我有一壶酒,足以慰平生。

0%

sqoop学习笔记

Sqoop

sqoop概述

Sqoop(发音:skup)是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql…)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。

导入数据:MySQL,Oracle导入数据到Hadoop的HDFS、HIVE、HBASE等数据存储系统;

导出数据:从Hadoop的文件系统中导出数据到关系数据库

mysql <—-à hdfs

Sqoop教程

注意:1.99.7与1.4.6不兼容,且没有特征不完整,它并不打算用于生产部署。

Sqoop原理

将导入或导出命令翻译成mapreduce程序来实现。

在翻译出的mapreduce中主要是对inputformat和outputformat进行定制。

Sqoop安装

安装Sqoop的前提是已经具备Java和Hadoop的环境。

下载并解压

官网

上传至服务器并解压,解压命令如下:

1
[root@hadoop01 ~]# tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4.tar.gz -C /usr/local/

配置环境变量

1
2
3
4
5
6
vi /etc/profile

export SQOOP_HOME=/usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/sbin:$HIVE_HOME/bin:$ZEPPELIN_HOME/bin:$HBASE_HOME/bin:$SQOOP_HOME/bin

source /etc/profile

修改配置

Sqoop的配置文件与大多数大数据框架类似,在sqoop根目录下的conf目录中。

  1. 重命名配置文件

    将临时文件的template去掉。(也可对配置文件进行拷贝,保留原始template文件做恢复备份)

    1
    [root@hadoop01 conf]# mv sqoop-env-template.sh sqoop-env.sh
  2. 修改配置文件

    sqoop-env.sh

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    #Set path to where bin/hadoop is available
    export HADOOP_COMMON_HOME=/usr/local/hadoop-2.7.3

    #Set path to where hadoop-*-core.jar is available
    export HADOOP_MAPRED_HOME=/usr/local/hadoop-2.7.3

    #set the path to where bin/hbase is available
    export HBASE_HOME=/usr/local/hbase-1.3.2

    #Set the path to where bin/hive is available
    export HIVE_HOME=/usr/local/apache-hive-1.2.2-bin

    #Set the path for where zookeper config dir is
    export ZOOKEEPER_HOME= /usr/local/zookeeper-3.4.6
    export ZOOCFGDIR=/usr/local/zookeeper-3.4.6/conf
  3. 拷贝JDBC驱动

    拷贝jdbc驱动到sqoop的lib目录下

    1
    2
    3
    [root@hadoop01 lib]# pwd
    /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib
    [root@hadoop01 lib]# cp /root/zeppelin-lib/mysql-connector-java-5.1.37.jar $PWD
  4. 验证Sqoop

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    [root@hadoop01 ~]# sqoop help
    /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/conf/sqoop-env.sh: line 35: export: `/usr/local/zookeeper-3.4.6': not a valid identifier
    Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../hcatalog does not exist! HCatalog jobs will fail.
    Please set $HCAT_HOME to the root of your HCatalog installation.
    Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
    Please set $ACCUMULO_HOME to the root of your Accumulo installation.
    Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
    Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
    20/07/24 19:34:41 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
    usage: sqoop COMMAND [ARGS]

    Available commands:
    codegen Generate code to interact with database records
    create-hive-table Import a table definition into Hive
    eval Evaluate a SQL statement and display the results
    export Export an HDFS directory to a database table
    help List available commands
    import Import a table from a database to HDFS
    import-all-tables Import tables from a database to HDFS
    import-mainframe Import datasets from a mainframe server to HDFS
    job Work with saved jobs
    list-databases List available databases on a server
    list-tables List available tables in a database
    merge Merge results of incremental imports
    metastore Run a standalone Sqoop metastore
    version Display version information

    See 'sqoop help COMMAND' for information on a specific command.
  5. 测试Sqoop是否能够成功连接数据库

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    sqoop-list-databases --connect jdbc:mysql://192.168.88.110:3306/  --username root --password 123456
    #结果如下
    /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/conf/sqoop-env.sh: line 35: export: `/usr/local/zookeeper-3.4.6': not a valid identifier
    Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../hcatalog does not exist! HCatalog jobs will fail.
    Please set $HCAT_HOME to the root of your HCatalog installation.
    Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
    Please set $ACCUMULO_HOME to the root of your Accumulo installation.
    Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
    Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
    20/07/24 19:36:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
    20/07/24 19:36:24 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    20/07/24 19:36:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    information_schema
    hive
    mysql
    performance_schema
    sys
    zeppelintest

    将mysql中的所有库名展示出来了。

Sqoop的简单使用案例

导入数据

在Sqoop中,“导入”概念指:从非大数据集群(RDBMS)向大数据集群(HDFS,HIVE,HBASE)中传输数据,叫做:导入,即使用import关键字。

RDBMS(关系型数据库)到HDFS

  1. 确定Mysql服务开启正常

  2. 在Mysql中新建一张表并插入一些数据

    mysql中数据如下:

    image-20200724233208415

  3. 导入数据

    • 全部导入

      1
      2
      3
      4
      5
      6
      7
      8
      9
      sqoop import \
      --connect jdbc:mysql://192.168.88.110:3306/test \
      --username root \
      --password 123456 \
      --table student \
      --target-dir /sqoop/mysql/student \
      --delete-target-dir \
      --m 1 \
      --fields-terminated-by "\t"

      运行过程打印日志信息如下:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      86
      87
      88
      89
      90
      91
      92
      93
      94
      95
      96
      /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/conf/sqoop-env.sh: line 35: export: `/usr/local/zookeeper-3.4.6': not a valid identifier
      Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../hcatalog does not exist! HCatalog jobs will fail.
      Please set $HCAT_HOME to the root of your HCatalog installation.
      Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
      Please set $ACCUMULO_HOME to the root of your Accumulo installation.
      Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
      Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
      20/07/24 19:45:10 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
      20/07/24 19:45:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
      20/07/24 19:45:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
      20/07/24 19:45:10 INFO tool.CodeGenTool: Beginning code generation
      20/07/24 19:45:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `student` AS t LIMIT 1
      20/07/24 19:45:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `student` AS t LIMIT 1
      20/07/24 19:45:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop-2.7.3
      注: /tmp/sqoop-root/compile/29825426f428c228e971e547c536c287/student.java使用或覆盖了已过时的 API。
      注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
      20/07/24 19:45:16 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/29825426f428c228e971e547c536c287/student.jar
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/local/hbase-1.3.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
      20/07/24 19:45:17 INFO tool.ImportTool: Destination directory /sqoop/mysql/student is not present, hence not deleting.
      20/07/24 19:45:17 WARN manager.MySQLManager: It looks like you are importing from mysql.
      20/07/24 19:45:17 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
      20/07/24 19:45:17 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
      20/07/24 19:45:17 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
      20/07/24 19:45:17 INFO mapreduce.ImportJobBase: Beginning import of student
      20/07/24 19:45:17 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
      20/07/24 19:45:17 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
      20/07/24 19:45:17 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.88.110:8032
      20/07/24 19:45:19 WARN hdfs.DFSClient: Caught exception
      java.lang.InterruptedException
      at java.lang.Object.wait(Native Method)
      at java.lang.Thread.join(Thread.java:1252)
      at java.lang.Thread.join(Thread.java:1326)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
      20/07/24 19:45:19 WARN hdfs.DFSClient: Caught exception
      java.lang.InterruptedException
      at java.lang.Object.wait(Native Method)
      at java.lang.Thread.join(Thread.java:1252)
      at java.lang.Thread.join(Thread.java:1326)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
      20/07/24 19:45:20 INFO db.DBInputFormat: Using read commited transaction isolation
      20/07/24 19:45:20 INFO mapreduce.JobSubmitter: number of splits:1
      20/07/24 19:45:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1595558750246_0001
      20/07/24 19:45:20 INFO impl.YarnClientImpl: Submitted application application_1595558750246_0001
      20/07/24 19:45:20 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1595558750246_0001/
      20/07/24 19:45:20 INFO mapreduce.Job: Running job: job_1595558750246_0001
      20/07/24 19:45:35 INFO mapreduce.Job: Job job_1595558750246_0001 running in uber mode : false
      20/07/24 19:45:35 INFO mapreduce.Job: map 0% reduce 0%
      20/07/24 19:45:48 INFO mapreduce.Job: map 100% reduce 0%
      20/07/24 19:45:49 INFO mapreduce.Job: Job job_1595558750246_0001 completed successfully
      20/07/24 19:45:49 INFO mapreduce.Job: Counters: 30
      File System Counters
      FILE: Number of bytes read=0
      FILE: Number of bytes written=137395
      FILE: Number of read operations=0
      FILE: Number of large read operations=0
      FILE: Number of write operations=0
      HDFS: Number of bytes read=87
      HDFS: Number of bytes written=129
      HDFS: Number of read operations=4
      HDFS: Number of large read operations=0
      HDFS: Number of write operations=2
      Job Counters
      Launched map tasks=1
      Other local map tasks=1
      Total time spent by all maps in occupied slots (ms)=9426
      Total time spent by all reduces in occupied slots (ms)=0
      Total time spent by all map tasks (ms)=9426
      Total vcore-milliseconds taken by all map tasks=9426
      Total megabyte-milliseconds taken by all map tasks=9652224
      Map-Reduce Framework
      Map input records=6
      Map output records=6
      Input split bytes=87
      Spilled Records=0
      Failed Shuffles=0
      Merged Map outputs=0
      GC time elapsed (ms)=558
      CPU time spent (ms)=1570
      Physical memory (bytes) snapshot=175525888
      Virtual memory (bytes) snapshot=2091405312
      Total committed heap usage (bytes)=89653248
      File Input Format Counters
      Bytes Read=0
      File Output Format Counters
      Bytes Written=129
      20/07/24 19:45:49 INFO mapreduce.ImportJobBase: Transferred 129 bytes in 32.0322 seconds (4.0272 bytes/sec)
      20/07/24 19:45:49 INFO mapreduce.ImportJobBase: Retrieved 6 records.
      [root@hadoop01 ~]#

      可见sqoop将RDBMS(关系型数据库)中的数据导入HDFS的过程底层是通过MapReduce操作实现的。

      导入结果如下:

      image-20200724233858380

    • 查询导入

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      sqoop import \
      --connect jdbc:mysql://192.168.88.110:3306/test \
      --username root \
      --password 123456 \
      --target-dir /sqoop/mysql/student01 \
      --delete-target-dir \
      -m 1 \
      --fields-terminated-by "\t" \
      --query 'select name,age,address from student where 1=1 and $CONDITIONS;'

      "'select name,age,address from student where 1=1 and $CONDITIONS;'"

      结果如下:

      image-20200724235149601

    • 导入指定列

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      sqoop import \
      --connect jdbc:mysql://192.168.88.110:3306/test \
      --username root \
      --password 123456 \
      --target-dir /sqoop/mysql/student02 \
      --delete-target-dir \
      --num-mappers 1 \
      --fields-terminated-by "\t" \
      --columns id,name \
      --table student

      结果:

      image-20200724235508458

    • 使用sqoop关键字筛选查询导入数据

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      sqoop import \
      --connect jdbc:mysql://192.168.88.110:3306/test \
      --username root \
      --password 123456 \
      --target-dir /sqoop/mysql/student \
      --delete-target-dir \
      --num-mappers 1 \
      --fields-terminated-by "\t" \
      --table student \
      --where "id=2"

      结果:image-20200727095541145

mysql导入hdfs封装shell脚本

mysql-hdfs.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#!/bin/bash

read -p "请输入你的选择(1.导入全部 2.查询导入 3.基于列元素的导入 4.关键字筛选查询导入):" choose


read -p "请输入数据库名:" database
read -p "请输入要查询的表名:" tableName

if [ $choose == 1 ]
then

sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/$database \
--username root \
--password 123456 \
--table $tableName \
--target-dir /sqoop/mysql/$tableName \
--delete-target-dir \
--m 1 \
--fields-terminated-by "\t"

elif [ $choose == 2 ]
then

read -p "请输入你要查询的sql语句(样例输入:'select name,sex from teacher where 1=1 and $CONDITIONS;'):" sql

sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/$database \
--username root \
--password 123456 \
--target-dir /sqoop/mysql/$tableName \
--delete-target-dir \
-m 1 \
--fields-terminated-by "\t" \
--query $sql

elif [ $choose == 3 ]
then

read -p "请输入要查询的元素列(用逗号分隔):" string

sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/$database \
--username root \
--password 123456 \
--target-dir /sqoop/mysql/$tableName \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns $string \
--table $tableName

elif [ $choose == 4 ]
then

read -p "请输入要查询的关键字(例如:"id=2"):" string

sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/$database \
--username root \
--password 123456 \
--target-dir /sqoop/mysql/$tableName \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--table $tableName \
--where $string

else
echo "你输入的不对!!!"

fi

RDBMS(关系型数据库)到Hive

  • 方法一:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    sqoop import \
    --connect jdbc:mysql://192.168.88.110:3306/test \
    --username root \
    --password 123456 \
    --target-dir /sqoop/mysql/student \
    --delete-target-dir \
    --table student \
    --num-mappers 1 \
    --hive-import \
    --hive-database db_user \
    --fields-terminated-by "\t" \
    --hive-overwrite \
    --hive-table mysql_hive_student

    ==提示:该过程分为两步,第一步将数据导入到HDFS,第二步将导入到HDFS的数据迁移到Hive仓库==

    ==提示:第一步的保存目录是/sqoop/mysql/student==

    结果:

    image-20200727102549630

  • 方法二

    将第一步和第二部分开执行,即将mysql中的数据导入hdfs中,再在hive中创建表,用load命令将数据从hdfs中导入hive表中。

    第一步:(只导入id,name,age)用之前封装的shell

    1
    2
    3
    4
    5
    [root@hadoop01 ~]# mysql-hdfs.sh
    请输入你的选择(1.导入全部 2.查询导入 3.基于列元素的导入 4.关键字筛选查询导入):3
    请输入数据库名:test
    请输入要查询的表名:student
    请输入要查询的元素列(用逗号分隔):id,name,age

    ==第二步:在hive中创建表,然后导入数据至hive表中==

    创建外部表:

    1
    2
    3
    4
    hive (db_user)> create external table mysql_hive_student01(id int,name string,age int)
    > location '/external/mysql_hive_student01';
    OK
    Time taken: 0.118 seconds

    导入数据

    1
    load data inpath '/sqoop/mysql/student/part-m-*'  into  table  mysql_hive_student01

    ==有问题==

image-20200727143908971**

image-20200727153251510

您的支持是我继续创作的动力