Sqoop
sqoop概述
Sqoop(发音:skup)是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql…)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
导入数据:MySQL,Oracle导入数据到Hadoop的HDFS、HIVE、HBASE等数据存储系统;
导出数据:从Hadoop的文件系统中导出数据到关系数据库
mysql <—-à hdfs
注意:1.99.7与1.4.6不兼容,且没有特征不完整,它并不打算用于生产部署。
Sqoop原理
将导入或导出命令翻译成mapreduce程序来实现。
在翻译出的mapreduce中主要是对inputformat和outputformat进行定制。
Sqoop安装
安装Sqoop的前提是已经具备Java和Hadoop的环境。
下载并解压
上传至服务器并解压,解压命令如下:
1 | [root@hadoop01 ~]# tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4.tar.gz -C /usr/local/ |
配置环境变量
1 | vi /etc/profile |
修改配置
Sqoop的配置文件与大多数大数据框架类似,在sqoop根目录下的conf目录中。
重命名配置文件
将临时文件的template去掉。(也可对配置文件进行拷贝,保留原始template文件做恢复备份)
1
[root@hadoop01 conf]# mv sqoop-env-template.sh sqoop-env.sh
修改配置文件
sqoop-env.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/usr/local/hadoop-2.7.3
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/usr/local/hadoop-2.7.3
#set the path to where bin/hbase is available
export HBASE_HOME=/usr/local/hbase-1.3.2
#Set the path to where bin/hive is available
export HIVE_HOME=/usr/local/apache-hive-1.2.2-bin
#Set the path for where zookeper config dir is
export ZOOKEEPER_HOME= /usr/local/zookeeper-3.4.6
export ZOOCFGDIR=/usr/local/zookeeper-3.4.6/conf拷贝JDBC驱动
拷贝jdbc驱动到sqoop的lib目录下
1
2
3[root@hadoop01 lib]# pwd
/usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib
[root@hadoop01 lib]# cp /root/zeppelin-lib/mysql-connector-java-5.1.37.jar $PWD验证Sqoop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28[root@hadoop01 ~]# sqoop help
/usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/conf/sqoop-env.sh: line 35: export: `/usr/local/zookeeper-3.4.6': not a valid identifier
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
20/07/24 19:34:41 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.测试Sqoop是否能够成功连接数据库
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18sqoop-list-databases --connect jdbc:mysql://192.168.88.110:3306/ --username root --password 123456
#结果如下
/usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/conf/sqoop-env.sh: line 35: export: `/usr/local/zookeeper-3.4.6': not a valid identifier
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
20/07/24 19:36:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
20/07/24 19:36:24 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/07/24 19:36:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
hive
mysql
performance_schema
sys
zeppelintest将mysql中的所有库名展示出来了。
Sqoop的简单使用案例
导入数据
在Sqoop中,“导入”概念指:从非大数据集群(RDBMS)向大数据集群(HDFS,HIVE,HBASE)中传输数据,叫做:导入,即使用import关键字。
RDBMS(关系型数据库)到HDFS
确定Mysql服务开启正常
在Mysql中新建一张表并插入一些数据
mysql中数据如下:
导入数据
全部导入
1
2
3
4
5
6
7
8
9sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/test \
--username root \
--password 123456 \
--table student \
--target-dir /sqoop/mysql/student \
--delete-target-dir \
--m 1 \
--fields-terminated-by "\t"运行过程打印日志信息如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96/usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/conf/sqoop-env.sh: line 35: export: `/usr/local/zookeeper-3.4.6': not a valid identifier
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/local/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
20/07/24 19:45:10 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
20/07/24 19:45:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/07/24 19:45:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/07/24 19:45:10 INFO tool.CodeGenTool: Beginning code generation
20/07/24 19:45:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `student` AS t LIMIT 1
20/07/24 19:45:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `student` AS t LIMIT 1
20/07/24 19:45:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop-2.7.3
注: /tmp/sqoop-root/compile/29825426f428c228e971e547c536c287/student.java使用或覆盖了已过时的 API。
注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
20/07/24 19:45:16 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/29825426f428c228e971e547c536c287/student.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase-1.3.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/07/24 19:45:17 INFO tool.ImportTool: Destination directory /sqoop/mysql/student is not present, hence not deleting.
20/07/24 19:45:17 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/07/24 19:45:17 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/07/24 19:45:17 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/07/24 19:45:17 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/07/24 19:45:17 INFO mapreduce.ImportJobBase: Beginning import of student
20/07/24 19:45:17 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/07/24 19:45:17 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/07/24 19:45:17 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.88.110:8032
20/07/24 19:45:19 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
20/07/24 19:45:19 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
20/07/24 19:45:20 INFO db.DBInputFormat: Using read commited transaction isolation
20/07/24 19:45:20 INFO mapreduce.JobSubmitter: number of splits:1
20/07/24 19:45:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1595558750246_0001
20/07/24 19:45:20 INFO impl.YarnClientImpl: Submitted application application_1595558750246_0001
20/07/24 19:45:20 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1595558750246_0001/
20/07/24 19:45:20 INFO mapreduce.Job: Running job: job_1595558750246_0001
20/07/24 19:45:35 INFO mapreduce.Job: Job job_1595558750246_0001 running in uber mode : false
20/07/24 19:45:35 INFO mapreduce.Job: map 0% reduce 0%
20/07/24 19:45:48 INFO mapreduce.Job: map 100% reduce 0%
20/07/24 19:45:49 INFO mapreduce.Job: Job job_1595558750246_0001 completed successfully
20/07/24 19:45:49 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=137395
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=129
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=9426
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=9426
Total vcore-milliseconds taken by all map tasks=9426
Total megabyte-milliseconds taken by all map tasks=9652224
Map-Reduce Framework
Map input records=6
Map output records=6
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=558
CPU time spent (ms)=1570
Physical memory (bytes) snapshot=175525888
Virtual memory (bytes) snapshot=2091405312
Total committed heap usage (bytes)=89653248
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=129
20/07/24 19:45:49 INFO mapreduce.ImportJobBase: Transferred 129 bytes in 32.0322 seconds (4.0272 bytes/sec)
20/07/24 19:45:49 INFO mapreduce.ImportJobBase: Retrieved 6 records.
[root@hadoop01 ~]#可见sqoop将RDBMS(关系型数据库)中的数据导入HDFS的过程底层是通过MapReduce操作实现的。
导入结果如下:
查询导入
1
2
3
4
5
6
7
8
9
10
11sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/test \
--username root \
--password 123456 \
--target-dir /sqoop/mysql/student01 \
--delete-target-dir \
-m 1 \
--fields-terminated-by "\t" \
--query 'select name,age,address from student where 1=1 and $CONDITIONS;'
"'select name,age,address from student where 1=1 and $CONDITIONS;'"结果如下:
导入指定列
1
2
3
4
5
6
7
8
9
10sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/test \
--username root \
--password 123456 \
--target-dir /sqoop/mysql/student02 \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns id,name \
--table student结果:
使用sqoop关键字筛选查询导入数据
1
2
3
4
5
6
7
8
9
10sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/test \
--username root \
--password 123456 \
--target-dir /sqoop/mysql/student \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--table student \
--where "id=2"结果:
mysql导入hdfs封装shell脚本
mysql-hdfs.sh
1 |
|
RDBMS(关系型数据库)到Hive
方法一:
1
2
3
4
5
6
7
8
9
10
11
12
13sqoop import \
--connect jdbc:mysql://192.168.88.110:3306/test \
--username root \
--password 123456 \
--target-dir /sqoop/mysql/student \
--delete-target-dir \
--table student \
--num-mappers 1 \
--hive-import \
--hive-database db_user \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table mysql_hive_student==提示:该过程分为两步,第一步将数据导入到HDFS,第二步将导入到HDFS的数据迁移到Hive仓库==
==提示:第一步的保存目录是/sqoop/mysql/student==
结果:
方法二
将第一步和第二部分开执行,即将mysql中的数据导入hdfs中,再在hive中创建表,用load命令将数据从hdfs中导入hive表中。
第一步:(只导入id,name,age)用之前封装的shell
1
2
3
4
5[root@hadoop01 ~]# mysql-hdfs.sh
请输入你的选择(1.导入全部 2.查询导入 3.基于列元素的导入 4.关键字筛选查询导入):3
请输入数据库名:test
请输入要查询的表名:student
请输入要查询的元素列(用逗号分隔):id,name,age==第二步:在hive中创建表,然后导入数据至hive表中==
创建外部表:
1
2
3
4hive (db_user)> create external table mysql_hive_student01(id int,name string,age int)
> location '/external/mysql_hive_student01';
OK
Time taken: 0.118 seconds导入数据
1
load data inpath '/sqoop/mysql/student/part-m-*' into table mysql_hive_student01
==有问题==
**