Use Apache Sqoop on an E-MapReduce (EMR) cluster to import data from a MySQL database into OSS-HDFS. Sqoop accesses OSS-HDFS through the Hadoop-OSS connector built into EMR, which means the OSS-HDFS endpoint format differs from standard OSS — use oss://bucket.region.oss-dls.aliyuncs.com/path instead of oss://bucket/path.
Prerequisites
Before you begin, make sure you have:
An EMR cluster running EMR V3.42.0 or later, or EMR V5.8.0 or later. For instructions, see Create a cluster.
OSS-HDFS enabled for a bucket, with permissions granted to access it. For instructions, see Enable OSS-HDFS and grant access permissions.
Connect to the EMR cluster
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
Click the EMR cluster.
Click the Nodes tab, then click the plus icon (
) on the left side of the node group.Click the ID of the ECS instance. On the Instances page, click Connect next to the instance ID.
For instructions on logging in with an SSH key pair or SSH password from Windows or Linux, see Log on to a cluster.
Import data from MySQL to OSS-HDFS
Run the following command to import data from a MySQL table into an OSS-HDFS directory:
sudo sqoop import \
--connect <dburi>/<dbname> \
--username <username> \
--password <password> \
--table <tablename> \
--target-dir <oss-dir> \
-m <count>Parameters
| Parameter | Required | Description |
|---|---|---|
dburi | Yes | The JDBC URI for the database. Example: jdbc:mysql://192.168.xxx.xxx:3306/ |
dbname | Yes | The name of the database. |
username | Yes | The username for logging in to the database. |
password | Yes | The password for the user. |
tablename | Yes | The name of the MySQL table to import. |
oss-dir | Yes | The OSS-HDFS target directory. Use the OSS-HDFS endpoint format: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/ |
count | No | The number of MapReduce tasks. |
format | No | The output file format. Valid values: avrodatafile, sequencefile, textfile (default), parquetfile. Specify using --as-<format>, for example --as-parquetfile. |
The OSS-HDFS endpoint format (oss://bucket.region.oss-dls.aliyuncs.com/path) differs from standard OSS. Make sure you use the OSS-HDFS-specific endpoint.Example
The following command imports the src_kv table from the sqoop_test database into OSS-HDFS in Parquet format:
sudo sqoop import \
--connect jdbc:mysql://master-1-1/sqoop_test \
--username root \
--password password1 \
--table src_kv \
-m 1 \
--target-dir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_kv \
--as-parquetfileImport data incrementally
For incremental imports, add --incremental, --check-column, and --last-value to your command. Two modes are supported: append and lastmodified.
Append mode
Use append mode when your table only receives new rows (inserts, no updates). Sqoop imports rows where the check column value is greater than --last-value.
When using append mode, --temporary-rootdir is required. Sqoop first writes data to the temporary directory, then moves the files to the target directory. If the target directory already exists, Sqoop does not import or overwrite data in it.
sudo sqoop import \
--connect <dburi>/<dbname> \
--username <username> \
--password <password> \
--table <tablename> \
--target-dir <oss-dir> \
--temporary-rootdir <oss-tmpdir> \
--incremental append \
--check-column <col> \
--last-value <value> \
-m <count>Incremental parameters
| Parameter | Required | Description |
|---|---|---|
oss-tmpdir | Yes (incremental modes) | The temporary OSS-HDFS directory. Example: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_tmp/ |
col | Yes | The column used to identify new rows. |
value | Yes | The maximum check column value from the previous import. Sqoop imports only rows with a higher value. |
Example
sudo sqoop import \
--connect jdbc:mysql://master-1-1/sqoop_test \
--username root \
--password password1 \
--table src_kv \
--target-dir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_kv \
--temporary-rootdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_tmp \
--incremental append \
--check-column id \
--last-value 100 \
-m 1Lastmodified mode
Use lastmodified mode when your table receives both inserts and updates. Sqoop imports rows where the check column (a timestamp) is more recent than --last-value. This mode requires --merge-key to merge updated rows with existing data.
When using lastmodified mode, --temporary-rootdir is required.
sudo sqoop import \
--connect <dburi>/<dbname> \
--username <username> \
--password <password> \
--table <tablename> \
--target-dir <oss-dir> \
--temporary-rootdir <oss-tmpdir> \
--incremental lastmodified \
--check-column <col> \
--merge-key <key-col> \
--last-value <value> \
-m <count>Example
sudo sqoop import \
--connect jdbc:mysql://master-1-1/sqoop_test \
--username root \
--password password1 \
--table src_kv \
--target-dir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_kv \
--temporary-rootdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_tmp \
--incremental lastmodified \
--check-column updated_at \
--merge-key id \
--last-value "2024-01-01 00:00:00" \
-m 1