All Products
Search
Document Center

Object Storage Service:Use Apache Sqoop on an EMR cluster to implement read and write access to data stored in OSS-HDFS

Last Updated:Mar 20, 2026

Use Apache Sqoop on an E-MapReduce (EMR) cluster to import data from a MySQL database into OSS-HDFS. Sqoop accesses OSS-HDFS through the Hadoop-OSS connector built into EMR, which means the OSS-HDFS endpoint format differs from standard OSS — use oss://bucket.region.oss-dls.aliyuncs.com/path instead of oss://bucket/path.

Prerequisites

Before you begin, make sure you have:

Connect to the EMR cluster

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

  2. Click the EMR cluster.

  3. Click the Nodes tab, then click the plus icon (+) on the left side of the node group.

  4. Click the ID of the ECS instance. On the Instances page, click Connect next to the instance ID.

For instructions on logging in with an SSH key pair or SSH password from Windows or Linux, see Log on to a cluster.

Import data from MySQL to OSS-HDFS

Run the following command to import data from a MySQL table into an OSS-HDFS directory:

sudo sqoop import \
  --connect <dburi>/<dbname> \
  --username <username> \
  --password <password> \
  --table <tablename> \
  --target-dir <oss-dir> \
  -m <count>

Parameters

ParameterRequiredDescription
dburiYesThe JDBC URI for the database. Example: jdbc:mysql://192.168.xxx.xxx:3306/
dbnameYesThe name of the database.
usernameYesThe username for logging in to the database.
passwordYesThe password for the user.
tablenameYesThe name of the MySQL table to import.
oss-dirYesThe OSS-HDFS target directory. Use the OSS-HDFS endpoint format: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/
countNoThe number of MapReduce tasks.
formatNoThe output file format. Valid values: avrodatafile, sequencefile, textfile (default), parquetfile. Specify using --as-<format>, for example --as-parquetfile.
The OSS-HDFS endpoint format (oss://bucket.region.oss-dls.aliyuncs.com/path) differs from standard OSS. Make sure you use the OSS-HDFS-specific endpoint.

Example

The following command imports the src_kv table from the sqoop_test database into OSS-HDFS in Parquet format:

sudo sqoop import \
  --connect jdbc:mysql://master-1-1/sqoop_test \
  --username root \
  --password password1 \
  --table src_kv \
  -m 1 \
  --target-dir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_kv \
  --as-parquetfile

Import data incrementally

For incremental imports, add --incremental, --check-column, and --last-value to your command. Two modes are supported: append and lastmodified.

Append mode

Use append mode when your table only receives new rows (inserts, no updates). Sqoop imports rows where the check column value is greater than --last-value.

When using append mode, --temporary-rootdir is required. Sqoop first writes data to the temporary directory, then moves the files to the target directory. If the target directory already exists, Sqoop does not import or overwrite data in it.

sudo sqoop import \
  --connect <dburi>/<dbname> \
  --username <username> \
  --password <password> \
  --table <tablename> \
  --target-dir <oss-dir> \
  --temporary-rootdir <oss-tmpdir> \
  --incremental append \
  --check-column <col> \
  --last-value <value> \
  -m <count>

Incremental parameters

ParameterRequiredDescription
oss-tmpdirYes (incremental modes)The temporary OSS-HDFS directory. Example: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_tmp/
colYesThe column used to identify new rows.
valueYesThe maximum check column value from the previous import. Sqoop imports only rows with a higher value.

Example

sudo sqoop import \
  --connect jdbc:mysql://master-1-1/sqoop_test \
  --username root \
  --password password1 \
  --table src_kv \
  --target-dir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_kv \
  --temporary-rootdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_tmp \
  --incremental append \
  --check-column id \
  --last-value 100 \
  -m 1

Lastmodified mode

Use lastmodified mode when your table receives both inserts and updates. Sqoop imports rows where the check column (a timestamp) is more recent than --last-value. This mode requires --merge-key to merge updated rows with existing data.

When using lastmodified mode, --temporary-rootdir is required.

sudo sqoop import \
  --connect <dburi>/<dbname> \
  --username <username> \
  --password <password> \
  --table <tablename> \
  --target-dir <oss-dir> \
  --temporary-rootdir <oss-tmpdir> \
  --incremental lastmodified \
  --check-column <col> \
  --merge-key <key-col> \
  --last-value <value> \
  -m <count>

Example

sudo sqoop import \
  --connect jdbc:mysql://master-1-1/sqoop_test \
  --username root \
  --password password1 \
  --table src_kv \
  --target-dir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_kv \
  --temporary-rootdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/tmp/sqoop_tmp \
  --incremental lastmodified \
  --check-column updated_at \
  --merge-key id \
  --last-value "2024-01-01 00:00:00" \
  -m 1