This topic describes how to use DataX to export data from Tablestore to a local CSV file. You can export data from data tables and time series tables in Tablestore.
Background
DataX is an open-source offline data synchronization tool that supports data transfer between heterogeneous data sources, including MySQL, Oracle, HDFS, Hive, and Tablestore.
Before you start, note the following:
Exported file names: DataX appends a random suffix to exported CSV file names (for example,
output.csv__d737aec2_c9e3_4489_a5d7_361f44c998ce). Remove the suffix manually after export to get a standard CSV file.Network costs: If you run DataX on an Elastic Compute Service (ECS) instance, use a VPC endpoint to avoid outbound Internet traffic fees and improve network performance.
Prerequisites
The endpoint and name of the Tablestore instance and the information of the source table in Tablestore are obtained.
-
An AccessKey pair is created for your Alibaba Cloud account or a Resource Access Management (RAM) user. For more information, see Create an access key pair.
ImportantFor security purposes, use Tablestore as a RAM user. Create a RAM user, attach the
AliyunOTSFullAccesspolicy to the RAM user, and create an AccessKey pair for the RAM user. For more information, see Use the AccessKey pair of a RAM user to access Tablestore.
Procedure
The examples in this topic use ECS instances running Alibaba Cloud Linux 3.2104 LTS 64-bit or Ubuntu 22.04 64-bit. For more information, see What is ECS?
Step 1: Install dependencies
-
Install Python 2 or Python 3.
Python 3 is pre-installed on ECS instances running Alibaba Cloud Linux or Ubuntu. For other operating systems, install Python manually.
-
Install JDK 1.8 or later. JDK 1.8 is recommended.
The following commands install JDK 1.8 on Alibaba Cloud Linux or Ubuntu. For other operating systems, install JDK manually.
Alibaba Cloud Linux
yum -y install java-1.8.0-openjdk-devel.x86_64Ubuntu
apt update && apt upgrade apt install openjdk-8-jdk
Step 2: Download DataX
-
Download the DataX package.
wget https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202309/datax.tar.gz -
Extract the package.
tar -zxvf datax.tar.gz
To compile DataX from source, see the DataX installation guide.
Step 3: Create a configuration file
-
Go to the
bindirectory of DataX.cd datax/bin -
Create a configuration file.
vi tablestore_to_csv.jsonThe following examples show the configuration file content for each table type. Modify the parameters based on your requirements.
Export data from a data table
For information of the schema of the source data table named
orders, see Appendix 1: Data table .{ "job": { "setting": { "speed": { "channel": 1 }, "errorLimit": { "record": 0, "percentage": 0 } }, "content": [ { "reader": { "name": "otsreader", "parameter": { "endpoint": "https://<YOUR-INSTANCE>.<YOUR-REGION>.ots.aliyuncs.com", "accessId": "<YOUR-ACCESS-KEY-ID>", "accessKey": "<YOUR-ACCESS-KEY-SECRET>", "instanceName": "<YOUR-INSTANCE>", "table": "orders", "range": { "begin": [ { "type": "INF_MIN" } ], "end": [ { "type": "INF_MAX" } ] }, "column": [ { "name": "order_id" }, { "name": "user_id" }, { "name": "sku_id" }, { "name": "price" }, { "name": "num" }, { "name": "total_price" }, { "name": "order_status" }, { "name": "create_time" }, { "name": "modified_time" } ] } }, "writer": { "name": "txtfilewriter", "parameter": { "path": "/tmp/export/", "fileName": "output.csv", "writeMode": "truncate", "fileFormat": "csv" } } } ] } }Export data from a time series table
For information of the schema of the source time series table named
vehicles_timeseriesdata, see Appendix 2: Time series table .{ "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "otsreader", "parameter": { "endpoint": "https://<YOUR-INSTANCE>.<YOUR-REGION>.ots.aliyuncs.com", "accessId": "<YOUR-ACCESS-KEY-ID>", "accessKey": "<YOUR-ACCESS-KEY-SECRET>", "instanceName": "<YOUR-INSTANCE>", "table": "vehicles_timeseriesdata", "mode": "normal", "newVersion": "true", "isTimeseriesTable": "true", "measurementName": "measurement_1", "timeRange": { "begin": 0, "end": 1750000000000 }, "column": [ { "name": "_m_name" }, { "name": "_data_source" }, { "name": "_tags" }, { "name": "_time" }, { "name": "vin_id", "type": "STRING" }, { "name": "mfrs", "type": "STRING" }, { "name": "model", "type": "STRING" }, { "name": "speed", "type": "DOUBLE" }, { "name": "gps", "type": "STRING" }, { "name": "mileage", "type": "DOUBLE" }, { "name": "emission", "type": "DOUBLE" } ] } }, "writer": { "name": "txtfilewriter", "parameter": { "path": "/tmp/export/", "fileName": "output.csv", "writeMode": "truncate", "fileFormat": "csv" } } } ] } }-
The following table describes the Tablestore Reader parameters that you must modify.
Parameter
Applies to
Description
channel
Both
The number of parallel read/write threads. Each channel is an independent thread. Increasing this value raises concurrency and resource consumption.
endpoint
Both
The endpoint of the Tablestore instance. If you access Tablestore from an ECS instance, use a VPC endpoint to avoid outbound traffic fees and improve performance and security.
accessId
Both
The AccessKey ID of your Alibaba Cloud account or RAM user.
accessKey
Both
The AccessKey secret of your Alibaba Cloud account or RAM user.
instanceName
Both
The name of the Tablestore instance.
tableName
Both
The name of the source table in Tablestore.
column
Both
The array of columns to export.
range
Data table
The primary key range to read. The
beginandendarrays each specify one value per primary key column. The range is left-closed and right-open. UseINF_MIN(negative infinity) andINF_MAX(positive infinity) to read the full table. The number of columns in a virtual point must be the same as that in a primary key. This parameter is optional; the default reads from negative infinity.measurementName
Time series table
The metric name of the time series to read. If not set, all data in the table is read.
timeRange
Time series table
The timestamp range to read, in milliseconds. The range is left-closed and right-open. This parameter is optional; the default reads all versions.
-
The following table describes the TxtFileWriter parameters that you must modify.
Parameter
Description
path
The local directory where the exported file is saved.
fileName
The base name of the exported file, including the extension. Example:
output.csv.writeMode
How TxtFileWriter handles existing files before writing. Valid values:
-
truncate: deletes all files prefixed with thefileNamevalue in the target directory before writing. -
append: writes directly to the file without pre-processing, ensuring no file name conflicts. -
nonConflict: reports an error and aborts if a file prefixed with thefileNamevalue already exists in the directory.
fileFormat
The output file format. Valid values:
csvandtext. -
-
Step 4: Run the DataX job
-
Run the export job.
python3 datax.py tablestore_to_csv.jsonAfter the job completes, a summary is printed:
2025-03-19 17:21:05.146 [job-0] INFO StandAloneJobContainerCommunicator - Total 200000 records, 23086634 bytes | Speed 1.10MB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.222s | All Task WaitReaderTime 18.392s | Percentage 100.00% 2025-03-19 17:21:05.147 [job-0] INFO JobContainer - Task Start Time : 2025-03-19 17:20:43 Task End Time : 2025-03-19 17:21:05 Task Duration : 21s Average Transfer Speed : 1.10MB/s Record Write Speed : 10000rec/s Total Records Read : 200000 Failed Records : 0 -
Verify the export result.
Check the output directory (for example,
/tmp/export/) to confirm the exported file. The file name includes a random suffix:output.csv__d737aec2_c9e3_4489_a5d7_361f44c998ceTo preview the file contents, run:
head -5 output.csv__d737aec2_c9e3_4489_a5d7_361f44c998ceThe output is similar to:
order_id,user_id,sku_id,price,num,total_price,order_status,create_time,modified_time ORD-001,USR-100,SKU-200,29.99,2,59.98,completed,2025-01-01 08:00:00,2025-01-02 10:00:00 ORD-002,USR-101,SKU-201,49.99,1,49.99,pending,2025-01-02 09:00:00,2025-01-02 09:00:00Remove the suffix to get a standard CSV file name:
mv output.csv__d737aec2_c9e3_4489_a5d7_361f44c998ce output.csvNoteDataX appends a random suffix to distinguish files written by parallel threads. Remove the suffix to get a standard CSV file name.
Appendix: Sample source tables
Appendix 1: Data table
The name of the sample data table is orders. The following table describes the schema of the data table.
|
Field name |
Type |
Description |
|
order_id (primary key column) |
String |
The ID of the order. |
|
user_id |
String |
The ID of the user. |
|
sku_id |
String |
The ID of the product. |
|
price |
Double |
The unit price of the purchased products. |
|
num |
Integer |
The quantity of the purchased products. |
|
total_price |
Double |
The total price of the order. |
|
order_status |
String |
The status of the order. |
|
create_time |
String |
The time when the order was created. |
|
modified_time |
String |
The time when the order was last modified. |
Appendix 2: Time series table
The name of the sample time series table is vehicles_timeseriesdata. The following table describes the schema of the time series table.
|
Field name |
Type |
Description |
|
_m_name |
String |
The name of the physical quantity or metric for data in the time series. |
|
_data_source |
String |
The identifier of the data source for the time series. This field can be left blank. |
|
_tags |
String |
The tags of the time series. |
|
_time |
Integer |
The time when the data is reported. |
|
vin_id |
String |
The vehicle identification number (VIN) of the vehicle. |
|
mfrs |
String |
The manufacturer. |
|
model |
String |
The model of the vehicle. |
|
speed |
Double |
The current speed of the vehicle. |
|
gps |
String |
The GPS coordinates of the vehicle in the format of |
|
mileage |
Double |
The current mileage of the vehicle. |
|
emission |
Double |
The emission value. |