This topic describes how to use DataX to export data from Tablestore to a local CSV file. You can export data from data tables and time series tables in Tablestore.
Background information
DataX is a tool for offline synchronization between disparate data sources. It supports efficient and stable synchronization between various data sources such as MySQL, Oracle, HDFS, Hive, and Tablestore.
Prerequisites
The endpoint and name of the Tablestore instance and the information of the source table in Tablestore are obtained.
An AccessKey pair is created for your Alibaba Cloud account or a Resource Access Management (RAM) user. For more information, see Create an AccessKey pair.
ImportantFor security purposes, we recommend that you use Tablestore features as a RAM user. You can create a RAM user, grant the user permissions to manage Tablestore by attaching the
AliyunOTSFullAccess
policy to the RAM user, and create an AccessKey pair for the RAM user. For more information, see Use the AccessKey pair of a RAM user to access Tablestore.
Procedure
In the examples in this topic, Elastic Compute Service (ECS) instances that run Alibaba Cloud Linux 3.2104 LTS 64-bit or Ubuntu 22.04 64-bit are used. For more information, see What is ECS?
Step 1: Install dependencies
Install Python 2 or Python 3.
Python 3 is pre-installed on ECS instances that run Alibaba Cloud Linux or Ubuntu. For ECS instances that run other operating systems, you need to install Python yourself.
Install JDK 1.8 or later. We recommend that you install JDK 1.8.
This step describes how to install JDK 1.8 on an ECS instance that runs Alibaba Cloud Linux or Ubuntu. For ECS instances that run other operating systems, you need to install JDK yourself.
Alibaba Cloud Linux
yum -y install java-1.8.0-openjdk-devel.x86_64
Ubuntu
apt update && apt upgrade apt install openjdk-8-jdk
Step 2: Download DataX
Download the DataX package.
wget https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202309/datax.tar.gz
Extract the package.
tar -zxvf datax.tar.gz
For information about how to compile DataX yourself, see DataX installation guide.
Step 3: Create a configuration file
Go to the
bin
directory of DataX.cd datax/bin
Create a configuration file. If you use
vim
, replace the command accordingly.vi tablestore_to_csv.json
The following sample code shows the content of a configuration file. Modify the parameters based on your synchronization requirements and actual situation.
Export data from a data table
For information of the schema of the source data table named
orders
, see Appendix 1: Data table.{ "job": { "setting": { "speed": { "channel": 1 }, "errorLimit": { "record": 0, "percentage": 0 } }, "content": [ { "reader": { "name": "otsreader", "parameter": { "endpoint": "https://<YOUR-INSTANCE>.<YOUR-REGION>.ots.aliyuncs.com", "accessId": "<YOUR-ACCESS-KEY-ID>", "accessKey": "<YOUR-ACCESS-KEY-SECRET>", "instanceName": "<YOUR-INSTANCE>", "table": "orders", "range": { "begin": [ { "type": "INF_MIN" } ], "end": [ { "type": "INF_MAX" } ] }, "column": [ { "name": "order_id" }, { "name": "user_id" }, { "name": "sku_id" }, { "name": "price" }, { "name": "num" }, { "name": "total_price" }, { "name": "order_status" }, { "name": "create_time" }, { "name": "modified_time" } ] } }, "writer": { "name": "txtfilewriter", "parameter": { "path": "/tmp/export/", "fileName": "output.csv", "writeMode": "truncate", "fileFormat": "csv" } } } ] } }
Export data from a time series table
For information of the schema of the source time series table named
vehicles_timeseriesdata
, see Appendix 2: Time series table.{ "job": { "setting": { "speed": { "channel": 1 } }, "content": [ { "reader": { "name": "otsreader", "parameter": { "endpoint": "https://<YOUR-INSTANCE>.<YOUR-REGION>.ots.aliyuncs.com", "accessId": "<YOUR-ACCESS-KEY-ID>", "accessKey": "<YOUR-ACCESS-KEY-SECRET>", "instanceName": "<YOUR-INSTANCE>", "table": "vehicles_timeseriesdata", "mode": "normal", "newVersion": "true", "isTimeseriesTable": "true", "measurementName": "measurement_1", "timeRange": { "begin": 0, "end": 1750000000000 }, "column": [ { "name": "_m_name" }, { "name": "_data_source" }, { "name": "_tags" }, { "name": "_time" }, { "name": "vin_id", "type": "STRING" }, { "name": "mfrs", "type": "STRING" }, { "name": "model", "type": "STRING" }, { "name": "speed", "type": "DOUBLE" }, { "name": "gps", "type": "STRING" }, { "name": "mileage", "type": "DOUBLE" }, { "name": "emission", "type": "DOUBLE" } ] } }, "writer": { "name": "txtfilewriter", "parameter": { "path": "/tmp/export/", "fileName": "output.csv", "writeMode": "truncate", "fileFormat": "csv" } } } ] } }
The following table describes the parameters of Tablestore Reader that you must modify.
Parameter
Description
channel
The core parallel unit of the data synchronization task.
Each channel corresponds to an independent thread for reading and writing data. You can control the concurrency of the task by modifying the number of channels, which affects system performance and resource consumption.
endpoint
The endpoint of the Tablestore instance.
NoteIf you access Tablestore from an ECS instance, we recommend that you use a VPC endpoint, which does not generate fees for outbound traffic over the Internet and provides better network performance and security.
accessId
The AccessKey ID of your Alibaba Cloud account or RAM user.
accessKey
The AccessKey secret of your Alibaba Cloud account or RAM user.
instanceName
The name of the Tablestore instance.
tableName
The name of the source table in Tablestore.
column
The array of columns to export.
range
The begin and end parameters specify the start and end values for each primary key column when data is read from a Tablestore data table. The value of the begin and end parameters is a JSON array. The range of data to read from Tablestore is a left-closed, right-open interval.
The start and end primary keys must be valid primary keys or virtual points composed of
INF_MIN
andINF_MAX
types. The number of columns in a virtual point must be the same as that in a primary key.INF_MIN
indicates negative infinity, which is smaller than any value.INF_MAX
indicates positive infinity, which is larger than any value.NoteThis parameter is used when a data table is used as the source table. This parameter is optional. The default value indicates that the reading starts from negative infinity.
measurementName
The metric name of the time series that you want to read. If this parameter is not configured, all data in the table is read.
NoteThis parameter is used when a time series table is used as the source table.
timeRange
The begin and end parameters specify the start and end timestamps for reading data. The value of the begin and end parameters is a JSON array. The range of data to read from Tablestore is a left-closed, right-open interval. The unit of the timestamp is milliseconds.
NoteThis parameter is used when a time series table is used as the source table. This parameter is optional. The default value indicates that all versions are read.
The following table describes the parameters of TxtFileWriter that you must modify.
Parameter
Description
path
The path of the exported file in the system.
fileName
The name of the exported file, including the file name extension. Example:
output.csv
.writeMode
The mode in which TxtFileWriter processes data before writing data. Valid values:
truncate: clears all files whose names are prefixed with the value of the fileName parameter in the specified directory before writing data.
append: does not process data before writing the data. The data is directly written to the file whose name is specified by using the fileName parameter. This ensures that the file names do not conflict.
nonConflict: reports an error and aborts the operation if a file whose name is prefixed with the value of the fileName parameter exists in the directory.
fileFormat
The format of the output file. Valid values:
csv
andtext
.
Step 4: Execute the DataX task
Run the following command to start the execution of the data export task:
python3 datax.py tablestore_to_csv.json
After the task is completed, the overall execution status is printed.
2025-03-19 17:21:05.146 [job-0] INFO StandAloneJobContainerCommunicator - Total 200000 records, 23086634 bytes | Speed 1.10MB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.222s | All Task WaitReaderTime 18.392s | Percentage 100.00% 2025-03-19 17:21:05.147 [job-0] INFO JobContainer - Task Start Time : 2025-03-19 17:20:43 Task End Time : 2025-03-19 17:21:05 Task Duration : 21s Average Transfer Speed : 1.10MB/s Record Write Speed : 10000rec/s Total Records Read : 200000 Failed Records : 0
Verify the export result.
After the task is completed, check the specified path (such as
/tmp/export/
) to confirm whether the exported CSV file meets your expectations.output.csv__d737aec2_c9e3_4489_a5d7_361f44c998ce
NoteThe exported CSV file name will have a random suffix to distinguish files generated by multiple threads. You need to manually remove this suffix to obtain a standard CSV file.
Appendix-Sample source tables
Appendix 1: Data table
The name of the sample data table is orders
. The following table describes the schema of the data table.
Field name | Type | Description |
order_id (primary key column) | String | The ID of the order. |
user_id | String | The ID of the user. |
sku_id | String | The ID of the product. |
price | Double | The unit price of the purchased products. |
num | Integer | The quantity of the purchased products. |
total_price | Double | The total price of the order. |
order_status | String | The status of the order. |
create_time | String | The time when the order was created. |
modified_time | String | The time when the order was last modified. |
Appendix 2: Time series table
The name of the sample time series table is vehicles_timeseriesdata
. The following table describes the schema of the time series table.
Field name | Type | Description |
_m_name | String | The name of the physical quantity or metric for data in the time series. |
_data_source | String | The identifier of the data source for the time series. This field can be left empty. |
_tags | String | The tags of the time series. |
_time | Integer | The time when the data is reported. |
vin_id | String | The vehicle identification number (VIN) of the vehicle. |
mfrs | String | The manufacturer. |
model | String | The model of the vehicle. |
speed | Double | The current speed of the vehicle. |
gps | String | The GPS coordinates of the vehicle in the format of |
mileage | Double | The current mileage of the vehicle. |
emission | Double | The emission value. |