This topic describes how to upload data to MaxCompute or download data from MaxCompute, including service connections, SDKs, tools, data import and export, and cloud migration.

Background information

MaxCompute provides the following types of channels for data uploads and downloads:
  • DataHub: used for real-time data uploads and downloads. DataHub provides tools, such as the OGG, Flume, Logstash, and Fluentd plug-ins, to upload and download data.
  • Tunnel: used for batch data uploads and downloads. Tunnel provides tools, such as the MaxCompute client, DataWorks, Data Transmission Service, Sqoop, Kettle plug-in, and MaxCompute Migration Assist (MMA), to upload and download data.
  • Streaming Tunnel: used for streaming data writes. Streaming Tunnel supports Realtime Compute for Apache Flink, DataHub, Data Transmission Service, real-time synchronization of DataWorks, and Message Queue for Apache Kafka.

DataHub and Tunnel provide their own SDKs. The data upload and download tools derived from these SDKs allow you to upload or download data in a variety of scenarios. For more information, see MaxCompute Tunnel overview.

The preceding tools provide sufficient capabilities for most scenarios that involve the migration of data to the cloud. The subsequent topics describe the tools and cloud migration scenarios, such as Data migration from Hadoop to MaxCompute, database synchronization, and log collection, to help you select proper technical schemes.

Note To perform offline data synchronization, we recommend that you use Data Integration of DataWorks. For more information, see Overview.

Limits

  • Limits on data uploads by using Tunnel commands
    • No limits are imposed on the upload speed. The upload speed varies based on the network bandwidth and server performance.
    • The number of upload retries is limited. If the number of upload retries exceeds the limit, the system continues to upload data in the next block. By default, the number of upload retries is 5. After data is uploaded, you can execute the SELECT COUNT(*) FROM table_name statement to check whether data loss occurs.
    • Each project supports a maximum of 2,000 parallel Tunnel connections.
    • On the server, the lifecycle for a session is 24 hours after a session is created. A session can be shared among processes and threads on the server, but you must make sure that each block ID is unique.
  • Limits on data uploads by using DataHub
    • The size of each field cannot exceed its upper limit. For more information, see Data type editions.
      Note The size of a string cannot exceed 8 MB.
    • During the upload, multiple data entries are packaged into the same file.
  • Limits on the TableTunnel SDK
    • The value of a block ID must be in the range of [0, 20000). The amount of data that you want to upload in a block cannot exceed 100 GB.
    • The lifecycle of a session is 24 hours. If you want to upload large amounts of data, we recommend that you transfer your data in multiple sessions.
    • The lifecycle of an HTTP request that corresponds to a RecordWriter is 120 seconds. If no data flows over an HTTP connection within 120 seconds, the server closes the connection.
  • The Tunnel service allocates slots that are available for services based on the data access sequence.
    • If the remaining available resources are insufficient, the Tunnel service denies access to data when the number of parallel requests exceeds 1,000 until the remaining available resources become sufficient.
    • If no resources are available for data access, data cannot be accessed until resources are released.

Precautions

The network status has a significant impact on Tunnel uploads and downloads. In normal cases, the upload speed ranges from 1 MB/s to 20 MB/s. If you want to upload a large amount of data, we recommend that you configure the Tunnel endpoint of the classic network or a virtual private cloud (VPC). You can access the Tunnel endpoint of the classic network or a VPC by using Elastic Compute Service (ECS) instances or a leased line. If the upload speed is slow, you can use the multi-thread upload method.

For more information about Tunnel endpoints, see Endpoints.