All Products
Search
Document Center

MaxCompute:Data upload scenarios and tools

Last Updated:Sep 21, 2023

This topic describes how to upload data to MaxCompute or download data from MaxCompute. This topic also describes the required service connections, SDKs, and tools, and the common operations, including data import and export, and data migration to the cloud.

Background information

MaxCompute provides the following types of channels for data uploads and downloads. You can select a channel based on your business requirements.

  • MaxCompute Tunnel: allows you to upload and download data in batches.

  • Streaming Tunnel: allows you to write data to MaxCompute in streaming mode.

  • DataHub: allows you to process streaming data. DataHub allows you to subscribe to streaming data, publish and distribute streaming data, and archive streaming data to MaxCompute.

Features

  • Upload data by using MaxCompute Tunnel

    You can perform a single batch operation to upload data to MaxCompute by using MaxCompute Tunnel. For example, you can upload data in external files, external databases, external object storage systems, or log files to MaxCompute. MaxCompute Tunnel supports the following upload solutions:

    • Tunnel SDK: You can upload data to MaxCompute by using the interfaces of Tunnel SDK. For more information, see MaxCompute Tunnel.

    • Data synchronization: You can extract, transform, and load data to MaxCompute by using the Data Integration service of DataWorks. For more information, see Overview.

    • Open source tools and plug-ins: You can upload data to MaxCompute by using Sqoop, Kettle, Flume, Fluentd, and Oracle GoldenGate (OGG).

    • Built-in tool of MaxCompute: The MaxCompute client provides built-in commands based on Tunnel SDK. You can upload data to MaxCompute by using Tunnel commands. For more information about how to use Tunnel commands, see Tunnel commands.

    Note

    To perform offline data synchronization, we recommend that you use Data Integration of DataWorks. For more information, see Overview.

  • Write data by using Streaming Tunnel

    MaxCompute Streaming Tunnel allows you to write data to MaxCompute in streaming mode and provides a set of APIs and backend services that are different from the APIs and backend services of MaxCompute Tunnel. Streaming Tunnel supports the following data write solutions:

    • Data synchronization of Data Integration: allows you to write streaming data to MaxCompute. For more information, see Overview of real-time synchronization nodes.

    • Data shipping: allows you to write streaming data to MaxCompute by using the data shipping mode that integrates streaming write APIs. For example, you can ship data to MaxCompute by using Simple Log Service and ApsaraMQ for Kafka.

    • Data writing to MaxCompute in real time: allows you to write streaming data to MaxCompute in real time by using Realtime Compute for Apache Flink.

Reliability of solutions

MaxCompute provides the service level agreement (SLA) guarantee. By default, MaxCompute Tunnel and Streaming Tunnel use shared resources that are free of charge. When you upload or download data by using MaxCompute Tunnel and Streaming Tunnel, you must consider the reliability of the solution that you want to use. The Tunnel service allocates slots that are available for services based on the data access sequence.

  • If no resources are available for data access, data cannot be accessed until resources are released.

  • If the number of valid requests does not reach 100 within 5 minutes, the Tunnel service is not available. For more information, see Status codes.

  • The request latency and the limits on requests are not included in the scope of SLA guarantee. For more information about the limits on requests, see Limits.

Limits

  • Limits on using MaxCompute Tunnel

    • Data uploads

      • Lifecycle of an upload session: 24 hours

      • Maximum number of blocks that can be written in a single upload session: 20,000

      • Maximum data write speed of a single block: 10 MB/s

      • Maximum amount of data that can be written in a single block: 100 GB

      • Maximum number of upload sessions that can be created for a single table: 500 per 5 minutes

      • Maximum number of blocks that can be written to a single table: 500 per 5 minutes

      • Maximum number of upload sessions that can be concurrently committed by a single table: 32

      • Maximum number of blocks that can be written at the same time: depends on the number of Data Transmission Service (DTS) slots that can be used at the same time. One DTS slot is occupied each time data is written to a block.

    • Data downloads

      • Lifecycle of a download session: 24 hours

      • Lifecycle of a session that is used to download instance data: 24 hours (limited by the instance lifecycle)

      • Maximum number of instance-data download sessions that can be created for a single project: 200 per 5 minutes

      • Maximum number of download sessions that can be created for a single table: 200 per 5 minutes

      • Maximum speed of a single download: 10 MB/s

      • Maximum number of download sessions that can be created at the same time: depends on the number of DTS slots that can be used at the same time. One DTS slot is occupied each time a download session is created.

      • Maximum number of instance-data download sessions that can be created at the same time: depends on the number of DTS slots that can be used at the same time. One DTS slot is occupied each time an instance-data download session is created.

      • Maximum number of download requests that can be sent at the same time: depends on the number of DTS slots that can be used at the same time. One DTS slot is occupied each time a download request is sent.

  • Limits on using Streaming Tunnel

    • Maximum write speed per slot: 1 MB/s

    • Maximum number of write requests per slot: 10 per second

    • Maximum number of partitions to which data can be concurrently written in a single table: 64

    • Maximum number of slots that are available for a single partition: 32

    • Maximum number of slots that can be used by a single streaming-data upload session: depends on the number of DTS slots that can be used at the same time. You can specify the number of DTS slots when you create a streaming-data upload session.

  • Limits on data uploads by using DataHub

    • The size of each field cannot exceed its upper limit. For more information, see Data type editions.

      Note

      The size of a string cannot exceed 8 MB.

    • During the upload, multiple data entries are packaged into the same file.

Shared DTS slots that are free of charge available for different regions

The following table describes the maximum number of shared DTS slots that can be assigned for different regions at the project level. The shared DTS slots are free of charge.

Country or region

Region

Number of DTS slots

China

China (Hangzhou)

300

China (Shanghai)

600

China East 2 Finance

50

China (Beijing)

300

China North 2 Ali Gov

100

China (Zhangjiakou)

300

China (Shenzhen)

150

China South 1 Finance

50

China (Chengdu)

150

China (Hong Kong)

50

Other countries or regions

Singapore

100

Australia (Sydney)

50

Malaysia (Kuala Lumpur)

50

Indonesia (Jakarta)

50

Japan (Tokyo)

50

Germany (Frankfurt)

50

US (Silicon Valley)

100

US (Virginia)

50

UK (London)

50

India (Mumbai)

50

UAE (Dubai)

50

Status codes

Status code

Meaning

200

HTTP_OK

201

HTTP_CREATED

400

HTTP_BAD_REQUEST

401

HTTP_UNAUTHORIZED

403

HTTP_FORBIDDEN

404

HTTP_NOT_FOUND

405

HTTP_METHOD_NOT_ALLOWED

409

HTTP_CONFLICT

422

HTTP_UNPROCESSABLE_ENTITY

429

HTTP_TOO_MANY_REQUESTS

499

HTTP_CLIENT_CLOSED_REQUEST

500

HTTP_INTERNAL_SERVER_ERROR

502

HTTP_BAD_GATEWAY

503

HTTP_SERVICE_UNAVAILABLE

504

HTTP_GATEWAY_TIME_OUT

Precautions

The network status has a significant impact on Tunnel uploads and downloads. In normal cases, the upload speed ranges from 1 MB/s to 20 MB/s. If you want to upload a large amount of data, we recommend that you configure the Tunnel endpoint of the classic network or a virtual private cloud (VPC). You can access the Tunnel endpoint of the classic network or a VPC by using Elastic Compute Service (ECS) instances or a leased line. If the upload speed is slow, you can use the multi-thread upload method.

For more information about Tunnel endpoints, see Endpoints.