FAQs for Data Upload with MaxCompute Tunnel

Data Issues

Q: If I use the Tunnel Java SDK to upload data, can data be automatically allocated to each partition?

A: Currently, Tunnel does not support automatic data upload and allocation to each partition. Data can only be uploaded to one table or one partition of the table each time. When uploading data to a partition table, you must specify the target partition. If multiple levels of partitions exist, you must specify the partitions to the last hierarchy. For more information about the Java SDK, see Java SDK.

Q: If I use the Tunnel Java SDK to upload data to a partition table, can the SDK dynamically create partitions based on the data?

A: You must create partitions first and specify the target partition when uploading data through the SDK. You can also upload data to a table in MaxCompute and run SQL statements to dynamically create partitions.

Q: When I run tunnel upload D:test test/pt="time" in DataWorks to upload data to partitions, why is the following error returned: FAILED: error occurred while running tunnel command?

A: DataWorks does not support the Upload statement provided by the MaxCompute Tunnel command line tool.

Q: When I run Tunnel commands to upload data, the data is divided into 50 blocks and uploaded properly at first. However, the upload fails at the 22nd block. After five retries, the 22nd block is ignored and the data of the 23rd block is uploaded. Why does this happen?

A: A block is defined as follows: A block corresponds to an HTTP request. Data of multiple blocks can be uploaded in parallel and atomically. No matter whether a synchronization request succeeds or fails, one block does not affect other blocks. The number of retries for a failed block is limited. When the number of retries reaches the limit, the current block is ignored and the next block of data is uploaded. After the upload is completed, you can run the select count(*) statement to check whether any data is lost.

Q: I want to upload 10 GB of website logs collected by my local server each day to MaxCompute. The current upload speed by running the tunnel upload command is about 300 KB/s. How can I accelerate the upload process?

A: The tunnel upload command does not have a speed limit for data upload. The upload speed is restricted by the network bandwidth and server performance. To improve the performance, you can distribute data to different partitions or tables and upload data to and download data from multiple ECS instances.

Q: How can I combine the following two commands in a shell script to upload data in a TXT file to a table in MaxCompute?

/odpscmd/bin/odpscmd

tunnel upload "$FILE" project.table

A: Refer to the Client to set the startup parameters for CLT. The startup command in the shell script is as follows:

/odpscmd/bin/odpscmd -e "tunnel upload "$FILE" project.table"

Q: When I run the tunnel upload command to upload data to MaxCompute, why does the upload fail if the data includes a carriage return or a space?

A: If the data includes a carriage return or a space, you can replace the carriage return or space with other delimiters. Specify the column and row delimiters by using -rd and -fd, respectively, to upload the data. If you cannot replace the delimiters in the data, you can upload the data as a single row and use UDFs to parse it.

For example, the following data includes a carriage return. You can use the comma (,) as the column delimiter (rd) and the at sign (@) as the row delimiter (fd). Then, the data can be properly uploaded.

Data content:

shopx,x_id,100@
shopy,y_id,200@
shopz,z_i
d,300@

Upload command:

odps@ MaxCompute_DOC>tunnel u d:\data.txt sale_detail/sale_date=201312,region=hangzhou -s false -fd "," -rd "@";

Upload result:

+-----------+-------------+-------------+-----------+--------+
| shop_name | customer_id | total_price | sale_date | region |
+-----------+-------------+-------------+-----------+--------+
| shopx     | x_id        | 100.0       | 201312    | hangzhou |
| shopy   | y_id        | 200.0       | 201312    | hangzhou |
| shopz   | z_id
d      | 300.0       | 201312    | hangzhou |
+-----------+-------------+-------------+-----------+--------+

Q: When running the tunnel upload command to upload data to MaxCompute, I use the comma (,) as the column delimiter. What should I do if the data in the description field already includes a comma (,) or vertical bar (|)?

A: If the data in the description field already includes a comma (,), you can use another symbol as the delimiter and specify it using -fd. An example is as follows:

A user wants to run the tunnel upload command in Windows to upload an EXCEL file that describes the job requirements. The table content includes the comma (,). To upload the data, first set the default delimiter for converting the EXCEL file to a CSV file: In the Windows 7 operating system, open Control Panel and click Clock, Language and Region. In the displayed window, click Change the date, time or number formats. On the Formats tab page, click Additional settings. In this example, the original data does not include the dollar sign ($). Set the delimiter to the dollar sign ($), as shown in the following figure.

After the settings, save the EXCEL file as a CSV file and use a text editing tool such as Notepad++ to transcode the file to UTF-8 encoding (default encoding format of Tunnel). Check whether the file delimiter has been changed to the dollar sign ($).

Title$Location$Salary$Company$Company introduction link$Company type$Company size$Industry$Work experience$Education background$Enrollment$Release time$Tag$Job information$Work address$Company information$Website$Collection time

Run the tunnel upload command to upload the data and specify the delimiter. (You must first create a table in MaxCompute.) The following information indicates that the data has been successfully uploaded:

odps@ MaxCompute_DOC>tunnel u d:\12JD.CSV JD2 -s false -fd "$";
Upload session: 201808201531180d47df0b0b18fa45
Start upload:d:\12JD.CSV
Using \r\n to split records
Upload in strict schema mode: true
Total bytes:111028       Split input to 1 blocks
2018-08-20 15:31:18     upload block: '1'
2018-08-20 15:31:18     upload block complete, blockid=1
upload complete, average speed is 108.4 KB/s
OK

Q: I want to run the tunnel upload command to upload data to MaxCompute. The default delimiter of the tunnel upload command is the comma (,). However, the delimiter in the CSV file is also the comma (,): A column of data in the file includes commas (,) enclosed in quotation marks (""). What should I do?

A: You can use the -fd parameter to specify another delimiter in the CSV file.

If the data includes multiple symbols that may conflict with the delimiter, you can customize the delimiter in the data to prevent the conflict, for example, #@#@@ or )#@#@@$.

Q: Why is the OOM error java.lang.OutOfMemoryError:Java heap space returned when I run the tunnel upload command to upload data to MaxCompute?

A: The error message indicates that when the data is uploaded, a memory overflow error has occurred. Currently, you can run the tunnel upload command to upload large amounts of data. If an OOM error occurs, the row and column delimiters of data may be incorrect. As a result, the entire text will be considered as one piece of data for buffering in the memory and splitting.

If this is the case, test a small amount of data with -td and -fd, and then upload the full amount of data.

Q: I need to run the tunnel upload command to upload multiple data files to a table in MaxCompute. Can I use loops in a script to upload all the data files under the folder?

A: The tunnel upload command can be used to upload a file or a level-1 folder.

For example, to upload the folder d:data, run the following command:

odps@ MaxCompute_DOC>tunnel u d:\data sale_detail/sale_date=201312,region=hangzhou -s false;

For more information, see Tunnel commands.

Q: The error column mismatch is returned when I upload a folder. However, each file in the folder can be separately uploaded. Is the folder size too large?

A: Set -dbr=false -s true in the upload command to verify the data format. This error is returned because the number of columns is mismatched. A possible reason is that the column delimiter is incorrect or a blank row that cannot be separated by the delimiter exists at the bottom of the file.

Q: When I run the tunnel upload command to upload two files to MaxCompute, why does the second file fail to be uploaded after the first file is uploaded? No error is returned. The upload commands are as follows:

D:\odps\bin\odpscmd.bat -e"tunnel upload d:\data1.txt sale_detail/sale_data=201312 -fd="$" -mbr=5 --scan=true; "

D:\odps\bin\odpscmd.bat -e"tunnel upload d:\data2.txt sale_detail/sale_data=201312 -fd="$" -mbr=5 --scan=true; "

A: You have used CLT of an earlier version. In this case, when the tunnel upload command is run and the command includes the --scan parameter, the parameter may fail to be properly transferred in the relay mode. You can delete --scan=true from the command and try again.

Q: I want to upload all files in a folder to a table in MaxCompute and want a partition to be automatically created. When I run the command tunnel upload /data/2018/20180813/*.json app_log /dt=20180813 -fd '@@' -acp true;, the following error is returned:

Unrecognized option: -acp
FAILED: error occurred while running tunnel command

A: This error occurs when an unsupported command or character is used. The tunnel upload command of MaxCompute does not support wildcards and regular expressions.

Q: When I run the tunnel upload command to upload file data to MaxCompute, an error is returned. Is there a command similar to -f of MySQL that can skip the error data and continue the upload?

A: This error occurs because of data format issues; for example, the data type is mismatched. Set the -dbr parameter to true to ignore dirty data (such as excessive columns, insufficient columns, and mismatched column data type). For more information, see Tunnel commands. The default value of the -dbr parameter is false, indicating that the dirty data will be handled. When the value is true, all data not conforming to the table definition is ignored.

Q: When I run the tunnel upload command to upload file data to MaxCompute, why is the following error returned?

java.io.IOException: RequestId=XXXXXXXXXXXXXXXXXXXXXXXXX, ErrorCode=StatusConflict, ErrorMessage=You cannot complete the specified operation under the current upload or download status.
        at com.aliyun.odps.tunnel.io.TunnelRecordWriter.close(TunnelRecordWriter.java:93)
        at com.xgoods.utils.aliyun.maxcompute.OdpsTunnel.upload(OdpsTunnel.java:92)
        at com.xgoods.utils.aliyun.maxcompute.OdpsTunnel.upload(OdpsTunnel.java:45)
        at com.xeshop.task.SaleStatFeedTask.doWork(SaleStatFeedTask.java:119)
        at com.xgoods.main.AbstractTool.excute(AbstractTool.java:90)
        at com.xeshop.task.SaleStatFeedTask.main(SaleStatFeedTask.java:305)

A: This error message indicates that an upload or download task is running. The same operation cannot be performed again.

Q: When I use the Tunnel SDK to upload file data to MaxCompute, why is an error indicating repeated submission returned?

RequestId=20170116xxxxxxx, ErrorCode=StatusConflict, ErrorMessage=You cannot complete the specified operation under the current upload or download status. java.io.IOException: RequestId=20170116xxxxxxx, ErrorCode=StatusConflict, ErrorMessage=You cannot complete the specified operation under the current upload or download status.
at com.aliyun.odps.tunnel.io.TunnelRecordWriter.close(TunnelRecordWriter.java:93)

A: This error message indicates that the error occurs when the writer is to be closed. The possible reasons are as follows:

The close operation is performed on a writer that has been closed.
The session corresponding to the writer has been closed.
The session has been submitted.

You can diagnose the problem based on the preceding possible reasons, for example, printing the current status of the writer and session.

Q: I have written a UDF and packaged it in a JAR file and want to use the Tunnel SDK to upload the file to MaxCompute. Is there any requirement for the JAR package size?

A: The JAR package size cannot exceed 10 MB. If the size is larger than 10 MB, we recommend that you run the tunnel upload command to upload data to MaxCompute.

Q: If I run the tunnel upload command to upload data to MaxCompute, is there any requirement for the data size?

A: The tunnel upload command does not have a limit for the size of data to be uploaded.

Q: If I run the tunnel upload command to upload a CSV file to MaxCompute, how can I upload the data without the table header in the first row?

A: We recommend that you set the -h parameter to true to skip the table header.

Q: Are there any partition requirements if I use the Tunnel SDK to upload data to MaxCompute?

A: If you use the Tunnel SDK to upload data to MaxCompute, a maximum of 60,000 partitions are supported.

If too many partitions are specified, the performance of statistics and analysis will be affected. MaxCompute limits the maximum number of instances in a job, which is closely related to the amount of data you upload and the number of partitions. We recommend that you evaluate your business and select the appropriate partition strategy to prevent the impact of excessive partitions.

For more information about partition tables, see Partition.

In addition, MaxCompute supports batch data submission with the Tunnel Python SDK. For more information, see Data upload/download configuration in Python SDK.

Q: I want to upload 80 million data records at one time. The following error is returned when I run odps tunnel recordWriter.close():

ErrorCode=StatusConflict, ErrorMessage=You cannot complete the specified operation under the current upload or download status.

A: This error message indicates that the session status is incorrect. We recommend that you create a session and upload the data again. A possible reason is that the session has been closed or committed in the previous operation. Each partition requires a separate session.

To prevent this error caused by multiple commit operations, check whether the data has been successfully uploaded. If data upload failed, upload the data again. For more information, see Example of multi-thread uploading.

Q: How can I use Tunnel BufferedWriter to prevent errors caused by batch data upload through the Tunnel SDK?

A: MaxCompute Java SDK 0.21.3-public and later versions support the BufferedWriter SDK, which simplifies the data upload process and provides a fault tolerance function.

BufferedWriter hides the block from users. From the users' perspective, a writer is enabled on the session for data writing. During implementation, BufferedWriter stores the data in the buffer of the client and enables an HTTP connection for data upload after the buffer is filled.

BufferedWriter provides a fault tolerance function to ensure successful data upload.

Q: When I run the tunnel upload command to upload a CSV file to MaxCompute, why are many parts of the text replaced with hyphens (-) after the file is successfully uploaded?

A: A possible reason is that the data encoding format or the delimiter is incorrect so that the data uploaded to the table is incorrect. We recommend that you standardize the original data and upload the file again.

Q: Can I use the configuration of a table when running the tunnel upload command to upload data to MaxCompute?

A: You can write a shell script to run the tunnel upload command. Use /odpscmd/bin/odpscmd -e to run the script and paste the table configuration in the script.

Q: When using the Tunnel SDK to upload data to MaxCompute, I often find that query with the SELECT statement is slow and the performance of SQL statements is low. This may be because a large number of small files exist in MaxCompute, which affect the performance. How can I handle so many small files?

A: Reasons for generation of small files:

Apsara Distributed File System of MaxCompute stores data by block. A file whose size is smaller than the block size (64 MB by default) is called a small file.

Currently, the following small files may be generated in MaxCompute:

A large number of small files generated during the Reduce computing process.
Small files generated when Tunnel collects data.
Temporary files generated during runtime and expired files in the recycle bin, including:
1. TABLE_BACKUP: tables exceeding the retention period in the recycle bin.
2. FUXI_JOB_TMP: temporary directories during runtime.
3. TMP_TABLE: temporary tables generated during runtime.
4. INSTANCE: logs stored in meta tables during runtime.
5. LIFECYCLE: data tables or partitions exceeding the lifecycle.
6. INSTANCEPROFILE: profile information after job submission and completion.
7. VOLUME_TMP: data that is stored in Apsara Distributed File System but does not have meta information.
8. TEMPRESOURCE: disposable or temporary resource files used by UDFs.
9. FAILOVER: temporary files retained during system failover.

Impact of excessive small files:

Affects the map instance performance: A small file corresponds to an instance by default. Too many small files cause resource waste and affect the overall runtime performance.
Too many small files put pressure on Apsara Distributed File System and occupy too much space, even causing file system unavailability.

Command for checking the number of small files in a table:

desc extended + Table name

Methods for handling small files:

The method for handling small files varies depending on the generation causes:

Small files generated during the reduce process
Run INSERT OVERWRITE on the source table (or partition) or write data to the new table and delete the source table.
Small files generated when Tunnel collects data
1. When calling the Tunnel SDK, submit the buffered data after its size reaches 64 MB.
2. When using the console, do not frequently upload small files. We recommend that you accumulate the files to a certain size and upload them at one time.
3. If you upload a partition table, we recommend that you set a lifecycle for the partitions so that expired data is automatically cleared.
4. Run INSERT OVERWRITE on the source table (or partition).
```
Run ALTER TABLE tablename [PARTITION] and MERGE SMALLFILES to merge small files:set odps.merge.cross.paths=true;
set odps.merge.max.partition.count=100; --Ten partitions are optimized by default. In this example, 100 partitions are optimized.
ALTER TABLE tablename [PARTITION] MERGE SMALLFILES;
```
5. Temporary tables

We recommend that you set a lifecycle for all temporary tables so that expired temporary tables are automatically moved to the recycle bin.

Q: When I run the tunnel upload command to upload data to MaxCompute, what should I do if I use a space as the column delimiter or filter data by using regular expressions?

A: The tunnel upload command does not support regular expressions. To use the space as the column delimiter or filter data through regular expressions, use MaxCompute UDFs.

Upload the data as a single column. The following is the original data in this example. The space and carriage return are used as the column and row delimiters, respectively. The data to be extracted is enclosed in quotation marks. Certain data, such as hyphens "-", needs to be filtered out. You can use regular expressions to complete this complex task.

10.21.17.2 [24/Jul/2018:00:00:00 +0800] - "GET https://help.aliyun.com/document_detail/73477.html" 200 0 81615 81615 "-" "iphone" - HIT - - 0_0_0 001 - - - -
10.17.5.23 [24/Jul/2018:00:00:00 +0800] - "GET https://help.aliyun.com/document_detail/73478.html" 206 0 49369 49369 "-" "huawei" - HIT - - 0_0_0 002 - - - -
10.24.7.16 [24/Jul/2018:00:00:00 +0800] - "GET https://help.aliyun.com/document_detail/73479.html" 206 0 83821 83821 "-" "vivo" - HIT - - 0_0_0 003 - - - -

To upload the data as a single column, first run the following command to create a single-column table in the project of MaxCompute for receiving data:

odps@ bigdata_DOC>create table userlog1(data string);

Upload the data with the column delimiter "u0000", which does not exist in the data, to keep the column integrity:

odps@ bigdata_DOC>tunnel upload C:\userlog.txt userlog1 -s false -fd "\u0000" -rd "\n";

After uploading the original data, use MaxCompute IntelliJ IDEA to define a Python UDF. (You can also define a JAVA UDF.) To use the Python UDF, submit a ticket to apply for the permission.

Write the following code:

from odps.udf import annotate
from odps.udf import BaseUDTF
import re           #Enter the regular function.
regex = '([(\d\.)]+) \[(.*?)\] - "(.*?)" (\d+) (\d+) (\d+) (\d+) "-" "(.*?)" - (.*?) - - (.*?) (.*?) - - - -'             #Regular expression in use
# line -> ip,date,request,code,c1,c2,c3,ua,q1,q2,q3
@annotate('string -> string,string,string,string,string,string,string,string,string,string,string')  #The number of strings must be consistent with the actual data. In this example, 11 columns of data exist.
class ParseAccessLog(BaseUDTF):
    def process(self, line):
        try:
            t = re.match(regex, line).groups()
            self.forward(t[0], t[1], t[2], t[3], t[4], t[5], t[6], t[7], t[8], t[9], t[10])
        except:
            pass

After the UDF is defined, upload the code.

Set MaxCompute project to the current project.

After the code is uploaded, register the UDF and enter the UDF name, which is ParseAccessLog in this example.

After the UDF is uploaded, use it to process the original data uploaded to the userlog1 table. Make sure that you enter the correct column name, which is data in this example. You can use common SQL syntax to create the userlog2 table to store the processed data.

odps@ bigdata_DOC>create table userlog2 as select ParseAccessLog(data) as (ip,date,request,code,c1,c2,c3,ua,q1,q2,q3) from userlog1;

After the processing is completed, you can view the target table that has been created. The data has been successfully divided into columns.

+----+------+---------+------+----+----+----+----+----+----+----+
| ip | date | request | code | c1 | c2 | c3 | ua | q1 | q2 | q3 |
+----+------+---------+------+----+----+----+----+----+----+----+
| 10.21.17.2 | 24/Jul/2018:00:00:00 +0800 | GET https://help.aliyun.com/document_detail/73477.html | 200  | 0  | 81615 | 81615 | iphone | HIT | 0_0_0 | 001 |
| 10.17.5.23 | 24/Jul/2018:00:00:00 +0800 | GET https://help.aliyun.com/document_detail/73478.html | 206  | 0  | 4936 | 4936 | huawei | HIT | 0_0_0 | 002 |
| 10.24.7.16 | 24/Jul/2018:00:00:00 +0800 | GET https://help.aliyun.com/document_detail/73479.html | 206  | 0  | 83821 | 83821 | vivo | HIT | 0_0_0 | 003 |
+----+------+---------+------+----+----+----+----+----+----+----+

Q: When I run the tunnel upload command to upload data to MaxCompute, how can multiple files in a folder be uploaded to the same table with each file in a separate partition?

A: You can write a shell script to enable this function. The following describes how to use the shell script to call CLT in Windows. You can use the shell script in Linux in a similar way.

The content of the shell script is as follows:

#!/bin/sh
C:/odpscmd_public/bin/odpscmd.bat  -e "create table user(data string) partitioned by (dt int);" //Create a user partition table and set the partition keyword to dt. In this example, CLT is installed in C:/odpscmd_public/bin/odpscmd.bat. You can modify the installation path as required.
dir=$(ls C:/userlog)  //Define the dir variable as the name of all files under the folder.
pt=0 //The pt variable indicates the partition value, whose initial value is 0. Each time a file is uploaded, this value is incremented by 1. This ensures that each file is stored in a separate partition.
for i in $dir  //Define the cycle. All files under the folder C:/userlog are traversed.
do
   let pt=pt+1  //After each cycle is completed, the value of the pt variable is incremented by 1.
   echo $i  //Display the file name.
   echo $pt //Display the partition name.
    C:/odpscmd_public/bin/odpscmd.bat  -e "alter table user add partition (dt=$pt);tunnel upload C:/userlog/$i user/dt=$pt -s false -fd "%" -rd "@";" //Use CLT to add partitions and upload files to the partitions.
done

The userlog1 and userlog2 files are used in this example. The following figure shows the result of running the shell script.

After the upload is completed, check the table data in CLT.

Network Issues

Q: When I run the tunnel upload command to upload data to MaxCompute, why is the following error returned?

java.io.IOException: Error writing request body to server

A: This error message indicates that an exception occurred when you uploaded data to the server. A possible reason is that the network is disconnected or timed out during the upload.

If your data source is not a local file and must be obtained from a database, the server has to wait for data acquisition when writing data. As a result, a timeout occurred. Currently, the server considers that a timeout has occurred if no data is uploaded using the UploadSession operation within 600 seconds.
If you use the endpoint of a public network to upload data, the timeout may occur due to unstable performance of the public network.

Solution:

You can obtain data and call the Tunnel SDK to upload it.
A block can store 64 MB to 1 GB of data. We recommend that you upload no more than 10,000 data records at a time to prevent the timeout caused by retry. A maximum of 20,000 blocks can be attached to a session.

If you store your data on an ECS instance, see Configure endpoint to configure the appropriate endpoint to improve the speed and save costs.

Q: When I run the tunnel upload command to upload data to MaxCompute, why is the Tunnel endpoint of the Internet connected after I configure the endpoint of a classic network?

A: Besides configuring endpoint, you must configure tunnel_endpoint in the odps_config.ini configuration file. For more information, see Configure endpoint. Currently, only the Tunnel endpoint of China (Shanghai) does not need to be configured.

Q: Is throttling supported when I run the tunnel upload command to upload data to MaxCompute? If the upload rate is too high, the server I/O performance may be affected.

A: Currently, the tunnel upload command of MaxCompute does not support throttling. You can use Tunnel SDK to achieve throttling.

Billing Issues

Q: When I run the tunnel upload command to upload data to MaxCompute, is the bandwidth charged based on the data size before or after compression?

A: The bandwidth is charged based on the data size after compression.

To learn more about Alibaba Cloud MaxCompute, visit www.alibabacloud.com/product/maxcompute