MaxCompute enables you to write data from a MaxCompute project to an OSS directory using OSS external tables through INSERT statements. This topic explains the process of writing data to OSS and includes examples of data write operations.
Background information
MaxCompute supports writing data from internal tables or processed from external tables to OSS:
-
Write data to OSS using a built-in text or open source data extractor
To write data in CSV, TSV, open source formats, or supported compression formats to OSS, create an OSS external table with MaxCompute's built-in text or open source data extractor and execute an INSERT statement.
Examples:
-
Example: Write data to a non-partitioned OSS directory using a built-in text extractor
-
Example: Write data to a partitioned OSS directory using a built-in text extractor
-
Example: Write data in a compression format to OSS using a built-in text extractor
-
Example: Write data to OSS using a built-in open source data extractor
-
-
Write data to OSS using a custom extractor
To write data in a custom format to OSS, create an OSS external table with a custom extractor and use it to execute an INSERT statement to write data to OSS.
For more information, see Example: Write data to OSS using a custom extractor.
-
Write data to OSS using the multipart upload feature
To write data in an open source format to OSS, create an OSS external table with a built-in open source data extractor, enable or disable the multipart upload feature, and execute an INSERT statement. This feature is controlled by the
odps.sql.unstructured.oss.commit.mode
property at the session or project level and is disabled by default. For more details on the multipart upload feature, see Multipart Upload.The behavior when the
odps.sql.unstructured.oss.commit.mode
property is set to different values is as follows:-
Value is False: Data written to OSS is stored in the LOCATION directory under the .odps folder. The .odps folder maintains a .meta file to ensure data consistency. The .odps content can only be processed by MaxCompute, and may not be correctly parsed by other engines, potentially causing errors.
-
Value is True: MaxCompute uses multipart upload to ensure data consistency with a
two-phase commit
approach. There will be no .odps directory or .meta file, ensuring compatibility with other data processing engines.
The
odps.sql.unstructured.oss.commit.mode
property can be set at the project and session levels:-
Configure the odps.sql.unstructured.oss.commit.mode parameter at the project level. The configuration takes effect on the project. Sample command:
setproject odps.sql.unstructured.oss.commit.mode =true;
-
Configure the odps.sql.unstructured.oss.commit.mode parameter at the session level. The configuration takes effect only on the current task. Sample command:
set odps.sql.unstructured.oss.commit.mode =true;
ImportantIn the
insert overwrite
operation scenario, if the job fails to run in extreme cases, an issue may occur where the old data is deleted but the new data is not written.Cause: A hardware failure occurs or the metadata fails to be updated. This issue rarely occurs. If this issue occurs, new data cannot be written to the destination table, and the deleted historical data cannot be restored because the delete operation in OSS cannot be rolled back.
Solutions:
-
If you overwrite the OSS external table based on the old data, such as
insert overwrite table T select * from table T;
, you need to back up the OSS data in advance. If the job fails to run, overwrite the OSS external table based on the backed-up old data. -
If the
insert overwrite
job can be resubmitted, resubmit the job when the job fails to run.
-
注意事项
OSS bandwidth is limited. If the traffic generated in a short period of time exceeds the bandwidth limit of the OSS instance, the bandwidth is used up, reducing the speed of data read or write operations to the OSS external table. For more information about OSS bandwidth, see CloseDoes OSS have bandwidth and QPS limits?.
Prerequisites
Before writing data to OSS, ensure the following requirements are met:
-
The Alibaba Cloud account or RAM user has the necessary access permissions for OSS.
For more authorization information, see STS Authorization of OSS.
-
(Optional) An OSS bucket, directories, and data files are prepared.
NoteMaxCompute can automatically create an OSS directory in the OSS console. For an SQL statement that is used to create an external table and a user-defined function (UDF), you can execute the SQL statement to read data from or write data to the external table and UDF. You can also manually create an OSS directory.
For more information about OSS buckets, directories, and data file upload operations, see Create a Bucket, Create a Directory, and Upload a File.
Example: Write data to a non-partitioned OSS directory by using a built-in text extractor
Read data from the OSS external table mc_oss_csv_external1, created as shown in Example: Creating an OSS External Table Using a Built-in Text Extractor - Non-partitioned Table, and write the data in CSV format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/output/
.
To add the image repository, follow these steps:
-
Log on to the MaxCompute client and execute the following command to create an OSS external table that maps to the target OSS directory.
CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_csv_external4 ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole' ) LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/output/';
For more information about the syntax and parameters for creating an OSS external table, see Create an OSS External Table.
-
Use the MaxCompute client to execute the INSERT OVERWRITE or
INSERT INTO
command on the external table to write data to OSS.INSERT INTO TABLE mc_oss_csv_external4 SELECT * FROM mc_oss_csv_external1;
After successful execution, the exported files can be viewed in the OSS directory. An output folder is created under the .odps folder, containing a .meta file and a folder for .csv files, such as
output/.odps/20210330*********/R2_1_0_0-0_TableSink1-0-.csv
.Note-
The .odps folder includes a .meta file, which is an additional macro data file created by MaxCompute to record valid data in the current folder. MaxCompute only parses the .meta file if the INSERT statement fails. If the job fails or is terminated prematurely, the INSERT OVERWRITE statement can be executed again.
-
When processing TSV or CSV files with a built-in MaxCompute extractor, the number of files generated in the OSS directory matches the number of concurrent SQL jobs.
-
If the
INSERT OVERWRITE ... SELECT ... FROM ...;
operation assigns 1,000 mappers to the source data table from_tablename, it will generate 1,000 TSV or CSV files. -
MaxCompute's flexible semantics and configurations can limit the number of files generated. To control the concurrency of mappers and the number of files, adjust the size of
odps.stage.mapper.split.size
. For reducers or joiners, adjust the number ofodps.stage.reducer.num
orodps.stage.joiner.num
respectively.
-
Example: Write data to a partitioned OSS directory by using a built-in text extractor
Read data from the OSS external table mc_oss_csv_external2, created in Example: Create an OSS External Table by Using a Built-in Text Extractor - Partitioned Table, and write the data in CSV format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/output/
.
To add the image repository, follow these steps:
-
To create an OSS external table that maps to the specified OSS directory, log on to the MaxCompute client and execute the following command.
CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_csv_external5 ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING ) PARTITIONED BY ( direction STRING ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole' ) LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/output/';
For more information about the syntax and parameters for creating an OSS external table, see Create an OSS External Table.
-
Use the MaxCompute client to execute the
INSERT OVERWRITE
orINSERT INTO
command, which writes data to OSS from an external table.INSERT INTO TABLE mc_oss_csv_external5 PARTITION (direction) SELECT * FROM mc_oss_csv_external2;
Upon successful execution of the statement, you can view the exported files in the OSS directory. A subdirectory is created under the output folder based on the partition value specified in the INSERT statement. This partition subdirectory contains a .odps folder, which includes a .meta file and a folder for storing .csv files. For instance,
output/direction=N/.odps/20210330*********/R2_1_0_0-0_TableSink1-0-.csv
.Note-
The .odps folder includes a .meta file, which is an additional macro data file created by MaxCompute to record valid data in the current folder. MaxCompute only parses the .meta file if the INSERT statement fails. If the job fails or is terminated prematurely, the INSERT OVERWRITE statement can be executed again.
-
When processing TSV or CSV files with a built-in MaxCompute extractor, the number of files generated in the OSS directory matches the number of concurrent SQL jobs.
-
If the
INSERT OVERWRITE ... SELECT ... FROM ...;
operation assigns 1,000 mappers to the source data table from_tablename, it will generate 1,000 TSV or CSV files. -
MaxCompute's flexible semantics and configurations can limit the number of files generated. To control the concurrency of mappers and the number of files, adjust the size of
odps.stage.mapper.split.size
. For reducers or joiners, adjust the number ofodps.stage.reducer.num
orodps.stage.joiner.num
respectively.
-
Example: Write data in a compression format to OSS by using a built-in text extractor
Read data from the OSS external table mc_oss_csv_external1, created as shown in Example: Create an OSS External Table by Using a Built-in Text Extractor - Non-partitioned Table, and write the data in GZIP compression format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/output/
.
To add the image repository, follow these steps:
-
Log on to the MaxCompute client and execute the command below to create an OSS external table mapping to the specified OSS directory.
CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_csv_external6 ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole', 'odps.text.option.gzip.input.enabled'='true', 'odps.text.option.gzip.output.enabled'='true' ) LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/output/';
For more information about the syntax and parameters for creating an OSS external table, see Create an OSS External Table.
-
Use the MaxCompute client to execute the
INSERT OVERWRITE
orINSERT INTO
command on the external table to write data to OSS.INSERT INTO TABLE mc_oss_csv_external6 SELECT * FROM mc_oss_csv_external1;
After a successful execution, you can view the exported files in the OSS directory. An output folder is created under the .odps folder, containing a .meta file and a folder with .gz files. For instance,
output/.odps/20220413*********/M1_0_0-0_TableSink1-0-.csv.gz
.Note-
The .odps folder includes a .meta file, which is an additional macro data file created by MaxCompute to record valid data in the current folder. MaxCompute only parses the .meta file if the INSERT statement fails. If the job fails or is terminated prematurely, the INSERT OVERWRITE statement can be executed again.
-
When processing TSV or CSV files with a built-in MaxCompute extractor, the number of files generated in the OSS directory matches the number of concurrent SQL jobs.
-
If the
INSERT OVERWRITE ... SELECT ... FROM ...;
operation assigns 1,000 mappers to the source data table from_tablename, it will generate 1,000 TSV or CSV files. -
MaxCompute's flexible semantics and configurations can limit the number of files generated. To control the concurrency of mappers and the number of files, adjust the size of
odps.stage.mapper.split.size
. For reducers or joiners, adjust the number ofodps.stage.reducer.num
orodps.stage.joiner.num
respectively.
-
Example: Write data to OSS by using a built-in open source data extractor
Read data from the OSS external table mc_oss_csv_external1, created as shown in Example: Create an OSS External Table Using a Built-in Text Extractor - Non-partitioned Table, and write the data in ORC format to the OSS directory oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo4/output/
.
To add the image repository, follow these steps:
-
Log on to the MaxCompute client and execute the command below to create an OSS external table that corresponds to the target OSS directory.
CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_orc_external ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) STORED AS orc LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo4/output/';
For more details on the syntax and parameters for creating an OSS external table, see Create an OSS External Table.
-
Use the MaxCompute client to execute the
INSERT OVERWRITE
orINSERT INTO
command on the external table to write data to OSS.INSERT INTO TABLE mc_oss_orc_external SELECT * FROM mc_oss_csv_external1;
Upon successful execution of the statement, you can check the exported files in the OSS directory. An output folder will be created under the .odps folder, containing a .meta file and a folder for the ORC files. For instance,
output/.odps/20220413*********/M1_0_0-0_TableSink1
.Note-
The .odps folder includes a .meta file, which is an additional macro data file created by MaxCompute to record valid data in the current folder. MaxCompute only parses the .meta file if the INSERT statement fails. If the job fails or is terminated prematurely, the INSERT OVERWRITE statement can be executed again.
-
When processing TSV or CSV files with a built-in MaxCompute extractor, the number of files generated in the OSS directory matches the number of concurrent SQL jobs.
-
If the
INSERT OVERWRITE ... SELECT ... FROM ...;
operation assigns 1,000 mappers to the source data table from_tablename, it will generate 1,000 TSV or CSV files. -
MaxCompute's flexible semantics and configurations can limit the number of files generated. To control the concurrency of mappers and the number of files, adjust the size of
odps.stage.mapper.split.size
. For reducers or joiners, adjust the number ofodps.stage.reducer.num
orodps.stage.joiner.num
respectively.
-
Example: Write data to OSS by using a custom extractor
Following the development of a custom extractor as demonstrated in Example: Create an OSS External Table by Using a Custom Extractor, this section describes the process for MaxCompute to write data to OSS.
To add the image repository, complete these steps:
-
Log on to the MaxCompute client and execute the command below to create an OSS external table linked to the desired OSS directory.
CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_external ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) STORED BY 'com.aliyun.odps.udf.example.text.TextStorageHandler' WITH serdeproperties ( 'delimiter'='|', 'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole' ) LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/SampleData/output' USING 'javatest-1.0-SNAPSHOT.jar';
For more information about the syntax and parameters for creating an OSS external table, see Create an OSS External Table.
-
Use the MaxCompute client to execute either the
INSERT OVERWRITE
orINSERT INTO
command on the external table to write data to OSS.INSERT INTO TABLE mc_oss_external SELECT * FROM mc_oss_csv_external1;
Upon successful execution of the statement, the exported files can be found in the OSS directory. An output folder is created within the .odps folder, containing a .meta file and a directory for TEXT files, such as
output/.odps/20220413*********/M1_0_0-0_TableSink1-0
.Note-
The .odps folder includes a .meta file, which is an additional macro data file created by MaxCompute to record valid data in the current folder. MaxCompute only parses the .meta file if the INSERT statement fails. If the job fails or is terminated prematurely, the INSERT OVERWRITE statement can be executed again.
-
When processing TSV or CSV files with a built-in MaxCompute extractor, the number of files generated in the OSS directory matches the number of concurrent SQL jobs.
-
If the
INSERT OVERWRITE ... SELECT ... FROM ...;
operation assigns 1,000 mappers to the source data table from_tablename, it will generate 1,000 TSV or CSV files. -
MaxCompute's flexible semantics and configurations can limit the number of files generated. To control the concurrency of mappers and the number of files, adjust the size of
odps.stage.mapper.split.size
. For reducers or joiners, adjust the number ofodps.stage.reducer.num
orodps.stage.joiner.num
respectively.
-