This topic describes the data types and parameters that OSS Writer supports and how to configure it by using the codeless user interface (UI) and code editor.

Background information

OSS Writer allows you to write one or more CSV-like files to Object Storage Service (OSS). The number of files written to OSS depends on the number of concurrent threads and the total number of files to be synchronized.
Note You must configure an OSS connection before you configure OSS Writer.

OSS Writer can write files that store logical two-dimensional tables, such as CSV files that store text data, to OSS. For more information about OSS, see What is OSS?

OSS Writer allows you to convert data obtained from a Data Integration reader to files and write the files to OSS. The OSS files store unstructured data only. OSS Writer supports the following features:
  • Writes only files that store text data. The text data must be logical two-dimensional tables.
  • Writes CSV-like files with custom delimiters.
  • Uses concurrent threads to write files. Each thread writes a file.
  • Supports file rotation. OSS Writer can write data to another file when the size of the current file exceeds a specific value. OSS Writer can also write data to another file when the number of rows in the current file exceeds a specific value.
OSS Writer does not support the following features:
  • Uses concurrent threads to write a single file.
  • Distinguishes between data types. OSS does not distinguish between data types. Therefore, OSS Writer writes all data as strings to files in OSS.

Parameters

Parameter Description Required Default value
datasource The connection name. It must be the same as the name of the added connection. You can add connections in the code editor. Yes N/A
object The name prefix of the files to be written to OSS as objects. OSS simulates the directory effect by adding separators to object names. You can set the object parameter in the following rules:
  • "object": "datax": The names of the files start with datax, which is followed by a random string as the suffix.
  • "object": "cdo/datax": The names of the files start with /cdo/datax, which is followed by a random string as the suffix. OSS uses backslashes (/) in object names to simulate the directory effect.

If you do not want to add a random universally unique identifier (UUID) as the suffix, we recommend that you set the writeSingleObject parameter to true.

Yes N/A
writeMode The mode in which OSS Writer writes the files. Valid values:
  • truncate: deletes all existing objects with the specified object name prefix before writing files to OSS. For example, if you set the object parameter to abc, all objects whose names start with abc are deleted.
  • append: writes all files and ensures that the actual file names do not conflict with those of existing objects by suffixing the file names with random UUIDs. For example, if you set the object parameter to DI, the actual names of the files written to OSS are in the following format: DI_****_****_****.
  • nonConflict: returns an error message if an object with the specified object name exists. For example, if you set the object parameter to abc and the object named abc123 exists, an error message is returned.
Yes N/A
writeSingleObject Specifies whether to write a single file to OSS at a time. Valid values:
  • true: OSS Writer writes a single file to OSS at a time.
  • false: OSS Writer writes multiple files to OSS at a time.
No false
fileFormat The format in which files are written to OSS. Valid values: csv and text.
  • If a file is written as a CSV file, the file strictly follows CSV specifications. If the data in the file contains the column delimiter, the column delimiter is escaped by using double quotation marks (" ").
  • If a file is written as a text file, the data in the file is separated with the column delimiter. If the data in the file contains the column delimiter, the column delimiter is not escaped.
No text
fieldDelimiter The column delimiter that is used in the files to be written to OSS. No ,
encoding The encoding format of the files to be written to OSS. No utf-8
nullFormat The string that represents null. No standard strings can represent null in text files. Therefore, Data Integration provides the nullFormat parameter to define a string that represents a null pointer. For example, if you specify nullFormat="null", Data Integration considers null as a null pointer. No N/A
header (advanced parameter, which cannot be set on the codeless UI) The table header in the files to be written to OSS, for example, ['id', 'name', 'age']. No N/A
maxFileSize (advanced parameter, which cannot be set on the codeless UI) The maximum size of a single file that can be written to OSS. Default value: 100000. Unit: MB. File rotation based on this maximum size is similar to log rotation of Log4j. When a file is uploaded to OSS in multiple parts, the minimum size of a part is 10 MB. This size is the minimum granularity for file rotation. That is, if you set the maxFileSize parameter to less than 10 MB, the minimum size of a file is still 10 MB. Each call of the InitiateMultipartUploadRequest operation supports writing up to 10,000 parts.

If file rotation occurs, suffixes, such as _1, _2, and _3, are appended to the new file names that consist of file name prefixes and random UUIDs.

No 100,000MB
suffix (advanced parameter, which cannot be set on the codeless UI) The file name extension of the files to be written to OSS. For example, if you set the suffix parameter to .csv, the final name of a file written to OSS is in the format of fileName****.csv. No N/A

Configure OSS Writer by using the codeless UI

  1. Configure the connections.
    Configure the connections to the source and destination data stores for the sync node.Select data source section
    Parameter Description
    Data source The datasource parameter in the preceding parameter description. Select a connection type and select the name of a connection that you have configured in DataWorks.
    Object prefix The object parameter in the preceding parameter description. Enter the path of the directory for storing the files. Do not include the name of the OSS bucket in the path.
    Text Type The fileFormat parameter in the preceding parameter description. Valid values: csv and text.
    Column separator The fieldDelimiter parameter in the preceding parameter description. The default delimiter is comma (,).
    Encoding The encoding parameter in the preceding parameter description. Default value: UTF-8.
    null value The nullFormat parameter in the preceding parameter description. Enter a string that represents null. If the data in the source data store contains the string, the string is replaced with null.
    Time format The format in which the data of the DATE type is serialized in an object, for example, "dateFormat": "yyyy-MM-dd".
    Prefix conflict The solution to take when a prefix conflict occurs. If an object with the specified name prefix exists, the system can replace the object with the new object, insert the new object, or return an error message.
  2. Configure field mapping. It is equivalent to setting the column parameter in the preceding parameter description. Fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right.Field Mapping section
    GUI element Description
    The same name mapping Click The same name mapping to establish a mapping between fields with the same name. Note that the data types of the fields must match.
    Peer mapping Click Peer mapping to establish a mapping between fields in the same row. Note that the data types of the fields must match.
    Unmap Click Unmap to remove mappings that have been established.
  3. Configure channel control policies.
    Parameter Description
    Maximum number of concurrent tasks expected The maximum number of concurrent threads that the sync node uses to read data from or write data to data stores. You can configure the concurrency for the node on the codeless UI.
    Synchronization rate Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
    The number of error records exceeds The maximum number of dirty data records allowed.

Configure OSS Writer by using the code editor

The following example shows how to configure a sync node to write files to OSS. For more information, see Create a sync node by using the code editor.
{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"oss", // The writer type.
            "parameter":{
                "nullFormat":"",//The string that represents null.
                "dateFormat":"",// The format in which the data of the DATE type is serialized in an object.
                "datasource":"",// The connection name.
                "writeMode":"",// The write mode.
                "writeSingleObject":"false", // Specifies whether to write a single file to OSS at a time. A value of false indicates that OSS Writer writes multiple files to OSS at a time.
                "encoding":"",// The encoding format.
                "fieldDelimiter":","// The column delimiter.
                "fileFormat":"",// The format in which files are written to OSS.
                "Object":[]// The name prefix of the files to be written to OSS as objects.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed.
        },
        "speed":{
            "throttle":false,// Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
            "concurrent":1 // The maximum number of concurrent threads.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Write ORC or Parquet files to OSS

OSS Writer uses HDFS Writer to write ORC or Parquet files to OSS. In addition to the original parameters, OSS Writer provides extended parameters Path and FileFormat. For more information about the extended parameters, see Configure HDFS Writer.

The following examples show how to configure a sync node to write ORC or Parquet files to OSS.
  • The following example shows how to configure a sync node to write ORC files to OSS:
    {
          "stepType": "oss",
          "parameter": {
            "datasource": "",
            "fileFormat": "orc",
            "path": "/tests/case61",
            "fileName": "orc",
            "writeMode": "append",
            "column": [
              {
                "name": "col1",
                "type": "BIGINT"
              },
              {
                "name": "col2",
                "type": "DOUBLE"
              },
              {
                "name": "col3",
                "type": "STRING"
              }
            ],
            "writeMode": "append",
            "fieldDelimiter": "\t",
            "compress": "NONE",
            "encoding": "UTF-8"
          }
        }
  • The following example shows how to configure a sync node to write Parquet files to OSS:
    {
          "stepType": "oss",
          "parameter": {
            "datasource": "",
            "fileFormat": "parquet",
            "path": "/tests/case61",
            "fileName": "test",
            "writeMode": "append",
            "fieldDelimiter": "\t",
            "compress": "SNAPPY",
            "encoding": "UTF-8",
            "parquetSchema": "message test { required int64 int64_col;\n required binary str_col (UTF8);\nrequired group params (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired binary value (UTF8);\n}\n}\nrequired group params_arr (LIST) {\n  repeated group list {\n    required binary element (UTF8);\n  }\n}\nrequired group params_struct {\n  required int64 id;\n required binary name (UTF8);\n }\nrequired group params_arr_complex (LIST) {\n  repeated group list {\n    required group element {\n required int64 id;\n required binary name (UTF8);\n}\n  }\n}\nrequired group params_complex (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired group value {\n  required int64 id;\n required binary name (UTF8);\n  }\n}\n}\nrequired group params_struct_complex {\n  required int64 id;\n required group detail {\n  required int64 id;\n required binary name (UTF8);\n  }\n  }\n}",
            "dataxParquetMode": "fields"
          }
        }