Configure parameters for Vertica data synchronization - DataWorks

Vertica is a column-oriented database that uses a massively parallel processing (MPP) architecture. The Vertica data source provides bidirectional channels to read from and write to Vertica. This topic describes the data synchronization capabilities of the Vertica data source in DataWorks.

Supported versions

Vertica Reader connects to Vertica through the Vertica JDBC driver. Make sure the driver version is compatible with your Vertica service. DataWorks uses JDBC driver version 7.1.2:

<dependency>
  <groupId>com.vertica</groupId>
  <artifactId>vertica-jdbc</artifactId>
  <version>7.1.2</version>
</dependency>

Limits

Vertica data sources work only with Serverless resource groups (recommended) and exclusive resource groups for Data Integration.
Vertica Writer does not support the writeMode parameter.
Tasks can be configured only in the code editor.

Supported field types

Common Vertica data types are supported: integer, float, string, and time. Support for advanced data types is limited.

Add a data source

Before developing a synchronization task, add the Vertica data source to DataWorks. For instructions, see Data source management.

Parameter descriptions are also available in the DataWorks console when you add the data source.

Develop a data synchronization task

Vertica data synchronization tasks must be configured in the code editor. The following sections cover the script format, parameters, and examples.

Configure an offline synchronization task for a single table

For the configuration procedure, see Configure a task in the code editor.
For all parameters and script examples, see the Appendix: Script examples and parameter descriptions section.

Appendix: Script examples and parameter descriptions

Configure a batch synchronization task by using the code editor

The script must follow the unified script format for batch synchronization tasks. For format requirements, see Configure a task in the code editor.

Reader script example

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "vertica",
      "parameter": {
        "datasource": "",        // The data source name.
        "column": [              // The columns to read.
          "id",
          "name"
        ],
        "where": "",
        "splitPk": "id",
        "connection": [
          {
            "table": [           // The source table name.
              "table"
            ]
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {
        "print": false,
        "fieldDelimiter": ","
      },
      "name": "Writer",
      "category": "writer"
    }
  ],
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  },
  "setting": {
    "errorLimit": {
      "record": "0"              // Maximum number of error records allowed.
    },
    "speed": {
      "throttle": true,          // Set to true to enable rate limiting; set to false to disable.
      "concurrent": 1,           // Number of concurrent jobs.
      "mbps": "12"               // Maximum transmission rate. 1 mbps = 1 MB/s.
    }
  }
}

Reader script parameters

Parameter	Description	Required	Default value	Example
`datasource`	The data source name. Must match the name of the data source added in the code editor.	Yes	None	`my_vertica_source`
`table`	The source tables to read from, specified as a JSON array. Multiple tables can be read simultaneously, but all tables must have the same schema. Vertica Reader does not verify schema consistency. The `table` parameter must be placed inside the `connection` block.	Yes	None	`["orders", "order_items"]`
`column`	The columns to read from the source tables, specified as a JSON array. Use `["*"]` to read all columns. Supports column pruning, column reordering, and constants. Cannot be left blank.	Yes	None	`["id", "name", "created_at"]`
`splitPk`	The column used to partition data for concurrent reads. Use the primary key for even distribution and to avoid hot spots. Only integer columns are supported — string, float, and date columns are not. If left blank, data is read through a single channel without partitioning.	No	None	`"id"`
`where`	A filter condition. Vertica Reader uses the `column`, `table`, and `where` parameters to construct the SQL query. For incremental synchronization of daily data, set this to `gmt_create > $bizdate`. If not configured, all data in the table is read.	No	None	`"gmt_create > $bizdate"`
`querySql`	A custom SQL query for advanced filtering scenarios where `where` alone is insufficient. When `querySql` is configured, Vertica Reader ignores the `table`, `column`, and `where` parameters.	No	None	`"SELECT id, name FROM orders WHERE status = 'active'"`
`fetchSize`	The number of records fetched from the database in each batch. Increasing this value reduces network round trips and improves extraction performance. Setting `fetchSize` above 2048 may cause an out-of-memory (OOM) error.	No	1024	`512`

Writer script example

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "vertica",
      "parameter": {
        "datasource": "data_source_name",
        "column": [                    // The destination columns.
          "id",
          "name"
        ],
        "connection": [
          {
            "table": [                 // The destination table name.
              "vertica_table"
            ]
          }
        ],
        "preSql": [                    // SQL to run before the write task starts.
          "delete from @table where db_id = -1"
        ],
        "postSql": [                   // SQL to run after the write task completes.
          "update @table set db_modify_time = now() where db_id = 1"
        ]
      },
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"                    // Maximum number of error records allowed.
    },
    "speed": {
      "throttle": true,                // Set to true to enable rate limiting; set to false to disable.
      "concurrent": 1,                 // Number of concurrent jobs.
      "mbps": "12"                     // Maximum transmission rate. 1 mbps = 1 MB/s.
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Writer script parameters

Parameter	Description	Required	Default value	Example
`datasource`	The data source name. Must match the name of the data source added in the code editor.	Yes	None	`my_vertica_dest`
`jdbcUrl`	The JDBC URL of the destination Vertica database, specified inside the `connection` block. Only one value is supported — multiple primary databases for the same database instance are not supported (such as in bidirectional data import scenarios).	Yes	None	`jdbc:vertica://127.0.0.1:3306/database`
`username`	The username for authenticating to the data source.	Yes	None	`dbadmin`
`password`	The password for the specified username.	Yes	None	`********`
`table`	The destination tables to write to, specified as a JSON array inside the `connection` block.	Yes	None	`["vertica_table"]`
`column`	The destination columns to write data to, separated by commas.	Yes	None	`["id", "name", "age"]`
`preSql`	A SQL statement to run before data is written to the destination table. Use `@table` as a placeholder for the table name — it is replaced with the actual table name at runtime.	No	None	`"delete from @table where db_id = -1"`
`postSql`	A SQL statement to run after data is written to the destination table.	No	None	`"update @table set db_modify_time = now() where db_id = 1"`
`batchSize`	The number of records committed in each batch. Larger values reduce network round trips and improve throughput, but setting this too high may cause an OOM error.	No	1024	`512`