edit-icon download-icon

Configure RDBMSReader

Last Updated: Feb 23, 2018

Overview

The RDBMSReader plug-in provides the ability to read data from RDBMS. At the underlying implementation level, RDBMSReader connects to a remote RDBMS database through JDBC and runs corresponding SQL statements to select data from the RDBMS database. Currently, it supports reading data from databases including DM, db2, PPAS, and Sybase. RDBMSReader is a common plug-in for reading data from relational databases, and you can add any read support for relational databases by registering a database driver.

How it works

RDBMSReader connects to a remote RDBMS database through the JDBC connector. The SELECT SQL query statements are generated and sent to the remote RDBMS database based on your configuration. Then, the SQL statements are run and the returned results are assembled into abstract datasets using the custom data types of DataX. Datasets are passed to the downstream writer for processing.

RDBMSReader assembles the Table, Column, and Where information configured by the user into SQL statements to send to the RDBMS database, and sends the querySql information configured by the user to the RDBMS database directly.

Features

Configuration samples

  • Configure a job to synchronously extract data from RDBMS database:
  1. {
  2. "job": {
  3. "setting": {
  4. "speed": {
  5. "byte": 1048576
  6. },
  7. "errorLimit": {
  8. "record": 0,
  9. "percentage": 0.02
  10. }
  11. },
  12. "content": [
  13. {
  14. "reader": {
  15. "name": "rdbmsreader",
  16. "parameter": {
  17. "username": "xxx",
  18. "password": "xxx",
  19. "column": [
  20. "id",
  21. "name"
  22. ],
  23. "splitPk": "pk",
  24. "connection": [
  25. {
  26. "table": [
  27. "table"
  28. ],
  29. "jdbcUrl": [
  30. "jdbc:dm://ip:port/database"
  31. ]
  32. }
  33. ],
  34. "fetchSize": 1024,
  35. "where": "1 = 1"
  36. }
  37. },
  38. "writer": {
  39. "name": "streamwriter",
  40. "parameter": {
  41. "print": true
  42. }
  43. }
  44. }
  45. ]
  46. }
  47. }
  • Configure a job to synchronize tasks from a database with custom SQLs to ODPS:
  1. {
  2. "job": {
  3. "setting": {
  4. "speed": {
  5. "byte": 1048576
  6. },
  7. "errorLimit": {
  8. "record": 0,
  9. "percentage": 0.02
  10. }
  11. },
  12. "content": [
  13. {
  14. "reader": {
  15. "name": "rdbmsreader",
  16. "parameter": {
  17. "username": "xxx",
  18. "password": "xxx",
  19. "column": [
  20. "id",
  21. "name"
  22. ],
  23. "splitPk": "pk",
  24. "connection": [
  25. {
  26. "querySql": [
  27. "SELECT * from dual"
  28. ],
  29. "jdbcUrl": [
  30. "jdbc:dm://ip:port/database"
  31. ]
  32. }
  33. ],
  34. "fetchSize": 1024,
  35. "where": "1 = 1"
  36. }
  37. },
  38. "writer": {
  39. "name": "streamwriter",
  40. "parameter": {
  41. "print": true
  42. }
  43. }
  44. }
  45. ]
  46. }
  47. }

Parameter description

  • jdbcUrl

    • Description: Information of the JDBC connection to the opposite-end database. The format of jdbcUrl is in accordance with the RDBMS official specification, and the URL attachment control information can be entered. The JDBC format varies with different databases. DataX selects an appropriate database driver for data reading based on the specific JDBC format.

      • DM jdbc:dm://ip:port/database
      • db2 jdbc:db2://ip:port/database
      • PPAS jdbc:edb://ip:port/database

      How to add new supported database with rdbmswriter:

      • Go to the rdbmsreader directory. The ${DATAX_HOME} here is the home directory of DataX. ${DATAX_HOME}/plugin/reader/rdbmswriter
      • Under the rdbmsreader directory, you can find the plugin.json configuration file. Use this file to register your specific database driver, which is placed in the drivers array. The RDBMSReader plug-in dynamically selects the appropriate database driver to connect to the database when running the job.
  1. ```
  2. {
  3. "name": "rdbmsreader",
  4. "class": "com.alibaba.datax.plugin.reader.rdbmsreader.RdbmsReader",
  5. "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.",
  6. "developer": "alibaba",
  7. "drivers": [
  8. "dm.jdbc.driver.DmDriver",
  9. "com.ibm.db2.jcc.DB2Driver",
  10. "com.sybase.jdbc3.jdbc.SybDriver",
  11. "com.edb.Driver"
  12. ]
  13. }
  14. ```
  15. - The rdbmsreader directory contains the libs sub-directory, under which you must put your specific database driver.
  16. ```
  17. $tree
  18. .
  19. |-- libs
  20. | |-- Dm7JdbcDriver16.jar
  21. | |-- commons-collections-3.0.jar
  22. | |-- commons-io-2.4.jar
  23. | |-- commons-lang3-3.3.2.jar
  24. | |-- commons-math3-3.1.1.jar
  25. | |-- datax-common-0.0.1-SNAPSHOT.jar
  26. | |-- datax-service-face-1.0.23-20160120.024328-1.jar
  27. | |-- db2jcc4.jar
  28. | |-- druid-1.0.15.jar
  29. | |-- edb-jdbc16.jar
  30. | |-- fastjson-1.1.46.sec01.jar
  31. | |-- guava-r05.jar
  32. | |-- hamcrest-core-1.3.jar
  33. | |-- jconn3-1.0.0-SNAPSHOT.jar
  34. | |-- logback-classic-1.0.13.jar
  35. | |-- logback-core-1.0.13.jar
  36. | |-- plugin-rdbms-util-0.0.1-SNAPSHOT.jar
  37. | `-- slf4j-api-1.7.10.jar
  38. |-- plugin.json
  39. |-- plugin_job_template.json
  40. `-- rdbmsreader-0.0.1-SNAPSHOT.jar
  41. ```
  • Required: Yes

  • Default value: None

  • username

    • Description: Username for the data source.

    • Required: Yes

    • Default value: None

  • password

    • Description: Password corresponding to a specified username for the data source.

    • Required: Yes

    • Default value: None

  • table

    • Description: The table selected for synchronization.

    • Required: Yes

    • Default value: None

  • column

    • Description: The column name set to be synchronized in the configured table. Field information is described with arrays in JSON. is used to indicate all columns by default, for example, [‘‘].

      Column pruning is supported, which means you can select some columns to export.

      Change of column order is supported, which means you can export the columns in an order different from the schema order of the table.

      Constant configuration is supported with JSON format:[“id”, “1”, “‘bazhen.csy’”, “null”, “to_char(a + 1)”, “2.3” , “true”]where id refers to the ordinary column name, 1 is an integer constant, ‘bazhen.csy’ is a string constant, null refers to the null pointer, to_char(a + 1) is a function expression, 2.3 is a floating point number, and true is a Boolean value.

      Column is required and cannot be blank.

    • Required: Yes

    • Default value: None

  • splitPk

    • Description: If you specify the splitPk when using RDBMSReader to extract data, it means that you want to use the fields represented by splitPk for data sharding. Then, the DataX starts concurrent tasks to synchronize data, which greatly improves the efficiency of data synchronization.

      If you are using splitPk, we recommend that you use the primary keys of tables, because the primary keys are generally even and data hotspots are less prone to split data fragments.

      Currently, splitPk only supports data sharding for integer data types. Other types such as floating point, string, and date are not supported. If other unsupported types are specified, RDBMSReader reports an error.

      If splitPk is left blank, it indicates that no table splitting is required and RDBMSReader synchronizes full data through a single channel.

    • Required: No

    • Default: Null

  • where

    • Description: Filtering condition. RDBMSReader concatenates an SQL command based on specified column, table, and WHERE conditions and extracts data according to the SQL. For example, you can set the where condition as limit 10 during a test. In actual service scenarios, incremental synchronization typically synchronizes the data generated on the current day. You can specify the where condition as gmt_create > $bizdate.

      1. The where condition can be effectively used for incremental synchronization. If the where condition is not set or is left null, full table data synchronization is applied.
    • Required: No

    • Default value: None

  • querySql

    • In some business scenarios, the where condition is insufficient for filtration. In such cases, the user can customize a filter SQL using this configuration item. When this item is configured, the DataX system filters data using this configuration item directly instead of such configuration items as table and column. For example, for data synchronization after multi-table join, use select a,b from table_a join table_b on table_a.id = table_b.id.

      `When querySql is configured, RDBMSReader directly ignores the configuration of table, column, and where conditions.

    • Required: No

    • Default value: None

  • fetchSize

    • Description: It defines the pieces of batch data that the plug-in and database server can fetch each time. The value determines the number of network interactions between the DataX system and the server, which can greatly improve data extraction performance.

      Note that a value greater than 2048 may lead to OOM for data synchronization.

    • Required: No

    • Default value: 1024

Type conversion

RDBMSReader supports most general-purpose relational database types such as numbers and characters. You must decide depending on your specific database type.

Thank you! We've received your feedback.