edit-icon download-icon

Configure PostgreSQL reader

Last Updated: Apr 12, 2018

The PostgreSQL Reader plug-in reads data from PostgreSQL databases. At the underlying implementation level, PostgreSQL Reader connects to a remote PostgreSQL database through JDBC and runs SELECT statements to extract data from the database. On the public cloud, RDS provides a PostgreSQL storage engine.

Specifically, PostgreSQL Reader connects to a remote PostgreSQL database through the JDBC connector. The SELECT SQL query statements are generated and sent to the remote PostgreSQL database based on your configuration. Then, the SQL statements are run and the returned results are assembled into abstract datasets using the custom data types of data integration. Datasets are passed to the downstream writer for processing.

  • PostgreSQL Reader concatenates the table, column, and where information you configured into SQL statements, and sends them to the PostgreSQL database.

  • PostgreSQL directly sends the querySql information you configured to the PostgreSQL database.

PostgreSQL Reader supports most data types in PostgreSQL. Check whether your data type is supported.

PostgreSQL Reader converts PostgreSQL data types as follows.

Category PostgreSQL data type
Integer bigint, bigserial, integer, smallint, serial
Floating point double precision, money, numeric, real
String varchar, char, text, bit, inet
Date and time date, time, timestamp
Boolean bool
Binary bytea

Note:

  • Apart from the field types listed here, other types are not supported.
  • For money, inet, and bit, you must use syntaxes like a_inet::varchar for conversion.

Parameter description

Parameter Description Required Default value
datasource Data source name. It must be identical to the data source name added. Adding data source is supported in script mode. Yes None
table The name of the table selected for synchronization. Yes None
column The column name set to be synchronized in the configured table. Field information is described with arrays in JSON. [ * ] indicates all columns by default.
- Column pruning is supported, which means you can select some columns to export.
- Change of column order is supported, which means you can export the columns in an order different from the schema order of the table.
- Constant configuration is supported. You must follow MySQL SQL syntax format, for example [“id”, “table“, “1”, “‘mingya.wmy’”, “‘null’”, “to_char(a + 1)”, “2.3” , “true”], where id refers to the ordinary column name, table is the column name including reserved fields, 1 is an integer constant, ‘Mingya.wmy’ is a string constant (note that a pair of single quotation marks must be added), null refers to the null pointer, CHAR_LENGTH(s) is the function to calculate the string length, 2.3 is a floating point number, and true is a Boolean value.
- Column must contain the specified column set to be synchronized and it cannot be blank.
Yes None
splitPk If you specify the splitPk when using the PostgreSql Reader to extract data, it means that you want to use the fields represented by splitPk for data sharding. Then, the data synchronization starts concurrent tasks to synchronize data, which greatly improves the efficiency of data synchronization.
- We recommend that splitPk users use the table primary key. The table primary key of the table is usually more uniform, so the split slice is also not easy to show the data hot spot. - Currently, splitPk only supports data sharding for integer data types. Other types such as string, floating point, and date are not supported. If you specify an unsupported data type, the splitPk is ignored and the data is synchronized using a single channel.
- If the splitPk is not specified (for example, splitPk is not provided or splitPk value is null), the table data is synchronized using a single channel.
No Null
where Filtering condition. PostgreSQL Reader concatenates an SQL command based on specified column, table, and where conditions and extracts data according to the SQL. For example, you can set the where condition during a test. In actual service scenarios, the data on the current day are usually required to be synchronized, in which case you can set the where condition as id > 2 and sex = 1. - The where condition can be effectively used for incremental synchronization.
- If the where condition is not set or is left null, full table data synchronization is applied.
No None
querySql(in advanced mode, unavailable in wizard mode) In some service scenarios, the where condition is insufficient for filtering. In such cases, you can customize a SQL filter using this parameter. When this item is configured, the data integration system filters data using this configuration item directly instead of such configuration items as tables and columns. For example, for data synchronization after multi-table join, use select a,b from table_a join table_b on table_a.id = table_b.id.
When querySql is configured, PostgreSql Reader directly ignores the configuration of table, column, and where conditions.
No None
fetchSize It defines the pieces of batch data that the plug-in and database server can fetch each time. The value determines the number of network interactions between the data integration system and the server, which can greatly improve data extraction performance. Note: A value greater than 2048 may lead to OOM for data synchronization. No 512

Development in wizard mode

wizardmodePostgreSQL

  • Data sources: datasource in the preceding parameter description. Select postgresql.

  • Table: table in the preceding parameter description. Select the table to be synchronized.

  • Data filtering: where in the preceding parameter description. You can enter a where filtering statement (with where keyword not required) based on the corresponding SQL syntax. This filtering condition can be used as an incremental synchronization criterion.

  • the split button: splitPk in the preceding parameter description. Supports only integer fields. During data reading, the data is split based on the configured fields to achieve concurrent reading, improving data synchronization efficiency. The configuration of splitting key is related to the source selection in data synchronization. The splitting key configuration item is displayed only when you configure the data source.

  • Field mapping: column in the preceding parameter description.

Development in script mode

Configure a job to synchronously extract data from a PostgreSQL database:

  1. {
  2. "type": "job",
  3. "traceId": "You can enter a trace ID for your job here. We recommend you to organize your trace ID in the following format: service name + job ID",
  4. "version": "1.0",
  5. "configuration": {
  6. "reader": {
  7. "plugin": "postgresql",
  8. "parameter": {
  9. "datasource": "datasourceName",
  10. "table": "tableName",
  11. "column": [
  12. "id", "name", "age"
  13. ],
  14. "fetchSize": 512,
  15. "splitPk": "id",
  16. "where": "id > 100"
  17. }
  18. },
  19. "writer": {}
  20. }
  21. }

Configure a PostgreSQL synchronization task with custom SQL statements as follows:

  1. {
  2. "type": "job",
  3. "traceId": "You can enter a trace ID for your job here. We recommend you to organize your trace ID in the following format: service name + job ID",
  4. "version": "1.0",
  5. "configuration": {
  6. "reader": {
  7. "plugin": "postgresql",
  8. "parameter": {
  9. "datasource": "datasourceName",
  10. "querySql": "SELECT * from tableName",
  11. "fetchSize": 1024
  12. }
  13. },
  14. "writer": {
  15. }
  16. }
  17. }

Supplemental instructions

Data restoration of master/slave synchronization

Master/slave synchronization means that PostgreSQL uses a master/slave disaster recovery mode in which the slave database continuously restores data from the master database through binlog. Due to the time difference in master/slave data synchronization, especially in some special situations such as network latency, the restored data in the slave database after synchronization are significantly different from the data of the master database, that is to say, the data synchronized in the slave database are not a full image of the master database at the current time.

If the data integration system synchronizes data of the RDS provided by Alibaba Cloud, the data is directly read from the master database without data restoration concerns. However, this may cause concerns on the master database load. Configure it properly for throttling.

Consistency constraints

PostgreSQL is an RDBMS system in terms of data storage, which can provide APIs for querying strong consistency data. For example, if another data writer writes data to the database during a synchronization task, PostgreSQL Reader does not get the newly written data because of the snapshot features of the database. For the snapshot features of databases, see MVCC Wikipedia.

These are the features of data synchronization consistency in the single-threaded model of PostgreSQL Reader. Because PostgreSQL Reader uses concurrent data extraction according to your configuration information, robust data consistency cannot be guaranteed. After PostgreSQL Reader completes data sharding based on splitPk, multiple concurrent tasks are successively enabled to synchronize data. Since multiple concurrent tasks do not belong to the same read transaction and time intervals exist between the concurrent tasks, the data is not complete and consistent data snapshot information.

Currently, the consistency snapshot demands in the multi-threaded model cannot be met technically, which can only be solved from the engineering point of view. The engineering approaches have both advantages and disadvantages. The following solutions are provided for your consideration:

  • Use single-threaded synchronization without data sharding. This is slow but can make sure robust data consistency.

  • Close other data writers to make sure the current data is static. For example, you can lock the table or close slave database synchronization. The disadvantage is that the online businesses may be affected.

Database encoding

PostgreSQL supports EUC_CN and UTF-8 encoding for simplified Chinese. PostgreSQL Reader extracts data using JDBC at the underlying level. JDBC is applicable to all types of encodings and can complete the transcoding at the underlying level. Therefore, PostgreSQL Reader can acquire the encoding and complete transcoding automatically without the need to specify the encoding.

PostgreSQL Reader cannot identify the inconsistency between the encoding written to the underlying layer of PostgreSQL and the configured encoding, nor provide a solution. Due to this issue, the exported codes may contain garbage codes.

Incremental data synchronization

Since PostgreSQL Reader extracts data using JDBC SELECT statements, you can extract incremental data using the SELECT…WHERE… conditions in the following ways:

  • When database online applications write data into the database, the modify field is filled with the modification timestamp, including addition, update, and deletion (logical deletion). For this type of applications, PostgreSQL Reader only requires the WHERE condition followed by the timestamp of the last synchronization phase.
  • For new streamline data, PostgreSQL Reader requires the WHERE condition followed by the maximum auto-increment ID of the last synchronization phase.

In case that no field is provided for the business to identify the addition or modification of data, PostgreSQL Reader cannot perform incremental data synchronization and can only perform full data synchronization.

SQL security

PostgreSQL Reader provides querySql statements for you to SELECT data by yourself. PostgreSQL Reader conducts no security verification on querySql. The security during use is ensured by the data integration users.

Thank you! We've received your feedback.