what is ogg for big data, how to install and configure - DataHub

This guide explains how to install and configure Oracle GoldenGate (OGG) for Big Data to replicate data from an Oracle database to Alibaba Cloud DataHub.

Prerequisites

An Oracle database, version 19c or earlier. The database version cannot be newer than the source OGG version.
Source: Oracle GoldenGate 19.1.0.0
Target: Oracle GoldenGate for Big Data 19.1.0.0
OGG Official Download Link

Note

The examples in this guide use OGG 19.1. For information about other supported versions, refer to the version specifications provided by Oracle.

Installation

This section describes the installation and configuration process for Oracle and OGG. The installation of the Oracle database itself is not covered.

Note

The Oracle and OGG parameter settings in this guide are for demonstration purposes only. For production environments, consult an experienced Oracle or OGG administrator for the proper production configuration.

Configure the source OGG

This section provides a working example using Oracle 11g. For Oracle 12c and later multitenant versions, refer to the Official Documentation.

1. Configure the source Oracle database

Note

Failing to perform the following configuration may result in NULL pre-update values for UPDATE operations.

# Create a dedicated tablespace
create tablespace ATMV datafile '/home/oracle/u01/app/oracle/oradata/uprr/ATMV.dbf' size 100m autoextend on next 50m maxsize unlimited;
# Create a user named ogg_test with the password ogg_test
create user ogg_test identified by ogg_test default tablespace ATMV;
# Grant full privileges to the ogg_test user
grant connect,resource,dba to ogg_test;
# Check the supplemental logging status
Select SUPPLEMENTAL_LOG_DATA_MIN, SUPPLEMENTAL_LOG_DATA_PK, SUPPLEMENTAL_LOG_DATA_UI, SUPPLEMENTAL_LOG_DATA_FK, SUPPLEMENTAL_LOG_DATA_ALL from v$database;
# Add database supplemental logging
alter database add supplemental log data;
alter database add supplemental log data (primary key, unique,foreign key) columns;
# Rollback commands
alter database drop supplemental log data (primary key, unique,foreign key) columns;
alter database drop supplemental log data;
# Full column logging mode. Note: In this mode, DELETE operations only contain primary key values. To include other values, configure NOCOMPRESSDELETES in the source Extract process.
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
# Enable forced logging mode for the database
alter database force logging;
# Install sequence support
@sequence.sql
#
alter table sys.seq$ add supplemental log data (primary key) columns;

2. Install the source OGG

After decompressing the OGG installation package, you will find the following directory structure:

drwxr-xr-x install
drwxrwxr-x response
-rwxr-xr-x runInstaller
drwxr-xr-x stage

Oracle installations typically use a response file. Configure the installation dependencies in response/oggcore.rsp as follows:

oracle.install.responseFileVersion=/oracle/install/rspfmt_ogginstall_response_schema_v12_1_2
# Must match the Oracle version
INSTALL_OPTION=ORA11g
# GoldenGate home directory
SOFTWARE_LOCATION=/home/oracle/u01/ggate
# Do not start the Manager process initially
START_MANAGER=false
# Manager port
MANAGER_PORT=7839
# Corresponding Oracle home directory
DATABASE_LOCATION=/home/oracle/u01/app/oracle/product/11.2.0/dbhome_1
# Can be left unconfigured for now
INVENTORY_LOCATION=
# Group (In this example, both Oracle and OGG use the same user, ogg_test. In a production environment, you can create a separate user for OGG.)
UNIX_GROUP_NAME=oinstall

Run the following command:

runInstaller -silent -responseFile {YOUR_OGG_INSTALL_FILE_PATH}/response/oggcore.rsp

In this example, OGG is installed in the /home/oracle/u01/ggate directory. The installation log is located in the /home/oracle/u01/ggate/cfgtoollogs/oui directory. The following message in the silentInstall{timestamp}.log file confirms a successful installation.

Note

The installation of Oracle GoldenGate Core was successful.

Run the /home/oracle/u01/ggate/ggsci command. At the prompt, enter the CREATE SUBDIRS command to generate the required directories (dirxxx) for OGG. The source OGG installation is now complete.

3. Configure the source Manager

Use the GGSCI utility to configure the source Manager. Run the edit params mgr command and add the following configuration:

PORT 7839
DYNAMICPORTLIST  7840-7849
USERID ogg_test, PASSWORD ogg_test
PURGEOLDEXTRACTS ./dirdat/*, USECHECKPOINTS, MINKEEPDAYS 7
LAGREPORTHOURS 1
LAGINFOMINUTES 30
LAGCRITICALMINUTES 45
PURGEDDLHISTORY MINKEEPDAYS 3, MAXKEEPDAYS 7
PURGEMARKERHISTORY MINKEEPDAYS 3, MAXKEEPDAYS 7

After configuring, run view params mgr to verify the settings.

Start the Manager: start mgr

Check the Manager status: info mgr

4. Configure the source Extract process

Use GGSCI to configure the Extract process. This example names the Extract process dhext. Run edit params dhext and add the following configuration:

EXTRACT dhext
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
DBOPTIONS   ALLOWUNUSEDCOLUMN
USERID ogg_test, PASSWORD ogg_test
REPORTCOUNT EVERY 1 MINUTES, RATE
NUMFILES 5000
DISCARDFILE ./dirrpt/ext_test.dsc, APPEND, MEGABYTES 100
DISCARDROLLOVER AT 2:00
WARNLONGTRANS 2h, CHECKINTERVAL 3m
EXTTRAIL ./dirdat/st, MEGABYTES 200
DDL &
INCLUDE MAPPED OBJTYPE 'table' &
INCLUDE MAPPED OBJTYPE 'index' &
INCLUDE MAPPED OBJTYPE 'SEQUENCE' &
EXCLUDE OPTYPE COMMENT
DDLOPTIONS  NOCROSSRENAME  REPORT
TABLE  OGG_TEST.*,tokens (TKN-ROWID=@GETENV('RECORD','rowid'));
SEQUENCE  OGG_TEST.*;
GETUPDATEBEFORES

Note

The configuration TABLE OGG_TEST.*,tokens (TKN-ROWID=@GETENV('RECORD','rowid')); is used to capture the ROWID of the source table. If you do not need to capture the ROWID, you can change it to TABLE OGG_TEST.*;.

Add and start the Extract process:

# Add the Extract process
add extract dhext,tranlog, begin now

# Set the size of each trail file to 200 MB
add exttrail ./dirdat/st,extract dhext, megabytes 200

# Start the process
start dhext

Once started, the Extract process captures all database changes in files within the ggate/dirdat directory.

5. Configure the source Pump process

Start GGSCI and run edit params pump to configure the Pump process:

EXTRACT pump
RMTHOST xx.xx.xx.xx, MGRPORT 7839, COMPRESS
PASSTHRU
NUMFILES 5000
RMTTRAIL ./dirdat/st
DYNAMICRESOLUTION
TABLE      OGG_TEST.*;
SEQUENCE  OGG_TEST.*;

Add and start the Pump process:

# Add the Pump process
add extract pump,exttrailsource ./dirdat/st

# Add the target trail file and set the size of each file to 200 MB
add rmttrail ./dirdat/st,extract pump,megabytes 200

# Ensure the target manager is running before you start the Pump process
start pump

Once the Pump process starts successfully, it delivers the trail file to the dirdat directory on the target system.

Configure the target OGG for Big Data

1. Install the target OGG for Big Data

The target OGG for Big Data does not require a formal installation. Simply decompress the package. After decompression, start GGSCI and run the create subdirs command. This creates the various dirxxx directories that OGG requires.

2. Install and configure the DataHub plugin

Your environment must have JDK 1.8 or later installed. Configure the JAVA_HOME and LD_LIBRARY_PATH environment variables. You can add them to your ~/.bash_profile file. For example:

export JAVA_HOME=/xxx/xxx
export JRE_HOME=/xxx/xxx/jrexx
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$JRE_HOME/lib/amd64:$JRE_HOME/lib/amd64/server

After you modify the environment variables, download and decompress datahub-ogg-plugin.tar.gz. In the conf directory, edit the javaue.properties file and replace ${GG_HOME} with the package's decompression path.

gg.handlerlist=ggdatahub

gg.handler.ggdatahub.type=com.aliyun.odps.ogg.handler.datahub.DatahubHandler
gg.handler.ggdatahub.configureFileName=${GG_HOME}/aliyun-datahub-ogg-plugin/conf/configure.xml

goldengate.userexit.timestamp=utc+8
goldengate.userexit.writers=javawriter

javawriter.stats.display=TRUE
javawriter.stats.full=TRUE

gg.includeggtokens=true
gg.classpath=${GG_HOME}/aliyun-datahub-ogg-plugin/lib/*
gg.log=log4j
gg.log.level=info
gg.log.file.count=64
gg.log.file.size=128MB

javawriter.bootoptions=-Xms512m -Xmx512m -Xmn256m -Djava.class.path=ggjava/ggjava.jar -Dlog4j.configurationFile=${GG_HOME}/aliyun-datahub-ogg-plugin/conf/log4j.properties

In the conf directory, edit the configure.xml file. For configuration guidance, refer to the comments within the file.

<?xml version="1.0" encoding="UTF-8"?>
<configue>
    <defaultOracleConfigure>
        <!-- Oracle SID. Required. -->
        <sid>100</sid>
        <!-- Oracle schema. Can be overridden by oracleSchema in a mapping. One of these must be specified. -->
        <schema>ogg_test</schema>
    </defaultOracleConfigure>
    <defalutDatahubConfigure>
        <!-- DataHub endpoint. Required. -->
        <endPoint>YOUR_DATAHUB_ENDPOINT</endPoint>
        <!-- DataHub project. Can be overridden by datahubProject in a mapping. One of these must be specified. -->
        <project>YOUR_DATAHUB_PROJECT</project>
        <!-- DataHub AccessKey ID. Can be overridden by datahubAccessId in a mapping. One of these must be specified. -->
        <accessId>YOUR_DATAHUB_ACCESS_ID</accessId>
        <!-- DataHub AccessKey secret. Can be overridden by datahubAccessKey in a mapping. One of these must be specified. -->
        <accessKey>YOUR_DATAHUB_ACCESS_KEY</accessKey>
        <!-- The DataHub field for the data change type. Can be overridden by ctypeColumn in a columnMapping. -->
        <ctypeColumn>optype</ctypeColumn>
        <!-- The DataHub field for the data change time. Can be overridden by ctimeColumn in a columnMapping. -->
        <ctimeColumn>readtime</ctimeColumn>
        <!-- The DataHub field for the data change sequence number. Increments with each change, but is not guaranteed to be consecutive. Can be overridden by cidColumn in a columnMapping. -->
        <cidColumn>record_id</cidColumn>
    </defalutDatahubConfigure>
    <!-- By default, the strictest policy is applied: no dirty data file is written, the process exits immediately, and retries are infinite. -->
    <!-- Maximum number of records per batch. Optional. Default: 1000. -->
    <batchSize>1000</batchSize>
    <!-- Default format for time fields. Optional. Default: yyyy-MM-dd HH:mm:ss. -->
    <defaultDateFormat>yyyy-MM-dd HH:mm:ss</defaultDateFormat>
    <!-- Whether to continue on dirty data. Optional. Default: false. -->
    <dirtyDataContinue>true</dirtyDataContinue>
    <!-- Dirty data file name. Optional. Default: datahub_ogg_plugin.dirty. -->
    <dirtyDataFile>datahub_ogg_plugin.dirty</dirtyDataFile>
    <!-- Maximum size of the dirty data file in MB. Optional. Default: 500. -->
    <dirtyDataFileMaxSize>200</dirtyDataFileMaxSize>
    <!-- Number of retries. -1: infinite, 0: no retries, n: n retries. Optional. Default: -1. -->
    <retryTimes>0</retryTimes>
    <!-- A comma-separated list of shard IDs to use, which takes precedence. Optional. Example: 0,1. -->
    <shardId>0,1</shardId>
    <!-- Retry interval in milliseconds. Optional. Default: 3000. -->
    <retryInterval>4000</retryInterval>
    <!-- Checkpoint file name. Optional. Default: datahub_ogg_plugin.chk. -->
    <checkPointFileName>datahub_ogg_plugin.chk</checkPointFileName>
    <mappings>
        <mapping>
            <!-- Oracle schema. See description above. -->
            <oracleSchema></oracleSchema>
            <!-- Oracle table. Required. -->
            <oracleTable>t_person</oracleTable>
            <!-- DataHub project. See description above. -->
            <datahubProject></datahubProject>
            <!-- DataHub topic. Required. -->
            <datahubTopic>t_person</datahubTopic>
            <!-- The DataHub field for the Oracle table's ROWID. Optional. -->
            <rowIdColumn></rowIdColumn>
            <ctypeColumn></ctypeColumn>
            <ctimeColumn></ctimeColumn>
            <cidColumn></cidColumn>
            <columnMapping>
                <!--
                src: Source Oracle column name. Required.
                dest: Target DataHub field. Required.
                destOld: Target DataHub field for pre-update data. Optional.
                isShardColumn: Whether this column is used as the hash key for sharding. Optional. Default: false. Can be overridden by shardId.
                isDateFormat: Whether to format timestamp fields using dateFormat. Default: true. If false, the source data must be a long.
                dateFormat: The format for timestamp fields. If not specified, the default format is used.
                -->
                <column src="id" dest="id" isShardColumn="true"  isDateFormat="false" dateFormat="yyyy-MM-dd HH:mm:ss"/>
                <column src="name" dest="name" isShardColumn="true"/>
                <column src="age" dest="age"/>
                <column src="address" dest="address"/>
                <column src="comments" dest="comments"/>
                <column src="sex" dest="sex"/>
                <column src="temp" dest="temp" destOld="temp1"/>
            </columnMapping>
        </mapping>
    </mappings>
</configue>

3. Configure the target Manager

port 7919
dynamicportlist  7910-7919
lagreportminutes  10
laginfoseconds  1
purgeoldextracts ./dirdat/*, usecheckpoints, minkeepdays 7

Start the Manager: start mgr

4. Configure the target Replicat process

In GGSCI, run edit params dhwt to configure the Replicat process:

REPLICAT dhwt
getEnv (JAVA_HOME)
getEnv (LD_LIBRARY_PATH)
getEnv (PATH)
TARGETDB LIBFILE libggjava.so SET property=${GG_HOME}/aliyun-datahub-ogg-plugin/conf/javaue.properties -- Manually update this path.
MAP ogg_test.*, TARGET ogg_test.*;

Note

Manually update the path to the javaue.properties file and specify which tables to map in the MAP parameter. This example captures all tables under the ogg_test schema.

Add and start the dhwt process:

# Add the Replicat process
add replicat dhwt, exttrailsource dirdat/st
# Start the dhwt process
start dhwt

FAQ

Q: Where are the log files for troubleshooting DataHub plugin issues?

A: The DataHub plugin logs are located in the dirrpt directory of your OGG for Big Data installation.

Each Replicat process generates its own log file named [ProcessName]-datahub.log. For example, if your Replicat process is named DHWT, the corresponding plugin log file is DHWT-datahub.log. Check this file for any issues related to DataHub connectivity, data formatting, or write errors.

Q: Where can I find the general OGG Replicat process logs?

A: The general OGG Replicat process logs, also known as report files, are located in the dirrpt directory of your OGG for Big Data installation.

These files contain detailed information about the process's status, errors, and performance metrics. The log file is named [ProcessName].log (or .rpt). For example, for a Replicat process named DHWT, the report file will be DHWT.log.

Q: What does the replicated data look like in DataHub?

A: When data is successfully written to DataHub, you will see a confirmation message in the DataHub plugin log:

2020-12-01 11:29:30.000461 [record_writer-1-thread-1] INFO ShardWriter - Write DataHub success, table: orders, topic: orders_new, shard: 0, recordNum: 1, rt: 3

In DataHub, the replicated record will contain your original table's columns plus several metadata columns added by the plugin to provide context about the change event.

Here is an example of a replicated record in the DataHub console:

Shard ID	System time	oid (string)	num (string)	pid (string)	bak (string)	ctype (string)	ctime (string)	cid (string)	rowid (string)
0	11:29:30 AM Dec 1, 2020	1	3	2	zh	I	2020-12-01 03:29:24.000074	16067933700000	AAAWwyAAGAAABufAAC