All Products
Search
Document Center

MaxCompute:Access a Hologres database from MaxCompute

Last Updated:Jun 06, 2024

This topic describes how to access a Hologres database from MaxCompute.

Background information

Hologres is an interactive analytics service for real-time data processing provided by Alibaba Cloud. Hologres allows you to write and query real-time data with high concurrency and analyze data with high performance. You can use Hologres and MaxCompute to build a federated data warehouse solution, thereby eliminating the need for data migration and combining real-time and offline data processing. The solution is suitable for scenarios such as large-scale offline analysis, real-time operational analysis, and interactive querying. To implement the solution, you can use the following methods to access Hologres from MaxCompute:

Prerequisites

Use an external table

  1. In the Hologres console, select the instance that you purchased and create a Hologres database named mc_db_holo.

    For information about how to create a Hologres database, see Create a database.

  2. In the HoloWeb console, execute the following statement in the mc_db_holo database to create a table named mc_sql_holo and insert data into the table.

    For information about how to create a Hologres table, see Overview.

    CREATE TABLE mc_sql_holo(
            id INTEGER,
            name TEXT
    );
    
    INSERT INTO mc_sql_holo VALUES
            (1,'zhangsan'),
            (2,'lisi'),
            (3,'wangwu')
    ;
  3. In the Resource Access Management (RAM) console, create a RAM role named AliyunOdpsHoloRole and modify the trust policy.

    For more information, see the "Create a Hologres external table in STS mode" section in the Hologres external tables topic.

    Note

    In this example, the RAM role is created for a trusted Alibaba Cloud account.

  4. Grant the RAM role AliyunOdpsHoloRole the access to the Hologres instance.

    For more information, see the "Create a Hologres external table in STS mode" section in the Hologres external tables topic.

  5. On the MaxCompute client, execute the following statement to create an external table named mc_externaltable_holo:

    create external table if not exists mc_externaltable_holo
    (
        id int ,
        name string
    )
    stored by 'com.aliyun.odps.jdbc.JdbcStorageHandler'
    with serdeproperties (
      'odps.properties.rolearn'='acs:ram::13969******5947:role/aliyunodpsholorole')
    LOCATION 'jdbc:postgresql://hgprecn-cn-2r42******-cn-hangzhou-internal.hologres.aliyuncs.com:80/mc_db_holo?currentSchema=public&useSSL=false&table=mc_sql_holo/'
    TBLPROPERTIES (
      'mcfed.mapreduce.jdbc.driver.class'='org.postgresql.Driver',
      'odps.federation.jdbc.target.db.type'='holo',
      'odps.federation.jdbc.colmapping'='id:id,name:name'
    );
    Note

    For information about the parameters in the statement, see Hologres external tables.

  6. On the MaxCompute client, execute the following statement to obtain data from the external table:

    set odps.sql.split.hive.bridge=true;
    set odps.sql.hive.compatible=true;
    select * from mc_externaltable_holo limit 10;
    Note

    For information about the parameters in SET operations, see SET operations.

    The following result is returned:

    +----+----------+
    | id | name     |
    +----+----------+
    | 1  | zhangsan |
    | 2  | lisi     |
    | 3  | wangwu   |
    +----+----------+
  7. On the MaxCompute client, execute the following statement to insert data to the external table:

    set odps.sql.split.hive.bridge=true;
    set odps.sql.hive.compatible=true;
    insert into mc_externaltable_holo values (4,'alice');
  8. In the HoloWeb console, query data in the Hologres table mc_sql_holo. The data you inserted in the previous step is contained in the table.

    select * from mc_sql_holo;

Use Spark on MaxCompute in local mode

  1. In the HoloWeb console, execute the following statement in the mc_db_holo database to create a table named mc_jdbc_holo.

    For information about how to create a Hologres table, see Overview.

    CREATE TABLE mc_jdbc_holo(
            id INTEGER,
            name TEXT
    );
  2. Go to the /home/pythoncode path on the Linux operating system and create a Python script named holo_local.py.

    Sample Python script:

    from pyspark.sql import SparkSession
    
    spark = SparkSession \
                .builder \
                .appName("Spark_local") \
                .config("spark.eventLog.enabled","false") \
                .getOrCreate()
    
    jdbcDF = spark.read.format("jdbc"). \
                options(
                url='jdbc:postgresql://hgprecn-cn-2r42******-cn-hangzhou.hologres.aliyuncs.com:80/mc_db_holo',
                dbtable='mc_jdbc_holo',
                user='LTAI5tJE8fuerxXdPPB****',
                password='Okr2kbBKueR3uRaHaBiUHw4r6****',
                driver='org.postgresql.Driver').load()
    
    jdbcDF.printSchema()

    Parameters:

    • url: the JDBC URL that contains the jdbc:postgresql:// prefix.

      • hgprecn-cn-2r42******-cn-hangzhou.hologres.aliyuncs.com:80: the endpoint used to access the Hologres instance over the Internet. For information about how to obtain the endpoint, see Instance configurations.

      • mc_db_holo: the name of the Hologres database to which you want to connect. In this example, the Hologres database name is mc_db_holo.

    • dbtable: the name of the Hologres table from which you want to read data or to which you want to write data. In this example, the Hologres table name is mc_db_holo.

    • user: the AccessKey ID of an authorized Alibaba Cloud account or RAM user. You can obtain the AccessKey ID on the AccessKey Pair page.

    • password: the AccessKey secret corresponding to the AccessKey ID that you specified in the user parameter. You can obtain the AccessKey secret on the AccessKey Pair page.

    • driver: the PostgreSQL driver. Set this parameter to org.postgresql.Driver.

  3. Run the following spark-submit command from a directory on the Linux operating system to submit a job in local mode:

    spark-submit --master local --driver-class-path /home/postgreSQL/postgresql-42.2.16.jar --jars /home/postgreSQL/postgresql-42.2.16.jar /home/pythoncode/holo_local.py

    View the output logs. If the printed schema information is the same as that of the mc_jdbc_holo table in Hologres, access to Hologres from MaxCompute is successful.

Use Spark on MaxCompute in cluster mode

  1. In the HoloWeb console, execute the following statement in the mc_db_holo database to create a table named mc_jdbc_holo.

    For information about how to create a Hologres table, see Overview.

    CREATE TABLE mc_jdbc_holo(
            id INTEGER,
            name TEXT
    );
  2. Go to the /home/pythoncode path on the Linux operating system and create a Python script named holo_yarncluster.py.

    Sample Python script:

    from pyspark.sql import SparkSession
    
    spark = SparkSession \
                .builder \
                .appName("Spark_yarn") \
                .getOrCreate()
    
    jdbcDF = spark.read.format("jdbc"). \
                options(
                url='jdbc:postgresql://hgprecn-cn-2r42******-cn-hangzhou-internal.hologres.aliyuncs.com:80/mc_db_holo',
                dbtable='mc_jdbc_holo',
                user='LTAI5tJE8fuerxXdPPB****',
                password='Okr2kbBKueR3uRaHaBiUHw4r6****',
                driver='org.postgresql.Driver').load()
    
    jdbcDF.printSchema()

    Parameters:

    • url: the JDBC URL that contains the jdbc:postgresql:// prefix.

      • hgprecn-cn-2r42******-cn-hangzhou-internal.hologres.aliyuncs.com:80: the endpoint used to access the Hologres instance over the classic network. For information about how to obtain the endpoint, see Instance configurations.

      • mc_db_holo: the name of the Hologres database to which you want to connect. In this example, the Hologres database name is mc_db_holo.

    • dbtable: the name of the Hologres table from which you want to read data or to which you want to write data. In this example, the Hologres table name is mc_db_holo.

    • user: the AccessKey ID of an authorized Alibaba Cloud account or RAM user. You can obtain the AccessKey ID on the AccessKey Pair page.

    • password: the AccessKey secret corresponding to the AccessKey ID that you specified in the user parameter. You can obtain the AccessKey secret on the AccessKey Pair page.

    • driver: the PostgreSQL driver. Set this parameter to org.postgresql.Driver.

  3. Go to the /home/spark2.4.5/spark-2.4.5-odps0.33.2/conf directory and modify the spark-defaults.conf file.

    # Modify the following configurations:
    spark.hadoop.odps.project.name = <MaxCompute_Project_Name>
    spark.hadoop.odps.end.point = <Endpoint>
    spark.hadoop.odps.runtime.end.point = <VPC_Endpoint>
    spark.hadoop.odps.access.id = <AccessKey_ID>
    spark.hadoop.odps.access.key = <AccessKey_Secret>
    
    spark.hadoop.odps.cupid.trusted.services.access.list = <Hologres_Classic_Network>
    
    # Retain the following configurations:
    spark.master = yarn-cluster
    spark.driver.cores = 2
    spark.driver.memory = 4g
    
    spark.dynamicAllocation.shuffleTracking.enabled = true
    spark.dynamicAllocation.shuffleTracking.timeout = 20s
    spark.dynamicAllocation.enabled = true
    spark.dynamicAllocation.maxExecutors = 10
    spark.dynamicAllocation.initialExecutors = 2
    spark.executor.cores = 2
    spark.executor.memory = 8g
    
    spark.eventLog.enabled = true
    spark.eventLog.overwrite = true
    spark.eventLog.dir = odps://admin_task_project/cupidhistory/sparkhistory
    
    spark.sql.catalogImplementation = hive
    spark.sql.sources.default = hive

    Parameters:

    • MaxCompute_Project_Name: the name of the MaxCompute project that you created.

      To obtain the project name, log on to the MaxCompute console, select a region in the upper-left corner, and then choose Workspaces > Projects in the left-side navigation pane.

    • access_id: the AccessKey ID used to access the MaxCompute project.

      You can obtain the AccessKey ID on the AccessKey Pair page.

    • AccessKey Secret: the AccessKey secret corresponding to the AccessKey ID that you specified in the access_id parameter.

      You can obtain the AccessKey secret on the AccessKey Pair page.

    • Endpoint: the endpoint used to access the MaxCompute project over the Internet.

      For information about the public endpoint of each region, see Endpoints in different regions (Internet).

    • VPC_Endpoint: the endpoint used to access the Virtual Private Cloud (VPC) where the MaxCompute project resides.

      For information about the VPC endpoint of each region, see Endpoints in different regions (VPC).

    • Hologres_Classic_Network: the endpoint used to access Hologres over the classic network. This configuration specifies a network policy for connecting to the Hologres instance in the MaxCompute sandbox environment. Otherwise, MaxCompute cannot access external services.

  4. Run the following spark-submit command from a directory on the Linux operating system to submit a job in cluster mode:

    spark-submit --master yarn-cluster --driver-class-path /home/postgreSQL/postgresql-42.2.16.jar --jars /home/postgreSQL/postgresql-42.2.16.jar /home/pythoncode/holo_yarncluster.py

    After you submit the job, you can use the following information in the output logs to monitor the job:

    • The LogView URL. Logview

    • The JobView URL of Spark UI. Jobview

  5. Open the Logview URL. If the job status is success, click Job Details > master-0 > StdOut to view the returns of the jdbcDF.printSchema() method.

    Logview

    If the printed schema information in the Stdout section is the same as that of the mc_jdbc_holo table in Hologres, access to Hologres from MaxCompute is successful. StdOut

    Note

    You can also open the Jobview URL of Spark UI to view the job details.

Use Spark on MaxCompute in DataWorks mode

  1. In the HoloWeb console, execute the following statement in the mc_db_holo database to create a table named mc_jdbc_holo.

    For information about how to create a Hologres table, see Overview.

    CREATE TABLE mc_jdbc_holo(
            id INTEGER,
            name TEXT
    );
  2. Go to the /home/spark2.4.5/spark-2.4.5-odps0.33.2/conf directory and modify the spark-defaults.conf file.

    # Modify the following configurations:
    spark.hadoop.odps.project.name = <MaxCompute_Project_Name>
    spark.hadoop.odps.end.point = <Endpoint>
    spark.hadoop.odps.runtime.end.point = <VPC_Endpoint>
    spark.hadoop.odps.access.id = <AccessKey_ID>
    spark.hadoop.odps.access.key = <AccessKey_Secret>
    
    spark.hadoop.odps.cupid.trusted.services.access.list = <Hologres_Classic_Network>
    
    # Retain the following configurations:
    spark.master = yarn-cluster
    spark.driver.cores = 2
    spark.driver.memory = 4g
    
    spark.dynamicAllocation.shuffleTracking.enabled = true
    spark.dynamicAllocation.shuffleTracking.timeout = 20s
    spark.dynamicAllocation.enabled = true
    spark.dynamicAllocation.maxExecutors = 10
    spark.dynamicAllocation.initialExecutors = 2
    spark.executor.cores = 2
    spark.executor.memory = 8g
    
    spark.eventLog.enabled = true
    spark.eventLog.overwrite = true
    spark.eventLog.dir = odps://admin_task_project/cupidhistory/sparkhistory
    
    spark.sql.catalogImplementation = hive
    spark.sql.sources.default = hive

    Parameters:

    • MaxCompute_Project_Name: the name of the MaxCompute project that you created.

      To obtain the project name, log on to the MaxCompute console, select a region in the upper-left corner, and then choose Workspaces > Projects in the left-side navigation pane.

    • access_id: the AccessKey ID used to access the MaxCompute project.

      You can obtain the AccessKey ID on the AccessKey Pair page.

    • AccessKey Secret: the AccessKey secret corresponding to the AccessKey ID that you specified in the access_id parameter.

      You can obtain the AccessKey secret on the AccessKey Pair page.

    • Endpoint: the endpoint used to access the MaxCompute project over the Internet.

      For information about the public endpoint of each region, see Endpoints in different regions (Internet).

    • VPC_Endpoint: the endpoint used to access the Virtual Private Cloud (VPC) where the MaxCompute project resides.

      For information about the VPC endpoint of each region, see Endpoints in different regions (VPC).

    • Hologres_Classic_Network: the endpoint used to access Hologres over the classic network. This configuration specifies a network policy for connecting to the Hologres instance in the MaxCompute sandbox environment. Otherwise, MaxCompute cannot access external services.

  3. Log on to the DataWorks console.

  4. In the left-side navigation pane, click Workspaces.

  5. On the Workspaces page, find the workspace that you want to view and choose Shortcuts > Data Development in the Actions column.

  6. Create a PostgreSQL JDBC resource and an ODPS Spark node.

    1. In the left-side navigation pane, right-click the business flow that you want to use and choose Create Resource > MaxCompute > File. In the Create Resource dialog box, upload the downloaded PostgreSQL JDBC package and click Create.

      Note
    2. Right-click the business flow that you want to use and choose Create Resource > MaxCompute > Python. In the Create Resource dialog box, enter a resource name in the Name field and click Create.

      In this example, the resource name is read_holo.py.

    3. Copy and paste the following code to the read_holo.py file and click the 保存 icon.

      from pyspark.sql import SparkSession
      
      spark = SparkSession \
                  .builder \
                  .appName("Spark") \
                  .getOrCreate()
      
      jdbcDF = spark.read.format("jdbc"). \
                  options(
                  url='jdbc:postgresql://hgprecn-cn-2r42******-cn-hangzhou-internal.hologres.aliyuncs.com:80/mc_db_holo',
                  dbtable='mc_jdbc_holo',
                  user='LTAI5tJE8fuerxXdPPB****',
                  password='Okr2kbBKueR3uRaHaBiUHw4r6****',
                  driver='org.postgresql.Driver').load()
      
      jdbcDF.printSchema()

      Parameters:

      • url: the JDBC URL that contains the jdbc:postgresql:// prefix.

        • hgprecn-cn-2r42******-cn-hangzhou.hologres.aliyuncs.com:80: the endpoint used to access the Hologres instance over the Internet. For information about how to obtain the endpoint, see Instance configurations.

        • mc_db_holo: the name of the Hologres database to which you want to connect. In this example, the Hologres database name is mc_db_holo.

      • dbtable: the name of the Hologres table from which you want to read data or to which you want to write data. In this example, the Hologres table name is mc_db_holo.

      • user: the AccessKey ID of an authorized Alibaba Cloud account or RAM user. You can obtain the AccessKey ID on the AccessKey Pair page.

      • password: the AccessKey secret corresponding to the AccessKey ID that you specified in the user parameter. You can obtain the AccessKey secret on the AccessKey Pair page.

      • driver: the PostgreSQL driver. Set this parameter to org.postgresql.Driver.

    4. Right-click the business flow that you want to use and choose Create Node > ODPS Spark. In the Create Node dialog box, enter a node name in the Name field and click Confirm. In this example, the node name is spark_read_holo.

    5. Follow the instructions in the following figure to configure the spark_read_holo node.

      • Add an entry in the Configuration Items field. Set the key to spark.hadoop.odps.cupid.trusted.services.access.list.

      • Set the value to hgprecn-cn-2r42******-cn-hangzhou-internal.hologres.aliyuncs.com:80, which specifies the endpoint used to access Hologres over the classic network.

        Note

        This configuration specifies a network policy for connecting to the Hologres instance in the MaxCompute sandbox environment. Otherwise, MaxCompute cannot access external services.

  7. On the canvas of the business workflow, right-click the spark_read_holo > Run Current Node.

    After you submit the job, you can view the job details in the output logs, including the diagnostic information, LogView URL, and Jobview URL of Spark UI.

  8. Open the Logview URL. If the job status is success, click Job Details > master-0 > StdOut to view the returns of the jdbcDF.printSchema() method.

    Logview

    If the printed schema information in the Stdout section is the same as that of the mc_jdbc_holo table in Hologres, access to Hologres from MaxCompute is successful. StdOut

    Note

    You can also open the Jobview URL of Spark UI to view the job details.