ApsaraDB for Cassandra is a distributed NoSQL database that is developed based on Apache Cassandra and integrated with the database as a service (DBaaS) feature of Alibaba Cloud. This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access ApsaraDB for Cassandra.

Prerequisites

  • Object Storage Service (OSS) is activated. For more information, see Activate OSS.
  • An ApsaraDB for Cassandra cluster is created. For more information, see Use cqlsh to manage an ApsaraDB for Cassandra instance.
  • The internal endpoint and CQL port number of the ApsaraDB for Cassandra cluster and the username and password for logon to the ApsaraDB for Cassandra database are obtained. For more information, see Use multi-language SDKs to access an ApsaraDB for Cassandra instance over the Internet and the internal network.
  • A table is created in the ApsaraDB for Cassandra cluster and data is inserted into the table. For more information, see Use cqlsh to manage an ApsaraDB for Cassandra instance. Sample statements:
    CREATE KEYSPACE spark WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
    use spark;
    CREATE TABLE spark-test (first_name text , last_name text, PRIMARY KEY (first_name)) ;
    INSERT INTO spark_test (first_name, last_name) VALUES ('hadoop', 'big data basic plateform');
    INSERT INTO spark_test (first_name, last_name) VALUES ('spark', 'big data compute engine');
    INSERT INTO spark_test (first_name, last_name) VALUES ('kafka', 'streaming data plateform');
    INSERT INTO spark_test (first_name, last_name) VALUES ('mongodb', 'document database');
    INSERT INTO spark_test (first_name, last_name) VALUES ('es', 'serarch egnine');
    INSERT INTO spark_test (first_name, last_name) VALUES ('flink', 'streaming plateform');
  • The security group ID and vSwitch ID that are used by the serverless Spark engine of DLA to access the ApsaraDB for Cassandra cluster are obtained. For more information, see Access your VPC.

Procedure

  1. Compile the following test code and the pom.xml file that contains the dependency required for accessing the ApsaraDB for Cassandra cluster. Then, package the test code and dependency into a JAR file and upload this file to OSS.
    Sample test code:
    import org.apache.spark.sql.SparkSession
    
    object SparkCassandra {
    
      def main(args: Array[String]): Unit = {
        // The internal endpoint of the ApsaraDB for Cassandra cluster.
        val cHost = args(0)
        // The CQL port of the ApsaraDB for Cassandra cluster.
        val cPort = args(1)
        // The username and password that are used to log on to the ApsaraDB for Cassandra database.
        val cUser = args(2)
        val cPw = args(3)
        // The keyspace and table of the ApsaraDB for Cassandra cluster.
        val cKeySpace = args(4)
        val cTable = args(5)
    
        val spark = SparkSession
          .builder()
          .config("spark.cassandra.connection.host", cHost)
          .config("spark.cassandra.connection.port", cPort)
          .config("spark.cassandra.auth.username", cUser)
          .config("spark.cassandra.auth.password", cPw)
          .getOrCreate();
    
        val cData1 = spark
          .read
          .format("org.apache.spark.sql.cassandra")
          .options(Map("table" -> cTable, "keyspace" -> cKeySpace))
          .load()
        print("=======start to print the cassandra data======")
        cData1.show()
      }
    }
    Dependency in the pom.xml file of the ApsaraDB for Cassandra cluster:
            <dependency>
                <groupId>com.datastax.spark</groupId>
                <artifactId>spark-cassandra-connector_2.11</artifactId>
                <version>2.4.2</version>
            </dependency>
  2. Log on to the DLA console.
  3. In the top navigation bar, select the region where the ApsaraDB for Cassandra cluster resides.
  4. In the left-side navigation pane, choose Serverless Spark > Submit job.
  5. On the Parameter Configuration page, click Create Job.
  6. In the Create Job dialog box, configure the parameters and click OK to create a Spark job.
    3
  7. In the Job List navigation tree, click the Spark job that you created and enter the following content of the job in the code editor. Replace the parameter values based on the following parameter descriptions. Then, click Save and Execute.
    {
        "args": [
            "cds-xxx-1-core-001.cassandra.rds.aliyuncs.com",  # The internal endpoint of the ApsaraDB for Cassandra cluster. Two internal endpoints may be provided. You can choose one of them.
            "9042",  # The CQL port number of the ApsaraDB for Cassandra cluster.
            "cassandra",  # The username that is used to log on to the ApsaraDB for Cassandra database.
            "test_1234",  # The password that is used to log on to the ApsaraDB for Cassandra database.
            "spark", # The keyspace in the ApsaraDB for Cassandra cluster.
            "spark_test"# The name of the table in the ApsaraDB for Cassandra database.
        ],
        "file": "oss://spark_test/jars/cassandra/spark-examples-0.0.1-SNAPSHOT.jar",  # The OSS directory where the test code is saved.
        "name": "Cassandra-test",
        "jars": [
            "oss://spark_test/jars/cassandra/spark-cassandra-connector_2.11-2.4.2.jar"  # The OSS directory where the JAR file that contains the dependency of the test code is saved.
        ],
        "className": "com.aliyun.spark.SparkCassandra",
        "conf": {
            "spark.driver.resourceSpec": "small",  # The specifications of the Spark driver, which can be small, medium, large, or xlarge.
            "spark.executor.instances": 2,  # The number of Spark executors.
            "spark.executor.resourceSpec": "small",   # The specifications of Spark executors, which can be small, medium, large, or xlarge.
            "spark.dla.eni.enable": "true",  # Specifies whether to enable an elastic network interface (ENI) for the VPC. If you want to access data of the VPC, set spark.dla.eni.enable to true.
            "spark.dla.eni.vswitch.id": "vsw-xxx",  # The ID of the vSwitch to which the ApsaraDB for Cassandra cluster belongs.
            "spark.dla.eni.security.group.id": "sg-xxx"  # The ID of the security group to which the ApsaraDB for Cassandra cluster belongs.
        }
    }

Result

After the job succeeds, find the job and click Log in the Operation column to view the logs of the job.