edit-icon download-icon

Tutorial

Last Updated: Mar 15, 2018

Using Spark/Spark SQL to access Table Store

Data preparation

Name a table in Table Store as pet and import the following data:

Name Owner Species Sex Birth Death
Fluffy Harold cat f 1993-02-04
Claws Gwen cat m 1994-03-17
Buffy Harold dog f 1989-05-13
Fang Benny dog m 1990-08-27
Bowser Diane dog m 1979-08-31 1995-07-29
Chirpy Gwen bird f 1998-09-11
Whistler Gwen bird 1997-12-09
Slim Benny snake m 1996-04-29
Puffball Diane hamster f 1999-03-30

Note that column Name is the only Primary Key.

As Table Store is schema-free according to the Data model, you do not need to input anything (such as NULL) into blank cells.

Example for accessing by Spark SQL

Preparations

Prepare the environment for Spark, JDK, and dependency package of Table Store SDK and EMR as Prerequisites.

Example

  1. $ bin/spark-sql --master local --jars tablestore-4.3.1-jar-with-dependencies.jar,emr-tablestore-1.4.2.jar
  2. spark-sql> CREATE EXTERNAL TABLE pet
  3. (name STRING, owner STRING, species STRING, sex STRING, birth STRING, death STRING)
  4. STORED BY 'com.aliyun.openservices.tablestore.hive.TableStoreStorageHandler'
  5. WITH SERDEPROPERTIES(
  6. "tablestore.columns.mapping"="name,owner,species,sex,birth,death")
  7. TBLPROPERTIES (
  8. "tablestore.endpoint"="YourEndpoint",
  9. "tablestore.access_key_id"="YourAccessKeyId",
  10. "tablestore.access_key_secret"="YourAccessKeySecret",
  11. "tablestore.table.name"="pet");
  12. spark-sql> SELECT * FROM pet;
  13. Bowser Diane dog m 1979-08-31 1995-07-29
  14. Buffy Harold dog f 1989-05-13 NULL
  15. Chirpy Gwen bird f 1998-09-11 NULL
  16. Claws Gwen cat m 1994-03-17 NULL
  17. Fang Benny dog m 1990-08-27 NULL
  18. Fluffy Harold cat f 1993-02-04 NULL
  19. Puffball Diane hamster f 1999-03-30 NULL
  20. Slim Benny snake m 1996-04-29 NULL
  21. Whistler Gwen bird NULL 1997-12-09 NULL
  22. Time taken: 5.045 seconds, Fetched 9 row(s)
  23. spark-sql> SELECT * FROM pet WHERE birth > "1995-01-01";
  24. Chirpy Gwen bird f 1998-09-11 NULL
  25. Puffball Diane hamster f 1999-03-30 NULL
  26. Slim Benny snake m 1996-04-29 NULL
  27. Whistler Gwen bird NULL 1997-12-09 NULL
  28. Time taken: 1.41 seconds, Fetched 4 row(s)

Parameters explanation

  • WITH SERDEPROPERTIES

    • tablestore.columns.mapping (optional): By default, the field names of the external tables (written in lower case according to Hive conventions) are the same as the column names (names of Primary Key or Attribute columns) in Table Store. However, due to case-sensitivity or charsets, the names may be different. In this case, tablestore.columns.mapping needs to be specified. This parameter is a comma-separated string. A blank space cannot be added before or after a comma. Each item is a column name and the order is the same as the field names of the external tables.

      Note: Table Store supports column names with blank characters. So a blank space is considered part of the column name.

  • TBLPROPERTIES

    • tablestore.endpoint (required): The endpoint. You can view the endpoint information of the instance on the Table Store console.

    • tablestore.instance (optional): The instance name. If it is not specified, it is the first field of tablestore.endpoint.

    • tablestore.table.name (required): The table name in Table Store.

    • tablestore.access_key_id and tablestore.access_key_secret (required): See Access control.

    • tablestore.sts_token (optional): See Security token.

Example for accessing by Spark

The following example illustrates how to count rows in pet by Spark.

  1. private static RangeRowQueryCriteria fetchCriteria() {
  2. RangeRowQueryCriteria res = new RangeRowQueryCriteria("YourTableName");
  3. res.setMaxVersions(1);
  4. List<PrimaryKeyColumn> lower = new ArrayList<PrimaryKeyColumn>();
  5. List<PrimaryKeyColumn> upper = new ArrayList<PrimaryKeyColumn>();
  6. lower.add(new PrimaryKeyColumn("YourPkeyName", PrimaryKeyValue.INF_MIN));
  7. upper.add(new PrimaryKeyColumn("YourPkeyName", PrimaryKeyValue.INF_MAX));
  8. res.setInclusiveStartPrimaryKey(new PrimaryKey(lower));
  9. res.setExclusiveEndPrimaryKey(new PrimaryKey(upper));
  10. return res;
  11. }
  12. public static void main(String[] args) {
  13. SparkConf sparkConf = new SparkConf().setAppName("RowCounter");
  14. JavaSparkContext sc = new JavaSparkContext(sparkConf);
  15. Configuration hadoopConf = new Configuration();
  16. TableStoreInputFormat.setCredential(
  17. hadoopConf,
  18. new Credential("YourAccessKeyId", "YourAccessKeySecret"));
  19. TableStoreInputFormat.setEndpoint(
  20. hadoopConf,
  21. new Endpoint("https://YourInstance.Region.ots.aliyuncs.com/"));
  22. TableStoreInputFormat.addCriteria(hadoopConf, fetchCriteria());
  23. try {
  24. JavaPairRDD<PrimaryKeyWritable, RowWritable> rdd = sc.newAPIHadoopRDD(
  25. hadoopConf,
  26. TableStoreInputFormat.class,
  27. PrimaryKeyWritable.class,
  28. RowWritable.class);
  29. System.out.println(
  30. new Formatter().format("TOTAL: %d", rdd.count()).toString());
  31. } finally {
  32. sc.close();
  33. }
  34. }

Note: If you use scala, replace JavaSparkContext with SparkContext, and replace JavaPairRDD with PairRDD.

Run the program

  1. $ bin/spark-submit --master local --jars hadoop-connector.jar row-counter.jar
  2. TOTAL: 9

Data type conversion

Table Store and Hive/Spark support the different sets of data types.

The following table indicates data type conversion support from Table Store (rows) to Hive (columns).

TINYINT SMALLINT INT BIGINT FLOAT DOUBLE BOOLEAN STRING BINARY
INTEGER Yes, with limited precision Yes, with limited precision  Yes, with limited precision Yes Yes, with limited precision Yes, with limited precision
DOUBLE Yes, with limited precision Yes, with limited precision Yes, with limited precision Yes, with limited precision Yes, with limited precision Yes
BOOLEAN Yes
STRING Yes
BINARY Yes
Thank you! We've received your feedback.