All Products
Search
Document Center

Configure predicate pushdown for batch computing

Last Updated: Nov 24, 2020

When you use search index for batch computing, you can customize predicate pushdown. You can configure predicate pushdown only on LONG and STRING columns.

Background information

You can use predicate pushdown when a query condition for the search index needs to filter large amounts of data, and it is time-consuming for Tablestore to union all intermediate query results. To address these needs, you can push down the value filtering of part of fields from the storage layer (Tablestore) to the compute layer (Spark), which improves query efficiency.

Take select * from table where a = 10 and b < 999999999; as an example. If the number of rows returned for the condition (a = 10) is only 1,000, and the number of rows returned for the condition (b < 999999999) is up to 100 million, it is time-consuming for Tablestore to union the results of the two conditions. However, you can push down the condition (b < 999999999) to the compute layer. Spark needs only to filter data based on the 1,000 rows returned from the storage layer, which reduces most pressure on the storage layer.

Supported operators

The following table lists the operators supported by Spark.

Spark

Supported

Example of SQL statement

And

Yes

select * from table where a > 1 and b < 0;

Or

Yes

select * from table where a > 1 or b < 0;

Not

Yes

select * from table where a ! = 1;

EqualTo

Yes

select * from table where a = 1;

Not+EqualTo

Yes

select * from table where a ! = 1;

IsNull

Yes

select * from table where a is null;

Return rows that do not contain the a column from table.

In

Yes

select * from table where a in {1,2,3};

By default, the maximum upper limit is 1024.

LessThan

Yes

select * from table where a < 10;

You can customize predicate pushdown when the SQL statement uses LONG or STRING columns.

LessThanOrEqual

Yes

select * from table where a <=10;

You can customize predicate pushdown when the SQL statement uses LONG or STRING columns.

GreaterThan

Yes

select * from table where a > 10;

You can customize predicate pushdown when the SQL statement uses LONG or STRING columns.

GreaterThanOrEqual

Yes

select * from table where a >= 10;

You can customize predicate pushdown when the SQL statement uses LONG or STRING columns.

StringStartsWith

Yes

select * from table where a like "tablestore%";

Return rows that contain a column whose values are prefixed with tablestore from table.

Coordinate pair of the central point, radius (STRING JSON)

Yes

select * from table where val_geo = '{"centerPoint":"3,0", "distanceInMeter": 100000}';

The geographical area is defined by the radius and coordinate pair of the central point.

Coordinate pairs of the top-left corner and bottom-right corner in the rectangular geographical area (STRING JSON)

Yes

select * from table where geo = '{"topLeft":"8,0", "bottomRight": "0,10"}';

The rectangular geographical area is defined by the coordinate pairs of the upper-left corner and lower-right corner.

Coordinate pairs of vertices in the polygon geographical area (STRING JSON)

Yes

select * from table where geo = '{"points":["5,0", "5,1", "6,1", "6,10"]}';

The polygon geographical area is defined by the coordinate pairs of multiple vertices.

Configure predicate pushdown

The parameters for predicate pushdown can be configured when you create a Spark external table. Predicate pushdown uses the following rules:

  • When ADN and NOT are the only logical operators in the filter conditions, you can specify whether to enable predicate pushdown.

  • When the filter conditions contain logical operator OR , all predicates are pushed down. Custom configurations for predicate pushdown such as push.down.range.long=false and push.down.range.string=false (configured by using SQL in EMR) become invalid.

The following table describes the combination of logical operators with pushdown configurations, examples of SQL statements, and expected results.

Logical operator

Pushdown configuration

Example of SQL statement

Expected result

AND

  • push.down.range.long=true

  • push.down.range.string=true

select * from table where val_long1 > 1000 and val_long1 is null and name like 'table%' and pk in {12341,213432};

The SQL statement is executed.

All predicates are pushed down.

AND only

  • push.down.range.long=false

  • push.down.range.string=true

select * from table where val_long1 > 1 and name like 'table%';

Predicates where operators that are used to compare values with LONG column values are not pushed down.

The results returned based on the predicate are filtered by the Spark compute layer. The Spark compute layer obtains the data based on the condition name like 'table%'. The condition val_long1 > 1 is executed by using the business code.

AND only

  • push.down.range.long=true

  • push.down.range.string=false

select * from table where val_string1 > 'string1' and name like 'table%';

Predicates where operators that are used to compare values with STRING column values are not pushed down.

The results returned based on the predicate are filtered by the Spark compute layer. The Spark compute layer obtains the data based on the condition name like 'table%'. The condition val_long1 > 1 is executed by using the business code.

AND only

Columns containing values for geographical locations exist

select * from table where val_geo = '{"centerPoint":"3,0", "distanceInMeter": 100000}' and val_long1 = 37691900 and val_long2 > 2134234;

The SQL statement is executed.

The configurations of push.down.range.long determine whether the Spark compute layer filters the results returned based on the condition val_long2 > 2134234.

OR

  • push.down.range.long=true

  • push.down.range.string=true

select * from table where val_long1 > 1000 or val_long1 is null or name like 'table%' and pk in {12341,213432};

The SQL statement is executed.

All predicates are pushed down.

OR

Columns containing values for geographical locations exist

select * from table where val_geo = '{"centerPoint":"3,0", "distanceInMeter": 100000}' or val_long1 = 37691900;

The SQL statement is executed.

All predicates are pushed down.

OR

  • push.down.range.long=false

  • push.down.range.string=true

  • The SQL statement contains the rangeLong field for filtering.

select * from table where val_long1 > 1 and name like 'table%';

The custom configurations for predicate pushdown become invalid. All predicates are pushed down.

OR

  • push.down.range.long=true

  • push.down.range.string=false

  • The SQL statement contains the rangeString field for filtering.

select * from table where val_string1 > 'string1' and name like 'table%';

The custom configurations for predicate pushdown become invalid. All predicates are pushed down.