Import SSTable and CSV data with BulkLoad-ApsaraDB for Cassandra - Deprecated

Before you begin

This tool uses a file streaming interface to import data to an ApsaraDB for Cassandra cluster. BulkLoad is one of the fastest ways to migrate offline data to a Cassandra cluster. Before you import data, make the following preparations:

Create a Cassandra cluster.
Prepare offline data in SSTable or CSV format.
Create an independent ECS instance in the same VPC as the Cassandra cluster, and configure security group rules to ensure that the ECS instance can access the Cassandra cluster.

1. Create an ECS instance of the client in the same VPC as the Cassandra cluster

We recommend that you create an ECS instance independent of the Cassandra cluster. Otherwise, online services may be affected.

2. Create a schema

$ cqlsh -f schema.cql  -u USERNAME -p PASSWORD [host]

3. Prepare data

3.1 SSTable data format

Organize a directory in the data/${keyspace}/${table} format and store SSTable data in the directory, as shown in the following example:

ls /tmp/quote/historical_prices/
md-1-big-CompressionInfo.db md-1-big-Data.db        md-1-big-Digest.crc32       md-1-big-Filter.db      md-1-big-Index.db       md-1-big-Statistics.db      md-1-big-Summary.db     md-1-big-TOC.txt

In the preceding example, the keyspace parameter is set to quote and the table parameter is set to historical_prices.

Import data

Run the sstableloader command to specify the data catalog data/${ks}/${table} in the bin directory of the Cassandra distribution.

${cassandra_home}/bin/sstableloader -d <ip address of the node> data/${ks}/${table}

After the SSTable data is imported, run the following command to check the data: bin/cqlsh -u USERNAME -p PASSWORD [host]

$ bin/cqlsh 
cqlsh> select * from quote.historical_prices;

 ticker | date                            | adj_close | close     | high      | low       | open      | volume
--------+---------------------------------+-----------+-----------+-----------+-----------+-----------+--------
   ORCL | 2019-10-29 16:00:00.000000+0000 | 26.160000 | 26.160000 | 26.809999 | 25.629999 | 26.600000 | 181000
   ORCL | 2019-10-28 16:00:00.000000+0000 | 26.559999 | 26.559999 | 26.700001 | 22.600000 | 22.900000 | 555000

3.2 CSV data format

You must first convert CSV data to the SSTable format. Cassandra provides the CQLSSTableWriter tool for generating SSTables. This tool allows you to convert data in a format into the SSTable format. CSV data must also be organized in advance. Therefore, you must compile and run the code for parsing CSV data on your own. The following sample code demonstrates how to use this tool. For more information about this tool, visit the GitHub repository.

        // Prepare SSTable writer
        CQLSSTableWriter.Builder builder = CQLSSTableWriter.builder();
        // set output directory
        builder.inDirectory(outputDir)
               // set target schema
               .forTable(SCHEMA)
               // set CQL statement to put data
               .using(INSERT_STMT)
               // set partitioner if needed
               // default is Murmur3Partitioner so set if you use different one.
               .withPartitioner(new Murmur3Partitioner());
        CQLSSTableWriter writer = builder.build();
        /TODO: Read a CSV file. Read each line of a CSV file in an iterative manner.
        while ((line = csvReader.read()) ! = null)
                {
                    writer.addRow(ticker,
                                  DATE_FORMAT.parse(line.get(0)),
                                  new BigDecimal(line.get(1)),
                                  new BigDecimal(line.get(2)),
                                  new BigDecimal(line.get(3)),
                                  new BigDecimal(line.get(4)),
                                  Long.parseLong(line.get(6)),
                                  new BigDecimal(line.get(5)));
                }
                writer.close();

After you generate SSTable data by using the custom program, import the data as described in section 3.1.