This tool uses a file streaming interface to import data to an ApsaraDB for Cassandra cluster. BulkLoad is one of the fastest ways to migrate offline data to a Cassandra cluster. Before you import data, make the following preparations:

  • Create a Cassandra cluster.
  • Prepare offline data in SSTable or CSV format.
  • Create an independent ECS instance in the same VPC as the Cassandra cluster, and configure security group rules to ensure that the ECS instance can access the Cassandra cluster.

1. Create an ECS instance in the same VPC as the Cassandra cluster

We recommend that you create an ECS instance independent of the Cassandra cluster. Otherwise, it may affect online services.

2. Create a schema

$ cqlsh -f schema.cql  -u USERNAME -p PASSWORD [host]

3. Prepare data

3.1 SSTable data format

Organize a directory in the data/${keyspace}/${table} format and store SSTable data in the directory, as shown in the following example:

ls /tmp/quote/historical_prices/
md-1-big-CompressionInfo.db md-1-big-Data.db        md-1-big-Digest.crc32       md-1-big-Filter.db      md-1-big-Index.db       md-1-big-Statistics.db      md-1-big-Summary.db     md-1-big-TOC.txt

In the preceding example, the keyspace parameter is set to quote and the table parameter is set to historical_prices.

Import data

Run the sstableloader command to specify the data directory data/${ks}/${table} in the bin directory of the Cassandra distribution.

${cassandra_home}/bin/sstableloader -d <ip address of the node> data/${ks}/${table}

After the SSTable data is imported, run the following command to check the data: bin/cqlsh -u USERNAME -p PASSWORD [host]

$ bin/cqlsh 
cqlsh> select * from quote.historical_prices;
 
 ticker | date                            | adj_close | close     | high      | low       | open      | volume
--------+---------------------------------+-----------+-----------+-----------+-----------+-----------+--------
   ORCL | 2019-10-29 16:00:00.000000+0000 | 26.160000 | 26.160000 | 26.809999 | 25.629999 | 26.600000 | 181000
   ORCL | 2019-10-28 16:00:00.000000+0000 | 26.559999 | 26.559999 | 26.700001 | 22.600000 | 22.900000 | 555000

3.2 CSV data format

You must first convert CSV data to the SSTable format. Cassandra provides the CQLSSTableWriter tool for generating SSTables. This tool allows you to convert data in any format into the SSTable format. CSV data also needs to be organized in advance. Therefore, you must compile and run the code for parsing CSV data on your own. The following sample code demonstrates how to use this tool. For more information about this tool, visit the GitHub repository.

        // Prepare the SSTable writer.
        CQLSSTableWriter.Builder builder = CQLSSTableWriter.builder();
        // Set the output directory.
        builder.inDirectory(outputDir)
               // Set the target schema.
               .forTable(SCHEMA)
               // Set the CQL statement to insert data.
               .using(INSERT_STMT)
               // Set a partitioner as needed.
               //The default partitioner is Murmur3Partitioner. You can set a different one.
               .withPartitioner(new Murmur3Partitioner());
        CQLSSTableWriter writer = builder.build();
        
        /TODO: Iterate over each line of a CSV file.
        while ((line = csvReader.read()) ! = null)
                {
                    writer.addRow(ticker,
                                  DATE_FORMAT.parse(line.get(0)),
                                  new BigDecimal(line.get(1)),
                                  new BigDecimal(line.get(2)),
                                  new BigDecimal(line.get(3)),
                                  new BigDecimal(line.get(4)),
                                  Long.parseLong(line.get(6)),
                                  new BigDecimal(line.get(5)));
                }
                writer.close();

After you generate SSTable data by using the custom program, import the data as described in section 3.1.