You can create and update schemas of ApsaraDB for HBase by using HBase Shell or Admin in the Java API.
Before you modify column families, disable the following table:
Configuration config = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(config); String table = "Test"; admin.disableTable(table); // Disable the table HColumnDescriptor f1 = ...; admin.addColumn(table, f1); // Add a column family HColumnDescriptor f2 = ...; admin.modifyColumn(table, f2); // Modify a column family HColumnDescriptor f3 = ...; admin.modifyColumn(table, f3); // Modify a column family admin.enableTable(table);
After you change a table or a column family (including the encoding algorithm, compaction pressure format, and block size), the changes will take effect the next time when a major compaction is performed or the StoreFile is rewritten.
Rules for table schema design
- The size of a region is between 10 GB and 50 GB.
- The size of a cell must be no larger than 10 MB. If the size of a cell exceeds 10 MB, you need to use Medium-sized Objects (MOBs). (currently not supported by ApsaraDB for HBase, which will be supported in version 2.0). If the size of a cell is even larger and the MOBs is not applicable, you can store it directly in Object Storage Service (OSS).
- Typically, a table contains one to three column families. Do not design an ApsaraDB for HBase table in the same way as an RDBMS table.
- A table can be divided into about 50 to 100 regions based on the rowkeys. We recommend that you define one or two column families for a table. Note: Each column family is continuous and different column families are separated.
- Make your column family name as short as possible because each value in the storage contains a column family name (ignoring prefix encoding).
- If you store data and logs on different devices based on time, define rowkeys consisting of device IDs and times. You can then create a table where no additional data is written to old regions except during specific time periods. In this case, you minimize the number of active regions, but maintain a large number of old regions that do not have new writes. Having a large number of regions is acceptable because only active regions consume resources.
Number of column families
Currently, ApsaraDB for HBase is not optimized for more than one column family. We recommend that you make the number of column families as small as possible.
The flushing and compaction operations are performed on one region. If a flush is triggered on a column family that has a large amount of data, the adjacent column families will also be flushed even though the amount of data they carry is small.
The compaction operation is now triggered based on the number of all files in a column family, rather than the file size.
When flushing and compaction involve multiple column families, many redundant I/O operations are performed. To solve this problem, you need to make flushing and compaction operations working on only one column family.
Try to operate on only one column family in the schema. Group columns with similar usage rates into one column family so that you can access only one column family each time to improve efficiency.
Column family cardinality
If there are multiple column families in a table, make sure that the cardinalities (such as the number of rows) among column families do not differ too much. For example, column family A contains one million rows and column family B contains one billion rows. The data of column family A may be distributed by rowkeys to many regions (and region servers). This will make scanning column family A very inefficient.
Number of versions
The number of row versions is configured per column family by the HColumnDescriptor parameter. The default value is 3. This parameter is very important, because ApsaraDB for HBase does not overwrite a value and it only appends data later. The early versions distinguished by the timestamp will be deleted when your run a major compaction. The usage of the HColumnDescriptor parameter is described in the data model section. The value of this version can be increased or decreased based on the specific application.
We recommend that you do not set the maximum number of versions to a high level (for example, hundreds or more) unless old data is very important to you. This causes the storage of files to become extremely large.
Minimum number of versions
Similar to the maximum number of row versions, the minimum number of versions is also configured per column family by HColumnDescriptor parameter. The default value is 0, which means that the feature is disabled. The minimum number of versions is used together with the Time To Live (TTL) parameter. You can configure the parameters such as: save valuable data for the last T seconds, up to N versions, but at least M versions (M is the minimum version number, and M <N). This parameter is enabled for a column family only during the time to live and must be less than the number of row versions.
Supported data types
ApsaraDB for HBase supports the bytes-in/bytes-out interface through Put and Result, so anything that can be converted into byte arrays can be saved as values. The input can be strings, numbers, complex objects, or even images as long as they can be converted into bytes.
There is an actual length limit for a value. (For example, saving 10-50 MB objects to ApsaraDB for HBase may negatively affect the query performance.) All rows in ApsaraDB for HBase follow the HBase data model including versioning. When you design the schema, take these into account as well as the block size of the column families.
You can set TTL seconds for the column family. ApsaraDB for HBase will automatically delete data after it times out. The time zone of TTL in ApsaraDB for HBase is UTC.
The stored files that contain expired rows can be deleted by minor compaction. To disable this feature, you can set the hbase.store.delete.expired.storefile parameter to false, or set the minimum number of versions to a non-zero value.
The latest version of ApsaraDB for HBase will support storing the specified time in each cell. Cell TTLs are submitted as a property of the update request (such as Appends, Increments, and Puts) by using Mutation#setTTL. If the TTL property is set, it will be applied to all cells updated by this operation. There are two obvious differences between cell TTL handling and ColumnFamily TTLs:
1. The Cell TTLs are measured in the unit of milliseconds instead of seconds.
2. The TTL of a cell cannot exceed the valid time set by ColumnFamily TTLs.