TableStore data models-WideColumn and Timeline-Alibaba Cloud Developer Community

TableStore is the first distributed multi-model database developed by Alibaba Cloud and belongs to NoSQL. When it comes to NoSQL databases, it is no longer unfamiliar to many application development. Currently, many application systems do not only rely on relational databases at the underlying layer. Instead, they select different types of databases based on different business scenarios. For example, cached KeyValue data are stored in Redis, document data is stored in MongoDB, and graph data is stored in Neo4J.
Looking back on the development of NoSQL, NoSQL was born in the Web 2.0 era. The rapid development of the Internet also brought about an explosion of Internet data. Traditional relational databases are difficult to carry such a large amount of data. Therefore, a distributed database with high scalability is required. However, it is very challenging to implement highly available and scalable distributed databases based on traditional relational data models. The data model of most data on the internet is very simple, so it is not necessary to use the relational model for modeling. If we can break the relationship model and use a simpler data model to model data, weaken transactions and constraints, and aim at high availability and scalability, databases designed with this goal can better meet business needs. It is based on this concept that promotes the development of NoSQL.
In summary, the development of NoSQL is driven by the new business challenges and new database requirements in the Internet era. Based on this development, NoSQL has its remarkable characteristics:
  • multi-data model : to meet different data requirements, many different data models have been created, such as KeyValue, Document, Wide Column, Graph, and Time Series. This is one of the most prominent features of NoSQL database development, breaking the constraints of relational models and choosing a diversified development direction. The selection of data models is more scenario-oriented, closer to the actual business needs, and can be optimized in a deeper level.
  • High Concurrency and low latency : The Development of NoSQL databases is more driven by the needs of online services. Its design goal is to provide high-concurrency and low-latency access for online services.
  • High Scalability : to cope with the explosive increase in data volume, scalability is one of the core design goals. Therefore, the underlying architecture usually considers the distributed architecture at the beginning of design.
As can be seen from the development trend statistics of DBEngines, various NoSQL databases have been in a booming state in recent years. As a distributed NoSQL database, Alibaba Cloud TableStore selects a multi-model architecture for data models and supports WideColumn and Timeline.
WideColumn model is a classic model proposed by Bigtable and widely used by other similar systems. At present, most semi-structured and structured data in the world are stored in this model system. In addition to WideColumn model, we have introduced another new data model: Timeline,Timeline model is a new generation model for message data, applicable to IM, currently, message storage and synchronization in message systems such as Feeds and IoT device message ingestion have been widely used. Next, let's take a look at these two models in detail.
 
the preceding figure is a model diagram of the Wide Column model. To better understand this model, we use a relational model for comparison. A relational model can be simply understood as a two-dimensional model, consisting of rows and columns, with a fixed Schema for each row. Therefore, the features of relational models are two-dimensional and fixed Schema, which is the simplest understanding, regardless of transactions and constraints. Wide Column model is a three-dimensional model, adding a time dimension on the basis of two dimensions of rows and columns. The time dimension is reflected in the attribute column. The attribute column can have multiple values. Each value corresponds to a Timestamp as the version. Each row is Schema Free and has no strong Schema definition. Therefore, the simple summary of comparing relational models Wide Column models is: three-dimensional, Schema Free, simplified transactions and constraints.
Let's take a look at the composition of this model in detail. There are several main parts:
  • Primary Key (Primary Key): each row has a Primary Key. The Primary Key consists of multiple columns (1 to 4 columns). The Primary Key is defined as a fixed Schema. The Primary Key is used to uniquely distinguish one row of data.
  • Partition Key (Partition Key): The first column of the primary Key is called the Partition Key. The Partition Key is used to Partition the range of the table. Each Partition is distributed to different machines for service. In the same partition key, cross-line transactions are provided.
  • Attribute Column (Attribute Column): except for the primary key Column, all the columns in a row are Attribute columns. The attribute column corresponds to multiple values. Different values correspond to different versions. A row can store an unlimited number of attribute columns.
  • Version (Version): each value corresponds to a different Version. The Version value is a timestamp that defines the data lifecycle.
  • Data Type: TableStore supports multiple Data types, including String, Binary, Double, Integer, and Boolean.
  • Lifecycle (Time To Live): each table can define a data lifecycle. For example, if the lifecycle is set To one month, the data written in the table before one month is automatically cleared. The write time of data is determined by the version. Generally, the version is determined by the server when data is written, or by the application itself.
  • Maximum number of versions (MaxVersion): each table defines the maximum number of versions to be saved in each column. It is used to control the number of versions in a column, data that exceeds the upper limit in the earlier version is automatically cleared.
 
Wide Column the characteristics of the model, in summary: three-dimensional structure (row, Column and time), Wide row, multi-version data and life cycle management. At the Data operation level, Wide Column model provides two types of Data access APIs, Data API and Stream API.

Data API

for more information about Data API, see here. Data API is a standard Data API that provides online Data reading and writing, including:
  • putRow: inserts a new row. If the same row exists, the row is overwritten.
  • UpdateRow: you can add or delete attribute columns in a row or update the values of existing attribute columns. If the row does not exist, a new row is inserted.
  • DeleteRow: deletes a row.
  • BatchWriteRow: multiple rows of data in multiple tables can be updated in batches. PutRow, UpdateRow, and DeleteRow can be combined.
  • GetRow: reads a row of data.
  • GetRange: range scan data, which can be scanned in the forward or reverse order.
  • BatchGetRow: queries multiple rows of data in multiple tables.

Stream API

in relational model databases, standard APIs are not defined for incremental data in databases. However, in many application scenarios of traditional relational databases, the use of incremental data (binlog) cannot be ignored. This is widely used in Alibaba's internal scenarios, and provides middleware such as DRC to fully mine the capabilities of this part of data. After mining the incremental data capabilities, we can do a lot of things in the technical architecture:
 
However, even if the incremental data of relational databases is so useful, there is no standard API definition in the industry to obtain the incremental data. TableStore, we have already discovered the value of this part of data and provided standardized APIs to open the capabilities of this part of data. This is our Stream API (document).
Stream API include:
  • ListStream : obtains the Stream ID of the table.
  • DescribeStream : obtains the details of a Stream, and pulls the list of shards in the Stream and the Shard structure tree.
  • GetShardIterator : obtains the Iterator of the current incremental data of the Shard.
  • GetStreamRecord : obtains incremental data in a Shard based on the Shard Iterator.
 
The implementation of TableStore Stream is much more complex than MySQL Binlog, because TableStore itself is a distributed architecture, and Stream is also a distributed incremental data consumption framework. The data consumption of the Stream must be obtained in order. The Shard of the Stream corresponds to the partition of the table in the TableStore. The partition of the table is split and merged, to ensure that the data consumption of old shards He Xinzeng shards can still keep the order after partition splitting and merging, we have designed a relatively precise mechanism. For the design of TableStore Stream, we will not go into details here. We will release more detailed design documents later.
Due to the complexity of the Stream internal architecture, this complexity is also introduced to the Stream data consumption side, which is not so simple when users use Stream API. This year, we have also planned a new data consumption channel service to simplify Stream data consumption and provide easier and easier-to- use APIs.
 
Timeline model is a new data model for message data scenarios. It can meet the special requirements of message data scenarios for message order preservation, mass message storage, and real- time synchronization.
Such as above is Timeline model diagram, a large table data in the abstract for multiple Timeline, a large tables can hosted Timeline number no upper limit.
The composition of Timeline mainly includes:
  • Timeline ID: the ID that uniquely identifies the Timeline.
  • Timeline Meta: the metadata of the Timeline, which can contain any key-value pair attributes.
  • Message Sequence: a Message queue that hosts all messages in the Timeline. Messages are stored in the queue in an orderly manner, and auto- increment IDs are assigned according to the write order. There is no upper limit on the number of messages that a message queue can host. You can randomly locate a message based on the message ID and scan it in the positive or negative order.
  • Message Entry: the Message body, which contains the specific content of the Message and can contain any key- value pairs.
 
The model of Timeline has some similarities with message queue logically, Timeline similar to topics in message queue. The difference is that TableStore Timeline focus more on the scale of topics. In the instant messaging scenario, each user's inbox and mail box are the same Topic. In the IoT message scenario, each device corresponds to a Topic, and the number of topics can reach tens of millions or even hundreds of millions. TableStore Timeline is based on the underlying distributed engine. A single table can support theoretically unlimited Timeline (topics), simplify the Pub/Sub model of queues, and support message order preservation, random positioning, and forward and backward scanning, it is more suitable for scenarios such as instant messaging (IM), Feeds, and IoT messaging systems.
 
For more information about the origin of Timeline model, see this article-《 implementation of message push and storage architecture in modern IM system for specific applications, please refer to the following article-"TableStore Timeline: easily build tens of millions of IM and Feed stream systems".
 
Timeline is a new data model released last year, and we are still polishing it. Based on this model, we have helped DingTalk, Cainiao intelligent customer service, Taobao Ticket gathering, and intelligent device management to build messaging systems in the fields of instant messaging, Feeds, and IoT.
 
Finally, welcome to join our internal DingTalk group (group number: 11789671) for communication.
Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now