Abstract: This article describes the general framework behind a feed stream system and how its architecture is designed.
By Jing Shaoqiang, nicknamed Shaoqiang at Alibaba.
The replacement of feature phones, or "dumb phones," by smartphones was almost a decade ago, but we are still feeling the effects of it today. This is in large because this transition had ushered in the mobile Internet era and the rise of the "mobile app". In the US, some major apps are Facebook, Instagram, Snapchat and Twitter. And, in China, some of the most notable apps are the social feed and posting app Weibo, the social messaging "super" app WeChat, the algorithm-powered news app TopBuzz, and the video sharing app Kuaishou. These products have grown rapidly over the last half decade with the exploding popularity and saturation of the smartphone market.
One thing that all of the apps I have mentioned above have in common is that they depend on feed streams, which flow from the top to the bottom of your smart phone screen. These feed streams are usually, but not always, time-based, being based on a time frame and are suitable for browsing content on mobile devices. These products have easily replaced their previous-generation counterparts, quickly seizing the entirety of the remaining market share.
Feed streaming, fundamentally speaking, consists of feeds and streams, and it works by continuously delivering feeds to the destination of your smart phone app. In informatics, a feed is an information unit, such as a post on WeChat Moments or Weibo, a review post, or a short video. The difference, of course, is that feed streams are constantly updated information units. Users browse new feeds that are continuously pushed by publishers on mobile devices.
Currently, the most popular feed streams in China are the information flows of Weibo, WeChat Moments, TopBuzz (Toutiao), and video flows of Kuaishou and TikTok (known in China as Douyin). These also come with various private messages or notification systems, of course, too. These feed streams are also called feed stream systems. The remaining sections of this article will describe the artchiture and how you can design the architecture of a feed stream system.
A feed stream is essentially a data stream that delivers information units from N publishers to M recipients based on the following relationships detailed in the infographic below.
A feed stream system is a data stream system, with core-driven data, and data is divided into three types at the data layer:
The three data types are defined as follows:
Consider the following product-related factors while designing a feed stream system.
Also, consider the following factors while selecting a data storage system.
Now let's dive right into how one would go about designing a feed stream system from the top down.
The first step is identifying the type of target product, which may be one of the following:
Refer to the following table to compare the product types.
|Product type||Following relationship||Availability of VIP accounts||Timeliness||Sort by|
|Microblogging||One-way||Yes||Seconds to minutes||Time|
|Short video sharing, such as TikTok||One-way or N/A||Yes||Seconds to minutes||Recommendation|
|Friend's sharing, such as WeChat Moments||Two-way||N/A||Seconds||Time|
This comparison involves the core features of the product types. For example, a two-way relationship is formed between two users who follow each other, but it is only a supplement feature of microblogging.
The product types may distinguish on the basis of the following two criteria.
After determining the product type, determine the system design target which refers to the maximum users that the system supports, whether this is 100 thousand, 1 million, 10 million, or 100 million.
System design is simple when the number of users is small. The following section describes how to design a feed stream system that supports hundreds of millions of users. To support such a massive user scale, you need to select subsystems with sufficient scalability, availability, and reliability. In a large system, an unreliable subsystem may affect the entire system.
Storage is the most critical component and is the same for all synchronization modes. User messages are stored in a repository. A repository must provide the following three functions:
Therefore, the repository has the following important features:
The following table lists the two types of repositories available.
|Feature||Distributed NoSQL||Relational database (database splitting and table sharding)|
|Scalability||Linear||To be restructured|
|Common system||Table Store and Bigtable||MySQL and PostgreSQL|
Therefore, the following pointers summarize NoSQL functionality.
Refer to the following table to understand the structural design of the repository table when an Alibaba Cloud Tablestore database is used.
|Primary key column||First primary key column||Second primary key column||Property column||Property column|
|Description||The ID of the message sender.||The message ID that might be a timestamp.||Content||Other content|
The following figure shows the system architecture based on the selected repository type.
After determining the system scale, product type, and storage system type, select a synchronization mode from the following three choices:
However, the push-pull mode requires a system design that is more complex than that in push mode.
The following table compares the three modes:
|Type||Push mode||Pull mode||Push-pull mode|
|User read latency||Millisecond||Seconds||Seconds|
|System requirement||Robust write capability||Robust read capability||Moderate read and write capabilities|
|Common system||Distributed NoSQL with the LSM architecture, such as Table Store and Bigtable||Cache systems or search systems, such as Redis and Memcached (applicable to sorting by recommendation)||Combination of the two system types|
|Architecture complexity||Simple||Complex||More complex|
Now here's a quick summary of the scenarios and modes of synchronization that we discussed above:
If you are using an Alibaba Cloud Tablestore database, refer to the following structural design of the synchronization database table:
|Primary key column||First primary key column||Second primary key column||Property column||Property column||Property column|
|Description||The ID of the message receiver.||The ID of the message, which may be in the format of timestamp+send_user_id, or use the auto-increment column of Table Store.||The ID of the message sender.||The value of the message_id column in store_table, that is, the message ID. You can query the message content in store_table based on the sender_id and message_id.||Other message content that is not included in the synchronization database.|
A complete feed stream system includes not only the basic synchronization and storage functions but also metadata. The following sections describe how metadata is processed.
The metadata in a feed stream system includes:
Next, we will introduce the three types of metadata individually.
User details include custom user properties and attached system properties, which are queried by the user ID. Store user details can in a distributed NoSQL system or a relational database.
If you are using Alibaba Cloud's NoSQL database product Tablestore, consider the structural design of the user details given below:
|Primary key sequence||First primary key column||Property column-1||Property column-2||......|
|Description||The primary key column used to uniquely identify a user.||User nickname, which is a custom user property.||User gender, which is a custom user property.||Other properties, including the custom user property column and the attached system property column. Table Store is free of schema, in which a new column can be added to any row without impact on the existing data.|
This section describes how to store relationships. Design the system to support the query of the following lists, follower lists, and friends lists through the indexing capability. Such querying involves multiple property columns. The storage system might be a relational database or unstructured database.
Refer to the following relationship table for the structural design of Tablestore.
|Primary key sequence||First primary key column||First primary key column||Property column||Property column|
|Description||The user ID.||The follower ID.||The follow time.||Other property columns.|
Consider the following search index structure:
Take note of the following critical points during a query:
user_idby using TermQuery, and sort the followers by timestamp.
follow_user_idby using TermQuery and sort the followed accounts by timestamp.
With session push pool, receivers will perceive incoming new messages from senders through periodic refreshing at the client side. In this case, the system is burdened with read requests as the number of clients increases. This causes a query storm when a platform publishes breaking news, with many hibernating devices logging on, resulting in a device online rate that far exceeds the usual 20% to 30% rate. As such, the system may stop responding and become unavailable to all users.
Therefore, one solution is the maintenance of a session push pool on the server to record a list of online users. After User A sends a message to User B, the server writes the message to the repository and synchronization database and notifies User B of the new message in User B's session that is stored in the session push pool. The message is pushed to the client when User B accesses the session to read the message.
Alternatively, a notification is pushed to the client to instruct them to pull the new message. The session push pool is used in synchronization and stored in the memory because the pool data is essentially metadata. The pool data must be persistently stored since it supports a single-key query. Therefore, you'll want to store the pool data in a distributed NoSQL database or relational database, or the existing system.
Refer to the structural design of the session table below while using Tablestore to design your own.
|Primary key column sequence||First primary key column||Second primary key column||Property column|
|Description||The ID of the receiver.||The ID of a device. One user may have multiple devices, which may have different read positions. The device ID is used to differentiate these read positions. This column can be ignored if the multi-client feature is not required.||The latest message sequence ID that the receiver pushes to the client.|
All types of feed streams, except private messages typically, support the comment feature. In essence, the comment features is similar to a repository but have an additional relationship with the commented message. Therefore, comments are grouped on the basis of commented messages. Comment querying is done within a specific range. The query method is simple and does not require the complex transactions and Join feature of a relational database. A distributed NoSQL database is suitable for storing comments.
Follow the steps below to select a storage method:
|Primary key column sequence||First primary key column||Second primary key column||Property column||Property column||Property column|
|Description||The ID of the message that is posted to Weibo or WeChat Moments.||The ID of a comment.||The content of a comment.||The user to which the response is returned.||Other properties.|
To search for comments, create search indexes for the table.
The likes feature gained popularity in recent years and is implemented similarly to the comment feature. The only difference is that the likes feature lacks an item that is available in the comment feature, so the two features support the same storage method.
The structural design of the likes table is the same as that of the comment table in case of Table Store
The following figure shows the system architecture that includes the metadata system:
The feed stream products require the search capability in the following scenarios:
Searching for such content is based on a string match and only requires the word segmentation retrieval function rather than complex correlation algorithms.
Use the following two methods to implement content searching.
Also, create search indexes for the corresponding table while using Tablestore.
nick_nameto the text type with single word segmentation.
store_table. This enables complex searches for feed stream content, such as conditional filtering and full-text retrieval.
The following figure shows the system architecture that supports the search function.
Currently, a feed stream system supports sorting by time and score, respectively. Common feed stream products, such as Weibo, WeChat Moments, and private messages, belong to the timeline type and prefer real-time performance over the quality of published content. You need to follow a user before viewing the content published by that user. The published content may include useless messages. Sorting by timeline is applicable to these products. The architecture described here belongs to the timeline type.
In other types of feed stream products, the system pushes differentiated content to users based on their preferences. This system architecture is different from the timeline-based architecture and will be described in a subsequent article about recommendations.
A feed stream system must provide a method to process the content that is published and then deleted by a user. If the push mode is used during the delete operation, it is not possible to promptly delete the content as required by laws and regulations.
The synchronization table includes only message IDs instead of message content. The message content is read from the repository when users read messages. Users may not be able to read any data based on the message ID by directly deleting a message from the repository. Tthis is equivalent to deleting content, in a quick manner. The message content may also be deleted through tombstones, in addition to direct deletion. Deleted feed content is labeled. When the system queries labeled data, the system regards such data as having been deleted.
The logic used to update feed content is similar to the one used to delete feed content. If you use a storage system with multi-version support, such as Table Store, you may edit versions, just as in the case of Weibo.
The above sections describe the features and system requirements of different sub-functions. Two types of systems meet requirements: a single system based on Alibaba Cloud Tablestore and a combined system based on open-source components.
The design focus of a feed stream system varies depending on different product types. Let's take a quick look at the various product types.
WeChat Moments is a typical feed stream system that uses the push mode and has a limited number of two-way write relationships. Feed content is sorted by time. Users who keep producing useless content are added to a blacklist.
The design of feed stream systems similar to WeChat Moments is detailed in the article "System Architecture Design for WeChat Moments and Similar Systems."
Weibo is a typical feed stream system with one-way relationships and VIP accounts, so it requires both push and pull modes. On Weibo, users actively follow other users, so feed content is sorted by time. Sorting by recommendation is relatively ineffective.
The design of feed stream systems similar to Weibo is detailed in the article "System Architecture Design for Weibo and Similar Systems."
TopBuzz is an app that evolves from the feed stream system of Weibo and has quickly gained popularity in recent years. The system pushes content to users based on their preferences that are determined based on user-browsed content. The pushed content approximates user preferences after a period of training. Users do not need to follow other users to receive content pushes.
The design of feed stream systems similar to TopBuzz is detailed in the article "System Architecture Design for Toutiao and Similar Systems."
Private messages may refer to as a simple feed stream system or a variant of instant messaging (IM). Such messages use one-way relationships and do not support group chats.
This article describes the general framework of a feed stream system, including product definitions, synchronization, storage, metadata, comments, likes, sorting, and search features. We hope that this article has helped you to design a feed stream system with hundreds of millions of users.
Apache Flink Community China - September 27, 2020
digoal - May 16, 2019
digoal - September 20, 2019
Alibaba Clouder - June 3, 2020
Alibaba Clouder - September 17, 2020
Alibaba Clouder - April 12, 2019
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.Learn More
A key value database service that offers in-memory caching and high-speed access to applications hosted on the cloudLearn More
A financial-grade distributed relational database that features high stability, high scalability, and high performance.Learn More
A database engine fully compatible with Apache Cassandra with enterprise-level SLA assurance.Learn More
More Posts by Alibaba Cloud Storage