How to Compress Time Series Data More Efficiently?

The essence of deep learning is to make decisions. When using it to solve specific problems, it is very important to find the matching point, make a reasonable model, and then organize the data to optimize the loss and finally solve the problem better. In the past period of time, we have done some research and exploration on data compression with deep reinforcement learning and achieved some results, which have been published in ICDE 2020 research track (Two-level Data Compression using Machine Learning in Time Series Database) and done Oral report. Here is an overall rough introduction, hoping to bring some reference to other scenarios, at least the compression of other data.

Background description

1 Time series data

Time series data, as the name implies, refers to data related to time series, which is a form of data that can be seen everywhere every day. The figure below lists three examples: a) electrocardiogram, b) stock index, c) specific stock transaction data.

Regarding the work content of the time series database, briefly, it needs to respond to massive queries, analysis, prediction, etc. at the user level; while at the bottom layer it needs to deal with massive reads and writes, compression and decompression, aggregation and other operations, and these The basic unit of operation is time-series data, which is generally (and can be simplified) described uniformly by two 8-byte values.

It is conceivable that any electronic device generates a variety of massive time-series data every day, requiring a large amount of storage space, etc. It is a natural method to compress, store and process it. The focus here is how to perform more efficient compression.

2 Reinforcement Learning
Machine learning can be divided into supervised learning, unsupervised learning, and reinforcement learning according to whether the sample has groundTruth. Reinforcement learning, as the name suggests, is to study hard without stopping, and does not require groundTruth, and the real world often does not have groundTruth. For example, human cognition is often a process of continuous iterative learning. In this sense, reinforcement learning is a process and method that is more in line with or more comprehensive and general to deal with real-world problems, so there is a saying: if deep learning will slowly become a solution to specific problems like C/Python/Java If it is a basic tool for the problem, then reinforcement learning is a basic tool for deep learning.

The classic schematic diagram of reinforcement learning is as follows, the basic elements are State, Action, and Environment. The basic process is: the Environment gives the State, the Agent makes an Action decision based on the State, and the Action acts on the Environment to generate a new State and reward, where the reward is used to guide the Agent to make a better Action decision, and the cycle repeats….

The common supervised learning is much simpler. It can be considered as a special case of reinforcement learning. The goal is very clear groudTruth, so the corresponding reward is also relatively clear.

Reinforcement learning can be classified into the following three categories according to personal understanding:

1) DQN

Deep Q network, a type that is more in line with human intuitive feeling logic, it will train a network that evaluates Q-value, and can give the reward of each Action for any state, and then finally select the action with the largest reward to operate. Can. The training process is passed back by evaluating the results of "estimated Q-value" and "real Q-value", and finally makes the network estimate Q-value more and more accurate.

2) Policy Gradient

It is a more end-to-end type that trains a network and directly gives the final action to any state. The scope of application of DQN requires that the Q-value of the continuous state is also relatively continuous (this is not applicable to playing Go, etc.), while Policy Gradient has greater universality because it ignores the internal process and directly gives the action. But its disadvantage is that it is more difficult to evaluate and converge. The general training process is: for a certain state, a variety of actions are randomly taken at the same time, and the results of various actions are evaluated for reverse transmission, and finally the network outputs an action with better effect.

3) Actor-Critic

Try to combine the previous two networks and learn from each other. On the one hand, use the policy gradient network to output the action of any state, and on the other hand, use the DQN network to make a better quantitative evaluation of the action output of the policy gradient and use it to guide the policy gradient. renew. As the name suggests, it's like a performer-critic relationship. The training process needs to train the actor (policy Graident) and critic (QN) network at the same time, but the training of the actor only needs the guidance of the follow critic. It has many variants, and it is also the main direction of continuous development in the current theoretical research of DRL.

Compression of Time Series Data
Compressing massive time-series data is an obvious thing, so there are many researches and explorations in academia and industry. Some methods are:

Snappy: Compresses integers or strings, mainly using long-distance prediction and run-length encoding (RLE). Wide applications include Infuxdb.
Simple8b: first perform delta processing on the data before and after, if they are the same, use RLE encoding; otherwise, according to a code table with 16 entries, pack 1 to 240 numbers (the bits of each number according to the code table) to the data in units of 8B , has a wide range of applications including Infuxdb.
Compression planner: introduces some general compression tools such as scale, delta, dictionary, huffman, run length and patched constant, etc., and then proposes to use static or dynamic methods to combine and try these tools for compression; the idea is quite novel, but the actual performance will be lower is a problem.
ModelarDB: focuses on lossy compression, and compresses based on the tolerable loss given by the user. The basic idea is to maintain a small buff, detect whether the pre-order data conforms to a certain pattern (straight line fitting of the slope), and if it fails, switch modes and start the buff again; it is more suitable for supporting lossy IoT fields.
Sprintz: It is also effective in the IoT field, focusing on 8/16 bit integer processing; mainly using scale for prediction and RLC for difference encoding and bit-level packing.
Gorilla: The compression algorithm of sofa used in Facebook's high-throughput real-time system, which performs lossless compression and is widely used in various fields such as IoT and cloud services. It introduces delta-of-delta to process timestamps, transforms data with xor and then uses Huffman encoding and bit-packing. An example graph is shown below.
MO: Similar to Gorilla, but removes bit-packing, all data operations are basically byte-aligned, which reduces the compression ratio but improves processing performance.

There are many related compression algorithms, in general:

They basically support single mode, or limited partial static mode for data compression.
Many use bit-packing (even lossy compression) in order to improve the compression ratio, but it is not very friendly to the increasingly widely used parallel computing.
Two-stage deep learning-based compression algorithm
1 Characteristics of Time Series Data Compression
Time series data comes from various aspects such as IoT, finance, Internet, business management and monitoring, etc., and the morphological characteristics are very different, and the requirements for data accuracy are also different. If there is only one unified compression algorithm for indiscriminate processing, it should be based on a lossless algorithm that uses 8B data for data description.

The figure below is an example of some time-series data in Alibaba Cloud’s business. Whether it is lossless at the macro or micro level, the data patterns are varied, not only shape curves, but also data accuracy. Therefore, it is necessary for the compression algorithm to support as many compression modes as possible, and then select one of them for compression effectively and economically.

For a large-scale commercial time-series data compression algorithm, three important features need to be focused on:

Time correlation: Time series data has a strong time correlation, and the corresponding data is basically continuous. Sampling intervals are usually 1s, 100ms etc.
Pattern diversity: As shown in the figure above, there will be a large gap between patterns and features.
Data massiveness: The amount of data that needs to be processed every day, every hour, and every second is massive. The overall processing data is at least at the level of 10P per day. The corresponding compression algorithm needs to be efficient and have high throughput.
2 The core concept of the new algorithm
Tracing back to the source, the essence of data compression can be divided into two stages: first, the Transform stage transforms the data from one space to another more regular space, and then in the difference encoding stage, various methods are used to better identify the transformed difference.

According to the characteristics of time series data, the following six basic transform primitives (extensible) can be defined:

Then define the following 3 basic differential coding primitives (extensible):

Next, compress the above two tools permutations and combinations, which is feasible but the effect is definitely not very good, because the cost proportion of mode selection and related parameters is too high, and 2B (primitive choice + primitive parameter) control information is required, accounting for 8B needs to express 25% of the data.

A better method should be to abstract and layer the characteristics of the data, as shown in the schematic diagram below. Create a control parameter set to better express all situations, and then select appropriate parameters at the global (a timeline) level to determine a search space (including only a small number of compression modes, such as 4); and then perform each point in detail When compressing, it traverses to select the best compression mode for compression. The proportion of control information is ~3%.

3 Two-stage compression framework AMMMO
The overall process of AMMMO (adatpive multiple mode middle-out) is divided into two stages. The first stage determines the overall characteristics of the current timeline (determines the specific values of 9 control parameters); then in the second stage, a small amount of compression mode traverse and find the last one for compression, the specific block diagram is as follows:

The mode selection in the second stage is not difficult, and the logic is simple and suitable for high-efficiency execution; in the first stage, it is a relatively big challenge to determine the parameter values (9 here) to obtain a suitable compression space, which needs to be selected from more than 300K theoretical permutations and combinations Find the right one.

4 Rule-based pattern space selection algorithm
An algorithm can be designed, such as creating a scoreboard of the effects of each compression mode, and then traversing all points in a timeline and analyzing and recording, and then selecting the best mode through statistical analysis and comparison. Some obvious problems are:

Is the chosen evaluation metric ideal?
It requires manual thinking and programming, and there are more implementation, debug and maintain workloads.
If the primitives and compression modes in the algorithm are changed, the entire code needs to be refactored. The above choice is not a theoretical choice, and an automatic and intelligent method is needed to support continuous evolution.

deep reinforcement learning

1 Problem Modeling

Simplifying the entire pattern space selection algorithm above is shown in the figure below. We can equate this problem with a multi-objective classification problem. Each parameter is a target, and the value range of each parameter space is the number of categories that can be selected. Deep learning has demonstrated its high usability in image classification, semantic understanding, etc. Similarly, we can also use deep learning to realize the selection problem of the pattern space here, and treat it as a multi-label classification problem.

What kind of network do you use? Considering that the main relationship of recognition is delta/xor, shift, bitmask, etc., cnn is not appropriate, and full-connect mlp is more appropriate. Correspondingly, for all points on a timeline, if there are 3600 points in 1 hour, a total of 3600*8B, it is too much. Considering the similarity of sections within the same timeline, 32 points are regarded as the most basic processing unit.

Next, how to create training samples? How to find labels for samples?

Here we introduce reinforcement learning instead of supervised learning for training because:

1) It is difficult to create samples with labels

32 samples 256B, theoretically there are 256^256 possibilities for sample, for each such sample, it is necessary to traverse 300K possibilities to find the best one. The workload of creating and selecting samples and creating labels is very heavy.

2) This is not a normal one-class-label problem

Given a sample, there is not the only best result. It is very likely that many choices can achieve the same compression effect, and the training of N class (N is basically unknowable) increases a lot of difficulty.

3) An automated approach is needed

The selection of parameters such as the compressed tool may need to be expanded. If the creation of the entire training sample occurs, it needs to be done again. An automated approach is needed.

What kind of reinforcement learning to use? DQN, policy gradient, or actor-critic? As analyzed above, DQN is not suitable for discontinuous reward/action. The parameters here, such as majorMode 0 and 1, are completely different results, so DQN is not suitable. In addition, the compression problem is not easy to evaluate on the one hand, and the network is not so complicated, and actor-critic is not needed. In the end we settled on policy gradient.

The common loss of policy gradient is to use a slowly increasing baseline as a measure to feedback whether the current action is appropriate, but it is not suitable here (the effect is not very good after trying), because here the theoretical block of the sample (256^256 ) state is too much. To this end, we specially designed a loss.

After getting the parameters of each block, consider the correlation of blocks, etc. Statistical methods can be used to aggregate the final parameter settings of the entire timeline.

2 Deep Reinforcement Learning Network Framework
The overall network framework diagram is as follows:

On the training side: randomly select M blocks, copy N copies of each block, and then input them into a fully connected network with 3 hidden layers, use region softmax to get the probability of various choices for each parameter, and then sample each The value of a parameter, after obtaining the parameter, input it to the underlying compression algorithm for actual compression and obtain the compressed value. The copied N blocks are compared with each other to calculate the loss and then do backpropagation. The overall design of loss is:

fn(copi) describes the compression effect, positive feedback is higher than the average value of N blocks, Hcs(copi) is the cross entropy, the higher the probability of the higher score, the better; and vice versa. The following H(cop) is cross entropy as a regularization factor to avoid network curing and converge to a local optimum.

On the inference side, all or partial blocks of a timeline can be input into the network to obtain parameters, perform statistical aggregation and then obtain the parameters of the entire timeline.

result data
1 Experimental design
On the one hand, the test data part randomly selected 28 large timelines under the two major scenarios of Aliyun business IoT and server; on the other hand, it also selected UCR, the most common data set in the field of time series data analysis and mining. The basic information is as follows:

For the comparison algorithm, the comparative algorithms Gorilla, MO and Snappy are selected. Because AMMMO is a two-stage compression algorithm framework, various algorithms can be used for parameter selection in the first stage. Lazy (simple and rude setting of some universal parameters), rnd1000Avg (random 1000 times to take the average value of the effect) is selected here. , Analyze (algorithm using artificial code) and ML (method of deep reinforcement learning), etc.

2 Comparison of compression effects
First of all, from the point of view of the overall compression rate, the two-stage adaptive multi-mode compression of AMMMO has significantly improved effects compared with Gorila/MO, etc., and the average compression rate has increased by about 50%.

Then what about the effect of ML? The following figure compares the compression effect on test set B in the field of view of ML. In general, ML is slightly better than artificially designed algorithms, and significantly better than random average.

3 Operating efficiency
AMMMO draws on the design idea of MO and removes bit-packing. It can not only run at high speed on CPU, but also is especially suitable for parallel computing platforms such as GPU. In addition, AMMMO is divided into two stages, the performance of the first stage will be worse, but in many cases, for example, for the data of a specific device in the past 2 days, the global compression parameters can be reused. The figure below describes the overall performance comparison. The experimental environment is "Intel CPU 8163 + Nvidia GPU P100", and the AMMMO code uses P100.

As can be seen from the figure above, AMMMO can achieve GB/s processing performance at both the compression end and the decompression end, and the performance indicators are still very good.

4 Effects learned by the algorithm
The network trained by deep reinforcement learning looks good from the final effect, so has it really learned meaningful content? The subscript compares the performance of the three algorithms on several test sets. It can be seen that the parameter selection of the ML version is similar to the analysis algorithm/optimal effect selection, especially in the selection of byte offset and majorMode.

What would this compressed fully connected network parameter representation look like? The parameter heatmap visualization is performed on the first layer (positive parameters are red, negative ones are blue, and the larger the value, the brighter the color), as follows:

It can be clearly seen that the 32 points have many regular operations on the same byte, and the vertical lines (if they span bytes will confuse the situation), can be considered as delta or xor operations on the corresponding positions. Then the parameters of Byte0 with the largest number change are also more active.

In summary, what deep learning has learned is quite explanatory.

Machine Learning Open Class | Learn Intelligent Recommendation System from Alibaba Cloud Technical Experts

The Alibaba Cloud machine learning PAI team has launched a series of courses, combining the experience of various recommendation scenarios, to bring you the popularization of knowledge related to the recommendation business. The course includes basic concepts and architecture descriptions of recommendation systems, recall algorithms, sorting algorithms, and online service orchestration. It takes you 10 minutes to implement a simple recommendation system.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us