×
Community Blog Time Series Database – Solution to the Problem of Timeline Expansion (High Cardinality)

Time Series Database – Solution to the Problem of Timeline Expansion (High Cardinality)

This article mainly discusses some feasible solutions for InfluxDB when it encounters the problem of high cardinality in the written data.

By Xu Jianwei (Zhuying)

Preface

The saturated development of the mobile side is causing the entire IT industry to look forward to the Internet of Everything era. In the Internet of Things (IoT) scenario, there are often many different terminals deployed in different locations to collect various data. For example, there are 100,000 loT devices in a certain area, and each of them sends data every five seconds. Then, 630.7 billion data points will be generated every year. These data are generated sequentially, and the format of the data generated by loT devices is consistent. There is no need to delete or modify. Time series databases have emerged to deal with the needs described above.

The time series database pursues fast writing, high compression, and fast data retrieval under the premise of assuming no data insertion, no update requirements, and stable data structure. The label (tag) of time series data is indexed to improve query performance so you can quickly find values that match all specified tags. If the number of label (tag) values is too large (high cardinality problem), the index will have various problems. This article mainly discusses some feasible solutions when InfluxDB encounters the high cardinality problem of written data.

The High Cardinality Problem (Timeline Expansion)

Time series databases mainly store metric data. Each piece of data is called a sample. The sample consists of the following three parts:

  • Metrics (Time-Series): Metric name and labelsets that describe the characteristics of the current sample
  • Timestamp: A timestamp in a millisecond
  • Sample Value (Value): The value of the current sample.
<-------------- time-series="" --------=""><-timestamp -----=""> <-value->
node_cpu{cpu="cpu0",mode="idle"} @1627339366586 70
node_cpu{cpu="cpu0",mode="sys"} @1627339366586 5
node_cpu{cpu="cpu0",mode="user"} @1627339366586 25

Ordinarily, lablelsets in time-series are limited and enumerable. For example, the optional values of a model in the example above are idle, sys, and user.

Suggestions on labels in the official Prometheus documentation:

CAUTION: Please remember every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

When designing a time series database, it is also assumed that the timeline is low in cardinality. However, with the widespread use of metrics, timeline expansion cannot be avoided in many scenarios.

For example, tags appear as pod/container IDs, and some tags appear as user IDs, even URLs in cloud-native scenarios. The timeline expands significantly when these tags are combined.

This contradiction is inevitable. How do we solve it? Should the party that writes data adjust the number of time-series written when writing data? Should the time series database change the design to apply to this scenario? There is no perfect solution to this problem yet.

In the real-world situation, if the time series database is not unavailable after the timeline is expanded, the performance will not drop exponentially. In other words, when the timeline does not expand, the performance is excellent. After the timeline expands, the performance can reach the level of good or pass.

How can the performance of the time series database be good when the timeline is expanded? Next, let's discuss this issue through the InfluxDB source code.

Timeline Processing Logic

The main processing logic of the InfluxDB TSM structure is similar to LSM. After the data is reported, it is added to the cache and log file (WAL). The reported data is compacted (data files are merged and indexes are rebuilt) to speed up the retrieval or compress the ratio.

Indexing involves three aspects:

  • TSI (Time Series Index): Retrieve measurement, tag, tagval, and time
  • TSM (Time-Structured Merge Tree): Retrieve time-series -> value
  • Series Segment Index: Retrieve time-series key <-> time-series Id

Please refer to the official article for the specific index implementation of InfluxDB.

1

When the timeline expands, the retrieval performance of TSI and TSM does not decrease seriously, and the problem mainly occurs in the Series Segment Index.

In this section, we will discuss the forward index of timeline files in InfluxDB (time-series key -> id, id -> time-series key):

  • SeriesFile is at the database (bucket) level.
  • SeriesIndex mainly handles the index mapping of key -> Id and key -> id.
  • SeriesSegment mainly stores the ID and key of Series.
  • SeriesIndex contains indexes, such as ID and key of Series. (It can be understood as two HashMaps.)
  • keyIDMap uses the key to find the corresponding ID.
  • idOffsetMap finds the offset through ID and searches the SeriesSegment file through this offset (corresponding location of SeriesSegment) to obtain the key.

2

The specific code (InfluxDB 2.0.7) is listed below:

tsdb/series_partition.go:30
// SeriesPartition represents a subset of series file data.
type SeriesPartition struct {
      ...
segments []*SeriesSegment
index    *SeriesIndex
seq      uint64 // series id sequence
      ....
}
tsdb/series_index.go:36
// SeriesIndex represents an index of key-to-id & id-to-offset mappings.
type SeriesIndex struct {
path string
      ...
data        []byte // mmap data
keyIDData    []byte // key/id mmap data
idOffsetData []byte // id/offset mmap data
// In-memory data since rebuild.
keyIDMap    *rhh.HashMap
idOffsetMap map[uint64]int64
tombstones  map[uint64]struct{}
}

When the series key is retrieved, it will be searched in the memory map first and then in the disk map. The specific implementation code is listed below:

tsdb/series_index.go:185
func (idx *SeriesIndex) FindIDBySeriesKey(segments []*SeriesSegment, key []byte) uint64 {
    // Search in memory map
if v := idx.keyIDMap.Get(key); v != nil {
if id, _ := v.(uint64); id != 0 && !idx.IsDeleted(id) {
return id
}
}
if len(idx.data) == 0 {
return 0
}
hash := rhh.HashKey(key)
for d, pos := int64(0), hash&idx.mask; ; d, pos = d+1, (pos+1)&idx.mask {
              // Search in disk map offset
elem := idx.keyIDData[(pos * SeriesIndexElemSize):]
elemOffset := int64(binary.BigEndian.Uint64(elem[:8]))
if elemOffset == 0 {
return 0
}
                // Obtain corresponding ID through offset
elemKey := ReadSeriesKeyFromSegments(segments, elemOffset+SeriesEntryHeaderSize)
elemHash := rhh.HashKey(elemKey)
if d > rhh.Dist(elemHash, pos, idx.capacity) {
return 0
} else if elemHash == hash && bytes.Equal(elemKey, key) {
id := binary.BigEndian.Uint64(elem[8:])
if idx.IsDeleted(id) {
return 0
}
return id
}
}
}

Here is some additional knowledge about the implementation of converting memory HashMap to disk HashMap. We all know that HashMap stores arrays. The implementation in InfluxDB is to map the disk space through mmap (see keyIDData of SeriesIndex) and then access the array address through Hash. Robin Hood Hashing is adopted and conforms to the principle of memory locality. (The code for the search logic is shown above.) The developers put a lot of thought into the manual migration of Robin Hood Hashtable to disk Hashtable.

How are memory map and disk map generated? Why do we need two maps?

InfluxDB puts the newly added series key into the memory HashMap first. When the memory HashMap exceeds the threshold, merge the memory HashMap and the disk HashMap (traverse all SeriesSegments and filter the deleted series keys) to generate a new disk HashMap. This process is called compaction. After the compaction is completed, the memory HashMap is cleared and continues to store new series keys.

3

tsdb/series_partition.go:200
// Check if we've crossed the compaction threshold.
if p.compactionsEnabled() && !p.compacting &&
p.CompactThreshold != 0 && p.index.InMemCount() >= uint64(p.CompactThreshold) &&
p.compactionLimiter.TryTake() {
p.compacting = true
log, logEnd := logger.NewOperation(context.TODO(), p.Logger, "Series partition compaction", "series_partition_compaction", zap.String("path", p.path))
p.wg.Add(1)
go func() {
defer p.wg.Done()
defer p.compactionLimiter.Release()
compactor := NewSeriesPartitionCompactor()
compactor.cancel = p.closing
if err := compactor.Compact(p); err != nil {
log.Error("series partition compaction failed", zap.Error(err))
}
logEnd()
// Clear compaction flag.
p.mu.Lock()
p.compacting = false
p.mu.Unlock()
}()
}
tsdb/series_partition.go:569
func (c *SeriesPartitionCompactor) compactIndexTo(index *SeriesIndex, seriesN uint64, segments []*SeriesSegment, path string) error {
hdr := NewSeriesIndexHeader()
hdr.Count = seriesN
hdr.Capacity = pow2((int64(hdr.Count) * 100) / SeriesIndexLoadFactor)
// Allocate space for maps.
keyIDMap := make([]byte, (hdr.Capacity * SeriesIndexElemSize))
idOffsetMap := make([]byte, (hdr.Capacity * SeriesIndexElemSize))
// Reindex all partitions.
var entryN int
for _, segment := range segments {
errDone := errors.New("done")
if err := segment.ForEachEntry(func(flag uint8, id uint64, offset int64, key []byte) error {
...
// Save max series identifier processed.
hdr.MaxSeriesID, hdr.MaxOffset = id, offset
// Ignore entry if tombstoned.
if index.IsDeleted(id) {
return nil
}
// Insert into maps.
c.insertIDOffsetMap(idOffsetMap, hdr.Capacity, id, offset)
return c.insertKeyIDMap(keyIDMap, hdr.Capacity, segments, key, offset, id)
}); err == errDone {
break
} else if err != nil {
return err
}
}

This design has two drawbacks:

  1. When performing compaction, I/O accesses the SeriesSegments file, and the memory loads all series keys. A new Hashtable will be built, and then this Hashtable will be stored to disk through the mmap method. When the series keys exceed tens of millions or more, insufficient memory and OOM problems will occur.
  2. When performing compaction, the deleted series key (tombstone tag) is filtered, and no series index is generated. However, the deleted series key in the SeriesSegment is only marked with tombstone and will not be physically deleted. This will cause the SeriesSegment to expand all the time. In the production environment, the size of all segment files in one partition exceeds tens of GB, and a large number of I/O accesses will be generated when performing compaction.

Feasible Solutions

1. Add Partitions or Databases

The forward index of InfluxDB is at the database level. There are two ways to reduce memory during compaction. One way is to increase the number of partitions, and the other way is to divide multiple measurements into different databases. However, the problem is that InfluxDB, which already has data, is not good at adjusting two pieces of data.

2. Modify the Timeline Storage Policy

We know the Hash index is an O1 query, which is very efficient. However, there is a scale-out problem for growing data. Let's compromise. If the partition is greater than a certain threshold, the Hash index becomes a B+ tree index. B+ tree has limited performance degradation for data expansion, which is more suitable for high cardinality issues and no longer requires global compaction.

3. Construct the Forward Index of the Series Key in the Shard Level

In InfluxDB, each shard has a time interval, and the timeline data in a time interval is not large. For example, 180-day series keys are stored in a database, while shards generally have a span of only one day or one hour. There is a huge gap between the series keys stored in the two. In addition, constructing the forward index of the series key in the shard level is friendlier to the deletion operation. When the shard expires and is deleted, diff operation will be performed to compare all series keys of the current shard with those in other shards, and the series keys will be deleted when they do not exist.

4. Modify the Timeline Storage Policy Based on Measurements

In the production environment, timeline expansion has a lot to do with measurements. Generally, a few measurements have the timeline expansion problem, but most do not.

We can add measurement timeline statistics when compacting the forward index of the series key. If the timeline of measurement is expanded, all series keys of the measurement can be switched to B+ tree. The series keys that do not expand continue to adopt the Hash index. As such, the performance of this solution is better than the second one, but the development cost will be higher.

Currently, the problem of high cardinality is mainly reflected in the forward index of series keys. Personally, the second solution can be adopted in the short term, and then the fourth one can be gradually used. This can solve the problem of timeline growth with little performance degradation and low cost. The third solution involves relatively large changes, but the design is more reasonable, which can be used as a long-term repair solution.

Summary

This article mainly uses InfluxDB to explain the high cardinality problem of time series databases and feasible solutions. The dimension explosion of metrics causes timeline expansion. Many believe this is because of the misuse or abuse of the time series database. However, when facing the explosion of information and data nowadays, the cost of converging data dimensions without divergence is very high, much higher than the cost of data storage.

The Divide and Conquer strategy is needed to solve this problem, improving the tolerance of time series databases to dimension explosions. In other words, when timeline expansion occurs, the time series database will not crash, and the metrics without timeline expansion continue to run efficiently. The metrics with timeline expansion can experience slight performance degradation. It will be the core capabilities of time series databases to improve tolerance for timeline expansion and control the explosion radius of timeline expansion.

The sample code in this article was written in Golang and based on the InfluxDB source code. Special thanks to Boshu, Lizi, and Renjie for helping to explain InfluxDB and discussing the timeline expansion problem.

References

0 0 0
Share on

Alibaba Cloud Native

68 posts | 5 followers

You may also like

Comments

Alibaba Cloud Native

68 posts | 5 followers

Related Products