Optimizing Internet of Vehicles Data with the Window Function

This article describes how to optimize the data collected about the virtual tracks traveled by a vehicle by removing unnecessary data and separating data by vehicle.

By Digoal

This article describes how to optimize the data collected about the virtual tracks traveled by a vehicle by removing unnecessary data and separating data by vehicle.

Overview

One common scenario for Internet of Vehicles (IoV) applications is to collect the virtual tracks traveled by a vehicle. Virtual tracks are usually not generated and reported in real time, but rather consist of aggregated data that is reported at some interval. This aggregated data is from multiple vehicles and often may also represent multiple points in time at the same position. To make sense of this data, it is important that you optimize it.

Typically, the data of virtual tracks consists of the following parameters:

(car_id, pos geometry, crt_time timestamp)

This data includes information such as the vehicle identification information, its position, and the timestamp of the data collected.

From the data collected, you can also infer other information, such as when there were traffic jams and red traffic lights, by a lack of change in the position of the vehicle over time. Data reports are outputted and reported as follows:

1, position 1, '2017-01-01 12:00:00'  
1, position 1, '2017-01-01 12:00:05'  
1, position 1, '2017-01-01 12:00:10'  
1, position 1, '2017-01-01 12:00:15'  
1, position 1, '2017-01-01 12:00:20'  
1, position 2, '2017-01-01 12:00:30'

In order to save space, reducing the size of data collected, it is more efficient to clean tracks. For example, by only retaining two records for a particular position, specifically the first and last times recorded, or the arrival and departure points, you can significantly reduce the size of the data collected.

You can optimize the data can using the window function. Next, it is recommended to clean tracks using a terminal window.

Removing Unnecessary Data

1. Design a table structure.

create table car_trace (cid int, pos point, crt_time timestamp);

In this table, columns will show the car identification information, the position of the car, and time that this information was recorded.

2. Generate 10 million test data records, assuming 1,000 vehicles are available. Having this large dataset will make it easier to repeat the data and test the effect, 25 points are used at the position.

insert into car_trace select random()*999, point((random()*5)::int, (random()*5)::int), clock_timestamp() from generate_series(1,10000000);

3. Create an index with the car identification number and the timestamp.

create index idx_car on car_trace (cid, crt_time);

4. Run the following command to query the data layout for the car with the identification number of 1:

select * from car_trace where cid=1 order by crt_time limit 1000;  
  
   1 | (3,1) | 2017-07-22 21:30:09.84984  
   1 | (1,4) | 2017-07-22 21:30:09.850297  
   1 | (1,4) | 2017-07-22 21:30:09.852586  
   1 | (1,4) | 2017-07-22 21:30:09.854155  
   1 | (1,4) | 2017-07-22 21:30:09.854425  
   1 | (3,1) | 2017-07-22 21:30:09.854493  
  
Several pieces of data are repetitive.

5. Filter the records that were generated at a single position using the window. At most, only two records for arrival and departure at this position will be kept.

Two window functions are used here:

Lag: indicates the previous record of the current record.
Lead : indicates the next record of the current record.

The method you can use to determine the arrival and departure points can be broken up into the following steps:

To identify the arrival point, find a current pos is not the same as the previous pos (or "lag"). The arrival point is the current pos in this case.
To identify the departure point, the current pos is not the same as the next pos (or "lead"). In this case, the departure point is the current pos.
See if the previous pos (or "lag") is empty, indicating that the current record is the first record.
See if the next pos (or "lead") is empty, indicating that the current record is the last record.

For this method, you can run following command using the parameters "lag" and "lead". This involves the methods discussed above.

using select * from   
(  
select   
  *,   
  lag(pos) over (partition by cid order by crt_time) as lag,   
  lead(pos) over (partition by cid order by crt_time) as lead   
from car_trace   
  where cid=1   
  and crt_time between '2017-07-22 21:30:09.83994' and '2017-07-22 21:30:09.859735'  
) t  
  where pos <> lag  
  or pos <> lead  
  or lag is null  
  or lead is null;  
  
 cid |  pos  |          crt_time          |  lag  | lead    
-----+-------+----------------------------+-------+-------  
   1 | (2,1) | 2017-07-22 21:30:09.83994  |       | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.839953 | (2,1) | (5,2)  
   1 | (5,2) | 2017-07-22 21:30:09.840704 | (3,1) | (4,4)  
   1 | (4,4) | 2017-07-22 21:30:09.84179  | (5,2) | (5,2)  
   1 | (5,2) | 2017-07-22 21:30:09.843787 | (4,4) | (1,5)  
   1 | (1,5) | 2017-07-22 21:30:09.844165 | (5,2) | (0,5)  
   1 | (0,5) | 2017-07-22 21:30:09.84536  | (1,5) | (4,1)  
   1 | (4,1) | 2017-07-22 21:30:09.845896 | (0,5) | (3,3)  
   1 | (3,3) | 2017-07-22 21:30:09.846958 | (4,1) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.84984  | (3,3) | (1,4)  
   1 | (1,4) | 2017-07-22 21:30:09.850297 | (3,1) | (1,4)  
   1 | (1,4) | 2017-07-22 21:30:09.854425 | (1,4) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.854493 | (1,4) | (3,2)  
   1 | (3,2) | 2017-07-22 21:30:09.854541 | (3,1) | (2,0)  
   1 | (2,0) | 2017-07-22 21:30:09.855297 | (3,2) | (4,1)  
   1 | (4,1) | 2017-07-22 21:30:09.857592 | (2,0) | (4,1)  
   1 | (4,1) | 2017-07-22 21:30:09.857595 | (4,1) | (0,4)  
   1 | (0,4) | 2017-07-22 21:30:09.857597 | (4,1) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.858996 | (0,4) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.859735 | (3,1) |   
(20 rows)

If the track dataset cleaning is not performed, the obtained results are as follows:

select   
  *,   
  lag(pos) over (partition by cid order by crt_time) as lag,   
  lead(pos) over (partition by cid order by crt_time) as lead   
from car_trace   
  where cid=1   
  and crt_time between '2017-07-22 21:30:09.83994' and '2017-07-22 21:30:09.859735';  
  
 cid |  pos  |          crt_time          |  lag  | lead    
-----+-------+----------------------------+-------+-------  
   1 | (2,1) | 2017-07-22 21:30:09.83994  |       | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.839953 | (2,1) | (5,2)  
   1 | (5,2) | 2017-07-22 21:30:09.840704 | (3,1) | (4,4)  
   1 | (4,4) | 2017-07-22 21:30:09.84179  | (5,2) | (5,2)  
   1 | (5,2) | 2017-07-22 21:30:09.843787 | (4,4) | (1,5)  
   1 | (1,5) | 2017-07-22 21:30:09.844165 | (5,2) | (0,5)  
   1 | (0,5) | 2017-07-22 21:30:09.84536  | (1,5) | (4,1)  
   1 | (4,1) | 2017-07-22 21:30:09.845896 | (0,5) | (3,3)  
   1 | (3,3) | 2017-07-22 21:30:09.846958 | (4,1) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.84984  | (3,3) | (1,4)  
   1 | (1,4) | 2017-07-22 21:30:09.850297 | (3,1) | (1,4)  
   1 | (1,4) | 2017-07-22 21:30:09.852586 | (1,4) | (1,4)  
   1 | (1,4) | 2017-07-22 21:30:09.854155 | (1,4) | (1,4)  
   1 | (1,4) | 2017-07-22 21:30:09.854425 | (1,4) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.854493 | (1,4) | (3,2)  
   1 | (3,2) | 2017-07-22 21:30:09.854541 | (3,1) | (2,0)  
   1 | (2,0) | 2017-07-22 21:30:09.855297 | (3,2) | (4,1)  
   1 | (4,1) | 2017-07-22 21:30:09.857592 | (2,0) | (4,1)  
   1 | (4,1) | 2017-07-22 21:30:09.857595 | (4,1) | (0,4)  
   1 | (0,4) | 2017-07-22 21:30:09.857597 | (4,1) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.858996 | (0,4) | (3,1)  
   1 | (3,1) | 2017-07-22 21:30:09.859735 | (3,1) |   
(22 rows)

The parameters Lag and lead are used to clear records during the stay of the vehicle between its arrival and departure points.

Optimizing Scanning IO Amplification

Because data collected involves many vehicle IDs, the data of different vehicles is usually aggregated and written into the database. If this data is not optimized, the data of different vehicles may be mixed together or scattered in a confusing way. That is, data for different vehicles may be stored in the same data block.

There are two ways to optimize the data so not have the data of different vehicles mixed together:

Data is aggregated, grouped, and sorted at the business end before being written into the database. For example, after receiving the data submitted by the vehicle terminal, the program groups vehicle IDs, sorts them by time, and writes them into the database (insert into tbl values (),(),...() ;). In this way, the data of the same vehicle will fall into the same data block as much as possible.
The database end uses a partition to reorganize the data. For example, data is stored based on the vehicle ID, each vehicle, or vehicle HASH partition.

These two methods both relate to reorganizing the data based on query requirements to achieve the purpose of decreasing scanning IO amplification.

This method is similar to PostgreSQL Securities Industry Database Requirement Analysis and Application. For more information about this method, refer to this article.

Community

Optimizing Internet of Vehicles Data with the Window Function

Overview

Removing Unnecessary Data

Optimizing Scanning IO Amplification

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

IoT Platform

AnalyticDB for PostgreSQL

Global Internet Access Solution

PolarDB for PostgreSQL