How to achieve efficient matching between massive goods and business districts


According to comprehensive consideration of traffic conditions, distribution of shopping malls, and distribution of residential areas, the city is divided into business districts. The division of commercial districts in some areas of Hangzhou is shown in the figure below.

Commodities are point data distributed randomly on the map by GPS released by users. When the user is within a certain business district, the app will recommend to the user products whose GPS is located in this business district. To achieve precise recommendation services, it is necessary to calculate which products belong to your business district.

In the database, business circles are surface data surrounded by multiple points. These surface data have different shapes and sizes and do not overlap each other. Commodities are point data marked by GPS. How can we quickly and efficiently determine the attribution relationship between massive commodities and business circles? The traditional and direct method is to use geometric spatial relationship calculation formulas to perform direct "point-plane" relationship calculations on massive data to determine whether each commodity is located within each business district.

There are currently 1 billion commodity data, and it is increasing rapidly every day. The total number of commercial districts in all cities across the country is about 10,000, and the size of each commercial district varies from 10 to 80. If geometric point-plane relationship operations are directly used, the required calculation level is about 2 billion billion times of basic operations. Following this line of thought, we tried to use the offline computing cluster inside Alibaba Group to perform calculations. As a result, the cluster failed to give results after running for more than 2 days.

After algorithm improvement, we adopted an algorithm based on GeoHash exact matching, combined with GeoHash inexact matching and fine matching with small-scale geometric relationship operations, which greatly reduced the amount of calculation and efficiently realized massive point-surface data in an offline environment Inclusion relationship calculation. The same is to match 1 billion items of product with 10,000 business district data, and the results can be obtained within 1 day.

▌Principle and Algorithm of Point Data GeoHash

GeoHash is a method of encoding geographic coordinates, which maps two-dimensional coordinates to a string. Each string represents a specific rectangle, and all coordinates within the rectangle share this string. The longer the string, the higher the precision, and the smaller the corresponding rectangular range.

When encoding a geographic coordinate, according to the initial interval range latitude [-90,90] and longitude [-180,180], calculate whether the target longitude and latitude fall in the left interval or the right interval respectively. If it falls in the left interval, it will take 0, and if it falls in the right interval, it will take 1. Then, continue to search in half according to this method for the interval obtained in the previous step to obtain the next binary code. When the length of the code reaches the progress requirement of the business, according to the rule of "longitude is placed in even digits, and latitude is placed in odd digits", the obtained binary codes are interspersed and combined to obtain a new binary string. Finally, according to the comparison table of base32, the binary string is translated into a string, that is, the target GeoHash string corresponding to the geographic coordinates is obtained.

Taking the coordinate "30.280245, 120.027162" as an example, calculate its GeoHash string. First encode the latitude in binary:

1. Divide [-90,90] into 2 parts, if "30.280245" falls in the right interval (0,90), the first digit will be 1.
2. Divide (0,90] into 2 points, and "30.280245" falls in the left interval (0,45], then the second digit is 0.
3. By repeating the above steps, the target range will be smaller and smaller, and the two endpoints of the range will be closer to "30.280245".

The process in the figure below describes the process of the first few iterations in detail: follow the above process and continue to iterate until the number of encoding digits meets the precision requirements of our business. The complete 15-bit binary code iteration table is as follows:

The resulting latitude binary code is 10101 01100 01000.

According to the same process, the longitude is binary coded, and the specific iteration details are as follows: The obtained longitude binary code is 11010 10101 01101.

According to the rule of "longitude in even digits and latitude in odd digits", the binary codes of latitude and longitude are interspersed, and the completed binary codes are: 11100 11001 10011 10010 00111 00010. Since base32 encoding will be used later, every 5 binary numbers correspond to a 32-ary number, so here we convert every 5 binary digits into decimal digits to get 28, 25, 19, 18, 7, 2. Compare the base32 encoding table, and get the corresponding encoding: wtmk72.

The above results can be verified on the website, and the verification results are as follows: the first few digits of the verification results are consistent with our calculation results. If we iterate more times when using the binary method to obtain the binary encoding, we will get a more accurate result with more digits like this in the verification website.

The correspondence between the length and precision of a GeoHash string is as follows:

▌Face data GeoHash encoding implementation
The standard GeoHash algorithm introduced in the previous section can only be used to calculate the GeoHash code corresponding to the two-dimensional point coordinates. In our scene, we also need to calculate the GeoHash code corresponding to the surface data (that is, the POLYGON polygon object in GIS), which requires an extended algorithm to achieve .

The idea of the algorithm is to first find the minimum bounding rectangle MBR of the target Polygon, and calculate the GeoHash code corresponding to the coordinates of the southwest corner of the MBR. Then use the inverse algorithm of GeoHash encoding to reversely solve the rectangular GeoHash block corresponding to this encoding. Starting from this GeoHash block, loop to the east and north to find adjacent GeoHash blocks of the same size, and stop until the found GeoHash block completely exceeds the range of the MBR. For the multiple GeoHash blocks found in this way, the part on the edge may be completely disjoint with the target Polygon, and this part of the block needs to be eliminated through calculation, so as to reduce the amount of subsequent unnecessary calculations.

The high-definition large image of the final result obtained in the above example is as follows, in which the blue GeoHash block is partially intersected with the original Polygon, and the orange GeoHash block is completely contained in the original Polygon.

▌A fast algorithm for finding adjacent GeoHash blocks
The two steps marked green and orange in the flow chart of GeoHash encoding of surface data in the previous section are to find the adjacent GeoHash strings on the east or north respectively.

The traditional method is to find a point inside the adjacent block according to the reverse solution information of the current GeoHash block, and then perform GeoHash encoding on this point, which is the GeoHash encoding of the adjacent block. As shown in the figure below, if we want to calculate the codes of the 8 adjacent blocks around "wtmk72", we must first use the GeoHash inverse algorithm to reverse the coordinates N1, N2, N3, and N4 of the four vertices of "wtmk72", and then use these 4 Calculate the coordinate N5 of any point inside the adjacent block on the right, and then perform GeoHash encoding on N5, and the obtained "wtmk78" is the encoding of the adjacent block on the right we require. According to the same method, the codes of a total of 8 adjacent blocks around "wtmk72" can be obtained.

This method needs to be decoded once and then encoded again, which is time-consuming, especially when the specified GeoHash string is long and needs to be looped many times.

By observing the rules of the GeoHash encoding table and combining the characteristics of the Z-order curve used in GeoHash encoding, a method of quickly finding adjacent GeoHash strings by looking up the table is verified.

Still take the GeoHash string "wtmk72" as an example, the corresponding decimal number is "28, 25, 19, 18, 7, 2", converted to binary is 11100 11001 10011 10010 00111 00010. Among them, w corresponds to 11100, and these 5 binary digits respectively represent "longitude latitude longitude latitude longitude"; t corresponds to 11001, and these 5 binary digits respectively represent "longitude longitude latitude longitude latitude". From this promotion, it can be seen that the binary digits represented by odd-numbered characters ('w', 'm', and '7' in this example) in GeoHash correspond to "longitude, latitude, and longitude" respectively, and even-numbered characters ('w', 'm', and '7' in this example) The binary digits represented by 't', 'k', and '2') correspond to "latitude, longitude, latitude, and longitude".

The binary value of 'w' is 11100, which means "right upper right lower left" when converted into azimuth. The binary value of 't' is 11001, which means "top right bottom left top".

According to the conversion relationship between this character and orientation, we can know that the comparison table of characters and positions on odd-numbered positions is as follows: the comparison table of characters and positions on even-numbered positions is as follows: Here we can see a very interesting phenomenon, the comparison table of odd-numbered positions There is a transpose and flip relationship with the even-numbered bit look-up table.

With the above two character and position comparison tables, you can quickly find out what the 8 characters around each character are. To calculate 8 GeoHash values around a given GeoHash string, if the last character of the string does not exceed the boundary in this direction, the first few characters remain unchanged, and the last character is the adjacent character in this direction. Yes; if the last digit exceeds the boundary of the comparison table in this direction, first calculate the adjacent characters of the penultimate character in this direction, and then calculate the adjacent characters of the last character in this direction (the comparison table is circular Adjacent characters); if the second-to-last adjacent character in this direction also exceeds the boundary of the lookup table, first calculate the penultimate adjacent character in this direction. and so on.

Taking the above "wtmk72" as an example, asking for 8 adjacent character strings of this GeoHash string actually means finding the adjacent characters of the tail character '2'. '2' applies to the even-numbered comparison table, and its 8 adjacent characters are '1', '3', '9', '0', '8', 'p', 'r', 'x', where 'p', 'r', and 'x' have exceeded the lower boundary of the comparison table, and are adjacent relationships obtained by connecting the even-numbered comparison table up and down to form a ring. Therefore, for these 3 "below" adjacent characters beyond the boundary, it is necessary to find the next-to-last adjacent character below, that is, the adjacent character below '7'. '7' is an odd number, and the odd number comparison table is applicable. The adjacent character "below" of '7' in the comparison table is '5', so the 8 adjacent GeoHash strings of "wtmk72" are "wtmk71", "wtmk73", "wtmk79", "wtmk70", "wtmk78", "wtmk5p", "wtmk5r", "wtmk5x". Using this fast algorithm for adjacent strings can greatly improve the efficiency of the surface data GeoHash encoding algorithm in the flow chart in the previous section.

▌Efficiently establish the relationship between massive point data and area data
The idea of ​​establishing the relationship between massive point data and area data is to first calculate the GeoHash codes of the same length for the GPS data (point data) and AOI data (area data) of business districts that need to be matched according to the algorithm described above. Each point data corresponds to a unique GeoHash string; each surface data corresponds to one or more GeoHash codes, and these codes are either "completely contain a string" or "partially contain a string".

a) Join the GeoHash string of each commodity with the "completely contained string" of the business district. The data that appears in the result obtained by join is the definite relationship of "a certain commodity belongs to a certain business district".

b) For the remaining commodities whose relationship has not been determined, perform a join operation on the GeoHash strings of these commodities and the "partially included strings" of the business district. The data that appears in the results obtained by join is a possible relationship of "the product belongs to a certain business district". Next, the geometric relationship calculation is performed on the product GPS and the business district AOI data in this batch of data. , and then screen out the determined relationship of "goods belong to a certain business circle".

As shown in the figure, the GeoHash code of item 1's point data is "wtmk70j", and the join with the area data "completely contains the string wtmk70j" is successful, so it can be directly determined that item 1 belongs to this area data.

The GeoHash code of the point data of product 2 is "wtmk70r", and the join with the "partially contained string wtmk70r" of the surface data is successful. Therefore, product 2 is suspected to belong to the surface data. Whether there is a containment relationship depends on the follow-up point-surface geometry calculation. Sure. The point data GeoHash code of item 3 cannot match any GeoHash block code of the area data, so it can be quickly determined that item 3 does not belong to the area data.

In practical applications, the original GPS range of massive commodities is scattered all over the country, and the data of massive business districts is also distributed in different cities across the country. After the operation of step a), most of the commodity data have already determined the affiliation relationship with the business district. For the remaining unmatched commodity data, after the GeoHash matching in step b), the calculation amount of the subsequent "commodity-business district geometry calculation" can be changed from the Cartesian product of "1 commodity x all business districts in the country" The magnitude is reduced to the magnitude of the Cartesian product of "1 commodity x 1 (or several) business districts", which reduces most unnecessary geometric operations, which are very time-consuming.

In practical application, with 1 billion commodities and 10,000 business district data, using the fast algorithm in this paper, only 1 billion GeoHash point codes + 10,000 GeoHash surface codes + 5 million geometric operations of "whether the point is inside the surface" are needed , roughly converted to the number of times required for basic operations is about 180 billion times, which is far less than the 200 billion times of basic operations required by traditional methods. Using Alibaba's offline computing platform, the algorithm in this paper completes all calculations in less than a day.

In addition, for a given point and polygon, there are more than one algorithm for calculating the containment relationship through geometry, and the most commonly used algorithm is the ray method. To put it simply, it is to make a ray from this point, and judge whether the number of intersections between the ray and the polygon is odd or even. If it is an odd number, the point is inside the polygon; otherwise, the point is outside the polygon.


In the face of the spatial relationship division of massive point-plane data, this paper adopts the idea of reducing the amount of calculation through GeoHash, which essentially uses the idea of spatial indexing. In fact, there are a variety of practical spatial indexes in the GIS field, such as the R tree series (R tree, R+ tree, R* tree), quad tree, K-D tree, grid index, etc. These index algorithms have their own characteristics . The ideas in this paper can not only be used to deal with problems related to point-surface relations, but also can be used to quickly deal with problems such as point-point relations, surface-surface relations, point-line relations, and line-line relations. The affiliation relationship between a large number of bus stops and roads, whether there are intersections between multiple roads or railways, etc.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us