Vision is a bridge connecting the real world

Today I will mainly introduce to you how vision and related technologies are implemented in AutoNavi, and how they can help connect the real world. The phrase "connecting the real world" is not just my personal thoughts, but the mission of AutoNavi. Our mission is "connecting the real world and making travel better".

First of all, a brief introduction to AutoNavi Maps. There are more than 100 million daily active users and more than 400 million monthly active users. AutoNavi Maps not only provides navigation, but also provides other services related to travel, covering information services, driving navigation, and shared travel. , Smart public transportation, smart scenic spots, cycling, walking, long-distance travel and other application scenarios.

What Gaode Map does is to establish the relationship between people and the real world. People need to establish a connection with the real world. Maps are the foundation, and there are more information on the map that can be obtained.

Vision is the bridge to the real world

Vision is the bridge to the real world. Why? From the perspective of human information acquisition, 80% of the content is obtained through vision. From the perspective of human information processing, 30%-60% of the human brain is used for visual perception. From a machine perspective, vision is a very important means of general perception.

There are many other ways for human beings to perceive the real world, such as sensors, LT... However, as a general method, I always think that vision is the first choice. Do it in real time.

There is another reason that more than 80% of the human real world (various elements) are designed for vision. Sometimes, we are too familiar with the real world to care too much. But look at the signs and information around you, including things you know, are designed and acquired based on vision.

Because the main way humans acquire information is through vision, the design of the real world is also based on vision. As you can imagine, if the main way to obtain information is through the sense of smell, the world will be very different. Based on these, returning to what we are doing, you will not be surprised that most of the acquisition and establishment of map information also comes from vision.

Visual Technology @ Gaode Map: Map Making
There are many different ways to apply visual technology to Amap, as shown in the following figure:

On the left is map production, including conventional maps and high-definition maps. High-definition maps correspond to future driverless driving. The right side is related to the navigation experience. We are doing some work related to positioning, and we are also using visual technology to make navigation more convenient. Due to time constraints, today I will only introduce the general map and navigation related parts.
Where does the map service come from? The first thing to do is to collect data. At present, most of the information is collected through cameras and vision. The real world is huge. There are millions of kilometers of roads across the country. In addition to other information, manual methods cannot currently handle it. To a large extent, automatic identification is required to identify data through algorithms. Of course, sometimes the algorithm cannot achieve 100%, and manual correction is required to make a map database to support map data services.

Map making tasks, conventional map tasks are usually divided into two categories, one is road-related, and the other is POI listing recognition. Both types of tasks require more visual techniques. For example, in the recognition of road signs, what the algorithm needs to do is to find out all the signs on the road one by one, and at the same time identify the type and content of the signs.

There are more than 100 types of road signs. If you just deal with these flags, it's not that complicated. In reality, sometimes it is necessary to collect data in a low-cost way. At this time, how to ensure the image quality is a problem that needs to be considered and solved.
When collecting information, sometimes the picture will have distortion, reflection, occlusion, etc. Let alone the problem of resolution compression, the imaging itself depends on the quality and cost of the lens, weather conditions, light and other factors, and sometimes the collected image There are many bad pictures. At this time, it is not just to solve an ideal algorithm problem, but also to deal with many practical situations.

Let me give you a few examples. The picture on the left below is the actual collected image, and there will be various problems. If you know something about cameras, you know that cameras have internal parameters and external parameters. The internal parameters are focal length, center, and distortion. External parameters are position and angle, which will affect the imaging effect.

For the recognition problem, these camera parameters will not cause too much problem, but if you need to do some calculations related to geometry and position, then camera distortion and inaccurate internal and external parameters will cause big problems. We can basically solve this problem by putting multi-source data together for matching. On the right is a practical example. The distortion correction angle of the camera, some oblique angles have been corrected, which greatly improves the subsequent algorithm processing.

Another example, image quality. Some pictures are of poor quality, but there is no way to throw them away, and there is still useful information. Some original images are very blurry after zooming in. If the method of image enhancement is used at this time, this picture can be made clearer. There are many methods available to improve the quality of raw data. For example, to improve the accuracy of the recognition algorithm and improve the efficiency of labor, it can also be used for fuzzy detection. By comparing before and after enhancement, you can know which ones are fuzzy and which ones are not.

What I just mentioned is just an example of traffic signs. Another interesting problem is the perception of electronic eyes. The electronic eye is very small, and the detection of small targets is a challenging problem, and everyone is paying more attention to it in the research field. You can feel it, take a picture, if it is too small, you can’t see it clearly after zooming in, it’s not as good as the distant view. So how can we find such a small electronic eye more accurately?

The usual method is to zoom in on the area, because this thing is too small, it is difficult to find the target, find the area to zoom in, and introduce the surrounding information. This information can help find this small target better, and zoom in a little bit to see other relevant information to help the intelligent detection of the electronic eye.
But if it is too large, there will be problems. If it is too large, a lot of irrelevant information will be introduced. Technically speaking, there are some solutions. Nowadays, the most used visual technology has an attention mechanism, draw a big frame, and the machine will learn which is important and which is not, so as to help better focus on the target itself. Of course, try to use some prior information, such as its own distribution, height, and size.

Light detection is not enough, many times the real world is changing. Many times it is necessary to distinguish what has changed and what has not. An electronic eye was detected before, and an electronic eye was detected by the new data. It is necessary to know whether the two are the same.

How to judge? Because this picture expresses differently, if you look carefully, you can indeed see that the background buildings and erection types are similar. Algorithms need to be used to determine whether it is true or not, which involves target detection, lane attribution, erection type analysis, and scene matching. Through these, it is possible to judge to a large extent what kind of scene this is, so as to judge whether the elements of the two pictures are the same.

I just talked about roads. Here are a few examples related to POI. The brand of POI can be divided into many different types, such as archway type, listing type, door face type and so on. Not only are there various POIs, but there are also various non-POIs. If you only detect text, you will find that many in the real world are not POIs, but some are just signs, slogans, advertisements, couplets, traffic signs, etc. Therefore, it is necessary to distinguish between POI and non-POI.

There are many other complicated scenes, so I won’t list them here one by one. Some of them may not be thought of at ordinary times, such as three-dimensional listing, it is not a flat sign, it may be a fruit supermarket on the corner of the street, and it bends along the corner of the street. This type of brand is difficult to fully detect in one image. Even if it is detected, it will be divided into two brands if you are not careful, so the complexity of the real world will still cause more problems.

Faced with so many complexities, it is necessary to analyze the situation of specific scenarios. In many cases, the final result is often not a single algorithm that can solve all problems, and requires the fusion of various algorithms. For example, if it is text, it needs to be detected, and the text itself also needs to be detected and recognized. In terms of position, some three-dimensional inferences need to be made. In many cases, after the data is obtained, there are still blurred and occluded parts, and judgments must be made.

Every judgment cannot be solved by a single method. It is not possible to achieve the best results by relying on only one model. What is needed is two or more models to solve problems from different angles to achieve better results. , which is based on data accumulation.

Some of the problems listed above have a certain degree of complexity. Like all problems, the more difficult they are, the more difficult they are. We are still working on them. These algorithms largely determine the efficiency of map production and the quality of maps that reach users. These are very important core questions.

POI is not only the need to judge whether it is POI or text recognition as described above, but also needs to understand the content of the layout in many cases. For a brand, you need to know the information on the brand, sometimes it has the main name, sometimes it has branches, sometimes it doesn’t, whether it has contact information, business scope, all these need to be done by algorithms.

Visual Technology@高德MAP: Navigation
The above introduction is that there are many complexities in map making, which need to be processed by visual algorithms or other algorithms. Next, share the navigation.

Let me talk about my own experience first. I was on vacation in Spain some time ago, and there are so many roundabouts in Europe. Google (map) navigation often reminded me to take the third exit after entering the curve. It's not an exit, so I went wrong several times. I have never driven a car in China, and domestic traffic is more complicated. For example, in Xizhimen, Beijing, sometimes you can turn right directly, and sometimes you need to make an 810-degree circle.

We hope to make a relatively large change in the way of navigation, making it a WYSIWYG scene. If there is an algorithm that can directly tell people which way to go, it will be more useful to people, making driving easier and navigation more direct.

Many cars now have cameras, whether it is the front end or the back end, and video data can be obtained in many cases. We superimpose the effect calculated by the AI ​​algorithm on the video to tell people how to go.

AutoNavi released an AR navigation product in April this year. One of the products is augmented reality. It will tell you to keep driving or turning on this line. There will be prompts to press the line and arrows to tell you You turn right ahead.
In this product, in addition to guidance, there are other functions. For example, the collision warning function of the vehicle in front has also been added, which will estimate the distance and speed of the vehicle in front, which will help everyone drive safely. Other things can also be displayed in a more intuitive way, such as speed limit, electronic eyes, related to zebra crossing, if you see someone in front, you will also be prompted.

The above functions may not seem so difficult, but it is very difficult to implement. Why? Because we hope that this is a function that everyone can use immediately, so we have to do it at a very low cost. This is not the same as an autonomous driving system. From the perspective of the sensor, what we want to do is a single sensor, and it is a low-cost camera. From a computing point of view, the autonomous driving system may use a dedicated chip of several hundred watts, but for us, the required computing power is only about one-fifth of that of ordinary mobile phones.

Let me show you an example of AR navigation. This is the output of the actual algorithm. In this example, there are vehicle detection, lane line segmentation, and guide line calculation. As mentioned just now, high performance (low computing power) is a major challenge, so we must fully consider computing efficiency when developing algorithms, including various means, such as model compression, small model training optimization, and the combination of detection and tracking. The multi-target joint model, the fusion with traditional GPS navigation, etc., need several things to be done in one model.

The real world is very complex. To achieve high-quality and efficient map production, or to achieve accurate positioning and navigation, there is still a lot of work to be done in terms of vision. I hope that through the above introduction, everyone has a better understanding of the application of visual technology in the map of AutoNavi and the application in the field of travel, and also has a better understanding of the mission of AutoNavi.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us