Elizabeth
Engineer
Engineer
  • UID625
  • Fans2
  • Follows1
  • Posts68
Reads:881Replies:0

MXTrip: Best practice of high-speed overseas access and HA & DR architecture

Created#
More Posted time:Mar 24, 2017 9:31 AM
MXTrip mainly orients to users looking for overseas independent travel services. It provides users with LBS-based services including real-time restaurant information, hotel booking and attraction query, namely finding food, attraction and discount information for going abroad.
Normally, those users will use travel guides and consult tourism experts for overseas independent travel. For Chinese users, finding this information for domestic tourist destinations and attractions is easy, however it can be difficult for overseas destinations and result in complicated preparation work.
To address this challenge, MXTrip emerged and focuses on the following three items:
• Integration and structuralization of global tourism data;
• Construction of mapping knowledge domains and exploration of knowledge entities;
• Intelligent travel.
Among those items, the integration and structuralization of global tourism data is the fundamental service of MXTrip. By obtaining mass unstructured tourism data, this service can filter and integrate tourism data as a comprehensive travel knowledge system and help users start a simpler trip.

Overview of MXTrip data
MXTrip crawls and collects over 2 billion global travel pages, implements global data crawling, transfer and storage and stores more than 200 historical versions. Also, it can track the updates of original webpages with a tracking time of less than 30 minutes for hotspot pages while supporting self-learning intelligent dispatch algorithms, balanced updating of hot and cold data, intelligent sniffing of various travel data sources and automatic parsing and matching of templates.
For large companies, this data volume is acceptable. However, for small and micro companies, it'd be challenging to build a quick, cross-area and scalable distributed crawler system with limited resources.
One of the exclusive highlights of MXTrip is remote crawling, which is not designed for disaster recovery (DR) but for crawling data from a number of useful websites that are inaccessible from China, especially some social media websites. One of such websites is Facebook. In addition to social users, this platform also provides a mass amount of relevant information maintained by merchants themselves, such as addresses, business hours, discount information and even users' consultation and comments on those merchants. All the information is valuable for pre- and mid-trip users. Though MXTrip is using the cross-area crawling model today, it once introduced VPN for crawling, but failed shortly after due to the network bandwidth bottleneck and the quick firewall detection mechanism for traffic anomalies.

Distributed crawling architecture
 
For overall architecture, ESCs are used to build a crawler cluster where each websites has its own agent pool. In this way, IP addresses blocked by website A are still available to website B. Also, the JS render engine is introduced to obtain core data from websites as many websites populate their core data through Ajax asynchronous population.
The clusters deployed in Hangzhou China and the US North have their own message clusters. When a page is crawled, the crawled data will be written into the corresponding message cluster. ETS will then abstract and convert the data, process and store data in both message clusters, and place the processed data in webpage and link libraries. Noticeably, WDB is implemented by using OSS. This design is very useful as OSS is ideal for storing small files for the following three reasons: no restriction on the number of uploaded files, the total storage capacity and the size of a single file. With those three advantages, OSS can store more than 200 historical states for each webpage to simplify the quick tracking of historical webpage data states.
On the other hand, LinkDB is implemented based on MongoDB. Previously MySQL was used for this purpose when the URL volume was less than 300 million as it features a high degree of ease-of-use. However, when the data amount exceeded 300 million, MongoDB was used as LinkDB, which is flexible for expanding the link library to support crawling a greater data volume of webpages.
With intelligent dispatch, the system can dynamically adjust webpage weights by website, webpage weight, update frequency and update time, facilitating crawling hot webpages and balancing crawling speeds of hot and cold URLs. The crawling queue uses the common Redis system, which can adjust weights dynamically.

Data integration and knowledge discovery
After data crawling is complete, other challenges occur to the basic crawled data. Specifically, the system needs to address the following 4 algorithm-level challenges:
• How to effectively obtain webpage content from mass sources in highly-different page structures and multiple languages?
• How to filter and integrate unstructured data from sources with different information qualities?
• How to extract effective entities to create mapping knowledge domains from diverse tourism information?
• How to apply mapping knowledge domains to actual scenarios to help users complete intelligent travel?
 
MXTrip stores most basic data on OSS and RDS as E-MapReduce and ODPS support both of them and can calculate data quickly.
For data calculation, MPI was used when E-MapReduce and ODPS were not used widely. However, MPI suffered from a high complexity and was soon migrated to the E-MapReduce and ODPS platforms as they can provide more comprehensive functions. The reason for using both the big data processing platforms is that a large number of original pages needed to be obtained and extensive data correction and processing occurred frequently due to complicated data processing logic. In this context, both of those Alibaba Cloud computing platforms were available for accelerating data processing and helping with operational implementation and rapid business iteration.
Also, the evolution of the MaxCompute platform and having more useful functions available greatly accelerates business implementation. For example, face recognition, e-commerce image and machine translation functions from the MaxCompute platform and the machine learning algorithms of E-MR and ODPS both dramatically simplify the development work of algorithm teams and enable them to focus more on business indexes rather than building a platform and algorithm model.
By using certain basic modules such as theme model and entity discovery and various algorithms, processing tasks of unstructured data can be combined to form tourism entity and mapping knowledge domains.

MXTrip application scenarios
MXTrip application scenarios are distinct to those of other companies as they orient to different user groups. Among those differences, the most distinct characteristics are:
• 50% of access requests are from overseas, and this requires that overseas images can be uploaded and accessed quickly.
• System HA and remote DR are required for implementing LBS-based instant personalized retrieval.
Overseas access acceleration
Given that apps are services specific to areas, which can be a country or a city, data is normally isolated from area to area. However, MXTrip application scenarios are designed for global travel users without obvious relevance.
This requires that overseas images can be uploaded and accessed quickly. To ensure access efficiency, many similar enterprises use CDN to cache APIs, resulting in similar data seen by overseas users.
Travel users normally like taking selfies and sharing them online. However, uploading images overseas is much more complicated than in China, and cross-country network access always has to go through the same international network backbones. However, this link is poorly maintained due to limited bandwidth and enormous traffic. During the peak surfing hours especially after 9 in the morning and after 6 in the afternoon, this link often suffers from a packet loss rate of greater than 20%, which is totally unstable for network access. As a result, this causes poor user experience.
By completing the steps shown in the figure below, MXTrip manages to accelerate image uploading and improve the stability of accessing APIs. Specifically, these steps are nearby deployment, unilateral/bilateral acceleration and asynchronous pulling/sending back. To implement those functions, Alibaba Cloud can provide ideal supportive functions, namely cloud resolution, OSS and image service.
 

HA & DR architecture optimization
To provide personalized recommendations and retrieval, MXTrip must provide quick and real-time processing capabilities.
In addition, HA and remote DR concerns exist in each enterprise. So far, we have implemented a quasi real-time image backup mechanism in Beijing, and data switchover can be done in a timely manner once the main machine room becomes faulty (or suffers from natural disasters), preventing long-term service downtime.
The figure below shows the architecture of the HA and DR solution.
 
1. Load balancing
Among our services, load balancing is a very useful basic service, which can ensure HA for each server and implement quasi real-time synchronization in active and standby machine rooms to ensure data security and reliability.
Noticeably, the forwarding performance of Server Load Balancers can be significantly compromised when forwarding performance requirements are demanding; for example, multiple accesses occur inside APIs and the most common scenario is accessing databases. To solve this problem, Server Load Balancers are used in the scenarios with lower performance requirements and fewer accesses. If multiple accesses exist for the same request, local HAProxy will be used to implement local forwarding to improve forwarding performance. In this way, the access speed can be greatly improved, which is 400% quicker than Server Load Balancers as VIP is unavailable on classic networks and local HAProxy must be used instead.
2. Message queue
The message queue is a modularized service and an important tool for decoupling different businesses. Currently, MXTrip uses Kafka clusters as our technical teams are experienced in message queue technologies and can process an average of 6 billion messages per day. For this reason, we have rich experience in this field and it is maintained by ourselves at this moment.
3. Data transfer
For data backup, MXTrip uses Otter as the data transfer tool between databases and files. Otter is an open-source software program developed by the Alibaba Group and is initially used for data sync in China-US B2B services, which mainly involves database synchronization in active and standby machine rooms. It is worth noting that PXC is used for online databases as it can accelerate access and provide better autonomy. For other databases such as those deployed in the approval system and CMS, RDS is used instead. With the newly available Alibaba Cloud OSS cross-area backup function, file data synchronization does not rely on Otter anymore and provides better efficiency and timeliness.
4. Cloud monitoring
Initially, the self-developed zabbix service is used for system monitoring. Today, CloudMonitor is used for the same purpose to facilitate customization, which can solve most monitoring problems with a high degree of ease-of-use.
5. Logging system
On the one hand, MXTrip uses the message queue to implement real-time processing; on the other hand, it uses the Alibaba Cloud logging system to complete data analysis through ODPS and to simplify the process of analyzing logs.
6. Others
MXTrip uses Kafka for message clusters and PXC for some databases.
Guest