Large-scale graph searches and real-time computing applications in the Alibaba anti-fraud system - Alibaba Cloud Developer Forums: Cloud Discussion Forums

Assistant Engineer
Assistant Engineer
  • UID627
  • Fans3
  • Follows0
  • Posts55

Large-scale graph searches and real-time computing applications in the Alibaba anti-fraud system

More Posted time:Oct 26, 2016 9:29 AM
Alibaba has a zero-tolerance attitude for fraud on its e-commerce platform. For Alibaba, there is no strongest, but stronger measures in identifying, preventing, controlling, and punishing fraudulent transactions. Over many years of operating the world's largest e-commerce platform, Alibaba has accumulated a large volume of big data. Alibaba e-commerce anti-fraud is a multidimensional supervision mechanism, covering monitoring and alarms, identification and analysis, and punishment and control. In particular, its data monitoring and algorithm identification of fraudulent transactions apply real-time big data analysis and processing capabilities and large-scale graph search technology to identify fraud behavior anywhere in the system.

1. Taobao anti-fraud system structure
Taobao's anti-fraud system structure can be explained from three perspectives: data, algorithms, and systems.

Data: Mainly, the system summarizes identified fraud data into four categories (sellers, products, orders, and buyers) and provides all the data to the data platform, where it can be used by the business staff. It can be used as sample features in algorithm training and to facilitate system query and monitoring of fraud data trends.

Algorithms: Algorithms cover the big data from the account network, transaction network, capital network, and logistics network. They can be used to fully implement presales, sales, and aftersales businesses and identify all kinds of fraud behavior.

Systems: These is a set of systems established on the basis of the data layer, including the monitoring and alarm, online analysis, and risk operations systems. These systems can quickly and effectively discover click farms and immediately stop the click farmer from profiting.

In addition, the Taobao anti-fraud system also includes an evaluation system that provides a comprehensive set of methods to evaluate the performance and value of the system. The evaluation system incorporates both manual and algorithm testing. The recall rate and accuracy metrics are used to evaluate the coverage and accuracy of algorithm models, while the implementation rate, purity, and bounce rate are used to evaluate the performance and value of the anti-fraud service.

2. Taobao's anti-fraud algorithms
Taobao's anti-fraud algorithm system is constantly optimized and improved together with the Taobao platform. In early years, fraud activities were simple, such as repeatedly changing a product's listing and delisting time to improve its rank. Through simple analysis, it was usually possible to create rules to deal with such activities. However, as the platform's business scenarios became more diverse, so did fraud tactics. Still, most fraud behaviors were concentrated in basic product information (for example, incorrect categories, title abuse, exaggerated claims, low-cost credit hyping, advertisement products, repeated distribution, unwanted traffic, and search terms) and bot click farming (for example, creating a large amount of fake accounts and having them quickly click certain items in order to increase sales). These click farming techniques are also a fraudulent means to get more free traffic in order to continuously grow and develop the platform business. As Taobao's business was continuously updated and it optimized the anti-fraud algorithms, these frauds became easy to identify and punish.

Justice always prevails over evil. No matter how many new fraud tactics are adopted, Taobao's anti-fraud algorithm system can always quickly respond. The most important step was to implement real-time big data (account network, transaction network, capital network, and logistics network) analysis and processing capabilities throughout the entire business chain (presales, sales, and aftersales). Therefore, any concealed, sophisticated fraud methods can be modeled and subjected to cross-analysis from multiple points in massive volumes of big data, making it possible to quickly identify and control risks. Taobao's anti-fraud algorithm framework is shown in Fig 1.

Figure 1

The overall anti-fraud algorithm framework incorporates four networks of big data: the account network, transaction network, capital network, and logistics network. It covers the entire e-commerce business chain, from presales, through sales, to aftersales. Algorithm models are a type of stream computing framework. In the real-time and offline computing modules, data logs are processed into a set of transaction attribute features and used as the basis for identification algorithms. Real-time computing quickly analyzes abnormalities in online data (such as abnormal sales or abnormal increases in seller reputation) and converts the data into corresponding features. Offline computing processes the features of all stages in the business chain. By combining online and offline computing, the algorithms consider the effect of both long- and short-term behavior change factors in the model, further enhancing the speed and accuracy of fraud identification.
Taobao's anti-fraud algorithm framework covers two Alibaba e-commerce scenarios: daily anti-fraud and major promotion anti-fraud. The algorithms are mainly based on graph mining and online learning. Online learning can perform real-time model updates for certain rule-based algorithms to defend against tentative fraud techniques. Generally, rule-based models (decision tree and LR models) establish strong rules based on transaction features to identify fraud. This model is very efficient for identifying more obvious product fraud. Graph mining, in contrast, moves away from a local approach to consider global behavior and identify more subtle fraud techniques. For example, probabilistic graph models model user behavior paths over time (if we assume normal user behavior trajectories are probabilistically distributed, abnormal behavior trajectories will deviate from the probabilistic distribution at certain points). This can be very effective at identifying bot click farming or fixed mode click farming. Graph label propagation models can be used to identify manual click farming and can identify the sophisticated methods used by highly-concealed and well-organized click-farm platforms with extremely high efficiency and accuracy. In order to further verify the precision of algorithm models, the anti-fraud system added real-time intervention modules to perform cross-check verification and analysis. These include expert knowledge, manual reporting, abnormality monitoring, and manual evaluation. After processing these external data sources, the output can serve as verification to dynamically optimize the models.

In the anti-fraud system, the large-scale graph search technology is applied in the following four algorithms:
1. Label graph model: This model mines communities and gangs in a large-scale attribute graph structure. It differs from the other machine learning algorithms described above in that, in the attribute graph, label propagation algorithms can be effectively used to analyze user behavior and find many machine gangs and groups that other algorithms cannot.

2. Probabilistic graph model: In a large-scale graph structure, this model finds relationships between variables. By using a probabilistic graph model, we can effectively analyze the risk level of user information (for example, to prevent the leak of user addresses) and the association between user shopping behavior processes (to identify abnormal behaviors).

3. Data stream graph model: In a large-scale data stream, this model picks out frequency subgraphs. Using data stream mining, we discover "collapsing networks" produced by the credit hype activities of bot accounts in the capital flow network for the first time. At the same time, we can create a "first transfer network" to effectively identify credit hyping users, with an accuracy of 99.9%.

4. Large-scale graph linking model: On the basis of large-scale graph data, this model finds ranks and weights. Using this graph linking method, we can effectively discover repetitive shipment and false shipment behavior. Graph algorithms can concurrently process 500 million lines of graph data from over 100 million nodes. When processing 220 million lines of graph data from 30 million nodes, it only takes 14 minutes to call the graph linking algorithm. At the same time, the overall algorithm framework includes real-time computation modules for use in highly time-sensitive business scenarios (such as 11/11). In such scenarios, some algorithms can identify fraud with a 0-second lag and perform dynamic adjustment every 15 minutes to update all the other models.

3. Big data full-process anti-fraud example
The most critical component of Taobao anti-fraud is the construction of a full-process big data dragnet, incorporating the account network, transaction network, capital network, and logistics network. This is used for comprehensive, no-blind spot monitoring and identification of any type of fraud behavior.
*Account network: In this network, we can look at various registration and login information to comprehensively determine the authenticity and platform features of accounts. By mining changes in user behavior, we can effectively discovery account behavior abnormalities (see Fig 3).

Figure 2

*Transaction network: In this network, we can mine user shopping behavior paths to track abnormal behavior. This process involves presales (search terms, clicks and views, product details pages), sales (favorites, shopping cart, payment), and aftersales (logistics, comments, returns) (see Fig 3).

Figure 3

*Capital network: In this network, we can mine capital flow behavior to identify abnormal transactions, money laundering, account hacking, cash-out, and other dangerous behaviors (see Fig 4).

Figure 4

*Logistics network: In this network, we can mine associations between transactions and logistics, to identify fake shipments, empty box scams, and other frauds (see Fig 5).

Figure 5

4. Conclusion
The Taobao anti-fraud system has already established and perfected a set of comprehensive big data analysis systems, incorporating account networks, transaction networks, capital networks, and logistics networks, as well as online learning and large-scale graph mining algorithm identification systems, covering presales, sales, and aftersales. Alibaba also established a comprehensive and platformized risk control system, Wormhole. Through this system's monitoring, alarm, and online analysis methods, model algorithms and manual operations are effectively integrated. This not only efficiently identifies fraud behavior and effectively intervenes, but also can effectively control various risks. When used during routine operations and major promotions, Taobao's anti-fraud algorithm system has demonstrated excellent accuracy, coverage, and bounce rates under all circumstances.