By Alibaba Cloud Data Intelligence Team
Alibaba Machine Translation is dedicated to serving Alibaba's major international businesses such as AliExpress, Lazada, ICBU, Tmall Global, Taobao Overseas, DingTalk. This technology helps countries break through the language barrier by simplifying localization processes when going global.
This article shares the background of machine translation and illustrates the applications of machine translation in Alibaba ecology, particularly in Alibaba Cloud's vertical fields such as e-commerce and communication. This article also introduces a full suite of e-commerce multi-language solution based on machine translation technology, and presents technological innovation and highlights of Alibaba machine translation in resolving business problems.
Shi Yangbin, senior technical expert in the translation platform of Alibaba's Machine Intelligence Technology Laboratory, is the head of the corpus and solutions on the translation platform of Alibaba's Machine Intelligence Technology Laboratory. In terms of corpus, he is responsible for acquiring, cleaning, mining, and systematically constructing the corpus data for Alibaba machine translation. In terms of solutions, he is responsible for encapsulating, integrating, and servicing Alibaba translation technologies and exporting them in a complete solution to resolve language problems in the process of cross-border e-commerce internationalization.
This article was written based on his speech and the PowerPoint used for the speech.
In his presentation, Shi Yangbin divided the topics into four parts:
First of all, I will introduce the background of machine translation. It includes two parts: trends of machine translation and Alibaba machine translation.
When it comes to translation, you may think of simultaneous interpretation at the first time. Secondly, you may think of the translation of subtitles in movie and television works. A popular Chinese movie screened at the beginning of this year, the name of which can be literally translated into "No questions about the West and the East", has a quite interesting English name, "Forever Young". The English name conveys another meaning of the movie. Of course, such an English name is manually translated, and obviously not the work of machine translation. Another scenario that everyone will encounter in their daily work and study is machine translation.
Before introducing the background of machine translation, let me introduce the development of machine translation. The following figure shows Chris Manning's hand-drawn development history of machine translation. As you can see from this picture, machine translation was first discussed as early as 1954. By 1982, the first rule-based machine translation system appeared. By around 1993, a statistical machine translation system in the word dimension was created, followed by machine translation in the phrase dimension. Then, machine translation was also optimized in the phrase dimension. Until recently, around 2014, with the development of AI, machine translation based on the neural network was generated, which greatly improved the quality of machine translation. The reason why I talk about this development history is to explain that machine translation is constantly improving and that it has made great progress in the quality and has played a significant role in many scenarios.
The first group of users are language service companies, such as translation service providers and localization companies, who use machine translation to improve the efficiency of human translation. Then there are Internet content providers, because Internet companies need to build sites that release international news and foreign-language news. For example, YouTube, Youku, and other video content providers may need to use machine translation to translate videos into multiple languages. The third group is social platforms. The users on social platforms are often from different countries. The platforms need to break the barrier between users through machine translation. Governments and some state-owned enterprises are also users of machine translation. Their websites need to provide information and news in multiple languages. In addition, multinational companies like Huawei and HP need to sell products all over the world, and then their product documents and customer support services also need multi-language versions. The last group is tool service providers. For example, the well-known Fliggy needs to provide multi-language support for tourism, while HJ provides support for personal language learning.
The following figure shows what the machine translation market looks like. You can see traditional machine translation companies including Google, Baidu, and Microsoft and recently emerging machine translation companies including Amazon, GTCOM, NiuTrans, and Alibaba language services. The machine translation market was already worth 450 million USD in 2017. The value is growing at an annual rate of 10% per year. Today, the number of words translated using online machine translation has reached 100 billion.
I talked about the background of machine translation. Let's move to the next question: Why does Alibaba need machine translation? In fact, the answer is simple. Alibaba has been vigorously developing its international strategy in recent years. Therefore, it is necessary to expand all the businesses into the global market and make them available to more users around the world. Alibaba is expected to achieve five globalization goals. In this process, Alibaba has to break through the language barrier, and therefore has made great investment in machine translation.
The following figure shows the overall capability of Alibaba machine translation. Alibaba started R&D in machine translation around 2013. So far, machine translation has served more than 40 teams and more than 170 applications in Alibaba. Currently, Alibaba machine translation supports 21 languages and 43 language directions, as well as automatic recognition of 19 language directions. The daily average call volume has reached 750 million times in Alibaba, and the system stability has reached 99.99%. At the most authoritative WMT in the machine translation field that just ended this year, Alibaba machine translation ranked first in the world in five language items. All these capabilities reflect the long-term accumulation of Alibaba machine translation. In terms of translation forms, in addition to traditional text translation, Alibaba machine translation supports speech, photo, and video translation. These capabilities will be output on Alibaba Cloud.
The first part introduces the development history and background of machine translation and Alibaba's capabilities of machine translation. In the second part, we will focus on the applications of machine translation in Alibaba cross-border e-commerce scenarios. In this part, I will show you the cross-border e-commerce chain and present several application cases in the chain.
The following figure illustrates the cross-border e-commerce chain. First of all, for websites that want to do cross-border e-commerce business, the first thing is to build a website in multiple languages. This involves the multilingualization of websites and apps, as well as the multilingualization of website rules and security information risk control. After building the website in multiple languages, take actions to attract new customers through traffic diversion on the website. The actions include providing multi-language versions for advertisement marketing and promotion information and refine the translation for traffic diversion products. When traffic is diverted to a multi-language website, increase the probability that users locate their expected products. This involves intra-site search, which requires a multilingual search scheme. Based on the multilingual search scheme, optimize the categories and product attributes to help users quickly locate their expected products. When a user locates the expected product, the product information should be easy to understand, which helps improve the user's purchase conversion rate. This involves title customization and rewriting. Change the title to be simple and clear, and translate the product title, detailed description, and comments into multiple languages. In addition, it is required to provide multi-language brand base information. The product purchase is followed by payment and logistics. In addition, another crucial phase for cross-border e-commerce is customs clearance. The "customs inspection" information requires multilingual translation support. When the product is received by the customer, the website also wants the customer to buy more products. This comes down to product retention and re-purchase. The post-sale team of the website needs to communicate with the customer in time, perform quality inspection, provide product description or translated description, and inspect the product quality, so that the purchase conversion rate is increased by offering the user feedback to the source of products.
In the preceding cross-border e-commerce chain, the function of machine translation can be measured by specific indicators in each stage. In the multilingual website building stage, the indicator is DAU; in the traffic diversion and customer attraction stage, the indicators are UV and COST of the entire site; in the intra-site search stage, the indicator is the conversion rate from the List page to Detail page; in the product browsing stage, the indicator is the conversion rate from Detail page to ordering. The final payment and re-purchase stages can also be measured by specific indicators.
The following are the use cases of some stages in the cross-border e-commerce process. As we know, searching is the major entrance of the traffic for e-commerce websites. Users definitely want to use their native languages for searching on the e-commerce websites of different countries. However, no e-commerce website can build an independent search engine for the users in each country because the cost is huge. Therefore, Alibaba uses a set of English indexes. The words entered by users are converted into English, and the products are searched based on the English indexes. The multilingual search is implemented. The following case shows multilingual search. A Russian user of AliExpress enters a Russian word to search for microphone. Then the system intelligently identifies the language entered by the user, corrects the spelling (the misspelling can be automatically corrected), translates the word into English, and invokes the search engine to search for the user desired product. This chain can effectively improve the List-to-Detail conversion rate.
After retrieved the information about the product the user desires, the website redirects the user to the product detail page, which includes multilingual product information. In this way, the user can navigate to a product category on the website by searching, and then find the product to buy. The user may also view the product details and comments after reading the title of the product. If the user does not find desired information, the user may close the page, and this user is lost. The cross-border e-commerce needs to provide multilingual product information, allowing users to understand what the products are and what the functions they have.
Alibaba have made great efforts in providing multilingual product information. The following figure shows the product title translation on AliExpress. The original language of the information about this product is English, and the product information has been translated into Russian and Arabic.
Some products do not have comments. Therefore, it is necessary to translate the comments of the same product in different languages into other languages. In the following example, the comments in Spanish are translated into Russian and Arabic. Then users can view the comments from other consumers.
The third part is multilingual product details. On AliExpress, sellers generally publish information in English. AliExpress needs to translate the product details, such as dimensions, attributes, and logistics information, into other languages.
In addition, the product detail page provides a "Question" function, which is commonly used by users. AliExpress needs to translate the questions asked by the users in different countries into other languages, so that more users can obtain desired information. As shown in the following figure, the questions in Russia are translated into English and Arabic.
After the user reads and searches product information and orders the product, the website needs to mail the product to the user. Cross-border e-commerce also includes an essential stage: customs clearance. In customs clearance, the product names must be translated into Chinese and provided to China customs. As shown in the following example, the English name of the phone holder is too long. The customs only needs the key information about the phone holder, instead of a long name. Therefore, AliExpress uses the NLP technology to extract the key words from the long title. In this example, AliExpress extracts the words "Phone Holder" and translates the words into Chinese. Then the Cainiao customs affair platform uses the Chinese words for filing and customs clearance, and completes the check out operation for the products.
Before and after the transaction, buyers and sellers need to communicate with each other. Alibaba.com is a B2B scenario that may need more pre-sale communication. According to the research of Alibaba, about 30% buyers in international trading use minority languages, but most sellers are unable to communicate with minority languages. Alibaba developed an automatic translation system for real-time multilingual communication. The system supports translation between multiple languages, and can accurately translate terms in the trading scenario due to a large bilingual term bank. In addition, the real-time multilingual communication system has the intelligent processing capability. It automatically identifies user languages and translates information into the user languages. Moreover, the system provides the intelligent correction function based on context. You may have the feelings that typo often occurs in communication, and the typos can not be accurately translated. The system also unifies the expressions in the oral communication scenarios. Last, the real-time multilingual communication system provides a cross-border communication solution, which supports multiple operating systems such as Windows, iOS, and Android. If users have the ability, they can also edit the text translated by machine before sending the text. In many scenarios, some terms need to be pre-translated into the desired text. Therefore, the real-time intervention function is supported for the real-time scenarios.
The previous content describes the overall cross-border e-commerce chain and specially describes some stages. The following part introduces the technology highlights of Alibaba machine translation, including the challenges and technology innovation of Alibaba machine translation for e-commerce.
The challenges include translation quality, service level, and fast iteration. First, translation quality. E-commerce always depends on transaction, so a high translation quality is required. In the e-commerce scenario, translation must meet the readability requirement, and the field-related key information must be precisely translated. The key information includes brand, key attributes, dimensions, numerals, and logistics information. Compared with common scenarios, the e-commerce scenario has higher requirements for the translation of the key information. In addition, a flexible intervention mechanism is required, because machine translation is not accurate in some scenarios. Once inaccurate translation is identified, the rapid intervention mechanism can timely correct the translation results. Second, the requirements on the service level. High availability is one of the requirements. The problems that may affect the entire transaction must be prevented. In addition, machine translation must meet the multi-region requirement. Alibaba machine translation serves many departments and teams that spread across different regions, so machine translation must be deployed in multiple regions. Machine translation must also meet the high concurrency and fast response requirements. As we know, Alibaba's Double 11 shopping spree creates extremely high traffic. In this scenario, machine translation must meet the high concurrency requirement and respond quickly. Last, fast iteration. Alibaba machine translation supports so many services, so rapid mass corpus training is essential to training a usable model in a short period of time. Facing various scenarios, the types of languages often need to be expanded. Therefore, Alibaba machine translation supports fast language expansion. The machine translation also needs to support high efficient model iteration. The three challenges mentioned above can be addressed from three aspects. They are model, data, and project, in that order. These are the only three aspects that can solve the challenges of services. Next, I would like to talk about how to approach these challenges through model, data, and project.
To ensure the high translation quality in e-commerce scenarios, different models must be developed based on the target scenarios, and the multi-model integration mechanism must be introduced. To cope with translation of lengthy text, such as product description, comments, and communications, that requires high sentence fluency, the neural network-based machine translation model is used. When translating short text such as product titles, searching words, and attributes, the statistical machine translation model should suffice. The translation of menus such as numerals, dates, units, addresses, and traveling scenes can be processed by rule-based machine translation. In addition, Alibaba machine translation employs accurate manual translation data to filter the translation memory library at the outer layer to fully match the text to be translated.
The new transformer neural network structure is used on the model network, so the improvement of translation quality is obvious compared with the traditional neural network translation model, and the training speed is greatly improved. The model was demonstrated in the latest WMT and won the championship in 5 evaluations.
In the artificial intelligence field, data also plays a very important role in addition to the model. The data used in Alibaba machine translation is the field. That is, the machine translation uses many data related to the e-commerce field, for example, the bilingual corpus, glossaries, frequently used phrases, and monolingual corpus in the e-commerce field and e-commerce brand glossaries. In addition, the monolingual and bilingual corpus in the general fields are used to train the machine translation engines of e-commerce companies. The billion-level bilingual parallel corpus, 100 million-level e-commerce bilingual parallel corpus, 10 million-level e-commerce knowledge base, and large-scale industry multilingual term bank can be obtained. Most of the corpus are the bilingual parallel corpus captured from the Internet, some of the others are obtained through terminology mining, and the remains are obtained from manual translation.
The following figure shows a set of data system, including data obtaining, selection, and e-commerce knowledge base building. Most data used in Alibaba machine translation is obtained from the Internet. Alibaba captures data from multilingual websites, analyzes, cleans, and processes the websites to create bilingual corpus, and a few corpus are bought, exchanged, or obtained from manual translation. Alibaba also optimizes part of the data to fit in the field. Corpus selection is performed at different levels. The basic level is to determine the translation quality and fluency based on certain rules and filter the corpus using N-gram. The field-related corpus can also be filtered using models. In addition, machine learning can implement further and specific quality-related jobs. The building of e-commerce knowledge base depends on the business side, for example, Alibaba.com, AliExpress, and Tmall Global. Data is intelligently mined from the product information, such as the name entities, synonyms, hypernyms, and the dependencies between words. Then, the information is translated into multiple languages based on the dependencies automatically or manually. Finally, the multilingual e-commerce support data is generated.
The project in Alibaba machine translation includes four parts.
The previous sections describe the highlights and challenges of machine translation. This section introduces the machine translation products launched on Alibaba Cloud. The machine translation products on Alibaba Cloud are launched through APIs. You can choose Products > AI > Natural language Processing > Machine Translation on the homepage of Alibaba Cloud to view details about machine translation products. Three editions have been launched. 1. General basic edition API, supporting Chinese<->English translation. It has free amount for trial use. 2. E-commerce standard edition API, supporting English<->Chinese, English<->Russia, English<->Spanish, English<->French, and English<->Portuguese translation. This edition has an obvious advantage in e-commerce translation. It is applicable to the translation of titles, product description, and comments. 3. General standard edition API, supporting Chinese<->English translation. It will support more languages in future. This edition is applicable to the general scenarios such as traveling and oral communication.
This product launch is just a beginning of Alibaba machine translation. In future, more capabilities will be output through Alibaba Cloud. Alibaba will continuously improve translation quality and open the latest model capabilities, enrich the API output capabilities, for example, supporting the user-defined translation, enrich open scenarios (focusing on e-commerce scenario and supporting more scenarios), and improve the product matrix, supporting multimodal open APIs such as texts, voices, and images. Moreover, Alibaba will support customized private deployment. That is, users only need to provide scenario-oriented training data, and Alibaba will help users build the model, and deploy and launch machine translation. Users can deploy and use machine translation in their own environments.
Alibaba Clouder - October 26, 2018
Alibaba Clouder - May 29, 2018
Alibaba BlockChain Service Team - September 6, 2018
Alibaba Clouder - September 5, 2018
Alibaba Clouder - January 16, 2019
Alibaba Clouder - September 26, 2018
More Posts by Alibaba Clouder