Community Blog Data Analysis: FlashTex or RegEx

Data Analysis: FlashTex or RegEx

In this article, we will discuss how you can use FlashText, a python library that is 100 times faster than RegEx to perform data analysis.

Before you proceed with your analysis, you need to clean your source data, even for the simplest text. This often includes searching and replacing keywords. For example, search the corpus for the keyword "Python," or replace all "python" to "Python."

The Smarter and Faster way of Data Cleansing - FlashText

As the name suggests, FlashText is one of the fastest ways to execute search and replace keywords. It is an open source python library on GitHub.

When using FlashText, begin by providing a list of keywords. FlashText uses this list to build an internal Trie dictionary. You then send it a string of text depending on whether you want to search or replace.

Why is FlashText so Fast?

To truly understand the reason behind FlashText’s speed, let us consider an example. Take a sentence that comprises three words "I like Python". Assume that you have a corpus of four words {Python, Java, J2ee, and Ruby}.

If for every word in the corpus, you select it out and see if it appears in the sentence, you need to iterate the string four times.

For n words in the corpus, we need n iterations. And each step (is in sentence?) will take its own time. This is the logic behind RegEx matching.
There is also an alternative method that contradicts the first method. That is for each word in the sentence, see if it exists in the corpus.

For m words in the sentence, you have m cycles. In the situation, the time spent only depends on the number of words in the sentence. You can quickly perform this step (is in corpus?) using a dictionary.

The FlashText algorithm uses the second method. Moreover, Aho-Corasick algorithm and Trie data structure inspire its algorithm.

When do You Need to Use FlashText?

In terms of search, if the number of keywords is greater than 500, FlashText will perform better than RegEx.

Additionally, RegEx can search special characters such as "^, $, *, d" but FlashText does not support them.

You cannot match partial words (for example "worddvec"), but it can match full word ("word2vec").

Take a look at the basic usage of FlashText. Give it a try. You will observe that it is much faster than RegEx.

Related Blog Posts

How to Write a Headless Web Scraping Bot in Python

In this article, you will get some information on writing our own basic headless web scraping "bot" in Python with Beautiful Soup 4 on an Alibaba Cloud Elastic Compute Service (ECS) instance with CentOS 7.

We will be using Python for our basic web scraping "bot". I admire the language for its relative simplicity and of course the wide variety of modules available to play around with. In particular, we will be using the Requests and Beautiful Soup 4 modules.

Alibaba Cloud DevOps Cookbook Part 1 – CLI, SDK, SSH, SFTP

Now that I have a test website that is load balanced and has auto scaling, I would like to learn more about the Alibaba Cloud CLI and Python SDK. During development I often need to make changes to files that I publish on my ECS instances. Since the auto scaling group is built from an image, changing the image takes effort and time. During testing, I want to do rapid-fire edit / deploy / debug / improve. This means that I need a quick way to upload files to my ECS instances all at once.

Related Market Product

AISE TensorFlow 1.9 Python 3.6 CPU MKL Notebook

A pre-configured and fully integrated minimal runtime environment with TensorFlow, an open source software library for machine learning, Keras, an open source neural network library, Jupyter Notebook, a browser-based interactive notebook for programming, mathematics, and data science, and the Python programming language. The stack is built with the Intel MKL and MKL-DNN libraries and optimized for running on CPU.

Versions: TensorFlow 1.8.0, Python 3.6.3, Development preset 1, Libc 2.22, OpenBLAS 0.2.20, Python_enum34 1.1.6, NumPy 1.13.3, Ubuntu 16.04

Related Documentation

RDS Python SDK Developer Guide

This example shows how to use Alibaba Cloud python? The SDK calls the createdbinstance interface of the RDS to create an RDS instance.

Enable logging in Python SDK

OSS Python SDK provides a logging function to easily track problems. This function is disabled by default.

With this function, you can locate and collect log information about OSS operations and save the information as log files in local disks.

Related Products

Object Storage Service

Alibaba Cloud Object Storage Service (OSS) is an encrypted, secure, cost-effective, and easy-to-use object storage service that enables you to store, back up, and archive large amounts of data in the cloud, with a guaranteed reliability of 99.999999999%. RESTful APIs allow storage and access to OSS anywhere on the Internet. You can elastically scale the capacity and processing capability, and choose from a variety of storage types to optimize the storage cost.

Server Load Balancer

Alibaba Cloud Server Load Balancer (SLB) distributes traffic among multiple instances to improve the service capabilities of your applications. You can use SLB to prevent single point of failures (SPOFs) and improve the availability and the fault tolerance capability of your applications.

Related Course

How to Scale Python on Cloud

Analysts often use libraries, tools in the Python ecosystem to analyze data on their personal computer. They like these tools because they are efficient, intuitive, and widely trusted. However when they choose to apply their analyses to larger datasets they find that these tools were not designed to scale beyond a single machine. In this course we will introduce how Alibaba Cloud scales Python based on its offline data processing engine,taking advantage of the unlimited computing resource on cloud.

0 0 0
Share on

Alibaba Clouder

1,233 posts | 198 followers

You may also like