String Matching Algorithm Starts From the Indexof Function

String Matching Algorithm Introduction

String matching algorithm. The string matching algorithm starts from the indexOf function

foreword
I believe that everyone who has learned Java has used the indexOf function. With the indexOf function, we can find out whether a string (pattern string) has appeared in another string (main string), and the returned result indicates the subscript of the occurrence position. If it returns -1, means that the pattern string does not exist in the main string, so, have you ever wondered how these lookup functions are implemented?

From the indexOf source code
String matching algorithm.First, let's take a look at the source code of indexOf . There are many ways to use indexOf . This is an example of a formal parameter.
static String mainString = "Hello my name is HuangLinqing ";
static String patternString = " HuangLinqing ";
public static void main( String[] args ) {
System.out.printf ( mainString.indexOf ( patternString , 0) + "");

The result of running the above code, the returned result is 17, indicating that the pattern string exists in the main string, and the subscript of the first occurrence is 17
indexOf method will eventually go to the following method, and the source code is as follows:
/**
* Code shared by String and StringBuffer to do searches. The
* source is the character array being searched, and the target
* is the string being searched for.
*
* @param source the characters being searched.
* @param sourceOffset offset of the source string.
* @param sourceCount count of the source string.
* @param target the characters being searched for.
* @param targetOffset offset of the target string.
* @param targetCount count of the target string.
* @param fromIndex the index to begin searching from.
*/
static int indexOf(char[] source, int sourceOffset, int sourceCount,
char[] target, int targetOffset, int targetCount,
int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
char first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first);
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j]
== target[k]; j++, k++);
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}

There are not many lines of code, so let's analyze the above code, fromIndex is 0 by default, target is the pattern string, targetCount is the size of the pattern string, source is the main string, sourceCount is the size of the main string
if ( fromIndex >= sourceCount ) {
return( targetCount == 0 ? sourceCount : -1);
}
if ( fromIndex < 0) {
fromIndex = 0;
}
if ( targetCount == 0) {
return fromIndex ;
}

If the starting position of the search is greater than the size of the main string, if the pattern string is an empty string , it returns the size of the main string , otherwise it returns -1, and if the size of the pattern string is equal to 0, it is the position to start searching. These lines of code are easy to understand. There is no example, mainly the following code:
char first = target[ targetOffset ];
int max = sourceOffset + ( sourceCount - targetCount );
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first);
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j]
== target[k]; j++ , k++);
if (j == end) {

/* Found whole string. */
return i - sourceOffset ;
}
}
}

indexOf is a typical BF algorithm. Let's briefly introduce the BF algorithm, and then go back and understand the above code.

BF and RK algorithms
•BF algorithm
BF algorithm is Brute Force, a brute force matching algorithm, and also a naive matching algorithm. The size of the main string is sourceSize , and the size of the pattern string is targetSize , because we want to find the pattern string in the main string , so sourceZize > targetSize , so from the main string The subscript is 0, and the targetSize characters are continuously searched , and then the subscript is 1, and the subscript is sourceSize - targetSize . For a simple example, find EF in ABCDEFG:

The above figure shows the sequential comparison from i is 0 to i is 4. We can also see from the figure that the BF algorithm is more time-consuming because the number of comparisons is large, but the main string and Neither pattern string is too long, so this method of comparison is easier to use.
Now let's go back and look at the second half of the source code of indexOf , I believe that there is no need to explain.
•RK algorithm
The RK algorithm is actually an upgrade of the BF algorithm. Take the above figure as an example. When searching for EF in ABCDEFG, for example, when the subscript is 0, we compare the values of A and E. If they are not equal, we will not continue. Compared, but for example, we are now looking for whether CDF exists in the main string . We need to find that the third bit is not equal from the known larger E of C, so that when the first part of the pattern string is equal to the main string, only the last bit is not equal. , the number of comparisons is too many, and the efficiency is relatively low, so we can use hash calculation to compare, and I will add an article after the hash calculation .
We want to compare the pattern string with sourceSize - targetSize + 1 string , we can first hash the sourceSize - targetSize + 1 pattern string . Compared with the pattern string after hash calculation, if it is equal, it exists. The probability of hash collision is relatively low in general implementation. If you are worried, we can compare the original string again when the hash value is equal to ensure accuracy. The collision probability of is also related to the design of the hash algorithm itself. In this case, we first calculate the hash value of AB and compare it with the pattern string, and then calculate the hash value of BC and compare it with the pattern string, until the equal return subscripts are compared.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00