The principle of hash method in HashMap

Hash, generally translated as "hash", and also directly transliterated as "hash", is to convert an input of any length into a fixed-length output through a hash algorithm, and the output is the hash value. This conversion is a compression mapping, that is, the space of the hash value is usually much smaller than the space of the input, and different inputs may hash to the same output, so it is impossible to uniquely determine the input value from the hash value. Simply put, it is a function that compresses a message of any length into a message digest of a certain length.

All hash functions have the following basic property: if the hash values calculated according to the same hash function are different, then the input values must also be different. However, if the hash values calculated by the same hash function are the same, the input values are not necessarily the same.

The phenomenon that two different input values have the same hash value calculated according to the same hash function is called a collision.

The common Hash functions are as follows:
Direct addressing method: directly use the keyword k or k plus a certain constant (k+c) as the hash address.
Number analysis method: Extract the number with a relatively uniform value in the keyword as the hash address.
The method of dividing and leaving the remainder: divide the keyword k by a number p not greater than the length m of the hash table, and use the remainder as the address of the hash table.
Segmented superposition method: Divide the keyword into several parts with equal digits according to the number of digits in the address of the hash table, and the last part can be relatively short. Then add these parts together, and the result after discarding the highest carry is the hash address of the keyword.
Square method: If the distribution of each part of the keyword is uneven, you can first calculate its square value, and then take the middle digits as the hash address according to your needs.
Pseudo-random number method: A pseudo-random number is used as a hash function.

Collisions were described above. An important indicator to measure the quality of a hash function is the probability of collision and the solution to the collision. Any hash function basically cannot completely avoid collisions. The common methods to solve collisions are as follows:

Open addressing method:

The open addressing method is to find the next empty hash address once a conflict occurs. As long as the hash table is large enough, the empty hash address can always be found and the record will be stored.
chain address method

Each unit of the hash table is used as the head node of the linked list, and all elements whose hash address is i form a synonym linked list. That is, when a conflict occurs, the keyword is linked at the end of the linked list with the unit as the head node.
rehashing

When the hash address conflicts, use other functions to calculate another hash function address until the conflict no longer occurs.
Create common overflow areas

The hash table is divided into two parts, the basic table and the overflow table, and the conflicting elements are put into the overflow table.

The data structure of HashMap
In Java, there are two relatively simple data structures for storing data: arrays and linked lists. The characteristics of the array are: easy to address, difficult to insert and delete; and the characteristics of the linked list are: difficult to address, easy to insert and delete. As we mentioned above, one of the commonly used conflict resolution methods for hash functions is called the chain address method. In fact, it combines arrays and linked lists to take advantage of both. We can understand it as an array of linked lists. .


We can see from the above figure that the left side is obviously an array, and each member of the array is a linked list. All elements held by this data structure contain a pointer for linking between elements. We assign elements to different linked lists according to their own characteristics. In turn, we find the correct linked list through these characteristics, and then find the correct element from the linked list. Among them, the method of calculating the element array subscript according to the element characteristics is the hash algorithm, that is, the protagonist hash() function of this article (of course, also including the indexOf() function).

hash method
Let's take the HashMap of JDK 1.7 as an example, which defines a final int hash(Object k) method, which is mainly referenced by the following methods.

First of all, in the same version of Jdk, the implementation of the hash method in HashMap, HashTable and ConcurrentHashMap is different. There are also differences in different versions of JDK (Java7 and Java8). I will try to introduce them all. I believe that after reading this article, you will thoroughly understand the hash method.

The above methods are mainly adding and deleting methods, which is not difficult to understand. When we want to add or delete an element in a linked list array, we must first know where it should be stored in the linked list array, that is, he Subscripts in this array. The function of the hash() method is to locate its position in the HashMap according to the Key. The same is true for HashTable and ConcurrentHashMap.

Source code analysis
First of all, in the same version of Jdk, the implementation of the hash method in HashMap, HashTable and ConcurrentHashMap is different. There are also differences in different versions of JDK (Java7 and Java8). I will try to introduce them all. I believe that after reading this article, you will thoroughly understand the hash method.

Before the code, let's do a simple analysis. We know that the function of the hash method is to locate the position of the K-V in the linked list array according to the Key. That is, the input of the hash method should be a Key of type Object, and the output should be an array subscript of type int. If you were asked to design this method, what would you do?

In fact, it is simple, we just need to call the hashCode() method of the Object object, which will return an integer, and then use this number to take the modulo of the capacity of HashMap or HashTable. That's right, in fact, the basic principle is this, but in terms of specific implementation, it is realized by two methods int hash(Object k) and int indexFor(int h, int length). However, considering issues such as efficiency, the implementation of HashMap will be a little more complicated.

hash: This method mainly converts Object into an integer.

indexFor: This method mainly converts the integer generated by hash into the subscript in the linked list array.

HashMap in Java 7
As I said before, the indexFor method is actually mainly to convert the integer generated by the hash into the subscript in the linked list array. So what does return h & (length-1); mean? In fact, he is taking a model. All of Java uses bit operations (&) instead of modulo operations (%). The main consideration is efficiency. **The efficiency of bit operation (&) is much higher than that of replacing modulo operation (%). The main reason is that bit operation directly operates on memory data without converting to decimal, so the processing speed is very fast.
**

So, why can you use the bit operation (&) to implement the modulo operation (%)? The principle of this realization is as follows:

X % 2^n = X & (2^n - 1)

2^n means 2 to the nth power, that is to say, a number modulo 2^n == a number and (2^n - 1) do a bitwise AND operation.

Assuming that n is 3, then 2^3 = 8, expressed in binary as 1000. 2^3 -1 = 7, which is 0111.

At this time, X & (2^3 - 1) is equivalent to taking the last three digits of the binary system of X

From a binary point of view, X / 8 is equivalent to X >> 3, that is, shift X to the right by 3 bits, and at this time the quotient of X / 8 is obtained, and the removed part (the last three bits) is X % 8, which is the remainder.

I don’t know if you understand the explanation above, but it doesn’t matter if you don’t understand it, you just need to remember this technique. Or you can find a few examples and try it out.

Therefore, return h & (length-1); as long as the length of length is guaranteed to be 2^n, the modulo operation can be realized. The length in HashMap is indeed a multiple of 2, the initial value is 16, and then it is expanded to double the original value each time.

After analyzing the indexFor method, we are going to analyze the specific principle and implementation of the hash method. Before in-depth analysis, so far, let's make a summary.

The data of HashMap is stored in the linked list array. When performing operations such as insertion/deletion on HashMap, it is necessary to locate which subscript of the array it should be stored in according to the key value of the K-V pair. And this operation of finding the subscript through the key value is called hashing. The array of HashMap has a length. Java stipulates that this length can only be a multiple of 2, and the initial value is 16. The simple method is to first obtain the hashcode of the key value, and then take the int value obtained by the hashcode to modulo the length of the array. In order to consider performance, Java always uses bitwise AND operations to implement modulo operations.

Next, we will find that neither the modulus operation nor the bit operation can directly solve the problem of large conflicts. For example: CA11 0000 and 0001 0000 have the same value after performing bitwise AND operation on 0000 1111.

Two different key values, the result obtained after performing a bitwise AND operation on the length of the array is the same. Isn't this a conflict? So how to resolve this conflict, let's see how Java does it.

This code is to perturb the calculation of the hashCode of the key to prevent hash conflicts caused by different high bits of different hashCode but the same low bits. To put it simply, it is to combine high-order features and low-order features to reduce the probability of hash collisions. That is to say, try to make any change in one bit affect the final result.

For example, we now want to put a K-V pair into a HashMap. The value of the Key is "hollishuang". After simply obtaining the hashcode, the obtained value is "1011000110101110011111010011011". If the size of the current HashTable is 16, that is In the absence of perturbation calculations, he finally gets an index result value of 11. Since the binary extension of 15 to 32 bits is "0000000000000000000000000001111", when a number is bitwise ANDed with it, no matter what the first 28 bits are, the calculation result is the same (because 0 is ANDed with any number, the result is the same to 0).

It can be seen that the value of the latter two hashcodes after bit operation is also 11. Although we don't know which key's hashcode is the two in the above example, there must be such a key, which creates a conflict.

So, next, let me see what the final calculation result of the perturbed algorithm will be.

HashTable in Java 7

The above is the implementation of the hash method and indexOf method of HashMap in Java 7. Then we will see how the thread-safe HashTable is implemented, how it is different from HashMap, and try to analyze the different reasons. Following is the implementation of hash method of HashTable in Java 7.

We can find that it is very simple, which is equivalent to just doing a simple hash on k and taking its hashCode. And there is no indexOf method in HashTable, it is replaced by this code: int index = (hash & 0x7FFFFFFF) % tab.length;. In other words, HashMap and HashTable use two methods for calculating array subscripts. HashMap uses bit operations, while HashTable uses direct modulo.

Why do a bitwise AND operation between the hash value and 0x7FFFFFFF, mainly to ensure that the first bit of the obtained index is 0, that is, to obtain a positive number. Because the first 0 of a signed number represents a positive number, and 1 represents a negative number.

As we said earlier, the reason why HashMap does not need to take modulus is to improve efficiency. Some people think that because HashTable is a thread-safe class, it is inherently slow, so Java does not consider the efficiency issue, so it directly uses the modulo algorithm? But in fact, it is not entirely true. There are still certain considerations in the design of Java, although the efficiency is indeed slower than that of HashMap.

In fact, there are certain considerations for HashTable to use simple modulo. This involves the constructor and expansion function of HashTable. Due to the limited space, the code will not be posted here, and the conclusion will be given directly:

The default initial size of HashTable is 11, and then it is expanded to the original 2n+1 each time.

That is to say, the default size of the linked list array of HashTable is a prime number and an odd number. The result of each subsequent expansion is also an odd number.

Because HashTable will try to use prime numbers and odd numbers as the size of the capacity. When the size of the hash table is a prime number, the result of simple modulo hashing is more uniform. (This can be proved. Since it is not the focus of this article, it will not be introduced in detail for the time being. You can refer to: http://zhaox.github.io/algorithm/2015/06/29/hash)

So far, we have read the implementation of hash in HashMap and HashTable in Java 7, let's make a brief summary.

The default initialization size of HashMap is 16, and then it is expanded to twice the original size each time.

The default initial size of HashTable is 11, and then it is expanded to the original 2n+1 each time.

When the size of the hash table is a prime number, the result of simple modulo hashing will be more uniform, so from this point of view, the size selection of the hash table of HashTable seems to be better. Because the more dispersed the hash result, the better the effect.

In the modulo calculation, if the modulus is a power of 2, then we can directly use bit operations to obtain the result, which is much more efficient than division. Therefore, in terms of the efficiency of hash calculation, HashMap is even better.

However, in order to improve efficiency, HashMap uses bit operations instead of hashing, which introduces the problem of uneven distribution of hashes. Therefore, in order to solve this problem, HashMap has made some improvements to the hash algorithm and performed disturbance calculations.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us