Which Part of the Jdk Source Code Touches You the Most #hashmap

2-Which section of the JDK source code touches you the most #HashMap

JDK source code.HashMap
Linked list length>=8 to red-black tree
HashMap for the reason for the selection of "8" in the red-black tree with the length of the linked list >= 8.
designer first explained the advantages of using red-black trees in hashmap - tree nodes can be quickly identified by the instanceof method, and the query complexity is very low ( logn ), and the efficiency of the query can still be guaranteed in the case of many conflicts.
JDK source code.Then, the designer explained the drawbacks of the tree - the space size of tree nodes is twice that of ordinary nodes. So in order to save space, only when enough nodes conflict (the length of the linked list is >= 8 and the size of the array is >= 64), the conversion will be considered, and when the conflict is reduced, it will be changed to a linked list again.
JDK source code.The designer is very rigorous. The Poisson distribution probability when the length of the linked list is 0 to 8 is listed in the comments, and the Poisson distribution formula and wiki address are also attached. The probability that the number of conflicting nodes reaches 8 is 0.00000006, which is less than 1 in 10,000,000. This probability is very small. This "8" is the conversion boundary value selected by the designer under the consideration of time and space loss .
JDK source code.The designer's pursuit of code efficiency and rigor touched me

map capacity algorithm
Map.putAll ((float)s / loadFactor ) + 1.0F
official explanation
public void putAll ( Map m) {
tryPresize ( m.size ());
for ( Map.Entry e : m.entrySet ())
putVal ( e.getKey (), e.getValue (), false);
}
final void putMapEntries(Map m, boolean evict) {
int s = m.size();
if (s > 0) {
if (table == null) { // pre-size
float ft = ((float)s / loadFactor) + 1.0F;
int t = ((ft < (float)MAXIMUM_CAPACITY) ?
(int)ft : MAXIMUM_CAPACITY);
if (t > threshold)
threshold = tableSizeFor(t);
}
else if (s > threshold)
resize();
for (Map.Entry e : m.entrySet()) {
K key = e.getKey();
V value = e.getValue();
putVal(hash(key), key, value, false, evict);
}
}
}
When we use HashMap(int initialCapacity ) to initialize the capacity, HashMap does not use the initialCapacity we passed in directly as the initial capacity. JDK will help us calculate a relatively reasonable value as the initial capacity by default. The so-called reasonable value is actually to find the first power of 2 that is larger than the value passed in by the user . However, this value may seem reasonable, but it is not. Because HashMap calculates the default capacity based on the capacity passed in by the user, it does not take into account the factor of loadFactor , but simply calculates the first power of 2 about this number mechanically . loadFactor is the load factor. When the number of elements (size) in the HashMap exceeds threshold = loadFactor * capacity, the capacity will be expanded. That is to say, if the default value we set is 7, after JDK processing, the capacity of HashMap will be set to 8, but this HashMap will be expanded once the number of elements reaches 8*0.75 = 6, which Obviously we don't want to see it, and the rehash process is a time-consuming initialization capacity. If the initialization capacity is set to expectedSize /0.75 + 1, it can effectively reduce conflicts and reduce errors .
这个算法在guava中有实现
public static HashMap newHashMap(Map map) {
return new HashMap<>(map);
}

/**
* Creates a {@code HashMap} instance, with a high enough "initial capacity" that it should
* hold {@code expectedSize} elements without growth. This behavior cannot be broadly guaranteed,
* but it is observed to be true for OpenJDK 1.7. It also can't be guaranteed that the method
* isn't inadvertently oversizing the returned map.
*
* @param expectedSize the number of entries you expect to add to the returned map
* @return a new, empty {@code HashMap} with enough capacity to hold {@code expectedSize} entries
* without resizing
* @throws IllegalArgumentException if {@code expectedSize} is negative
*/
public static HashMap newHashMapWithExpectedSize(int expectedSize) {
return new HashMap<>(capacity(expectedSize));
}

/**
* Returns a capacity that is sufficient to keep the map from being resized as long as it grows no
* larger than expectedSize and the load factor is ≥ its default (0.75).
*/
static int capacity(int expectedSize) {
if (expectedSize < 3) {
checkNonnegative(expectedSize, "expectedSize");
return expectedSize + 1;
}
if (expectedSize < Ints.MAX_POWER_OF_TWO) {
// This is the calculation used in JDK8 to resize when a putAll
// happens; it seems to be the most conservative calculation we
// can make. 0.75 is the default load factor.
return (int) ((float) expectedSize / 0.75F + 1.0F);
}
return Integer.MAX_VALUE; // any large value
}
java8里hashmap计算table大小的方法
static final int tableSizeFor ( int cap) {
int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0 ) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Because after the incoming parameter is subtracted by 1, the value calculated by the highest bit of 1 is the table size we require (for example, 7-1 is 6, binary 0110, the highest bit of 1 corresponds to decimal 4, and 4 The last bit is 8, which is the size of the array we ask for).
The author continuously shifts the highest bit 1 backwards, and then performs a bitwise OR operation with the previous number to get a series of consecutive 1s. At this time, adding 1 can get the expected size of the array (for example, 0110 finally gets 0111 , plus 1 is 1000 to get 8 is the size of the array we want).
The author uses the shift operation and the OR operation, which are relatively fast operations for the computer to process, and skillfully obtains the size of the array. Although this method is changed to use Integer.numberOfLeadingZeros in jdk11 (should be a very small number and a large number to do the same number of operations, causing a certain waste.
In jdk11, the number of shift operations can be reduced by judging the size of if. Personally, I think it is for this reason, so I changed it to Integer.numberOfLeadingZeros ), but the modification of Integer.numberOfLeadingZeros should also be borrowed from the original method (personal opinion).
This calculation method of jdk8 is still very cleverly designed.

31 odd primes as multipliers to avoid conflicts
public int hashCode ( ) {
int h = hash;
if (h == 0 && value.length > 0) {
char val [ ] = value;
for (int i = 0; i < value.length ; i ++) {
h = 31 * h + val [ i ];
}
hash = h;
}
return h;
}
Where did the 31 in the code come from? Why choose 31 among many numbers instead of other numerical values The calculation logic of hashcode s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s is mentioned in the comments [n-1] Refer to some popular science articles on the Internet and some answers on Stack Overflow. Why does Java's hashCode() in String use 31 as a multiplier? - The main reasons for Stack Overflow are as follows.
1.Taking 31 as the multiplier to participate in the multiplication calculation of the hashcode value, and taking the modulo later (actually during the AND operation), the probability of obtaining the same index will be reduced, that is, the probability of hash collision is reduced. But at the same time, 37, 41, 43, etc. are all good choices. Why did you choose 31 in the end? It depends on the second reason
2.31 has a good performance, that is, replacing multiplication with shift and subtraction can get better performance: 31 * i == ( i << 5) - i , modern VMs can do this optimization automatically.
It involves a problem called the collision rate of the hash algorithm. For example, the most common hash algorithm is the modulo method array length 4, there is a data 5 to calculate 5% 4, the result is 1, then put 5 under the array The marker is the position of 1. Calculate 6% 4, the result is 2, then put 6 in the position where the subscript of the array is 2. Calculate 7% 4, the result is 3, then put 7 in the position where the subscript of the array is 3. Calculate 8% 4, the result is 4, then put 8 in the position where the subscript of the array is 4. There has been no conflict. If the next calculation is 9% 4, and the result is 1, then put 9 in the position where the index of the array is 1. At this time, the previous number 5 has been placed at the position where the subscript is 1. At this time, the position where the subscript of the array is 1 will store two values. This is a hash conflict. If a hash conflict occurs, you can only continue in order. store. If the array length is large and the data is widely distributed, the hash collision rate will be low, otherwise the hash collision rate will be high

JDK source code.The perturbation function hash() in the HashMap source code


Here I found the source code in JDK 1.8, paste it for easy viewing:
static final int hash( Object key) {
int h;
return (key == null ) ? 0 : (h = key.hashCode ()) ^ (h >>> 16);
}
Reason: This code is a piece of code that left a deep impression when I was studying the HashMap source code . Although this code is very short, it is concise and elegant in my opinion to achieve the final effect that the author wants to achieve.
Let's talk about the meaning of this code first. In fact, it is to obtain the hashCode of the key in the HashMap , and then XOR the high 16 and low 16 bits of this value , and the final returned value will be determined later. When the location is used, the modulo operation is performed with the number of hash buckets, but a more efficient bit operation method is used in HashMap, because the number of buckets in HashMap is the nth power of 2 (this is actually for uniform distribution), Therefore, the & operation is used, namely hash & (n-1), which can replace the operation of taking the modulo and directly use the bit operation. Compared with the % operation, the efficiency is higher, and it can be said to be detailed control.

Secondly, back to the hash() function, h >>> 16 This action is to shift the hashCode of the key to the right by 16 bits , so that the upper 16 bits of the obtained 32-bit secondary number will be 0, which is the same as the original hashcode . When the XOR operation is performed, the high-order and low-order values will be involved. The advantage of this is that when the high-order 16 bits of the hash values of the two keys are different, but the low-order 16 bits are the same, the hash conflict is avoided. Knowing that the elements of HashMap have hash collisions , they will start chaining. When the number reaches the threshold, it will also be converted to a red-black tree. Therefore, it is necessary to make the elements evenly distributed as much as possible, which requires processing as much as possible. To reduce the probability of conflict, the XOR operation of the high and low bits here in HashMap allows both the high and low bits to participate in the subsequent modulo, which is a way to reduce the probability.

When I saw this code at that time, I did a good research on the bit operation of the two-level system, and this code is only the addition of this one-step XOR operation, which may directly affect the result. Not only is it clever, but also the performance and the The optimization was considered to the extreme. At that time, I thought that this might be the gap between the great god and me. Now that I think about it, maybe this is the professional quality a programmer should have! A requirement can be completed by piling up junk code, and it can be completed by optimizing it to the extreme, but if you study it carefully, you will make a judgment. Many times our attitude determines the height. I personally think that the advanced path of programmers is to abandon the garbage code that can be completed and run, and pursue good code that future generations will see and applaud.