How Is Netty Used to Write a High-Performance Distributed Service Framework?

By Jiachun

1. What Is Netty and What Can It Do?

Netty is a mature I/O framework dedicated to creating high-performance network applications.
You don't have to be a network expert to build complex network applications based on Netty compared to the Java I/O API that uses the underlying system directly.
Most common middleware products concerning network communication in the industry implement the network layer based on Netty.

2. Design a Distributed Service Framework

2.1 Architecture

2.2 Remote Call Process

Start the server (service provider) and publish the service at the registry
Start the client (service consumer) and subscribe to the services you are interested in from the registry
The client receives the service address list sent by the registry.
When the caller initiates a call, Proxy selects an address from the service address list, serializes information, such as <group, providerName, version>, methodName, and args[], into a byte array and sends it to the address over the network.
The server receives the deserialization request, searches for the corresponding providerObject in the local service dictionary by <group, providerName, version>, calls the specified method through reflection based on <methodName, args[]>, and serializes the method return value as an array of bytes to return to the client.
After the client receives the response, it deserializes into a Java object. Then, the Proxy returns the object to the method caller.

The process above is transparent to the method caller, and everything looks like a local call.

2.3 Diagram of Remote Client Call

Important Concept: RPC trituple <ID, Request, Response>.

Note: In Netty4.x, thread contention can be avoided better by replacing the global Map with IO Thread(worker) —> Map<InvokeId，Future>.

2.4 Diagram of Remote Server Call

2.5 Diagram of Remote Call on the Transport Layer

2.6 Design Transport Layer Protocol Stack

Protocol Header

Protocol Body

1) metadata:<group, providerName, version>

2) methodName

3) Is parameterTypes[] necessary?

a) What's the problem?

Potential lock contention for ClassLoader.loadClass() during deserialization
The size of the protocol body stream
Parameter types are added in generic calls

b) Can they be solved?

For more information about the static dispatch rules for Java methods, please see section 15.12.2.5 entitled Choosing the Most Specific Method of The Java Language Specification.

c) args[]

d) Other: traceId, appName ...

3. Features, Good Practices, and Squeeze Performance

3.1 Create a Client Proxy Object

1) What does the Proxy do?

Cluster fault tolerance —> Load balancing —> Network

2) What methods can be used to create a Proxy?

jdk proxy/javassist/cglib/asm/bytebuddy

3) What are the most important aspects?

Avoid remote calls by intercepting toString, equals, hashCode, and other methods.

4) Recommendation (bytebuddy):

3.2 Elegant Synchronous/Asynchronous Call

First, look back for the "diagram of remote client call"
Then, understand how to handle Failover better
Think about the future

3.3 Unicast/Multicast

Message Dispatcher
FutureGroup

3.4 Generic Call

3.5 Serialization / Deserialization

The protocol header is marked with the serializer type. Multiple types are supported.

3.6 Scalability

Java SPI:

java.util.ServiceLoader
META-INF/services/com.xxx.Xxx

3.7 Service-Level Thread Pool Isolation

The failure of one thread pool does not affect other thread pools.

3.8 Interceptor in Responsibility Chain Mode

Too many extensions need to start from here.

3.9 Metrics

3.10 Tracing

OpenTracing

3.11 Registry

3.12 Throttling (Application Level or Service Level)

It is necessary to have the extension capability to access the third-party throttling middleware easily.

3.13 What if the Provider Thread Pool Is Full?

3.14 Soft Load Balancing

1) Weighted Random (Dichotomy Instead of Traverse)

2) Weighted Polling (Maximum Common Divisor)

3) Minimum Load

4) Consistent Hash (Stateful Service Scenarios)

5) Others

Note: Preheating logic is required.

3.15 Cluster Fault Tolerance

1) Fail-Fast

2) Failover

How do we handle asynchronous calls?

Better

3) Fail-Safe

4) Fail-Back

5) Forking

6) Others

3.16 The Way to Explore the Performance (Don't Trust It, Test It)

1) Write a FastMethodAccessor using ASM to replace the reflection call on the server.

2) Serialization/Deserialization

Serialize and deserialize in business threads to avoid occupying I/O threads:

Serialization/deserialization takes up a very low number of time slices of I/O thread.
Deserialization often involves loading a Class, and loadClass has a serious lock contention problem, which can be observed through JMC.

Select an efficient serialization/deserialization framework:

For example, Kryo, Protobuf, Protostuff, Hessian, or Fastjson

Framework selection is only the first step. If serialization framework does not go well, expand and optimize it:

The traditional serialization/deserialization and data writing/reading process is listed here: java object -> byte[] -> off-heap memory/off-heap memory -> byte[] -> java object
Optimization: Omit the byte[] step and read from/write to the off-heap memory directly. This requires the corresponding serialization framework to be expanded.
String encoding and decoding optimization
Variant Optimization: Multiple writeBytes are merged into writeShort/writeInt/writeLong.
Example for Protostuff Optimization: UnsafeNioBufInput reads from the off-heap memory directly, and UnsafeNioBufOutput writes to the off-heap memory directly.

3) I/O thread is bound to the CPU.

4) Client coroutine that calls a synchronous blocking operation in the client and encounters a bottleneck easily:

There are not many options at the Java level, and they are not perfect for the time being.

Name	Description
Kilim	Bytecode enhancement during compilation
Quasar Agent	Dynamic bytecode enhancement
ali_wisp	Implementation of ali_jvm in the underlying environment

5) Netty Native Transport and PooledByteBufAllocator:

Reduce the fluctuation caused by GC

6) Release the I/O thread as soon as possible to do what it should do and minimize thread context switching

4. Why Netty?

4.1 BIO vs. NIO

4.2 Difficulties of Java Native NIO APIs

High Complexity

It is difficult to start because APIs are complex.
Sticky packet/half-packet problems are troublesome.
Strong concurrent or asynchronous programming skills are required. Otherwise, it is difficult to write efficient and stable implementations.

Poor Stability with Multiple Problems

Debugging is difficult. Occasionally, when encountering a bug that is extremely difficult to reproduce, we have to check it despite the difficulty.
On the Linux system, EPollArrayWrapper.epollWait returns a bug of 100% CPU usage caused by empty polling. Netty helps you work around by rebuilding the selector.

Some Disadvantages of NIO Code Implementation

1) Selector.selectedKeys() produces too much garbage.

Netty modified the implementation of sun.nio.ch.SelectorImpl and used double arrays instead of HashSet to store selectedKeys:

Compared with HashSet (like iterators and packaged objects), there is less garbage generated. This helps reduce GC operations.
Slight performance gain (1~2%)

NIO code is synchronized everywhere, such as allocate direct buffer and Selector.wakeup():

For the allocate direct buffer, the pooledBytebuf of Netty has a fronted TLAB (Thread-local allocation buffer) that reduces lock contention effectively.
If wakeup calls are excessive, serious lock contention and high overhead occur. (Reasons for High Overhead: The Linux platform uses a pair of pipes to communicate with the select thread outside the select thread. Since the pipe handle cannot be placed in the fd_set in Windows, we can only compromise and use two TCP connections for simulation.) If wakeup calls are insufficient, it will cause unnecessary congestion during the select operation. (If you are confused, use Netty directly, which has the corresponding optimization logic.)
Netty Native Transport has fewer locks.

2) fdToKey mapping

EPollSelectorImpl#fdToKey maintains the mapping of SelectionKey corresponding to all connected fd (descriptor), which is a HashMap.
Each worker thread has a selector, meaning each worker has a fdToKey. These fdToKeys roughly share all connections.
Imagine a scenario where a single machine holds hundreds of thousands of connections, and HashMap rehashes from the default size = 16 step by step.

3) Selector is the implementation of Epoll LT on the Linux platform.

Netty Native Transport supports Epoll ET.

4) Direct Buffers is managed by GC.

DirectByteBuffer.cleaner: The virtual reference is responsible for free direct memory. DirectByteBuffer is just a shell. If this shell survives through the age limit of the new generation and finally comes to the old generation, it will be a sad thing.
Failure to apply enough direct memory will trigger GC explicitly, which is Bits.reserveMemory() -> { System.gc() }. First of all, the entire process is interrupted by GC, and the code sleeps for 100 milliseconds. If the direct memory is still not enough after the code wakes up, oops .
To make matters worse, if you take the advice of books called Tips on the Improvement of XX and set the -XX:+DisableExplicitGC parameter, there will be unexpected misfortune.
The cleaner is removed from the UnpooledUnsafeNoCleanerDirectByteBuf of Netty. The Netty framework releases the items in real-time by maintaining the reference count.

5. The Real Netty

5.1 Several Important Concepts in Netty and Their Relationships

EventLoop

A Selector
A task queue (mpsc_queue: Lock-free multiple producers and single consumer)
A delayed task queue (delay_queue: A priority queue with a binary heap structure. Complexity: O(log n)).
EventLoop is bound to a Thread that avoids the thread contention in the pipeline.

Boss: the mainReactor and Worker: the subReactor

The Boss and Worker share the EventLoop code logic. The Boss handles the accept event, and the Worker handles read and write events.
After the Boss listens to and accepts the connection (channel), it hands the channel to the Worker by polling. The Worker is responsible for processing the subsequent I/O events of the channel, such as read and write.
In the case of no multi-port binding, only one EventLoop needs to be included in BossEventLoopGroup, and only one can be used.
WorkerEventLoopGroup generally contains multiple EventLoop, and the number is generally two times the CPU core number. The most important thing is to find the best value according to the scenario.
Channels are divided into two types: ServerChannel and Channel. ServerChannel corresponds to ServerSocketChannel, and Channel corresponds to a network connection.

5.2 Netty4 Thread Model

5.3 `ChannelPipeline`

5.4 `Pooling&Reuse`

PooledByteBufAllocator

Based on jemalloc paper (3.x)
ThreadLocal caches for lock free: This practice has caused some problems. The application and return of (Bytebuf) thread are not the same, causing memory leakage. It was later solved by an mpsc_queue while sacrificing a little bit of performance.
Different size classes

Recycler

ThreadLocal + Stack
Memory leakage occurred because the application and return of (element) thread were not the same.
Later, it was improved. When different threads return elements, they are put into a WeakOrderQueue and associated with the stack. If the stack is empty in the next pop, all WeakOrderQueues associated with the current stack are scanned first.
WeakOrderQueue is a linked list of multiple arrays. The default size of each array is 16.
The problem is: What is the impact on GC of referencing new objects by old ones?

5.5 Netty Native Transport

It creates fewer objects and has less GC pressure than NIO.

The following part describes some specific features for the optimization on Linux:

SO_REUSEPORT: Port reuse – Multiple sockets are allowed to listen on the same IP address and port, and the cooperation with RPS/RFS improves the performance more. RPS and RFS simulate multi-queue network interface cards (NICs) at the software layer and provide load balancing capabilities. This prevents the interruption of packet reception, and delivery by NICs occurs at one CPU core, which affects the performance.
TCP_FASTOPEN: Three handshakes are also used to exchange data.
EDGE_TRIGGERED: Epoll ET is supported.
Unix domain sockets (Inter-process communication on the same machine, such as Service Mesh).

5.6 An Introduction to Multiplexing

select/poll

Due to the limitations on the implementation mechanism, the more the concurrent connections, the poorer the performance. The time complexity of using polling to detect ready events is O(n). We also need to copy the bloated fd_set between the user space and kernel space repeatedly.
Poll is not very different from select. The only difference is the limit on the maximum number of file descriptors is lifted.
Both select and poll are in LT mode.

Epoll

The time complexity of detecting ready events using the callback method is O(1). Each time epoll_wait is called, only the ready file descriptor is returned.
Epoll supports LT and ET modes.

5.7 More In-Depth Understanding of Epoll

LT vs. ET

Concepts:

LT: Level-Triggered
ET: Edge-Triggered

Readable:

When the buffer is not empty, the corresponding readable state in fd events is set to 1. Otherwise, it is set to 0.

Writable:

When the buffer has space to write, the corresponding writable state in fd events is set to 1. Otherwise, it is set to 0.

Diagram:

Three Epoll Methods

1) Main Code: linux-2.6.11.12/fs/eventpoll.c

2) int epoll_create(int size)

Create an rb-tree (red-black tree) and a ready-list (ready linked list):

Red-black tree O(logN) balances efficiency and memory occupation. Red-black tree is the best choice when capacity demand is uncertain and probably large.
The size parameter is meaningless at this time. In the early days, the Epoll implementation was achieved through a hash table, so the size parameter is required.

3) int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)

Put epitem in the rb-tree and register ep_poll_callback with the kernel interrupt handler. When the callback is triggered, put the epitem in the ready-list.

4) int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout)

ready-list —> events[]

Data Structure of Epoll

epoll_wait Workflow Overview

Code for reference: linux-2.6.11.12/fs/eventpoll.c:

1) epoll_wait calls ep_poll

If rdlist (ready-list) is empty (no ready fd), the current thread is suspended. The current thread is only awakened when rdlist is not empty.

2) The event status of file descriptor fd is changed.

The buffer changes from an unreadable state to a readable state or from a non-writable state to a writable state. As a result, the callback function ep_poll_callback on the corresponding fd is triggered.

3) ep_poll_callback is triggered.

The epitem of the corresponding fd is added to the rdlist. Therefore, the rdlist is not empty, the thread is awakened, and the epoll_wait can continue.

4) Run the ep_events_transfer function

Copy the epitem in rdlist to txlist and clear rdlist
Suppose Epoll LT is adopted, and the fd.events status does not change (for example, the status does not change before data is read from the buffer.) Thus, the epitem is returned to rdlist.

5) Run the ep_send_events function

Scan each epitem in txlist and call the poll method corresponding to its associated fd to obtain the newer events
Send the events and corresponding fd to the user space

5.8 Netty-Based Best Practices

1) The necessity of the business thread pool

The business logic, especially the logic with a long blocking time, should not occupy Netty I/O threads and should be dispatched to the business thread pool.

2) WriteBufferWaterMark

Please note the default settings of the high and low watermarks (32 KB to 64 KB) and adjust the value according to the scenario. You can think about how to use the watermark.

3) Rewrite the MessageSizeEstimator to reflect the real high and low watermarks

The default implementation cannot calculate the object size. The message size has already been calculated before any outboundHandler is passed through when writing the object. At this time, the object has not been encoded into Bytebuf. Therefore, the size calculation is inaccurate (being smaller.)

4) Pay attention to the setting of EventLoop#ioRatio, which is 50 by default.)

This controls the proportion of time for EventLoop to execute I/O tasks and non-I/O tasks.

5) Who schedules the detection of idle procedures?

Netty4.x uses I/O thread for scheduling by default. The delayQueue of EventLoop, a priority queue implemented by a binary heap, is used, and the complexity degree is O(log N). Each worker processes its own procedure monitoring, which helps reduce context switching, but network I/O operations and idle procedures will affect each other.
If the total number of connections is small (for example, about tens of thousands), the implementation above will not have any problem. We recommend implementing an IdleStateHandler using HashedWheelTimer when the number of connections is large. Its complexity degree is O(1), and it allows network I/O operations and idle procedures to be independent of each other, but it incurs the context switching overhead.

6) ctx.writeAndFlush or channel.writeAndFlush?

ctx.write goes to the next outbound handler directly. Be careful not to let it bypass the idle procedure detection, which is not what you want.
channel.write moves backward from the end to the front and passes through all outbound handlers of the pipeline one by one.

7) Use Bytebuf.forEachByte() to replace the loop traverse of ByteBuf.readByte() and avoid rangeCheck()

8) Use CompositeByteBuf to avoid unnecessary memory copying

The disadvantage is that the index computing is complex in terms of time. Please make your own judgment based on different scenarios.

9) To read an int, use Bytebuf.readInt() instead of Bytebuf.readBytes(buf, 0, 4).

This can avoid a memory copy. The same is true for long, short, and other types.

10) Configure UnpooledUnsafeNoCleanerDirectByteBuf to replace DirectByteBuf of the JDK so that the Netty framework releases off-heap memory based on the reference count.

io.netty.maxDirectMemory:

<0: Without using cleaner, Netty inherits the maximum direct memory size set by JDK directly. The direct memory size of JDK is independent, so the total direct memory size will be twice as big as the JDK configuration.
== 0: If cleaner is used, Netty does not set the maximum direct memory size.

0: If no cleaner is used, this parameter will limit the maximum direct memory size of Netty. (The direct memory size of JDK is independent and not limited by this parameter.)

11) Optimal Number of Connections

One connection has a bottleneck, so the CPU cannot be utilized effectively. Too many connections are unnecessary. The best practice is to test on your own based on different scenarios.

12) When using PooledBytebuf, you should be good at using the -Dio.netty.leakDetection.level parameter.

Four Levels: DISABLED, SIMPLE, ADVANCED, and PARANOID
SIMPLE: Its sampling rate is the same as ADVANCED, which is less than 1% (bitwise AND operation, mask = = 128 - 1.)
The default is the SIMPLE level, with low overhead.
When leakage occurs, the word "LEAK:" will appear in logs. Please run the grep command to check the logs from time to time. Once "LEAK:" appears, change the level to ADVANCED immediately and run again. This way, you can know where the leaking object was accessed.
PARANOID: This level is recommended for testing because the sampling rate is 100%.

13) Channel.attr() – Attach your own objects to the channel

The thread-safe hash table implemented by the open hashing method is also a segmented lock (only locking the table header.) Lock contention exists only in the case of hash conflict (similar to ConcurrentHashMapV8.)
The hash table has only four buckets by default. Use them sparingly.

5.9 Code Skills Learned from the Netty Source Code

1) AtomicIntegerFieldUpdater --> AtomicInteger in scenarios with a large number of objects

In Java, the object header has 12 bytes (when the compression pointer is enabled.) Java objects are aligned based on eight bytes, so the minimum object size is 16 bytes. The AtomicInteger is 16 bytes in size, and the AtomicLong is 24 bytes in size.
AtomicIntegerFieldUpdater acts as a static field to operate volatile int.

2) FastThreadLocal is faster than JDK in terms of implementation.

Linearly detected Hash table —> raw array storage with index atomic auto-incrementing.

3) IntObjectHashMap / LongObjectHashMap

Integer —> int
Node[] —> raw array

4) RecyclableArrayList

Based on the Recycler mentioned above, you can use it in scenarios with frequent new ArrayList.

5) JCTools

Suitable for some SPSC/MPSC/SPMC/MPMC unlocked concurrent queues and NonblockingHashMap (comparable to ConcurrentHashMapV6/V8) not available in JDK.

Recruitment

We are the Ant Intelligent Monitoring Technology Middle Platform Storage Team. We are using Rust, Go, and Java to build a new-generation low-cost time-series database with high performance and real-time analysis capability. You are welcome to transfer positions or recommend other applicants to our team. Please contact Feng Jiachun via email (jiachun.fjc@antgroup.com) for more information.

References

[1] Netty

[2] JDK-Source

/jdk/src/solaris/classes/sun/nio/ch/EPollSelectorImpl.java
/jdk/src/solaris/classes/sun/nio/ch/EPollArrayWrapper.java
/jdk/src/solaris/native/sun/nio/ch/EPollArrayWrapper.c

[3] Linux-Source

linux-2.6.11.12/fs/eventpoll.c
https://github.com/torvalds/linux

[4] RPS/RFS

https://my.oschina.net/guol/blog/113144 (Article in Chinese)

[5] I/O Multiplexing

Volume 1 Chapter 6 of UNIX Network Programming
http://blog.csdn.net/russell_tao/article/details/17119729 (Article in Chinese)

[6] jemalloc

[7] SO_REUSEPORT

https://my.oschina.net/miffa/blog/390931 (Article in Chinese)

[8] TCP_FASTOPEN

https://www.oschina.net/question/12_137950 (Article in Chinese)

[9] Main Reference Sources for Best Practices

http://calvin1978.blogcn.com/articles/netty-info.html (Article in Chinese)

[10] https://github.com/eishay/jvm-serializers/wiki

[11] https://github.com/OpenHFT/Java-Thread-Affinity