Serialize technology (XML, JSON, JDBC byte encoding, Protobuf)
At present, there are various serialization and deserialization schemes for network output transmission in the industry, and their technical implementations are quite different in their original intentions and backgrounds, so the design architectures will also be very different. In the end, there are big differences in the aspects of parsing speed, impact on the system, size of transmitted data, maintainability and readability after landing. This article shares some of my analysis and understanding of some common serialization technologies:
The article is divided into 3 parts:
1. List common serialization and deserialization schemes (ObjectXXStream, XML, JSON)
2. Processing scheme of MySQL JDBC result set
3. Google Protocol Buffer processing solution
[1. Common Serialize solutions used in API and message communication calls]:
Option 1. Object serialization and deserialization are performed based on Java's native ObjectOutputStream.write() and ObjectInputStream.read().
Option 2. Serialize and deserialize based on JSON.
Option 3. Serialization and deserialization based on XML.
[Scheme 1 Analysis, ObjectXXXStream]:
advantage:
(1) It is serialized by Java's own API, which is simple, convenient, and has no third-party dependencies.
(2) Don't worry about the loss of precision in data parsing, the loss of fields, and the uncertainty of the deserialization type of Object.
shortcoming:
(1) It is troublesome for both parties to debug. It is best for the sender and receiver to have the same version of the object description, otherwise there will be strange problems, the debugging cycle is relatively long, and there are many problems in cross-team cooperation.
(2) The transferred object contains metadata information, which occupies a large space.
[Scheme 2 analysis, JSON serialization]:
advantage:
(1) Simple and convenient, no need to pay attention to the object format to be serialized.
(2) There are many components in the open source world that can be supported, such as FastJSON performance is very good.
(3) In many RPC frameworks, this scheme is basically supported.
shortcoming:
(1) If the object property contains the Object type, if the business itself does not know the data type during deserialization, it will be very troublesome to process.
(2) Due to the text type, it will definitely occupy a large data space, such as the following figure.
(3) Comparing the compatibility and performance of parsing packages that depend on JSON, the processing methods of some details of JSON (such as some non-standard JSON) may be different.
(4) Serialization, regardless of any data type, must first be converted to String and converted to byte[], which will increase the number of memory copies.
(5) When deserializing, the entire JSON must be deserialized into an object before it can be read. You should know that Java objects, especially objects with many levels of nesting, will take up much more memory than data. its own space.
Extreme Case 1 of Data Amplification:
The transfer data description information is:
class PP {
long userId = 102333320132133L;
int passportNumber = 123456;
}
The format of the JSON passed at this time is:
{
"userId":102333320132133,
"passportNumber": 123456
}
The data we want to pass is 1 long, 1 int, that is, 12 bytes of data, the string length of this JSON will be the actual number of bytes (excluding carriage returns and spaces, this is just for readability) , and note that the long here is a string in JSON), this string has: 51 bytes, that is, the data is placed about 4.25 times.
Data amplification extreme case 2:
When there is data in your object that is of type byte[], JSON is data in text format and cannot store byte[], then to serialize such data, the only way is to convert bytes into characters. The usual method is: Two kinds:
(1) Use BASE64 encoding, which is a common practice in JSON at present.
(2) Hexadecimal character encoding is performed according to the byte, for example, a string: "FF" represents the byte 0xFF.
No matter which of the above two methods is used, 1 byte will become 2 characters to be transmitted, that is, the byte[] data will be enlarged by more than 2 times. Why not use the ISO-8859-1 character encoding? Because after encoding in this way, after the final serialization into network byte[] data, the corresponding byte[] does not become larger, but when deserialized into text, the receiver does not know that it is ISO-8859-1, and will use For example, the more common character sets such as GBK and UTF-8 can be parsed into String before further parsing of JSON objects, so this byte[] may be changed during the encoding process, and it will be very troublesome to deal with this problem.
[Scheme 2 analysis, XML serialization]:
advantage:
(1) Simple and convenient to use, no need to pay attention to the object format to be serialized
(2) The readability is good, XML is relatively common in the industry, and everyone is accustomed to seeing the appearance of XML in the configuration file
(3) A large number of RPC frameworks are supported, and documents can be directly formed through XML for circulation
shortcoming:
(1) The performance of serialization and deserialization has not been very good.
(2) There are also the same data type problems as JSON, and the problem of data amplification. At the same time, the problem of data amplification is more serious. At the same time, the number of memory copies is also the same as the JSON type, which is inevitable.
XML data enlargement description:
The data amplification of XML is usually more serious than that of JSON. In the JSON case above, XML usually transmits this data like this:
102333320132133
123456
This message has more than 80+ bytes. If some Property attributes are added in the XML and the objects are nested, it is possible that the magnification ratio may reach 10 times, so its magnification ratio JSON is more serious, which is why more and more APIs now prefer JSON over XML.
[What is the problem of zooming in]:
(1) Spend more time splicing strings and copying memory, occupying more Java memory and generating more fragments.
(2) To convert the generated JSON object to byte[], it needs to be converted to String text and then encoded by byte[]. Because this is a text protocol, it is natural to copy the full amount of memory again.
(3) The transmission process takes up more network traffic because the data is amplified.
(4) As the number of packages on the network increases, the ACK of TCP will also increase, and the natural system will also be larger. Under the same packet loss rate, the number of lost packets will increase, and the overall transmission time will be longer. If this The network delay of data transmission is very large and the packet loss rate is very high. We need to reduce the size as much as possible; compression is one way, but compression will bring a huge increase in CPU load. It is our expectation to reduce data amplification as much as possible before compression. Then, when transmitting data, determine whether to compress according to RT and data size. If necessary, if the data is too large before compression, you can also perform partial sample data compression to test the compression rate.
(5) It will take more time for the receiver to process the data.
(6) Since it is a text protocol, overhead will be increased during processing, such as converting numbers to strings, converting strings to numbers; converting byte[] to string and converting string to byte[] will increase additional memory and computational overhead.
However, in the usual large number of applications, this overhead is insignificant compared to business logic, so optimization is not the focus of our attention, but we face some specific scenarios with more data processing, that is, the core business is in the data sequence. This problem must be considered when serializing and deserializing, so I will continue to discuss the problem below.
Here are some questions:
(1) Is there a better solution for network transmission? If so, why is it not widely adopted now?
(2) Relative to the underlying data communication, such as how JDBC does it, what if it transmits the result set like the above three schemes?
[Second, MySQL JDBC data transmission scheme]:
As mentioned in the previous article, the problem of data being amplified several times during the serialization process, do we want to see if the same is true for some relatively low-level communications? So let's take MySQL JDBC as an example to see if the same is true for the communication between it and JDBC.
There are many implementations of JDBC drivers according to different databases. There are huge differences in the implementation details of each database. This article takes the data parsing of MySQL JDBC (before MySQL 8.0) as an example to explain how it transmits data. In the process of data, I believe that what everyone is most concerned about is how the data of the ResultSet is transmitted.
Aside from basic information such as MetaData in the result set, just look at the data itself:
(1) When JDBC reads a data row, it first reads a row packetge from the buffer. The row package is obtained from the network package. The package size is determined according to the header of the package passed in the protocol, and then the row package is obtained from the network package. The content of the corresponding size is read in the network buffer. The following figure wants to express that the package transmitted by the network and the package in the business data may not completely correspond. In addition, if the packages in the network all reach the local buffer, they are logically continuous (the intentional separation in the figure is to let everyone know that the network transmission is divided into different packages to the local), JDBC reads from the local buffer The process of row package is the process of copying the package from the kernel package to the JVM. For us Java, we mainly focus on the row package (there may be some special cases in JDBC that the package read is not at the row level. For this special case, please read Interested students can consult the source code by themselves).
(2) Except for the header, the single-line data is the body. The body part contains a variety of different data types. At this time, placing the data types on the body obviously takes up space, so the data types are extracted from the metadata. , the order of data columns in body will be consistent with the order of columns in metdata.
(3), MySQL parses data types in detail:
3.1. If the data corresponding to Metadata is found to be int, longint, tinyint, year, float, double and other data types, it will be read according to the number of fixed-length bytes. For example, int will naturally be read according to 4 bytes, and long will be read according to 8 bytes. .
3.2. If it is found to be other types, such as varchar, binary, text, etc., it will be read according to variable length.
3.3. The variable-length string first reads the 1-byte flag bit.
3.4. If the value of this flag bit is less than or equal to 250, it directly represents the length of the subsequent bytes (note that the string is converted to the length of bytes here), so as to ensure that the variable-length strings stored in most businesses are stored in the Only 1 byte of amplification is required during network delivery.
3.5. If this flag is: 251, it means this field is NULL
3.6. If the flag bit is: 252, it means that 2 bytes are needed to represent the length of the field. At this time, adding the flag bit is 3 bytes, and the length of data (64KB) within 65536. Note that it will be converted into When long, the high bits are filled with 0, so 2 bytes can be filled to 65536, and only 3 bytes need to be enlarged to represent this data.
3.7. If the flag bit is: 253, it means that the length of 4 large table fields is required, which can represent 4GB (the same as above with 0 supplemented by the high order), such data will hardly appear in the database, even if it does appear, only 5 words will appear. section magnification.
3.8. If the flag bit is: 254, 8 bytes represent the length. At this time, there is no high-order 0 to be filled. At most, the length of Long.MAX_VALUE can be read, but this space cannot be stored in memory at present, so there is no need to worry about the use problem. , at this time, the 9-byte enlargement, the source code is as follows:
(4) Let's first follow this understanding. In the process of transferring data, MySQL enlarges the data very small. Is this true? Please explain point 5 below.
Additional instructions:
a. There are two implementation classes for the parsing of ResultSetRow data in MySQL JDBC (except JDBC versions above MySQL 8.0): BufferRow and ByteArrayRow. These two methods have the same logic in the parsing stage of reading a single row of data, but The way of parsing and storing data is different. BufferRow will be parsed into byte[] of data row, ByteArrayRow will be parsed into byte[][] two-dimensional array, and the second dimension is the information of each column, which is the client Behavior, independent of the format in which the data is delivered over the network. (The two are used in different scenarios. For example, one of the scenarios is: ByteArrayRow will be enabled when the cursor turns on the UPDATE mode, but this is not the focus of this article. It is mentioned here that the main purpose is to inform you that no matter which method, read data the same way)
b. RowData in MySQL JDBC is the entry for data processing in ResultSet. There are three implementation classes: RowStatStatic, RowDataCursor, and RowDataDynamic. Although there are three implementation classes, they will not affect the data format. There are different ways of reading data in the area: RowStatStatic and RowDataCursor will read all the data in the buffer into JDBC to form an array each time, and RowDataCursor has a difference in processing that the database returns the data content of FetchSize size each time ( The details of the implementation are mentioned in the previous article); RowDataDynamic is to start from the pak when a row is needed Read it in cege, and try to read the next package after the package is read. These will not affect the transmission format of the data itself on the network, so the parsing mentioned in the text is the more common parsing of MySQL JDBC at present, and has nothing to do with its internal execution path.
(5) Taking BufferRow as an example, when you initiate getString('xxx'), getDate(int), first it needs to find the first column inside (pass the number to omit this action), and then there will be A lastRequestedIndex and lastRequestedPos respectively record the last read field and the position of the byte. If the index you pass in is larger than this index, it starts from the current position and scans backwards. The scanning rules are consistent with the width of the database above. , find the corresponding position, copy out the corresponding byte[] array, and convert the object type you want.
PS: lastRequestedIndex and lastRequestedPos are actually JDBC that thinks that most of your situations are read from front to back, so reading this way is also the most friendly solution for JDBC programs, otherwise the pointer moves forward and needs to start from 0, the reason is very Simple (the length of the data is not in the tail, but in the head), so when the pointer is moved back and forth, this will generate a lot of overhead, and will generate more fragmentation of memory copies. Although ByteArrayRow can solve this problem, it will occupy a relatively large space by itself. In addition, the byte[] bytes returned by its internal two-dimensional array can be modified externally (because there is no copy).
In addition, according to this method of reading data, if a single row of data is too large (for example, there are large fields of 100MB+), it is read into Java memory, even if CursorFetch and Stream are used to read dozens of pieces of data, you can JVM memory hangs up. So far, I haven't seen a parameter that can "limit the length of a single row of data" in MySQL. It is estimated that the possibility of official support for such special needs is very small, and most of them can only be realized by modifying the source code.
[Back to the topic itself: the communication between MySQL and JDBC seems to be amplified very small? 】
In fact, MySQL transmits data to JDBC through the text protocol by default, not the Binary protocol. Although its byte[] array will not be enlarged like JSON, it is not really a text protocol, but it has many kinds of data. By default, the type is all text transmission. For example, an account mentioned above: 102333320132133 is 8 in the database, but if there is a text format transmission during network transmission, it will be: 15 bytes, if it is DateTime data It can be stored in 8 bytes in the database, but if the network transmission is transmitted according to YYYY-MM-DD HH:MI:SS, it can reach 19 bytes, and when they use String to transmit in the network, according to the above mentioned Yes, MySQL will treat it as a variable-length character, so it will add at least one of its own flags to the data header. In addition, the increase here is not just a few bytes, but you need to get the real data, and the receiver needs to further calculate and process to get it. For example, after 102333320132133 is sent by text, the receiver needs to convert this string to long Only the type can get long. Let’s imagine that you are processing 5 million data, each row of data has 20 columns, and there are a lot of similar processing. Isn’t the cost increased a lot?
JDBC and MySQL can communicate through the binary protocol, which is to communicate according to the space occupied by the actual number, but when it is more pitted, MySQL's current plan to open the binary protocol is: "open the server prepareStatemet", this Once it is turned on, there will be a lot of pits, especially in the programming of the Internet, I will explain it step by step in the subsequent articles.
Aside from "the pit of opening the binary protocol", we think that MySQL JDBC encodes data very well in the process of communication, very compact encoding (of course, if you enable SSL access, the data will be enlarged again, and Encrypted data is basically difficult to compress).
A summary of the advantages and disadvantages of traditional data serialization:
Advantage:
(1) After the data is all encoded according to byte[], due to the compact encoding, the amplification of the data itself is very small.
(2) Since there is no parsing process for encoding and decoding, they are written sequentially to the end of the ByteBuffer, that is to say, there is no need to find a position. When reading, it can be reduced by design according to the design. Even if the position is found, it is a moving offset quantity, very efficient.
(3) If multiple lines of data are passed, the deserialization process does not need to parse the entire passed data at one time and then process it like XML or JSON. Just imagine, if a result set of 5000 lines and 20 columns will generate How many Java objects, and how much each Java object enlarges the data itself, can be converted into Java objects on demand after passing in bytes, and the used Java objects can be released, so that it does not need to occupy such a large JVM memory at the same time, and The byte[] array is also just the size of the data itself, and can also be released on demand.
(4) Compared with the three methods mentioned above, such as JSON, it does not need to undergo a String conversion during serialization and deserialization, which will reduce a memory copy.
(5) Write your own code and use a similar communication scheme to achieve the ultimate in network optimization.
Disadvantage:
(1) The encoding is customized between MySQL and MySQL JDBC, and others cannot use it (we can refer to other people's ideas)
(2) The byte encoding and decoding process is written by the programmer, which requires high level and rigor of the programmer. A lot of testing is required in the early stage. In the later stage, if a slight deviation is considered in the network problem, unexpected bugs may appear. (So these contents need to be encapsulated within the company, and most programmers do not need to pay attention to this content).
(3) In terms of memory copy, this layer of memory copy is unavoidable from rowBuffer to data in the application. If you write a custom program, under necessary conditions, this place can further reduce memory copy, but It cannot be eliminated; as mentioned above, this overhead is simply insignificant for the business processing of the entire application.
Why legacy communication protocols don't choose to do this:
(1), refer to 3 points in the disadvantage.
(2) For traditional API communication, we pay more attention to speed and generality, that is, we often debug code with different teams and even different companies. To design a binary protocol, the development cost and debugging cost are very high.
(3) Readability, for business code, byte[] is less readable, especially when objects are nested, the way byte[] is expressed is very complicated.
If MySQL JDBC uses the binary protocol, will the data compactness reach the extreme?
According to the general understanding, it is the ultimate, all data will not be further enlarged, int only uses 4 bytes to transfer, long only uses 8 bytes to transfer, then it can continue to become smaller, can it be compressed?
No, no, in the binary world, if you explore the details, there are more magical things, let's explore:
E.g:
long id = 1L;
At this time, the network will use 8 bytes for transmission. You can see the arrangement of the 8 bytes:
Let's not consider that there are 31 bits of 0 according to the bit. First, there are 7 0s according to the byte, which means that the byte has no data, and only 1 byte has value. You can look at a large number of data in your own database. The automatic growth column of , before the id is less than 4194303, the first 5 bytes are wasted, and the first 4 bytes are all 0 before they grow to 4 bytes (2 to the 32nd power -1), which is wasted. In addition, even if the first byte of the 8 bytes is used, there will be a large amount of data. The middle byte is: 0. The probability is extremely high, just like entering 100 million in decimal, then the maximum will be below 100 million. There are 8 0s, and the higher 0s are difficult to add.
If you really want to try, you can use this method: use 1 byte as a sign, but it will take up a certain amount of computational overhead, so it is up to you to decide whether to do this for this space. This article is only a technical discussion:
Method 1: Express the number of bytes currently used by several low-order bits. Since long has only 8 bytes, it is enough to use 3 bits. The other 5 bits are wasted, and it doesn't matter. Deserialization At this time, you can add 0x00 according to the high number.
Method 2: Compared with method 1, it is more thorough, but it is more complicated to handle. The 0 and 1 of the 8 bits of 1 are used to represent the 8 bytes of long respectively. The serialization and deserialization process is based on the flag. The bit and the data itself perform the byte complement 0x00 operation. The complete 8 bytes are the value of long. The worst case is 9 bytes representing long, and the best case is 0 is 1 byte, and only the byte is occupied. When 2 bytes are used, even if the data becomes quite large, there will be a large number of data bytes with vacancies. In these cases, it can usually be expressed with less than 8 bytes. Only 7 bytes can be used to occupy the same space as the original number long, and the data at this time is already larger than 2 to the 48th power -1.
[3. Google Protocol Buffer technical solution]:
This may not be used by many people, and I don't know what it is used for, but I have to say that it is an artifact for data serialization and deserialization. The internal data communication is designed. Everyone knows that Google's global network is very powerful, so naturally it is quite extreme in terms of data transmission. Here I will explain its principle. For its own use, please refer to other People's blog, this article cannot explain it step by step due to the limited space of this article.
When you see this name, you should know that it is a protocol buffer, or a protocol encoding. Its purpose is similar to the use of JSON and XML for RPC calls mentioned above, that is, to transfer messages or call APIs between systems. However, on the one hand, in order to achieve readability and cross-language versatility similar to XML and JSON, on the other hand, Google hopes to achieve high serialization and deserialization performance, and data amplification can be controlled, so it also hopes to have a A method that is easier to use than the underlying encoding, yet can use the underlying encoding, and has the ability to read the document.
It first needs to define a format file, as follows:
syntax = "proto2";
package com.xxx.proto.buffer.test;
message TestData2 {
optional int32 id = 2;
optional int64 longId = 1;
optional bool boolValue = 3;
optional string name = 4;
optional bytes bytesValue = 5;
optional int32 id2 = 6;
}
This file is neither a Java file nor a C file. It has nothing to do with the language. Usually, its suffix is named proto (the numbers 1, 2, and 3 in the file represent the serialization order, and deserialization will also be done in this order), Then after protobuf is installed locally (different OS installation methods are different, and there are official downloads and instructions), a protoc running file will be generated. After adding it to the environment variable, run the command to specify a target directory:
protoc --java_out=~/temp/TestData2.proto
At this time, the directory described by the package will be generated in the specified directory, and there is a Java source file in the directory (other languages will generate other languages), this part of the code is generated by Google for you, if you write it yourself It is too laborious, so Google will do it for you; the protobuf package should be introduced into the local Java project, and the maven reference (the version can be selected by yourself):
com.google.protobuf
protobuf-java
3.6.1
At this time, the generated code will call the method library provided in the Google package to do serialization. Our code only needs to call the API in the generated class to do serialization and deserialization. The file is published in a module and published to the maven repository, and others can refer to it. Regarding the test code itself, you can refer to the fact that there are many blogs that provide test code, which is still very useful.
The amazing thing about Google Coding is that you can define the format of the transmitted data in the way of objects, which is highly readable, and even more suitable for programmers to read compared to XML and JSON. It can also be used as a communication document. It is common to different languages. Objects can still be nested, but the serialized bytes are only a little larger than the original data, which is amazing.
After testing different data types, deliberately creating the number of layers of data nesting, and performing multi-layer nesting of binary arrays, I found that the ratio of data amplification is very, very small, almost equivalent to binary transmission, so I serialized the data. Its binary is output, and it is found that its encoding method is very close to the above JDBC. Although there are some differences in details, it is very close. In addition, it has several major characteristics in serialization:
(1) If the field is empty, it will not generate any bytes. If the properties of the integrated object are all null, the generated bytes will be 0
(2) Variable-length encoding is used for the data of int32 and int64. The idea has some commonalities with our description above, that is, an int64 value can be expressed with fewer bytes when it is relatively small, and its internal There is a set of byte shift and XOR algorithms to handle this.
(3), it does not do any conversion to the string, byte[], directly put the word Section arrays are similar to binary encoding.
(4), because the field is empty, it can not do any bytes. Its method is that there will be a position encoding information where there is data. You can try to adjust the number sequence to see if the generated byte will change; Then it has strong compatibility at this time, that is, there is no problem with ordinary adding fields, which is difficult for ordinary binary encoding.
(5) The serialization process does not generate metadata information, that is, it does not write the structure of the object in bytes, but the deserialized receiver has the same object, which can be de-parsed.
How does this differ from writing the code myself?
(1) There are many uncertainties in writing the code yourself. If it is not written well, the data may be larger and it is easy to make mistakes.
(2) After Google's engineers standardized the internal standards, Google's open source products also use this communication protocol in large numbers, and more and more industry middleware products begin to use this solution, even the latest version of MySQL database will also be used in data transmission. Begin to be compatible with protobuf.
(3), Google is equivalent to starting to define a new data transmission scheme in the industry, which has performance and reduces the difficulty of code development, as well as the ability to access across languages, so more and more people like to use this thing .
So what's the downside? There are not many, it basically takes into account all the things you need to consider in serialization and deserialization, and achieves a very good balance, but if we want to pick defects, we have to find scenarios:
(1) protobuf requires both parties to specify the data type, and each object in the defined file must specify the data type. There is no plan for the expression of the Object type. You must predict in advance what type this Object is.
(2) Arrays can be expressed using repeated, but only the same type of data can be expressed. For example, when the data types of multiple columns in a row of JDBC data mentioned above are different, it will be troublesome to use this expression; in addition, by default The lower array can only express a 1-dimensional array. To express a two-dimensional array, it needs to be done indirectly by using object nesting.
(3) The data types it provides are all basic data types. If it is not a common type, you need to find a way to convert it to a common type for transmission. For example, you need to find a Docment object from MongoDB. The serialization of this object requires you to pass another one first. The method is converted into byte[] or String, and compared with XML and JSON, it generally provides a recursive function, but if protobuf wants to provide this function, it will inevitably face the problem of data amplification, and generality and performance are always contradictory.
(4) Compared with custom byte, serialization and deserialization are done at one time and cannot be done step by step. In this way, if the array is nested, a large number of Java objects will be generated during deserialization. In addition, custom byte will be generated. The memory copy can be further reduced, but Google's memory copy is much less compared to the text protocol.
Additional instructions:
As mentioned in point 2, for the array expressed by repeated, each element must be of the same type, and it is impossible to directly express elements of different types, because it does not have an array like Object[] like Java, so it can determine the Object's After the type is passed, deserialization will be very troublesome, because the receiver does not know what type of data is, and the protobuf network transmits data without metadata, so the only way to determine is to pass it on the client itself according to business needs.
Therefore, if it is really necessary, you can use List to express a row of data, that is, each byte[] element in it can obtain metadata from other places, such as database metadata, and then go to it yourself Conversion, but the transfer process is all byte[]; as far as JDBC is concerned, I personally recommend using a byte[] for a row of data to transfer, rather than using a byte[] for each item of data, because in serialization and deserialization During the conversion process, each array element will be packaged into an object in protobuf, and the number of Java objects generated at this time is a multiple of the number of columns. For example, if there are 40 columns, 40 times the number of Java objects will be generated, which is exaggerated.
In short, each serialization and deserialization scheme currently has application scenarios. They determine the architecture at the beginning of the design, and will also determine the final performance, stability, system overhead, network transmission size, and so on.
Author: Zhong Yin
The article is divided into 3 parts:
1. List common serialization and deserialization schemes (ObjectXXStream, XML, JSON)
2. Processing scheme of MySQL JDBC result set
3. Google Protocol Buffer processing solution
[1. Common Serialize solutions used in API and message communication calls]:
Option 1. Object serialization and deserialization are performed based on Java's native ObjectOutputStream.write() and ObjectInputStream.read().
Option 2. Serialize and deserialize based on JSON.
Option 3. Serialization and deserialization based on XML.
[Scheme 1 Analysis, ObjectXXXStream]:
advantage:
(1) It is serialized by Java's own API, which is simple, convenient, and has no third-party dependencies.
(2) Don't worry about the loss of precision in data parsing, the loss of fields, and the uncertainty of the deserialization type of Object.
shortcoming:
(1) It is troublesome for both parties to debug. It is best for the sender and receiver to have the same version of the object description, otherwise there will be strange problems, the debugging cycle is relatively long, and there are many problems in cross-team cooperation.
(2) The transferred object contains metadata information, which occupies a large space.
[Scheme 2 analysis, JSON serialization]:
advantage:
(1) Simple and convenient, no need to pay attention to the object format to be serialized.
(2) There are many components in the open source world that can be supported, such as FastJSON performance is very good.
(3) In many RPC frameworks, this scheme is basically supported.
shortcoming:
(1) If the object property contains the Object type, if the business itself does not know the data type during deserialization, it will be very troublesome to process.
(2) Due to the text type, it will definitely occupy a large data space, such as the following figure.
(3) Comparing the compatibility and performance of parsing packages that depend on JSON, the processing methods of some details of JSON (such as some non-standard JSON) may be different.
(4) Serialization, regardless of any data type, must first be converted to String and converted to byte[], which will increase the number of memory copies.
(5) When deserializing, the entire JSON must be deserialized into an object before it can be read. You should know that Java objects, especially objects with many levels of nesting, will take up much more memory than data. its own space.
Extreme Case 1 of Data Amplification:
The transfer data description information is:
class PP {
long userId = 102333320132133L;
int passportNumber = 123456;
}
The format of the JSON passed at this time is:
{
"userId":102333320132133,
"passportNumber": 123456
}
The data we want to pass is 1 long, 1 int, that is, 12 bytes of data, the string length of this JSON will be the actual number of bytes (excluding carriage returns and spaces, this is just for readability) , and note that the long here is a string in JSON), this string has: 51 bytes, that is, the data is placed about 4.25 times.
Data amplification extreme case 2:
When there is data in your object that is of type byte[], JSON is data in text format and cannot store byte[], then to serialize such data, the only way is to convert bytes into characters. The usual method is: Two kinds:
(1) Use BASE64 encoding, which is a common practice in JSON at present.
(2) Hexadecimal character encoding is performed according to the byte, for example, a string: "FF" represents the byte 0xFF.
No matter which of the above two methods is used, 1 byte will become 2 characters to be transmitted, that is, the byte[] data will be enlarged by more than 2 times. Why not use the ISO-8859-1 character encoding? Because after encoding in this way, after the final serialization into network byte[] data, the corresponding byte[] does not become larger, but when deserialized into text, the receiver does not know that it is ISO-8859-1, and will use For example, the more common character sets such as GBK and UTF-8 can be parsed into String before further parsing of JSON objects, so this byte[] may be changed during the encoding process, and it will be very troublesome to deal with this problem.
[Scheme 2 analysis, XML serialization]:
advantage:
(1) Simple and convenient to use, no need to pay attention to the object format to be serialized
(2) The readability is good, XML is relatively common in the industry, and everyone is accustomed to seeing the appearance of XML in the configuration file
(3) A large number of RPC frameworks are supported, and documents can be directly formed through XML for circulation
shortcoming:
(1) The performance of serialization and deserialization has not been very good.
(2) There are also the same data type problems as JSON, and the problem of data amplification. At the same time, the problem of data amplification is more serious. At the same time, the number of memory copies is also the same as the JSON type, which is inevitable.
XML data enlargement description:
The data amplification of XML is usually more serious than that of JSON. In the JSON case above, XML usually transmits this data like this:
This message has more than 80+ bytes. If some Property attributes are added in the XML and the objects are nested, it is possible that the magnification ratio may reach 10 times, so its magnification ratio JSON is more serious, which is why more and more APIs now prefer JSON over XML.
[What is the problem of zooming in]:
(1) Spend more time splicing strings and copying memory, occupying more Java memory and generating more fragments.
(2) To convert the generated JSON object to byte[], it needs to be converted to String text and then encoded by byte[]. Because this is a text protocol, it is natural to copy the full amount of memory again.
(3) The transmission process takes up more network traffic because the data is amplified.
(4) As the number of packages on the network increases, the ACK of TCP will also increase, and the natural system will also be larger. Under the same packet loss rate, the number of lost packets will increase, and the overall transmission time will be longer. If this The network delay of data transmission is very large and the packet loss rate is very high. We need to reduce the size as much as possible; compression is one way, but compression will bring a huge increase in CPU load. It is our expectation to reduce data amplification as much as possible before compression. Then, when transmitting data, determine whether to compress according to RT and data size. If necessary, if the data is too large before compression, you can also perform partial sample data compression to test the compression rate.
(5) It will take more time for the receiver to process the data.
(6) Since it is a text protocol, overhead will be increased during processing, such as converting numbers to strings, converting strings to numbers; converting byte[] to string and converting string to byte[] will increase additional memory and computational overhead.
However, in the usual large number of applications, this overhead is insignificant compared to business logic, so optimization is not the focus of our attention, but we face some specific scenarios with more data processing, that is, the core business is in the data sequence. This problem must be considered when serializing and deserializing, so I will continue to discuss the problem below.
Here are some questions:
(1) Is there a better solution for network transmission? If so, why is it not widely adopted now?
(2) Relative to the underlying data communication, such as how JDBC does it, what if it transmits the result set like the above three schemes?
[Second, MySQL JDBC data transmission scheme]:
As mentioned in the previous article, the problem of data being amplified several times during the serialization process, do we want to see if the same is true for some relatively low-level communications? So let's take MySQL JDBC as an example to see if the same is true for the communication between it and JDBC.
There are many implementations of JDBC drivers according to different databases. There are huge differences in the implementation details of each database. This article takes the data parsing of MySQL JDBC (before MySQL 8.0) as an example to explain how it transmits data. In the process of data, I believe that what everyone is most concerned about is how the data of the ResultSet is transmitted.
Aside from basic information such as MetaData in the result set, just look at the data itself:
(1) When JDBC reads a data row, it first reads a row packetge from the buffer. The row package is obtained from the network package. The package size is determined according to the header of the package passed in the protocol, and then the row package is obtained from the network package. The content of the corresponding size is read in the network buffer. The following figure wants to express that the package transmitted by the network and the package in the business data may not completely correspond. In addition, if the packages in the network all reach the local buffer, they are logically continuous (the intentional separation in the figure is to let everyone know that the network transmission is divided into different packages to the local), JDBC reads from the local buffer The process of row package is the process of copying the package from the kernel package to the JVM. For us Java, we mainly focus on the row package (there may be some special cases in JDBC that the package read is not at the row level. For this special case, please read Interested students can consult the source code by themselves).
(2) Except for the header, the single-line data is the body. The body part contains a variety of different data types. At this time, placing the data types on the body obviously takes up space, so the data types are extracted from the metadata. , the order of data columns in body will be consistent with the order of columns in metdata.
(3), MySQL parses data types in detail:
3.1. If the data corresponding to Metadata is found to be int, longint, tinyint, year, float, double and other data types, it will be read according to the number of fixed-length bytes. For example, int will naturally be read according to 4 bytes, and long will be read according to 8 bytes. .
3.2. If it is found to be other types, such as varchar, binary, text, etc., it will be read according to variable length.
3.3. The variable-length string first reads the 1-byte flag bit.
3.4. If the value of this flag bit is less than or equal to 250, it directly represents the length of the subsequent bytes (note that the string is converted to the length of bytes here), so as to ensure that the variable-length strings stored in most businesses are stored in the Only 1 byte of amplification is required during network delivery.
3.5. If this flag is: 251, it means this field is NULL
3.6. If the flag bit is: 252, it means that 2 bytes are needed to represent the length of the field. At this time, adding the flag bit is 3 bytes, and the length of data (64KB) within 65536. Note that it will be converted into When long, the high bits are filled with 0, so 2 bytes can be filled to 65536, and only 3 bytes need to be enlarged to represent this data.
3.7. If the flag bit is: 253, it means that the length of 4 large table fields is required, which can represent 4GB (the same as above with 0 supplemented by the high order), such data will hardly appear in the database, even if it does appear, only 5 words will appear. section magnification.
3.8. If the flag bit is: 254, 8 bytes represent the length. At this time, there is no high-order 0 to be filled. At most, the length of Long.MAX_VALUE can be read, but this space cannot be stored in memory at present, so there is no need to worry about the use problem. , at this time, the 9-byte enlargement, the source code is as follows:
(4) Let's first follow this understanding. In the process of transferring data, MySQL enlarges the data very small. Is this true? Please explain point 5 below.
Additional instructions:
a. There are two implementation classes for the parsing of ResultSetRow data in MySQL JDBC (except JDBC versions above MySQL 8.0): BufferRow and ByteArrayRow. These two methods have the same logic in the parsing stage of reading a single row of data, but The way of parsing and storing data is different. BufferRow will be parsed into byte[] of data row, ByteArrayRow will be parsed into byte[][] two-dimensional array, and the second dimension is the information of each column, which is the client Behavior, independent of the format in which the data is delivered over the network. (The two are used in different scenarios. For example, one of the scenarios is: ByteArrayRow will be enabled when the cursor turns on the UPDATE mode, but this is not the focus of this article. It is mentioned here that the main purpose is to inform you that no matter which method, read data the same way)
b. RowData in MySQL JDBC is the entry for data processing in ResultSet. There are three implementation classes: RowStatStatic, RowDataCursor, and RowDataDynamic. Although there are three implementation classes, they will not affect the data format. There are different ways of reading data in the area: RowStatStatic and RowDataCursor will read all the data in the buffer into JDBC to form an array each time, and RowDataCursor has a difference in processing that the database returns the data content of FetchSize size each time ( The details of the implementation are mentioned in the previous article); RowDataDynamic is to start from the pak when a row is needed Read it in cege, and try to read the next package after the package is read. These will not affect the transmission format of the data itself on the network, so the parsing mentioned in the text is the more common parsing of MySQL JDBC at present, and has nothing to do with its internal execution path.
(5) Taking BufferRow as an example, when you initiate getString('xxx'), getDate(int), first it needs to find the first column inside (pass the number to omit this action), and then there will be A lastRequestedIndex and lastRequestedPos respectively record the last read field and the position of the byte. If the index you pass in is larger than this index, it starts from the current position and scans backwards. The scanning rules are consistent with the width of the database above. , find the corresponding position, copy out the corresponding byte[] array, and convert the object type you want.
PS: lastRequestedIndex and lastRequestedPos are actually JDBC that thinks that most of your situations are read from front to back, so reading this way is also the most friendly solution for JDBC programs, otherwise the pointer moves forward and needs to start from 0, the reason is very Simple (the length of the data is not in the tail, but in the head), so when the pointer is moved back and forth, this will generate a lot of overhead, and will generate more fragmentation of memory copies. Although ByteArrayRow can solve this problem, it will occupy a relatively large space by itself. In addition, the byte[] bytes returned by its internal two-dimensional array can be modified externally (because there is no copy).
In addition, according to this method of reading data, if a single row of data is too large (for example, there are large fields of 100MB+), it is read into Java memory, even if CursorFetch and Stream are used to read dozens of pieces of data, you can JVM memory hangs up. So far, I haven't seen a parameter that can "limit the length of a single row of data" in MySQL. It is estimated that the possibility of official support for such special needs is very small, and most of them can only be realized by modifying the source code.
[Back to the topic itself: the communication between MySQL and JDBC seems to be amplified very small? 】
In fact, MySQL transmits data to JDBC through the text protocol by default, not the Binary protocol. Although its byte[] array will not be enlarged like JSON, it is not really a text protocol, but it has many kinds of data. By default, the type is all text transmission. For example, an account mentioned above: 102333320132133 is 8 in the database, but if there is a text format transmission during network transmission, it will be: 15 bytes, if it is DateTime data It can be stored in 8 bytes in the database, but if the network transmission is transmitted according to YYYY-MM-DD HH:MI:SS, it can reach 19 bytes, and when they use String to transmit in the network, according to the above mentioned Yes, MySQL will treat it as a variable-length character, so it will add at least one of its own flags to the data header. In addition, the increase here is not just a few bytes, but you need to get the real data, and the receiver needs to further calculate and process to get it. For example, after 102333320132133 is sent by text, the receiver needs to convert this string to long Only the type can get long. Let’s imagine that you are processing 5 million data, each row of data has 20 columns, and there are a lot of similar processing. Isn’t the cost increased a lot?
JDBC and MySQL can communicate through the binary protocol, which is to communicate according to the space occupied by the actual number, but when it is more pitted, MySQL's current plan to open the binary protocol is: "open the server prepareStatemet", this Once it is turned on, there will be a lot of pits, especially in the programming of the Internet, I will explain it step by step in the subsequent articles.
Aside from "the pit of opening the binary protocol", we think that MySQL JDBC encodes data very well in the process of communication, very compact encoding (of course, if you enable SSL access, the data will be enlarged again, and Encrypted data is basically difficult to compress).
A summary of the advantages and disadvantages of traditional data serialization:
Advantage:
(1) After the data is all encoded according to byte[], due to the compact encoding, the amplification of the data itself is very small.
(2) Since there is no parsing process for encoding and decoding, they are written sequentially to the end of the ByteBuffer, that is to say, there is no need to find a position. When reading, it can be reduced by design according to the design. Even if the position is found, it is a moving offset quantity, very efficient.
(3) If multiple lines of data are passed, the deserialization process does not need to parse the entire passed data at one time and then process it like XML or JSON. Just imagine, if a result set of 5000 lines and 20 columns will generate How many Java objects, and how much each Java object enlarges the data itself, can be converted into Java objects on demand after passing in bytes, and the used Java objects can be released, so that it does not need to occupy such a large JVM memory at the same time, and The byte[] array is also just the size of the data itself, and can also be released on demand.
(4) Compared with the three methods mentioned above, such as JSON, it does not need to undergo a String conversion during serialization and deserialization, which will reduce a memory copy.
(5) Write your own code and use a similar communication scheme to achieve the ultimate in network optimization.
Disadvantage:
(1) The encoding is customized between MySQL and MySQL JDBC, and others cannot use it (we can refer to other people's ideas)
(2) The byte encoding and decoding process is written by the programmer, which requires high level and rigor of the programmer. A lot of testing is required in the early stage. In the later stage, if a slight deviation is considered in the network problem, unexpected bugs may appear. (So these contents need to be encapsulated within the company, and most programmers do not need to pay attention to this content).
(3) In terms of memory copy, this layer of memory copy is unavoidable from rowBuffer to data in the application. If you write a custom program, under necessary conditions, this place can further reduce memory copy, but It cannot be eliminated; as mentioned above, this overhead is simply insignificant for the business processing of the entire application.
Why legacy communication protocols don't choose to do this:
(1), refer to 3 points in the disadvantage.
(2) For traditional API communication, we pay more attention to speed and generality, that is, we often debug code with different teams and even different companies. To design a binary protocol, the development cost and debugging cost are very high.
(3) Readability, for business code, byte[] is less readable, especially when objects are nested, the way byte[] is expressed is very complicated.
If MySQL JDBC uses the binary protocol, will the data compactness reach the extreme?
According to the general understanding, it is the ultimate, all data will not be further enlarged, int only uses 4 bytes to transfer, long only uses 8 bytes to transfer, then it can continue to become smaller, can it be compressed?
No, no, in the binary world, if you explore the details, there are more magical things, let's explore:
E.g:
long id = 1L;
At this time, the network will use 8 bytes for transmission. You can see the arrangement of the 8 bytes:
Let's not consider that there are 31 bits of 0 according to the bit. First, there are 7 0s according to the byte, which means that the byte has no data, and only 1 byte has value. You can look at a large number of data in your own database. The automatic growth column of , before the id is less than 4194303, the first 5 bytes are wasted, and the first 4 bytes are all 0 before they grow to 4 bytes (2 to the 32nd power -1), which is wasted. In addition, even if the first byte of the 8 bytes is used, there will be a large amount of data. The middle byte is: 0. The probability is extremely high, just like entering 100 million in decimal, then the maximum will be below 100 million. There are 8 0s, and the higher 0s are difficult to add.
If you really want to try, you can use this method: use 1 byte as a sign, but it will take up a certain amount of computational overhead, so it is up to you to decide whether to do this for this space. This article is only a technical discussion:
Method 1: Express the number of bytes currently used by several low-order bits. Since long has only 8 bytes, it is enough to use 3 bits. The other 5 bits are wasted, and it doesn't matter. Deserialization At this time, you can add 0x00 according to the high number.
Method 2: Compared with method 1, it is more thorough, but it is more complicated to handle. The 0 and 1 of the 8 bits of 1 are used to represent the 8 bytes of long respectively. The serialization and deserialization process is based on the flag. The bit and the data itself perform the byte complement 0x00 operation. The complete 8 bytes are the value of long. The worst case is 9 bytes representing long, and the best case is 0 is 1 byte, and only the byte is occupied. When 2 bytes are used, even if the data becomes quite large, there will be a large number of data bytes with vacancies. In these cases, it can usually be expressed with less than 8 bytes. Only 7 bytes can be used to occupy the same space as the original number long, and the data at this time is already larger than 2 to the 48th power -1.
[3. Google Protocol Buffer technical solution]:
This may not be used by many people, and I don't know what it is used for, but I have to say that it is an artifact for data serialization and deserialization. The internal data communication is designed. Everyone knows that Google's global network is very powerful, so naturally it is quite extreme in terms of data transmission. Here I will explain its principle. For its own use, please refer to other People's blog, this article cannot explain it step by step due to the limited space of this article.
When you see this name, you should know that it is a protocol buffer, or a protocol encoding. Its purpose is similar to the use of JSON and XML for RPC calls mentioned above, that is, to transfer messages or call APIs between systems. However, on the one hand, in order to achieve readability and cross-language versatility similar to XML and JSON, on the other hand, Google hopes to achieve high serialization and deserialization performance, and data amplification can be controlled, so it also hopes to have a A method that is easier to use than the underlying encoding, yet can use the underlying encoding, and has the ability to read the document.
It first needs to define a format file, as follows:
syntax = "proto2";
package com.xxx.proto.buffer.test;
message TestData2 {
optional int32 id = 2;
optional int64 longId = 1;
optional bool boolValue = 3;
optional string name = 4;
optional bytes bytesValue = 5;
optional int32 id2 = 6;
}
This file is neither a Java file nor a C file. It has nothing to do with the language. Usually, its suffix is named proto (the numbers 1, 2, and 3 in the file represent the serialization order, and deserialization will also be done in this order), Then after protobuf is installed locally (different OS installation methods are different, and there are official downloads and instructions), a protoc running file will be generated. After adding it to the environment variable, run the command to specify a target directory:
protoc --java_out=~/temp/TestData2.proto
At this time, the directory described by the package will be generated in the specified directory, and there is a Java source file in the directory (other languages will generate other languages), this part of the code is generated by Google for you, if you write it yourself It is too laborious, so Google will do it for you; the protobuf package should be introduced into the local Java project, and the maven reference (the version can be selected by yourself):
At this time, the generated code will call the method library provided in the Google package to do serialization. Our code only needs to call the API in the generated class to do serialization and deserialization. The file is published in a module and published to the maven repository, and others can refer to it. Regarding the test code itself, you can refer to the fact that there are many blogs that provide test code, which is still very useful.
The amazing thing about Google Coding is that you can define the format of the transmitted data in the way of objects, which is highly readable, and even more suitable for programmers to read compared to XML and JSON. It can also be used as a communication document. It is common to different languages. Objects can still be nested, but the serialized bytes are only a little larger than the original data, which is amazing.
After testing different data types, deliberately creating the number of layers of data nesting, and performing multi-layer nesting of binary arrays, I found that the ratio of data amplification is very, very small, almost equivalent to binary transmission, so I serialized the data. Its binary is output, and it is found that its encoding method is very close to the above JDBC. Although there are some differences in details, it is very close. In addition, it has several major characteristics in serialization:
(1) If the field is empty, it will not generate any bytes. If the properties of the integrated object are all null, the generated bytes will be 0
(2) Variable-length encoding is used for the data of int32 and int64. The idea has some commonalities with our description above, that is, an int64 value can be expressed with fewer bytes when it is relatively small, and its internal There is a set of byte shift and XOR algorithms to handle this.
(3), it does not do any conversion to the string, byte[], directly put the word Section arrays are similar to binary encoding.
(4), because the field is empty, it can not do any bytes. Its method is that there will be a position encoding information where there is data. You can try to adjust the number sequence to see if the generated byte will change; Then it has strong compatibility at this time, that is, there is no problem with ordinary adding fields, which is difficult for ordinary binary encoding.
(5) The serialization process does not generate metadata information, that is, it does not write the structure of the object in bytes, but the deserialized receiver has the same object, which can be de-parsed.
How does this differ from writing the code myself?
(1) There are many uncertainties in writing the code yourself. If it is not written well, the data may be larger and it is easy to make mistakes.
(2) After Google's engineers standardized the internal standards, Google's open source products also use this communication protocol in large numbers, and more and more industry middleware products begin to use this solution, even the latest version of MySQL database will also be used in data transmission. Begin to be compatible with protobuf.
(3), Google is equivalent to starting to define a new data transmission scheme in the industry, which has performance and reduces the difficulty of code development, as well as the ability to access across languages, so more and more people like to use this thing .
So what's the downside? There are not many, it basically takes into account all the things you need to consider in serialization and deserialization, and achieves a very good balance, but if we want to pick defects, we have to find scenarios:
(1) protobuf requires both parties to specify the data type, and each object in the defined file must specify the data type. There is no plan for the expression of the Object type. You must predict in advance what type this Object is.
(2) Arrays can be expressed using repeated, but only the same type of data can be expressed. For example, when the data types of multiple columns in a row of JDBC data mentioned above are different, it will be troublesome to use this expression; in addition, by default The lower array can only express a 1-dimensional array. To express a two-dimensional array, it needs to be done indirectly by using object nesting.
(3) The data types it provides are all basic data types. If it is not a common type, you need to find a way to convert it to a common type for transmission. For example, you need to find a Docment object from MongoDB. The serialization of this object requires you to pass another one first. The method is converted into byte[] or String, and compared with XML and JSON, it generally provides a recursive function, but if protobuf wants to provide this function, it will inevitably face the problem of data amplification, and generality and performance are always contradictory.
(4) Compared with custom byte, serialization and deserialization are done at one time and cannot be done step by step. In this way, if the array is nested, a large number of Java objects will be generated during deserialization. In addition, custom byte will be generated. The memory copy can be further reduced, but Google's memory copy is much less compared to the text protocol.
Additional instructions:
As mentioned in point 2, for the array expressed by repeated, each element must be of the same type, and it is impossible to directly express elements of different types, because it does not have an array like Object[] like Java, so it can determine the Object's After the type is passed, deserialization will be very troublesome, because the receiver does not know what type of data is, and the protobuf network transmits data without metadata, so the only way to determine is to pass it on the client itself according to business needs.
Therefore, if it is really necessary, you can use List
In short, each serialization and deserialization scheme currently has application scenarios. They determine the architecture at the beginning of the design, and will also determine the final performance, stability, system overhead, network transmission size, and so on.
Author: Zhong Yin
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00