One text reading character encoding

What is character encoding?

We know that there are only 0 and 1 in the computer world. If there is no character code, what we see is a string of "11001010010101100111001...." Our communication is like casting pearls before swine. I can't understand it, it can't understand me. Character encoding is like a translation program between humans and machines, which translates our familiar characters into binary characters that can be read by machines, and translates binary characters into characters that we can understand. The following is the encyclopedia's explanation of character encoding

Character encoding, also known as character set code, is to encode characters in a character set into an object in a specified set (such as bit pattern, natural number sequence, 8-bit group or electric pulse), so that text can be stored in a computer or transmitted through a communication network. A common example is to code the Latin alphabet into Morse code and ASCII. For example, ASCII coding is to number letters, numbers and other symbols, and use 7-bit binary to represent this integer. Character set is a set of multiple characters. There are many types of character sets. Each character set contains different characters. Common character set names include ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, etc. If the computer wants to process all kinds of characters accurately, it needs to encode characters so that the computer can recognize and store all kinds of characters.

Why Computers Need Coding

Encoding is the process of converting information from one form to another, such as converting characters (characters, numbers, symbols, etc.), images, sounds, or other objects into specified electric pulse signals or binary digits by using a predetermined method. The pictures we see now, the music we hear, and even the lines of code we write, the characters we knock down, are all real. But in fact, there are a string of "01" numbers behind them. The attractive girl you saw on your mobile phone yesterday does not exist in the real world. It's just a "skeleton" generated by the computer with the "01" number.

Binary does not exist

You may think that the data in the computer is "01" binary, but actually there is no binary in the computer. Even though we know that all the contents are stored in the hard disk, you can't find any "0101" numbers in it when you take it apart. There are only disks and tracks in it. Even if we zoom in to see the disk, there are only uneven disk surfaces. The convex parts are magnetized, and the concave parts are not magnetized; But we named the convex place "1" and the concave place "0".

Similarly, you can't find binary digits in the memory. When the memory is magnified, it is a stack of capacitor banks. Whether the memory cell stores "0" or "1" depends on whether the capacitor is charged. If it is charged, we think it is "1". If it is not charged, we think it is "0". However, the capacitor will discharge over a long period of time. The capacitor representing "1" will discharge, and the capacitor representing "0" will draw power. This is also the reason why our memory cannot be powered off. We need to charge the capacitor regularly to ensure that the capacitance of "1" has power.

As for the display, it is the most direct feeling. The beautiful women's paintings, sun, moon, mountains and rivers you see through the display are actually light spots of different colors emitted by LED. The display is a matrix composed of a group of LED, in which each diode can be called a pixel. "1" means bright, "0" means off, and we can usually see colorful colors, It is the combination of three colors (red, green and blue) of LED. How can an ASCII code "65" be displayed as "A"? This is the credit of the video card, which stores the graphic data of each character (also known as the font code) and transmits the graphic data of the two-dimensional matrix to the display for imaging.

Therefore, the so-called 0 and 1 are current pulse signals, and binary is actually a mathematical logic concept abstracted by us. Why do we use binary representation?

Because the binary system has only two states, each bit in the binary system can be represented by a physical device with two stable states. For example, it can be conveniently represented by "0" and "1" in terms of high and low levels or the positive and negative characteristics of charges, and the on and off of lights. This provides a convenient condition for the computer to realize logical operation and logical judgment.

Computer code conversion process

Because computers can only express the logical concept of "01", they cannot directly express pictures and words, so we need a certain conversion process.

In fact, we maintain the character number mapping relationship according to certain rules. For example, we abstract "A" into "1" in the computer. When we see "1", we think it is "A", which is essentially a mapping table. In theory, you can assign a unique character code to each character at will, such as the following table

Next, let's take a look at the simple process of text input transcoding storage output (display/print). First, we know that the computer was invented by Americans. The rules are set by Americans. The keys on the keyboard are all English letters, so the numbers are not assigned as you want. For the input of English letters, there is a direct correspondence between the keyboard and the ASCII code. The number "65" corresponding to the keyboard key "A" is also the binary literal translation "01000001" of "65" stored on the disk, which is easy to understand.

But this is not the case for Chinese character input. There is no corresponding input key for Chinese characters on the keyboard. It is impossible to directly type Chinese characters. Then there is the conversion relationship between input code, internal code and font code. The input code helps us convert English keyboard keys into Chinese characters, the internal code helps us convert Chinese characters into binary sequences, and the font code helps us output binary sequences to the display for image.

input code

We simulate the input process of Chinese characters. First, open the txt text and knock down the pinyin letters of "nihao". Then, multiple qualified Chinese phrases will pop up in the input field. Finally, we will select the corresponding number to input Chinese characters. How is this process realized?

In the computer field, there is a holy saying like the Ten Commandments of Moses: "Any problem in the computer science field can be solved by adding an indirect middle layer".

Here we add another layer of key letter combination and Chinese character mapping table, just like the English Chinese dictionary. This layer is called the input code. The process from the input code to the internal code is a table lookup conversion operation, such as the ASCII characters "nihao". You can freely modify the mapping table and candidate numbers. I can map it to "Hello, Xiao Yang".

Internal code

The internal code, also known as internal code, is the core part of character coding and the binary code used for the actual storage, exchange and communication of character sets in the computer. Through the internal code, we can achieve the purpose of efficient storage and transmission of text. Our external code (input code) realizes the mapping and conversion of keyboard keys and characters, but the internal code makes characters become binary languages that can be read by machines.

graphemic code

The characters in the computer are all expressed in binary form of internal code. How can we display the characters corresponding to numbers on the display? For example, the number "1" represents the Chinese character "you". How can we display "1" as "you"?

This depends on the font code. The font code is essentially an n * n pixel lattice. The pixels in some positions are set to white (represented by 1), and the pixels in other positions are set to black (represented by 0). The font of each character is stored in the computer in advance, and this font information library is called the font library.

For example, the Chinese "you" bitmap, such a 16 * 16 pixel matrix, requires 16 * 16/8=32 bytes of space to represent. The font information on the right is called the font code. Different font libraries (such as Song typeface and bold typeface) have different font codes for the same character.

Therefore, the character encoding to the displayed font code is actually another lookup table, that is, the mapping table between character encoding and font code.

In fact, we can also think of character coding as a compression method of glyph code. A 32 byte pixel lattice is compressed into a 2 byte internal code.

History of character encoding

Telegraph code

In a broad sense, coding has a long history, which can be traced back to the ancient time of knotting for recording events. However, the invention of Morse code is closer to modern character coding, which has opened the door to the information and communication era. Morse code was invented by American Morse in 1837. It was more than 100 years earlier than ASCII. It played a very important role in early radio. Every radio communicator must know that Morse code is composed of dot dot "." and dash "-". In telegrams, it is expressed as short drop and long da. Like binary, it is binary code. One binary is definitely not enough to represent our letters, so we can use multiple binary to represent it, for example, tick tick ". -" represents the letter "A", and tick tick "-..." represents the letter "B".

Coding era

When computers were first invented, they were used to solve mathematical calculation problems. Later, people found that computers could do more things, such as text processing. At that time, machines were very large, and machines were isolated from each other. Without considering the communication problem of machines, major manufacturers did their own work, making their own hardware, making their own software, and coding as they wanted.

Later, when the machines needed to communicate with each other, they found that the characters displayed on different computers were different. On IBM, the number "00010100" represented "A", and on Microsoft, it was displayed as "B", which made everyone dumbfounded. So the American Standard Code for Information Interchange was developed by the American Standardization Organization, which unified the rules of the game and specified which binary numbers are used to represent common symbols.

All flowers bloom together

It is very happy for English speaking countries to unify the ASCII code standard, but only English letters are considered for ASCII coding. Later, when computers were sent to Europe, the French needed to add an alphabetic symbol (for example, é), and the Germans needed to add several letters (Ö ö, Ö ö, Õ, ß). Fortunately, ASCII only used the first 127 numbers, so the Europeans used the unused ASCII codes (128-255) as their own unique symbol codes, They can also play together well.

However, after being introduced to China, the extensive and profound Chinese language will be completely obscured. We have tens of thousands of Chinese characters and 255 numbers are not enough, so there are later multi byte codes... Therefore, all countries have introduced their own language coding tables, and the later ISO 8859 series, GB series (GB2312, GBK, GB18030, GB13000), Big5, EUC-KR, JIS, However, in order to be universal in computer systems, these extended codes are directly or indirectly compatible with ASCII codes.

In order to sell their products to the world, international manufacturers such as Microsoft/IBM need to support the languages of various countries and adopt local coding methods in different places. Therefore, they gathered all coding methods around the world and numbered them, and called them Codepage (also known as internal code table). Therefore, we sometimes see the xx code page to refer to a certain character code, For example, in Microsoft's system, the Chinese GBK code corresponds to the 936 code page, and the traditional Chinese Big5 code corresponds to the 950 code page.

These character codes, which are compatible with ASCII and incompatible with each other, are later collectively referred to as ANSI codes. It is estimated that you will be familiar with the following figure. Under the window, we basically use ANSI code to save.

The literal meaning of ANSI is not character encoding, but rather a non-profit organization in the United States, which is the abbreviation of the American National Standards Institute. ANSI has done a lot of standard development work for character encoding. Later, people used to call this kind of chaotic multi byte encoding ANSI encoding or standard code page.

The ANSI code is just a model, which generally represents the default coding method of the system, and it is not a certain coding method. For example, in the Windows operating system, the ANSI code in China refers to the GB code, the ANSI code in Hong Kong refers to the Big5 code, and the ANSI code in South Korea refers to the EUC-KR code.

Unification of the world

Since each country has its own character encoding, what if some people want to pretend to force two sentences of Korean in Chinese? Sorry, your force level is too high to support. You can only type Chinese characters when you choose GB2312. At the same time, major international manufacturers are also suffering from the problem of compatibility with various character codes, so they can't bear it and decide to develop a set of codes that can accommodate all the characters in the world, and then they have the famous Unicode.

Unicode is also called universal code, including character set, coding scheme, etc. Unicode is created to solve the limitations of traditional character coding scheme. It sets a unified and unique binary code for each character in each language. In this language environment, there will be no language coding conflicts, and content in any national language can also be displayed on the same screen. This is the greatest advantage of Unicode.

In the Unicode encoding scheme, there are four common encoding implementations UTF-7, UTF-8, UTF-16, and UTF-32. The most famous is UTF-8. However, at the beginning of the Unicode design, the UTF-16 with double byte fixed length encoding was adopted, but it was found that the historical burden was too heavy to be pushed forward. Finally, a longer UTF-8 was widely accepted.

Character encoding model

Traditional coding model

In the traditional character encoding model, the characters in the character set are basically numbered one by one in decimal system, and then the decimal number is directly converted into the corresponding binary code. It can be said that the character number is the character encoding.

In fact, the computer is the process of looking up the mapping table when dealing with the conversion relationship between characters and numbers. For example, ASCII encoding is to encode a unique number for each English character. The whole encoding process is relatively simple. The computer directly maps to binary, and the decimal number is just for our convenience.

Modern coding model

The Unicode coding model adopts a new coding idea, which divides the coding model into four levels, or five levels. However, the fifth level is the coding adaptation of the transport layer, which is not strictly appropriate in the coding model.

• The first layer, Abstract Character Repertoire (ACR): defines the abstract character set and defines each abstract character;

• The second layer, coded character set (CCS): numbering abstract character sets

• The third layer, character encoding form CEF (Character Encoding Form): encode the character number into a logical code sequence

• The fourth layer, character encoding scheme CES (Character Encoding Scheme): encode the logical symbol sequence into a physical byte sequence

Layer 1: Abstract character set ACR

The so-called abstract character set is a collection of abstract characters, which is an unordered set. It is emphasized here that characters are abstract, that is, they not only include narrow characters that we can see visually, such as tangible characters such as "a", but also include invisible characters that we cannot see, such as some control characters such as "DELETE" and "NULL".

Another implication of abstraction is that some glyphs are composed of multiple characters. For example, the Spanish "ñ" is composed of two characters, "n" and "~". In this regard, Unicode is different from the traditional coding standards, which mostly regard "ñ" as an independent character, while Unicode regards it as a combination of two characters.

At the same time, a character may have a variety of visual font representations. For example, a Chinese character has many shapes, such as regular script, line, grass, cleric, etc. These are all regarded as the same abstract character (that is, character set coding is used to encode characters rather than fonts). How to display them is a matter of font library.

Abstract character sets can be either open or closed. An open character set refers to a character set that will continue to add characters, while a closed character set refers to a character set that will not add characters. For example, ASCII is closed, only 128 characters, and will not be added in the future. However, Unicode is open, and new characters will be added continuously. It has increased from the original 7163 characters to the current 144697 characters.

Layer 2: numbering character set CCS

Numbering character set is to number each character in the abstract character set and map it to a set of non negative integers;

The number is generally expressed in decimal and hexadecimal systems that are convenient for human reading. For example, the "A" character number is "65", and the "B" character number is "66";

You need to be clear that the number of some character encoding is the stored binary sequence, such as ASCII encoding; The number of some character codes is different from the stored binary sequence, such as GB2312, Unicode, etc.

In addition, the range of numbered character sets is limited. For example, the range of ASCII character sets is 0~127, and the range of ISO-8859-1 is 0~256. GB2312 is represented by a 94 * 94 two-dimensional matrix space. Unicode is represented by the concept of Plane plane space, which is called the numbered space of character sets.

A location in the number space is called a Code Point. The coordinate (non negative integer value pair) of the code point occupied by a character or the non negative integer value represented by it is the code value (code point number) of the character.

Layer 3: Character encoding method CEF

Abstract character sets and numbered character sets are viewed from the perspective of facilitating our understanding, so finally we need to translate them into a language that can be understood by the computer, and convert decimal numbers into binary forms.

Therefore, the character encoding method is the process of converting the code point number of the character set into a binary code unit sequence.

Code element: the smallest processing unit for character encoding. For example, one ASCII character is equal to one byte, which belongs to single byte code element; One character of UTF-16 is equal to two bytes. The processing is based on the word "word", so it is a double byte symbol; UTF-8 is a multi byte encoding, including single byte characters and multi byte characters. Each processing is processed as a single single byte, so the smallest unit of processing is byte, which is also a single byte symbol

You may have a question here. Why should we extract such a layer separately?

The early character encoding was also handled in this way. The decimal system and binary system were directly converted. For example, the ASCII code. The decimal system of the character "A" was "65", and the corresponding binary system was "1000000". At the same time, the binary system was also stored in the hard disk, so the encoding at that time was relatively simple.

With the advent of multi byte character set (MBCS multi byte character set), character number and binary are not directly converted. For example, GB2312 code, the location number of "10000" character is "45, 82", and the corresponding binary internal code is "1100 1101 1111 0010" (its decimal system is "205242").

What will happen if we do not convert and directly map to binary code? The character numbers of "10000" characters are "45, 82". 45 is "-" in ASCII, and 82 is "U". Whether to display two characters "- U" or one character "10000", prefix processing is added to avoid this conflict. The detailed process will be explained in detail below.

Layer 4: character encoding scheme CES

The character encoding scheme is also called "Serialization Format", which refers to mapping the code element sequence after encoding the character number into the form of byte sequence (i.e. byte stream), so that the encoded characters can be processed, stored and transmitted in the computer.

The character encoding method CEF is somewhat like the logic design in our database structure design, and this layer of encoding scheme CES is like a physical design, mapping the code sequence to a binary process in the physical sense related to a specific computer system platform.

Here, you may wonder why the binary symbol sequence is different from the actually stored binary system? This is mainly caused by the large and small end sequence of the computer. The specific end sequence will be described in detail in the UTF-16 coding section.

Nouns in big and small endings come from Jonathan Swift's Travel Notes:

Everyone believes that the original way to eat an egg is to break the larger end of the egg. However, when the emperor's grandfather ate eggs as a child, he happened to break a finger when he beat eggs according to the ancient method. Therefore, his father, the emperor at that time, issued an edict, ordering all the subjects to break the smaller end of eggs when eating eggs, and the violator would be severely punished. The common people are extremely disgusted with this order. History tells us that there have been six rebellions, one of which was an emperor who lost his life, and the other one lost his throne... About this dispute, hundreds of large books have been published, but the books of the Great Duan School have always been banned, and the law also stipulates that no one in the school can be an official.

Common character encoding

ASCII

Long ago, computer manufacturers used to render characters to the screen in their own ways. At that time, computers were the size of a house. Not everyone could afford to play with them. At that time, people didn't care how computers communicated. With the advent of microprocessors in the 1970s and 1980s, computers became smaller and smaller, and personal computers began to come into the public's view, followed by a blowout of development. However, before, manufacturers were all independent and did not consider that their products should be compatible with others', resulting in very painful data conversion between different computer systems. Therefore, the American Standards Association formulated ASCII codes in 1967, So far, 128 characters have been defined.

The first 32 (0~31) are invisible control characters, 32~126 are visible characters, and 127 are DELETE commands (DEL key on the keyboard).

In fact, as early as before ASCII, IBM also introduced a character encoding system EBCDIC in 1963, which, like ASCII code, includes control characters, numbers, commonly used punctuation marks, uppercase and lowercase English letters.

However, his character numbers were not continuous, which brought trouble to the subsequent program processing. Later, the ASCII code learned from the experience of EBCDIC and assigned continuous codes to English words to facilitate program processing, so it was widely accepted later.

Compared with EBCDIC coding, ASCII only uses the low 7 bits of a byte, and the highest bit is always 0, in addition to the continuous arrangement of characters. Don't underestimate the highest bit 0. It seems insignificant, but this is the most successful aspect of ASCII design. When introducing the coding principles later, you will find that because of this high bit 0, other coding specifications can be seamlessly compatible with ASCII code, making ASCII widely accepted.

ISO-8859 series

Although the character coding has been unified in the American market, computer manufacturers have encountered difficulties when entering the European market. Although the mainstream languages in Europe also use Latin letters, there are many extensions, such as French "é" and Norwegian "ν", which cannot be expressed in ASCII. But we found that the 128 behind ASCII have not been used yet, which is enough for the mainstream languages in Europe.

So we have the well-known ISO-8859-1 (Latin-1), which only extends the last 128 characters of ASCII, or belongs to single byte encoding; At the same time, in order to be compatible with the original ASCII code, when the highest bit is 0, the original ASCII character remains unchanged, and when the highest bit is 1, the extended European character is indicated.

But I haven't finished here. I just said that this is the mainstream language in Europe, but there is no French in the mainstream language œ、Œ、Ÿ Three letters, no Finnish Š、š、Ž、ž , The 256 code points in the single byte code have been used up, so more variants of ISO-8859-2/3// 16 series, they are all compatible with ASCII, but they are not completely compatible with each other.

The ISO-8859-n series character sets are as follows:

The ISO8859-1 character set, namely Latin-1, is a common character in Western Europe, including the letters of Germany and France. The ISO8859-2 character set, also known as Latin-2, collects Eastern European characters. The ISO8859-3 character set, also known as Latin-3, collects Southern European characters. The ISO8859-4 character set, also known as Latin-4, collects Nordic characters. The ISO8859-5 character set, also known as Cyrillic, collects Slavic characters. The ISO8859-6 character set, also known as Arabic, collects Arabic characters. ISO8859-7 character set, also known as Greek, collects Greek characters

GB series

When computers entered the East Asian countries, manufacturers were even dumbfounded. The languages of the United States and European countries are basically phonetic characters, and one byte is enough. However, many Asian countries are ideographic characters. The number of characters is often tens of thousands, and one byte is not enough. Therefore, the relevant departments of our country designed GB2312 double byte code according to ISO specifications, but GB2312 is a closed character set, Only the commonly used characters are included, which is more than 7000 characters in total. Therefore, in order to expand more characters, including some rare characters, the following GBK, GB18030 and GB13000 ("GB" is the acronym of the Chinese phonetic alphabet of the "national standard") came into being.

According to the GB series encoding scheme, if a byte is 0~127 in a piece of text, the meaning of this byte is the same as that of ASCII encoding; otherwise, this byte and the next byte together form a Chinese character (or other characters defined by the GB code), so the GB series all contain ASCII encoding.

GB2312

GB2312 is a coding standard that uses two bytes to represent Chinese characters, including 6763 Chinese characters and 682 non Chinese graphic characters. To avoid conflicts with ASCII character coding (0~127), it is stipulated that the value of the coded byte representing a Chinese character must be greater than 127 (that is, the highest bit of the byte is 1), and two bytes greater than 127 must be connected together to represent a Chinese character (GB2312 is a double byte coding), So GB2312 belongs to variable length coding. When it is an English character, it takes up one byte, and when it is a Chinese character, it takes up two bytes. It can be considered that GB2312 is a Chinese extension of ASCII.

The GB2312 character set number space is a 94 * 94 two-dimensional table. The row represents the area (high byte), and the list shows the bit (low byte). Each area has 94 bits, and each area corresponds to a character, which is called the location code. Add 2020H to the location code to get the national standard code, and add 8080H to the national standard code to get the commonly used computer internal code. The concepts of location code, national standard code and internal code are introduced here. Let's talk about the relationship between the three

National standard code

The national standard code is the standard code for Chinese character information exchange in China, which is composed of four hexadecimal digits and is represented by two lower 7-bit bytes. To avoid the first 32 control command characters in ASCII characters, each byte starts from the 33rd number, as shown in the figure below

Location code

Since the hexadecimal coded area of the above national standard code is not intuitive enough for us to use, we map it to the decimal 94 * 94 two-dimensional table number space, which we call location code. At the same time, the location code can also be used as an external code. The input method can be directly switched to location code for Chinese character input, but this input method is irregular. It is difficult for people to remember the location code, and there are few people using it.

The figure below shows a two-dimensional table of location codes. For example, the word "Wan" is 45 zones and 82 digits, so the location codes of the word "Wan" are "45, 82".

Among them: • Zone 01~09 (682): special symbols, numbers, English characters, tab characters, etc., including 682 full width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, Russian Cyrillic letters, etc;

• Zone 10~15: empty area, reserved for expansion;

• Areas 16~55 (3755): commonly used Chinese characters (also known as first level Chinese characters), sorted by Pinyin;

• Areas 56~87 (3008): Unusual Chinese characters (also known as secondary Chinese characters), sorted by radicals/strokes;

• Area 88~94: empty area, reserved for expansion.

Internal code

The GB2312 national standard code specification covers the symbols and English letters in the visible part of ASCII, and uses two 7-bit codes to re program the English letters and symbols therein. However, this has a disadvantage. Early English articles encoded with ASCII code cannot be opened. Once opened, they are garbled. That is to say, they should be compatible with the early ASCII code instead of overwriting it. Later, Microsoft set the highest bit of bytes to 1 to solve this problem, Because ASCII uses 7 bits and the highest bit is 0, the converted code is called internal code (internal code), which essentially modifies the coding standard of GB2312 and is finally accepted by everyone.

Summarize the following three conversion relationships: location code -->area code and bit code+32 (i.e.+20H) respectively to get the national standard code -->+128 (i.e.+80H) respectively to get the internal code (no conflict with the ACSII code)

GBK

GBK means "national standard extension". Because the highest bit of GB2312 double bytes is required to be greater than 1, and the upper limit is no more than 10000 characters, this extension is used directly without recoding the characters of GB2312, so it is fully compatible with GB2312. Although GBK is also a double byte encoding, it only requires that the first byte is greater than 127, which means that this is the beginning of a Chinese character. Because of this, the encoding space of GBK is much larger than that of GB2312.

The overall coding range of GBK is 8140-FEFE. The first byte is between 81-FE and the last byte is between 40-FE. After removing the line xx7F, there are 23940 code bits in total, including 21886 Chinese characters and graphic symbols; Among them, GBK/1 includes other supplementary characters except GB 2312 characters, GBK/2 includes GB2312 characters, GBK/3 includes CJK characters, GBK/4 includes CJK characters and supplementary characters, GBK/5 is non Chinese characters, and UDC is user-defined characters.

Details are as follows:

→ You may have two questions here, why do tail bytes start at 40 instead of 00; Why exclude the numbering of FF and xx7F?

There is no mandatory requirement that the high bit of GBK's tail byte encoding is 1. When the high bit is 0, it conflicts with the ASCII code. Most of the characters between 00-40 in the ASCII code are control characters, so the elimination of control characters is mainly to prevent the loss of high bytes from causing serious systemic consequences;

FF is excluded in order to be compatible with GB2312. The GB2312 bit is reserved; 7F means that the DEL character is deleted backward. If the first byte is lost during transmission, serious consequences will occur. Therefore, xx7F should also be excluded. This is something that all coding schemes should pay attention to.

GB18030

With the development of computers, more than 20000 characters of GBK are still unbearable, so in 2000, China formulated a new standard GB18030 to replace GBK. GB18030 is a mandatory standard. All software sold in Chinese Mainland now supports GB18030.

GB18030 is actually a Unicode aligned standard, which includes all Unicode character sets, and is also a kind of Unicode implementation (UTF).

Then why do we need to build a Unicode implementation with UTF?

UTF-8/UCS-2 are not compatible with GB2312. If you upgrade directly, the code will be completely garbled. Therefore, GB18030 is a superset of GBK and GB2312. When our original GB2312 (GBK) software is considered to be upgraded to international Unicode, you can directly use GB18030 to upgrade.

Although GB18030 is also an extension of GB2312, it is different from GBK in its extension mode. GBK mainly makes full use of some undefined encoding spaces in GB2312, while GB18030 uses byte variable length encoding, with single byte area compatible with ASCII, double byte area compatible with GBK, and four byte area aligned with all Unicode code bits.

In principle, the 0x30~0x39 encoding space unused by the second byte is used to determine whether four bytes are available.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us