Explain in detail the MySQL character set and collation/collation rules (character, collate)

One, character set and collation

  • Database tables are used to store and retrieve data. Different languages ​​and character sets need to be stored and retrieved in different ways . Therefore, MySQL needs to adapt to different character sets (different letters and characters), to adapt to different methods of sorting and retrieving data
  • In the normal database activities (select, insert, etc.) of MySQL, there is no need to worry about too much. The determination of which character set and collation to use is made at the server, database, and table level

Several important terms

  • Character set: a collection of letters and symbols
  • Encoding: internal representation of a member of a character set
  • Sorting rules: instructions that specify how characters are compared. (The collation is also called "proof order")

Why is proofreading so important

  • Sorting English is actually not as simple as imagined. Consider APE, apex, and Apple. Are they in the correct order? It depends on whether you want to be case sensitive. Use a case-sensitive proofreading order. These words have a sorting method, using a case-insensitive proofreading order. This not only affects sorting (such as sorting data with order by), but also affects search (for example, looking for apple's where clause can find APPLE)
  • The situation is more complicated when using special characters such as Post or German, and when using character sets that are not based on Latin (Japanese, Hebrew, Russian, etc.)

2. MySQL character set support

  • MySQL supports multiple character sets (character sets), and allows separate character sets to be specified at different levels such as servers, databases, tables, columns, and string constants . For example, if you want the columns of a table to use the latin1 character set by default, but you also want to include a column that supports Hebrew and a column that supports Greek (Greek). This is both allowed.
  • In addition, you can also specify the sorting rules explicitly . You can find out which character sets and collations MySQL specifically supports, and you can also convert data from one character set to another.
  • MySQL provides the following character set features:
  • The MySQL server allows multiple character sets to be used at the same time .
  • A given character set can have one or more sorting rules . You can choose the most suitable sorting rule for your application.
  • The character sets that support Unicode are:
  • The utf8 and ucs2 character sets, which include the Basic Multilingual Plane (BMP, also known as the "Plane 0"), which is an encoding section in Unicode, and the encoding range includes U+000 to U+FFFF) characters.
  • And utf16, utf32 and utf8mb4 character set, they include BMP characters and supplementary characters.
  • MySQL 5.6.1 adds utf16le. This character set is very similar to utf16, the main difference is that the applicable encoding is little-endian instead of big-endian.
  • You can specify character sets at the server, database, table, column, and string constant levels respectively:
  • The MySQL server has a default character set.
  • You can use the CREATE DATABSES statement to set the character set of the database; use the ALTER DATABASES statement to modify.
  • CREATE TABLE and ALTER TABLE have special clauses for setting the character set of tables and columns.
  • The character set of the string always bright can be specified by context or explicitly.
  • There are also several functions and operators that can be used to convert individual values ​​from one character set to another. The CHARSET() function can return the character set of a given value. Similarly, the COLLATE operator can change the collation of a string, and the COLLATE() function can return the collation of a given string.
  • SHOW Statements and INFORMATION_SCHEMA database in the database can provide information related to the available character set and collation of available information.
  • When an indexed character column is changed, the MySQL server will automatically reorder the index .
  • Different character sets cannot be mixed within a string , and different character sets cannot be applied to different rows of a given column. However, you can use the Unicode character set (which can use one encoding to represent characters in multiple languages) to achieve multi-language support.

Three, view the supported character set and collation

Check the supported character sets

SHOW CHARACTER SET;
  • MySQL supports many character sets. The sentence can be all available character sets as well as the description and default collation rules of each character set. As follows:

View the list of supported proofreading

SHOW COLLATION;
  • The following sentence can view all the proofreading supported (the picture is too long, the interception part)
  • For example, latin1 has several proofreadings for different European languages , and many proofreads appear twice, once case-sensitive (represented by _cs) and once case-insensitive (represented by _ci)
  • The SHOW statement supports the LIKE clause , so you can query for a specific character set or collation. E.g:
SHOW CHARACTER SET LIKE 'latin%';SHOW COLLATION LIKE 'utf8%';
  • Information about available character sets and collations can also be obtained from the CHARACHTER_SETS table and COLLATIONS table of the INFORMATION_SCHEMA library

Fourth, view the current character set and proofreading

  • The database has a default character set and proofreading when it is installed. In addition, you can also specify the default character set and proofreading when creating the database
  • In fact, character sets are rarely server-wide (or even database-wide) settings. Different tables and even different columns may require different character sets , and both can be specified when creating the table

View the current character set and proofreading

SHOW VARIABLES LIKE 'character%';SHOW VARIABLES LIKE 'collation%';
  • SHOW VARIABLES statement can display the current character set and collation of the server . E.g:

Five, specify the character set

  • The character set and collation can be set at multiple levels (from the default character set used by the MySQL server to the character set used by a single string).
  • The default character set and collation of the server are constructed at compile time. But you can rewrite them by setting the system variables character-set-server and collation-server when the server is started or running .
  • If you only specify the character set , its default collation will become the default collation of the server. If a collation is specified , it must be compatible with the character set. If the name of a certain collation starts with the name of a certain character set, then they are compatible. For example, the collation utf8_danish_ci is compatible with the character set utf8, but is not compatible with the character set latin1.

Grammatical format

  • The following two clauses can be used to specify the character set and collation of the database/table/column:
CHARACTER SET charsetCOLLATE collation
  • Related notes:
  • CHARACTER SET can be replaced by CHARSET.
  • charset is the name of a certain character set supported by the server, and collation is the name of a certain collation of the character set.
  • These two clauses can be used at the same time or separately. If applicable at the same time, it must be ensured that the name of the collation is compatible with the character set.
  • If only the CHARACTER SET clause is given, it means that the default collation is applied.
  • If only the COLLATE clause is given, the character set determined by the beginning of the name of the given collation is used.

Specify rules when creating a database

  • The syntax for specifying the character set and collation when creating a database is as follows:
CREATE DATABASE db_name CHARACTER SET charset COLLATE collation;

Specify rules when creating a table

  • The syntax for specifying the character set and collation when creating a table is as follows:
CREATE TABLE tbl_name(...) CHARACTER SET charset COLLATE collation;

Assign rules to columns

  • The syntax for specifying the character set and collation for the columns in the table is as follows: These attributes are applicable to data types such as CHAR, VARCHAR, TEXT, ENUM, SET, etc.
CREATE TABLE tbl_name(    col_name CHAR(10) CHARACTER SET charset COLLATE collation);

6. Specify proofreading rules when inquiring

  • For example, the following select statement uses collate to specify an alternate collation order (in this example, case-sensitive collation), which will affect the order of the results
SELECT * FROM customer ORDER BY cust_name COLLATE latin1_general_cs;

Temporarily case sensitive

  • The select statement above demonstrates a technique for case-sensitive searches on tables that are not normally partitioned. Of course, the reverse is also possible

select other collate clauses

  • In addition to the use of the following in the order by clause seen above, collate can also be used for group by, having, aggregate functions, aliases, etc.
  • Finally, it should be noted that if absolutely necessary, strings can be converted between character sets. To do this, use the cast() or convert() function

Seven, Unicode support

  • One of the reasons why there are so many character sets is that people have developed different character encoding schemes for different languages . This will cause a lot of problems. For example, if a given character exists in several languages, it may be represented by different numbers in different encoding schemes. Also, different languages ​​often require different numbers of bytes to represent a character. The latinl character set is very small, and each character can be represented by only one byte. But for some languages, such as Japanese and Chinese, because they contain a lot of characters, each of their characters needs to be represented by multiple bytes.
  • The goal of Unicode is to provide a unified character encoding system so that the character sets of all languages ​​can be expressed in a unified way.
  • The two types of Unicode character sets, utf8 and ucs2 , only include the characters defined in the BMP, that is, there are only 65536 characters at most. They are outside the BMP does not support those supplementary characters.
  • The ucs2 character set corresponds to the UCS-2 encoding scheme of Unicode. It uses 2 bytes to represent 1 character, and the most significant byte takes precedence. UCS is an abbreviation for Universal Character Set.
  • The utf8 character set uses a variable-length format that uses 1 to 3 bytes to represent a character. It corresponds to the UTF-8 encoding scheme. UTF is the abbreviation of Unicode Transformation Format.
  • Starting from MySQL 5.5.3, other Unicode character sets include supplementary characters other than BMP .
  • The character sets utf16 and utf32 are similar to ucs2, except that they have added support for supplementary characters. For utf16, those BMP characters still occupy 2 bytes (same as usc2), and supplementary characters occupy 4 bytes. For utf32, all characters occupy 4 bytes.
  • The utf8mb4 character set contains all utf8 characters (where each character occupies 1 to 3 bytes), and also contains supplementary characters, where each note occupies 4 bytes.

MySQL 5.6.1 adds support for utf16le. This character set is very similar to utf16, the main difference is that it uses low byte first instead of high byte first.