Home MySQL Blogs Understand what happens in MySQL when using UTF String with Latin1 encoding...
15
Sep
2013
Understand what happens in MySQL when using UTF String with Latin1 encoding... PDF Print E-mail
Written by Marco Tusa   

... And how to go out from it.

 

Why this article?

The scope of the article is to light up what happens behind the scene when an application push UTF data on a MySQL server using Latin1 encoding.

I will show how the characters are store, what exactly happen if you mix UTF8 and latin1, what is transformed to what and why.

Finally I will show the safer way to use to convert a table or simply a table field, without exporting the full dataset.

 

I will also show what is a good way to do it when using master-slave scenario, with minimal impact on the live dataset.

What I am not going to discuss here, and I assume you already have familiarity with is:

  • How to assign a character set or a collation to a table or fields;
  • How collation works in in sorting;
  • How to set the default character set in server and connecting clients.

If any of this point is not clear to you, please read first here (http://dev.mysql.com/doc/refman/5.6/en/charset.html)

For my exercise I will use two tables, two MySQL server instances, and Arabic text.

 

Please note that I have copy this text from Official UN nation page, as such I hope they do not contain any offensive text.

 

Step one check the status of our severs:

Master

(root@localhost) [(none)]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1
Connection id:		2
Current database:
Current user:		root@localhost
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfm/mysql.sock
Uptime:			2 min 34 sec
Threads: 3  Questions: 16  Slow queries: 0  Opens: 34  Flush tables: 1  Open tables: 26  Queries per second avg: 0.103

Slave

------------

(root@localhost) [(none)]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1
Connection id:		4
Current database:
Current user:		root@localhost
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfs/mysql.sock
Uptime:			10 min 43 sec
Threads: 2  Questions: 8  Slow queries: 0  Opens: 34  Flush tables: 1  Open tables: 26  Queries per second avg: 0.012


As you can see, I have define the MASTER to use latin1 as default also for the server, while for the SLAVE I choose to use UTF8.

In this exercise I am going to use only UTF8 for the examples never UTF8mb4.

 

First create the table.

CREATE TABLE `utf_test` (
  `ID` integer AUTO_INCREMENT PRIMARY KEY,
  `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL,
  `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `binutf` varbinary(250) DEFAULT NULL,
  `different` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
 

 

Now let start insert data again and see what happens:

insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا',

' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same');

 

 

 Master
 -------------
(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same');
*************************** 1. row ***************************
       ID: 1
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same
1 row IN SET (0.00 sec)
 
 
Slave
---------------
(root@localhost) [test]>set names latin1;
Query OK, 0 rows affected (0.00 sec)
(root@localhost) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same
1 row IN SET (0.00 sec)
 

 

 

 

All good given the value are passed consistently

Now let add some text to the "different" fields and see IF we can still handle it correctly:

 

(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same? لصدأ القمح إلى كينيا');
Master
--------------
(root@localhost) [test]>select * FROM utf_test\G
       ID: 3
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same? لصدأ القمح إلى كينيا
2 rows IN SET (0.00 sec)
 
Slave
----------------------------
       ID: 3
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same? لصدأ القمح إلى كينيا
 

 

 

 

Ok once more all good.
Is this real UTF8 or as we have seen sometimes fake UTF thanks to Latin1 push?
let see inserting another row with UTF field "yesutf" and "different" containing the same apparent data.

insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا');

 

 

Master
-----------
(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert same text everywhere', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا');
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
Slave
------------------
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 

 

 


Checking the hex:

 

Master
-----
(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
 
Slave 
-------------
(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

As expected the hex code is different between the fields define as UTF8 and the one as Latin1,and I see them correctly because... ?

Look what happens if I change the Names to UTF8...:

 

(root@localhost) [test]>set names utf8;
 
(root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

Now only the binutf (varbinary) is shown correctly.

So just to recap what happen if I use latin1....
Using latin1

I see all fine

 

(root@localhost) [test]>set names latin1;
(root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

The hex values are not matching as:

 

(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

If I do a reverse conversion, also using UNHEX

 

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,unhex(hex(yesutf)) AS yesutf_hex,unhex(hex(binutf)) AS binutf_hex,unhex(hex(different)) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: ح إلى كينيا
   binutf_hex: ح إلى كينيا
different_hex: ح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

 

Again I see something different, is it confusing?

No.

 

It is quite clear that varbinary and varchar store the string using the same hexadecimal values, while the yesutf is different because it was expecting a UTF string.

 

Let us analyze JUST the first character coming from the arabic string: ح

This letter correspond to:

Unicode code point U+062D

character: ح

UTF8 characters: Ø

UTF8 hex: d8 ad

Meaning: ARABIC LETTER HAH

 

What happen is that I am inserting the correct codes but the UTF8 field read it as LATIN1 char per byte, as such it transform the value to Ø.

The value of that character in the UTF-8 encoding table and Unicode characters is:

Unicode code point U+00D8

character: Ø

UTF8 characters: Ø

UTF8 hex: c8 98

Meaning: LATIN CAPITAL LETTER O WITH STROKE

 

Comparing the values above with the HEX value we had before, you can see that they match perfectly

yesutf_hex: C398

binutf_hex: D8AD

different_hex: D8AD

 

So far I hope everything is clear.

 

The fields using LATIN1 and UT8 are correctly shown when using NAMES latin1 because the latin1 interpretation of the UTF8 hex values in the case of the fields with latin1 encoding.

 

While in the case of fields with UTF8 encoding, the server perform a double conversion of the codes, conversion that IS NOT executed when doing UNHEX, and in that case what it is shown is the REAL value of the stored codes.

 

Again the only SAFE one during these operation is the varbinary which store the code not bounded to any encoding.

 

So for instance if you have all your database using LATIN1 and you want to convert to UTF8 you cannot just do :

(root@localhost) [test]>alter table utf_corrupt modify different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;

 

Because your text in the tables will be converted and become unusable if using UTF8 clients, see below:

Encoding set as LATIN1:

Client characterset: latin1

Conn. characterset: latin1

I am going to do the conversion:

 

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
SET NAMES UTF8;
 
	Client characterset:	utf8
	Conn.  characterset:	utf8
 
 
(root@localhost) [test]>select * FROM utf_corrupt\G
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
 

 

The text is a mess again, this because the text is transformed:

 

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 <---------
1 row IN SET (0.00 sec)
 

 

 

 

Starting from the beginning so the table is again:

 

(root@localhost) [test]>show CREATE TABLE utf_corrupt\G
*************************** 1. row ***************************
       TABLE: utf_corrupt
CREATE TABLE: CREATE TABLE `utf_corrupt` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL,
  `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `binutf` varbinary(250) DEFAULT NULL,
  `different` varbinary(250) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
1 row IN SET (0.00 sec)
 

 

 

 

data is:

 

*************************** 3. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec) 

 

 

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

 

And encoding: Client characterset: latin1 Conn.  characterset: latin1

 

To do properly you have to do a 3 steps conversion, mainly varchar -> varbinary -> varchar.As follow:

 

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varbinary(250) DEFAULT NULL;
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0
 
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <-------------- 
1 row IN SET (0.00 sec)
 

 

 

Now all is fine and the string was not transformed.Now change the encoding with NAMES:

Set NAMES utf8;

Convert back to varchar and UTF ...

 

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
Query OK, 3 rows affected (0.02 sec)
Records: 3  Duplicates: 0  Warnings: 0
 

 

 

And now yes you have the text stored correctly.

 

       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

All good! My data is correctly moved from a fields base on varchar LATIN1 to the new varchar UTF8.The most important aspect is to ensure to use the right encoding and to do not allow the server to convert you text to wrong hex values.

In case of significant migration if you have a master slave setup,it is possible to play with them, to minimize the impact of the migration, following these steps:

  1. all latin1 
  2. convert slave to varbinary
  3. insert data from master to slave keeping all as latin1
  4. swap the server
  5. convert ex-master to varbinary 
  6. when done change all application connection to utf8 master_host
  7. convert tables back to varchar utf8 on slave 
  8. swap master slave again
  9. convert slave table to UTF8

 

It seems cumbersome, but if you cannot afford to export the data change the table encoding and reload the data, and you cannot stop the write for long time, this is the safer way.The negative aspect of it is ... that varbinary use a different approach for sorting, it is based on the value of the binaries stored and you cannot use collation to modify it.

Let's go through the process:

 

Step 1) Check that all is latin1

 

Master
------------
(root@127.0.0.1:5510) [test]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1
 
Connection id:		11
Current DATABASE:	test
Current user:		root@localhost
SSL:			NOT IN USE
Current pager:		stdout
USING OUTFILE:		''
USING delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfm/mysql.sock
Uptime:			2 days 3 hours 3 min 49 sec
 
Threads: 3  Questions: 198  Slow queries: 0  Opens: 89  FLUSH TABLES: 1  Open TABLES: 27  Queries per second avg: 0.001
--------------
 
Slave
---------------------
(root@127.0.0.1:5511) [test]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1
 
Connection id:		9
Current DATABASE:	test
Current user:		root@localhost
SSL:			NOT IN USE
Current pager:		stdout
USING OUTFILE:		''
USING delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfs/mysql.sock
Uptime:			2 days 3 hours 3 min 45 sec
 
Threads: 2  Questions: 177  Slow queries: 0  Opens: 87  FLUSH TABLES: 1  Open TABLES: 27  Queries per second avg: 0.000
 

 

 

Insert some data on the master ... just to have a couples of rows.

 

(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 1', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 

 

 

Check the values:

 

Master
----------------------
(root@127.0.0.1:5510) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
2 rows IN SET (0.00 sec)
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 
  
Slave
-------------------------
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
2 rows IN SET (0.00 sec)
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

Step 2) Convert to varbinary the slave, keep same encoding (NAMES latin1)

 

 

(root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varbinary(250);
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0
 
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <------------------
1 row IN SET (0.00 sec)
 

 

 

Code remain consistent.

Step 3) Production continue to work as usual, no impact.

Insert some other records on the master:

 

(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 3. row ***************************
       ID: 5
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 4. row ***************************
       ID: 7
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 2. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 3. row ***************************
           id: 5
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 4. row ***************************
           id: 7
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
4 rows IN SET (0.00 sec)
 
ON The Slave 
----------------------
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 3. row ***************************
       ID: 5
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 4. row ***************************
       ID: 7
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 2. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 3. row ***************************
           id: 5
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 4. row ***************************
           id: 7
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
4 rows IN SET (0.00 sec)
 

 

 

Step 4) swap the server

5511 master and 5510 slave (but stop the replication from 5511->5510 and stop also the previous replication from 5510 to 511), production will continue to work in write, order by will be affected for the moment.

 

On new Master 5511:

 

*************************** 6. row ***************************
       ID: 11 <----------------------------------------------
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
6 rows IN SET (0.00 sec)
 
ON new Slave 5510:
*************************** 4. row ***************************
       ID: 7 <--------------------------------
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 

 

Step 5) modify to varbinary the tables on the new Slave:

 

(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varbinary(250);
Query OK, 4 rows affected (0.02 sec)
Records: 4  Duplicates: 0  Warnings: 0
 

 

 

 

And now restart replication on 5510, so it can replicate from the master 5511.

 

(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
 
1 row IN SET (0.00 sec)(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 11
    noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

 

All good.

Step 6) change all Application connections to UTF8,

in our example I will change the NAMES on new Master 5511 and on slave 5510

SET NAMES UTF8;

Server characterset: utf8

Db     characterset: utf8

Client characterset: utf8

Conn.  characterset: utf8

 

Master
--------------------------
(root@127.0.0.1:5511) [test]>set names utf8;
Query OK, 0 rows affected (0.00 sec)
 
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
...
*************************** 6. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
6 rows IN SET (0.00 sec)
 

 

 

 

Note how the values in the yesutf was NOT correctly converted given the initial double translation.

 

Slave
---------------
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	utf8
Conn.  characterset:	utf8
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

 

Step 7) Convert table on SLAVE back to varchar but using utf8 now

 

(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varchar(250)  CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
Query OK, 7 rows affected (0.01 sec)
Records: 7  Duplicates: 0  Warnings: 0
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 15
   notutf: STEP7: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 15
    noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

 

Step 8) given all good put now the 5510 back as master,

given NOW the 5510 has the fields set as varchar and encoding/collation using UTF also the order by will work fine again.

At this point data will continue to be pass from master to slave.

 

ON Master
---------------------
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP8: inserting data from master 5510 after swap server again ', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 17
   notutf: STEP8: inserting DATA FROM master 5510 after swap server again 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 17
    noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 
ON Slave
-------------------------
*************************** 8. row ***************************
       ID: 17
   notutf: STEP8: inserting DATA FROM master 5510 after swap server again 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
8 rows IN SET (0.00 sec)
 
*************************** 8. row ***************************
           id: 17
    noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
 

 

 

 

Step 9) Finally convert all tables also on slave to varchar UTF8:

 

 

Master
---------------
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP9: final Slave conversions', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
 
Slave
---------------
(root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varchar(250)  CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
*************************** 9. row ***************************
       ID: 19
   notutf: STEP9: final Slave conversions
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
9 rows IN SET (0.00 sec)
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
...
           id: 19
    noutf_hex: 53544550393A2066696E616C20536C61766520636F6E76657273696F6E73
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
9 rows IN SET (0.00 sec)
 

 

 

 

 

 

Done.
When in doubt about the character set you are using or the collation always check them.
good commands from application:
SELECT USER(), CHARSET(USER()), COLLATION(USER())\G(root@127.0.0.1:5511) [test]>SELECT USER(), CHARSET(USER()), COLLATION(USER())\G
*************************** 1. row ***************************
           USER(): root@127.0.0.1
  CHARSET(USER()): utf8
COLLATION(USER()): utf8_general_ci
1 row IN SET (0.00 sec)
 
AND
(root@127.0.0.1:5511) [test]>show VARIABLES LIKE '%server%'\G
*************************** 1. row ***************************
Variable_name: character_set_server
        Value: utf8
*************************** 2. row ***************************
Variable_name: collation_server
        Value: utf8_unicode_ci
*************************** 3. row ***************************
Variable_name: server_id
        Value: 5511
3 rows IN SET (0.00 sec)
 

 

 

Conclusion

You must put a lot of attention on how you connect to the server, in relation to the encoding you are going to use.
It is advisable to set it explicitly, to be really sure and to check that the server/application are getting it right.
I cannot find a good reasons for which you should have a mix of UTF8 and Latin1 in the same table as for the example I have use, but you never know.
What I mean is that it is normal to have Fields using Latin1 and be define as Latin1 for the encoding, but mixing the UTF8 using Latin1 encoding in the application connection, and try to write UTF will only generate a mess.
If you Have a table with 4 fields 2 UTF8 and 2 LATIN1, please use UTF8 encoding in the connection/server and be sure to push the right text, actually if you do so, the server will alert you if you try to do something stupid, like pushing UTF8 values in latin1 fields.
For that specific cases when you have no idea of what you are going to write, and you MUST use Latin1 for the application connection, well use varbinary and your life will be easier, of course the ordering will be done by binary value instead character, but you need to keep your data safe first.

Reference

 

Thanks 

Finally a huge thanks to Derek Downey, because he raise the issue (again) to me in relation to a customer migration.

Thanks for his dedication, professionality and will to learn and have fun together.

Comments (1)
  • Peter Laursen  - I wrote an article to our FAQ about same 6 years a
    http://faq.webyog.com/content/34/152/en/my-special-characters-display-as-_noisy_-latin-characters-in-sqlyog.html
  • Marco Tusa
    Well I am not saying that this is a new topic, because is not.
    The first time I have to handle it was in 2005 when I was working in UN, and we migrate 180 Databases from oracle to MySQL, almost all of them using 5 languages.
    The reason I publish this, is because I have seen many, many customer still not having clear ideas.

    I believe that our task is to provide better understanding, also on the topic that we found already well know, specially when we still see lack of understanding.

    That is it, nothing is OLD, if there is also only ONE person that require to understand it better.
    Assuming so would be really arrogant from any of us.
Only registered users can write comments!

!joomlacomment 4.0 Copyright (C) 2009 Compojoom.com . All rights reserved."

Last Updated on Sunday, 15 September 2013 23:16
 
 

Who's Online

We have 128 guests online