google
 
 
Home MySQL Blogs
MySQL

My MySQL tipsvalid-rss-rogers




How to mess up your data using ONE command in MySQL/Galera. PDF Stampa E-mail
Scritto da Marco Tusa   
Giovedì 27 Novembre 2014 00:55

Or how wsrep_on can bring you to have a cluster with usless data.

redflag

This is a WARNING article, and it comes out after I have being working on define internal blueprint on how to perform DDL operation using RSU safely.

The fun, if fun we want to call it, comes as usual by the fact that I am a curious guy and I often do things my way and not always following the official instructions.

Anyhow, lets us go straight to the point and describe what can happen on ANY MySQL/Galera installation.

The environment

The test environment, MySQL/Galera (Percona PXC 5.6.20 version).

The cluster was based on three nodes local no geographic distribution, no other replication in place then Galera.

Haproxy on one application node, simple application writing in this table:

Table: tbtest1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CREATE TABLE: CREATE TABLE 'tbtest1' (
'auAPP1nc' bigint(11) NOT NULL AUTO_INCREMENT,
'a' int(11) NOT NULL,
'uuid' char(36) COLLATE utf8_bin NOT NULL,
'b' varchar(100) COLLATE utf8_bin NOT NULL,
'c' char(200) COLLATE utf8_bin NOT NULL,
'counter' bigint(20) DEFAULT NULL,
'time' timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
'partitionid' int(11) NOT NULL DEFAULT '0',
'date' date NOT NULL,
'strrecordtype' char(3) COLLATE utf8_bin DEFAULT NULL,
PRIMARY KEY ('auAPP1nc','partitionid'),
KEY 'IDX_a' ('a'),
KEY 'IDX_uuid' ('uuid')
) ENGINE=InnoDB AUTO_INCREMENT=482 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
 

 

Small app

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#! /bin/bash --
i=1
while :
do
	echo "$i "
	mysql -ustress -pxxx -h192.168.0.35 -P 3307 -e "SET @HH=@@HOSTNAME;
insert into test.tbtest1 (a,uuid,b,c,strrecordtype,date,partitionid) 
values($i,UUID(),@HH,'a','APP1'|'APP2',now(),RAND()*100)";
	i=$((i + 1))
	if [ $i -eq 100 ]
	then
		break
	fi
	sleep 0.5;
done
 

 

 

Server Information

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
(root@localhost:pm) [(none)]>\s SHOW global STATUS LIKE 'wsrep_provider_version';
--------------
/home/mysql/templates/PCX-56/bin/mysql Ver 14.14 Distrib 5.6.20-68.0, FOR Linux (i686) USING EditLine wrapper
Connection id:	90
Current DATABASE:
Current user:	root@localhost
SSL:	NOT IN USE
Current pager:	stdout
USING OUTFILE:	''
USING delimiter:	;
Server version:	5.6.20-68.0-25.7-log Percona XtraDB Cluster BINARY (GPL) 5.6.20-25.7, Revision 886, wsrep_25.7.r4126
Protocol version:	10
Connection:	Localhost via UNIX socket
Server characterset:	utf8
Db characterset:	utf8
Client characterset:	utf8
Conn. characterset:	utf8
UNIX socket:	/home/mysql/instances/galera1-56/mysql.sock
Uptime:	2 min 38 sec
Threads: 3 Questions: 282 Slow queries: 0 Opens: 78 FLUSH TABLES: 3 Open TABLES: 8 Queries per second avg: 1.784
--------------
+------------------------+---------------+
| Variable_name | Value |
+------------------------+---------------+
| wsrep_provider_version | 3.7(r7f44a18) |
+------------------------+---------------+
1 row IN SET (0.01 sec)
 

 

 

Facts

In MySQL/Galera there is variable that allow us to say to the server to do not replicate. This variable is wsrep_on and when we set it as OFF the server will not replicate any statement to the other node.

This is quite useful when in the need to perform actions on an single node, like when you need to perform DDL on RSU mode.

But this flexibility can bite you quite badly.

I had done a simple small change to the widely use command:

 

SET wsrep_on=OFF;

 

I just add GLOBAL:

SET GLOBAL wsrep_on=OFF;

 

 

To be honest I was expecting to have the command rejected, but no it was accept and this is what happened:

I had run the small loop (see above) on two application servers, one pointing to HAProxy and writing APP1 in the field strrecordtype, the other pointing directly to the node where I will issue the command with wsrep_on inserting APP2.

The results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS APP2_COUNTER FROM tbtest1 
WHERE strrecordtype='APP2';
select count(*) AS APP1_COUNTER FROM tbtest1 WHERE strrecordtype='APP1';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral03 |
+---------------+
1 row IN SET (0.00 sec)
+-------------+
|APP2_COUNTER |
+-------------+
| 99          |
+-------------+
1 row IN SET (0.00 sec)
+-------------+
|APP1_COUNTER |
+-------------+
| 99          |
+-------------+
1 row IN SET (0.00 sec)
(root@localhost:pm) [test]>
(root@localhost:pm) [test]>SET GLOBAL wsrep_on=OFF; <------------- It should not be GLOBAL
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS APP2_COUNTER FROM tbtest1 
WHERE strrecordtype='APP2';
select count(*) AS APP1_COUNTER FROM tbtest1 WHERE strrecordtype='APP1';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral01 |
+---------------+
1 row IN SET (0.00 sec)
+-------------+
|APP2_COUNTER |
+-------------+
| 0 |
+-------------+
1 row IN SET (0.00 sec)
+-------------+
|APP1_COUNTER |
+-------------+
| 66              | <-------------------- 1/3 lost because HAProxy think that the node is ok ...
+-------------+
1 row IN SET (0.00 sec)

 

 

As you can see in the tusacentral03 (which is the one where I issue SET GLOBAL wsrep_ON=OFF), I have ALL the records inserted in the local node and ALL the records coming from the others node.

But on the node tusacentral01, I had NO records related to APP2, but more relevant I had lost 1/3 of my total inserts.

Why?

Well this is quite clear, and unfortunately is by design.

If I issue wsrep_ON=OFF with GLOBAL the server will apply the setting to ALL sessions, meaning all session on that will STOP to replicate.

In the source code the section relevant to this is quite clear:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#wsrep_mysqld.cc
#line 1395
int wsrep_to_isolation_begin(THD *thd, char *db_, char *table_,
                             const TABLE_LIST* table_list)
{
 
  /*
    No isolation for applier or replaying threads.
   */
  if (thd->wsrep_exec_mode == REPL_RECV) return 0;
 
  int ret= 0;
  mysql_mutex_lock(&thd->LOCK_wsrep_thd);
 
  if (thd->wsrep_conflict_state == MUST_ABORT)
  {
    WSREP_INFO("thread: %lu, %s has been aborted due to multi-master conflict",
               thd->thread_id, thd->query());
    mysql_mutex_unlock(&thd->LOCK_wsrep_thd);
    return WSREP_TRX_FAIL;
  }
  mysql_mutex_unlock(&thd->LOCK_wsrep_thd);
 
  DBUG_ASSERT(thd->wsrep_exec_mode == LOCAL_STATE);
  DBUG_ASSERT(thd->wsrep_trx_meta.gtid.seqno == WSREP_SEQNO_UNDEFINED);
 
  if (thd->global_read_lock.can_acquire_protection())
  {
    WSREP_DEBUG("Aborting APP1: Global Read-Lock (FTWRL) in place: %s %lu",
                thd->query(), thd->thread_id);
    return -1;
  }
 
  if (wsrep_debug && thd->mdl_context.has_locks())
  {
    WSREP_DEBUG("thread holds MDL locks at TI begin: %s %lu",
                thd->query(), thd->thread_id);
  }
 
  /*
    It makes sense to set auto_increment_* to defaults in APP1 operations.
    Must be done before wsrep_APP1_begin() since Query_log_event encapsulating
    APP1 statement and auto inc variables for wsrep replication is constructed
    there. Variables are reset back in THD::reset_for_next_command() before
    processing of next command.
   */
  if (wsrep_auto_increment_control)
  {
    thd->variables.auto_increment_offset = 1;
    thd->variables.auto_increment_increment = 1;
  }
 
  if (thd->variables.wsrep_on && thd->wsrep_exec_mode==LOCAL_STATE) <------- Here we have a check for wsrep_on 
  {
    switch (wsrep_OSU_method_options) {
    case WSREP_OSU_APP1: ret =  wsrep_APP1_begin(thd, db_, table_,
                                               table_list); break;
    case WSREP_OSU_APP2: ret =  wsrep_APP2_begin(thd, db_, table_); break;
    }
    if (!ret)
    {
      thd->wsrep_exec_mode= TOTAL_ORDER;
    }
  }
  return ret;
}
enum wsrep_exec_mode {
    LOCAL_STATE,
    REPL_RECV,
    TOTAL_ORDER,
    LOCAL_COMMIT
};
 

 

 

So what happen is that the server check if the thd object has that variable ON and has LOCAL_STATE, if so it replicates, if not it does nothing.

But as said while this makes sense in the SESSION scope, it does not in the GLOBAL.

 

Not only, setting wsrep_on to OFF in global scope does NOT trigger any further action from MySQL/Galera, like for instance the possible FACT that the node could be desynchronize from the remaining cluster.

The interesting effect of this is that HAProxy has NO WAY to know that the node had stop to replicate, and as such the server can receive the requests but those will not replicate to the other node causing data diversion.

 

You can say, that a DBA SHOULD know what he is doing, and as such he/her should be MANUALLY desync the node and then issue the command.

My point instead is that I don't see ANY good reason to have wsrep_on as global variable; instead I see this as a very dangerous and conceptually wrong "feature".

 

Browsing the Codership manual, I noticed that the wsrep_on variable comes with the "L" flag, meaning that the variable is NOT suppose to be GLOBAL.

But it is ...

I also had dig in the code and:

1
2
3
4
5
6
7
8
9
10
11
12
wsrep_var.cc
#line58
 
bool wsrep_on_update (sys_var *self, THD* thd, enum_var_type var_type)
{
  if (var_type == OPT_GLOBAL) {
    // FIXME: this variable probably should be changed only per session
    thd->variables.wsrep_on = global_system_variables.wsrep_on;
  }
  return false;
}
 

 

That is interesting isn't it?

Wondering when this comment was inserted and why it was ignored.

 

Anyhow the source of all problems is here in the wsrep_on variable definition:

1
2
3
4
5
6
7
static Sys_var_mybool Sys_wsrep_on (
       "wsrep_on", "To enable wsrep replication ",
       SESSION_VAR(wsrep_on),                      <----------------------- This allow global 
       CMD_LINE(OPT_ARG), DEFAULT(TRUE), 
       NO_MUTEX_GUARD, NOT_IN_BINLOG, ON_CHECK(0),
       ON_UPDATE(wsrep_on_update));
 

 

The variable was defined as SESSION_VAR instead of SESSION_ONLY, and as such used also in global scope.

 

As already state, this is from my point of view a conceptual error not a bug, but something that should not exists at all, because in a cluster where I have data certify/replicate/synchronize there should NOT be any option for a DBA/user to bypass at GLOBAL level the data validation/replication process.

 

To note, and to make things worse, after I had done the test I can easily set wsrep_on back, and my node will continue to act as part of the cluster as if all the nodes are equal, while they are not.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS RSU_COUNTER FROM tbtest1 
WHERE strrecordtype='RSU';
select count(*) AS TOI_COUNTER FROM tbtest1 WHERE strrecordtype='TOI';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral03 |
+---------------+
1 row IN SET (0.00 sec)
 
+-------------+
| RSU_COUNTER |
+-------------+
|         181 |
+-------------+
1 row IN SET (0.00 sec)
 
+-------------+
| TOI_COUNTER |
+-------------+
|         177 |
+-------------+
1 row IN SET (0.00 sec)
 
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral01 |
+---------------+
1 row IN SET (0.00 sec)
 
+-------------+
| RSU_COUNTER |
+-------------+
|          77 |
+-------------+
1 row IN SET (0.00 sec)
 
+-------------+
| TOI_COUNTER |
+-------------+
|         139 |
+-------------+
 

 

As you can see the cluster continue to insert data using HAProxy and all the node, but it has a data set that is inconsistent.

Conclusions

  • Never use SET GLOBAL with wsrep_on
  • IF you are so crazy to do so, be sure no one is writing on the node.
  • I am sure this is a mistake in the logic and as such this variable should be change from the source, in the code defining the variable SESSION_ONLY and not SESSION_VAR
    Or wsrep_on can damage you quite badly.
Ultimo aggiornamento Giovedì 27 Novembre 2014 15:00
 
Geographic replication with MySQL and Galera PDF Stampa E-mail
Scritto da Marco Tusa   
Venerdì 21 Novembre 2014 00:56

Introduction

With the introduction of Galera replication for MySQL, Codership has, in my opinion, provide the most significant support to MySQL that the community had seen in the last tree years.

Codership had filled a significant gap in MySQL replication capability, succeeding in a task where many before had failed.

Not only Codership is constantly committed in doing their solution better and more solid, and they are the most friendly and helpful team I have met in years.

In short chapeau to all of the guys! Great job, I am really proud to have the opportunity to discuss, interact with you.

Said that, no product or code around is unmingled of errors and/or issues at different level. So also if Galera help us a lot solving many replication and architectural issues it still has some limitation that must be taken in consideration.

Galera is easy to setup, especially to perform a basic setup, which will work fine for almost all basic OLTP low level and local traffic.

Unfortunately I have seen many implementations that had start as basic and then become something different, in usage and very often in design.

The most common mistake, from architectural point of view, I have seen is when customers starts with a simple cluster of 3 nodes all based on local network, and then given it "works so great" they decide to add one/two nodes on a geographic distributed site.

This solution is not wrong "per se" but it is how it is implemented that makes the difference.

The easiest way to explain what I mean and what should not be done, it is to use a real case, and describe what was implemented and how we had evolve the solution, while Codership was optimizing their product.

Customer scenario

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The initial environment of one single application node pointing to a cluster of 3 MySQL/Galera nodes, was transformed into a 6 (or more) application nodes getting traffic from a balancer, each application nodes could write in any Data node (round robin) residing in the same geographic area (at least), but cluster was geographically distributed 3 nodes in site A, 3 nodes in site B.

As result the customer navigation was jumping from one application server to another, and from one Data node to another as well.

Transactions were small and relevant only for a single request.

Finally each application node had a timer to clean up "old" session, the concept of OLD was relevant only to the local node and it could happen that an OLD session was in fact still on going on another application node.

In short, the customer was pushing randomly traffic to both zones, using a single table for the sessions to coordinate the activities, but also purging them without any good locking mechanism:

Onecluster

Problem description

When I connect to the machine, I was not aware of the architecture as it was; I was still referring to what I had found in our history, so one Application node and a simple cluster.

Takes me 5 seconds, to see that the situation was quite different. The cluster was suffering from many point of view, Flow Control was high very often, and nodes between the zones were declare dead often.

Takes me some time to let the customer understand why that was not working correctly, and why the architecture was design wrongly. Mostly because they were claiming that it was working fine before (reminds me "on my machine works" doesn't it?). Discussion on what before was, how much traffic/load and so on.

Finally, they redirect all the traffic to one zone only and at least the FC went down enough to have the site to work, and to give me the chance to review the settings and try to optimize it.

To keep it up I had to increase the FC related variable, to modify galera parameters for nodes time out, and obviously optimize MySQL itself, and we survive a bit.

For the scope of the article focus on the geographic replication, I am not to report all the review, but the part related to the subject.

As said I had extended time outs and FC, but I was not happy and as such I had tested in more details what the network was really allowing to transmit.

 

 

[root@xx ~]# netperf -H 10.0.0.12 -t TCP_RR -v 2  -- -b 6 -r 32K -S 256K -S 256K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.12 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   
 
16384  87380  32768    32768   10.01      97.45   
249856 249856
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   71830.852   97.451 25.546    25.546  <------------ 71ms 
 

 

So a simple round trip for a small record of 32K takes 71 ms, what if I simulate a basic load of 1Mb?

 

[root@xx ~]# netperf -H 10.2.3.30  -t TCP_RR -v 2  -- -b 6 -r 1M,32K -S 256K -S 256K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.2.3.30 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   
 
16384  87380  1048576   32768   10.00       2.70   
249856 249856
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   2592709.047   2.700  22.648    0.708  <-- we have 2592.709047 ms latency
 

 

 

Note, to understand better what was going on and why I use this method to evaluate the network read my other article (Effective way to check the network connection performance, when using replication geographically distributed)

What we had to do

With this kind of connectivity, it is very unluckily that we can have the galera replication working correctly. Not even if we keep only one site as master and the other as stand by. The way the nodes communicate and validate the writesets will be too demanding.

See the below diagram from Codership to understand why

ScreenShot1

 

The only alternative to this is to use asynchronous connection and set two nodes, one for each site in master-master mode:

master-master

Setup a master-master, means that we loose the automatic replication shift from one node to another. That is it, this solution while fix the latency problem because use asynchronous replication, it does imply that in case of crash of one of the two nodes connected, you must manually move the replication to another node.

How to set it up how to fix when it breaks

Setting up asynchronous replication in a galera cluster is quite easy.

Assuming you have two nodes connected as master master (A1 B1) and assuming the only active was A1 and that it crashes, the only thing you need to do is to check on B1 for:

 

+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_last_committed | 221463896 |
+----------------------+-----------+
 

 

 

In the status variables, then look on the node you had choose as master say A2, for the binary log containing the

 

Xid = 221463896

 

 

And issue the command CHANGE MASTER with the coordinates you will find there.

Then on A2 set the node as slave of the B1 using the Master coordinates from that node.

The procedure is simple and obviously worked decently, given the situation.

Customer was obviously not happy because they would like to have the two cluster in perfectly sync, but that was at least good enough to keep the site working efficiently. Not only after we set it up, we had at least two incidents to production, and we had successfully able to fail over to DR and then Failback with less then few seconds of service interruption.

Anyhow our friends in Codership had already release the new version of Galera (3.x), working on MySQL 5.6.

So while the customer was refining and correcting a lot of issues their side, mainly in the code and how the application was using the common table.

I had review, test and crashed MySQL 5.6/Galera3.

Galera 3.0 comes with some interesting new enhancement, see for an overview http://www.slideshare.net/Severalnines/webinar-slides-27249453.

At the moment of writing Seppo is close to deliver the speech about the new features in Galera 4 at Percona Live London 2014. (http://www.percona.com/live/london-2014/sessions/galera-cluster-new-features)

Galera 3.X group replication and how this can change everything.

In Galera 3, the introduction of the segment concept basically isolate/identify a group of data nodes inside a cluster.

The meaning is that any node with the same segment value will communicate extensively to each other, while intra-segments they elect one node for each segment as communication channel:

ScreenShot2

 

Quorum calculation is also taken in to account the segments allocation.

In the presentation I have linked here there is mention that using segments will allow managing WAN latency of 100-300 ms.

Given I am a skeptical, I decide to perform extensive tests, and see/identify what it is really possible to accomplish.

Investigations

I did several tests, involving many parameters in galera, but at the end I had two main actors I was playing with, the max_packet size and gcache pagesize.

All the others ones changed had prove to be stable and efficient, below the settings I had used.

How to test the problem

I have setup 8 different tests:

1) simple insert on 1 node from 3 apps
2) simple insert on 1 node for segment on DR only one app on production 2
3) simple insert on 1 node from 3 apps GALERA PACK larger
4) simple insert on 1 node for on DR only one app on production 2 GALERA PACK larger
5) simple insert on 1 node from 3 apps BUT with WAN connection
6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection
7) simple insert on 1 node from 3 apps BUT with WAN connection AND GALERA PACK larger
8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

 

Each test was using OLTP approach, with 5 main tables and 6 child tables with the following structure:

 

tbtest1 | CREATE TABLE `tbtest1` (
  `autoInc` bigint(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) NOT NULL,
  `uuid` char(36) COLLATE utf8_bin NOT NULL,
  `b` varchar(100) COLLATE utf8_bin NOT NULL,
  `c` char(200) COLLATE utf8_bin NOT NULL,
  `counter` bigint(20) DEFAULT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `partitionid` int(11) NOT NULL DEFAULT '0',
  `date` date NOT NULL,
  `strrecordtype` char(3) COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`autoInc`,`partitionid`),
  KEY `IDX_a` (`a`),
  KEY `IDX_uuid` (`uuid`)
) ENGINE=InnoDB AUTO_INCREMENT=3474631 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
 
CREATE TABLE: CREATE TABLE `tbtest_child1` (
  `a` int(11) NOT NULL,
  `bb` int(11) NOT NULL AUTO_INCREMENT,
  `date` date NOT NULL,
  `partitionid` int(11) NOT NULL DEFAULT '0',
  `stroperation` varchar(254) COLLATE utf8_bin DEFAULT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`a`,`bb`),
  UNIQUE KEY `bb` (`bb`)
) ENGINE=InnoDB AUTO_INCREMENT=36289 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row IN SET (0.00 sec)
 

 

 

Using 3 application nodes, each was having 8 children treads for a total of 24 threads writing on the cluster.

The tests performed from 1 to 4 where done as baseline so NO WAN network simulation.

When I mention Galera pack larger I refer specifically to:

gcs.max_packet_size

This value is "All writesets exceeding that size will be fragmented." And it is set by default to 32Kb, which means anything larger that will be fragmented in side galera. My understanding is that writeset package exceeding that dimension will be split in smaller packages. But given the dimension of 32K, or even less is significantly larger then the network frame dimension, the packet fragmentation will take place in any case, but at lower level (network).

My reasoning here was, why I should want to do that, if I already have frame fragmentation at network (TCP/IP) level? I may get the meaning of the parameter wrong, but if not is worth a try.

On this regards I had also optimize the machine network layer:

net.core.optmem_max = 40960
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_slow_start_after_idle = 0

I run the tests on 6 data nodes using 2 segments, in virtualized environment, 8 core 8GB raid 10 6 10k RPM disks, CentOS 6.

On purpose I had skip HAProxy and prefer to pint directly on the node I had defined.

The applications were closing and opening connection at each transaction to emulate the customer behavior.

When simulating the wan network I had set a final total latency of :

 

root@galera1h1n5 galera56N1]# netperf -H 10.0.0.33 -t TCP_RR -v 2  -- -b 6 -r 120K,64K -S 156K -S 156K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.33 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec
 
65536  87380  122880   65536   10.00       1.90
319488 319488
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   3684481.420   1.900  1.868     0.996  <----- 3684.481 ms 
 

 

With packet reordering 15-20%.

Just for the good of comparison a PING whould had report:

[root@galera1h1n5 galera56N1]# ping -M  do -s 1472 -c 3 10.0.0.33
PING 10.0.0.33 (10.0.0.33) 1472(1500) bytes of data.
1480 bytes from 10.0.0.33: icmp_seq=1 ttl=64 time=403 ms
1480 bytes from 10.0.0.33: icmp_seq=2 ttl=64 time=389 ms
1480 bytes from 10.0.0.33: icmp_seq=3 ttl=64 time=410 ms
 

 

So ~400 ms for 1427 (1500) bytes.

 

With all this in place, I have started to roll the ball and collected my data.

To notice that I have run each set of test at least 5 times, to validate the data collected, and data was clean, refresh and InnoDB BP reload at any test.

Before reaching the last set of tests (the final 2 sets) I had run a total amount of 64 variants of the tests, sometime crashing galera, sometime getting so bad performance that it was obvious that would not be usable in production, and so on.

Anyhow, I was more focus on the replication performance/behavior than on what MySQL/Innodb was doing given I was not looking for an ultra-fine tuning at that level, actually some imperfections had fit my purpose better.

Related issues

Initially I was not using an optimize network settings, and as soon as I reach the first time the tests 7-8 I had issues, in short the test 7 was taking 4 time longer then test 5 (his twin), and in some cases it was so bad that I had to stop it.

But once the network was set up correctly most of the issue disappears.

Results

Who wins?

There is not a single winner; all the tests shows that there is the need to carefully pick the right solution for the job.

But these two sets of tests, and settings had proven to be stable and better performing then all the others in my simulation:

execution_time_smallGcache_page

 

This first image show the set of tests perform with the gcahe page set to a lower value of 32Mb and more instances 4.

execution_time_largeGcache_page

 

This second one shows the test with a larger page of 250Mb and only 2 instances.

I did not include the starting test where the page/instance was set by defaults, because it was providing less performance.

Comparing the two, we can see that there is a different behavior between the tests done on a local network and the ones using the WAN.

In the case of a LAN where the cluster is less limited in the performance at network level, the use of more smaller gcache pages with more instances, had allow the cluster and MySQL to insert ~5700-5300 row/sec.

The use of a large gcache page and only 2 instances, on the other side see the drop of the performance to ~5500-4950 row/sec.

The difference is not significant with this traffic, but is interesting t note that it exists and is consistent on all the tests.

My side, this means we have something that add a cost, significant cost when galera need to access a less number of gcache page. Given this is a memory access I am oriented to think there is a mutex that prevent the usage of the single page/resource.

On the other hand the Large gcache page, seems providing better support in the case of a distribute cluster.

I assume that given the transmission is slowed down by the network, galera/MySQL is able to better utilize the less instances of the gcache memory pages.

Never the less, the large gcache with less instances, had be less effective when inserting in multiple nodes then the small gcache page with more instances and larger gcs.max_packet_size.

But that is not all.

When using larger packet size (gcs.max_packet_size), I was consistently able to complete the tests in less time when I was using a geographically distributed cluster, large gcache page and multiple nodes writing.

While when the cluster was performing locally, the small (standard) gcs.max_packet_size was working better.

Let us navigate the results.

Segments distribution

The first test I have done is to be sure that the segments where working as expected.

The cluster was compose by 2 segments, each segment 3 nodes.

Segment 1 nodes IP ending with: 5,21,23

Segment 2 nodes IP ending with 2,23,33

 

 

Local cluster(same network)
Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.5                <=> 10.0.0.21        1.65GB  76.8Kb  63.9Kb  67.1Kb
10.0.0.5                <=> 10.0.0.31        1.63GB  63.8Kb  51.8Kb  55.2Kb
10.0.0.5                <=> 10.0.0.23        1.58GB  31.8Kb  26.2Kb  28.5Kb <----
10.0.0.5                <=> 10.0.0.2         30.8MB  19.3Kb  15.4Kb  16.2Kb
10.0.0.5                <=> 10.0.0.33         270KB   848b   1.81Kb  1.74Kb
 
 
 Display paused  19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└────────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.21               <=> 10.0.0.5          1.70GB  70.7Kb  71.6Kb  67.5Kb
10.0.0.21               <=> 10.0.0.31         83.4MB  44.9Kb  43.3Kb  40.9Kb
10.0.0.21               <=> 10.0.0.2          31.8MB  15.6Kb  16.9Kb  16.5Kb <---- ??
10.0.0.21               <=> 10.0.0.33         32.1MB  15.6Kb  16.8Kb  16.3Kb <---- ??
10.0.0.21               <=> 10.0.0.23          269KB   848b   1.34Kb  1.36Kb
 
 
Display paused  19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└────────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.31               <=> 10.0.0.5          1.67GB  56.5Kb  58.3Kb  55.7Kb
10.0.0.31               <=> 10.0.0.21         83.4MB  44.7Kb  43.3Kb  41.5Kb
10.0.0.31               <=> 10.0.0.2          31.7MB  19.6Kb  17.6Kb  17.2Kb <---- ??
10.0.0.31               <=> 10.0.0.33          273KB   848b   1.98Kb  1.89Kb 
10.0.0.31               <=> 10.0.0.23          269KB   848b   1.85Kb  1.72Kb
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴───────────────┴─────────┴──────────────┴────────────────
10.0.0.2              <=> 10.0.0.23       1.70GB  71.2Kb  71.2Kb  67.6Kb
10.0.0.2              <=> 10.0.0.33        111MB  58.2Kb  57.6Kb  54.6Kb 
10.0.0.2              <=> 10.0.0.31       31.7MB  20.2Kb  17.7Kb  16.9Kb <---- ??
10.0.0.2              <=> 10.0.0.5        30.8MB  15.6Kb  17.3Kb  16.5Kb <---- ??
10.0.0.2              <=> 10.0.0.21       31.8MB  15.6Kb  16.9Kb  16.5Kb <---- ??
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
───────────────┴───────────────┴────────────────────────┴────────────────
10.0.0.23              <=> 10.0.0.2        1.65GB  71.2Kb  71.2Kb  68.3Kb
10.0.0.23              <=> 10.0.0.33       1.60GB  43.7Kb  43.4Kb  41.7Kb
10.0.0.23              <=> 10.0.0.5        1.62GB  28.1Kb  29.6Kb  28.3Kb <----
10.0.0.23              <=> 10.0.0.31        269KB   848b   1.85Kb  1.76Kb
10.0.0.23              <=> 10.0.0.21        269KB   848b   1.34Kb  1.36Kb
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴────────────────┴───────┴────────────────┴────────────────
10.0.0.33              <=> 10.0.0.2          111MB  58.8Kb  57.6Kb  55.4Kb 
10.0.0.33              <=> 10.0.0.23        1.65GB  43.7Kb  43.3Kb  41.9Kb 
10.0.0.33              <=> 10.0.0.21        32.1MB  15.6Kb  16.8Kb  16.5Kb <---- ??
10.0.0.33              <=> 10.0.0.31         273KB   848b   1.98Kb  1.89Kb
10.0.0.33              <=> 10.0.0.5          274KB   848b   1.78Kb  1.73Kb
 
 
Wan distributed cluster
Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴───────────┴─────────┴────────────────┴──────────────────────
10.0.0.5              <=> 10.0.0.31            1.71GB  30.2Kb  31.3Kb  31.0Kb
10.0.0.5              <=> 10.0.0.21            1.71GB  28.1Kb  30.4Kb  30.4Kb
10.0.0.5              <=> 10.0.0.2             1.61GB  15.1Kb  16.3Kb  16.6Kb <----
10.0.0.5              <=> 10.0.0.33             682KB  1.19Kb  1.91Kb  1.77Kb
10.0.0.5              <=> 10.0.0.23            8.83MB  3.91Kb  1.77Kb  1.87Kb <---- ??
10.0.0.5              <=> 10.0.0.153           5.05KB     0b      0b    572b
10.0.0.5              <=> 10.0.0.151           5.91KB     0b      0b    535b
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.21             <=> 10.0.0.31              194MB  28.7Kb  31.2Kb  31.3Kb 
10.0.0.21             <=> 10.0.0.5              1.76GB  28.5Kb  30.6Kb  30.4Kb
10.0.0.21             <=> 10.0.0.23             73.8MB  15.6Kb  17.1Kb  17.1Kb <---- ??
10.0.0.21             <=> 10.0.0.33              298KB  1.86Kb  2.00Kb  1.91Kb
10.0.0.21             <=> 10.0.0.2              5.83MB  2.06Kb  1.53Kb  1.42Kb <---- ??
10.0.0.21             <=> 10.0.0.12              351KB  1.53Kb  1.34Kb  1.41Kb
 
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.31             <=> 10.0.0.5              1.76GB  28.7Kb  31.3Kb  31.1Kb
10.0.0.31             <=> 10.0.0.21              192MB  28.7Kb  31.2Kb  31.3Kb
10.0.0.31             <=> 10.0.0.33             53.3MB  17.8Kb  17.9Kb  17.7Kb <---- ??
10.0.0.31             <=> 10.0.0.23             1.23MB  3.93Kb  2.01Kb  2.31Kb
10.0.0.31             <=> 10.0.0.2               323KB   528b   1.77Kb  1.81Kb
10.0.0.31             <=> 10.0.0.13              354KB  1.28Kb  1.32Kb  1.38Kb
 
 
 Display paused 19.1Mb      38.1Mb    57.2Mb           76.3Mb           95.4Mb
└───────────────┴───────────┴─────────┴────────────────┴──────────────────────
10.0.0.2              <=> 10.0.0.23            1.65GB  28.5Kb  30.9Kb  31.2Kb
10.0.0.2              <=> 10.0.0.33            1.65GB  27.7Kb  30.5Kb  31.0Kb
10.0.0.2              <=> 10.0.0.5             1.64GB  15.1Kb  16.3Kb  16.6Kb <----
10.0.0.2              <=> 10.0.0.31             323KB   528b   1.77Kb  1.81Kb
10.0.0.2              <=> 10.0.0.21            5.88MB  2.06Kb  1.53Kb  1.46Kb <---- ??
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.23             <=> 10.0.0.33              118MB  29.8Kb  31.5Kb  31.5Kb
10.0.0.23             <=> 10.0.0.2              1.70GB  29.5Kb  30.8Kb  31.2Kb
10.0.0.23             <=> 10.0.0.21             74.0MB  15.4Kb  16.9Kb  17.0Kb <----??
10.0.0.23             <=> 10.0.0.31             1.22MB  3.31Kb  1.85Kb  2.27Kb
10.0.0.23             <=> 10.0.0.5              8.96MB  3.40Kb  1.67Kb  1.84Kb <---- ??
10.0.0.23             <=> 10.0.0.12              359KB  1.47Kb  1.47Kb  1.36Kb
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.33             <=> 10.0.0.23              118MB  28.1Kb  31.5Kb  31.4Kb
10.0.0.33             <=> 10.0.0.2              1.70GB  27.5Kb  30.2Kb  30.7Kb
10.0.0.33             <=> 10.0.0.31             53.3MB  17.8Kb  17.9Kb  17.7Kb <----??
10.0.0.33             <=> 10.0.0.21              298KB  1.86Kb  2.00Kb  1.91Kb
10.0.0.33             <=> 10.0.0.5               686KB  1.03Kb  1.96Kb  1.78Kb
10.0.0.33             <=> 10.0.0.13              370KB  1.41Kb  1.44Kb  1.42Kb
 

 

 

Apparently all was working fine, and if you notice the nodes 5 -23 and 5-2 seems to work as bridge between the segments.

But if you look carefully (I had point the cases with <---- ??), you will see that there is additional traffic going on, for instance in the first set we have nodes 33-21 exchanging data for 32Mb, and in the second set (wan) even more we have node 23-21 exchanging 73Mb.

It seems nothing, but instead it means there is an exchange of some type. What kind of traffic is going on between SOME nodes? I said SOME on purpose, because other nodes are not generating this that I consider an anomaly.

May be this is by design, if so I would like to know, and would like to know why.

Anyhow a side from this, most of the exchange is actually floating as expected.

 

More inside the replication

As mention before, the whole exercise was to identify if and if yes which is best settings to use for the cluster working in geographically distributed environment. In relation to that the tests that are more relevant are the number 6 and 8:

6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection
8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

Comparing the differences existing between the gcache.keep_pages_size and page_size settings as well.

Test6

Replication_queue6

Test8

Replication_queue8

Reviewing the two graphs above, related to the Replication queue, we can see that both tests do not report Flow Control action.

In the case of Test6 the receiving and sending queue had some spikes, that were not present in Test8, meaning the two queues were processed more efficiently when the max_packet_size is larger.

To note is that the wsrep_evs_repl_latency in Test8 is higher, as such the messages seems takes longer to be dispatched, but at the end the performance are better.

 

Test6

netstats6

Test8

netstats8

Reviewing the graphs related to Writeset dimension and the Netstat, we can notice that the size of the writesets are larger in the test with default max_packet_size, also network traffic results to be higher. I was surprise about that and I had repeat the tests even more times. At the end magnifying the information related to the transmitted packet, we can notice that when using larger max_packet_size, the distance between received and transmitted is less.

Test6

netstatsPacket6

Test8

netstatsPacket8

Given in this test we have writes going on two nodes, my conclusion is that the network layer was able to optimize the data transmission, reducing the number of vectors (frames) and as such reducing the overhead.

 

Test6

Parallelization_efficency6

Test8

Parallelization_efficency8

To complete the review related to the galera replication, let take a look to the parallelization efficiency graphs.

Here the differences are more evident. The apply window (Average distance between highest and lowest concurrently applied seqno) when using default max_packet_size is definitely higher, as well the wsrep_commit_window (Average distance between highest and lowest concurrently committed seqno).

Also the wsrep_cert_deps_distance (Average distance between highest and lowest seqno value that can be possibly applied in parallel, potential degree of parallelization), in the first case is steady while with larger max_packet_size it is going down.

I was oriented to read this as an optimize way to apply and commit the incoming data.

Test6

Write_set_bytes6

Test8

Write_set_bytes8

 

Reviewing the Writeset_byte graphs we can note that while the bytes replicated remain very close between Test6 and Test8, the received are quite larger in the case of smaller/default max_packet_size.

So the node was by the fact receiving less data in Test8, keeping constant the same amount of data from application.

Test6

processes6

Test8

processes8

 

Analyzing the process information the Test6 (default max_packet_size) has more major page fault then Test8, and both system time and user time are larger in Test6. The meaning of this is more I/Operations

The differences are not huge, but they are consistent through the different number of tests.

 

Conclusions?

I had collect and still collecting a lot of additional data, but for the scope of this article focus on the network and replication side, I prefer to stop.

There is not A conclusion about the whole MySQL/Galera setup, but let me say that I am strongly suggesting to do not use a monolithic setup for all cases.

Instead adapt the configuration following the architecture you are implementing.

 

So far and from the tests I have done, I can say that:

Cluster implemented in a single site with very good connectivity and using MySQL/Galera for HA or ad write distribute platform:

  • Use multiple gcache.keep_pages_size (I had relate the number to the number of cores but I don't see any benefit to go above 8 instance at the moment)
  • Use gcache.page_size never less then 32Mb no more 256Mb total
  • Use small gcs.max_packet_size 32K

Cluster implemented in a single site with geographic distributed site and using MySQL/Galera on second site for HA only:

  • Use single or two gcache.keep_pages_size
  • Use default gcache.page_size or little bit larger like 256M
  • Use larger gcs.max_packet_size 2097152K

 

In the final tests I have being using the two following configurations, and they had work great also with significant load, not only for the tests but also when we had implement them at the customer site, using the same load and operation that was crashing the system with Galera 2.x or with other Galera 3.x configurations.

Configuration used for local cluster:

 

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=2;gcache.page_size=125M;
gcache.size=5024M;evs.send_window=512;evs.user_send_window=512; evs.use_aggregate = true; gcs.fc_factor = 1; 
gcs.fc_limit = 32; gcs.max_packet_size = 32768;gcs.fc_debug = 0; evs.inactive_check_period = PT1S; 
evs.inactive_timeout = PT15S;evs.suspect_timeout = PT5S; evs.stats_report_period = PT1M;"

 

 

Configuration used for geographically distributed DR:

 

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=4;gcache.page_size=64M;
gcache.size=5024M;evs.send_window=1024;evs.user_send_window=1024; evs.use_aggregate = true; gcs.fc_factor = 1; 
gcs.fc_limit = 260; gcs.max_packet_size = 2097152;gcs.fc_debug = 0; evs.inactive_check_period = PT30S;
 evs.inactive_timeout = PT1M;evs.suspect_timeout = PT1M; evs.stats_report_period = PT3M;"

 

 

 

Please let me know if in you will experience different results, and how/why. 

 

Great MySQL .... & Galera to all.

Ultimo aggiornamento Venerdì 21 Novembre 2014 20:34
 
Effective way to check the network connection performance, when using replication geographically distributed PDF Stampa E-mail
Scritto da Marco Tusa   
Domenica 23 Marzo 2014 00:11

Why this article

The more I have to interact with customers asking about MySQL/Galera, the most I have to answer over and over to the same question about what kind of network conditions Galera can manage efficiently.

One of the most frequent myths I have to cover at the start of any conversation that involve the network is PING.

Most of the customers use PING to validate the generic network conditions, and as direct consequence they apply that approach also when in need to have information on more complex and heavy use like in the Galera replication.

To have a better understanding why I consider the use of PING, not wrong but inefficient, let us review some basic networking concepts.

 

Frame

At the beginning stay the physical layer, but I am going to skip it otherwise article will be too long, what I want only to say is that unless you are able to afford a leased line to connect to your distributed sites, you are going to be subject to Packet switching, as such affected by: throughput, bandwidth, latency, congestion, and dropped packets issues.

Given the physical layer, and given we are talking about Ethernet connections, the basic transporter and the vector that encapsulates all the others is the Ethernet frame. A frame can have a dimension up to 1518 bytes and nothing less then 64.

A frame has a header compose by:

  • Preamble of 7 bytes,
  • Delimiter 1 bytes,
  • MAC address destination 6 bytes,
  • MAC address destination 6 bytes,
  • Optional fields (IEEE 802.1Q/ IEEE 802.1p), 4 bytes
  • Ethernet type or length (if > 1500 it represent the type; if < 1501 it represents the length of the Payload, 2 bytes
  • PayLoad, up to 1500 bytes
  • Frame CRC, 4 bytes
  • Inter-packet gab, this is the space that is added between to frames

framedimension

 

 

A frame can encapsulate many different protocols like:

  • IPv4
  • IPv6
  • ARP
  • AppleTalk
  • IPX
  • ... Many more

The maximum size available for the datagram to be transmitted is of 1500 bytes, also known as MTU. That is, the MTU or Maximum transmission unit is the dimension in bytes that a frame can transport from source to destination, we will see after that this is not guarantee and fragmentations can happen.

Only exception is when Jumbo Frames are supported, in that case a frame can support a payload with a size up to 9000 bytes. Jumbo frames can be quite bad for latency, especially when the transmission is done between data-centre geographically distributed.

 

 

IP (internet protocol)

For the sake of this article we will focus on the IPv4 only. The IPv4 (Internet Protocol) is base on the connectionless and best-effort packets delivery, this means that each frame is sent independently, the data is split in N IP datagram and sent out to the destination, no guarantee it will deliver or that the frames will arrive in the same order they are sent.

Each IP datagram has a header section and data section.

The IPv4 packet header consists of 14 fields, of which 13 are required.

The 14th field is optional (red background in table) and aptly named: options.

 

IPv4 Header Format

Offsets 

Octet 

0

1

2

3

Octet 

Bit 

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

0

0

Version

IHL

DSCP

ECN

Total Length

4

32

Identification

Flags

Fragment Offset

8

64

Time To Live

Protocol

Header Checksum

12

96

Source IP Address

16

128

Destination IP Address

20

160

Options (if IHL > 5)

   

 

 

An IP datagram is encapsulated in the frame…

ip_datagram

 

 

Then sent.

For performance the larger is the datagram (1500 MTU), the better, but this only in theory, because this rules works fine when a datagram is sent over a local network, or a network that can guarantee to keep the level of MTU to 1500.

What happen in real life is that a frame sent over the Internet has to go over many different networks and there is no guarantee that all of them will support the same MTU, actually the normal condition is that they don’t.

So when a frame pass trough a gateway, the gateway knows the MTU of the two links (in/out) and if they do not match, it will process the frame fragmenting it.

See below:

fragmentating_net

 

Assuming Host A is sending an IP datagram of 1400 bytes to Host B, when Gw1 will receive it, it will have to fragment it in to smaller peaces to allow the frame to be transported:

fragmented

 

 

Once a datagram is fragmented to match the largest transportable frame, it will be recompose only at destination. If one of the fragments of the datagram is lost, for any reason, the whole datagram got discarded and transmission fails.

 

 

ICMP

The IP specification imposes the implementation of a special protocol dedicated to the IP status check and diagnostics, the ICMP (Internet Control Message Protocol).

Any communication done by ICMP is embedded inside an IP datagram, and as such follow the same rules:

icmp_fragment

 

 

ICMP has many “tools” in his pocketknife, one if them is PING.

 

A ping is compose by a echo request datagram and an echo reply datagram:

00

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

Type = 8 (request) OR =0 (reply)

Code = 0

Header Checksum

Identifier

Sequence Number

Data

 

Request and reply have a similar definition, with the note that the replay MUST contain the full set of data sent by the echo request.

Ping by default sent a set of data of 56 bytes plus the bytes for the ICMP header, so 64 bytes.

Ping can be adjust and to behave in different way, including the data dimension and the don’t fragment bit (DF). The last one is quite important because this set the bit that will mark the datagram as not available for fragmentation.

As such if you set a large dimension for the data and set the DF, the datagram will be sent requesting to not be fragmented, in that case if a frame will reach a gateway that will require the fragmentation the whole datagram will be drop, and we will get an error.

 

 

I.e.:

ping -M do -s 1472 -c 3 192.168.0.34
PING 192.168.0.34 (192.168.0.34) 1472(1500) bytes of data.
1480 bytes from 192.168.0.34: icmp_req=1 ttl=128 time=0.909 ms
1480 bytes from 192.168.0.34: icmp_req=2 ttl=128 time=0.873 ms
1480 bytes from 192.168.0.34: icmp_req=3 ttl=128 time=1.59 ms

as you can see I CAN send it using my home wireless because is below 1500 MTU also if I ask to do not fragment, this because my wireless is using 1500MTU.

But if I just raise the dimension of ONE byte:

ping -M do -s 1473 -c 3 192.168.0.34
PING 192.168.0.34 (192.168.0.34) 1473(1501) bytes of data.
From 192.168.0.35 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 192.168.0.35 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 192.168.0.35 icmp_seq=1 Frag needed and DF set (mtu = 1500)

Removing the –M (such that fragmentation can take place)

root@tusacentral01:~# ping  -s 1473 -c 3 192.168.0.34
PING 192.168.0.34 (192.168.0.34) 1473(1501) bytes of data.
1481 bytes from 192.168.0.34: icmp_req=1 ttl=128 time=1.20 ms
1481 bytes from 192.168.0.34: icmp_req=2 ttl=128 time=1.02 ms
1481 bytes from 192.168.0.34: icmp_req=3 ttl=128 time=0.996 ms

 

PING is widely used to test connectivity between two network points, which is good, and from the example above, you may think that it is a good tool to test the network also in more complex situations, like when we need to be sure that the virtually synchronous galera replication will NOT be affected.

But you are wrong, the PING tool as part of the ICPM for his operations it is still using the basic IP protocol which is base on the best effort delivery of the datagram, while any data that is transmitted between the nodes in a MySQL/Galera cluster, use the TCP protocol, the two are bound (from TCP point of view), but the TCP implementation behaviour is quite different.

To understand that let us review at very high level (one billion of kilometres high) the TCP characteristics.

 

 

TCP

As mentioned previously the IP and related ICPM work on the base of the best effort delivery of the datagram, what this means in short is that whatever I sent using it is sent without guarantee it will reach destination, and without actually caring to send it if it fails, more no real check is perform between source and destination regarding the data transmission.

TCP means Transmission Control Protocol and as the name says, it is design to control the data transmission happening between source and destination.

TCP’s implementations offer a reliable, ordered, error-checked delivery of a stream of octets between source and destination.

The TCP is a transmission protocol and as such it works at the transport level of the OSI model:

OSI Model

Data unit

Layer

Function

 

Host
layers

Data

7. Application

Network process to application

 

6. Presentation

Data representation, encryption and decryption, convert machine dependent data to machine independent data

 

5. Session

Interhost communication, managing sessions between applications

 

Segments

4. Transport

Reliable delivery of packets between points on a network.

 

Media
layers

Packet/Datagram

3. Network

Addressing, routing and (not necessarily reliable) delivery of datagrams between points on a network.

 

Bit/Frame

2. Data link

A reliable direct point-to-point data connection.

 

Bit

1. Physical

A (not necessarily reliable) direct point-to-point data connection.

 

 

 

TCP implementations use the IP protocol encapsulation for the transmission of the data:

tcp_datagram_incapsulation

 

A TCP implementation has several characteristics that make sense to summarize:

  • Is stream oriented, that' it when two applications open a connection base on TCP, they will see it as a stream of bit that will be deliver to the destination application, exactly in the same order and consistency they had on the source.
  • Establish a connection, which means that the host1 and host2 before start to send data over, must perform a handshake operation, which will allow them to know each other state. Connection use a three way handshake:

tcp_con_open

Once it is establish the two parts will keep to monitor the data transfer, in case of crash interruption of the service both hosts will notify the application involved in the stream transfer.

  • TCP sees a connection
  • Buffered transmission, the applications communicating the data can choose whatever dimension to send, it could be a single byte a time. The transport layer will buffer the data in order to match the best datagram dimension for that specific moment, and will send it. The dimension of the datagram is adaptative which means it can be adjusted in relation to the network transfer condition.
  • Unstructured stream, meaning the stream does not have a predefine format/dimension for internal data, as such the applications must know how to read from the stream.
  • Full-duplex connection, data can be transmitted simultaneously from both side.
  • Segment TCP implementation see the stream as a sequence of octet (bytes), which it will split in segments, the segment dimension is normally as large as a datagram.

 

TCP header definition:

TCP Header

Offsets

Octet

0

1

2

3

Octet

Bit 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

0

0 

Source port

Destination port

4

32 

Sequence number

8

64 

Acknowledgment number (if ACK set)

12

96 

Data offset

Reserved
0 0 0

N
S

C
W
R

E
C
E

U
R
G

A
C
K

P
S
H

R
S
T

S
Y
N

F
I
N

Window Size

16

128 

Checksum

Urgent pointer (if URG set)

20
...

160
... 

Options (if data offset > 5. Padded at the end with "0" bytes if necessary.)
...

                                                                   

Source port (16 bits)

 

 

As said TCP implementations are reliable and can re-transmit missed packets, let see how it works:

tcp_pcg_sent_simpletcp_pcg_sent_error

 

This mechanism is in place for each TCP transmission, and ensures the correct send/receive of the segment.

It is easy to understand how also if effective, having this mechanism working for a single segment at time is far to be efficient.  This because considering a set of packages from 1 to 100, before sending package 2, the dispatch of package 1 must be acknowledge.

To solve this issue, the TCP use the concept of dynamic slide window. That’s it the TCP implementations manage to dispatch multiple packages at the same time:

sliding_window_initial

 

 

A sliding window in TCP implementations works at octet level not by segment or by package, each stream’s octet has a sequential number, the dispatcher manages three pointers associated to each connection.

The first pointer indicate the start of the sliding window, splitting the octet that had be dispatched & acknowledge from the ones that are in progress or still to dispatch.

The second pointer indicates the higher octet that can be dispatch before getting the acknowledgments for the already dispatched octet.

The third pointer indicates the window limit after which no dispatch can be done, until further slide.

 

sliding_window_pointers

 

 

In TCP implementation the slide window has, as previously mentioned, a dynamic dimension, each acknowledgement received is also containing a declaration of how many octet the receiver can accept.

The whole discussion now will become much more complex and not pertinent to the scope of this article, but is worth to mention that, it is here that the whole concept of flow/congestion control will take place.

For the sake of this article and simplifying a lot let us say that when there are optimal conditions in a local lan the dimension of a segment will coincide with the maximum MTU available.

While when sending traffic over the internet, the TCP implementation will have to manage not only the optimal initial dimension, but possible and probable issue while the transmission will take place, reducing and enlarging the sliding window as needed, and in some cases put it on hold.

If you are really interested on how congestion works in the TCP/IP world I suggest this book (http://www.amazon.com/Adaptive-Congestion-Networks-Automation-Engineering-ebook/dp/B008NF05DQ)

 

 

Back to the topic

So we have seen how PING works, and the fact it send a simple echo request, using a very basic datagram, with no dispatch or connection error handling.

On the other hand we have see how complex TCP implementation is, and how important is in the TCP implementation the concept of adaptative transmission (that will also seriously affected by available bandwidth) and congestion control.

Keeping that in mind, there is no way we can compare a TCP/IP data transmission between to points with a simple ICMP/IP echo, no matter how large we will make the datagram, or if we declare it not to fragment, or whatever a Ping is executing and following a different protocol, and should never be used to validate the data transmission in a TCP/IP scenario.

Back to square 1, how I can evaluate my connection between two geographically distributed sites?  In detail how I can be sure the transfer will be efficient enough to support the Galera replication?

When implementing Galera, some additional status metrics are added in MySQL, two of them are quite relevant to the purpose of this article:

  • Wsrep replicated bytes
  • Wsrep received bytes

If you are smart, and I am sure you are, you already have some historical monitor in place, and before even thinking to go for geographical distributed replication you have tested and implement MySQL/Galera using local cluster.

As such you will have collected data about the two metrics above, and will be easy to identify the MAX value, be careful here do not use the average.

Why? Because the whole exercise is to see if the network can efficiently manage the traffic of a (virtually) synchronous connection, without having the Galera flow control to take place.

As such let us say we have 150Kb/s as replicated bytes, and 300k as received, and see what will happen if we use PING and another tool more TCP oriented.

In the following scenario I am going to first check if the two locations are connected using a high-speed link that will allow 1500 MTU, then I will check the connection state/latency using the max dimension of 300K.

 

 Test 1

[root@Machine1~]# ping -M  do -s 1432 -c 3 10.5.31.10
PING 10.5.31.10 (10.5.31.10) 1432(1460) bytes of data.
From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)
From 192.168.10.30 icmp_seq=2 Frag needed and DF set (mtu = 1398)
From 192.168.10.30 icmp_seq=2 Frag needed and DF set (mtu = 1398)

 

 Test 2

[root@Machine1~]# ping -M  do -s 1371 -c 3 10.5.31.10
PING 10.5.31.10 (10.5.31.10) 1371(1399) bytes of data.
From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)
From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)
From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)

Test 3
[root@Machine1~]# ping -M  do -s 1370 -c 3 10.5.31.10
PING 10.5.31.10 (10.5.31.10) 1370(1398) bytes of data.
1378 bytes from 10.5.31.10: icmp_seq=1 ttl=63 time=50.4 ms
1378 bytes from 10.5.31.10: icmp_seq=2 ttl=63 time=47.5 ms
1378 bytes from 10.5.31.10: icmp_seq=3 ttl=63 time=48.8 ms

 

As you can see the link is not so bad, it can support 1370 MTU, so in theory we should be able to have a decent connection.

Let see... what happens with PING

 

[root@Machine1~]# ping   -c 3 10.5.31.10
PING 10.5.31.10 (10.5.31.10) 56(84) bytes of data.
64 bytes from 10.5.31.10: icmp_seq=1 ttl=63 time=49.6 ms
64 bytes from 10.5.31.10: icmp_seq=2 ttl=63 time=46.1 ms
64 bytes from 10.5.31.10: icmp_seq=3 ttl=63 time=49.7 ms
--- 10.5.31.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2052ms
rtt min/avg/max/mdev = 46.189/48.523/49.733/1.650 ms

[root@Machine1~]# ping -M  do -s 300 -c 3 10.5.31.10
PING 10.5.31.10 (10.5.31.10) 300(328) bytes of data.
308 bytes from 10.5.31.10: icmp_seq=1 ttl=63 time=50.5 ms
308 bytes from 10.5.31.10: icmp_seq=2 ttl=63 time=48.5 ms
308 bytes from 10.5.31.10: icmp_seq=3 ttl=63 time=49.6 ms
--- 10.5.31.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 48.509/49.600/50.598/0.855 ms

Performing the tests seems that we have more ore less 49-50 ms latency, which is not really great, but manageable, we can play a little with flow-control in Galera and have the sites communicating, with that I am NOT recommending to set a Galera cluster with 50ms latency, what I am saying is that if desperate something can be done, period.

But wait a minute let us do the other test, and this time let us use a tool that is design for checking the real condition of a network connection using TCP/IP.

For that I normally use NetPerf (there are other tools, but I like this one).

About NetPerf:

"Netperf is a benchmark that can be used to measure the performance of many different types of networking. It provides tests for both unidirectional throughput, and end-to-end latency. The environments currently measureable by netperf include:

TCP and UDP via BSD Sockets for both IPv4 and IPv6

DLPI

Unix Domain Sockets

SCTP for both IPv4 and IPv6"

(http://www.netperf.org/netperf/)

 

NetPerf use a two-point connection instantiating a server demon on one machine, and using an application that simulate the client-server scenario.

I strongly suggest you to read the documentation, to better understand what it can do, how and the results.

 

Done? Ready let us go...

Given NetPerf allow me to define what is the dimension I need to send, and what I will receive, this time I can set the test properly, and ask what will be the real effort:

[root@Machine1~]# netperf -H 10.5.31.10 -t TCP_RR -v 2 -- -b 6 -r 156K,300k

MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.5.31.10 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec
65536  87380  159744   300000  10.00      30.20
65536  87380
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
8         0          0        0       231800.259 30.198   38.592     72.476

 

The tool will use a TCP round trip, as such it will simulate a real TCP connection/stream/sliding-window/traffic-congestion-control.

The result is 231800.259 microseconds for latency; also other metrics are interesting but for consistency with ping let us stay on latency.

 

So we have 231800.259 microseconds that mean 231.8 millisecond, real latency.

 

This result is quite different from the PING results, reason for that is as said a totally different transport mechanism, which will imply a different cost.

Considering the number of 231.8 ms a site-to-site synchronous replication is out of the question.

I can then focus and spend my time to explore different and more appropriate solutions.

 

 

Conclusion(s)

The conclusion(s) is quite simple, whenever you need to validate your network connection PING is fine.

Whenever you need to measure the capacity of your network, then PING is just dangerous, and you should really avoid using it.

Good MySQL to all.

Ultimo aggiornamento Domenica 23 Marzo 2014 13:27
 
Server ID misconfiguration can affect you more then you think. PDF Stampa E-mail
Scritto da Marco Tusa   
Venerdì 01 Novembre 2013 10:23

Sub title: When you do stupid things,  you get stupid things back, but sometime is good to discover odd behaviour.

 

The other day, I was having fun playing with the latest toy, mysql-cluster-gpl-7.3.2.

 

I was setting up a new cluster set, of 6 data nodes and 3 Mysql node, to do so I was using a simple bash file that helps me to generate what I need on my environment.

During that I had a very stupid idea, to have the script to recognize the machines IPs and PORT I have set, then use them as SERVER IDs.

 

I know I know is not a good way to define SERVER ID, anyhow I did it, so the result was that in my my.cnf my server_id looks like:

server-id = 1921680355510

 

Some of you already see what kind of stupid thing this is, but let me not anticipate it.

 

So as usual I start the mysql node first to generate a brand new database, to do so I have first disable the ndbcluster and all went fine.

 

Then I start my management nodes and my data nodes.

 

I was quite happy and I finally attach my MySQL node to the rest of the cluster.

But Nodes failed to starts, I was quite surprise so I open the log and is saw:

 

2013-10-28 10:20:58 2692 [Note] NDB[3]: NodeID: 77, all storage nodes connected
2013-10-28 10:20:58 2692 [ERROR] NDB: server id provided is too large to be represented in opt_server_id_bits or is reserved
mysqld: /usr/src/mysql-cluster-gpl-7.3.2/sql/ndb_component.cc:139: virtual int Ndb_component::deinit(): Assertion 'm_thread_state == TS_STOPPED' failed.
14:20:58 UTC - mysqld got signal 6 ;
 

 

 

What??? My fantastic server_id of 13!!!!! digit is too long?

 

Hey wait a minute?! how many digit??

 

At that point I just hit the keyboard with my head.. realizing what I have done.

headdesk

Server id is an unsigned integer, his max value is 4294967295 with 10 digit.

Once more my friend NDB was right and I was trying to do a stupid thing.

 

But when I realize it, I also wonder, why when activating ndbcluster I get the error and when I have MySQL running without it NO?

 

This kind of checking should be done at common level at MySQL start-up, so why I get it only with MySQL and NDB active?

I dig a little bit and I found the place where NDB is doing the check.

The if condition where the check is executed is quite simple and clear.

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
....
sql::ha_ndbcluster_binlog.cc
 
int ndbcluster_binlog_start()
{
  DBUG_ENTER("ndbcluster_binlog_start");
 
  if (::server_id == 0)
  {
    sql_print_warning("NDB: server id set to zero - changes logged to "
                      "bin log with server id zero will be logged with "
                      "another server id by slave mysqlds");
  }
 
  /* 
     Check that ServerId is not using the reserved bit or bits reserved
     for application use
  */
  if ((::server_id & 0x1 << 31) ||                             // Reserved bit <--------------------- Check
      !ndbcluster_anyvalue_is_serverid_in_range(::server_id))  // server_id_bits
  {
    sql_print_error("NDB: server id provided is too large to be represented in "
                    "opt_server_id_bits or is reserved");
    DBUG_RETURN(-1);
  }

 

 

But again why the error is not reported in MySQL core?

 

Because MySQL is silently converting my server id to his maximum value, as such my stupid server id of 1921680355510 become 4294967295.

 

Oh my...!

An identifier is managed as a quantity!

 

I have done a very stupid thing setting the server_id to , but honestly I don't like this as well.

 

As we know MySQL silently adjust his behaviour in relation to the SQL_MODE, but the way it adjust the parameters should be something different.

I agree that variables like max_connections CAN be adjusted, but others like server_id, well no, I think is wrong given those are identifier(s) and not quantity.

 

Server ID is an ID, as such it must be validated and if not matching the validation an ERROR must be return and the server should not start at all.

 

Instead MySQL currently is just raising a WARNING and continue to run, not caring if there will be another server with the same id because this "adjustment".

 

131028 14:00:58 mysqld_safe Starting mysqld daemon with databases from /home/mysql/instances/csite3/data
2013-10-28 14:00:58 0 [Warning] option 'server_id': unsigned value 192168075510 adjusted to 4294967295 <----------------------------------------------

I was really "not happy" about this, so I went into the code and check where the server ID is evaluated and modified.
So this is the action calls:

 

mysqld::mysqld_main
	=>mysqld::init_common_variables
		=> my_getopt::init_variables(longopts, init_one_value);
			=> my_getopt::init_one_value
				=> my_getopt::getopt_ll_limit_value

 

So what exactly is happening here?

 

MySQL initialize reading the options in my_getopt and on the base of the data type it pass it to getopt_ll_limit_value.

There whatever numeric argument  is pass, if it exceed the max value is reset to his Max value. At this stage there is no logic to manage values like Server ID, that is an identifier and not quantity.

This is obviously something that can be managed better, and actually the fact that NDB prevent the server to go ahead is by far more correct, this because Server Id is it used internally to recognize the server.

 

I have done a quick check, and I have see that at least the following functions are using the server_id variable, in the new replication mechanism.

 

rpl_master.cc
int register_slave(THD* thd, uchar* packet, uint packet_length)
void unregister_slave(THD* thd, bool only_mine, bool need_lock_slave_list)
bool show_slave_hosts(THD* thd)
static int fake_rotate_event(NET* net, String* packet, char* log_file_name,ulonglong position, const char** errmsg,uint8 checksum_alg_arg)
static int send_heartbeat_event(NET* net, String* packet,const struct event_coordinates *coord,uint8 checksum_alg_arg)
bool com_binlog_dump(THD *thd, char *packet, uint packet_length)
bool com_binlog_dump_gtid(THD *thd, char *packet, uint packet_length)
void mysql_binlog_send(THD* thd, char* log_ident, my_off_t pos,const Gtid_set* slave_gtid_executed)

 

I have not investigate more, to see what will happen if there are two different servers having the same ID, I have not done it, because in any case it is conceptually wrong.

 

Just for fun I have create an horrible patch to manage the mismatch, adding a control on the parameter name.

 

Obviously that should be something different like an attribute of the my_option object that indicate if the option is a quantity an identifier or whatever else, then on that base decide what to do.

Any how also a stupid and horrible things like the check on the name is better then not having it at all:

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
ulonglong getopt_ull_limit_value(ulonglong num, const struct my_option *optp,
                                 my_bool *fix)
{
  my_bool adjusted= FALSE;
  ulonglong old= num;
  char buf1[255], buf2[255];
  const ulonglong max_of_type=
    max_of_int_range(optp->var_type & GET_TYPE_MASK);
 
	char string1[255], string2[255];
	int result;
	string1=optp->name;
	string2="server_id";
	scanf("%s",string1[10]);
	scanf("%s",string2[10]);
	int chlimit = 255;
	result = strncmp(string1,string2,chlimit);
 
 
 
  if ((ulonglong) num > (ulonglong) optp->max_value &&
      optp->max_value) /* if max value is not set -> no upper limit */
  {
    num= (ulonglong) optp->max_value;
    adjusted= TRUE;
    if(result == 0){
           my_getopt_error_reporter(ERROR_LEVEL,
           "option '%s': unsigned value %s CANNOT BE adjusted to %s please review your settings, and start MySQL again",
           optp->name, llstr(old, buf1), llstr(num, buf2));
           exit(1);
 
    }
  }
  if (num > max_of_type)
  {
    num= max_of_type;
    adjusted= TRUE;
  }
 

 

Running the server with the fix, I will finally had a consisten behaviour, and my favorite MySQL server doing the right thing:

 

131029 13:38:03 mysqld_safe mysqld from pid file /home/mysql/instances/csite3/mysql.pid ended
131029 13:41:11 mysqld_safe Starting mysqld daemon with databases from /home/mysql/instances/csite3/data
2013-10-29 13:41:11 0 [ERROR] 'option server_id' unsigned value 192168075510 CANNOT BE adjusted to 4294967295 please review your settings, and start MySQL again
131029 13:41:12 mysqld_safe mysqld from pid file /home/mysql/instances/csite3/mysql.pid ended

 

 

 

 

Conclusion

From a stupid action it could come out something interesting, at least for me.

I would like to classify this as bug, but given it is a conceptual thing may be is more a what? Feature request? Logic fix?

 

Finally my conclusion is that MySQL is a great product, but is growing fast and sometimes things are left behind, this is a small but possible dangerous one.

Ultimo aggiornamento Venerdì 01 Novembre 2013 11:11
 
Understand what happens in MySQL when using UTF String with Latin1 encoding... PDF Stampa E-mail
Scritto da Marco Tusa   
Domenica 15 Settembre 2013 21:40

... And how to go out from it.

 

Why this article?

The scope of the article is to light up what happens behind the scene when an application push UTF data on a MySQL server using Latin1 encoding.

I will show how the characters are store, what exactly happen if you mix UTF8 and latin1, what is transformed to what and why.

Finally I will show the safer way to use to convert a table or simply a table field, without exporting the full dataset.

 

I will also show what is a good way to do it when using master-slave scenario, with minimal impact on the live dataset.

What I am not going to discuss here, and I assume you already have familiarity with is:

  • How to assign a character set or a collation to a table or fields;
  • How collation works in in sorting;
  • How to set the default character set in server and connecting clients.

If any of this point is not clear to you, please read first here (http://dev.mysql.com/doc/refman/5.6/en/charset.html)

For my exercise I will use two tables, two MySQL server instances, and Arabic text.

 

Please note that I have copy this text from Official UN nation page, as such I hope they do not contain any offensive text.

 

Step one check the status of our severs:

Master

(root@localhost) [(none)]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1
Connection id:		2
Current database:
Current user:		root@localhost
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfm/mysql.sock
Uptime:			2 min 34 sec
Threads: 3  Questions: 16  Slow queries: 0  Opens: 34  Flush tables: 1  Open tables: 26  Queries per second avg: 0.103

Slave

------------

(root@localhost) [(none)]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1
Connection id:		4
Current database:
Current user:		root@localhost
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfs/mysql.sock
Uptime:			10 min 43 sec
Threads: 2  Questions: 8  Slow queries: 0  Opens: 34  Flush tables: 1  Open tables: 26  Queries per second avg: 0.012


As you can see, I have define the MASTER to use latin1 as default also for the server, while for the SLAVE I choose to use UTF8.

In this exercise I am going to use only UTF8 for the examples never UTF8mb4.

 

First create the table.

CREATE TABLE `utf_test` (
  `ID` integer AUTO_INCREMENT PRIMARY KEY,
  `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL,
  `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `binutf` varbinary(250) DEFAULT NULL,
  `different` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
 

 

Now let start insert data again and see what happens:

insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا',

' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same');

 

 

 Master
 -------------
(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same');
*************************** 1. row ***************************
       ID: 1
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same
1 row IN SET (0.00 sec)
 
 
Slave
---------------
(root@localhost) [test]>set names latin1;
Query OK, 0 rows affected (0.00 sec)
(root@localhost) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same
1 row IN SET (0.00 sec)
 

 

 

 

All good given the value are passed consistently

Now let add some text to the "different" fields and see IF we can still handle it correctly:

 

(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same? لصدأ القمح إلى كينيا');
Master
--------------
(root@localhost) [test]>select * FROM utf_test\G
       ID: 3
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same? لصدأ القمح إلى كينيا
2 rows IN SET (0.00 sec)
 
Slave
----------------------------
       ID: 3
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same? لصدأ القمح إلى كينيا
 

 

 

 

Ok once more all good.
Is this real UTF8 or as we have seen sometimes fake UTF thanks to Latin1 push?
let see inserting another row with UTF field "yesutf" and "different" containing the same apparent data.

insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا');

 

 

Master
-----------
(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert same text everywhere', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا');
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
Slave
------------------
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 

 

 


Checking the hex:

 

Master
-----
(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
 
Slave 
-------------
(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

As expected the hex code is different between the fields define as UTF8 and the one as Latin1,and I see them correctly because... ?

Look what happens if I change the Names to UTF8...:

 

(root@localhost) [test]>set names utf8;
 
(root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

Now only the binutf (varbinary) is shown correctly.

So just to recap what happen if I use latin1....
Using latin1

I see all fine

 

(root@localhost) [test]>set names latin1;
(root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

The hex values are not matching as:

 

(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

If I do a reverse conversion, also using UNHEX

 

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,unhex(hex(yesutf)) AS yesutf_hex,unhex(hex(binutf)) AS binutf_hex,unhex(hex(different)) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: ح إلى كينيا
   binutf_hex: ح إلى كينيا
different_hex: ح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

 

Again I see something different, is it confusing?

No.

 

It is quite clear that varbinary and varchar store the string using the same hexadecimal values, while the yesutf is different because it was expecting a UTF string.

 

Let us analyze JUST the first character coming from the arabic string: ح

This letter correspond to:

Unicode code point U+062D

character: ح

UTF8 characters: Ø

UTF8 hex: d8 ad

Meaning: ARABIC LETTER HAH

 

What happen is that I am inserting the correct codes but the UTF8 field read it as LATIN1 char per byte, as such it transform the value to Ø.

The value of that character in the UTF-8 encoding table and Unicode characters is:

Unicode code point U+00D8

character: Ø

UTF8 characters: Ø

UTF8 hex: c8 98

Meaning: LATIN CAPITAL LETTER O WITH STROKE

 

Comparing the values above with the HEX value we had before, you can see that they match perfectly

yesutf_hex: C398

binutf_hex: D8AD

different_hex: D8AD

 

So far I hope everything is clear.

 

The fields using LATIN1 and UT8 are correctly shown when using NAMES latin1 because the latin1 interpretation of the UTF8 hex values in the case of the fields with latin1 encoding.

 

While in the case of fields with UTF8 encoding, the server perform a double conversion of the codes, conversion that IS NOT executed when doing UNHEX, and in that case what it is shown is the REAL value of the stored codes.

 

Again the only SAFE one during these operation is the varbinary which store the code not bounded to any encoding.

 

So for instance if you have all your database using LATIN1 and you want to convert to UTF8 you cannot just do :

(root@localhost) [test]>alter table utf_corrupt modify different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;

 

Because your text in the tables will be converted and become unusable if using UTF8 clients, see below:

Encoding set as LATIN1:

Client characterset: latin1

Conn. characterset: latin1

I am going to do the conversion:

 

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
SET NAMES UTF8;
 
	Client characterset:	utf8
	Conn.  characterset:	utf8
 
 
(root@localhost) [test]>select * FROM utf_corrupt\G
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
 

 

The text is a mess again, this because the text is transformed:

 

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 <---------
1 row IN SET (0.00 sec)
 

 

 

 

Starting from the beginning so the table is again:

 

(root@localhost) [test]>show CREATE TABLE utf_corrupt\G
*************************** 1. row ***************************
       TABLE: utf_corrupt
CREATE TABLE: CREATE TABLE `utf_corrupt` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL,
  `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `binutf` varbinary(250) DEFAULT NULL,
  `different` varbinary(250) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
1 row IN SET (0.00 sec)
 

 

 

 

data is:

 

*************************** 3. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec) 

 

 

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

 

And encoding: Client characterset: latin1 Conn.  characterset: latin1

 

To do properly you have to do a 3 steps conversion, mainly varchar -> varbinary -> varchar.As follow:

 

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varbinary(250) DEFAULT NULL;
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0
 
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <-------------- 
1 row IN SET (0.00 sec)
 

 

 

Now all is fine and the string was not transformed.Now change the encoding with NAMES:

Set NAMES utf8;

Convert back to varchar and UTF ...

 

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
Query OK, 3 rows affected (0.02 sec)
Records: 3  Duplicates: 0  Warnings: 0
 

 

 

And now yes you have the text stored correctly.

 

       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

All good! My data is correctly moved from a fields base on varchar LATIN1 to the new varchar UTF8.The most important aspect is to ensure to use the right encoding and to do not allow the server to convert you text to wrong hex values.

In case of significant migration if you have a master slave setup,it is possible to play with them, to minimize the impact of the migration, following these steps:

  1. all latin1 
  2. convert slave to varbinary
  3. insert data from master to slave keeping all as latin1
  4. swap the server
  5. convert ex-master to varbinary 
  6. when done change all application connection to utf8 master_host
  7. convert tables back to varchar utf8 on slave 
  8. swap master slave again
  9. convert slave table to UTF8

 

It seems cumbersome, but if you cannot afford to export the data change the table encoding and reload the data, and you cannot stop the write for long time, this is the safer way.The negative aspect of it is ... that varbinary use a different approach for sorting, it is based on the value of the binaries stored and you cannot use collation to modify it.

Let's go through the process:

 

Step 1) Check that all is latin1

 

Master
------------
(root@127.0.0.1:5510) [test]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1
 
Connection id:		11
Current DATABASE:	test
Current user:		root@localhost
SSL:			NOT IN USE
Current pager:		stdout
USING OUTFILE:		''
USING delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfm/mysql.sock
Uptime:			2 days 3 hours 3 min 49 sec
 
Threads: 3  Questions: 198  Slow queries: 0  Opens: 89  FLUSH TABLES: 1  Open TABLES: 27  Queries per second avg: 0.001
--------------
 
Slave
---------------------
(root@127.0.0.1:5511) [test]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1
 
Connection id:		9
Current DATABASE:	test
Current user:		root@localhost
SSL:			NOT IN USE
Current pager:		stdout
USING OUTFILE:		''
USING delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfs/mysql.sock
Uptime:			2 days 3 hours 3 min 45 sec
 
Threads: 2  Questions: 177  Slow queries: 0  Opens: 87  FLUSH TABLES: 1  Open TABLES: 27  Queries per second avg: 0.000
 

 

 

Insert some data on the master ... just to have a couples of rows.

 

(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 1', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 

 

 

Check the values:

 

Master
----------------------
(root@127.0.0.1:5510) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
2 rows IN SET (0.00 sec)
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 
  
Slave
-------------------------
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
2 rows IN SET (0.00 sec)
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

Step 2) Convert to varbinary the slave, keep same encoding (NAMES latin1)

 

 

(root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varbinary(250);
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0
 
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <------------------
1 row IN SET (0.00 sec)
 

 

 

Code remain consistent.

Step 3) Production continue to work as usual, no impact.

Insert some other records on the master:

 

(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 3. row ***************************
       ID: 5
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 4. row ***************************
       ID: 7
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 2. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 3. row ***************************
           id: 5
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 4. row ***************************
           id: 7
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
4 rows IN SET (0.00 sec)
 
ON The Slave 
----------------------
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 3. row ***************************
       ID: 5
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 4. row ***************************
       ID: 7
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 2. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 3. row ***************************
           id: 5
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 4. row ***************************
           id: 7
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
4 rows IN SET (0.00 sec)
 

 

 

Step 4) swap the server

5511 master and 5510 slave (but stop the replication from 5511->5510 and stop also the previous replication from 5510 to 511), production will continue to work in write, order by will be affected for the moment.

 

On new Master 5511:

 

*************************** 6. row ***************************
       ID: 11 <----------------------------------------------
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
6 rows IN SET (0.00 sec)
 
ON new Slave 5510:
*************************** 4. row ***************************
       ID: 7 <--------------------------------
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 

 

Step 5) modify to varbinary the tables on the new Slave:

 

(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varbinary(250);
Query OK, 4 rows affected (0.02 sec)
Records: 4  Duplicates: 0  Warnings: 0
 

 

 

 

And now restart replication on 5510, so it can replicate from the master 5511.

 

(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
 
1 row IN SET (0.00 sec)(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 11
    noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

 

All good.

Step 6) change all Application connections to UTF8,

in our example I will change the NAMES on new Master 5511 and on slave 5510

SET NAMES UTF8;

Server characterset: utf8

Db     characterset: utf8

Client characterset: utf8

Conn.  characterset: utf8

 

Master
--------------------------
(root@127.0.0.1:5511) [test]>set names utf8;
Query OK, 0 rows affected (0.00 sec)
 
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
...
*************************** 6. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
6 rows IN SET (0.00 sec)
 

 

 

 

Note how the values in the yesutf was NOT correctly converted given the initial double translation.

 

Slave
---------------
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	utf8
Conn.  characterset:	utf8
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 

 

 

 

Step 7) Convert table on SLAVE back to varchar but using utf8 now

 

(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varchar(250)  CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
Query OK, 7 rows affected (0.01 sec)
Records: 7  Duplicates: 0  Warnings: 0
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 15
   notutf: STEP7: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 15
    noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 

 

 

 

Step 8) given all good put now the 5510 back as master,

given NOW the 5510 has the fields set as varchar and encoding/collation using UTF also the order by will work fine again.

At this point data will continue to be pass from master to slave.

 

ON Master
---------------------
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP8: inserting data from master 5510 after swap server again ', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 17
   notutf: STEP8: inserting DATA FROM master 5510 after swap server again 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 17
    noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 
ON Slave
-------------------------
*************************** 8. row ***************************
       ID: 17
   notutf: STEP8: inserting DATA FROM master 5510 after swap server again 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
8 rows IN SET (0.00 sec)
 
*************************** 8. row ***************************
           id: 17
    noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
 

 

 

 

Step 9) Finally convert all tables also on slave to varchar UTF8:

 

 

Master
---------------
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP9: final Slave conversions', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
 
Slave
---------------
(root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varchar(250)  CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
*************************** 9. row ***************************
       ID: 19
   notutf: STEP9: final Slave conversions
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
9 rows IN SET (0.00 sec)
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
...
           id: 19
    noutf_hex: 53544550393A2066696E616C20536C61766520636F6E76657273696F6E73
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
9 rows IN SET (0.00 sec)
 

 

 

 

 

 

Done.
When in doubt about the character set you are using or the collation always check them.
good commands from application:
SELECT USER(), CHARSET(USER()), COLLATION(USER())\G(root@127.0.0.1:5511) [test]>SELECT USER(), CHARSET(USER()), COLLATION(USER())\G
*************************** 1. row ***************************
           USER(): root@127.0.0.1
  CHARSET(USER()): utf8
COLLATION(USER()): utf8_general_ci
1 row IN SET (0.00 sec)
 
AND
(root@127.0.0.1:5511) [test]>show VARIABLES LIKE '%server%'\G
*************************** 1. row ***************************
Variable_name: character_set_server
        Value: utf8
*************************** 2. row ***************************
Variable_name: collation_server
        Value: utf8_unicode_ci
*************************** 3. row ***************************
Variable_name: server_id
        Value: 5511
3 rows IN SET (0.00 sec)
 

 

 

Conclusion

You must put a lot of attention on how you connect to the server, in relation to the encoding you are going to use.
It is advisable to set it explicitly, to be really sure and to check that the server/application are getting it right.
I cannot find a good reasons for which you should have a mix of UTF8 and Latin1 in the same table as for the example I have use, but you never know.
What I mean is that it is normal to have Fields using Latin1 and be define as Latin1 for the encoding, but mixing the UTF8 using Latin1 encoding in the application connection, and try to write UTF will only generate a mess.
If you Have a table with 4 fields 2 UTF8 and 2 LATIN1, please use UTF8 encoding in the connection/server and be sure to push the right text, actually if you do so, the server will alert you if you try to do something stupid, like pushing UTF8 values in latin1 fields.
For that specific cases when you have no idea of what you are going to write, and you MUST use Latin1 for the application connection, well use varbinary and your life will be easier, of course the ordering will be done by binary value instead character, but you need to keep your data safe first.

Reference

 

Thanks 

Finally a huge thanks to Derek Downey, because he raise the issue (again) to me in relation to a customer migration.

Thanks for his dedication, professionality and will to learn and have fun together.

Ultimo aggiornamento Domenica 15 Settembre 2013 23:16
 
«InizioPrec.123456789Succ.Fine»

JPAGE_CURRENT_OF_TOTAL
 

Who's Online

 30 visitatori online