My MySQL tips

Geographic replication with MySQL and Galera

Introduction

With the introduction of Galera replication for MySQL, Codership has, in my opinion, provide the most significant support to MySQL that the community had seen in the last tree years.

Codership had filled a significant gap in MySQL replication capability, succeeding in a task where many before had failed.

Not only Codership is constantly committed in doing their solution better and more solid, and they are the most friendly and helpful team I have met in years.

In short chapeau to all of the guys! Great job, I am really proud to have the opportunity to discuss, interact with you.

Said that, no product or code around is unmingled of errors and/or issues at different level. So also if Galera help us a lot solving many replication and architectural issues it still has some limitation that must be taken in consideration.

Galera is easy to setup, especially to perform a basic setup, which will work fine for almost all basic OLTP low level and local traffic.

Unfortunately I have seen many implementations that had start as basic and then become something different, in usage and very often in design.

The most common mistake, from architectural point of view, I have seen is when customers starts with a simple cluster of 3 nodes all based on local network, and then given it "works so great" they decide to add one/two nodes on a geographic distributed site.

This solution is not wrong "per se" but it is how it is implemented that makes the difference.

The easiest way to explain what I mean and what should not be done, it is to use a real case, and describe what was implemented and how we had evolve the solution, while Codership was optimizing their product.

{autotoc enabled=yes}

Customer scenario

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The initial environment of one single application node pointing to a cluster of 3 MySQL/Galera nodes, was transformed into a 6 (or more) application nodes getting traffic from a balancer, each application nodes could write in any Data node (round robin) residing in the same geographic area (at least), but cluster was geographically distributed 3 nodes in site A, 3 nodes in site B.

As result the customer navigation was jumping from one application server to another, and from one Data node to another as well.

Transactions were small and relevant only for a single request.

Finally each application node had a timer to clean up "old" session, the concept of OLD was relevant only to the local node and it could happen that an OLD session was in fact still on going on another application node.

In short, the customer was pushing randomly traffic to both zones, using a single table for the sessions to coordinate the activities, but also purging them without any good locking mechanism:

Onecluster

Problem description

When I connect to the machine, I was not aware of the architecture as it was; I was still referring to what I had found in our history, so one Application node and a simple cluster.

Takes me 5 seconds, to see that the situation was quite different. The cluster was suffering from many point of view, Flow Control was high very often, and nodes between the zones were declare dead often.

Takes me some time to let the customer understand why that was not working correctly, and why the architecture was design wrongly. Mostly because they were claiming that it was working fine before (reminds me "on my machine works" doesn't it?). Discussion on what before was, how much traffic/load and so on.

Finally, they redirect all the traffic to one zone only and at least the FC went down enough to have the site to work, and to give me the chance to review the settings and try to optimize it.

To keep it up I had to increase the FC related variable, to modify galera parameters for nodes time out, and obviously optimize MySQL itself, and we survive a bit.

For the scope of the article focus on the geographic replication, I am not to report all the review, but the part related to the subject.

As said I had extended time outs and FC, but I was not happy and as such I had tested in more details what the network was really allowing to transmit.

[root@xx ~]# netperf -H 10.0.0.12 -t TCP_RR -v 2  -- -b 6 -r 32K -S 256K -S 256K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.12 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   
 
16384  87380  32768    32768   10.01      97.45   
249856 249856
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   71830.852   97.451 25.546    25.546  <------------ 71ms

So a simple round trip for a small record of 32K takes 71 ms, what if I simulate a basic load of 1Mb?

[root@xx ~]# netperf -H 10.2.3.30  -t TCP_RR -v 2  -- -b 6 -r 1M,32K -S 256K -S 256K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.2.3.30 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   
 
16384  87380  1048576   32768   10.00       2.70   
249856 249856
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   2592709.047   2.700  22.648    0.708  <-- we have 2592.709047 ms latency

Note, to understand better what was going on and why I use this method to evaluate the network read my other article (Effective way to check the network connection performance, when using replication geographically distributed)

What we had to do

With this kind of connectivity, it is very unluckily that we can have the galera replication working correctly. Not even if we keep only one site as master and the other as stand by. The way the nodes communicate and validate the writesets will be too demanding.

See the below diagram from Codership to understand why

ScreenShot1

The only alternative to this is to use asynchronous connection and set two nodes, one for each site in master-master mode:

master-master

Setup a master-master, means that we loose the automatic replication shift from one node to another. That is it, this solution while fix the latency problem because use asynchronous replication, it does imply that in case of crash of one of the two nodes connected, you must manually move the replication to another node.

How to set it up how to fix when it breaks

Setting up asynchronous replication in a galera cluster is quite easy.

Assuming you have two nodes connected as master master (A1 B1) and assuming the only active was A1 and that it crashes, the only thing you need to do is to check on B1 for:

+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_last_committed | 221463896 |
+----------------------+-----------+

In the status variables, then look on the node you had choose as master say A2, for the binary log containing the

Xid = 221463896

And issue the command CHANGE MASTER with the coordinates you will find there.

Then on A2 set the node as slave of the B1 using the Master coordinates from that node.

The procedure is simple and obviously worked decently, given the situation.

Customer was obviously not happy because they would like to have the two cluster in perfectly sync, but that was at least good enough to keep the site working efficiently. Not only after we set it up, we had at least two incidents to production, and we had successfully able to fail over to DR and then Failback with less then few seconds of service interruption.

Anyhow our friends in Codership had already release the new version of Galera (3.x), working on MySQL 5.6.

So while the customer was refining and correcting a lot of issues their side, mainly in the code and how the application was using the common table.

I had review, test and crashed MySQL 5.6/Galera3.

Galera 3.0 comes with some interesting new enhancement, see for an overview http://www.slideshare.net/Severalnines/webinar-slides-27249453.

At the moment of writing Seppo is close to deliver the speech about the new features in Galera 4 at Percona Live London 2014. (http://www.percona.com/live/london-2014/sessions/galera-cluster-new-features)

Galera 3.X group replication and how this can change everything.

In Galera 3, the introduction of the segment concept basically isolate/identify a group of data nodes inside a cluster.

The meaning is that any node with the same segment value will communicate extensively to each other, while intra-segments they elect one node for each segment as communication channel:

ScreenShot2

Quorum calculation is also taken in to account the segments allocation.

In the presentation I have linked here there is mention that using segments will allow managing WAN latency of 100-300 ms.

Given I am a skeptical, I decide to perform extensive tests, and see/identify what it is really possible to accomplish.

Investigations

I did several tests, involving many parameters in galera, but at the end I had two main actors I was playing with, the max_packet size and gcache pagesize.

All the others ones changed had prove to be stable and efficient, below the settings I had used.

How to test the problem

I have setup 8 different tests:

1) simple insert on 1 node from 3 apps

2) simple insert on 1 node for segment on DR only one app on production 2

3) simple insert on 1 node from 3 apps GALERA PACK larger

4) simple insert on 1 node for on DR only one app on production 2 GALERA PACK larger

5) simple insert on 1 node from 3 apps BUT with WAN connection

6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection

7) simple insert on 1 node from 3 apps BUT with WAN connection AND GALERA PACK larger

8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

Each test was using OLTP approach, with 5 main tables and 6 child tables with the following structure:

tbtest1 | CREATE TABLE `tbtest1` (
  `autoInc` bigint(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) NOT NULL,
  `uuid` char(36) COLLATE utf8_bin NOT NULL,
  `b` varchar(100) COLLATE utf8_bin NOT NULL,
  `c` char(200) COLLATE utf8_bin NOT NULL,
  `counter` bigint(20) DEFAULT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `partitionid` int(11) NOT NULL DEFAULT '0',
  `date` date NOT NULL,
  `strrecordtype` char(3) COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`autoInc`,`partitionid`),
  KEY `IDX_a` (`a`),
  KEY `IDX_uuid` (`uuid`)
) ENGINE=InnoDB AUTO_INCREMENT=3474631 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
 
CREATE TABLE: CREATE TABLE `tbtest_child1` (
  `a` int(11) NOT NULL,
  `bb` int(11) NOT NULL AUTO_INCREMENT,
  `date` date NOT NULL,
  `partitionid` int(11) NOT NULL DEFAULT '0',
  `stroperation` varchar(254) COLLATE utf8_bin DEFAULT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`a`,`bb`),
  UNIQUE KEY `bb` (`bb`)
) ENGINE=InnoDB AUTO_INCREMENT=36289 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row IN SET (0.00 sec)

Using 3 application nodes, each was having 8 children treads for a total of 24 threads writing on the cluster.

The tests performed from 1 to 4 where done as baseline so NO WAN network simulation.

When I mention Galera pack larger I refer specifically to:

gcs.max_packet_size

This value is "All writesets exceeding that size will be fragmented." And it is set by default to 32Kb, which means anything larger that will be fragmented in side galera. My understanding is that writeset package exceeding that dimension will be split in smaller packages. But given the dimension of 32K, or even less is significantly larger then the network frame dimension, the packet fragmentation will take place in any case, but at lower level (network).

My reasoning here was, why I should want to do that, if I already have frame fragmentation at network (TCP/IP) level? I may get the meaning of the parameter wrong, but if not is worth a try.

On this regards I had also optimize the machine network layer:

net.core.optmem_max = 40960

net.core.rmem_max = 16777216

net.core.wmem_max = 16777216

net.core.rmem_default = 16777216

net.core.wmem_default = 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216

net.ipv4.tcp_wmem = 4096 65536 16777216

net.core.netdev_max_backlog = 50000

net.ipv4.tcp_max_syn_backlog = 30000

net.ipv4.tcp_congestion_control=htcp

net.ipv4.tcp_mtu_probing=1

net.ipv4.tcp_max_tw_buckets = 2000000

net.ipv4.tcp_tw_reuse = 1

net.ipv4.tcp_fin_timeout = 30

net.ipv4.tcp_slow_start_after_idle = 0

I run the tests on 6 data nodes using 2 segments, in virtualized environment, 8 core 8GB raid 10 6 10k RPM disks, CentOS 6.

On purpose I had skip HAProxy and prefer to pint directly on the node I had defined.

The applications were closing and opening connection at each transaction to emulate the customer behavior.

When simulating the wan network I had set a final total latency of :

root@galera1h1n5 galera56N1]# netperf -H 10.0.0.33 -t TCP_RR -v 2  -- -b 6 -r 120K,64K -S 156K -S 156K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.33 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec
 
65536  87380  122880   65536   10.00       1.90
319488 319488
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   3684481.420   1.900  1.868     0.996  <----- 3684.481 ms

With packet reordering 15-20%.

Just for the good of comparison a PING whould had report:

[root@galera1h1n5 galera56N1]# ping -M  do -s 1472 -c 3 10.0.0.33
PING 10.0.0.33 (10.0.0.33) 1472(1500) bytes of data.
1480 bytes from 10.0.0.33: icmp_seq=1 ttl=64 time=403 ms
1480 bytes from 10.0.0.33: icmp_seq=2 ttl=64 time=389 ms
1480 bytes from 10.0.0.33: icmp_seq=3 ttl=64 time=410 ms

So ~400 ms for 1427 (1500) bytes.

With all this in place, I have started to roll the ball and collected my data.

To notice that I have run each set of test at least 5 times, to validate the data collected, and data was clean, refresh and InnoDB BP reload at any test.

Before reaching the last set of tests (the final 2 sets) I had run a total amount of 64 variants of the tests, sometime crashing galera, sometime getting so bad performance that it was obvious that would not be usable in production, and so on.

Anyhow, I was more focus on the replication performance/behavior than on what MySQL/Innodb was doing given I was not looking for an ultra-fine tuning at that level, actually some imperfections had fit my purpose better.

Related issues

Initially I was not using an optimize network settings, and as soon as I reach the first time the tests 7-8 I had issues, in short the test 7 was taking 4 time longer then test 5 (his twin), and in some cases it was so bad that I had to stop it.

But once the network was set up correctly most of the issue disappears.

Results

Who wins?

There is not a single winner; all the tests shows that there is the need to carefully pick the right solution for the job.

But these two sets of tests, and settings had proven to be stable and better performing then all the others in my simulation:

execution_time_smallGcache_page

This first image show the set of tests perform with the gcahe page set to a lower value of 32Mb and more instances 4.

execution_time_largeGcache_page

This second one shows the test with a larger page of 250Mb and only 2 instances.

I did not include the starting test where the page/instance was set by defaults, because it was providing less performance.

Comparing the two, we can see that there is a different behavior between the tests done on a local network and the ones using the WAN.

In the case of a LAN where the cluster is less limited in the performance at network level, the use of more smaller gcache pages with more instances, had allow the cluster and MySQL to insert ~5700-5300 row/sec.

The use of a large gcache page and only 2 instances, on the other side see the drop of the performance to ~5500-4950 row/sec.

The difference is not significant with this traffic, but is interesting t note that it exists and is consistent on all the tests.

My side, this means we have something that add a cost, significant cost when galera need to access a less number of gcache page. Given this is a memory access I am oriented to think there is a mutex that prevent the usage of the single page/resource.

On the other hand the Large gcache page, seems providing better support in the case of a distribute cluster.

I assume that given the transmission is slowed down by the network, galera/MySQL is able to better utilize the less instances of the gcache memory pages.

Never the less, the large gcache with less instances, had be less effective when inserting in multiple nodes then the small gcache page with more instances and larger gcs.max_packet_size.

But that is not all.

When using larger packet size (gcs.max_packet_size), I was consistently able to complete the tests in less time when I was using a geographically distributed cluster, large gcache page and multiple nodes writing.

While when the cluster was performing locally, the small (standard) gcs.max_packet_size was working better.

Let us navigate the results.

Segments distribution

The first test I have done is to be sure that the segments where working as expected.

The cluster was compose by 2 segments, each segment 3 nodes.

Segment 1 nodes IP ending with: 5,21,23

Segment 2 nodes IP ending with 2,23,33

Local cluster(same network)
Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.5                <=> 10.0.0.21        1.65GB  76.8Kb  63.9Kb  67.1Kb
10.0.0.5                <=> 10.0.0.31        1.63GB  63.8Kb  51.8Kb  55.2Kb
10.0.0.5                <=> 10.0.0.23        1.58GB  31.8Kb  26.2Kb  28.5Kb <----
10.0.0.5                <=> 10.0.0.2         30.8MB  19.3Kb  15.4Kb  16.2Kb
10.0.0.5                <=> 10.0.0.33         270KB   848b   1.81Kb  1.74Kb
 
 
 Display paused  19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└────────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.21               <=> 10.0.0.5          1.70GB  70.7Kb  71.6Kb  67.5Kb
10.0.0.21               <=> 10.0.0.31         83.4MB  44.9Kb  43.3Kb  40.9Kb
10.0.0.21               <=> 10.0.0.2          31.8MB  15.6Kb  16.9Kb  16.5Kb <---- ??
10.0.0.21               <=> 10.0.0.33         32.1MB  15.6Kb  16.8Kb  16.3Kb <---- ??
10.0.0.21               <=> 10.0.0.23          269KB   848b   1.34Kb  1.36Kb
 
 
Display paused  19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└────────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.31               <=> 10.0.0.5          1.67GB  56.5Kb  58.3Kb  55.7Kb
10.0.0.31               <=> 10.0.0.21         83.4MB  44.7Kb  43.3Kb  41.5Kb
10.0.0.31               <=> 10.0.0.2          31.7MB  19.6Kb  17.6Kb  17.2Kb <---- ??
10.0.0.31               <=> 10.0.0.33          273KB   848b   1.98Kb  1.89Kb 
10.0.0.31               <=> 10.0.0.23          269KB   848b   1.85Kb  1.72Kb
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴───────────────┴─────────┴──────────────┴────────────────
10.0.0.2              <=> 10.0.0.23       1.70GB  71.2Kb  71.2Kb  67.6Kb
10.0.0.2              <=> 10.0.0.33        111MB  58.2Kb  57.6Kb  54.6Kb 
10.0.0.2              <=> 10.0.0.31       31.7MB  20.2Kb  17.7Kb  16.9Kb <---- ??
10.0.0.2              <=> 10.0.0.5        30.8MB  15.6Kb  17.3Kb  16.5Kb <---- ??
10.0.0.2              <=> 10.0.0.21       31.8MB  15.6Kb  16.9Kb  16.5Kb <---- ??
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
───────────────┴───────────────┴────────────────────────┴────────────────
10.0.0.23              <=> 10.0.0.2        1.65GB  71.2Kb  71.2Kb  68.3Kb
10.0.0.23              <=> 10.0.0.33       1.60GB  43.7Kb  43.4Kb  41.7Kb
10.0.0.23              <=> 10.0.0.5        1.62GB  28.1Kb  29.6Kb  28.3Kb <----
10.0.0.23              <=> 10.0.0.31        269KB   848b   1.85Kb  1.76Kb
10.0.0.23              <=> 10.0.0.21        269KB   848b   1.34Kb  1.36Kb
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴────────────────┴───────┴────────────────┴────────────────
10.0.0.33              <=> 10.0.0.2          111MB  58.8Kb  57.6Kb  55.4Kb 
10.0.0.33              <=> 10.0.0.23        1.65GB  43.7Kb  43.3Kb  41.9Kb 
10.0.0.33              <=> 10.0.0.21        32.1MB  15.6Kb  16.8Kb  16.5Kb <---- ??
10.0.0.33              <=> 10.0.0.31         273KB   848b   1.98Kb  1.89Kb
10.0.0.33              <=> 10.0.0.5          274KB   848b   1.78Kb  1.73Kb
 
 
Wan distributed cluster
Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴───────────┴─────────┴────────────────┴──────────────────────
10.0.0.5              <=> 10.0.0.31            1.71GB  30.2Kb  31.3Kb  31.0Kb
10.0.0.5              <=> 10.0.0.21            1.71GB  28.1Kb  30.4Kb  30.4Kb
10.0.0.5              <=> 10.0.0.2             1.61GB  15.1Kb  16.3Kb  16.6Kb <----
10.0.0.5              <=> 10.0.0.33             682KB  1.19Kb  1.91Kb  1.77Kb
10.0.0.5              <=> 10.0.0.23            8.83MB  3.91Kb  1.77Kb  1.87Kb <---- ??
10.0.0.5              <=> 10.0.0.153           5.05KB     0b      0b    572b
10.0.0.5              <=> 10.0.0.151           5.91KB     0b      0b    535b
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.21             <=> 10.0.0.31              194MB  28.7Kb  31.2Kb  31.3Kb 
10.0.0.21             <=> 10.0.0.5              1.76GB  28.5Kb  30.6Kb  30.4Kb
10.0.0.21             <=> 10.0.0.23             73.8MB  15.6Kb  17.1Kb  17.1Kb <---- ??
10.0.0.21             <=> 10.0.0.33              298KB  1.86Kb  2.00Kb  1.91Kb
10.0.0.21             <=> 10.0.0.2              5.83MB  2.06Kb  1.53Kb  1.42Kb <---- ??
10.0.0.21             <=> 10.0.0.12              351KB  1.53Kb  1.34Kb  1.41Kb
 
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.31             <=> 10.0.0.5              1.76GB  28.7Kb  31.3Kb  31.1Kb
10.0.0.31             <=> 10.0.0.21              192MB  28.7Kb  31.2Kb  31.3Kb
10.0.0.31             <=> 10.0.0.33             53.3MB  17.8Kb  17.9Kb  17.7Kb <---- ??
10.0.0.31             <=> 10.0.0.23             1.23MB  3.93Kb  2.01Kb  2.31Kb
10.0.0.31             <=> 10.0.0.2               323KB   528b   1.77Kb  1.81Kb
10.0.0.31             <=> 10.0.0.13              354KB  1.28Kb  1.32Kb  1.38Kb
 
 
 Display paused 19.1Mb      38.1Mb    57.2Mb           76.3Mb           95.4Mb
└───────────────┴───────────┴─────────┴────────────────┴──────────────────────
10.0.0.2              <=> 10.0.0.23            1.65GB  28.5Kb  30.9Kb  31.2Kb
10.0.0.2              <=> 10.0.0.33            1.65GB  27.7Kb  30.5Kb  31.0Kb
10.0.0.2              <=> 10.0.0.5             1.64GB  15.1Kb  16.3Kb  16.6Kb <----
10.0.0.2              <=> 10.0.0.31             323KB   528b   1.77Kb  1.81Kb
10.0.0.2              <=> 10.0.0.21            5.88MB  2.06Kb  1.53Kb  1.46Kb <---- ??
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.23             <=> 10.0.0.33              118MB  29.8Kb  31.5Kb  31.5Kb
10.0.0.23             <=> 10.0.0.2              1.70GB  29.5Kb  30.8Kb  31.2Kb
10.0.0.23             <=> 10.0.0.21             74.0MB  15.4Kb  16.9Kb  17.0Kb <----??
10.0.0.23             <=> 10.0.0.31             1.22MB  3.31Kb  1.85Kb  2.27Kb
10.0.0.23             <=> 10.0.0.5              8.96MB  3.40Kb  1.67Kb  1.84Kb <---- ??
10.0.0.23             <=> 10.0.0.12              359KB  1.47Kb  1.47Kb  1.36Kb
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.33             <=> 10.0.0.23              118MB  28.1Kb  31.5Kb  31.4Kb
10.0.0.33             <=> 10.0.0.2              1.70GB  27.5Kb  30.2Kb  30.7Kb
10.0.0.33             <=> 10.0.0.31             53.3MB  17.8Kb  17.9Kb  17.7Kb <----??
10.0.0.33             <=> 10.0.0.21              298KB  1.86Kb  2.00Kb  1.91Kb
10.0.0.33             <=> 10.0.0.5               686KB  1.03Kb  1.96Kb  1.78Kb
10.0.0.33             <=> 10.0.0.13              370KB  1.41Kb  1.44Kb  1.42Kb

Apparently all was working fine, and if you notice the nodes 5 -23 and 5-2 seems to work as bridge between the segments.

But if you look carefully (I had point the cases with <---- ??), you will see that there is additional traffic going on, for instance in the first set we have nodes 33-21 exchanging data for 32Mb, and in the second set (wan) even more we have node 23-21 exchanging 73Mb.

It seems nothing, but instead it means there is an exchange of some type. What kind of traffic is going on between SOME nodes? I said SOME on purpose, because other nodes are not generating this that I consider an anomaly.

May be this is by design, if so I would like to know, and would like to know why.

Anyhow a side from this, most of the exchange is actually floating as expected.

More inside the replication

As mention before, the whole exercise was to identify if and if yes which is best settings to use for the cluster working in geographically distributed environment. In relation to that the tests that are more relevant are the number 6 and 8:

6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection

8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

Comparing the differences existing between the gcache.keep_pages_size and page_size settings as well.

Test6

Replication_queue6

Test8

Replication_queue8

Reviewing the two graphs above, related to the Replication queue, we can see that both tests do not report Flow Control action.

In the case of Test6 the receiving and sending queue had some spikes, that were not present in Test8, meaning the two queues were processed more efficiently when the max_packet_size is larger.

To note is that the wsrep_evs_repl_latency in Test8 is higher, as such the messages seems takes longer to be dispatched, but at the end the performance are better.

Test6

netstats6

Test8

netstats8

Reviewing the graphs related to Writeset dimension and the Netstat, we can notice that the size of the writesets are larger in the test with default max_packet_size, also network traffic results to be higher. I was surprise about that and I had repeat the tests even more times. At the end magnifying the information related to the transmitted packet, we can notice that when using larger max_packet_size, the distance between received and transmitted is less.

Test6

netstatsPacket6

Test8

netstatsPacket8

Given in this test we have writes going on two nodes, my conclusion is that the network layer was able to optimize the data transmission, reducing the number of vectors (frames) and as such reducing the overhead.

Test6

Parallelization_efficency6

Test8

Parallelization_efficency8

To complete the review related to the galera replication, let take a look to the parallelization efficiency graphs.

Here the differences are more evident. The apply window (Average distance between highest and lowest concurrently applied seqno) when using default max_packet_size is definitely higher, as well the wsrep_commit_window (Average distance between highest and lowest concurrently committed seqno).

Also the wsrep_cert_deps_distance (Average distance between highest and lowest seqno value that can be possibly applied in parallel, potential degree of parallelization), in the first case is steady while with larger max_packet_size it is going down.

I was oriented to read this as an optimize way to apply and commit the incoming data.

Test6

Write_set_bytes6

Test8

Write_set_bytes8

Reviewing the Writeset_byte graphs we can note that while the bytes replicated remain very close between Test6 and Test8, the received are quite larger in the case of smaller/default max_packet_size.

So the node was by the fact receiving less data in Test8, keeping constant the same amount of data from application.

Test6

processes6

Test8

processes8

Analyzing the process information the Test6 (default max_packet_size) has more major page fault then Test8, and both system time and user time are larger in Test6. The meaning of this is more I/Operations

The differences are not huge, but they are consistent through the different number of tests.

Conclusions?

I had collect and still collecting a lot of additional data, but for the scope of this article focus on the network and replication side, I prefer to stop.

There is not A conclusion about the whole MySQL/Galera setup, but let me say that I am strongly suggesting to do not use a monolithic setup for all cases.

Instead adapt the configuration following the architecture you are implementing.

So far and from the tests I have done, I can say that:

Cluster implemented in a single site with very good connectivity and using MySQL/Galera for HA or ad write distribute platform:

Use multiple gcache.keep_pages_size (I had relate the number to the number of cores but I don't see any benefit to go above 8 instance at the moment)
Use gcache.page_size never less then 32Mb no more 256Mb total
Use small gcs.max_packet_size 32K

Cluster implemented in a single site with geographic distributed site and using MySQL/Galera on second site for HA only:

Use single or two gcache.keep_pages_size
Use default gcache.page_size or little bit larger like 256M
Use larger gcs.max_packet_size 2097152K

In the final tests I have being using the two following configurations, and they had work great also with significant load, not only for the tests but also when we had implement them at the customer site, using the same load and operation that was crashing the system with Galera 2.x or with other Galera 3.x configurations.

Configuration used for local cluster:

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=2;gcache.page_size=125M;

gcache.size=5024M;evs.send_window=512;evs.user_send_window=512; evs.use_aggregate = true; gcs.fc_factor = 1;

gcs.fc_limit = 32; gcs.max_packet_size = 32768;gcs.fc_debug = 0; evs.inactive_check_period = PT1S;

evs.inactive_timeout = PT15S;evs.suspect_timeout = PT5S; evs.stats_report_period = PT1M;"

Configuration used for geographically distributed DR:

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=4;gcache.page_size=64M;

gcache.size=5024M;evs.send_window=1024;evs.user_send_window=1024; evs.use_aggregate = true; gcs.fc_factor = 1;

gcs.fc_limit = 260; gcs.max_packet_size = 2097152;gcs.fc_debug = 0; evs.inactive_check_period = PT30S;

 evs.inactive_timeout = PT1M;evs.suspect_timeout = PT1M; evs.stats_report_period = PT3M;"

Please let me know if in you will experience different results, and how/why.

Great MySQL .... & Galera to all.

Effective way to check the network connection performance, when using replication geographically distributed

Why this article

The more I have to interact with customers asking about MySQL/Galera, the most I have to answer over and over to the same question about what kind of network conditions Galera can manage efficiently.

One of the most frequent myths I have to cover at the start of any conversation that involve the network is PING.

Most of the customers use PING to validate the generic network conditions, and as direct consequence they apply that approach also when in need to have information on more complex and heavy use like in the Galera replication.

To have a better understanding why I consider the use of PING, not wrong but inefficient, let us review some basic networking concepts.

Frame

At the beginning stay the physical layer, but I am going to skip it otherwise article will be too long, what I want only to say is that unless you are able to afford a leased line to connect to your distributed sites, you are going to be subject to Packet switching, as such affected by: throughput, bandwidth, latency, congestion, and dropped packets issues.

Given the physical layer, and given we are talking about Ethernet connections, the basic transporter and the vector that encapsulates all the others is the Ethernet frame. A frame can have a dimension up to 1518 bytes and nothing less then 64.

A frame has a header compose by:

Preamble of 7 bytes,
Delimiter 1 bytes,
MAC address destination 6 bytes,
MAC address destination 6 bytes,
Optional fields (IEEE 802.1Q/ IEEE 802.1p), 4 bytes
Ethernet type or length (if > 1500 it represent the type; if < 1501 it represents the length of the Payload, 2 bytes
PayLoad, up to 1500 bytes
Frame CRC, 4 bytes
Inter-packet gab, this is the space that is added between to frames

framedimension

A frame can encapsulate many different protocols like:

IPv4
IPv6
ARP
AppleTalk
IPX
... Many more

The maximum size available for the datagram to be transmitted is of 1500 bytes, also known as MTU. That is, the MTU or Maximum transmission unit is the dimension in bytes that a frame can transport from source to destination, we will see after that this is not guarantee and fragmentations can happen.

Only exception is when Jumbo Frames are supported, in that case a frame can support a payload with a size up to 9000 bytes. Jumbo frames can be quite bad for latency, especially when the transmission is done between data-centre geographically distributed.

IP (internet protocol)

For the sake of this article we will focus on the IPv4 only. The IPv4 (Internet Protocol) is base on the connectionless and best-effort packets delivery, this means that each frame is sent independently, the data is split in N IP datagram and sent out to the destination, no guarantee it will deliver or that the frames will arrive in the same order they are sent.

Each IP datagram has a header section and data section.

The IPv4 packet header consists of 14 fields, of which 13 are required.

The 14th field is optional (red background in table) and aptly named: options.

IPv4 Header Format
*Offsets*	Octet	0								1								2								3
Octet	Bit	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
0	0	Version				IHL				DSCP						ECN		Total Length
4	32	Identification																Flags			Fragment Offset
8	64	Time To Live								Protocol								Header Checksum
12	96	Source IP Address
16	128	Destination IP Address
20	160	Options (if IHL > 5)
		Options (if IHL > 5)

An IP datagram is encapsulated in the frame…

ip_datagram

Then sent.

For performance the larger is the datagram (1500 MTU), the better, but this only in theory, because this rules works fine when a datagram is sent over a local network, or a network that can guarantee to keep the level of MTU to 1500.

What happen in real life is that a frame sent over the Internet has to go over many different networks and there is no guarantee that all of them will support the same MTU, actually the normal condition is that they don’t.

So when a frame pass trough a gateway, the gateway knows the MTU of the two links (in/out) and if they do not match, it will process the frame fragmenting it.

See below:

fragmentating_net

Assuming Host A is sending an IP datagram of 1400 bytes to Host B, when Gw1 will receive it, it will have to fragment it in to smaller peaces to allow the frame to be transported:

fragmented

Once a datagram is fragmented to match the largest transportable frame, it will be recompose only at destination. If one of the fragments of the datagram is lost, for any reason, the whole datagram got discarded and transmission fails.

ICMP

The IP specification imposes the implementation of a special protocol dedicated to the IP status check and diagnostics, the ICMP (Internet Control Message Protocol).

Any communication done by ICMP is embedded inside an IP datagram, and as such follow the same rules:

icmp_fragment

ICMP has many “tools” in his pocketknife, one if them is PING.

A ping is compose by a echo request datagram and an echo reply datagram:

00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
Type = 8 (request) OR =0 (reply)								Code = 0								Header Checksum
Identifier																Sequence Number
Data

Request and reply have a similar definition, with the note that the replay MUST contain the full set of data sent by the echo request.

Ping by default sent a set of data of 56 bytes plus the bytes for the ICMP header, so 64 bytes.

Ping can be adjust and to behave in different way, including the data dimension and the don’t fragment bit (DF). The last one is quite important because this set the bit that will mark the datagram as not available for fragmentation.

As such if you set a large dimension for the data and set the DF, the datagram will be sent requesting to not be fragmented, in that case if a frame will reach a gateway that will require the fragmentation the whole datagram will be drop, and we will get an error.

I.e.:

ping -M do -s 1472 -c 3 192.168.0.34
 PING 192.168.0.34 (192.168.0.34) 1472(1500) bytes of data.
 1480 bytes from 192.168.0.34: icmp_req=1 ttl=128 time=0.909 ms
 1480 bytes from 192.168.0.34: icmp_req=2 ttl=128 time=0.873 ms
 1480 bytes from 192.168.0.34: icmp_req=3 ttl=128 time=1.59 ms

as you can see I CAN send it using my home wireless because is below 1500 MTU also if I ask to do not fragment, this because my wireless is using 1500MTU.

But if I just raise the dimension of ONE byte:

ping -M do -s 1473 -c 3 192.168.0.34
 PING 192.168.0.34 (192.168.0.34) 1473(1501) bytes of data.
 From 192.168.0.35 icmp_seq=1 Frag needed and DF set (mtu = 1500)
 From 192.168.0.35 icmp_seq=1 Frag needed and DF set (mtu = 1500)
 From 192.168.0.35 icmp_seq=1 Frag needed and DF set (mtu = 1500)

Removing the –M (such that fragmentation can take place)

root@tusacentral01:~# ping  -s 1473 -c 3 192.168.0.34

PING 192.168.0.34 (192.168.0.34) 1473(1501) bytes of data.
 1481 bytes from 192.168.0.34: icmp_req=1 ttl=128 time=1.20 ms
 1481 bytes from 192.168.0.34: icmp_req=2 ttl=128 time=1.02 ms
 1481 bytes from 192.168.0.34: icmp_req=3 ttl=128 time=0.996 ms

PING is widely used to test connectivity between two network points, which is good, and from the example above, you may think that it is a good tool to test the network also in more complex situations, like when we need to be sure that the virtually synchronous galera replication will NOT be affected.

But you are wrong, the PING tool as part of the ICPM for his operations it is still using the basic IP protocol which is base on the best effort delivery of the datagram, while any data that is transmitted between the nodes in a MySQL/Galera cluster, use the TCP protocol, the two are bound (from TCP point of view), but the TCP implementation behaviour is quite different.

To understand that let us review at very high level (one billion of kilometres high) the TCP characteristics.

TCP

As mentioned previously the IP and related ICPM work on the base of the best effort delivery of the datagram, what this means in short is that whatever I sent using it is sent without guarantee it will reach destination, and without actually caring to send it if it fails, more no real check is perform between source and destination regarding the data transmission.

TCP means Transmission Control Protocol and as the name says, it is design to control the data transmission happening between source and destination.

TCP’s implementations offer a reliable, ordered, error-checked delivery of a stream of octets between source and destination.

The TCP is a transmission protocol and as such it works at the transport level of the OSI model:

OSI Model
	Data unit	Layer	Function
Host layers	Data	7. Application	Network process to application
		6. Presentation	Data representation, encryption and decryption, convert machine dependent data to machine independent data
		5. Session	Interhost communication, managing sessions between applications
	Segments	4. Transport	Reliable delivery of packets between points on a network.
Media layers	Packet/Datagram	3. Network	Addressing, routing and (not necessarily reliable) delivery of datagrams between points on a network.
	Bit/Frame	2. Data link	A reliable direct point-to-point data connection.
	Bit	1. Physical	A (not necessarily reliable) direct point-to-point data connection.

TCP implementations use the IP protocol encapsulation for the transmission of the data:

tcp_datagram_incapsulation

A TCP implementation has several characteristics that make sense to summarize:

Is stream oriented, that' it when two applications open a connection base on TCP, they will see it as a stream of bit that will be deliver to the destination application, exactly in the same order and consistency they had on the source.
Establish a connection, which means that the host1 and host2 before start to send data over, must perform a handshake operation, which will allow them to know each other state. Connection use a three way handshake:

tcp_con_open

Once it is establish the two parts will keep to monitor the data transfer, in case of crash interruption of the service both hosts will notify the application involved in the stream transfer.

TCP sees a connection
Buffered transmission, the applications communicating the data can choose whatever dimension to send, it could be a single byte a time. The transport layer will buffer the data in order to match the best datagram dimension for that specific moment, and will send it. The dimension of the datagram is adaptative which means it can be adjusted in relation to the network transfer condition.
Unstructured stream, meaning the stream does not have a predefine format/dimension for internal data, as such the applications must know how to read from the stream.
Full-duplex connection, data can be transmitted simultaneously from both side.
Segment TCP implementation see the stream as a sequence of octet (bytes), which it will split in segments, the segment dimension is normally as large as a datagram.

TCP header definition:

TCP Header
*Offsets*	Octet	0								1								2								3
Octet	Bit	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
0	0	Source port																Destination port
4	32	Sequence number
8	64	Acknowledgment number (if ACK set)
12	96	Data offset				Reserved 0 0 0			N S	C W R	E C E	U R G	A C K	P S H	R S T	S Y N	F I N	Window Size
16	128	Checksum																Urgent pointer (if URG set)
20 ...	160 ...	Options (if data offset > 5. Padded at the end with "0" bytes if necessary.) ...

Source port (16 bits)

As said TCP implementations are reliable and can re-transmit missed packets, let see how it works:

tcp_pcg_sent_simple tcp_pcg_sent_error

This mechanism is in place for each TCP transmission, and ensures the correct send/receive of the segment.

It is easy to understand how also if effective, having this mechanism working for a single segment at time is far to be efficient. This because considering a set of packages from 1 to 100, before sending package 2, the dispatch of package 1 must be acknowledge.

To solve this issue, the TCP use the concept of dynamic slide window. That’s it the TCP implementations manage to dispatch multiple packages at the same time:

sliding_window_initial

A sliding window in TCP implementations works at octet level not by segment or by package, each stream’s octet has a sequential number, the dispatcher manages three pointers associated to each connection.

The first pointer indicate the start of the sliding window, splitting the octet that had be dispatched & acknowledge from the ones that are in progress or still to dispatch.

The second pointer indicates the higher octet that can be dispatch before getting the acknowledgments for the already dispatched octet.

The third pointer indicates the window limit after which no dispatch can be done, until further slide.

sliding_window_pointers

In TCP implementation the slide window has, as previously mentioned, a dynamic dimension, each acknowledgement received is also containing a declaration of how many octet the receiver can accept.

The whole discussion now will become much more complex and not pertinent to the scope of this article, but is worth to mention that, it is here that the whole concept of flow/congestion control will take place.

For the sake of this article and simplifying a lot let us say that when there are optimal conditions in a local lan the dimension of a segment will coincide with the maximum MTU available.

While when sending traffic over the internet, the TCP implementation will have to manage not only the optimal initial dimension, but possible and probable issue while the transmission will take place, reducing and enlarging the sliding window as needed, and in some cases put it on hold.

If you are really interested on how congestion works in the TCP/IP world I suggest this book (http://www.amazon.com/Adaptive-Congestion-Networks-Automation-Engineering-ebook/dp/B008NF05DQ)

Back to the topic

So we have seen how PING works, and the fact it send a simple echo request, using a very basic datagram, with no dispatch or connection error handling.

On the other hand we have see how complex TCP implementation is, and how important is in the TCP implementation the concept of adaptative transmission (that will also seriously affected by available bandwidth) and congestion control.

Keeping that in mind, there is no way we can compare a TCP/IP data transmission between to points with a simple ICMP/IP echo, no matter how large we will make the datagram, or if we declare it not to fragment, or whatever a Ping is executing and following a different protocol, and should never be used to validate the data transmission in a TCP/IP scenario.

Back to square 1, how I can evaluate my connection between two geographically distributed sites? In detail how I can be sure the transfer will be efficient enough to support the Galera replication?

When implementing Galera, some additional status metrics are added in MySQL, two of them are quite relevant to the purpose of this article:

Wsrep replicated bytes
Wsrep received bytes

If you are smart, and I am sure you are, you already have some historical monitor in place, and before even thinking to go for geographical distributed replication you have tested and implement MySQL/Galera using local cluster.

As such you will have collected data about the two metrics above, and will be easy to identify the MAX value, be careful here do not use the average.

Why? Because the whole exercise is to see if the network can efficiently manage the traffic of a (virtually) synchronous connection, without having the Galera flow control to take place.

As such let us say we have 150Kb/s as replicated bytes, and 300k as received, and see what will happen if we use PING and another tool more TCP oriented.

In the following scenario I am going to first check if the two locations are connected using a high-speed link that will allow 1500 MTU, then I will check the connection state/latency using the max dimension of 300K.

Test 1

[root@Machine1~]# ping -M  do -s 1432 -c 3 10.5.31.10

PING 10.5.31.10 (10.5.31.10) 1432(1460) bytes of data.

From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)

From 192.168.10.30 icmp_seq=2 Frag needed and DF set (mtu = 1398)

From 192.168.10.30 icmp_seq=2 Frag needed and DF set (mtu = 1398)

Test 2

[root@Machine1~]# ping -M  do -s 1371 -c 3 10.5.31.10

PING 10.5.31.10 (10.5.31.10) 1371(1399) bytes of data.

From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)

From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)

From 192.168.10.30 icmp_seq=1 Frag needed and DF set (mtu = 1398)

Test 3

[root@Machine1~]# ping -M  do -s 1370 -c 3 10.5.31.10

PING 10.5.31.10 (10.5.31.10) 1370(1398) bytes of data.

1378 bytes from 10.5.31.10: icmp_seq=1 ttl=63 time=50.4 ms

1378 bytes from 10.5.31.10: icmp_seq=2 ttl=63 time=47.5 ms

1378 bytes from 10.5.31.10: icmp_seq=3 ttl=63 time=48.8 ms

As you can see the link is not so bad, it can support 1370 MTU, so in theory we should be able to have a decent connection.

Let see... what happens with PING

[root@Machine1~]# ping   -c 3 10.5.31.10

PING 10.5.31.10 (10.5.31.10) 56(84) bytes of data.

64 bytes from 10.5.31.10: icmp_seq=1 ttl=63 time=49.6 ms

64 bytes from 10.5.31.10: icmp_seq=2 ttl=63 time=46.1 ms

64 bytes from 10.5.31.10: icmp_seq=3 ttl=63 time=49.7 ms

--- 10.5.31.10 ping statistics ---

3 packets transmitted, 3 received, 0% packet loss, time 2052ms

rtt min/avg/max/mdev = 46.189/48.523/49.733/1.650 ms

[root@Machine1~]# ping -M  do -s 300 -c 3 10.5.31.10

PING 10.5.31.10 (10.5.31.10) 300(328) bytes of data.

308 bytes from 10.5.31.10: icmp_seq=1 ttl=63 time=50.5 ms

308 bytes from 10.5.31.10: icmp_seq=2 ttl=63 time=48.5 ms

308 bytes from 10.5.31.10: icmp_seq=3 ttl=63 time=49.6 ms

--- 10.5.31.10 ping statistics ---

3 packets transmitted, 3 received, 0% packet loss, time 2053ms

rtt min/avg/max/mdev = 48.509/49.600/50.598/0.855 ms

Performing the tests seems that we have more ore less 49-50 ms latency, which is not really great, but manageable, we can play a little with flow-control in Galera and have the sites communicating, with that I am NOT recommending to set a Galera cluster with 50ms latency, what I am saying is that if desperate something can be done, period.

But wait a minute let us do the other test, and this time let us use a tool that is design for checking the real condition of a network connection using TCP/IP.

For that I normally use NetPerf (there are other tools, but I like this one).

About NetPerf:

"Netperf is a benchmark that can be used to measure the performance of many different types of networking. It provides tests for both unidirectional throughput, and end-to-end latency. The environments currently measureable by netperf include:

TCP and UDP via BSD Sockets for both IPv4 and IPv6

DLPI

Unix Domain Sockets

SCTP for both IPv4 and IPv6"

(http://www.netperf.org/netperf/)

NetPerf use a two-point connection instantiating a server demon on one machine, and using an application that simulate the client-server scenario.

I strongly suggest you to read the documentation, to better understand what it can do, how and the results.

Done? Ready let us go...

Given NetPerf allow me to define what is the dimension I need to send, and what I will receive, this time I can set the test properly, and ask what will be the real effort:

[root@Machine1~]# netperf -H 10.5.31.10 -t TCP_RR -v 2 -- -b 6 -r 156K,300k

MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.5.31.10 () port 0 AF_INET : first burst 6

Local /Remote

Socket Size   Request  Resp.   Elapsed  Trans.

Send   Recv   Size     Size    Time     Rate

bytes  Bytes  bytes    bytes   secs.    per sec

65536  87380  159744   300000  10.00      30.20

65536  87380

Alignment      Offset         RoundTrip  Trans    Throughput

Local  Remote  Local  Remote  Latency    Rate     10^6bits/s

Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound

8         0          0        0       231800.259 30.198   38.592     72.476

The tool will use a TCP round trip, as such it will simulate a real TCP connection/stream/sliding-window/traffic-congestion-control.

The result is 231800.259 microseconds for latency; also other metrics are interesting but for consistency with ping let us stay on latency.

So we have 231800.259 microseconds that mean 231.8 millisecond, real latency.

This result is quite different from the PING results, reason for that is as said a totally different transport mechanism, which will imply a different cost.

Considering the number of 231.8 ms a site-to-site synchronous replication is out of the question.

I can then focus and spend my time to explore different and more appropriate solutions.

Conclusion(s)

The conclusion(s) is quite simple, whenever you need to validate your network connection PING is fine.

Whenever you need to measure the capacity of your network, then PING is just dangerous, and you should really avoid using it.

Good MySQL to all.

Server ID misconfiguration can affect you more then you think.

Sub title: When you do stupid things, you get stupid things back, but sometime is good to discover odd behaviour.

The other day, I was having fun playing with the latest toy, mysql-cluster-gpl-7.3.2.

I was setting up a new cluster set, of 6 data nodes and 3 Mysql node, to do so I was using a simple bash file that helps me to generate what I need on my environment.

During that I had a very stupid idea, to have the script to recognize the machines IPs and PORT I have set, then use them as SERVER IDs.

I know I know is not a good way to define SERVER ID, anyhow I did it, so the result was that in my my.cnf my server_id looks like:

server-id = 1921680355510

Some of you already see what kind of stupid thing this is, but let me not anticipate it.

So as usual I start the mysql node first to generate a brand new database, to do so I have first disable the ndbcluster and all went fine.

Then I start my management nodes and my data nodes.

I was quite happy and I finally attach my MySQL node to the rest of the cluster.

But Nodes failed to starts, I was quite surprise so I open the log and is saw:

2013-10-28 10:20:58 2692 [Note] NDB[3]: NodeID: 77, all storage nodes connected
2013-10-28 10:20:58 2692 [ERROR] NDB: server id provided is too large to be represented in opt_server_id_bits or is reserved
mysqld: /usr/src/mysql-cluster-gpl-7.3.2/sql/ndb_component.cc:139: virtual int Ndb_component::deinit(): Assertion 'm_thread_state == TS_STOPPED' failed.
14:20:58 UTC - mysqld got signal 6 ;

What??? My fantastic server_id of 13!!!!! digit is too long?

Hey wait a minute?! how many digit??

At that point I just hit the keyboard with my head.. realizing what I have done.

headdesk

Server id is an unsigned integer, his max value is 4294967295 with 10 digit.

Once more my friend NDB was right and I was trying to do a stupid thing.

But when I realize it, I also wonder, why when activating ndbcluster I get the error and when I have MySQL running without it NO?

This kind of checking should be done at common level at MySQL start-up, so why I get it only with MySQL and NDB active?

I dig a little bit and I found the place where NDB is doing the check.

The if condition where the check is executed is quite simple and clear.

....

sql::ha_ndbcluster_binlog.cc
 
int ndbcluster_binlog_start()
{
  DBUG_ENTER("ndbcluster_binlog_start");
 
  if (::server_id == 0)
  {
    sql_print_warning("NDB: server id set to zero - changes logged to "
                      "bin log with server id zero will be logged with "
                      "another server id by slave mysqlds");
  }
 
  /* 
     Check that ServerId is not using the reserved bit or bits reserved
     for application use
  */
  if ((::server_id & 0x1 << 31) ||                             // Reserved bit <--------------------- Check
      !ndbcluster_anyvalue_is_serverid_in_range(::server_id))  // server_id_bits
  {
    sql_print_error("NDB: server id provided is too large to be represented in "
                    "opt_server_id_bits or is reserved");
    DBUG_RETURN(-1);
  }

But again why the error is not reported in MySQL core?

Because MySQL is silently converting my server id to his maximum value, as such my stupid server id of 1921680355510 become 4294967295.

Oh my...!

An identifier is managed as a quantity!

I have done a very stupid thing setting the server_id to , but honestly I don't like this as well.

As we know MySQL silently adjust his behaviour in relation to the SQL_MODE, but the way it adjust the parameters should be something different.

I agree that variables like max_connections CAN be adjusted, but others like server_id, well no, I think is wrong given those are identifier(s) and not quantity.

Server ID is an ID, as such it must be validated and if not matching the validation an ERROR must be return and the server should not start at all.

Instead MySQL currently is just raising a WARNING and continue to run, not caring if there will be another server with the same id because this "adjustment".

131028 14:00:58 mysqld_safe Starting mysqld daemon with databases from /home/mysql/instances/csite3/data
2013-10-28 14:00:58 0 [Warning] option 'server_id': unsigned value 192168075510 adjusted to 4294967295 <----------------------------------------------

I was really "not happy" about this, so I went into the code and check where the server ID is evaluated and modified.

So this is the action calls:

mysqld::mysqld_main
	=>mysqld::init_common_variables
		=> my_getopt::init_variables(longopts, init_one_value);
			=> my_getopt::init_one_value
				=> my_getopt::getopt_ll_limit_value

So what exactly is happening here?

MySQL initialize reading the options in my_getopt and on the base of the data type it pass it to getopt_ll_limit_value.

There whatever numeric argument is pass, if it exceed the max value is reset to his Max value. At this stage there is no logic to manage values like Server ID, that is an identifier and not quantity.

This is obviously something that can be managed better, and actually the fact that NDB prevent the server to go ahead is by far more correct, this because Server Id is it used internally to recognize the server.

I have done a quick check, and I have see that at least the following functions are using the server_id variable, in the new replication mechanism.

rpl_master.cc
int register_slave(THD* thd, uchar* packet, uint packet_length)
void unregister_slave(THD* thd, bool only_mine, bool need_lock_slave_list)
bool show_slave_hosts(THD* thd)
static int fake_rotate_event(NET* net, String* packet, char* log_file_name,ulonglong position, const char** errmsg,uint8 checksum_alg_arg)
static int send_heartbeat_event(NET* net, String* packet,const struct event_coordinates *coord,uint8 checksum_alg_arg)
bool com_binlog_dump(THD *thd, char *packet, uint packet_length)
bool com_binlog_dump_gtid(THD *thd, char *packet, uint packet_length)
void mysql_binlog_send(THD* thd, char* log_ident, my_off_t pos,const Gtid_set* slave_gtid_executed)

I have not investigate more, to see what will happen if there are two different servers having the same ID, I have not done it, because in any case it is conceptually wrong.

Just for fun I have create an horrible patch to manage the mismatch, adding a control on the parameter name.

Obviously that should be something different like an attribute of the my_option object that indicate if the option is a quantity an identifier or whatever else, then on that base decide what to do.

Any how also a stupid and horrible things like the check on the name is better then not having it at all:

ulonglong getopt_ull_limit_value(ulonglong num, const struct my_option *optp,
                                 my_bool *fix)
{
  my_bool adjusted= FALSE;
  ulonglong old= num;
  char buf1[255], buf2[255];
  const ulonglong max_of_type=
    max_of_int_range(optp->var_type & GET_TYPE_MASK);
 
	char string1[255], string2[255];
	int result;
	string1=optp->name;
	string2="server_id";
	scanf("%s",string1[10]);
	scanf("%s",string2[10]);
	int chlimit = 255;
	result = strncmp(string1,string2,chlimit);
 
 
 
  if ((ulonglong) num > (ulonglong) optp->max_value &&
      optp->max_value) /* if max value is not set -> no upper limit */
  {
    num= (ulonglong) optp->max_value;
    adjusted= TRUE;
    if(result == 0){
           my_getopt_error_reporter(ERROR_LEVEL,
           "option '%s': unsigned value %s CANNOT BE adjusted to %s please review your settings, and start MySQL again",
           optp->name, llstr(old, buf1), llstr(num, buf2));
           exit(1);
 
    }
  }
  if (num > max_of_type)
  {
    num= max_of_type;
    adjusted= TRUE;
  }

Running the server with the fix, I will finally had a consisten behaviour, and my favorite MySQL server doing the right thing:

131029 13:38:03 mysqld_safe mysqld from pid file /home/mysql/instances/csite3/mysql.pid ended
131029 13:41:11 mysqld_safe Starting mysqld daemon with databases from /home/mysql/instances/csite3/data
2013-10-29 13:41:11 0 [ERROR] 'option server_id' unsigned value 192168075510 CANNOT BE adjusted to 4294967295 please review your settings, and start MySQL again
131029 13:41:12 mysqld_safe mysqld from pid file /home/mysql/instances/csite3/mysql.pid ended

Conclusion

From a stupid action it could come out something interesting, at least for me.

I would like to classify this as bug, but given it is a conceptual thing may be is more a what? Feature request? Logic fix?

Finally my conclusion is that MySQL is a great product, but is growing fast and sometimes things are left behind, this is a small but possible dangerous one.

Understand what happens in MySQL when using UTF String with Latin1 encoding...

... And how to go out from it.

Why this article?

The scope of the article is to light up what happens behind the scene when an application push UTF data on a MySQL server using Latin1 encoding.

I will show how the characters are store, what exactly happen if you mix UTF8 and latin1, what is transformed to what and why.

Finally I will show the safer way to use to convert a table or simply a table field, without exporting the full dataset.

I will also show what is a good way to do it when using master-slave scenario, with minimal impact on the live dataset.

What I am not going to discuss here, and I assume you already have familiarity with is:

How to assign a character set or a collation to a table or fields;
How collation works in in sorting;
How to set the default character set in server and connecting clients.

If any of this point is not clear to you, please read first here (http://dev.mysql.com/doc/refman/5.6/en/charset.html)

For my exercise I will use two tables, two MySQL server instances, and Arabic text.

Please note that I have copy this text from Official UN nation page, as such I hope they do not contain any offensive text.

Step one check the status of our severs:

Master

(root@localhost) [(none)]>\s

--------------

/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1

Connection id:		2

Current database:

Current user:		root@localhost

SSL:			Not in use

Current pager:		stdout

Using outfile:		''

Using delimiter:	;

Server version:		5.5.27-log MySQL Community Server (GPL)

Protocol version:	10

Connection:		Localhost via UNIX socket

Server characterset:	latin1

Db     characterset:	latin1

Client characterset:	latin1

Conn.  characterset:	latin1

UNIX socket:		/opt/mysql_instances/instances/derekutfm/mysql.sock

Uptime:			2 min 34 sec

Threads: 3  Questions: 16  Slow queries: 0  Opens: 34  Flush tables: 1  Open tables: 26  Queries per second avg: 0.103

Slave

------------

(root@localhost) [(none)]>\s

--------------

/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, for linux2.6 (x86_64) using readline 5.1

Connection id:		4

Current database:

Current user:		root@localhost

SSL:			Not in use

Current pager:		stdout

Using outfile:		''

Using delimiter:	;

Server version:		5.5.27-log MySQL Community Server (GPL)

Protocol version:	10

Connection:		Localhost via UNIX socket

Server characterset:	utf8

Db     characterset:	utf8

Client characterset:	latin1

Conn.  characterset:	latin1

UNIX socket:		/opt/mysql_instances/instances/derekutfs/mysql.sock

Uptime:			10 min 43 sec

Threads: 2  Questions: 8  Slow queries: 0  Opens: 34  Flush tables: 1  Open tables: 26  Queries per second avg: 0.012

As you can see, I have define the MASTER to use latin1 as default also for the server, while for the SLAVE I choose to use UTF8.

In this exercise I am going to use only UTF8 for the examples never UTF8mb4.

First create the table.

CREATE TABLE `utf_test` (
  `ID` integer AUTO_INCREMENT PRIMARY KEY,
  `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL,
  `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `binutf` varbinary(250) DEFAULT NULL,
  `different` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Now let start insert data again and see what happens:

insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا',

' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same');

 Master
 -------------
(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same');
*************************** 1. row ***************************
       ID: 1
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same
1 row IN SET (0.00 sec)
 
 
Slave
---------------
(root@localhost) [test]>set names latin1;
Query OK, 0 rows affected (0.00 sec)
(root@localhost) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same
1 row IN SET (0.00 sec)

All good given the value are passed consistently

Now let add some text to the "different" fields and see IF we can still handle it correctly:

(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', ' this is a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا', 'Am I the same? لصدأ القمح إلى كينيا');
Master
--------------
(root@localhost) [test]>select * FROM utf_test\G
       ID: 3
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same? لصدأ القمح إلى كينيا
2 rows IN SET (0.00 sec)
 
Slave
----------------------------
       ID: 3
   notutf: simple INSERT NOT utf IN non utf FIELDS, but USING Master ALL Latin1 AND Slave WITH server DEFAULT UTF8
   yesutf: إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
   binutf:  this IS a mix of it 3 إستقدام أصناف محصولية مقاومة لصدأ القمح إلى كينيا
different: Am I the same? لصدأ القمح إلى كينيا

Ok once more all good.
Is this real UTF8 or as we have seen sometimes fake UTF thanks to Latin1 push?
let see inserting another row with UTF field "yesutf" and "different" containing the same apparent data.

insert into utf_test values(null,'simple insert not utf in non utf fields, but using Master all Latin1 and Slave with server default UTF8', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا');

Master
-----------
(root@localhost) [test]>insert INTO utf_test VALUES(NULL,'simple insert same text everywhere', 'ح إلى كينيا', 'ح إلى كينيا', 'ح إلى كينيا');
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
Slave
------------------
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)

Checking the hex:

Master
-----
(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
 
Slave 
-------------
(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)

As expected the hex code is different between the fields define as UTF8 and the one as Latin1,and I see them correctly because... ?

Look what happens if I change the Names to UTF8...:

(root@localhost) [test]>set names utf8;
 
(root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
   binutf: ح إلى كينيا
different: Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
1 row IN SET (0.00 sec)

Now only the binutf (varbinary) is shown correctly.

So just to recap what happen if I use latin1....
Using latin1

I see all fine

(root@localhost) [test]>set names latin1;
(root@localhost) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
1 row IN SET (0.00 sec)

The hex values are not matching as:

(root@localhost) [test]>select hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)

If I do a reverse conversion, also using UNHEX

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,unhex(hex(yesutf)) AS yesutf_hex,unhex(hex(binutf)) AS binutf_hex,unhex(hex(different)) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
   binutf_hex: ح إلى كينيا
different_hex: ح إلى كينيا
1 row IN SET (0.00 sec)

Again I see something different, is it confusing?

No.

It is quite clear that varbinary and varchar store the string using the same hexadecimal values, while the yesutf is different because it was expecting a UTF string.

Let us analyze JUST the first character coming from the arabic string: ح

This letter correspond to:

Unicode code point U+062D

character: ح

UTF8 characters: Ø

UTF8 hex: d8 ad

Meaning: ARABIC LETTER HAH

What happen is that I am inserting the correct codes but the UTF8 field read it as LATIN1 char per byte, as such it transform the value to Ø.

The value of that character in the UTF-8 encoding table and Unicode characters is:

Unicode code point U+00D8

character: Ø

UTF8 characters: Ã

UTF8 hex: c8 98

Meaning: LATIN CAPITAL LETTER O WITH STROKE

Comparing the values above with the HEX value we had before, you can see that they match perfectly

yesutf_hex: C398

binutf_hex: D8AD

different_hex: D8AD

So far I hope everything is clear.

The fields using LATIN1 and UT8 are correctly shown when using NAMES latin1 because the latin1 interpretation of the UTF8 hex values in the case of the fields with latin1 encoding.

While in the case of fields with UTF8 encoding, the server perform a double conversion of the codes, conversion that IS NOT executed when doing UNHEX, and in that case what it is shown is the REAL value of the stored codes.

Again the only SAFE one during these operation is the varbinary which store the code not bounded to any encoding.

So for instance if you have all your database using LATIN1 and you want to convert to UTF8 you cannot just do :

(root@localhost) [test]>alter table utf_corrupt modify different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;

Because your text in the tables will be converted and become unusable if using UTF8 clients, see below:

Encoding set as LATIN1:

Client characterset: latin1

Conn. characterset: latin1

I am going to do the conversion:

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
SET NAMES UTF8;
 
	Client characterset:	utf8
	Conn.  characterset:	utf8
 
 
(root@localhost) [test]>select * FROM utf_corrupt\G
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
   binutf: ح إلى كينيا
different: Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§

The text is a mess again, this because the text is transformed:

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7 <---------
1 row IN SET (0.00 sec)

Starting from the beginning so the table is again:

(root@localhost) [test]>show CREATE TABLE utf_corrupt\G
*************************** 1. row ***************************
       TABLE: utf_corrupt
CREATE TABLE: CREATE TABLE `utf_corrupt` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `notutf` varchar(250) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL,
  `yesutf` varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `binutf` varbinary(250) DEFAULT NULL,
  `different` varbinary(250) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
1 row IN SET (0.00 sec)

data is:

*************************** 3. row ***************************
       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: ح إلى كينيا
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)

(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)

And encoding: Client characterset: latin1 Conn. characterset: latin1

To do properly you have to do a 3 steps conversion, mainly varchar -> varbinary -> varchar.As follow:

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varbinary(250) DEFAULT NULL;
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0
 
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <-------------- 
1 row IN SET (0.00 sec)

Now all is fine and the string was not transformed.Now change the encoding with NAMES:

Set NAMES utf8;

Convert back to varchar and UTF ...

(root@localhost) [test]>alter TABLE utf_corrupt MODIFY different varchar(250) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
Query OK, 3 rows affected (0.02 sec)
Records: 3  Duplicates: 0  Warnings: 0

And now yes you have the text stored correctly.

       ID: 5
   notutf: simple INSERT same text everywhere
   yesutf: Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
   binutf: ح إلى كينيا
different: ح إلى كينيا
3 rows IN SET (0.00 sec)
 
(root@localhost) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_corrupt WHERE id=(SELECT max(id) FROM utf_corrupt)\G
*************************** 1. row ***************************
           id: 5
    noutf_hex: 73696D706C6520696E736572742073616D6520746578742065766572797768657265
   yesutf_hex: C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)

All good! My data is correctly moved from a fields base on varchar LATIN1 to the new varchar UTF8.The most important aspect is to ensure to use the right encoding and to do not allow the server to convert you text to wrong hex values.

In case of significant migration if you have a master slave setup,it is possible to play with them, to minimize the impact of the migration, following these steps:

all latin1
convert slave to varbinary
insert data from master to slave keeping all as latin1
swap the server
convert ex-master to varbinary
when done change all application connection to utf8 master_host
convert tables back to varchar utf8 on slave
swap master slave again
convert slave table to UTF8

It seems cumbersome, but if you cannot afford to export the data change the table encoding and reload the data, and you cannot stop the write for long time, this is the safer way.The negative aspect of it is ... that varbinary use a different approach for sorting, it is based on the value of the binaries stored and you cannot use collation to modify it.

Let's go through the process:

Step 1) Check that all is latin1

Master
------------
(root@127.0.0.1:5510) [test]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1
 
Connection id:		11
Current DATABASE:	test
Current user:		root@localhost
SSL:			NOT IN USE
Current pager:		stdout
USING OUTFILE:		''
USING delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfm/mysql.sock
Uptime:			2 days 3 hours 3 min 49 sec
 
Threads: 3  Questions: 198  Slow queries: 0  Opens: 89  FLUSH TABLES: 1  Open TABLES: 27  Queries per second avg: 0.001
--------------
 
Slave
---------------------
(root@127.0.0.1:5511) [test]>\s
--------------
/opt/mysql_templates/mysql-55o/bin/mysql  Ver 14.14 Distrib 5.5.27, FOR linux2.6 (x86_64) USING readline 5.1
 
Connection id:		9
Current DATABASE:	test
Current user:		root@localhost
SSL:			NOT IN USE
Current pager:		stdout
USING OUTFILE:		''
USING delimiter:	;
Server version:		5.5.27-log MySQL Community Server (GPL)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	latin1
Conn.  characterset:	latin1
UNIX socket:		/opt/mysql_instances/instances/derekutfs/mysql.sock
Uptime:			2 days 3 hours 3 min 45 sec
 
Threads: 2  Questions: 177  Slow queries: 0  Opens: 87  FLUSH TABLES: 1  Open TABLES: 27  Queries per second avg: 0.000

Insert some data on the master ... just to have a couples of rows.

(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 1', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)

Check the values:

Master
----------------------
(root@127.0.0.1:5510) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
2 rows IN SET (0.00 sec)
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 
  
Slave
-------------------------
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
2 rows IN SET (0.00 sec)
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)

Step 2) Convert to varbinary the slave, keep same encoding (NAMES latin1)

(root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varbinary(250);
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0
 
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7 <------------------
1 row IN SET (0.00 sec)

Code remain consistent.

Step 3) Production continue to work as usual, no impact.

Insert some other records on the master:

(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP1: insert data 2', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 3. row ***************************
       ID: 5
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 4. row ***************************
       ID: 7
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 2. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 3. row ***************************
           id: 5
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 4. row ***************************
           id: 7
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
4 rows IN SET (0.00 sec)
 
ON The Slave 
----------------------
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 2. row ***************************
       ID: 3
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 3. row ***************************
       ID: 5
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
*************************** 4. row ***************************
       ID: 7
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)
 
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 2. row ***************************
           id: 3
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 3. row ***************************
           id: 5
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
*************************** 4. row ***************************
           id: 7
    noutf_hex: 53544550313A20696E7365727420646174612032
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
4 rows IN SET (0.00 sec)

Step 4) swap the server

5511 master and 5510 slave (but stop the replication from 5511->5510 and stop also the previous replication from 5510 to 511), production will continue to work in write, order by will be affected for the moment.

On new Master 5511:

*************************** 6. row ***************************
       ID: 11 <----------------------------------------------
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
6 rows IN SET (0.00 sec)
 
ON new Slave 5510:
*************************** 4. row ***************************
       ID: 7 <--------------------------------
   notutf: STEP1: INSERT DATA 2
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
4 rows IN SET (0.00 sec)

Step 5) modify to varbinary the tables on the new Slave:

(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varbinary(250);
Query OK, 4 rows affected (0.02 sec)
Records: 4  Duplicates: 0  Warnings: 0

And now restart replication on 5510, so it can replicate from the master 5511.

(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
 
1 row IN SET (0.00 sec)(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 11
    noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)

All good.

Step 6) change all Application connections to UTF8,

in our example I will change the NAMES on new Master 5511 and on slave 5510

SET NAMES UTF8;

Server characterset: utf8

Db characterset: utf8

Client characterset: utf8

Conn. characterset: utf8

Master
--------------------------
(root@127.0.0.1:5511) [test]>set names utf8;
Query OK, 0 rows affected (0.00 sec)
 
(root@127.0.0.1:5511) [test]>select * FROM utf_test\G
*************************** 1. row ***************************
       ID: 1
   notutf: STEP1: INSERT DATA 1
   yesutf: Ø§Ù„Ù‚Ù…Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
...
*************************** 6. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: Ø§Ù„Ù‚Ù…Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
6 rows IN SET (0.00 sec)

Note how the values in the yesutf was NOT correctly converted given the initial double translation.

Slave
---------------
Server characterset:	utf8
Db     characterset:	utf8
Client characterset:	utf8
Conn.  characterset:	utf8
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 11
   notutf: STEP4: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: Ø§Ù„Ù‚Ù…Ø Ø¥Ù„Ù‰ ÙƒÙŠÙ†ÙŠØ§
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)

Step 7) Convert table on SLAVE back to varchar but using utf8 now

(root@127.0.0.1:5510) [test]>alter TABLE utf_test MODIFY different varchar(250)  CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
Query OK, 7 rows affected (0.01 sec)
Records: 7  Duplicates: 0  Warnings: 0
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 15
   notutf: STEP7: inserting DATA FROM EX-SLAVE after swap replication IS down 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 15
    noutf_hex: 53544550343A20696E73657274696E6720646174612066726F6D2045582D534C4156452061667465722073776170207265706C69636174696F6E20697320646F776E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)

Step 8) given all good put now the 5510 back as master,

given NOW the 5510 has the fields set as varchar and encoding/collation using UTF also the order by will work fine again.

At this point data will continue to be pass from master to slave.

ON Master
---------------------
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP8: inserting data from master 5510 after swap server again ', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
Query OK, 1 row affected (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select * FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
       ID: 17
   notutf: STEP8: inserting DATA FROM master 5510 after swap server again 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
1 row IN SET (0.00 sec)
 
(root@127.0.0.1:5510) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test WHERE id=(SELECT max(id) FROM utf_test)\G
*************************** 1. row ***************************
           id: 17
    noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
1 row IN SET (0.00 sec)
 
ON Slave
-------------------------
*************************** 8. row ***************************
       ID: 17
   notutf: STEP8: inserting DATA FROM master 5510 after swap server again 
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
8 rows IN SET (0.00 sec)
 
*************************** 8. row ***************************
           id: 17
    noutf_hex: 53544550383A20696E73657274696E6720646174612066726F6D206D6173746572203535313020616674657220737761702073657276657220616761696E20
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7

Step 9) Finally convert all tables also on slave to varchar UTF8:

Master
---------------
(root@127.0.0.1:5510) [test]>insert INTO utf_test VALUES(NULL,'STEP9: final Slave conversions', 'القمح إلى كينيا', 'القمح إلى كينيا', 'القمح إلى كينيا');
 
Slave
---------------
(root@127.0.0.1:5511) [test]>alter TABLE utf_test MODIFY different varchar(250)  CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL;
*************************** 9. row ***************************
       ID: 19
   notutf: STEP9: final Slave conversions
   yesutf: القمح إلى كينيا
   binutf: القمح إلى كينيا
different: القمح إلى كينيا
9 rows IN SET (0.00 sec)
(root@127.0.0.1:5511) [test]>select id,hex(notutf) AS noutf_hex,hex(yesutf) AS yesutf_hex,hex(binutf) AS binutf_hex,hex(different) AS different_hex FROM utf_test\G
*************************** 1. row ***************************
           id: 1
    noutf_hex: 53544550313A20696E7365727420646174612031
   yesutf_hex: C398C2A7C399E2809EC399E2809AC399E280A6C398C2AD20C398C2A5C399E2809EC399E280B020C399C692C399C5A0C399E280A0C399C5A0C398C2A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
...
           id: 19
    noutf_hex: 53544550393A2066696E616C20536C61766520636F6E76657273696F6E73
   yesutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
   binutf_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
different_hex: D8A7D984D982D985D8AD20D8A5D984D98920D983D98AD986D98AD8A7
9 rows IN SET (0.00 sec)

Done.

When in doubt about the character set you are using or the collation always check them.

good commands from application:

SELECT USER(), CHARSET(USER()), COLLATION(USER())\G(root@127.0.0.1:5511) [test]>SELECT USER(), CHARSET(USER()), COLLATION(USER())\G
*************************** 1. row ***************************
           USER(): root@127.0.0.1
  CHARSET(USER()): utf8
COLLATION(USER()): utf8_general_ci
1 row IN SET (0.00 sec)
 
AND
(root@127.0.0.1:5511) [test]>show VARIABLES LIKE '%server%'\G
*************************** 1. row ***************************
Variable_name: character_set_server
        Value: utf8
*************************** 2. row ***************************
Variable_name: collation_server
        Value: utf8_unicode_ci
*************************** 3. row ***************************
Variable_name: server_id
        Value: 5511
3 rows IN SET (0.00 sec)

Conclusion

You must put a lot of attention on how you connect to the server, in relation to the encoding you are going to use.

It is advisable to set it explicitly, to be really sure and to check that the server/application are getting it right.

I cannot find a good reasons for which you should have a mix of UTF8 and Latin1 in the same table as for the example I have use, but you never know.

What I mean is that it is normal to have Fields using Latin1 and be define as Latin1 for the encoding, but mixing the UTF8 using Latin1 encoding in the application connection, and try to write UTF will only generate a mess.

If you Have a table with 4 fields 2 UTF8 and 2 LATIN1, please use UTF8 encoding in the connection/server and be sure to push the right text, actually if you do so, the server will alert you if you try to do something stupid, like pushing UTF8 values in latin1 fields.

For that specific cases when you have no idea of what you are going to write, and you MUST use Latin1 for the application connection, well use varbinary and your life will be easier, of course the ordering will be done by binary value instead character, but you need to keep your data safe first.

Reference

http://dev.mysql.com/doc/refman/5.6/en/globalization.html

http://www.utf8-chartable.de/unicode-utf8-table.pl?number=128

Thanks

Finally a huge thanks to Derek Downey, because he raise the issue (again) to me in relation to a customer migration.

Thanks for his dedication, professionality and will to learn and have fun together.

The Galera installation guide for dummies.

The Galera series

Overview and some history

Some years ago (2011), I was presenting in one article what at that time I have call “A dream on MySQL parallel replication” (http://www.tusacentral.net/joomla/index.php/mysql-blogs/96-a-dream-on-mysql-parallel-replication).

At that time I was dreaming about having a replication mechanism working on the commit operations, and was also trying to explain how useless is to have parallel replication by schema. I have to say that I got a lot of feedback about that article, directly and indirectly, most of it telling me that I was crazy and that what I was looking for cannot be done.

The only crazy guy that was following me in my wandering was Herik Ingo, who mentions to me Codership and Galera.

After few months a customer comes to me describing a scenario that in my imagination, would have be perfect for the cloud and an extension of MySQL call Galera.

At that time I was not really aware of all the different aspect of Galera, but I give it a try. On November 2011, I had done the first POC and start to collect valuable information about how to play with Galera.

It was my intention to write an article presenting it, but I was really too busy and the article at the end was unfinished (http://www.tusacentral.net/joomla/index.php/mysql-blogs/119-galera-on-red-hat-is-123-part-1).

Instead I submit to MySQL connect 2012 the first version of a presentation about this POC, POC that was only the first of much longer list (http://www.slideshare.net/marcotusa/scaling-with-sync-replication-2012).

From September 2012 to April 2013, we have seen Galera and the work coming from Codership, becoming more and more known in the community, this also thanks to Percona and people like Jay Jansen, or support coming from FromDual and tools from Several Nines.

In April 2013 in Santa Clara I attend the Jay’s tutorial, also to see how others were dealing with what for me had become in the last 2 years, a must use.

If you have lost it, here the link and please review the material, Jay is a great professional, and his tutorial was awesome (http://www.percona.com/live/mysql-conference-2013/sessions/percona-xtradb-cluster-galera-practice-part-1).

At the same conference I presented the updates of the POC done with some additional numbers, statistics, and different binaries, in fact I moved from MySQL/Oracle InnoDB to XtraDB.

Keeping in mind that we still talk about 5.5 because Galera is still not 5.6 productions ready, the difference was significant. A lot of inconsistent behaviour in thread handling that I had suffers with standard InnoDB, were not present in XtraDB.

As for today after more then two years from that initial article, we have many Galera installations around, some of them used in very critical systems.

This is thanks to the continuous work of the Codership people, and to the others that had believe in them, people like Herik Ingo, Oli Sennauser (FromDual), Johan Anderson (Several Nines), Jay Jansen (Percona) and myself.

Galera is becoming more and more a solid and trustable product, it still has some issue here and there, but the Codership team is quite efficient in fixing them when tracked down.

ANYHOW I have to make a note here, and I am talking to all the company who are taking advantage out of Galera. Please remember that open source doesn’t mean free, people still has to pay bills, and Codership developers cannot provide and sustain Galera if you don’t support them.

To buy Codership support, it is a very efficient way to get a quality service from the software producer, and at the same time to guarantee the future development of a product that allow you to make business.

Architecture

Now let start to talk about it.

After many installation and different combination of blocks, the following is for me the solution that identify the most flexible, scalable and solid solution to implement a MySQL cluster using Galera.

Normally if a customer asks me advice the following is my answer:

galera

"The solution sees the HAProxy instance hosted directly on the application machine, the HAProxy then connect the application to the data layer rotating on them.

Given Codership had provide a check for HAProxy to recognize the status of the node in relation to his role, HAProxy is able to skip the MySQL nodes when those are in recover or Donor mode.

Our tests had shown that this approach is the most scalable in relation to Application/data connection, and at the same time is the one that reduce the impact to minimum, given each application tile is independent.

About MySQL this solution allow the data layer to scale both reads and writes. The scalability is anyhow limited by the amount of write and related bytes, which the data layer must share between nodes.

This is it; each MySQL node has a full copy of the dataset, as such each write or delete must be applying to all nodes.

The Galera replication layer is efficiently managing this effort, but efficiency is still dependant by the Data Storage and Network layer.

It is possible to have better understanding if this solution fulfils the requirements, calculating the real amount of traffic generated, and performing a projection.

Also a proof of concept is always suggested to validate the application functions against the MySQL/Galera cluster.

Pro

Solution scales write and read, it also allow to have no service interruption when using a proxy solution like HAProxy which redirect the traffic In less then 5 seconds.
MySQL nodes can be access at all times for read and write operation.
Application can access database directly if needed, or can be configure as a tile with the HAProxy for better scalability.
Specific check is provided to identify the internal status of MySQL/Galera node.
The solution use InnoDB as storage engine, as such it will behave in a well known way, in responding to Read operations.
This solution can scale out and scale in, quite easily, given that to scale out we just need to add a even number of MySQL servers, to an odd cluster.
To scale in is just a matter to remove the nodes, from HAProxy and then turn the MySQL off.

Cons

Data is not partitioned cross MySQL nodes, it is fully replicated on all the MySQL, as such a lot of space on disk will be used, (Data x Number of nodes) + (Binary logs size x number of nodes).
When a node is recovering it will require a donor node, this will reduce the capacity of the Cluster of the failed node + the Donor. In case of a 3 nodes MySQL cluster, only one node will remain active, given that the recommended minimum number nodes on busy environment is five nodes.
The solution has being tested on physical machines, Amazon EC2, and within different Zones, but it will require dedicated network cross-zone to prevent delay.

Minimum requirements

The minimum number of MySQL nodes for a cluster is 3, but if the application is critical to reduce possible issue when recovering a node, a cluster of 5 is strongly suggested.

Note that for quorum reasons the number of server must be odd.

Network between the nodes must be reliable and with low latency.

Best usage

Applications that require having write distribution and some scalability, with medium load of writes per second, and constant grow of the dataset size.

Uptime in nines

99. 995% that correspond to 26 minutes downtime per year.

Solution Availability

MySQL with Galera is a GA solution, so no cost in implementing it.

It is good practices to have a support contract with Codership as software provider, especially to have better reaction in case of bugs or feature requests."

Amen!

Implementation

Once you have identify your requirements, and dimension the machines (or cloud instances) that will host the MySQL Galera implementation, be sure to have one network channel to serve the MySQL-Application traffic, and a different one to serve the Galera replication layer, and a very efficient data storage layer.

To get help and advice on that you can contact me any time.

Basic requirements

Packages:

Xinetd

rsyslog

openssl-devel.x86_64

sysstat

iotop

netcat

htop

oprofile

Perl

Perl DBI

Perl DBD::mysql

Perl Time::HiRes

accepting network traffic from/to ports 3306 3307 3308 3311

HAPROXY for:

RH5: http://rpm.pbone.net/index.php3/stat/4/idpl/17082875/dir/redhat_el_5/com/haproxy-1.3.26-1.el5.x86_64.rpm.html

RH6: http://rpm.pbone.net/index.php3?stat=26&dist=74&size=440708&name=haproxy-1.4.8-1.el6.x86_64.rpm

MySQL Galera From Percona:

MariaDB Galera implementation

Codership (Oracle) Galera implementation

Configuration

The initial configuration is normally base on 5 or 7 MySQL galera node listening each other on the internal network.

The HAProxy will check MySQL using the code delivered by Codership which is recognizing the Galera state.

This check will be installed ONLY on the MySQL/Galera node usign port 3311, so no configuration is due in the HAProxy node.

The MySQL frontend will connect to application using HAProxy, each Application node will have his own HAProxy installation.

Each Application will then locally connect to HAProxy (127.0.0.1:3307) HAProxy will redirect the connection to final target.

The connections are distributed using RR (Round Robin) and are non persistent.

That is once the TCP/IP connection is close the same Application Node will have no guarantee to access the same server.

PORTS:

HAProxy will listening on port 3307
HAProxy will show status using HTTP client on port 3308
MySQL will be listening on port 3306
MySQL check for HAProxy will operate on port 3311

MySQL:

adjust the standard parameter to fit the hosting environment.

redflag

NOTE!! for innodb_io_capacity Run FIO or IOZONE to discover the real IOPS available and adjust the parameter to that value, just guessing is not what you want in production.

Galera:

In line of principle the WSREP settings are tuned during the POC, but possible fine-tuning could be require when moving to production.

I advise you to spend some time to check and tune the following.

wsrep_max_ws_rows= 131072
wsrep_max_ws_size= 2147483648
wsrep_slave_threads= 48
wsrep_provider_options="gcache.size=10240M;
			evs.send_window=512;
			evs.user_send_window=512"

Parameters to keep an eye on are the send/receive queue and the GCACHE.SIZE.

About this there is something that must be clarify and why is very important to set it large enough.

First of all you should understand that when a node become a DONOR the server will not be accessible for write operation, as such it will be removed by HAProxy from the pool until the node has finished to feed the JOINER.

redflag

Galera has two ways of synchronizing a starting or recovering node.

IST and SST.

IST

When performing a synchronization with IST, Galera will send over to the resarting node ONLY the information present in the GCache, this can be see an INCREMENTAL update.

For instance, if you have a cluster of 5 nodes and for maintenance reasons you need to put them down on rotation, the node that will remain down will loose a set of operation during the maintenance time.

When you start back Galera read the last position the node has locally registered, and will request from the donor to start from there. If the DONOR still has that position in the GCache it will send to the restarting node the data from there.

This operation is normally much faster and has very limited impact also on the DONOR.

SST

This is a rebuild from scratch; normally it applies when the node is started the first time, and/or when it crashes.

The operation can be very time consuming when talking of dataset of some consistencies.

There are several methods that can be used for SST, from mysqldump to Xtrabackup. I have choose almost always to use the Xtrabackup, which is very well integrated in the Percona Galera distribution and guarantee performance and data safety.

But obviously when you are in the need to backup several hundreds of gigabytes, the process will take some time. Or if you have a very high level of inserts and say one or two hundreds of gigabytes, again the combination of time and datasize will be fatal.

The main point is that in these cases the time Galera will take the DONOR down in order to backup and trasmit the data to the JOINER, will be too long after for the DONOR node to recover from his Gcache once finish the operation, transforming the DONOR in an additional JOINER.

I have being observing this on cascade effect on several cluster not configured correctly, in relation to their traffic and data size.

Clusters of 7 or more nodes, just going on hold because the nodes were not able to store enough information on gcache. It is true that when Galera is left with one node, given it is in DONOR mode it stops to write allowing the cluster to heal itself. But it is also tru that this could be a very long operation and in production is quite distruptive.

So what to do? Easy just calculate before what is the worse scenario for you, then dimension the GCache to be at least 1/3 bigger then that is not more. Try to be safe, and stay on IST, this very important if you have a very high level of writes.

What I do is that Gcache must be large enough to guarantee modification statements for the double of the time needed to take a full backup.

IE.

With five node, and a set of binary log of 20GB per day.

If a full backup with XTRABACKUP takes 6Hrs the GCACHE size should be:

GCache = ((BS x (tb/hrsDat )) x Mn) x 2

GCache = ((20 x (4/24)) x 5 )* 2 = ~33.3GB

BS - size of the binlog

Tb - Time for the backup in hours in a day

Mn - MySQLGalera nodes

hrsDat Hours in a day (24)

This should be sufficient to have a decent amount of time and space to be safe.

redflag

Finally rememeber that Galera with Xtrabackup REQUIRE perl with DBI DBD::mysql in place or synchronization will fail!

Main steps

1) configure the environment

- install xinetd (if not present)

- create user

- create directory layout

- download software for MySQL/Galera

- Install HAProxy

2) Deploy the first MySQL Galera node

- create the basic mysql database

- create basic grants

- test and cleanup the other accounts

3) Deploy all other nodes

4) Deploy HAProxy

- review configuration and change in relation to the network

- start HAProxy

5) Test connection from client to HAProxy

6) Perform test to validate the installation

7) Load data set

Step by step

configure the environment

1) Install basic tools, if cusomer agreed:

yum -y install htop/sysstat/screen/xinetd/haproxy/iotop/nc

rpm -Uvh http://mirrors.kernel.org/fedora-epel/6/i386/epel-release-6-8.noarch.rpm

rpm -Uvh http://mirrors.kernel.org/fedora-epel/5/i386/epel-release-5-4.noarch.rpm

2) If you like run inside a screen so if you need you can detach and not interfere with the installtion process

screen -h 50000 -S MySQLINSTALL

3) Check if ANY MySQL daemon is present (RPM) and running, in case remove it

ps aux|grep mysql
rpm -qa |grep -i MySQL
rpm -e --nodeps Or -ev --allmatches
rpm -qa |grep MySQL
#Remove OLD
for bin in 'ls -D /usr/local/mysql/bin/'; do rm -f /usr/bin/$bin; done
for lib in 'ls -D /usr/local/mysql/lib/libmysqlclient*'; do rm -f /usr/lib64mysql/$lib; done
for bin in 'ls -D /usr/local/xtrabackup/bin/'; do rm -f /usr/bin/$bin; done

3) create user and the directory structure

userdel mysql
rm -fr /home/mysql/
groupadd mysql
useradd -m -g mysql -s /bin/bash -m -d /home/mysql mysql
passwd mysql
mkdir/opt/mysql_templates/
ln -s /usr/local/mysql -> /opt/mysql_templates/

IN case of binary use

5) Download the mysql version

wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/binary/linux/x86_64/Percona-XtraDB-Cluster-5.5.30-23.7.4.405.Linux.x86_64.tar.gz
tar -xzf Percona-XtraDB-Cluster-5.5.30-23.7.4.405.Linux.x86_64.tar.gz
wget http://www.percona.com/redir/downloads/XtraBackup/LATEST/binary/Linux/x86_64/percona-xtrabackup-2.1.3-608.tar.gz
tar -xzf percona-xtrabackup-2.1.3-608.tar.gz

6) Create symbolic links to /usr/local

ln -s /opt/mysql_templates/Percona-XtraDB-Cluster-5.5.30-23.7.4.405.Linux.x86_64 /usr/local/mysql
ln -s /opt/mysql_templates/percona-xtrabackup-2.1.3 /usr/local/xtrabackup

7) Create symbolic links to /usr/bin

#Install new
for bin in 'ls -D /usr/local/mysql/bin/'; do ln -s /usr/local/mysql/bin/$bin /usr/bin/$bin; done
for bin in 'ls -D /usr/local/xtrabackup/bin/'; do ln -s /usr/local/xtrabackup/bin/$bin /usr/bin/$bin; done

#Set security
for bin in 'ls -D /usr/local/mysql/bin/'; do chmod +x /usr/bin/$bin; done
for bin in 'ls -D /usr/local/xtrabackup/bin/'; do chmod +x /usr/local/xtrabackup/bin/$bin /usr/bin/$bin; done

8) Move the service script from the original directory

mv /usr/local/mysql/support-files/mysql.server /etc/init.d/mysqld or /etc/init.d/mysql/mysqld

Edit the file filling the basedir and datadir variables, this is not always needed.

9) Edit my.cnf to match the path and set initial values

10) Reset security

chown -R mysql:mysql /opt/mysql_templates
chmod +x /usr/local/mysql

Configure HAProxy check

Require:

xinetd

/usr/bin/clustercheck

the file clustercheck comes directly with the Percona distribution, you just to be sure that is in the path for the execution.

1) set haproxy check

prepared file (mysqlchk):

vi /etc/xinext.d/mysqlchk

# default: on
# description: mysqlchk
service mysqlchk
{
# this is a config for xinetd, place it in /etc/xinetd.d/
disable = no
flags = REUSE
socket_type = stream
port = 3311
wait = no
user = nobody
server = /usr/bin/clustercheck
log_on_failure += USERID
only_from = 0.0.0.0
# recommended to put the IPs that need
# to connect exclusively (security purposes)
per_source = UNLIMITED
}

2) check for free port

cat /etc/services |grep 3311

add service mysqlchk /etc/services

echo "mysqlchk 3311/tcp # mysqlchk" >> /etc/services

3) add to /etc/xinetd.d/ the configuration for mysqlchk services

restart xinetd

4) Check it

telnet 127.0.0.1 3311
	Trying 127.0.0.1...
	Connected to localhost.localdomain (127.0.0.1).
	Escape character is '^]'.
	HTTP/1.1 200 OK
	Content-Type: text/plain
	Connection: close
	Content-Length: 40
	Percona XtraDB Cluster Node is synced.
	Connection closed by foreign host.

Perl setup

You should do this the way you are more comfortable, anyhow be carefull on not doing double installation between yum/apt-get and cpan. These two way by default install library in different places, and will give you a nightmare in cleaning the mess and library conflict.

Be sure to have DBI and DBD installed where DBD::mysql should be version perl-DBD-MySQL-4.019 or newer.

Deploy the first MySQL Galera node

1) Create initial MySQL database FOR BINARY INSTALL ONLY:

su - mysql
cd /usr/local/mysql/
./scripts/mysql_install_db --defaults-file=/etc/my.cnf

Carefully check the output you should see OK twice, if not check the error log.

2) Start mysql

/etc/init.d/mysqld start --wsrep_cluster_address=gcomm://

Check the error log for possible errors

tail -fn 200

3) connect for the first time and change security

mysql -uroot
set PASSWORD for root@'localhost'=PASSWORD('secret');

Grant access for xtrabackup

CREATE USER 'sstuser'@'localhost' IDENTIFIED BY 'mysqlsst';
GRANT RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost';
GRANT RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'192.168.1.%';

mysql> FLUSH PRIVILEGES;

Grant access to haproxy checks

CREATE USER 'clustercheckuser'@'localhost' IDENTIFIED BY 'clustercheckpassword!';
GRANT PROCESS ON *.* TO 'clustercheckuser'@'127.0.0.1';
GRANT PROCESS ON *.* TO 'clustercheckuser'@'192.168.1.%';
Flush privileges;

Remove generic users:

DROP user ''@'localhost';
DROP user ''@'
Grant all on *.* to dbaadmin@'localhost' identified by 'secret'

4) collect statistics and informations:

SHOW GLOBAL STATUS/VARIABLES;
STATUS;

5) Stop server

6) restart server

/etc/init.d/mysqld start --wsrep_cluster_address=gcomm://

Deploy all other nodes

On each node:

1) modify the server identification in the my.cnf

wsrep_node_name=pchgny1 <------------
server-id=1 <----------------

2) start the node checking the mysql log

/etc/init.d/mysqld start

Deploy HAProxy

Connect on the appliction servers and perform the HAProxy installation.

wget the HAProxy package related to the host OS
rpm -iUvh haproxy-1.4.22-4.el6_4.x86_64.rpm

1) Set the configuration file on each HAProxy node

redflag In line of principle HAProxy is quite efficient to monitor and report the status of the nodes on his HTML interface when using the HTTP protocol, this is not true when using the TCP.
Given that, I was using the trick to use the HTTP protocol on a different port, just with the scope of reporting.

#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local1.*                       /var/log/haproxy.log
    #
   log 127.0.0.1   local1 notice
 
    maxconn     40096
    user        haproxy
    group       haproxy
    daemon
 
    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats
 
 
 
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  tcplog
    option                  dontlognull
    option                  redispatch
    retries                 3
    maxconn 4096
    contimeout 160000
    clitimeout 240000
    srvtimeout 240000
 
 
 
 
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
 
listen stats 0.0.0.0:3308
    mode http
    stats enable
#    stats hide-version
    stats scope .
    stats realm Haproxy\ Statistics
    stats uri /haproxy?stats
    stats refresh 5s
#    stats auth xxxxx:xxxxx
    option contstats
    option httpchk
#        option mysql-check user test
 
    server node1 10.5.1.6:3306 check port 3311 inter 2000 rise 3 fall 3
    server node3 10.5.1.8:3306 check port 3311 inter 2000 rise 3 fall 3
    server node4 10.5.1.9:3306 check port 3311 inter 2000 rise 3 fall 3
    server node5 10.5.1.10:3306 check port 3311 inter 2000 rise 3 fall 3

 
 
listen galera_cluster 0.0.0.0:3307
#Logging
	log global
	option dontlognull
#mode
    mode tcp
# balancer
    balance roundrobin
 
#options
#	option abortonclose
#	option forceclose
	option clitcpka 
	option tcpka
    option httpchk
#        option mysql-check user test
 
    server node1 10.5.1.6:3306 check port 3311 inter 2000 rise 3 fall 3
    server node3 10.5.1.8:3306 check port 3311 inter 2000 rise 3 fall 3
    server node4 10.5.1.9:3306 check port 3311 inter 2000 rise 3 fall 3
    server node5 10.5.1.10:3306 check port 3311 inter 2000 rise 3 fall 3

2) add logging

to add logging using rsyslog

vim /etc/rsyslog.conf

Modify enabling, the following:

# Provides UDP syslog reception
$ModLoad imudp.so
$UDPServerRun 514

and add

#HAProxy log
local1.* /var/log/haproxy.log

Finally restart rsyslog

/etc/init.d/rsyslog restart

2) start HAProxy

/etc/init.d/haproxy start

3) check sever status using the web interface

Using a web browser check from:

http://:3308/haproxy?stats

You will see, or you SHOULD see, the HTML page reporting the status of your nodes.

Quick check for the connection

Connect to MySQL using mysql client and simple whatch to cycle the servers.

watch -n1 -d 'mysql -udbaadmin  -p  -h  -e "Show global status"| grep -E

"wsrep_ready|wsrep_last_committed|wsrep_replicated|wsrep_received|wsrep_local_commits|wsrep_local_cert_failures|wsrep_local_bf_aborts|wsrep_local_send_queue|wsrep_local_recv_queue|wsrep_local_state_comment"'

To see how HAProxy redirect the connections from the APPLICATION NODE:

watch -n1 -d 'mysql -h127.0.0.1 -P3307 -u -p -e "Show global variables"| grep -i -E "server_id|wsrep_node_name"'

You will see the values changing at each request.

Try to put down one node and see what happen on the web interface of HAProxy and at the running command.

If all is fine it will be quite fun to see how easy and fast it manage the shutting down node.

POC steps

Finally this is just an example of what we do cover when doing the POC, it obviously vary from customer to customer.

POC Tests

Functional tests:

1) light data load/read on all nodes

- perform loading on all nodes

- perform data read on all nodes

2) Query/inserting while one node is failing

- perform selects an all nodes

- perform inserts on all nodes

Expectations:

No service interruption

No difference in the result sets between nodes.

3) Query/inserting while node is recovering

- perform selects on all data nodes

- perform inserts on all data nodes

- identify which node become the donor

Expectations:

minimal service degradation because Donor node will not be available

No service interruption

No difference in the results set

Recovery perform by IST if insight the boundary of the cache

Recovery perform by SST if bigger then cache

Performance/capacity tests (including difference in using ONE single node, Three to seven nodes, full capacity):

1) Execute increasing load from application node modification only (using simple light insert like real time application; more complex inserts like heavier records, batch insert, multiple linked tables)

- perform increasing inserts and delete using from 4 to 96 threads each application block (2 block for each application)

Expectations:

- validate the maximum load limit

- identify the lag in the Galera replication, if any, when under heavy stress

Mesure:

- IO

- memory usage

- Internal MySQL/Galera metrics

- threads contention

2) Execute increasing mix load read/write

- perform increasing selects/inserts and delete using from 4 to 96 threads each application block

Expectations:

- validate the maximum load limit

- identify the lag in the Galera replication, if any, when under heavy mix read and write stress

Mesure:

- IO

- memory usage

- Internal MySQL/Galera metrics

- threads contention

3) Perform node data crash and recovery while heavy load running

Same test as 1.2 and 1.3

4) Optimizations:

1) Implement partitioning on table and execute historical archiving

- Test impact optimization in accessing data

- Test impact in managing the partitions (optimizing/defragmenting/drop)

Additional article about galera

There are few forthcoming articles I am writing following the same serie:

Galera understanding what to monitor and how

Galera tests and numbers, what I have prove is it possible to achieve in numbers and graphs.

Reference

Jay (Percona)

http://www.mysqlperformanceblog.com/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/

http://www.mysqlperformanceblog.com/2013/05/02/galera-flow-control-in-percona-xtradb-cluster-for-mysql/

https://www.percona.com/live/mysql-conference-2013/users/jay-janssen

Oli (FromDual)

http://www.fromdual.com/mysql-and-galera-load-balancer

http://www.fromdual.com/galera-load-balancer-documentation

http://www.fromdual.com/unbreakable-mysql-cluster-with-galera-and-lvs

http://www.fromdual.com/switching-from-mysql-myisam-to-galera-cluster

http://www.fromdual.com/galera-cluster-nagios-plugin-en

Codership

http://codership.com/

http://www.codership.com/wiki/doku.php?id=mysql_options_0.8

http://www.codership.com/wiki/doku.php?id=galera_parameters

http://www.codership.com/wiki/doku.php?id=galera_status_0.8

http://www.codership.com/wiki/doku.php?id=flow_control

http://www.codership.com/wiki/doku.php?id=galera_arbitrator

http://www.codership.com/wiki/doku.php?id=sst_mysql

http://www.codership.com/wiki/doku.php?id=ist

Several Nines

http://www.severalnines.com/clustercontrol-mysql-galera-tutorial

http://www.severalnines.com/blog/migrating-mysql-galera-cluster-new-data-center-without-downtime

http://www.severalnines.com/galera-configurator/

MySQL

Introduction

Customer scenario

Customer environment description

Problem description

What we had to do

How to set it up how to fix when it breaks

Investigations

Related issues

Results

Segments distribution

More inside the replication

Conclusions?

Why this article

Frame

IP (internet protocol)

ICMP

Conclusion(s)

Conclusion

Why this article?

Step 1) Check that all is latin1

Step 2) Convert to varbinary the slave, keep same encoding (NAMES latin1)

Step 3) Production continue to work as usual, no impact.

Step 4) swap the server

Step 5) modify to varbinary the tables on the new Slave:

Step 6) change all Application connections to UTF8,

Step 7) Convert table on SLAVE back to varchar but using utf8 now

Step 8) given all good put now the 5510 back as master,

Step 9) Finally convert all tables also on slave to varchar UTF8:

Conclusion

Reference

Thanks

The Galera series

Overview and some history

Architecture

Implementation

Basic requirements

Packages:

Configuration

PORTS:

MySQL:

Galera:

IST

SST

Main steps

Step by step

configure the environment

IN case of binary use

Configure HAProxy check

Perl setup

Deploy the first MySQL Galera node

Deploy all other nodes

Deploy HAProxy

Quick check for the connection

POC steps

POC Tests

Functional tests:

Performance/capacity tests (including difference in using ONE single node, Three to seven nodes, full capacity):

Additional article about galera

Reference

More Articles …

Path

login