My MySQL tips valid-rss-rogers

 

MariaDB 10.4 has being declared GA, and few presentations on Galera4 in the last conferences were hold.

So, I thought, it is time to give it a try and see what is going on.

It is not a secret that I consider the Codership guys my heroes, and that I have push for Galera as solid and very flexible HA solution, for many years.

Given that my first comment is that it was a shame to have Galera4 available only in MariaDB, I would have preferred to test the MySQL vanilla version for many reasons, but mainly because the MySQL/Oracle is and remain the official and main line of the MySQL software, you like it or not, and as such I want to see how the Galera4 behave with that. Anyhow Codership state that the other versions will be out AFTER the summer, and I hope this will be true.

To test the new version given I do not have the vanilla MySQL, I decide to use the other distribution coming from Percona. At the end the test where done comparing MariaDB 10.4.x with PXC 5.7.x. In short Galera4 Vs Galera3.

I setup on the same machines the two different software, and I configure as close as possible. Said that I did 2 main set of tests: Data ingest and OLTP, both running for 90 minutes, not pushing like hell, but gently simulate some traffic. Configuration files can be found here.

Galera4 stream replication was disable, following the Codership instruction (wsrep_trx_fragment_size=0).

Test1 Ingest

For the ingest test I had use my stresstool application (here) with only 10 threads and 50 batch inserts each thread, the schema definition is in the windmills.json file.

As always, an image says more than many words:

ingest execution by thread

In general terms, PXC was able to execute same load in less than MariaDB.

ingest events thread

And PXC was able to deal with a higher number of events per thread as well.

The average galera latency was around 9ms in the writer and around 5ms for the receivers in PXC. With same load, same machines, same configuration:

Screen Shot 2019 07 24 at 21658 PM

The latency in MariaDB was significantly higher around 19ms for the writer, and between 9 and 5 ms for the receivers.

Screen Shot 2019 07 24 at 13350 PM

In short overall PXC 5.7 with galera3 was performing better than MariaDB 10.4 with galera4.

The amount of data on transmitted and received on PXC was higher (good) than Mariadb:

PXC:

Screen Shot 2019 07 24 at 22127 PM

MariaDB:

Screen Shot 2019 07 24 at 14453 PM

OLTP

For oltp test I have sysbenc with oltp r/w tests, 180 threads (90 from two different application nodes), 200K rows for table, 40 tables and 90 minutes run.

Let see what happened:

oltp evens write

PXC was performing better than MariaDB, executing more writes/s and and more events_thread/sec.

Checking the latency, we can see:

oltp latency

Also in this case PXC was having less average latency than MariaDB in the execution.

What about galera?

For PXC/Galera3, the average galera latency was around 3.5ms in the writer and less in the receivers:

Screen Shot 2019 07 26 at 124919 PM

In this case the latency in Galera4 was same or less of the one in Galera3:

Screen Shot 2019 07 26 at 51341 PM

Also analyzing the MAX latency:

Galera3

Screen Shot 2019 07 26 at 124936 PM

Galera4

Screen Shot 2019 07 26 at 51354 PM

We can see that Galera4 was dealing with it much better than the version3.

 

I have done many other checks and it seems to me that in the OLTP, but I do not exclude this is valid for ingest as well, Galera4 is in some way penalize by the MariaDB platform.

I am just at the start of my investigation and I may be wrong, but I cannot confirm or deny until Codership will release the code for MySQL.

Conclusions

Galera4 seems to come with some very good new feature, please review Seppo presentation, and one thing I noticed it comes with optimized node communication, reducing the latency fluctuation.

Just this for me is great, plus we will have stream replication that can be very useful but I have not tested it yet.

Nevertheless, I would not move to it just yet. I would wait to have the other MySQL distribution supported, do some tests, and see where the performance problem is.

Because at the moment also with not heavy load, current version of PXC 5.7/Galera3 runs better than MariaDB/Galera4, so why I should migrate to a platform that locks me in like MariaDB, and do not give me benefit (yet)? Also considering that once Galera4 will be available for the standard MySQL versions, we can have all the good coming from Galera4, without being lock in by MariaDB.

A small note about MariaDB, while I was playing with it, I noticed that by default MariaDB comes with the plugin level BETA, which means potentially run in production code that is still in beta stage, no comment!surprise emo

References

https://github.com/Tusamarco/blogs/tree/master/Galera4

https://galeracluster.com/

Seppo presentation https://www.slideshare.net/SakariKeskitalo/galera-cluster-4-presentation-in-percona-live-austin-2019

https://mariadb.org/mariadb-10-4-4-now-available/

During the last conferences, I had the chance to discuss a lot with many colleagues. Some of them, like me, feel we have lost ourselves, not totally but a bit. morepowerful together

I had start to think why, and this is my reasoning, not saying is right, but I think is not far from the truth.

Let us start just mentioning few events in the 2019 schedule, February Fosdem, end of May Percona live US, mid-June DataOps Barcelona, September Oracle open World, end of September-October Percona Live Europe, plus an undefined number of secondary events or MariaDB roadshows.

Where is the MySQL (and related) conference?

Do not get me wrong, we talk about MySQL a lot during each one of the above (or MariaDB, tometo/tomato). But it seems to me, that in a subtle way, the community had lost his conference.

There is nothing wrong if each company is trying to get the most out of their investment, so implicitly driving the show to get the best return for their own business.

But … as community are we getting what we need?

If we take a look to the past, we see that the MySQL conference in Santa Clara, was a real milestone. It is true that MySQL was a less mature product, and as such a lot of innovation and discussion were presented every year. But it was not only because that.

In my opinion the singularity of the event was making it a more precious moment, with a lot of focus and will to share.

Not only, the fact that we did not have dozens of events was grouping more people, as speakers/experts and attendee. This was generating more discussions and exchanges, resulting in ideas that we as community had the chance to develop in the forthcoming months.

Here another very important factor, we as part of the community, as engineers, respect each other and we have no limit in sharing knowledge and help each other. No matter if one of us make a mistake, next time he will do it right, and we had always help to go in that direction.

Having the chance to discuss face to face during the event, was great, having the chance to continue the discussion after the event even better, in theory having the chance to do that multiple time in the year should be the best thing. But… there is a but, all these smalls (or smaller) events are companies’ events, not community, sorry to say that but is the reality. What it means is that the interactions are, often, limited, restricted, penalized by the competition between the companies.

Mark Callaghan, some times ago, mention in a short post, that was sad to see so many bright engineers sparse cross companies, and often not able to collaborate to make things better in a common effort, following a shared path (Mark that was my interpretation if that was not your intention, my apology, but also if, I think this is true).

 

This is sad indeed, and a waste of energy. It is also very sad, because we know each other, directly or indirectly. We bring no ressentiment to each other, and we like to discuss, share, help and get help.

Damn we love to work together! No matter the company we are working with, I am sure Oracle developers would love to work with MariaDB guys, as well as Percona’s, and vice versa.

I am not naïve, I understand the need to have sane competition and the need to have “some” differences. But maybe we are not considering the long term, we are a small community, we are a very small fish ball, if we do not help each other we will lose in long term. This is already happening with Postgres growing day by day evolving under our blind eyes.

Looking our small ball, what I see is fragmentation, I see companies trying to find new magic word, or magic trends, to package the same old shit, to survive in the market.

That is not innovation, that is not evolution, that is giving up to what is the most important concept of an opensource community.

Going back to the number of events and their relevance. This increasing number of events, is the natural consequence of the lack of coordination and proper interaction. I do not care how many we will have, if they are bringing us, food for brain.

If they are going to be useful to make MySQL (and related) better.

But if the scope is only promoting the company business, that is ok… just it is not a community event, and in that case, we need to recognize and accept, that we do not have an open conference any longer. And we need one, well two. One for Americas, and another one for EurAsia (get use to the geo definition please, there is where the market is evolving).

We need a strong, unified, focused community. We need to have all the actors collaborate on a common strategy, not doing chicken fights for crumbs. We need a steering committee and the company’s commitment to adhere to the committee indications.

We need that, or we will be forgotten soon because obsolete.

That is what I think… please let me know your ideas, we are a community... let us discuss how to do things better!

ProxySQL in its versions up to 1.x did not natively support Percona XtraDB Cluster (PXC). Instead, it relied on the flexibility offered by the scheduler. This approach allowed users to implement their own preferred way to manage the ProxySQL behaviour in relation to the Galera events. From version 2.0 we can use native ProxySQL support for PXC.. The mechanism to activate native support is very similar to the one already in place for group replication. In brief it is based on the table [runtime_]mysql_galera_hostgroups and the information needed is mostly the same:

 

  • writer_hostgroup: the hostgroup ID that refers to the WRITER
  • backup_writer_hostgroup: the hostgoup ID referring to the Hostgorup that will contain the candidate servers
  • reader_hostgroup: The reader Hostgroup ID, containing the list of servers that need to be taken in consideration
  • offline_hostgroup: The Hostgroup ID that will eventually contain the writer that will be put OFFLINE
  • active: True[1]/False[0] if this configuration needs to be used or not
  • max_writers: This will contain the MAX number of writers you want to have at the same time. In a sane setup this should be always 1, but if you want to have multiple writers, you can define it up to the number of nodes.
  • writer_is_also_reader: If true [1] the Writer will NOT be removed from the reader HG
  • max_transactions_behind: The number of wsrep_local_recv_queue after which the node will be set OFFLINE. This must be carefully set, observing the node behaviour.
  • comment: I suggest to put some meaningful notes to identify what is what.

Given the above let us see what we need to do in order to have a working galera native solution. I will have three Servers:

  192.168.1.205 (Node1)
  192.168.1.21  (Node2)
  192.168.1.231 (node3)

As set of Hostgroup, I will have:

Writer  HG-> 100
Reader  HG-> 101
BackupW HG-> 102
offHG   HG-> 9101

To set it up

Servers first:
 
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,1000);

Then the galera settings:
 
insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) values (100,102,101,9101,0,1,1,16);

As usual if we want to have R/W split we need to define the rules for it:
 
insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,'windmills','app_test',100,1,3,'^SELECT.*FOR UPDATE',1);
insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1041,6033,'windmills','app_test',101,1,3,'^SELECT.*@@',1);

save mysql query rules to disk;
load mysql query rules to run;

Then another important variable... the server version, please do yourself a good service ad NEVER use the default.
 
update global_variables set variable_value='5.7.0' where variable_name='mysql-server_version';
LOAD MYSQL VARIABLES TO RUNTIME;SAVE MYSQL VARIABLES TO DISK;

Finally activate the whole thing:
 
save mysql servers to disk;
load mysql servers to runtime;

 

 

 

One thing to note before we go ahead. In the list of servers I had:

  1. Filled only the READER HG
  2. Used the same weight

This because of the election mechanism ProxySQL will use to identify the writer, and the (many) problems that may be attached to it.

For now let us go ahead and see what happens when I load this information to runtime. Before running the above commands:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
After:
 
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 1000   | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
| 1000   | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

 

1
2
3
4
5
6
7
8
9
10
11
12
mysql> SELECT * FROM runtime_mysql_galera_hostgroups \G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                active: 0  <----------- note this 
            max_writers: 1
  writer_is_also_reader: 1
max_transactions_behind: 16
                comment: NULL
1 row IN SET (0.01 sec)

 

As we can see, ProxySQL had taken the nodes from my READER group and distribute them adding node 1 in the writer and node 2 as backup_writer.
But – there is a but – wasn't my rule set with Active=0? Indeed it was, and I assume this is a bug (#Issue  1902).
The other thing we should note is that ProxySQL had elected as writer node 3 (192.168.1.231).
As I said before what should we do IF we want to have a specific node as preferred writer? We need to modify its weight.
So say we want to have node 1 (192.168.1.205) as writer we will need something like this:

 
1
2
3
4
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,10000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);
 

Doing that will give us :

1
2
3
4
5
6
7
8
9
10
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 10000  | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 2209       |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 508        |
| 10000  | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 2209       |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 508        |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

If you noticed, given we had set the WEIGHT in node 1 higher, this node will become also the most utilized for reads.

We probably do not want that, so let us modify the reader weight.

1
UPDATE mysql_servers SET weight=10 WHERE hostgroup_id=101 AND hostname='192.168.1.205';

At this point if we trigger the failover, with set global wsrep_reject_queries=all; on node 1. ProxySQL will take action and will elect another node as writer:

1
2
3
4
5
6
7
8
9
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 588        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 588        |
| 10000  | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 468        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

 

Node 3 (192.168.1.231) is the new writer and node 1 is in the special group for OFFLINE. Let see now what will happen IF we put back node 1.

1
2
3
4
5
6
7
8
9
10
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 10000  | 100       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
| 10000  | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Ooops the READER has come back with the HIGHEST value and as such it will be the most used node, once more. To fix it, we need to re-run the update as before.

But there is a way to avoid this? In short the answer is NO! This, in my opinion, is BAD and is worth a feature request, because this can really put a node on the knees.

Now this is not the only problem. There is another point that is probably worth discussion, which is the fact ProxySQL is currently doing FAILOVER/FAILBACK.

Failover, is obviously something we want to have, but failback is another discussion.

The point is, once the failover is complete and the cluster has redistributed the incoming requests, doing a failback is an impacting operation that can be a disruptive one too.

If all nodes are treated as equal, there is no real way to prevent it, while if YOU set a node to be the main writer, something can be done, let us see what and how. Say we have:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);
 
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 1000   | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 613        |
| 10     | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 613        |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Let us put the node down set global wsrep_reject_queries=all; And check:

1
2
3
4
5
6
7
8
9
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 506        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 506        |
| 1000   | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 527        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

We can now manipulate the weight in the special OFFLINE group and see what happen:

1
UPDATE mysql_servers SET weight=10 WHERE hostgroup_id=9101 AND hostname='192.168.1.205'

Then I put the node up again: set global wsrep_reject_queries=none;

1
2
3
4
5
6
7
8
9
10
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | STATUS  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 573        |
| 10     | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 458	|
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 573	|
| 10     | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 458        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

 

That's it, the node is back but with no service interruption.

At this point we can decide if make this node reader like the others, or wait and plan a proper time of the day when we can put it back as writer, while, in the meanwhile it has a bit of load to warm its bufferpool.

The other point – and important information – is what is ProxySQL is currently checking on Galera? From reading the code Proxy will trap the following:

  • read_only
  • wsrep_local_recv_queue
  • wsrep_desync
  • wsrep_reject_queries
  • wsrep_sst_donor_rejects_queries
  • primary_partition

Plus the standard sanity checks on the node. Finally to monitor the whole situation we can use this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql> SELECT * FROM mysql_server_galera_log ORDER BY time_start_us DESC LIMIT 10;
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
| hostname      | port | time_start_us    | success_time_us | primary_partition | read_only | wsrep_local_recv_queue | wsrep_local_state | wsrep_desync | wsrep_reject_queries | wsrep_sst_donor_rejects_queries | error |
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
| 192.168.1.231 | 3306 | 1549982591661779 | 2884            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982591659644 | 2778            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982591658728 | 2794            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982586669233 | 2827            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982586663458 | 5100            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982586658973 | 4132            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982581665317 | 3084            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982581661261 | 3129            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982581658242 | 2786            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982576661349 | 2982            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+

As you can see above the log table keeps track of what is changed. In this case, it reports that node 1 has wsrep_reject_queries activated, and it will continue to log this until we set wsrep_reject_queries=none.

Conclusions

ProxySQL galera native integration is a useful feature to manage any Galera implementation, no matter whether it's Percona PXC, MariaDB cluster or MySQL/Galera.

The generic approach is obviously a good thing, still it may miss some specific extension like we have in PXC with the performance_schema pxc_cluster_view table.

I've already objected about the failover/failback, and I am here again to remind you: whenever you do a controlled failover REMEMBER to change the weight to prevent an immediate failback.

This is obviously not possible in the case of a real failover, and, for instance, a simple temporary eviction will cause two downtimes instead only one.

Some environments are fine with that others not so. Personally I think there should be a FLAG in the configuration, such that we can decide if failback should be executed or not.  

lightspeedLet's talk about MySQL high availability (HA) and synchronous replication once more.
It's part of a longer series on some high availability reference architecture solutions over geographically distributed areas.
Part 1: Reference Architecture(s) for High Availability Solutions in Geographic Distributed Scenarios: Why Should I Care?
Part 2: MySQL High Availability On-Premises: A Geographically Distributed Scenario

The Problem

A question I often get from customers is: How do I achieve high availability in case if I need to spread my data in different, distant locations? Can I use Percona XtraDB Cluster? Percona XtraDB Cluster (PXC), mariadb-cluster or MySQL-Galera are very stable and well-known solutions to improve MySQL high availability using an approach based on multi-master data-centric synchronous data replication model. Which means that each data-node composing the cluster MUST see the same data, at a given moment in time. Information/transactions must be stored and visible synchronously on all the nodes at a given time. This is defined as a tightly coupled database cluster. This level of consistency comes with a price, which is that nodes must physically reside close to each other and cannot be geographically diverse. This is by design (in all synchronous replication mechanisms). This also has to be clarified over and over throughout the years. Despite that we still see installations that span across geographic locations, including AWS Regions. We still see some solutions breaking the golden rule of proximity, and trying to break the rules of physics as well. The problem/mistake is not different for solutions based on-premises or in the cloud (for whatever cloud provider). Recently I had to design a couple of customer solutions based on remote geographic locations. In both cases, the customer was misled by an incorrect understanding of how the synchronous solution works, and from a lack of understanding of the network layer. I decided I need to cover this topic again, as I have done previously in Galera geographic replication and Effective way to check network connection in a geographically distributed environment

What Happen When I Put Things on the Network?

Well, let's start with the basics. While light travels at 300 million meters per second, the propagation of the electric fields or electric signaling is slower than that. The real speed depends by the medium used to transmit it. But it can be said that the real speed normally spans from 0% to 99% of light-speed (depending on the transmission medium). This means that in optimal conditions the signal travels at approximately 299.72Km per millisecond, in good/mid condition about half that at 149.86Km per millisecond, and in bad conditions it could be 3Km per millisecond or less. To help you understand, the distance between Rome (Italy) and Mountain View (California) is about 10,062Km. At light-speed it will take 33.54ms. In good conditions (90% of light-speed) the signal will take 37.26ms to reach Mountain View, and in less optimal conditions it can easily double to 74.53 ms. Keep in mind this is the electric field propagation speed: optimal conditions with no interruption, re-routing and so on. Reality will bring all the kind of interruptions, repeaters and routing. All the physics above works as a baseline. On top of this, each human construct adds functionalities, flexibility and (unfortunately) overhead – leading to longer times and slower speeds. The final speed will be different than the simple propagation of the electric fields. It will include the transmission time of complex signaling using ICMP protocol, or even higher delays with the use of a very complex protocol like TCP/IP, which includes handshaking, package rerouting, re-sending and so on. On top of that, when sending things over the internet, we need to realize that it is very improbable we will be the only user sending data over that physical channel. As such, whatever we have “on the road” will need to face bandwidth limitation, traffic congestion and so on. I had described the difference between protocols (ICMP – TCP/IP) here, clarifying how the TCP/IP scenario is very different from using different protocols like ICMP, or the theoretical approach. What it all means is that we cannot trust the theoretical performance. We must move to a more empirical approach. But we must understand the right empirical approach or we will be misled.

An Example

I recently worked on a case where a customer had two data centers (DC) at a distance of approximately 400Km, connected with “fiber channel”. Server1 and Server2 were hosted in the same DC, while Server3 was in the secondary DC. Their ping, with default dimension, to Server3 was ~3ms. Not bad at all, right? We decided to perform some serious tests, running multiple sets of tests with netperf for many days collecting data. We also used the data to perform additional fine tuning on the TCP/IP layer AND at the network provider. The results produced a common (for me) scenario (not so common for them):

 

picture1


The red line is the first set of tests BEFORE we optimized. The yellow line is the results after we optimized. The above graph reports the number of transactions/sec (AVG) we could run against the different dimension of the dataset and changing the destination server. The full roundtrip was calculated. It is interesting to note that while the absolute numbers were better in the second (yellow) tests, this was true only for a limited dataset dimension. The larger the dataset, the higher the impact. This makes sense if you understand how the TCP/IP stack works (the article I mentioned above explains it). But what surprised them were the numbers. Keeping aside the extreme cases and focusing instead on the intermediate case, we saw that shifting from a 48k dataset dimension to 512K hugely dropped the performance. The drop for executed transactions was from 2299 to 219 on Server2 (same dc) and from 1472 to 167 Server3 (different DC). Also, note that Server3 only managed ~35% fewer transactions comparing to Server2 from the start given the latency. Latency moved from a more than decent 2.61ms to 27.39ms for Server2 and 4.27ms to 37.25ms for Server3.

 

picture2


37ms latency is not very high. If that had been the top limit, it would have worked. But it was not. In the presence of the optimized channel, with fiber and so on, when the tests were hitting heavy traffic, the congestion was such to compromise the data transmitted. It hit a latency >200ms for Server3. Note those were spikes, but if you are in the presence of a tightly coupled database cluster, those events can become failures in applying the data and can create a lot of instability. Let me recap a second the situation for Server3: We had two datacenters.

  • The connection between the two was with fiber
  • Distance Km ~400, but now we MUST consider the distance to go and come back. This because in case of real communication, we have not only the send, but also the receive packages.
  • Theoretical time at light-speed =2.66ms (2 ways)
  • Ping = 3.10ms (signal traveling at ~80% of the light speed) as if the signal had traveled ~930Km (full roundtrip 800 Km)
  • TCP/IP best at 48K = 4.27ms (~62% light speed) as if the signal had traveled ~1,281km
  • TCP/IP best at 512K =37.25ms (~2.6% light speed) as if the signal had traveled ~11,175km

Given the above, we have from ~20%-~40% to ~97% loss from the theoretical transmission rate. Keep in mind that when moving from a simple signal to a more heavy and concurrent transmission, we also have to deal with the bandwidth limitation. This adds additional cost. All in only 400Km distance. This is not all. Within the 400km we were also dealing with data congestions, and in some cases the tests failed to provide the level of accuracy we required due to transmission failures and too many packages retry. For comparison, consider Server2 which is in the same DC of Server1. Let see:

  • Ping = 0.027ms that is as if the signal had traveled ~11km light-speed
  • TCP/IP best at 48K = 2.61ms as if traveled for ~783km
  • TCP/IP best at 512K =27.39ms as if traveled for ~8,217km
  • We had performance loss, but the congestion issue and accuracy failures did not happen.

You might say, "But this is just a single case, Marco, you cannot generalize from this behavior!" You would be right IF that were true (but is not).
The fact is, I have done this level of checks many times and in many different environments. On-premises or using the cloud. Actually, in the cloud (AWS), I had even more instability.
T
he behavior stays the same. Please test it yourself (it is not difficult to use netperf).

Just do the right tests with RTT and multiple requests (note at the end of the article).
Anyhow, what I know from the tests is that when working INSIDE a DC with some significant overhead due to the TCP/IP stack (and maybe wrong cabling), I do not encounter the same congestion or bandwidth limits I have when dealing with an external DC.
This allows me to have more predictable behavior and tune the cluster accordingly.

Tuning that I cannot do to cover the transmission to Server3 because of unpredictable packages behavior and spikes. >200ms is too high and can cause delivery failures.
If we apply the given knowledge to the virtually-synchronous replication we have with Galera (Percona XtraDB Cluster), we can identify that we are hitting the problems well-explained in Jay's article Is Synchronous Replication right for your app?  There, he explains Callaghan’s Law: [In a Galera cluster] a given row can’t be modified more than once per RTT.
On top of that, when talking of geographical disperse solutions we have the TCP/IP magnifying the effect at writeset transmission/latency level.
This causes nodes NOT residing on the same physical contiguous network delay for all the certification-commit phases for an X amount of time.
When X is predictable, it may range between 80% - 3% of the light speed for the given distance.
But you can't predict the transmission-time of a set of data split into several datagrams, then sent on the internet, when using TCP/IP.
So we cannot use the X range as a trustable measure.
The effect is unpredictable delay, and this is read as a network issue from Galera.
The node can be evicted from the cluster. Which is exactly what happens, and what we experience when dealing with some “BAD” unspecified network issue.

 This means that whenever we need to use a solution based on tightly coupled database cluster (like PXC), we cannot locate our nodes at a distance that is longer than the largest RTT time of our shortest desired period of commit.
If our application must apply the data in a maximum of 200ms in one of its functions, our min RTT is 2ms and our max RTT is 250ms.
We cannot use this solution, period. To be clear, locating a node on another geolocation, and as such use the internet to transmit/receive data, is by default a NO GO given the unpredictability of that network link.

  I doubt that nowadays we have many applications that can wait an unpredictable period to commit their data.
The only case when having a node geographically distributed is acceptable is if you accept commits happening in undefined periods of time and with possible failures.

What Is the Right Thing To Do?

The right solution is easier than the wrong one, and there are already tools in place to make it work efficiently. Say you need to define your HA solution between the East and West Coast, or between Paris and Frankfurt. First of all, identify the real capacity of your network in each DC.
Then build a
tightly coupled database cluster in location A and another tightly coupled database cluster in the other location B. Then link them using ASYNCHRONOUS replication.
Finally, use a tool like Replication Manager for Percona XtraDB Cluster to automatically manage asynchronous replication failover between nodes.
On top of all of that use a tool like ProxySQL to manage the application requests.
The full architecture is described here. 

picture3

 

Conclusions

The myth of using ANY solution based on tightly coupled database cluster on distributed geographic locations is just that: a myth. It is conceptually wrong and practically dangerous.

It MIGHT work when you set it up, it MIGHT work when you test it, it MIGHT even work for some time in production.

  By definition, it will break, and it will break when it is least convenient. It will break in an unpredictable moment, but because of a predictable reason.

You did the wrong thing by following a myth. Whenever you need to distribute your data over different geographic locations, and you cannot rely on a single physical channel (fiber) to connect the two locations, use asynchronous replication, period!

References

https://github.com/y-trudeau/Mysql-tools/tree/master/PXC
http://www.tusacentral.net/joomla/index.php/mysql-blogs/164-effective-way-to-check-the-network-connection-when-in-need-of-a-geographic-distribution-replication-.html
https://www.percona.com/blog/2013/05/14/is-synchronous-replication-right-for-your-app/

Sample test

#!/bin/bash 
test_log=/tmp/results_$(date +'%Y-%m-%d_%H_%M_%S').txt
exec 9>>"$test_log"
exec 2>&9
exec 1>&9
echo "$(date +'%Y-%m-%d_%H_%M_%S')" >&9

for ip in 11 12 13; do
  echo "  ==== Processing server 10.0.0.$ip === " 

  for size in 1 48 512 1024 4096;do
    echo " --- PING ---"
    ping -M do -c 5  10.0.0.$ip -s $size
    echo "  ---- Record Size $size ---- " 
    netperf -H 10.0.0.$ip -4 -p 3307 -I 95,10 -i 3,3 -j -a 4096 -A 4096  -P 1 -v 2 -l 20 -t TCP_RR -- -b 5 -r ${size}K,48K -s 1M -S 1M
    echo "  ---- ================= ---- ";
  done
   echo "  ==== ----------------- === ";
done


mysql-high-availability-2-300x200MySQL High Availability. Shutterstock.com[/caption] In this article, we'll look at an example of an on-premises, geographically distributed MySQL high availability solution. It's part of a longer series on some high availability reference architecture solutions over geographically distributed areas. Part 1: Reference Architecture(s) for High Availability Solutions in Geographic Distributed Scenarios: Why Should I Care? Percona consulting's main aim is to identify simple solutions to complex problems. We try to focus on identifying the right tool, a more efficient solution, and what can be done to make our customers' lives easier. We believe in doing the work once, doing it well and have more time afterward for other aspects of life. In our journey, we often receive requests for help – some simple, some complicated.

Scenario

The company "ACME Inc." is moving its whole business from a monolithic application to a distributed application, split into services. Each different service deals with the requests independently from each other. Some services follow the tightly-bounded transactional model, and others work/answer asynchronously. Each service can access the data storage layer independently. In this context, ACME Inc. identified the need to distribute the application services over wide geographic regions, focusing on each region achieving scale independently. The identified regions are:

  • North America
  • Europe
  • China

ACME Inc. is also aware of the fact that different legislation acts on each region. As such, each region requires independent information handling about sales policies, sales campaigns, customers, orders, billing and localized catalogs, but will share the global catalog and some historical aggregated data. While most of the application services will work feeding and reading local distributed caches, the basic data related to the catalog, sales and billing is based on an RDBMS. Historical data is instead migrated to a “Big Data” platform, and aggregated data is elaborated and push to a DWH solution at HQ. The application components are developed using multiple programming languages, depending on the service. The RDBMS identified by ACME Inc. in collaboration with the local authorities was MySQL-oriented. There were several solutions like:

  • PostgreSQL
  • Oracle DB
  • MS SQL server

We excluded closed-source RDBMSs given that some countries imposed a specific audit plugin. This plugin was only available for the mentioned platforms. The cost of parallel development and subsequent maintenance in case of RDBMS diversification was too high. As such all the regions must use the same major RDBMS component. We excluded PostgreSQL given that compared to the adoption of MySQL, utilization cases were higher and MySQL had a well-defined code producer. Finally, the Business Continuity team of ACME Inc., had defined an ITSC (Information Technology Service Continuity) plan that defined the RPO (Recovery Point Objective), the RTO (Recovery Time Objective) and system redundancy. That’s it. To fulfill the ITSCP, each region must have the critical system redundantly replicated in the same region, but not on the proximity.

Talking About the Components

This is a not-so-uncommon scenario, and it also presents a lot of complexity if you try to address it with one solution. But let's analyze it and see how we can simplify the approach while still meeting the needs and requirements of ACME Inc. When using MySQL-based solutions, the answer to "what should we use?" is use what best fits your business needs. The "nines" availability reference table for the MySQL world (most RDBMSs) can be summarized below:

9 0. 0 0 0 % (36 days) MySQL Replication
9 9. 9 0 0 % (8 hours) Linux Heartbeat with DRBD (Obsolete DRBD)
9 9. 9 0 0 % (8 hours) RHCS with Shared Storage (Active/Passive)
9 9. 9 9 0 % (52 minutes) MHA/Orchestrator with at least three nodes
9 9. 9 9 0 % (52 minutes) DRBD and Replication (Obsolete DRBD)
9 9 .9 9 5 % (26 minutes) Multi-Master (Galera replication) 3 node minimum
9 9. 9 9 9 % (5 minutes) MySQL Cluster

An expert will tell you that it always doesn't make sense to go for the most "nines" in the list. This because each solution comes with a tradeoff: the more high availability (HA) you get, the higher the complexity of the solution and in managing the solution. For instance, the approach used in MySQL Cluster (NDB) makes this solution not suitable for generic utilization. It requires proper analysis of the application needs, data utilization and archiving before being selected. It also requires in-depth knowledge to properly manage the cluster, as it is more complex than other similar solutions. This indirectly makes a solution based on MySQL+Galera replication the one with the highest HA level a better choice, since it is close to the defaults generalized utilizations. This is why MySQL+Galera replication has become in the last six years the most used solution for platform looking for very high HA, without the need to diverge from standard MySQL/InnoDB approach. You can read more about Galera replication: http://galeracluster.com/products/ Read more about Percona XtraDB Cluster. There are several distributions implementing Galera replication:

*Note that MariaDB Cluster/Server and all related solutions coming from MariaDB have significantly diverged from the MySQL mainstream. This often means that once migrated to MariaDB; your database will not be compatible with other MySQL solutions. In short, you are locked-in to MariaDB. It is recommended that you carefully evaluate the move to MariaDB before making that move.

Choosing the Components

RDBMS

Our advice is to use Percona XtraDB Cluster (PXC), because at the moment it is one of the most flexible and reliable and compatible solutions. PXC is composed of three main components:

The cluster is normally composed of three nodes or more. Each node can be used as a Master, but the preferred and recommended way is to use one node as a Writer and the other as Readers. Application-wise, accessing the right node can be challenging since this means you need to be aware of which node is the writer, which is the reader, and be able to shift from one to the other if necessary.

Proxy

To simplify this process, it helps to have an additional component that works as a “proxy” connecting the application layer to the desired node(s). The most popular solutions are:

  • HAProxy
  • ProxySQL

There are several important differences between the two. But in summary, ProxySQL is a Level 7 proxy and is MySQL protocol aware. So, while HAProxy is just passing the connection over as a forward proxy (level 4), ProxySQL is aware of what is going through it and acts as reverse proxy. With ProxySQL is possible to decide, based on several parameters, where to send traffic (read/write split and more), what must be stopped, or if we should rewrite an incoming SQL command. A lot of information is available on the ProxySQL website https://github.com/sysown/proxysql/wiki and on the Percona Database Performance Blog .

Backup/Restore

No RDBMS platform is safe without a well-tested procedure for backup and recovery. The Percona XtraDB Cluster package distribution comes with Percona XtraBackup as the default method for node provisioning. A good backup and restore (B/R) policy start from the consideration of ACME's ITSCP, to have full and incremental backups, perfectly covering the RPO, and a good recovery procedure to keep the recovery time inside RTO whenever possible. There are several tools that allow you to plan and execute backup/restore procedure, some coming from vendors other than open source and community-oriented. In respect to being a fully open source and community-oriented, we in consulting normally suggest using: https://github.com/dotmanila/pyxbackup. Pyxbackup is a wrapper around XtraBackup that helps simplify the B/R operations, including the preparation of a full and incremental set. This helps significantly reduce the recovery time.

Disaster Recovery

Another very important aspect of the ITSC Plan is the capacity of the system to survive to major disasters. The disaster and recovery (DR) solution must be able to act as the main production environment. Therefore, it must be designed and scaled as the main production site in resources. It must be geographically separated, normally hundreds of kilometers or more. It must be completely independent of the main site. It must be as much as possible in sync with the main production site. While the first three “musts” are easy to understand, the fourth one is often the object of misunderstanding. The concept of be as much in sync with the production site as possible creates confusion in designing HA solutions with Galera replication involved. The most common misunderstanding is the misuse of the Galera replication layer. Mainly the conceptual confusion between tightly coupled database cluster and loosely coupled database cluster. Any solution based on Galera replication is a tightly coupled database cluster, because the whole idea is to be data-centric, synchronously distributed and consistent. The price is that this solution cannot be geographically distributed. Solutions like standard MySQL replication are instead loosely coupled database cluster and they are designed to be asynchronous. Given that, the nodes connected by it are completely independent in processing/apply the transaction, and the solution fits perfectly into ANY geographically distributed replication solution. The price is that data on the receiving front might not be up to date with the one from the source in that specific instant. The point is that for the DR site the ONLY valid solution is the asynchronous link (loosely coupled database cluster), because by design and requirement the two sites must be separated by a significant number of kilometers. For better understanding about why synchronous replication cannot work in a geographically distributed scenario, see "Misuse of Geographic Node distribution with Galera-based replication". In our scenario, the use of Percona XtraDB Cluster helps to create a most robust asynchronous solution. This is because each tightly coupled database cluster, no matter if source or destination, will be seen by the other tightly coupled database cluster as a single entity. What it means is that we can shift from one node to another inside the two clusters, still confident we will have the same data available and the same asynchronous stream passing from one source to the other. To ensure this procedure is fully automated, we add to our architecture the last block: replication manager for Percona XtraDB Cluster (https://github.com/y-trudeau/Mysql-tools/tree/master/PXC). RMfP is another open source tool that simplifies and automates failover inside each PXC cluster such that our asynchronous solution doesn't suffer if the node is currently acting as Master fails.

How to Link the Components

Summarizing all the different components of our solution:

  • Application stack
    • Load balancer
    • Application nodes by service
    • Distributed caching
    • Data access service
  • Database stack
    • Data proxy (ProxySQL)
    • RDBMS (Percona XtraDB Cluster)
    • Backup/Restore
      • Xtrabackup
      • Pyxbackup
      • Custom scripts
    • DR
      • Replication Manager for Percona XtraDB Cluster
  • Monitoring
    • PMM (not covered here see <link> for detailed information)

 

mysql_ha-page-2

In the solution above, we have two locations separated by several kilometers. On top of them, the load balancer(s)/DNS resolution redirects the incoming traffic to the active site. Each site hosts a full application stack, and applications connect to local ProxySQL. ProxySQL has read/write enabled to optimize the platform utilization, and is configured to shift writes from one PXC node to another in case of node failure. Asynchronous replication connects the two locations and transmits data from master to slave. Note that with this solution, it is possible to have multiple geographically distributed sites. Backups are taken at each site independently and recovery test is performed. RMfP oversees and modifies the replication channels in the case of a node failure. Finally, Percona Monitoring and Management (PMM) is in place to perform in-depth monitoring of the whole database platform.


Conclusions

We always look for the most efficient, manageable, user-friendly combination of products, because we believe in providing and supporting the community with simple but efficient solutions. What we have presented here is the most robust and stable high availability solution in the MySQL space (except for MySQL NDB that we have excluded). It is conceptualized to provide maximum service continuity, with limited bonding between the platforms/sites. It also is a well-tested solution, that has been adopted and adapted in many different scenarios where performance and real HA are a must. I have preferred to keep this digression at a high level, given the details of the implementation have already been discussed elsewhere (see reference section for more reading). Still, Percona XtraDB Cluster (as any other solution implementing Galera replication) might not fit the final use. Given that, it is important to understand where it does and doesn’t fit. This article is a good summary with examples: Is Synchronous Replication right for your app?. Check out the next article on How Not to do MySQL High Availability.

References

https://www.percona.com/blog/2016/06/07/choosing-mysql-high-availability-solutions/

https://dev.mysql.com/doc/mysql-ha-scalability/en/ha-overview.html

https://www.percona.com/blog/2014/11/17/typical-misconceptions-on-galera-for-mysql/

http://galeracluster.com/documentation-webpages/limitations.html

http://tusacentral.net/joomla/index.php/mysql-blogs/170-geographic-replication-and-quorum-calculation-in-mysqlgalera.html

http://tusacentral.net/joomla/index.php/mysql-blogs/167-geographic-replication-with-mysql-and-galera.html

http://tusacentral.net/joomla/index.php/mysql-blogs/164-effective-way-to-check-the-network-connection-when-in-need-of-a-geographic-distribution-replication-.html

http://tusacentral.net/joomla/index.php/mysql-blogs/183-proxysql-percona-cluster-galera-integration.html https://github.com/sysown/proxysql/wiki

Latest conferences

PL2020
percona_tech_days_aug_2020

We have 103 guests and no members online

oracle_ace