My MySQL tips

260 (Thousands) thanks

Percona live is over, and we finally get some feedback about our presentations.

Initially I had 3 presentations to give:

Comparing High Available Solutions With Percona XtraDB Cluster and Percona Server With Group Replication.
Comparing Hash Join solution, the good, the bad and the worse. The New version 2021.
Boosting MySQL NDB cluster & MySQL Innodb Cluster with PrxySQL V2

But when I saw we were having a lot of great submissions, I decided that it would be better for the conference and for Percona to drop 2 of them and leave some free slots for others.

That has always been my line of conduct, I do not think we should have speakers with multiple talks unless exceptions. I know in the past I had been presenting multiple times, but that is why I am stating this now, more and more.

Anyhow, the remaining talk was Comparing High Available Solutions With Percona XtraDB Cluster and Percona Server With Group Replication. Which is a high level discussion about High Availability and the two main solutions Percona provides for it.

The presentation was previously recorded and played during the event. This allowed me to interact during the presentation much more than if I had to do it live. thankyou

I thought we had a good level of interactions and discussion which is always a good sign, but when we finally got the feedback I discovered my session had 260 attendees.

Now, I am still trying to get more info (wondering why it is so difficult to get) but from what I discovered so far, it was in the first three most attended.

Given that I want to THANK YOU to all who have been there. I want to say one million times THANK YOU to all the ones who had also interacted during the presentation.

The reason to be for a speaker is to have not only people in the room, but people who interact because they are interested.
I like to think this is the result of years of honest and unbias blogging and speaking. So once more thank you!

Percona had not publish it yet on the official site, I am sharing it from youtube. In case you are still interested to view it, unfortunately the chat interaction is not there, given this was my original recording I had passed for the conference, but better than nothing right?

And now few comments about the conference.

In general it was a good conference, I think we had a decent mix of technical and less technical content. I really enjoyed the key notes with Amanda Brock and after Luis Villa. I can be more in alignment with one instead of the other, but the topic and the side discussion in the chat was great.

About trying to follow the tracks and trying to attend the most interesting speeches (for me), I have to say I was a bit disappointed.

In my article here I was giving an indication of which presentations I was going to attend or when in conflict I was expecting to be able to see the recording.

But there had been some unexpected inconveniences.

The first one, positive, was that I wanted to attend more talks than the ones in the list. The second one, negative, was that the conference organization did not provide the recording after the speech. Actually as mentioned above most of them are still not published.

Now, while I understand that in case of LIVE talk, you may need some time to put them up, I do not get why we were not publishing the recording, as they were, after the official slot was over. Any additional refining or editing could have been done later, but in the meantime people would have been able to get better service and attend more.

Point was that as soon as I discovered it was not possible to watch the recording after the presentation, I started to keep on at least two screens, and record what I cannot focus on to be able to watch it later. Very annoying and disappointing. But hey space for improvement right? Next online event can be better.

Another thing I liked was the platform used for the conference, easy, clean and allowed a lot of “offline” sharing. Yes, sometimes we had to reload the page to have the presentation running, but that was not a big deal.

I want to mention again the great work done by the Percona marketing team in making this event a great success. All of them, but special mention is for Bronwyn Campbell and Valentina Lago, without them Percona live 2021 would NOT be what it was.

Finally, while all of us around the globe HOPE to be able to attend the next conference in person, I strongly doubt this will be possible. We MUST keep in mind conferences as Percona Live is a global event. That will make it very unlikely to have people attending in person soon. Obviously the option is to limit it to a regional event, which for me will mean to diminish its relevance in a very impactful way. I miss meeting in person, but I would prefer to have the conference online, if that means I will miss a huge part of the speakers and their content!

And I may be pessimistic, but I doubt we will be in a safe position before the next event.

Well that is all for now, but stay tuned… because the two missed presentations are coming ;)

To all THANK YOU two hundred sixty thousand times once more.

thankyou

Percona Live 2021 - my agenda picks

Well, we are almost there.

Percona Live 2021 is just around the corner and this year comes with a very “dense” agenda. PL2021 Speaker cards Marco Tusa

From my side I am giving a speech that is a high-level comparison between PXC and PS with group replication as highly available solutions.

It is a bit new for me to be at such high level, given I prefer to dig into technical, but I also think it was worth to stop and have the chance better understand what gives us what.

Anyhow this blog is not about that.

Given, as said, this year we will have a very “dense” schedule, I initially look at it and get scared.

“Too much stuff!”, that was my initial thought. Then I take a breath, reserve a couple of hours of my personal time, and start to dig-in.

In general, I have to say that thanks to the fact we are having an online event with registered speech it would be easier to attend multiple tracks also when conflicting.

What I did, I start to prune the agenda from the topics I am less interested, like marketing, or from source we know are predicating right and acting in totally different way. Then I navigate the speeches reading abstracts, reading about the speaker (if I was not familiar with), and start to compile my list.

Once done I realized it was a quite good and solid track, and possible worthy to share.

Spoiler alert, I love fantasy, so you may end up in attending one or two speech which will talk about fables, but you know we always need to have some fun time.

Before sharing the schedule, I want to mention that without the incredible work done by Bronwyn Campbell (Global Events Manager in Percona) and Valentina Lago (Global Events Coordinator) Percona Live would not exists. Given that we should thanks them for the HUGE effort done to realize this incredible event.

And now my list, I have left some holes here and there, where there was nothing interesting for me, to attend the speeches that are in conflict.

The link to the agenda: https://perconaliveonline.sched.com/

Hope you will find it useful:

All times are in EDT

May 12
6:30
Percona Server for MySQL in the Enterprise - Dimitri Vanoverbeke
Building and Scaling a Robust Zero-Code Data Pipeline With Open Source Technologies - Paul Brebner

8:00
Projections in ClickHouse - Amos Bird
Revertible, Recoverable Schema Migrations in Vitess - Shlomi Noach

8:30
Creating MySQL User-Defined Functions in C++ Has Never Been Easier - Yura Sorokin
ClickHouse 2021: New Features and Roadmap - Alexey Milovidov

9:00
The Essentials of Search - Kyle Davis

11:00
Organize the Migration of a Hundred Database Clusters to the Cloud - Maxime Fouilleul

13:00
Sharding All The Way Down: Building Fast and Highly Concurrent Databases on Modern Hardware - Avishai Ish-Shalom

13:30
Creating Chaos in Databases - Vadim Tkachenko

14:00
The Top 5 Things You Should Know About Databases on Kubernetes vs VMs - Marco Nicosia - Rachel Heaton - Adam Berlin

15:00
ARM Power! Comparing MySQL x86 vs ARM Performance - Vinicius. Grippa

16:00
Debug a Kubernetes Operator - Philipp Krenn
Kubernetes-on-Rails?! KateSQL: A Shopify-Scale Cloud-Hosted MySQL Platform - Jeremy Cole - Akshay Suryawanshi

17:30
MyRocks - The 30,000 Foot View - Mike Benshoof
Open Source DBaaS with PMM - Steve Hoffman

May 13
7:00
Hybrid TP/AP With MySQL and ClickHouse - Stig Bakken
OSINT - Do You Really Know What Data You're Leaking? - David Busby
How Machine Learning Inside Databases Solves Significant Data-Science Challenges - Jorge Torres - Patricio Cerda-Mardini

7:30
Setup and manage alerts for databases with Integrated alerting in Percona Monitoring and Management - Peter Boros - Zoriana Stefanyshyn

8:30
Test Applications' Storage Stability by Injecting Storage Errors - Keao Yang

9:00
Everything a DBA Should Know About Kubernetes - Janos Pasztor

9:30
MySQL Shell for DBAs - Frédéric Descamps

10:00
How to Develop BPF Tools with libbpf + BPF CO-RE - Wenbo Zhang

10:30
Don’t Feed Me Dog Food and Call it a 5 Star Meal. How the Open Source Landscape is Being Hijacked. - Matt Yonkovit

11:00
Brand New Development Announced at PL - Johannes Schlüter - Kenny Gryp
Scaling Out Distributed Storage Fabric with RocksDB - Yasaswi Kishore - Sandeep Madanala

13:00
Successfully Run Your MySQL NDB Cluster in Kubernetes - Tiago Alves

14:00
GraphQL as Analytical Language for Data Warehouses - Aleksey Studnev

Inconsistent voting in PXC

AKA Cluster Error Voting

What is Cluster Error Voting (CEV)?

“Cluster Error Voting is a new feature implemented by Alexey Yurchenko, and it is a protocol for nodes to decide how the cluster will react to problems in replication. When one or several nodes have an issue to apply an incoming transaction(s) (e.g. suspected inconsistency), this new feature helps. In a 5-node cluster, if 2-nodes fail to apply the transaction, they get removed and a DBA can go in to fix what went wrong so that the nodes can rejoin the cluster. (Seppo Jaakola)”

This feature was ported to Percona PXC in version 8.0.21, and as indicated above, it is about increasing the resilience of the cluster especially when TWO nodes fail to operate and may drop from the cluster abruptly. The protocol is activated in a cluster with any number of nodes.

Before CEV if a node has a problem/error during a transaction, the node having the issue would just report the error in his own log and exit the cluster:

2021-04-23T15:18:38.568903Z 11 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '21' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-23T15:18:38.568976Z 11 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 16
2021-04-23T15:18:38.569717Z 11 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 224fddf7-a43b-11eb-84d5-2ebf2df70610:16 server_id: d7ae67e4-a43c-11eb-861f-8fbcf4f1cbb8 client_id: 40 trx_id: 115 flags: 3
2021-04-23T15:18:38.575439Z 11 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-23T15:18:38.575578Z 11 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-23T15:18:38.575647Z 11 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-23T15:18:38.575737Z 11 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-23T15:18:38.576132Z 11 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-23T15:18:38.577954Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,3206d174,5)
memb {
	727c277a,1
	}
joined {
	}
left {
	}
partitioned {
	3206d174,1
	d7ae67e4,1
	}
)
2021-04-23T15:18:38.578109Z 11 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-23T15:18:38.578158Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-23T15:18:38.578640Z 11 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-23T15:18:38.578747Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

While the other nodes, will “just” report the node as out of the view:

2021-04-23T15:18:38.561402Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://10.0.0.23:4567)
2021-04-23T15:18:38.562751Z 0 [Note] [MY-000000] [Galera] Node 3206d174 state primary
2021-04-23T15:18:38.570411Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,3206d174,6)
memb {
	3206d174,1
	d7ae67e4,1
	}
joined {
	}
left {
	}
partitioned {
	727c277a,1
	}
)
2021-04-23T15:18:38.570679Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-23T15:18:38.574592Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://10.0.0.23:4567)
2021-04-23T15:18:38.574716Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2021-04-23

With CEV we have a different process. Let us review it with images first.

Let us start with a cluster based on:

3 Nodes where only one works as Primary.

Primary writes and as expected writestes are distributed on all nodes.

insert into test_voting values(null,REVERSE(UUID()), NOW()); <-- Few times

DC1-1(root@localhost) [test]>select * from test_voting;
+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
+----+--------------------------------------+---------------------+
5 rows in set (0.00 sec)

Some inexperienced DBA does manual operation on a secondary using the very unsafe feature wsrep_on.

And then by mistake or because he did not understand what he is doing:

insert into test_voting values(17,REVERSE(UUID()), NOW()); <-- with few different ids

At the end of the operation the Secondary node will have:

DC1-2(root@localhost) [test]>select * from test_voting;
+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 16 | 05de43720080-a39a-be11-405a-82715600 | 2021-04-24 14:50:17 |
| 17 | 05de43720080-a39a-be11-405a-f9d62e22 | 2021-04-24 14:51:14 |
| 18 | 05de43720080-a39a-be11-405a-f5624662 | 2021-04-24 14:51:20 |
| 19 | 05de43720080-a39a-be11-405a-cd8cd640 | 2021-04-24 14:50:23 |
+----+--------------------------------------+---------------------+

Which of course is not in line with the rest of the cluster, that still has the previous data.

Then our guy put the node back:

At this point the Primary does another insert in that table and:

Houston we have a problem!

The secondary node already has the entry with that ID and cannot perform the insert:

2021-04-24T13:52:51.930184Z 12 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-24T13:52:51.930295Z 12 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 4928120

But instead of exit from the cluster it will raise a verification through voting:

2021-04-24T13:52:51.932774Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.932888Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:51.936525Z 0 [Note] [MY-000000] [Galera] Member 1(node3) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:51.936626Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   1/3
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:52.003615Z 0 [Note] [MY-000000] [Galera] Member 2(node1) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:52.003722Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000

As you can see each node inform the cluster about the success or failure of the operation, the majority wins.

Once the majority had identified the operation was legit, as such, the node that ask for the voting will need to get out from the cluster:

2021-04-24T13:52:52.038510Z 12 [ERROR] [MY-000000] [Galera] Inconsistency detected: Inconsistent by consensus on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120
	 at galera/src/replicator_smm.cpp:process_apply_error():1433
2021-04-24T13:52:52.062666Z 12 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-24T13:52:52.062750Z 12 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-24T13:52:52.062796Z 12 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-24T13:52:52.062880Z 12 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-24T13:52:52.063372Z 12 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-24T13:52:52.085853Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,65a111c6-bb0f,23)
memb {
	65a111c6-bb0f,2
	}
joined {
	}
left {
	}
partitioned {
	aae38617-8dd5,2
	dc4eaa39-b39a,2
	}
)
2021-04-24T13:52:52.086241Z 12 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-24T13:52:52.086391Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-24T13:52:52.150106Z 12 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-24T13:52:52.150340Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

It is also nice to notice that now we have a decent level of information about what happened also in the other nodes, the log below is from the Primary:

2021-04-24T13:52:51.932829Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.978123Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
…<snip>
2021-04-24T13:52:51.981647Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000
2021-04-24T13:52:51.981887Z 11 [Note] [MY-000000] [Galera] Vote 0 (success) on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120 is consistent with group. Continue.
2021-04-24T13:52:52.064685Z 0 [Note] [MY-000000] [Galera] declaring aae38617-8dd5 at tcp://10.0.0.31:4567 stable
2021-04-24T13:52:52.064885Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://10.0.0.21:4567)
2021-04-24T13:52:52.066916Z 0 [Note] [MY-000000] [Galera] Node aae38617-8dd5 state primary
2021-04-24T13:52:52.071577Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,aae38617-8dd5,24)
memb {
	aae38617-8dd5,2
	dc4eaa39-b39a,2
	}
joined {
	}
left {
	}
partitioned {
	65a111c6-bb0f,2
	}
)
2021-04-24T13:52:52.071683Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-24T13:52:52.075293Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://10.0.0.21:4567)
2021-04-24T13:52:52.075419Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

At this point a DBA can start to investigate and manually fix the inconsistency and have the node rejoin the cluster. In the meanwhile the rest of the cluster continue to operate:

+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 18 | 05de43720080-938a-be11-405a-d02c7bc5 | 2021-04-24 14:52:51 |
+----+--------------------------------------+---------------------+

Conclusion

Cluster Error Voting (CEV), is a nice feature to have. It helps to understand better what goes wrong and it increases the stability of the cluster, that with the voting has a better way to manage the node expulsion.

Another aspect is the visibility, never underestimate the fact an information is available also on other nodes. Having it available on multiple nodes may help investigations in case the log on the failing node gets lost (for any reasons).

We still do not have active tuple certification, but is a good step, especially given the history we have seen of data drift in PXC/Galera in these 12 years of utilization.

My LAST comment, is that while I agree WSREP_ON can be a very powerful tool in the hands of experts as indicated in my colleague blog https://www.percona.com/blog/2019/03/25/how-to-perform-compatible-schema-changes-in-percona-xtradb-cluster-advanced-alternative/ . That option remains DANGEROUS, and you should never use it UNLESS your name is Przemysław Malkowski and you really know what you are doing.

Great MySQL to everybody!

References

https://www.percona.com/doc/percona-xtradb-cluster/8.0/release-notes/Percona-XtraDB-Cluster-8.0.21-12.1.html

https://youtu.be/LbaCyr9Soco

Online DDL with Group Replication Percona Server 8.0.22 (and MySQL 8.0.23)

While I was working on my grFailOver POC, I have also done some additional parallel testing. One of them was to see how online DDL are executed inside a Group Replication cluster.

The online DDL feature provides support for instant and in-place table alterations and concurrent DML. Checking the Group Replication (GR) official documentation I was trying to identify if any limitation exists, but the only thing I have found was:

"Concurrent DDL versus DML Operations. Concurrent data definition statements and data manipulation statements executing against the same object but on different servers is not supported when using multi-primary mode. During execution of Data Definition Language (DDL) statements on an object, executing concurrent Data Manipulation Language (DML) on the same object but on a different server instance has the risk of conflicting DDL executing on different instances not being detected."

This impacts only when you have a multi primary scenario, which is NOT recommended and not my case.

So in theory GR should be able to handle the online DDL without problems.

My scenario :

I have two DCs and I am going to do actions on my DC1 and see how it propagates all over, and what impact I will have.

The test

To do the test I will run and insert from select.

insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;

And a select, on my Primary node gr1, while on another connection execute the ALTER:

ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE;

As you may have noticed I am EXPLICITLY asking for INPLACE and lock NONE. So in case MySQL cannot satisfy it should exit and not execute the command.

In the meantime on all other nodes I will run a check command to see WHEN my Alter is taking place.

Let us roll the ball:

On my Primary the command to insert the data

[root@gr1 grtest]# while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

Again on Primary another session to execute the alter:

DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE;

On other nodes to monitor when Alter will start:

while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show processlist;"|grep -i "alter";sleep 1;done

What happens:

Data is inserted by the loop.

Alter start, but I can still insert data in my table, and most important the data is propagated to all nodes of the DC1 cluster.

No alter action on the other nodes.

.559
.502
.446
.529
.543
.553
.533
.602
.458  <---- end of the alter locally

Once ALTER is complete on the local node (Primary) it is then executed (broadcast) to all the nodes participating in the cluster.

[ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction <--- waiting for waiting for handler commit    No INSERTS are allowed

But write are suspended waiting for:

37411 | root            | localhost          | windmills_s | Query            |    19 | Waiting for table metadata lock                                 | insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype

And eventually it will timeout.

The other point is that any write hangs until the slowest node had apply the ALTER:

It is important to note that all nodes, not only the PRIMARY remain pending waiting for the slow node:

The slowest drive all.

GR3:

11:01:28.649  48 system user windmills_s Query 171 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE
11:01:29.674  48 system user windmills_s Query 172 waiting for handler commit ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE

GR2

Start 11:00:14.438  18 system user windmills_s Query 97 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE
Ends 11:02:00.107  18 system user windmills_s Query 203 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE

Finally when the last node in the GR cluster has applied the ALTER, the writes will resume, and the Replica node on DC2 will start its ALTER operation on PRIMARY first, then on the other nodes.

Summarizing

We can say we have 3 phases.

The first one is when the primary start to write and we have ONLINE operation:

1 gr ddl

Phase 2: is when the Primary ends the local operation and transmit the changes to the Secondary nodes:

2 gr ddl

Here all writes are locked by metalock.

Phase 3: is when all nodes have finalized the operation, the Alter is passed over replication channel (async replica) and metalock is removed.

3 gr ddl

Writes are executed on Primary
ALTER is executed on the Primary
- DDL does not impact the write operation and respects the not blocking directive.
ALTER is completed on Primary and passed to all nodes
- Meta lock is raised on nodes
ALL cluster waits for slowest node to complete
When all is done in the DC1 then the action is replicated to DC2
- Goto point 2

BUT!!! If the primary goes down, the cluster is unable to elect a new primary until a Secondary had completed the operations. Meaning if alter takes 10 hours, we have a possible cluster without Primary for 10 hours. See also bug: https://bugs.mysql.com/bug.php?id=103421

Conclusion

It seems that at the moment we have a partial coverage of the online ddl feature when using group_replication. Of course to have to wait for the SECONDARY nodes is better and less impacting than to wait for PRIMARY first and then the SECONDARIES.

But is confusing given I was expecting to have either full online coverage (I had explicitly asked for that in the DDL command), or a message telling me it cannot be executed online.

Of course I would prefer to have FULL online coverage ;0)

Keep in mind my setup was also pretty standard and that changing group_replication_consistency does not affect the outcome. But not sure I can classify this as a bug, more an unexpected undesirable behavior.

Finally I cannot avoid to say that it seems to me a bit too much expect to have the users accepting to wait for two times the time of an alter, first on the primary then on the secondary nodes. I know we have PT-OSC to save the day, but I really think that native support should be better and allow the alter to start when it starts on the Primary, not after doubling the required time.

What you can do with Auto-failover and Percona Server Distribution (8.0.x)

Where x is >= 22 ;)

The Problem

There are few things your data does not like. One is water and another is fire. Well, guess what:

OVH Fire

If you think that everything will be fine after all, take a look:

Given my ISP had part of its management infrastructure on OVH, they had been impacted by the incident.

As you can see from the highlight, the ticket number in three years changes very little (2k cases) and the date jumps from 2018 to 2021. On top of that, I have to mention I had opened several tickets the month before that disappeared.

So either my ISP was very lucky and had very few cases in three years and sent all my tickets to /dev/null... or they have lost THREE YEARS of data.

Let us go straight to the chase; they have lost their data, period.

After the fire at the OVH, these guys did not have a good backup to use for data restoring and did not even have a decent Disaster Recovery solution. Their platform remained INACCESSIBLE for more than five days, during which they also lost visibility of their own network/access point/clients and so on.

Restoring data has brought them back online, but it takes them more than a month to review and fix the internal management system and bring the service back to acceptable standards. Needless to say, complaints and more costly legal actions had been raised against them.

All this because they missed two basic Best Practices when designing a system:

Good backup/restore procedure
Always have a Disaster Recovery solution in place

Yeah, I know... I should change ISP.

Anyhow, a Disaster Recovery (DR) solution is a crucial element in any production system. It is weird we still have to cover this in 2021, but apparently, it still is something being underestimated that requires our attention.

This is why in this (long) article, I will illustrate how to implement another improved DR solution utilizing Percona Server for MySQL and standard MySQL features as group replication and asynchronous replication automatic failover (AAF).

Asynchronous Replication Automatic Failover

I have already covered the new MySQL feature here (http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover) but let us recap.

From MySQL 8.0.22 and Percona Server for MySQL 8.0.22 you can take advantage of AAF when designing distributed solutions. What does this mean?

When using simple Async-replication you have this:

Whereas, a Highly Available (HA) solution in DC2 is pulling data out from another HA solution in DC1 with the relation 1:1, meaning the connection is one node against another node.

If you have this:

Your data replication is interrupted and the two DCs diverge. Also you need to manually (or by script) recover the interrupted link. With AAF you can count on a significant improvement:

The link now is NOT 1:1, but a node in DC2 can count on AAF to recover the link on the other remaining nodes:

If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention. This solves a quite large chunk of problems, but it does not fix all, as I mentioned in the article above.

If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention.

GR Failover

I was hoping to have this fixed in MySQL 8.0.23, but unfortunately, it is not. So I decided to develop a Proof Of Concept and see if it would fix the problem, and more importantly what needs to be done to do it safely.

The result is a very basic (and I need to refine the code) Stored Procedure called grfailover, which manages the shift between primaries inside a Group Replication cluster:

I borrowed the concept from Yves' Replication Manager for Percona XtraDB Cluster (https://github.com/y-trudeau/Mysql-tools/tree/master/PXC), but as we will see for GR and this use we need much less.

Why Can This Be a Simplified Version?

Because in GR we already have a lot of information and we also have the autofailover for async replication. Given that, what we need to do is only manage the start/stop of the Replica. Auto-failover will take care of the shift from one source to the other, while GR will take care of which node should be the preferred Replica (Primary on replica site).

In short, the check just needs to see if the node is a Primary, and if so, start the replication if it is not already active while eventually stopping it if the node IS NOT a primary.

We can also maintain a table of what is going on, to be sure that we do not have two nodes replicating at the same time.

The definition will be something like this:

+--------------+---------------+------+-----+---------+-------+
| Field        | Type          | Null | Key | Default | Extra |
+--------------+---------------+------+-----+---------+-------+
| server_uuid  | char(36)      | NO   | PRI | NULL    |       |
| HOST         | varchar(255)  | NO   |     | NULL    |       |
| PORT         | int           | NO   |     | 3306    |       |
| channel_name | varchar(100)  | NO   |     | NULL    |       |
| gr_role      | varchar(30)   | NO   |     | NULL    |       |
| STATUS       | varchar(50)   | YES  |     | NULL    |       |
| started      | timestamp(6)  | YES  |     | NULL    |       |
| lastupdate   | timestamp(6)  | YES  |     | NULL    |       |
| active       | tinyint       | YES  |     | 0       |       |
| COMMENT      | varchar(2000) | YES  |     | NULL    |       |
+--------------+---------------+------+-----+---------+-------+

The full code can be found in GitHub here: https://github.com/Tusamarco/blogs/tree/master/asyncAutoFailOver.

How-To

The first thing you need to do is deploy Percona Server Distribution for MySQL (8.0.22 or greater) using Group Replication as a HA solution. To do so, refer to the extensive guide here: Percona Distribution for MySQL: High Availability with Group Replication Solution.

Once you have it running on both DCs, you can configure AAF on both DCs Primary node following either MySQL 8.0.22: Asynchronous Replication Automatic Connection (IO Thread) Failover or this MySQL Asynchronous SOURCE auto failover.

Once you have the AAF replication up and running, it is time for you to create the procedure and the management table in your DC-Source Primary.

First of all, be sure you have a `percona` schema, and if not, create it:

Create schema percona;

Then create the table:

CREATE TABLE `group_replication_failover_manager` (
  `server_uuid` char(36) NOT NULL,
  `HOST` varchar(255) NOT NULL,
  `PORT` int NOT NULL DEFAULT '3306',
  `channel_name` varchar(100) NOT NULL,
  `gr_role` varchar(30) NOT NULL,
  `STATUS` varchar(50) DEFAULT NULL,
  `started` timestamp(6) NULL DEFAULT NULL,
  `lastupdate` timestamp(6) NULL DEFAULT NULL,
  `active` tinyint DEFAULT '0',
  `COMMENT` varchar(2000) DEFAULT NULL,
  PRIMARY KEY (`server_uuid`)
) ENGINE=InnoDB;

Last, create the procedure. Keep in mind you may need to change the DEFINER or simply remove it. The code will be replicated on all nodes. To be sure, run the command below on all nodes:

select ROUTINE_SCHEMA,ROUTINE_NAME,ROUTINE_TYPE from information_schema.ROUTINES where ROUTINE_SCHEMA ='percona' ;
+----------------+--------------+--------------+
| ROUTINE_SCHEMA | ROUTINE_NAME | ROUTINE_TYPE |
+----------------+--------------+--------------+
| percona        | grfailover   | PROCEDURE    |
+----------------+--------------+--------------+

You should get something as above.

If not, then check your replication, something probably needs to be fixed. If instead, it all works out, this means you are ready to go.

To run the procedure you can use any kind of approach you like, the only important thing is that you MUST run it FIRST on the current PRIMARY node of each DCs.

This is because the PRIMARY node must be the first one to register in the management table. Personally, I like to run it from cron when in “production” while manually when testing:

IE:/opt/mysql_templates/PS-8P/bin/mysql -h 127.0.0.1 -P 3306 -D percona -e "call  grfailover(5,\"dc2_to_dc1\");"

Where:

grfailover is the name of the procedure.
5 is the timeout in minutes after which the procedure will activate the replication in the Node.
dc2_to_dc1 Is the name of the channel in the current node, the procedure needs to manage.

Given two clusters as:

DC1-1(root@localhost) [(none)]>SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

DC2-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 79ede65d-9797-11eb-9963-08002734ed50 | gr4         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | 7e214802-9797-11eb-a0cf-08002734ed50 | gr6         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | 7fddf04f-9797-11eb-a193-08002734ed50 | gr5         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

If you query the management table after you have run the procedure ONLY on the two Primaries:

>select * from percona.group_replication_failover_manager order by host\G
*************************** 1. row ***************************
server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:22:40.000000
  lastupdate: 2021-04-08 10:22:53.000000
      active: 1
     COMMENT: Just inserted
*************************** 2. row ***************************
 server_uuid: 7e214802-9797-11eb-a0cf-08002734ed50
        HOST: gr6
        PORT: 3306
channel_name: dc1_to_dc2
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 09:17:50.000000
  lastupdate: 2021-04-08 09:17:50.000000
      active: 1
     COMMENT: Just inserted

Given the replication link was already active, the nodes will report only “Just Inserted” in the comment.

While if one of the two channels was down and the node NOT deactivated (set the active flag in the management table to 0), the comment will change to “COMMENT: REPLICA restarted for the channel <channel name>”

At this point, you can run the procedure also on the other nodes and after that, if you query the table by channel:

DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by host\G
*************************** 1. row ***************************
 server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50
        HOST: gr1
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted
*************************** 2. row ***************************
 server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:22:40.000000
  lastupdate: 2021-04-08 10:22:53.000000
      active: 1
     COMMENT: REPLICA restarted for the channel dc2_to_dc1
*************************** 3. row ***************************
 server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50
        HOST: gr3
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted
3 rows in set (0.00 sec)

What happens if I now change my Primary, or if the Primary goes down? Well let say we “just” shift our PRIMARY:

stop slave for channel 'dc2_to_dc1';SELECT group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50');
Query OK, 0 rows affected, 1 warning (0.01 sec)

+--------------------------------------------------------------------------+
| group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50') |
+--------------------------------------------------------------------------+
| Primary server switched to: ebff1ab8-9793-11eb-ba5f-08002734ed50         |
+--------------------------------------------------------------------------+

Please note that given I have an ACTIVE replication channel, to successfully shift the primary, I MUST stop the replication channel first.

C1-2(root@localhost) [percona]>DC1-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

Reading the management table we will see that grFailOver had started the shift:

DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by host\G
*************************** 1. row ***************************
 server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50
        HOST: gr1
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Need to wait 5 minutes, passed: 0
*************************** 2. row ***************************
 server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:22:40.000000
  lastupdate: 2021-04-08 10:22:53.000000
      active: 1
     COMMENT: REPLICA restarted for the channel dc2_to_dc1
*************************** 3. row ***************************
 server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50
        HOST: gr3
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted

Checking the new PRIMARY node gr1, we can see that:

Gr_role is PRIMARY
COMMENT reports the countdown (in minutes) the node waits

After the 5 minutes:

DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by host\G
*************************** 1. row ***************************
 server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50
        HOST: gr1
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:27:54.000000
  lastupdate: 2021-04-08 10:30:12.000000
      active: 1
     COMMENT: REPLICA restarted for the channel dc2_to_dc1
*************************** 2. row ***************************
 server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Resetted by primary node ebff1ab8-9793-11eb-ba5f-08002734ed50 at 2021-04-08 10:27:53
*************************** 3. row ***************************
 server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50
        HOST: gr3
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted

Now, what we can see is:

Node gr1 had become active in replicating
- It reports the time it started the replication
- It reports the last time it checked for the replication to be active
Node gr2 is marked SECONDARY
- In the comment is also reported the time and when the replication was restarted on the new REPLICA node

If for any reason the replication in the original node gr2 was restarted (like moving back the PRIMARY) while the countdown was still in place, grFailOver will stop any action and reset the gr1 status.

In short, now my two DCs can rely on AAF for failing over on a different SOURCE and on grFailOver for shifting the Node following GR Primary, or to failover to another node when my Primary crashes.

Conclusion

I am sure Oracle is backing something about this and I am sure we will see it out soon, but in the meantime, I have to say that this simple solution works. It has improved the resiliency of my testing architecture A LOT.

And while I am still testing it and I am totally confident that the procedure can be written in a more efficient way, I am also sure bugs and errors are around the corner.

BUT, this was a POC and I am happy with the outcome. This proves it is not so difficult to make better what we have, and also proves that sometimes a small thing can have a HUGE impact.

It also proves we should not always wait for others to do what is required and that ANYONE can help.

Finally, as mentioned above, this is a POC solution, but no one prevents you to start from it and make it a production solution, as my colleague Yves did for his Percona XtraDB Cluster Replication Manager.

Is just on you! Great MySQL to all.

References

https://www.datacenterdynamics.com/en/news/fire-destroys-ovhclouds-sbg2-data-center-strasbourg/

http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover

https://github.com/y-trudeau/Mysql-tools/tree/master/PXC

https://www.percona.com/blog/2020/10/26/mysql-8-0-22-asynchronous-replication-automatic-connection-io-thread-failover/

Sidebar

Main Menu Mobile

MySQL

260 (Thousands) thanks

Percona Live 2021 - my agenda picks

Inconsistent voting in PXC

AKA Cluster Error Voting

Conclusion

References

Online DDL with Group Replication Percona Server 8.0.22 (and MySQL 8.0.23)

The test

Summarizing

Conclusion

What you can do with Auto-failover and Percona Server Distribution (8.0.x)

The Problem

Asynchronous Replication Automatic Failover

GR Failover

Why Can This Be a Simplified Version?

How-To

Conclusion

References

More Articles …

Path

login