My MySQL tips valid-rss-rogers

 

For what reason should I use a real multi-Primary setup?
To be clear, not a multi-writer solution where any node can become the active writer in case of needs, as for PXC or PS-Group_replication.
No, we are talking about a multi-Primary setup where I can write at the same time on multiple nodes.
I want to insist on this “why?”.

After having excluded the possible solutions mentioned above, both covering the famous 99,995% availability, which is 26.30 minutes downtime in a year, what is left?

Disaster Recovery? Well that is something I would love to have, but to be a real DR solution we need to put several kilometers (miles for imperial) in the middle. 

And we know (see here and here) that aside some misleading advertising, we cannot have a tightly coupled cluster solution across geographical regions.

So, what is left? I may need more HA, ok that is a valid reason. Or I may need to scale the number of writes, ok that is a valid reason as well.
This means, at the end, that I am looking to a multi-Primary because:

  • Scale writes (more nodes more writes)
    • Consistent reads (what I write on A must be visible on B)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.   

Now, keeping myself bound to the MySQL ecosystem, my natural choice would be MySQL NDB cluster.

But my (virtual) boss was at AWS re-invent and someone mentioned to him that Aurora Multi-Primary does what I was looking for.

This (long) article is my voyage in discovering if that is true or … not.

Given I am focused on the behaviour first, and NOT interested in absolute numbers to shock the audience with millions of QPS, I will use low level Aurora instances. And will perform tests from two EC2 in the same VPC/region of the nodes.

instances

You can find the details about the tests on GitHub here Finally I will test:

  • Connection speed
  • Stale read
  • Write single node for baseline
  • Write on both node:
    • Scaling splitting the load by schema
    • Scaling same schema 

Tests results

Let us start to have some real fun. The first test is … 

Connection Speed

The purpose of this test is to evaluate the time taken in opening a new connection and time taken to close it. The action of open/close connection can be a very expensive operation especially if applications do not use a connection pool mechanism.

a1

a3

As we can see ProxySQL results to be the most efficient way to deal with opening connections, which was expected given the way it is designed to reuse open connections towards the backend. 

a2

a4

Different is the close connection operation in which ProxySQL seems to take a little bit longer.  As global observation we can say that using ProxySQL we have more consistent behaviour. Of course this test is a simplistic one, and we are not checking the scalability (from 1 to N connections) but it is good enough to give us the initial feeling. Specific connection tests will be the focus of the next blog on Aurora MM. 

Stale Reads

Aurora MultiPrimary use the same mechanism of the default Aurora to update the buffer pool:

aurora multi master sharing BP

Using the Page Cache update, just doing both ways. This means that the Buffer Pool of Node2 is updated with the modification performed in Node1 and vice versa.

To verify if an application would be really able to have consistent reads, I have run this test. This test is meant to measure if, and how many, stale reads we will have when writing on a node and reading from the other.

Amazon Aurora multi Primary has 2 consistency model:

Consistency model

As an interesting fact the result was that with the default consistency model (INSTANCE_RAW), we got 100% stale read.

Given that I focused on identifying the level of the cost that exists when using the other consistency model (REGIONAL_RAW) that allows an application to have consistent reads.

The results indicate an increase of the 44% in total execution time, and of the 95% (22 time slower) in write execution. 

a5

a6

a7

It is interesting to note that the time taken is in some way predictable and consistent between the two consistency models. 

The graph below shows in yellow how long the application must wait to see the correct data on the reader node. While in blue is the amount of time the application waits to get back the same consistent read because it must wait for the commit on the writer.

   a8

As you can see the two are more or less aligned. Given the performance cost imposed by using REGIONAL_RAW,  all the other tests are done the defaut INSTANCE_RAW, unless explicitly stated.

Writing tests

All tests run in this section were done using sysbench-tpcc with the following settings:

sysbench ./tpcc.lua --mysql-host=<> --mysql-port=3306 --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --time=300 --threads=32 --report-interval=1 --tables=10 --scale=15  --mysql_table_options=" CHARSET=utf8 COLLATE=utf8_bin"  --db-driver=mysql prepare

sysbench /opt/tools/sysbench-tpcc/tpcc.lua --mysql-host=$mysqlhost --mysql-port=$port --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --db-driver=mysql --tables=10 --scale=15 --time=$time  --rand-type=zipfian --rand-zipfian-exp=0 --report-interval=1 --mysql-ignore-errors=all --histogram  --report_csv=yes --stats_format=csv --db-ps-mode=disable --threads=$threads run

Write Single node (Baseline)

Before starting the comparative analysis, I was looking to define what was the “limit” of traffic/load for this platform. 

Picture 1

t1 t2

From the graph above, we can see that this setup scales up to 128 threads after that, the performance remains more or less steady. 

Amazon claims that we can mainly double the performance when using both nodes in write mode and use a different schema to avoid conflict.

scalability

 

Once more remember I am not interested in the absolute numbers here, but I am expecting the same behaviour Given that our expectation is to see:

Picture 2

Write on both nodes different schemas

So AWS recommend this as the scaling solution:

split traffic by db table partition to avoid conflicts

And I diligently follow the advice.

I used 2 EC2 nodes in the same subnet of the Aurora Node, writing to a different schema (tpcc & tpcc2). 

Overview

Let us make it short and go straight to the point. Did we get the expected scalability?

Well no:

Picture 3

We just had 26% increase, quite far to be the expected 100% Let us see what happened in detail (if not interested just skip and go to the next test).

Node 1

Picture 5

Node 2

Picture 6

As you can see Node1 was (more or less) keeping up with the expectations and being close to the expected performance.
But Node2 was just not keeping up, performances there were just terrible. 

The graphs below show what happened.

While Node1 was (again more or less) scaling up to the baseline expectations (128 threads), Node2 collapsed on its knees at 16 threads. Node2 was never able to scale up.

Reads

Node 1

t4

Node1 is scaling the reads as expected also if here and there we can see performance deterioration.

Node 2

t7

Node2 is not scaling Reads at all. 

Writes

Node 1

t5

Same as Read

Node 2

t8

Same as read

Now someone may think I was making a mistake and I was writing on the same schema. I assure you I was not.

Check the next test to see what happened if using the same schema.  

Write on both nodes same schema

Overview

Now, now Marco, this is unfair. You know this will cause contention.

Yes I do! But nonetheless I was curious to see what was going to happen and how the platform would deal with that level of contention. 
My expectations were to have a lot of performance degradation and increased number of locks. About conflict I was not wrong, node2 after the test reported:

+-------------+---------+-------------------------+
| table       | index   | PHYSICAL_CONFLICTS_HIST |
+-------------+---------+-------------------------+
| district9   | PRIMARY |                    3450 |
| district6   | PRIMARY |                    3361 |
| district2   | PRIMARY |                    3356 |
| district8   | PRIMARY |                    3271 |
| district4   | PRIMARY |                    3237 |
| district10  | PRIMARY |                    3237 |
| district7   | PRIMARY |                    3237 |
| district3   | PRIMARY |                    3217 |
| district5   | PRIMARY |                    3156 |
| district1   | PRIMARY |                    3072 |
| warehouse2  | PRIMARY |                    1867 |
| warehouse10 | PRIMARY |                    1850 |
| warehouse6  | PRIMARY |                    1808 |
| warehouse5  | PRIMARY |                    1781 |
| warehouse3  | PRIMARY |                    1773 |
| warehouse9  | PRIMARY |                    1769 |
| warehouse4  | PRIMARY |                    1745 |
| warehouse7  | PRIMARY |                    1736 |
| warehouse1  | PRIMARY |                    1735 |
| warehouse8  | PRIMARY |                    1635 |
+-------------+---------+-------------------------+

Which is obviously a strong indication something was not working right. In terms of performance gain, if we compare ONLY the result with the 128 Threads : Picture 4

Also with the high level of conflict we still have 12% of performance gain.

The problem is that in general we have the two nodes behave quite badly.
If you check the graph below you can see that the level of conflict is such to prevent the nodes not only to scale but to act consistently.

Node 1

Picture 7

Node 2

Picture 8

Reads

In the following graphs we can see how node1 had issues and it actually crashed 3 times, during tests with 32/64/512 treads.
Node2 was always up but the performances were very low. 

Node 1

t10

Node 2

t13

Writes

Node 1

t11

Node 2

t14

Recovery from crashed Node

About recovery time reading the AWS documentation and listening to presentations, I often heard that Aurora Multi Primary is a 0 downtime solution.
Or other statements like: “
in applications where you can't afford even brief downtime for database write operations, a multi-master cluster can help to avoid an outage when a writer instance becomes unavailable. The multi-master cluster doesn't use the failover mechanism, because it doesn't need to promote another DB instance to have read/write capability

To achieve this the suggestion I found, was to have applications pointing directly to the Nodes endpoint and not use the Cluster endpoint.
In this context the solution pointing to the Nodes should be able to failover within a seconds or so, while the cluster endpoint:

fail over times using mariadb driver

Personally I think that designing an architecture where the application is responsible for the connection to the database and failover is some kind of refuse from 2001. But if you feel this is the way, well go for it.

What I did for testing is to use ProxySQL, as plain as possible, with nothing else then the basic monitor coming from the native monitor.

I then compare the results with the tests using the Cluster endpoint.
In this way I adopt the advice of pointing directly at the nodes, but I was doing things in our time.  

The results are below and they confirm (more or less) the data coming from Amazon.

a10

A downtime of 7 seconds is quite a long time nowadays, especially if I am targeting the 5 nines solution that I want to remember is 864 ms downtime per day.

Using ProxySQL is going closer to that, still too long to be called 0 (zero) downtime.

I also have fail-back issues when using the AWS cluster endpoint.

Given it was not able to move the connection to the joining node seamlessly. 

Last but not least when using the consistency level INSTANCE_RAW, I had some data issue as well as PK conflict:
FATAL: mysql_drv_query() returned error 1062 (Duplicate entry '18828082' for key 'PRIMARY')   

Conclusions

As state the beginning of this long blog the reasons expectations to go for a multi Primary solution were:

  • Scale writes (more nodes more writes)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.   

Honestly I feel we have completely failed the scaling point.

Facepalm Jesus

Probably if I use the largest Aurora I will get much better absolute numbers, and it will take me more to encounter the same issues, but I will.

In any case if the Multi muster solution is designed to provide that scalability, it should do that with any version.

I did not have zero downtime, but I was able to failover pretty quickly with ProxySQL.

Finally, unless the consistency model is REGIONAL_RAW, shifting from one node to the other is not prone to possible negative effects like stale reads.

Because that I consider this requirement not satisfied in full. 

Given all the above, I think this solution could eventually be valid only for High Availability (close to be 5 nines), but given it comes with some limitations I do not feel comfortable in preferring it over others just for HA, at the end default Aurora is already good enough as a High available solution. 

references

https://www.youtube.com/watch?v=p0C0jakzYuc

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html

https://www.slideshare.net/marcotusa/improving-enterprises-ha-and-disaster-recovery-solutions-reviewed

https://www.slideshare.net/marcotusa/robust-ha-solutions-with-proxysql

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html#aurora-multi-master-limitations  

If you have a business no matter how small, you are collecting data, and you need to have your data accessible to make informed decisions about how to make your business better. The more successful you become the more data you are producing and the more you become dependent by it. This is when you start to realize your must have your data in a safe place like a database instead some spreadsheet.

But to put your data in a database is not enough, you must be sure the database will be robust, resilient, and always available when you or your customers need it.

When design architectures for robust database architectures, we always discuss about High Availability (HA) and Disaster Recovery (DR). These two concepts are elements of the larger umbrella that is the Business continuity plan.

To cover the different needs, we (Database Architects) use/apply two main models: Tightly Coupled cluster, and Loosely Coupled cluster (latest presentation https://www.slideshare.net/marcotusa/best-practicehigh-availabilitysolutiongeodistributedfinal).

The combination of the two different model allow us to build solid, resilient, scalable HA solution geographically distributed.

Like this:

Picture1

 

Until now the part that was NOT natively supported in the architecture above was how to failover the replication channel from a node to another node in case of crash.

This element is currently cover by custom script(https://github.com/y-trudeau/Mysql-tools/tree/master/PXC) develop by my colleague Yves Trudeau.

 

But Oracle in MySQL 8.0.22 introduced the Asynchronous failover feature. In this blog I am going to check if this feature is enough to cover what we need to avoid using external movable/custom script or if instead there is still work to do.

My scenario is the following:

I have a PXC 8.0.20 cluster on DC1 and I am going to replicate towards a Group Replication cluster 8.0.22.

First, I need to establish the replication, to do so I follow the standard steps:

Once I am up and running and have a situation like this:

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 38809616-e149-11ea-b0cc-080027d0603e | gr1         |        3306 |       ONLINE |     PRIMARY |         8.0.22 |
| group_replication_applier | 959cc074-e14b-11ea-b90c-080027d0603e | gr2         |        3306 |       ONLINE |   SECONDARY |         8.0.22 |
| group_replication_applier | a56b38ec-e14b-11ea-b9ce-080027d0603e | gr3         |        3306 |       ONLINE |   SECONDARY |         8.0.22 |
| group_replication_applier | b7039e3b-f744-11ea-854c-08002734ed50 | gr4         |        3306 |       ONLINE |   SECONDARY |         8.0.22 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

 

Then I am ready to create the replication channel.

To do so I am first doing keeping the OLD style without Auto-failover. To do so:

On one of the PCX nodes:

Create user replica_dc@’192.168.4.%’ identified by ‘secret’;
Grant REPLICATION SLAVE on *.* to replica_dc@’192.168.4.%’;

On the PRIMARY of my Grou Replication cluster:

CHANGE MASTER to
master_user='<user>',
master_password='<secret>',
master_host='192.168.4.22',
master_auto_position=1,
source_connection_auto_failover=0,
master_retry_count=6,
master_connect_retry=10
for channel "dc1_to_dc2";

 

A brief explanation about the above:

  • source_connection_auto_failover=0,  <--This enable/disable auto-failover
  • master_retry_count=6,                       <-- The number of attempts
  • master_connect_retry=10                  <-- Specifies the interval between reconnection attempts

Once Replication is up, we will have this scenario:

async failover 8022 pxc base async1

Now in case of a crash of the PXC node from which I replicate for, my Group replication cluster will stop replicating:

async failover 8022 pxc base async2

To prevent this to happen I will use the new functionality to allow Asynchronous replication failover.

To do so the first thing I have to do is to add the list of possible SOURCE in the new table that is locate in MySQL (mysql.replication_asynchronous_connection_failover), to do it I don’t need to use an insert command but instead a function:

asynchronous_connection_failover_add_source(channel, host, port, network_namespace, weight)

where:

  • channel: The replication channel for which this replication source server is part of the source list.
  • host: The host name for this replication source server.
  • port: The port number for this replication source server.
  • network_namespace: The network namespace for this replication source server (when specified).
  • weight: The priority of this replication source server

In my case I will have:

SELECT asynchronous_connection_failover_add_source('dc1_to_dc2', '192.168.4.22', 3306, null, 100);
SELECT asynchronous_connection_failover_add_source('dc1_to_dc2', '192.168.4.23', 3306, null, 80);
SELECT asynchronous_connection_failover_add_source('dc1_to_dc2', '192.168.4.233', 3306, null, 50);

Now, and this is important, given the information is located in a table in the mysql.schema , the information present in the Primary is automatically replicated all over the cluster. Given that all nodes SEE the list of potential Source.

But given we are running our Group Replication cluster in Single Primary mode we can have only the Primary acting as REPLICA, because only the Primary can write. Hope that is clear.

To activate the failover we just need to pass the command:

First let us check the status:

SELECT CHANNEL_NAME, SOURCE_CONNECTION_AUTO_FAILOVER FROM performance_schema.replication_connection_configuration where CHANNEL_NAME = 'dc1_to_dc2';
+--------------+---------------------------------+
| CHANNEL_NAME | SOURCE_CONNECTION_AUTO_FAILOVER |
+--------------+---------------------------------+
| dc1_to_dc2   | 0                               |
+--------------+---------------------------------+


Ok all good I can activate the auto_failover:

stop slave for channel 'dc1_to_dc2';
CHANGE MASTER TO source_connection_auto_failover=1 for channel "dc1_to_dc2";
start slave for channel 'dc1_to_dc2';
SELECT CHANNEL_NAME, SOURCE_CONNECTION_AUTO_FAILOVER FROM performance_schema.replication_connection_configuration where CHANNEL_NAME = 'dc1_to_dc2';

+--------------+---------------------------------+
| CHANNEL_NAME | SOURCE_CONNECTION_AUTO_FAILOVER |
+--------------+---------------------------------+
| dc1_to_dc2   | 1                               |
+--------------+---------------------------------+

Perfect all is running, and this is my layout now:

async failover 8022 pxc base async3

At this point if something happens to my current SOURCE the auto-failover will be triggered, and I should see my GR Primary move to the other node.

After I kill the pxc_node2 the error log of the primary will present:

[ERROR] [MY-010584] [Repl] Slave I/O for channel 'dc1_to_dc2': error reconnecting to master This email address is being protected from spambots. You need JavaScript enabled to view it.:3306' - retry-time: 10 retries: 1 message: Can't connect to MySQL server on '192.168.4.23' (111), Error_code: MY-002003
...
[ERROR] [MY-010584] [Repl] Slave I/O for channel 'dc1_to_dc2': error reconnecting to master This email address is being protected from spambots. You need JavaScript enabled to view it.:3306' - retry-time: 10 retries: 6 message: Can't connect to MySQL server on '192.168.4.23' (111), Error_code: MY-002003

[System] [MY-010562] [Repl] Slave I/O thread for channel 'dc1_to_dc2': connected to master This email address is being protected from spambots. You need JavaScript enabled to view it.:3306',replication started in log 'FIRST' at position 53656167
[Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was 28ae74e9-12c7-11eb-8d57-08002734ed50.

The Primary identify the SOURCE failed, try N time as asked waiting for each try T seconds. After that it try to change the master choosing it form the given list. If we use a different weight value, the node that will be elected will be the active one with the highest weight value.

Once a node is elected as SOURCE, the replication channel WILL NOT fail back also if a node with higher weight value will become available. The replication SOURCE change only if there is a failure in the communication between SOURCE-REPLICA pair.

This is what I have now:

async failover 8022 pxc async failover4

 

 

Is this enough?

Well let us say that we are very close, but no this is not enough.

Why?

Because this cover ONLY a side of the picture. Which is the REPLICA node able to change the SOURCE. But if we are talking of a cluster like Group Replication, we need to have a mechanism that will allow the CLUSTER (and not only the single node) to fail-over.

What does it mean?

Here I mean that IF the Primary fails (remember only a Primary can become a REPLICA because only a Primary can write), another Primary will be elected by the cluster, and I would expect that if my previous Primary was replicating from a SOURCE, then the new one will do the same starting from the last valid applied position.

I am sure Oracle had already identified this as a possible issue, given to me this new feature sounds something done to make the architecture based on MySQL more resilient. As such the above looks like the most logic step forward when thinking to Tightly Coupled cluster like Group Replication.

Summary of commands

Set the source:

change master to master_user='replica_dc', master_password=<secret>, master_host='192.168.4.22', master_auto_position=1, source_connection_auto_failover=0,   master_retry_count=6, master_connect_retry=10 for channel "dc1_to_dc2";

Set the list:

SELECT asynchronous_connection_failover_add_source('dc1_to_dc2', '192.168.4.22', 3306, null, 100);
SELECT asynchronous_connection_failover_add_source('dc1_to_dc2', '192.168.4.23', 3306, null, 80);
SELECT asynchronous_connection_failover_add_source('dc1_to_dc2', '192.168.4.233', 3306, null, 50);

Delete an entry from the list

SELECT asynchronous_connection_failover_delete_source('dc1_to_dc2', '192.168.4.22', 3306, '');

Check if a replication channel is using automatic failover:

SELECT CHANNEL_NAME, SOURCE_CONNECTION_AUTO_FAILOVER FROM performance_schema.replication_connection_configuration where CHANNEL_NAME = 'dc1_to_dc2';

To change the SOURCE setting:

stop slave for channel 'dc1_to_dc2';
CHANGE MASTER TO source_connection_auto_failover=1 for channel "dc1_to_dc2";
start slave for channel 'dc1_to_dc2';

Conclusions

This is a very nice feature that will significantly increase the architecture resilience when using asynchronous replication in the architecture.

Still I think we need to wait to have it fully cover what is needed. This is a great first step but not enough, we still must manage the replication if a node in the REPLICA cluster fails.

I really hope we will see the evolution of that soon.

In the meantime, GREAT WORK MySQL/Oracle team, really!!

References

https://dev.mysql.com/doc/refman/8.0/en/replication-functions-source-list.html

https://www.slideshare.net/marcotusa/best-practicehigh-availabilitysolutiongeodistributedfinal

https://github.com/y-trudeau/Mysql-tools/tree/master/PXC

https://dev.mysql.com/doc/refman/8.0/en/change-master-to.html

Why another article on this Marco?

Deadlocks is a topic covered many times and with a lot of articles on the web, also from Percona.
I suggest you review the reference section for articles on how to identify Deadlocks and from where they are generated.
So why another article?
The answer is that messages like the following are still very common:

User (John): “Marco our MySQL is having problems”
Marco: “Ok John what problems. Can you be a bit more specific?”
John: “Our log scraper is collecting that MySQL has a lot of errors”
Marco: “Ok can you share the MySQL log so I can review it?”
John: “Errors are in the application log, will share one application log”

Marco reviews the log and in it he founds:

“ERROR 1213 (40001): Deadlock found when trying to get lock;
try restarting transaction”

Marco reaction is: "Oh my ..." headdesk

Why? Because deadlocks are not what is express in the message, and of course we have a problem of mindset and last but not least terminology.

In this very short article I will try to change your point of view around Deadlocks.

What is a deadlock?

A deadlock is a situation wherein two or more competing actions are waiting for the other to finish. As a consequence, neither ever does.
In computer science, deadlock refers to a specific condition when two or more processes are each waiting for each other to release a resource.
In order for a deadlock to happen 4 conditions (Coffman conditions) should exists:
Mutual exclusion: At least one resource must be held in a non-shareable mode. Otherwise, the processes would not be prevented from using the resource when necessary. Only one process can use the resource at any given instant of time.
Hold and wait or resource holding: a process is currently holding at least one resource and requesting additional resources which are being held by other processes.
No preemption: a resource can be released only voluntarily by the process holding it.
Circular wait: each process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource.

All the above illustrates conditions that are not bound to RDBMS only but to any system dealing with data transaction processing. In any case it is a fact that today in most cases deadlocks are not avoidable unless to prevent one of the above conditions to happen without compromising the system execution integrity. Breaking or ignoring one of the above rules, especially for RDBMS, could affect data integrity, which will go against the reason to exist of a RDBMS.

Just to help us to better contextualize, let us review a simple case of Deadlock.
Say I have MySQL with the World schema loaded, and I have the TWO transactions running, both looking for the same 2 cities in Tuscany (Firenze and Prato) but in different order.

mysql> select * from City where CountryCode = 'ITA' and District='Toscana';
+------+---------+-------------+----------+------------+
| ID   | Name    | CountryCode | District | Population |
+------+---------+-------------+----------+------------+
| 1471 | Firenze | ITA    | Toscana       | 376662     | <---
| 1483 | Prato   | ITA    | Toscana       |  172473    | <--- ...
+------+---------+-------------+----------+------------+

And both transactions are updating the population: 

Connection 1 will have: connection1 > start transaction;
Query OK, 0 rows affected (0.01 sec)

connection1 > select * from City where ID=1471;
+------+---------+-------------+----------+------------+
| ID   | Name    | CountryCode | District | Population |
+------+---------+-------------+----------+------------+
| 1471 | Firenze | ITA         | Toscana  | 376662     |
+------+---------+-------------+----------+------------+
1 row in set (0.00 sec)

connection1 > update City set Population=Population + 1 where ID = 1471;
Query OK, 1 row affected (0.05 sec)
Rows matched: 1 Changed: 1 Warnings: 0

connection1 > update City set Population=Population + 1 where ID = 1483;
Query OK, 1 row affected (2.09 sec)
Rows matched: 1 Changed: 1 Warnings: 0

Connection 2 will have:
connection 2 >start transaction;
Query OK, 0 rows affected (0.01 sec)

connection 2 >select * from City where ID=1483;
+------+-------+-------------+----------+------------+
| ID   | Name  | CountryCode | District | Population |
+------+-------+-------------+----------+------------+
| 1483 | Prato | ITA         | Toscana  | 172473     |
+------+-------+-------------+----------+------------+
1 row in set (0.01 sec)

connection 2 >update City set Population=Population + 1 where ID = 1483;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

connection 2 >update City set Population=Population + 1 where ID = 1471;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

This is a very simple example of deadlock detection
An image may help:

dl ff 1 

 

If we stop a second and ignore the word “ERROR” in the message, what is really happening is that MySQL is preventing us from modifying the data in the wrong way. If the locks would not be in place one of the two transactions would modify the population incrementing a number that is not valid anymore.
The right thing to do is to abort one of the two transactions and NOTIFY the application that, if you really need to perform the action, in this case increase the population, better to redo the execution and be sure it is still the case.
Just think, it could happen that the application re-run transactions 2 and identify there is no need to increase the value because it is already what it is supposed to be.

Think if you are calculating the financial situation of your company and you and your colleague are processing the same data but for different tasks.
Without locks & deadlocks you may end up in corrupting each other's interpretation of the data, and perform wrong operations. As a result you may end up paying the wrong salaries or worse.

Given that, and more, deadlocks (and locks) needs to be seen as friends helping us in keeping our data consistent.
The problem raise, when we have applications poorly designed and developed, and unfortunately by the wrong terminology (in my opinion) in MySQL.

Let us start with MySQL, Deadlock detection is detecting an intrinsic inevitable condition in the RDBMS/ACID world. As such defining it an ERROR is totally misleading. A deadlock is a CONDITION, and its natural conclusion is the abortion of one of the transactions reason of the deadlock.
The message should be a NOTIFICATION not an ERROR.

The problem in the apps instead, is that normally the isolation and validation of the data is demanded to RDBMS, which is fine. But then only seldom can we see applications able to deal with messages like lock-timeout or deadlock. This is of course a huge pitfall, because while it is natural to have the RDBMS dealing with the data consistency, it is not, and should not, be responsible for the retry that is bound to the application logic.
Nowadays we have a lot of applications that require very fast execution, and locks and deadlocks are seen as enemies because they have a cost in time.

But this is a mistake, a design mistake. Because if you are more willing to have speed instead of data consistency, then you should not use a RDBMS that must respect specific rules, at any (time) cost.
Other systems to store data (eventually consistent) will be more appropriate in your case.

While if you care about your data, then you need to listen to your RDBMS and write the code in a way, you will get all the benefit out of it, also when it comes to deadlocks.

Conclusion

Deadlocks (and locks), should be seen as friends. They are mechanisms that exist to keep our data consistent. We should not bypass them unless willing to compromise our data.

As previously indicated, if you want to understand in the details how to diagnose a deadlock review the links in the reference.

References
https://www.percona.com/blog/2012/09/19/logging-deadlocks-errors/
https://www.percona.com/blog/2014/10/28/how-to-deal-with-mysql-deadlocks/
https://www.percona.com/community-blog/2018/09/24/minimize-mysql-deadlocks-3-steps/

 

A small thing that brings huge help.

The other day I was writing some code to process a very large amount of items coming from a social media API. My items were ending in a queue in MySQL and then needed to be processed and eventually moved.

The task was not so strange,  but what I have to do is to develop a queue processor.  Now when you need to process a queue you have two types of queue: static and dynamic.

The static comes in a batch of N number of items in a given time interval and is normally easier to process given you have a defined number of items that you can split in chunks and process in parallel.

The dynamic is… well... more challenging. One option is to wait to have a predefined number of items, and then process them as if they were a static queue.

But this approach is not very good, given it is possible that it will delay a lot the processing of an item for all the time it has to wait to reach the desired queue’s size.

The other possibility is to have the processing jobs work on a single item and not in chunk/batch. But, this is not optimal when given the chance to process a queue in batch to speed up the processing time.

My incoming queue is a bit unpredictable, a mix of fixed sizes and a few thousand coming sparse, without a clear interval.  So I was there thinking on how to process this and already starting to design a quite complex mechanism to dynamically calculate the size of the possible chunks and the number of jobs, when…

An aside: some colleagues know my habit to read the whole MySQL manual, from A to Z, at least once a year. It's a way for me to review what is going on and sometimes to dig in more in some aspects. This normally also gives me a good level of confidence about new features and other changes on top of reading the release notes.

...When … looking at the documentation for something else, my attention was captured by:

“To avoid waiting for other transactions to release row locks, NOWAIT and SKIP LOCKED options may be used with SELECT ... FOR UPDATE or SELECT ... FOR SHARE locking read statements.”

Wait -  what???

Let me dig in a bit:

“SKIP LOCKED. A locking read that uses SKIP LOCKED never waits to acquire a row lock. The query executes immediately, removing locked rows from the result set.”

Wow, how could I have missed that?

It was also not new but in MySQL 8.0.1, the milestone release. Having experience with Oracle, I knew what SKIP LOCKED does and how to use it. But I was really not aware that it was also available in MySQL.

In short, SKIP LOCKED allows you to lock a row (or set of them), bypassing the rows already locked by other transactions.

The classic example is:

# Session 1:
mysql> CREATE TABLE t (i INT, PRIMARY KEY (i)) ENGINE = InnoDB;
mysql> INSERT INTO t (i) VALUES(1),(2),(3);
mysql> START TRANSACTION;
mysql> SELECT * FROM t WHERE i = 2 FOR UPDATE;
+---+
| i |
+---+
| 2 |
+---+

# Session 2:
mysql> START TRANSACTION;
mysql> SELECT * FROM t FOR UPDATE SKIP LOCKED;
+---+
| i |
+---+
| 1 |
| 3 |
+---+

But what is important for me is that given an N number of jobs running, I can bypass all the effort of calculating the dynamic chunks, given that using SKIP LOCKED that will happen as well, if in a different way.

All good, but what performance will I have using SKIP LOCK in comparison to the other two solutions?

I have run the following tests on my laptop, so not a real server, and used a fake queue processor I wrote on the fly to test the things you can find on GitHub here.  

What I do is to create a table like this:

CREATE TABLE `jobs` (
  `jobid` int unsigned NOT NULL AUTO_INCREMENT,
  `time_in` bigint NOT NULL,
  `time_out` bigint DEFAULT '0',
  `worked_time` bigint DEFAULT '0',
  `processer` int unsigned DEFAULT '0',
  `info` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`jobid`),
  KEY `idx_time_in` (`time_in`,`time_out`),
  KEY `idx_time_out` (`time_out`,`time_in`),
  KEY `processer` (`processer`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Then I will do three different methods of processing:

  1. Simple process, reading and writing a row a time
  2. Use chunks to split my queue
  3. Use SKIP LOCKED

To clarify the difference existing between the 3 different way of processing the queue let us use 3 simple images:

proc1

In simple processing each row represent a possible lock that the other processes must wait for.

proc2

In chunk processing given each process knows what records to lock they can go in parallel.

proc3

In SKIP LOCKED also if each process have no idea of what rows they need to lock, is enough to say the size of the chunk, and MySQL will return the records available. 

 

I will repeat the tests for a static queue, and after for a dynamic queue for 20 seconds. Let's see what happens.

Test one - simple processing and static queue:

queue Processor report:
=========================
Start Time                = Tue Jul 28 13:28:25 CEST 2020
End Time                  = Tue Jul 28 13:28:31 CEST 2020
Total Time taken(ms)      =     6325.0
Number of jobs            =          5
Number of Items processed =      15000
Avg records proc time     =     220308

Test 2 use the chunks and static queue

Chunk no loop
--------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:30:18 CEST 2020
End Time                  = Tue Jul 28 13:30:22 CEST 2020
Total Time taken(ms)      =     4311.0
Number of jobs            =          5
Number of Items processed =      15000
Avg records proc time     =     391927

Test three - use SKIP LOCKED and static queue:

SKIP LOCK - no loop
------------------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:32:07 CEST 2020
End Time                  = Tue Jul 28 13:32:11 CEST 2020
Total Time taken(ms)      =     4311.0
Number of jobs            =          5
Number of Items processed =      15000
Avg records proc time     =     366812

So far, so good.

Time is almost the same (actually in this test, it's exactly the same), normally fluctuating a bit up and down by a few ms.

Picture 1 

 Average execution by commit fluctuates a bit:

Picture 2

Here, the base execution is faster for the simple reason that the application is processing one record against a batch of records of the other two.

Now it is time to see what happens if instead of a static batch, I have a process that fills the queue constantly. If you want picture what will happen on each test, just imagine this:

  • Some pipes will put water in a huge tank
  • The base solution will try to empty the tank using a small glass of water but acting very fast at each run
  • With Chunk it will wait for the water to reach a specific level, then will use a fixed-size bucket
  • Using SKIP LOCK, it will constantly look at the tank and will choose the size of the bucket based on the quantity of the water present.

To simulate that, I will use five threads to write new items, five to process the queue, and will run the test for 20 seconds.

Here we will have some leftovers; that is how much water remains in the tank because the application was not emptied with the given bucket. We can say it is a way to measure the efficiency of the processing, where the optimal sees the tank empty.

Test one -  simple processing and static queue:

Basic no loop
--------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:42:37 CEST 2020
End Time                  = Tue Jul 28 13:43:25 CEST 2020
Total Time taken(ms)      =    48586.0
Number of jobs            =          5
Number of Items processed =      15000
Total Loops executed (sum)=         85
Avg records proc time     =     243400

Leftover
+----------+------------+
| count(*) | max(jobId) |
+----------+------------+
|   143863 |     225000 |
+----------+------------+

Test 2 use the chunks and static queue:

Chunk no loop
--------------
queue Processor report:
=========================
Start Time                = Tue Jul 28 13:44:56 CEST 2020
End Time                  = Tue Jul 28 13:45:44 CEST 2020
Total Time taken(ms)      =    47946.0
Number of jobs            =          5
Number of Items processed =      15000
Total Loops executed (sum)=         70
Avg records proc time     =     363559

Leftover
+----------+------------+
| count(*) | max(jobId) |
+----------+------------+
|       53 |     195000 |
+----------+------------+

Test 3 use SKIP LOCKED and static queue:

queue Processor report:
=========================
Start Time                = Tue Jul 28 14:46:45 CEST 2020
End Time                  = Tue Jul 28 14:47:32 CEST 2020
Total Time taken(ms)      =    46324.0
Number of jobs            =          5
Number of Items processed =      15000
Total Loops executed (sum)=       1528
Avg records proc time     =     282658

Leftover
+----------+------------+
| count(*) | max(jobId) |
+----------+------------+
|        0 |       NULL |
+----------+------------+

 Here, the scenario is a bit different than the one we had with the static queue.

Picture 3

Here, the scenario is a bit different than the one we had with the static queue.

Picture 4

Record processing when comparing by chunk and SKIP LOCK is again more efficient in the second one. This is because it optimizes the size of the “bucket” and given that it can sometimes process fewer records per loop.

Picture 15

As we can see when using SKIP LOCK, the application was able to execute 1528 loops to process the queue against the 70 of the chunk and 85 of the basic approach.

In the end, the only one that was able to empty the tank was the solution with SKIP LOCK.

Conclusion

Processing queues can be simple when we need to process a fixed number of items, but if you need an adaptive approach, then the situation changes. You can find yourself writing quite complex algorithms to optimize the processing.

Using SKIP LOCK helps you in keeping the code/solution simple and move the burden of identifying the record to process onto the RDBMS.

SKIP LOCK is something that other technologies like Oracle-DB and Postgres already implemented, and their developer communities use.

MySQL implementation comes a bit later, and the option is not widely known or used in the developer’s community using MySQL, but it should.

Give it a try and let us know!

NOTE !!

SKIP LOCK is declared unsafe for statement replication, you MUST use ROW based replication if you use it.

References

MySQL 8.0 Hot Rows with NOWAIT and SKIP LOCKED

MySQL 8.0 Reference Manual: Locking Reads

 

 

 

 

Back to part 1

Now that we had seen how to setup our environment with PXC8 is time to see how our setup will behave and what can be done in case of issues.

We will now analyse the following common situations:
- Put a node in maintenance or remove a node and insert it back
- node crash
- network failure
- proxysql node crash

The following tests are done using a java application connecting using straight JDBC connection and no connection pooling. I choose to use that more than a benchmarking tool such as sysbench because I was looking for something as close to a real application more than a benchmark. Especially I was interested in having dependencies cross requests and real data retrieve and error management.

As soon as we start the schedule script, the script checks the nodes and identify that we have writer_is_also_reader=0 so it removes the entry for the current writer

2020/07/03 06:43:50.130:[INFO] Writer is also reader disabled removing node from reader Hostgroup 192.168.1.5;3306;201;5000 Retry #1
2020/07/03 06:43:50.138:[WARN] DELETE node:192.168.1.5;3306;201;5000 SQL: DELETE from mysql_servers WHERE hostgroup_id=201 AND hostname='192.168.1.5' AND port='3306'
2020/07/03 06:43:50.143:[INFO] END EXECUTION Total Time(ms):277.705907821655

Test 1 Put a node in maintenance

PXC has a very useful feature pxc_maint_mode to deal with maintenance and to notify applications and midlevel architectural blocks (such as ProxySQL) that a node is going to be under maintenance.

With pxc_maint_mode you can specifies the maintenance mode for taking a node down without adjusting settings in ProxySQL.

The following values are available:

  • DISABLED: This is the default state that tells ProxySQL to route traffic to the node as usual.
  • SHUTDOWN: This state is set automatically when you initiate node shutdown.
  • MAINTENANCE: You can manually change to this state if you need to perform maintenance on a node without shutting it down.

The First test is to put a reader in maintenance and put it back.
current scenario in ProxySQL:

+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| 10000  | 200          | 192.168.1.5 | ONLINE | 2        | 3        | 5      | 0	| 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE | 2        | 15       | 17     | 0	| 17          |
| 10000  | 201          | 192.168.1.6 | ONLINE | 0        | 12       | 12     | 0	| 12          |
| 998    | 8200         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0.      | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+

Putting down the 192.168.1.6 node:

(root localhost) [(none)]>set global pxc_maint_mode=maintenance;
Query OK, 0 rows affected (10.00 sec)
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status       | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| 10000  | 200          | 192.168.1.5 | ONLINE       | 0        | 5        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE       | 1        | 15       | 17     | 0       | 17          |
| 10000  | 201          | 192.168.1.6 | OFFLINE_SOFT | 0        | 0        | 12     | 0       | 12          | 
| 998    | 8200         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+

Node is in OFFLINE_SOFT as such it will allow any existing connection to finish while not accepting any new.
We can wait for running connection to end, then do whatever kind of maintenance we need without affecting our production.

Once done we can move the node back:

[(none)]>set global pxc_maint_mode=disabled;
Query OK, 0 rows affected (0.00 sec)

In the script log we will see that the node is identify as ready to be put back:

2020/07/03 06:58:23.871:[INFO] Evaluate nodes state 192.168.1.6;3306;201;1000 Retry #1
2020/07/03 06:58:23.882:[WARN] Move node:192.168.1.6;3306;201;1000 SQL: UPDATE mysql_servers SET status='ONLINE' WHERE hostgroup_id=201 AND hostname='192.168.1.6' AND port='3306'
2020/07/03 06:58:23.887:[INFO] END EXECUTION Total Time(ms):265.045881271362

 In all this we never had a moment of service interruption.

The above test was the easy one.

Let us now see what happens if we put the writer in maintenance.


This is a much more impacting action, given the node is accepting write transactions and is in single mode.
Ler us put the writer 192.168.1.5 in maintenance:

set global pxc_maint_mode=maintenance;
Query OK, 0 rows affected (10.00 sec)

 And in few seconds:

+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status       | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE       | 3        | 2        | 5      | 0       | 5           |
| 10000  | 200          | 192.168.1.5 | OFFLINE_SOFT | 0        | 0        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE       | 4        | 12       | 17     | 0       | 17          |
| 998    | 8200         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+

 What happened?

In the script log:

...
2020/07/03 08:11:22.110:[INFO] END EXECUTION Total Time(ms):231.402158737183
2020/07/03 08:11:24.299:[WARN] PXC maintenance on single writer, is asking for failover. Fail-over in action Using Method = 1
2020/07/03 08:11:24.307:[WARN] Move node:192.168.1.5;3306;200;3020 SQL: UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=200 AND hostname='192.168.1.5' AND port='3306'
2020/07/03 08:11:24.313:[INFO] Special Backup - Group found! I am electing a node to writer following the indications
This Node Try to become the new WRITER for HG 200 Server details: 192.168.1.6:3306:HG8200
2020/07/03 08:11:24.313:[INFO] This Node Try to become a WRITER promoting to HG 200 192.168.1.6:3306:HG 8200
2020/07/03 08:11:24.313:[WARN] DELETE from writer group as: SQL:DELETE from mysql_servers where hostgroup_id in (200,9200) AND STATUS = 'ONLINE'
2020/07/03 08:11:24.720:[WARN] Move node:192.168.1.6:33069992000 SQL:INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.1.6',200,3306,999,2000);
2020/07/03 08:11:24.720:[WARN] !!!! FAILOVER !!!!!
Cluster was without WRITER I have try to restore service promoting a node
2020/07/03 08:11:24.720:[INFO] END EXECUTION Total Time(ms):685.911178588867
...

The script identify the need to shift production to another node.
It set the current writer as offline in the staging environment and identify which node from the special group 8000 is the appropriate replacement.

Then push all the changes to runtime
All is done in ~700 ms, so the whole process takes less then a second and the production was not impacted. 

Test 2 writer node crash

Note: In the text below MySQL is set to work on CEST while system is EDT.

This is of course a much more impacting scenario, we need to keep in to account not only the the situation of the node in ProxySQL, but the need for PXC to rebuild the Primary view getting quorum etc..

Initial picture:

+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status       | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE       | 3        | 2        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE       | 4        | 12       | 17     | 0       | 17          |
| 998    | 8200         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE       | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+--------------+----------+----------+--------+---------+-------------+/pre>

 

Abruptly killing the writer with kill -9 signal:

[root@galera1h1n6 gal8_2]# ps aux|grep gal8_2
root 5265 0.0 0.0 113452 876 ? S Jun26 0:00 /bin/sh /opt/mysql_templates/PXC8/bin/mysqld_safe --defaults-file=/opt/mysql_instances/gal8_2/my.cnf
mysql 8745 7.7 72.9 9355104 5771476 ? Sl Jun26 882:54 /opt/mysql_templates/PXC8/bin/mysqld

PXC cluster start to identify the issue:

2020-07-04T10:05:41.011531Z 0 [Note] [MY-000000] [Galera] (27260297, 'tcp://192.168.1.5:4567') turning message relay requesting on, nonlive peers: tcp://192.168.1.6:4567

And script as well:

2020/07/04 06:05:41.000:[ERROR] Cannot connect to DBI:mysql:host=192.168.1.6;port=3306 as monitor
2020/07/04 06:05:41.001:[ERROR]  Node is not responding setting it as SHUNNED (ProxySQL bug - #2658)192.168.1.6:3306:HG200

2020/07/04 06:05:41.052:[WARN] PXC maintenance on single writer, is asking for failover. Fail-over in action Using Method = 1
2020/07/04 06:05:41.061:[INFO] Special Backup - Group found! I am electing a node to writer following the indications
 This Node Try to become the new WRITER for HG 200 Server details: 192.168.1.5:3306:HG8200
2020/07/04 06:05:41.062:[INFO] This Node Try to become a WRITER promoting to HG 200 192.168.1.5:3306:HG 8200
2020/07/04 06:05:41.062:[WARN]  DELETE from writer group as:  SQL:DELETE from mysql_servers where hostgroup_id in (200,9200) AND STATUS = 'ONLINE'

 As said there is also the need from the cluster to rebuild the cluster view and get a quorum (see below):

2020-07-04T10:05:45.685154Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,27260297,16)
memb {
	27260297,1
	7e4d3144,1
	}
joined {
	}
left {
	}
partitioned {
	3eb94984,1
	}
)

As soon as the view is available the script can perform the failover:

2020/07/04 06:05:46.318:[WARN] Move node:192.168.1.5:330610002000 SQL:INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.1.5',200,3306,1000,2000);
2020/07/04 06:05:46.318:[WARN] !!!! FAILOVER !!!!!
Cluster was without WRITER I have try to restore service promoting a node
2020/07/04 06:05:46.318:[INFO] END EXECUTION Total Time(ms):5551.42211914062

the whole exercise takes 5 seconds.

Which for a server crash is not bad at all.

In my case given the Java application was design to deal with minimal service interruption with retry loop, I did not had any error, but this depends on how you had wrote the application layer. 

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 1000   | 200          | 192.168.1.5 | ONLINE  | 0        | 4        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE  | 0        | 26       | 30     | 0       | 28          |
| 998    | 8200         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

Node also if SHUNNED in the special groups will be normally managed and in the need put ONLINE. 

Crash of a reader

 Let us now do the same with a reader

[root galera1h1n7 gal8_3]# date; kill -9 5259 8739
Sat Jul 4 06:43:46 EDT 2020

The script will fail to connect :

2020/07/04 06:43:46.923:[ERROR] Cannot connect to DBI:mysql:host=192.168.1.7;port=3306 as monitor
2020/07/04 06:43:46.923:[ERROR]  Node is not responding setting it as SHUNNED (ProxySQL bug - #2658)192.168.1.7:3306:HG8201

2020-07-04T10:43:47.998377Z 0 [Note] [MY-000000] [Galera] (27260297, 'tcp://192.168.1.5:4567') reconnecting to 7e4d3144 (tcp://192.168.1.7:4567), attempt 0

The node is internally SHUNNED by the script given not accessible, while waiting for ProxySQL to take action and shun the node. All reads request are managed by ProxySQL.
The final picture will be:

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 1000   | 200          | 192.168.1.5 | ONLINE  | 0        | 5        | 6      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | SHUNNED | 0        | 0        | 30     | 98      | 28          |
| 1000   | 201          | 192.168.1.6 | ONLINE  | 1	   | 24       | 25     | 0	 | 25          |
| 998    | 8200         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

 No service interruption from the read side thanks to ProxySQL.

Test 3 network failure

Network failures are always tricky, they can happen as a simple single instance, or for a large but unpredictable window of time.
Network can have a slow down first such that the node affect the whole cluster, then fully resolve themselves, leaving you with very limited data to understand what and why that happens.

Given that is important to try to take action to limit negative effects as much as possible.
That level of actions are above the scope of a scheduler Script, who is in charge only of the layout of the nodes.
Given that what it should do is not to solve the possible impact at PXC level but reduce the possible confusion in distributing the traffic with ProxySQL.

The initial picture is:

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 1000   | 200          | 192.168.1.5 | ONLINE  | 0        | 5        | 6      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | SHUNNED | 0        | 0        | 30     | 98      | 28          |
| 1000   | 201          | 192.168.1.6 | ONLINE  | 1	   | 24       | 25     | 0	 | 25          |
| 998    | 8200         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

On node 192.168.1.5 I will stop the network interface.

enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.5 netmask 255.255.255.0 broadcast 192.168.1.255
[vagrant galera1h1n5 ~]$ date;sudo ifdown enp0s8
Sat Jul 4 13:20:32 EDT 2020
Device 'enp0s8' successfully disconnected.

As soon as I stop it, the PXC cluster identify the node is not reachable:

2020-07-04T17:20:35.980786Z 0 [Note] [MY-000000] [Galera] (7e4d3144, 'tcp://192.168.1.7:4567') connection to peer 27260297 with addr tcp://192.168.1.5:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)

The script is running and must timeout the attempt to connect (6 seconds), and fencing other process to start:

2020/07/04 13:20:36.729:[ERROR] Another process is running using the same HostGroup and settings,
Or orphan pid file. check in /tmp/proxysql_galera_check_200_W_201_R.pid
2020/07/04 13:20:38.806:[ERROR] Another process is running using the same HostGroup and settings,
Or orphan pid file. check in /tmp/proxysql_galera_check_200_W_201_R.pid

Finally script node connection time out and failover starts

2020/07/04 13:20:40.699:[ERROR] Cannot connect to DBI:mysql:host=192.168.1.5;port=3306;mysql_connect_timeout=6 as monitor
2020/07/04 13:20:40.699:[ERROR] Node is not responding setting it as SHUNNED (internally) (ProxySQL bug - #2658)192.168.1.5:3306:HG200
2020/07/04 13:20:40.804:[WARN] Fail-over in action Using Method = 1
2020/07/04 13:20:40.805:[INFO] Special Backup - Group found! I am electing a node to writer following the indications
This Node Try to become the new WRITER for HG 200 Server details: 192.168.1.6:3306:HG8200
2020/07/04 13:20:40.805:[INFO] This Node Try to become a WRITER promoting to HG 200 192.168.1.6:3306:HG 8200
2020/07/04 13:20:40.805:[WARN] DELETE from writer group as: SQL:DELETE from mysql_servers where hostgroup_id in (200,9200) AND STATUS = 'ONLINE'
2020/07/04 13:20:42.622:[WARN] Move node:192.168.1.6:33069992000 SQL:INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.1.6',200,3306,999,2000);
2020/07/04 13:20:42.623:[WARN] !!!! FAILOVER !!!!!
Cluster was without WRITER I have try to restore service promoting a node
2020/07/04 13:20:42.623:[INFO] END EXECUTION Total Time(ms):7983.0858707428

Script does failover due to network in 7 seconds, 10 seconds from network issue

final picture:

+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE  | 0        | 5        | 5      | 0       | 5           |
| 10000  | 201          | 192.168.1.7 | ONLINE  | 3        | 16       | 60     | 721     | 28          |
| 998    | 8200         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8200         | 192.168.1.5 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE  | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | SHUNNED | 0        | 0        | 0      | 0       | 0           |
+--------+--------------+-------------+---------+----------+----------+--------+---------+-------------+

Given the level of complexity a network failure brings, I think this is a pretty good results, also if it will be inevitable to have some application errors given the connection was on the fly, and for that there is no way to prevent, except re-run the whole transaction.

Test 4 ProxySQL node crash

 Last test is what happen if a proxysql node crash? We have set a VIP that we use to connect to ProxySQL (192.168.1.194), and set Keepalived to deal with the move of the VIP cross nodes will that be efficient enough?

Let us try killing one of the ProxySQL node.

Picture before:

+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE | 3        | 1        | 4      | 0	    | 4           |
| 10000  | 201          | 192.168.1.7 | ONLINE | 3        | 7        | 10     | 0	    | 10          |
| 1000   | 201          | 192.168.1.5 | ONLINE | 0        | 2        | 2      | 0	    | 2           |
| 998    | 8200         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	    | 0           |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+

I connect on the current Proxy node that is in charge of the `vip and:

[root proxy1 proxysql]# date;kill -9 24890 24891
Sat Jul 4 14:02:29 EDT 2020

In the system log of the next in the chain Proxy server (different machine), Keepalived identify the node is down and swap the VIP:

Jul 4 14:02:31 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) forcing a new MASTER election
Jul 4 14:02:32 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) Transition to MASTER STATE
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) Entering MASTER STATE
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) setting protocol VIPs.
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: Sending gratuitous ARP on enp0s8 for 192.168.1.194
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: VRRP_Instance(VI_01) Sending/queueing gratuitous ARPs on enp0s8 for 192.168.1.194
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: Sending gratuitous ARP on enp0s8 for 192.168.1.194
Jul 4 14:02:33 proxy2 Keepalived_vrrp[6691]: Sending gratuitous ARP on enp0s8 for 192.168.1.194

Application identify few connection that were close unexpectedly and try to restore them:

20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it
20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it
20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it
20/07/04 14:02:33 ERROR [ACTIONS2]: ##### Connection was closed at server side unexpectedly. I will try to recover it

So in few seconds ProxySQL on the Proxy2 node is up and running able to manage the traffic:

+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| weight | hostgroup_id | srv_host    | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+
| 999    | 200          | 192.168.1.6 | ONLINE | 0        | 6        | 6      | 0	| 6           |
| 10000  | 201          | 192.168.1.7 | ONLINE | 3        | 29       | 32     | 0	| 32          |
| 1000   | 201          | 192.168.1.5 | ONLINE | 0        | 2        | 2      | 0	| 2           |
| 998    | 8200         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 999    | 8200         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8200         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
| 1000   | 8201         | 192.168.1.7 | ONLINE | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.6 | ONLINE | 0        | 0        | 0      | 0       | 0           |
| 1000   | 8201         | 192.168.1.5 | ONLINE | 0        | 0        | 0      | 0	| 0           |
+--------+--------------+-------------+--------+----------+----------+--------+---------+-------------+ 

Full failover done in 4 Seconds !!!!

Also this to me sounds a quite good achievement.

Conclusions

We can achieve very highly customisable setup with the use of the scheduler+script in ProxySQL.

The flexibility we get is such that we can create a very high available solution without the need to accept compromises impose by native galera support.

All this without reducing the efficiency of the solution.
In terms of time we had (all times may have a max additional +2 seconds due the scheduler interval we had define):

  • Maintenance on writer - All is done in ~700 ms no service interruption.
  • Writer crash -  whole failover takes 5 seconds. It may have the need to retry transactions.
  • Reader crash - no service interruption
  • Network issue (with PXC cluster quorum) between 7 and 10 seconds
  • ProxySQL node crash - 4 seconds to fail over another ProxySQL node

Of course applications must be design to be fault tolerant and retry the transactions if any problem raise, I would say this goes without saying and is in any programming Best Practices.
If you do not do that and fully rely on the data layer, better for you to review your developers team.

 

Latest conferences

PL2020
percona_tech_days_aug_2020

We have 68 guests and no members online

oracle_ace