My MySQL tips valid-rss-rogers

 

Overview 

Percona operator for MySQL (POM) :(https://www.percona.com/doc/kubernetes-operator-for-pxc/index.html) it’s aim is a special type of controller introduced to simplify complex deployments. The Operator extends the Kubernetes API with custom resources.

The Percona Operator for MySQL solution is using Percona Xtradb Cluster behind the hood to provide a highly available, resilient and scalable MySQL service in the Kubernetes space. 

This solution comes with all the advantages/disadvantages provided by Kubernetes, plus with some advantages of its own like the capacity to scale reads on the nodes that are not Primary.

Of course there are some limitations like the way PXC handle DDLs, which may impact the service, but there is always a cost to pay to get something, expecting to have all for free is unreasonable.     

In this context we need to talk and cover what is full read consistency in this solution and why it is important to understand the role it plays.  

Stale Reads

When using Kubernetes we should talk about the service and not about the technology/product used to deliver such service. 

In our case the Percona operator is there to deliver a MySQL service. We should then see that as a whole as a single object. To be more clear what we must consider is NOT the fact we have a cluster behind the service but that we have a service that to be resilient and highly available use a cluster. 

We should not care If a node/pod goes down unless the service is discontinued.

What we have as a plus in the Percona operator for MySQL solution is a certain level of READ scalability. This achieved optimising the use of the non PRIMARY nodes, and instead having them sitting there applying only replicated data, the Percona Operator provides access to them to scale the reads.  

But… there is always a BUT  

Let us start with an image: 

Screen Shot 2019 10 13 at 32714 PM

(https://www.slideshare.net/lefred.descamps/galera-replication-demystified-how-does-it-work) from Fred Descamps)

By design the apply and commit finalize in Galera (PXC) may have (and has) a delay between nodes.

This means that, if using defaults, applications may have inconsistent reads if trying to access the data from different nodes than the Primary. 

POM provides access using two different solutions:

  • Using HAProxy (default)
  • Using ProxySQL
haproxy  proxysql

 

When using HAProxy you will have 2 entry points:

  • cluster1-haproxy, which will point to the Primary ONLY, for reads and writes. This is the default entry point for the applications to the MySQL database.
  • cluster1-haproxy-replicas, which will point to all the 3 nodes and is supposed to be used for READS only. This is the PLUS you can use if your application has READ/WRITE separation.

redflag
Please note that at the moment there is nothing preventing application to use the cluster1-haproxy-replicas also for write, but that is dangerous and wrong because will generate a lot of certification conflicts and BF abort given it will distribute writes all over the cluster impacting on performance as well (and not giving you any write scaling):

[marcotusa@instance-1 ~]$ for i in `seq 1 100`; do mysql -h cluster1-haproxy-replicas -e "insert into test.iamwritingto values(null,@@hostname)";done
+----------------+-------------+
| host           | count(host) |
+----------------+-------------+
| cluster1-pxc-1 |          34 |
| cluster1-pxc-2 |          33 |
| cluster1-pxc-0 |          33 |
+----------------+-------------+

 

When using ProxySQL the entry point is a single one, but you may define query rules to automatically split the R/W requests coming from the application.

This is the preferred method when application has no way to separate the READS from the writes.

I have done a comparison of the two methods in POM here 

Now, as mentioned above, by default PXC (any Galera base solution) comes with some relaxed settings, for performance purpose. This is normally fine in many standard cases, but if you use POM and use the PLUS of scaling reads using the second access point with HAproxy or Query Rules with Proxysql, you should NOT have stale reads, given the service must provide consistent data, as if you are acting on a single node. 

To achieve that you can change the defaults and change the parameter in PXC wsrep_sync_wait. 

When changing the parameter wsrep_sync_wait as explained in the documentation the node initiates a causality check, blocking incoming queries while it catches up with the cluster. 

Once all data on the node receiving the READ request is commit_finalized, the node performs the read.

But this has a performance impact as said before.

What is the impact?

To test the performance impact I had used a cluster deployed in GKE, with this characteristics:

  • 3 Main nodes n2-standard-8 (8 vCPUs, 32 GB memory)
  • 1 App node n2-standard-8 (8 vCPUs, 32 GB memory)
  • PXC pods using:
    •  25GB of the 32 available 
    • 6 CPU of the 8 available
  • HAProxy:
    • 600m CPU
    • 1GB RAM
  • PMM agent
    • 500m CPU
    • 500 MB Ram

On The application  node I used sysbench running two instances, one in r/w mode the other only reads. Finally to test stale read I used the stale read test from my test suite (  https://github.com/Tusamarco/testsuite) .

Given I was looking for results with moderate load I just used 68/96/128 threads per sysbench instance. 

Results

Marco, did we have or not stale reads? Yes we did:

stale reads moderate load

I had from 0 (with very light load) up to 37% stale reads with MODERATED load. Where moderated was the 128 threads sysbench running. 

Setting wsrep_sync_wait=3 of course I had full consistency.
But I had performance loss:

performance loss reads

As you can see I had an average loss of 11% in case of READS

performance loss writes

While for writes the average loss was the 16%. 

Conclusions 

At this point we need to stop and think about what is worth doing. If my application is READs heavy and READs scaling, it is probably worth enabling the full synchronicity given scaling on the additional node allows me to have a 2x or more READs. 

If instead my application is write critical, probably losing also ~16% performance is not good.

Finally if my application is stale reads tolerant, I will just go with the defaults and get all the benefits without penalties.

Also keep in mind that POM is designed to offer a MySQL service so the state of the single node is not as critical as if you are using a default PXC installation, PODs are by nature ephemeral objects while service is resilient.

References      

https://www.percona.com/doc/kubernetes-operator-for-pxc/index.html

https://github.com/Tusamarco/testsuite

https://en.wikipedia.org/wiki/Isolation_(database_systems)#Dirty_reads

https://galeracluster.com/library/documentation/mysql-wsrep-options.html#wsrep-sync-wait

https://www.slideshare.net/lefred.descamps/galera-replication-demystified-how-does-it-work

Overview

Percona Operator for MySQL (POM) comes with two different proxies, HAProxy and ProxySQL. While the initial version was based on ProxySQL, in time Percona opted to set HAProxy as the default Proxy for the operator, this without removing ProxySQL. 

While one of the main points was to guarantee users to have a 1:1 compatibility with vanilla MySQL in the way the operator allows connections. There are also other factors that are involved in the decision to have two proxies. In this article I will scratch the surface of this why.

Operator assumptions

When working with the Percona operator for MySQL, there are few things to keep in mind:

  • Each deployment has to be seen as a single MySQL service as if a single MySQL instance
  • The technology used to provide the service may change in time
  • Pod resiliency is not guaranteed, Service resiliency is. 
  • Resources to be allocated are not automatically calculated and must be identified at the moment of the deployment
  • In Production you cannot set more than 5 or less than 3 nodes when using PXC

There are two very important points in the list above.

The first one is that what you get IS NOT a PXC cluster, but a MySQL service. The fact that Percona at the moment uses PXC to cover the service is purely accidental and we may decide to change it anytime.

The other point is that Service is resilient the pod is not. In short you should expect to see pods stopping to work and being re-created. What should NOT happen is that service goes down. Trying to debug each minor issue per node/pod is not what is expected when you use kubernetes. 

Given the above, review your expectations… and let us go ahead. 

The plus in the game (read scaling)

As said, what is offered with POM is a mysql service. Percona has added a proxy on top of the nodes/pods that help the service to respect the resiliency service expectations. There are two possible deployments:

  • HAProxy
  • ProxySQL

Both allow to optimise one aspect of POM, which is read scaling.
Infact what we were thinking was, given we must use a (virtually synchronous) cluster, why not take advantage of that and allow reads to scale on the other nodes when available? 

This approach will help all the ones using POM to have the standard MySQL service but with a plus. 

But, with it also come some possible issues like READ/WRITE splitting and stale reads. About stale reads see this article on how to deal with it (https://docs.google.com/document/d/1NyqcEKxfhgD1tDRPbiY0bENdp953DzkBfAoMi22u3J4/edit)

For R/W splitting we instead have a totally different approach in respect to what kind of proxy we implement. 

If using HAProxy, we offer a second entry point that can be used for READ operation. That entrypoint will balance the load on all the nodes available. 

Please note that at the moment there is nothing preventing application to use the cluster1-haproxy-replicas also for write, but that is dangerous and wrong because will generate a lot of certification conflicts and BF abort given it will distribute writes all over the cluster impacting on performance as well (and not giving you any write scaling). It is your responsibility to guarantee that only READS will go through that entrypoint.

If instead ProxySQL is in use it is possible to implement automatic R/W splitting. 

Global difference and comparison

At this point it is useful to have a better understanding of the functional difference between the two proxies and what is the performance difference if any. 

As we know HAProxy acts as a level 4 proxy when operating in TCP mode, it also is a forward-proxy, which means each TCP connection is established with the client with the final target and there is no interpretation of the data-flow.

ProxySQL on the other hand is a level 7 proxy and is a reverse-proxy, this means the client establishes a connection to the proxy who presents itself as the final backend. Data can be altered on the fly when it is in transit. 

To be honest, it is more complicated than that but allows me the simplification. 

On top of that there are additional functionalities that are present in one (ProxySQL) and not in the other. The point is if they are relevant for the use in this context or not. For a short list see below (source is from ProxySQL blog but data was removed)  

 proxySQL HAProxy feature comparison

As you may have noticed HAProxy is lacking some of that functionalities, like R/W split, firewalling and caching, proper of the level 7 implemented in ProxySQL.  

The test environment

To test the performance impact I had used a cluster deployed in GKE, with this characteristics:

  • 3 Main nodes n2-standard-8 (8 vCPUs, 32 GB memory)
  • 1 App node n2-standard-8 (8 vCPUs, 32 GB memory)
  • PXC pods using:
    •  25GB of the 32 available 
    • 6 CPU of the 8 available
  • HAProxy:
    • 600m CPU
    • 1GB RAM
  • PMM agent
    • 500m CPU
    • 500 MB Ram
  • Tests using sysbench as for (https://github.com/Tusamarco/sysbench

What I have done is to run several tests running two Sysbench instances. One only executing reads, while the other reads and writes. 

In the case of ProxySQL I had R/W splitting thanks to the Query rules, so both sysbench instances were pointing to the same address. While testing HAProxy I was using two entry points:

  • Cluster1-haproxy – for read and write
  • Cluster1-haproxy-replicas – for read only

Then I also compare what happens if all requests hit one node only. For that I execute one Sysbench in R/W mode against one entry point, and NO R/W split for ProxySQL.

Finally sysbench tests were executed with the –reconnect option to force the tests to establish new connections.

As usual tests were executed multiple times, on different days of the week and moment of the day. Data reported is a consolidation of that, and images from PMM are samples coming from the execution that was closest to the average values. 

Comparing performance when scaling Reads

These tests imply that one node is mainly serving writes while the others are serving reads. To not affect performance and given I was not interested in maintaining full read consistency the parameter wsrep_sync_wait was kept as default (0). 

events 3node

operation 3node

A first observation shows how ProxySQL seems to keep a more stable level of request served. The increasing load penalises HAProxy reducing if ⅓ the number of operations at 1024 threads.

writes 3node

reads 3node

 

Digging a bit more we can see that HAProxy is performing much better than ProxySQL for the WRITE operation. The number of writes remains almost steady with minimal fluctuations. ProxySQL on the other hand is performing great when load in write is low, then performance drops by 50%.

For reads we have the opposite. ProxySQL is able to scale in a very efficient way, distributing the load across the nodes and able to maintain the level of service despite the load increase. 

If we start to take a look to the latency distribution statistics (sysbench histogram information), we can see that:

latency68 3node HAproxy w

latency68 3node proxy w

In case of low load and writes both proxies stay on the left side of the graph with low value in ms. HAProxy is a bit more consistent and grouped around 55ms value, while ProxySQL is a bit more sparse and spans between 190-293ms.

latency68 3node HAproxy r

latency68 3node proxy r

About reads we have a similar behaviour, both for the large majority between 28-70ms.

We have a different picture when load increases:  

latency1024 3node HAproxy w

latency1024 3node proxy w

ProxySQL is having some occurrences where it performs better, but it spans in a very large range, from ~2k ms to ~29k ms. While HAProxy is substantially grouped around 10-11K ms.
As a result, in this context, HAProxy is able to better serve writes under heavy load than ProxySQL. 

Again different picture in case of reads.

latency1024 3node HAproxy r

latency1024 3node proxy r

Here ProxySQL is still spanning on a wide range ~76ms – 1500ms, while HAProxy is more consistent but less efficient, grouping around 1200ms the majority of the service. This is consistent with the performance loss we have seen in READ when using high load and HAProxy.  

Comparing when using only one node

But let us now discover what happens when using only one node. So using the service as it should be, without the possible Plus of read scaling. 

 writes 1node

reads 1node

The first thing I want to mention is a strange behaviour that was consistently happening (no matter what proxy used) at 128 threads. I am investigating it but I do not have a good answer yet on why Operator solution with PXC, was having that significant drop in performance ONLY with 128 threads.

Aside that, the results were consistently showing HAProxy performing better in serving read/writes. Keep in mind that HAProxy just establishes the connection point-to-point and is not doing anything else. While ProxySQL is designed to eventually act on the incoming stream of data. 

This becomes even more evident when reviewing the latency distribution.
In this case no matter what load we have, HAProxy performs better:

latency68 1node HAproxy rw

latency68 1node proxy rw

latency2048 1node HAproxy rw

latency2048 1node proxy rw

 

As you can notice, HAProxy is less grouped than when we have 2 entry points, but it is still able to serve more efficiently than ProxySQL.

Conclusions

As usual my advice is to use the right tool for the job, and do not force yourself in something stupid.right tool
As clearly stated at the beginning POM is designed to provide a mysql SERVICE, not a PXC cluster and all the configuration and utilisation should converge on that.
ProxySQL can help you IF you want to scale a bit more on READS using the possible plus. But this is not guaranteed to work as it works when using standard PXC
Not only you need to have a very good understanding of Kubernetes and ProxySQL if you want to avoid issues.
With HAProxy you can scale reads as well, but you need to be sure you have R/W separation at application level.

In any case utilising HAProxy for the service is the easier way to go.
This is one of the reasons why Percona decided to shift to HAProxy.
HAProxy is the solution that offers the proxy service more in line with the aim of the kubernetes service concept.
It is also the solution that remains closer on how a simple MySQL service should behave.

You need to set your expectations correctly to avoid being in trouble later.

References

https://www.percona.com/doc/kubernetes-operator-for-pxc/index.html

 

In recent times I have been designing several solutions focused on High Availability and Disaster Recovery. Some of them using Percona Server for MySQL with group replication, some using Percona XtraDB Cluster (PXC). What many of them had in common was the use of ProxySQL for the connection layer. This is because I consider the use of a layer 7 Proxy preferable, given the possible advantages provided in ReadWrite split and SQL filtering. 

The other positive aspect provided by ProxySQL, at least for Group Replication, is the native support which allows us to have a very quick resolution of possible node failures.

ProxySQL has Galera support as well, but in the past, that had shown to be pretty unstable, and the old method to use the scheduler was still the best way to go.

After Percona Live Online 2020 I decided to try it again and to see if at least the basics were now working fine. 

What I Have Tested

I was not looking for complicated tests that would have included different levels of transaction isolation. I was instead interested in the more simple and basic ones. My scenario was:

1 ProxySQL node v2.0.15  (192.168.4.191)
1 ProxySQL node v2.1.0  (192.168.4.108)
3 PXC 8.20 nodes (192.168.4.22/23/233) with internal network (10.0.0.22/23/33) 

ProxySQL was freshly installed. 

All the commands used to modify the configuration are here. Tests were done first using ProxySQL v2.015 then v2.1.0. Only if results diverge I will report the version and results. 

PXC- Failover Scenario

As mentioned above I am going to focus on the fail-over needs, period. I will have two different scenarios:

  • Maintenance
  • Node crash 

From the ProxySQL point of view I will have three scenarios always with a single Primary:

  • Writer is NOT a reader (option 0 and 2)
  • Writer is also a reader

The configuration of the native support will be:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',100,3306,10000,2000,'DC1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',101,3306,100,2000,'DC1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',101,3306,10000,2000,'DC1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',101,3306,10000,2000,'DC1');

Galera host groups:

  • Writer: 100
  • Reader: 101
  • Backup_writer: 102
  • Offline_hostgroup: 9101

Before going ahead let us analyze the Mysql Servers settings. As you can notice I am using the weight attribute to indicate ProxySQL which is my preferred write. But I also use weight for the READ Host Group to indicate which servers should be used and how.

Given that we have that:

  • Write
    • 192.168.4.22  is the preferred Primary
    • 192.168.4.23  is the first failover 
    • 192.168.4.233 is the last chance 
  • Read
    • 192.168.4.233/23 have the same weight and load should be balanced between the two of them
    • The 192.168.4.22 given is the preferred writer should NOT receive the same load in reads and have a lower weight value.  

The Tests

First Test

The first test is to see how the cluster will behave in the case of 1 Writer and 2 readers, with the option writer_is_also_reader = 0.
To achieve this the settings for proxysql will be:

insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) 
values (100,102,101,9101,1,1,0,10);

As soon as I load this to runtime, ProxySQL should move the nodes to the relevant Host Group. But this is not happening, instead, it keeps the readers in the writer HG and SHUN them.

+--------+-----------+---------------+----------+---------+
| weight | hostgroup | srv_host | srv_port | status |
+--------+-----------+---------------+----------+---------+
| 10000 | 100 | 192.168.4.233 | 3306 | ONLINE |
| 10000 | 100 | 192.168.4.23 | 3306 | SHUNNED |
| 10000 | 100 | 192.168.4.22 | 3306 | SHUNNED |
| 10000 | 102 | 192.168.4.23 | 3306 | ONLINE |
| 10000 | 102 | 192.168.4.22 | 3306 | ONLINE |
+--------+-----------+---------------+----------+---------+

This is, of course, wrong. But why does it happen?

The reason is simple. ProxySQL is expecting to see all nodes in the reader group with READ_ONLY flag set to 1. 

In ProxySQL documentation we can read:

writer_is_also_reader=0: nodes with read_only=0 will be placed either in the writer_hostgroup and in the backup_writer_hostgroup after a topology change, these will be excluded from the reader_hostgroup.

This is conceptually wrong. 

A PXC cluster is a tightly coupled replication cluster, with virtually synchronous replication. One of its benefits is to have the node “virtually” aligned with respect to the data state. 

In this kind of model, the cluster is data-centric, and each node shares the same data view.

tightly coupled

What it also means is that if correctly set the nodes will be fully consistent in data READ.

The other characteristic of the cluster is that ANY node can become a writer anytime.  While best practices indicate that it is better to use one Writer a time as Primary to prevent certification conflicts, this does not mean that the nodes not currently elected as Primary, should be prevented from becoming a writer.

Which is exactly what READ_ONLY flag does if activated.

Not only, the need to have READ_ONLY set means that we must change it BEFORE we have the node able to become a writer in case of fail-over. 

This, in short, means the need to have either a topology manager or a script that will do that with all the relative checks and logic to be safe. Which in time of fail-over means it will add time and complexity when it’s not really needed and that goes against the concept of the tightly-coupled cluster itself.

Given the above, we can say that this ProxySQL method related to writer_is_also_reader =0, as it is implemented today for Galera, is, at the best, useless. 

Why is it working for Group Replication? That is easy; because Group Replication internally uses a mechanism to lock/unlock the nodes when non-primary, when using the cluster in single Primary mode. That internal mechanism was implemented as a security guard to prevent random writes on multiple nodes, and also manage the READ_ONLY flag. 

Second Test

Let us move on and test with writer_is_also_reader = 2. Again from the documentation:

writer_is_also_reader=2 : Only the nodes with read_only=0 which are placed in the backup_writer_hostgroup are also placed in the reader_hostgroup after a topology change i.e. the nodes with read_only=0 exceeding the defined max_writers.

Given the settings as indicated above, my layout before using Galera support is:

How scheduler and script stand in supporting failover (Percona and Marco example) 

In part one of this series I had illustrated how simple scenarios may fail or have problems when using Galera native support inside ProxySQL. In this post, I will repeat the same tests but using the scheduler option and the external script.

The Scheduler

First a brief explanation about the scheduler.

The scheduler inside ProxySQL was created to allow administrators to extend ProxySQL capabilities. The scheduler gives the option to add any kind of script or application and run it at the specified interval of time. The scheduler was also the initial first way we had to deal with Galera/Percona XtraDB Cluster (PXC) node management in case of issues. 

The scheduler table is composed as follows:

For what reason should I use a real multi-Primary setup?
To be clear, not a multi-writer solution where any node can become the active writer in case of needs, as for PXC or PS-Group_replication.
No, we are talking about a multi-Primary setup where I can write at the same time on multiple nodes.
I want to insist on this “why?”.

After having excluded the possible solutions mentioned above, both covering the famous 99,995% availability, which is 26.30 minutes downtime in a year, what is left?

Disaster Recovery? Well that is something I would love to have, but to be a real DR solution we need to put several kilometers (miles for imperial) in the middle. 

And we know (see here and here) that aside some misleading advertising, we cannot have a tightly coupled cluster solution across geographical regions.

So, what is left? I may need more HA, ok that is a valid reason. Or I may need to scale the number of writes, ok that is a valid reason as well.
This means, at the end, that I am looking to a multi-Primary because:

  • Scale writes (more nodes more writes)
    • Consistent reads (what I write on A must be visible on B)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.   

Now, keeping myself bound to the MySQL ecosystem, my natural choice would be MySQL NDB cluster.

But my (virtual) boss was at AWS re-invent and someone mentioned to him that Aurora Multi-Primary does what I was looking for.

This (long) article is my voyage in discovering if that is true or … not.

Given I am focused on the behaviour first, and NOT interested in absolute numbers to shock the audience with millions of QPS, I will use low level Aurora instances. And will perform tests from two EC2 in the same VPC/region of the nodes.

instances

You can find the details about the tests on GitHub here Finally I will test:

  • Connection speed
  • Stale read
  • Write single node for baseline
  • Write on both node:
    • Scaling splitting the load by schema
    • Scaling same schema 

Tests results

Let us start to have some real fun. The first test is … 

Connection Speed

The purpose of this test is to evaluate the time taken in opening a new connection and time taken to close it. The action of open/close connection can be a very expensive operation especially if applications do not use a connection pool mechanism.

a1

a3

As we can see ProxySQL results to be the most efficient way to deal with opening connections, which was expected given the way it is designed to reuse open connections towards the backend. 

a2

a4

Different is the close connection operation in which ProxySQL seems to take a little bit longer.  As global observation we can say that using ProxySQL we have more consistent behaviour. Of course this test is a simplistic one, and we are not checking the scalability (from 1 to N connections) but it is good enough to give us the initial feeling. Specific connection tests will be the focus of the next blog on Aurora MM. 

Stale Reads

Aurora MultiPrimary use the same mechanism of the default Aurora to update the buffer pool:

aurora multi master sharing BP

Using the Page Cache update, just doing both ways. This means that the Buffer Pool of Node2 is updated with the modification performed in Node1 and vice versa.

To verify if an application would be really able to have consistent reads, I have run this test. This test is meant to measure if, and how many, stale reads we will have when writing on a node and reading from the other.

Amazon Aurora multi Primary has 2 consistency model:

Consistency model

As an interesting fact the result was that with the default consistency model (INSTANCE_RAW), we got 100% stale read.

Given that I focused on identifying the level of the cost that exists when using the other consistency model (REGIONAL_RAW) that allows an application to have consistent reads.

The results indicate an increase of the 44% in total execution time, and of the 95% (22 time slower) in write execution. 

a5

a6

a7

It is interesting to note that the time taken is in some way predictable and consistent between the two consistency models. 

The graph below shows in yellow how long the application must wait to see the correct data on the reader node. While in blue is the amount of time the application waits to get back the same consistent read because it must wait for the commit on the writer.

   a8

As you can see the two are more or less aligned. Given the performance cost imposed by using REGIONAL_RAW,  all the other tests are done the defaut INSTANCE_RAW, unless explicitly stated.

Writing tests

All tests run in this section were done using sysbench-tpcc with the following settings:

sysbench ./tpcc.lua --mysql-host=<> --mysql-port=3306 --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --time=300 --threads=32 --report-interval=1 --tables=10 --scale=15  --mysql_table_options=" CHARSET=utf8 COLLATE=utf8_bin"  --db-driver=mysql prepare

sysbench /opt/tools/sysbench-tpcc/tpcc.lua --mysql-host=$mysqlhost --mysql-port=$port --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --db-driver=mysql --tables=10 --scale=15 --time=$time  --rand-type=zipfian --rand-zipfian-exp=0 --report-interval=1 --mysql-ignore-errors=all --histogram  --report_csv=yes --stats_format=csv --db-ps-mode=disable --threads=$threads run

Write Single node (Baseline)

Before starting the comparative analysis, I was looking to define what was the “limit” of traffic/load for this platform. 

Picture 1

t1 t2

From the graph above, we can see that this setup scales up to 128 threads after that, the performance remains more or less steady. 

Amazon claims that we can mainly double the performance when using both nodes in write mode and use a different schema to avoid conflict.

scalability

 

Once more remember I am not interested in the absolute numbers here, but I am expecting the same behaviour Given that our expectation is to see:

Picture 2

Write on both nodes different schemas

So AWS recommend this as the scaling solution:

split traffic by db table partition to avoid conflicts

And I diligently follow the advice.

I used 2 EC2 nodes in the same subnet of the Aurora Node, writing to a different schema (tpcc & tpcc2). 

Overview

Let us make it short and go straight to the point. Did we get the expected scalability?

Well no:

Picture 3

We just had 26% increase, quite far to be the expected 100% Let us see what happened in detail (if not interested just skip and go to the next test).

Node 1

Picture 5

Node 2

Picture 6

As you can see Node1 was (more or less) keeping up with the expectations and being close to the expected performance.
But Node2 was just not keeping up, performances there were just terrible. 

The graphs below show what happened.

While Node1 was (again more or less) scaling up to the baseline expectations (128 threads), Node2 collapsed on its knees at 16 threads. Node2 was never able to scale up.

Reads

Node 1

t4

Node1 is scaling the reads as expected also if here and there we can see performance deterioration.

Node 2

t7

Node2 is not scaling Reads at all. 

Writes

Node 1

t5

Same as Read

Node 2

t8

Same as read

Now someone may think I was making a mistake and I was writing on the same schema. I assure you I was not.

Check the next test to see what happened if using the same schema.  

Write on both nodes same schema

Overview

Now, now Marco, this is unfair. You know this will cause contention.

Yes I do! But nonetheless I was curious to see what was going to happen and how the platform would deal with that level of contention. 
My expectations were to have a lot of performance degradation and increased number of locks. About conflict I was not wrong, node2 after the test reported:

+-------------+---------+-------------------------+
| table       | index   | PHYSICAL_CONFLICTS_HIST |
+-------------+---------+-------------------------+
| district9   | PRIMARY |                    3450 |
| district6   | PRIMARY |                    3361 |
| district2   | PRIMARY |                    3356 |
| district8   | PRIMARY |                    3271 |
| district4   | PRIMARY |                    3237 |
| district10  | PRIMARY |                    3237 |
| district7   | PRIMARY |                    3237 |
| district3   | PRIMARY |                    3217 |
| district5   | PRIMARY |                    3156 |
| district1   | PRIMARY |                    3072 |
| warehouse2  | PRIMARY |                    1867 |
| warehouse10 | PRIMARY |                    1850 |
| warehouse6  | PRIMARY |                    1808 |
| warehouse5  | PRIMARY |                    1781 |
| warehouse3  | PRIMARY |                    1773 |
| warehouse9  | PRIMARY |                    1769 |
| warehouse4  | PRIMARY |                    1745 |
| warehouse7  | PRIMARY |                    1736 |
| warehouse1  | PRIMARY |                    1735 |
| warehouse8  | PRIMARY |                    1635 |
+-------------+---------+-------------------------+

Which is obviously a strong indication something was not working right. In terms of performance gain, if we compare ONLY the result with the 128 Threads : Picture 4

Also with the high level of conflict we still have 12% of performance gain.

The problem is that in general we have the two nodes behave quite badly.
If you check the graph below you can see that the level of conflict is such to prevent the nodes not only to scale but to act consistently.

Node 1

Picture 7

Node 2

Picture 8

Reads

In the following graphs we can see how node1 had issues and it actually crashed 3 times, during tests with 32/64/512 treads.
Node2 was always up but the performances were very low. 

Node 1

t10

Node 2

t13

Writes

Node 1

t11

Node 2

t14

Recovery from crashed Node

About recovery time reading the AWS documentation and listening to presentations, I often heard that Aurora Multi Primary is a 0 downtime solution.
Or other statements like: “
in applications where you can't afford even brief downtime for database write operations, a multi-master cluster can help to avoid an outage when a writer instance becomes unavailable. The multi-master cluster doesn't use the failover mechanism, because it doesn't need to promote another DB instance to have read/write capability

To achieve this the suggestion I found, was to have applications pointing directly to the Nodes endpoint and not use the Cluster endpoint.
In this context the solution pointing to the Nodes should be able to failover within a seconds or so, while the cluster endpoint:

fail over times using mariadb driver

Personally I think that designing an architecture where the application is responsible for the connection to the database and failover is some kind of refuse from 2001. But if you feel this is the way, well go for it.

What I did for testing is to use ProxySQL, as plain as possible, with nothing else then the basic monitor coming from the native monitor.

I then compare the results with the tests using the Cluster endpoint.
In this way I adopt the advice of pointing directly at the nodes, but I was doing things in our time.  

The results are below and they confirm (more or less) the data coming from Amazon.

a10

A downtime of 7 seconds is quite a long time nowadays, especially if I am targeting the 5 nines solution that I want to remember is 864 ms downtime per day.

Using ProxySQL is going closer to that, still too long to be called 0 (zero) downtime.

I also have fail-back issues when using the AWS cluster endpoint.

Given it was not able to move the connection to the joining node seamlessly. 

Last but not least when using the consistency level INSTANCE_RAW, I had some data issue as well as PK conflict:
FATAL: mysql_drv_query() returned error 1062 (Duplicate entry '18828082' for key 'PRIMARY')   

Conclusions

As state the beginning of this long blog the reasons expectations to go for a multi Primary solution were:

  • Scale writes (more nodes more writes)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.   

Honestly I feel we have completely failed the scaling point.

Facepalm Jesus

Probably if I use the largest Aurora I will get much better absolute numbers, and it will take me more to encounter the same issues, but I will.

In any case if the Multi muster solution is designed to provide that scalability, it should do that with any version.

I did not have zero downtime, but I was able to failover pretty quickly with ProxySQL.

Finally, unless the consistency model is REGIONAL_RAW, shifting from one node to the other is not prone to possible negative effects like stale reads.

Because that I consider this requirement not satisfied in full. 

Given all the above, I think this solution could eventually be valid only for High Availability (close to be 5 nines), but given it comes with some limitations I do not feel comfortable in preferring it over others just for HA, at the end default Aurora is already good enough as a High available solution. 

references

https://www.youtube.com/watch?v=p0C0jakzYuc

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html

https://www.slideshare.net/marcotusa/improving-enterprises-ha-and-disaster-recovery-solutions-reviewed

https://www.slideshare.net/marcotusa/robust-ha-solutions-with-proxysql

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html#aurora-multi-master-limitations  

Latest conferences

PL2020
percona_tech_days_aug_2020

We have 165 guests and no members online

oracle_ace