TusaCentral - Welcome to the Frontpage

Proof of Concept: Horizontal Write Scaling for MySQL with Kubernetes Operator

Historically MySQL is great in horizontal READ scale. The scaling in that case is offered by the different number of Replica nodes, no matter if using standard asynchronous replication or synchronous replication.

However those solutions do not offer the same level of scaling for writes operation.

Why? Because the solutions still rely on writing in one single node that works as Primary. Also in case of multi-Primary the writes will be distributed by transaction. In both cases, when using virtually-synchronous replication, the process will require certification from each node and local (by node) write, as such the number of writes are NOT distributed across multiple nodes but duplicated.

The main reason behind this is that MySQL is a relational database system (RDBMS), and any data that is going to be written in it, must respect the RDBMS rules (https://en.wikipedia.org/wiki/Relational_database). In short any data that is written must be consistent with the data present. To achieve that the data needs to be checked with the existing through defined relations and constraints. This action is something that can affect very large datasets and be very expensive. Think about updating a table with millions of rows that refer to another table with another million rows.

An image may help:

Every time I will insert an order, I must be sure that all the related elements are in place and consistent.

This operation is quite expensive but our database is able to run it in a few milliseconds or less, thanks to several optimizations that allow the node to execute most of them in memory, with none or little access to mass storage.

The key factor is that the whole data structure resides in the same location (node) facilitating the operations.

Once we have understood that, it will also become clear why we cannot have relational data split in multiple nodes and have to distribute writes by table. If I have a node that manages only the items, another the orders, another the payments, I will need to have my solution able to deal with distributed transactions, each of which needs to certify and verify other nodes data.

This level of distribution will seriously affect the efficiency of the operation which will increase the response time significantly. This is it, nothing is impossible however the performances will be so impacted that each operation may take seconds instead milliseconds or fraction of it, unless lifting some of the rules breaking the relational model.

MySQL as well as other RDBMS are designed to work respecting the model and cannot scale in any way by fragmenting and distributing a schema, so what can be done to scale?

The alternative is to split a consistent set of data into fragments. What is a consistent set of data? It all depends on the kind of information we are dealing with. Keeping in mind the example above where we have a shop online serving multiple customers, we need to identify which is the most effective way to split the data.

For instance if we try to split the data by Product Type (Books, CD/DVD, etc) we will have a huge duplication of data related to customers/orders/shipments and so on, all this data is also quite dynamic given I will have customers constantly ordering things.

Why duplicate the data? Because if I do not duplicate that data I will not know if a customer had already bought or not that specific item, or I will have to ask again about the shipment address and so on. Which means also that any time a customer buys something or puts something in the wish list I have to reconcile the data in all my nodes/clusters.

On the other hand if I choose to split my data by country of customer’s residence the only data I will have to duplicate and keep in sync is the one related to the products, of which the most dynamic one will be the number of items in stock. This of course unless I can organize my products by country as well, which is a bit unusual nowadays, but not impossible.

Another possible case is if I am a health organization and I manage several hospitals. As for the example above, it will be easier to split my data by hospital given most of the data related to patients is bound to the hospital itself as well as treatments and any other element related to hospital management. While it will make no sense to split by patient's country of residence.

This technique of splitting the data into smaller pieces is called sharding and at the moment is the only way we have to scale RDBM horizontally.

In the MySQL open source ecosystem we have only two consolidated ways to perform sharding, Vitess and ProxySQL. The first one is a complete solution that takes ownership of your database and manages almost any aspect of its operations in a sharded environment, this includes a lot of specific features for DBA to deal with daily operations like table modifications, backup and more.

While this may look great it also comes with some string attached, including the complexity and proprietary environment. That makes Vitess a good fit for “complex” sharding scenarios where other solutions may not be enough.

ProxySQL does not have a sharding mechanism “per se” but given the way it works and the features it has, allow us to build simple sharding solutions.

It is important to note that most of the DBA operations will still be on DBA to be executed with incremented complexity given the sharding environment.

There is a third option which is application aware sharding.

This solution sees the application be aware of the need to split the data in smaller fragments and internally point the data to different “connectors” who are connected to multiple data sources.

In this case the application is aware of a customer country and will redirect all the operations related to him to the datasource responsible for the specific fragment.

Normally this solution requires a full code re-design and could be quite difficult to achieve when it is injected after the initial code architecture definition.

On the other hand, if done at design it is probably the best solution, because it will allow the application to define the sharding rules and can also optimize the different data sources using different technologies for different uses.

One example could be the use of a RDBMS for most of the Online transaction processing (OLTP) data shared by country, and having the products as distributed memory cache with a different technology. At the same time all the data related to orders, payments and customer history can be consolidated in a data warehouse used to generate reporting.

As said the last one is probably the most powerful, scalable and difficult to design and unfortunately it represents probably less than the 5% of the solution currently deployed.

As well as very few cases are in the need to have a full system/solution to provide scalability with sharding.

By experience, most of the needs for horizontal scaling fell in the simple scenario, where there is the need to achieve sharding and data separation, very often with sharding-nothing architecture. In shared-nothing, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.

The POC

Why this POC?

In the years I have faced a lot of customers that were talking about scaling their database solution and looking at very complex sharding as Vitess as the first and only way to go.

This without even considering if their needs were driving them there for real.

kiss

In my experience, and talking with several colleagues, I am not alone, when analyzing the real needs and after discussing with all the parties impacted, only a very small percentage of customers were in the real need of complex solutions. Most of the others were just trying to avoid a project that will implement simple shared-nothing solutions. Why? Because apparently it is simpler to migrate data to a platform that does all for you, than accept a bit of additional work and challenge at the beginning but keep a simple approach. Also going for the last shining things always has its magic.

On top of that, with the rise of Kubernetes and Mysql Operators, a lot of confusion starts to circulate, most of which generate by the total lack of understanding that a Database and a relational database are two separate things. That lack of understanding the difference and the real problems attached to a RDBMS had brought some to talk about horizontal scaling for databases, with a concerning superficiality and without clarifying if they were talking about RDBMS or not. As such some clarification is long due as well as putting back the KISS principle as the main focus.

Given that, I thought that refreshing how ProxySQL could help in building a simple sharding solution may help to clarify the issues, reset the expectations and show how we can do things in a simpler way. (See my old post https://www.percona.com/blog/mysql-sharding-with-proxysql/).

To do so I had built a simple POC that illustrates how you can use Percona Operator for MySQL (POM) and ProxySQL to build a sharded environment with a good level of automation for some standard operations like backup/restore software upgrade and resource scaling.

Why Proxysql?

In the following example we mimic a case where we need a simple sharding solution, which means we just need to redirect the data to different data containers, keeping the database maintenance operations on us. In this common case we do not need to implement a full sharding system such as Vitess.

As illustrated above ProxySQL allows us to set up a common entry point for the application and then redirect the traffic on the base of identified sharding keys. It will also allow us to redirect read/write traffic to the primary and read only traffic to all secondaries.

The other interesting thing is that we can have ProxySQL as part of the application pod, or as an independent service. Best practices indicate that having ProxySQL closer to the application will be more efficient especially if we decide to activate the caching feature.

Why POM

Percona Operator for MySQL comes with three main solutions, Percona Operator for PXC, Percona Operator for MySQL Group Replication and Percona Operator for Percona server. The first two are based on virtually-synchronous replication, and allow the cluster to keep the data state consistent across all pods, which guarantees that the service will always offer consistent data. In K8s context we can see POM as a single service with native horizontal scalability for reads, while for writes we will adopt the mentioned sharding approach.

The other important aspects of using a POM based solution is the automation it comes with. Deploying POM you will be able to set automation for backups, software updates, monitoring (using PMM) and last but not least the possibility to scale UP or DOWN just changing the needed resources.

The elements used

In our POC I will use a modified version of sysbench (https://github.com/Tusamarco/sysbench), that has an additional field continent and I will use that as a sharding key. At the moment and for the purpose of this simple POC I will only have 2 Shards.

As the diagram above illustrates here we have a simple deployment but good enough to illustrate the sharding approach.

We have:

The application(s) node(s), it is really up to you if you want to test with one application node or more, nothing will change, as well as for the ProxySQL nodes, just keep in mind that if you use more proxysql nodes is better to activate the internal cluster support or use consul to synchronize them.
Shard 1 is based on POM with PXC, it has:

Load balancer for service entry point
- Entry point for r/w
- Entry point for read only
3 Pods for Haproxy
- Haproxy container
- Pmm agent container
3 Pods with data nodes (PXC)
- PXC cluster node container
- Log streaming
- Pmm container
Backup/restore service

Shard 2 is based on POM for Percona server and Group Replication (technical preview)
- Load balancer for service entry point
  - Entry point for r/w
  - Entry point for read only
- 3 Pods for MySQL Router (testing)
  - MySQL router container
- 3 Pods with data nodes (PS with GR)
  - PS -GR cluster node container
  - Log streaming
  - Pmm container
- Backup/restore on scheduler

Now you may have noticed that the representation of the nodes are different in size, this is not a mistake while drawing. It indicates that I have allocated more resources (CPU and Memory) to shard1 than shard2. Why? Because I can and I am simulating a situation where a shard2 gets less traffic, at least temporarily, as such I do not want to give it the same resources as shard1. I will eventually increase them if I see the needs.

The settings

Data layer

Let us start with the easy one, the Data layer configuration. Configuring correctly the environment is the key, and to do so I am using a tool that I wrote specifically to calculate the needed configuration in K8s POM environment, you can find it here (https://github.com/Tusamarco/mysqloperatorcalculator).

Once you have compiled it and run you can simply ask what “dimensions” are supported, or you can define a custom level of resources, but you will still need to indicate the level of expected load. In any case please refer to the README in the repository which has all the instructions.

The full cr.yaml for PXC shard1 is here, while the one for PS-GR here.

For Shard 1: I asked for resources to cover traffic of type 2 (Light OLTP), configuration type 5 (2XLarge) 1000 connections.

For Shard2: I ask for resources to cover traffic of type 2 (Light OLTP), configuration type 2 (Small), 100 connections.

Once you have the CRs defined you can follow the official guidelines to set the environment up ( PXC (https://docs.percona.com/percona-operator-for-mysql/pxc/index.html), PS (https://docs.percona.com/percona-operator-for-mysql/ps/index.html)

It is time now to see the Proxysql settings.

ProxySQL and Sharding rules

As mentioned before we are going to test the load sharding by continent, and as also mentioned before we know that ProxySQL will not provide additional help to automatically manage the sharded environment.

Given that one way to do it is to create a DBA account per shard, or to inject shard information in the commands while executing.
I will use the less comfortable one just to prove if it works, the different DBA accounts.

We will have 2 shards, the sharding key is the continent field, and the continents will be grouped as follows:

Shard 1:
- Asia
- Africa
- Antarctica
- Europe
- North America
Shard 2:
- Oceania
- South America

The DBAs users:

dba_g1
dba_g2

The application user:

app_test

The host groups will be:

Shard 1
- 100 Read and Write
- 101 Read only
Shard 2
- 200 Read and Write
- 201 Read only

Once that is defined, we need to identify which query rules will serve us and how.

What we want is to redirect all the incoming queries for:

Asia, Africa, Antarctica, Europe and North America to shard 1.
Oceania and South America to shard 2
Split the queries in R/W and Read only
Prevent the execution of any query that do not have a shard key
Backup data at regular interval and store it in a safe place

Given the above we first define the rules for the DBAs accounts:

We set the Hostgroup for each DBA and then if the query matches the sharding rule we redirect it to the proper sharding, otherwise the HG will remain as set.

This allows us to execute queries like CREATE/DROP table on our shard without problem, but will allow us to send data where needed.

For instance the one below is the output of the queries that sysbench will run.

Prepare:

INSERT INTO windmills_test1 /*  continent=Asia */ (uuid,millid,kwatts_s,date,location,continent,active,strrecordtype) VALUES(UUID(), 79, 3949999,NOW(),'mr18n2L9K88eMlGn7CcctT9RwKSB1FebW397','Asia',0,'quq')

In this case I have the application simply injecting a comment in the INSERT SQL declaring the shard key, given I am using the account dba_g1 to create/prepare the schemas, rules 32/32 will be used and give I have sett apply=1, ProxySQL will exit the query rules parsing and send the command to relevant hostgroup.

Run:

SELECT id, millid, date,continent,active,kwatts_s FROM windmills_test1 WHERE id BETWEEN ? AND ? AND continent='South America'

SELECT SUM(kwatts_s) FROM windmills_test1 WHERE id BETWEEN ? AND ?  and active=1  AND continent='Asia'
SELECT id, millid, date,continent,active,kwatts_s  FROM windmills_test1 WHERE id BETWEEN ? AND ?  AND continent='Oceania' ORDER BY millid

SELECT DISTINCT millid,continent,active,kwatts_s   FROM windmills_test1 WHERE id BETWEEN ? AND ? AND active =1  AND continent='Oceania' ORDER BY millid

UPDATE windmills_test1 SET active=? WHERE id=?  AND continent='Asia'
UPDATE windmills_test1 SET strrecordtype=? WHERE id=?  AND continent='North America'

DELETE FROM windmills_test1 WHERE id=?  AND continent='Antarctica'

INSERT INTO windmills_test1 /* continent=Antarctica */ (id,uuid,millid,kwatts_s,date,location,continent,active,strrecordtype) VALUES (?, UUID(), ?, ?, NOW(), ?, ?, ?,?) ON DUPLICATE KEY UPDATE kwatts_s=kwatts_s+1

The above are executed during the tests.

In all of them the sharding key is present, either in the WHERE clause OR as comment.

Of course if I execute one of them without the sharding key, the firewall rule will stop the query execution, ie:

mysql> SELECT id, millid, date,continent,active,kwatts_s FROM windmills_test1 WHERE id BETWEEN ? AND ?;
ERROR 1148 (42000): It is impossible to redirect this command to a defined shard. Please be sure you Have the Continent definition in your query, or that you use a defined DBA account (dba_g{1/2})

Check here for full command list

Setting up the dataset

Once we have the rules set it is time to setup the schemas and the data using sysbench (https://github.com/Tusamarco/sysbench), remember to use windmills_sharding tests.

The first operation is to build the schema on SHARD2 without filling it with data. This is a DBA action as such we will execute it using the dba_g2 account:

sysbench ./src/lua/windmills_sharding/oltp_read.lua  --mysql-host=10.0.1.96  --mysql-port=6033 --mysql-user=dba_g2 --mysql-password=xxx --mysql-db=windmills_large --mysql_storage_engine=innodb --db-driver=mysql --tables=4 --table_size=0 --table_name=windmills --mysql-ignore-errors=all --threads=1  prepare

Setting table_size and pointing to the ProxySQL IP/port will do, and I will have:

mysql> select current_user(), @@hostname;
+----------------+-------------------+
| current_user() | @@hostname        |
+----------------+-------------------+
| dba_g2@%       | ps-mysql1-mysql-0 |
+----------------+-------------------+
1 row in set (0.01 sec)

mysql> use windmills_large;
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_windmills_large |
+---------------------------+
| windmills1                |
| windmills2                |
| windmills3                |
| windmills4                |
+---------------------------+
4 rows in set (0.01 sec)

mysql> select count(*) from windmills1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.09 sec)

All set but empty.

Now let us do the same but with the other DBA user:

sysbench ./src/lua/windmills_sharding/oltp_read.lua  --mysql-host=10.0.1.96  --mysql-port=6033 --mysql-user=dba_g1 --mysql-password=xxx --mysql-db=windmills_large --mysql_storage_engine=innodb --db-driver=mysql --tables=4 --table_size=400 --table_name=windmills --mysql-ignore-errors=all --threads=1  prepare

If I do now the select above with user dba_g2:

mysql> select current_user(), @@hostname;select count(*) from windmills1;
+----------------+-------------------+
| current_user() | @@hostname        |
+----------------+-------------------+
| dba_g2@%       | ps-mysql1-mysql-0 |
+----------------+-------------------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|      113 |
+----------+
1 row in set (0.00 sec)

While If I reconnect and use dba_g1:

mysql> select current_user(), @@hostname;select count(*) from windmills1;
+----------------+--------------------+
| current_user() | @@hostname         |
+----------------+--------------------+
| dba_g1@%       | mt-cluster-1-pxc-0 |
+----------------+--------------------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|      287 |
+----------+
1 row in set (0.01 sec)

I can also check on ProxySQL to see which rules were utilized:

select active,hits,destination_hostgroup, mysql_query_rules.rule_id, match_digest, match_pattern, replace_pattern, cache_ttl, apply,flagIn,flagOUT FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id;

+------+-----------------------+---------+---------------------+----------------------------------------------------------------------------+-------+--------+---------+
| hits | destination_hostgroup | rule_id | match_digest        | match_pattern                                                              | apply | flagIN | flagOUT |
+------+-----------------------+---------+---------------------+----------------------------------------------------------------------------+-------+--------+---------+
| 3261 | 100                   | 20      | NULL                | NULL                                                                       | 0     | 0      | 500     |
| 51   | 200                   | 21      | NULL                | NULL                                                                       | 0     | 0      | 600     |
| 2320 | 100                   | 31      | NULL                | scontinents*(=|like)s*'*(Asia|Africa|Antarctica|Europe|North America)'* | 1     | 500    | 0       |
| 880  | 200                   | 32      | NULL                | scontinents*(=|like)s*'*(Oceania|South America)'*                       | 1     | 500    | 0       |
| 0    | 100                   | 34      | NULL                | scontinents*(=|like)s*'*(Asia|Africa|Antarctica|Europe|North America)'* | 1     | 600    | 0       |
| 0    | 200                   | 35      | NULL                | scontinents*(=|like)s*'*(Oceania|South America)'*                       | 1     | 600    | 0       |
| 2    | 100                   | 51      | NULL                | scontinents*(=|like)s*'*(Asia|Africa|Antarctica|Europe|North America)'* | 0     | 0      | 1001    |
| 0    | 200                   | 54      | NULL                | scontinents*(=|like)s*'*(Oceania|South America)'*                       | 0     | 0      | 1002    |
| 0    | 100                   | 60      | NULL                | NULL                                                                       | 0     | 50     | 1001    |
| 0    | 200                   | 62      | NULL                | NULL                                                                       | 0     | 60     | 1002    |
| 7    | NULL                  | 2000    | .                   | NULL                                                                       | 1     | 0      | NULL    |
| 0    | 100                   | 2040    | ^SELECT.*FOR UPDATE | NULL                                                                       | 1     | 1001   | NULL    |
| 2    | 101                   | 2041    | ^SELECT.*$          | NULL                                                                       | 1     | 1001   | NULL    |
| 0    | 200                   | 2050    | ^SELECT.*FOR UPDATE | NULL                                                                       | 1     | 1002   | NULL    |
| 0    | 201                   | 2051    | ^SELECT.*$          | NULL                                                                       | 1     | 1002   | NULL    |
+------+-----------------------+---------+---------------------+----------------------------------------------------------------------------+-------+--------+---------+

Running the application

Now that the data load test was successful, let us do the real load following the indication as above but use 80 Tables and just a bit more records like 20000, nothing huge.

Once the data is loaded we will have the 2 shards with different numbers of records, if all went well the shard2 should have ¼ of the total and shard1 ¾ .

When load is over I have as expected:

mysql> select current_user(), @@hostname;select count(*) as shard1 from windmills_large.windmills80;select /* continent=shard2 */ count(*) as shard2 from windmills_large.windmills80;
+----------------+--------------------+
| current_user() | @@hostname         |
+----------------+--------------------+
| dba_g1@%       | mt-cluster-1-pxc-0 |
+----------------+--------------------+
1 row in set (0.00 sec)

+--------+
| shard1 |
+--------+
|  14272 | ← Table windmills80 in SHARD1
+--------+
+--------+
| shard2 |
+--------+
|   5728 | ← Table windmills80 in SHARD2
+--------+

As you may have already noticed, I used a trick to query the other shard using the dba_g1 user, I just passed in the query the shard2 definition as a comment. That is all we need.

Let us execute the run command for writes in sysbench and see what happens.

The first thing we can notice while doing writes is the query distribution:

+--------+-----------+----------------------------------------------------------------------------+----------+--------+----------+----------+--------+---------+-------------+---------+
| weight | hostgroup | srv_host                                                                   | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries |
+--------+-----------+----------------------------------------------------------------------------+----------+--------+----------+----------+--------+---------+-------------+---------+
| 10000  | 100       | ac966f7d46c04400fb92a3603f0e2634-193113472.eu-central-1.elb.amazonaws.com  | 3306     | ONLINE | 24	     | 0        | 138    | 66      | 25          | 1309353 |
| 100    | 101       | a5c8836b7c05b41928ca84f2beb48aee-1936458168.eu-central-1.elb.amazonaws.com | 3306     | ONLINE | 0	     | 0        | 0      | 0       | 0           |       0 |
| 10000  | 200       | a039ab70e9f564f5e879d5e1374d9ffa-1769267689.eu-central-1.elb.amazonaws.com | 3306     | ONLINE | 24	     | 1        | 129    | 66      | 25          |  516407 |
| 10000  | 201       | a039ab70e9f564f5e879d5e1374d9ffa-1769267689.eu-central-1.elb.amazonaws.com | 6447     | ONLINE | 0	     | 0        | 0      | 0       | 0           |       0 |
+--------+-----------+----------------------------------------------------------------------------+----------+--------+----------+----------+--------+---------+-------------+---------+

Where we can notice that the load in connection is evenly distributed, while the load is mainly going to shard1 as we expected given we have an unbalanced sharding by design.

At MySQL level we had:

Questions

Com Type

The final point is, what is the gain using this sharding approach?

Well we still need to consider the fact we are testing on a very small set of data, however if we can already identify some benefit here, that will be an interesting result.

Let see the write operations with 24 and 64 threads:

We get a gain of ~33% just using sharding, while for latency we do not have a cost on the contrary also with small load increase we can see how the sharded solution performs better. Of course we are still talking about low number of rows and running threads but gain is there.

Backup

The backup and restore operation when using POM is completely managed by the operator (see instructions in the POM documentation https://docs.percona.com/percona-operator-for-mysql/pxc/backups.html and https://docs.percona.com/percona-operator-for-mysql/ps/backups.html ).

The interesting part is that we can have multiple kind of backup solution, like:

On demand
Scheduled
Full Point in time recovery with log streaming

Automation will allow us to set schedule as simple as this:

    schedule:
     - name: "sat-night-backup"
        schedule: "0 0 * * 6"
        keep: 3
        storageName: s3-eu-west
      - name: "daily-backup"
        schedule: "0 3 * * *"
        keep: 7
        storageName: s3-eu-west

Or if you want to run the on demand:

kubectl apply -f backup.yaml

Where the backup.yaml file has very simple informations:

apiVersion: ps.percona.com/v1alpha1
kind: PerconaServerMySQLBackup
metadata:
  name: ps-gr-sharding-test-2nd-of-may
#  finalizers:
#    - delete-backup
spec:
  clusterName: ps-mysql1
  storageName: s3-ondemand

Using both methods we will be able to soon have a good set of backups like:

POM (PXC)

cron-mt-cluster-1-s3-eu-west-20234293010-3vsve   mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-04-29-03:00:10-full   Succeeded   3d9h        3d9h
cron-mt-cluster-1-s3-eu-west-20234303010-3vsve   mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-04-30-03:00:10-full   Succeeded   2d9h        2d9h
cron-mt-cluster-1-s3-eu-west-2023513010-3vsve    mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-05-01-03:00:10-full   Succeeded   33h         33h
cron-mt-cluster-1-s3-eu-west-2023523010-3vsve    mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-05-02-03:00:10-full   Succeeded   9h          9h

POM (PS) *

NAME                             STORAGE       DESTINATION                                                                     STATE       COMPLETED   AGE
ps-gr-sharding-test              s3-ondemand   s3://mt-bucket-backup-tl/ondemand/ondemand/ps-mysql1-2023-05-01-15:10:04-full   Succeeded   21h         21h
ps-gr-sharding-test-2nd-of-may   s3-ondemand   s3://mt-bucket-backup-tl/ondemand/ondemand/ps-mysql1-2023-05-02-12:22:24-full   Succeeded   27m         27m

To note that as DBA, we still need to validate the backups with a restore procedure, that part is not automated (yet).

*Note that Backup for POM PS is available only on demand given the solution is still in technical preview

When will this solution fit in?

As mentioned multiple times, this solution can cover simple cases of sharding, better if you have shared-nothing.

It also requires work from the DBA side in case of DDL operations or resharding.

You also need to be able to change some SQL code in order to be sure to have present the sharding key/information, in any SQL executed.

When will this solution not fit in?

There are several things that could prevent you to use this solution, the most common ones are:

You need to query multiple shards at the same time. This is not possible with ProxySQL
You do not have a DBA to perform administrative work and need to rely on automated system
Distributed transaction cross shard
No access to SQL code.

Conclusions

We do not have the Amletic dilemma about sharding or not sharding.

When using a RDBMS like MySQL, if you need horizontal scalability, you need to shard.

The point is there is no magic wand or solution, moving to sharding is an expensive and impacting operation. If you choose it at the beginning, before doing any application development, the effort can be significantly less.

Doing sooner will also allow you to test proper solutions, where proper is a KISS solution, always go for the less complex things, in 2 years you will be super happy about your decision.

If instead you must convert a current solution, then prepare for a bloodshed, or at least for a long journey.

In any case we need to keep in mind few key points:

Do not believe most of the articles on the internet that promise you infinite scalability for your database. If there is no distinction in the article about a simple database and a RDBMS, run away.
Do not go for the last shiny things just because they shine. Test them and evaluate IF it makes sense for you. Better to spend a quarter testing now a few solutions, than fight for years with something that you do not fully comprehend.
Using containers/operators/kubernetes do not scale per se, you must find a mechanism to have the solution scaling, there is absolutely NO difference with premises. What you may get is a good level of automation, however that will come with a good level of complexity, it is up to you to evaluate if it makes sense or not.

As said at the beginning for MYSQL the choice is limited. Vitess is the full complete solution, with a lot of coding to provide you a complete platform to deal with your scaling needs.

However do not be so fast to exclude ProxySQL as possible solutions. There are out there already many using it also for sharding.

This small POC used a synthetic case, but it also shows that with just 4 rules you can achieve a decent solution. A real scenario could be a bit more complex … or not.

References

Vitess (https://vitess.io/docs/)

ProxySQL (https://proxysql.com/documentation/)

Firewalling with ProxySQL (https://www.tusacentral.com/joomla/index.php/mysql-blogs/197-proxysql-firewalling)

Sharding:

https://www.percona.com/blog/mysql-sharding-with-proxysql/
https://https://medium.com/pinterest-engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6fwww.percona.com/blog/horizontal-scaling-in-mysql-sharding-followup/
https://quoraengineering.quora.com/MySQL-sharding-at-Quora
https://skywalking.apache.org/blog/skywalkings-new-storage-feature-based-on-shardingsphere-proxy-mysql-sharding/

Which is the best Proxy for Percona MySQL Operator?

Overview

HAProxy, ProxySQL, MySQL Router (AKA MySQL Proxy), in the last few years I had to answer multiple times on what proxy to use and in what scenario. When designing an architecture there are many components that need to be considered before deciding what is the best solution.

When deciding what to pick, there are many things to consider like where the proxy needs to be, if it “just” needs to redirect the connections or if more features need to be in, like caching, filtering, or if it needs to be integrated with some MySQL embedded automation.

Given that, there never was a single straight answer, instead an analysis needs to be done. Only after a better understanding of the environment, the needs and the evolution that the platform needs to achieve is it possible to decide what will be the better choice.

However recently we have seen an increase in usage of MySQL on Kubernetes, especially with the adoption of Percona Operator for MySQL.
In this case we have a quite well define scenario that can resemble the image below:

In this scenario the proxies need to sit inside Pods balancing the incoming traffic from the Service LoadBalancer connecting with the active data nodes.

Their role is merely to be sure that any incoming connection is redirected to nodes that are able to serve them, which include having a separation between Read/Write and Read Only traffic, separation that can be achieved, at service level, with automatic recognition or with two separate entry points.

In this scenario it is also crucial to be efficient in resource utilization and scaling with frugality. In this context features like filtering, firewalling or caching are redundant and may consume resources that could be allocated to scaling. Those are also features that will work better outside the K8s/Operator cluster, given the closer to the application they are located, the better they will serve.

About that we must always remember the concept that each K8s/Operator cluster needs to be seen as a single service, not as a real cluster. In short each cluster is in reality a single database with High Availability and other functionalities built in.

Anyhow, we are here to talk about Proxies. Once we have defined that we have one clear mandate in mind, we need to identify which product allow our K8s/Operator solution to:

Scale at the maximum the number of incoming connections
Serve the request with the higher efficiency
Consume less resources as possible

The Environment

To identify the above points I have simulated a possible K8s/Operator environment, creating:

One powerful application node, where I run sysbench read only tests, scaling from 2 to 4096 threads. (Type c5.4xlarge)
3 Mid data nodes with several gigabytes of data in with MySQL and Group Replication (Type m5.xlarge)
1 Proxy node running on a resource limited box (Type t2.micro)

The Tests

We will have very simple test cases. The first one has the scope to define the baseline, identifying the moment when we will have the first level of saturation due to the number of connections. In this case we will increase the number of connections and keep a low number of operations.

The second test will define how well the increasing load is served inside the range we had previously identified.

For documentation the sysbench commands are:

Test1

sysbench ./src/lua/windmills/oltp_read.lua  --db-driver=mysql --tables=200 --table_size=1000000 
 --rand-type=zipfian --rand-zipfian-exp=0 --skip_trx=true  --report-interval=1 --mysql-ignore-errors=all 
--mysql_storage_engine=innodb --auto_inc=off --histogram  --stats_format=csv --db-ps-mode=disable --point-selects=50 
--reconnect=10 --range-selects=true –rate=100 --threads=<#Threads from 2 to 4096> --time=1200 run

Test2

sysbench ./src/lua/windmills/oltp_read.lua  --mysql-host=<host> --mysql-port=<port> --mysql-user=<user> 
--mysql-password=<pw> --mysql-db=<schema> --db-driver=mysql --tables=200 --table_size=1000000  --rand-type=zipfian 
--rand-zipfian-exp=0 --skip_trx=true  --report-interval=1 --mysql-ignore-errors=all --mysql_storage_engine=innodb 
--auto_inc=off --histogram --table_name=<tablename>  --stats_format=csv --db-ps-mode=disable --point-selects=50 
--reconnect=10 --range-selects=true --threads=<#Threads from 2 to 4096> --time=1200 run

Results

Test 1

As indicated here I was looking to identify when the first Proxy will reach a dimension that would not be manageable. The load is all in creating and serving the connections, while the number of operations is capped to 100.

As you can see and as I was expecting, the three Proxies were behaving more or less the same, serving the same number of operations (they were capped so why not), until not.

MySQL router after 2048 connection was not able to serve anything more.

NOTE: MySQL Router was actually stopped working at 1024 threads, but using version 8.0.32 I enabled the feature: connection_sharing. That allows it to go a bit further.

Let us take a look also the the latency:

Here the situation starts to be a little bit more complicated. MySQL Router is the one that has the higher latency no matter what. However HAProxy and ProxySQL have an interesting behavior. HAProxy is performing better with a low number of connections, while ProxySQL is doing better when a high number of connections is in place.

This is due to the multiplexing and the very efficient way ProxySQL uses to deal with high load.

Everything has a cost:

HAProxy is definitely using less user’s CPU resources than ProxySQL or MySQL Router …

.. we can also notice that HAProxy barely reaches on average the 1.5 CPU load while ProxySQL is at 2.50 and MySQL Router around 2.

To be honest I was expecting something like this, given ProxySQL's need to handle the connections and the other basic routing. What was instead a surprise was MySQL Router, why does it have a higher load?

Brief summary

This test highlights that HAProxy and ProxySQL are able to reach a level of connection higher than the slowest runner in the game (MySQL Router). It is also clear that traffic is better served under a high number of connections by ProxySQL but it requires more resources.

Test 2

When the going gets tough, the tough gets going

Well let us remove the –rate limitation and see what will happen.

The scenario with load changes drastically. We can see how HAProxy is able to serve the connection and allow the execution of a higher number of operations for the whole test. ProxySQL is immediately after it and is behaving quite well up to 128 threads, then it just collapses.

MySQL Router never takes off; it always stays below the 1k reads/second while HAProxy was able to serve 8.2k and ProxySQL 6.6k.

Taking a look at the Latency, we can see that HAProxy had a gradual increase as expected, while ProxySQL and MySQL Router just went up from the 256 threads on.

To observe that both ProxySQL and MySQL Router were not able to complete the tests with 4096 threads.

Why? HAProxy stays always below 50% cpu, no matter the increasing number of threads/connections, scaling the load very efficiently. MySQL router was almost immediately reaching the saturation point, being affected not only by the number of threads/connections but also by the number of operations, and that was not expected given we do not have a level 7 capability in MySQL Router.

Finally ProxySQL, which was working fine up to a certain limit, then it reached saturation point and was not able to serve the load. I am saying load because ProxySQL is a level 7 proxy and is aware of the content of the load. Given that on top of multiplexing, additional resource consumption was expected.

Here we just have a clear confirmation of what already said above, with 100% cpu utilization reached by MySQL Router with just 16 threads, and ProxySQL way after at 256 threads.

Brief Summary

HAProxy comes up as the champion in this test, there is no doubt that it was able to scale the increasing load in connection without being affected significantly by the load generated by the requests. The lower consumption in resources also indicates the possible space for even more scaling.

ProxySQL was penalized by the limited resources, but this was the game, we have to get the most out of the few available. This test indicates that it is not optimal to use ProxySQL inside the Operator, actually it is a wrong choice if low resource and scalability is a must.

MySQL Router was never in the game. Unless a serious refactoring, MySQL Router is designed for very limited scalability, as such the only way to adopt it is to have many of them at application node level. Utilizing it close to the data nodes in a centralized position is a mistake.

Conclusions

I started showing an image on how the MySQL service is organized and want to close showing the variation that for me is the one to be considered the default approach:

This to highlight that we always need to choose the right tool for the job.

The Proxy in architectures involving MySQL/PS/PXC is a crucial element for the scalability of the cluster, no matter if using K8s or not. It is important to choose the one that serves us better, which in some cases can be ProxySQL over HAProxy.

However when talking of K8s and Operators we must recognize the need to optimize the resources usage for the specific service. In that context there is no discussion about it, HAProxy is the best solution and the one we should go to.

My final observation is about MySQL Router (aka MySQL Proxy).

Unless a significant refactoring of the product at the moment it is not even close to what the other two can do. From the tests done so far, it requires a complete reshaping starting to identify why it is so subject to the load coming from the query, more than the load coming from the connections.

Great MySQL to everyone.

References

https://www.percona.com/blog/boosting-percona-distribution-for-mysql-operator-efficiency/

https://www.slideshare.net/marcotusa/my-sql-on-kubernetes-demystified

https://docs.haproxy.org/2.7/configuration.html

https://proxysql.com/documentation/

https://dev.mysql.com/doc/mysql-router/8.0/en/

Help! I am out of disk space!

How we could fix a nasty out of space issue leveraging the flexibility of Percona MySQL operator (PMO)

When planning a database deployment, one of the most challenging factors to consider is the amount of space we need to dedicate for Data on disk.

This is even more cumbersome when working on bare metal. Given it is definitely more difficult to add space when using this kind of solution in respect to the cloud.

This is it, when using cloud storage like EBS or similar it is normally easy(er) to extend volumes, which gives us the luxury to plan the space to allocate for data with a good grade of relaxation.

Is this also true when using a solution based on Kubernetes like Percona Operator for MySQL? Well it depends on where you run it, however if the platform you choose supports the option to extend volumes K8s per se is giving you the possibility to do so as well.

However, if it can go wrong it will, and ending up with a fully filled device with MySQL is not a fun experience.

As you know, on normal deployments, when mysql has no space left on the device, it simply stops working, ergo it will cause a production down event, which of course is an unfortunate event that we want to avoid at any cost.

This blog is the story of what happened, what was supposed to happen and why.

The story

Case was on AWS using EKS.

Given all the above, I was quite surprised when we had a case in which a deployed solution based on PMO went out of space. However we start to dig and review what was going on and why.

The first thing we did was to quickly investigate what was really taking space, that could have been an easy win if most of the space was taken by some log, but unfortunately this was not the case, data was really taking all the available space.

The next step was to check what storage class was used for the PVC

k get pvc
NAME                         VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS
datadir-mt-cluster-1-pxc-0   pvc-<snip>   233Gi      RWO            io1
datadir-mt-cluster-1-pxc-1   pvc-<snip>   233Gi      RWO            io1
datadir-mt-cluster-1-pxc-2   pvc-<snip>   233Gi      RWO            io1

Ok we use the io1 SC, it is now time to check if the SC is supporting volume expansion:

kubectl describe sc io1
Name:            io1
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"},"name":"io1"},"parameters":{"fsType":"ext4","iopsPerGB":"12","type":"io1"},"provisioner":"kubernetes.io/aws-ebs"}
,storageclass.kubernetes.io/is-default-class=false
Provisioner:           kubernetes.io/aws-ebs
Parameters:            fsType=ext4,iopsPerGB=12,type=io1
AllowVolumeExpansion:  <unset> <------------
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

And no is not enabled, in this case we cannot just go and expand the volume, must change the storage class settings first.
To enable volume expansion, you need to delete the storage class and enable it again.

Unfortunately we were unsuccessful in doing that operation, because the storage class kept staying as unset for ALLOWVOLUMEEXPANSION.

As said this is a production down event, so we cannot invest too much time in digging why it was not correctly changing the mode, we had to act quickly.

The only option we had to fix it was:

Expand the io1 volumes from AWS console (or aws client)
Resize the file system
Patch any K8 file to allow K8 to correctly see the new volumes dimension

Expanding EBS volumes from the console is trivial, just go to Volumes, select the volume you want to modify, choose modify and change the size of it with the one desired, done.

Once that is done connect to the Node hosting the pod which has the volume mounted like:

 k get pods -o wide|grep mysql-0
NAME                                        READY     STATUS    RESTARTS   AGE    IP            NODE             
cluster-1-pxc-0                               2/2     Running   1          11d    10.1.76.189     <mynode>.eu-central-1.compute.internal

Then we need to get the id of the pvc to identify it on the node

k get pvc
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS
datadir-cluster-1-pxc-0   Bound    pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df   233Gi      RWO            io1

One note, when doing this kind of recovery with a PXC based solution, always recover node-0 first, then the others.

So we connect to <mynode> and identify the volume:

lslbk |grep pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df
nvme1n1      259:4    0  350G  0 disk /var/lib/kubelet/pods/9724a0f6-fb79-4e6b-be8d-b797062bf716/volumes/kubernetes.io~aws-ebs/pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df <-----

At this point we can resize it:

root@ip-<snip>:/# resize2fs  /dev/nvme1n1
resize2fs 1.45.5 (07-Jan-2020)
Filesystem at /dev/nvme1n1 is mounted on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-central-1a/vol-0ab0db8ecf0293b2f; on-line resizing required
old_desc_blocks = 30, new_desc_blocks = 44
The filesystem on /dev/nvme1n1 is now 91750400 (4k) blocks long.

The good thing is that as soon as you do that the MySQL daemon see the space and will restart, however it will happen only on the current pod and K8 will still see the old dimension:

k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                            STORAGECLASS   REASON
pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df   333Gi      RWO            Delete           Bound    pxc/datadir-cluster-1-pxc-0   io1

To allow k8 to be align with the real dimension we must patch the information stored, the command is the following:

kubectl patch pvc <pvc-name>  -n <pvc-namespace> -p '{ "spec": { "resources": { "requests": { "storage": "NEW STORAGE VALUE" }}}}'
Ie:
kubectl patch pvc datadir-cluster-1-pxc-0 -n pxc -p '{ "spec": { "resources": { "requests": { "storage": "350" }}}}'

Remember to use as pvc-name the NAME coming from

kubectl get pvc.

Once this is done k8 will see the new volume dimension correctly.

Just repeat the process for Node-1 and Node-2 and …done the cluster is up again.

Finally do not forget to modify your custom resources file (cr.yaml) to match the new volume size. IE:

    volumeSpec:
      persistentVolumeClaim:
        storageClassName: "io1"
        resources:
          requests:
            storage: 350G

The whole process took just a few minutes, it was time now to investigate why the incident happened and why the storage class was not allowing extension in the first place.

Why it happened

Well first and foremost the platform was not correctly monitored. As such there was lack of visibility about the space utilization, and no alert about disk space.

This was easy to solve just enabling the PMM feature in the cluster cr and set the alert in PMM once the nodes join it (see https://docs.percona.com/percona-monitoring-and-management/get-started/alerting.html for details on how to).

The second issue was the problem with the storage class. Once we had the time to carefully review the configuration files, we identified that there was an additional tab in the SC class, which was causing k8 to ignore the directive.

Was suppose to be:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: io1
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: io1
  iopsPerGB: "12"
  fsType: ext4 
allowVolumeExpansion: true <----------

It was:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: io1
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: io1
  iopsPerGB: "12"
  fsType: ext4 
  allowVolumeExpansion: true. <---------

What was concerning was the lack of error returned by the Kubernetes API, so in theory the configuration was accepted but not really validated.

In any case once we had fix the typo and recreated the SC, the setting for volume expansion was correctly accepted:

kubectl describe sc io1
Name:            io1
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"},"name":"io1"},"parameters":{"fsType":"ext4","iopsPerGB":"12","type":"io1"},"provisioner":"kubernetes.io/aws-ebs"}
,storageclass.kubernetes.io/is-default-class=false
Provisioner:           kubernetes.io/aws-ebs
Parameters:            fsType=ext4,iopsPerGB=12,type=io1
AllowVolumeExpansion:  True    
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

What should have happened instead?

If a proper monitoring and alerting was in place, the administrators would have the time to act and extend the volumes without downtime.

However, the procedure for extending volumes on K8 is not complex but also not as straightforward as you may think. My colleague Natalia Marukovich wrote this blog post (https://www.percona.com/blog/percona-operator-volume-expansion-without-downtime/) that gives you the step by step instructions on how to extend the volumes without downtime.

Conclusions

Using the cloud, containers, automation or more complex orchestrators like Kubernetes, do not solve all, do not prevent mistakes from happening, and more importantly do not make the right decisions for you.

You must set up a proper architecture that includes backup, monitoring and alerting. You must set the right alerts and act on them in time.

Finally automation is cool, however the devil is in the details and typos are his day to day joy. Be careful and check what you put online, do not rush it. Validate, validate validate…

Great stateful MySQL to all.

Sidebar

Main Menu Mobile

Proof of Concept: Horizontal Write Scaling for MySQL with Kubernetes Operator

The POC

Why this POC?

Why Proxysql?

Why POM

The elements used

The settings

Data layer

ProxySQL and Sharding rules

Setting up the dataset

Running the application

Backup

When will this solution fit in?

When will this solution not fit in?

Conclusions

References

Which is the best Proxy for Percona MySQL Operator?

Overview

The Environment

The Tests

Results

Test 1

Brief summary

Test 2

Brief Summary

Conclusions

References

Help! I am out of disk space!

The story

Why it happened

What should have happened instead?

Conclusions

More Articles ...

login