My MySQL tips

Percona Distribution for MySQL: High Availability with Group Replication solution

This blog provides high availability (HA) guidelines using group replication architecture and deployment recommendations in MySQL, based on our best practices.

Every architecture and deployment depends on the customer requirements and application demands for high availability and the estimated level of usage. For example, using high read or high write applications, or both, with a need for 99.999% availability.

Here, we give architecture and deployment recommendations along with a technical overview for a solution that provides a high level of high availability and assumes the usage of high read/write applications (20k or more queries per second).

Layout

Components

This architecture is composed of two main layers:

Connection and distribution layer
RDBMS (Relational Database Management System) layer

Connection Layer

The connection layer is composed of:

Application to proxy redirection mechanism which can be anything from a Virtual IP managed by Keepalived local service to a DNS resolution service like Amazon Route 53. Its function is to redirect the traffic to the active Proxy node.
Proxy connection distribution is composed of two or more nodes. Its role is to redirect the traffic to the active nodes of the Group Replication cluster. In cases like ProxySQL where the proxy is a level 7 proxy and is able to perform Read/Write split, this layer is also in charge of redirecting writes to the Primary node and reads to the Replicas, and of HA to prevent a single point of failure

Data Layer

The data layer is composed of:

Primary node serving writes (or source) - this is the node that will accept writes and DDL modifications. Data will be processed following the ACID paradigm (atomicity, consistency, isolation, durability) and replicated to all other nodes.
Replica nodes are the elements serving read requests. Some replica nodes can be elected Primary in case of Primary node failure. A replica node should be able to leave and join back a healthy cluster without impacting the service.
Replication mechanism to distribute changes across nodes and in this solution is done with Group Replication. Group Replication is a tightly coupled solution which means that the database cluster is based on a Datacentric approach (single state of the data, distributed commit). In this case, the data is consistent in time across nodes and replication requires a high performant link. Given that, geographic distribution is strongly discouraged and Disaster Recovery (DR) is not implicitly supported by the main Group Replication mechanism.

The node characteristics (CPU/RAM/Storage) are not relevant to the main solution design. They instead must reflect the estimated workload the solution will have to cover, which is a case-by-case identification.

What is important to keep in mind is that all nodes that are part of the cluster must have the same characteristics. If they don’t, the cluster will be imbalanced and service will be affected.

As a generic indication, we recommend using solutions with at least 8 cores and 16GB RAM when production.

High Availability

How do we measure availability and at what point does it become “high” availability?

Generally speaking, the measurement of availability is done by establishing a measurement time frame and dividing it by the time that it was available. This ratio will rarely be 1, which is equal to 100% availability. At Percona we don’t consider a solution to be highly available if it is not at least 99% or “two nines” available.

Availability %	Downtime per year	Downtime per month	Downtime per week	Downtime per day
99% ("two nines")	3.65 days	7.31 hours	1.68 hours	14.40 minutes
99.5% ("two nines five")	1.83 days	3.65 hours	50.40 minutes	7.20 minutes
99.9% ("three nines")	8.77 hours	43.83 minutes	10.08 minutes	1.44 minutes
99.95% ("three nines five")	4.38 hours	21.92 minutes	5.04 minutes	43.20 seconds
99.99% ("four nines")	52.60 minutes	4.38 minutes	1.01 minutes	8.64 seconds
99.995% ("four nines five")	26.30 minutes	2.19 minutes	30.24 seconds	4.32 seconds
99.999% ("five nines")	5.26 minutes	26.30 seconds	6.05 seconds	864.00 milliseconds

How is High Availability Achieved?

There are three key components to high availability:

Infrastructure - This is the physical or virtual hardware that database systems rely on to run. Without enough infrastructure (VM’s, networking, etc) there cannot be high availability. The easiest example is: there is no way to make a single server highly available.
Topology Management - This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure. Many clustering or synchronous replication solutions offer this capability out of the box. However, for asynchronous replication, this is handled by additional software.
Connection Management - This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager, however in asynchronous clusters deploying a connection manager is mandatory for high availability.

This Solution Provides:

The proposed solution, based on a tightly coupled database cluster, offers an HA level of 99.995% when coupled with the Group replication setting group_replication_consistency=AFTER.

group replication

Failovers

If properly planned and architected, a database failure or configuration change that requires a restart shouldn’t affect the stability of the database infrastructure. Failovers are an integral part of a stability strategy and aligning the business requirements for availability and uptime with failover methodologies is critical to achieving those goals. Below are the 3 main types of failovers that can occur in database environments.

Planned Failover: A planned failover is a failover that has been scheduled in advance or occurs at a regular interval. There can be many reasons for planned failovers including patching, large data operations, retiring existing infrastructure, or simply testing the failover strategy.
Unplanned Failover: An unplanned failover is what occurs when a database unexpectedly becomes unresponsive or experiences instability. This could also include emergency changes that do not fall under the planned failover cadence or scheduling parameters. Unplanned failovers are generally considered higher-risk operations due to the high stress and high potential for either data corruption or data fragmentation.
Regional or Disaster Recovery Failover: Unplanned failovers still work with the assumption that additional database infrastructure is immediately available and in a usable state. In a regional or DR failover, we would be making the assumption that there is a large-scale infrastructure outage that requires the business to move its operations away from its current availability zone.
This solution covers both planned and unplanned failovers.

Maintenance Windows

Major vs Minor Maintenance: Although it may not be obvious at first, not all maintenance activities are created equal and do not have the same dependencies. It is good to separate maintenance that demands downtime or failover from maintenance that can be done without impacting those important stability metrics. When defining these maintenance dependencies there can be a change in the actual maintenance process that allows for a different cadence.

Maintenance Without Service Interruption: With rolling restart and using proper version upgrade it is possible to cover both major and minor maintenance without service interruption.

Uptime

When referring to database stability, uptime is likely the largest indicator of stability and oftentimes is the most obvious symptom of an unstable database environment. Uptime is composed of 3 key components and, contrary to common perception, is based on what happens when the database software is not able to take incoming requests rather than maintaining the ability to take requests with errors.

Recovery Time Objective (RTO): This can be boiled down to a very simple question “How long can the business sustain a database outage?”. Once the business is aligned with a goal of a minimum viable recovery time objective, it is much more straightforward to plan and invest in the infrastructure required to meet that requirement. It is important to acknowledge that while everyone desires 100% uptime, there is a need for realistic expectations that align with the business needs and not a technical desire.

Recovery Point Objective (RPO): There is a big distinction between the Recovery Point and the Recovery Time for database infrastructure. The database can be available, but not to the exact state that it was when it became unavailable. That is where Recovery Point comes in. The question we ask here is “How much data can the business lose during a database outage?”. All businesses have their own requirements here and it is worthy to note that it is always the goal to never sustain any data loss. But this is framed in a worst-case scenario how much data could be lost and the business maintains the ability to continue.

Disaster Recovery: While RTO and RPO are great for unplanned outages or small-scale hiccups to the infrastructure, when we talk about Disaster Recovery this is a major large-scale outage not strictly for the database infrastructure. How capable is the business of restarting operations with the assumption that all resources are completely unavailable in the main availability zone? The assumption here is that there is no viable restoration point or time that aligns with the business requirements. While each DR scenario is unique based on available infrastructure, backup strategy, and technology stack, there are some common threads for any scenario.

This solution helps improve uptime:

Using this solution will help you to significantly reduce both RPO and RTO. Given the tightly coupled cluster solution approach, the failure of a single node will not result in service interruption.
Increasing the number of nodes will also improve the cluster resilience by the formula:
F = (N -1) / 2

Where:

F - Number of admissible failures

N - number of nodes in the cluster

Examples:

In a cluster of 5 nodes, F = (5 - 1)/2 = 2.

The cluster can support up to 2 failures.

In a cluster of 4 nodes, F = (4 - 1)/2 = 1.

The cluster can support up to 1 failure.

This solution also allows for a more restrictive backup policy, dedicating a node to the backup cycle, which will help in keeping RPO low. As previously mentioned, DR is not covered by default by the solution which will require an additional replication setup and controller.

Measurement and Monitoring

To ensure database infrastructure is performing as intended or at its best, it is necessary to measure specific metrics and alert when some of these metrics are not in line with expectations. Periodic review of these measurements is also encouraged to promote stability and understand potential risks associated with the database infrastructure. Below are the 3 aspects of Database performance measurement and monitoring

Measurement: To understand how a database infrastructure is performing there is a need to measure multiple aspects of the infrastructure. With measurement, it’s important to understand the impact of the sample sizes, sample timing, and sample types.

Metrics: Metrics refer to the actual parts of the database infrastructure being measured. When we discuss metrics, more isn’t always better as it could introduce unintentional noise or allow for troubleshooting to become overly burdensome.

Alerting: When one or many metrics of the database infrastructure is not within a normal or acceptable range, an alert should be generated so that the team responsible for the appropriate portion of the database infrastructure can investigate and remedy it

Monitoring for this solution is covered by:

Percona Monitoring and Management has a specific dashboard to monitor the Group Replication state and cluster status as a whole. (https://www.percona.com/doc/percona-monitoring-and-management/2.x/introduction.html) has a specific dashboard to monitor Group Replication state, and cluster status as a whole.

How to Implement the Infrastructure

In this section, we are providing the step by step instructions on how to implement the above solution.

The Elements

The following will be used:

1 Virtual IP for ProxySQL failover - 192.168.4.194
2 ProxySQL nodes
- Proxy1 192.168.4.191
- Proxy2 192.168.4.192
4 MySQL nodes in Single Primary mode
- Gr1 192.168.4.81 - Initial Primary
- Gr2 192.168.4.82 - Replica / failover
- Gr3 192.168.4.83 - Replica / failover
- Gr4 192.168.4.84 - Replica / Backup
Ports. All ports must be open if a firewall is in place or any other restriction like AppArmor or SELinux.
- Proxysql
  - 6033
  - 6032
  - 3306
- MySQL - GR
  - 3306
  - 33060
  - 33061

Software Installation

First, you need to install the Percona Distribution for MySQL, the Percona Server for MySQL-based variant, on each node. Follow the instructions at https://www.percona.com/doc/percona-distribution-mysql/8.0/installing.html to install Percona Server for MySQL v8.0.

Configure the Nodes

Before anything, make sure that all the nodes use the same time-zone and time:

    [root@gr1 ps8]# date
    Tue Aug 18 08:22:12 EDT 2020

Check also for ntpd service to be present and enabled. Be sure that each node resolves the other nodes by name:

root@gr1 ps8]# for i in 1 2 3 4 ; do ping -c 1 gr$i > /dev/null;echo $?; done

If not able to resolve, add the entries in the /etc/hosts file.

Once instances are up and running check the Percona Server for MySQL version on each node:

(root@node1) [(none)]>\s
--------------
/opt/mysql_templates/PS-8P/bin/mysql  Ver 8.0.20-11 for Linux on x86_64 (Percona Server (GPL), Release 11, Revision 159f0eb)

Step 1

Create a proper user for administration:

CREATE user dba@localhost identified by 'dbapw';
CREATE user dba@'192.168.%' identified by 'dbapw'; 

GRANT ALL on *.* to dba@localhost with grant option;
GRANT ALL on *.* to dba@'192.168.%' with grant option;

Exit from the client as user root and login as user dba.

Be sure to have a good and unique SERVER_ID value:

(dba@node1) [(none)]>show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     |     1 | <--- Not good given the same for all nodes
+---------------+-------+
1 row in set (0.01 sec)

It's now time to add group replication settings to the instances.

Step 2

Stop all running nodes, then in the my.cnf add:

 #####################
    #Replication + binlog settings
    #####################
    auto-increment-increment                                    =1
    auto-increment-offset                                       =1

    log-bin                                             =<path_to_logs>/binlog
    log-bin-index                                       =binlog.index
    binlog-checksum                                             =NONE
    binlog-format                                               =ROW
    binlog-row-image                                            =FULL
    log-slave-updates 						=1
    binlog-transaction-dependency-tracking                      =WRITESET_SESSION


    enforce-gtid-consistency                                    =TRUE
    gtid-mode                                                   =ON

    master-info-file                                            =master.info
    master-info-repository                                      =TABLE
    relay_log_info_repository                                   =TABLE
    relay-log                                            =<path_to_logs>/relay

    sync-binlog                                                 =1

    ### SLAVE SECTION
    skip-slave-start
    slave-parallel-type                                        = LOGICAL_CLOCK
    slave-parallel-workers                                      = 4
    slave-preserve-commit-order                                 = 1

    ######################################
    #Group Replication
    ######################################
    plugin_load_add                                    ='group_replication.so'
    plugin-load-add                                    ='mysql_clone.so'
    group_replication_group_name       ="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" <-- Not good use something that will help you 
                                                                            to identify the GR transactions and from where they 
                                                                             come from IE "dc1euz1-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
    group_replication_start_on_boot                     =off
    group_replication_local_address                     = "192.168.4.81/2/3/4:33061"  <---- CHANGE THIS TO MATCH EACH NODE LOCAL IP
    group_replication_group_seeds                       = "192.168.4.81:33061,192.168.4.82:33061,192.168.4.83:33061,192.168.4.84:33061"
    group_replication_bootstrap_group                   = off
    transaction-write-set-extraction                    = XXHASH64

Restart all nodes and connect to them.

Step 3

Create a user for replication (on all nodes):

   SET SQL_LOG_BIN=0;
    CREATE USER replica@'192.168.4.%' IDENTIFIED BY 'replicapw';   #<--- Please note the filter by IP is more restrictive 
    GRANT REPLICATION SLAVE ON *.* TO replica@'192.168.4.%';
    FLUSH PRIVILEGES;
    SET SQL_LOG_BIN=1;

Link the nodes with replication channel (on all nodes):

CHANGE MASTER TO MASTER_USER='replica', MASTER_PASSWORD='replicapw' FOR CHANNEL 'group_replication_recovery';

Check the current status:

(dba@node1) [(none)]>\u performance_schema
    (dba@node1) [performance_schema]>show tables like '%repl%';
    +-------------------------------------------+
    | Tables_in_performance_schema (%repl%)     |
    +-------------------------------------------+
    | replication_applier_configuration         |
    | replication_applier_filters               |
    | replication_applier_global_filters        |
    | replication_applier_status                |
    | replication_applier_status_by_coordinator |
    | replication_applier_status_by_worker      |
    | replication_connection_configuration      |
    | replication_connection_status             |
    | replication_group_member_stats            |
    | replication_group_members                 | <------------------------
    +-------------------------------------------+

   (dba@node1) [performance_schema]>select * from replication_group_members\G
CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 
   MEMBER_HOST: 
   MEMBER_PORT: 
  MEMBER_STATE: 
   MEMBER_ROLE: OFFLINE
MEMBER_VERSION: 
1 row in set (0.00 sec)

At this stage, you should be able to start the first (Primary) cluster node.

Only on GR1:

(dba@node1)[none]> SET GLOBAL group_replication_bootstrap_group=ON;
(dba@node1)[none]> START GROUP_REPLICATION;
(dba@node1)[none]> SET GLOBAL group_replication_bootstrap_group=OFF;

And then check it:

(dba@node1) [none]>select * from performance_schema.replication_group_members\G
     CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 90a353b8-e6dc-11ea-98fa-08002734ed50
   MEMBER_HOST: gr1
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.20

Once the Primary is running, connect on the second node GR2 and start Group replication:

(dba@node2) [none]>START GROUP_REPLICATION;
Query OK, 0 rows affected (4.60 sec)

Check if it registered correctly:

    (dba@node2) [performance_schema]>select * from replication_group_members\G
*************************** 1. row ***************************
  CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 58ffd118-e6dc-11ea-8af8-08002734ed50
   MEMBER_HOST: gr2
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: SECONDARY
MEMBER_VERSION: 8.0.20
*************************** 2. row ***************************
  CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 90a353b8-e6dc-11ea-98fa-08002734ed50
   MEMBER_HOST: gr1
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.20

Test if replication works:

On GR1

(dba@node1) [performance_schema]>create schema test;
Query OK, 1 row affected (0.76 sec)

(dba@node1) [performance_schema]>\u test
Database changed

(dba@node1) [test]>create table test1 (`id` int auto_increment primary key);
Query OK, 0 rows affected (0.32 sec)

(dba@node1) [test]>insert into test1 values(null);
Query OK, 1 row affected (0.34 sec)

On GR2

 (dba@node2) [performance_schema]>use \test
 Database changed
 (dba@node2) [test]>select * from test1;
 +----+
 | id |
 +----+
 |  1 |
 +----+
 1 row in set (0.00 sec)

Start group replication of the other two nodes GR3 and GR4:

(dba@node3) [performance_schema]>START GROUP_REPLICATION;
(dba@node4) [performance_schema]>START GROUP_REPLICATION;

Proxy Setup

Step 1

In our solution we will use two ProxySQL nodes:

Proxy1 192.168.4.191
Proxy2 192.168.4.192

First, you need to install ProxySQL on the nodes you have selected, in our case the two above.

To install the software follow the instructions in How to Install ProxySQL From the Percona Repository. Once you have installed the software, we first need to grant access to the ProxySQL monitor user to our Percona Server for MySQL nodes.

Create monitor user in MySQL group replication nodes:

Create monitor user in MySQL group replication nodes:
create user monitor@'192.168.4.%' identified by 'monitor';
grant usage on *.* to 'monitor'@'192.168.4.%';
grant select on sys.* to 'monitor'@'192.168.4.%';

Then define some basic variables:

update global_variables set Variable_Value='admin:admin;cluster1:clusterpass'  where Variable_name='admin-admin_credentials';
update global_variables set variable_value='cluster1' where variable_name='admin-cluster_username';
update global_variables set variable_value='clusterpass' where variable_name='admin-cluster_password';
update global_variables set Variable_Value=0  where Variable_name='mysql-hostgroup_manager_verbose';
update global_variables set Variable_Value='true'  where Variable_name='mysql-query_digests_normalize_digest_text';
update global_variables set Variable_Value='8.0.20'  where Variable_name='mysql-server_version';
update global_variables set Variable_Value='utf8'  where Variable_name='mysql-default_charset';
update global_variables set Variable_Value=300  where Variable_name='mysql-tcp_keepalive_time';
update global_variables set Variable_Value='true'  where Variable_name='mysql-use_tcp_keepalive';
update global_variables set Variable_Value='true'  where Variable_name='mysql-verbose_query_error';
update global_variables set Variable_Value='true'  where Variable_name='mysql-show_processlist_extended';
update global_variables set Variable_Value=50000  where Variable_name='mysql-max_stmts_cache';
update global_variables set Variable_Value='false'  where Variable_name='admin-web_enabled';
update global_variables set Variable_Value='0'  where Variable_name='mysql-set_query_lock_on_hostgroup';

load admin variables to run;save admin variables to disk;
load mysql variables to run;save mysql variables to disk;

The user name and password need to reflect your standards. The ones used above are just an example. Then set up the nodes as a cluster:

INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.191',6032,100,'PRIMARY');
INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.192',6032,100,'SECONDARY');
load proxysql servers to run;save proxysql servers to disk;

Step 2

Define user(s), servers, and query rules to perform read/write split. Create one or more valid user(s), for instance, if you have a user named app_gr with the password test, that has access to your group replication cluster:

insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent,comment) values ('app_gr','test',1,400,'mysql',1,'application test user GR');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;

Define servers:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.81',400,3306,10000,2000,'GR1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.81',401,3306,100,2000,'GR1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.82',401,3306,10000,2000,'GR2');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.83',401,3306,10000,2000,'GR2');        
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.84',401,3306,1,2000,'GR2');        
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

Define query rules to get read-write split:

INSERT INTO mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,apply) values(4040,6033,'app_gr',400,1,3,'^SELECT.*FOR UPDATE',1);
INSERT INTO mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,multiplex,apply) values(4042,6033,'app_gr',401,1,3,'^SELECT.*$',1,1);
LOAD MYSQL QUERY RULES TO RUN;SAVE MYSQL QUERY RULES TO DISK;

Step 3

Once we have all the configuration ready, we need to have a special view in the SYS schema in our Percona server nodes. The view working for the server version 8 and above can be found here (https://github.com/Percona-Lab/group_replication_tools/blob/master/GR_sys_view_forProxysql_v1.sql)

Run that sql on the PRIMARY node of the Group Replication cluster.

Step 4

Now we are ready to activate the native support for Group Replication in ProxySQL. We will use the following group definition:

Writer HG-> 400
Reader HG-> 401
BackupW HG-> 402
Offline HG-> 9401

INSERT INTO mysql_group_replication_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) 
values (400,402,401,9401,1,1,1,100);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

Few comments here about the parameters (for full reference see here https://proxysql.com/documentation/main-runtime#mysql_group_replication_hostgroups ). We recommend setting the number of writers always to 1, and witer_is_also_reader to 1 as well to obtain the most reliable results.

max_writers: 1
writer_is_also_reader: 1

The max_transactions_behind is a subjective parameter that you should calculate on the basis of your needs. If for instance you cannot have a stale read, it will be safe to set this value to a low number (ie 50) and to set in all Group replication nodes:

set global group_replication_consistency=AFTER;

If instead, you have no issue or strict requirements about some stale read, you can relax the parameter and ignore the group_replication_consistency setting. Our recommended setting is group_replication_consistency=AFTER and max_transactions_behind: 100.

Proxy HA

The final step is to enable High Availability for the ProxySQL layer. In this approach, we will use the well-known keepalived service. First, install the keepalived software using yum or apt-get on each ProxySQL node:

Sudo yum install -y keepalived
Or
Sudo apt-get install -y keepalived

Then modify the /etc/keepalived/keepalived.conf file accordingly to your setup. In our case:

Proxy1 192.168.4.0/24 dev enp0s9 proto kernel scope link src 192.168.4.191
Proxy2 192.168.4.0/24 dev enp0s9 proto kernel scope link src 192.168.4.192
VIP 192.168.4.194

We want to have the primary to be Proxy1 and the failover node to be Proxy2. Given that the config will look like:

cat /etc/keepalived/keepalived.conf 
global_defs {
  # Keepalived process identifier
  router_id  proxy_HA
}
# Script used to check if Proxy is running
vrrp_script check_proxy {
  script "killall -0 proxysql"
  interval 2
  weight 2
}
# Virtual interface
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_01 {
  state MASTER 
  interface enp0s9
  virtual_router_id 51
  priority 100  <----- This needs to be different for each ProxySQL node, like 100/99 
 
  # The virtual ip address shared between the two load balancers
  virtual_ipaddress {
   192.168.4.194  dev enp0s9
  }
  track_script {
    check_proxy
  }
}

Once done, start the keepalived service, and from now on the VIP will be associated with the Proxy1 unless service is down.

In the system log:

proxysql1 Keepalived_vrrp[17422]: VRRP sockpool: [ifindex(4), proto(112), unicast(0), fd(10,11)]
proxysql1 Keepalived_vrrp[17422]: VRRP_Script(check_proxy) succeeded
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Transition to MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Received advert with higher priority 101, ours 100
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Entering BACKUP STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Changing effective priority from 100 to 102
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) forcing a new MASTER election
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Transition to MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Entering MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) setting protocol VIPs.
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Sending/queueing gratuitous RPs on enp0s9 for 192.168.4.194
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
..
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
proxysql1 avahi-daemon[989]: Registering new address record for 192.168.4.194 on enp0s9.IPv4.

Disaster Recovery Implementation

The implementation of a DR (Disaster Recovery) site will follow the same direction provided for the main site. There are only some generic rules that should be followed:

A DR site should be located in a different geographic location than the main site (several hundred kilometers/miles away).
The connection link between the main site and the DR site can only be established using asynchronous replication (standard MySQL replication setup ).

Monitoring

There are few ways to monitor a Group Replication cluster. The easiest way is to have Percona Monitoring and Management (Version 2.10 or later) deployed to do it for you. For an easy installation of Percona Monitoring and Management check out this quickstart.

Percona Monitoring and Management

The only important thing to remember is that when registering the Percona Server for MySQL node or the MySQL node, you should specify the replication_set flag.

Ie:  pmm-admin add mysql --username=pmm --password=pmm --query-source=perfschema --replication-set=gr_test_lab  group_rep4 127.0.0.1:3306

Then you can use the Group replication Dashboard and monitor your cluster with a lot of details.

The sections are:

Overview(3 panels)

Replication Delay Details(3 panels)

Transactions(8 panels)

Conflicts

From Command Line

From the command line you need to manually query the tables in Performance schema:

+----------------------------------------------+
| replication_applier_configuration            |
| replication_applier_filters                  |
| replication_applier_global_filters           |
| replication_applier_status                   |
| replication_applier_status_by_coordinator    |
| replication_applier_status_by_worker         |
| replication_connection_configuration         |
| replication_connection_status                |
| replication_group_member_stats               |
| replication_group_members                    |
+----------------------------------------------+

For instance, to get the lag in the number of transactions on a node:

select @last_exec:=SUBSTRING_INDEX(SUBSTRING_INDEX( @@global.GTID_EXECUTED,':',-1),'-',-1) last_executed;select  @last_rec:=SUBSTRING_INDEX(SUBSTRING_INDEX(Received_transaction_set,':',-1),'-',-1) last_received FROM performance_schema.replication_connection_status WHERE Channel_name = 'group_replication_applier'; select (@last_rec - @last_exec) as real_lag;
+---------------+
| last_executed |
+---------------+
| 125624        |
+---------------+
1 row in set, 1 warning (0.03 sec)

+---------------+
| last_received |
+---------------+
| 125624        |
+---------------+
1 row in set, 1 warning (0.00 sec)

+----------+
| real_lag |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

Or use a more composite query:

SELECT
  conn_status.channel_name as channel_name,
  conn_status.service_state as IO_thread,
  applier_status.service_state as SQL_thread,
  conn_status.LAST_QUEUED_TRANSACTION as last_queued_transaction,
  applier_status.LAST_APPLIED_TRANSACTION as last_applied_transaction,
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP - 
                            LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 'rep delay (sec)',
  LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP - 
                           LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 'transport time', 
  LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP - 
                           LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP 'time RL',
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP - 
                           LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP 'apply time',  
  if(GTID_SUBTRACT(LAST_QUEUED_TRANSACTION, LAST_APPLIED_TRANSACTION) = "","0" ,        abs(time_to_sec(if(time_to_sec(APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP)=0,0,timediff(APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,now()))))) `lag_in_sec` 
FROM
  performance_schema.replication_connection_status AS conn_status
JOIN performance_schema.replication_applier_status_by_worker AS applier_status
  ON applier_status.channel_name = conn_status.channel_name
ORDER BY lag_in_sec, lag_in_sec desc\G

Which will provide information about each applier:

*************************** 1. row ***************************
channel_name: group_replication_applier
IO_thread: ON
SQL_thread: ON
last_queued_transaction: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:125624
last_applied_transaction: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:125621
rep delay (sec): 3.153038
transport time: 0.061327
time RL: 0.001005
apply time: 0.388680
lag_in_sec: 0

As you can see, Percona Monitoring and Management will give you a better view without compromising the details.

Conclusions

Using these steps and recommendations, you can set up database infrastructure with high availability based on group replication and use Percona Monitoring and Managemen to monitor the infrastructure’s performance and health.

Keep in mind that we are constantly working on making our recommendations better. As such, what is illustrated here is subject to changes and revision especially on the basis of the increasing adoption of Group Replication. This is because the more GR is used the more edge cases or deviation we will identify. Those are a significant help for us to refine our best practices.

Who is drop-in replacement of

From MySQL to Percona or MariaDB ...and back

As we all know MariaDB starts as a fork of MySQL and then slowly diverges until becoming a different product.

Still I often catch at conferences that MariaDB is a drop-in replacement for MySQL (https://en.wikipedia.org/wiki/Drop-in_replacement). This for me is a bold statement given drop-in-replacement: “It refers to the ability to replace one hardware (or software) component with another one without any other code or configuration changes being required and resulting in no negative impacts”. Which by inheritance also means I can go backwards.

In short if MariaDB is a real drop-in replacement, we should be able to replace the MySQL binaries with the ones coming from MariaDB and then roll back without any issue.

This short article is the result of my notorious disbelieving about any kind of bold marketing statements.

To be clear I am NOT going to compare the functionalities of the different products, I just want to see if I can replace one with the other.

The tests

What and how

For the tests I will use the latest version of:

MySQL 5.7
Percona Server 5.7
MariaDb 10.3

And for the newest:

MySQL 8.0.22 (8.0.23 is out but PS is not yet, so we cannot compare)
Percona Server 8.0.22
MariaDB 10.5

I will have a source pointing to a symbolic link named /opt/mysql_templates/magic then I will replace the target of the link pointing to the different binaries.

What I will do is simple:

Point to MySQL
Create a new instance
Create the world schema and load data (using Innodb)
Select count(*) from world.City;
SET GLOBAL innodb_fast_shutdown=0;
Stop instance
Point link to Percona
Start instance
Select count(*) from world.City;
Drop world
Repeat all steps from #3 and point to MariaDB
Once the MariaDB test is done, point to MySQL and repeat.

Will run the same tests for the 5.7/10.3 series and for the 8.0.22/10.5.

The expectation to be drop-in-replacement is to be able to move from MySQL to Percona Server to MariaDB and back to MySQL. Anything diverging will prove we are NOT dealing with drop-in.

Version 5.7/10.3

MySQL

/opt/mysql_templates/magic/bin/mysql  Ver 14.14 Distrib 5.7.33, for linux-glibc2.12 (x86_64) using  EditLine wrapper
Connection id:		4
Server version:		5.7.33-log MySQL Community Server (GPL)
UNIX socket:		/opt/mysql_instances/magic/mysql.sock
Uptime:			4 sec

+----------+
| count(*) |
+----------+
|     4079 |
+----------+
1 row in set (0.01 sec)

Shift to Percona:

/opt/mysql_templates/magic/bin/mysql  Ver 14.14 Distrib 5.7.33-36, for Linux (x86_64) using  6.0

Connection id:		8
Server version:		5.7.33-36-log Percona Server (GPL), Release 36, Revision 7e403c5
UNIX socket:		/opt/mysql_instances/magic/mysql.sock
Uptime:			5 min 38 sec

+----------+
| count(*) |
+----------+
|     4079 |
+----------+
1 row in set (0.00 sec)

No errors at all in the log

Let us go backwards to MySQL

(root@localhost) [world]>\s
--------------
/opt/mysql_templates/magic/bin/mysql  Ver 14.14 Distrib 5.7.33, for linux-glibc2.12 (x86_64) using  EditLine wrapper

Connection id:		7
Server version:		5.7.33-log MySQL Community Server (GPL)
UNIX socket:		/opt/mysql_instances/magic/mysql.sock
Uptime:			5 min 33 sec

Database changed
+----------+
| count(*) |
+----------+
|     4079 |
+----------+
1 row in set (0.00 sec)

No issue at all

Let us try to move to MariaDB 10.3:

So at first attempt it failed, and had to modify my.cnf removing GTID config as Performance schema (see reference section about that).

Once done:

2021-03-07 10:59:22 0 [ERROR] Missing system table mysql.roles_mapping; please run mysql_upgrade to create it
2021-03-07 10:59:22 0 [ERROR] Incorrect definition of table mysql.event: expected column 'sql_mode' at position 14 to have type set(...), found type set(...)
2021-03-07 10:59:22 0 [ERROR] mysqld: Event Scheduler: An error occurred when initializing system tables. Disabling the Event Scheduler.

Working but needing to run innodb_update, once done, I had a lot of errors related to the Performance schema, and still InnoDB issues:

/opt/mysql_templates/magic/bin/mysql  Ver 15.1 Distrib 10.3.28-MariaDB, for Linux (x86_64) using readline 5.1

Connection id:		76
Server:			MariaDB
Server version:		10.3.28-MariaDB-log MariaDB Server
UNIX socket:		/opt/mysql_instances/magic/mysql.sock
Uptime:			1 min 17 sec

(root@localhost) [(none)]>select count(*) from world.City;
+----------+
| count(*) |
+----------+
|     4079 |
+----------+
1 row in set (0.003 sec)


2021-03-07 11:03:37 229 [ERROR] InnoDB: Column last_update in table `mysql`.`innodb_table_stats` is BINARY(4) NOT NULL but should be INT UNSIGNED NOT NULL (flags mismatch).
2021-03-07 11:03:37 229 [ERROR] InnoDB: Fetch of persistent statistics requested for table `mysql`.`gtid_executed` but the required system tables mysql.innodb_table_stats and mysql.innodb_index_stats are not present or have unexpected structure. Using transient stats instead.
2021-03-07 11:03:38 232 [ERROR] Column count of mysql.proc is wrong. Expected 21, found 20. Created with MariaDB 50733, now running 100328. Please use mysql_upgrade to fix this error
2021-03-07 11:03:38 232 [Note] View `sys`.`x$statements_with_errors_or_warnings`: the version is set to 100328, algorithm restored to be MERGE
2021-03-07 11:03:38 233 [ERROR] InnoDB: Column last_update in table `mysql`.`innodb_table_stats` is BINARY(4) NOT NULL but should be INT UNSIGNED NOT NULL (flags mismatch).
2021-03-07 11:03:38 233 [ERROR] InnoDB: Column last_update in table `mysql`.`innodb_table_stats` is BINARY(4) NOT NULL but should be INT UNSIGNED NOT NULL (flags mismatch).
2021-03-07 11:03:38 233 [ERROR] InnoDB: Column last_update in table `mysql`.`innodb_table_stats` is BINARY(4) NOT NULL but should be INT UNSIGNED NOT NULL (flags mismatch).

Let us go back to MySQL now:

2021-03-07T16:27:01.517996Z 0 [ERROR] Native table 'performance_schema'.'session_variables' has the wrong structure
2021-03-07T16:27:01.518307Z 0 [ERROR] Incorrect definition of table mysql.db: expected column 'User' at position 2 to have type char(32), found type char(80).
2021-03-07T16:27:01.518428Z 0 [ERROR] mysql.user has no `Event_priv` column at position 28
2021-03-07T16:27:01.518697Z 0 [ERROR] Incorrect definition of table mysql.event: expected column 'sql_mode' at position 14 to have type set(...), found type set(...)
2021-03-07T16:27:01.518829Z 0 [ERROR] Event Scheduler: An error occurred when initializing system tables. Disabling the Event Scheduler.
2021-03-07T16:27:01.520065Z 0 [Note] /opt/mysql_templates/magic/bin/mysqld: ready for connections.
Version: '5.7.33-log'  socket: '/opt/mysql_instances/magic/mysql.sock'  port: 3306  MySQL Community Server (GPL)

Try to run mysql_upgrade

mysql@gr1 magic]$ /opt/mysql_templates/magic/bin/mysql_upgrade --defaults-file=./my.cnf -uroot --force
mysql_upgrade: Got error: 1524: Plugin '0' is not loaded while connecting to the MySQL server
Upgrade process encountered error and will not continue.

Only way to access the data at this point is to use skip-grant-tables. But that is unsustainable and in any case I continue to get errors in the log.

Summarizing

Once I have an instance built with MySQL 5.7, I can easily shift to Percona Server 5.7, and eventually go back. I can migrate to MariaDB, but this implies configuration changes and must run mysql_update to modify the core system tables. Finally cannot rollback to MySQL, modifications done by MariaDB are not allowing it.

Version 8/10.5

MySQL

/opt/mysql_templates/magic8/bin/mysql  Ver 8.0.22 for Linux on x86_64 (MySQL Community Server - GPL)

Connection id:		19
Server version:		8.0.22 MySQL Community Server - GPL
UNIX socket:		/opt/mysql_instances/magic8/mysql.sock
Binary data as:		Hexadecimal
Uptime:			22 sec

(root@localhost) [world]>select count(*) from world.City;
+----------+
| count(*) |
+----------+
|     4079 |
+----------+
1 row in set (0.01 sec)

Move to Percona Server

/opt/mysql_templates/magic8/bin/mysql  Ver 8.0.22-13 for Linux on x86_64 (Percona Server (GPL), Release 13, Revision 6f7822f)

Connection id:		23
Server version:		8.0.22-13 Percona Server (GPL), Release 13, Revision 6f7822f
UNIX socket:		/opt/mysql_instances/magic8/mysql.sock
Binary data as:		Hexadecimal
Uptime:			10 min 45 sec

+----------+
| count(*) |
+----------+
|     4079 |
+----------+
1 row in set (0.00 sec)

No error of any type and no changes as for 5.7

Let us rollback to MySQL

/opt/mysql_templates/magic8/bin/mysql  Ver 8.0.22 for Linux on x86_64 (MySQL Community Server - GPL)

Connection id:		18
Server version:		8.0.22 MySQL Community Server - GPL
UNIX socket:		/opt/mysql_instances/magic8/mysql.sock
Binary data as:		Hexadecimal
Uptime:			19 sec

+----------+
| count(*) |
+----------+
|     4079 |
+----------+
1 row in set (0.00 sec)

Perfect and no issue.

Let us now try MariaDB

210308 09:13:34 mysqld_safe Starting mariadbd daemon with databases from /opt/mysql_instances/magic8/data
2021-03-08  9:13:35 0 [ERROR] /opt/mysql_templates/magic8/bin/mariadbd: Error while setting value 'index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on' to 'optimizer_switch'
210308 09:13:35 mysqld_safe mysqld from pid file /opt/mysql_instances/magic8/mysqld.pid ended

Ok let see if I remove the optimizer settings:

2021-03-08  9:14:53 0 [ERROR] InnoDB: Invalid flags 0x4800 in /opt/mysql_instances/magic8/data/ibdata1
2021-03-08  9:14:53 0 [ERROR] InnoDB: Plugin initialization aborted with error Data structure corruption
2021-03-08  9:14:53 0 [Note] InnoDB: Starting shutdown...
2021-03-08  9:14:53 0 [ERROR] Plugin 'InnoDB' init function returned error.
2021-03-08  9:14:53 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2021-03-08  9:14:53 0 [Note] Plugin 'FEEDBACK' is disabled.
2021-03-08  9:14:53 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
2021-03-08  9:14:53 0 [ERROR] /opt/mysql_templates/magic8/bin/mariadbd: unknown variable 'auto-generate-certs=TRUE'
2021-03-08  9:14:53 0 [ERROR] Aborting
210308 09:14:53 mysqld_safe mysqld from pid file /opt/mysql_instances/magic8/mysqld.pid ended

No way to have it working unless a logical dump

Summarizing

As for MySQL 5.7 once I have the instance built with MySQL 8.0.22 I can move to Percona 8.0.22 and roll back with no issues. Moving to MariaDB is not possible at all unless a full LOGICAL dump, which I really want to see how long it could take when you have TeraBytes of data.

Another weirdo

While I was working with MariaDB I also review the documentation and my eyes were capture by this (https://mariadb.com/docs/reference/es/system-variables/innodb_purge_threads/)

“WoW”, I thought, “DYNAMIC settings for innodb_purge_threads this is cool”. Let me try it.

/opt/mysql_templates/mariadb-10.5.9-linux-systemd-x86_64/bin/mysql  Ver 15.1 Distrib 10.5.9-MariaDB, for Linux (x86_64) using readline 5.1

Connection id:		7
Server:			MariaDB
Server version:		10.5.9-MariaDB MariaDB Server
UNIX socket:		/opt/mysql_instances/maria105/mysql.sock
Uptime:			8 min 14 sec


(root@localhost:pm) [(none)]>set global innodb_purge_threads=10;
ERROR 1238 (HY000): Variable 'innodb_purge_threads' is a read only variable
(root@localhost:pm) [(none)]>

Whaaat??

Then I realized I was on the page of “Enterprise Documentation”. Checking with the community version https://mariadb.com/kb/en/innodb-system-variables/#innodb_purge_threads innodb_purge_threads is still NOT dynamic, and after checking even more I also found Innodb_log_file_size, same story.

maria slide

I was very badly impressed by this, for several reasons but want to mention the two most important:

MariaDB is declaring over and over to be the one truly open source and community oriented. But where is the support for the community here?
Is InnoDB not owned by Oracle? Isn't MariaDB using it because Oracle released it under the GPLv2? So why are they modifying the code and not returning it to the open source community? I am not an expert in legal things, but that sounds to me an infringement of the license.

Conclusions

Drop-in replacement has a very specific meaning, and it must be used with caution. It also brings several strings attached, one of these is that replacing binaries is not a one way only solution. Attaching the term limited to drop-in as in MariaDB documentation, it is not correct, it still evokes an inapplicable concept and can be seen as misleading advertising.

As we can see the only real drop-in replacement for MySQL is Percona Server, MariaDB is not even close to it, too many changes in the configuration file, and of course the need to go for a logical dump is not even remotely the right way to go.

In short MariaDB is obviously a different product, as already stated by many in different articles, that is diverging more and more.

The only reason for which MariaDB continues to play the drop-in game with MySQL, for me, is because they continue to use the traction MySQL has on the community and the market, to attract customers. They also absorb from the work done by Oracle and Percona but as shown with innodb_purge_threads/innodb_log_file_size, they do not hesitate to keep significant features only for Enterprise version, without sharing with the community.

As said many times, MariaDB has huge minds in development, starting from Monty down to any level of developer/DBA. We must recognize and respect that, as well we must recognize the great work they do at technical level.

I would love to see more collaboration, but I also understand the need to be something different to survive as a company.

What I cannot accept is when we have claims that are not real like the drop-in replacement (also if limited), or even worse the case of innodb_purge_threads/innodb_log_file_size.

That is wrong and as a strong advocate of open source and a lover of the MySQL/MariaDB community I feel I need to voice my concern.

Great MySQL to all

Reference

https://www.percona.com/resources/webinars/differences-between-mariadb%C2%AE-and-mysql%C2%AE

https://mariadb.com/docs/reference/es/system-variables/innodb_purge_threads/

https://mariadb.com/kb/en/innodb-system-variables/#innodb_purge_threads

https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/

Full read consistency within Percona Operator for MySQL

Overview

Percona operator for MySQL (POM) :(https://www.percona.com/doc/kubernetes-operator-for-pxc/index.html) it’s aim is a special type of controller introduced to simplify complex deployments. The Operator extends the Kubernetes API with custom resources.

The Percona Operator for MySQL solution is using Percona Xtradb Cluster behind the hood to provide a highly available, resilient and scalable MySQL service in the Kubernetes space.

This solution comes with all the advantages/disadvantages provided by Kubernetes, plus with some advantages of its own like the capacity to scale reads on the nodes that are not Primary.

Of course there are some limitations like the way PXC handle DDLs, which may impact the service, but there is always a cost to pay to get something, expecting to have all for free is unreasonable.

In this context we need to talk and cover what is full read consistency in this solution and why it is important to understand the role it plays.

Stale Reads

When using Kubernetes we should talk about the service and not about the technology/product used to deliver such service.

In our case the Percona operator is there to deliver a MySQL service. We should then see that as a whole as a single object. To be more clear what we must consider is NOT the fact we have a cluster behind the service but that we have a service that to be resilient and highly available use a cluster.

We should not care If a node/pod goes down unless the service is discontinued.

What we have as a plus in the Percona operator for MySQL solution is a certain level of READ scalability. This achieved optimising the use of the non PRIMARY nodes, and instead having them sitting there applying only replicated data, the Percona Operator provides access to them to scale the reads.

But… there is always a BUT

Let us start with an image:

Screen Shot 2019 10 13 at 32714 PM

(https://www.slideshare.net/lefred.descamps/galera-replication-demystified-how-does-it-work) from Fred Descamps)

By design the apply and commit finalize in Galera (PXC) may have (and has) a delay between nodes.

This means that, if using defaults, applications may have inconsistent reads if trying to access the data from different nodes than the Primary.

POM provides access using two different solutions:

Using HAProxy (default)
Using ProxySQL

When using HAProxy you will have 2 entry points:

cluster1-haproxy, which will point to the Primary ONLY, for reads and writes. This is the default entry point for the applications to the MySQL database.
cluster1-haproxy-replicas, which will point to all the 3 nodes and is supposed to be used for READS only. This is the PLUS you can use if your application has READ/WRITE separation.

Please note that at the moment there is nothing preventing application to use the cluster1-haproxy-replicas also for write, but that is dangerous and wrong because will generate a lot of certification conflicts and BF abort given it will distribute writes all over the cluster impacting on performance as well (and not giving you any write scaling):

[marcotusa@instance-1 ~]$ for i in `seq 1 100`; do mysql -h cluster1-haproxy-replicas -e "insert into test.iamwritingto values(null,@@hostname)";done
+----------------+-------------+
| host           | count(host) |
+----------------+-------------+
| cluster1-pxc-1 |          34 |
| cluster1-pxc-2 |          33 |
| cluster1-pxc-0 |          33 |
+----------------+-------------+

When using ProxySQL the entry point is a single one, but you may define query rules to automatically split the R/W requests coming from the application.

This is the preferred method when application has no way to separate the READS from the writes.

I have done a comparison of the two methods in POM here

Now, as mentioned above, by default PXC (any Galera base solution) comes with some relaxed settings, for performance purpose. This is normally fine in many standard cases, but if you use POM and use the PLUS of scaling reads using the second access point with HAproxy or Query Rules with Proxysql, you should NOT have stale reads, given the service must provide consistent data, as if you are acting on a single node.

To achieve that you can change the defaults and change the parameter in PXC wsrep_sync_wait.

When changing the parameter wsrep_sync_wait as explained in the documentation the node initiates a causality check, blocking incoming queries while it catches up with the cluster.

Once all data on the node receiving the READ request is commit_finalized, the node performs the read.

But this has a performance impact as said before.

What is the impact?

To test the performance impact I had used a cluster deployed in GKE, with this characteristics:

3 Main nodes n2-standard-8 (8 vCPUs, 32 GB memory)
1 App node n2-standard-8 (8 vCPUs, 32 GB memory)
PXC pods using:
- 25GB of the 32 available
- 6 CPU of the 8 available
HAProxy:
- 600m CPU
- 1GB RAM
PMM agent
- 500m CPU
- 500 MB Ram

On The application node I used sysbench running two instances, one in r/w mode the other only reads. Finally to test stale read I used the stale read test from my test suite ( https://github.com/Tusamarco/testsuite) .

Given I was looking for results with moderate load I just used 68/96/128 threads per sysbench instance.

Results

Marco, did we have or not stale reads? Yes we did:

stale reads moderate load

I had from 0 (with very light load) up to 37% stale reads with MODERATED load. Where moderated was the 128 threads sysbench running.

Setting wsrep_sync_wait=3 of course I had full consistency.
But I had performance loss:

performance loss reads

As you can see I had an average loss of 11% in case of READS

performance loss writes

While for writes the average loss was the 16%.

Conclusions

At this point we need to stop and think about what is worth doing. If my application is READs heavy and READs scaling, it is probably worth enabling the full synchronicity given scaling on the additional node allows me to have a 2x or more READs.

If instead my application is write critical, probably losing also ~16% performance is not good.

Finally if my application is stale reads tolerant, I will just go with the defaults and get all the benefits without penalties.

Also keep in mind that POM is designed to offer a MySQL service so the state of the single node is not as critical as if you are using a default PXC installation, PODs are by nature ephemeral objects while service is resilient.

References

https://www.percona.com/doc/kubernetes-operator-for-pxc/index.html

https://github.com/Tusamarco/testsuite

https://en.wikipedia.org/wiki/Isolation_(database_systems)#Dirty_reads

https://galeracluster.com/library/documentation/mysql-wsrep-options.html#wsrep-sync-wait

https://www.slideshare.net/lefred.descamps/galera-replication-demystified-how-does-it-work

Percona Operator for MySQL (HAProxy or ProxySQL?)

Overview

Percona Operator for MySQL (POM) comes with two different proxies, HAProxy and ProxySQL. While the initial version was based on ProxySQL, in time Percona opted to set HAProxy as the default Proxy for the operator, this without removing ProxySQL.

While one of the main points was to guarantee users to have a 1:1 compatibility with vanilla MySQL in the way the operator allows connections. There are also other factors that are involved in the decision to have two proxies. In this article I will scratch the surface of this why.

Operator assumptions

When working with the Percona operator for MySQL, there are few things to keep in mind:

Each deployment has to be seen as a single MySQL service as if a single MySQL instance
The technology used to provide the service may change in time
Pod resiliency is not guaranteed, Service resiliency is.
Resources to be allocated are not automatically calculated and must be identified at the moment of the deployment
In Production you cannot set more than 5 or less than 3 nodes when using PXC

There are two very important points in the list above.

The first one is that what you get IS NOT a PXC cluster, but a MySQL service. The fact that Percona at the moment uses PXC to cover the service is purely accidental and we may decide to change it anytime.

The other point is that Service is resilient the pod is not. In short you should expect to see pods stopping to work and being re-created. What should NOT happen is that service goes down. Trying to debug each minor issue per node/pod is not what is expected when you use kubernetes.

Given the above, review your expectations… and let us go ahead.

The plus in the game (read scaling)

As said, what is offered with POM is a mysql service. Percona has added a proxy on top of the nodes/pods that help the service to respect the resiliency service expectations. There are two possible deployments:

HAProxy
ProxySQL

Both allow to optimise one aspect of POM, which is read scaling.
Infact what we were thinking was, given we must use a (virtually synchronous) cluster, why not take advantage of that and allow reads to scale on the other nodes when available?

This approach will help all the ones using POM to have the standard MySQL service but with a plus.

But, with it also come some possible issues like READ/WRITE splitting and stale reads. About stale reads see this article on how to deal with it (https://docs.google.com/document/d/1NyqcEKxfhgD1tDRPbiY0bENdp953DzkBfAoMi22u3J4/edit)

For R/W splitting we instead have a totally different approach in respect to what kind of proxy we implement.

If using HAProxy, we offer a second entry point that can be used for READ operation. That entrypoint will balance the load on all the nodes available.

If instead ProxySQL is in use it is possible to implement automatic R/W splitting.

Global difference and comparison

At this point it is useful to have a better understanding of the functional difference between the two proxies and what is the performance difference if any.

As we know HAProxy acts as a level 4 proxy when operating in TCP mode, it also is a forward-proxy, which means each TCP connection is established with the client with the final target and there is no interpretation of the data-flow.

ProxySQL on the other hand is a level 7 proxy and is a reverse-proxy, this means the client establishes a connection to the proxy who presents itself as the final backend. Data can be altered on the fly when it is in transit.

To be honest, it is more complicated than that but allows me the simplification.

On top of that there are additional functionalities that are present in one (ProxySQL) and not in the other. The point is if they are relevant for the use in this context or not. For a short list see below (source is from ProxySQL blog but data was removed)

proxySQL HAProxy feature comparison

As you may have noticed HAProxy is lacking some of that functionalities, like R/W split, firewalling and caching, proper of the level 7 implemented in ProxySQL.

The test environment

To test the performance impact I had used a cluster deployed in GKE, with this characteristics:

3 Main nodes n2-standard-8 (8 vCPUs, 32 GB memory)
1 App node n2-standard-8 (8 vCPUs, 32 GB memory)
PXC pods using:
- 25GB of the 32 available
- 6 CPU of the 8 available
HAProxy:
- 600m CPU
- 1GB RAM
PMM agent
- 500m CPU
- 500 MB Ram
Tests using sysbench as for (https://github.com/Tusamarco/sysbench)

What I have done is to run several tests running two Sysbench instances. One only executing reads, while the other reads and writes.

In the case of ProxySQL I had R/W splitting thanks to the Query rules, so both sysbench instances were pointing to the same address. While testing HAProxy I was using two entry points:

Cluster1-haproxy – for read and write
Cluster1-haproxy-replicas – for read only

Then I also compare what happens if all requests hit one node only. For that I execute one Sysbench in R/W mode against one entry point, and NO R/W split for ProxySQL.

Finally sysbench tests were executed with the –reconnect option to force the tests to establish new connections.

As usual tests were executed multiple times, on different days of the week and moment of the day. Data reported is a consolidation of that, and images from PMM are samples coming from the execution that was closest to the average values.

Comparing performance when scaling Reads

These tests imply that one node is mainly serving writes while the others are serving reads. To not affect performance and given I was not interested in maintaining full read consistency the parameter wsrep_sync_wait was kept as default (0).

events 3node

operation 3node

A first observation shows how ProxySQL seems to keep a more stable level of request served. The increasing load penalises HAProxy reducing if ⅓ the number of operations at 1024 threads.

writes 3node

reads 3node

Digging a bit more we can see that HAProxy is performing much better than ProxySQL for the WRITE operation. The number of writes remains almost steady with minimal fluctuations. ProxySQL on the other hand is performing great when load in write is low, then performance drops by 50%.

For reads we have the opposite. ProxySQL is able to scale in a very efficient way, distributing the load across the nodes and able to maintain the level of service despite the load increase.

If we start to take a look to the latency distribution statistics (sysbench histogram information), we can see that:

latency68 3node HAproxy w

latency68 3node proxy w

In case of low load and writes both proxies stay on the left side of the graph with low value in ms. HAProxy is a bit more consistent and grouped around 55ms value, while ProxySQL is a bit more sparse and spans between 190-293ms.

latency68 3node HAproxy r

latency68 3node proxy r

About reads we have a similar behaviour, both for the large majority between 28-70ms.

We have a different picture when load increases:

latency1024 3node HAproxy w

latency1024 3node proxy w

ProxySQL is having some occurrences where it performs better, but it spans in a very large range, from ~2k ms to ~29k ms. While HAProxy is substantially grouped around 10-11K ms.
As a result, in this context, HAProxy is able to better serve writes under heavy load than ProxySQL.

Again different picture in case of reads.

latency1024 3node HAproxy r

latency1024 3node proxy r

Here ProxySQL is still spanning on a wide range ~76ms – 1500ms, while HAProxy is more consistent but less efficient, grouping around 1200ms the majority of the service. This is consistent with the performance loss we have seen in READ when using high load and HAProxy.

Comparing when using only one node

But let us now discover what happens when using only one node. So using the service as it should be, without the possible Plus of read scaling.

writes 1node

reads 1node

The first thing I want to mention is a strange behaviour that was consistently happening (no matter what proxy used) at 128 threads. I am investigating it but I do not have a good answer yet on why Operator solution with PXC, was having that significant drop in performance ONLY with 128 threads.

Aside that, the results were consistently showing HAProxy performing better in serving read/writes. Keep in mind that HAProxy just establishes the connection point-to-point and is not doing anything else. While ProxySQL is designed to eventually act on the incoming stream of data.

This becomes even more evident when reviewing the latency distribution.
In this case no matter what load we have, HAProxy performs better:

latency68 1node HAproxy rw

latency68 1node proxy rw

latency2048 1node HAproxy rw

latency2048 1node proxy rw

As you can notice, HAProxy is less grouped than when we have 2 entry points, but it is still able to serve more efficiently than ProxySQL.

Conclusions

As usual my advice is to use the right tool for the job, and do not force yourself in something stupid.
As clearly stated at the beginning POM is designed to provide a mysql SERVICE, not a PXC cluster and all the configuration and utilisation should converge on that.
ProxySQL can help you IF you want to scale a bit more on READS using the possible plus. But this is not guaranteed to work as it works when using standard PXC
Not only you need to have a very good understanding of Kubernetes and ProxySQL if you want to avoid issues.
With HAProxy you can scale reads as well, but you need to be sure you have R/W separation at application level.

In any case utilising HAProxy for the service is the easier way to go.
This is one of the reasons why Percona decided to shift to HAProxy.
HAProxy is the solution that offers the proxy service more in line with the aim of the kubernetes service concept.
It is also the solution that remains closer on how a simple MySQL service should behave.

You need to set your expectations correctly to avoid being in trouble later.

References

https://www.percona.com/doc/kubernetes-operator-for-pxc/index.html

Support for Percona XtraDB Cluster in ProxySQL (Part Two)

How scheduler and script stand in supporting failover (Percona and Marco example)

In part one of this series, I had illustrated how simple scenarios may fail or have problems when using Galera native support inside ProxySQL. In this post, I will repeat the same tests but using the scheduler option and the external script.

The Scheduler

First a brief explanation about the scheduler.

The scheduler inside ProxySQL was created to allow administrators to extend ProxySQL capabilities. The scheduler gives the option to add any kind of script or application and run it at the specified interval of time. The scheduler was also the initial first way we had to deal with Galera/Percona XtraDB Cluster (PXC) node management in case of issues.

The scheduler table is composed as follows:

CREATE TABLE scheduler (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    active INT CHECK (active IN (0,1)) NOT NULL DEFAULT 1,
    interval_ms INTEGER CHECK (interval_ms>=100 AND interval_ms<=100000000) NOT NULL,
    filename VARCHAR NOT NULL,
    arg1 VARCHAR,
    arg2 VARCHAR,
    arg3 VARCHAR,
    arg4 VARCHAR,
    arg5 VARCHAR,
    comment VARCHAR NOT NULL DEFAULT '')

The relevant elements are:

Active: that defines if the scheduler should execute or not the external script
Interval_ms: frequency of the execution. This has NO check if previous executions terminate. Given that a script must include a check to prevent launching multiple instances which will probably create conflicts and resource issues.
Filename: the FULL path of the script/app you want to be executed.
Arg(s): whatever you want to pass as arguments. When you have a complex script, either use a configuration file or collapse multiple arguments in a single string.

The Scripts

In this blog, I will present two different scripts (as examples). Both will cover the scenarios as in the previous article and can do more, but I will focus only on that part for now.

One script is written in Bash and is the porting of the proxysql_galera_checker Percona was using with ProxySQL-admin in ProxySQL version 1.4. The script is available here from Percona-lab (git clone ).

The other, written by me, is written in Perl and is probably the first script that came out in 2016. I have done some enhancements and bug fixing to it during the years. Available here (git clone).

Both are offered here as examples and I am not suggesting to use them in critical production environments.

The Setup

To use the two scripts some custom setup must be done. First of all, check that the files are executable by the user running ProxySQL.

Let’s start with mine in Perl

To make it work we need to define a set of host groups that will work as Reader/Writer/Backup-writer/backup-reader (optional but recommended). The difference from the native support is that instead of having them indicated in a specialized table, we will use the mysql_servers table.

Writer: 100
Readers: 101
Backup Writers:8100
Backup Readers: 8101

Given the above, on top of the already defined servers in the previous article, we just need to add the 8000 HGs.

For example:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',8100,3306,1000,2000,'Failover server preferred');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',8100,3306,999,2000,'Second preferred');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',8100,3306,998,2000,'Third and last in the list');      
    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',8101,3306,100,2000,'');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',8101,3306,1000,2000,'');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',8101,3306,1000,2000,'');

After that we need to insert the instructions for the scheduler:INSERT INTO scheduler (id,active,interval_ms,filename,arg1) values (10,0,2000,"/opt/tools/proxy_sql_tools/galera_check.pl","-u=cluster1 -p=clusterpass -h=192.168.4.191 -H=100:W,101:R -P=6032 --retry_down=2 --retry_up=1 --main_segment=2 --debug=0 --log=/var/lib/proxysql/galeraLog --active_failover=1");

The result will be:

    id: 10
     active: 0
interval_ms: 2000
   filename: /opt/tools/proxy_sql_tools/galera_check.pl
       arg1: -u=cluster1 -p=clusterpass -h=192.168.4.191 -H=100:W,101:R -P=6032 --retry_down=2 --retry_up=1 --main_segment=2 --debug=0  --log=/var/lib/proxysql/galeraLog --active_failover=1
       arg2: NULL
       arg3: NULL
       arg4: NULL
       arg5: NULL
    comment:

Please refer to the instruction in Github for the details of the parameters. What we can specify here is:

-H=100:W,101:R Are the Host Group we need to refer to as the ones dealing with our PXC cluster
–active_failover=1 Failover method to apply
–retry_down=2 –retry_up=1 If action must be taken immediately or if a retry is to be done. This is to avoid the possible jojo effect due to any delay from the node or network.

Always set it to 0 and activate only when all is set and you are ready to go. Once the above is done, the script ready to be used by ProxySQL is the galera_check script.

Percona proxysql_galera_checker

One limitation this script has is that you cannot use different IPs for the PXC internal communication and the ProxySQL node. Given that, we need to modify the setup we had in the previous blog to match the script requirements. Also here we need to define which HG will be the writer which the reader, but we will specify the internal IPs, and, of course, ProxySQL must have access to that network as well.

Writer HG : 200
Reader HG: 201
Network IPs 10.0.0.22 – 23 – 33

Given that, our ProxySQL setup will be:

delete from mysql_users where username='app_test';
insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent,comment) values ('app_test','test',1,200,'mysql',1,'application test user DC1');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;

delete from mysql_query_rules where rule_id in(1040,1042);
insert into mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,'app_test',200,1,3,'^SELECT.*FOR UPDATE',1);
insert into mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,apply) values(1042,6033,'app_test',201,1,3,'^SELECT.*$',1);
load mysql query rules to run;save mysql query rules to disk;

delete from mysql_servers where hostgroup_id in (200,201);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.22',200,3306,10000,2000,'DC1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.22',201,3306,100,2000,'DC1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.23',201,3306,10000,2000,'DC1');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.33',201,3306,10000,2000,'DC1');        

load mysql servers to run;save mysql servers to disk;

As you can see here we need to redefine also the user and query rules to match the different HGs, if you use the same (100 -101) no need to do that.

Now it’s time to add the line in for the scheduler:

delete from scheduler where id=60;
INSERT  INTO scheduler (id,active,interval_ms,filename,arg1) values (60,0,3000,"/opt/tools/proxysql-scheduler/proxysql_galera_checker","--config-file=/opt/tools/proxysql-scheduler/proxysql-admin-sample.cnf --writer-is-reader=always --write-hg=200 --read-hg=201 --writer-count=1 --priority=10.0.0.22:3306,10.0.0.23:3306,10.0.0.33:3306 --mode=singlewrite --debug --log=/tmp/pxc_test_proxysql_galera_check.log");
LOAD SCHEDULER TO RUNTIME;SAVE SCHEDULER TO DISK;

Also in this case please refer to the specifications of the parameters, but it’s worth mentioning:

–write-hg=200 –read-hg=201 Host groups definition
–writer-is-reader=always Keep this as ALWAYS please, we will see you do not need anything different.
–mode=singlewrite Possible modes are load balancer and single writer. This is refuse from the old. Never, ever use Galera/PXC in multi-primary mode, period.
–priority=10.0.0.22:3306,10.0.0.23:3306,10.0.0.33:3306 This is where we define the priority for the writers.

Also in this case when loading a schedule, keep the schedule deactivated, and enable it only when ready.

The Tests

Read Test

The first test is the simple read test, so while we have sysbench running in read_only mode we remove one reader after the other.

Marco script:

+---------+-----------+---------------+----------+--------------+----------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed | ConnFree |
+---------+-----------+---------------+----------+--------------+----------+----------+
| 10000   | 100       | 192.168.4.22  | 3306     | ONLINE       | 0        | 0        |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 38       | 8        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE	| 15       | 49       |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 0        | 64       |

As we can see, by just setting the weight we will be able to prevent sending reads to the Writer, and while some will still arrive there, it is negligible. Once we put all the readers down…

Marco script:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 10000   | 100       | 192.168.4.22  | 3306     | ONLINE       | 0        |
| 10000   | 101       | 192.168.4.233 | 3306     | SHUNNED      | 0        |
| 10000   | 101       | 192.168.4.23  | 3306     | SHUNNED      | 0 	|
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 58       |

Given the last node also if with the low weight it will serve all the reads.

Percona Script:

+---------+-----------+---------------+----------+--------------+--------
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed | 
+---------+-----------+---------------+----------+--------------+--------
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 0        | 
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 22       | 
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 21	| 
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 1        |

Remove the reads:

+---------+-----------+---------------+----------+--------------+-------
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+-------
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 0        |
| 10000   | 201       | 10.0.0.33     | 3306     | OFFLINE_SOFT | 0 	
| 10000   | 201       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0 	
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 62       |

In both cases, no issue at all; the writer takes the load of the reads only when left alone.

Maintenance Test

In this test, I will simply put the node down into maintenance mode using pxc_maint_mode=maintenance, as done in the other article. As a reminder, this was working fine also with native Galera.

Marco script:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 10000   | 100       | 192.168.4.22  | 3306     | ONLINE       | 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 8        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 3    	   |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 0        |
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0        |

After:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 999     | 100       | 192.168.4.23  | 3306     | ONLINE       | 50       |
| 10000   | 100       | 192.168.4.22  | 3306     | OFFLINE_SOFT | 0        |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 5        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 6 	   |
| 100     | 101       | 192.168.4.22  | 3306     | OFFLINE_SOFT | 0        |

Node was elected and connections on the old writer were also able to end given OFFLINE_SOFT.

Putting back the node, removing it from maintenance:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 999     | 100       | 192.168.4.23  | 3306     | ONLINE	| 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 5        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 5 	   |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 0        |

Node WILL NOT failback by default (this is by design), this will eventually allow you to warm caches or anything else it may be meaningful before moving the node to Primary role again.

The Percona script will behave a bit differently:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0        |
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 50       |
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 4 	  |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 10	  |
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 0        |

Then I put the node under maintenance:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | ONLINE       | 26       |
| 10000   | 200       | 10.0.0.22     | 3306     | OFFLINE_SOFT | 22       |
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 8        |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 12       |
| 100     | 201       | 10.0.0.22     | 3306     | OFFLINE_SOFT | 0        |

Connections will be moved to the new Writer slowly based on the application approach.

But when I put the node back from maintenance:

+---------+-----------+---------------+----------+--------------+----------
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed 
+---------+-----------+---------------+----------+--------------+----------
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0        
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 49       
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 5        
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 14       
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 0

The old Writer will be put back as Primary. As indicated above I consider this wrong, given we may risk putting back a node that is cold and that can affect production performance. It is true that putting it back from maintenance is a controlled action, but the more checks the better.

Testing Node Crash

Marco script:

To emulate a crash I will kill the mysqld process with kill -9 <pid>.

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status	| ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000    | 100       | 192.168.4.22  | 3306     | ONLINE	| 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE	| 12       |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE	| 4        |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE	| 0        |

Kill the process:

59,50,53.99,6603.16,6205.21,218.97,178.98,1561.52,0.00,2.00
60,50,54.11,5674.25,5295.50,215.43,163.32,1648.20,0.00,1.00 
61,50,3.99,3382.12,3327.22,30.95,23.96,2159.29,0.00,48.91   <--- start
62,50,0.00,820.35,820.35,0.00,0.00,0.00,0.00,0.00         
63,50,0.00,2848.86,2550.67,195.13,103.07,0.00,0.00,0.00
64,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
65,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
66,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
67,50,50.00,4268.99,4066.99,52.00,150.00,7615.89,0.00,1.00  <--- failover end 
68,50,72.00,6522.40,6096.37,268.02,158.01,1109.09,0.00,1.00

Five seconds is consistently taken, of which two are because I set the scheduler to run every two seconds, and also a retry.
And the new Primary is serving while the failed node is removed:

+---------+-----------+---------------+----------+--------------+----------+----------+
| weight  | hostgroup | srv_host      | srv_port | status	    | ConnUsed | ConnFree |
+---------+-----------+---------------+----------+--------------+----------+----------+
| 999     | 100       | 192.168.4.23  | 3306     | ONLINE       | 0        | 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 0        | 34       |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 0        | 35       |
| 100     | 101       | 192.168.4.22  | 3306     | SHUNNED      | 0        | 0        |

Percona script:

Also, in this case, the Percona script behaves a bit differently.

Before the crash:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 49       |
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 5        |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 14       |
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 0        |

29,50,41.05,4099.74,3838.44,155.18,106.12,2009.23,0.00,0.00
30,50,8.01,1617.92,1547.79,37.07,33.06,1803.47,0.00,50.09
31,50,0.00,2696.60,2696.60,0.00,0.00,0.00,0.00,0.00       <--- start
32,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
33,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
34,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
35,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
36,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
37,50,12.96,2385.82,2172.46,91.72,121.63,8795.93,0.00,0.00  <--- failback ends 6"
38,50,39.95,4360.00,4083.38,148.80,127.82,9284.15,0.00,0.00

Variable time to recover but around 6-12 seconds.

+---------+-----------+---------------+----------+---------+----------+
| weight  | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+---------+-----------+---------------+----------+---------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | ONLINE  | 50       | ← new
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE  | 11       |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE  | 5        |

New Primary is elected. But on node recovery:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 50       | 
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 0        |<--old is back
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 10       |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE	| 6        |
| 1000    | 201       | 10.0.0.22     | 3306     | ONLINE       | 0        |

MySQL

Layout

Components

Connection Layer

Data Layer

High Availability

How is High Availability Achieved?

This Solution Provides:

Failovers

Maintenance Windows

Uptime

Measurement and Monitoring

How to Implement the Infrastructure

The Elements

Software Installation

Configure the Nodes

Step 1

Step 2

Step 3

Proxy Setup

Step 1

Step 2

Step 3

Step 4

Proxy HA

Disaster Recovery Implementation

Monitoring

Percona Monitoring and Management

From Command Line

Conclusions

The tests

What and how

Version 5.7/10.3

MySQL

Shift to Percona:

Let us go backwards to MySQL

Let us try to move to MariaDB 10.3:

Let us go back to MySQL now:

Summarizing

Version 8/10.5

MySQL

Move to Percona Server

Let us rollback to MySQL

Let us now try MariaDB

Summarizing

Another weirdo

Conclusions

Reference

Overview

Stale Reads

What is the impact?

Results

Conclusions

References

Overview

Operator assumptions

The plus in the game (read scaling)

Global difference and comparison

The test environment

Comparing performance when scaling Reads

Comparing when using only one node

Conclusions

References

The Scheduler

The Scripts

The Setup

Let’s start with mine in Perl

Percona proxysql_galera_checker

The Tests

Read Test

Maintenance Test

Testing Node Crash

Conclusions

More Articles …

Path

login