TusaCentral - MySQL Blogs

My MySQL tips

The Galera Crossroads: Why PXC is the Lifeline for MariaDB Community Users

Or: Surviving the Codership Acquisition Without Losing Your Cluster

Why this long post?

Recently, the database landscape shifted significantly when MariaDB plc absorbed Codership. If you aren't familiar, Codership is the c good pxc galera mariasmall ompany that introduced the Galera library and the WSREP API to MySQL, creating the first virtually synchronous replication solution for the MySQL ecosystem. For years, they produced their own highly stable, patched version of MySQL + Galera, which was widely adopted alongside solutions like Percona XtraDB Cluster (PXC).

The Post-Acquisition Landscape Following the acquisition, MariaDB plc made a controversial decision: they plan to phase out Galera from the MariaDB Community version and enhance it exclusively for MariaDB Enterprise.

This move sparked a lengthy debate. A large portion of the community pushed back, and even the MariaDB Foundation wasn't aligned with the decision (as detailed in this blog post by lefred).

However, looking at the Foundation's meeting minutes from February 25, 2026, it is clear they ultimately settled on "Option 2." This means the Foundation is willing to keep the existing Galera/WSREP code in the community server, but any future evolution or enhancement of the product will have to rely entirely on external community contributions.

What Does This Mean for You? The reality of the situation is straightforward:

Codership is gone.
MariaDB plc (the company) will transition the active development of Galera strictly to their Enterprise offering.
The MariaDB Foundation will maintain the Galera code "as-is" unless the community actively steps up to provide updates.

As a result, users currently relying on the Codership version of Galera and, in my opinion, those using MariaDB Community may soon find themselves stuck in a difficult position, unsure of what steps to take next.

The Goal of This Document This post is meant to cut through the uncertainty and answer those lingering questions. My goal is to provide you with the facts so you can make an informed decision about your database architecture's future.

Below, you will find a detailed comparison between Percona XtraDB Cluster (based on Galera) and the MariaDB implementation to help you navigate this transition.

1. Executive Summary

Both Percona XtraDB Cluster (PXC) and MariaDB Galera Cluster share a common ancestor: the Galera synchronous multi-master replication library by Codership, PXC does not use the same Galera library as MariaDB. It uses the tracking fork of upstream Galera, and Percona adds many critical fixes that make it actually work in some places (IST stability, gcache, NBO, etc). They both implement the wsrep API and use the same Group Communication System (GCS) for write-set ordering. Despite this shared foundation, they are NOT binary-compatible and you cannot simply swap binaries between them.

The divergence stems from their server cores: PXC is built on Percona Server for MySQL (which closely tracks Oracle MySQL 8.x, 9.x), while MariaDB Galera is built on MariaDB Server, which forked from MySQL around 2010 and has since developed its own independent feature set, system tables, InnoDB patches, GTID implementation, and binary log format.

The gap has widened considerably since MySQL 8.0 introduced a new data dictionary stored entirely in InnoDB (eliminating .frm files), new authentication plugins, and native JSON type changes none of which exist in MariaDB's independent implementation.

2. Why You Cannot Simply Replace the Binaries

This is the most critical section for anyone considering migration. The following incompatibilities make a drop-in binary replacement impossible:

2.1 Data Dictionary and System Tables

MySQL 8.0 as such MySQL with Galera, (and PXC 8.0/8.4) replaced all .frm, .par, .opt, .trn, .trg files with a transactional data dictionary stored in InnoDB. MariaDB never adopted this. MariaDB 10.5 deprecated .frm files but uses its own internal frm-less representation. The system tables (mysql.user, mysql.tables_priv, mysql.columns_priv, mysql.routines, mysql.events, etc.) have fundamentally different schemas. Mounting a PXC data directory with a MariaDB binary will fail at startup and vice versa.

Concrete examples of schema divergence in mysql.user:

PXC/MySQL 8.0 uses plugin-centric design; Password column was removed entirely
MariaDB retains password column alongside authentication_string
PXC defaults to caching_sha2_password; MariaDB defaults to mysql_native_password (10.6 LTS) or ed25519

2.2 GTID Format Incompatibility

GTID implementations are entirely different and mutually incompatible:

Aspect	Percona XtraDB Cluster	MariaDB Galera Cluster
Format	server_uuid:seq_no (e.g. 6b07f8c7-...:1)	domain_id:server_id:seq_no (e.g. 0-1-100)
System variable	gtid_mode = ON/OFF/ON_PERMISSIVE	gtid_strict_mode = ON/OFF
Cluster GTID integration	wsrep generates UUID-based GTIDs automatically	Requires wsrep_gtid_mode + wsrep_gtid_domain_id
Replication positioning	MASTER_AUTO_POSITION=1	MASTER_USE_GTID = slave_pos / current_pos
Cross-product GTID replication	Cannot replicate to/from MariaDB using GTID	Cannot replicate to/from MySQL 8.x using GTID
Binlog GTID events	Gtid_log_event format	Gtid_list_log_event incompatible wire format

Any DR topology crossing PXC and MariaDB must use file+position-based replication or purpose-built ETL tools. GTID-based replication between them does not work.

2.3 InnoDB / XtraDB Divergence

InnoDB tablespace format: MySQL 8.0 uses a newer undo tablespace design (undo001/undo002) absent in MariaDB
Redo log format: MySQL 8.0.30+ uses a new circular redo log; MariaDB uses its own format since 10.5
innodb_autoinc_lock_mode: PXC enforces mode=2 via pxc_strict_mode; MariaDB defaults to mode=1 this alone causes certification failures if uncorrected
Row format checksums and internal page structures differ between the two InnoDB forks

2.4 Binary Log Format Enforcement

PXC 8.0 hardcodes ROW-based binary logging. Setting binlog_format=STATEMENT or MIXED raises an error regardless of pxc_strict_mode:

ERROR: --binlog-format=STATEMENT is not supported. Use ROW.

MariaDB Galera warns but can run with MIXED format in some scenarios, which risks non-deterministic replication.

2.5 wsrep API and Patch Divergence

Aspect	Percona XtraDB Cluster	MariaDB Galera Cluster
Galera library	Galera 4.x, separate package (libgalera_smm.so)	Galera 4.x, embedded in server package since 10.1
wsrep activation	Active when wsrep_provider path is configured	Requires explicit wsrep_on=ON in my.cnf
Extra status variables	10 PXC-specific wsrep_* variables	1 extra: wsrep_thread_count
Extra config variables	pxc_strict_mode, pxc_encrypt_cluster_traffic, pxc_maint_mode, wsrep_reject_queries	wsrep_gtid_mode, wsrep_gtid_domain_id, wsrep_patch_version, wsrep_mysql_replication_bundle

2.6 JSON, SQL Modes, and Reserved Words

MySQL 8.0 stores JSON as a native binary type. MariaDB stores JSON as longtext with a CHECK constraint. Tables with JSON columns cannot be physically migrated logical exports (mysqldump) are required and may need schema adjustments. Numerous SQL modes and reserved words differ, causing silent behavioral differences that surface only in application testing.

3. Architecture and Replication Internals

Both products share the same fundamental architecture: a database server patched with the wsrep API communicates with the Galera plugin (libgalera_smm.so), which handles Group Communication via the Totem Single Ring Ordering protocol and write-set certification.

3.1 Write-Set Replication Flow

The flow is identical because it is implemented in the shared Galera library:

Transaction executes locally; InnoDB registers each modified row key via wsrep append_key()
On COMMIT: wsrep packages row keys + binary log event as a write-set (WS)
WS is sent to GCS, which assigns a global sequence number (seqno) and broadcasts to all nodes
Every node independently certifies the WS against its local Certification Conflict Vector (cert_index_ng)
Conflict (same row key at overlapping seqno range): certification fails → ERROR 1213 Deadlock
Pass: WS applied via applier thread; certification is deterministic every node reaches the same decision

Galera uses optimistic locking at the cluster level, pessimistic locking locally.

A transaction acquires row locks on the originating node (standard InnoDB pessimistic locking) but has no visibility into locks on other nodes. Conflicts are detected only at commit time.

3.2 Brute Force Abort

When an incoming replicated write-set conflicts with a local uncommitted transaction, the incoming write-set always wins. The local transaction is rolled back immediately and the client receives ERROR 1213. Applications must implement retry logic this is not optional for multi-writer topologies. Both products behave identically here; the difference is in monitoring granularity (PXC exposes wsrep_local_bf_aborts and wsrep_local_cert_failures).

4. Flow Control

Flow Control (FC) is the back-pressure mechanism that prevents fast writers from overwhelming slow appliers. When a node's receive queue exceeds gcs.fc_limit, it broadcasts a FLOW_CONTROL_PAUSE message to the entire cluster. All nodes suspend committing until the queue drains below gcs.fc_factor × gcs.fc_limit.

FC is cluster-global: when ONE node pauses, ALL nodes stop committing
Creates latency spikes visible to all applications on all nodes
The pausing node continues applying its backlog during FC
Frequent FC indicates the cluster is write-bound beyond what the slowest node can absorb

4.1 Observability PXC vs MariaDB

Aspect	Percona XtraDB Cluster	MariaDB Galera Cluster
wsrep_flow_control_paused	Available fraction of time in FC	Available in both Enterprise and community
wsrep_flow_control_sent/recv	Available	Available in both Enterprise and community
wsrep_flow_control_status	PXC ONLY ON or OFF right now	Not available
wsrep_flow_control_interval	PXC ONLY current [low, high] range	Not available
wsrep_flow_control_interval_low/high	PXC ONLY individual thresholds	Not available
wsrep_cert_bucket_count	PXC ONLY cert index hash buckets	Not available
wsrep_gcache_pool_size	PXC ONLY gcache memory in use	Not available
wsrep_ist_receive_seqno_*	PXC ONLY IST progress (start/current/end)	Not available

4.2 Key FC Tuning Variables (both products)

wsrep_provider_options key	Default	Effect
gcs.fc_limit	100	Recv queue depth that triggers FC pause. Raise for bursty writers.
gcs.fc_factor	1.0	Queue must drop below fc_limit × fc_factor to resume. Lower = resumes sooner.
gcs.fc_master_slave	no	Set yes for single-writer topology to disable FC on the writer node.
gcs.max_packet_size	64500	Max GCS packet size. Set larger than your largest expected write-set.

5. Streaming Replication and Large Transaction Handling

Streaming Replication (SR) is a Galera 4 feature available in both PXC 8.0+ and MariaDB 10.4+. It splits large transactions into fragments that are replicated and certified before the final COMMIT.

Without SR: A 1M-row UPDATE runs entirely on one node. At commit, the entire write-set is sent. Other nodes stall 28–30 seconds certifying and applying it. All unrelated writes cluster-wide are blocked during this window.

With SR (wsrep_trx_fragment_size > 0): Fragments are replicated mid-transaction. Each certified fragment acquires row locks on ALL nodes, providing cluster-wide row-level locking during the transaction. Conflicting transactions on other nodes WAIT rather than certifying and failing later.

The trade-off: Galera double-writes fragments to mysql.wsrep_streaming_log (an InnoDB table). A 34-second update without SR takes 40 seconds with 1MB fragments, and 51 seconds with 0.1MB fragments. Fragment rollback propagates to all nodes more expensive than a local-only rollback.

Variable	Values	Notes
wsrep_trx_fragment_size	0 (off), N	Fragment size in units of wsrep_trx_fragment_unit
wsrep_trx_fragment_unit	bytes, rows, statements	Recommend bytes; 1MB is a reasonable starting point for most workloads
Session-scope only	Yes	Do not enable globally. Enable per-session for known large transactions only.

No difference between PXC and MariaDB on SR it is identical Galera 4 library behavior in both.

6. Split Brain, Quorum, and Primary Component

Split-brain protection is implemented identically in both products via the Galera GCS layer.

6.1 Primary Component Election

When a network partition occurs, each segment runs a membership algorithm. The segment with strictly more than 50% of cluster weight becomes the Primary Component (PC). Minority segments enter non-primary state:

All writes rejected: ERROR 1047 WSREP has not yet prepared node for application use
Reads permitted (effectively read-only)
Node waits until network heals and PC is re-established

6.2 Garbd Arbitrator

For 2-node or even-node clusters, garbd is a lightweight voting member without data storage. Both products ship it. Mixing garbd binaries from PXC and MariaDB in the same cluster is not recommended due to potential wsrep API version differences.

6.3 Node Weighting (pc.weight)

Both products support pc.weight in wsrep_provider_options to assign higher votes to specific nodes. Use this to prioritize primary datacenter nodes over DR nodes in quorum calculations preventing the DR site from forming a spurious PC if the link to the primary drops.

6.4 Split-Brain Recovery

Identify the most advanced node: inspect grastate.dat and the seqno field
The node with safe_to_bootstrap: 1 was the last to write cleanly
Bootstrap from it: SET GLOBAL wsrep_provider_options="pc.bootstrap=YES"; or restart with --wsrep-new-cluster
Re-provision all other nodes via SST from the bootstrapped node (losing diverged writes)
Verify with pt-table-checksum after cluster reform

PXC-specific advantage: pxc_maint_mode. Provides DISABLED / PXCMAINT / MAINTENANCE states. PXCMAINT signals load balancers to drain the node gracefully before maintenance. MariaDB has no equivalent HAProxy/ProxySQL coordination that must be done externally.

7. State Transfer: SST and IST

7.1 IST (Incremental State Transfer)

Used when a node rejoins after a short absence and the donor's gcache still contains the missing write-sets. Fast and non-blocking to donor. Mechanism is identical in both. PXC adds monitoring:

wsrep_ist_receive_status: text description of IST state
wsrep_ist_receive_seqno_start / current / end: enables building a completion percentage

MariaDB provides none of these; IST progress requires log file grepping.

7.2 SST (Full State Transfer)

Aspect	Percona XtraDB Cluster	MariaDB Galera Cluster	Impact
Default method	xtrabackup-v2	mariabackup (recommended)	Both are production-grade
CLONE SST	YES native MySQL CLONE plugin; no external binary; encrypted by default	NO not available in MariaDB	PXC can provision a new node with zero external tooling; MariaDB always requires mariabackup binary installed and configured on all nodes
Backup tool	Percona XtraBackup (xtrabackup)	MariaDB Backup (mariabackup, fork of xtrabackup 2.3)	The tools are incompatible. PXC's backup files cannot be restored by mariabackup and vice versa. Migration between the two products requires a full logical dump, not a physical copy
Cross-product SST	xtrabackup CANNOT restore MariaDB data	mariabackup CANNOT restore PXC data	Hard blocker for any hybrid topology or live migration attempt using physical SST. Reinforces that the two clusters cannot share nodes
Donor blocking	CLONE: non-blocking to read. xtrabackup: --lock-ddl=REDUCED even DDLs don't block the donor	mariabackup: brief FTWRL, then non-blocking	Operationally equivalent for xtrabackup vs mariabackup paths. CLONE and new option in PXC eliminates even the brief lock, making it preferable for write-sensitive donors
wsrep_allowlist	PXC 8.0+ IP allowlist for SST/IST requests	Not available in MariaDB Galera	Without an allowlist, any node that knows the cluster address can request an SST, increasing the attack surface. PXC allows hardening this at the database layer; MariaDB relies entirely on network-level controls
Encryption	pxc_encrypt_cluster_traffic covers SST automatically	Requires separate SSL config per SST method	In MariaDB, SST encryption is configured independently from replication traffic encryption. A misconfiguration (e.g. TLS enabled for write-sets but forgotten for SST) silently transfers a full data snapshot in plaintext — a common security gap. PXC's single-variable approach eliminates this risk by default

8. DDL Handling and Online Schema Changes

Schema changes are the most operationally dangerous operations in Galera. Both products support three mechanisms via wsrep_OSU_method.

8.1 TOI Total Order Isolation (default)

DDL is executed across all nodes in global total order. Every node pauses at the same logical point, applies the DDL, then resumes. Safe but causes cluster-wide stall for the DDL duration. For large tables this means minutes of downtime. Identical behavior in both products.

8.2 RSU Rolling Schema Upgrade

Desynchronizes one node (wsrep_desync=ON), applies DDL locally, then re-syncs. Cluster continues processing during upgrade on that node. Risk: schema is temporarily inconsistent across nodes. Identical behavior in both products.

8.3 NBO Non-Blocking Operation (KEY DIFFERENCE)

NBO acquires a metadata lock only at the very start and very end of the DDL. The DDL executes independently on each node while the cluster processes other statements normally.

PXC 8.0.25+ (Community / Open Source): NBO is fully supported for CREATE/ALTER/DROP INDEX and ALTER TABLE index operations. Available at no cost in the standard community release.
MariaDB Galera (Enterprise Only): NBO is restricted to MariaDB Enterprise Server. The community edition does NOT support NBO only TOI and RSU are available. This is a significant operational disadvantage for large table DDL in production.

External tools for Galera-safe DDL:

pt-online-schema-change: works with Galera, requires pxc_strict_mode=PERMISSIVE during execution (PXC), careful configuration

9. Disaster Recovery Architectures

Galera provides a synchronous multi-master within a cluster. For DR across geographic sites, both products rely on asynchronous MySQL replication. The GTID incompatibility is the main constraint.

9.1 Async Replica as DR Node

Standard DR pattern: async replica in DR site replicates from one Galera node. Requirements for both products:

log_slave_updates = ON on all cluster nodes (cluster writes must reach binlog for async replicas)
binlog_format = ROW (enforced by PXC; must be set explicitly in MariaDB)

Critical: DR replica must be the same product family. A PXC cluster cannot replicate to a MariaDB DR node via GTID (format mismatch). File+position replication is possible but loses GTID safety. In practice: PXC → Percona Server/PXC DR; MariaDB → MariaDB Server DR.

9.2 Geo-Distributed Galera

Galera can technically span datacenters, but WAN latency adds directly to commit latency (certification is synchronous). At 20ms RTT, every write adds 20ms to commit time. Both products are equally affected. Mitigation: tune evs.* provider options for WAN tolerance. However for most kinds of workloads, async replication between sites is a must to geo-distributed Galera.

9.3 PMM Integration

PXC integrates natively with Percona Monitoring and Management (PMM), providing built-in Galera dashboards, flow control visualization, write-set lag tracking, and cluster state alerting. MariaDB Galera can be monitored by PMM but requires additional dashboard configuration and custom exporters for full visibility.

10. PXC-Specific Features

10.1 pxc_strict_mode

Performs safety validations at startup and runtime. Modes: ENFORCING (default), PERMISSIVE, DISABLED.

ENFORCING blocks:

MyISAM DML (would not replicate, causing silent data divergence)
Tables without primary keys (certification is key-based; no PK causes full-table locks in certification)
Non-ROW binlog_format
log_output=FILE (can impact applier performance)
innodb_autoinc_lock_mode != 2 (mode 1 can cause gaps/deadlocks in multi-master)

MariaDB has no equivalent. Without this enforcement, operators can accidentally run MyISAM writes or INSERT into a table without a PK on a MariaDB node and the operation silently succeeds locally but is not replicated, causing cluster data divergence.

10.2 pxc_encrypt_cluster_traffic

A single variable (ON by default in PXC 8.0) that enables TLS for ALL cluster traffic: write-set replication, SST, IST, and internal service messages. In MariaDB Galera, each of these requires separate SSL configuration. Misconfiguration can leave SST traffic unencrypted while write-set traffic is encrypted a common security gap in MariaDB Galera deployments.

10.3 CLONE SST Plugin

PXC's CLONE SST (default since 8.0.41) requires no external backup tool, uses MySQL's native encryption, and is non-blocking for reads on the donor. Node provisioning is simpler, faster for smaller datasets, and requires no xtrabackup binary installation.

10.4 GCache and Write-Set Cache Encryption

Introduced in PXC 8.0.31-23. Currently a tech preview feature.

What it does: Encrypts two on-disk structures that Galera uses to buffer replication data:

GCache (RingBuffer file) the persistent on-disk write-set cache used for IST. Encryption uses a two-layer key scheme: the Keyring stores only a Master Key, which encrypts a per-file File Key. The encrypted File Key is stored in the RingBuffer's preamble. Since the RingBuffer is non-volatile (survives restarts), the File Key must be retrievable from the preamble on restart.

Write-Set cache (allocator disk pages) temporary disk pages spilled during large transactions. These are ephemeral (not persistent across restarts), so no File Key is stored encryption is in-memory-keyed only.

How to enable via wsrep_provider_options:

Variable	Default	Controls
gcache.encryption	off	Enable/disable GCache encryption
gcache.encryption_cache_size	16MB	Encryption cache size (max 512 pages)
gcache.encryption_cache_page_size	32KB	Must be a multiple of CPU page size (typically 4KB)
allocator.disk_pages_encryption	off	Enable/disable Write-Set cache encryption
allocator.encryption_cache_size	16MB	Same structure as GCache
allocator.encryption_cache_page_size	32KB	Same constraint

Master Key rotation:

sql

ALTER INSTANCE ROTATE GCACHE MASTER KEY;

Requires a keyring plugin or keyring component (e.g. keyring_file, keyring_vault) loaded and configured. The keyring file should be stored outside the data directory.

GCache and Write-Set Cache Encryption

10.5 FC Auto Eviction of Lagging Nodes

Introduced in PXC 8.0.33-25 (PXC-3760).

The problem it solves: When a node is persistently slow, it drives Flow Control (FC) for the entire cluster, throttling all writes. Previously, operators had to manually evict such a node. This feature makes the node evict itself when it has been in FC too long.

How it works: A sliding time window tracks FC activity. If FC time within that window exceeds a threshold ratio, the node self-leaves the cluster.

Variables (set via wsrep_provider_options, both static require restart):

Variable	Default	Description
gcs.fc_auto_evict_window	0 (disabled)	Width of the observation window (seconds). 0 = feature off.
gcs.fc_auto_evict_threshold	0.75	Ratio (0.0–1.0): if FC time ÷ window ≥ this value, node self-evicts.

Example: With gcs.fc_auto_evict_window=60 and gcs.fc_auto_evict_threshold=0.75, if a node spends ≥ 45 seconds of any 60-second window in FC, it leaves the cluster automatically.

There is also the older, separate EVS-level auto eviction:

Variable	Default	Description
evs.auto_evict	0 (disabled)	Number of delayed-list entries allowed before EVS auto-evicts a slow node. Requires evs.version=1.
evs.evict		Manual eviction: set to a node's UUID to force evict it.
evs.delay_margin	PT1S	How long a node can lag before being added to the delayed list.

wsrep_provider options index

11. MariaDB Galera-Specific Features

11.1 wsrep_gtid_mode and wsrep_gtid_domain_id

MariaDB's GTID integration with Galera is more explicit. wsrep_gtid_mode=ON ensures all Galera write-sets carry consistent MariaDB GTIDs using the domain from wsrep_gtid_domain_id. Critical for async replication topologies where downstream replicas need GTID-based position tracking. PXC handles this automatically via MySQL's native GTID integration in the wsrep patch.

11.2 WSREP_INFO Plugin

MariaDB contributed the WSREP_INFO plugin, which exposes cluster membership as queryable information_schema tables:

SELECT * FROM information_schema.WSREP_MEMBERSHIP;

SELECT * FROM information_schema.WSREP_STATUS;

More ergonomic than parsing SHOW STATUS LIKE 'wsrep%'. PXC achieves equivalent visibility through status variables and PMM but does not have these information_schema tables natively.

11.3 Embedded Galera and Simplified Installation

Since MariaDB 10.1, Galera support is embedded in the server package (no separate galera lib install). Activated by wsrep_on=ON. PXC ships libgalera_smm.so as a separate package alongside the server package. Operationally minor, but reduces packaging complexity in some deployment automation scenarios.

11.4 MariaDB-Unique Database Features

MariaDB Galera supports several features PXC/MySQL 8.x does not have:

System-Versioned Tables (temporal tables, SQL:2011 standard) work with Galera replication
Sequence objects (CREATE SEQUENCE)
Spider storage engine for horizontal sharding
COMPRESS() / UNCOMPRESS() improvements
Different JSON implementation (LONGTEXT + CHECK) more portable but less performant for JSON operations

12. Performance Characteristics

12.1 Write Throughput and Concurrency Scaling

Both impose identical write amplification: every write executes on every node. Server-level performance differences:

High concurrency (>128 threads): PXC 8.4 (MySQL 8.4 base) scales better. In Percona's 2026 ecosystem benchmark, MySQL 8.4 and Percona Server 8.4 reached 13,325–13,385 TPS at 512 threads. MariaDB variants peak at 128 threads then show notable degradation.
Low concurrency / single-thread: MariaDB performs better. MariaDB 10.11 shows excellent single-thread throughput.
Galera overhead: Write-set certification adds ~2–5% overhead on commit versus standalone MySQL. Identical for both products.

12.2 IST Performance

Percona has demonstrated up to 4x IST improvement in PXC through parallelized write-set application during IST. MariaDB has also improved IST threading but has fewer published benchmarks for direct comparison.

12.3 Large Transaction Performance

Without Streaming Replication, both products behave identically: full write-set certification blocks other nodes during apply. With SR enabled, both achieve similar improvements via Galera 4 library behavior. No meaningful advantage for either product.

13. Key Configuration Differences Summary

Parameter	PXC 8.x	MariaDB Galera 10.x/11.x
wsrep activation	Active when wsrep_provider path set	wsrep_on=ON required explicitly
binlog_format	ROW enforced cannot change	ROW recommended; MIXED possible with caveats
innodb_autoinc_lock_mode	Must be 2 (pxc_strict_mode enforces)	Defaults to 1 must manually set to 2
SST default method	xtrabackup-v2	mariabackup
Cluster traffic encryption	pxc_encrypt_cluster_traffic=ON (default)	Manual SSL config per subsystem
Safety enforcement	pxc_strict_mode=ENFORCING (default)	No equivalent manual discipline required
Maintenance drain	pxc_maint_mode=PXCMAINT/MAINTENANCE	Must coordinate externally with load balancer
Online DDL (NBO)	Index ops supported open source	Enterprise Server only (not community edition)
GTID integration	MySQL GTID native; automatic	wsrep_gtid_mode + wsrep_gtid_domain_id needed
IST progress monitoring	wsrep_ist_receive_seqno_* variables	Log file parsing only
Monitoring	PMM native + 10 extra wsrep_* status vars	WSREP_INFO plugin; PMM needs extra setup
Flow control visibility	wsrep_flow_control_status/interval/interval_low/high	Only basic paused/sent/recv counters
IP allowlist for SST/IST	wsrep_allowlist (8.0+)	Not available
GCache + Write-Set cache encryption	gcache.encryption and allocator.disk_pages_encryption via wsrep_provider_options; Master Key via ALTER INSTANCE ROTATE GCACHE MASTER KEY; keyring plugin/component required (since 8.0.31-23, tech preview)	Not available
FC auto eviction of lagging node	Node self-evicts when FC time exceeds gcs.fc_auto_evict_threshold ratio within gcs.fc_auto_evict_window; older EVS-level evs.auto_evict also available (since 8.0.33-25)	Not available

14. Decision Guidance

Choose PXC when:

If you use MySQL Galera and require the shortest possible path to replace it
Application requires MySQL 8.x-specific features (native JSON binary type, improved window functions, roles, invisible columns)
You need enforced cluster safety pxc_strict_mode catches dangerous configurations before they diverge data
Simplest possible encryption setup is required (pxc_encrypt_cluster_traffic=ON)
Using PMM for monitoring and want native Galera dashboards out of the box
You need NBO (online DDL for indexes) without an enterprise license
Already using Percona XtraBackup for backups SST toolchain is shared
DR replication topology connects to MySQL 8.x replicas (GTID-compatible)
High concurrency (>128 threads) write workloads where MySQL 8.x scales better

Choose MariaDB Galera when:

Application relies on MariaDB-specific features: temporal tables, Sequences, Spider, Aria, MariaDB stored procedure syntax differences
Async replication to MariaDB-native replicas using GTID is required
Team DBA expertise is MariaDB-centric
Low-concurrency workloads where MariaDB single-thread performance is superior
WSREP_INFO plugin’s information_schema tables are useful to your monitoring tooling

Migrating between the two:

Full logical migration only, no physical data directory copy is possible
Use mysqldump or mydumper; expect schema adjustments for JSON columns, auth plugins, and reserved words
Reconfigure entire GTID replication topology
Allow significant application testing time SQL mode, optimizer, and implicit conversion differences will surface
Test pxc_strict_mode=ENFORCING carefully existing MariaDB schemas often have tables without PKs or MyISAM tables that will block startup

Stop Guessing Your Kubernetes MySQL Configs: Meet the MySQL Operator Calculator

Let’s be honest: migrating a relational database to Kubernetes sounds fantastic in a whiteboard meeting, but the reality of day-two operations is a completely different story.

When moving MySQL to Kubernetes, the ultimate goal is simple: identify a safe, performant set of configuration values for your database pods. But where do you start? Usually, you look at your overall node resources say, a machine with 16 CPUs and 64GB of RAM. front image calculator

In the old bare-metal days, you'd apply the standard rules of thumb:

Set innodb_buffer_pool_size to 60-80% of total RAM to maximize caching.
Allocate 1 innodb_buffer_pool_instances per 1GB of buffer pool.
Match innodb_io_capacity to your drive speeds.

If you try applying these legacy rules in Kubernetes, your pod won't survive.

The Kubernetes Reality Check: OOMKills and Probe Traps

Why do the old rules fail? Because Kubernetes environments lack swap space. If a pod exceeds its assigned memory limit, Kubernetes executes an immediate, destructive action: an OOMKill.

Standard tuning rules don't account for the hidden memory consumers inside a K8s pod. You aren't just allocating memory for MySQL anymore; you have to share the pod's limits across running connections, the routing proxy, monitoring sidecars, and internal database processes.

For example, extensive testing reveals that Percona Server (PS) with Group Replication consumes about 9% to 11% more memory than Percona XtraDB Cluster (PXC) under the exact same load. If you blindly allocate 80% of your RAM to the buffer pool, that extra 10% overhead from Group Replication will push you right over the edge.

Memory isn't the only trap. During OLTP load testing (using sysbench TPC-C), pods can get killed before memory even peaks. The culprit? Kubernetes liveness and readiness probes. Under heavy load, a perfectly healthy database pod might take slightly longer to respond. If your probe timeouts are too short, K8s assumes the pod is dead and kills it with no questions asked.

Step 1: Discover Your Actual Resources

To avoid these pitfalls, you must know what resources you actually have before tuning anything. A 64GB node does not give you 64GB of pod memory. Cloud providers run system pods to manage the cluster, which silently consume your baseline resources.

Before applying any configurations, check your node:

Bash
kubectl describe node <nodename>

You might see something like this in the output:

Plaintext
  Resource           Requests    Limits
  --------           --------    ------
  cpu                702m (4%)   1200m (7%)
  memory             645Mi (1%)  1994Mi (3%)

In this scenario, 7% of the CPU and 3% of the memory are already spoken for. Your 16 CPUs and 64GB of RAM are actually closer to 14 CPUs and 58GB of usable memory. If you base your manual database tuning on the 64GB fantasy, you are already on a collision course with an OOMKill.

You could try to manually scale down your buffers to be "safe" (e.g., arbitrarily dropping the buffer pool to 50%), but then you sacrifice massive amounts of performance.

This is where the guessing game has to stop.

Enter the MySQL Operator Calculator

Built as a lightning-fast, RESTful Go service, the MySQL Operator Calculator is designed to take this exact math entirely out of your hands.

Instead of manually calculating overheads for proxies, monitors, and Group Replication, you simply feed the calculator your actual available pod resources and workload type. It dynamically computes the optimal, mathematically safe configuration parameters for your Kubernetes operator (such as the Percona Operator for MySQL).

Why You Need It in Your Toolkit:

Say Goodbye to OOM Kills: The tool mathematically balances your total allocated memory across the three critical components of a modern K8s database pod: the mysql engine, the proxy layer, and the monitor agent.
Workload-Aware Tuning: Simply tell the calculator your load type (Read-Heavy, Light OLTP, or Heavy OLTP), and it adjusts the buffers and threads accordingly.
Automation: Designed with modern infrastructure in mind, the calculator outputs clean, structured json. You can easily curl the API from your CI/CD pipelines to automatically inject calculated configurations into your Helm charts.
Auto-Calculated Connections: Not sure how many connections your memory limit can safely handle? Pass 0 for connections, and the tool will calculate the maximum safe threshold for you.

How It Works in Practice

Getting your optimized configuration is as simple as making an HTTP request. Let's say you have a heavy OLTP Percona XtraDB Cluster (PXC), you've identified you have exactly 4 CPUs and 2.5GB of RAM available, and you want the tool to figure out the max connections for MySQL 8.0.33. Just ask:

Bash
curl -i -X GET -H "Content-Type: application/json" -d '{
  "output": "human",
  "dbtype": "pxc",
  "dimension": {
    "id": 999,
    "cpu": 4000,
    "memory": "2.5G"
  },
  "loadtype": {"id": 3},
  "connections": 0,
  "mysqlversion": {"major": 8, "minor": 0, "patch": 33}
}' http://127.0.0.1:8080/calculator

Using the human output flag gives you a highly readable, my.cnf-style output, while the json flag provides structured data detailing the exact configuration section, calculated value, and the safe minimums/maximums used in the background math.

Ready to Stop Guessing?

Container orchestration is complex enough without having to manually calculate memory overheads on a calculator app at 2:00 AM during an outage. By programmatically determining your limits, you ensure your database remains stable, performant, and perfectly sized for its environment.

This is why I developed this tool, initially for my personal use, but I think it can be useful to others, so here we go:

Check out the source code, compile the binary, and start optimizing your clusters today by visiting the MySQL Operator Calculator on GitHub.

Or you can try it using the service at tusacentral.net:8080 like:

curl -i -X GET -H "Content-Type: application/json" -d '{
"output":"human",
"dbtype":"group_replication", 
"dimension": {"id": 999, "cpu": 16000, "memory": "64G"}, 
"loadtype": {"id": 3}, 
"connections":1500,
"mysqlversion":{"major":8,"minor":4,"patch":8},
"providercostpct":0.10}' http://tusacentral.net:8080/calculator

This is just for demo and cannot be used as reference for a service, please build your own server for that.

Of course the use of the settings generated is at your own risk, I am not taking any responsability in case they are not working, so test them over and over and see if they match your needs.

Also read the recent blogs https://tusacentral.net/joomla/index.php/mysql-blogs/265-group-replication-vs-percona-xtradb-cluster-the-true-cost-of-consistency and https://tusacentral.net/joomla/index.php/mysql-blogs/266-the-failover-brownout-rethinking-high-availability-in-mysql-group-replication
they are VERY important to understand what is going on in the operator especially the one using Grup Replication.

PR or issue requests are welcome.

The Failover Brownout: Rethinking High Availability in MySQL Group Replication

It is time to talk again about Flow control and group replication. This time with a special eye on the use of Group Replication in the Kubernetes context. In this article we will dig a bit on how it works and what are the various side effects.

The problem

Recently I was refining the calculation I use in the MySQL calculator for Operator given I was constantly encountering a very serious problem with the Percona Server Operator.

The problem is that when the deployment was/is serving a high level of traffic, it will, no matter what, end up in getting OMMKill by the K8 system.

This because the pod was gradually consuming more and more memory, reaching the memory limit set in the CR specification.

Now let me clarify a few things, to get straight to the facts.

Kubernetes itself does not OOMKill a pod for hitting its memory limit, the mechanism works as described below with mention on how Working Set Size (WSS) is calculated, and how OOMKills are triggered, and in the resource sections, the links to the official documentation and source code.

1. The Reality of OOMKills vs. Kubelet Evictions

It is crucial to distinguish between what the Linux kernel does and what Kubernetes does:

OOMKilled (Exit Code 137): This is executed entirely by the Linux kernel's OOM Killer, not Kubernetes. When we set a memory limit in our Pod spec, Kubernetes translates that into a Linux cgroup constraint (memory.limit_in_bytes for cgroups v1, or memory.max for cgroups v2). If our container attempts to allocate more memory than this hard limit, and the kernel cannot reclaim any page cache (like inactive files), the kernel directly intervenes and terminates the process.
Node-Pressure Evictions: This is where Kubernetes actively observes memory. The kubelet monitors the working_set_bytes metric to protect the node from running out of memory. If the node's memory drops below an eviction threshold, Kubernetes will actively evict pods to prevent the kernel from initiating a system-wide OOM kill.

2. How Working Set Size (WSS) is Calculated for the container

Kubernetes monitors container memory via cAdvisor, which is integrated directly into the kubelet. cAdvisor calculates the Working Set Size by taking the total memory usage and subtracting the inactive file cache (memory that the kernel can easily reclaim if it faces memory pressure).

Because active file caches and anonymous memory (like our application's heap) cannot be easily evicted, this working set metric is the most accurate representation of the memory your container is forcing the system to hold.

The Calculation & cgroups Evolution The core mathematical calculation is Memory Usage - Inactive File Cache, but how cAdvisor fetches this data from the Linux kernel depends entirely on your node's cgroup version. Modern cAdvisor relies heavily on the opencontainers/runc/libcontainer library to read these raw cgroup files:

cgroups v1: cAdvisor starts with the raw usage from memory.usage_in_bytes and subtracts the reclaimable cache found under the total_inactive_file key.
cgroups v2 (Unified): cAdvisor starts with the raw usage from memory.current and subtracts the reclaimable cache found under the inactive_file key.

The Underlying Code Logic While older versions used a static setMemoryStats function, modern Kubernetes branches handle this dynamically. The logic executes the following flow before reporting back to the kubelet:

Detects Version: It identifies whether the node runs cgroups v1 or v2 to determine the correct inactive file key name.
Fetch Usage: It pulls the raw memory usage from the container.
Subtract Cache: It looks up the inactive file value and safely subtracts it from the usage (including a safeguard to ensure the working set never drops below zero).
Report Metric: It sets this final calculated value as container_memory_working_set_bytes, which the kubelet then uses to decide if the node is under memory pressure.

Back to us

At the end the point is that if our pod reaches the limit and we ARE NOT using the new swap feature existing in Kubernetes, our pod will be brutally killed, and in 99% of the cases our production will suffer a lot. !Ops spoiler!

To clearly understand what was causing the issue about this memory consumption and having my calculator fail, I started to collect the information about the memory usage in MySQL itself.

SELECT EVENT_NAME,CURRENT_NUMBER_OF_BYTES_USED / 1024 / 1024 AS current_usage_mb FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME like 'memory/%' and EVENT_NAME not like 'memory/performance%' order by current_usage_mb desc limit 25;

Which will give you and output like this:

+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.66179943 |
| memory/group_rpl/certification_info   |      92.45250702 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      49.90627003 |
| memory/innodb/memory                  |      34.68734741 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/mysqld_openssl/openssl_malloc  |       9.51009655 |
| memory/innodb/read0read               |       8.19496155 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.87006950 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.83031464 |
| memory/innodb/std                     |       2.72618866 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.34302521 |
| memory/sql/TABLE_SHARE::mem_root      |       2.31734467 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/temptable/physical_ram         |       1.00003052 |
| memory/sql/dd::String_type            |       0.94942093 |
| memory/innodb/btr0pcur                |       0.89743423 |
+---------------------------------------+------------------+

Plus I used PMM to collect memory information

To simulate the load I used the sysbench-tpcc (tpc-c derivate test) variant and run the tests simulating a load of 1024 threads against a cluster based on machine with 16 Core and 64Gb volumes ~3k IOPS, so not gigantic but not small.

The finding was almost immediate:

+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/certification_info   |    1431.67934418 | <constantly increasing
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.63542366 |
| memory/sql/Gtid_set::Interval_chunk   |      95.52413940 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      48.17613125 |
| memory/innodb/memory                  |      35.08897400 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/innodb/read0read               |      14.86782837 |
| memory/mysqld_openssl/openssl_malloc  |      12.05916119 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.84074974 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.82012177 |
| memory/innodb/std                     |       2.72515869 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.35884857 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/sql/TABLE_SHARE::mem_root      |       1.83777618 |
| memory/innodb/trx0undo                |       1.26304626 |
| memory/mysys/lf_node                  |       1.08828735 |
+---------------------------------------+------------------+

Ok then … What is the certification info???

What is group_rpl/certification_info?

In MySQL, memory/group_rpl/certification_info is a Performance Schema memory instrument. It tracks the exact amount of RAM allocated to store the Certification Database (or Certification Info).

In Group Replication, nodes do not lock rows across the network while a transaction is executing. Instead, transactions execute locally and optimistically. When it is time to commit, the transaction undergoes a Certification Process to ensure no other concurrent transaction in the cluster has modified the exact same rows. The certification_info buffer is the in-memory hash map that makes this conflict detection possible.

1. What is it used for?

The certification_info structure acts as a tracking ledger for recently modified rows.

Here is how it works under the hood:

The Key-Value Pair: It is fundamentally an in-memory dictionary. The key is the hash of a modified row (extracted from the transaction's "write set"), and the value is the Global Transaction Identifier (GTID) of the transaction that successfully modified it.
Conflict Detection: When a new transaction attempts to commit, it broadcasts its write set and the "snapshot version" of the database it saw when it started. The certifier cross-references the incoming transaction's write set against the certification_info map.
The Decision: If the certification_info shows that a row was modified by a newer GTID that the incoming transaction did not "see" when it started, a conflict is flagged, and the transaction is aborted. If no conflict exists, the transaction is certified, and the certification_info map is updated with the new write set and GTID.

The primary does not hold onto this memory out of stubbornness; it does so because purging that data too early would destroy the cluster's consistency in the event of a failover.

In Group Replication, garbage collection for the certification_info buffer is not triggered just because a transaction commits on the primary. It is triggered by a concept called the Stable Set.

Every node in the cluster periodically broadcasts a message to the rest of the group saying, "Here are the GTIDs I have successfully applied to my disk." The cluster then calculates a global low watermark. This watermark is the highest transaction GTID that every single member of the group has successfully applied. Garbage collection is only allowed to purge write-sets from the certification database that fall below this global watermark. To note that this purge is a synchronous operation during which writes are forbidden.

2. How the Apply Queue Stalls the Watermark

When a secondary node starts lagging, its applier queue grows. This means the secondary is receiving transactions from the network quickly, but its SQL thread is too slow to actually execute them and commit them to disk.

Because the secondary hasn't applied these transactions, it cannot report those GTIDs back to the group as "finished."

The lagging secondary's local watermark stalls.
Therefore, the global low watermark for the entire cluster stalls.
Because the global watermark hasn't moved forward, the garbage_collect function on the primary (and all other nodes) says, "I am not allowed to delete any write-sets yet."
As the primary continues to process new writes, the certification_info memory buffer grows continuously.

3. Why the Primary Cannot Purge Early

we might wonder: If the transaction is already committed on the primary, why does the primary care if the secondary has applied it? Why not just drop the write-set from its own memory?

The answer comes down to Failover Safety and Distributed Conflict Detection. GR is a shared-nothing, decentralized architecture. Even if you are running in Single-Primary mode (keep this in mind will be important later), the underlying engine uses the exact same logic as Multi-Primary mode.

Here is why the primary is forbidden from purging that data:

The Failover Scenario: Imagine our primary node crashes right now. The lagging secondary (which still has a massive apply queue) is immediately elected as the new primary.
The Conflict Risk: As the new primary, it starts accepting new writes from your application. However, it still has thousands of old transactions in its applier queue that it hasn't written to disk yet!
The Necessity of the Buffer: When a new write comes in, the new primary must check if that write conflicts with any of the pending transactions in its apply queue. It does this by checking the certification_info map. If the old primary had purged the global certification data early, the new primary wouldn't have the write-sets for those pending transactions. It would blindly accept the new write, causing a massive data conflict and breaking the replication group entirely.

Fine Marco, then what is the effect of this?

Well, drums roll …

… When a secondary node is elected as the new primary during a failover, it does not immediately open the floodgates to new writes. It keeps its super_read_only variable set to ON until it has completely drained its local apply queue of all transactions that were certified prior to the election.

This is an intentional design choice to guarantee that the new primary's state is completely consistent with the old primary before it starts accepting new data.

4. Immediate Write Rejections (No Built-in Queuing)

The most critical impact to understand is that the new primary does not queue or pause new incoming writes while it catches up. It outright rejects them.

If our application or proxy routes a COMMIT, INSERT, UPDATE, or DELETE to the new primary while it is still processing the old queue, MySQL will immediately throw an error back to the client:

ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

5. The "Brownout" Window (Write Outage)

Because of this behavior, a failover in MySQL Group Replication does not instantly restore write availability. Our cluster experiences a "brownout", a period where reads might succeed, but writes are entirely blocked.

The duration of this write outage is directly proportional to the size of the apply queue.

If the secondary was fully caught up, write availability is restored in milliseconds.
If the secondary was lagging by 50 minutes, your application will suffer a 50 minute write outage while the node applies the backlog.

6. Impact on Proxies (e.g., MySQL Router or ProxySQL)

If we are using a proxy layer to route your database traffic, the apply queue dictates how the proxy behaves during the transition:

MySQL Router: It continuously monitors the cluster topology and the super_read_only flag. Even though the node has technically been elected primary, Router will not open the read-write port to it until the apply queue drains and super_read_only flips to OFF. Depending on your application timeouts, client connections will either hang waiting for a writable connection or fail completely.
ProxySQL: Similar to Router, if it is configured to check for the read_only state, it will temporarily quarantine the new primary from the write hostgroup.
HAProxy (in Operator): Monitor both Primary state and read_only state, but it expose the Primary to writes causing the application to fail (bug we need to fix)

7. Read Traffic and Stale Data

During this catch-up phase, the node will accept incoming SELECT queries (since it is still a valid database). However, because it is actively churning through the old primary's backlog, the data being read is temporarily stale.

If your application reads a row that is sitting in the apply queue but hasn't been committed to disk yet, it will get the old version of that row.

Why Flow Control is Critical

Because a large apply queue turns a seamless failover into a severe, application-breaking write outage, Group Replication includes the Flow Control feature.

Flow Control monitors the size of the apply queues across all secondaries. If a secondary starts lagging too far behind, Flow Control should actively throttle the write throughput on the current primary to allow the lagging node to catch up. It is essentially a trade-off: we accept a slight performance hit during normal operations to guarantee that your database recovers almost instantly during a failover.

However, this is not what really happens.

1. It is Reactive, Not Proactive (The Polling Blind Spot)

Flow control does not intercept and evaluate every single transaction in real-time. Instead, it relies on a periodic polling interval governed by group_replication_flow_control_period (which defaults to 1 second).

Once a second, the cluster checks the size of the apply queues and the certifier queues.

The Vulnerability: If our application generates a massive spike of 50,000 writes in 500 milliseconds, the primary will happily accept and certify all of them. Flow control will not even notice the spike until the next 1 second polling interval hits. By the time it decides to apply a throttle, the damage is already done, and the secondary's queue is already overflowing.

2. The PID Controller's "Soft Brake" Math

When flow control does decide to throttle, it does not simply freeze the primary. It uses a PID (Proportional-Integral-Derivative) controller algorithm to calculate a "write quota" (the maximum number of transactions the primary is allowed to commit in the next second).

The PID controller is deliberately tuned to be gentle. It wants to gracefully degrade performance rather than cause immediate application timeouts.

When the secondary's queue breaches the group_replication_flow_control_applier_threshold (default 25,000 transactions), the PID controller reduces the primary's quota incrementally.
The Failure Point: If the primary's incoming write rate is astronomically higher than the secondary's disk IO capacity, this incremental "step down" in the quota is too slow. The primary is still allowed to write, say, 10,000 transactions per second, while the secondary is only applying 2,000. The queue continues to grow aggressively despite the throttle being "active."

3. The Concurrency Mismatch (Parallel vs. Serial)

This is often the silent killer that defeats flow control. Flow control makes mathematical assumptions about how fast the secondary should be able to apply transactions based on recent history.

However, the primary node might be executing writes using hundreds of highly concurrent threads. The secondary relies on the parallel applier to keep up. If the incoming workload suddenly includes transactions that cannot be parallelized, such as writes hitting overlapping rows, cascading foreign key updates, or DDL statements, the secondary's applier instantly drops from executing in parallel down to a single, serialized thread.

When this serialization happens, the secondary's applier rate plummets instantly. Flow control, which only checks in once a second and adjusts gradually, cannot brake the primary fast enough to compensate for the secondary suddenly dropping to a crawl.

What can we do?

At the moment of writing there are only two things that can be done.

Make Flow control more aggressive
Increase the number of replication appliers

1. Making Flow Control More Aggressive

We can configure Flow Control to be a bit more aggressive. It will still remain a suggestion but a strong one.

How it works (The Configuration):

Lower the Threshold: By reducing group_replication_flow_control_applier_threshold (default is 25,000) to something like 1,000 or 500, we force the PID controller to kick in almost immediately when a spike occurs.
Remove the Safety Net: By keeping group_replication_flow_control_min_quota to 0 (default), we remove the minimum write guarantee. If the secondary falls behind, Flow Control is allowed to throttle the primary's writes down to zero, also if this will never happen.
Increase the Sensitivity: We can tweak the PID controller's math (using the derivative and proportional tuning variables) to react much more aggressively to queue growth. group_replication_flow_control_hold_percent=100 group_replication_flow_control_release_percent=5

The reality check, does it work?:

If the expectation is to have a rigid control over the applier queue on the lagging secondary, then the answer is NO. No matter what, at the moment flow control is not designed to act as we are used to in PXC (Percona Xtradb Cluster), where we have a rigid control of the pending queue also at the cost of delaying the writes. In Group Replication the Flow Control will never bring the write to 0, the unfortunate aspect is that the mechanism is not enough to keep the queue under control.

2. Increasing Replication Appliers

To help the secondary chew through the queue faster, we can increase the number of parallel threads it uses to write to disk.

How it works: We can increase the replica_parallel_workers (formerly slave_parallel_workers) setting. GR is exceptionally smart about this. Because of the certification process we discussed earlier, GR already knows exactly which transactions modify which rows. It uses a writeset-based dependency tracker to safely hand off non-conflicting transactions to multiple worker threads simultaneously. The formula that is normally used to calculate the number of replication workers is to set 2.5 workers for each available core. IE if we have 14000m CPUs in our CR (K8) then we can assign ~35 workers, this is definitely higher than the default value of 4.

The reality check, does it work?: Yes, but only if our workload allows it.

The Catch - The Serialization Wall: Parallel appliers only work if the transactions do not conflict. If our application has 50 concurrent threads all trying to update the same "inventory count" row, or updating a highly contentious table, those transactions cannot be parallelized. The secondary's coordinator thread will see the row-level conflicts and force those transactions to wait in line and execute sequentially. We could allocate 128 parallel workers, but 127 of them will sit idle while one thread does all the work.
The Catch - Context Switching: More threads do not magically create more disk IOPS. If we set the workers too high (e.g., beyond the physical CPU core count or disk IO capacity), the secondary's InnoDB engine will spend more time context-switching and fighting over internal mutex locks than actually committing data. In many cases, over-allocating parallel workers actually slows down the apply rate.

Do we have any conclusions?

1. If HA is the goal, enforce Strict Flow Control

If our absolute top priority is High Availability, specifically achieving a near-zero Recovery Time Objective (RTO), we must configure an aggressive flow control.

The Logic: Fast failovers require small apply queues. To guarantee a small apply queue, we must strictly throttle the primary the millisecond the secondary starts to lag.
The Trade-off: we are protecting the cluster's failover readiness at the expense of application write latency. If there is a massive write spike, our application will face timeouts and connection errors, but if the primary server suddenly catches fire, our database will recover and elect a new primary almost instantly.

The problem is that Group Replication is not able to act like that today, this is something we eventually need to implement to have better HA.

2. If Performance is the goal, relax Flow Control

If our top priority is keeping the application fast and ensuring COMMIT latencies remain extremely low, we should relax flow control or rely on the generous defaults.

The Logic: By relaxing flow control, we allow the primary to run at the absolute maximum speed its local disks and CPU allow. It does not care if the secondaries fall behind. Our application users remain happy and experience zero throttling.
The Trade-off: We are accepting severe risks to your HA posture. If the primary crashes while the secondaries have a massive apply queue, we will suffer a long write outage (the brownout) while the new primary catches up. Additionally, we are accepting the risk that the certification_info memory buffer will grow significantly on the primary and eventually have the pod OOMKilled .

3. Is this not what Asynchronous replication with semy-sync offers?

1. The Similarities

If we look purely at how a single transaction flows and how a failover behaves, GR and Semi-Sync look like twins:

The Durability Guarantee: Semi-Sync: The primary waits to commit until at least one secondary confirms it has received the transaction and written it to its local Relay Log.
- GR: The primary waits to commit until a majority quorum of nodes confirm they have received the transaction, certified it, and written it to their local relay logs.
The Failover Delay (The Queue): In both systems, the secondary receiving the data does not mean the secondary has applied the data to its InnoDB tables.
- If a crash happens, both systems require the new primary to completely execute its pending queue (Relay Log for Semi-Sync, Apply Queue for GR) before it is safe to accept new writes.

2. The Crucial Differences

If they behave so similarly, why use GR at all? The differences lie entirely in automation, consensus, and split-brain protection. Semi-Sync is just a data transport mechanism; GR is a full state-machine cluster.

Here is what GR gives you that Semi-Sync does not:

Automatic Election and Orchestration:
- Semi-Sync: If the primary dies, Semi-Sync does nothing. The cluster sits there broken. You must rely on external tools (like Orchestrator or manual DBA intervention) to detect the crash, pick the most up-to-date secondary, wait for its relay log to apply, disable read_only, and re-point the application.
- GR: The cluster detects the failure natively. The remaining nodes use Paxos consensus to elect a new primary automatically, manage the queue drain natively via the super_read_only flip we discussed, and self-heal.
Split-Brain Protection (Network Partitions):
- Semi-Sync: If our network splits in half, an external failover tool might accidentally promote a secondary while the old primary is still alive and accepting writes. We now have a split-brain, and our data is permanently corrupted.
- GR: GR enforces strict quorum. If a network split happens, the side of the network with the minority of nodes will automatically fence itself off and refuse all writes. Split-brain is mathematically prevented.
The Certification Database:
- As we established, GR requires the certification map to ensure the new primary doesn't accept writes that conflict with its unapplied queue. Semi-Sync does not have this; it relies entirely on the external failover tool to guarantee no writes touch the new primary until the relay log is 100% applied.

3. Final observation

If we are using Single-Primary GR with relaxed flow control, we have essentially built a highly-automated, consensus-driven version of Semi-Sync replication.

We have the exact same apply-queue bottleneck during failover, but we have traded the need for external orchestrator tools for built-in Paxos consensus and native split-brain protection.

Conclusions (for real)

When we run MySQL on a traditional, dedicated Virtual Machine, memory limits are "soft." If the certification_info database explodes and consumes an extra 10GB of RAM because of the applier lag, the Linux OS might start aggressively swapping inactive pages to disk, but the MySQL process usually survives. Performance degrades, but the database stays online.

In Kubernetes, memory limits are "hard." As we discussed earlier, Kubernetes enforces pod memory limits via cgroups v2 (memory.max). The Linux kernel's OOM Killer has no understanding of database quorum, failover states, or apply queues. It only sees math: Working Set Size > memory.max = Terminate Process (Exit Code 137).

The Chain Reaction of Relaxed Flow Control in k8s

If we prioritize "performance" by relaxing Flow Control in a Kubernetes environment, we are essentially setting a ticking time bomb. Here is the chain of events:

The Spike: Our application experiences a massive write spike.
The Queue: The secondary pod's disk cannot keep up, and its applier queue grows to 1,000,000 transactions.
The Memory Sprawl: Because the queue is large, the global low-watermark stalls. The Primary pod is forbidden from garbage collecting the certification_info map. The in-memory hash map balloons in size.
The Execution: The memory.current metric will reach the memory.max, kernel will trigger the OMMKill process. First action will be to try to free the page.cache related to the process. If the purge is successful and the memory.current is less than memory.max then the process will persist, otherwise the kernel will kill it. We can use the WSS metric to predict a successful OMMKill. The Primary pod's Working Set Size (WSS) breaches its Kubernetes memory limit, this is a fair estimate not an absolute value.
The Catastrophe: The Linux OOM Killer instantly assassinates the Primary MySQL process.

Because we tried to avoid a few seconds of write latency by keeping relaxed Flow Control, we inadvertently caused a hard crash of the primary database pod, with long write downtime.

The Architectural Law

Therefore, here is my statement as architectural law for containerized environments: In Kubernetes, High Availability and Pod stability are so intrinsically linked that Flow Control must act as hard as it can to cap the apply queue.

We cannot allow unbounded memory growth in a container. The only way to bound certification_info memory is to bound the apply queue.
The only way to bound the apply queue is with strict, aggressive Flow Control.
Increasing the number of replication appliers helps but is not the conclusive answer.

In a Kubernetes environment, we must tune group_replication_flow_control_applier_threshold to a strict, low number, and accept that during massive traffic spikes, our application will experience write throttling. It is infinitely better for our application's connection pool to wait 2 seconds for a COMMIT to succeed than for the primary database pod to be violently OOMKilled by the kernel, and have to wait for minutes or hours to recover write capabilities.

Note

Just as a mention this is exactly how Percona Operator with Percona Xtradb Cluster works. To be more specific, PXC and in general solutions based on Galera have a Flow Control mechanism that enforces the queue to be inside hard limits. While this more invasive control may be noticeable at application level, it guarantees that the other nodes are not lagging behind the primary and this is why it is a stronger HA solution in the Kubernetes environment.

Reference

https://github.com/Tusamarco/mysqloperatorcalculator

Managing Resources and OOMKills: Resource Management for Pods and Containers

(This page details how memory limits are enforced reactively by the Linux kernel via OOM kills).

How WSS triggers Evictions: Node-pressure Eviction

(This page explicitly details how the kubelet uses the memory.available signal, which is derived from node capacity minus the working set size).

Latest changes. Pointer to the code

Swap Memory Management (Core Concepts & Configuration): https://kubernetes.io/docs/concepts/cluster-administration/swap-memory-management/

Group Replication VS Percona XtraDB Cluster: The True Cost of Consistency

Overview

When building high-availability MySQL environments, the choice between MySQL Group Replication (GR) and Percona XtraDB Cluster (PXC) often comes down to how they handle the eternal database dilemma: data consistency versus performance. dolphin vs goath small

While both provide "synchronous-like" replication, they approach the problem of stale reads—reading data that has been committed on one node but not yet applied on another—in distinct ways. Understanding these differences, and the performance penalties associated with fixing them, is critical for any production environment.

Technology Overviews

MySQL Group Replication (GR)

Group Replication is the native, albeit more recent, high-availability solution built by Oracle for MySQL. It is based on a distributed state machine architecture and uses the Paxos consensus protocol.

Mechanism: When a transaction is committed, it is sent to all group members. The members must agree (consensus) on the order of transactions. Once a majority agrees, the transaction is "certified" and committed on the originator.
Replication Type: Virtually synchronous. The consensus ensures the data is received and ordered across nodes, but the actual applying of the data to the database happens asynchronously in the background.

Percona XtraDB Cluster (PXC)

PXC is an open-source enterprise solution based on Percona Server for MySQL and the Galera Replication library, which is the first and most mature virtually synchronous solution for MySQL.

Mechanism: When a node commits a transaction, it sends it to all other members of the Primary component (active group). All nodes must certify the transaction (check for conflicts), this is done on each node in the cluster, including the node that originates the write-set, before the originating node can finalize the commit.
Replication Type: Strictly synchronous (up to the certification level), asynchronous afterward. If the certification test fails, the node drops the write-set and the cluster rolls back the original transaction. If the test succeeds, however, the transaction commits and the write-set is applied to the rest of the cluster.

The Battle Against "Stale Reads": Why It Matters

The most critical distinction for developers is whether a SELECT query on Node B will immediately see the INSERT just performed on Node A.

In a distributed system, there is a microsecond-to-millisecond gap between a transaction being globally ordered (everyone knows it happened) and being locally applied (the data is physically readable in the table). Reading executed on a secondary during this gap results in a stale read.

Why is avoiding stale reads so critical?

While a stale read might just mean a user temporarily sees their old profile picture after updating it, in many business cases, it breaks the application's core logic:

Financial Transactions: A user deposits $100 on the Primary node and immediately refreshes their balance page, which reads from a Replica. If the read is stale, the balance hasn't updated. The user panics, thinking their money is lost.
E-commerce & Inventory: A customer buys the last item in stock. The next user immediately loads the product page. A stale read tells the second user the item is still available, leading to a cancelled order and a frustrated customer.
Security & Access: A user changes their password or updates a critical permission. If the next authentication request hits a node lagging by just a fraction of a second, their valid login might be rejected, or a revoked session might still be active.

To prevent these scenarios, we must tell the database to enforce strict consistency. But how do GR and PXC handle this, and what does it cost?

Consistency Controls Comparison

Both Group Replication and Percona XtraDB Cluster provide built-in mechanisms to enforce consistency and eliminate stale reads when your application demands it. However, they approach this problem using entirely different variables and distinct levels of granularity. The table below breaks down the specific controls each technology offers, highlighting exactly what it takes to force a node to serve fresh data.

Feature	MySQL Group Replication	Percona XtraDB Cluster
Default Behavior	Reads on secondaries may be stale because the applier thread might be lagging after consensus.	Reads on secondaries may be stale due to asynchronous background applying.
Stale Read Fix	Uses the group_replication_consistency variable.	Uses the wsrep-sync-wait variable.
Consistency Levels	Offers EVENTUAL, BEFORE, AFTER, and BEFORE_AND_AFTER.	Offers granular levels from 0 (default, no checks) up to 7 (checks on all READ, UPDATE, DELETE, INSERT, and REPLACE statements).
The Fix	Setting to AFTER ensures the next read is fresh.	Setting to 7 ensures we have a comparable scenario with GR. However in PXC setting wsrep_sync_wait = 1 will be enough to avoid stale reads.

The True Cost of Being Consistent

If we know stale reads are bad, why don't we just enforce strict consistency everywhere?

An image can help to understand:

dirty comparative2

Because in distributed databases, consistency is incredibly expensive. To test this, we used a 3-node internal lab environment to run a Sysbench-based TPC-C derivative test (50/50 read/write split, running for 600 seconds, scaling from 1 to 1024 threads).

You can find the detailed machine specifications here. The benchmarks were executed using a TPC-C derivative test based on sysbench. Finally—and crucially—you can review the configuration files used for the tests. I maintained the same baseline MySQL configuration across the board, only adjusting the parameters specific to each replication technology.

Scenario 1: Default (Relaxed) Consistency

(GR = EVENTUAL, PXC = wsrep-sync-wait 0)

I want to remind, that MySQL CE and Percona Server are running using Group Replication, while PXC is using galera.

With default settings, both systems allow stale reads.

Both technologies scales well up to 128 threads:

Group Replication performs exceptionally well, handling up to 15K operations/sec before dropping off after 128 threads.
PXC (Galera) is slightly less efficient at peak but scales very nicely and predictably.

At this level, the lag between the moment of commit and the moment the server returns the answer is minimal. But we are entirely exposed to stale reads.

Scenario 2: Enforced Consistency (The Cost)

(GR = AFTER, PXC = wsrep-sync-wait 7)

When we configure the servers to prevent stale reads, the systems must wait for transactions to be fully applied before returning a read. This is where the architectural differences become glaringly apparent:

PXC (Galera): Performance drops but not too much from a peak of ~9K ops/sec (in the previous test) to roughly ~8.5K ops/sec. This is a hit but not huge and the database remains highly functional and stable.
Group Replication: Performance catastrophically drops from ~15K ops/sec (in the previous test) to a staggering ~3.8K ops/sec.

This is the crucial takeaway

Enforcing strict consistency in Group Replication results in a massive ~75% performance penalty. The latency between the commit and the server response increases significantly compared to PXC.

The intermediate way

There is another approach which is to inject the higher consistency only when it is really needed.

The Solution: Session-Level Consistency You do not need, and should not use, full consistency at the global level for general cases. Instead, force consistency only when and where it is critical.

While for Group Replication there is no support for SQL injection hints like SELECT /*+ SET_VAR(...) */, you can enforce this at the session level right before a critical read:

SET SESSION group_replication_consistency = 'AFTER';
-- OR for PXC:
SET SESSION wsrep_sync_wait = 7;

To note that PXC offers more flexibility and you can use hints:

select /*+ SET_VAR(wsrep_sync_wait=7) */ @@session.wsrep_sync_wait ,@@global.wsrep_sync_wait;
+---------------------------+--------------------------+
| @@session.wsrep_sync_wait | @@global.wsrep_sync_wait |
+---------------------------+--------------------------+
|                         7 |                        0 |
+---------------------------+--------------------------+

By isolating these variables to specific sessions (like the immediate redirect after a password change or a checkout process), you ensure data integrity exactly where the business requires it, while allowing the rest of your application to enjoy the high-speed performance of relaxed consistency.

PXC: The performance drop is minimal and the solution is able to provide a consistent delivery with nice scalability up to 256 threads.

Group Replication: The solution suffers from a significant drop, not as if we set the AFTER condition at global level, but still we see a drop of ~52%.

Comparing the two solutions we can see that PXC is able to deal with the additional requested consistency better.

Additional differences

But these are not the only differences we can immediately see. Performing a comparison about resources utilization, we can see that while both solutions move the same amount of data as IO operations:

Yes, for exactly the same load and traffic Group Replication consumes 8GB more than PXC, which in this environment represents 26% memory more, over total available.

Cost that is reflected also as CPU utilization.

Conclusion: How to Survive the Cost

How impactful is enforcing strict consistency at a global level in a production environment? Massively. If you blindly enforce strict consistency globally without understanding your architecture, you will decimate your database throughput. Here is the reality of how the two solutions handle that tax:

The Group Replication Reality: By default (using EVENTUAL consistency), MySQL Group Replication behaves essentially as semi-synchronous replication paired with an automated topology manager (see The Failover Brownout: Rethinking High Availability in MySQL Group Replication). The Primary is allowed to forge ahead and serve traffic even if the Secondaries are lagging significantly behind. The moment you demand strict consistency, the Primary is violently tethered back to the rest of the cluster, and its performance drops off a cliff as it waits for the slowest node.
The PXC Advantage: Percona XtraDB Cluster (PXC) absorbs the "consistency penalty" much more gracefully. While varying consistency levels exist in PXC, adjusting them does not cause the same dramatic throughput shock seen in MGR. This is because PXC enforces a virtually synchronous, high-consistency baseline from the start. It simply does not allow the node receiving writes to deviate too far from the rest of the cluster. You pay a baseline performance tax upfront, but in exchange, you get guaranteed, ironclad High Availability out of the box.

The Final Verdict Modifying consistency values at the global server level should only be done after rigorous load testing and a complete understanding of the performance tax you are about to pay.

Ultimately, it comes down to choosing the right tool for your specific SLA:

If your architecture demands a true, virtually synchronous solution with strict High Availability out of the box, PXC is the purpose-built engine for the job.
If you are looking for a highly automated, semi-synchronous solution, Group Replication delivers excellent default performance—but tuning it to mimic PXC's strict consistency will cost you heavily in throughput.

References

https://www.google.com/url?q=https://mariadb.com/docs/galera-cluster/galera-architecture/certification-based-replication&sa=D&source=docs&ust=1777342808813139&usg=AOvVaw3SAf2g7NO9d681ZJ0VVEMB

https://docs.percona.com/percona-xtradb-cluster/5.7/wsrep-system-index.html#wsrep_sync_wait

MySQL Belgian Days and FOSDEM 2026: My Impressions

First of all, I want to say a huge thank you to Frederic Descamps and the entire team who worked on the MySQL Belgian Days. I was thrilled to see the high number of presentations and the excellent quality across the board. We had two rooms packed with attendees, and the event was simply great. It was also incredibly productive to finally reconnect with so many people in the community face-to-face.

Presentation-wise, I was really impressed by Vitor Oliveira (Huawei) and his talk, "Beyond Linear Read-Ahead: Logical Prefetching using Primary and Secondary Indexes in InnoDB." I found the presentation and the work behind it fascinating. It perfectly explained something that I, along with several colleagues, had empirically proven in the field regarding InnoDB old pages and their impact on performance. I strongly suggest reviewing this presentation.

Another highly interesting talk, even if I feel its full power wasn't grasped by everyone in the room, was Arnaud Adant's session on MySQL Binary Log Analytics. The level of detail we can dig into and the way he handled the binlog was excellent. It was a great demonstration of how a well-known topic like the binlog can still hold a few surprises and remain highly relevant, especially when looking at real-world, large-scale scenarios.

During the Belgian Days, I also received the MySQL Legend award, which was totally unexpected for me. It was so unexpected, in fact, that after the final Rockstar nomination, I actually walked out of the room and missed Fred announcing my name! In pure Grinch style, Fred had to come out and drag me back in. I was so embarrassed here is the video of my momentary shame.

Now, what about FOSDEM? Well, FOSDEM is chaos, as we all know, and nobody expects anything less. However, this year we had a single database room for just one day. That meant trying to cram a whole universe into a single jar. As a result, the room was completely full, but the speeches were, at least for me, a bit too high-level and generic. I understand that was the intention given the constraints, but we need to keep this in mind for the future. Ultimately, I wasn't really impressed.

The day after, we had the MariaDB Day, which featured some interesting talks, specifically focusing on what is coming next for MariaDB. I had a few great discussions there and hope we will be able to collaborate when performing future tests.

The Summit for the MySQL Community

Last but not least, on Monday, February 2nd, we held the Summit for the MySQL Community. The event was an open discussion about how we, as a community, can work together to keep the MySQL ecosystem not just alive, but thriving and effective. It was an excellent meeting featuring people from AWS, Bloomberg, Booking, Canonical, WordPress, Oracle, Percona, MariaDB, and more. I don't have the full list, but it was amazing to see everyone together and willing to collaborate.

What became clear to everyone is that our scope is the same. No matter what company we come from, we want to ensure the MySQL/MariaDB/Percona/Whatever-flavor ecosystem continues to meet user needs and expands to tackle upcoming challenges. To do this, we need to focus on improving community interaction, code sharing, and evolution, without getting derailed by useless debates about who is the latest shining rockstar.

The intention is to do this together, Oracle included, assuming they take the right steps. In this regard, there is an open letter to Oracle that we are asking everyone to read; if you agree with its principles, please sign it.

Looking Forward: The Foundation and Ecosystem

Finally, I want to wish the best of luck to Fred (LeFred), who has decided to move on from Oracle and join the MariaDB Foundation, as he announced in his recent blog post. However, I also want to take a moment to answer the question he posed in that post:

"There is an initiative to create a foundation to ‘save’ MySQL, but doesn’t such a foundation already exist? There is a viable alternative for MySQL users: MariaDB. It offers more features, is ready to innovate further, and welcomes your contributions. Let’s work together!"

To answer Fred's question directly: No, that specific, overarching foundation does not quite exist yet and that is exactly what became so clear during Monday's summit. The fact that the MariaDB Foundation is there is fantastic, and we all view it as a vital piece of the larger puzzle we debated.

However, we also recognize that no single entity or fork can accomplish this broader mission alone. The goal of this new foundation initiative isn't to compete with MariaDB, but to build a unified, vendor-neutral space that lifts up the entire ecosystem.

So, let us stay focused on the greater good. Rather than trying to shift entirely into one court or the other, let's build a truly collaborative foundation where all flavors and contributors can thrive together. We have a lot of work ahead of us let's do it side by side.