My MySQL tips valid-rss-rogers


This is my 2nd year in Pythian, and the first in Santa Clara as part of the Pythian company, but I will not be alone.

This year Pythian will have a good number of MySQL members. We will wait for you at Pedro's restaurant , if you are not registered yet, please do so NOW!!!  Register yourself and join us.

The official announcement from the company:

"Pythian organizing an event that by now may be considered a tradition: The MySQL community dinner at Pedro’s! This dinner is open to all MySQL community members as many of you will be in town for the MySQL Conference that week."

We will wait there for all of you....


That is trying to explain myself a quite complicate process in few word/paragraphs, don’t expect anything special, this write start from one talk at the office, where I was drive in confusion by many talks, mine include.

So I decide to try to get a better logical path on what is going on InnoDB process when we modify a record. It was not my purpose to do an in depth review of every process, just overview.

Most of what I have written come from my understanding of other people writing, and code reading.

InnoDB maxi experts forgive me if I will not be precise, and correct me please when I will be wrong.



Let us start from the beginning.


InnoDB is one of the storage engines present in MySQL, one of its characteristics is that it is ACID compliant; this means it has atomic, consistent, isolated, durable transactions, and so on …. We know that let us go ahead, if you don’t know what I am talking about STOP here and read:


To satisfy those requirements InnoDB use several components like:

Buffer Pool

Rollback log (undo insert, undo update)

redo log



Finally InnoDB internally organizes the data in memory page of 16K default dimension. How all those blocks are combined and used is not always very clear.

But we know so move it on …


I will try to present a logical path that explains the different blocks and moments.


Let us begin starting MySQL, on start the MYSQLD demon will invoke InnoDB storge engine initialization, which has to perform several operations, like check the files presence and if they are readable/writable, check the catalog in the system table space, read the LSN from the data files and compare it with the last apply in the double-buffer and REDO log, initialize the Buffer Pool and other buffers.


If we are lucky and all the operation will be successfully executed, InnoDB will initialize correctly and will come online.


At this point our data-set still reside in the data file and BP is empty, system areas like undo insert, undo update (rollback log) are empty as well given no transaction are really in place.


Let assume we want to do a smart thing like warming-our database loading in the BP the most used datasets, to do so we simply run a select on those dataset and InnoDB will load the data from disk to BP storing record in pages of 16Kb (5.6 will allow an easy way to change/manage the page size).


Anyhow, running SELECT * FROM COUNTRY will load in the BP all the table in once, and data will from now on accessible from BP instead of the disk. This is true until data is frequently accessed. In this case InnoDB will maintain the pages in memory available for reading, in the case no one will ask for it, the pages containing the data will be marked as old. InnoDB use the LRU approach (least recently used) to identify which page can be removed from the BP to make space for new blocks. Also it will place the new created block(s), in the middle of the LRU list.


So far so good and easy, and if we need to tune the BP for efficient use of the LRU list reading the gives good explanation.


But BP for InnoDB is not only a buffer we use to cache data, it is a real working area, where data is also manipulated.


Here the fun starts.


Let us assume we will have 3 different transactions accessing the COUNTRY table, for modification, what will happen?


First of all the first transaction (there is always someone coming first), will store the values its going to change in a undo log record, this could fit in one page or more, structure will follow the standard InnoDB way using cursor, tuple and page. In the undo log LSN reference is present as well for comparison with the data set.


The rollback log is mainly pided in undo insert and undo update. It is a log allocate PER transaction and located in the system tables (in 5.6 is possible to move it out). The number of possible buffers/transactions from InnoDB 1.1 is 123*1023 approximately 123K, before InnoDB 1.1 the maximum number was 1023.


Finally the log will contain the initial status of the page(s) and the versioning of the modifications done by the transaction, organize in cursors, record and page(s).


The combination of the different buffers/transaction and the level of interaction are defined by the Isolation level (Repeatable reads; read committed; read uncommitted etc).


We can have many transactions accessing the rollback log but what is not evident at the first shot is:


When rows are modified or deleted, the rows and associated undo logs are not physically removed immediately, or even immediately after the transaction commits. The old data is preserved until transactions that started earlier or concurrently are finished, so that those transactions can access the previous state of modified or deleted rows.

Thus, a long-running transaction can prevent InnoDB from purging data that was changed by a different transaction.

When rows are modified or deleted within a long-running transaction, other transactions using the READ COMMITTED and REPEATABLE READ isolation levels have to do more work to reconstruct the older data if they read those same rows.

When a long-running transaction modifies a table, queries against that table from other transactions do not make use of the covering index technique. Queries that normally could retrieve all the result columns from a secondary index, instead look up the appropriate values from the table data.


Seems clear enough, but let me make it clearer.


IF a transaction is modifying a table and this table is accessed by another transaction, the secondary index IS NOT USED for returning the value, but IS USED for the search, the value is retrieve by table look-up instead, so INDEX ->PK ->Table look-up.


It is not clear to me how undo log it manages the different LSN position, in case of difference between records/cursors. Can we have them on the same page? Or must reside on different space?  Also how I retrieve them in order to follow the correct order.


Reading the comments in trx0undo.c done by Heikki I suppose:

If each transaction is given a whole page for its update undo log, file space consumption can be 10 times higher than necessary.

Therefore, partly filled update undo log pages should be reusable.

But then there is no way inpidual pages can be ordered so that the ordering agrees with the serialization numbers of the transactions on the pages.

Thus, the history list must be formed of undo logs, not their header pages as it was in the old implementation.

However, on a single header page the transactions are placed in

the order of their serialization numbers.

As old versions are purged, we may free the page when the last transaction on the page has been purged.

A problem is that the purge has to go through the transactions in the serialization order. This means that we have to look through all rollback segments for the one that has the smallest transaction number in its history list.


So from what I read here, we will have a History list based on undo log (s) themselves. But the transactions will also be ordered at page level.


I am not sue I get it right… so don’t take it as a statement.


Ok back to our COUNTRY table, assume we have first transaction doing ”UPDATE COUNTRY set gov_type='dictatorship' where iso3=ITA;”

Then another doing   “UPDATE COUNTRY set people_status='unhappy' where gov_type='dictatorship'”;

Finally the 3td doing :”INSERT INTO COUNTY (<fields>)  VALUES (values);


Assuming they happens in short period of time but without overlap so no issue as the one describe above, what happens internally?


First InnoDB need to have a transaction (ib_trx_t), then it will create a “tuple” to put the value I want to change, then a cursor to represent the record structure (ib_crsr_t),  get the tuple value from the cursor, clone the tuple, modify it put back in the cursor, destroy tuple(s) and finally commit the transaction.


All operations until the commit, are done at tuple and cursor level. Nothing is really written in the BP or REDO log.


Both updates will work the same while the insert will have a different path:  Transaction and cursor creation will be similar but then it will need a tuple array and it will populate it with the available values, if value not present for the specific tuple NULL will apply, if that tuple represent a NOT NULL attribute, an error will occur.

So far so good, so nothing is really written outside the cursor until the commit, so what happen after the commit?


Many things.


On commit, value is pushed from cursor to the BP and the relevant pages modified and marked dirty.

Also the dirty page will be located in the flush list, which contains all of the dirty pages in LSN order, least-recently-modified first

HERE the optimistic approach will happen, means that InnoDB assume that the change will be only at row/Page level, without the need of an index/B-tree reorganization.


Note that at this point InnoDB is not in a Consistent state anymore, and thinking to flush the page(s) to the data file could take to long leaving it in an Inconsistent state for too long.

Solution is that as part of the commit InnoDB flush the information to the REDO log, which “the facto” represent the CD of ACID for InnoDB.


There is a lot of confusion in the documentation here, which sometimes contradict itself in the same paragraph.

This one is a good example “InnoDB, like any other ACID-compliant database engine, flushes the redo log of a transaction before it is committed. …  With group commit, InnoDB issues a single write to the log file to perform the commit action for multiple user transactions that commit at about the same time, significantly improving throughput.


Ok so IF InnoDB flush uncommitted transaction how can it then group by committed transaction in X of time.


More does it make ANY sense to flush not committed transaction at all?

I read all this in the following way (code reading will follow and let you know).

InnoDB will process all the COMMITTED transaction in memory on REDOLOG, but those transactions are not committed on the data-file(s) yet, following the write-ahead-logging concept.

To me seems more logic, and it seems to follow more the InnoDB logic.


Back to REDO logs or Iblog, they are written sequentially, where information is previously store in the buffer then flush on disk/file.

They contains only the LSN, and the information related to the changes.


Is important to mention that from InnoDB 1.1 the Group commit is back and that is the way to ensure that Binary log AND Iblog inserts follow the same order. To achieve that InnoDB use the Two Phase Commit protocol, pushing writes to Iblog and Binary log, but it needs to have sync_binlog disable.


At this point if the server crash our data will be recover reading the REDO log.

Restarting the InnoDB will start reading the Iblog from the latest checkpoint comparing the LSN number in the log with the last related to the table, all the LSN less then the log LSNs will be then apply.


So REDO logs are quite relevant not only for recovery but also for the durability, unfortunately we all know that the REDO logs are also a huge bottleneck, and if not tuned correctly they represent a serious issue for the MySQL effiency.


See Vadim blog for discussion about checkpoints (

But that is not all, we can have many other possible issue related to REDO logs like:

1) Iblog needs to have enough space to write the incoming transactions or your transactions process will be stop soon.

2) InnoDB will not cancel information from the log, if the relevant pages are still not flushed from BP to data-files. It will stop any insert and start to flush the dirty pages from the BP.


In short, keep innodb_max_dirty_pages_pct to the default value of 75% can KILL your server when using large BP.


After that we still have the data in the data-files not matching the data in memory, so now InnoDB needs to align that as well.

InnoDB use both sharp checkpoint, and fuzzy checkpoint.


The sharp checkpoint takes place when InnoDB align all the modification performed in committed transactions and when all LSN references are align as well.

Example is when InnoDB shuts down.


The fuzzy checkpoint take place very frequently, given InnoDB try to keep the dirty page to minimum, simplifying and citing:

When InnoDB flushes dirty pages to disk, it finds the oldest dirty page’s LSN and treats that as the checkpoint low-water mark. It then writes this to the transaction log header. You can see this in the functions log_checkpoint_margin() and log_checkpoint().

Therefore, every time InnoDB flushes dirty pages from the head of the flush list, it is actually making a checkpoint by advancing the oldest LSN in the system. And that is how continual fuzzy checkpointing is implemented without ever “doing a checkpoint” as a separate event. If there is a server crash, then recovery simply proceeds from the oldest LSN onwards.

(tks Baron for the simplification)


Finally InnoDB to perform the real writes use the doublebuffer, only after the write and the flush to the doublewrite buffer have completed, InnoDB write the pages to their proper positions in the data file.


If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite during crash recovery.


With that the process is far to be complete and I skip points that needs to be addressed, I know it very well.


My intention was just to put in order, simplifying a lot, what happens in the “main” flow.

Many open questions remain, what happen in the case of non optimistic (or pessimistic) writes that require a rebuild of innoDB b-tree index?

What in the B-tree reorganization, what about the page flush, insert buffer … and many more.


WHAT it will be very nice to have is a book explaining how it works from A to Z, driving us poor guys working with MySQL and InnoDB also in each file/function of the code, or at least on the main sections of it.


Checkpoint     As changes are made to data pages that are cached in the buffer pool, those changes are written to the data files sometime later, a process known as flushing. The checkpoint is a record of the latest changes (represented by an LSN value) that have been successfully written to the data files.

LRU     An acronym for “least recently used”, a common method for managing storage areas. The items that have not been used recently are evicted when space is needed to cache newer items. InnoDB uses the LRU mechanism by default to manage the pages within the buffer pool, but makes exceptions in cases where a page might be read only a single time, such as during a full table scan. This variation of the LRU algorithm is called the midpoint insertion strategy. The ways in which the buffer pool management differs from the traditional LRU algorithm is fine-tuned by the options innodb_old_blocks_pct, innodb_old_blocks_time, and the new MySQL 5.6 options innodb_lru_scan_depth and innodb_flush_neighbors.

LSN     Acronym for “log sequence number”. This arbitrary, ever-increasing value represents a point in time corresponding to operations recorded in the redo log. (This point in time is regardless of transaction boundaries; it can fall in the middle of one or more transactions.) It is used internally by InnoDB during crash recovery and for managing the buffer pool.

In the MySQL Enterprise Backup product, you can specify an LSN to represent the point in time from which to take an incremental backup. The relevant LSN is displayed by the output of the ibbackup command. Once you have the LSN corresponding to the time of a full backup, you can specify that value to take a subsequent incremental backup, whose output contains another LSN for the next incremental backup.

Flush     To write changes to the database files, that had been buffered in a memory area or a temporary disk storage area. The InnoDB storage structures that are periodically flushed include the redo log, the undo log, and the buffer pool.     Flushing can happen because a memory area becomes full and the system needs to free some space, because a commit operation means the changes from a transaction can be finalized, or because a slow shutdown operation means that all outstanding work should be finalized. When it is not critical to flush all the buffered data at once, InnoDB can use a technique called fuzzy checkpointing to flush small batches of pages to spread out the I/O overhead.


A SQL statement that ends a transaction, making permanent any changes made by the transaction. It is the opposite of rollback, which undoes any changes made in the transaction.     InnoDB uses an optimistic mechanism for commits, so that changes can be written to the data files before the commit actually occurs. This technique makes the commit itself faster, with the tradeoff that more work is required in case of a rollback.

Page     A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.     One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.     When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.     All the InnoDB disk data structures within a MySQL instance share the same page size.

Extent     A group of pages within a tablespace totaling 1 megabyte. With the default page size of 16KB, an extent contains 64 pages. In MySQL 5.6, the page size can also be 4KB or 8KB, in which case an extent contains more pages, still adding up to 1MB.     InnoDB features such as segments, read-ahead requests and the doublewrite buffer use I/O operations that read, write, allocate, or free data one extent at a time.

Dirty page

A page in the InnoDB buffer pool that has been updated in memory, where the changes are not yet written (flushed) to the data files.

Dirty read     An operation that retrieves unreliable data, data that was updated by another transaction but not yet committed. It is only possible with the isolation level known as read uncommitted.     This kind of operation does not adhere to the ACID principle of database design. It is considered very risky, because the data could be rolled back, or updated further before being committed; then, the transaction doing the dirty read would be using data that was never confirmed as accurate.     Its polar opposite is consistent read, where InnoDB goes to great lengths to ensure that a transaction does not read information updated by another transaction, even if the other transaction commits in the meantime.

Crash recovery     The cleanup activities that occur when InnoDB is started again after a crash. Changes that were committed before the crash, but not yet written into the tablespace files, are reconstructed from the doublewrite buffer. When the database is shut down normally, this type of activity is performed during shutdown by the purge operation.     During normal operation, committed data can be stored in the insert buffer for a period of time before being written to the tablespace files. There is always a tradeoff between keeping the tablespace files up-to-date, which introduces performance overhead during normal operation, and buffering the data, which can make shutdown and crash recovery take longer.

Doublewrite buffer     InnoDB uses a novel file flush technique called doublewrite. Before writing pages to the data files, InnoDB first writes them to a contiguous area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer have completed, does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite buffer during crash recovery.     Although data is always written twice, the doublewrite buffer does not require twice as much I/O overhead or twice as many I/O operations. Data is written to the buffer itself as a large sequential chunk, with a single fsync() call to the operating system.

Write combining     An optimization technique that reduces write operations when dirty pages are flushed from the InnoDB buffer pool. If a row in a page is updated multiple times, or multiple rows on the same page are updated, all of those changes are stored to the data files in a single write operation rather than one write for each change.

Undo log     A storage area that holds copies of data modified by active transactions. If another transaction needs to see the original data (as part of a consistent read operation), the unmodified data is retrieved from this storage area. This area is physically part of the system tablespace. It is split into separate portions, the insert undo buffer and the update undo buffer. Collectively, these parts are also known as the rollback segment, a familiar term for Oracle DBAs.

Redo log     A set of files, typically named ib_logfile0 and ib_logfile1, that record statements that attempt to change data in InnoDB tables. These statements are replayed automatically to correct data written by incomplete transactions, on startup following a crash. The passage of data through the redo logs is represented by the ever-increasing LSN value. The 4GB limit on maximum size for the redo log is raised in MySQL 5.6.     The disk layout of the redo log is influenced by the configuration options innodb_log_file_size, innodb_log_group_home_dir, and (rarely) innodb_log_files_in_group. The performance of redo log operations is also affected by the log buffer, which is controlled by the innodb_log_buffer_size configuration option.

Tuple     In the Embedded InnoDB product, a data structure holding one or more table columns. The tuple is an intermediate holding area where query results are stored before the columns are extracted, or where column values are stored before being used in an insert or update operation.

Cursor     An internal data structure that is used to represent the result set of a query, or other operation that performs a search using a SQL WHERE clause. It works like an iterator in other high-level languages, producing each value from the result set as requested.     Although usually SQL handles the processing of cursors for you, you might delve into the inner workings when dealing with performance-critical code. Embedded InnoDB requires you to be familiar with cursors and to manipulate them using C API calls.




Ewen Fortune

Mark Challagan



And the code … good reading


Finally I thank Danil Zburivsky who raised few good questions.



Let me say NO, but nothing really bad just annoying and thanks to Oli (always great) less annoying then how it could be.

Anyhow, I had to put all up in a RED HAT box on EC2.

So I start to take a look here an there to see what is the best way to do it, I also take the configurator from Severalnines, but for the moment I am not going to use it,

and in any case I never use it for the my.cnf parameters.

But what was the scope of all this?



I have a customer that could be interested in using a scaling MySQL solution, for his business.

Requirements as usual are 100% HA (OK we know this is not possible), scaling in writes as much as possible, geographic distribution.

Data set will be starting from 40G (peanuts) to 6 Tb (well not so peanuts) in 2 years.

More the scalability need to be "out" and "in" so architecture need to be able to scale as we like in relation to the "seasons".


What from there?


Well first of all define the generic architecture that should serve better, then find few good candidates that could fit in from the technological part

do some POCs and take numbers...

So here we are POCs and Numbers but first of all Installation.

As said I need to do that using REDHAT 6 EL so no way to use the pre-compile version, but taking a try will not harm.

Time to test and see how it works.

What I want to have as start:

3 nodes as start

data replicate from to each node

do stress test

download and installation


Excluding the demo package (that I don't want to use)

Any how I have found the site confuse, what I have to take? the package with ONLY the wsrep patch, or also the galera provider, all none both?


I had to download and extract to see that the galera libraries where not in the wsrep version, noting bad having it clear and on the same page could help.

Then I took the file from several nine using the configurator ...

DO I REALLY need it ??

mmm let us see


You will deploy on ec2, using redhat, and x86_64 architecture. You can also go into s9s-galera-xyz/scripts/install/.s9s/config

to make adjustments (but be careful) - if you get it wrong, refresh this page and start over with fresh deployment package.

To install:

Upload your keypair to /root/.ssh/Availng.pem on the cluster-control server

tar xvfz s9s-galera-1.0.0.tar.gz

cd s9s-galera-1.0.0/mysql/scripts/install

./ 2>&1 |tee cc.log

A detailed log is also written to $HOME/s9s_deploy.log

As said I decide not to use it for the moment.

My directory layout will be:


+-- bin

+-- etc
+-- galera -> /opt/galera_templates/galera¦   +-- bi

¦   +-- lib64

¦   +-- shar

+-- games

+-- include
+-- lib
+-- lib64
¦   +-- perl5
+-- libexec
+-- mysql -> /opt/mysql
¦   +-- bi

¦   +-- data

¦   +-- docs

¦   +-- include

¦   +-- lib

¦   +-- man

¦   +-- mysql-test

¦   +-- scripts

¦   +-- share

¦   +-- sql-bench

¦   +-- support-files

+-- sbin

+-- share

¦   +-- applications

¦   +-- info

¦   +-- man

¦   +-- perl5
+-- src


Next I had to do the usual thing:

setup MySQL user and group

setup directories

Configuring my.cnf


Start Galera


I am in rush (never good), so I did start and configure MySQL, but then when I run it:

111118 17:34:55 [Note] Event Scheduler: Loaded 0 events
111118 17:34:55 [Note] WSREP: wsrep_load(): loading provider library '/usr/local/galera/lib64/galera/'
111118 17:34:55 [ERROR] WSREP: wsrep_load(): dlopen(): cannot open shared object file: No such file or directory
111118 17:34:55 [ERROR] WSREP: wsrep_load(/usr/local/galera/lib64/galera/ failed..
111118 17:34:55 [Note] WSREP: wsrep_load(): loading provider library 'none'
111118 17:34:55 [ERROR] Aborting
111118 17:34:55 [Note] WSREP: Service disconnected.
111118 17:34:56 [Note] WSREP: Some threads may fail to exit.
111118 17:34:56  InnoDB: Starting shutdown...
111118 17:34:57  InnoDB: Shutdown completed; log sequence number 1595682
111118 17:34:57 [Note] /usr/local/mysql/bin/mysqld: Shutdown complete

Ok Ok I know they said that so no way I need to recompile and install.

As Oli wrote issue is set up all before so let me see for RED HAT:

Take MySQL as for Oli instructions

and patch: wget

apply patch and download the galera replicator




yum -y install openssl

yum -y install boost-devel

yum -y install openssl-devel.x86_64

yum -y install check-devel.x86_64


Then run :


Have a coffee or two ...

cp /opt/mysql/lib/plugin/

cp garb/garbd /opt/mysql/bin/



11122 18:19:53 [Note] WSREP: Flow-control interval: [8, 16]
111122 18:19:53 [Note] WSREP: Restored state OPEN -> JOINED (0)
111122 18:19:53 [Note] WSREP: Member 0 (galera1) synced with group.
111122 18:19:53 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
111122 18:19:53 [Note] WSREP: New cluster view: group UUID: 7bd4ac4b-1560-11e1-0800-19942f7ebcc0, conf# 1: Primary, number of nodes: 1, my index: 0, first seqno: 1, protocol version 1
111122 18:19:53 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
111122 18:19:53 [Note] WSREP: Assign initial position for certification: 0, protocol version: 1
111122 18:19:53 [Note] WSREP: Synchronized with group, ready for connections
111122 18:19:53 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.











mysql> SHOW global STATUS LIKE "wsrep%";
| Variable_name | Value |
| wsrep_local_state_uuid | 7bd4ac4b-1560-11e1-0800-19942f7ebcc0 |
| wsrep_protocol_version | 1 |
| wsrep_last_committed | 0 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 2 |
| wsrep_received_bytes | 135 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced (6) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 1 |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | 7bd4ac4b-1560-11e1-0800-19942f7ebcc0 |
| wsrep_cluster_status | PRIMARY |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <> |
| wsrep_provider_version | |
| wsrep_ready | ON |
38 rows IN SET (0.00 sec)



As state in the internal text file a note about the LIMITATIONS

1) Currently replication works only with InnoDB storage engine. Any writes to
tables of other types, including system (mysql.*) tables are not replicated.
However, DDL statements are replicated in statement level, and changes
to mysql.* tables will get replicated that way.
So, you can safely issue: CREATE USER...,
but issuing: INSERT INTO mysql.user..., will not be replicated.

2) DELETE operation is unsupported on tables without primary keys.
Rows in tables without primary keys may appear in different order on
different nodes. As a result SELECT...LIMIT... may return slightly
different sets.

3) Unsupported queries:
* LOCK/UNLOCK tables is not supported in multimaster configuration.
* lock functions (GET_LOCK(), RELEASE_LOCK()... )

4) Query log cannot be directed to table. If you enable query logging,
you must forward the log to a file:
log_output = FILE
Use general_log and general_log_file to choose query logging and the
log file name

5) Maximum allowed transaction size is defined by wsrep_max_ws_rows and
wsrep_max_ws_size. Anything bigger (e.g. huge LOAD DATA) will be rejected.

6) Due to cluster level optimistic concurrency control, transaction issuing
COMMIT may still be aborted at that stage. There can be two transactions
writing to same rows and committing in separate cluster nodes, and only one
of the them can successfully commit. The failing one will be aborted.
For cluster level aborts, MySQL/galera cluster gives back deadlock error
code (Error: 1213 SQLSTATE: 40001  (ER_LOCK_DEADLOCK)).

7) XA transactions can not be supported due to possible rollback on commit.

8) Currently supported only in master-slave mode without parallel applying
(will be fixed in the nearest release):
* cascading foreign key constraints,
* secondary unique indexes and
* primary varchar indexes with non-binary collation.

Creating the full environment

Now that I have all up and working, I can finally use my current EC2 as template and add boxes to my cluster.
From here the real Fun will start ...



-> Expanding the cluster
-> Starting my dataset
-> Pushing on throttle and take numbers




The fact that MySQL is not supporting at his best triggers and Store Procedure, is something we all know quite well.

Because that, we normally do use them with caution, but the other day I was working on a simple action on a customer site, and I face something that I did not recognize immediately, because I was starting from a wrong assumption.

Let me describe the case. A customer needs to perform regular export of his data, and reload of it on a different site. To do that we put up an external procedure using "select into outfile" and "Load DATA in FILE".

Nothing special here.
Given we know what we were doing we also set the REPLACE or IGNORE keyword as manual: "The REPLACE and IGNORE keywords control handling of input rows that duplicate existing rows on unique key";

We disable UNIQUE and FOREIGN KEY checks as well, finally no replication or binary logging is in place, so no concern that side.

Then we run the import and got an error message :"ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'".
This time the message error was quite clear, and I should not have problem in recognize the issue, right? It is a primary key duplication, BUT and here there is the need of a BIG BUT, I am using the REPLACE keyword because I want my inserts from file overwrite whatever is in the table.

Given I have being use that for years without any issue, I (me fool) did not recognize at the first shot the rot reason of the error.

This is because in my mind the keyword "REPLACE" had a special meaning, that for obvious reasons it does not have in MySQL.

To simplify my life here I have replicate the issue using the test dataset Sakila, so if you want to have fun try it downloading the schema from MySQL site.

So what I have done:

  • Load schema/data
  • Take an export in CSV
  • take info about triggers constrains and so on:
DATABASE changed
| sakila         | customer_create_date | sakila              | customer           |
| sakila         | ins_film             | sakila              | film               |
| sakila         | upd_film             | sakila              | film               |
| sakila         | del_film             | sakila              | film               |
| sakila         | payment_date         | sakila              | payment            |
| sakila         | rental_date          | sakila              | rental             |
6 rows IN SET (0.00 sec)
    -> WHERE R.`ROUTINE_SCHEMA`= 'sakila';
| sakila         | film_in_stock              | PROCEDURE    | SQL          |
| sakila         | film_not_in_stock          | PROCEDURE    | SQL          |
| sakila         | get_customer_balance       | FUNCTION     | SQL          |
| sakila         | inventory_held_by_customer | FUNCTION     | SQL          |
| sakila         | inventory_in_stock         | FUNCTION     | SQL          |
| sakila         | rewards_report             | PROCEDURE    | SQL          |
6 rows IN SET (0.00 sec)
| sakila            | fk_film_actor_actor       | sakila                   | PRIMARY                | film_actor    | actor                 |
| sakila            | fk_store_address          | sakila                   | PRIMARY                | store         | address               |
| sakila            | fk_staff_address          | sakila                   | PRIMARY                | staff         | address               |
| sakila            | fk_customer_address       | sakila                   | PRIMARY                | customer      | address               |
| sakila            | fk_film_category_category | sakila                   | PRIMARY                | film_category | category              |
| sakila            | fk_address_city           | sakila                   | PRIMARY                | address       | city                  |
| sakila            | fk_city_country           | sakila                   | PRIMARY                | city          | country               |
| sakila            | fk_payment_customer       | sakila                   | PRIMARY                | payment       | customer              |
| sakila            | fk_rental_customer        | sakila                   | PRIMARY                | rental        | customer              |
| sakila            | fk_film_category_film     | sakila                   | PRIMARY                | film_category | film                  |
| sakila            | fk_film_actor_film        | sakila                   | PRIMARY                | film_actor    | film                  |
| sakila            | fk_inventory_film         | sakila                   | PRIMARY                | inventory     | film                  |
| sakila            | fk_rental_inventory       | sakila                   | PRIMARY                | rental        | inventory             |
| sakila            | fk_film_language_original | sakila                   | PRIMARY                | film          | LANGUAGE              |
| sakila            | fk_film_language          | sakila                   | PRIMARY                | film          | LANGUAGE              |
| sakila            | fk_payment_rental         | sakila                   | PRIMARY                | payment       | rental                |
| sakila            | fk_rental_staff           | sakila                   | PRIMARY                | rental        | staff                 |
| sakila            | fk_store_staff            | sakila                   | PRIMARY                | store         | staff                 |
| sakila            | fk_payment_staff          | sakila                   | PRIMARY                | payment       | staff                 |
| sakila            | fk_inventory_store        | sakila                   | PRIMARY                | inventory     | store                 |
| sakila            | fk_customer_store         | sakila                   | PRIMARY                | customer      | store                 |
| sakila            | fk_staff_store            | sakila                   | PRIMARY                | staff         | store                 |
22 rows IN SET (0.00 sec)


Now I have to reload the table FILM which has relation (constrains) and it has Triggers, ok let us take a look on them :

| TRIGGER_SCHEMA | TRIGGER_NAME | EVENT_OBJECT_SCHEMA | EVENT_OBJECT_TABLE | ACTION_STATEMENT                                                                                                                                                                                                                                                                       |
| sakila         | ins_film     | sakila              | film               | BEGIN
    INSERT INTO film_text (film_id, title, description)
        VALUES (new.film_id, new.title, new.description);
| sakila         | upd_film     | sakila              | film               | BEGIN
    IF (old.title != new.title) OR (old.description != new.description)
        UPDATE film_text
            SET title=new.title,
        WHERE film_id=old.film_id;
    END IF;
  END |
| sakila         | del_film     | sakila              | film               | BEGIN
    DELETE FROM film_text WHERE film_id = old.film_id;
  END                                                                                                                                                                                                                     |
3 rows IN SET (0.00 sec)


My table inserts/update/delete records on the table FILM_TEXT, at any insert/update/delete using PRYMARY KEY on both involve table.

The table itself is quite simple:


mysql> SHOW CREATE TABLE film_text;
| TABLE     | CREATE TABLE                                                                                                                                                                                                                                       |
| film_text | CREATE TABLE `film_text` (
  `film_id` smallint(6) NOT NULL,
  `title` varchar(255) NOT NULL,
  `description` text,
  PRIMARY KEY (`film_id`),
  FULLTEXT KEY `idx_title_description` (`title`,`description`)
1 row IN SET (0.00 sec)



It Is MyISAM because the full text (next time we will use Innodb full text ;) ), it use "film_id", as primary key.

At this point I think you should have already get why I was a fool, and where was my misunderstanding.

I was thinking that GIVEN I have REPLACE in primary key in the LOAD DATA IN FILE statement, MySQL was able to propagate this condition also to relate tables.


ERROR 1062 (23000): Duplicate entry '1' FOR KEY 'PRIMARY'


So when I saw the error I was a little bit ... annoyed.

I did a quick check on the FILM_TEXT table:

mysql> SELECT * FROM film_text LIMIT 10;
| film_id | title            | description                                                                                                           |
|       1 | ACADEMY DINOSAUR | A Epic Drama of a Feminist AND a Mad Scientist who must Battle a Teacher IN The Canadian Rockies                      |
|       2 | ACE GOLDFINGER   | A Astounding Epistle of a DATABASE Administrator AND a Explorer who must Find a Car IN Ancient China                  |
|       3 | ADAPTATION HOLES | A Astounding Reflection of a Lumberjack AND a Car who must Sink a Lumberjack IN A Baloon Factory                      |
|       4 | AFFAIR PREJUDICE | A Fanciful Documentary of a Frisbee AND a Lumberjack who must Chase a Monkey IN A Shark Tank                          |
|       5 | AFRICAN EGG      | A Fast-Paced Documentary of a Pastry Chef AND a Dentist who must Pursue a Forensic Psychologist IN The Gulf of Mexico |
|       6 | AGENT TRUMAN     | A Intrepid Panorama of a Robot AND a Boy who must Escape a Sumo Wrestler IN Ancient China                             |
|       7 | AIRPLANE SIERRA  | A Touching Saga of a Hunter AND a Butler who must Discover a Butler IN A Jet Boat                                     |
|       8 | AIRPORT POLLOCK  | A Epic Tale of a Moose AND a Girl who must Confront a Monkey IN Ancient India                                         |
|       9 | ALABAMA DEVIL    | A Thoughtful Panorama of a DATABASE Administrator AND a Mad Scientist who must Outgun a Mad Scientist IN A Jet Boat   |
|      10 | ALADDIN CALENDAR | A Action-Packed Tale of a Man AND a Lumberjack who must Reach a Feminist IN Ancient China                             |
10 rows IN SET (0.00 sec)


And yes data is there and at this point I must assume that this is the cause of the error.

So Marco stop dreaming about REPLACE cascade action and do something.

Cool, let me DISABLE the trigger such that I can reload the table on FILM without worries.

But oops I CANNOT disable a trigger, and more... what if the data I am inserting is the same for Primary key but not for the text, and I need to replace data in the DESCRIPTION fields?

Well I cannot do it without modifying the trigger itself.

So at the end I have to do something in the triggers like so instead of using


So at the end I have TO do something IN the triggers LIKE so instead of USING
    INSERT INTO film_text (film_id, title, description)
        VALUES (new.film_id, new.title, new.description);


We need to use

    INSERT INTO film_text (film_id, title, description)
        VALUES (new.film_id, new.title, new.description)  ON DUPLICATE KEY UPDATE film_id=new.film_id;
  mysql> DROP TRIGGER sakila.ins_film ;
Query OK, 0 rows affected (0.07 sec)
mysql> DELIMITER ;;
    ->     INSERT INTO film_text (film_id, title, description)
    ->         VALUES (new.film_id, new.title, new.description)  ON DUPLICATE KEY UPDATE film_id=new.film_id;
    ->   END;;  
Query OK, 0 rows affected (0.11 sec)


Which obviously will work fine:

Query OK, 1000 rows affected (0.57 sec)
Records: 1000  Deleted: 0  Skipped: 0  Warnings: 0


Problem solve, and all fine.
Really? Not really!

I have few comments here and I met two annoying issue during my exercise.


The first one is that I am still not able to disable the triggers, this is a feature request pending from 2005, we can discuss the best way to implement it, but not the fact that it is a NEEDED feature.


Second if I use REPLACE on LOAD DATA INFILE, I could expect a "Cascade" behaviour of the REPLACE action, at least when I involve the primary key in the Trigger.
That actually should be quite simple, like FIRING the UPDATE trigger instead the INSERT, not rocket science, no need to check what you are doing in the trigger, REPLACE=UPDATE in Trigger action.

In term of suggestion to the others, I can only say ... "REVIEW the triggers code to include the possible case of duplicate inserts", but I don't like it.
I think that we should go for a more elegant way to run it, then modify the trigger code, which was a simple operation here, but that could not be always the case.

Is this a critical issue? No we can stay as we did so far.
Is this something easy to solve? I think so.

Would make any sense to change it as I describe? I think so.
More I would like to say that I don't understand how we can have the option to disable FOREIGN key in place and STILL not have it for Triggers.

That is for me another annoying Cold Case, also if takes me few minutes to fix it!





In the lasts weeks, I choose to stay silent, to try to see and read other people blogs with a new eye. I was trying to go in depth of any single article, reading investigating and double-checking.

There were a lot of news and things going on, not all new and not all really interesting for me, but at the end, enough to le me say that I had spent many nights up reading and studying, instead leading ahead my own projects.

But what was surprise me as well, was that at the end what is very often interesting as a NEW feature or cool coming solution, it is not so interesting in the day by day work.

More what happens is sometime, or I should say too often, the new things that are cool development, are built on the rubble left behind.

Too often we have to still deal with annoying OLD bugs, or feature requests that are "by the facto" standard in almost all decent Database platforms.

Just in the last page (as for the time of writing) of Planet MySQL, I can see mention of the Microsecond issue from Chris Calendar, the way how MySQL handle temporary tables in "A More Perfect UNION " and last but not least my annoying lack of trigger DISABLE/ENABLE function.

So at the end I found myself looking to what I start to call "Cold Case" in my mind. I also star to thought "how I can help"? Well not forgetting them, and instead write and write again about them every time I will find, hoping to be able to propose also some works around, whenever possible.

This is why I have open a series of "Cold Case".

I also invite you all to do not forget and do not leave them behind. Write about your Cold Case and let us push for having them solve once for all.


Latest conferences


We have 85 guests and no members online