My MySQL tips valid-rss-rogers


An old story that is not yet solve.


Why this article.

Some time ago I had open a bug report to codership through Seppo.

The report was about the delay existing in executing data load with FK. (

The delay I was reporting at that time were such to scare me a lot, but I know talking with Alex and Seppo that they were aware of the need to optimize the approach an some work was on going.

After some time I had done the test again with newer version of PXC and Galera library.

This article is describing what I have found, in the hope that share information is still worth something, nothing less nothing more.

The tests

Tests had being run on a VM with 8 cores 16GB RAM RAID10 (6 spindle 10KRPM).

I have run 4 types of tests:

  • Load from file using SOURCE and extended inserts
  • Load from SQL dump and extended inserts
  • Run multiple threads operating against employees tables with and without FK
  • Run single thread operating against employees tables with and without FK

 For the test running against the employees’ db and simulating the client external access, I had used my own stresstool.

The tests have been done during a large period of time, given I was testing different versions and I had no time to stop and consolidate the article. Also I was never fully convinced, as such I was doing the tests over and over, to validate the results.

I have reviewed version from:

Server version:                        5.6.21-70.1-25.8-log Percona XtraDB Cluster binary (GPL) 5.6.21-25.8, Revision 938, wsrep_25.8.r4150


Server version:                        5.6.24-72.2-25.11-log Percona XtraDB Cluster binary (GPL) 5.6.24-25.11, Revision, wsrep_25.11

With consistent behavior.


What happened

The first test was as simple as the one I did for the initial report, and I was mainly loading the employees db in MySQL.

time mysql -ustress -ptool -h -P3306 < employees.sql

Surprise surprise … I literally jump on the chair the load takes 37m57.792s.

Yes you are reading right, it was taking almost 38 minutes to execute.

I was so surprise that I did not trust the test, as such I did it again, and again, and again.

Changing versions, changing machines, and so on.

No way… the time remain surprisingly high.

Running the same test but excluding the FK and using galera was complete in 90 seconds, while with FK but not loading the Galera library 77 seconds.

Ok something was not right. Right?

I decide to dig a bit starting from analyzing the time taken, for each test.

See image below:




From all the tests the only one not align was the data loading with FK + Galera .

I had also decided to see what was the behavior in case of multiple threads and contention.

As such I prepare a test using my StressTool and run two class of tests, one with 8 threads pushing data, the other single threaded.

As usual I have also run the test with FK+Galera, NOFK+Galera, FK+No Galera.

The results were what I was expecting this time and the FK impact was minimal if any, see below:




The distance between execution was minimal and in line with expectations.

Also it was consistent between versions, so no surprise, I relaxed there and I could focus on something else.

On what?

Well why on the case of the load from file, the impact was so significant.

The first thing done was starting to dig on the calls, and what each action was really doing inside MySQL.

To do so I have install some tools like PERF and OPROFILE, and start to dig into it.

First test with FK+Galera taking 38 minutes, was constantly reporting a different sequence of calls/cost from all other tests.

57.25%  [kernel]                      [k] hypercall_page

35.71%  [.] 0x0000000000010c61

2.73%                  [.] __strlen_sse42

0.16%  mysqld                        [.] MYSQLparse(THD*)

0.14%  [.] strlen@plt

0.12%              [.] galera::KeySetOut::KeyPart::KeyPart(galera::KeySetOut::KeyParts&, galera::KeySetOut&, galera::K

0.12%  mysqld                        [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned

0.09%                  [.] memcpy

0.09%                  [.] _int_malloc

0.09%  mysqld                        [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long,

0.08%  mysql                         [.] read_and_execute(bool)

0.08%  mysqld                        [.] ha_innobase::wsrep_append_keys(THD*, bool, unsigned char const*, unsigned char const*)

0.07%                  [.] _int_free

0.07%              [.] galera::KeySetOut::append(galera::KeyData const&)

0.06%                  [.] malloc

0.06%  mysqld                        [.] lex_one_token(YYSTYPE*, THD*)


Comparing this with the output of the action without FK but still with Galera:

75.53%  [kernel]                      [k] hypercall_page

1.31%  mysqld                        [.] MYSQLparse(THD*)

0.81%  mysql                         [.] read_and_execute(bool)

0.78%  mysqld                        [.] ha_innobase::wsrep_append_keys(THD*, bool, unsigned char const*, unsigned char const*)

0.66%  mysqld                        [.] _Z27wsrep_store_key_val_for_rowP3THDP5TABLEjPcjPKhPm.clone.9

0.55%  mysqld                        [.] fill_record(THD*, Field**, List<Item>&, bool, st_bitmap*)

0.53%                  [.] _int_malloc

0.50%                  [.] memcpy

0.48%  mysqld                        [.] lex_one_token(YYSTYPE*, THD*)

0.45%              [.] galera::KeySetOut::KeyPart::KeyPart(galera::KeySetOut::KeyParts&, galera::KeySetOut&, galera::K

0.43%  mysqld                        [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long,

0.43%  mysqld                        [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned

0.39%  mysqld                        [.] trx_undo_report_row_operation(unsigned long, unsigned long, que_thr_t*, dict_index_t*, dtuple_t

0.38%              [.] galera::KeySetOut::append(galera::KeyData const&)

0.37%                  [.] _int_free

0.37%  mysqld                        [.] str_to_datetime

0.36%                  [.] malloc

0.34%  mysqld                        [.] mtr_add_dirtied_pages_to_flush_list(mtr_t*)


What comes out is the significant difference in the FK parsing.

The galera function


KeySetOut::KeyPart::KeyPart (KeyParts&  added, 
                             KeySetOut&     store,
                             const KeyPart* parent,
                             const KeyData& kd,
                             int const      part_num) 



is the top consumer before moving out to share libraries.

After it the server is constantly calling the strlen function, as if evaluating each entry in the insert multiple times.

This unfortunate behavior happens ONLY when the FK exists and require validation, and ONLY if the Galera library is loaded.

It is logic conclusion that the library is adding the overhead, probably in some iteration, and probably a bug.


Running the application tests, using multiple clients and threads, this delay is not happening, at least with this level of magnitude.

During the application tests, I had be using batching insert up to 50 insert for SQL command, as such I could have NOT trigger the limit, that is causing the issue in Galera.

As such, I am not still convinced that we are “safe” there and I have in my to do list to add this test soon, in the case of significant result I will append the information, but I was feeling the need to share in the meanwhile.


The other question was, WHY the data load from SQL dump was NOT taking so long?

That part is easy, comparing the load files we can see that in the SQL dump the FK and UK are disable while loading, as such the server skip the evaluation of the FK in full.

That’s it, adding:




To the import and setting them back after, remove the delay and also the function calls become “standard”.




This short article has the purpose of:

  • Alert all of you of this issue in Galera and let you know this is going on from sometime and has not being fix yet.
  • Provide you a workaround. Use SET FOREIGN_KEY_CHECKS=0, UNIQUE_CHECKS=0; when performing data load, and rememeber to put them back (SET FOREIGN_KEY_CHECKS=1, UNIQUE_CHECKS=1;).
    Unfortunately, as we all know, not always we can disable them, Right? This brings us to the last point.
  • I think that Codership and eventually Percona, should dedicate some attention to this issue, because it COULD be limited to the data loading, but it may be not.





I have more info and oprofile output that I am going to add in the bug report, with the hope it will be processed.


Great MySQL to everyone …

Performance Schema (PS) has been the subject of many, many recent discussions, presentations, and articles.  After its release in MySQL 5.7, PS has become the main actor for people who want to take the further steps in MySQL monitoring. At the same time, it has become clear that Oracle intends to make PS powerful with so many features and new instrumentation that old-style monitoring will begin to look like obsolete tools from the Stone Age.

This article will explain PS and provide guidance on what needs to be done in order to use it effectively.

What I am not going to do is to dig into specific performance issues or address polemics about what PS is and what, in a Utopian vision, it should be. I have seen too many presentations, articles and comments like this and they are not productive, nor are they in line with my target which is: keep people informed on how to do things EASILY.

For the scope of this article I will base my code mainly on version MySQL 5.7, with some digression to MySQL 5.6, if and when it makes sense.


Basic Concepts

Before starting the real how-to, it is my opinion that we must cover a few basic concepts and principles about PS. The primary goal of the Performance Schema is to measure (instrument) the execution of the server. A good measure should not cause any change in behavior. To achieve this, the overall design of the Performance Schema complies with the following, very severe design constraints:

  • The parser is unchanged. Also, there are no new keywords or statements. This guarantees that existing applications will run the same way with or without the Performance Schema.
  • All the instrumentation points return "void", there are no error codes. Even if the performance schema fails internally, execution of the server code will proceed.
  • None of the instrumentation points allocate memory. All the memory used by the Performance Schema is pre-allocated at startup, and is considered "static" during the server life time.
  • None of the instrumentation points use any pthread_mutex, pthread_rwlock, or pthread_cond (or platform equivalents). Executing the instrumentation point should not cause thread scheduling to change in the server.

In other words, the implementation of the instrumentation points, including all the code called by the instrumentation points is:

  • Malloc free
  • Mutex free
  • Rwlock free


Currently, there is still an issue with the usage of the LF_HASH, which introduces memory allocation, though a plan exists to be replace it with lock-free/malloc-free hash code table.

The observer should not influence the one observe. As such, the PS must be as fast as possible, while being less invasive. In cases when there are choices between:

Processing when recording the performance data in the instrumentation.


Processing when retrieving the performance data.

Priority is given in the design to make the instrumentation faster, pushing some complexity to data retrieval.

Performance schema was designed while keeping an eye on future developments and how to facilitate the PS usage in new code. As such, to make it more successful, the barrier of entry for a developer should be low, so it is easy to instrument code. This is particularly true for the instrumentation interface. The interface is available for C and C++ code, so it does not require parameters that the calling code cannot easily provide, supports partial instrumentation (for example, instrumenting mutexes does not require that every mutex is instrumented). The Performance Schema instrument interface is designed in such a way that any improvement/additions in the future will not require modifications, as well as old instrumentation remaining unaffected by the changes.

The final scope for PS is to have it implemented in any plugin included in MySQL, although pretending to have them always using the latest version will be unrealistic in most cases. Given that the Performance Schema implementation must provide up to date support, within the same deployment, multiple versions of the instrumentation interface must ensure binary compatibility with each version.

The importance of flexibility means we may have conditions like:

  • Server supporting the Performance Schema + a storage engine that is instrumented.
  • Server supporting the Performance Schema + a storage engine that is not instrumented.
  • Server not supporting the Performance Schema + a storage engine that is instrumented.



Finally, we need to take in to account that the Performance Schema can be included or excluded from the server binary, using build time configuration options, with exposure in the compiling interface.

Performance Schema Interfaces

As mentioned above, PS can be excluded from code at the moment of the code compilation, thanks to the PS compile interface. This interface is one of seven that are present in PS. The full list is:

  • Instrument interface
  • Compiling interface
  • Server bootstrap interface
  • Server startup interface
  • Runtime configuration interface
  • Internal audit interface
  • Query interface

Instrument Interface:

This is the one that allows plugin implementers to add their instruments to PS. In general the interface is available for:

  • C implementations
  • C++ implementations
  • The core SQL layer (/sql)
  • The mysys library (/mysys)
  • MySQL plugins, including storage engines,
  • Third party plugins, including third party storage engines.


Compiling Interface:

As mentioned earlier, this is used during the build and will include or exclude PS code from the binaries.

Server Bootstrap Interface:

This is an internal private interface, which has the scope to provide access to the instructions demanded and create the tables for the PS itself.

Server Startup Interface:

This interface will expose options used with the mysqld command line or in the my.cnf, required to:

  • Enable or disable the performance schema.
  • Specify some sizing parameters.


Runtime Configuration Interface

This is one of the two most important interfaces for DBAs and SAs. It will allow the configuration of the PS at runtime. Using the methods expose by this interface, we will be able to configure what instruments, consumers, users and more we want to have active. This interface uses standard SQL and is very easy to access and use. Also, it is the preferred method to activate or deactivate instruments. Thus, when we start the server we should always enable the PS with all the instruments and consumers deactivated, and use this interface to choose only the ones we are interested in.

Internal Audit Interface:

The internal audit interface is provided to the DBA to inspect if the Performance Schema code itself is functioning properly. This interface is necessary because a failure caused while instrumenting code in the server should not cause failures in the MySQL server itself, and in turn the performance schema implementation never raises errors during runtime execution. To access the information a DBA just needs to issue the SHOW ENGINE PERFORMANCE SCHEMA STATUS; command.

Query Interface:

Lastly, this interface is the one that allows us to access the collected data, and to perform data filtering, grouping, join, etc. It will also allow access to a special table like the summary tables and digest, which will be discussed later on.

Consumers and Instruments

Another important concept in PS to understand is the difference between Instruments and Consumers.


Instruments are the ones collecting raw data where the calls are embedded in the code, such as:


    { result= index_prev(buf); })


In this case the code refers to the MYSQL_TABLE_IO_WAIT function declared in the class (<mysql_root_code>/sql/ If enabled in the compilation phase the above function will provide PS the information related to specific table io_wait.

The instruments demanded to manage that data collection is: wait/io/table/sql/handler.

The naming convention for the instruments is quite easy. The first part wait is the name of the Top-level Instrument component (list later), the second io is the observed condition, and table is the object.  The remaining suffix is referring to more specific plugin implementations and includes innodb, myisam, sql or names like IO_CACHE::append_buffer_lock. In the above example it refers to the Handler class in SQL tree.


Instruments are organized by top level components like:

  • Idle: An instrumented idle event. This instrument has no further components.
  • Memory: An instrumented memory event.
  • Stage: An instrumented stage event.
  • Statement: An instrumented statement event.
  • Transaction: An instrumented transaction event. This instrument has no further components.
  • Wait: An instrumented wait event.

 Each top level has an n number of instruments:


| name        | Numb |
| idle        |    1 |
| memory      |  367 |
| stage       |  117 |
| statement   |  191 |
| transaction |    1 |
| wait        |  297 |


We can and should keep in consideration that, it is best practice to enable only the instruments we may require for the time we need them. This can be achieved using the re-using the runtime interface (I will explain how exactly later on).

There exists official documentation ( providing more detailed information about the list of what is available for each Top Component.


The Consumers are the destination of the data collected from the instruments. Consumers have different scope and timelines. Also, consumer like event statements has many different tables like:

  • Current
  • History
  • History long
  • Summaries (by different aggregation)
  • Summary Digest (like what we can find by processing the slow query log)

 Once more it is important to define what we are looking for and enable only what we need. For instance, if we need to review/identify the SQL with the most impacting, we should enable only the events_statements_current, events_statements_history and events_statements_summary_by_digest. All the other consumers can stay off. It is also important to keep in mind that each event may have a relation with another one. In this case, we will be able to navigate the tree relating the events using the fields EVENT_ID and NESTING_EVENT_ID where the last one is the EVENT_ID of the parent.

Pre-Filtering vs. Post-filtering

We are almost there, stay tight! Another important concept to understand is the difference between post and pre-filtering. As I mentioned, we can easily query the Consumer tables with SQL, we can create complex SQL to join tables and generate complex reports. But this can be quite heavy and resource consuming, especially if we want to dig on specific sections of our MySQL server.

In this case we can use the pre-filtering approach. The pre-filtering is basically a way to tell to PS to collect information ONLY from a specific source like user/IP (actors) or Object(s) like Tables, Triggers, Events, and Functions. The last one can be set at a general level or down to a specific object name.

The pre-filtering with the activation of the right instruments and consumer is a powerful way to collect the information without overloading the server with useless data. It is also very easy to implement given we just need to set the objects and/or actors in the setup tables as we like.


Rolling the Ball, Setup the PS for Observation as Start

Now that we have covered the basic concepts we can start to work on the real implementation.

Compile the Source Code:

As mentioned earlier, we can use the compile interface to include or exclude features from the code compilation. The available options are:

  • DISABLE_PSI_COND Exclude Performance Schema condition instrumentation
  • DISABLE_PSI_FILE Exclude Performance Schema file instrumentation
  • DISABLE_PSI_IDLE Exclude Performance Schema idle instrumentation
  • DISABLE_PSI_MEMORY Exclude Performance Schema memory instrumentation
  • DISABLE_PSI_METADATA Exclude Performance Schema metadata instrumentation
  • DISABLE_PSI_MUTEX Exclude Performance Schema mutex instrumentation
  • DISABLE_PSI_RWLOCK Exclude Performance Schema rwlock instrumentation
  • DISABLE_PSI_SOCKET Exclude Performance Schema socket instrumentation
  • DISABLE_PSI_SP Exclude Performance Schema stored program instrumentation
  • DISABLE_PSI_STAGE Exclude Performance Schema stage instrumentation
  • DISABLE_PSI_STATEMENT Exclude Performance Schema statement instrumentation
  • DISABLE_PSI_STATEMENT_DIGEST Exclude Performance Schema statement_digest instrumentation
  • DISABLE_PSI_TABLE Exclude Performance Schema table instrumentation

This level of detail is so granular that we can only include the things we are planning to use.

The positive aspect of doing so at the compilation level is that we will be sure no one will mess-up adding undesired instruments. The drawback is that if we change our mind and we decide we may need the ones we had excluded, we will have to compile the whole server again.

As a result, I would say that using this approach is not for someone that is just starting to use PS. Given you are still discovering what is there, it make sense to compile with all the features (default).

Configure PS in my.cnf:

To set the PS correctly in the my.cnf is quite important, so I strongly suggest disabling any instrument and consumer at the start-up. They can be enabled by the script later, and that would be much safer for a production database.

I normally recommend a section like the following:






The settings above will start the server with PS as “enabled”, but all the instruments and consumer will be OFF. Well, this is not entirely true, as for the moment of the writing (MySQL 5.7.7) once the PS is enabled the instruments related to memory/performance_schema are enabled regardless, which make sense given they are dedicated to monitor the memory utilization of PS.

A final note about the configuration is that we can decide to use the counting option of the instruments instead, capturing the latency time. To do so, we just have to declare it as: performance_schema_instrument='statement/sql/%=COUNTED'

In this case I had set that ALL the SQL statements should be counted.

Start Server and Set Only the Users We Need:

Once we have started our MySQL server, we are almost ready to go.

This is it, given we start it with NO instruments, we have to decide where to begin, and given we all know the most impacting factor in a database server is how we query it, we will start from there. In turn, analyzing what is going from the SQL point of view. Although, I want to catch the work coming from my application user, not from everywhere. Given this we can set the user in the actor table. This is very simple given we will use the Runtime configuration interface which uses SQL syntax.

So, let say I want to trace only my application user named stress running from machines in the range. I will need to:


UPDATE setup_actors SET ENABLED='NO' WHERE user='%'; 
INSERT INTO setup_actors VALUES('10.0.0.%','stress','%','YES');
(root@localhost) [performance_schema]>select * FROM setup_actors;
| HOST     | USER   | ROLE | ENABLED |
| %        | %      | %    | NO      |
| 10.0.0.% | stress | %    | YES     |
2 rows IN SET (0.00 sec)



Great, from now on PS will only focus on my user stress, so now let us decide what to enable for instruments and consumers.

Once more using SQL command we will enable all the instruments related to SQL statements, but wait a minute, if you check the instrument table, you will see we have several variations of the statements instrument:

  • SQL
  • SP
  • Scheduler
  • Com
  • Abstract

Also, this is not included but relevant is the TRANSACTION. For now, we will only enable the SQL, ABSTRACT, Scheduler and Transaction.

SQL will be:


UPDATE  setup_instruments SET ENABLED='YES' WHERE ENABLED='NO' AND name LIKE 'statement/abstract/%'; 
UPDATE  setup_instruments SET ENABLED='YES' WHERE ENABLED='NO' AND name LIKE 'statement/sql/%';
UPDATE  setup_instruments SET ENABLED='YES' WHERE ENABLED='NO' AND name LIKE 'transaction';
(root@localhost) [performance_schema]>select count(*) FROM setup_instruments
 WHERE ENABLED = 'YES' AND name NOT LIKE 'memory%';
| count(*) |
|      143 |
1 row IN SET (0.01 sec)



We have 143 instruments active. Now we must setup the consumers and choose the destination that will receive the data.

The list of consumers is the following:


(root@localhost) [performance_schema]>select * FROM setup_consumers;
| NAME                             | ENABLED |
| events_stages_current            | NO      |
| events_stages_history            | NO      |
| events_stages_history_long       | NO      |
| events_statements_current        | NO      |
| events_statements_history        | NO      |
| events_statements_history_long   | NO      |
| events_transactions_current      | NO      |
| events_transactions_history      | NO      |
| events_transactions_history_long | NO      |
| events_waits_current             | NO      |
| events_waits_history             | NO      |
| events_waits_history_long        | NO      |
| global_instrumentation           | NO      |
| thread_instrumentation           | NO      |
| statements_digest                | NO      |
15 rows IN SET (0.00 sec)



To enable ANY of them, first we have to enable the GLOBAL one, which works as a global power on/off. The same thing applies for the Thread instrumentation:


UPDATE setup_consumers SET ENABLED='YES' WHERE NAME='global_instrumentation';
UPDATE setup_consumers SET ENABLED='YES' WHERE NAME='thread_instrumentation';


Then we need to activate at least the events_statements_current to see something, I suggest activating also history and statements_digest.


UPDATE setup_consumers SET ENABLED='YES' WHERE NAME='events_statements_current';
UPDATE setup_consumers SET ENABLED='YES' WHERE NAME='events_statements_history';
UPDATE setup_consumers SET ENABLED='YES' WHERE NAME='statements_digest';
UPDATE setup_consumers SET ENABLED='YES' WHERE NAME='events_transactions_current';
UPDATE setup_consumers SET ENABLED='YES' WHERE NAME='events_transactions_history';



As result, we will have the following consumers activated:


(root@localhost) [performance_schema]>select * FROM setup_consumers;
| NAME                             | ENABLED |
| events_stages_current            | NO      |
| events_stages_history            | NO      |
| events_stages_history_long       | NO      |
| events_statements_current        | YES     |
| events_statements_history        | YES     |
| events_statements_history_long   | NO      |
| events_transactions_current      | YES     |
| events_transactions_history      | YES     |
| events_transactions_history_long | NO      |
| events_waits_current             | NO      |
| events_waits_history             | NO      |
| events_waits_history_long        | NO      |
| global_instrumentation           | YES     |
| thread_instrumentation           | YES     |
| statements_digest                | YES     |
15 rows IN SET (0.00 sec)



Final optimization for the pre-filtering is to decide IF we want to catch all the objects and reduce them to a subset. By default PS will use the settings below:


(root@localhost) [performance_schema]>select * FROM setup_objects;
| EVENT       | mysql              | %           | NO      | NO    |
| EVENT       | performance_schema | %           | NO      | NO    |
| EVENT       | information_schema | %           | NO      | NO    |
| EVENT       | %                  | %           | YES     | YES   |
| FUNCTION    | mysql              | %           | NO      | NO    |
| FUNCTION    | performance_schema | %           | NO      | NO    |
| FUNCTION    | information_schema | %           | NO      | NO    |
| FUNCTION    | %                  | %           | YES     | YES   |
| PROCEDURE   | mysql              | %           | NO      | NO    |
| PROCEDURE   | performance_schema | %           | NO      | NO    |
| PROCEDURE   | information_schema | %           | NO      | NO    |
| PROCEDURE   | %                  | %           | YES     | YES   |
| TABLE       | mysql              | %           | NO      | NO    |
| TABLE       | performance_schema | %           | NO      | NO    |
| TABLE       | information_schema | %           | NO      | NO    |
| TABLE       | %                  | %           | YES     | YES   |
| TRIGGER     | mysql              | %           | NO      | NO    |
| TRIGGER     | performance_schema | %           | NO      | NO    |
| TRIGGER     | information_schema | %           | NO      | NO    |
| TRIGGER     | %                  | %           | YES     | YES   |
20 rows IN SET (0.00 sec)



It is easy to understand that ANY object existing in the default Schema will be ignored. In our case, for now, we will keep it as it is, but this will be our next filtering step after we have analyzed some data. This will happen in the PART 2, stay tuned.


For now, you should understand what a Performance Schema is, its basic concept, as well as what interfaces are available and for what. You should also be able to compile the source code with and without PS, or part of it. You should be able to configure the MySQL configuration file correctly, and perform the initial configuration at runtime. Finally, you should know how to query the PS and how to dig in the information, which will also be discussed in the Part 2.

Folks, as usual Pythian is organizing the community dinner. After many years, food, (responsible) drinking and photos, this event has become an important moment for all of us, to know each other better, discuss and have fun.

This year is also the 20th year for MySQL so … YEAAAH let us celebrate, with more food, fun and responsible drinking.

If you had not done it yet … register yourself here:

Info about the event:

When: Tuesday April 14, 2015 – 7:00 PM at Pedro’s (You are welcome to show up later, too!
Where: Pedro’s Restaurant and Cantina – 3935 Freedom Circle, Santa Clara, CA 95054

I know, I know … we are that kind of people that decide where to go at the last minute, and every years we do the same, but if you could register, that will help us to organize it better … and c’mon the dinner is on Tuesday … so we are almost there!!!

Anyhow, hope to see all of you there, all of you!

Some reference: Menu Eventbrite Pedro


Nowadays most companies use geographic distributed platforms to better serve their customers. It is quite common to see companies with datacenters in North America, Europe and Asia; each site hosting databases and shared data. In some cases the data is simply spread around for better accessibility; in other cases it is localized and different on each geographic location; in most cases a mix of the two.

Useless to say that most of the solutions were implemented using MySQL, and asynchronous replication. MySQL Asynchronous replication had being the most flexible solution, but at the same time also the most unreliable given the poor performance, lack of certification and possible data drift.

In this scenario the use of alternative solutions, like MySQL Synchronous (galera) replication had being a serious challenge. This because the nodes interactions was so intense and dense, that poor network performance between the locations was preventing the system to work properly; or to put it in another way, it was possible only when exceptional network performances were present.

As such horrible solutions like the following are still implemented.


I assume there is no need to explain how multi-circular solutions are a source of trouble, and how they seems to work, until you realize your data is screw.

So question is, what is the status of MySQL Synchronous replication and there is any possibility to successfully implement it in place of Asynchronous replication?

The honest trustable answer is it depends.

There are few factors that may allow or prevent the usage of MySQL Synchronous replication, but before describe them, let us review what had happened in the development of galera 3.x that had significantly changed the scenario.

The first obvious step is to validate the network link, to do so I suggest to follow the method describe in my previous article (Effective way to check network connection).

Once you had certify that, the next step is to design correctly the cluster, assigning the different geographic areas to the logical grouping with the segment feature Galera provide, to know more about segments (geographic-replication-with-mysql-and-galera)


Assuming two scenarios, one is for simple Disaster Recovery, while the second for data distribution.

The first one will be locate on same continent, like Italy and France or Canada and USA. While the second can be distributed like Italy, Canada and USA.

We can come up with a schema for this two solutions that looks like:

image003 OR image005


This looks easy, but I can tell you just right now that while the solution distributed on 3 geographic areas is going to work, the first one will have issue in case of crash.

To understand why, and to correctly design the segments you need to understand another important concept in Galera, which is the quorum calculation.


Generally, you will hear or read saying that Galera cluster should be deploy using an odd number of nodes to correctly manage the quorum calculation. This is not really true and not always needed if you understand how it works, and calculate it correctly.


Galera Cluster supports a weighted quorum, where each node can be assigned a weight in the 0 to 255 range, with which it will participate in quorum calculations.

The formula for the calculation is:


In short the weight sum of the previous view, excluding the node that had left gently, divided by two must be less than the sum of the weight of the current view.

Wait ... what is a view? A view is the logical grouping of nodes composing the cluster.


WSREP: view(view_id(PRIM,28b4b776,78) 
	memb { 
joined {} left {}
partitioned { b9aabaa5,1 <--- node is shutting down}) 



This is a view with ID 78 containing a group of nodes that is the PRIMARY Component, and having one node shutting down gently.


The View information is kept inside galera and on modification of the node(s) presence or access it is updated, the ID is incremented and the new view is compared with the previous following the formula describe above.


I think is quite clear, as such let see why I said that the first solution will not correctly work (by default), or more correctly will not work as you may expect.

So we have 6 nodes distributed cross 2 geo site each with different segment identifier.


If the network between the two sites will have issues and cluster cannot communicate the whole cluster will become NON-Primary:



As you can see if ONE of the two segment will become non reachable, the other will not have enough quorum to remain PRIMARY given 3 is not greater then 6/2 (3).

This is obviously the scenario in which all the weight is set as default to 1. This is also why it is recommended to use ODD nodes.


Just for clarity, see what happened if ONE node goes down and THEN the network crashes.



As you can see here the Final view has the quorum, and in that case the site in segment 1 will be able to stay up as PRIMARY, given 3 is greater then 5/2.

Anyhow back to our production – DR site how this can be set?


The first one is to decide that one of the two side will always win, like say production:



In this scenario the Segment 1 will always win, and to promote the DR to PRIMARY you must do it manually.  That will work, but may be is not what we expect if we choose this solution for DR purposes.

The other option is to use a trick and add a witness like the arbitrator GARBD.


I don’t like the use of GARBD , never had, in the Codership documentation:

If one datacenter fails or loses WAN connection, the node that sees the arbitrator, and by extension sees clients, continues operation.


Even though Galera Arbitrator does not store data, it must see all replication traffic. Placing Galera Arbitrator in a location with poor network connectivity to the rest of the cluster may lead to poor cluster performance.

This means that if you use GARBD you will in any case have all the cost of the traffic but not the benefit of a real node. If this is not clear enough I will show you a simple case in which it may be more an issue then a solution.

In this scenario we will use GARBD and see what happen



We will have the quorum, but the point is… we may have it on both side, as such if the two segments will not be able to communicate, but are able to see the witness aka GARBD, each of them will think to be the good one. In short this is call split-brain, the nightmare of any DBA and SA.

As such the simple but real solution when using Galera also for DR, is to think at it as a geographically distributed cluster and add AT LEAST a 7th node, that will allow the cluster to calculate the quorum properly and in case two segments are temporary unable to connect. Not only the use of a third segment will work as man-in-the-middle passing messages from one segment to another, including the WriteSets.


So in case of real crash of ONE of the segment, the others will be able to keep going as PRIMARY without issue. On the other hand in case of crash of one of the network link, the cluster will be able to survive and distribute the data.




Use asynchronous replication to cover geographic distribution, may still be an option when the network or specific data access mode will prevent it.  But the use of MySQ/Galera may be help you a lot in keep your data consistency under control and to manage HA more efficiently.

As such whatever need you may have (DR or distributed writes) use three different segments and sites, no matter if only for DR. This will improve the robustness of your solution.

MySQL/Galera  is not only a good solution to have a geographical write distributed solution, but is also a robust solution in case of crash of one of the network link.

In that case the cluster will continue to work, but it may be in degraded state, given the third segment will have to forward the data to the other two nodes.

About that I have not yet perform extensive tests, but I will and post additional information.

"Hi Paul how is going?" "Good Marco and you?" "Good, I had a stressful week last week but now is ok, I mange to close some pending activities, working a little bit more during the day, doing that I was able to reduce the queue of pending task, and now all is normal again", "good for you that you manage, I had too many things ongoing and was not able to dedicate more time to queue".


The simple (boring) conversation above hides one of the most complex elaborations of monitoring data. We as human being do a lot of data processing in very short time. We may be less fast in doing some calculations respect to computers, but no computer can compete with us when we are talking about multitask and data processing.


To answer to someone asking you how you are, you do not simple review your status in that moment, your brain decide on the base of the last time you have see the person to review all the relevant data and provide a synthesis of the relevant facts, then again you summarize in "good" because you do not consider relevant to pass over all single facts to your friend but only the conclusion.


Not only, during the same process, you evaluate, in relation to your relationship with the person, what kind of information you may want to share and why, how to present to him/her such that it will be relevant and interesting for the interaction.


The simple process of talking also may happen while you are walking along the street, taking care of the traffic, and expressing interest, curiosity or annoyance to your collocutor.

Each expression you will show on your face is the result of the data collection, analysis and decision your brain is taking. Plus some other coming from more in depth inner process, like deep fear or surprise, but that is out of the context now.

The funniest interesting thing is that we are so use to do this and to summarize in such efficient way, that we consider funny or totally out of context, when we see someone not doing so.


Just think about how hilarious is Sheldon Lee Cooper (for the ones who do not know what I am talking about

In the show Sheldon is quite often answering to the simple question "How are you?" with a massive amount of information, that not only is not understood, but also totally irrelevant, and as such in that context hilarious.

Hilarious in that context I sais, but far to be hilarious in real life, this because we are so expose to external signal and information that we should not and cannot spend time and resource, elaborating incoming information just to know "How our friend is doing". In the evolution it was decide that was the owner of the information that has to process it, that has to elaborate his data and expose only what is relevant for his interaction.


Just think what life would be IF instead of saying "Good thank you" to the question "How are you", you would start to enumerate all the facts in each single details, or with some aggregation, to each single person that asks you the same question and expect them to sort out if that means good or not. Crazy eh? I would say quite inefficient and source of possible misunderstanding as well.

Someone may decide that working an hour more per day is totally unacceptable, and as such your status would be "Bad" instead "Good", which is the exact opposite of how you really feel.

As said this way of acting and behaving, is not something coming from the void, but instead the result of a long process that had be refine in 2,5 millions of years (Homo habilis). The evolution had decide that is much more efficient to have Marco telling to Paul how he is doing, than Paul try to read all the information from Marco and then elaborate, with his parameters, how Marco is doing.

I am going to say that, well the evolution is right, and I am quite happy with what we had achieve, also if we had taken some million of years to get there.

I am also confident that you too, see how this is more efficient, and correct.

So, for God sake, why are we still using a method that is not only inefficient but also exposing us to mistakes, when we have to know how complex system feel, systems that are less complex then us, but complex anyhow.

Why are we "monitoring" things, exposing numbers, and pretend to elaborate those with the illusion to finally GET "How are you?"

Would not much more efficient, and statistically more prune of errors just ask "Hi my nice system, how are you today?" "Marco you are boring, you ask me that every day, anyhow I am good" "There is anything you need?" "Yes please, check the space I may run out in a week" "Oh thanks to let me know in advance I will".


Am I crazy? No I don't think so, is it something that we see only in the movies? Again no I don't think so, and actually is not so far from what we may start to do.

How we can move from a quite wrong way of doing, collecting useless amount of data to analyze to get simple synthetic information?


Here is my dream

Let us start simple, you cannot ask to someone "How are you?" if he is dead, is a yes/no condition. But this does not apply to complex systems, in our body every day we loose cells they die, but we replace tem, and our brain is not sending us warning message for each one of them.

But we do have alert messages if the number of them become too hi such that primary function can be compromise.

In short our brain discriminate between what can be compensate automatically and what not, and bother us only when the last one occur.

What can be done to create monitor monad, that is internally consistent and that allow us to scale and to aggregate?

The main point as state above is to do not flood the collocutor with information, but at the same time do not loose the meaning and if in the need the details.


This is the first point we can separate between what we need to archive and what we need to get as answer.

To be clear, I ask you "How are you" "Good thank you", that is what I need to know, but at the same time I may be in the position to download your data, and collect all the metrics relevant.

I had to use a heart monitor after few events, and what happened was quite simple. They attach to my body a monitor that was sending detailed metrics of my heart to them directly, plus they were calling me to ask, "How you feel today?" The detailed information was for them to eventually dig in the system if something goes bad.


The detailed information is easy to collect the challenge come from the amount of data, how to store aggregate and so on, but all that is the usual boring and old stuff.

What I see instead interesting is how to get the monad to work, how to define the correct way to balance the analysis.

My idea is quite simple; assume the easiest case where we have to process just 3 different metrics to get a meaningful state, something like IO/CPUuser/Net incoming.

A simple




will work fine; each solid vertex is a metric plus one that is the yes/no condition (am I alive?).

The centre of the solid represent the initial state; state in which all the forces are in perfect balance and there is no variation in the position of the Point of Balance from the centre itself.


We know that any system is never in perfect balance, we also know that each system may behave differently on the base of the N factors, where N is not determinate, but change not only in relation of the kind of system, but also between system that behave to the same class. In short try to define N is a waste of time.


What can be done, and guess what is exactly what we do when we move from Blastula to new born, we can learn what is the correct level of variation, meaning we can learn by each system which is the variation that do not compromise our functions.


Initially we may have a define AC which is the acceptable range inside which the point can fluctuate, for each vertex we have an F for the possible fluctuation, when F =0 in one of more of the directions we can say "Huston we have a problem".


While learning, our system will identify what may be the right shape and distance for the F such that the initial circle may become something like this:



Which means that any movement of our point inside the AC area will give us the answer "I am good thanks". Any movement outside, will generate a possible signal like "I am stressed my CPU is overload".

This is a very simple basic example, and it may be not clear how this scale and how it could resolve much more complex scenario. So let us go ahead.


A simple solid like a triangular pyramid covers something that is very basic. But if for instances you need to provide the status of a more complex interaction say a database status or a more complex system, then you may have one or many solid with much more complex interaction:


With the solid disdyakis triacontahedron we can have 62 vertexes, meaning that with the same algorithm we can a associate a significant number of metrics.


Each solid is seen from the whole as single entity, like if enclose in a sphere that shows only the "final" status:


The flexibility comes from the fact we can connect any solid to another in exclusive mode or as subsystem(s), at the same time each element can be quite complex internally but expose simple and direct status.


So for instance a simple one can be like:


While a more complex and probably the most common would be:


In this case we can assume to have a group describing the status for a Storage Engine another for whatever happen on the storage, and so on until we have a Node of our architecture fully describe.


At this point it should be clear that once we had cover the internal complexity of the variation for each solid, the outcome is a simplify message "I am good" no matter at what level we are looking it.

That will allow us to eventually have quite complex system, with complex relations, be described and report status in a very simple and immediate way.


Understanding what is going on in a system like this:


Can be quite difficult and taking time. Using standard way of monitoring, we will not be sure if there is a correlation between the metrics, and if it is taking in to account correctly the behaviour of the Node.


Using the new approach will allow us to, first of all get simple feedback:


Basically, given a node affected (badly ... give it is dead) all the others are still answering, "I am good", but the Nodes related will start to say, "I am not happy", "I am very sad my node is dead", "I am cool don't worry load and service are under control".

And I will focus directly on my dead node and how to fix it. Given the way I collect my information and report the state, I will be able to see that in the timeline and focus directly on the moment issues starts, for the node.



No message is a message

What is also very important to consider, is that once we have define the correct behaviour for each solid, that happen during the learning period, we also know what is the expected behaviour and what signals we should see.


In short if you go in the gym and do 45 minutes on the treadmill, you expect to have higher heart rate, to feel fatigued and sweaty. If that doesn't happen then either you were cheating not doing the right exercise, or probably you are a cyber-man and you were not aware of that.


Getting the right signal in the right context, also when the signal is a negative one is as important as, or even more, then getting a good one.


Last year my mother had a quite important surgery, the day after that she was great, feeling perfectly, no pain, no bad signals. And she was dying down; the doctor start to be uncomfortable with her NOT feeling some level of pain or whatever discomfort. Luckily they take action and save her (at the last second) but they did.


Nowadays we just collect metrics, and very rarely we put them in relation, and even more rarely we try to get the negative reading as relevant event. This because we currently do not have a way to contextualize the right behaviour, to know how thing can be correctly handled, and as such what is the deviation from that.


The implementation I am describing not only takes in to account the behaviour not the singe event, but it also can trace and identify the lack of a negative signal, a signal that must take place to keep the behaviour healthy.


What I really want to stress out is that the way we do monitor today is the same that trying to manage the space shuttle with stone and scalpel.


There are many solutions out there, but all of them are focus on more or less the same approach/model.

Better then nothing of course, and yes we still have situation in which we have NO monitoring. But still I think that changing the paper of the wraps is not doing something new in relation to the content.


I do not pretend to know how to implement my idea, the algorithm to calculate the variation and the interaction in the solid, is something I do not see in my range. But that just means I need to find someone able to share a dream and with quite good mathematical skills.

Latest conferences


We have 439 guests and no members online