Sysbench and the Random Distribution effect

What you may not know about random number generation in sysbench

Sysbench is a well known and largely used tool to perform benchmarking. Originally written by Peter Zaitsev in early 2000, it has become a de facto standard when performing testing and benchmarking. Nowadays it is maintained by Alexey Kopytov and can be found in Github at https://github.com/akopytov/sysbench.

What I have noticed though, is that while widely-used, some aspects of sysbench are not really familiar to many. For instance, the easy way to expand/modify the MySQL tests is using the lua extension, or the embedded way it handles the random number generation.

Why this article?

I wrote this article with the intent to show how easy it can be to customize sysbench to make it what you need. There are many different ways to extend sysbench use, and one of these is through proper tuning of the random IDs generation.

By default, sysbench comes with five different methods to generate random numbers. But very often, (in fact, almost all the time), none is explicitly defined, and even more rare is seeing some parametrization when the method allows it.

If you wonder “Why should I care? Most of the time defaults are good”, well, this blog post is intended to help you understand why this may be not true.

Let us start.

What methods do we have in sysbench to generate numbers? Currently the following are implemented and you can easily check them invoking the --help option in sysbench:

Special
Gaussian
Pareto
Zipfian
Uniform

Of them Special is the default with the following parameters:

rand-spec-iter=12 number of iterations for the special distribution [12]
rand-spec-pct=1 percentage of the entire range where 'special' values will fall in the special distribution [1]
rand-spec-res=75 percentage of 'special' values to use for the special distribution [75]

Given I like to have simple and easy reproducible tests and scenarios, all the following data has being collected using the sysbench commands:

sysbench ./src/lua/oltp_read.lua --mysql_storage_engine=innodb --db-driver=mysql --tables=10 --table_size=100 prepare
sysbench ./src/lua/oltp_read_write.lua --db-driver=mysql --tables=10 --table_size=100 --skip_trx=off --report-interval=1 --mysql-ignore-errors=all --mysql_storage_engine=innodb --auto_inc=on --histogram --stats_format=csv --db-ps-mode=disable --threads=10 --time=60 --rand-type=XXX run

Feel free to play by yourself with script instruction and data here (https://github.com/Tusamarco/blogs/tree/master/sysbench_random).

What is sysbench doing with the random number generator? Well, one of the ways it is used is to generate the IDs to be used in the query generation. So for instance in our case, it will look for numbers between 1 and 100, given we have 10 tables with 100 rows each.

What will happen if I run the sysbench RUN command as above, and change only the random –rand-type?

I have run the script and used the general log to collect/parse the generated IDs and count their frequencies, and here we go:

Special
Uniform
Zipfian
Pareto
Gaussian

Makes a lot of sense right? Sysbench is, in the end, doing exactly what we were expecting.

Let us check one by one and do some reasoning around them.

Special

The default is Special, so whenever you DO NOT specify a random-type to use, sysbench will use special. What special does is to use a very, very limited number of IDs for the query operations. Here we can actually say it will mainly use IDs 50-51 and very sporadically a set between 44-56, and the others are practically irrelevant. Please note, the values chosen are in the middle range of the available set 1-100.

In this case, the spike is focused on two IDs representing 2 percent of the sample. If I increase the number of records to one million, the spike still exists and is focused on 7493, which is 0.74% of the sample. Given that’s even more restrictive, the number of pages will probably be more than one.

Uniform

As declared by the name, if we use Uniform, all the values are going to be used for the IDs and the distribution will be … Uniform.

Zipfian

The Zipf distribution, sometimes referred to as the zeta distribution, is a discrete distribution commonly used in linguistics, insurance, and the modeling of rare events. In this case, sysbench will use a set of numbers starting from the lower (1) and reducing the frequency in a very fast way while moving towards bigger numbers.

Pareto

With Pareto that applies the rule of 80-20 (read https://en.wikipedia.org/wiki/Pareto_distribution), the IDs we will use are even less distributed and more concentrated in a small segment. 52 percent of all IDs used were using the number 1, while 73 percent of IDs used were in the first 10 numbers.

Gaussian

Gaussian distribution (or normal distribution) is well known and familiar (see https://en.wikipedia.org/wiki/Normal_distribution) and mostly used in statistics and prediction around a central factor. In this case, the used IDs are distributed in a bell curve starting from the mid-value and slowly decreases towards the edges.

The point now is, what for?

Each one of the above cases represents something, and if we want to group them we can say that Pareto and Special can be focused on hot-spots. In that case, an application is using the same page/data over and over. This can be fine, but we need to know what we are doing and be sure we do not end up there by mistake.

For instance, IF we are testing the efficiency of InnoDB page compression in read, we should avoid using the Special or Pareto default, which means we must change sysbench defaults. This is in case we have a dataset of 1Tb and bufferpool of 30Gb, and we query over and over the same page. That page was already read from the disk-uncompressed-available in memory.

In short, our test is a waste of time/effort.

Same if we need to check the efficiency in writing. Writing the same page over and over is not a good way to go.

What about testing the performance?

Well again, are we looking to identify the performance, and against what case? It is important to understand that using a different random-type WILL impact your test dramatically. So your “defaults should be good enough” may be totally wrong.

The following graphs represent differences existing when changing ONLY the rand-type value, test type, time, additional option, and the number of threads are exactly the same.

Latency differs significantly from type to type:

Here I was doing read and write, and data comes from the Performance Schema query by sys schema (sys.schema_table_statistics). As expected, Pareto and Special are taking much longer than the others given the system (MySQL-InnoDB) is artificially suffering for contention on one hot spot.

Changing the rand-type affects not only latency but also the number of processed rows, as reported by the performance schema.

Given all the above, it is important to classify what we are trying to determine, and what we are testing.

If my scope is to test the performance of a system, at all levels, I may prefer to use Uniform, which will equally stress the dataset/DB Server/System and will have more chances to read/load/write all over the place.

If my scope is to identify how to deal with hot-spots, then probably Pareto and Special are the right choices.

But when doing that, do not go blind with the defaults. Defaults may be good, but they are probably recreating edge cases. That is my personal experience, and in that case, you can use the parameters to tune it properly.

For instance, you may still want to have sysbench hammering using the values in the middle, but you want to relax the interval so that it will not look like a spike (Special-default) but also not a bell curve (Gaussian).

You can customize Special and have something like :

In this case, the IDs are still grouped and we still have possible contention, but less impact by a single hot-spot, so the range of possible contention is now on a set of IDs that can be on multiple pages, depending on the number of records by page.

Another possible test case is based on Partitioning. If, for instance, you want to test how your system will work with partitions and focus on the latest live data while archiving the old one, what can you do?

Easy! Remember the graph of the Pareto distribution? You can modify that as well to fit your needs.

Just tuning the –rand-pareto value, you can easily achieve exactly what you were looking for and have sysbench focus the queries on the higher values of the IDs.

Zipfian can also be tuned, and while you cannot obtain an inversion as with Pareto, you can easily get from spiking on one value to equally distributed scenarios. A good example is the following:

The last thing to keep in mind, and it looks to me that I am stating the obvious but better to say that than omit it, is that while you change the random specific parameters, the performance will also change.

See latency details:

Here you can see in green the modified values compared with the original in blue.

Conclusion

At this point, you should have realized how easy it can be to adjust the way sysbench works/handles the random generation, and how effective it can be to match your needs. Keep in mind that what I have mentioned above is valid for any call like the following, such as when we use the sysbench.rand.default call:

local function get_id()

return sysbench.rand.default(1, sysbench.opt.table_size)

End

Given that, do not just copy and paste strings from other people’s articles, think and understand what you need and how to achieve it.

Before running your tests, check the random method/settings to see how it comes up and if it fits your needs. To make it simpler for me, I use this simple test (https://github.com/Tusamarco/sysbench/blob/master/src/lua/test_random.lua). The test runs and will print a quite clear representation of the IDs distribution.

My recommendation is, identify what matches your needs and do your testing/benchmarking in the right way.

References

First and foremost reference is for the great work Alexey Kopytov is doing in working on sysbench https://github.com/akopytov/sysbench

Zipfian articles:

Pareto:

https://en.wikipedia.org/wiki/Pareto_distribution

Percona article on how to extend tests in sysbench https://www.percona.com/blog/2019/04/25/creating-custom-sysbench-scripts/

The whole set material I used for this article is on github (https://github.com/Tusamarco/blogs/tree/master/sysbench_random)

#StopTRUMP

peace While this symbol was designed by Gerald Holtom in 1958 for the nuclear disarmament, in the years it had become more. It represent the will to resolve conflict with words and understanding, the will to be respectful of human and cultural diversity, the will to put the egoism/egotism/egocentrism aside.

Nowadays is so difficult to remember that sometime we just need to say NO to political/social abuse. We must be able to take position also if this means that we may risk something, that we may be in a less comfortable position.

Acts of violence are never the right answer, they never lead to positive results. It is bad thing when a single do any act of violence, but it becomes an horror when is a state doing it, and even worse, when it is justified with words as freedom and even worse peace.

What is going on in these days is terrible, the actions of terrorism that Mr. Trump is requesting and justifying are beyond any understanding (except political and economical interest).
Words like "we have targeted 52 Iranian sites (representing the 52 American hostages taken by Iran many years ago), some at a very high level & important to Iran & the Iranian culture, and those targets, and Iran itself, WILL BE HIT VERY FAST AND VERY HARD. The USA wants no more threats!" are not words that a chief of state like the USA president should say. These words are a violent attack to any democratic country, and mainly express the will of USA President to use Terrorism against any possible competitor.

We should hope that the House of Representative will be able to limit the current path of action of Mr. trump, (https://eu.usatoday.com/story/news/politics/2020/01/05/iran-nancy-pelosi-says-house-vote-limit-trumps-war-powers/2821968001/ https://www.nytimes.com/2020/01/04/opinion/editorials/trump-iran-threats-Suleimani.html?action=click&module=Opinion&pgtype=Homepage)

But what is very concerning, is the silence coming from many other international actors, and very concerning the EU silence.
There is NO justification for the European Union to wait to reject and condemn the recent act of terrorism done by Mr. Trump. We European had kill each other for millennia, and we should know well that PEACE and respect is a value we cannot neglect.

Europe cannot accept in silence acts of violence of this magnitude for cowardice, if the same would had be done by Libya or Senegal, the whole EU would have raise in one voice condamny it. The fact this was done by USA is what is stopping all. But is time for EU and probably the whole United Nations to realize that USA, under this administration, is a rogue state and must be isolated and stopped.

USA is a great country, with great people. But it has lost its north, the way the Trump administration (and Bush(es) before) is leading the country, is against the whole world interest and good.

A recent article huffpost.com/highline/article/sanctions/ is a must read to have better understanding of what can/should be done.

There is something we can do, and use Internet, Facebook, Twitter, not only to show our dogs underpants , or the last shitty meal we had.

We can use to put pressure to our governaments to say NO to violence and act of terrorism like the recent one done by Mr Trump.

There is no excuse for a chief of state to push the world to the edge of another global war, no excuse!

#stopTRUMP

trump war #stopTrump European Union

Dirty reads in High Availability solution

Understand dirty reads when using ProxySQL

Recently I had been asked to dig a bit about WHY some user where getting dirty reads when using PXC and ProxySQL.

While the immediate answer was easy, I had taken that opportunity to dig a bit more and buildup a comparison between different HA solutions.

For the ones that cannot wait, the immediate answer is …drum roll, PXC is based on Galera replication, and as I am saying from VERY long time (2011), Galera replication is virtually synchronous. Given that if you are not careful you MAY hit some dirty reads, especially if configured incorrectly.

There is nothing really bad here, we just need to know how to handle it right.

In any case the important thing is to understand some basic concepts.

Two ways of seeing the world (the theory)

Once more let us talk about data-centric approach and data-distributed.

We can have one data state:

dacentric

Where all the data nodes see a single state of the data. This is it, you will consistently see the same data at a given T moment in time, where T is the moment of commit on the writer.

Or we have data distributed:

data diff

Where each node has an independent data state. This means that data can be visible on the writer, but not yet visible on another node at the moment of commit, and that there is no guarantee that data will be passed over in a given time.

The two extremes can be summarized as follow:

Tightly coupled database clusters

Data Centric approach (single state of the data, distributed commit)
Data is consistent in time cross nodes
Replication requires high performing link
Geographic distribution is forbidden

Loosely coupled database clusters

Single node approach (local commit)
Data state differs by node
Single node state does not affect the cluster
Replication link doesn’t need to be high performance
Geographic distribution is allowed

Two ways of seeing the world (the reality)

Given life is not perfect and we do not have only extremes, the most commonly used MySQL solution find their place covering different points in the two-dimensional Cartesian coordinate system:

Screen Shot 2019 10 16 at 94547 PM

This graph has the level of high availability on the X axis and the level of Loose – Tight relation on the Y axis.

As said I am only considering the most used solutions:

MySQL – NDB cluster
Solutions based on Galera
MySQL Group replication / InnoDB Cluster
Basic Asynchronous MySQL replication

InnoDB Cluster and Galera are present in two different positions, while the others take a unique position in the graph. At the two extreme position we have Standard replication, which is the one less tight and less HA, and NDB Cluster who is the tightest solution and higher HA.

Translating this into our initial problem, it means that when using NDB we NEVER have Dirty Reads, while when we use standard replication we know this will happen.

Another aspect we must take in consideration when reviewing our solutions, is that nothing come easy. So, the more we want to move to the Right-Top corner the more we need to be ready to give. This can be anything, like performance, functionalities, easy to manage, etc.

When I spoke about the above the first time, I got a few comments, the most common was related on why I decided to position them in that way and HOW I did test it.

Well initially I had a very complex approach, but thanks to the issue with the Dirty Reads and the initial work done by my colleague Marcelo Altman, I can provide a simple empiric way that you can replicate just use the code and instructions from HERE.

Down into the rabbit hole

The platform

To perform the following tests, I have used:

A ProxySQL server
An NDB cluster of 3 MySQL nodes 6 data nodes (3 Node Groups)
A cluster of 3 PXC 5.7 single writer
An InnoDB cluster 3 nodes single writer
A 3 nodes MySQL replica set
1 Application node running a simple Perl script

All nodes where connected with dedicated backbone network, different from front end receiving data from the script.

The tests

I have run the same simple test script with the same set of rules in ProxySQL.
For Galera and InnoDB cluster I had used the native support in ProxySQL, also because I was trying to emulate the issues I was asked to investigate.

For Standard replication and NDB I had used the mysql_replication_hostgroup settings, with the difference that the later one had 3 Writers, while basic replication has 1 only.

Finally, the script was a single threaded operation, creating a table in the Test schema, filling it with some data, then read the Ids in ascending order, modify the record with update, and try to read immediately after.

When doing that with ProxySQL, the write will go to the writer Host Group (in our case 1 node also for NDB, also if this is suboptimal), while reads are distributed cross the READ Host Group. If for any reason an UPDATE operation is NOT committed on one of the nodes being part of the Reader HG, we will have a dirty read.

Simple no?!

The results

dirty comparative2

Let us review the graph. Number of dirty reads significantly reduce moving from left to the right of the graph, dropping from 70% of the total with basic replication to the 0.06% with Galera (sync_wait =0).

The average lag is the average time taken from the update commit to when the script returns the read with the correct data.

It is interesting to note a few factors:

The average cost time in GR between EVENTUAL and AFTER is negligible
Galera average cost between sync_wait=0 and sync_wait=3 is 4 times longer
NDB is getting an average cost that is in line with the other BUT its max Lag is very low, so the fluctuation because the synchronization is minimal (respect to the others)
GR and Galera can have 0 dirty reads but they need to be configured correctly.

Describing a bit more the scenario, MySQL NDB cluster is the best, period! Less performant in single thread than PXC but this is expected, given NDB is designed to have a HIGH number of simultaneous transactions with very limited impact. Aside that it has 0 dirty pages no appreciable lag between writer commit – reader.

On the other side of the spectrum we have MySQL replication with the highest number of dirty reads, still performance was not bad but data is totally inconsistent.

Galera (PXC implementation) is the faster solution when single threaded and has only 0.06% of dirty reads with WSREP_SYNC_WAIT=0, and 0 dirty pages when SYNC_WAIT=3.
About galera we are seen and paying something that is like that by design. A very good presentation (https://www.slideshare.net/lefred.descamps/galera-replication-demystified-how-does-it-work) from Fred Descamps explain how the whole thing works.

This slide is a good example:

Screen Shot 2019 10 13 at 32714 PM

By design the apply and commit finalize in Galera may have (and has) a delay between nodes. When changing the parameter wsrep_sync_wait as explained in the documentation the node initiates a causality check, blocking incoming queries while it catches up with the cluster.

Once all data on the node receiving the READ request is commit_finalized, the node perform the read.

MySQL InnoDB Cluster is worth a bit of discussion. From MySQL 8.0.14 Oracle introduced the parameter group_replication_consistency please read (https://dev.mysql.com/doc/refman/8.0/en/group-replication-consistency-guarantees.html), in short MySQL Group replication can now handle in different way the behavior in respect of Write transactions and read consistency.

Relevant to us are two settings:

EVENTUAL
- Both RO and RW transactions do not wait for preceding transactions to be applied before executing. This was the behavior of Group Replication before the group_replication_consistency variable was added. A RW transaction does not wait for other members to apply a transaction. This means that a transaction could be externalized on one member before the others.
AFTER
- A RW transaction waits until its changes have been applied to all of the other members. This value has no effect on RO transactions. This mode ensures that when a transaction is committed on the local member, any subsequent transaction reads the written value or a more recent value on any group member. Use this mode with a group that is used for predominantly RO operations to ensure that applied RW transactions are applied everywhere once they commit. This could be used by your application to ensure that subsequent reads fetch the latest data which includes the latest writes.

As shown above using AFTER is a win and will guarantee us to prevent dirty reads with a small cost.

ProxySQL

ProxySQL has native support for Galera and Group replication, including the identification of the transactions/writeset behind. Given that we can think ProxySQL SHOULD prevent dirty reads, and it actually does when the entity is such to be caught.

But dirty reads can happen in such so small-time window that ProxySQL cannot catch them.

As indicated above we are talking of microseconds or 1-2 milliseconds. To catch such small entity ProxySQL monitor should pollute the MySQL servers with requests, and still possibly miss them given network latency.

Given the above, the dirty read factor, should be handled internally as MySQL Group Replication and Galera are doing, providing the flexibility to choose what to do.

There are always exceptions, and in our case the exception is in the case of basic MySQL replication. In that case, you can install and use the ProxySQL binlog reader, that could help to keep the READS under control, but will NOT be able to prevent them when happening a very small time and number.

Conclusion

Nothing comes for free, dirty reads is one of “those” things that can be prevented but we must be ready to give something back.

It doesn’t matter what, but we cannot get all at the same time.

Given that is important to identify case by case WHICH solution fits better, sometimes it can be NDB, others Galera or Group replication. There is NOT a silver bullet and there is not a single way to proceed.

Also, when using Galera or GR the more demanding setting to prevent dirty reads, can be set at the SESSION level, reducing the global cost.

Summarizing

NDB is the best, but is complex and fits only some specific usage like high number of threads; simple schema definition; in memory dataset
Galera is great and it helps in joining performance and efficiency. It is a fast solution but can be flexible enough to prevent dirty reads with some cost.
Use WSREP_SYNC_WAIT to tune that see (https://galeracluster.com/library/documentation/mysql-wsrep-options.html#wsrep-sync-wait)
MySQL Group Replication come actually attached, we can avoid dirty reads, it cost a bit use SET group_replication_consistency= 'AFTER' for that.
Standard replication can use ProxySQL Binlog Reader, it will help but will not prevent the dirty reads.

To be clear:

With Galera use WSREP_SYNC_WAIT=3 for reads consistency
With GR use group_replication_consistency= 'AFTER'

I suggest to use SESSION not GLOBAL and play a bit with the settings to understand well what is going on.

I hope this article had given you a better understanding of what solutions we have out there, such that you will be able to perform an informed decision when in need.

Reference

https://www.proxysql.com/blog/proxysql-gtid-causal-reads

https://github.com/Tusamarco/proxy_sql_tools/tree/master/proxy_debug_tools

https://en.wikipedia.org/wiki/Isolation_(database_systems)#Dirty_reads

https://galeracluster.com/library/documentation/mysql-wsrep-options.html#wsrep-sync-wait

https://dev.mysql.com/doc/refman/8.0/en/group-replication-configuring-consistency-guarantees.html

https://www.slideshare.net/lefred.descamps/galera-replication-demystified-how-does-it-work

mysql Percona galera MySQL8 ProxySQL

Sidebar

Main Menu Mobile

Sysbench and the Random Distribution effect

Why this article?

Let us start.

Special

Uniform

Zipfian

Pareto

Gaussian

The point now is, what for?

Conclusion

References

#StopTRUMP

Dirty reads in High Availability solution

Understand dirty reads when using ProxySQL

Two ways of seeing the world (the theory)

Two ways of seeing the world (the reality)

Down into the rabbit hole

The platform

The tests

The results

ProxySQL

Conclusion

Reference

More Articles …

Path

login