Home MySQL Blogs
MySQL

My MySQL tipsvalid-rss-rogers




Community dinner @ Pedro’s PDF Print E-mail
Written by Marco Tusa   
Sunday, 12 April 2015 20:54

Folks, as usual Pythian is organizing the community dinner. After many years, food, (responsible) drinking and photos, this event has become an important moment for all of us, to know each other better, discuss and have fun.

This year is also the 20th year for MySQL so … YEAAAH let us celebrate, with more food, fun and responsible drinking.

If you had not done it yet … register yourself here: https://www.eventbrite.com/e/the-pythian-mysql-community-pay-your-own-way-dinner-tickets-15692805604

Info about the event:

When: Tuesday April 14, 2015 – 7:00 PM at Pedro’s (You are welcome to show up later, too!
Where: Pedro’s Restaurant and Cantina – 3935 Freedom Circle, Santa Clara, CA 95054

I know, I know … we are that kind of people that decide where to go at the last minute, and every years we do the same, but if you could register, that will help us to organize it better … and c’mon the dinner is on Tuesday … so we are almost there!!!

Anyhow, hope to see all of you there, all of you!

Some reference: Menu Eventbrite Pedro

Last Updated on Sunday, 12 April 2015 20:56
 
Geographic replication and quorum Calculation in MySQL/Galera PDF Print E-mail
Written by Marco Tusa   
Monday, 09 March 2015 18:42

Introduction

Nowadays most companies use geographic distributed platforms to better serve their customers. It is quite common to see companies with datacenters in North America, Europe and Asia; each site hosting databases and shared data. In some cases the data is simply spread around for better accessibility; in other cases it is localized and different on each geographic location; in most cases a mix of the two.

Useless to say that most of the solutions were implemented using MySQL, and asynchronous replication. MySQL Asynchronous replication had being the most flexible solution, but at the same time also the most unreliable given the poor performance, lack of certification and possible data drift.

In this scenario the use of alternative solutions, like MySQL Synchronous (galera) replication had being a serious challenge. This because the nodes interactions was so intense and dense, that poor network performance between the locations was preventing the system to work properly; or to put it in another way, it was possible only when exceptional network performances were present.

As such horrible solutions like the following are still implemented.

image001

I assume there is no need to explain how multi-circular solutions are a source of trouble, and how they seems to work, until you realize your data is screw.

So question is, what is the status of MySQL Synchronous replication and there is any possibility to successfully implement it in place of Asynchronous replication?

The honest trustable answer is it depends.

There are few factors that may allow or prevent the usage of MySQL Synchronous replication, but before describe them, let us review what had happened in the development of galera 3.x that had significantly changed the scenario.

The first obvious step is to validate the network link, to do so I suggest to follow the method describe in my previous article (Effective way to check network connection).

Once you had certify that, the next step is to design correctly the cluster, assigning the different geographic areas to the logical grouping with the segment feature Galera provide, to know more about segments (geographic-replication-with-mysql-and-galera)

 

Assuming two scenarios, one is for simple Disaster Recovery, while the second for data distribution.

The first one will be locate on same continent, like Italy and France or Canada and USA. While the second can be distributed like Italy, Canada and USA.

We can come up with a schema for this two solutions that looks like:

image003 OR image005

 

This looks easy, but I can tell you just right now that while the solution distributed on 3 geographic areas is going to work, the first one will have issue in case of crash.

To understand why, and to correctly design the segments you need to understand another important concept in Galera, which is the quorum calculation.

 

Generally, you will hear or read saying that Galera cluster should be deploy using an odd number of nodes to correctly manage the quorum calculation. This is not really true and not always needed if you understand how it works, and calculate it correctly.

 

Galera Cluster supports a weighted quorum, where each node can be assigned a weight in the 0 to 255 range, with which it will participate in quorum calculations.

The formula for the calculation is:

image007

In short the weight sum of the previous view, excluding the node that had left gently, divided by two must be less than the sum of the weight of the current view.

Wait ... what is a view? A view is the logical grouping of nodes composing the cluster.

 

WSREP: view(view_id(PRIM,28b4b776,78) 
	memb { 
		28b4b776,1
		79cc1886,1
		8637105e,2 
		f218f33d,2
}
joined {} left {}
partitioned { b9aabaa5,1 <--- node is shutting down}) 

 

 

This is a view with ID 78 containing a group of nodes that is the PRIMARY Component, and having one node shutting down gently.

 

The View information is kept inside galera and on modification of the node(s) presence or access it is updated, the ID is incremented and the new view is compared with the previous following the formula describe above.

 

I think is quite clear, as such let see why I said that the first solution will not correctly work (by default), or more correctly will not work as you may expect.

So we have 6 nodes distributed cross 2 geo site each with different segment identifier.

image009 

If the network between the two sites will have issues and cluster cannot communicate the whole cluster will become NON-Primary:

image011

 

As you can see if ONE of the two segment will become non reachable, the other will not have enough quorum to remain PRIMARY given 3 is not greater then 6/2 (3).

This is obviously the scenario in which all the weight is set as default to 1. This is also why it is recommended to use ODD nodes.

 

Just for clarity, see what happened if ONE node goes down and THEN the network crashes.

image013image017image015

 

As you can see here the Final view has the quorum, and in that case the site in segment 1 will be able to stay up as PRIMARY, given 3 is greater then 5/2.

Anyhow back to our production – DR site how this can be set?

 

The first one is to decide that one of the two side will always win, like say production:

image021image019

 

In this scenario the Segment 1 will always win, and to promote the DR to PRIMARY you must do it manually.  That will work, but may be is not what we expect if we choose this solution for DR purposes.

The other option is to use a trick and add a witness like the arbitrator GARBD.

 

I don’t like the use of GARBD , never had, in the Codership documentation:

If one datacenter fails or loses WAN connection, the node that sees the arbitrator, and by extension sees clients, continues operation.

And

Even though Galera Arbitrator does not store data, it must see all replication traffic. Placing Galera Arbitrator in a location with poor network connectivity to the rest of the cluster may lead to poor cluster performance.


This means that if you use GARBD you will in any case have all the cost of the traffic but not the benefit of a real node. If this is not clear enough I will show you a simple case in which it may be more an issue then a solution.

In this scenario we will use GARBD and see what happen

image025image023

 

We will have the quorum, but the point is… we may have it on both side, as such if the two segments will not be able to communicate, but are able to see the witness aka GARBD, each of them will think to be the good one. In short this is call split-brain, the nightmare of any DBA and SA.

As such the simple but real solution when using Galera also for DR, is to think at it as a geographically distributed cluster and add AT LEAST a 7th node, that will allow the cluster to calculate the quorum properly and in case two segments are temporary unable to connect. Not only the use of a third segment will work as man-in-the-middle passing messages from one segment to another, including the WriteSets.

 

So in case of real crash of ONE of the segment, the others will be able to keep going as PRIMARY without issue. On the other hand in case of crash of one of the network link, the cluster will be able to survive and distribute the data.

image027

 

Conclusion

Use asynchronous replication to cover geographic distribution, may still be an option when the network or specific data access mode will prevent it.  But the use of MySQ/Galera may be help you a lot in keep your data consistency under control and to manage HA more efficiently.

As such whatever need you may have (DR or distributed writes) use three different segments and sites, no matter if only for DR. This will improve the robustness of your solution.

MySQL/Galera  is not only a good solution to have a geographical write distributed solution, but is also a robust solution in case of crash of one of the network link.

In that case the cluster will continue to work, but it may be in degraded state, given the third segment will have to forward the data to the other two nodes.

About that I have not yet perform extensive tests, but I will and post additional information.

Last Updated on Monday, 09 March 2015 22:45
 
The Monitoring mistake OR how dreaming can bring ideas PDF Print E-mail
Written by Marco Tusa   
Sunday, 21 December 2014 21:53

"Hi Paul how is going?" "Good Marco and you?" "Good, I had a stressful week last week but now is ok, I mange to close some pending activities, working a little bit more during the day, doing that I was able to reduce the queue of pending task, and now all is normal again", "good for you that you manage, I had too many things ongoing and was not able to dedicate more time to queue".

 

The simple (boring) conversation above hides one of the most complex elaborations of monitoring data. We as human being do a lot of data processing in very short time. We may be less fast in doing some calculations respect to computers, but no computer can compete with us when we are talking about multitask and data processing.

 

To answer to someone asking you how you are, you do not simple review your status in that moment, your brain decide on the base of the last time you have see the person to review all the relevant data and provide a synthesis of the relevant facts, then again you summarize in "good" because you do not consider relevant to pass over all single facts to your friend but only the conclusion.

 

Not only, during the same process, you evaluate, in relation to your relationship with the person, what kind of information you may want to share and why, how to present to him/her such that it will be relevant and interesting for the interaction.

 

The simple process of talking also may happen while you are walking along the street, taking care of the traffic, and expressing interest, curiosity or annoyance to your collocutor.

Each expression you will show on your face is the result of the data collection, analysis and decision your brain is taking. Plus some other coming from more in depth inner process, like deep fear or surprise, but that is out of the context now.

The funniest interesting thing is that we are so use to do this and to summarize in such efficient way, that we consider funny or totally out of context, when we see someone not doing so.

 

Just think about how hilarious is Sheldon Lee Cooper (for the ones who do not know what I am talking about http://en.wikipedia.org/wiki/Sheldon_Cooper).

In the show Sheldon is quite often answering to the simple question "How are you?" with a massive amount of information, that not only is not understood, but also totally irrelevant, and as such in that context hilarious.

Hilarious in that context I sais, but far to be hilarious in real life, this because we are so expose to external signal and information that we should not and cannot spend time and resource, elaborating incoming information just to know "How our friend is doing". In the evolution it was decide that was the owner of the information that has to process it, that has to elaborate his data and expose only what is relevant for his interaction.

 

Just think what life would be IF instead of saying "Good thank you" to the question "How are you", you would start to enumerate all the facts in each single details, or with some aggregation, to each single person that asks you the same question and expect them to sort out if that means good or not. Crazy eh? I would say quite inefficient and source of possible misunderstanding as well.

Someone may decide that working an hour more per day is totally unacceptable, and as such your status would be "Bad" instead "Good", which is the exact opposite of how you really feel.

As said this way of acting and behaving, is not something coming from the void, but instead the result of a long process that had be refine in 2,5 millions of years (Homo habilis). The evolution had decide that is much more efficient to have Marco telling to Paul how he is doing, than Paul try to read all the information from Marco and then elaborate, with his parameters, how Marco is doing.

I am going to say that, well the evolution is right, and I am quite happy with what we had achieve, also if we had taken some million of years to get there.

I am also confident that you too, see how this is more efficient, and correct.

So, for God sake, why are we still using a method that is not only inefficient but also exposing us to mistakes, when we have to know how complex system feel, systems that are less complex then us, but complex anyhow.

Why are we "monitoring" things, exposing numbers, and pretend to elaborate those with the illusion to finally GET "How are you?"

Would not much more efficient, and statistically more prune of errors just ask "Hi my nice system, how are you today?" "Marco you are boring, you ask me that every day, anyhow I am good" "There is anything you need?" "Yes please, check the space I may run out in a week" "Oh thanks to let me know in advance I will".

 

Am I crazy? No I don't think so, is it something that we see only in the movies? Again no I don't think so, and actually is not so far from what we may start to do.

How we can move from a quite wrong way of doing, collecting useless amount of data to analyze to get simple synthetic information?

 

Here is my dream

Let us start simple, you cannot ask to someone "How are you?" if he is dead, is a yes/no condition. But this does not apply to complex systems, in our body every day we loose cells they die, but we replace tem, and our brain is not sending us warning message for each one of them.

But we do have alert messages if the number of them become too hi such that primary function can be compromise.

In short our brain discriminate between what can be compensate automatically and what not, and bother us only when the last one occur.

What can be done to create monitor monad, that is internally consistent and that allow us to scale and to aggregate?

The main point as state above is to do not flood the collocutor with information, but at the same time do not loose the meaning and if in the need the details.

 

This is the first point we can separate between what we need to archive and what we need to get as answer.

To be clear, I ask you "How are you" "Good thank you", that is what I need to know, but at the same time I may be in the position to download your data, and collect all the metrics relevant.

I had to use a heart monitor after few events, and what happened was quite simple. They attach to my body a monitor that was sending detailed metrics of my heart to them directly, plus they were calling me to ask, "How you feel today?" The detailed information was for them to eventually dig in the system if something goes bad.

 

The detailed information is easy to collect the challenge come from the amount of data, how to store aggregate and so on, but all that is the usual boring and old stuff.

What I see instead interesting is how to get the monad to work, how to define the correct way to balance the analysis.

My idea is quite simple; assume the easiest case where we have to process just 3 different metrics to get a meaningful state, something like IO/CPUuser/Net incoming.

A simple

 

triangular

 

will work fine; each solid vertex is a metric plus one that is the yes/no condition (am I alive?).

The centre of the solid represent the initial state; state in which all the forces are in perfect balance and there is no variation in the position of the Point of Balance from the centre itself.

triangolo_sys_1

We know that any system is never in perfect balance, we also know that each system may behave differently on the base of the N factors, where N is not determinate, but change not only in relation of the kind of system, but also between system that behave to the same class. In short try to define N is a waste of time.

 

What can be done, and guess what is exactly what we do when we move from Blastula to new born, we can learn what is the correct level of variation, meaning we can learn by each system which is the variation that do not compromise our functions.

 

Initially we may have a define AC which is the acceptable range inside which the point can fluctuate, for each vertex we have an F for the possible fluctuation, when F =0 in one of more of the directions we can say "Huston we have a problem".

 

While learning, our system will identify what may be the right shape and distance for the F such that the initial circle may become something like this:

triangolo_sys_2

 

Which means that any movement of our point inside the AC area will give us the answer "I am good thanks". Any movement outside, will generate a possible signal like "I am stressed my CPU is overload".

This is a very simple basic example, and it may be not clear how this scale and how it could resolve much more complex scenario. So let us go ahead.

 

A simple solid like a triangular pyramid covers something that is very basic. But if for instances you need to provide the status of a more complex interaction say a database status or a more complex system, then you may have one or many solid with much more complex interaction:

solido1

With the solid disdyakis triacontahedron we can have 62 vertexes, meaning that with the same algorithm we can a associate a significant number of metrics.

 

Each solid is seen from the whole as single entity, like if enclose in a sphere that shows only the "final" status:

solido2

The flexibility comes from the fact we can connect any solid to another in exclusive mode or as subsystem(s), at the same time each element can be quite complex internally but expose simple and direct status.

 

So for instance a simple one can be like:

solido3

While a more complex and probably the most common would be:

solido4

In this case we can assume to have a group describing the status for a Storage Engine another for whatever happen on the storage, and so on until we have a Node of our architecture fully describe.

solido5

At this point it should be clear that once we had cover the internal complexity of the variation for each solid, the outcome is a simplify message "I am good" no matter at what level we are looking it.

That will allow us to eventually have quite complex system, with complex relations, be described and report status in a very simple and immediate way.

 

Understanding what is going on in a system like this:

solido6

Can be quite difficult and taking time. Using standard way of monitoring, we will not be sure if there is a correlation between the metrics, and if it is taking in to account correctly the behaviour of the Node.

 

Using the new approach will allow us to, first of all get simple feedback:

solido7

Basically, given a node affected (badly ... give it is dead) all the others are still answering, "I am good", but the Nodes related will start to say, "I am not happy", "I am very sad my node is dead", "I am cool don't worry load and service are under control".

And I will focus directly on my dead node and how to fix it. Given the way I collect my information and report the state, I will be able to see that in the timeline and focus directly on the moment issues starts, for the node.

solido8

 

No message is a message

What is also very important to consider, is that once we have define the correct behaviour for each solid, that happen during the learning period, we also know what is the expected behaviour and what signals we should see.

 

In short if you go in the gym and do 45 minutes on the treadmill, you expect to have higher heart rate, to feel fatigued and sweaty. If that doesn't happen then either you were cheating not doing the right exercise, or probably you are a cyber-man and you were not aware of that.

 

Getting the right signal in the right context, also when the signal is a negative one is as important as, or even more, then getting a good one.

 

Last year my mother had a quite important surgery, the day after that she was great, feeling perfectly, no pain, no bad signals. And she was dying down; the doctor start to be uncomfortable with her NOT feeling some level of pain or whatever discomfort. Luckily they take action and save her (at the last second) but they did.

 

Nowadays we just collect metrics, and very rarely we put them in relation, and even more rarely we try to get the negative reading as relevant event. This because we currently do not have a way to contextualize the right behaviour, to know how thing can be correctly handled, and as such what is the deviation from that.

 

The implementation I am describing not only takes in to account the behaviour not the singe event, but it also can trace and identify the lack of a negative signal, a signal that must take place to keep the behaviour healthy.

Conclusion

What I really want to stress out is that the way we do monitor today is the same that trying to manage the space shuttle with stone and scalpel.

 

There are many solutions out there, but all of them are focus on more or less the same approach/model.

Better then nothing of course, and yes we still have situation in which we have NO monitoring. But still I think that changing the paper of the wraps is not doing something new in relation to the content.

 

I do not pretend to know how to implement my idea, the algorithm to calculate the variation and the interaction in the solid, is something I do not see in my range. But that just means I need to find someone able to share a dream and with quite good mathematical skills.

Last Updated on Friday, 09 January 2015 16:54
 
How to mess up your data using ONE command in MySQL/Galera. PDF Print E-mail
Written by Marco Tusa   
Thursday, 27 November 2014 00:55

Or how wsrep_on can bring you to have a cluster with usless data.

redflag

This is a WARNING article, and it comes out after I have being working on define internal blueprint on how to perform DDL operation using RSU safely.

The fun, if fun we want to call it, comes as usual by the fact that I am a curious guy and I often do things my way and not always following the official instructions.

Anyhow, lets us go straight to the point and describe what can happen on ANY MySQL/Galera installation.

The environment

The test environment, MySQL/Galera (Percona PXC 5.6.20 version).

The cluster was based on three nodes local no geographic distribution, no other replication in place then Galera.

Haproxy on one application node, simple application writing in this table:

Table: tbtest1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CREATE TABLE: CREATE TABLE 'tbtest1' (
'auAPP1nc' bigint(11) NOT NULL AUTO_INCREMENT,
'a' int(11) NOT NULL,
'uuid' char(36) COLLATE utf8_bin NOT NULL,
'b' varchar(100) COLLATE utf8_bin NOT NULL,
'c' char(200) COLLATE utf8_bin NOT NULL,
'counter' bigint(20) DEFAULT NULL,
'time' timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
'partitionid' int(11) NOT NULL DEFAULT '0',
'date' date NOT NULL,
'strrecordtype' char(3) COLLATE utf8_bin DEFAULT NULL,
PRIMARY KEY ('auAPP1nc','partitionid'),
KEY 'IDX_a' ('a'),
KEY 'IDX_uuid' ('uuid')
) ENGINE=InnoDB AUTO_INCREMENT=482 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
 

 

Small app

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#! /bin/bash --
i=1
while :
do
	echo "$i "
	mysql -ustress -pxxx -h192.168.0.35 -P 3307 -e "SET @HH=@@HOSTNAME;
insert into test.tbtest1 (a,uuid,b,c,strrecordtype,date,partitionid) 
values($i,UUID(),@HH,'a','APP1'|'APP2',now(),RAND()*100)";
	i=$((i + 1))
	if [ $i -eq 100 ]
	then
		break
	fi
	sleep 0.5;
done
 

 

 

Server Information

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
(root@localhost:pm) [(none)]>\s SHOW global STATUS LIKE 'wsrep_provider_version';
--------------
/home/mysql/templates/PCX-56/bin/mysql Ver 14.14 Distrib 5.6.20-68.0, FOR Linux (i686) USING EditLine wrapper
Connection id:	90
Current DATABASE:
Current user:	root@localhost
SSL:	NOT IN USE
Current pager:	stdout
USING OUTFILE:	''
USING delimiter:	;
Server version:	5.6.20-68.0-25.7-log Percona XtraDB Cluster BINARY (GPL) 5.6.20-25.7, Revision 886, wsrep_25.7.r4126
Protocol version:	10
Connection:	Localhost via UNIX socket
Server characterset:	utf8
Db characterset:	utf8
Client characterset:	utf8
Conn. characterset:	utf8
UNIX socket:	/home/mysql/instances/galera1-56/mysql.sock
Uptime:	2 min 38 sec
Threads: 3 Questions: 282 Slow queries: 0 Opens: 78 FLUSH TABLES: 3 Open TABLES: 8 Queries per second avg: 1.784
--------------
+------------------------+---------------+
| Variable_name | Value |
+------------------------+---------------+
| wsrep_provider_version | 3.7(r7f44a18) |
+------------------------+---------------+
1 row IN SET (0.01 sec)
 

 

 

Facts

In MySQL/Galera there is variable that allow us to say to the server to do not replicate. This variable is wsrep_on and when we set it as OFF the server will not replicate any statement to the other node.

This is quite useful when in the need to perform actions on an single node, like when you need to perform DDL on RSU mode.

But this flexibility can bite you quite badly.

I had done a simple small change to the widely use command:

 

SET wsrep_on=OFF;

 

I just add GLOBAL:

SET GLOBAL wsrep_on=OFF;

 

 

To be honest I was expecting to have the command rejected, but no it was accept and this is what happened:

I had run the small loop (see above) on two application servers, one pointing to HAProxy and writing APP1 in the field strrecordtype, the other pointing directly to the node where I will issue the command with wsrep_on inserting APP2.

The results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS APP2_COUNTER FROM tbtest1 
WHERE strrecordtype='APP2';
select count(*) AS APP1_COUNTER FROM tbtest1 WHERE strrecordtype='APP1';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral03 |
+---------------+
1 row IN SET (0.00 sec)
+-------------+
|APP2_COUNTER |
+-------------+
| 99          |
+-------------+
1 row IN SET (0.00 sec)
+-------------+
|APP1_COUNTER |
+-------------+
| 99          |
+-------------+
1 row IN SET (0.00 sec)
(root@localhost:pm) [test]>
(root@localhost:pm) [test]>SET GLOBAL wsrep_on=OFF; <------------- It should not be GLOBAL
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS APP2_COUNTER FROM tbtest1 
WHERE strrecordtype='APP2';
select count(*) AS APP1_COUNTER FROM tbtest1 WHERE strrecordtype='APP1';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral01 |
+---------------+
1 row IN SET (0.00 sec)
+-------------+
|APP2_COUNTER |
+-------------+
| 0 |
+-------------+
1 row IN SET (0.00 sec)
+-------------+
|APP1_COUNTER |
+-------------+
| 66              | <-------------------- 1/3 lost because HAProxy think that the node is ok ...
+-------------+
1 row IN SET (0.00 sec)

 

 

As you can see in the tusacentral03 (which is the one where I issue SET GLOBAL wsrep_ON=OFF), I have ALL the records inserted in the local node and ALL the records coming from the others node.

But on the node tusacentral01, I had NO records related to APP2, but more relevant I had lost 1/3 of my total inserts.

Why?

Well this is quite clear, and unfortunately is by design.

If I issue wsrep_ON=OFF with GLOBAL the server will apply the setting to ALL sessions, meaning all session on that will STOP to replicate.

In the source code the section relevant to this is quite clear:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#wsrep_mysqld.cc
#line 1395
int wsrep_to_isolation_begin(THD *thd, char *db_, char *table_,
                             const TABLE_LIST* table_list)
{
 
  /*
    No isolation for applier or replaying threads.
   */
  if (thd->wsrep_exec_mode == REPL_RECV) return 0;
 
  int ret= 0;
  mysql_mutex_lock(&thd->LOCK_wsrep_thd);
 
  if (thd->wsrep_conflict_state == MUST_ABORT)
  {
    WSREP_INFO("thread: %lu, %s has been aborted due to multi-master conflict",
               thd->thread_id, thd->query());
    mysql_mutex_unlock(&thd->LOCK_wsrep_thd);
    return WSREP_TRX_FAIL;
  }
  mysql_mutex_unlock(&thd->LOCK_wsrep_thd);
 
  DBUG_ASSERT(thd->wsrep_exec_mode == LOCAL_STATE);
  DBUG_ASSERT(thd->wsrep_trx_meta.gtid.seqno == WSREP_SEQNO_UNDEFINED);
 
  if (thd->global_read_lock.can_acquire_protection())
  {
    WSREP_DEBUG("Aborting APP1: Global Read-Lock (FTWRL) in place: %s %lu",
                thd->query(), thd->thread_id);
    return -1;
  }
 
  if (wsrep_debug && thd->mdl_context.has_locks())
  {
    WSREP_DEBUG("thread holds MDL locks at TI begin: %s %lu",
                thd->query(), thd->thread_id);
  }
 
  /*
    It makes sense to set auto_increment_* to defaults in APP1 operations.
    Must be done before wsrep_APP1_begin() since Query_log_event encapsulating
    APP1 statement and auto inc variables for wsrep replication is constructed
    there. Variables are reset back in THD::reset_for_next_command() before
    processing of next command.
   */
  if (wsrep_auto_increment_control)
  {
    thd->variables.auto_increment_offset = 1;
    thd->variables.auto_increment_increment = 1;
  }
 
  if (thd->variables.wsrep_on && thd->wsrep_exec_mode==LOCAL_STATE) <------- Here we have a check for wsrep_on 
  {
    switch (wsrep_OSU_method_options) {
    case WSREP_OSU_APP1: ret =  wsrep_APP1_begin(thd, db_, table_,
                                               table_list); break;
    case WSREP_OSU_APP2: ret =  wsrep_APP2_begin(thd, db_, table_); break;
    }
    if (!ret)
    {
      thd->wsrep_exec_mode= TOTAL_ORDER;
    }
  }
  return ret;
}
enum wsrep_exec_mode {
    LOCAL_STATE,
    REPL_RECV,
    TOTAL_ORDER,
    LOCAL_COMMIT
};
 

 

 

So what happen is that the server check if the thd object has that variable ON and has LOCAL_STATE, if so it replicates, if not it does nothing.

But as said while this makes sense in the SESSION scope, it does not in the GLOBAL.

 

Not only, setting wsrep_on to OFF in global scope does NOT trigger any further action from MySQL/Galera, like for instance the possible FACT that the node could be desynchronize from the remaining cluster.

The interesting effect of this is that HAProxy has NO WAY to know that the node had stop to replicate, and as such the server can receive the requests but those will not replicate to the other node causing data diversion.

 

You can say, that a DBA SHOULD know what he is doing, and as such he/her should be MANUALLY desync the node and then issue the command.

My point instead is that I don't see ANY good reason to have wsrep_on as global variable; instead I see this as a very dangerous and conceptually wrong "feature".

 

Browsing the Codership manual, I noticed that the wsrep_on variable comes with the "L" flag, meaning that the variable is NOT suppose to be GLOBAL.

But it is ...

I also had dig in the code and:

1
2
3
4
5
6
7
8
9
10
11
12
wsrep_var.cc
#line58
 
bool wsrep_on_update (sys_var *self, THD* thd, enum_var_type var_type)
{
  if (var_type == OPT_GLOBAL) {
    // FIXME: this variable probably should be changed only per session
    thd->variables.wsrep_on = global_system_variables.wsrep_on;
  }
  return false;
}
 

 

That is interesting isn't it?

Wondering when this comment was inserted and why it was ignored.

 

Anyhow the source of all problems is here in the wsrep_on variable definition:

1
2
3
4
5
6
7
static Sys_var_mybool Sys_wsrep_on (
       "wsrep_on", "To enable wsrep replication ",
       SESSION_VAR(wsrep_on),                      <----------------------- This allow global 
       CMD_LINE(OPT_ARG), DEFAULT(TRUE), 
       NO_MUTEX_GUARD, NOT_IN_BINLOG, ON_CHECK(0),
       ON_UPDATE(wsrep_on_update));
 

 

The variable was defined as SESSION_VAR instead of SESSION_ONLY, and as such used also in global scope.

 

As already state, this is from my point of view a conceptual error not a bug, but something that should not exists at all, because in a cluster where I have data certify/replicate/synchronize there should NOT be any option for a DBA/user to bypass at GLOBAL level the data validation/replication process.

 

To note, and to make things worse, after I had done the test I can easily set wsrep_on back, and my node will continue to act as part of the cluster as if all the nodes are equal, while they are not.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS RSU_COUNTER FROM tbtest1 
WHERE strrecordtype='RSU';
select count(*) AS TOI_COUNTER FROM tbtest1 WHERE strrecordtype='TOI';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral03 |
+---------------+
1 row IN SET (0.00 sec)
 
+-------------+
| RSU_COUNTER |
+-------------+
|         181 |
+-------------+
1 row IN SET (0.00 sec)
 
+-------------+
| TOI_COUNTER |
+-------------+
|         177 |
+-------------+
1 row IN SET (0.00 sec)
 
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral01 |
+---------------+
1 row IN SET (0.00 sec)
 
+-------------+
| RSU_COUNTER |
+-------------+
|          77 |
+-------------+
1 row IN SET (0.00 sec)
 
+-------------+
| TOI_COUNTER |
+-------------+
|         139 |
+-------------+
 

 

As you can see the cluster continue to insert data using HAProxy and all the node, but it has a data set that is inconsistent.

Conclusions

  • Never use SET GLOBAL with wsrep_on
  • IF you are so crazy to do so, be sure no one is writing on the node.
  • I am sure this is a mistake in the logic and as such this variable should be change from the source, in the code defining the variable SESSION_ONLY and not SESSION_VAR
    Or wsrep_on can damage you quite badly.
Last Updated on Thursday, 27 November 2014 15:00
 
Geographic replication with MySQL and Galera PDF Print E-mail
Written by Marco Tusa   
Friday, 21 November 2014 00:56

Introduction

With the introduction of Galera replication for MySQL, Codership has, in my opinion, provide the most significant support to MySQL that the community had seen in the last tree years.

Codership had filled a significant gap in MySQL replication capability, succeeding in a task where many before had failed.

Not only Codership is constantly committed in doing their solution better and more solid, and they are the most friendly and helpful team I have met in years.

In short chapeau to all of the guys! Great job, I am really proud to have the opportunity to discuss, interact with you.

Said that, no product or code around is unmingled of errors and/or issues at different level. So also if Galera help us a lot solving many replication and architectural issues it still has some limitation that must be taken in consideration.

Galera is easy to setup, especially to perform a basic setup, which will work fine for almost all basic OLTP low level and local traffic.

Unfortunately I have seen many implementations that had start as basic and then become something different, in usage and very often in design.

The most common mistake, from architectural point of view, I have seen is when customers starts with a simple cluster of 3 nodes all based on local network, and then given it "works so great" they decide to add one/two nodes on a geographic distributed site.

This solution is not wrong "per se" but it is how it is implemented that makes the difference.

The easiest way to explain what I mean and what should not be done, it is to use a real case, and describe what was implemented and how we had evolve the solution, while Codership was optimizing their product.

Customer scenario

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The initial environment of one single application node pointing to a cluster of 3 MySQL/Galera nodes, was transformed into a 6 (or more) application nodes getting traffic from a balancer, each application nodes could write in any Data node (round robin) residing in the same geographic area (at least), but cluster was geographically distributed 3 nodes in site A, 3 nodes in site B.

As result the customer navigation was jumping from one application server to another, and from one Data node to another as well.

Transactions were small and relevant only for a single request.

Finally each application node had a timer to clean up "old" session, the concept of OLD was relevant only to the local node and it could happen that an OLD session was in fact still on going on another application node.

In short, the customer was pushing randomly traffic to both zones, using a single table for the sessions to coordinate the activities, but also purging them without any good locking mechanism:

Onecluster

Problem description

When I connect to the machine, I was not aware of the architecture as it was; I was still referring to what I had found in our history, so one Application node and a simple cluster.

Takes me 5 seconds, to see that the situation was quite different. The cluster was suffering from many point of view, Flow Control was high very often, and nodes between the zones were declare dead often.

Takes me some time to let the customer understand why that was not working correctly, and why the architecture was design wrongly. Mostly because they were claiming that it was working fine before (reminds me "on my machine works" doesn't it?). Discussion on what before was, how much traffic/load and so on.

Finally, they redirect all the traffic to one zone only and at least the FC went down enough to have the site to work, and to give me the chance to review the settings and try to optimize it.

To keep it up I had to increase the FC related variable, to modify galera parameters for nodes time out, and obviously optimize MySQL itself, and we survive a bit.

For the scope of the article focus on the geographic replication, I am not to report all the review, but the part related to the subject.

As said I had extended time outs and FC, but I was not happy and as such I had tested in more details what the network was really allowing to transmit.

 

 

[root@xx ~]# netperf -H 10.0.0.12 -t TCP_RR -v 2  -- -b 6 -r 32K -S 256K -S 256K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.12 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   
 
16384  87380  32768    32768   10.01      97.45   
249856 249856
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   71830.852   97.451 25.546    25.546  <------------ 71ms 
 

 

So a simple round trip for a small record of 32K takes 71 ms, what if I simulate a basic load of 1Mb?

 

[root@xx ~]# netperf -H 10.2.3.30  -t TCP_RR -v 2  -- -b 6 -r 1M,32K -S 256K -S 256K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.2.3.30 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   
 
16384  87380  1048576   32768   10.00       2.70   
249856 249856
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   2592709.047   2.700  22.648    0.708  <-- we have 2592.709047 ms latency
 

 

 

Note, to understand better what was going on and why I use this method to evaluate the network read my other article (Effective way to check the network connection performance, when using replication geographically distributed)

What we had to do

With this kind of connectivity, it is very unluckily that we can have the galera replication working correctly. Not even if we keep only one site as master and the other as stand by. The way the nodes communicate and validate the writesets will be too demanding.

See the below diagram from Codership to understand why

ScreenShot1

 

The only alternative to this is to use asynchronous connection and set two nodes, one for each site in master-master mode:

master-master

Setup a master-master, means that we loose the automatic replication shift from one node to another. That is it, this solution while fix the latency problem because use asynchronous replication, it does imply that in case of crash of one of the two nodes connected, you must manually move the replication to another node.

How to set it up how to fix when it breaks

Setting up asynchronous replication in a galera cluster is quite easy.

Assuming you have two nodes connected as master master (A1 B1) and assuming the only active was A1 and that it crashes, the only thing you need to do is to check on B1 for:

 

+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_last_committed | 221463896 |
+----------------------+-----------+
 

 

 

In the status variables, then look on the node you had choose as master say A2, for the binary log containing the

 

Xid = 221463896

 

 

And issue the command CHANGE MASTER with the coordinates you will find there.

Then on A2 set the node as slave of the B1 using the Master coordinates from that node.

The procedure is simple and obviously worked decently, given the situation.

Customer was obviously not happy because they would like to have the two cluster in perfectly sync, but that was at least good enough to keep the site working efficiently. Not only after we set it up, we had at least two incidents to production, and we had successfully able to fail over to DR and then Failback with less then few seconds of service interruption.

Anyhow our friends in Codership had already release the new version of Galera (3.x), working on MySQL 5.6.

So while the customer was refining and correcting a lot of issues their side, mainly in the code and how the application was using the common table.

I had review, test and crashed MySQL 5.6/Galera3.

Galera 3.0 comes with some interesting new enhancement, see for an overview http://www.slideshare.net/Severalnines/webinar-slides-27249453.

At the moment of writing Seppo is close to deliver the speech about the new features in Galera 4 at Percona Live London 2014. (http://www.percona.com/live/london-2014/sessions/galera-cluster-new-features)

Galera 3.X group replication and how this can change everything.

In Galera 3, the introduction of the segment concept basically isolate/identify a group of data nodes inside a cluster.

The meaning is that any node with the same segment value will communicate extensively to each other, while intra-segments they elect one node for each segment as communication channel:

ScreenShot2

 

Quorum calculation is also taken in to account the segments allocation.

In the presentation I have linked here there is mention that using segments will allow managing WAN latency of 100-300 ms.

Given I am a skeptical, I decide to perform extensive tests, and see/identify what it is really possible to accomplish.

Investigations

I did several tests, involving many parameters in galera, but at the end I had two main actors I was playing with, the max_packet size and gcache pagesize.

All the others ones changed had prove to be stable and efficient, below the settings I had used.

How to test the problem

I have setup 8 different tests:

1) simple insert on 1 node from 3 apps
2) simple insert on 1 node for segment on DR only one app on production 2
3) simple insert on 1 node from 3 apps GALERA PACK larger
4) simple insert on 1 node for on DR only one app on production 2 GALERA PACK larger
5) simple insert on 1 node from 3 apps BUT with WAN connection
6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection
7) simple insert on 1 node from 3 apps BUT with WAN connection AND GALERA PACK larger
8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

 

Each test was using OLTP approach, with 5 main tables and 6 child tables with the following structure:

 

tbtest1 | CREATE TABLE `tbtest1` (
  `autoInc` bigint(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) NOT NULL,
  `uuid` char(36) COLLATE utf8_bin NOT NULL,
  `b` varchar(100) COLLATE utf8_bin NOT NULL,
  `c` char(200) COLLATE utf8_bin NOT NULL,
  `counter` bigint(20) DEFAULT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `partitionid` int(11) NOT NULL DEFAULT '0',
  `date` date NOT NULL,
  `strrecordtype` char(3) COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`autoInc`,`partitionid`),
  KEY `IDX_a` (`a`),
  KEY `IDX_uuid` (`uuid`)
) ENGINE=InnoDB AUTO_INCREMENT=3474631 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
 
CREATE TABLE: CREATE TABLE `tbtest_child1` (
  `a` int(11) NOT NULL,
  `bb` int(11) NOT NULL AUTO_INCREMENT,
  `date` date NOT NULL,
  `partitionid` int(11) NOT NULL DEFAULT '0',
  `stroperation` varchar(254) COLLATE utf8_bin DEFAULT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`a`,`bb`),
  UNIQUE KEY `bb` (`bb`)
) ENGINE=InnoDB AUTO_INCREMENT=36289 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row IN SET (0.00 sec)
 

 

 

Using 3 application nodes, each was having 8 children treads for a total of 24 threads writing on the cluster.

The tests performed from 1 to 4 where done as baseline so NO WAN network simulation.

When I mention Galera pack larger I refer specifically to:

gcs.max_packet_size

This value is "All writesets exceeding that size will be fragmented." And it is set by default to 32Kb, which means anything larger that will be fragmented in side galera. My understanding is that writeset package exceeding that dimension will be split in smaller packages. But given the dimension of 32K, or even less is significantly larger then the network frame dimension, the packet fragmentation will take place in any case, but at lower level (network).

My reasoning here was, why I should want to do that, if I already have frame fragmentation at network (TCP/IP) level? I may get the meaning of the parameter wrong, but if not is worth a try.

On this regards I had also optimize the machine network layer:

net.core.optmem_max = 40960
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_slow_start_after_idle = 0

I run the tests on 6 data nodes using 2 segments, in virtualized environment, 8 core 8GB raid 10 6 10k RPM disks, CentOS 6.

On purpose I had skip HAProxy and prefer to pint directly on the node I had defined.

The applications were closing and opening connection at each transaction to emulate the customer behavior.

When simulating the wan network I had set a final total latency of :

 

root@galera1h1n5 galera56N1]# netperf -H 10.0.0.33 -t TCP_RR -v 2  -- -b 6 -r 120K,64K -S 156K -S 156K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.33 () port 0 AF_INET : first burst 6
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec
 
65536  87380  122880   65536   10.00       1.90
319488 319488
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   3684481.420   1.900  1.868     0.996  <----- 3684.481 ms 
 

 

With packet reordering 15-20%.

Just for the good of comparison a PING whould had report:

[root@galera1h1n5 galera56N1]# ping -M  do -s 1472 -c 3 10.0.0.33
PING 10.0.0.33 (10.0.0.33) 1472(1500) bytes of data.
1480 bytes from 10.0.0.33: icmp_seq=1 ttl=64 time=403 ms
1480 bytes from 10.0.0.33: icmp_seq=2 ttl=64 time=389 ms
1480 bytes from 10.0.0.33: icmp_seq=3 ttl=64 time=410 ms
 

 

So ~400 ms for 1427 (1500) bytes.

 

With all this in place, I have started to roll the ball and collected my data.

To notice that I have run each set of test at least 5 times, to validate the data collected, and data was clean, refresh and InnoDB BP reload at any test.

Before reaching the last set of tests (the final 2 sets) I had run a total amount of 64 variants of the tests, sometime crashing galera, sometime getting so bad performance that it was obvious that would not be usable in production, and so on.

Anyhow, I was more focus on the replication performance/behavior than on what MySQL/Innodb was doing given I was not looking for an ultra-fine tuning at that level, actually some imperfections had fit my purpose better.

Related issues

Initially I was not using an optimize network settings, and as soon as I reach the first time the tests 7-8 I had issues, in short the test 7 was taking 4 time longer then test 5 (his twin), and in some cases it was so bad that I had to stop it.

But once the network was set up correctly most of the issue disappears.

Results

Who wins?

There is not a single winner; all the tests shows that there is the need to carefully pick the right solution for the job.

But these two sets of tests, and settings had proven to be stable and better performing then all the others in my simulation:

execution_time_smallGcache_page

 

This first image show the set of tests perform with the gcahe page set to a lower value of 32Mb and more instances 4.

execution_time_largeGcache_page

 

This second one shows the test with a larger page of 250Mb and only 2 instances.

I did not include the starting test where the page/instance was set by defaults, because it was providing less performance.

Comparing the two, we can see that there is a different behavior between the tests done on a local network and the ones using the WAN.

In the case of a LAN where the cluster is less limited in the performance at network level, the use of more smaller gcache pages with more instances, had allow the cluster and MySQL to insert ~5700-5300 row/sec.

The use of a large gcache page and only 2 instances, on the other side see the drop of the performance to ~5500-4950 row/sec.

The difference is not significant with this traffic, but is interesting t note that it exists and is consistent on all the tests.

My side, this means we have something that add a cost, significant cost when galera need to access a less number of gcache page. Given this is a memory access I am oriented to think there is a mutex that prevent the usage of the single page/resource.

On the other hand the Large gcache page, seems providing better support in the case of a distribute cluster.

I assume that given the transmission is slowed down by the network, galera/MySQL is able to better utilize the less instances of the gcache memory pages.

Never the less, the large gcache with less instances, had be less effective when inserting in multiple nodes then the small gcache page with more instances and larger gcs.max_packet_size.

But that is not all.

When using larger packet size (gcs.max_packet_size), I was consistently able to complete the tests in less time when I was using a geographically distributed cluster, large gcache page and multiple nodes writing.

While when the cluster was performing locally, the small (standard) gcs.max_packet_size was working better.

Let us navigate the results.

Segments distribution

The first test I have done is to be sure that the segments where working as expected.

The cluster was compose by 2 segments, each segment 3 nodes.

Segment 1 nodes IP ending with: 5,21,23

Segment 2 nodes IP ending with 2,23,33

 

 

Local cluster(same network)
Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.5                <=> 10.0.0.21        1.65GB  76.8Kb  63.9Kb  67.1Kb
10.0.0.5                <=> 10.0.0.31        1.63GB  63.8Kb  51.8Kb  55.2Kb
10.0.0.5                <=> 10.0.0.23        1.58GB  31.8Kb  26.2Kb  28.5Kb <----
10.0.0.5                <=> 10.0.0.2         30.8MB  19.3Kb  15.4Kb  16.2Kb
10.0.0.5                <=> 10.0.0.33         270KB   848b   1.81Kb  1.74Kb
 
 
 Display paused  19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└────────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.21               <=> 10.0.0.5          1.70GB  70.7Kb  71.6Kb  67.5Kb
10.0.0.21               <=> 10.0.0.31         83.4MB  44.9Kb  43.3Kb  40.9Kb
10.0.0.21               <=> 10.0.0.2          31.8MB  15.6Kb  16.9Kb  16.5Kb <---- ??
10.0.0.21               <=> 10.0.0.33         32.1MB  15.6Kb  16.8Kb  16.3Kb <---- ??
10.0.0.21               <=> 10.0.0.23          269KB   848b   1.34Kb  1.36Kb
 
 
Display paused  19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└────────────────┴────────────────┴────────┴────────────────┴────────────────
10.0.0.31               <=> 10.0.0.5          1.67GB  56.5Kb  58.3Kb  55.7Kb
10.0.0.31               <=> 10.0.0.21         83.4MB  44.7Kb  43.3Kb  41.5Kb
10.0.0.31               <=> 10.0.0.2          31.7MB  19.6Kb  17.6Kb  17.2Kb <---- ??
10.0.0.31               <=> 10.0.0.33          273KB   848b   1.98Kb  1.89Kb 
10.0.0.31               <=> 10.0.0.23          269KB   848b   1.85Kb  1.72Kb
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴───────────────┴─────────┴──────────────┴────────────────
10.0.0.2              <=> 10.0.0.23       1.70GB  71.2Kb  71.2Kb  67.6Kb
10.0.0.2              <=> 10.0.0.33        111MB  58.2Kb  57.6Kb  54.6Kb 
10.0.0.2              <=> 10.0.0.31       31.7MB  20.2Kb  17.7Kb  16.9Kb <---- ??
10.0.0.2              <=> 10.0.0.5        30.8MB  15.6Kb  17.3Kb  16.5Kb <---- ??
10.0.0.2              <=> 10.0.0.21       31.8MB  15.6Kb  16.9Kb  16.5Kb <---- ??
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
───────────────┴───────────────┴────────────────────────┴────────────────
10.0.0.23              <=> 10.0.0.2        1.65GB  71.2Kb  71.2Kb  68.3Kb
10.0.0.23              <=> 10.0.0.33       1.60GB  43.7Kb  43.4Kb  41.7Kb
10.0.0.23              <=> 10.0.0.5        1.62GB  28.1Kb  29.6Kb  28.3Kb <----
10.0.0.23              <=> 10.0.0.31        269KB   848b   1.85Kb  1.76Kb
10.0.0.23              <=> 10.0.0.21        269KB   848b   1.34Kb  1.36Kb
 
 Display paused 19.1Mb           38.1Mb   57.2Mb           76.3Mb     95.4Mb
└───────────────┴────────────────┴───────┴────────────────┴────────────────
10.0.0.33              <=> 10.0.0.2          111MB  58.8Kb  57.6Kb  55.4Kb 
10.0.0.33              <=> 10.0.0.23        1.65GB  43.7Kb  43.3Kb  41.9Kb 
10.0.0.33              <=> 10.0.0.21        32.1MB  15.6Kb  16.8Kb  16.5Kb <---- ??
10.0.0.33              <=> 10.0.0.31         273KB   848b   1.98Kb  1.89Kb
10.0.0.33              <=> 10.0.0.5          274KB   848b   1.78Kb  1.73Kb
 
 
Wan distributed cluster
Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴───────────┴─────────┴────────────────┴──────────────────────
10.0.0.5              <=> 10.0.0.31            1.71GB  30.2Kb  31.3Kb  31.0Kb
10.0.0.5              <=> 10.0.0.21            1.71GB  28.1Kb  30.4Kb  30.4Kb
10.0.0.5              <=> 10.0.0.2             1.61GB  15.1Kb  16.3Kb  16.6Kb <----
10.0.0.5              <=> 10.0.0.33             682KB  1.19Kb  1.91Kb  1.77Kb
10.0.0.5              <=> 10.0.0.23            8.83MB  3.91Kb  1.77Kb  1.87Kb <---- ??
10.0.0.5              <=> 10.0.0.153           5.05KB     0b      0b    572b
10.0.0.5              <=> 10.0.0.151           5.91KB     0b      0b    535b
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.21             <=> 10.0.0.31              194MB  28.7Kb  31.2Kb  31.3Kb 
10.0.0.21             <=> 10.0.0.5              1.76GB  28.5Kb  30.6Kb  30.4Kb
10.0.0.21             <=> 10.0.0.23             73.8MB  15.6Kb  17.1Kb  17.1Kb <---- ??
10.0.0.21             <=> 10.0.0.33              298KB  1.86Kb  2.00Kb  1.91Kb
10.0.0.21             <=> 10.0.0.2              5.83MB  2.06Kb  1.53Kb  1.42Kb <---- ??
10.0.0.21             <=> 10.0.0.12              351KB  1.53Kb  1.34Kb  1.41Kb
 
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.31             <=> 10.0.0.5              1.76GB  28.7Kb  31.3Kb  31.1Kb
10.0.0.31             <=> 10.0.0.21              192MB  28.7Kb  31.2Kb  31.3Kb
10.0.0.31             <=> 10.0.0.33             53.3MB  17.8Kb  17.9Kb  17.7Kb <---- ??
10.0.0.31             <=> 10.0.0.23             1.23MB  3.93Kb  2.01Kb  2.31Kb
10.0.0.31             <=> 10.0.0.2               323KB   528b   1.77Kb  1.81Kb
10.0.0.31             <=> 10.0.0.13              354KB  1.28Kb  1.32Kb  1.38Kb
 
 
 Display paused 19.1Mb      38.1Mb    57.2Mb           76.3Mb           95.4Mb
└───────────────┴───────────┴─────────┴────────────────┴──────────────────────
10.0.0.2              <=> 10.0.0.23            1.65GB  28.5Kb  30.9Kb  31.2Kb
10.0.0.2              <=> 10.0.0.33            1.65GB  27.7Kb  30.5Kb  31.0Kb
10.0.0.2              <=> 10.0.0.5             1.64GB  15.1Kb  16.3Kb  16.6Kb <----
10.0.0.2              <=> 10.0.0.31             323KB   528b   1.77Kb  1.81Kb
10.0.0.2              <=> 10.0.0.21            5.88MB  2.06Kb  1.53Kb  1.46Kb <---- ??
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.23             <=> 10.0.0.33              118MB  29.8Kb  31.5Kb  31.5Kb
10.0.0.23             <=> 10.0.0.2              1.70GB  29.5Kb  30.8Kb  31.2Kb
10.0.0.23             <=> 10.0.0.21             74.0MB  15.4Kb  16.9Kb  17.0Kb <----??
10.0.0.23             <=> 10.0.0.31             1.22MB  3.31Kb  1.85Kb  2.27Kb
10.0.0.23             <=> 10.0.0.5              8.96MB  3.40Kb  1.67Kb  1.84Kb <---- ??
10.0.0.23             <=> 10.0.0.12              359KB  1.47Kb  1.47Kb  1.36Kb
 
 Display paused 19.1Mb       38.1Mb   57.2Mb            76.3Mb           95.4Mb
└───────────────┴────────────┴────────┴─────────────────┴──────────────────────
10.0.0.33             <=> 10.0.0.23              118MB  28.1Kb  31.5Kb  31.4Kb
10.0.0.33             <=> 10.0.0.2              1.70GB  27.5Kb  30.2Kb  30.7Kb
10.0.0.33             <=> 10.0.0.31             53.3MB  17.8Kb  17.9Kb  17.7Kb <----??
10.0.0.33             <=> 10.0.0.21              298KB  1.86Kb  2.00Kb  1.91Kb
10.0.0.33             <=> 10.0.0.5               686KB  1.03Kb  1.96Kb  1.78Kb
10.0.0.33             <=> 10.0.0.13              370KB  1.41Kb  1.44Kb  1.42Kb
 

 

 

Apparently all was working fine, and if you notice the nodes 5 -23 and 5-2 seems to work as bridge between the segments.

But if you look carefully (I had point the cases with <---- ??), you will see that there is additional traffic going on, for instance in the first set we have nodes 33-21 exchanging data for 32Mb, and in the second set (wan) even more we have node 23-21 exchanging 73Mb.

It seems nothing, but instead it means there is an exchange of some type. What kind of traffic is going on between SOME nodes? I said SOME on purpose, because other nodes are not generating this that I consider an anomaly.

May be this is by design, if so I would like to know, and would like to know why.

Anyhow a side from this, most of the exchange is actually floating as expected.

 

More inside the replication

As mention before, the whole exercise was to identify if and if yes which is best settings to use for the cluster working in geographically distributed environment. In relation to that the tests that are more relevant are the number 6 and 8:

6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection
8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

Comparing the differences existing between the gcache.keep_pages_size and page_size settings as well.

Test6

Replication_queue6

Test8

Replication_queue8

Reviewing the two graphs above, related to the Replication queue, we can see that both tests do not report Flow Control action.

In the case of Test6 the receiving and sending queue had some spikes, that were not present in Test8, meaning the two queues were processed more efficiently when the max_packet_size is larger.

To note is that the wsrep_evs_repl_latency in Test8 is higher, as such the messages seems takes longer to be dispatched, but at the end the performance are better.

 

Test6

netstats6

Test8

netstats8

Reviewing the graphs related to Writeset dimension and the Netstat, we can notice that the size of the writesets are larger in the test with default max_packet_size, also network traffic results to be higher. I was surprise about that and I had repeat the tests even more times. At the end magnifying the information related to the transmitted packet, we can notice that when using larger max_packet_size, the distance between received and transmitted is less.

Test6

netstatsPacket6

Test8

netstatsPacket8

Given in this test we have writes going on two nodes, my conclusion is that the network layer was able to optimize the data transmission, reducing the number of vectors (frames) and as such reducing the overhead.

 

Test6

Parallelization_efficency6

Test8

Parallelization_efficency8

To complete the review related to the galera replication, let take a look to the parallelization efficiency graphs.

Here the differences are more evident. The apply window (Average distance between highest and lowest concurrently applied seqno) when using default max_packet_size is definitely higher, as well the wsrep_commit_window (Average distance between highest and lowest concurrently committed seqno).

Also the wsrep_cert_deps_distance (Average distance between highest and lowest seqno value that can be possibly applied in parallel, potential degree of parallelization), in the first case is steady while with larger max_packet_size it is going down.

I was oriented to read this as an optimize way to apply and commit the incoming data.

Test6

Write_set_bytes6

Test8

Write_set_bytes8

 

Reviewing the Writeset_byte graphs we can note that while the bytes replicated remain very close between Test6 and Test8, the received are quite larger in the case of smaller/default max_packet_size.

So the node was by the fact receiving less data in Test8, keeping constant the same amount of data from application.

Test6

processes6

Test8

processes8

 

Analyzing the process information the Test6 (default max_packet_size) has more major page fault then Test8, and both system time and user time are larger in Test6. The meaning of this is more I/Operations

The differences are not huge, but they are consistent through the different number of tests.

 

Conclusions?

I had collect and still collecting a lot of additional data, but for the scope of this article focus on the network and replication side, I prefer to stop.

There is not A conclusion about the whole MySQL/Galera setup, but let me say that I am strongly suggesting to do not use a monolithic setup for all cases.

Instead adapt the configuration following the architecture you are implementing.

 

So far and from the tests I have done, I can say that:

Cluster implemented in a single site with very good connectivity and using MySQL/Galera for HA or ad write distribute platform:

  • Use multiple gcache.keep_pages_size (I had relate the number to the number of cores but I don't see any benefit to go above 8 instance at the moment)
  • Use gcache.page_size never less then 32Mb no more 256Mb total
  • Use small gcs.max_packet_size 32K

Cluster implemented in a single site with geographic distributed site and using MySQL/Galera on second site for HA only:

  • Use single or two gcache.keep_pages_size
  • Use default gcache.page_size or little bit larger like 256M
  • Use larger gcs.max_packet_size 2097152K

 

In the final tests I have being using the two following configurations, and they had work great also with significant load, not only for the tests but also when we had implement them at the customer site, using the same load and operation that was crashing the system with Galera 2.x or with other Galera 3.x configurations.

Configuration used for local cluster:

 

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=2;gcache.page_size=125M;
gcache.size=5024M;evs.send_window=512;evs.user_send_window=512; evs.use_aggregate = true; gcs.fc_factor = 1; 
gcs.fc_limit = 32; gcs.max_packet_size = 32768;gcs.fc_debug = 0; evs.inactive_check_period = PT1S; 
evs.inactive_timeout = PT15S;evs.suspect_timeout = PT5S; evs.stats_report_period = PT1M;"

 

 

Configuration used for geographically distributed DR:

 

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=4;gcache.page_size=64M;
gcache.size=5024M;evs.send_window=1024;evs.user_send_window=1024; evs.use_aggregate = true; gcs.fc_factor = 1; 
gcs.fc_limit = 260; gcs.max_packet_size = 2097152;gcs.fc_debug = 0; evs.inactive_check_period = PT30S;
 evs.inactive_timeout = PT1M;evs.suspect_timeout = PT1M; evs.stats_report_period = PT3M;"

 

 

 

Please let me know if in you will experience different results, and how/why. 

 

Great MySQL .... & Galera to all.

Last Updated on Friday, 21 November 2014 20:34
 
«StartPrev123456789NextEnd»

Page 1 of 9
 

Who's Online

We have 14 guests online