Home

Move to Italy

Days
Hours
Minutes
Seconds

Blogs

Support Wikipedia

Login Form



21
Dec
2014
The Monitoring mistake OR how dreaming can bring ideas
Written by Marco Tusa   

"Hi Paul how is going?" "Good Marco and you?" "Good, I had a stressful week last week but now is ok, I mange to close some pending activities, working a little bit more during the day, doing that I was able to reduce the queue of pending task, and now all is normal again", "good for you that you manage, I had too many things ongoing and was not able to dedicate more time to queue".

 

The simple (boring) conversation above hides one of the most complex elaborations of monitoring data. We as human being do a lot of data processing in very short time. We may be less fast in doing some calculations respect to computers, but no computer can compete with us when we are talking about multitask and data processing.

 

To answer to someone asking you how you are, you do not simple review your status in that moment, your brain decide on the base of the last time you have see the person to review all the relevant data and provide a synthesis of the relevant facts, then again you summarize in "good" because you do not consider relevant to pass over all single facts to your friend but only the conclusion.

 

Not only, during the same process, you evaluate, in relation to your relationship with the person, what kind of information you may want to share and why, how to present to him/her such that it will be relevant and interesting for the interaction.

 

The simple process of talking also may happen while you are walking along the street, taking care of the traffic, and expressing interest, curiosity or annoyance to your collocutor.

Each expression you will show on your face is the result of the data collection, analysis and decision your brain is taking. Plus some other coming from more in depth inner process, like deep fear or surprise, but that is out of the context now.

The funniest interesting thing is that we are so use to do this and to summarize in such efficient way, that we consider funny or totally out of context, when we see someone not doing so.

 

Just think about how hilarious is Sheldon Lee Cooper (for the ones who do not know what I am talking about http://en.wikipedia.org/wiki/Sheldon_Cooper).

In the show Sheldon is quite often answering to the simple question "How are you?" with a massive amount of information, that not only is not understood, but also totally irrelevant, and as such in that context hilarious.

Hilarious in that context I sais, but far to be hilarious in real life, this because we are so expose to external signal and information that we should not and cannot spend time and resource, elaborating incoming information just to know "How our friend is doing". In the evolution it was decide that was the owner of the information that has to process it, that has to elaborate his data and expose only what is relevant for his interaction.

 

Just think what life would be IF instead of saying "Good thank you" to the question "How are you", you would start to enumerate all the facts in each single details, or with some aggregation, to each single person that asks you the same question and expect them to sort out if that means good or not. Crazy eh? I would say quite inefficient and source of possible misunderstanding as well.

Someone may decide that working an hour more per day is totally unacceptable, and as such your status would be "Bad" instead "Good", which is the exact opposite of how you really feel.

As said this way of acting and behaving, is not something coming from the void, but instead the result of a long process that had be refine in 2,5 millions of years (Homo habilis). The evolution had decide that is much more efficient to have Marco telling to Paul how he is doing, than Paul try to read all the information from Marco and then elaborate, with his parameters, how Marco is doing.

I am going to say that, well the evolution is right, and I am quite happy with what we had achieve, also if we had taken some million of years to get there.

I am also confident that you too, see how this is more efficient, and correct.

So, for God sake, why are we still using a method that is not only inefficient but also exposing us to mistakes, when we have to know how complex system feel, systems that are less complex then us, but complex anyhow.

Why are we "monitoring" things, exposing numbers, and pretend to elaborate those with the illusion to finally GET "How are you?"

Would not much more efficient, and statistically more prune of errors just ask "Hi my nice system, how are you today?" "Marco you are boring, you ask me that every day, anyhow I am good" "There is anything you need?" "Yes please, check the space I may run out in a week" "Oh thanks to let me know in advance I will".

 

Am I crazy? No I don't think so, is it something that we see only in the movies? Again no I don't think so, and actually is not so far from what we may start to do.

How we can move from a quite wrong way of doing, collecting useless amount of data to analyze to get simple synthetic information?

 

Here is my dream

Let us start simple, you cannot ask to someone "How are you?" if he is dead, is a yes/no condition. But this does not apply to complex systems, in our body every day we loose cells they die, but we replace tem, and our brain is not sending us warning message for each one of them.

But we do have alert messages if the number of them become too hi such that primary function can be compromise.

In short our brain discriminate between what can be compensate automatically and what not, and bother us only when the last one occur.

What can be done to create monitor monad, that is internally consistent and that allow us to scale and to aggregate?

The main point as state above is to do not flood the collocutor with information, but at the same time do not loose the meaning and if in the need the details.

 

This is the first point we can separate between what we need to archive and what we need to get as answer.

To be clear, I ask you "How are you" "Good thank you", that is what I need to know, but at the same time I may be in the position to download your data, and collect all the metrics relevant.

I had to use a heart monitor after few events, and what happened was quite simple. They attach to my body a monitor that was sending detailed metrics of my heart to them directly, plus they were calling me to ask, "How you feel today?" The detailed information was for them to eventually dig in the system if something goes bad.

 

The detailed information is easy to collect the challenge come from the amount of data, how to store aggregate and so on, but all that is the usual boring and old stuff.

What I see instead interesting is how to get the monad to work, how to define the correct way to balance the analysis.

My idea is quite simple; assume the easiest case where we have to process just 3 different metrics to get a meaningful state, something like IO/CPUuser/Net incoming.

A simple

 

triangular

 

will work fine; each solid vertex is a metric plus one that is the yes/no condition (am I alive?).

The centre of the solid represent the initial state; state in which all the forces are in perfect balance and there is no variation in the position of the Point of Balance from the centre itself.

triangolo_sys_1

We know that any system is never in perfect balance, we also know that each system may behave differently on the base of the N factors, where N is not determinate, but change not only in relation of the kind of system, but also between system that behave to the same class. In short try to define N is a waste of time.

 

What can be done, and guess what is exactly what we do when we move from Blastula to new born, we can learn what is the correct level of variation, meaning we can learn by each system which is the variation that do not compromise our functions.

 

Initially we may have a define AC which is the acceptable range inside which the point can fluctuate, for each vertex we have an F for the possible fluctuation, when F =0 in one of more of the directions we can say "Huston we have a problem".

 

While learning, our system will identify what may be the right shape and distance for the F such that the initial circle may become something like this:

triangolo_sys_2

 

Which means that any movement of our point inside the AC area will give us the answer "I am good thanks". Any movement outside, will generate a possible signal like "I am stressed my CPU is overload".

This is a very simple basic example, and it may be not clear how this scale and how it could resolve much more complex scenario. So let us go ahead.

 

A simple solid like a triangular pyramid covers something that is very basic. But if for instances you need to provide the status of a more complex interaction say a database status or a more complex system, then you may have one or many solid with much more complex interaction:

solido1

With the solid disdyakis triacontahedron we can have 62 vertexes, meaning that with the same algorithm we can a associate a significant number of metrics.

 

Each solid is seen from the whole as single entity, like if enclose in a sphere that shows only the "final" status:

solido2

The flexibility comes from the fact we can connect any solid to another in exclusive mode or as subsystem(s), at the same time each element can be quite complex internally but expose simple and direct status.

 

So for instance a simple one can be like:

solido3

While a more complex and probably the most common would be:

solido4

In this case we can assume to have a group describing the status for a Storage Engine another for whatever happen on the storage, and so on until we have a Node of our architecture fully describe.

solido5

At this point it should be clear that once we had cover the internal complexity of the variation for each solid, the outcome is a simplify message "I am good" no matter at what level we are looking it.

That will allow us to eventually have quite complex system, with complex relations, be described and report status in a very simple and immediate way.

 

Understanding what is going on in a system like this:

solido6

Can be quite difficult and taking time. Using standard way of monitoring, we will not be sure if there is a correlation between the metrics, and if it is taking in to account correctly the behaviour of the Node.

 

Using the new approach will allow us to, first of all get simple feedback:

solido7

Basically, given a node affected (badly ... give it is dead) all the others are still answering, "I am good", but the Nodes related will start to say, "I am not happy", "I am very sad my node is dead", "I am cool don't worry load and service are under control".

And I will focus directly on my dead node and how to fix it. Given the way I collect my information and report the state, I will be able to see that in the timeline and focus directly on the moment issues starts, for the node.

solido8

 

No message is a message

What is also very important to consider, is that once we have define the correct behaviour for each solid, that happen during the learning period, we also know what is the expected behaviour and what signals we should see.

 

In short if you go in the gym and do 45 minutes on the treadmill, you expect to have higher heart rate, to feel fatigued and sweaty. If that doesn't happen then either you were cheating not doing the right exercise, or probably you are a cyber-man and you were not aware of that.

 

Getting the right signal in the right context, also when the signal is a negative one is as important as, or even more, then getting a good one.

 

Last year my mother had a quite important surgery, the day after that she was great, feeling perfectly, no pain, no bad signals. And she was dying down; the doctor start to be uncomfortable with her NOT feeling some level of pain or whatever discomfort. Luckily they take action and save her (at the last second) but they did.

 

Nowadays we just collect metrics, and very rarely we put them in relation, and even more rarely we try to get the negative reading as relevant event. This because we currently do not have a way to contextualize the right behaviour, to know how thing can be correctly handled, and as such what is the deviation from that.

 

The implementation I am describing not only takes in to account the behaviour not the singe event, but it also can trace and identify the lack of a negative signal, a signal that must take place to keep the behaviour healthy.

Conclusion

What I really want to stress out is that the way we do monitor today is the same that trying to manage the space shuttle with stone and scalpel.

 

There are many solutions out there, but all of them are focus on more or less the same approach/model.

Better then nothing of course, and yes we still have situation in which we have NO monitoring. But still I think that changing the paper of the wraps is not doing something new in relation to the content.

 

I do not pretend to know how to implement my idea, the algorithm to calculate the variation and the interaction in the solid, is something I do not see in my range. But that just means I need to find someone able to share a dream and with quite good mathematical skills.

Last Updated on Friday, 09 January 2015 16:54
 
27
Nov
2014
How to mess up your data using ONE command in MySQL/Galera.
Written by Marco Tusa   

Or how wsrep_on can bring you to have a cluster with usless data.

redflag

This is a WARNING article, and it comes out after I have being working on define internal blueprint on how to perform DDL operation using RSU safely.

The fun, if fun we want to call it, comes as usual by the fact that I am a curious guy and I often do things my way and not always following the official instructions.

Anyhow, lets us go straight to the point and describe what can happen on ANY MySQL/Galera installation.

The environment

The test environment, MySQL/Galera (Percona PXC 5.6.20 version).

The cluster was based on three nodes local no geographic distribution, no other replication in place then Galera.

Haproxy on one application node, simple application writing in this table:

Table: tbtest1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CREATE TABLE: CREATE TABLE 'tbtest1' (
'auAPP1nc' bigint(11) NOT NULL AUTO_INCREMENT,
'a' int(11) NOT NULL,
'uuid' char(36) COLLATE utf8_bin NOT NULL,
'b' varchar(100) COLLATE utf8_bin NOT NULL,
'c' char(200) COLLATE utf8_bin NOT NULL,
'counter' bigint(20) DEFAULT NULL,
'time' timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
'partitionid' int(11) NOT NULL DEFAULT '0',
'date' date NOT NULL,
'strrecordtype' char(3) COLLATE utf8_bin DEFAULT NULL,
PRIMARY KEY ('auAPP1nc','partitionid'),
KEY 'IDX_a' ('a'),
KEY 'IDX_uuid' ('uuid')
) ENGINE=InnoDB AUTO_INCREMENT=482 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
 

 

Small app

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#! /bin/bash --
i=1
while :
do
	echo "$i "
	mysql -ustress -pxxx -h192.168.0.35 -P 3307 -e "SET @HH=@@HOSTNAME;
insert into test.tbtest1 (a,uuid,b,c,strrecordtype,date,partitionid) 
values($i,UUID(),@HH,'a','APP1'|'APP2',now(),RAND()*100)";
	i=$((i + 1))
	if [ $i -eq 100 ]
	then
		break
	fi
	sleep 0.5;
done
 

 

 

Server Information

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
(root@localhost:pm) [(none)]>\s SHOW global STATUS LIKE 'wsrep_provider_version';
--------------
/home/mysql/templates/PCX-56/bin/mysql Ver 14.14 Distrib 5.6.20-68.0, FOR Linux (i686) USING EditLine wrapper
Connection id:	90
Current DATABASE:
Current user:	root@localhost
SSL:	NOT IN USE
Current pager:	stdout
USING OUTFILE:	''
USING delimiter:	;
Server version:	5.6.20-68.0-25.7-log Percona XtraDB Cluster BINARY (GPL) 5.6.20-25.7, Revision 886, wsrep_25.7.r4126
Protocol version:	10
Connection:	Localhost via UNIX socket
Server characterset:	utf8
Db characterset:	utf8
Client characterset:	utf8
Conn. characterset:	utf8
UNIX socket:	/home/mysql/instances/galera1-56/mysql.sock
Uptime:	2 min 38 sec
Threads: 3 Questions: 282 Slow queries: 0 Opens: 78 FLUSH TABLES: 3 Open TABLES: 8 Queries per second avg: 1.784
--------------
+------------------------+---------------+
| Variable_name | Value |
+------------------------+---------------+
| wsrep_provider_version | 3.7(r7f44a18) |
+------------------------+---------------+
1 row IN SET (0.01 sec)
 

 

 

Facts

In MySQL/Galera there is variable that allow us to say to the server to do not replicate. This variable is wsrep_on and when we set it as OFF the server will not replicate any statement to the other node.

This is quite useful when in the need to perform actions on an single node, like when you need to perform DDL on RSU mode.

But this flexibility can bite you quite badly.

I had done a simple small change to the widely use command:

 

SET wsrep_on=OFF;

 

I just add GLOBAL:

SET GLOBAL wsrep_on=OFF;

 

 

To be honest I was expecting to have the command rejected, but no it was accept and this is what happened:

I had run the small loop (see above) on two application servers, one pointing to HAProxy and writing APP1 in the field strrecordtype, the other pointing directly to the node where I will issue the command with wsrep_on inserting APP2.

The results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS APP2_COUNTER FROM tbtest1 
WHERE strrecordtype='APP2';
select count(*) AS APP1_COUNTER FROM tbtest1 WHERE strrecordtype='APP1';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral03 |
+---------------+
1 row IN SET (0.00 sec)
+-------------+
|APP2_COUNTER |
+-------------+
| 99          |
+-------------+
1 row IN SET (0.00 sec)
+-------------+
|APP1_COUNTER |
+-------------+
| 99          |
+-------------+
1 row IN SET (0.00 sec)
(root@localhost:pm) [test]>
(root@localhost:pm) [test]>SET GLOBAL wsrep_on=OFF; <------------- It should not be GLOBAL
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS APP2_COUNTER FROM tbtest1 
WHERE strrecordtype='APP2';
select count(*) AS APP1_COUNTER FROM tbtest1 WHERE strrecordtype='APP1';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral01 |
+---------------+
1 row IN SET (0.00 sec)
+-------------+
|APP2_COUNTER |
+-------------+
| 0 |
+-------------+
1 row IN SET (0.00 sec)
+-------------+
|APP1_COUNTER |
+-------------+
| 66              | <-------------------- 1/3 lost because HAProxy think that the node is ok ...
+-------------+
1 row IN SET (0.00 sec)

 

 

As you can see in the tusacentral03 (which is the one where I issue SET GLOBAL wsrep_ON=OFF), I have ALL the records inserted in the local node and ALL the records coming from the others node.

But on the node tusacentral01, I had NO records related to APP2, but more relevant I had lost 1/3 of my total inserts.

Why?

Well this is quite clear, and unfortunately is by design.

If I issue wsrep_ON=OFF with GLOBAL the server will apply the setting to ALL sessions, meaning all session on that will STOP to replicate.

In the source code the section relevant to this is quite clear:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#wsrep_mysqld.cc
#line 1395
int wsrep_to_isolation_begin(THD *thd, char *db_, char *table_,
                             const TABLE_LIST* table_list)
{
 
  /*
    No isolation for applier or replaying threads.
   */
  if (thd->wsrep_exec_mode == REPL_RECV) return 0;
 
  int ret= 0;
  mysql_mutex_lock(&thd->LOCK_wsrep_thd);
 
  if (thd->wsrep_conflict_state == MUST_ABORT)
  {
    WSREP_INFO("thread: %lu, %s has been aborted due to multi-master conflict",
               thd->thread_id, thd->query());
    mysql_mutex_unlock(&thd->LOCK_wsrep_thd);
    return WSREP_TRX_FAIL;
  }
  mysql_mutex_unlock(&thd->LOCK_wsrep_thd);
 
  DBUG_ASSERT(thd->wsrep_exec_mode == LOCAL_STATE);
  DBUG_ASSERT(thd->wsrep_trx_meta.gtid.seqno == WSREP_SEQNO_UNDEFINED);
 
  if (thd->global_read_lock.can_acquire_protection())
  {
    WSREP_DEBUG("Aborting APP1: Global Read-Lock (FTWRL) in place: %s %lu",
                thd->query(), thd->thread_id);
    return -1;
  }
 
  if (wsrep_debug && thd->mdl_context.has_locks())
  {
    WSREP_DEBUG("thread holds MDL locks at TI begin: %s %lu",
                thd->query(), thd->thread_id);
  }
 
  /*
    It makes sense to set auto_increment_* to defaults in APP1 operations.
    Must be done before wsrep_APP1_begin() since Query_log_event encapsulating
    APP1 statement and auto inc variables for wsrep replication is constructed
    there. Variables are reset back in THD::reset_for_next_command() before
    processing of next command.
   */
  if (wsrep_auto_increment_control)
  {
    thd->variables.auto_increment_offset = 1;
    thd->variables.auto_increment_increment = 1;
  }
 
  if (thd->variables.wsrep_on && thd->wsrep_exec_mode==LOCAL_STATE) <------- Here we have a check for wsrep_on 
  {
    switch (wsrep_OSU_method_options) {
    case WSREP_OSU_APP1: ret =  wsrep_APP1_begin(thd, db_, table_,
                                               table_list); break;
    case WSREP_OSU_APP2: ret =  wsrep_APP2_begin(thd, db_, table_); break;
    }
    if (!ret)
    {
      thd->wsrep_exec_mode= TOTAL_ORDER;
    }
  }
  return ret;
}
enum wsrep_exec_mode {
    LOCAL_STATE,
    REPL_RECV,
    TOTAL_ORDER,
    LOCAL_COMMIT
};
 

 

 

So what happen is that the server check if the thd object has that variable ON and has LOCAL_STATE, if so it replicates, if not it does nothing.

But as said while this makes sense in the SESSION scope, it does not in the GLOBAL.

 

Not only, setting wsrep_on to OFF in global scope does NOT trigger any further action from MySQL/Galera, like for instance the possible FACT that the node could be desynchronize from the remaining cluster.

The interesting effect of this is that HAProxy has NO WAY to know that the node had stop to replicate, and as such the server can receive the requests but those will not replicate to the other node causing data diversion.

 

You can say, that a DBA SHOULD know what he is doing, and as such he/her should be MANUALLY desync the node and then issue the command.

My point instead is that I don't see ANY good reason to have wsrep_on as global variable; instead I see this as a very dangerous and conceptually wrong "feature".

 

Browsing the Codership manual, I noticed that the wsrep_on variable comes with the "L" flag, meaning that the variable is NOT suppose to be GLOBAL.

But it is ...

I also had dig in the code and:

1
2
3
4
5
6
7
8
9
10
11
12
wsrep_var.cc
#line58
 
bool wsrep_on_update (sys_var *self, THD* thd, enum_var_type var_type)
{
  if (var_type == OPT_GLOBAL) {
    // FIXME: this variable probably should be changed only per session
    thd->variables.wsrep_on = global_system_variables.wsrep_on;
  }
  return false;
}
 

 

That is interesting isn't it?

Wondering when this comment was inserted and why it was ignored.

 

Anyhow the source of all problems is here in the wsrep_on variable definition:

1
2
3
4
5
6
7
static Sys_var_mybool Sys_wsrep_on (
       "wsrep_on", "To enable wsrep replication ",
       SESSION_VAR(wsrep_on),                      <----------------------- This allow global 
       CMD_LINE(OPT_ARG), DEFAULT(TRUE), 
       NO_MUTEX_GUARD, NOT_IN_BINLOG, ON_CHECK(0),
       ON_UPDATE(wsrep_on_update));
 

 

The variable was defined as SESSION_VAR instead of SESSION_ONLY, and as such used also in global scope.

 

As already state, this is from my point of view a conceptual error not a bug, but something that should not exists at all, because in a cluster where I have data certify/replicate/synchronize there should NOT be any option for a DBA/user to bypass at GLOBAL level the data validation/replication process.

 

To note, and to make things worse, after I had done the test I can easily set wsrep_on back, and my node will continue to act as part of the cluster as if all the nodes are equal, while they are not.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
(root@localhost:pm) [test]>select @@HOSTNAME;select count(*) AS RSU_COUNTER FROM tbtest1 
WHERE strrecordtype='RSU';
select count(*) AS TOI_COUNTER FROM tbtest1 WHERE strrecordtype='TOI';
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral03 |
+---------------+
1 row IN SET (0.00 sec)
 
+-------------+
| RSU_COUNTER |
+-------------+
|         181 |
+-------------+
1 row IN SET (0.00 sec)
 
+-------------+
| TOI_COUNTER |
+-------------+
|         177 |
+-------------+
1 row IN SET (0.00 sec)
 
+---------------+
| @@HOSTNAME    |
+---------------+
| tusacentral01 |
+---------------+
1 row IN SET (0.00 sec)
 
+-------------+
| RSU_COUNTER |
+-------------+
|          77 |
+-------------+
1 row IN SET (0.00 sec)
 
+-------------+
| TOI_COUNTER |
+-------------+
|         139 |
+-------------+
 

 

As you can see the cluster continue to insert data using HAProxy and all the node, but it has a data set that is inconsistent.

Conclusions

  • Never use SET GLOBAL with wsrep_on
  • IF you are so crazy to do so, be sure no one is writing on the node.
  • I am sure this is a mistake in the logic and as such this variable should be change from the source, in the code defining the variable SESSION_ONLY and not SESSION_VAR
    Or wsrep_on can damage you quite badly.
Last Updated on Thursday, 27 November 2014 15:00
 
«StartPrev12345678910NextEnd»

Page 9 of 23
 

Connecting from

Your IP: 54.196.107.247

Location:

Who's Online

We have 94 guests online