Today was another good day, but I had a problem with one presentation and keynote.
the "Big Data is a Big Scam: Most of the Time". There is something that is not convincing me in the architecture,
and is mainly base on the lack of numbers and match between requirements and implementation.
The presentation was interesting from some aspects, and the tile approach is probably a good one.
I have enjoy the general idea and I did found awesome the bravery of setting up a circular replication cross geographic regions.
Never the less I remain bewildered by the tiles approach numbers, probably because the presentation was missing of few additional information.
Just to be more clear I was trying to understand better the real number and how works.
As for example if each tile is compose by 1 MySQL node, and 2 data node, and given that this solution use Xtra large EC2 with 68GB ram. I can assume that,
we have approximately 65 GB data (given some overhead for buffers and so on) for the current Node Group on this tile.
From the description I have also understood that the number of tails implemented, is different by region.
Given that the number of Node Group is different and by direct consequence the memory available as well, and the MySQL SQL node as well.
This raise an immediate question, are we having the same data set replicated world wide?
Or are we having a different sets?
each Node Group can host ~69GB, the maximum number of data node in 7.2 is still 48 (http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-ndbd-definition.html), which means that the maximum size that can be served is of 1.3TB for the whole cluster.
Given during the presentation my understanding was that the data set is of 100TB, where are located, now or in the future, the remaining 98.7TB? Keeping in mind that I asks if table on disks where used and answer was NO.
Then I have another question-mark infront of me.
If the set of tiles in US is compose of 20 tiles, (so 20 MySQL SQL Nodes, 40 Data nodes and so on), and it set like this because it needs to sustain the traffic mainly it requires that setting for "calculation power" needs.
I assume (probably wrong) that this is due to the number of request per second coming in.
Given we have circular replication in the architecture, and given we have different numbers of tiles, and by consequence of MySQL nodes, HOW can a 8 MySQL SQL Node setup (say in Australia) sustain the traffic coming in from a 20 MySQL SQL nodes setup?
I am sure, we are missing something in the presentation in terms of real numbers.
Being a MySQL NDB lover, I would really like to have more information and details, given that what I get from the presentation was not enough for me.