MySQL on Kubernetes demystified
Why
Marco, why did you write this long article?
Yes, it is long, and I know most of the people will not read it in full, but my hope is that at least someone will, and I count on them to make the wave of sanity.
Why I wrote it is simple. We write articles to share something we discover, or to share new approaches or as in this case to try to demystify and put in the right perspective the “last shining thing” that will save the world.
The “last shining thing” is the use of containers for relational databases management systems (RDBMS) and all the attached solutions like Kubernetes or similar.
Why is this a problem? The use of containers for RDBMS is not really a problem per se, but it had become a problem because it was not correctly contextualized and even more important, the model that should be used to properly design the solutions, was not reviewed and modified in respect to the classic one.
One example for all is this image:
Source (https://www.cockroachlabs.com/blog/kubernetes-trends/)
In this report we find the use of the term Database multiple times, and reference how easy it is to adopt and scale a Database on Kubernetes. But the problem is … what Database? Not all the Databases are the same, not all can be adopted so easily, some use more restrictive design to be real RDBMS, others less.
Some are designed to scale horizontally, others no. The other missed part is that, to get something you need to give something. If I design a system to scale horizontally, I probably have to pay a price on something else. It could be the lack of referential integrity or the Isolation level, it doesn’t matter, but you do not get anything for free.
Generalizing without mentioning the difference is misleading.
What happens instead is that we have a lot of people, presenting solutions that are so generic that are unusable. The most hilarious thing is that they present that as an inevitable evolution, the step into the future that will solve every problem. But when doing this, they do not clarify what is the “Database” in use, what you get, what you lose. This may lead to misunderstanding and future frustration.
For instance we constantly see presentations that illustrate how easy it will be to manage hundreds, thousands of pods containing RDBMS, without even understanding the concept of RDBMS, the data they may host and its dimension.
The more we go ahead the more dangerous this disinformation is becoming, because we can say that on 100 companies currently using RDBMS to manage their crucial data, only 10 have a good team of experts that understand how to really use containers, and probably only 1 or 2 of them have real data experts that are able to redesign the data layer to be correctly utilized in containers or able to see to what proper Database to migrate to achieve the expected results.
Another common expectation is to move to container/kubernetes to reduce the costs, no matter if related to the iron (physical servers) or of the management of them.
Indeed you can optimise that part, but you need to understand the limitations of your future solution. You must take into account that you may not have the same level of service in a way or another. Honestly I haven't seen that addressed in current blogging or reports.
This is why this article. What I want to do is to open a door to a discussion, that will lead us to review the original model used to design data inside RDBMS, that will allow anyone to safely approach a different model to use for modern database design.
When
(if you know how we get here you can skip this)
To fully understand what we are talking about we need to do a jump back in time, because without history we are nothing, without memory we are lost. Just note that I am going to touch on things at a very high level and for what we are concerned, otherwise this will be a book not an article.
A long time ago in a world far far away where the internet was not, there was: client server approach.
At that time we had many clients connecting to a server to provide access to whatever. Most of the clients were performing information rendering and local data validation, then they sent the information back to the application server who processed them and store… where? Well there was a high variety of containers. Some were just proprietary files with custom format, others connected to an external database. What is relevant here is that we had a very limited number of clients (often less than 1k) and at the end the amount of data in transit and then store was very small (if a database had more than 1GB data in total it was considered a monster).
Then came the internet … and many things started to change.
We still have a client on our PC, but now is called a web Browser, the information rendering is now standardize using SGML standards (HTML tagging), the server is not anymore a simple Application Server, but a web server who is connected to one or many application(s) using common gateway interface (CGI). Each application was then handling the data in/out independently, some using databases some not.
This new approach add to the previous model many challenges like:
- Anyone from anywhere can access the Web Server and any application hosted
- The number of requests and clients connected jump from well predictable numbers to something impossible to define.
- The connection to a web server follows a different approach based on a request issued by a browser and a set of information that will be sent as an answer from the server. (later we will have more interactive/active protocols but let us stay high ok?)
- Many application were duplicating the same functions but with different approach
- Data received and sent needs to be consistent not only locally on the server, but also between requests.
So on top of the problem generated by the number of possible clients connected and the amount of requests per second a web server was in need to process, the initial model was wasting resources in development and when running. There was the need to optimise the interactions between applications and their functions, to cover that the Service Oriented Architecture (SOA) model was largely adopted.
The data problem was partially solved identifying RDBMS as the best tool to use to guarantee the level of data consistency and at the same time it was possible to organize data in containers (tables) with validated interactions (Foreign keys and constraints) solving the online transaction processing (OLTP) problem. While for business analysis and reporting the O