Post by Ivan Borshchov
CEO & CTO @ Devforth
Going to design a next-generation scalable web application that will have quick as a lightning reaction? You might think that triumph depends on the fronted SPA framework, hosting/cloud provider or programming language you choose.
As a CTO of Devforth company which is delivering complex new-age web applications I say that none of this really matters. What matters is identifiers system your developers use! And I will explain why.
After designing architecture of dozen high-load apps with strong reaction requirements I must admit that UUID kind of identifiers is only one way to:
- Deliver rapid and robust User Interface on any client - Web, Mobile, Desktop
- Spend minimal developer hours by using smart architecture approaches
- Build Infinity-scalable solution
- Release offline/reconnect-tolerant solution which will save all edits even in offline and then seamlessly synchronize them to a server
But first, let's start with a classic. Despite the fact that UUID identifiers have been widely used by awesome web-portals as primary ID systems (https://trello.com, https://slack.com) a lot of new start-ups by trusting development to low-level outsource companies or freelancers are still based on Autoincremented primary keys as IDs. Typical de-facto default in relational DBs (MySQL, MSSQL, Oracle, Postgres) will look like:
id INT AUTO_INCREMENT PRIMARY KEY
Autoincrement IDs problem one - frontend have to wait for a server
Imagine user should be able to create two linked entities on User Interface - Company and a Branch . Branch should mandatory belong to a Company. It could be anything - thread and message, warehouse and shelf, etc...
App should handle this story in the next way:
- Frontend asks user to enter info about Company (Name, etc) and send request to the server (without ID) for creating an item in database
- Server creates table item and new ID for it by incrementing last ID value. Server is only one centralized storage of last ID so only server could create next id.
- Server sends a response to a browser to deliver created ID
- Only after this browser could complete forming of related entity - only now it knows real
companyIdfield with value returned from server
Some developers show loaders/spinners on every request. Most fast-to-implement and dump approach, it will stop UX comfort on the vine! If user goes offline, company request will fail and we can't create branch at all. Exceptional handling will create a lot of pain for developers and QA need to test them very carefully before accepting product.
Other developers create special "combined" server requests which will create several items: company and branch in one shot. This approach is better, but:
- It is not universal, every story will require rethinking and separate API (more code, more underhanded bugs).
- Has pseudo-id problem. To work before posting to the server (or offline) on frontend you will have to create some pseudo-identifiers and then replace them with real on server (only server knows correct last available id)
Obviously, such architectural solution is crooked and requires a lot of developer and QA resources. Remember that every complex code is longer to write, for QA it is harder to find an issues, then harder to support, harder to add new features, on-board new team, so don't be surprised then that your developers are super-slow and "ate" all your budget!
Simplicity and not complexity moves this world forward.
What is UUID
Universally unique identifier. Absolutely unique once you generate it ANYWHERE. It is possible because of using a large (128-bit) random number. Typical example encoded in HEX:
Any modern language has a built-in module and function to generate new UUID or has some third-party library.
There are several versions of UUIDs:
- The simplest one is
UUIDv4- it is a fully random number. "They" call it "full entropy". To give you an idea of how unique would be 128-bit random number - imagine you would generate 1 billion UUIDs v4 every second for the next 100 years, then the probability that you will have only one duplicate in all these generated IDs would be
- Another very common one is
UUIDv1. It includes 2 parts: 48-bit Host Mac Address (so already unique for different clients who generate it), 60-bit Timestamp (nanoseconds precision, even two subsequent generate function calls will never generate same timestamps). It's harder to measure a clash here but it is always better to use this version, and later in post, I will explain why.
So any client can just generate ID of company on own side (Frontend or Application), then he can go ahead and create branches by linking with this
companyId. You don't need to call a server to approve these IDs - no one will reuse it because it is unique! You could submit all these data even after an hour of offline work and you will still be sure that no one else generated it. (It is just an example, of course, it is always better to deliver data to server ASAP - at least to make data visible for another users if you have multiuser app)
Auto-increment IDs problem two - to scale you will need data shards, but you blocked by same IDs on different servers
Imagine you created your awesome well-tuned orchestration to spawn new servers with new DB instances, routing mechanisms, you are ready for high-load by scaling horizontally!
But on some peak you faced with overload on one server - for example, one client has grown and takes 90% of payload. You decide rapidly move this client from this server to another free server (which has fewer clients). You are moving company
135 to another server. But what if it already has company
135. OK, you could regenerate ID somehow and take care of delivering this change to a user, but... No, such solution sometime will make you pay a big price for data loss, you will spend a lot of time by recovering screwed entities with a user who gone offline at this important moment, or who edited it between swap.
Limited sharding solution
Most relational databases have some mechanisms to generate unique sequences of auto-incremented IDs on different servers to be able then move data between database servers. On one project on which we already had autoincremented IDs we configured
auto_increment_offset settings of MySQL. The first one should hold the maximum potential number of servers, and the second one number on the server. If you think you will have 8 servers then you have to set the next values:
- On the first server:
In all tables on first server MySQL will generate next IDs:
- On the second server:
In all tables on this server MySQL will generate IDs:
And so on. So as you see, now it is safe to move data from one server to another. And you can have 2 active servers now and then when you need to add one more, add one more, up to
8 servers. Not so bad, right?
But, you are limited. Imagine you will have a need to have a 9th server, and if you have scalable business then you will sometime. You will not able to just tune settings, you will need to reconfigure all previous servers and regenerate IDs. It is obvious how hard it will be.
To set a very big value of increment for potential grow you will have to take care of Field data type. Imagine you want up to
10'000 servers. Standart
INT in MySQL takes 4-bytes (32 bits), so if you will have
INT type you will be able to store next record count:
2 ** 32 / 10'000 = 429'496
If you could state that you will never have more then
10k servers, then I am not sure you will state that
429k items would be enough for one server. So probably you will need
BIGINT type which takes 8-bytes (64 bits).
Yes, UUIDs take even more -
16 bytes, but no special database settings are required - you could just move data from one server to another, they are Absolutly Unique!!! It will be much much easier for high-load software administrators to support it. Simplicity and only simplicity moves this world forward.
Small nice bonus - secret records count
If you use Autoincrement IDs some non-good-guy might create an account in your system, create a company, check the Network tab in the Chrome browser and see that his ID is
136. He now knows that your system has only 135 companies created by another clients through all the time from the beginning (or less if they were deleted). You will say "not a big deal", but are you 100% sure that info about total entities count is not sensitive? Think carefully!
When you use UUIDs you can't say how much records were created before. Also, you could use UUIDv4 in secret (e.g. share) links. Hackers will need eternity to guess it even if you will have millions of secret links.
But wait, probability of clash is very low but it exists, right?
The biggest problem of wrong decisions is the wrong priority. If you are afraid that after 100 years you will have one clash for one client and he will not able to create a company then you will miss several next years of a highly profitable and nicely scalable business. Let's earn money in the nearest decades, most systems anyway will die after 10-20 years because of rapid technology grow.
Anyway, any DB has some kind of Unique ID check, so correct handling will throw some exception, like any regular exception in your system it should be universally handled and for example, attempt executed again, so it would be even invisible for end-user. But normally it will happen not in the nearest
How about the clash in case of moving data between servers? I would just handle the exception and rollback transactions in which you will move data with the correct message, nothing serious - just select another server in case you will face it after years.
How to store it?
Technically every DB could store UUIDs, even relational like MySQL. It is a big mistake if someone will say you that Relational DBs should have sequentional IDs and NoSQL shoud have UUIDs. It is not about it. But you have to remember the fact that UUID takes 16 bytes of data on disk. This leads to 2 issues that should be carefully mitigated:
- Larger disk space
- Slower index
How to achieve minimal space when you use UUIDs
First of all, you've seen an example of UUID I posted above:
This is the mainstream format - 36-characters ASCII string. Most libraries will generate UUIDs like such string. And if you are going to store it in DB in this format (e.g.
VARCHAR(36) in SQL DB) you will waste a lot of disk space because it will take 36 bytes on disk. And if you use foreign keys in other tables, they all should hold the same values with the same length.
Never do this! Store UUIDs as
BINARY(16) or any other binary format your DB supports which could guarantee that one value will consume 16 bytes. Yes, you will need to write code or find another library to convert to/from the binary format. You will need to test it once, but then you will get an awesome most optimal setup!
How to achieve maximum performance when you find/insert UUIDs
To perform fast data search by ID fields Database must have some sort of index on it. The index, in most cases, is some sort of binary search tree so it gives search complexity
O(log(n)). This means that search time depends on elements count, and on some fixed time used to follow tree branch. Obviously, this time depends on value length, and nothing in this world could be free. If we increase the length of the field we need to pay with larger time on the following tree. This is another reason to use 16 bytes instead of 36. Anyway yes, 16-byte field will always be slower in index then 4-byte INT, but not so dramatically that we need to pay attention. Smart decisions are about priorities, we lose very little, but get a lot!
Another important factor is tree formation. Basic data structure theory which we study at University courses tells us that each insert requires some time to find the correct place in tree where we have to place a new id (with a pointer to data row on disk) and tree balancing. So value itself also influences insert time. There is some simple research that proves it https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ According to this post to achieve the best results we have to use UUIDv1 because it has timestamp part so index builder can faster find the correct place in a tree. So just always use v1! v4 will make your queries slower!
Also, the author suggested a good idea of rearranging some parts in v1 to achieve even better results. But this is not very necessary, even native MySQL built-in UUID generator has standard v1 format https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
How to make network requests consume less traffic
It is easy to explain why standard UUID takes 36-chars for representing 16 bytes:
- they use 4 dashes which are always in constant places and have no value.
- they use only hex characters (
To transmit such UUIDs over text protocols like HTTP/WS we are limited for ASCII chars, but we could use a larger alphabet. If we would use
0-9, a-z, A-Z, we could reduce length up to 22 chars, it will look like:
This is also called a Short UUID format. Obviously, it is just about representation, so any UUID version could be represented as short. I would definitely recommend to transmit this version over the internet in your requests and responses and get rid of 14 unused bytes.
This is a rare improvement, If you will check Trello and Slack requestyou will notice that they transmit plain 36-chars versions, but shortification is very time-cheap and could be easily implemented on client and server, so why not to use it?
- UUIDs solve the problem of centralized ID store making possible working in offline and handle complex relations on frontend by making frontend independent from the server which makes it super-responsive
- UUIDs allow you to implement shards for high-load scalable systems
- UUIDs make your IDs more secure by hiding records count
- To decrease insert time use UUIDv1 as IDs system
- Store it in the database as 16 bytes
- To save traffic transmit Shot UUIDs