Going to design a next-generation scalable web application that will have a reaction quick as lightning? You might think that triumph depends on the fronted SPA framework, hosting/cloud provider or programming language you choose.
As CTO of Devforth, which is delivering complex new-age web applications, I say that none of this really matters. What matters is the identifiers system your developers use! Here I will explain why.
After designing the architecture of dozens of high-load apps with strong reaction requirements I believe that UUID identifiers is the only way to:
- Deliver rapid and robust User Interface on any client – web, mobile, desktop
- Spend minimal developer hours by using smart architecture approaches
- Build infinity-scalable solutions
- Release offline/reconnect-tolerant solution, which will save all edits, even offline, and then seamlessly synchronize them to a server
But first, let's start with a classic. Despite the fact that UUID identifiers have been widely used by awesome web-portals as primary ID systems (https://trello.com, https://slack.com), a lot of new start-ups and developments are still based on auto-incremented primary keys as IDs. Typical de-facto default in relational DBs (MySQL, MSSQL, Oracle, Postgres) will look like:
id INT AUTO_INCREMENT PRIMARY KEY
This approach has advantages, but they are not valuable compared to the next drawbacks.
Auto-increment IDs' problem one is that the frontend has to wait for a server
Imagine the user should be able to create two linked entities on User Interface – company and branch. The branch should belong to a company. It could be anything – chat thread and message, warehouse and shelf, etc.
The app should handle this story in the following way:
- The frontend asks the user to enter info about the company (name, etc.) and send the request to the server (without ID) for creating an item in the database
- The server creates a table item and a new ID for it by incrementing the last ID value. The server is only one centralized storage of the last ID, so only the server could create the next ID.
- The server sends a response to a browser to deliver the created ID
- Only after this could the browser complete forming the related entity – only now it knows the real
companyIdfield with the value returned from the server
Some developers still show loaders/spinners on every request. These are most fast-to-implement but dump approaches. They will stop UX comfort on the vine! If the user goes offline, the company request will fail and we can't create branch at all. Exceptional handling will create a lot of pain for developers and QA need to test them very carefully before accepting the product.
Other developers create special "combined" server requests which will create the company and branch in one shot. This approach is better, however:
- It is not universal, as every story will require rethinking and separate API (more code, more underhanded bugs).
- It has a pseudo-ID problem. Before posting to the server (or offline) on frontend, you will have to create some pseudo-identifiers and then replace them with real on server (only server knows correct last available ID)
Obviously, such an architectural solution is crooked and requires a lot of developer and QA resources. Remember that every complex code is longer to write. For QA it is harder to find issues, support, add new features, and on-board new team, so don't be surprised if your developers are super-slow and "eat" all your budget!
Simplicity – not complexity – moves this world forward.
What is UUID
It stands for a universally unique identifier, which is absolutely unique once you generate it ANYWHERE. It is possible because of using a large (128-bit) random number. A typical example encoded in HEX is the following:
Any modern language has a built-in module and function to generate new UUID or has a third-party library.
There are several versions of UUIDs:
- The simplest one is
UUIDv4– it is a fully random number. "They" call it "full entropy." To give you an idea of how unique a 128-bit random number would be, imagine generating 1 billion UUIDs v4 every second for the next 100 years, The probability that you will have only one duplicate in all these generated IDs would be
- Another very common one is
UUIDv1. It includes 2 parts: 48-bit Host Mac Address (already unique for different clients who generate it), 60-bit Timestamp (nanoseconds precision; even two subsequent generate function calls will never generate the same timestamps). It's harder to measure a clash here. Later in the post, I will explain why it is better to use this version if you have no security restrictions about data inside of id (when it was generated and by whom).
Any client can generate an ID of a company on their own side (Frontend or Application), and then create branches by linking with this
companyId. You don't need to call a server to approve these IDs – no one will reuse it because it is unique! You could submit all these data even after an hour of offline work and you would still be sure that no one else generated it. This is just an example of the worst-case – it is always better to deliver data to server ASAP, at least to make data visible for other users if you have a multi-user app).
There are two problems with auto-increment IDs – to scale you will need data shards, but you will be blocked by the same IDs on different servers
Imagine you created your awesome well-tuned orchestration to spawn new servers with new DB instances and routing mechanisms – you are ready for high-load by scaling horizontally!
Let’s say during a peak you experience a server overload – for example, one client has grown and takes 90% of payload. You decide to rapidly move this client from this server to another free server (which has fewer clients). You are moving company
135 to another server. But what if it already has company
135? OK, you could somehow regenerate the ID and take care of delivering this change to a user, – however, you will pay a big price for data loss and spend a lot of time recovering data for a user who went offline or edited between a swap.
Limited sharding solution
Most relational databases have mechanisms to generate unique sequences of auto-incremented IDs on different servers to then be able to move data between database servers. On one project in which we already had auto-incremented IDs, we configured
auto_increment_offset settings of MySQL. The first one should hold the maximum potential number of servers, and the second one the number on the server. For example, if you think you will have 8 servers then you have to set the next values:
1. On the first server:
In all tables on the first server MySQL will generate the following IDs:
2. On the second server:
In all tables on this server MySQL will generate IDs:
And so on. So as you see, now it is safe to move data from one server to another. You can have 2 active servers now and when you need to add more, you can add up to
8 servers. Not so bad, right?
However, you are limited. If you have a scalable business, at some point you will need more than 8 servers. You will not be able to just tune settings – you will need to reconfigure all previous servers and regenerate IDs. It is obvious how hard this will be.
To set a very big value of increment for potential growth you will have to take care of the field data type. Imagine you want up to
10,000 servers. Standart
INT in MySQL takes 4-bytes (32 bits), so if you have an
INT type you will be able to store the following record count:
2 ** 32 / 10,000 = 429,496
If you know that you will never have more than
10k servers, then I am not sure that
429k items would be enough for one server. So you will probably need
BIGINT type, which takes 8-bytes (64 bits).
Yes, UUIDs take even more –
16 bytes, although no special database settings are required. You could just move data from one server to another – they are very unique! It will be much much easier for high-load software administrators to support it. Simplicity, and only simplicity moves this world forward.
Another nice bonus is the secret records count.
If you use Autoincrement IDs, a bad guy could create an account in your system, create a company, check the Network tab in the Chrome browser and see that his ID is
136. He now knows that your system has only 135 companies ever created by other clients. You may say "not a big deal," but are you 100% sure that info about total entities count is not sensitive?
When you use UUIDs, you can't say how many records were created before. Also, you could use UUIDv4 in secret (e.g. share) links. Hackers will need eternity to guess it even if you will have millions of secret links.
But wait – the probability of a clash is very low, but it exists, right?
The biggest problem in wrong decisions is having the wrong priorities. If you are afraid that after 100 years you will have one clash for one client and he will not be able to create a company, then you will miss several years of a highly profitable and nicely scalable business. Let's earn money in the near decades!
Any DB has a Unique ID check, so correct handling will be throwing some exception in case of a clash. Like any regular exception in your system it should be universally handled and for example, attempt to be executed again, so this would be invisible even for the end-user coming after 100 years.
How about a clash in the case of moving data between servers? I would just handle the exception and rollback transactions in which you will move data with the correct message. This is nothing major — just select another server in case you will face it after years.
How to store it?
Technically every DB could store UUIDs, even relational once like MySQL. It is a big mistake if someone says to you that Relational DBs should have sequential IDs and NoSQL should have UUIDs. This is not about it. But you have to remember the fact that UUID takes 16 bytes of data on disk. This leads to two issues that should be carefully mitigated:
- Larger disk space
- Slower index
How to achieve minimal space when you use UUIDs
First of all, you've seen an example of UUID I posted above:
This is the mainstream format – 36-characters ASCII string. Most libraries will generate UUIDs like this string. And if you are going to store it in DB in this format ((e.g.
VARCHAR(36) in SQL DB)) you will waste a lot of disk space because it will take 36 bytes on disk. If you use foreign keys in other tables, they all should hold the same values with the same length.
Never do this. Store UUIDs as
BINARY(16) or any other binary format your DB supports, which could guarantee that one value will consume 16 bytes. Yes, you will need to write code or find another library to convert to/from the binary format. You will need to test it once, but then you will get the awesome most optimal setup!
If you do it in other way, you will face a lot of issues, e.g. think about price of sql database backups
How to achieve maximum performance when you find/insert UUIDs
To perform fast data search by ID fields, the Database must have some sort of index on it. In most cases, the index is some sort of binary search tree so it gives search complexity
O(log(n)). This means that search time depends on element count and a fixed time used to follow the tree branches. Obviously, this time depends on value length, and nothing in this world could be free. If we increase the length of the field we need to pay with more time on the following tree. This is another reason to use 16 bytes instead of 36. Anyway yes, 16-byte field will always be slower in index then 4-byte INT, but not so dramatically that we need to pay attention. Smart decisions are about priorities, we lose very little, but get a lot!
Another important factor is tree formation. Basic data structure theory, which we study in university courses, tells us that each insert requires some time to find the correct place in a tree where we have to place a new ID (with a pointer to data row on disk) and tree balancing. Therefore, value also influences insert time. There is some simple research that proves it https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/. According to this post. to achieve the best results we have to use UUIDv1 because it has a timestamp part so index builders can faster find the correct place in a tree. So if you have no security limits about ID generation time and source, use v1. v4 will make your queries slower!
Also, the author suggested a good idea of rearranging some parts in v1 to achieve even better results. But this is not very necessary, as even native MySQL built-in UUID generator has standard v1 format: https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
How to make network requests consume less traffic
It is easy to explain why standard UUID takes 36-chars for representing 16 bytes:
- They use 4 dashes, which are always in constant places and have no value.
- They use only hex characters (
To transmit such UUIDs over text protocols like HTTP/WS we are limited for ASCII chars, but we could use a larger alphabet. If we would use
0-9, a-z, A-Z, we could reduce length up to 22 chars, which would look like:
This is also something called a Short UUID format. Obviously, it is just about representation, so any UUID version could be represented as short. I would definitely recommend transmitting this version over the internet in your requests and responses and getting rid of 14 unused bytes.
This is a rare improvement. If you will check Trello and Slack requests, you will notice that they transmit plain 36-chars versions, but shortification is very time-cheap and could be easily implemented for the client and server. So why not use it?
- UUIDs solve the problem of centralized ID store, making it possible to work in offline and handle complex relations on frontend by making the frontend independent from the server which makes it super-responsive
- UUIDs allow you to implement shards for high-load scalable systems
- UUIDs make your IDs more secure by hiding records count
- To decrease insert time, use UUIDv1 as IDs system
- Store it in the database as 16 bytes
- To save traffic, transmit Shot UUIDs