How to develop a scalable, resilient and highly available URL shortener.
Greetings everyone!
In the world of software development, we often face an overwhelming amount of information. Every application has its complexities hidden beneath the surface, which can be difficult to understand upfront. When I started my career in development, I was mainly focused on learning programming languages and familiarizing myself with a specific tech stack. Over time, I began to realize the importance of understanding the inner workings of applications. This shift in perspective occurred as I observed the performance of my applications in real-world scenarios and recognized the need to dig deeper into the details. I realized that to date, I was only staring at the tip of the iceberg.
In this blog, my intent is to present to you my thought process while developing an application. I will try to guide you through the critical aspects of system designs, how to assess your application and finally, how to arrive at a happy showdown. This blog might seem boring/irrelevant to some of the readers because this is purely theory-based. Don't worry, I'll be coming back with the practical version of this blog soon. But if you are willing to understand the details, this would be the one-stop shop for you.
Okay then, let's begin!
Roadmap
Here is the list of things we will be doing in this blog:
Get introduced to URL Shorteners
Decide upon the implementation of the idea
Take a deep dive into its working
Figure out the working of the backend
If you have come this far, I'm sure you will enjoy this:")
What are URL Shorteners?
Have you ever gone to Facebook profile via the web? Did you ever try to share your LinkedIn profile? If yes, then you can certainly feel where this is going. If not, though, let me tell you.
Here is the link to my GitHub profile: github.com/rajdip-b
This link might look like it's easy to memorize. Well, it is. But, if we ramp up the URL, let's say I show you my LinkedIn URL, you would get the idea.
Here goes: https://www.linkedin.com/in/rajdip-bhattacharya-581119232
As you can see, it's no joke to remember this URL. This is where URL shorteners come into play. You can think of URL shortness as a mathematical function that takes a string as input and generates another string based on the input. Let's see this in action with an example:
Notice how the big URL gets shortened into a much smaller URL. Now that we know what a URL shortener does, let's see how it does it.
Setting our goals
To make sure that our application stands out in the crowd, first, we need to make sure that we have a clear conception of what we are trying to build. It is in this step that we outline our application.
Working overview
First, let's see how our application will function.
The above diagram sketches out the working of our URL Shortener. To reiterate,
The client sends a request to our server to generate a shortened version of a URL.
The server generates this URL, stores it in the DB (which will be covered in the next sections) and returns the shortened URL.
The client, later on, requests the shortened URL to our server.
Our server sends a `301 Permanently Moved` response along with the location.
The client gets redirected to the actual server (in here, google.com).
Note that in step 4, we are sending `301 Permanently Moved`. In case you need to brush up your knowledge, here is a great resource to learn more about HTTP status codes. There are two status codes that we use for redirecting:
301 Permanently Moved
302 Found
There is a subtle difference between the two:
301 states that the resource has moved permanently. Browsers are programmed to cache such responses so that when we make subsequent requests to this URL, we are redirected to the actual location instead of the former location.
302, on the other hand, merely redirects us. This means the browsers won't cache these responses, and we will always be hitting the URL that sends us the redirected location.
To summarize, considering that we have requested a server that gets us redirected, for the subsequent requests,
301: Browser -> Our Server -> Actual Location
302: Browser -> Actual Location
Where "Actual Location" can be thought of as the long URL.
Planning
Next stop, we will make some assumptions on the use case of our application. This can be considered the most crucial point in any application design because implementation and resource allocation highly depend upon the output of this state. So here they are:
Capacity: 100 million URLs generated per day
Uptime: Should be running for 1 year at least.
Data: URLs containing alphanumeric characters.
Data modification: Just writes are allowed.
From the given design, we can infer that our application would have more reads than the number of writes. Let's say, reads:writes would be 10:1
Calculations:
Writes per second: 100 mil / (24 * 60 * 60) = 1160
Read operations: 10*1160 (since 10:1 is read:write) = 11600
Number of records that needs to be stored: 100 mil * 365 = 36.5 billion records
Considering an average of 100 bytes per record, the storage required for 1 year: ~= 33 TB
From the above calculations, we can infer the following requirements:
A database that can scale at large. Hence, scalability is ensured.
The database should have great indexing support for decreasing delay.
Caching the most frequently accessed URLs would decrease delay.
Creating multiple servers and load balancing would make the application resilient.
Designing the heart of the URL Shortener
Now that we have established the requirements, we are ready to move forward. The next stop is deciding the function that will actually generate the shortened URL. Note that we will have to store 36.5 billion records, which means we will have 36.5 billion unique shortened URLs. Therefore, we need to find out the minimum size of the shortened URL. We know that we will be operating on alphanumeric input, which has a size of 62. Since our shortened URL would also contain alphanumeric, we will have 62 different characters in 1 position and 62^n different combinations, where n is the size of the URL. So if our URL is of size 4, then n=4. So the number of combinations becomes 62^4 = 1,47,76,336 ~= 14 million. Therefore, we need to find an n such that 62^n >= 36.5 billion. I'll spare you the calculation. The lowest n that satisfies this equation is 7. Therefore, our shortened URL would be size 7.
Next, I'll be discussing 2 approaches to solving this problem.
Hash Function
A hash function can be defined as a complex mathematical function that takes in an input (hashValue) and generates a hashOutput based on the input. SHA256, SHA512, MD5 are some renowned hashing algorithms to name a few.
The above diagram illustrates SHA256 converting a string into absolute gibberish. Behind the scenes, the hash function uses mathematical functions to map characters to certain calculated values. The output is very long and according to our requirements, we only need 7 characters. We can truncate the output to be 7 characters long, but then it will mean that we might have duplicate elements in our records. To mitigate this, we will use hash + collision detection.
The diagram illustrates how we will be generating our shortened URL. Note that the output `shortUrl` has 7 characters, and this is done by truncating the rest of the output.
Base 62 conversion
Base 62 is a commonly used encoding technique that uses 62 characters-long set. This set consists of all characters in the alphanumeric family. For generating a Base62 short URL, we follow the following steps:
Character Set Selection: We map all the alphanumeric characters to decimal numbers, as represented in the above table.
Converting to Decimal: The entire long URL is scanned through, and each character is replaced by its decimal representation
Conversion to Base62: The output of the previous step is continuously divided by 62 and the remainder obtained at each step is mapped back from the table. For example, at any step, we got a remainder of 20, then would write K, because the table maps alphanumeric K to decimal 20.
Padding and truncating: The output obtained from the above stage might not be always 7 characters long. If the output length is lesser than 7, pad it with 0s else, truncate it if it's greater than 7.
The steps involved are much similar to that of hashing, just that we don't use any hash function but encoding functions.
With our URL Shortener function defined, we are ready to move forward to the "web" aspects of our application.
Deciding the backend
Nearly any application is always backend-heavy. In our case, we need to figure out how we want to put the pieces together to make our URL shortener functional.
Firstly, we want to establish how our DB operates.
The records in the database would be of the format:
id: number (auto generated)
shortUrl: string (7)
longUrl: string
Next, comes the high-level architecture of the overall application.
The client requests the server requesting for the long URL for a short URL.
The load balancer receives the request and sends forwards it to one of its healthy servers in the target group.
The server first checks the cache. If it is present, it returns the short URL to the client (5).
If it is not present in the cache, the server queries the DB and fetches it. It stores it in the cache and sends it over to the client(5).
This is a 301 Moved Permanently response, that will ultimately make the browser go to the destined location.
Conclusion
With that, we've concluded this blog. While it delved into theory, it has provided you with a clear understanding of the inner workings of such applications. What's even better is that you're now equipped to put this knowledge into action and apply the thought process to your projects!
This blog might not be 100% accurate, so, if you see any window for improvements, you're always welcome!