Many of us are familiar with databases like Postgres or MySQL, which fall under the category of Online Transaction Processing (OLTP) databases. These databases are designed for high-concurrency, low-latency operations, typically handling simple queries that involve reading or updating a few records at a time. It's important to note the emphasis on "few records," as we'll reference this throughout the article. NoSQL databases, like MongoDB, also belong to this class.
On the other hand, we have Online Analytical Processing (OLAP) databases, which are tailored for tasks such as reporting and analysis. They often deal with complex queries, including aggregates and joins, that require high-throughput scans over many records.
While OLTP databases can support basic analytics functionalities (such as aggregation, groupby, and window functions), they are not optimized for such tasks. Here are a few reasons why using OLTP databases for analytics may not be ideal:
Since OLTP and OLAP databases serve different purposes, creating a single database that can effectively handle both types of workloads is not feasible. Instead, companies develop independent databases for each use case. Choosing between OLTP and OLAP is not an "either-or" problem; most companies utilize both types of databases. OLTP databases are commonly used as application databases, while data from various OLTP databases is aggregated in a single OLAP database. Although the same data exists in two places, it's stored in different formats and in databases optimized for different workloads.
OLAP databases, being scalable, can easily store data from numerous OLTP databases without breaking a sweat. With this setup, your analytical workloads don’t hurt the performance of your OLTP databases. Another advantage is that both OLAP and OLTP can be optimized independently for their specific workloads.
Now that you understand “Why OLAP?”, let's try to answer “What?”.
The primary requirement for an OLAP database is scalability. To address the storage challenge, HDFS (Hadoop Distributed File System) was introduced. It functions like any other file system but is distributed across a cluster of machines. When executing a query, OLAP fetches records from HDFS, processes them, and returns the results. Results can also be written back to HDFS. It’s noteworthy that even with billions of records stored across thousands of machines, a small machine can execute simple operations; the size of the machine affects only the query execution time.
This leads to an important distinction: unlike OLTP databases, OLAP databases decouple compute and storage. This allows each system to scale independently as needed by the workload. For instance, if no queries are running, compute can scale to zero, and you only pay for storage. Here is day-to-day scenario to exemplifies decoupling of storage and compute. Imagine all your data stored in an S3 bucket; to train a ML model, you acquire a GPU machine on Paperspace, train your model on the data in the S3 bucket, push the trained model back to the S3 bucket, and then terminate the Paperspace instance. Even after terminating the compute, your data still persists, and you can access it from any other instance.
<aside> 🚨 Imagine you have a server running PostgreSQL on a remote machine, and you access it from another machine via a client. In this scenario, when a client sends a request to the server, both the computation and data processing occur on the same device. This means that even when there are no requests, you can't scale down the server to zero without losing access to your data. Unlike in a decoupled compute and storage system, where these processes are separate, here they are tightly intertwined.
</aside>
Various organizations build their OLAP databases by independently choosing:
This flexibility—the ability to select the storage system and processing engine best suited to the workload—is a significant advantage of OLAP databases over OLTP databases. Managed solutions like Google’s BigQuery, Azure Synapse, AWS Redshift, Snowflake, Databrick’s Data Intelligence Platform, among others, simplify the complexity.
To illustrate the decoupled compute and storage, check the pricing pages for Google’s BigQuery, Snowflake, or other data warehousing solutions, where both storage and compute are billed separately.