Duckdb Basic
Overview
This post covers the basic concepts of DuckDB and the reasons behind its popularity. It is based on my review of DuckDB blog post, “Why DuckDB?”.
Why DuckDB?
DuckDB is fast
DuckDB is an OLAP-centered database using SQL. OLAP stands for Online Analytical Processing, which focuses on data analysis and data science rather than just inserting or deleting records. It excels at analyzing databases through operations like aggregation and filtering.
People choose DuckDB for OLAP primarily because it is incredibly fast. But how exactly does it achieve this speed? These are what I have understood so far.
Columnar Storage
DuckDB stores data column-wise rather than the row-wise format typically observed in traditional table storage. Usually, when we analyze a database, we apply aggregations like:
SUM(Salary)
This means we only need to consider a single column rather than entire rows. Column-wise storage directly supports this feature of OLAP, making the whole process much faster.
DuckDB has no server
Traditional databases like PostgreSQL always require a server to be running. When you download PostgreSQL and start the program, it asks for a port and a server address (such as localhost). This is for several reasons, one of which is concurrency control, allowing multiple people to access the data at the same time. Another reason is the ability to use high-performance computers or the cloud as servers, as handling many transactions simultaneously is heavy computing work that a typical laptop might not manage. In this setup, the server handles the work, and we act as the client.
However, since DuckDB is an OLAP database, we often don’t need this server because we are less concerned with heavy concurrency control. One thing to note is that this also means DuckDB isn’t primarily built for BIG data in the same way because it doesn’t use a server, it depends on the user’s local environment. To address this, DuckDB developers provide cool algorithms like Out-of-core Processing and leverage MotherDuck, the cloud version of DuckDB. Of course, for massive amounts of data, Apache Spark, Google BigQuery, or Snowflake are still used for OLAP jobs.
Convenient file process
My experience is heavily centered on OLTP, such as PostgreSQL. In that environment, we usually had to load a CSV file into the database first and then run SQL queries. However, with DuckDB, users can easily run SQL queries directly on CSV files without moving them into the database.
Next Steps
- Execute simple queries over DuckDB in Python
- Deep dive into the core algorithms (e.g., Vectorized Execution, Out-of-core Processing)