Using DuckDB for not-so-big data in Glean

Big Data is cool - and so are infinitely scalable Data Warehouses. But sometimes, you just have not-so-big data. Or maybe you have a .csv file and you don’t even know what a Data Warehouse is, or don’t care about the “Modern Data Stack” - like, what is that? That shouldn’t stop you from exploring your data in an intuitive, visual way in Hashboard.

There's lots of hype about DuckDB and for good reason: it's a fun tool! But it also has some tangible benefits for the data ecosystem. For Hashboard there a few features that made it the obvious choice for small data:

Improved accessibility - no external database required
Isolation and testing
Portable computation

For those that don't know, DuckDB is a columnar in-process database engine that can read a variety of data formats and works especially well with Apache Arrow, which makes it nice for quickly writing fast sql queries on medium-sized data.

Making data tools more accessible

All the technical features are neat-o for engineers / data nerds - but more importantly for us, we saw DuckDB as an opportunity to improve the accessibility of Hashboard. Before we integrated DuckDB, you had to have a database server to use Hashboard (something like Snowflake or Postgres). This means if you had data you wanted to analyze, you also had to figure out how to get it INTO that database. With our DuckDB integration, you can now upload CSV files and don't need any external database (or upload json, tsv, parquet files etc.).

Hashboard's whole reason for existing is to make data more accessible and useful and allowing file uploads and our DuckDB integration fits in with that perfectly.

To our surprise, DuckDB even increased our own usage of Hashboard. I didn't realize how often I had a loose spreadsheet that I wanted to quickly pivot and trellis in Hashboard but found myself using excel instead. Excel is great too, but it was more fun to track bug-bash progress in Hashboard without having to use FiveTran and the rest of the Modern Data Stack™.

One final note that DuckDB has a pretty awesome SQL dialect that is designed for analysis. So this has been fun to use with random lightweight datasets (and public datasets from Kaggle).

Isolation and testing

More on the developer / internal side - it's made Hashboard higher quality by improving our system's isolation in testing. Hashboard is built on the Modern Data Stack™ - we buy into having lots of standardized logic in a database like Snowflake. Fun! Except, this sorta sucked when you don't want to depend on an external service.

Previously all integration tests would run against databases that were external to Hashboard. Since we run DuckDB ourselves, we can now run more integration tests without relying on an external process or database.

Portable computation

The first iteration of our DuckDB integration just uses a process / service in our backend to process files in DuckDB and run sql on them.

Hashboard natively uses Apache Arrow as a serialization format so that we can do data processing on both the backend and frontend natively. This lends itself well to potentially bringing DuckDB into our client as a next iteration and not having to ship data to our servers at all in order to explore it.‍

Using DuckDB for not-so-big data in Hashboard