DuckDb Guide


DuckDb is an in-process SQL database. That's a mouthful but really matters is that it is a great way to query data locally on your machine. It's fast, feature rich and has a number of plugins that make it easy to pull various sources of data, from csv files, parquet files, to other databases like Postgres. In this guide, we'll show you how to get up and running with DuckDb locally with quary and give you the basics to understand the DuckDB template. We'll show you how to add data, transform it and test it.

Installation

DuckDb is bundled into Quary, so you don't need to install anything else other than Quary. If you haven't already, download Quary and install it.

Once you have Quary installed, both as a CLI and as VSCode extension, you can start using DuckDb with its full capability.

Demo Project

If you want to look at the DuckDb template, you can find it here. You can also use the template by running the following command in a new folder:

quary init --type=duckdb

This will create a new project with the DuckDb template. You can then open the project in VSCode and start working with it. Rather than do that though, this guide will walk you through the basics of DuckDb.

Starting From Scratch

Configuring DuckDb

To start from scratch, create a new folder. In that folder, let's create the config file for DuckDb. Create a new file called quary.yaml and add the following content:

duckdb:
  path: data.db

This tells Quary to use DuckDb and to store the database in a file called data.db. You can change the path to whatever you want.

Adding source data

Now that we have the config file, let's add some data. Create a new file called people.csv, place it in a folder called data and add the following content:

id,name,age
1,Alice,30
2,Bob,40
3,Charlie,50
4,Dave,45
5,Eve,35

This is a simple CSV file with three columns: id, name, and age. We'll use this file as the data source in DuckDb. To do this, create a folder called models and add a new file called sources.yaml with the following content:

sources:
  - name: people
    path: '"data/people.csv"'
    columns:
      - name: id
        tests:
          - type: unique
          - type: not_null
      - name: name
        tests:
          - type: not_null
      - name: age
        tests:
          - type: not_null
          - type: gte
            info:
              value: 0

The above file names the source people and tells Quary to use the people.csv file as the source. It also specifies the columns and their tests. The id column should be unique and not null, the name column should not be null, and the age column should not be null and greater than or equal to 0.

From the root of the project, run the following command. This will run the tests on the source and ensure that the data is valid.

quary test

At this point, your file structure should look like this and if that is the case you now have a source that you can use in DuckDb to start transforming data.

.
├── data
│   └── people.csv
├── models
│   └── sources.yaml
└── quary.yaml

Transforming data

Now that we have the source, let's query the data. Create a new file called age_distribution.sql in the models folder. In this file, we will write a SQL query to get the count of the number of people in each age group (1-10, 11-20, 21-30, etc.).

SELECT
  CASE
    WHEN age BETWEEN 1 AND 10 THEN '1-10'
    WHEN age BETWEEN 11 AND 20 THEN '11-20'
    WHEN age BETWEEN 21 AND 30 THEN '21-30'
    WHEN age BETWEEN 31 AND 40 THEN '31-40'
    WHEN age BETWEEN 41 AND 50 THEN '41-50'
    ELSE '51+'
  END AS age_group,
  COUNT(*) AS count
FROM q.people

That query groups the ages into different buckets and counts the number of people in each bucket. The q.people refers to the people source that we defined earlier. The name for a source is always q.<source_name> where the source name is the name of the source in the sources.yaml file. To build this query as a view in DuckDb, run the following command:

quary build

This will create the structure in the data.db file. You can now query the data using the duckdb CLI for example.

At this point, your file structure should look like this.

.
├── data
│   └── people.csv
├── models
│   └── sources.yaml
│   └── age_distribution.sql
└── quary.yaml

Testing transformation

In addition to building the view, you can also test the transformation. Create a new file called schema.yaml in the models folder and add the following content:

models:
- name: age_distribution
  columns:
  - name: age_group
    tests:
      - type: not_null
  - name: count
    tests:
      - type: not_null
      - type: gte
        info:
          value: "0"

This file specifies the tests for the age_distribution model. The age_group column should not be null and the count column should not be null and greater than or equal to 0. To run the tests, run the following command:

quary test

Note that the name of the model is inferred from the file name. The file name should be the name of the model with the extension .sql. Just like you can refer to a source using q.<source_name>, you can refer to a model using q.<model_name>.