engineering

tutorial

Building Local and Serverless ELT(ETL) pipelines with CloudQuery and MotherDuck

Yevgeny Pats • March 14, 2024

Header Image: Building Local and Serverless ELT

CloudQuery is an open-source high-performance ELT framework powered by Apache Arrow, and DuckDB is an open-source in-process OLAP database. In this blog post we will walk through building an ELT pipeline using CloudQuery and DuckDB locally and then how to run the same configuration and pipeline with CloudQuery Cloud and MotherDuck (DuckDB Cloud version).

One cool thing about CloudQuery and DuckDB is that both of them are very easy to run locally which makes it easy to develop and debug before running in the cloud which can save time on feedback loops and surging cloud costs of compute, data, and egress - especially for development purposes.

Setting Up Local Pipeline

Setup & Prerequisites

CloudQuery is a single binary CLI that you can download and install following these instructions.
DuckDB is also a single binary and can be downloaded and installed following these instructions.

Syncing Data from HackerNews API to DuckDB

Now, let’s sync this year’s HackerNews stories and comments to DuckDB. To do that we will need to use the CloudQuery Hacker News Source and the DuckDB destination. The configuration for CloudQuery will look like the following:

# config.yml
kind: source
spec:
  name: "hackernews"
  path: "cloudquery/hackernews"
  registry: "cloudquery"
  version: "v3.0.21"
  tables: ["*"]
  destinations:
    - "duckdb"
  spec:
  start_time: 2024-01-01T00:00:00Z
---

kind: destination
spec:
  name: duckdb
  path: cloudquery/duckdb
  registry: cloudquery
  version: "v5.1.0"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: ./example.db

Now that we have cloudquery installed and this simple config file we can run the following:

cloudquery login
cloudquery sync ./config.yml

This will start syncing all the items from the start of 2024 to a local Duckdb. It can take a while, so feel free to interrupt it in the middle as you will have data to play with anyway.

Now you can open up DuckDB and run the following (the total size of the database is about 512MB at the time of writing):

SELECT
    title, score
FROM
    hackernews_items
WHERE
    parent = 0 AND
    title ILIKE '%open source%' OR
    title ILIKE '%open-source%'
ORDER BY score DESC LIMIT 20;

This will show something similar to the following:

Result 1: This image shows a screenshot of the result of the query which is in two columns (Title and Score). The title of the top entry is "Zed, a collaborative code editor, is now open-source" which has a score of 1576

Top stories that have “open source” or “open-source” in their title (how to spell open-source). We can definitely see that HN loves open-source!

Now let’s do something a bit more interesting and run the below query:

WITH lists_split_into_rows AS (
SELECT
  score,
  unnest(string_split(lower(title), ' ')) AS word
FROM
    hackernews_items
) SELECT
      word,
      sum(score) as score_weight
  FROM
      lists_split_into_rows
  WHERE
      word NOT IN ('to', 'the', 'of', 'in', 'for', 'and', 'the', 'with', 'on', 'a', 'from', 'are', '', 'is', 'ask', 'how', 'what', 'why', 'by', 'an', 'as', 'your', 'at', 'you', 'it', 'over', 'using', 'into')
GROUP BY
  word
ORDER BY score_weight DESC LIMIT 100;

This will return a list of words with score_weight, which represents the number of times a word occurred in a title. If a story gains traction, this word will get more weight.

Result 2: This image shows a screenshot of the result of the query which is in two columns (Word and Score Weight). The top 10 entries are "HN:", "-", "show", "i", "ai", "new", "that", "my", "has", "apple"

This query could use more fine-tuning but, even now, we can see that no doubt if you want to get noticed by the HN community, using words such as "AI", "LLM", "737", "open-source", or "free" will increase your odds of making the front page!

When the queries are ready, and you want to move them to your production database where other users and downstream subsystems can take advantage of the data, you can move to the next section and use the same queries and ELT configuration!

Setting Up Managed / Serverless Pipeline

Setup & Prerequisites

Now let’s see how we can set up the same thing as a cloud pipeline to run periodically.

Sign up to CloudQuery Cloud
Create MotherDuck Account

Syncing Data from HackerNews API to DuckDB

Once logged in to CloudQuery Cloud, Click New Sync from the top right.

The Destination Plugin

This will load the Configure Destination screen. From the Select destination plugin dropdown, choose DuckDB. You can leave the Select destination plugin version and Destination Name fields alone as the defaults will be fine.

The Destination Spec field for MotherDuck should look like the following.

kind: destination
spec:
  name: duckdb
  path: cloudquery/duckdb
  registry: cloudquery
  version: "v5.1.0"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: "md:hackernews?motherduck_token=${MOTHERDUCK_TOKEN}"

However, you will need to replace the start of the connection_string (replacing the md:hackernews portion shown here) with the name you want to use for your MotherDuck database (as described in authenticating to MotherDuck).

In this case, you'll notice that we're using CloudQuery's secret injection to securely store the MOTHERDUCK_TOKEN and inject it during run-time. When you selected the DuckDB Destination Plugin this automatically added a secret for this key into the Secret storage, so all you need to do to use it is to paste your MotherDuck Service Token into the Value Field.

With that done, click Next.

The Source Plugin

This will progress to the Configure Source screen. From the Select source plugin dropdown select HackerNews. In this case, all the default values are acceptable, so click Continue

Schedule and Resources

This will progress to the New Sync screen. Here, you can change the name of the sync, alter the schedule, and set the resource limits. Again, in this case the default values are acceptable, so click Save and run.

Accessing the data

Now you should be able to see the running sync, and over in the MotherDuck UI, the data should be arriving.

Result 3: This image shows a screenshot of the MotherDuck UI, where one of the queries used earlier has been run

The other cool part is that you can also connect from the local shell to MotherDuck. You can run queries both on the local and remote datasets from the same local shell by using .open md:blog and now both the local hackernews_items and remote hackernews.main.hackernews_items will be accessible!

Closing thoughts

Data engineering suffered for a while from local/development environments that couldn't be easily deployed to the cloud with the same configuration and query language. DuckDB, MotherDuck, and CloudQuery bring those capabilities to you!

On this page

Building Local and Serverless ELT(ETL) pipelines with CloudQuery and MotherDuck

Setting Up Local Pipeline

Setup & Prerequisites

Syncing Data from HackerNews API to DuckDB

Setting Up Managed / Serverless Pipeline

Setup & Prerequisites

Syncing Data from HackerNews API to DuckDB

The Destination Plugin

The Source Plugin

Schedule and Resources

Accessing the data

Closing thoughts