Deploying CloudQuery using Kubernetes CronJobs

In this tutorial we will set up a Kubernetes CronJob to run CloudQuery Sync on a regular schedule. Due to its flexibility and standardization across cloud providers, Kubernetes is commonly used by DevOps & Platform Engineers when deploying workloads and microservices.

Prerequisites

A CloudQuery API key. More information on generating an API Key can be found here
A Kubernetes Cluster
A Postgres Database server
Credentials (with Read privileges) for a supported API - in the example we’ll use a DigitalOcean account

Step 1: Create a Kubernetes Secret for your API Keys

To keep our credentials separate from our manifests, we need to create a secret that we’ll use to supply our environment variables We can do this in a single command:

kubectl create secret generic cloudquery-secret \
--from-literal=CLOUDQUERY_API_KEY=<your_cloudquery_api_key> \
--from-literal=DIGITALOCEAN_TOKEN=<your_token> \
--from-literal=SPACES_ACCESS_KEY_ID=<your_access_key> \
--from-literal=SPACES_SECRET_ACCESS_KEY=<your_secret_key> \
--from-literal=PG_CONNECTION_STR=<your_postgres_connection_string>

Step 2: Create the CronJob Manifest

A CronJob is a Kubernetes object that allows you to run a job on a schedule. It’s similar to a Kubernetes Workload, except that it is intended for Jobs (short-lived containers) that conduct a task and then shutdown, instead of Pods (long-lived containers intended for services).

The CronJob manifest has two key elements, the jobTemplate, and the schedule.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cloudquery
  labels:
    app: cloudquery
spec:
  schedule: "0 0 * * *"
  jobTemplate: {}

The Schedule expression follows the same format as crontab; that is a space separated list as follows minute hour day(of month) month day(of week). Which in the above example would be midnight (i.e. 00:00) every day.

To learn more about cron schedule expressions, check out https://crontab.guru/ or the Kubernetes documentation https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

The jobTemplate describes what to run, how to run it, and any volumes that it needs.

As this follows the same structure as a Kubernetes Workload, we won’t go too deep into the details here.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cloudquery
  labels:
    app: cloudquery
spec:
  schedule: "0 0 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cloudquery
              image: ghcr.io/cloudquery/cloudquery:latest
              imagePullPolicy: IfNotPresent
              args: ["sync", "/config/config.yml", "--log-console", "--log-format", "json"]
              envFrom:
                - secretRef:
                    name: cloudquery-secret
              volumeMounts:
              - name: config
                mountPath: "/config"
                readOnly: true
          restartPolicy: Never
          volumes:
          - name: config
            configMap:
              name: cloudquery-config
              items:
              - key: "config"
                path: "config.yml"

Looking at this completed manifest, there are three key elements to pay attention to: args, envFrom, volumes

In the args, you’ll see the arguments to be passed to CloudQuery, in envFrom you’re instructing Kubernetes to use the secret you created earlier, and in volumes you’re telling Kubernetes where to find the configuration file.

Step 3: Defining the ConfigMap to hold the CloudQuery Config

When using Kubernetes configuration files are commonly stored on a special type of object called a ConfigMap. ConfigMaps enable you to define collections of text files that can be mounted as directories by Pods and Jobs.

A ConfigMap definition is relatively simple, containing a data object where files are defined as key-value pairs.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cloudquery-config
data: {}

In this case, you want to define a ConfigMap to store a CloudQuery configuration file, that defines a Source integration and a Destination integration.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cloudquery-config
data:
  config: |
    kind: source
    spec:
      # Source spec section
      name: digitalocean
      path: cloudquery/digitalocean
      registry: cloudquery
      version: "v6.7.20"
      tables:
        - "digitalocean_droplets"
        - "digitalocean_databases"
        - "digitalocean_accounts"
        - "digitalocean_storage_volumes"
        - "digitalocean_floating_ips"
        - "digitalocean_firewalls"
        - "digitalocean_load_balancers"
        - "digitalocean_billing_history"
      destinations: ["postgresql"]
    ---
    kind: destination
    spec:
      name: "postgresql"
      path: "cloudquery/postgresql"
      registry: "cloudquery"
      version: "v8.12.1"
      spec:
        connection_string: ${PG_CONNECTION_STR}

In this example we’re using the DigitalOcean source integration and the Postgres destination integration. But you can find out more about building configuration files and integrations here.

Step 4: Apply the manifest

Apply the configuration from a terminal using kubectl apply

kubectl apply -f cronjob.yaml -f configmap.yaml

Step 5: Query the data

You can manually trigger the cronjob to run early using:

kubectl create job --from=cronjob/<name of cronjob> <name of job>

After which you can query your SQL database to see the results.

Summary

In this tutorial, we have seen how to set up CloudQuery on Kubernetes and sync data to a PostgreSQL database. If you have any questions, check out the video above or join our Community to chat with us!

Google Cloud VM 🎥Kestra