Running CloudQuery in Parallel
Running multiple instances of cloudquery sync
in parallel can be useful when a single sync is too slow, for example when syncing a large number of accounts, or when fetching from large accounts.
Splitting Syncs Automatically
Starting from version v6.8.0 (opens in a new tab) of the CloudQuery CLI, you can use the --shard
flag to automatically split a sync into smaller parts that can be run in parallel.
For example, to split a sync into 4 parts, you can run:
cloudquery sync config.yml --shard 1/4
cloudquery sync config.yml --shard 2/4
cloudquery sync config.yml --shard 3/4
cloudquery sync config.yml --shard 4/4
The shard
flag will automatically split the sync into parts, ensure each part gets a unique source name, and that the parts don't overlap.
It's recommended to run the parts in parallel, as the sync will be faster than running a single sync.
You can find an example of how to run the syncs in parallel in the GitHub Actions Deployment Guide section.
Supported Source Integrations for Sharding
Source Integration | Minimal Version |
---|---|
AWS | v27.20.0 (opens in a new tab) |
Azure | v14.8.0 (opens in a new tab) |
GCP | v16.3.0 (opens in a new tab) |
Splitting Syncs Manually
If you are using an older version of the CloudQuery CLI, or if you want to manually split a sync, you can do so by creating different configurations for each part of the sync, using the guidelines below.
Unique Names
Every source and destination integration configuration must have a unique name
. This is required because the name
is
written into the database (_cq_source_name
), and is used to later delete stale resources.
For instance, a configuration with multiple source integrations could look like:
kind: source
spec:
name: aws1
path: cloudquery/aws
registry: cloudquery
...
---
kind: source
spec:
name: aws2
path: cloudquery/aws
registry: cloudquery
...
---
kind: destination
spec:
name: "postgresql"
path: cloudquery/postgresql
registry: cloudquery
...
If the names are not unique, then the different integrations may delete/overwrite each other's resources.
No Overlapping Syncs
When splitting a sync into multiple source-integration configurations to be run in parallel, it is important that these syncs don't overlap - the set of Account/Table/Region that every source-integration grabs must not intersect.
For instance, in GCP, if the first source-integration fetches resource A
from project 1
, the second source-integration
can fetch resource B
from project 1
, or resource A
from project 2
, but can never fetch resource A
from project 1
.
For another example, if the first source-integration fetches from region europe-west1
in project 1
, the second source-integration
can fetch from region europe-west1
in project 2
, or from region europe-west2
in project 1
, but can never fetch from
region europe-west1
in project 1
.
If the configurations overlap, the behavior is undefined, and the database may contain duplicate rows.