Deploying CloudQuery to a Google Cloud Virtual Machine (VM)
In this tutorial we will install CloudQuery on a Google Cloud Virtual Machine (VM). We will then set up a cron schedule to run a regular CloudQuery sync. Running CloudQuery on a VM like this is one of the simplest ways to get started with CloudQuery on the Google Cloud Platform ecosystem.
Prerequisites
-
A CloudQuery API key. More information on generating an API Key can be found here.
-
A Google Cloud account. You can sign up here (opens in a new tab).
-
A Google Cloud Project with billing enabled
-
The following APIs enabled on the project, at a minimum:
- Compute Engine API
- Cloud Logging API
- Cloud SQL Admin API
- Cloud Resource Manager API
You can enable these APIs by going to the APIs & Services (opens in a new tab) section of the Google Cloud Console and clicking on Enable APIs and Services.
Step 1: Create a Google Cloud Virtual Machine
- Go to the Google Cloud Console (opens in a new tab) and select your project.
- Click on the hamburger menu in the top left corner and select Compute Engine.
- Click on Create Instance.
- Give your instance a name and select a region and zone.
- Select a machine type. We recommend using a machine with at least 2 vCPUs and 8GB of memory. The required size will vary depending on the size of your dataset you are syncing and how you configure the concurrency of the sync.
- Under Identity and API access, select Allow full access to all Cloud APIs. This will allow the GCP source integration to access the APIs it needs to list all GCP projects in your account. Note that this permission is distinct from the permissions we will later grant to the service account for this VM. You can also restrict this to only the APIs you need, but this is not covered in this tutorial.
- Create the instance
Step 2: Install CloudQuery
- SSH into your instance. You can do this by clicking on the SSH button next to your instance in the Google Cloud Console.
- Download CloudQuery by running the following command from the Quickstart guide for Linux:
curl -L https://github.com/cloudquery/cloudquery/releases/download/cli-v6.12.7/cloudquery_linux_amd64 -o cloudquery
chmod a+x cloudquery
Step 3: Create a Cloud SQL Instance
- Click on the hamburger menu in the top left corner and select SQL.
- Click on Create Instance.
- Select PostgreSQL. (Or another database if you prefer--CloudQuery supports many destinations. You will just need to then modify your destination configuration accordingly later.)
- Give your instance a name and select a region and zone.
- Select a machine type. We recommend using a machine with at least 2 vCPUs and 8GB of memory. The required size will vary depending on the size of your dataset you are syncing and how you configure the concurrency of the sync.
- Under Networking, select Private IP. If you later want to use Cloud SQL console or access the database from a service outside of your GCP VPC network (like Grafana or Superset), then you can also add a public IP. For a sync to complete, we only need a Private IP.
- Create the instance
Step 4: Add a database and user
- Click on your newly created Cloud SQL instance from the Cloud SQL page.
- Click on Databases.
- Click on Create Database.
- Give your database a name, like
cloudquery
. - Click on Users.
- Click on Create User Account.
- Give your user a name and password. The user name can again be
cloudquery
, with a password of your choice. - Click on Create.
It is also possible to use Cloud IAM for authentication, instead of a password, but this is not covered in this tutorial. In short, you will need to run Cloud SQL Proxy on your VM and configure CloudQuery to connect to the proxy.
Step 5: Configure CloudQuery
- Back in the SSH session connected to the VM instance, create a new file called
config.yaml
in the same directory as the CloudQuery binary. - Add the following contents to the file, replacing the values with your own:
kind: source
spec:
# Source spec section
name: "gcp"
path: "cloudquery/gcp"
registry: "cloudquery"
version: "v18.0.1"
tables: ["gcp_storage_buckets"] # Add more tables here if you want to sync more data
destinations: ["postgresql"]
spec:
# GCP Spec section
# project_ids: ["my-project"]
---
kind: destination
spec:
name: "postgresql"
path: "cloudquery/postgresql"
registry: "cloudquery"
version: "v8.7.7"
spec:
connection_string: "postgresql://cloudquery:<PASSWORD>@<PRIVATE_IP>:5432/cloudquery"
If you are using a database other than PostgreSQL, you will need to modify the destination
section of the configuration file accordingly. See the destinations (opens in a new tab) section of the docs for more information.
Similarly, if you are using a different source, follow the documentation for your source integration (opens in a new tab) to get the correct configuration.
Step 6: Add IAM roles
- Click on the hamburger menu in the top left corner and select IAM & Admin.
- Click on IAM.
- Find the service principal that is running your VM instance. You can find this information on the VM details page: it will typically be in the format
123456789012-compute@developer.gserviceaccount.com
- Click Edit principal (the pencil icon) next to the service account.
- Add the following roles to the service account that is running your VM instance.
- Viewer: This will allow the GCP integration to read GCP resources from your projects
- Browser: This is necessary to allow the GCP integration to list all projects in your account These permissions assume you are using the GCP source integration, it may be too broad for your use case or for other integrations. Following the principle of least privilege, you should make the scope as small as possible for your needs.
Step 7: Run a Sync
-
Back in the SSH session connected to the VM instance, run the following command to start the sync:
CLOUDQUERY_API_KEY=<your-api-key-value> ./cloudquery sync config.yaml
-
You should see a successful sync! If not, check the CLI output and
cloudquery.log
for errors. -
You can now fine-tune your configuration file: you may want to try
tables: ["*"]
to sync all tables at least once and get a complete asset inventory. Be aware, however, that this can take a long time on large account. We also don't recommend using*
in production, as when you perform version upgrades, this will cause new tables to automatically get synced and may cause unexpected issues. In general, we recommend only syncing the tables you need.
For tips on performance-tuning, see the performance tuning guide.
Step 8: Query the data
We can now query the data in our database. We will use the Cloud SQL interface on the Google Cloud console, but you can also use any other tool that supports PostgreSQL.
- Click on the hamburger menu in the top left corner and select SQL.
- Click on your database.
- Click on Open Cloud Shell.
- Edit the command if necessary, then press enter to execute it in the cloud shell.
- Run any SQL query you wish. For example, the following query lists all the storage buckets in your GCP account:
SELECT * FROM gcp_storage_buckets;
Step 9: Set up a cron schedule
Option 1: Basic cron schedule
Now that we have a working sync, we can set up a cron schedule to run it automatically.
- SSH into your instance. You can do this by clicking on the SSH button next to your instance in the Google Cloud Console.
- Run the following command to open the crontab editor:
crontab -e
- Add the following line to the crontab file, replacing the path with the path to your CloudQuery binary and configuration file:
0 1 * * * /path/to/cloudquery sync /path/to/config.yaml
- Save the file and exit the editor. The sync will now run every day at 1 a.m.
Option 2: Run on reboot
Alternatively, we can use the @reboot
cron directive to run the sync every time the VM instance is rebooted. This is useful if you want to keep the VM instance stopped most of the day, but have it come up and sync once a day. To do this, update the line in the crontab file:
@reboot /path/to/cloudquery sync /path/to/config.yaml
We can now use the Instance Schedule feature on GCP to have the VM come up at a certain time every day:
- Click on the hamburger menu in the top left corner and select Compute Engine.
- Click on VM Instances.
- Click on the Instance Schedules tab.
- Click on the Create Instance Schedule button.
- Configure the start time and end time so that the VM instance is up for long enough to run the sync to completion.
- Click on Submit.
- Click on the newly created schedule.
- Click on Add Instances to Schedule.
- Select your VM instance and click on Add. For this to work, you may first need to add the
Compute Instance Admin (v1)
role to the compute-system managed role (do this from the IAM page).
Summary
In this tutorial, we have seen how to set up CloudQuery on a GCP VM instance and sync data to a PostgreSQL database managed by Cloud SQL. We have also seen how to set up a cron schedule to run the sync automatically. If you have any questions, check out the video above or join our Community (opens in a new tab) to chat with us!