Tutorial: How to set up a Google BigQuery Collector for Anodot

Before we start

In this tutorial we will set up a Google BigQuery Collector.

We assume that the following steps have been completed successfully:

  1. anodot-collect is installed as described in Installation
  2. Google Service Account Set-up for BigQuery

The Dataset

We will use the table nyc-tlc.yellow.trips from the Google BigQuery Public Datasets to send the daily number of trips broken down by the number of passengers in each trip.

For this example we will use Standard SQL.

Testing in BigQuery

We need to build a query which can be used by the collector to extract the data. For example, we can start with the following query in the BigQuery Compose window:

SELECT
    DATE(pickup_datetime) as date,
    passenger_count,
    COUNT(1) as trip_count
FROM `nyc-tlc.yellow.trips`
GROUP BY 1,2
ORDER BY 1 LIMIT 8

Which gives the following result set when executed:

date passenger_count trip_count
2009-01-01 2 78198
2009-01-01 1 183441
2009-01-01 0 51
2009-01-01 6 1072
2009-01-01 5 22497
2009-01-01 4 17029
2009-01-01 3 25337
2009-01-02 1 223899

Writing a Query Template

Since the actual query used by the collector depends on a date range, we need to convert the original query into a query template, which we will put in a file query.sql.

SELECT 
    DATE(pickup_datetime) as date, 
    passenger_count, 
    COUNT(1) as trip_count 
FROM `nyc-tlc.yellow.trips`
WHERE 
    pickup_datetime >= '{{start_date}}' AND 
    pickup_datetime < '{{end_date}}'
group by 1,2

Notice that the range includes the left point and excludes the right point in the time interval.

Writing a Configuration File

The collector needs to convert the raw, tabular data from the query template into Anodot Metrics, the configuration file specifies how this is done. anodot-collect-conf.yaml.

anodot_api_token: SECRET_API_TOKEN          # You need to supply a token here in case you want to send data to Anodot
collectors:
  pyanodot.collectors.google_bigquery:
    accounts:
      my_account:
        credentials: creds.json
        queries:
          my_query:

            query:
              file: query.sql               # The query template resides in the file `query.sql`

            temporal_field: date            # The column `date` is the temporal field for reporting values

            use_legacy_sql: false           # we are using Standard SQL

            dimension_defs:
            - dimension: passenger_count    # The column `passenger_count` is a dimension

            metric_defs:
            - metric: trip_count            # The column `trip_count` is a metric
              target_type: counter          # This metric is aggregated as a counter ( sum )

Preparing the Work Directory

The Collector’s execution environment is a directory which contains all the necessary files for execution, and is a place for logs and other output files. So far we have the following files:

  1. anodot-collect-conf.yaml.
  2. query.sql
  3. The credentials file for the service account which we call creds.json.

Let’s put all three files in a directory called work:

$ tree
.
├── anodot-collect-conf.yaml
├── creds.json
└── query.sql

Testing the Collector

We want to estimate what amount of Anodot Metrics ( time-series ) is going to be generated as we run this collector daily. First we need to make sure the python environment is active for this shell:

source ~/.anodot/venv/bin/activate

We will run the collector with a specific time range, as follows:

anodot-collect.py pyanodot.collectors.google_bigquery -w . -X start_date=2014-01-01 -X end_date=2014-01-20 -a

The above command ( in particular the flag -a ) writes some .csv files into the work directory, containing the metric samples that would be sent to Anodot in production. We will use the exported file to estimate the number of metrics:

cut -d, -f3 *_my_query.csv | sort -u | wc -l

Which returns the number of indivitual Anodot Metrics which would get reported during this time range. If you think the amount of generated metrics is too large, go back a few steps to Testing in BigQuery and repeat until this step.

Sending Data to Anodot

After having verified the output, we are ready to start sending our data. If you haven’t done so already, you need to edit the configuration file and set the