Tutorial: How to set up a Google BigQuery Collector for Anodot¶
Before we start¶
In this tutorial we will set up a Google BigQuery Collector.
We assume that the following steps have been completed successfully:
anodot-collectis installed as described in Installation- Google Service Account Set-up for BigQuery
The Dataset¶
We will use the table nyc-tlc.yellow.trips from the Google BigQuery Public Datasets to send the daily number of trips broken down by the number of passengers in each trip.
For this example we will use Standard SQL.
Testing in BigQuery¶
We need to build a query which can be used by the collector to extract the data. For example, we can start with the following query in the BigQuery Compose window:
SELECT
DATE(pickup_datetime) as date,
passenger_count,
COUNT(1) as trip_count
FROM `nyc-tlc.yellow.trips`
GROUP BY 1,2
ORDER BY 1 LIMIT 8
Which gives the following result set when executed:
| date | passenger_count | trip_count |
|---|---|---|
| 2009-01-01 | 2 | 78198 |
| 2009-01-01 | 1 | 183441 |
| 2009-01-01 | 0 | 51 |
| 2009-01-01 | 6 | 1072 |
| 2009-01-01 | 5 | 22497 |
| 2009-01-01 | 4 | 17029 |
| 2009-01-01 | 3 | 25337 |
| 2009-01-02 | 1 | 223899 |
Writing a Query Template¶
Since the actual query used by the collector depends on a date range, we need to convert the original query into a query template, which we
will put in a file query.sql.
SELECT
DATE(pickup_datetime) as date,
passenger_count,
COUNT(1) as trip_count
FROM `nyc-tlc.yellow.trips`
WHERE
pickup_datetime >= '{{start_date}}' AND
pickup_datetime < '{{end_date}}'
group by 1,2
Notice that the range includes the left point and excludes the right point in the time interval.
Writing a Configuration File¶
The collector needs to convert the raw, tabular data from the query template into Anodot Metrics,
the configuration file specifies how this is done. anodot-collect-conf.yaml.
anodot_api_token: SECRET_API_TOKEN # You need to supply a token here in case you want to send data to Anodot
collectors:
pyanodot.collectors.google_bigquery:
accounts:
my_account:
credentials: creds.json
queries:
my_query:
query:
file: query.sql # The query template resides in the file `query.sql`
temporal_field: date # The column `date` is the temporal field for reporting values
use_legacy_sql: false # we are using Standard SQL
dimension_defs:
- dimension: passenger_count # The column `passenger_count` is a dimension
metric_defs:
- metric: trip_count # The column `trip_count` is a metric
target_type: counter # This metric is aggregated as a counter ( sum )
Preparing the Work Directory¶
The Collector’s execution environment is a directory which contains all the necessary files for execution, and is a place for logs and other output files. So far we have the following files:
anodot-collect-conf.yaml.query.sql- The credentials file for the service account which we call
creds.json.
Let’s put all three files in a directory called work:
$ tree
.
├── anodot-collect-conf.yaml
├── creds.json
└── query.sql
Testing the Collector¶
We want to estimate what amount of Anodot Metrics ( time-series ) is going to be generated as we run this collector daily. First we need to make sure the python environment is active for this shell:
source ~/.anodot/venv/bin/activate
We will run the collector with a specific time range, as follows:
anodot-collect.py pyanodot.collectors.google_bigquery -w . -X start_date=2014-01-01 -X end_date=2014-01-20 -a
The above command ( in particular the flag -a ) writes some .csv files into the work directory, containing the metric samples that
would be sent to Anodot in production. We will use the exported file to estimate the number of metrics:
cut -d, -f3 *_my_query.csv | sort -u | wc -l
Which returns the number of indivitual Anodot Metrics which would get reported during this time range. If you think the amount of generated metrics is too large, go back a few steps to Testing in BigQuery and repeat until this step.
Sending Data to Anodot¶
After having verified the output, we are ready to start sending our data. If you haven’t done so already, you need to edit the configuration file and set the