google-bigquery - Google BigQuery Collector

The google-bigquery collector performs a Standard SQL or Legacy SQL query based on CLI parameters and the configuration.

Google Service Account Set-up for BigQuery

The preferred long-term authentication method for Google services is the Service Account Credentials method. To use the Google BigQuery collector you will need to create or attain a credentials JSON file for a service account which has the necessary roles/permissions, as described below.

Create a Service Account

  1. Go to: https://console.cloud.google.com/apis/credentials
  2. Click “Create Credentials”, choose “Service Account Key”.

Granting permissions to the Service Account

  1. Go to dashboard https://console.cloud.google.com/apis/dashboard and check that Google BigQuery API is enabled for this project

  2. If not, click Enable API and enable BigQuery API for this project.

  3. Grant the Service Account the necessary IAM Permissions to create queries on your behalf:

    1. Go to the IAM console https://console.cloud.google.com/iam-admin/iam

    2. Click the Add button. a dialog will open.

    3. Fill the email of the service account.

    4. Select both roles:

      • BigQuery -> BigQuery User, and
      • BigQuery -> BigQuery Data Viewer
  4. Create a JSON credentials file

Config File Reference

Account-level properties

  • credentials (string or object):
  • If the value is a string then it is treated as a file-path assumed to be a JSON file which corresponds to a Google Service Account Credentials file.
  • If the value is an object then it is assumed to be the inlined JSON content of the Google Service Account Credentials file.

Query-level properties

  • query (string or array[string], required)
This is a jinja template to be filled from the CLI using the -X parameters
  • temporal_field (string, optional) - specifies how to generate the timestamp for the resulting Anodot metric.
  • dimension_defs, metrics_defs - the dimension names refer to the query result schema.

Example:

  • In this simplified example, the query schema is (total_users bigint, device text), total_users is treated as a metric and device is treated as a dimension:
{
    "query" : "SELECT SUM(users) AS total_users, device FROM daily_data GROUP BY device",
    "dimension_defs": [ { "dimension": "device" } ],
    "metric_defs": [ {"metric": "total_users", "target_type": "counter" } ]
}

sample_time

This collector receives the start_time and end_time parameters from the anodot-collect.py command line. Then for every timestamp at the time_step between start_time (inclusive) and end_time (exclusive) it will perform the query, with sample_time injected.

Example:

  • start_time=2016-12-01 00:00:00
  • end_time=2016-12-03 00:00:00
  • time_step="day" then the query will be performed with sample_time=2016-12-01 00:00:00,2016-12-03 00:00:00

Then it performs the query, injecting sample_time For each in this range performs the specified query and collects the result, assigning to the metric a timestamp which is either the sample_time

To simplify the query process, the collector injects a variable called sample_time into the original query.

Example:

  • query="SELECT * FROM daily_data_{sample_time:%Y%m%d};"
  • sample_time=2016-12-01 01:00:00

Then the collector will run the query

SELECT * FROM daily_data_20161201;

Command-line params

$ anodot-collect.py google-bigquery -h
usage: anodot-collect.py google-bigquery [-h] [-s START_TIME] [-e END_TIME]
                                         [-w WORK_DIR] [-E ENDPOINT]
                                         [-T ANODOT_API_TOKEN] [-V VER] [-a]
                                         [-J] [-d] [-p] [-D API_DELAY]
                                         [-C API_CHUNK]
                                         [--producer-concurrency PRODUCER_CONCURRENCY]
                                         [--anodot-api-concurrency ANODOT_API_CONCURRENCY]

optional arguments:
  -h, --help            show this help message and exit
  -s START_TIME, --start-time START_TIME
                        start timestamp (inclusive - used in time_range
                        queries) format 'YYYY-mm-dd HH:MM:SS'
  -e END_TIME, --end-time END_TIME
                        end timestamp (exclusive - used in time_range queries)
                        format 'YYYY-mm-dd HH:MM:SS'
  -w WORK_DIR, --work-dir WORK_DIR
                        working directory ( logs saved there )
  -E ENDPOINT, --endpoint ENDPOINT
                        Anodot API endpoint: `poc` or `production`
  -T ANODOT_API_TOKEN, --anodot-api-token ANODOT_API_TOKEN
                        Anodot API endpoint: `poc` or `production`
  -V VER, --ver VER     data version number to send ( unless specified here or
                        in the config, ver=1 )
  -a, --save-anodot-csv
                        write a CSV file with the Anodot format
  -J, --save-anodot-json
                        write a JSON file with the same format being sent to
                        the Anodot API
  -d, --debug           Print verbose debug output
  -p, --production
  -D API_DELAY, --api-delay API_DELAY
                        Anodot API: delay between requests
  -C API_CHUNK, --api-chunk API_CHUNK
                        Anodot API: max. number of metrics to send per request
  --producer-concurrency PRODUCER_CONCURRENCY
                        Number of concurrent processes to use for producers
  --anodot-api-concurrency ANODOT_API_CONCURRENCY
                        Number of concurrent processes to use for sending to
                        anodot

Sample CLI Invocation

  • Basic execution for generating CSV files for Anodot:
(venv) $ anodot-collect.py -w my_work_dir -s 2017-06-01 -e 2017-06-10 -a

Sample Config File

{"anodot_api_token": "XXXXXXXXXXXXXXXXXXx",
    "collectors": { 
    "google-bigquery" : {
        "accounts": {
            "account_name": {
                "credentials": "/path/to/service-account-file-you-got-from-google.json",
                "queries": {
            "example_sampled":
            {
                "ver": "55",
                "time_step": "hour",
                "query": [
                    "SELECT",
                    "SUM(totals_hits) as totals_hits,",
                    "trafficSource_utm_medium,",
                    "trafficSource_utm_campaign,",
                    "trafficSource_utm_id,",
                    "trafficSource_utm_source,",
                    "trafficSource_utm_term,",
                    "trafficSource_utm_content",
                    "FROM `my_ga_data_{sample_time:%Y%m%d}`",
                    "WHERE hour_utc_timestamp = TIMESTAMP('{sample_time:%Y-%m-%d %H:%M:%S}')",
                    "GROUP BY 2,3,4,5,6,7"
                ],
                "dimension_defs": [
                        { "dimension": "trafficSource_utm_medium" },
                        { "dimension": "trafficSource_utm_campaign" },
                        { "dimension": "trafficSource_utm_id" },
                        { "dimension": "trafficSource_utm_source" },
                        { "dimension": "trafficSource_utm_term" },
                        { "dimension": "trafficSource_utm_content" }
                ],
                "metric_defs": [
                    { "metric": "totals_hits", "target_type": "counter" }
                ]
            }    
        }
}}}}}