google-bigquery - Google BigQuery Collector¶
The google-bigquery collector performs a Standard SQL or Legacy SQL query based
on CLI parameters and the configuration.
Google Service Account Set-up for BigQuery¶
The preferred long-term authentication method for Google services is the Service Account Credentials method. To use the Google BigQuery collector you will need to create or attain a credentials JSON file for a service account which has the necessary roles/permissions, as described below.
Create a Service Account¶
- Go to: https://console.cloud.google.com/apis/credentials
- Click “Create Credentials”, choose “Service Account Key”.
Granting permissions to the Service Account¶
Go to dashboard https://console.cloud.google.com/apis/dashboard and check that Google BigQuery API is enabled for this project
If not, click Enable API and enable BigQuery API for this project.
Grant the Service Account the necessary IAM Permissions to create queries on your behalf:
Go to the IAM console https://console.cloud.google.com/iam-admin/iam
Click the Add button. a dialog will open.
Fill the email of the service account.
Select both roles:
- BigQuery -> BigQuery User, and
- BigQuery -> BigQuery Data Viewer
Create a JSON credentials file
Config File Reference¶
Account-level properties¶
credentials (string or object):- If the value is a string then it is treated as a file-path assumed to be a JSON file which corresponds to a Google Service Account Credentials file.
- If the value is an object then it is assumed to be the inlined JSON content of the Google Service Account Credentials file.
Query-level properties¶
query (string or array[string], required)
This is a jinja template to be filled from the CLI using the -X parameters
temporal_field (string, optional)- specifies how to generate the timestamp for the resulting Anodot metric.dimension_defs,metrics_defs- the dimension names refer to the query result schema.
Example:¶
- In this simplified example, the query schema is
(total_users bigint, device text),total_usersis treated as a metric anddeviceis treated as a dimension:
{
"query" : "SELECT SUM(users) AS total_users, device FROM daily_data GROUP BY device",
"dimension_defs": [ { "dimension": "device" } ],
"metric_defs": [ {"metric": "total_users", "target_type": "counter" } ]
}
sample_time¶
This collector receives the start_time and end_time parameters
from the anodot-collect.py command line. Then for every timestamp at
the time_step between start_time (inclusive) and end_time
(exclusive) it will perform the query, with sample_time injected.
Example:¶
start_time=2016-12-01 00:00:00end_time=2016-12-03 00:00:00time_step="day"then the query will be performed withsample_time=2016-12-01 00:00:00,2016-12-03 00:00:00
Then it performs the query, injecting sample_time For each in this
range performs the specified query and collects the result, assigning to
the metric a timestamp which is either the sample_time
To simplify the query process, the collector injects a variable called
sample_time into the original query.
Example:¶
query="SELECT * FROM daily_data_{sample_time:%Y%m%d};"sample_time=2016-12-01 01:00:00
Then the collector will run the query
SELECT * FROM daily_data_20161201;
Command-line params¶
$ anodot-collect.py google-bigquery -h
usage: anodot-collect.py google-bigquery [-h] [-s START_TIME] [-e END_TIME]
[-w WORK_DIR] [-E ENDPOINT]
[-T ANODOT_API_TOKEN] [-V VER] [-a]
[-J] [-d] [-p] [-D API_DELAY]
[-C API_CHUNK]
[--producer-concurrency PRODUCER_CONCURRENCY]
[--anodot-api-concurrency ANODOT_API_CONCURRENCY]
optional arguments:
-h, --help show this help message and exit
-s START_TIME, --start-time START_TIME
start timestamp (inclusive - used in time_range
queries) format 'YYYY-mm-dd HH:MM:SS'
-e END_TIME, --end-time END_TIME
end timestamp (exclusive - used in time_range queries)
format 'YYYY-mm-dd HH:MM:SS'
-w WORK_DIR, --work-dir WORK_DIR
working directory ( logs saved there )
-E ENDPOINT, --endpoint ENDPOINT
Anodot API endpoint: `poc` or `production`
-T ANODOT_API_TOKEN, --anodot-api-token ANODOT_API_TOKEN
Anodot API endpoint: `poc` or `production`
-V VER, --ver VER data version number to send ( unless specified here or
in the config, ver=1 )
-a, --save-anodot-csv
write a CSV file with the Anodot format
-J, --save-anodot-json
write a JSON file with the same format being sent to
the Anodot API
-d, --debug Print verbose debug output
-p, --production
-D API_DELAY, --api-delay API_DELAY
Anodot API: delay between requests
-C API_CHUNK, --api-chunk API_CHUNK
Anodot API: max. number of metrics to send per request
--producer-concurrency PRODUCER_CONCURRENCY
Number of concurrent processes to use for producers
--anodot-api-concurrency ANODOT_API_CONCURRENCY
Number of concurrent processes to use for sending to
anodot
Sample CLI Invocation¶
- Basic execution for generating CSV files for Anodot:
(venv) $ anodot-collect.py -w my_work_dir -s 2017-06-01 -e 2017-06-10 -a
Sample Config File¶
{"anodot_api_token": "XXXXXXXXXXXXXXXXXXx",
"collectors": {
"google-bigquery" : {
"accounts": {
"account_name": {
"credentials": "/path/to/service-account-file-you-got-from-google.json",
"queries": {
"example_sampled":
{
"ver": "55",
"time_step": "hour",
"query": [
"SELECT",
"SUM(totals_hits) as totals_hits,",
"trafficSource_utm_medium,",
"trafficSource_utm_campaign,",
"trafficSource_utm_id,",
"trafficSource_utm_source,",
"trafficSource_utm_term,",
"trafficSource_utm_content",
"FROM `my_ga_data_{sample_time:%Y%m%d}`",
"WHERE hour_utc_timestamp = TIMESTAMP('{sample_time:%Y-%m-%d %H:%M:%S}')",
"GROUP BY 2,3,4,5,6,7"
],
"dimension_defs": [
{ "dimension": "trafficSource_utm_medium" },
{ "dimension": "trafficSource_utm_campaign" },
{ "dimension": "trafficSource_utm_id" },
{ "dimension": "trafficSource_utm_source" },
{ "dimension": "trafficSource_utm_term" },
{ "dimension": "trafficSource_utm_content" }
],
"metric_defs": [
{ "metric": "totals_hits", "target_type": "counter" }
]
}
}
}}}}}