Flowfunctions Approach
We take a standardised approach when developing data workflows in BigQuery, which enables us to develop functions which support this standardised approach. Naming conventions for resources, functions and metadata elements are also consistent to improve clarity and simplicity.
Date-Partitioned Table Functions
Table Functions are an extremely powerful resource type in BigQuery, as they enable users to pass parameters to views. This means that if your inbound data is date-partitioned, you can build a downstream multi-step data transformation flow of arbirary complexity whilst still maintaining the predictability and efficiency resulting from date partitioning at every stage.
As such, we utilise them as a default when building data transformations, using start_date
and end_date
DATE
fields as the only parameters. By consistently using this structure we maintain all of the logic in BigQuery, making it directly accessible and easy to query and quality assure (QA).
This consistency also allows us to develop programmatic QA functions to track the evolution of e.g. row counts, unique values or metric sums as they evolve through the logical flow, and then execute these functions on specific date ranges. This makes it very quick to pinpoint specific unexpected values, and to trace the root cause back to specific code and row-level inbound data.
SQL/DDL/DML/DCL Compilation
One of the major challenges when building native BigQuery data engineering or transformation workflows is the inconsistency of constraints when it comes to scripting, and the fact that some values cannot be dynamically set (e.g. it is not possible to dynamically set structural query elements such as table_id
or column_name
). For this reason we have designed the flowfunctions
library to:
- Mirror native BigQuery routines and language as closely as possible
- Accept any dyamically-set variables to the function
- Compile the SQL/DDL/DML/DCL code
- Test and execute compiled code
- Enable resources to be created using compiled SQL definitions
This final point means that arbirarily complex logic can be executed directly in BigQuery without requiring user-defined templating or external libraries or tools.
Temporary Tables
Whilst using variables in scripts is a powerful approach, the syntax used to interact with them can make the resulting scripts difficult to read and therefore maintain. Additionally, although any table data can be aggregated into an ARRAY
of STRUCTs
, these are then subject to memory constraints which can cause workflows to fail if these limits are hit.
A better approach which we utilise in any script is leveraging Temporary Tables as pseduo-dataframes. Temporary tables are named (but without the project_id
or dataset_name
), are not subject to memory constraints and are accessible by name at any point in the script or session.
Combining Temporary Tables with Date-Partioned Table Functions for inbound script data enables rapid development of arbitrarily complex logic, which can trivially then be deployed on the full dataset by adjusting the start_date
and end_date
date parameters for deployment.
The contents of temporary tables can then be used to create new tables, inserted into existing tables or used as a foundation for iteration to build complex automated flows.
Table Iteration
Iterating around the rows of a table is a foundational requirement for any automated workflow, and one which is now trivial in BigQuery thanks to the FOR...IN structure, which iterates over the rows of a table object (e.g. table, view, table function or temporary table).
Column values are then accessed using a simple dot notation, this example would simply select each value in the column_name column:
Remote Functions
There are a number of useful features which are not possible in the BigQuery console and therefore require access to the BigQuery API, other Google Cloud APIs or external APIs. These functions are deployed using Cloud Functions to the regional flowfunctions
projects and accessed using Remote Functions in BigQuery. However once the appropriate access has been provisioned, this complexity is invisible to the end-user and the functions are called using the standard syntax.
Script Structure
Since readability is fundamental to the ongoing maintenance overhead of any code, we follow a standardised pattern when building any script and use appropriate indentations to leverage row collapsing in the BigQuery console. A standard script contains the following elements:
DECLARATION
Variable declarations need to take place at the beginning of any script, and they can also be defaulted to an intial value in-line using the DEFAULT
keyword. However this can get quite difficult to read, so we place the variable declarations in the HEADER and variable value setting into the subsequent appropriate section, with variables of the same type declared in the same line.
If using flowfunctions
it is also advisable to set the @@project_dataset_id
system variable to the flowfunctions
project in your region. This means that you can then drop the project from each function call and use a more succint and readable syntax throughout the script.
--DECLARATION:
DECLARE source_project_id STRING;
DECLARE start_date, end_date DATE;
SET @@dataset_project_id = 'flowfunctions--eu'
CONFIGURATION
Separating the script configuration from the variable declaration and execution makes it very quick and simple fo reuse previous structures or code templates without worrying about configuration consistency throughout the script. This is the only section where code changes should be required:
EXECUTION
This is the section where the logic is defined code is actually executed. Inputs into all functions in this code should be derived from the configuration and not hard-coded.