Skip to content

Flowfunctions Approach

We take a standardised approach when developing data workflows in BigQuery, which enables us to develop functions which support this standardised approach. Naming conventions for resources, functions and metadata elements are also consistent to improve clarity and simplicity.

Date-Partitioned Table Functions

Table Functions are an extremely powerful resource type in BigQuery, as they enable users to pass parameters to views. This means that if your inbound data is date-partitioned, you can build a downstream multi-step data transformation flow of arbirary complexity whilst still maintaining the predictability and efficiency resulting from date partitioning at every stage.

As such, we utilise them as a default when building data transformations, using start_date and end_date DATE fields as the only parameters. By consistently using this structure we maintain all of the logic in BigQuery, making it directly accessible and easy to query and quality assure (QA).

This consistency also allows us to develop programmatic QA functions to track the evolution of e.g. row counts, unique values or metric sums as they evolve through the logical flow, and then execute these functions on specific date ranges. This makes it very quick to pinpoint specific unexpected values, and to trace the root cause back to specific code and row-level inbound data.

SQL/DDL/DML/DCL Compilation

One of the major challenges when building native BigQuery data engineering or transformation workflows is the inconsistency of constraints when it comes to scripting, and the fact that some values cannot be dynamically set (e.g. it is not possible to dynamically set structural query elements such as table_id or column_name). For this reason we have designed the flowfunctions library to:

  • Mirror native BigQuery routines and language as closely as possible
  • Accept any dyamically-set variables to the function
  • Compile the SQL/DDL/DML/DCL code
  • Test and execute compiled code
  • Enable resources to be created using compiled SQL definitions

This final point means that arbirarily complex logic can be executed directly in BigQuery without requiring user-defined templating or external libraries or tools.

Temporary Tables

Whilst using variables in scripts is a powerful approach, the syntax used to interact with them can make the resulting scripts difficult to read and therefore maintain. Additionally, although any table data can be aggregated into an ARRAY of STRUCTs, these are then subject to memory constraints which can cause workflows to fail if these limits are hit.

A better approach which we utilise in any script is leveraging Temporary Tables as pseduo-dataframes. Temporary tables are named (but without the project_id or dataset_name), are not subject to memory constraints and are accessible by name at any point in the script or session.

Combining Temporary Tables with Date-Partioned Table Functions for inbound script data enables rapid development of arbitrarily complex logic, which can trivially then be deployed on the full dataset by adjusting the start_date and end_date date parameters for deployment.

The contents of temporary tables can then be used to create new tables, inserted into existing tables or used as a foundation for iteration to build complex automated flows.

Table Iteration

Iterating around the rows of a table is a foundational requirement for any automated workflow, and one which is now trivial in BigQuery thanks to the FOR...IN structure, which iterates over the rows of a table object (e.g. table, view, table function or temporary table).

Column values are then accessed using a simple dot notation, this example would simply select each value in the column_name column:

FOR table_row IN table_object
DO
    SELECT table_row.column_name;
END FOR;

Remote Functions

There are a number of useful features which are not possible in the BigQuery console and therefore require access to the BigQuery API, other Google Cloud APIs or external APIs. These functions are deployed using Cloud Functions to the regional flowfunctions projects and accessed using Remote Functions in BigQuery. However once the appropriate access has been provisioned, this complexity is invisible to the end-user and the functions are called using the standard syntax.

Script Structure

Since readability is fundamental to the ongoing maintenance overhead of any code, we follow a standardised pattern when building any script and use appropriate indentations to leverage row collapsing in the BigQuery console. A standard script contains the following elements:

DECLARATION

Variable declarations need to take place at the beginning of any script, and they can also be defaulted to an intial value in-line using the DEFAULT keyword. However this can get quite difficult to read, so we place the variable declarations in the HEADER and variable value setting into the subsequent appropriate section, with variables of the same type declared in the same line.

If using flowfunctions it is also advisable to set the @@project_dataset_id system variable to the flowfunctions project in your region. This means that you can then drop the project from each function call and use a more succint and readable syntax throughout the script.

--DECLARATION:
    DECLARE source_project_id STRING;
    DECLARE start_date, end_date DATE;
    SET @@dataset_project_id = 'flowfunctions--eu'

CONFIGURATION

Separating the script configuration from the variable declaration and execution makes it very quick and simple fo reuse previous structures or code templates without worrying about configuration consistency throughout the script. This is the only section where code changes should be required:

--CONFIGURATION:
    SET source_project_id = 'my_project_id';

EXECUTION

This is the section where the logic is defined code is actually executed. Inputs into all functions in this code should be derived from the configuration and not hard-coded.