Flowfunctions
The flowfunctions
library enables and empowers development, management and monitoring of complex and extensible data workflows within the native Google BigQuery environment.
Motivation
Google BigQuery is a powerful and accessible foundation for any kind of data work. It is:
- Serverless, requiring no infrastructure setup, configuration or maintenance
- Scalable, to a practically infinite degree
- Pay-as-you-go billed, with optional discounted commitment for predictability at scale
BigQuery has:
- A large feature set which is constantly expanding
- A built-in console so you can run queries, code, and inspect data in the native environment
- A wide range of integrations for data ingestion or downstream data activities
- A comprehensive API with support for multiple languages
This makes it suitable for any scale or type of organisation or team, from a single person to a large multi-national.
However due to some fundamental practical and syntactical idiosyncrasies and constraints, it can be difficult to achieve complex tasks in a clear, modular and scalable manner.
This library aims to provide a clean, unified, functional abstraction based on elements of the core BigQuery language sets. This enables developers to write and reuse powerful, scalable, extensible and maintainable code, to achieve any data objective without requiring (but still permitting) integration or usage of external tools.
Situation
There are a number of different language variations available in BigQuery:
- SQL - Structured Query Language
- DDL - Data Definition Language
- DML - Data Manipulation Language
- DCL - Data Control Language
- Procedural Language
- Debugging
- Other Statements
Additionally, programming and automation objectives can be achieved through a broad range of different functional constructs:
- UDF - User Defined Function
- Procedures
- Table Functions
- Scheduled Queries
- Remote Functions (using Cloud Functions)
BigQuery also permits access to a number of INFORMATION_SCHEMAviews which provide metadata about BigQuery objects and activity. This metadata, in combination with a subset of the native functionality provides us with the foundational tools to achieve complex data workflows within BigQuery.
However one fundamental constraint which impacts the ability to build clean, functional routines is the fact that structural statement elements like table ids, column names and json paths cannot be dynamically set.
This means that to achieve a task requiring iteration around tables - such as SELECT
data FROM
each table in a dataset and INSERT
into another table - it is possible to get the table metadata from the INFORMATION_SCHEMA.TABLES view, but you then need to construct a STRING
representation of the required statement and execute that statement via the EXECUTE IMMEDIATE statement.
While this is undoubtedly powerful, it is a cumbersome paradigm which is hard to reuse or scale. It can also be difficult to read.
A secondary constraint in the programming environment is the fact that - in order to receive an output from a procedure - it is necessary to know and instantiate a variable with the exact return variable structure and pass it to the procedure, where it is updated and returned.
Whilst this approach adds negligible developer overhead in simple cases, for more complex return data structures it becomes a heavy drag on the developer workflow.
Solution
The flowfunctions
library and contextual framework builds on these foundations, but aims to abstract away any requirement on the developer to manually construct and execute STRING
statements. It provides a functional interface to permit any function parameter to be dynamically set and therefore used as an automated step in any functional workflow.
The precise mechanism used varies, however the syntax and framework is organised around the principles of:
- Readability
- Simplicity
- Consistency
- Modularity
Humans should not have to write templated SQL.
Execution
Since routines in BigQuery can only be executed on code which is in the exact same physical location, the deployment process mirrors the entire flowfunctions
function set to all required Google Cloud regions (both multi-region and single region). The projects are named by appending a double hyphen and the lowercase region identifier:
- US:
flowfunctions--us
- EU:
flowfunctions--eu
The function sets are then either aligned to the BigQuery language (e.g. DML
, DCL
), or to the functional application (e.g. LINEAGE
, PROFILING
). The functions are then called using the standard CALL
or SELECT
syntax depending on the function type, for example the function DML.DELETE_TABLE_PARTITIONS(target_table_id STRING, date_partition_column STRING, start_date DATE, end_date DATE)
will delete a range of table partitions as part of an idempotent transformation flow, and would called to act upon data located in the US multi-region using the following syntax:
CALL `flowfunctions--us`.DML.DELETE_TABLE_PARTITIONS(target_table_id, date_partition_column, start_date, end_date)
Note that it is possible to drop the project identifier from the function call by setting the @@dataset_project_id
System Variable as the required geographic library:
SET @@dataset_project_id = 'flowfunctions--us';
CALL DML.DELETE_TABLE_PARTITIONS(target_table_id, date_partition_column, start_date, end_date)
The flowfunctions
library is currently in private beta and is being actively developed, tested and documented with a diverse set of client companies. For information or to request beta access please contact flowfunctions@transformationflow.io. Access is managed via Google Groups.