flowfunctions library enables and empowers development, management and monitoring of complex and extensible data workflows within the native Google BigQuery environment.
Google BigQuery is a powerful and accessible foundation for any kind of data work. It is:
- Serverless (so requires no infrastructure setup or maintenance)
- Infinitely scalable
- Available with pay-as-you-go pricing (or other models for predictability at scale)
- A large and expanding array of features
- A built-in console (so you can run queries, code, and inspect data in the native environment)
- A wide range of integrations for data ingestion or downstream data activities
- A comprehensive API with support for multiple languages
This makes it suitable for any scale or type of organisation or team.
However due to some fundamental practical and syntactical idiosyncrasies and constraints, it can be difficult to achieve complex tasks in a clear, modular and scalable manner.
This library aims to provide a clean, unified, functional abstraction based on elements of the core BigQuery language sets. This enables developers to write and reuse powerful, scalable, extensible and maintainable code, to achieve any data objective without requiring (but still permitting) integration or usage of external tools.
There are a number of different language variations available in BigQuery:
- SQL - Structured Query Language
- DDL - Data Definition Language
- DML - Data Manipulation Language
- DCL - Data Control Language
- Procedural Language
- Other Statements
Additionally, programming and automation objectives can be achieved through a broad range of different functional constructs:
- UDF - User Defined Function
- Table Functions
- Scheduled Queries
- Remote Functions (using Cloud Functions)
BigQuery also permits access to a number of views which provide metadata about BigQuery objects and activity:
This, in combination with a subset of the native functionality, should provide us with the foundational tools to achieve complex data workflows within BigQuery.
However one fundamental constraint which impacts the ability to build clean, functional routines is the fact that structural statement elements (e.g. table ids, column names, json paths) cannot be dynamically set.
This means that to achieve a task requiring iteration around tables (e.g.
FROM each table in a dataset and
INSERT into another table), it is possible to get the table metadata from the INFORMATION_SCHEMA.TABLES view, but you then need to construct a
STRING representation of the required statement and execute that statement via the EXECUTE IMMEDIATE statement.
While this is undoubtedly powerful, it is a cumbersome paradigm which is hard to reuse or scale. It is also extremely ugly to read.
A secondary constraint in the programming environment is the fact that - in order to receive an output from a procedure - it is necessary to know and instantiate a variable with the exact return variable structure and pass it to the procedure, where it is updated and returned.
Whilst this approach adds negligible developer overhead in simple cases, for more complex return data structures it becomes a heavy drag on the developer workflow.
flowfunctions library and contextual framework builds on these foundations, but aims to abstract away any requirement on the developer to manually construct and execute
STRING statements. It provides a functional interface to permit any function parameter to be dynamically set and therefore used as an automated step in any functional workflow.
The precise mechanism used varies, however the syntax and frame is organised around the principles of:
Humans should not have to write templated SQL.
Since routines in BigQuery can only be executed on code which is in the exact same physical location, the deployment process mirrors the entire
flowfunctions function set to all required Google Cloud regions (both multi-region and single region). The projects are named by appending a double hyphen and the lowercase region identifier:
The function sets are then either aligned to the BigQuery language (e.g.
DCL), or to the functional application (e.g.
PROFILING). The functions are then called using the standard
SELECT syntax depending on the function type, for example the function
DML.DELETE_TABLE_PARTITIONS(target_table_id STRING, date_partition_column STRING, start_date DATE, end_date DATE) will delete a range of table partitions as part of an idempotent transformation flow, and is called using the following syntax:
CALL `flowfunctions--us`.DML.DELETE_TABLE_PARTITIONS(target_table_id, date_partition_column, start_date, end_date)
Note that it is possible to drop the project identifier from the function call by setting the
@@dataset_project_id System Variable as the required geographic library:
SET @@dataset_project_id = 'flowfunctions--us'; CALL DML.DELETE_TABLE_PARTITIONS(target_table_id, date_partition_column, start_date, end_date)
flowfunctions library is currently in private beta and is being actively developed, tested and documented with a diverse set of client companies. For information or to request beta access please contact email@example.com. Access is managed via Google Groups.