Flowfunctions Introduction
The flowfunctions
library enables and empowers development, management and monitoring of complex and extensible data workflows within the native Google BigQuery environment.
Motivation
Google BigQuery is a powerful and accessible foundation for any kind of data work. It is:
- Serverless, requiring no infrastructure setup, configuration or maintenance
- Scalable, to a practically infinite degree
- Pay-as-you-go billed, with optional discounted commitment for predictability at scale
BigQuery has:
- A large feature set which is constantly expanding
- A built-in console so you can run queries, code, and inspect data in the native environment
- A wide range of integrations for data ingestion or downstream data activities
- A comprehensive API with support for multiple languages
This makes it suitable for any scale or type of organisation or team, from a single person to a large multinational.
However due to some fundamental practical and syntactical idiosyncrasies and constraints, it can be difficult to achieve complex tasks in a clear, modular and scalable manner.
This library aims to provide a clean, unified, functional abstraction based on elements of the core BigQuery language sets. This enables developers to write and reuse powerful, scalable, extensible and maintainable code, to achieve any data objective without requiring (but still permitting) integration or usage of external tools.
Situation
There are a number of different language variations available in BigQuery:
- SQL - Structured Query Language
- DDL - Data Definition Language
- DML - Data Manipulation Language
- DCL - Data Control Language
- Procedural Language
- Debugging
- Other Statements
Additionally, programming and automation objectives can be achieved through a broad range of different functional constructs:
- UDF - User Defined Function
- Procedures
- Table Functions
- Scheduled Queries
- Remote Functions (using Cloud Functions)
BigQuery also permits access to a number of INFORMATION_SCHEMA views which provide metadata about BigQuery objects and activity. This metadata, in combination with a subset of the native functionality provides us with the foundational tools to achieve complex data workflows within BigQuery.
However the fundamental constraint which impacts the ability to build clean, functional routines is the fact that structural statement elements like table ids, column names and json paths cannot be dynamically set.
This means that to achieve a task requiring iteration around tables and taking an action on each table, it is possible to get the table metadata from the INFORMATION_SCHEMA.TABLES view, but you then need to construct a STRING
representation of the required statement and execute that statement via the EXECUTE IMMEDIATE statement.
While this is undoubtedly powerful, it is a cumbersome paradigm which is hard to reuse or scale. It can also be difficult to read.
A secondary constraint in the programming environment is the fact that - in order to receive an output from a procedure - it is necessary to DECLARE
a variable with the exact return variable structure and pass it to the procedure, where it is updated and returned.
Whilst this approach adds negligible developer overhead in simple cases, for more complex return data structures it becomes a heavy drag on the developer workflow.
Solution
The flowfunctions
library and contextual framework builds on these foundations, but aims to abstract away any requirement on the developer to manually construct and execute STRING
statements. It also permits programmatic definition of any function argument.
The precise mechanism used varies, however the syntax and framework is organised around the principles of:
- Readability
- Simplicity
- Consistency
- Modularity
Humans should not have to write templated SQL.
The native JSON data type is used for complex data exchange as it enables functions to pass data as input and outputs without requiring precise schema definition in advance, and therefore supports the implementation of optional function arguments which are otherwise not possible in BigQuery.
Execution
Since routines in BigQuery can only be executed on code which is in the exact same physical location, the deployment process mirrors the entire flowfunctions
function set to all required Google Cloud regions (both multi-region, dual-region and single region). The projects are named by appending a double hyphen and the lowercase region identifier:
- US:
flowfunctions--us
- EU:
flowfunctions--eu
The function sets are then aligned to the functional application (e.g. EXPORT
, LOAD
), which are themselves aligned as closely as possible to the equivalent BigQuery statements. The functions are then called using the standard CALL
or SELECT
syntax depending on the function type, for example the function EXPORT.DATA(query_statement STRING, export_format STRING, export_uri STRING, optional_arguments JSON)
is equivalent to the (EXPORT DATA)[https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#export_data_statement] statement and enables programmatic exporting of arbitrary data to Google Cloud Storage.
CALL `flowfunctions--eu`.EXPORT.DATA(query_statement, export_format, export_uri, optional_arguments)
Note that it is possible to drop the project identifier from the function call by setting the @@dataset_project_id
System Variable as the required geographic library:
Once this system variable has been set in the script, the syntax to export data becomes much cleaner:
The flowfunctions
library is currently in private beta and is being actively developed, tested and documented with a diverse set of client companies. For information or to request beta access please contact flowfunctions@transformationflow.io.
Access is managed via Google Groups, so it can be granted and managed at an individual or organisations level.