DBT (Data Build Tool) Beyond the Basics: Complex Transformations Simplified with Examples
Introduction
Data is frequently referred to be the “new oil” since it is such a potent tool for decision-making in the contemporary environment. However, raw data, like crude oil, is not very helpful when it is first created. The Data Maturity Model can assist organizations analyse their current data landscape and plan their journey toward data transformation in this situation.
Tooling is also an essential component of data management. The use of tools like Data Build Tool (DBT) for building a flexible, maintainable, and reliable data architecture.
Understanding the Data Maturity Model
The Data Maturity Model is a framework for evaluating data management processes and pinpointing areas for development in businesses. It has various stages, each of which corresponds to a different degree of data maturity:
- Ad Hoc: Silos are used to store and handle data, and governance is inconsistent.
- Awareness: Organizations begin to make incremental moves toward better management as they become aware of the significance of data.
- Defined: The establishment of formal data management procedures takes place, and data governance is given top attention.
- Managed: Governance, integration, and the quality of the data are all actively handled.
Optimized: Continuous improvement is performed, and data-driven decision-making is engrained in the organization’s culture.
Warehouses, lakes, and lake houses in the Evolution of Data Architecture
Let’s examine the development of data architecture before delving into DBT:
- Data Warehouses: Traditional data warehouses are designed for the storage of structured data and the execution of structured business intelligence queries.
- Data Lakes: To store enormous amounts of unstructured raw data, data lakes were created. They give businesses the ability to store data without following a predetermined structure, enabling it to be used for a variety of analytics and processing applications.
- Lakehouse: The greatest features of data lakes and data warehouses are combined in the lake house architecture. It enables integrated processing and storage, giving users the best of both worlds.
Introducing DBT: Data Transformation Made Easy
An open-source command-line programme called DBT, or Data Build Tool, converts unprocessed data in warehouse (or lake house) into tables that are suitable for analyses. It streamlines the data transformation process, making it easier to manage, test, and collaborate on. How DBT fits into the current data stack is as follows:
- Extract & Load: The data warehouse, often known as the lake house, receives raw data from various sources.
- Transform with DBT: DBT uses transformations to arrange and meaningfully construct tables from raw data for analysis.
- Analyze: To gain insights and make informed decisions, analysts and data scientists query the modified data.
- Visualize & Report: The analysed data is presented using visualisation tools in an approachable and shareable style.
Slowly Changing Dimensions (SCD) and DBT
Slowly Changing Dimensions (SCD) in data warehousing refer to how data evolves over time. The multiple SCD kinds (Type 1, Type 2, and Type 3) govern how new and old data are managed. DBT provides strong functionality to properly manage SCDs, ensuring that the historical integrity of data is preserved throughout transformations.
Why DBT?
Data analysts and engineers may more efficiently change data in their warehouse with the help of the Data Build Tool (DBT). With a few straightforward keystrokes, it enables them to create modular SQL queries that are subsequently executed on the warehouse in the proper order. This is why it’s gotten so well-liked:
Version control: Enables tracking of SQL code modifications.
Testing: Verifies the integrity of the data transformation.
Documentation: Analytics code and dependencies’ documentation is generated automatically.
Airbnb’s Data Ecosystem with DBT
Regarding numerous dimensions, including listings, users, and transactions, Airbnb has a complicated data environment. They organise their DBT initiatives as follows:
Model Source
Airbnb uses DBT to create a source layer that holds unprocessed data from several sources, including transaction data, user activity logs, and third-party data.
The models/src/src_listing.sql code serves as a foundational layer in the DBT project by transforming the raw listings data from Airbnb’s database. It uses a common table expression (CTE) named raw_listings to fetch all records from AIRBNB.RAW.RAW_LISTINGS. The SQL query then selectively picks and renames relevant columns, such as id to listing_id and price to price_str, to prepare the data for further transformation and analysis. This “source” model thus acts as a cleaned-up version of the original raw data.
Dimensional Models (Dim)
A DBT model is used in the code for the dim_listings_cleansed.sql file to produce a tidy, dimensional table (DIM) for real estate listings. It extracts data from the src listing source model and converts it into an organised manner. In particular, it fixes inconsistencies like zero minimum nights by replacing them with 1, and it also turns a string field called price str into a number by eliminating the dollar sign (“$”). As a result, a clear and consistent dataset is produced, making it easier to conduct future analyses and inquiries regarding Airbnb’s property listings.
Fact Models (Fact)
Metrics and measures of business operations are stored in fact tables.
The fact_reviews.sql model in the example code is a DBT Fact model created to gather and store reviews data. It is set up as an incremental model, so rather than rebuilding the entire table, it merely updates with new data. The SQL code retrieves reviews from the src reviews source model and removes any instances where the review text field is null. Additionally, it only downloads records with a review date larger than the maximum review date already present in the table when running incrementally. The table is kept current while processing data is kept to a minimum.
Jinja Templates
For the purpose of developing reusable and dynamic SQL queries, DBT uses Jinja templating.
In the given code, Jinja templating is used to dynamically generate SQL queries in DBT. The {{ config(materialized=’table’) }} directive specifies that the output should be materialized as a table. {{ fields(‘reviewer_name’) }} dynamically fetches the field ‘reviewer_name‘. The {{ ref(‘fact_reviews’) }} function allows referencing another DBT model named ‘fact_reviews‘, ensuring dependencies are handled. GROUP BY 1 then groups the output by the first selected column, which in this case is ‘reviewer_name‘. Overall, Jinja in DBT allows for reusable, modular SQL code.
Testing
To assure data integrity, DBT makes it simple to evaluate the data models
The given code shows how DBT testing for data models functions. Different column-level checks are defined for dim listings cleansed in models/schema.yml. For instance, the host id must not be null and must have a matching record in the dim hosts cleansed table, whereas the listing id must be distinct and not null. Additionally, there can specified custom tests like looking for acceptable room type values. The second example, tests/dim_listings_minimum_nights.yml, shows how to create a custom test to find rows where minimum nights is less than 1. These tests assure the integrity and quality of the data.
Macros
In DBT, there are reusable sections of code called macros that can be applied to several models.
The no nulls in columns macros are intended to filter out any rows from a specified model that have a NULL value in any column. A SQL query is generated dynamically using DBT’s Jinja templating. The for-loop iterates through the column names that were retrieved by the adapter.get columns in relation(model) method, adding a condition to check for NULL values in each column. The dataset is filtered by the SQL query to remove entries with NULL in any column.
Documentation
The documentation that DBT automatically provides makes it simpler to comprehend the history and structure of the data models.
The documentation for the dim listing cleansed minimum nights field in the DBT model is concentrated in the models/doc.md file. It states that the minimum number of nights needed to reserve a listing is represented by this parameter. The source tables’ minimum_nights value may be 0 for older entries, which is an important caveat to make notice of. To ensure data integrity and consistency, a cleanup algorithm resets such instances to a minimum value of 1. This documentation is essential because it makes assumptions and data transformations clear for anyone working with the data model.
Conclusion
The Data Maturity Model provides organisations with a roadmap for enhancing their data capabilities. Tools like DBT facilitate this journey’s acceleration by offering a standardised method of managing data transformation tasks. A great example of a company utilising both techniques and tools to create a data-centric culture is Airbnb.
The features of DBT, such as version control, testing, and documentation, allow Airbnb’s data team to concentrate more on producing insights and less on maintaining data, which will help them advance up the data maturity curve.