Wumpus running up a staircase made of Discord Developer logos.
Engineering & Developers

Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data

At Discord, we faced a challenge that would make most data teams flinch: scaling dbt to process petabytes of data while supporting 100+ developers simultaneously working across 2,500+ models. What started as a simple implementation quickly hit critical limitations to accommodate millions of concurrent users generating petabytes of data.

dbt (data build tool) is a command-line tool that transforms data in your warehouse and brings software engineering principles to the world of SQL. Originally developed by Fishtown Analytics, a Philadelphia-based consulting firm, dbt has grown from humble beginnings to become widely adopted by data practitioners worldwide, leading to the company's rebranding as dbt Labs to reflect the tool's prominence.

Our journey with dbt began several years ago when we were evaluating solutions that could handle our rapidly growing data needs while maintaining the flexibility and transparency that engineers at Discord value. We chose dbt primarily because of its open-source nature, which aligns with Discord's engineering philosophy of leveraging and contributing to the open-source community whenever possible.

dbt offers several key features that made it attractive for our data transformation needs:

  • Seamless integration across other tools in our data stack (see our previous blog post about our orchestrator Dagster!)
  • Developer-friendly experience for data transformation
  • Modular design that promotes code reusability and maintainability
  • Comprehensive testing framework to ensure robust data quality

However, our initial implementation of dbt began to buckle under the scale of Discord. We encountered frequent re-compilation of the entire dbt project, amounting to painful 20+ minute waits. The default incremental materialization strategy wasn't optimized for our data volumes. Developers found themselves overwriting each other's test tables, creating confusion and wasted effort. Without addressing these scaling challenges, our ability to deliver timely data insights would have been severely compromised.

To scale beyond dbt's standard capabilities, we've implemented custom solutions that extend its core functionality. We built a state-of-the-art dbt system better-suited for a company of Discord’s size that enhances developer productivity, prevents breaking changes, and streamlines complex calculations. The customizations we've made, which we'll detail throughout this post, have allowed us to overcome these challenges and build a robust, scalable data transformation platform that serves as the backbone of our analytics infrastructure.

This isn't just another "how we use dbt" story — it's a blueprint for extending dbt to handle truly massive scale. We’re transforming painful compile times into rapid development cycles while ensuring robust data quality, and automating complex backfills that would otherwise require extensive manual intervention.

While we use Google BigQuery as our cloud provider, our solution is largely provider-agnostic and can be applied to other cloud platforms.

Separating Data Warehouse Environments

As our team grew to 100+ developers simultaneously working on 2,500+ dbt assets, we hit a critical collaboration roadblock: devs frequently needed to modify the same tables, but with dbt's default behavior, they would constantly overwrite each other's test tables. Having 100+ developers in the same “kitchen”, of course, created confusion and a suboptimal experience for fellow devs who couldn't reliably test their changes.

Without solving this fundamental collision problem, we couldn't scale our development process, regardless of how powerful our underlying infrastructure was. It became clear that we needed a way for multiple developers to work on the same models without stepping on each other's toes.

We implemented a custom environment separation strategy by overriding dbt's generate_alias_name macro, a built-in function that controls the exact table name generated during execution. Instead of using default table names, we created a system that automatically appends developer-specific identifiers based on the execution environment:

{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
    {%- if target.name.endswith('_testing') -%}
        {{ node.name + "__" + env_var('USER').replace(".", "_") }}
    {%- else -%}
        {{ node.name }}
    {%- endif -%}
{%- endmacro %}

This is a *very* simplified version of our macro, but this solution creates isolated environments:

  • For local development: Tables in the dbt_testing dataset get the developer's username appended (e.g., dim_users__chris_dong)
  • For CI/CD: Tables use pull request numbers or commit hashes as identifiers (e.g., dim_users__150513)
  • For production: Clean table names with no suffixes

With this, developers can now work on the same models simultaneously without interference. This change eliminated confusion and rework, accelerating our development velocity and removing a constant source of friction.

For teams facing similar challenges, this approach provides a straightforward path to supporting multi-developer workflows without complex branching strategies or additional infrastructure. Now there’s less risk of too many cooks in our kitchen! 

Breaking Through Performance Barriers

As our data volumes grew to petabytes and our team expanded to 100+ developers, we hit a critical performance wall: dbt's compilation times stretched painfully long, and the standard approach to incremental processing couldn't handle our scale efficiently.

The industry-standard method of using dbt's is_incremental() macro became a major bottleneck. Using this macro in BigQuery is suboptimal since it requires a full table scan to determine the latest timestamp from dbt's previous run, which can be costly and slow at scale. Our developers were regularly waiting over 20 minutes for compilations, severely limiting productivity and creating a frustrating development experience.

Instead, we replaced the traditional incremental approach with configurable time-based processing using dbt’s variables system to define var('start_date') and var('end_date'). This offers more flexibility than dbt's microbatch incremental feature, as it allows us to customize date ranges for each table. For instance, we might combine complete historical data from one table, the past week's data from another, and a specific date range from a third.

However, this introduced a new challenge: modifying these date variables forced a complete dbt project recompilation, resulting in a few minutes delay, as we encountered the dreaded message “Unable to do partial parsing because config vars, config profile, or config target have changed.” This inefficiency cascaded throughout our workflow, affecting test runs, pre-commit checks, and CI/CD processes alike. To solve this, we leveraged "dbt turbo" (inspired by this Medium post), which bypasses full compilation by strategically modifying the hash in dbt's partial parsing file:

vars_hash = FileHash.from_contents(
    "\x00".join(
        [
            stringified_cli_vars,
            getattr(config.args, "profile", "") or "",
            getattr(config.args, "target", "") or "",
            __version__,
        ]
    )
)

This open-source code from dbt-core produces a hash representation of your CLI variables, profile, and target. We overwrite this in a way so that changing start_date or end_date won’t trigger a full compile.

To make this accessible to all developers, we created a custom command called create-test-models that handles these optimizations automatically. This command streamlines the usage of additional developer-friendly features, like dbt's defer functionality, which lets us reference production assets in our sandbox testing environment. We also leveraged BigQuery's copy partitioning feature, which replaces entire partitions without write costs, making incremental processing significantly faster and cheaper than traditional MERGE operations at our scale.

The impact was transformative: a massive 5x speedup in dbt execution times! Developers no longer needed coffee breaks while waiting for dbt to run. More importantly, this performance leap achieved exactly what we were going for: it allowed our team to rapidly iterate on models, test changes quickly, and maintain productivity despite our massive scale.

For greater efficiency with sources, we built a custom dbt package to handle incoming raw data. Instead of manually adding sources for hundreds of new logging events, our framework automatically generates and integrates these sources as data arrives. This system extends to Python-based workflows in Dagster (including third-party data ingestion and ML pipelines), making resources immediately available in dbt. This automation significantly reduces maintenance overhead while ensuring seamless integration across all components of our data stack.

For teams hitting similar performance walls, our approach offers a path to dramatically faster development cycles, though it does require careful implementation since modifying dbt's internal metadata carries risk if not done properly.

Revolutionizing Data Backfills

The standard approach for refreshing data in dbt is using the aptly-named --full-refresh flag. While this may work at smaller volumes of data, at our petabyte scale, it simply wasn't feasible.

We noticed that, when attempting to process massive data volumes in a single query, dbt would consistently time out due to exceeding BigQuery's query limits. Even when queries succeeded, the computational costs were astronomical.

We needed a solution that would allow targeted, efficient backfills while maintaining strict data governance and controlling costs. And more importantly, we wanted to create a "fire and forget" experience where engineers could make changes and have the system intelligently determine what needed refreshing.

The solution we arrived at leverages dbt's meta field to implement a sophisticated versioning system that ensures proper data governance through controlled, documented data changes. Each dbt asset follows semantic versioning (major.minor.patch), with each version type triggering different backfill behaviors:

  • Major version changes (1.1.0 → 2.0.0): Trigger complete backfills of the modified table and all downstream dependencies
  • Minor version changes (1.1.0 → 1.2.0): Enable selective backfills of specific tables within the lineage

This versioning system works in conjunction with additional configuration parameters we developed:

  • partition_start_date: Determines the earliest date your data contains
  • backfill_partition_chunk_size: Controls how many days to process at once based on data volume, allowing us to break down massive queries into manageable chunks

The impact has been transformative for our development workflow. Our team can now iterate quickly on tables, with changes automatically flowing to production without manual intervention. The system intelligently handles data updates based on version changes, automating what would otherwise be complex manual processes, while maintaining a complete audit trail of all data transformations.

This also significantly reduces storage costs and processing time. Rather than maintaining complete historical data since inception, we only store and process what's necessary for analytics while maintaining full data integrity. For teams managing massive datasets, this targeting of backfills can dramatically reduce both the time and cost associated with data processing.

Given the effectiveness of our custom versioning strategy, we chose not to adopt dbt's native model versions at this time, though we continue to evaluate its potential benefits as the feature evolves.

Building macros and robust CI/CD guardrails

During our initial adoption of dbt, we quickly encountered a troubling pattern: seemingly minor changes would snowball into critical failures, sometimes breaking our entire data warehouse. With our 100+ developers working simultaneously across thousands of models, we needed a robust system to prevent these outages while maintaining development velocity.

We identified several key challenges that threatened data quality and stability:

  • Critical business metrics (like user engagement) were implemented inconsistently across models
  • Macro changes could silently impact hundreds of dependent tables without proper testing
  • Performance issues from inefficient query patterns often remained hidden until reaching production

After some tinkering, we created a comprehensive system of reusable macros paired with strict CI/CD guardrails that prevent errors before they reach production. Before dbt, we relied heavily on copy-pasting code, which led to bugs and inconsistent logic between files. Jinja and macros are a central part of dbt, enabling modular SQL snippets that can be shared across our codebase. Jinja is the templating language with Python-esque syntax that allows you to write dynamic SQL with programming constructs like loops and conditionals, while macros function as reusable SQL snippets you can call throughout your project.

We leveraged several open-source packages like dbt-utils, dbt-expectations, and elementary-data, adapting them to suit our needs. Then, we built a library of standardized macros that ensure consistency across our analytics. 

A prime example is our "lookback" macros, which makes WAU and MAU calculations straightforward while using HyperLogLog under the hood for significant performance gains. The macro below calculates aggregate metrics over last X days of data for an arbitrary granularity: 

{{ create_lookback_query(model, grain, lookback=[7,30] }}

We also created custom materializations to handle Discord's unique requirements. For instance, we added a merge_mode configuration that preserves older or newer rows based on timestamps, rather than always replacing data. For snapshots, we enhanced the logic to reconstruct SCD2-style tables from historical data, extending beyond dbt's default capabilities. We even created a separate materialization for static tables that need to run only once.

But the most transformative component was our CI/CD pipeline in Buildkite. Each pull request now faces multiple automated checks before it can be merged:

  • Cost analysis that identifies and flags expensive queries requiring optimization
  • Dependency analysis that detects potential breaking changes to downstream assets
  • Automated handling of maintenance operations table deprecation, UDF creation, and cluster key updates
A list of 42 Buildkite jobs. All but one have successfully passed.
Buildkite jobs that are ran for each PR and must pass before it can merge

An example of these checks is a utility that analyzes macro dependencies, highlighting how many tables could be affected by a seemingly innocuous change — this encourages getting extra eyes during code review. Each macro also requires a unit test with sample cases that must pass before the macro can be updated in production.

A Github comment from “Discord-Automation” explaining which macros and models have been affected by a submitted PR.
We send a warning when you are modifying a macro about its potential downstream impact

The impact has been massive. Before, data outages from broken dbt models were a fairly regular occurrence. Now, critical issues are caught during development, and our data platform has achieved a new level of reliability while maintaining development velocity. What's more, our standardized macros have simplified complex calculations, making our codebase more maintainable and consistent.

When scaling dbt beyond a few dozen models, investing in strong guardrails early pays enormous dividends down the road. While it requires upfront investment, the resulting stability and consistency allow teams to focus on delivering insights rather than debugging production issues.

Looking towards the future

Today, dbt powers many of our production use cases at Discord and has been serving us extremely well. With the recent SDF Labs acquisition, we're excited to reap the benefits of a next-generation dbt engine announced during dbt Developer Day and enjoy significantly faster compile times through Rust. We've also been adopting sqlglot and are working to integrate its column-level lineage capabilities.

As we look ahead, we're exploring the integration of dbt with emerging technologies. Apache Iceberg has us particularly interested as it aligns with dbt's own roadmap as highlighted in their "Iceberg is an Implementation Detail" blog post. This open table format promises improved storage costs and data management capabilities. Additionally, we're evaluating Apache Spark integration with dbt by leveraging its existing adapter for enhanced computing capabilities across our data ecosystem.

What started as a challenging journey to scale dbt for Discord’s massive data operations has evolved into a robust, custom-tailored solution that powers critical analytics across the platform. If you enjoyed reading this blog and want to help us push the boundaries of modern data engineering, check out our open roles in Data Engineering and Data Platform to join our amazing team and take our data stack to the next level!

Tags
No items found.

related articles