This piece is part of a larger series where we explore setting up a no-cost data stack in a home lab. In an earlier installment, we covered how to pull data from CoinMarketCap’s API and orchestrate it using Airflow. Now, we’ll shift gears to focus on transforming that raw cryptocurrency data into structured insights. Think of this as moving from raw ingredients to a well-prepped meal. 🍽️
Previous articles:
Here’s what we’ll tackle in this section:
- Building a bronze layer to stage and clean the raw data.
- Upgrading it to a silver layer, where the data becomes more integrated and easier to analyze.
- Leveraging DBT to automate, document, and streamline the entire process. 🚀
Raw Data Overview 🔍
The data comes from CoinMarketCap’s API, specifically the “listings” endpoint. This endpoint delivers a paginated list of active cryptocurrencies, complete with the latest market stats. It also offers flexibility to sort by various metrics like market cap, name, symbol, or even the date the coin was added to the system.
Here’s a closer look at what this data provides:
1. coin_market_data
📈
This table contains hourly updates with key metrics for each cryptocurrency. It’s the heartbeat of the dataset, constantly tracking the market’s ebb and flow:
- Notable fields:
price
: The latest average trade price across exchanges.percent_change_1h
,percent_change_24h
,percent_change_7d
: Percentage changes in price over one hour, one day, and one week, respectively.volume_24h
,volume_7d
,volume_30d
: Rolling trading volume metrics for 24 hours, 7 days, and 30 days.market_cap
: An estimation of the coin’s market size, calculated from its circulating supply.circulating_supply
: The approximate number of coins currently in circulation.total_supply
: The total number of coins in existence, minus any that have been burned.max_supply
: The best estimate of the maximum number of coins that will ever exist.
2. coins
🪙
This table captures static details about each cryptocurrency and comes from the “coin” endpoint. It’s more descriptive, focusing on metadata rather than market dynamics:
- Key details:
id
: The unique identifier for each coin.name
: The full name of the cryptocurrency (e.g., Bitcoin).symbol
: The ticker symbol for the coin (e.g., BTC, ETH).category
: Indicates whether it’s categorized as a “coin” or a “token.”slug
: A URL-friendly version of the coin’s name.logo
: A link to the coin’s logo in PNG format (useful for dashboards or presentations). 🖼️description
: A brief summary about the coin—when available.
These two datasets form the foundation for building a structured data model. With coin_market_data
, you get the dynamic pulse of the market, while coins
gives you a solid reference point dimension. Together, they set the stage for deeper analysis and actionable insights. 🔎
What is DBT? 🤔💻
DBT, or Data Build Tool, is like the Swiss Army knife for anyone wrangling data transformations. It’s open-source, built to make life easier for analytics engineers, and, honestly, pretty versatile. So, if you’ve got raw data and need to whip it into shape, DBT is the buddy you call. In realty DBT is just a python library that creates objects in the data warehouse and abstracts all the DDL. Which is great, becuase creating tables, views and merge code is time consuming. 🚀
Here’s how it kind of plays out, like steps in a pipeline:
- Data Ingestion: You’ve got your raw data, maybe from APIs or databases. This stuff lands in your warehouse (think PostgreSQL, Snowflake, or BigQuery). DBT has no place here in my data stack.
- Staging (Bronze Layer): Now DBT steps in, You write SQL scripts to clean and organizing the objects, iterating on the raw tabular data so you can start making sense of it. In this layer I just try to apply a first normization step, filter and rename columns.
- Transformations (Silver Layer): Here’s where the magic happens. DBT takes your cleaned data, applies logic, and turns it into something way more useful. You can implement incremental strategies and dimensional modeling here. This is where I’m at with this article.
Key Features of DBT 🛡️
Why is DBT such a go-to tool? Let’s break it down:
- Modular SQL 🧩: Think of it like Lego blocks—you build small pieces (or models) that fit together to handle big tasks. Easy to use, easy to reuse.
- Code Abstraction ✍️: DBT simplifies your workflow by abstracting the need to write and maintain all the DDL scripts.
- Version Control 🔄: Scripts are stored in a folder, making it easy to integrate with a repository for versioning and backups.
- Data Lineage 🕵️♂️: Understand exactly how tables connect and depend on each other—no detective work needed.
- Testing ✅: Catch data issues early with built-in tests, so errors don’t become bigger problems.
- Documentation 📚: Automatically generated, interactive documentation ensures all models, tests, and dependencies are well-organized.
- Orchestration-Friendly ⚙️: DBT integrates seamlessly with orchestration tools like Airflow, enabling efficient scheduling and automation.
These features absolutely tranformed how I deal with data. I now insist on handling data with the above in mind. As I’ve mentioned befire it doesn’t really matter what you use for this, just that you use something
Setting Up DBT for Data Transformation
To standardize and manage these transformations, DBT was used as the transformation layer in the stack. Here’s how we structured the project:
DBT Environment Setup
- Installed DBT in a Python virtual environment.
- Configured
dbt_project.yml
andprofiles.yml
for database connections. - Created a folder structure for models:
staging/
: For bronze layer staging.silver/
: For transformed data in the silver layer.
Source Definitions
Sources were defined in schema.yml
:
|
|
Bronze Layer: Staging Raw Data
The bronze layer focuses on lightly transforming raw data to prepare it for more complex transformations. Really the only thing I am doing here is correcting some of the column naming and standardizing the keys. I’m not trying to full on transform this data for dimensional modeling just describe it in DBT so I can use these models to build more models. This level is materialized as SQL views so no duplicating or caching.
Staging coin_market_data
|
|
Staging coins
|
|
Silver Layer: Transforming Data for Analysis
The silver layer refines and integrates raw data, turning it into a structured and insightful resource for analytics. This stage focuses on combining coin_market_data
and coins
to craft a detailed coin dimension. Here’s what was done:
- Run Key Optimization 🕒: The
created_at_key
serves as the key for each API run. Since the API process takes a few seconds, the timestamp is rounded down to the nearest minute to ensure a consistent run key. - Integrating Aggregate Data 📊: While many aggregate values in coin_market_data aren’t directly needed for this analysis, it’s useful to include them in the coin dimension for context. The latest values from the transaction table are extracted and stamped into this dimension, enriching the data with recent insights.
Silver Model: Coin Dimension
This model integrates the latest market metrics into the coin dimension.
|
|
Silver Model: Market Data Fact Table
The market data provides hourly snapshots of coin activity from the API. Since this data focuses on small-cap cryptocurrencies, it highlights coins with lower trading volumes and typically lower prices. After reviewing the dataset, here’s what stood out:
Efficient Formatting 🛠️: The dataset is already in a compact format, with most columns being quantitative, which minimizes storage concerns. As a result, only minimal transformation is required.
Single Dimension Dependency 🔗: For now, the only dimension necessary for this fact table is the coins dimension. Additional dimensions might be added later as needed.
Removing Redundant Fields 🗑️: Aggregate fields in coin_market_data take up unnecessary space without adding value here. These fields are better suited for inclusion in the coin dimension, where they can provide contextual insights without duplicating calculations.
In summary, the transformation for this table mainly involves dropping unnecessary columns and streamlining the dataset for focused analytics.
|
|
Conclusion
In this first part, we:
- Staged raw cryptocurrency data into a bronze layer.
- Transformed and enriched the data into a silver layer, creating a coin dimension with integrated metrics.
- Used DBT to ensure modular, reusable, and testable transformations.
In the upcoming section, we’ll dive into developing an incremental strategy and incorporating DBT into the data pipeline. Alongside this, we’ll introduce testing methodologies to enhance observability and quickly identify any breakdowns in the pipeline. 🚦