Introduction

We all have hobbies, right? Some people collect stamps; I like to collect data on small-cap cryptocurrencies. Because nothing says “good use of free time” like tracking obscure digital coins with names like MoonFluff and RocketPoodle.

But in all seriousness, tracking small-cap coins isn’t just about watching volatile charts (though that’s part of the fun). It’s about building a robust data pipeline, exploring how to transform raw API outputs into actionable analytics, and flexing some data engineering muscles along the way.

This article takes you through my process of setting up a pipeline that pulls crypto data, transforms it, and sets the stage for deeper analysisā€”all while wrestling with the CoinMarketCap API and pretending I totally know what Iā€™m doing. šŸ˜


1. Why Track Small-Cap Coins?

Because it’s fun! And by “fun,” I mean intellectually engaging in that “trying to balance a data pipeline on 10,000 API credits per month” kind of way. Small-cap coins are particularly interesting because they’re chaotic, high-risk, and packed with potential. They’re the cryptocurrency worldā€™s equivalent of indie bands: a little messy but full of possibilities. šŸŽø

For this project, my goal is to:

  1. Get Data: Start collecting small-cap crypto data to populate my data stack.
  2. Build a Pipeline: Use my existing setup, which includes:
    • Python + Requests: Pull raw data from the API.
    • FastAPI + Pydantic: Transform and validate data in the first stage.
    • Postgres: Store data in a relational database.
    • DBT: Create meaningful data models and analytics-ready transformations.
    • OLAP Design: Combine all this into a system primed for analysis and visualization. šŸ“Š

If nothing else, Iā€™ll end up with something pretty in Grafana. And if a small-cap coin I track happens to explode in value? Well, weā€™ll pretend I had a good financial reason for this exercise all along.


2. To Scrape or Not to Scrape?

Ah, the eternal question. Web scraping might seem like the scrappy underdog, but in practice, itā€™s a lot like parking in a fire lane: occasionally useful, generally frowned upon, and prone to getting you in trouble. šŸš«šŸš—

Web Scraping: The Rogueā€™s Path

  • Pros: Free (kind of), unregulated (mostly), and limited only by your HTML-parsing skills.
  • Cons: Violates terms of service. Risk of IP bans. Feels a bit like trying to sneak into a concert by pretending you’re the drummer.

CoinMarketCapā€™s API is the grown-up choice. Itā€™s designed for data access, and more importantly, it keeps the lawyers happy. Hereā€™s what I discovered:

  • Free Tier: 10,000 credits/month. Generous if youā€™re not greedy.
  • Endpoints: Iā€™m using cryptocurrency/listings/latest because itā€™s the buffet of crypto data.
  • Cost: 1 credit per 100 coins. Add a convert option (e.g., BTC), and itā€™s an extra credit. Do I really need to know the price in Dogecoin? No. So USD it is.
  • Update Frequency: Every 60 seconds. Perfect for tracking volatile marketsā€”or as I call it, “digital gambling.” šŸŽ°

3. Testing the API

First, I tested the API to figure out how many credits Iā€™d burn through and whether Iā€™d accidentally DDOS myself. Here’s the Python script I used to pull the top 500 small-cap coins:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import requests

API_KEY = "YOUR_API_KEY"
URL = "https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest"

PARAMS = {
    "start": 1,
    "limit": 500,  # Top 500 coins
    "convert": "USD",
    "market_cap_max": 1e9  # Small-cap threshold
}

HEADERS = {
    "Accepts": "application/json",
    "X-CMC_PRO_API_KEY": API_KEY
}

response = requests.get(URL, headers=HEADERS, params=PARAMS)
data = response.json()

print(f"Credits used: {len(data['data']) // 100 + 1} (rough estimate)")

Results

  • Credits Used: 3 credits. It turns out there werenā€™t quite 500 small-cap coins in the dataset. šŸ¤·ā€ā™‚ļø
  • How Often Can I Query? With 10,000 credits/month, I can query hourly (30 days Ɨ 24 hours Ɨ 3 credits = 2,160 credits/month) and still have room for errorā€”or my inevitable curiosity.

4. Exploring and Normalizing the Data

The data from the API is… dense. Itā€™s like a buffet where everything looks great until you realize half the options are nested structures. Take the quote field, for example: itā€™s an array, but Iā€™m only using USD, so thatā€™s one more thing to flatten.

Example Data Structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
    "id": 20947,
    "name": "Sui",
    "symbol": "SUI",
    "slug": "sui",
    "num_market_pairs": 440,
    "date_added": "2022-07-12T08:03:11.000Z",
    "tags": ["binance-launchpool", "layer-1"],
    "max_supply": 10000000000,
    "circulating_supply": 2927660018.558888,
    "total_supply": 10000000000,
    "quote": {
        "USD": {
            "price": 3.55,
            "volume_24h": 1872104986.2806106,
            "market_cap": 10408819004.025967,
            "percent_change_24h": -4.40,
            "last_updated": "2024-12-04T17:21:00.000Z"
        }
    }
}

Normalizing the Data

Hereā€™s how I split the data into three normalized tables:

  1. Coins Table: Static metadata like name, symbol, and tags. šŸŖ™
  2. Market Data Table: Transactional data like price and volume_24h. šŸ“ˆ
  3. Tags Table: A many-to-one table linking tags to the coinā€™s id. šŸ·ļø

Future Proofing: OLAP Planning

While Iā€™m normalizing data now, Iā€™m keeping an eye on OLAP (Online Analytical Processing) design. The goal? Build the smallest possible data structure with the most flexibility for recombining metrics like market cap and price changes into aggregate views.


5. Whatā€™s Next?

Now that Iā€™ve explored the API and laid the groundwork for normalizing the data, the next step is to extend my FastAPI setup to:

  1. Validate the incoming data. āœ…
  2. Store it in my Postgres database. šŸ˜
  3. Let DBT work its magic to turn raw numbers into actionable insights. āœØ

Stay tuned for the next article, where weā€™ll dive into FastAPI, Pydantic, and Postgres as we build the second stage of this pipeline. Spoiler: it involves more JSON parsing and a lot of coffee. ā˜•