You need to extract metadata from websites—titles, descriptions, images, Open Graph tags. Should you build your own scraper or use a metadata API? This guide breaks down both approaches with real numbers on cost, complexity, and maintenance.

The Problem: Extracting Website Metadata

Whether you're building link previews, an SEO tool, or a content aggregator, you need to fetch data from external websites. Sounds simple, right?

// Naive approach
const response = await fetch('https://example.com');
const html = await response.text();
// Parse HTML and extract meta tags...

But this simple approach fails in production:

CORS blocks browser requests to external domains
JavaScript-rendered pages return empty HTML
Rate limiting from target sites blocks your requests
Edge cases break your parser constantly

You have two options: build a robust scraper or use a metadata API.

Option 1: Build Your Own Scraper

Let's look at what it takes to build a production-grade metadata scraper.

Basic Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Your Application                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Scraper Service                          │
├─────────────────────────────────────────────────────────────┤
│  URL Queue → Rate Limiter → Fetcher → Parser → Cache        │
│                    │                                        │
│                    ▼                                        │
│            Headless Browser                                 │
│         (for JS-rendered pages)                             │
└─────────────────────────────────────────────────────────────┘

Implementation: Basic Scraper

Here's a minimal Node.js scraper:

// scraper.ts
import * as cheerio from 'cheerio';

interface Metadata {
  title: string;
  description: string;
  image: string | null;
  favicon: string | null;
}

export async function scrapeMetadata(url: string): Promise<Metadata> {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MetadataBot/1.0)',
    },
  });

  if (!response.ok) {
    throw new Error(`HTTP ${response.status}`);
  }

  const html = await response.text();
  const $ = cheerio.load(html);

  return {
    title:
      $('meta[property="og:title"]').attr('content') ||
      $('meta[name="twitter:title"]').attr('content') ||
      $('title').text() ||
      '',
    description:
      $('meta[property="og:description"]').attr('content') ||
      $('meta[name="twitter:description"]').attr('content') ||
      $('meta[name="description"]').attr('content') ||
      '',
    image:
      $('meta[property="og:image"]').attr('content') ||
      $('meta[name="twitter:image"]').attr('content') ||
      null,
    favicon: resolveUrl($('link[rel="icon"]').attr('href'), url),
  };
}

function resolveUrl(path: string | undefined, base: string): string | null {
  if (!path) return null;
  try {
    return new URL(path, base).href;
  } catch {
    return null;
  }
}

The Problems Start

This basic scraper fails on many sites. Let's fix the issues one by one:

Problem 1: JavaScript-Rendered Pages

Many modern sites (React, Vue, Angular) render content client-side:

// scrapeMetadata('https://react-spa.com') returns empty data!

Solution: Add Puppeteer

import puppeteer from 'puppeteer';

async function scrapeWithBrowser(url: string): Promise<string> {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  try {
    const page = await browser.newPage();
    await page.setUserAgent('Mozilla/5.0...');
    await page.goto(url, { waitUntil: 'networkidle0', timeout: 30000 });
    return await page.content();
  } finally {
    await browser.close();
  }
}

Cost: Puppeteer uses 50-200MB RAM per instance. Running 10 concurrent scrapes needs 2GB RAM minimum.

Problem 2: Rate Limiting

Sites block rapid requests:

// After 10 requests to the same domain:
// HTTP 429 Too Many Requests

Solution: Add rate limiting per domain

import Bottleneck from 'bottleneck';

const limiters = new Map<string, Bottleneck>();

function getLimiter(domain: string): Bottleneck {
  if (!limiters.has(domain)) {
    limiters.set(domain, new Bottleneck({
      maxConcurrent: 1,
      minTime: 2000, // 1 request per 2 seconds per domain
    }));
  }
  return limiters.get(domain)!;
}

async function rateLimitedFetch(url: string): Promise<Response> {
  const domain = new URL(url).hostname;
  const limiter = getLimiter(domain);
  return limiter.schedule(() => fetch(url));
}

Problem 3: Caching

Fetching the same URL repeatedly wastes resources:

import { Redis } from 'ioredis';

const redis = new Redis();
const CACHE_TTL = 3600; // 1 hour

async function getCachedMetadata(url: string): Promise<Metadata | null> {
  const cached = await redis.get(`meta:${url}`);
  return cached ? JSON.parse(cached) : null;
}

async function cacheMetadata(url: string, data: Metadata): Promise<void> {
  await redis.set(`meta:${url}`, JSON.stringify(data), 'EX', CACHE_TTL);
}

Problem 4: Error Handling

Real-world URLs fail in many ways:

async function robustScrape(url: string): Promise<Metadata> {
  try {
    // Validate URL
    const parsed = new URL(url);
    if (!['http:', 'https:'].includes(parsed.protocol)) {
      throw new Error('Invalid protocol');
    }

    // Check cache first
    const cached = await getCachedMetadata(url);
    if (cached) return cached;

    // Try simple fetch first
    let html: string;
    try {
      const response = await rateLimitedFetch(url);
      html = await response.text();
    } catch {
      // Fall back to browser for problematic sites
      html = await scrapeWithBrowser(url);
    }

    const metadata = parseHtml(html, url);
    await cacheMetadata(url, metadata);
    return metadata;
  } catch (error) {
    // Return minimal fallback
    return {
      title: new URL(url).hostname,
      description: '',
      image: null,
      favicon: null,
    };
  }
}

Full Scraper Infrastructure

A production scraper needs:

Component	Purpose	Technology
Queue	Job management	Redis + BullMQ
Browser Pool	JS rendering	Puppeteer cluster
Rate Limiter	Respect site limits	Bottleneck
Cache	Avoid re-fetching	Redis
Proxy Rotation	Avoid IP bans	Proxy service
Monitoring	Track failures	Prometheus/Grafana
Retry Logic	Handle transient errors	Exponential backoff

Cost Analysis: DIY Scraper

Item	Monthly Cost
Server (4GB RAM, 2 CPU)	$40-80
Redis (caching)	$15-30
Proxy service	$50-200
Monitoring	$20-50
Total infrastructure	$125-360/month
Engineering time (setup)	40-80 hours
Maintenance	5-10 hours/month

Option 2: Use a Metadata API

APIs handle all the complexity for you:

async function getMetadata(url: string): Promise<Metadata> {
  const response = await fetch(
    `https://api.katsau.com/v1/extract?url=${encodeURIComponent(url)}`,
    {
      headers: { 'Authorization': 'Bearer YOUR_API_KEY' }
    }
  );

  const { data } = await response.json();
  return data;
}

That's it. One API call, all the metadata you need.

What the API Handles

Challenge	DIY Solution	API Solution
CORS	Backend service	Handled
JS rendering	Puppeteer cluster	Handled
Rate limiting	Per-domain limiters	Handled
Caching	Redis setup	Handled
Proxy rotation	Proxy service	Handled
Edge cases	Constant fixes	Handled
Uptime	Your responsibility	99.9% SLA

Cost Analysis: API

Usage	Monthly Cost
1,000 requests	Free
10,000 requests	~$20
50,000 requests	~$50
100,000 requests	~$80

No infrastructure. No maintenance. Predictable costs.

When to Build vs Buy

Build Your Own Scraper When:

Extreme customization - You need very specific data extraction
Massive scale - Millions of URLs per day
Sensitive data - Cannot send URLs to third parties
Learning - Educational project

Use an API When:

Time to market - Need it working today
Moderate scale - Thousands to hundreds of thousands of URLs
Reliability - Can't afford downtime
Focus - Want to build product, not infrastructure

Real-World Comparison

Let's compare both approaches for a real use case: building link previews for a chat app.

Scenario: 10,000 link previews/month

DIY Approach:

Setup time: 40 hours × $100/hour = $4,000
Monthly infra: $150
Monthly maintenance: 5 hours × $100 = $500
First year cost: $4,000 + (12 × $650) = $11,800

API Approach:

Setup time: 2 hours × $100/hour = $200
Monthly cost: $20
First year cost: $200 + (12 × $20) = $440

Savings with API: $11,360 in year one

Scenario: 500,000 link previews/month

DIY Approach:

Setup time: 80 hours × $100/hour = $8,000
Monthly infra: $500 (larger servers, more proxies)
Monthly maintenance: 10 hours × $100 = $1,000
First year cost: $8,000 + (12 × $1,500) = $26,000

API Approach:

Setup time: 2 hours × $100/hour = $200
Monthly cost: $200 (enterprise plan)
First year cost: $200 + (12 × $200) = $2,600

Savings with API: $23,400 in year one

Even at high scale, APIs often win on total cost.

Hybrid Approach

Some teams use both:

async function getMetadata(url: string): Promise<Metadata> {
  // Check your cache first
  const cached = await cache.get(url);
  if (cached) return cached;

  // Use API for fresh data
  const data = await apiClient.extract(url);

  // Cache with your own TTL
  await cache.set(url, data, { ttl: 86400 }); // 24 hours

  return data;
}

This gives you:

Control over caching strategy
Reliability of professional API
Cost optimization through local caching

Making the Decision

Ask yourself these questions:

Question	Build	Buy
Do I need this working in < 1 week?	❌	✅
Is metadata extraction my core product?	✅	❌
Do I have DevOps resources?	✅	❌
Is my budget < $500/month?	❌	✅
Do I need 99.9% uptime?	❌	✅
Am I scraping > 1M URLs/month?	✅	❌

Conclusion

For most teams, a metadata API is the right choice. The math is clear:

Lower total cost (infrastructure + engineering time)
Faster time to market (days vs weeks)
Better reliability (professional SLA vs DIY monitoring)
Focus on your product (not scraping infrastructure)

Build your own scraper only if metadata extraction is your core business or you have very specific requirements that APIs can't meet.

Ready to stop maintaining scrapers? Try Katsau's metadata API free — 1,000 requests/month, no credit card required.

Web Scraping vs APIs: The Best Way to Extract Website Metadata