You need to extract metadata from websites—titles, descriptions, images, Open Graph tags. Should you build your own scraper or use a metadata API? This guide breaks down both approaches with real numbers on cost, complexity, and maintenance.
The Problem: Extracting Website Metadata
Whether you're building link previews, an SEO tool, or a content aggregator, you need to fetch data from external websites. Sounds simple, right?
// Naive approach
const response = await fetch('https://example.com');
const html = await response.text();
// Parse HTML and extract meta tags...
But this simple approach fails in production:
- CORS blocks browser requests to external domains
- JavaScript-rendered pages return empty HTML
- Rate limiting from target sites blocks your requests
- Edge cases break your parser constantly
You have two options: build a robust scraper or use a metadata API.
Option 1: Build Your Own Scraper
Let's look at what it takes to build a production-grade metadata scraper.
Basic Architecture
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Scraper Service │
├─────────────────────────────────────────────────────────────┤
│ URL Queue → Rate Limiter → Fetcher → Parser → Cache │
│ │ │
│ ▼ │
│ Headless Browser │
│ (for JS-rendered pages) │
└─────────────────────────────────────────────────────────────┘
Implementation: Basic Scraper
Here's a minimal Node.js scraper:
// scraper.ts
import * as cheerio from 'cheerio';
interface Metadata {
title: string;
description: string;
image: string | null;
favicon: string | null;
}
export async function scrapeMetadata(url: string): Promise<Metadata> {
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; MetadataBot/1.0)',
},
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
const html = await response.text();
const $ = cheerio.load(html);
return {
title:
$('meta[property="og:title"]').attr('content') ||
$('meta[name="twitter:title"]').attr('content') ||
$('title').text() ||
'',
description:
$('meta[property="og:description"]').attr('content') ||
$('meta[name="twitter:description"]').attr('content') ||
$('meta[name="description"]').attr('content') ||
'',
image:
$('meta[property="og:image"]').attr('content') ||
$('meta[name="twitter:image"]').attr('content') ||
null,
favicon: resolveUrl($('link[rel="icon"]').attr('href'), url),
};
}
function resolveUrl(path: string | undefined, base: string): string | null {
if (!path) return null;
try {
return new URL(path, base).href;
} catch {
return null;
}
}
The Problems Start
This basic scraper fails on many sites. Let's fix the issues one by one:
Problem 1: JavaScript-Rendered Pages
Many modern sites (React, Vue, Angular) render content client-side:
// scrapeMetadata('https://react-spa.com') returns empty data!
Solution: Add Puppeteer
import puppeteer from 'puppeteer';
async function scrapeWithBrowser(url: string): Promise<string> {
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
try {
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0...');
await page.goto(url, { waitUntil: 'networkidle0', timeout: 30000 });
return await page.content();
} finally {
await browser.close();
}
}
Cost: Puppeteer uses 50-200MB RAM per instance. Running 10 concurrent scrapes needs 2GB RAM minimum.
Problem 2: Rate Limiting
Sites block rapid requests:
// After 10 requests to the same domain:
// HTTP 429 Too Many Requests
Solution: Add rate limiting per domain
import Bottleneck from 'bottleneck';
const limiters = new Map<string, Bottleneck>();
function getLimiter(domain: string): Bottleneck {
if (!limiters.has(domain)) {
limiters.set(domain, new Bottleneck({
maxConcurrent: 1,
minTime: 2000, // 1 request per 2 seconds per domain
}));
}
return limiters.get(domain)!;
}
async function rateLimitedFetch(url: string): Promise<Response> {
const domain = new URL(url).hostname;
const limiter = getLimiter(domain);
return limiter.schedule(() => fetch(url));
}
Problem 3: Caching
Fetching the same URL repeatedly wastes resources:
import { Redis } from 'ioredis';
const redis = new Redis();
const CACHE_TTL = 3600; // 1 hour
async function getCachedMetadata(url: string): Promise<Metadata | null> {
const cached = await redis.get(`meta:${url}`);
return cached ? JSON.parse(cached) : null;
}
async function cacheMetadata(url: string, data: Metadata): Promise<void> {
await redis.set(`meta:${url}`, JSON.stringify(data), 'EX', CACHE_TTL);
}
Problem 4: Error Handling
Real-world URLs fail in many ways:
async function robustScrape(url: string): Promise<Metadata> {
try {
// Validate URL
const parsed = new URL(url);
if (!['http:', 'https:'].includes(parsed.protocol)) {
throw new Error('Invalid protocol');
}
// Check cache first
const cached = await getCachedMetadata(url);
if (cached) return cached;
// Try simple fetch first
let html: string;
try {
const response = await rateLimitedFetch(url);
html = await response.text();
} catch {
// Fall back to browser for problematic sites
html = await scrapeWithBrowser(url);
}
const metadata = parseHtml(html, url);
await cacheMetadata(url, metadata);
return metadata;
} catch (error) {
// Return minimal fallback
return {
title: new URL(url).hostname,
description: '',
image: null,
favicon: null,
};
}
}
Full Scraper Infrastructure
A production scraper needs:
| Component | Purpose | Technology |
|---|---|---|
| Queue | Job management | Redis + BullMQ |
| Browser Pool | JS rendering | Puppeteer cluster |
| Rate Limiter | Respect site limits | Bottleneck |
| Cache | Avoid re-fetching | Redis |
| Proxy Rotation | Avoid IP bans | Proxy service |
| Monitoring | Track failures | Prometheus/Grafana |
| Retry Logic | Handle transient errors | Exponential backoff |
Cost Analysis: DIY Scraper
| Item | Monthly Cost |
|---|---|
| Server (4GB RAM, 2 CPU) | $40-80 |
| Redis (caching) | $15-30 |
| Proxy service | $50-200 |
| Monitoring | $20-50 |
| Total infrastructure | $125-360/month |
| Engineering time (setup) | 40-80 hours |
| Maintenance | 5-10 hours/month |
Option 2: Use a Metadata API
APIs handle all the complexity for you:
async function getMetadata(url: string): Promise<Metadata> {
const response = await fetch(
`https://api.katsau.com/v1/extract?url=${encodeURIComponent(url)}`,
{
headers: { 'Authorization': 'Bearer YOUR_API_KEY' }
}
);
const { data } = await response.json();
return data;
}
That's it. One API call, all the metadata you need.
What the API Handles
| Challenge | DIY Solution | API Solution |
|---|---|---|
| CORS | Backend service | Handled |
| JS rendering | Puppeteer cluster | Handled |
| Rate limiting | Per-domain limiters | Handled |
| Caching | Redis setup | Handled |
| Proxy rotation | Proxy service | Handled |
| Edge cases | Constant fixes | Handled |
| Uptime | Your responsibility | 99.9% SLA |
Cost Analysis: API
| Usage | Monthly Cost |
|---|---|
| 1,000 requests | Free |
| 10,000 requests | ~$20 |
| 50,000 requests | ~$50 |
| 100,000 requests | ~$80 |
No infrastructure. No maintenance. Predictable costs.
When to Build vs Buy
Build Your Own Scraper When:
- Extreme customization - You need very specific data extraction
- Massive scale - Millions of URLs per day
- Sensitive data - Cannot send URLs to third parties
- Learning - Educational project
Use an API When:
- Time to market - Need it working today
- Moderate scale - Thousands to hundreds of thousands of URLs
- Reliability - Can't afford downtime
- Focus - Want to build product, not infrastructure
Real-World Comparison
Let's compare both approaches for a real use case: building link previews for a chat app.
Scenario: 10,000 link previews/month
DIY Approach:
Setup time: 40 hours × $100/hour = $4,000
Monthly infra: $150
Monthly maintenance: 5 hours × $100 = $500
First year cost: $4,000 + (12 × $650) = $11,800
API Approach:
Setup time: 2 hours × $100/hour = $200
Monthly cost: $20
First year cost: $200 + (12 × $20) = $440
Savings with API: $11,360 in year one
Scenario: 500,000 link previews/month
DIY Approach:
Setup time: 80 hours × $100/hour = $8,000
Monthly infra: $500 (larger servers, more proxies)
Monthly maintenance: 10 hours × $100 = $1,000
First year cost: $8,000 + (12 × $1,500) = $26,000
API Approach:
Setup time: 2 hours × $100/hour = $200
Monthly cost: $200 (enterprise plan)
First year cost: $200 + (12 × $200) = $2,600
Savings with API: $23,400 in year one
Even at high scale, APIs often win on total cost.
Hybrid Approach
Some teams use both:
async function getMetadata(url: string): Promise<Metadata> {
// Check your cache first
const cached = await cache.get(url);
if (cached) return cached;
// Use API for fresh data
const data = await apiClient.extract(url);
// Cache with your own TTL
await cache.set(url, data, { ttl: 86400 }); // 24 hours
return data;
}
This gives you:
- Control over caching strategy
- Reliability of professional API
- Cost optimization through local caching
Making the Decision
Ask yourself these questions:
| Question | Build | Buy |
|---|---|---|
| Do I need this working in < 1 week? | ❌ | ✅ |
| Is metadata extraction my core product? | ✅ | ❌ |
| Do I have DevOps resources? | ✅ | ❌ |
| Is my budget < $500/month? | ❌ | ✅ |
| Do I need 99.9% uptime? | ❌ | ✅ |
| Am I scraping > 1M URLs/month? | ✅ | ❌ |
Conclusion
For most teams, a metadata API is the right choice. The math is clear:
- Lower total cost (infrastructure + engineering time)
- Faster time to market (days vs weeks)
- Better reliability (professional SLA vs DIY monitoring)
- Focus on your product (not scraping infrastructure)
Build your own scraper only if metadata extraction is your core business or you have very specific requirements that APIs can't meet.
Ready to stop maintaining scrapers? Try Katsau's metadata API free — 1,000 requests/month, no credit card required.
Try Katsau API
Extract metadata, generate link previews, and monitor URLs with our powerful API. Start free with 1,000 requests per month.