Paginate & crawl¶

Goal: scrape many pages - a paginated listing, a set of URLs, or a category tree - reliably and at scale.

ScrapeNest jobs are single-URL. You orchestrate multi-page crawls by submitting many jobs, ideally with submit and webhooks so you're not blocking on each one.

Pattern 1: known page URLs¶

If the pages follow a predictable pattern, fan out with submit and a shared tag, then collect results as webhooks arrive:

Pythoncurl

from scrapenest import ScrapeNestClient

client = ScrapeNestClient(api_key="sn_live_...", base_url="https://api.scrapenest.com")

batch = "catalog-2026-06-06"
job_ids = []
for page in range(1, 51):
    created = client.submit(
        job_type="light",
        target_url=f"https://example.com/products?page={page}",
        tags=[f"batch:{batch}"],
        idempotency_token=f"{batch}-page-{page}",
        artifact_options={"include_extraction": True},
        extraction={"hooks": [
            {"hook_id": "skus", "type": "css", "selector": ".sku", "all_matches": True},
        ]},
    )
    job_ids.append(created.job_id)

for page in $(seq 1 50); do
  curl -s -X POST "https://api.scrapenest.com/v1/jobs" \
    -H "X-API-Key: sn_live_..." \
    -H "Content-Type: application/json" \
    -d "{
      \"job_type\": \"light\",
      \"target_url\": \"https://example.com/products?page=$page\",
      \"tags\": [\"batch:catalog-2026-06-06\"],
      \"idempotency_token\": \"catalog-2026-06-06-page-$page\"
    }"
done

The idempotency_token makes the whole batch safe to re-run - already-submitted pages return their original job instead of duplicating.

Pattern 2: discover then crawl¶

When you don't know the page URLs up front, scrape page 1, extract the links, then submit a job per link:

import json

# 1. Scrape the index page and extract product links
first = client.scrape(
    job_type="light",
    target_url="https://example.com/products",
    artifact_options={"include_extraction": True},
    extraction={"hooks": [
        {"hook_id": "links", "type": "css", "selector": "a.product", "attribute": "href", "all_matches": True},
    ]},
)

# 2. Read the extracted links
job = client.jobs.get(first.job_id)
art = next(a for a in job.artifacts if a.artifact_type == "json")
links = json.loads(client.artifacts.download_text(art.artifact_id))["hooks"][0]["values"]

# 3. Fan out a job per link
for href in links:
    client.submit(job_type="light", target_url=href, tags=["crawl:products"])

Collecting results¶

Use one shared tag for the batch, then filter jobs by it in the Console or the API to track progress and gather outputs. For large batches, prefer webhooks over polling - you'll get a job.completed event per page.

Stay within limits¶

Pace yourself. Bursts against one domain invite blocks and burn quota. Respect your rate limits and concurrency.
Use the cheapest tier that works - light for listings, escalate only where needed.
Always set idempotency_token so retries and re-runs never double-submit.
Watch your credit pool - each successful job consumes credits by tier. See Billing & Usage.

Paginate & crawl¶

Pattern 1: known page URLs¶

Pattern 2: discover then crawl¶

Collecting results¶

Stay within limits¶

See also¶