Paginate & crawl¶
Goal: scrape many pages — a paginated listing, a set of URLs, or a category tree — reliably and at scale.
ScrapeNest jobs are single-URL. You orchestrate multi-page crawls by submitting many jobs, ideally with scrape_async and webhooks so you're not blocking on each one.
Pattern 1: known page URLs¶
If the pages follow a predictable pattern, fan out with scrape_async and a shared tag, then collect results as webhooks arrive:
from scrapenest import ScrapeNestClient
client = ScrapeNestClient(api_key="sn_live_...", base_url="https://api.scrapenest.com")
batch = "catalog-2026-06-06"
job_ids = []
for page in range(1, 51):
created = client.scrape_async(
job_type="light",
target_url=f"https://example.com/products?page={page}",
tags=[f"batch:{batch}"],
idempotency_token=f"{batch}-page-{page}",
artifact_options={"include_extraction": True},
extraction={"hooks": [
{"hook_id": "skus", "type": "css", "selector": ".sku", "all_matches": True},
]},
)
job_ids.append(created.job_id)
for page in $(seq 1 50); do
curl -s -X POST "https://api.scrapenest.com/api/v1/jobs" \
-H "X-API-Key: sn_live_..." \
-H "Content-Type: application/json" \
-d "{
\"job_type\": \"light\",
\"target_url\": \"https://example.com/products?page=$page\",
\"tags\": [\"batch:catalog-2026-06-06\"],
\"idempotency_token\": \"catalog-2026-06-06-page-$page\"
}"
done
The idempotency_token makes the whole batch safe to re-run — already-submitted pages return their original job instead of duplicating.
Pattern 2: discover then crawl¶
When you don't know the page URLs up front, scrape page 1, extract the links, then submit a job per link:
import json
# 1. Scrape the index page and extract product links
first = client.scrape_sync(
job_type="light",
target_url="https://example.com/products",
artifact_options={"include_extraction": True},
extraction={"hooks": [
{"hook_id": "links", "type": "css", "selector": "a.product", "attribute": "href", "all_matches": True},
]},
)
# 2. Read the extracted links
job = client.jobs.get(first.job_id)
art = next(a for a in job.artifacts if a.artifact_type == "json")
links = json.loads(client.artifacts.download_text(art.artifact_id))["hooks"][0]["values"]
# 3. Fan out a job per link
for href in links:
client.scrape_async(job_type="light", target_url=href, tags=["crawl:products"])
Collecting results¶
Use one shared tag for the batch, then filter jobs by it in the Console or the API to track progress and gather outputs. For large batches, prefer webhooks over polling — you'll get a job.completed event per page.
Stay within limits¶
- Pace yourself. Bursts against one domain invite blocks and burn quota. Respect your rate limits and concurrency.
- Use the cheapest tier that works —
lightfor listings, escalate only where needed. - Always set
idempotency_tokenso retries and re-runs never double-submit. - Watch your credit pool — each successful job consumes credits by tier. See Billing & Usage.
See also¶
- Webhooks end-to-end — the recommended way to collect batch results.
- Rate Limits & Quotas · Extract structured data