Extract structured data¶
Goal: turn a web page into clean JSON without downloading and parsing HTML yourself.
ScrapeNest runs extraction hooks inside the job. You describe what you want with CSS selectors, regular expressions, or JSONPath, and the results come back as a JSON artifact.
Minimal example¶
Extract a page title and all product prices:
from scrapenest import ScrapeNestClient
client = ScrapeNestClient(api_key="sn_live_...", base_url="https://api.scrapenest.com")
result = client.scrape_sync(
job_type="light",
target_url="https://example.com/products",
artifact_options={"include_extraction": True},
extraction={
"hooks": [
{"hook_id": "title", "type": "css", "selector": "h1"},
{"hook_id": "prices", "type": "css", "selector": ".price", "all_matches": True},
]
},
)
print(result.status) # "succeeded"
curl -X POST "https://api.scrapenest.com/api/v1/jobs" \
-H "X-API-Key: sn_live_..." \
-H "Content-Type: application/json" \
-d '{
"job_type": "light",
"target_url": "https://example.com/products",
"artifact_options": {"include_extraction": true},
"extraction": {"hooks": [
{"hook_id": "title", "type": "css", "selector": "h1"},
{"hook_id": "prices", "type": "css", "selector": ".price", "all_matches": true}
]}
}'
Read the results¶
The job produces a json artifact containing one entry per hook. Download it like any artifact:
import json
job = client.jobs.get(result.job_id)
art = next(a for a in job.artifacts if a.artifact_type == "json")
data = json.loads(client.artifacts.download_text(art.artifact_id))
for hook in data["hooks"]:
print(hook["hook_id"], hook["status"], hook["values"])
Variations¶
Grab an attribute, not text — get href/src instead of element text:
{"hook_id": "links", "type": "css", "selector": "a.product", "attribute": "href", "all_matches": true}
Extract from a JSON API — use JSONPath on a light job against an API endpoint:
Pull IDs with a regex — capture a group from the raw source:
Extract after rendering — for JavaScript-built pages, switch to standard/stealth and combine with wait_until:
client.scrape_sync(
job_type="standard",
target_url="https://example.com/app",
wait_until="networkidle",
artifact_options={"include_extraction": True},
extraction={"hooks": [{"hook_id": "rows", "type": "css", "selector": "tr.item", "all_matches": True}]},
)
Tips¶
- A failing or empty hook never fails the job — always check each hook's
status. - Limits: up to 25 hooks per job, 50 results per hook. For larger crawls, paginate.
- Prefer
lightfor static/server-rendered pages — it's the fastest and cheapest tier.
See also¶
- Artifacts & Extraction reference — full hook syntax, result format, limits.
- Scrape a JavaScript SPA — when content is rendered client-side.
- Paginate & crawl — extract across many pages.