Skip to content

Extract structured data

Goal: turn a web page into clean JSON without downloading and parsing HTML yourself.

ScrapeNest runs extraction hooks inside the job. You describe what you want with CSS selectors, regular expressions, or JSONPath, and the results come back as a JSON artifact.

Minimal example

Extract a page title and all product prices:

from scrapenest import ScrapeNestClient

client = ScrapeNestClient(api_key="sn_live_...", base_url="https://api.scrapenest.com")

result = client.scrape_sync(
    job_type="light",
    target_url="https://example.com/products",
    artifact_options={"include_extraction": True},
    extraction={
        "hooks": [
            {"hook_id": "title", "type": "css", "selector": "h1"},
            {"hook_id": "prices", "type": "css", "selector": ".price", "all_matches": True},
        ]
    },
)
print(result.status)  # "succeeded"
curl -X POST "https://api.scrapenest.com/api/v1/jobs" \
  -H "X-API-Key: sn_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "job_type": "light",
    "target_url": "https://example.com/products",
    "artifact_options": {"include_extraction": true},
    "extraction": {"hooks": [
      {"hook_id": "title", "type": "css", "selector": "h1"},
      {"hook_id": "prices", "type": "css", "selector": ".price", "all_matches": true}
    ]}
  }'

Read the results

The job produces a json artifact containing one entry per hook. Download it like any artifact:

import json

job = client.jobs.get(result.job_id)
art = next(a for a in job.artifacts if a.artifact_type == "json")

data = json.loads(client.artifacts.download_text(art.artifact_id))
for hook in data["hooks"]:
    print(hook["hook_id"], hook["status"], hook["values"])

Variations

Grab an attribute, not text — get href/src instead of element text:

{"hook_id": "links", "type": "css", "selector": "a.product", "attribute": "href", "all_matches": true}

Extract from a JSON API — use JSONPath on a light job against an API endpoint:

{"hook_id": "names", "type": "jsonpath", "path": "$.data[*].name"}

Pull IDs with a regex — capture a group from the raw source:

{"hook_id": "skus", "type": "regex", "pattern": "SKU-(\\d+)", "group": 1, "all_matches": true}

Extract after rendering — for JavaScript-built pages, switch to standard/stealth and combine with wait_until:

client.scrape_sync(
    job_type="standard",
    target_url="https://example.com/app",
    wait_until="networkidle",
    artifact_options={"include_extraction": True},
    extraction={"hooks": [{"hook_id": "rows", "type": "css", "selector": "tr.item", "all_matches": True}]},
)

Tips

  • A failing or empty hook never fails the job — always check each hook's status.
  • Limits: up to 25 hooks per job, 50 results per hook. For larger crawls, paginate.
  • Prefer light for static/server-rendered pages — it's the fastest and cheapest tier.

See also