Extract structured data¶

Goal: turn a web page into clean JSON without downloading and parsing HTML yourself.

ScrapeNest runs extraction hooks inside the job. You describe what you want with CSS selectors, regular expressions, or JSONPath, and the results come back as a JSON artifact.

Minimal example¶

Extract a page title and all product prices:

Pythoncurl

from scrapenest import ScrapeNestClient

client = ScrapeNestClient(api_key="sn_live_...", base_url="https://api.scrapenest.com")

result = client.scrape(
    job_type="light",
    target_url="https://example.com/products",
    artifact_options={"include_extraction": True},
    extraction={
        "hooks": [
            {"hook_id": "title", "type": "css", "selector": "h1"},
            {"hook_id": "prices", "type": "css", "selector": ".price", "all_matches": True},
        ]
    },
)
print(result.status)  # "succeeded"

curl -X POST "https://api.scrapenest.com/v1/jobs" \
  -H "X-API-Key: sn_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "job_type": "light",
    "target_url": "https://example.com/products",
    "artifact_options": {"include_extraction": true},
    "extraction": {"hooks": [
      {"hook_id": "title", "type": "css", "selector": "h1"},
      {"hook_id": "prices", "type": "css", "selector": ".price", "all_matches": true}
    ]}
  }'

Read the results¶

The job produces a json artifact containing one entry per hook. Download it like any artifact:

import json

job = client.jobs.get(result.job_id)
art = next(a for a in job.artifacts if a.artifact_type == "json")

data = json.loads(client.artifacts.download_text(art.artifact_id))
for hook in data["hooks"]:
    print(hook["hook_id"], hook["status"], hook["values"])

Variations¶

Grab an attribute, not text - get href/src instead of element text:

{"hook_id": "links", "type": "css", "selector": "a.product", "attribute": "href", "all_matches": true}

Extract from a JSON API - use JSONPath on a light job against an API endpoint:

{"hook_id": "names", "type": "jsonpath", "path": "$.data[*].name"}

Pull IDs with a regex - capture a group from the raw source:

{"hook_id": "skus", "type": "regex", "pattern": "SKU-(\\d+)", "group": 1, "all_matches": true}

Extract after rendering - for JavaScript-built pages, switch to standard/stealth and combine with wait_until:

client.scrape(
    job_type="standard",
    target_url="https://example.com/app",
    wait_until="networkidle",
    artifact_options={"include_extraction": True},
    extraction={"hooks": [{"hook_id": "rows", "type": "css", "selector": "tr.item", "all_matches": True}]},
)

Tips¶

A failing or empty hook never fails the job - always check each hook's status.
Limits: up to 25 hooks per job, 50 results per hook. For larger crawls, paginate.
Prefer light for static/server-rendered pages - it's the fastest and cheapest tier.

Extract structured data¶

Minimal example¶

Read the results¶

Variations¶

Tips¶

See also¶