Skip to content

Artifacts & Extraction

Every scraping job produces one or more artifacts — the outputs you actually consume. This page lists the artifact types, the manifest that describes them, how to download them, and how to pull structured data out of a page with extraction hooks.

Artifact types

Which artifacts a job produces depends on its artifact_options and tier.

Type MIME Produced when Description
html text/html include_html (default on) Final rendered HTML of the page.
screenshot image/png include_screenshot (browser) PNG capture of the rendered page.
har application/json include_har (browser) Full HTTP Archive of network activity.
console_log application/x-ndjson include_console (browser) Browser console messages, one JSON object per line.
response_body varies include_response_body Raw HTTP response body.
json application/json include_extraction Structured results from your extraction hooks.
text text/plain include_text Cleaned, readable text content.
manifest application/json always The manifest itself (see below).

Response metadata — HTTP status, response headers, and timing — is not a separate artifact. It lives in the manifest (see The manifest).

The manifest

Each job emits a manifest: the forensics record for the run. It describes every artifact produced (ids, types, sizes, integrity hashes) and, at the top level, captures the job's outcome, the target's HTTP response, and timing.

{
  "manifest_version": "1.0.0",
  "job_id": "3d7d1e6e-2b8e-47c2-8bbd-9c2a1a3f9b10",
  "engine": "stealth",
  "target_url": "https://example.com",
  "final_url": "https://example.com/",
  "status": "completed",
  "outcome": "success",
  "response": {
    "http_status": 200,
    "headers": { "content-type": "text/html; charset=utf-8" }
  },
  "timing": {
    "started_at": "2026-06-06T12:00:03Z",
    "finished_at": "2026-06-06T12:00:05Z",
    "duration_ms": 1250
  },
  "credits": { "weight": 30, "charged": true, "amount": 30 },
  "artifacts": [
    {
      "artifact_id": "7d6cbb9f-2f31-4e9c-8c3a-5e9b0f2a14e1",
      "artifact_type": "html",
      "mime_type": "text/html",
      "checksum_sha256": "…",
      "byte_size": 18342,
      "created_at": "2026-06-06T12:00:05Z"
    }
  ]
}

The same artifact list is returned inline on the job object (GET /api/v1/jobs/{jobId}) and, in the SDK, as Job.artifacts.

Status vs. outcome

The manifest separates two things that are easy to conflate:

Field Values Meaning
status completed, failed Lifecycle — did the crawl run to completion? failed means an execution error (timeout, navigation crash) with no usable response.
outcome success, blocked, failed Result qualitysuccess: content was delivered. blocked: the crawl ran but the target denied or challenged it (an anti-bot wall). failed: the target returned an error response.

A blocked job has status: "completed" (it ran) but outcome: "blocked", plus a structured block:

{
  "status": "completed",
  "outcome": "blocked",
  "block": { "blocked": true, "signal": "anti_bot_challenge", "http_status": 403 },
  "response": { "http_status": 403, "headers": {  } },
  "credits": { "weight": 30, "charged": false, "amount": 0 }
}

You pay for results, not blocks

Credits are charged only when outcome is success. Blocked and failed jobs cost 0 credits (credits.charged: false). Always check outcome — not just status — before treating a job as a delivered page.

Response headers in the manifest are sanitized: cookies, set-cookie, authorization, and similar credential-bearing headers are redacted.

Downloading artifacts

Artifacts are never public. You exchange an artifact_id for a short-lived presigned URL, then fetch the bytes from that URL.

job = client.jobs.get("JOB_ID")
art = next(a for a in job.artifacts if a.artifact_type == "html")

html = client.artifacts.download_text(art.artifact_id)
# Exchange the id for a presigned URL (optionally set ttl_seconds, 60–86400)
curl "https://api.scrapenest.com/api/v1/artifacts/ARTIFACT_ID/download?ttl_seconds=600" \
  -H "X-API-Key: sn_live_..."
# → {"download_url": "https://...", "expires_at": "2026-06-06T12:15:05Z"}

curl -L "PRESIGNED_DOWNLOAD_URL" -o result.html

URLs expire

Presigned URLs are valid for a limited window (default ~15 minutes; control with ttl_seconds, range 60–86400). Request a fresh URL each time; don't cache it. You can also receive a ready-to-use URL on the artifact.ready webhook.

Integrity: every artifact carries a checksum_sha256. Verify it after download if you need tamper-evidence.

Extraction hooks

Instead of downloading HTML and parsing it yourself, let ScrapeNest extract structured data during the job. Add extraction.hooks to the payload and set artifact_options.include_extraction: true. Results come back as a json artifact.

A hook has an id, a type, and type-specific fields:

"extraction": {
  "hooks": [
    {"hook_id": "title", "type": "css", "selector": "h1"},
    {"hook_id": "image", "type": "css", "selector": "img.hero", "attribute": "src"},
    {"hook_id": "skus", "type": "regex", "pattern": "SKU-(\\d+)", "group": 1, "all_matches": true},
    {"hook_id": "price", "type": "jsonpath", "path": "$.product.price"}
  ]
}

Hook types

Selects elements by CSS. Returns text content by default, or an attribute value.

Field Type Description
selector string Required. CSS selector.
attribute string Optional. Return this attribute (e.g. href, src) instead of text.
all_matches boolean Return every match instead of the first.
{"hook_id": "links", "type": "css", "selector": "a.product", "attribute": "href", "all_matches": true}

Matches the page source with a regular expression.

Field Type Description
pattern string Required. Regular expression.
flags string Optional combination of i (ignore case), m (multiline), s (dotall).
group integer Capture group to return (default 0 = whole match).
all_matches boolean Return every match instead of the first.
{"hook_id": "ids", "type": "regex", "pattern": "id=(\\d+)", "group": 1, "all_matches": true}

Evaluates a JSONPath expression — ideal for JSON API responses (Light tier) or embedded JSON.

Field Type Description
path string Required. JSONPath expression, e.g. $.data[*].name.
{"hook_id": "names", "type": "jsonpath", "path": "$.data[*].name"}

Result format

The extraction artifact contains one entry per hook:

{
  "hooks": [
    {
      "hook_id": "title",
      "type": "css",
      "status": "succeeded",
      "duration_ms": 4,
      "values": ["Example Domain"]
    },
    {
      "hook_id": "price",
      "type": "jsonpath",
      "status": "failed",
      "duration_ms": 1,
      "values": [],
      "error": "no_match"
    }
  ]
}
Field Description
status succeeded, failed, timeout, or invalid.
values Array of extracted values (empty if none matched).
error Error code when status != succeeded (e.g. no_match, invalid_selector).

Limits

Limit Value
Max hooks per job 25
Max results per hook 50
Per-hook timeout 1000 ms (configurable per hook via timeout_ms, up to 30000)

A failing or empty hook never fails the job — check each hook's status in the result.

Next steps