Artifacts & Extraction¶

Every scraping job produces one or more artifacts - the outputs you actually consume. This page lists the artifact types, the manifest that describes them, how to download them, and how to pull structured data out of a page with extraction hooks.

Artifact types¶

Which artifacts a job produces depends on its artifact_options and tier.

Type	MIME	Produced when	Description
`html`	`text/html`	`include_html` (default on)	Final rendered HTML of the page.
`screenshot`	`image/png`	`include_screenshot` (browser)	PNG capture of the rendered page.
`har`	`application/json`	`include_har` (browser)	HTTP Archive (HAR 1.2) of network activity: every request with method, URL, status, headers, timings and response body. Bodies are absent only where the response has none (3xx redirects, `204 No Content`). Note the HAR covers network activity, which is usually shorter than the job's total duration - settling, scrolling and capture happen after the last request.
`console_log`	`application/x-ndjson`	`include_console` (browser)	Browser console messages, one JSON object per line. Always produced when requested - a page that logged nothing yields an empty file, not a missing artifact.
`reader`	`text/markdown`	`include_reader` (browser)	Article content as Markdown, extracted from the rendered DOM. Produced only when the page has article-like content; listing, search and app pages typically yield none.
`links`	`application/json`	`include_links` (browser)	Every `<a href>` on the rendered page: absolute `url`, link `text`, and `rel`. Duplicates removed.
`text`	`text/plain`	`include_text` (browser)	Visible text of the rendered page (`innerText`).
`page_metadata`	`application/json`	`include_page_metadata` (browser)	Title, description, author, published date, canonical URL, language, favicon, Open Graph, Twitter card, and JSON-LD blocks.
`response_body`	varies	`include_response_body`	Raw HTTP response body.
`extraction`	`application/json`	`include_extraction`	Structured results from your extraction hooks.
`manifest`	`application/json`	always	The manifest itself (see below).

Response metadata - HTTP status, response headers, and timing - is not a separate artifact. It lives in the manifest (see The manifest).

The manifest¶

Each job emits a manifest: the forensics record for the run. It describes every artifact produced (ids, types, sizes, integrity hashes) and, at the top level, captures the job's outcome, the target's HTTP response, and timing.

{
  "manifest_version": "1.0.0",
  "job_id": "3d7d1e6e-2b8e-47c2-8bbd-9c2a1a3f9b10",
  "engine": "stealth",
  "target_url": "https://example.com",
  "final_url": "https://example.com/",
  "status": "completed",
  "outcome": "success",
  "response": {
    "http_status": 200,
    "headers": { "content-type": "text/html; charset=utf-8" }
  },
  "timing": {
    "started_at": "2026-06-06T12:00:03Z",
    "finished_at": "2026-06-06T12:00:05Z",
    "duration_ms": 1250
  },
  "credits": { "weight": 30, "charged": true, "amount": 30 },
  "artifacts": [
    {
      "artifact_id": "7d6cbb9f-2f31-4e9c-8c3a-5e9b0f2a14e1",
      "artifact_type": "html",
      "mime_type": "text/html",
      "checksum_sha256": "…",
      "byte_size": 18342,
      "created_at": "2026-06-06T12:00:05Z"
    }
  ]
}

The same artifact list is returned inline on the job object (GET /v1/jobs/{jobId}) and, in the SDK, as Job.artifacts.

Status vs. outcome¶

The manifest separates two things that are easy to conflate:

Field	Values	Meaning
`status`	`completed`, `failed`	Lifecycle - did the crawl run to completion? `failed` means an execution error (timeout, navigation crash) with no usable response.
`outcome`	`success`, `blocked`, `failed`	Result quality - `success`: content was delivered. `blocked`: the crawl ran but the target denied or challenged it (an anti-bot wall). `failed`: the target returned an error response.

A blocked job has status: "completed" (it ran) but outcome: "blocked", plus a structured block:

{
  "status": "completed",
  "outcome": "blocked",
  "block": { "blocked": true, "signal": "anti_bot_challenge", "http_status": 403 },
  "response": { "http_status": 403, "headers": { … } },
  "credits": { "weight": 30, "charged": false, "amount": 0 }
}

You pay for results, not blocks

Credits are charged only when outcome is success. Blocked and failed jobs cost 0 credits (credits.charged: false). Always check outcome - not just status - before treating a job as a delivered page.

Response headers in the manifest are sanitized: cookies, set-cookie, authorization, and similar credential-bearing headers are redacted.

Downloading artifacts¶

Artifacts are never public. You exchange an artifact_id for a short-lived presigned URL, then fetch the bytes from that URL.

Pythoncurl

job = client.jobs.get("JOB_ID")
art = next(a for a in job.artifacts if a.artifact_type == "html")

html = client.artifacts.download_text(art.artifact_id)

# Exchange the id for a presigned URL (optionally set ttl_seconds, 60–86400)
curl "https://api.scrapenest.com/v1/artifacts/ARTIFACT_ID/download?ttl_seconds=600" \
  -H "X-API-Key: sn_live_..."
# → {"download_url": "https://...", "expires_at": "2026-06-06T12:15:05Z"}

curl -L "PRESIGNED_DOWNLOAD_URL" -o result.html

URLs expire

Presigned URLs are valid for a limited window (default ~15 minutes; control with ttl_seconds, range 60–86400). Request a fresh URL each time; don't cache it. You can also receive a ready-to-use URL on the artifact.ready webhook.

Integrity: every artifact carries a checksum_sha256. Verify it after download if you need tamper-evidence.

Extraction hooks¶

Instead of downloading HTML and parsing it yourself, let ScrapeNest extract structured data during the job. Add extraction.hooks to the payload and set artifact_options.include_extraction: true. Results come back as an extraction artifact.

A hook has an id, a type, and type-specific fields:

"extraction": {
  "hooks": [
    {"hook_id": "title", "type": "css", "selector": "h1"},
    {"hook_id": "image", "type": "css", "selector": "img.hero", "attribute": "src"},
    {"hook_id": "skus", "type": "regex", "pattern": "SKU-(\\d+)", "group": 1, "all_matches": true},
    {"hook_id": "price", "type": "jsonpath", "path": "$.product.price"}
  ]
}

Common fields (all hook types)¶

Every hook also accepts these optional fields, regardless of type:

Field	Type	Description
`max_results`	integer	Cap the number of values returned when `all_matches` is true. Range 1-1000. Defaults to the engine limit.
`timeout_ms`	integer	Per-hook execution timeout. Range 10-30000. Defaults to the engine limit.

Hook types¶

CSS selectorRegexJSONPath

Selects elements by CSS. Returns text content by default, or an attribute value.

Field	Type	Description
`selector`	string	Required. CSS selector.
`attribute`	string	Optional. Return this attribute (e.g. `href`, `src`) instead of text.
`all_matches`	boolean	Return every match instead of the first.

{"hook_id": "links", "type": "css", "selector": "a.product", "attribute": "href", "all_matches": true}

Matches the page source with a regular expression.

Field	Type	Description
`pattern`	string	Required. Regular expression.
`flags`	string	Optional combination of `i` (ignore case), `m` (multiline), `s` (dotall).
`group`	integer	Capture group to return (default `0` = whole match).
`all_matches`	boolean	Return every match instead of the first.

{"hook_id": "ids", "type": "regex", "pattern": "id=(\\d+)", "group": 1, "all_matches": true}

Evaluates a JSONPath expression - ideal for JSON API responses (Light tier) or embedded JSON.

Field	Type	Description
`path`	string	Required. JSONPath expression, e.g. `$.data[*].name`.

{"hook_id": "names", "type": "jsonpath", "path": "$.data[*].name"}

Result format¶

The extraction artifact contains one entry per hook:

{
  "hooks": [
    {
      "hook_id": "title",
      "type": "css",
      "status": "succeeded",
      "duration_ms": 4,
      "values": ["Example Domain"]
    },
    {
      "hook_id": "price",
      "type": "jsonpath",
      "status": "failed",
      "duration_ms": 1,
      "values": [],
      "error": "no_match"
    }
  ]
}

Field	Description
`status`	`succeeded`, `failed`, `timeout`, or `invalid`.
`values`	Array of extracted values (empty if none matched).
`error`	Error code when `status != succeeded` (e.g. `no_match`, `invalid_selector`).

Limits¶

Limit	Value
Max hooks per job	25
Max results per hook	50
Per-hook timeout	1000 ms (configurable per hook via `timeout_ms`, up to 30000)

A failing or empty hook never fails the job - check each hook's status in the result.

Next steps¶

Extract structured data - extraction hooks in a full recipe.
Job Parameters - artifact_options and extraction in context.
Data Retention & Holds - how long artifacts live.