Artifacts & Extraction¶
Every scraping job produces one or more artifacts — the outputs you actually consume. This page lists the artifact types, the manifest that describes them, how to download them, and how to pull structured data out of a page with extraction hooks.
Artifact types¶
Which artifacts a job produces depends on its artifact_options and tier.
| Type | MIME | Produced when | Description |
|---|---|---|---|
html |
text/html |
include_html (default on) |
Final rendered HTML of the page. |
screenshot |
image/png |
include_screenshot (browser) |
PNG capture of the rendered page. |
har |
application/json |
include_har (browser) |
Full HTTP Archive of network activity. |
console_log |
application/x-ndjson |
include_console (browser) |
Browser console messages, one JSON object per line. |
response_body |
varies | include_response_body |
Raw HTTP response body. |
json |
application/json |
include_extraction |
Structured results from your extraction hooks. |
text |
text/plain |
include_text |
Cleaned, readable text content. |
manifest |
application/json |
always | The manifest itself (see below). |
Response metadata — HTTP status, response headers, and timing — is not a separate artifact. It lives in the manifest (see The manifest).
The manifest¶
Each job emits a manifest: the forensics record for the run. It describes every artifact produced (ids, types, sizes, integrity hashes) and, at the top level, captures the job's outcome, the target's HTTP response, and timing.
{
"manifest_version": "1.0.0",
"job_id": "3d7d1e6e-2b8e-47c2-8bbd-9c2a1a3f9b10",
"engine": "stealth",
"target_url": "https://example.com",
"final_url": "https://example.com/",
"status": "completed",
"outcome": "success",
"response": {
"http_status": 200,
"headers": { "content-type": "text/html; charset=utf-8" }
},
"timing": {
"started_at": "2026-06-06T12:00:03Z",
"finished_at": "2026-06-06T12:00:05Z",
"duration_ms": 1250
},
"credits": { "weight": 30, "charged": true, "amount": 30 },
"artifacts": [
{
"artifact_id": "7d6cbb9f-2f31-4e9c-8c3a-5e9b0f2a14e1",
"artifact_type": "html",
"mime_type": "text/html",
"checksum_sha256": "…",
"byte_size": 18342,
"created_at": "2026-06-06T12:00:05Z"
}
]
}
The same artifact list is returned inline on the job object (GET /api/v1/jobs/{jobId}) and, in the SDK, as Job.artifacts.
Status vs. outcome¶
The manifest separates two things that are easy to conflate:
| Field | Values | Meaning |
|---|---|---|
status |
completed, failed |
Lifecycle — did the crawl run to completion? failed means an execution error (timeout, navigation crash) with no usable response. |
outcome |
success, blocked, failed |
Result quality — success: content was delivered. blocked: the crawl ran but the target denied or challenged it (an anti-bot wall). failed: the target returned an error response. |
A blocked job has status: "completed" (it ran) but outcome: "blocked", plus a structured block:
{
"status": "completed",
"outcome": "blocked",
"block": { "blocked": true, "signal": "anti_bot_challenge", "http_status": 403 },
"response": { "http_status": 403, "headers": { … } },
"credits": { "weight": 30, "charged": false, "amount": 0 }
}
You pay for results, not blocks
Credits are charged only when outcome is success. Blocked and failed jobs cost 0 credits (credits.charged: false). Always check outcome — not just status — before treating a job as a delivered page.
Response headers in the manifest are sanitized: cookies, set-cookie, authorization, and similar credential-bearing headers are redacted.
Downloading artifacts¶
Artifacts are never public. You exchange an artifact_id for a short-lived presigned URL, then fetch the bytes from that URL.
# Exchange the id for a presigned URL (optionally set ttl_seconds, 60–86400)
curl "https://api.scrapenest.com/api/v1/artifacts/ARTIFACT_ID/download?ttl_seconds=600" \
-H "X-API-Key: sn_live_..."
# → {"download_url": "https://...", "expires_at": "2026-06-06T12:15:05Z"}
curl -L "PRESIGNED_DOWNLOAD_URL" -o result.html
URLs expire
Presigned URLs are valid for a limited window (default ~15 minutes; control with ttl_seconds, range 60–86400). Request a fresh URL each time; don't cache it. You can also receive a ready-to-use URL on the artifact.ready webhook.
Integrity: every artifact carries a checksum_sha256. Verify it after download if you need tamper-evidence.
Extraction hooks¶
Instead of downloading HTML and parsing it yourself, let ScrapeNest extract structured data during the job. Add extraction.hooks to the payload and set artifact_options.include_extraction: true. Results come back as a json artifact.
A hook has an id, a type, and type-specific fields:
"extraction": {
"hooks": [
{"hook_id": "title", "type": "css", "selector": "h1"},
{"hook_id": "image", "type": "css", "selector": "img.hero", "attribute": "src"},
{"hook_id": "skus", "type": "regex", "pattern": "SKU-(\\d+)", "group": 1, "all_matches": true},
{"hook_id": "price", "type": "jsonpath", "path": "$.product.price"}
]
}
Hook types¶
Selects elements by CSS. Returns text content by default, or an attribute value.
| Field | Type | Description |
|---|---|---|
selector |
string | Required. CSS selector. |
attribute |
string | Optional. Return this attribute (e.g. href, src) instead of text. |
all_matches |
boolean | Return every match instead of the first. |
Matches the page source with a regular expression.
| Field | Type | Description |
|---|---|---|
pattern |
string | Required. Regular expression. |
flags |
string | Optional combination of i (ignore case), m (multiline), s (dotall). |
group |
integer | Capture group to return (default 0 = whole match). |
all_matches |
boolean | Return every match instead of the first. |
Result format¶
The extraction artifact contains one entry per hook:
{
"hooks": [
{
"hook_id": "title",
"type": "css",
"status": "succeeded",
"duration_ms": 4,
"values": ["Example Domain"]
},
{
"hook_id": "price",
"type": "jsonpath",
"status": "failed",
"duration_ms": 1,
"values": [],
"error": "no_match"
}
]
}
| Field | Description |
|---|---|
status |
succeeded, failed, timeout, or invalid. |
values |
Array of extracted values (empty if none matched). |
error |
Error code when status != succeeded (e.g. no_match, invalid_selector). |
Limits¶
| Limit | Value |
|---|---|
| Max hooks per job | 25 |
| Max results per hook | 50 |
| Per-hook timeout | 1000 ms (configurable per hook via timeout_ms, up to 30000) |
A failing or empty hook never fails the job — check each hook's status in the result.
Next steps¶
- Extract structured data — extraction hooks in a full recipe.
- Job Parameters —
artifact_optionsandextractionin context. - Data Retention & Holds — how long artifacts live.