opendevbrowser-data-extraction
Use this skill to extract structured, auditable datasets from dynamic pages with compliance-aware workflows.
Data Extraction Skill
Use this skill to extract structured, auditable datasets from dynamic pages with compliance-aware workflows.
Pack Contents
artifacts/extraction-workflows.mdassets/templates/extraction-schema.jsonassets/templates/pagination-state.jsonassets/templates/quality-gates.jsonassets/templates/compliance-checklist.mdscripts/run-extraction-workflow.shscripts/validate-skill-assets.sh- Shared robustness matrix:
../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md
Fast Start
./skills/opendevbrowser-data-extraction/scripts/validate-skill-assets.sh
./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh list
./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh pagination
./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh infinite-scroll
Core Rules
- Define schema before extraction.
- Track provenance for each record (
source_url,provider,captured_at,page). - Prefer embedded structured data (JSON-LD/microdata) where available.
- Stop on sustained anti-bot pressure (repeated 403/429/challenge loops).
- Honor
Retry-Afterand preserve checkpoint state before retrying pagination.
Parallel Multitab Alignment
- Apply shared concurrency policy from
../opendevbrowser-best-practices/SKILL.md("Parallel Operations"). - Run extraction acceptance on
managed,extension, andcdpConnectbefore claiming mode parity. - Keep one session per worker; avoid interleaving
target-usestreams inside a single session.
Robustness Coverage (Known-Issue Matrix)
Matrix source: ../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md
ISSUE-01: stale refs after dynamic content updatesISSUE-06: 429/backoff and retry budgetingISSUE-08: blocked/restricted origins and policy checksISSUE-09: pagination drift, duplicate accumulation, terminal detectionISSUE-10: locale/currency parsing consistency
Extraction Planning
- Define required fields and null policy.
- Snapshot and map refs to schema.
- Choose pagination strategy.
- Apply quality gates each page.
opendevbrowser_snapshot sessionId="<session-id>" format="actionables"
Structured Data First
Attempt extraction in this order:
- JSON-LD product/article blocks
- semantic table/list/card DOM
- fallback text parsing
opendevbrowser_dom_get_text sessionId="<session-id>" ref="<json-ld-ref>"
opendevbrowser_dom_get_html sessionId="<session-id>" ref="<table-ref>"
Pagination Patterns
Numbered/Next pagination
opendevbrowser_click sessionId="<session-id>" ref="<next-ref>"
opendevbrowser_wait sessionId="<session-id>" until="networkidle"
opendevbrowser_snapshot sessionId="<session-id>" format="actionables"
Infinite scroll
opendevbrowser_scroll sessionId="<session-id>" dy=1000
opendevbrowser_wait sessionId="<session-id>" until="networkidle"
Load more
opendevbrowser_click sessionId="<session-id>" ref="<load-more-ref>"
opendevbrowser_wait sessionId="<session-id>" until="networkidle"
Quality Gates
Apply per page:
- dedupe by stable key (URL or canonical ID)
- null-rate check for required fields
- count delta check (new records must increase)
- consistency check for currency/units
- max consecutive challenge/429 loops before stop
Use assets/templates/quality-gates.json.
Compliance and Safety
- Respect robots and site terms.
- Use pacing; do not flood endpoints.
- Treat robots as policy guidance, not auth.
- Stop or back off on repeated 429/403 and challenges.
References
- RFC 9309 (robots protocol): https://www.rfc-editor.org/rfc/rfc9309
- Google robots docs: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
- Schema.org Offer: https://schema.org/Offer
- Google Product structured data: https://developers.google.com/search/docs/appearance/structured-data/product-snippet
- Playwright best practices: https://playwright.dev/docs/best-practices
command example
# Skill source cat skills/opendevbrowser-data-extraction/SKILL.md