opendevbrowser-data-extraction

Use this skill to extract structured, auditable datasets from dynamic pages with compliance-aware workflows.

Data Extraction Skill

Use this skill to extract structured, auditable datasets from dynamic pages with compliance-aware workflows.

Pack Contents

artifacts/extraction-workflows.md
assets/templates/extraction-schema.json
assets/templates/pagination-state.json
assets/templates/quality-gates.json
assets/templates/compliance-checklist.md
scripts/run-extraction-workflow.sh
scripts/validate-skill-assets.sh
Shared robustness matrix: ../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md

Fast Start


./skills/opendevbrowser-data-extraction/scripts/validate-skill-assets.sh

./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh list

./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh pagination

./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh infinite-scroll

Supporting Surfaces

Use browser replay (screencast-start / screencast-stop) when lazy loading, infinite scroll, or pagination drift needs time-based proof.
Use desktop observation only for read-only evidence around sibling desktop surfaces; most extraction flows should stay browser-only.
Use --challenge-automation-mode off|browser|browser_with_helper only for bounded browser-scoped computer use when provider challenges appear; stop before any desktop-control interpretation.

Core Rules

Define schema before extraction.
Track provenance for each record (source_url, provider, captured_at, page).
Prefer embedded structured data (JSON-LD/microdata) where available.
Stop on sustained anti-bot pressure (repeated 403/429/challenge loops).
Honor Retry-After and preserve checkpoint state before retrying pagination.

Parallel Multitab Alignment

Apply shared concurrency policy from ../opendevbrowser-best-practices/SKILL.md ("Parallel Operations").
Run extraction acceptance on managed, extension, and cdpConnect before claiming mode parity.
Keep one session per worker; avoid interleaving target-use streams inside a single session.

Robustness Coverage (Known-Issue Matrix)

Matrix source: ../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md

ISSUE-01: stale refs after dynamic content updates
ISSUE-06: 429/backoff and retry budgeting
ISSUE-08: blocked/restricted origins and policy checks
ISSUE-09: pagination drift, duplicate accumulation, terminal detection
ISSUE-10: locale/currency parsing consistency

Extraction Planning

Define required fields and null policy.
Snapshot and map refs to schema.
Choose pagination strategy.
Apply quality gates each page.


opendevbrowser_snapshot sessionId="<session-id>" format="actionables"

Structured Data First

Attempt extraction in this order:

JSON-LD product/article blocks
semantic table/list/card DOM
fallback text parsing


opendevbrowser_dom_get_text sessionId="<session-id>" ref="<json-ld-ref>"

opendevbrowser_dom_get_html sessionId="<session-id>" ref="<table-ref>"

Pagination Patterns

Numbered/Next pagination


opendevbrowser_click sessionId="<session-id>" ref="<next-ref>"

opendevbrowser_wait sessionId="<session-id>" until="networkidle"

opendevbrowser_snapshot sessionId="<session-id>" format="actionables"

Infinite scroll


opendevbrowser_scroll sessionId="<session-id>" dy=1000

opendevbrowser_wait sessionId="<session-id>" until="networkidle"

Load more


opendevbrowser_click sessionId="<session-id>" ref="<load-more-ref>"

opendevbrowser_wait sessionId="<session-id>" until="networkidle"

Quality Gates

Apply per page:

dedupe by stable key (URL or canonical ID)
null-rate check for required fields
count delta check (new records must increase)
consistency check for currency/units
max consecutive challenge/429 loops before stop

Use assets/templates/quality-gates.json.

Compliance and Safety

Respect robots and site terms.
Use pacing; do not flood endpoints.
Treat robots as policy guidance, not auth.
Stop or back off on repeated 429/403 and challenges.

References

RFC 9309 (robots protocol): https://www.rfc-editor.org/rfc/rfc9309
Google robots docs: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
Schema.org Offer: https://schema.org/Offer
Google Product structured data: https://developers.google.com/search/docs/appearance/structured-data/product-snippet
Playwright best practices: https://playwright.dev/docs/best-practices

command example

# Skill source
cat skills/opendevbrowser-data-extraction/SKILL.md

opendevbrowser-data-extraction

Use this skill to extract structured, auditable datasets from dynamic pages with compliance-aware workflows.

Edit source View on GitHub

Data Extraction Skill

Use this skill to extract structured, auditable datasets from dynamic pages with compliance-aware workflows.

Pack Contents

artifacts/extraction-workflows.md
assets/templates/extraction-schema.json
assets/templates/pagination-state.json
assets/templates/quality-gates.json
assets/templates/compliance-checklist.md
scripts/run-extraction-workflow.sh
scripts/validate-skill-assets.sh
Shared robustness matrix: ../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md

Fast Start


./skills/opendevbrowser-data-extraction/scripts/validate-skill-assets.sh

./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh list

./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh pagination

./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh infinite-scroll

Supporting Surfaces

Use browser replay (screencast-start / screencast-stop) when lazy loading, infinite scroll, or pagination drift needs time-based proof.
Use desktop observation only for read-only evidence around sibling desktop surfaces; most extraction flows should stay browser-only.
Use --challenge-automation-mode off|browser|browser_with_helper only for bounded browser-scoped computer use when provider challenges appear; stop before any desktop-control interpretation.

Core Rules

Define schema before extraction.
Track provenance for each record (source_url, provider, captured_at, page).
Prefer embedded structured data (JSON-LD/microdata) where available.
Stop on sustained anti-bot pressure (repeated 403/429/challenge loops).
Honor Retry-After and preserve checkpoint state before retrying pagination.

Parallel Multitab Alignment

Apply shared concurrency policy from ../opendevbrowser-best-practices/SKILL.md ("Parallel Operations").
Run extraction acceptance on managed, extension, and cdpConnect before claiming mode parity.
Keep one session per worker; avoid interleaving target-use streams inside a single session.

Robustness Coverage (Known-Issue Matrix)

Matrix source: ../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md

ISSUE-01: stale refs after dynamic content updates
ISSUE-06: 429/backoff and retry budgeting
ISSUE-08: blocked/restricted origins and policy checks
ISSUE-09: pagination drift, duplicate accumulation, terminal detection
ISSUE-10: locale/currency parsing consistency

Extraction Planning

Define required fields and null policy.
Snapshot and map refs to schema.
Choose pagination strategy.
Apply quality gates each page.


opendevbrowser_snapshot sessionId="<session-id>" format="actionables"

Structured Data First

Attempt extraction in this order:

JSON-LD product/article blocks
semantic table/list/card DOM
fallback text parsing


opendevbrowser_dom_get_text sessionId="<session-id>" ref="<json-ld-ref>"

opendevbrowser_dom_get_html sessionId="<session-id>" ref="<table-ref>"

Pagination Patterns

Numbered/Next pagination


opendevbrowser_click sessionId="<session-id>" ref="<next-ref>"

opendevbrowser_wait sessionId="<session-id>" until="networkidle"

opendevbrowser_snapshot sessionId="<session-id>" format="actionables"

Infinite scroll


opendevbrowser_scroll sessionId="<session-id>" dy=1000

opendevbrowser_wait sessionId="<session-id>" until="networkidle"

Load more


opendevbrowser_click sessionId="<session-id>" ref="<load-more-ref>"

opendevbrowser_wait sessionId="<session-id>" until="networkidle"

Quality Gates

Apply per page:

dedupe by stable key (URL or canonical ID)
null-rate check for required fields
count delta check (new records must increase)
consistency check for currency/units
max consecutive challenge/429 loops before stop

Use assets/templates/quality-gates.json.

Compliance and Safety

Respect robots and site terms.
Use pacing; do not flood endpoints.
Treat robots as policy guidance, not auth.
Stop or back off on repeated 429/403 and challenges.

References

RFC 9309 (robots protocol): https://www.rfc-editor.org/rfc/rfc9309
Google robots docs: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
Schema.org Offer: https://schema.org/Offer
Google Product structured data: https://developers.google.com/search/docs/appearance/structured-data/product-snippet
Playwright best practices: https://playwright.dev/docs/best-practices

command example

# Skill source
cat skills/opendevbrowser-data-extraction/SKILL.md