01 // URL EXTRACTOR

Sitemap URL extractor.

Dump every URL out of any sitemap — XML, HTML, sitemap-index or text. Deduped, validated, exportable in five formats.

Supports XML sitemaps, sitemap-index files (auto-followed), HTML site indexes and plain-text URL lists.

How extraction handles edge cases

  • Sitemap-index files are auto-followed (first 50 children, parallelized).
  • HTML sitemap pages have every <a href> extracted and relative-resolved.
  • Plain-text URL lists are read line-by-line; non-URL lines are ignored.
  • Duplicates are removed across all child sitemaps.
  • Malformed URLs are skipped (not silently included).

Frequently asked

FAQ

How many URLs can it handle?
No fixed cap, but each fetched sitemap is bounded to 25 MB. A 50,000-URL sitemap typically lands around 8–12 MB.
Can I extract URLs from a sitemap index in one call?
Yes — point it at the index and it'll fan out, fetch every child, and return one flat deduplicated list.