01 // URL EXTRACTOR
Sitemap URL extractor.
Dump every URL out of any sitemap — XML, HTML, sitemap-index or text. Deduped, validated, exportable in five formats.
Supports XML sitemaps, sitemap-index files (auto-followed), HTML site indexes and plain-text URL lists.
How extraction handles edge cases
- Sitemap-index files are auto-followed (first 50 children, parallelized).
- HTML sitemap pages have every
<a href>extracted and relative-resolved. - Plain-text URL lists are read line-by-line; non-URL lines are ignored.
- Duplicates are removed across all child sitemaps.
- Malformed URLs are skipped (not silently included).
Frequently asked
FAQ
How many URLs can it handle?
No fixed cap, but each fetched sitemap is bounded to 25 MB. A 50,000-URL sitemap typically lands around 8–12 MB.
Can I extract URLs from a sitemap index in one call?
Yes — point it at the index and it'll fan out, fetch every child, and return one flat deduplicated list.