There are plenty of motives you may perhaps need to uncover the many URLs on an internet site, but your precise goal will identify Anything you’re trying to find. As an example, you may want to:
Identify every indexed URL to investigate challenges like cannibalization or index bloat
Accumulate recent and historic URLs Google has observed, especially for web site migrations
Locate all 404 URLs to recover from post-migration faults
In Just about every situation, just one tool received’t Present you with almost everything you will need. Sadly, Google Look for Console isn’t exhaustive, and also a “site:case in point.com” look for is restricted and hard to extract data from.
In this article, I’ll stroll you through some instruments to create your URL list and right before deduplicating the information using a spreadsheet or Jupyter Notebook, based upon your web site’s dimension.
Aged sitemaps and crawl exports
In case you’re searching for URLs that disappeared in the Are living web-site just lately, there’s a chance another person in your workforce could have saved a sitemap file or perhaps a crawl export ahead of the adjustments were being created. If you haven’t previously, check for these documents; they can typically offer what you would like. But, should you’re examining this, you probably didn't get so lucky.
Archive.org
Archive.org
Archive.org is an invaluable Software for Web optimization tasks, funded by donations. In case you search for a website and choose the “URLs” choice, you may access approximately ten,000 outlined URLs.
On the other hand, there are a few restrictions:
URL Restrict: You are able to only retrieve approximately web designer kuala lumpur ten,000 URLs, that is insufficient for larger sized websites.
Good quality: Lots of URLs could be malformed or reference source data files (e.g., pictures or scripts).
No export selection: There isn’t a crafted-in method to export the checklist.
To bypass The shortage of the export button, utilize a browser scraping plugin like Dataminer.io. Nevertheless, these restrictions signify Archive.org may well not offer a whole Resolution for larger sized sites. Also, Archive.org doesn’t show no matter if Google indexed a URL—however, if Archive.org found it, there’s an excellent possibility Google did, far too.
Moz Professional
While you may normally make use of a url index to uncover exterior web pages linking to you personally, these tools also uncover URLs on your website in the method.
The way to use it:
Export your inbound back links in Moz Professional to acquire a speedy and simple listing of focus on URLs from the web-site. When you’re addressing a large Web-site, think about using the Moz API to export details further than what’s workable in Excel or Google Sheets.
It’s imperative that you Take note that Moz Professional doesn’t affirm if URLs are indexed or learned by Google. On the other hand, due to the fact most web-sites implement the exact same robots.txt policies to Moz’s bots because they do to Google’s, this process usually functions nicely as a proxy for Googlebot’s discoverability.
Google Look for Console
Google Look for Console presents many worthwhile sources for building your listing of URLs.
Inbound links stories:
Just like Moz Pro, the Links portion delivers exportable lists of focus on URLs. Regrettably, these exports are capped at 1,000 URLs each. You could implement filters for certain pages, but considering the fact that filters don’t implement for the export, you may need to trust in browser scraping equipment—limited to five hundred filtered URLs at a time. Not perfect.
Performance → Search engine results:
This export offers you a list of web pages getting look for impressions. When the export is limited, you can use Google Research Console API for greater datasets. Additionally, there are no cost Google Sheets plugins that simplify pulling far more in depth details.
Indexing → Webpages report:
This section delivers exports filtered by problem style, however they are also minimal in scope.
Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is a wonderful source for amassing URLs, using a generous Restrict of 100,000 URLs.
A lot better, you'll be able to utilize filters to produce distinctive URL lists, properly surpassing the 100k Restrict. As an example, if you want to export only site URLs, follow these methods:
Step 1: Increase a segment to your report
Step two: Simply click “Create a new section.”
Move 3: Outline the segment using a narrower URL sample, like URLs that contains /blog/
Notice: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide precious insights.
Server log files
Server or CDN log information are Most likely the ultimate Device at your disposal. These logs seize an exhaustive list of every URL path queried by buyers, Googlebot, or other bots throughout the recorded period of time.
Criteria:
Information dimensions: Log data files may be substantial, lots of websites only retain the final two months of knowledge.
Complexity: Analyzing log information might be complicated, but various equipment can be obtained to simplify the procedure.
Combine, and great luck
As soon as you’ve collected URLs from all these sources, it’s time to combine them. If your website is sufficiently small, use Excel or, for much larger datasets, tools like Google Sheets or Jupyter Notebook. Make sure all URLs are constantly formatted, then deduplicate the listing.
And voilà—you now have an extensive listing of present-day, old, and archived URLs. Great luck!