Can I prevent Screaming Frog from running out of memory?
Magic SEO Ball says: Outlook good.
Although it’s possible that your site is larger than your computer’s crawling resources, most Screaming Frog users can get a fast, memory-efficient crawl by carefully choosing settings for their needs.
During a crawling session, Screaming Frog keeps every URL, file, and redirect encountered in memory. Our strategy is to exclude unneeded resources from the crawl entirely. You can also increase the memory available to the application, if your machine has RAM to spare, but these four tips will reduce your crawl’s needs, both in memory and time.
- Don’t crawl unnecessary file types. Under Configuration > Spider, uncheck all the file types, from images to external links. If you’re crawling for SEO purposes, it’s unlikely you need to look at these files.
- Respect robots directives. If you switch to the Advanced tab in the same menu, you’ll see you can respect noindex, as well as canonical. If the site you’re analyzing uses either directive to substantially limit indexation, it will save you time and memory to check these boxes.
- Increase timeout window and speed. If your site can handle it, increase the crawl speed under Configuration > Speed. How high can you go? Experiment! As a safety, you can bump up the timeout rate (Configuration > Spider > Advanced) to 60 seconds to protect from some server errors an overly-speedy crawl can cause.
- Exclude any pattern you aren’t concerned about today. This is the big tip. To be most efficient, you need to understand your site well and know a little bit about regular expressions (print this out and post it near your desk). Exclude all of the URL patterns that can cause your entire site to be re-crawled, as well as any folders you aren’t concerned about today. Use the exclusion filters at Configuration > Exclude.
Great, but what specifically should I exclude?
Exclusion Cookbook
I recommend that you pause your Screaming Frog crawl after 20 minutes, sort by URL, and look for URL patterns that are wasting time. These are the easiest patterns to exclude. A grasp of regular expressions will let you improvise, but here are the patterns I often find myself excluding to create a more efficient crawl.
-
https://.*
- Exclude all secure URLs. It’s possible that you’d like to find secure URLs, but don’t crawl them. Instead, set up a custom filter to find links to secure URLs. Why? If you have relative links, you’ll make Screaming Frog re-crawl your entire site, so for every URL, you’ll crawl http://url as well as https://url—very inefficient.
-
http://(qa|weirdsubdomain|anotherweirdone).*
- If you host a staging or QA server on a subdomain, and you have multiple writers, there is a risk that absolute links will be added to your QA domain. This directive will prevent the crawler from following all of the relative links to your QA subdomain. Reuse this pattern for any other subdomains that take a lot of crawler time.
-
http://domain.com.*
- Suppose you host your content on www.domain.com. You certainly want to find cases of absolute links pointing to the non-www version, but don’t let Screaming Frog crawl a complete duplicate of your site. Instead, set a custom filter that flags pages with links to “http://domain.com” and add the above line to your exclusions.
-
http://www.domain.com/(badfolder1|badfolder2)/.*
- Don’t let folders of low interest eat up your crawl time. The classic case is when you host a forum in a subfolder (http://www.domain.com/forum/). If you are crawling with Screaming Frog but aren’t currently concerned with the SEO problems built into the forum software (and I’m sure platforms such as phpBB have their share), skip over those folders.
Finally, remember that each line of the exclusions list excludes previously-excluded URLs already, so no need to exclude https://qa.*, for example.
Happy remixing!