Crawl budget seems to be an increasingly important factor. In the past year, I’ve noticed some very interesting things in my vertical regarding the value of pages vs the value of pages averaged across an entire site. Google has, for a very long time, said they don’t want to index low-value pages because it’s a waste of crawling resources for them. They didn’t really have the tools to enforce that until fairly recently. My best guess is that RankBrain is becoming more sophisticated and accurate.
Have you ever wondered why Google isn’t indexing your pages, even after several indexing requests?
I believe that we’re seeing a delay in the indexing process because Google’s taking a long time to gather the data and to teach RankBrain what to look for, then after repeat crawls of a page and determining that the page is of low value (or what they conclude is low value), the algorithm is just choosing to not consider those pages, or rather, being instructed to skip it by RankBrain.
I’ve had a chat with some other experts on the subject and they believe that RankBrain will change the game altogether, I don’t believe that it will /change/ the game, but it will force people to become better and to write more sophisticated content. RankBrain will give Google the means to enforce things that they’ve already been saying for years.
Therefore, as Rankbrain is growing and becoming more sophisticated by the day, it’s important to ensure that we aren’t wasting away our crawling budgets on pages that aren’t contributing towards the success of our website. In essence, we would like Google to ignore those pesky pages and to spend more time focusing on the content that really matters on our site, instead of wasting Google’s crawling resources on parametrised content.
Some SEO’s believe that the higher the crawl rate, the better the site. I have mixed feelings around that. It’s not that the crawl rate should be high, it should, in fact, be compatible with the size of your site and the updates you make to it. If you have a huge site with a lot of irrelevant content, you will loose rankings by having an unnecessary high crawl rate. It would fall down to the consistency of the crawl rate too, sometimes Google can find it’s way into a corner of a site it’s never found and open up a right shit storm, therefore increasing crawl rate.
Google downgrades websites with irrelevant indexable content, they value their crawling budgets. Therefore it’s important to get on top of your game and to start removing the content from the SERP’s that isn’t of any value to the user or is just straight up garbage.
Below are some examples of content I found on other site’s that need to be no-indexed or should have some paginated markup to communicate to Google that these pages are of a logical sequence:
- /help/comments/page-1110/ (every single page is indexed in the SERP’s);
- /community/reviews/page-5150/; (yep, that’s another 5000+ pages indexed);
- /community/photos/page-1134 (yet another 1,000+ pages indexed)
I probably don’t need to tell you that these pages are garbage, what person is really going to go through all of those pages? Let alone search engine crawlers, it’s a waste of resources, plus, it provides no value to the user. I don’t think I have mentioned that all of these pages inherit the same styling, navigation, footer, structure, page titles, meta descriptions and heading tags, it’s just the paragraph content that is ‘slightly’ different. It’s not needed in the search results.
So, how do we fix this nonsense?
Good question, it’s relatively easy, it will just require some development work, which of course, is easy said than done but hey ho. In order to signal to Googlebot that you don’t want these pages to be crawled, nor considered for inclusion in the search results, you will need to use the noindex directive. In some cases, I wouldn’t recommend no-indexing the root of the cause, but, however, the subsequent pages that provide no value and are almost serving as duplicate pages.
In this scenario, I wouldn’t apply the noindex tag to ‘/help/comments’ – but, however, all sequent pages after /help/comments/page-1/. This will tell Google that you’d like the root page to be indexed, however, all of the following pages not to be indexed – this will reduce the number of pages you have in the search results, considerably, and will help Googlebot conserve it’s crawling efforts and to only focus on the pages that provide genuine value.
It’s important to ensure that you review these pages before applying the noindex tag, as this will kill all organic traffic to this page. It may be worth implementing the rel=”next” and rel=”prev” links to indicate the relationship between the sequent URL’s if you find that your component URL’s are generating quite a lot of hefty traffic from search. In our case, the pages I outlined are generating 1-2 organic hits a month, it’s just not worth having these indexed.
In order to block search engine crawlers from indexing your pages, you can do this by adding the following meta tag into the head section of your page. Please be advised that this code blocks ALL search engine web crawlers, not just Googlebot:
<meta name="robots" content="noindex">
On the other hand, if you’re looking to only block Google, but allow other search engine crawlers like Bing, you can do so by using the following tag:
<meta name="googlebot" content="noindex">
It’s worthwhile taking note that other search engines may use different directives, so you may find your pages being in included in some search engines, even though you have implemented these tags. However, in Google’s case, this should work correctly.
Some SEO’s believe that you should disallow the following URL’s in the robots.txt file, shortly after adding the noindex tags.
Don’t do this.
Google will need tor re-crawl those pages in order to read the noindex tags, I’d only suggest disallowing these URL’s later down the line.