How To Force Google To Remove URLs From Its Index

Let me tell you a secret. Google’s top priority is to index as many pages as possible, by hook or by crook. And it is not Google’s priority to remove pages from its index; even when it finds a 404 or page not found error. Because it’s too busy adding newer pages, it doesn’t care about removing the stale ones.

Let me tell you another one. Google is too lazy to even respect robots.txt. So when it finds a 404, that signal is not taken seriously enough. So what do you do when a page is gone. A page is gone? 404 is not enough, robots.txt isn’t taken seriously either. What next?

Gone means really 410 gone

A 404 HTTP error means that the resource or page cannot be found. It’s just gone missing. May be it will come back? Who knows?

In case the page has moved to a new location then you should use a 301 permanent redirect or 302 temporary redirect. But that’s for another day.

robots.txt means nothing

Robots.txt is only meant to tell bots to not crawl the URLs. Semantically this is not the same as a 404 error or page not found.

A 410 HTTP error or response code means that the page is gone. It’s gone forever and there’s no substitute for it where the user can be redirected. A 410 in my testing is the strongest signal you can send to Google to have the page removed.

Indicates that the resource requested is no longer available and will not be available again. This should be used when a resource has been intentionally removed and the resource should be purged. Upon receiving a 410 status code, the client should not request the resource in the future. Clients such as search engines should remove the resource from their indices. Most use cases do not require clients and search engines to purge the resource, and a “404 Not Found” may be used instead.

Wikipedia

Hand-twisting Google to respect a 410

Use 410 only when the resource is really gone. If it still exists on the server then it’s just a technical mis-use of the 410 protocol. There are use cases when you really want to remove the pages from Google’s index even though you can’t really kill the pages. Should that be the case, you have to hand-twist Google into it. With the use of some .htaccess declarations, it’s easy to reinforce the message—it’s really gone forever. I’ve had success with this SEO approach on numerous sites that I’ve worked with and has never failed so far.

# No WordPress attachment pages. Match anything that has attachment id in the query string eg. https://www.example.com/?attachment_id=599…
RewriteCond %{QUERY_STRING} attachment_id [NC]
# … and throw a 410
RewriteRule ^ - [R=410,L]
# Remove WordPress author archives
Redirect 410 /author/john-doe

The cheat option

And finally the way to cheat and nuke it out is really to just redirect the URL to homepage. Google totally busy it; doesn’t break sweat.

Redirect 310 /author/john-doe /

So if you really want to have the URLs from Google’s index removed, a 404 won’t do it. robots.txt will also take it’s own sweet time. But a 410 hand-twister can be used as an effective means to remove those URLs from the search index. I prefer that there be as few direct descendants of the homepage as possible. This means no author, date-base archives, attachment page (unless you rely heavily on image search ranks) etc. And 410 comes in handy when you want to have the last laugh. And the final nuke option works the fastest.

Divi WordPress Theme