Google has started intentionally not indexing some of the pages that Google thinks should be excluded from indexing. Also, as per Google, these pages would not appear on the Google search result page. There could be multiple reasons for pages being excluded from indexing. These reasons status can also be seen in the new search console. Here are the reasons:
Crawl anomaly occurs when there is 4xx or 5xx level HTTP response codes were received from the server while fetching the page. Some of the affected pages which were excluded from indexing were trackback URLs, URLs with ak_action=reject_mobile (I think this comes from some mobile plugin), Feed URLs for the images, Very old URLs which were once linked to some sources but were then either deleted or the permalinks of the URLs were changed.
If you have recently changed the permalink of your website or you are planning to change the permalink of your website, you should be careful. Though Google will come to know about the new URL structure and also, categorize the old URLs as crawl anomaly but here is something more that you can do towards changing the permalink without affecting SEO.
You do not need to worry about these affected pages.
Duplicate without user-selected canonical
These pages have duplicate pages and none of the duplicate pages have canonical tags associated with them. Below are some sample links that I saw listed under the affected pages:
- For the upload links /wp-content/uploads/year/month eg. /wp-content/uploads/2018/08
- URLs automatically generated by plugins such as Wordfence eg. https://trekkerpedia.com/page/64/?wordfence_logHuman=1&hid=BAB03DC2A06830FD8077921BAD21D8B7
No action is needed here.
Alternate page with a proper canonical tag
This page is a duplicate of a page that Google recognizes as canonical. But here, I would like to add that if this page correctly points to the canonical page, then there is nothing for you to do. Otherwise, you need to tell Google about the proper canonical URL for all the URLs that have been excluded by Google.
Sample URLs for pages with the alternate canonical tag
- AMP pages: AMP pages will have their canonical URL as the non-amp version of the linked pages. In that case, Google shows AMP pages under this index status.
- URLs with search query: On a platform like WordPress, it is common to have plugins which generate new URLs on their own. In that case, we can have a lot of URLs generated dynamically for tags, paginated pages, from security plugins like Wordfence etc.
You need to double-check the linking canonical URL for the AMP pages if you have AMP enabled, otherwise, no action is needed. Also, read my blog on AMP vs non-AMP website Performance.
Alternate page with a proper canonical tag for AMPs
If you are seeing this issue for AMP page, you need to check the following condition. I have given the example as well below (You need to replace the URL)
- Add the following to the non-AMP page: <link rel=”amphtml” href=”https://www.example.com/url/to/amp/document.html”>
- And this to the AMP page: <link rel=”canonical” href=”https://www.example.com/url/to/full/document.html”>
Crawled – Currently not indexed
The page was crawled by Google, but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling. Most of the URLs that I saw for my blog were tags, feed with attachment_id search query, category feed, tag feed, single post feed etc.
Page with redirect
The URL is a redirect and therefore was not added to the index. Some of the sample URLs which I have seen under this index status are comment replies, past paginated posts, posts with the old permalink structure, and dynamically generated URLs from plugins.
How to avoid page redirect
- Do not change the blog URL structure or that’s more precisely called permalink on the WordPress platform.
- Leave the paginated pages as it is. Do not try to merge them again into a single page and vice versa.
- Do not modify the permalink of any individual blog post.
Recently, I changed the permalink of my WordPress blog and now, all the old URL structure falls under this category.
Excluded by ‘noindex’ tag
When Google tried to index the page it encountered a ‘noindex’ directive and therefore did not index it. If you do not want this page indexed, congratulations! If you do want this page to be indexed, you should remove that ‘noindex’ directive. Most of the affected pages are feed, media gallery directories, and individual comments links on WordPress.
noindex tag for SEO tips
- Crawl the entire website using tools such as deep crawl or screaming frog and look for pages with this particular tag and remove the tag
- Nonetheless, this tag should be there on the pages which contain the private or sensitive information such as the cart, checkout, login, account, thank-you page etc.
This means that it returns a user-friendly “not found” message without a corresponding 404 response code. We recommend returning a 404 response code for truly “not found” pages, or adding more information to the page to let us know that it is not a soft 404.
On the WordPress, most of the affected pages were deleted links with tags. In other words, those tags no more exist on my blog.
Not Found (404)
This page returned a 404 error when requested. Google discovered this URL without any explicit request or sitemap. Google might have discovered the URL through different ways, such as a link from another site, or possibly the page existed before and was deleted.
Googlebot will probably continue to try this URL for some period of time; there is no way to tell Googlebot to permanently forget a URL, although it will crawl it less and less often. 404 responses are not a problem, if intentional. If your page has moved, use a 301 redirect to the new location.
Most of the affected pages are the old blogs, tags, and categories which were linked somewhere internally or externally and their hyperlink now returns 404-not found.
Discovered – currently not indexed
The page was found by Google, but not crawled yet. Typically, Google tried to crawl the URL but the site was overloaded; therefore Google had to reschedule the crawl. This is why the last crawl date is empty on the report. Most of the affected pages are working tag’s URLs of the website.
Duplicate, submitted URL not selected as canonical
The URL is one of a set of duplicate URLs without an explicitly marked canonical page. You explicitly asked this URL to be indexed, but because it is a duplicate, and Google thinks that another URL is a better candidate for canonical, Google did not index this URL.
Instead, we indexed the canonical that we selected. The difference between this status and “Google chose different canonical than user” is that, in this case, you explicitly requested indexing. Most of the affected pages are the new tags and categories URLs
Duplicate, Google chose different canonical than user
This page is marked as canonical for a set of pages, but Google thinks another URL makes a better canonical. Google has indexed the page we consider canonical rather than this one. We recommend that you explicitly mark this page as a duplicate of the canonical URL. This page was discovered without an explicit crawl request. Google is still working on a tool to show which pages were marked canonical but that will take some time I guess.
Blocked by robots.txt
Google will also show you pages which have been intentionally blocked in the robots.txt file. If you are using regex in your robots.txt file, you should be careful with the regex rules because even unintentionally, you might end up blocking a lot of pages which were never meant to be blocked. If you think all the blocked pages are rightly blocked by Google then no other action is needed, otherwise, you should remove those pages from the robots.txt file.
A proper technical SEO audit can easily help you minimize these coverage issues. You can do the audit by yourself. Here is the link to the technical SEO audit.