Indexing in Google and the mother that bore them

Currently discovered but not indexed. A Search Console status that takes us head over heels.

What should we do if your website is not being indexed?

It is an error that is being generated in many pages and I will discuss several methods to improve your coverage rate and not die trying.

For years in my talks I used to use the analogy that Google is like a bitch-swallower.

It indexes everything very easily without you noticing it and therefore it is important to be careful with your indexing strategy.

That slide I deleted a long time ago, because things have changed.

For some time now, I have been detecting on my own websites, my clients’ websites and the number of queries that I usually come across (in forums, support, Twitter, communities):

  • What am I doing wrong?
  • Why doesn’t Google index me?
  • I get a notice in the Search Console coverage report of: “Discovered, but not indexed” in the excluded section.
  • Did I break something? Was it in a plugin update, from WordPress? (I saw this on WP support).
  • My website was hacked and since then Google has not indexed anything for me, etc.

And the truth is that it is normal that people far from the SEO world have these doubts. I’ve seen it in developers who have been in the digital world for years.

They’re not in the business, they’ve always had their content indexed without much trouble, and suddenly, a disturbance in the force, and the mother who bore them!

I spoiler: “You are NOT doing anything wrong”. A priori let’s go. But it needs to be analyzed.

From my point of view, Google does not communicate these problems in a clear way, this causes theories to be generated, when in reality, we are only witnesses of a possible “bug” or technical failure.

In many cases, these will be updates or changes to your indexing policy.

So I thought it would be useful to explain the process, the cases I see and possible solutions for this nuisance, to which the almighty search engine subjects us.

But first let’s go to the origin.

Google Spider according to Wajari

What is Google indexing?

Search engines such as Google are composed of 3 essential components:

  1. A crawler that crawls our website. In the case of Google: Googlebot.
  2. A database. This is what we can call indexing, when a web page arrives, Googlebot crawls it and incorporates it into its database to make it available for human searches. Best analogy: As if I were a librarian. It registers the book (web) and its content (pages) and enters it in its database.
  3. Algorithms. They organize information based on relevance and authority when a person does a search.

It is a theoretically simple process and for years you had to be very careful with the content you had on your website, because very often things were indexed that you didn’t want, for example:

  • Cookie notices
  • Acknowledgements pages
  • Contents lorem ipsum
  • Development versions of your website that you put in subdomains, etc.

How is indexing controlled?

With meta robots. As I told it in audio (and in writing) in my abandoned podcast(SEO for WP: Meta robots) and that I allow myself to repeat part of that content in this post:

Meta robots are a tag in HTML that gives an instruction to search engines.

Like robots.txt, we can block search engines, but in the case of robots.txt, some guidelines can be ignored, especially if a URL receives an external link and is detected.

Header tags are usually the best way to control the behavior of each URL.

As Fernando Maciá points out in his digital marketing dictionary:

“Meta robots allows you to control how a page should be indexed and how it is displayed to users on the search results page.”

Fernando Macia

It doesn’t get any clearer than that.

Also, in robots.txt we block a URL completely, while with meta robots we can have a URL that is still passing link juice or popularity, but we decide not to have it appear in Google’s indexes.

Meta robots

Meta robots tag syntax

Very simple: and these are the options we can define:

noindex, follow

<meta name="robots" content="noindex, follow"/>

In this case with the noindex we tell search engines NOT to index this content but you canfollow the links.

By following the links we maintain the link transfer and associated popularity juice.

This is the most typical solution when you want to avoid indexing a URL that may be considered as thin content or duplicate content from other sections of your website.

Very common in search results, which generates a change in the URL with the search term. In label files, author, etc.

If you have Yoast, RankMath or any other SEO plugin installed, do a test: perform a search in your WP and check the source code of the result. You will probably see this tag in the header.

index, nofollow

<meta name="robots" content="index, nofollow"/>

In this case the opposite is true, we tell you that you can index this URL but do NOT follow the links, therefore, they will not usually transmit their value.

As Tomás de Teresa points out (in an article that no longer exists, so I can’t link to it), it’s the ideal combination when you don’t back links from a particular URL, imagine user-created pages, for example in a forum.

noindex, nofollow

<meta name="robots" content="noindex, nofollow"/>

We avoid indexing and tracking links. It is a form of total blocking of that URL. Its use is not very common.

Index, follow

There is a fourth tag which is index, follow but this tag is not necessary to put it because it is the normal behavior, in which a URL is identified, the links are followed and the content is indexed in search engines.

A clarification: You don’t need to know HTML, obviously in technologies like WordPress plugins make this task very easy, just check or uncheck options and that’s it.

Is there a difference between robots.txt and meta robots at the crawling level?

As Fernando Maciá tells us, yes of course, remember that robots.txt is usually one of the first files that search engines will check.

If we mark a disallow to a directory within that file, in principle Google will not waste time crawling that directory, while if it reaches a URL with the noindex tag, it does a crawl.

In addition, with the robots.txt we can define patterns (imagine blocking directories or subsets of information) while the robots meta tag goes in each URL.

What should we take into account in these two forms?

These two methods are very necessary to control crawling and indexing.

For this reason it is important to leave in the meta robots the directives that we want, in a way to control the final indexing that Google makes of our web.

Other directives for meta robots

We can use more elements, some examples:

  • archive / noarchive: whether or not to store the web content in the internal cache.
  • noimageindex: not to index the images of the page.

And some other examples, but with less frequent uses that Google makes available to us in its help page for developers.

Meta robot directives

Why are there currently problems with indexing?

I would venture to say that in the last year, we have begun to see changes in this regard.

Everything was no longer indexed as easily as before.

But it didn’t happen every time. I have personally detected the following most frequent situations:

  1. New pages with little history.
  2. Newly registered domains (and with few external links).
  3. Pages with “not very relevant” content in the eyes of the search engine, the mother who bore them!
  4. Websites that have been hacked recently, even if you don’t get the security warning. Typical case in which a lot of “Russian or Chinese tricks” have been indexed and even if you manage to clean up the site, it still affects your coverage rate.
  5. Slow web sites with WPO(web performance optimization) problems
  6. Rare cases of websites that did not allow crawling correctly.

According to Search Console’ s own documentation, they point out that:

Discovered: currently unindexed. Google has found the page, but has not yet crawled it; probably because it has determined that, if it did, the website would be overloaded. Therefore, he has had to postpone the tracking.

Search Console Documentation

This explanation makes many people link it directly to web overloading due to WPO issues.

They explain it well in the Content King post as possible causes:

  1. Overloaded server, which means that Google cannot crawl correctly.
  2. Content overload. Your website has more content than the spider can crawl at that moment. This is undoubtedly an exceptional case and I believe it is reserved for excessively large websites.
  3. Poor internal link structure.
  4. Low quality content, which does not add value to the user.

I don’t doubt that there are cases like that, but most of the ones I’ve come across were not due to that cause (WPO and/or crawl budget), but to cases of inefficient internal linking and that “directive” of content quality, that according to them, your content doesn’t have a specific value.

Search Console Coverage Report

To detect if we have URLs in this situation, we have to go to the coverage report and mark excluded.

Search Console exclusion index

This report will show us all cases of URLs that are NOT indexed. Most common cases:

  • Excluded by noindex tag
  • Errors
  • Redirections
  • And a long etc. that are not relevant

But the ones that occupy us in this article:

  1. Tracked: currently not indexed. In this case they are usually indexed later without much problem and without any action on our part. We also find in this section many things that do not make sense to be indexed as the feed, sorting filters that are not well configured and have meta robots as index, etc..
  2. Discovered: currently unindexed. The case we seek to solve in this post.

(Quick) solution to indexing failure

Emilio Garcia on his excellent YouTube channel and podcast: Campamento Web posted this video that is great.

It explains in a clear and simple way how to index your content using RankMath’s indexing API:

Emilio uses the methodology explained in this RankMath post: Google indexing API.

A word of caution. As we can read in RankMath’s post, and they are very clear about it, this Google indexing API, is specifically designed for:

Google recommends using the indexing API ONLY for JobPosting or BroadcastEvent embedded in VideoObject websites [tipos de datos estructurados]. During our tests, we found that it worked on any type of website with great results and we created this plugin to test it.

RankMath

Therefore, they clarify that this methodology is NOT for everyone. But it certainly works. Do you want a quick and good solution? This is your method.

(Slow) solution for indexing your content

As SEO consultants we often come across situations like this, that on client websites, there are certain things that make us uneasy.

This solution is slower, but in general I have seen very good results: Patience and the mother who bore them!

Everything goes through:

Sitemap tracking with Screaming Frog

1. Analyze your sitemap.xml

  • Analyze the sitemap.xml of your website. Copy the URL.
  • Using any crawler like Screaming Frog in mode: List > Import > Download sitemap.xml and paste the sitemap address.
  • This will download and analyze only the sitemap (not the entire site).
Response codes with Screaming Frog
  • RULE: 100% of the response codes must be code 200.
  • There should be NO redirects (3xx), and NO errors (4xx). If you have mistakes: fix the house first.
  • If everything is perfect, you can resubmit to Google through Search Console:
Add Sitemap to Search Console

2. Check your robots.txt

For obvious reasons you should check that you are not blocking any directory or have defects in the syntax of this file.

You can use the web validation tool: Technical SEO Tools.

Is your sitemap linked? RankMath does it by default. Other SEO plugins do not and you would have to add it manually.

The syntax is very simple, it is only advisable to put it at the end and leave a space between the user-agent directive and the sitemap. Being minimalist with this file is good advice. Example:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://wajari.com/sitemap_index.xml

3. Analyze the URLs not indexed in Search Console using the inspector.

Google can give you clues as to what might be going on by using the inspector.

URL inspection in Search Console

In this link you have all the official documentation about the URL Inspection Tool.

If there are no apparent errors, and it simply does not appear indexed, as you well know, you can check request indexing.

This usually works very well. Of course, if we are talking about a few URLs there is no problem to do it this way manually, but if we are talking about hundreds or thousands of URLs, you have to look for other options.

4. Analyze your internal links and correct errors

Crawlers such as Screaming Frog allow us to analyze internal links. It would take a whole post to explain this point, but remember that links are essential.

If we do not have our contents well linked, it can be a negative factor for Google to discover the sections and add them to its database.

Keep Search Console errors and warnings online.

Try to improve all the aspects recommended by the tool, from structured data to core web vitals that are a ranking factor and can affect both crawling and positioning.

And last but not least: Patience.

Google, like any company, makes mistakes.

In my experience with my clients, we have solved most situations by simply following these steps.

We must be aware that I do not even want to imagine the size of what is involved in indexing all the websites on the Internet. I understand that it represents a technological challenge for the Californian giant.

In some exceptional cases (media) I solved it using the news sitemaps, which obviously do not apply to all websites; but it allowed to quickly recognize the contents that were created and indexed at the time.

Using plugins like RankMath it is quite convenient to track because if you have it connected with Search Console, you can see in the statistics tab the status of the index.

In that tab: “Index status” with a list of your URLs, showing you if it shows rich results, if it is indexed or not, etc.

RankMath index status

It also has an Instant Indexing module, although it only works with Yandex and Bing. It does this automatically with changes to your posts or pages, or you can even do it manually. A good invention indeed.

Final words

As usual in SEO: It depends. Your case may have multiple causes. I just recommend you to be patient and look for the best solution for your website.

It may piss us off, it may look bad to us, and I get your point. I empathize with you. But that anger will not allow you to solve the mistake.

As a content creator or business, you want your website to show up in Google, everyone wants that.

I understand that it is something that will be improved and in any case, is a possibility to improve our website, both in terms of crawling, linking, authority, content, speed, etc..

I just hope this post helps you to see it from the calmness, and not from the uneasiness of: I did something wrong.

You are not alone in this world of Discovered: currently unindexed!

Has something similar happened to you? I will be happy to hear your case in the comments.

Live long and prosper!

Leave this field blank

¿Te suscribes a mi lista?

Te reirás, lo disfrutarás y te puedes borrar cuando quieras. Contenido chachi y anti aburrimiento. El SEO no tiene por qué ser un coñazo. Consejos, promociones y mucho más. Nos vemos dentro. ¡Larga vida y prosperidad!

Leave a Comment