Hi, im Laria

This website is no longer maintained. You can visit my new website laria.me instead!

A new noindex proposal

( Posted: 2014-04-11 01:04:26 Tags: , , , , , )

So, today I improved my blog by adding more semantic markup, in hope that this will improve the indexability of my page. The blog now uses XHTM5 instead of XHTML1. It heavily uses rel attributes, has some <meta/> tags in the <head/> area and properly uses <article/>, <nav/>, <aside/> and tons of other new HTML5 features. (I already wrote about it here)

This should help webcrawlers properly indexing this blog.

But here's another problem: The header (of course, properly marked up as a <header/> tag) includes a cool (well, at least I think so...) random text that changes with every visit. Unfortunately crawlers seem to think, that this text is part of the actual content.

So I needed a way to tell a crawler to not index that one element.

While it is quite easy to tell a crawler to not index a whole page (using <meta name="robots" content="noindex" /> or the robots.txt), there seems to be no standard way of doing this with a single element.

We'll do it anyway:

Let's start with this code, we want to hide it from the crawlers:

<span>Don't look at me, I'm shy!</span>

The Yandex crawler has a solution for this: <noindex/>. Wrap your stuff in this tag, and Yandex won't index it. But whelp, now we have invalid markup. So the guys at Yandex provided an alternative version that uses HTML comments. Okay, add this one to our example:

<!--noindex-->
<span>Don't look at me, I'm shy!</span>
<!--/noindex-->

There is also a microformat for preventing the indexing of a section. Add the robots-noindex class and crawlers should ignore that part. Let's do this:

<!--noindex-->
<span class="robots-noindex">Don't look at me, I'm shy!</span>
<!--/noindex-->

2 Years later, Yahoo defined it's own method, that looks very similar, but is incompatible. Okay, add another class (robots-nocontent):

<!--noindex-->
<span class="robots-noindex robots-nocontent">Don't look at me, I'm shy!</span>
<!--/noindex-->

Of course, Google has another way of doing this thing: googleoff/googleon. Well, let's add another markup-as-a-comment-thingie:

<!--noindex-->
<!--googleoff: index-->
<span class="robots-noindex robots-nocontent">Don't look at me, I'm shy!</span>
<!--googleon: index-->
<!--/noindex-->

Oh, BTW: According to the Wikipedia, Google doesn't recognize any of these. Yeah, makes a lot of sense, add your own stupid format and then ignore it anyway. Way to go, Google...

I think, that's still not enough markup! We should add another standard, just to make sure, that this will not be crawled.

Introducing: The no-seriously-dont-index-me-plz microformat:

We add yet another class in the microformat tradition that clearly expresses our intents: Don't index this content. Simply add the class no-seriously-dont-index-me-plz!

The full code is now:

<!--noindex-->
<!--googleoff: index-->
<span class="robots-noindex robots-nocontent no-seriously-dont-index-me-plz">Don't look at me, I'm shy!</span>
<!--googleon: index-->
<!--/noindex-->

Perhaps this will actually work, since you now cannot spot the original content in that code so easily with all that nice, semantic markup cluttered around it...

I think this tiny bit of markup is totally reasonable and should be expected from any professional web developer and SEO enthusiast.