The Magic SEO Ball

All your SEO questions answered by The Magic SEO Ball!

Are fragment identifiers that change content cloaking?

March 1, 2019 By Natan Gesher

A slightly technical and complex question

The Magic SEO Ball has recently become aware of a large ecommerce website with extensive faceted navigation (currently managed with robots.txt disallows, rel=canonical and Search Console’s query parameter management feature) that is considering replacing its query parameters with URL fragments. To be more specific, consider these pretend URLs:

  1. http://domain.tld/category/subcategory/attribute/ is a category + attribute filter; typically, adding a single attribute results in a page that is crawlable and canonical.
  2. http://domain.tld/category/subcategory/attribute/?second-attribute is the same category + attribute except with a second attribute added; all 2+ attribute combinations get a query parameter, and all the query parameters are currently disallowed, with rel=canonical pointing to http://domain.tld/category/subcategory/attribute/.

Replacing query parameters with URL fragments might look like this:

  1. http://domain.tld/category/subcategory/attribute/ remains crawlable and canonical.
  2. http://domain.tld/category/subcategory/attribute/#second-attribute gets generated with two or more attributes, and http://domain.tld/category/subcategory/attribute/?second-attribute redirects to it.

Why consider this? Only because it addresses:

  • Crawl budget: #second-attribute will not be requested, and therefore it will not be crawled.
  • Duplication and canonicalization: there is no need to signal that #second-attribute is canonical to anything else, because #second-attribute will never be recognized as a duplicate in the first place.
  • Internal link equity: all filters pointing at #second-attribute will just be perceived as internal links pointing to the parent crawlable page.
  • External link equity: any backlink pointing to #second-attribute will just pass equity to the parent crawlable page.
  • (Bonus) Thinness: filtered combinations that are currently crawlable, but use noindex due to very low inventory counts, can be replaced with fragments as above.

In the words of one developer, upon learning of this plan and its intricate elegance: “It seems like cheating!” Doesn’t it, though? Others on the website’s engineering team have resisted this change on the grounds that doing it could be considered cloaking by Google. Wait, did they say cloaking? Yes, cloaking (seriously).

Are they right?

Magic SEO Ball says: my reply is no.

Because there wasn't any option for OF COURSE NOT.

First, some gratitude

One thing the Magic SEO Ball would like to make very clear is that he or it appreciates software developers’ concern about cloaking and its SEO risks, which can be real and serious, because usually this sort of conversation goes precisely the opposite way: devs want to do something fancy-like, SEO says that this is concerning, and then gets steamrolled by opposition. So this kind of partnership is rare and welcome.

What is cloaking?

What Google says about cloaking is, “Cloaking refers to the practice of presenting different content or URLs to human users and search engines,” but what they mean is something like this: “Cloaking refers to the practice of presenting different content or URLs to human users and search engines in order to trick search engines into making a page rank better than it otherwise would.”

(The last part was left out, no doubt, because they assumed it would be totally obvious.)

In its classic presentation, consider http://domain.tld/blue-widgets/, a product listing page that displays dozens of blue widgets but that fails to rank for the query [blue widgets]. An SEO strategy might be to display that product listing page to users on its URL, but instead to display to Googlebot, using user-agent detection or IP range detection, an 800 word essay on the subject of blue widgets on the same URL. This is cloaking because it is fundamentally an attempt to deceive Google into letting a URL rank in search results for a query when it otherwise would not.

The essence of cloaking is in the final five words of this sentence (emphasis added):

Cloaking is considered a violation of Google’s Webmaster Guidelines because it provides our users with different results than they expected.

Again, if Google lets http://domain.tld/blue-widgets/ rank for [blue widgets] because they have indexed it with an essay about blue widgets that only they can see, and searchers reach it but it only has products and no essay, then that is cloaking.

What is not cloaking?

On the other hand, if the company were to redirect http://domain.tld/blue-widgets/ to http://domain.tld/#blue-widgets/, Google would never consider that cloaking, because the company would in essence be trying not to rank for [blue widgets].

In this case, with http://domain.tld/blue-widgets/ replaced by http://domain.tld/#blue-widgets/, then http://domain.tld/#blue-widgets/ would simply never be indexed and therefore never get search traffic – it’s a strategy to keep something out of Google’s index, rather than a strategy to change the appearance of something in Google’s index – so is not and could never be considered cloaking.

Here’s some Matt Cutts doing some of his famous subtlety:

Now let’s have Googlebot come and ask for a page as well. And you give Googlebot a page…. Cloaking is when you show different content to users and to Googlebot….

Did you catch that? In order for it to be cloaking, Googlebot has to ask for (ie, request) a page – but Googlebot can not and will not request a URL fragment, and no fragment will be served to them: they will only ever request either http://domain.tld/ or http://domain.tld/blue-widgets/, but never http://domain.tld/#blue-widgets/.

Cutts continues with a superficially uninteresting, but ultimately revealing, example about pornography:

It’s a hugely bad experience [for searchers to land on pornography when that is not what they intended]; people complain about it. It’s an awful experience for users.

If http://domain.tld/blue-widgets/ changed to http://domain.tld/#blue-widgets/, then Google would never send search traffic to http://domain.tld/#blue-widgets/. Because no searchers would reach it, there could never be a negative experience for searchers like in the Cutts porn example.

Back to our example site

In context, the question about cloaking is even more bizarre because the site in question currently uses robots.txt disallows to prevent search engines from seeing URLs with query parameters. As a result of those URLs not being crawled, they are not indexed and do not get search traffic. They still exist, though; users can navigate to them and in the most literal possible way, they show different content to those users than they show to search engines: the users see filtered product listing pages and the search engines see only inaccessible URLs.

Using robots.txt disallows and/or rel=nofollows to prevent search engines from seeing pages is not any less cloaking than removing those pages and replacing them with fragments.

Common guidance on faceted navigation

Faceted navigation is a very complex subject, but like most SEO guidance around the web, guidance on this subject is effective primarily for beginner to intermediate SEOs, working on small to medium sized sites. For a site with a dozen categories and maybe a half dozen facets across each, using Search Console’s parameter management tool and rel=canonical or meta noindex is likely to be satisfactory.

It’s only when sites get rather more complex – thousands of categories, hundreds of thousands of desired category + attribute combinations, millions of product SKUs and backlinks – that the basic advice starts to break down and strategic technical SEO leadership must be consulted to devise a plan that will work for search engines as well as for humans. Otherwise, expect bots to get lost in a spider trap of trillions of URLs, or to throw away a substantial amount of link equity pointing to disallowed URLs.

Why aren’t other large, complex sites using URL fragments for faceted navigation?

Because they don’t care about SEO, of course, or because they don’t care enough about SEO to invest in it.

Or they don’t have a substantial amount of backlinks pointing to disallowed URLs, and are able to address the internal link equity issue satisfactorily by using javascript to hide links (good for them!).

Or because they don’t have the engineering talent to pull it off.

Or because they thought seriously about it and looked around the web, but couldn’t find anyone else doing it, and their normal practice is to bury any unusual proposal with requests for “comps,” so it didn’t get done.

Or because they don’t care about users sharing URLs, so they went with the idea of not updating URLs when filters are applied (I see you, TripAdvisor!).

Or some combination of these reasons.

Or they will, when they finally are able to hire a new SEO lead who tells them to do it.

Another objection

This isn’t how fragment identifiers were meant to be used. They were supposed to be URL anchors only.

Let the Magic SEO Ball’s operator tell you something about how things were meant to be. He remembers what javascript was like when it was first invented, because he used it then on his personal website in high school to annoy visitors with interminable popups (for real – the only option was to click “ok,” but clicking “ok” just meant more and more popups until the browser had to be force-quit). Javascript, in short, was meant to add interactivity to World Wide Web pages, which were static HTML documents.

A decade later, he was pretty shocked to see entire user interfaces – “web applications” – built out of javascript with URLs like this popular one: https://mail.google.com/mail/u/0/#inbox/.

A decade after that, javascript had actually managed to migrate from browsers to web servers, where it was generating entire sites, backend and frontend. And in that form, it’s used now for a variety of applications that aren’t even directly related to the web at all (narrowly defining “the web”).

Should technologies be used in the ways that they were originally intended? Maybe they should – it depends.

Should technologies also be used for new and different things? Maybe they should – it depends.

But objecting to an idea on the conservative grounds that it uses a technology in a way that’s different from its intended use is not likely to be a recipe for long term success in a competitive industry like ecommerce.

Is HTTPS a tie-breaker?

May 14, 2015 By Natan Gesher

So… last summer, Google announced that they would begin using HTTPS as a ranking signal:

…over the past few months we’ve been running tests taking into account whether sites use secure, encrypted connections as a signal in our search ranking algorithms. We’ve seen positive results, so we’re starting to use HTTPS as a ranking signal. For now it’s only a very lightweight signal — affecting fewer than 1% of global queries, and carrying less weight than other signals such as high-quality content — while we give webmasters time to switch to HTTPS. But over time, we may decide to strengthen it, because we’d like to encourage all website owners to switch from HTTP to HTTPS to keep everyone safe on the web.

Now Googler Gary Illyes has stated that HTTPS isn’t a ranking factor, but actually is just a tie-breaker:

HTTPS is a tie breaker (attribute, not a ranking factor) @methode #smxsydney #smx

— Woj Kwasi (@WojKwasi)

May 14, 2015

More analysis here:

Google uses HTTPS as a tie breaker when it comes to the search results. If two sites are virtually identical when it comes to where Google would rank them in the search results, HTTPS acts as a tie breaker. Google would use HTTPS to decide which site would appear first, if one site is using HTTPS while the other is not.

Does this make any sense?

Magic SEO Ball says: very doubtful.

Very doubtful.

There’s been some dispute among SEOs about just how important, if at all, HTTPS actually is as a ranking factor. Here’s an SEO arguing that there are seven specific things on which a site should focus before worrying about HTTPS. Her list:

  • Consistent URLs Everywhere
  • Short, Non-Parameter Heavy URLs
  • Real, Relevant Content, Not “SEO Copy”
  • Live Text High on the Page, Not Stuffed at the Bottom
  • Links
  • Strong CTA-Friendly Title Tags
  • Speed Up Your Load Time

For what it’s worth, I think she’s basically right, though this varies a lot based on the niche and the size of the site. For much bigger sites it may make sense to tackle other technical issues first; for certain types of sites it may make sense to improve the mobile experience first. And of course, for e-commerce or any site that takes people’s money, HTTPS does matter.

So what exactly is a tie-breaker?

Tie-breaking is a special system to determine the outcome of a contest when all the other inputs are exactly equal. A tie-breaker only gets used if it will be necessary to break a tie; if a certain input is calculated all the time, then it is not actually a tie-breaker.

One good example of a tie-breaker is the vote of the Vice President in the United States Senate. He doesn’t typically sit as a member of the Senate, but is officially its president, and he doesn’t vote at all unless his vote would change the outcome (ie, unless the Senate’s vote is tied).

Another example of a tie-breaker is how professional football teams are ranked at the end of a season. The NFL has a complex set of rules for determining which of two teams with identical win-loss records will advance to the playoffs and which will not (and it gets even more complex when there are three teams that are tied, and so on):

  1. Head-to-head (best won-lost-tied percentage in games between the clubs).
  2. Best won-lost-tied percentage in games played within the division.
  3. Best won-lost-tied percentage in common games.
  4. Best won-lost-tied percentage in games played within the conference.
  5. Strength of victory.
  6. Strength of schedule.
  7. Best combined ranking among conference teams in points scored and points allowed.
  8. Best combined ranking among all teams in points scored and points allowed.
  9. Best net points in common games.
  10. Best net points in all games.
  11. Best net touchdowns in all games.
  12. Coin toss

Critically, these rules only get invoked if two teams finish with the same record of wins and losses.

So, is HTTPS – which Google has previously used its official blog to describe as a “ranking signal” – not actually a ranking signal, but actually just a tie-breaker that gets used in the rare scenario that two pages rank exactly equally?

Almost certainly not. In order for HTTPS to be a tie-breaker, Google would have to compute all the ranking signals for a certain query, and then only in cases of a perfect 50-50 tie would they then tip the scale in favor of the site with HTTPS. But what if both sites use HTTPS, or what if neither site uses HTTPS? Then invoking the tie-breaker wouldn’t even break the tie.

HTTPS is, as John Mueller has said and as SEOs have confirmed, a relatively weak and – in most cases – somewhat insignificant ranking signal. For the sake of comparison, another relatively weak and somewhat insignificant ranking signal is keyword use in H2s.

Being a small ranking signal does not mean that it’s a tie-breaker.

Incidentally, this is not the first time that Mr. Illyes has made public statements about how Google works that have beggared belief among SEOs. Just a couple of months ago, he said that Panda is updated in real time, which made absolutely no sense at the time; we now know this to be false.

Matt Cutts often said things that might be true only in some extremely literal or extremely general sense, and which needed to be parsed carefully so their meanings could be teased out, and John Mueller often seems to be – I want to put this charitably – filling in the gaps in his knowledge with educated guesses. But Gary Illyes’s record of disputable and provably false statements suggest that he should be prevented from continuing to offer them.

Does Google use Gmail for URL discovery?

March 26, 2015 By Natan Gesher

I’ve heard an SEO person say that Google can scan my Gmail and use it to discover new URLs to crawl. Does this really happen?

Magic SEO Ball says: My sources say no.

My sources say no.

This is actually something that we have believed for some time, and something that we have told to a bunch of people, including on interviews for serious SEO positions at really excellent companies (Oops!).

So it’s a bit surprising to learn that we were almost certainly incorrect.

The IMEC Lab group ran a pretty good test, Does Google Sniff Your Gmail to Discover URLs?:

We posted 4 total pages … and then asked different groups of users to email links to those pages… We asked 20 to 22 people to send gmails sharing the links for each article to the various pages. One group was asked to share article 1, a different group was asked to share article 2, and so forth. The goal was to see if Google would spot these links in the gmails, and then crawl and index those URLs… there was very little to see. The results were wholly unremarkable, and that’s the most remarkable thing about them!

The test only lasted for less than two weeks, so it’s possible that those URLs would eventually have gotten crawled.

It’s also possible that Google will use Gmail to discover new domains to crawl, but not specific individual URLs.

So it would definitely be worthwhile to try repeating the test using different parameters, but until we see evidence demonstrating otherwise, it seems fair to say that Google does not crawl the URLs they see in Gmail.

Will the 21st April 2015 mobile algorithm update really be bigger than Panda and Penguin?

March 17, 2015 By Natan Gesher

Last month, Google announced a change to the way they’ll rank mobile search results, to begin on 21 April.

Now Googler Zineb Ait Bahajji is stating that this change will be bigger than Panda and Penguin.

Zineb from Google at #smx Munich about the mobile ranking update: is going to have a bigger effect than penguin and panda!

— Aleyda Solis (@aleyda)

March 17, 2015

Is she right?

(Or is this like when Googler Gary Illyes claimed incorrectly that Panda was already being updated in real time?)

Magic SEO Ball says: Don’t count on it.

Don't count on it.

The strange thing to consider about the upcoming change to mobile search rankings is the way Google announced it: they may never have given as much information, as far in advance, about a genuinely meaningful update to their organic ranking algorithm. It was truly an unprecedented event.

Unless, that is, the change is actually going to be relatively minor, and the announcement and all the Twitter hype and hoopla are really just a way to get webmasters and SEOs to do what Google wants them to do, which is to make mobile-friendly websites, preferably of the responsive or adaptive varieties.

We try not to be too skeptical and we definitely don’t believe that Google is lying about the mobile rankings change, but we have to wonder whether Google’s search quality team is really going to shoot themselves in the foot by providing worse search results in some cases, just because the pages happen to be mobile optimized. Tl;dr: they aren’t.

Panda and Penguin have been, at best, mixed successes for Google. Completely aside from the pissed off webmaster and SEO communities, we are aware of many SERPs that are lower quality as a result of Google’s attempts to use machine learning to create new algorithmic ranking factors.

After 21 April, expect to see changes, but don’t expect the world to end for great sites whose pages aren’t mobile friendly, and don’t expect garbage sites with responsive pages to start crushing their authoritative, very relevant, high-quality competitors.

Is Panda updated in real time?

March 13, 2015 By Natan Gesher

At SMX West, a Googler named Gary Illyes claimed that the Panda algorithm is now running constantly, and that sites affected by it only need to fix their problems and be recrawled, after which they should regain their traffic.

.@methode says panda happens “pretty much instantly” cc @rustybrick

— Rae Hoffman (@sugarrae)

March 5, 2015

@portentint @sugarrae … @methode did qualify that once the “pages were re-processed” … However long that takes cc @rustybrick

— Eric Wu ( ・ㅂ・)و ̑̑ (@eywu)

March 6, 2015

Is he correct / telling the truth?

Magic SEO Ball says: very doubtful.

Very doubtful.

First, this claim is completely in conflict with the evidence at hand. We have firsthand knowledge of at least one large site that got hit by Panda around 24-25 October 2014, eliminated the problem almost immediately by noindexing a quarter of its URLs [1], watched those URLs drop from Google’s index, and still has not recovered from Panda.

Second… well, there isn’t much more to say about this. While the Panda algorithm itself might have been updated at some point since late October, there is zero reason to believe that its data has been refreshed. And there’s also no reason to think that Google would run an updated Panda algorithm with stale data. So, almost certainly there’s been neither an algorithm update nor a data refresh.

SEOs who do high quality work are generally in agreement about this.

So does this mean that Mr. Illyes was misleading us or lying to us, or does it mean that he was mistaken or confused?

We think the latter explanation is far more likely. His excuse that he “caused some confusion by saying too much and not saying enough the same time” sounds like a nice try to save face, which is understandable. It’s probably an internal goal at Google to get to a point where Panda can be run in real time, but this requires two things:

  1. The quality of the algorithm has to be high enough. This means that false negatives need to be reduced, and false positives need to be eliminated.
  2. The logistics of running the algorithm have to be workable. This means that the computing complexity has to be manageable enough that Google’s engineers and infrastructure can handle it on a constant basis, rather than just on a periodic basis.

While the second issue is the kind of problem that Google is pretty good about solving – more engineers, better engineers, more hardware, more powerful hardware, whatever – the first issue is something that may not be possible in the near future.

References[+]

References
↑1 Without going into details, we have 95+% certainty that those URLs caused the Panda problem in the first place.

Should I create articles on my website for the purpose of syndicating them?

December 22, 2014 By Natan Gesher

This question comes to us via Twitter:

@gesher What’s your take on creating articles on your site for the purposes of syndication?

— Jesse Semchuck (@jessesem)

December 20, 2014

Magic SEO Ball says: my reply is no.

My reply is no.

Content syndication as an audience development strategy

Creating articles specifically with the intent of having them syndicated on other sites can be a fine way to expose those sites’ different audiences to your product, service, ideas or own website. When doing so, you should take care of the following concerns.

Audience

Every website has a different audience. Some are huge and some are tiny; some are general and some are specific; some are overlapping and some are distinct. Take care to ensure that your articles are appearing in the right places online by taking the time to understand the audience profiles of the sites where they will be syndicated. Failing to do so may cause your content to be ignored at best, or resented and marked as spam at worst.

Volume

How much is too much? If your syndicated content overwhelms the unique content on your own site, you are syndicating too much. If your syndicated content overwhelms the original content of the sites on which it appears, you are syndicating too much.

Repetition

Many people have a certain set of content websites that they visit on a regular basis, daily or weekly; or an RSS feed reader that they check on a regular schedule; or the expectation that they’ll be able to use Twitter and Facebook to find out what’s happening. Some people use all three methods. If they follow your site and another site that syndicates your site’s content, or multiple sites that syndicate your site’s content, they’re going to start seeing your articles repeatedly. While that may strike you as desirable, it may also backfire by bothering this extended audience, preventing people from ever becoming your customers or followers.

In summary, what many syndication issues – with audience, volume and repetition – have in common is that they are caused by a casual “If you build it, they will come” approach that discounts the users’ interests, wishes, and experience. This may result from a surfeit of technical ability to effect syndication (viz., by RSS) and a deficit of concern for other web denizens.

Consider, instead of a push method, a pull method by which you publish your own material on your own site, and allow it to be republished by request by other webmasters on an article by article basis.

Content syndication as an SEO strategy

In general, the main reason to be interested in content syndication as an SEO strategy is for link building: the idea being that you can create feeds with your articles, including followed links back to your own site, and allow other sites to use the articles with proper canonical tags.

While it would be a stretch to say that Google’s policies about link building have historically been clear, one trend that has emerged and that can be stated with clarity is that Google does not want to allow any link building strategy that can scale. In effect, this means that asking for links and giving them is fine, but that manipulation of the overall link graph is not fine.

Does content syndication for SEO purposes (i.e., for the purposes of increasing your articles’ link authority and your site’s domain authority) work? Yes, but you’d better assume that links added without any human effort by other sites’ webmasters can be devalued without any human effort by Google’s search quality engineers.

And that doesn’t even touch on the risks involved, which I outlined briefly in this Quora question: Can my search engine ranking be hurt if I post my blog articles on my own site and on several other blogging sites?

… if you publish the same article on your own site and on other sites, you’re running the risk that it will rank on the other sites but not on your own… employing this practice at scale may expose your site to Panda… Instead, consider creating original content for different sites that is useful to each site’s audience.

So if you’re thinking about audience development and want to do content syndication, I think it is ok but also that you should consult an SEO and seriously weigh the SEO concerns, along with the possibility that syndicating content in the wrong ways may do more harm than good. And if you’re thinking specifically about a content syndication strategy for SEO, there are much better ideas out there.

Is title the same as headline?

December 13, 2013 By Natan Gesher

My SEO submitted requirements that our CMS should not allow titles longer than seventy characters. He also mentioned elsewhere in the requirements document that headline should generate the title (which should then be editable). So headline is the same as title, right?

I built the headline field in the CMS so that it will not accept a headline from editors longer than seventy characters. Good?

Magic SEO Ball says: my reply is no.

My reply is no.

When an SEO talks about a title, he means the “title” tag.

When he talks about a headline, he almost definitely means the on-page “h1” tag.

It’s an SEO best practice for titles to be limited to seventy characters, because longer titles are likely to be truncated when they appear in search engine results. There is no particular character limit on headlines, expect that they not be very long or overwhelming to users.

It’s also a good idea for editorially created headlines to generate titles, but there can be some collision over the fact that headlines have no length limit while titles do. Therefore, if you’re working on a CMS and get these requirements from your SEO, keep in mine that he probably wants the title field to be editable.

Based on a true story.

A section of my site is called “Deals”; should I make the H1 “Specials”?

May 13, 2013 By Natan Gesher

Magic SEO Ball says: very doubtful.

Very doubtful.

But really, why would you do this? If you’ve got an area of your site that’s about deals and other related things, and you’ve decided that it will be called “Deals,” why would you use some other term instead of “deals” to tell your users what that area of the site is about? It just defies logic, not to mention the first rule of marketing that we learned at our first SEO job: never make up a different term to describe your product that’s separate from the one you’ve already made up.

Based on a true story.

Should I update my robots.txt file to disallow all bots?

May 11, 2013 By Natan Gesher

Magic SEO Ball says: My reply is no.

My reply is no.

Seriously though, why would you do this?

Maybe if you are creating a new site and want it to be private and specifically aren’t interested in receiving any search engine traffic, that might make sense… but if you’ve already got a big site that gets a lot of its traffic from SEO, you should really make sure that your robots.txt doesn’t block search engines.

Based on a true story.

All your SEO questions answered by a black magic ball!

  • About the Magic SEO Ball
  • Ask A Question
  • SEO glossary

Recent Posts

  • Are fragment identifiers that change content cloaking?
  • Does UTM tracking on inbound links affect SEO?
  • Did Thumbtack really break Google’s rules?
  • Is HTTPS a tie-breaker?
  • Is there a duplicate content penalty?

Natan Gesher | Sharav | Colossal Reviews | Megalomania:me | The Magic SEO Ball | LinkedIn | Facebook | Mastodon