Entities. They’re nothing new in SEO, but over the last year, I’ve been ruminating on how moving into entities should fundamentally be changing the way most of us are still thinking about SEO.
To start, let’s first step back and look at the history of the algorithms.
Moving from Pages to Entities
In the early years of the post-Google era, the link graph was focused on page to page hyperlink citations, which lead to work in PageRank, TrustRank, HITS, SpamRank, anchor text, etc. Over time, this continued to domain level analysis, where pages were clumped together and the processes were applied on the domain level. I think this was the first step into entity specific search (with domains being the simplest entity to understand).
I think we’re several years into the on-going interation of entities. We’ve seen it early on with businesses and local results. These were the second easiest to conceptually build. We also saw Google start pushing brands, which is yet another implementation of entities. They’re more unique and verifiable than people’s name as a named entity. Then we saw question / answer one boxes, which started answering questions with known named entities and facts (search for “who wrote harry potter” for an example).
Today, we’re seeing it expanding into AuthorRank/AgentRank, implicit social graphs, Google+, Schema.org, business page verification, and authors in search. They also helped set the stage back in 2010 when they acquired Metaweb (Freebase) and developed Google Squared (which was discontinued in labs, but the technology still continues in search).
I think we’re seeing the growth of this ranking concept, and initatives like Schema.org and Rel=”Author” only helps demonstrates the on-going effort to create a Schema for data related to entities.
I’m also tempted to propose that “entity search” could even be considered its own vertical in a sense, in the same way local, images, and social results are – with entity results being weighted when search queries trigger this vertical.
What Exactly is An Entity
An entity is anything, including real world objects, facts, and concepts, that has a number of documents associated with it. In the historical search models, a document is supported by other documents, but in the entity world, a conceptual object is supported by documents.
Examples of entities are businesses, products, movies, authors, people, places, events, etc.
From here, you end up with a known entity with the following information about it.
Entities are exceptionally powerful for search. The Metaweb promo video does a great job explaining the value and power of entities.
Looking at an entity based search engine conceptually changes many aspects of SEO. It would no longer be about just optimizing against a page, but optimizing for an object.
How Do Search Engines Do This?
Bill Slawski has done a great job discussing entities as they’ve developed over the years. I’d recommend taking time to read about it. (I was excited he did a wrap up post for it this week.)
But if you want to hear it directly from Google, I recommend watching the almost hour and a half long talk Andrew Hogue gave on “The Structured Search Engine”. At the time of the presentation, Andrew was working in New York on Google Squared and entity building with Freebase. I believe he now heads up search at FourSquare.
The presentation was one of the more enlightening hours I’ve spent on SEO in the last few months.
The Reddit Phenomena
Two months ago, a thread went up on Reddit comparing the Google Algorithm to Bing’s, especially when providing the right answer to relatively vague descriptions of movies. It seems as if Google knew what the searcher meant, beyond just hearing the words (which almost mirrors the language Andrew uses in his presentation).
I suspect what we’re seeing in this Reddit thread is entity search. I think Bing does it as well, and can get it right, like Dan Shure mentioned in this tweet. However, I think Google gets this better than Bing does. They have the superior index, larger history, faster speeds, and greater computing power. They also have Freebase, which gives them an exceptional edge on entities like movies, which are very well understood on Freebase.
The thing that set-off a bell in my head was when Director of Bing, Stefan Weitz, addressed a Q&A question after Rand’s presentation at Mozcation Seattle, where, although we were discussing local, he referenced the concept of entities using the example of a sofa.
“Longer term we’re looking at how we think of the web really as a representation of the physical world itself” – Stefan Weitz, Bing (SEOmoz WBF)
How Might This Be Working?
Let’s take this search from Reddit:
that movie that’s backwards and the guy can’t remember anything
The results return Memento, then it leads into random script, quotes, and review websites.
… I don’t think Google is returning pages about Memento…..
I think they are returning “Memento” … the entity.
The difference there may seem slight, but if that’s what is happening, I think it’s significant.
Let’s break it down.
I do think we see advanced language understanding in Google, but I feel the Reddit examples are entity weighted results being pulled into universal results or entities are being used to weight particular documents in the broad match indicies.
Queries with structures like “that movie where X” are easy for understanding intents, which lend themselves well to entity style searching.
How Entities Change SEO
How this changes SEO is a whole series of posts, and one can only speculate on how far this goes, but I think there are a few immediate items that we’re already seeing, or will continue to see, in the next few years.
Going broader than pages:
This conversation has been growing louder over time, but we’re stepping outside the SEO we’ve been doing since the early 2000’s. With 70% of demand being in the longtail, there is nothing SEOs can do to optimize for all of these phrases in the same exact, phase, and broad keyword match paradigm we’ve been using for years.
Google research has shown that on more difficult queries, people start to type their searches as natural language questions. They also searched longer queries on average. This study also stated that, at the time of the study (2010), most of the time the question queries failed to give users the information they were looking for and they would revert back to keyword queries.
Looking at query length compared to query success, we can see a clustering in the bottom right quadrant, which suggests that Google is doing better with short, more specific keyword style searches. When users are trying to find the answer to a difficult query, they tend to go longer and are met with less success.
The study shows how those with unsuccessful searches also tend towards more question words. A sign that they might be looking for a specific answer, which lends itself to entity style search.
I think this study does a good job of demonstrating the value for Google to improve its natural language understanding, especially for longer tail and question specific queries.
As SEOs, we need to look at how IMDB is successfully ranking in some of these longtail, natural language, question style queries. What is happening here is more complicated than longtail keyword targeting or UGC strategies.
It also changes the paradigm of how to rank. Instead of thinking in terms of setting a page to a keyword, getting anchor text, and getting link popularity; you can start to consider aligning a page with a known entity when appropriate.
The right “type” of citations by entity type:
This influences the link building game as well. The old model of PageRank, TrustRank, LRD, anchor text, domain diversity, related content, keyword in title, topic-sensitive PageRank, etc. will change a bit.
It adds a new dimension, which isn’t just to get links that support the topic/theme, but to get citations that support the entity type (and this isn’t always hyperlinks, like we’ve seen in local search).
I don’t think we’re too far off from the days where we’ll see widespread strategies to get authors with strong authority to contribute to a domain, because their person specific entity scores AuthorRank/AgentRank will apply a vector against traditional document scores to elevate content associated with strong entities. This will change the game entirely. In the longterm, how much will link building resources be reallocated to relationship building with strong entities, not websites?
The right “type” of content by entity type:
This also creates a new dimension to content as well. It’s not just about having unique content, or keyword content, or even “great” content. It starts to require needing the “right” content that would fall into the schema typical for the entity type. The robustness of known data attributes both on-site, and off-site, are likely playing a role in results such as those on Reddit. Like Rand discusses in his video about Advanced On-Page Optimization, we continue to move beyond the standard thoughts behind optimization (title, heading, on page, variations, alt text, etc.). Really, some of this isn’t much different than local search optimization, except it could be expanding to more and more objects.
Death of PageRank or Page Link Graph?
The kneejerk reaction is to wonder if I think the link graph is dying.
Not at all.
The model of modern search engines is still built off the core page link citation model, and most changes we see are additions to that – they work in parallel with this model, weight against this model, or resort against this model. PageRank still drives indexation and crawl priority rules as well.
However, I think our concept of a citation is ever growing. They now include local citations, brand mentions, and social shares. They’ll soon include rel author citations and social media profile mentions, which could help Author/Agent Rank. Entity search could mean there is value in schema citations, especially for physical real life objects.
Hey Justin
Why aren’t there more comments here yet?! Fascinating article, and thanks for bringing up the counter-example in my tweet. I think you’ve brought up a very important point about questioning how IMDB for example ranks for those long-tail phrases, and how as SEOs this can inform our strategies.
In the research I did after I saw you tweet about the Reddit example, I clearly got a sense from Google they were detecting “movie” as an entity in many cases, and this meant pages that *shouldn’t* or might not normally rank, were. Just as in a local search when prior to Fall 2010 it was a lot harder for a local business to rank, but with the integration of places/business listings this brought it more within reach. I plan on sharing my research soon 🙂
-Dan
Really liked all of your examples in your post. I’ve been seen to odd differences in how Google vs. Bing treats canonical. Your examples also show some interesting differences in how they handle entities, if that is what is happening here.
I was just thinking about this! I was focused on a granular level – thinking mainly about the influence of single entities, like authors. But now I have the terminology to use when referencing it. Entities! Thank you for writing this! You continue to be the smarty pantalones of the SEO world 🙂
I think there is a lot of interesting stuff coming with author entities. When Stefan Weitz talked about the idea briefly at SEOmoz, I don’t think he called them entities. I wish I remembered, but he was talking about connecting brands and known products (pretty much entities).
“It also changes the paradigm of how to rank. Instead of thinking in terms of setting a page to a keyword, getting anchor text, and getting link popularity; you can start to consider aligning a page with a known entity when appropriate.”
This also applies to aligning your incoming/outgoing link efforts to documents with similar themes vs. only targeting certain keywords. In some cases what may rank well today will not by the end of this year. Concepts like these certainly debunk the idea of doing keyword research then blanketing a certain niche with verbatim content.
Hi Justin – Your article is just in time as here I sit wondering what on earth SEO advice will be in the future (now?) for websites who don’t operate locally, but want an international audience. Seems like all the normal on page stuff will still apply, but page structure and clean markup will be key too.
If in the past we were trying to match keyphases that people searched in our on page signals, now we must make sure that whatever they search, the page content must serve the answer unambiguously in text with each entity property covered by a relevant heading and without code bloat in HTML.
Great post Justin. The point about authors carrying more weight is solid: this is the way it works in other media. Why would someone who commands more attention not confer correspondingly more weight to their topic, content address, etc. Thanks for the insight.
Mind = blown.
Another reason why brands (or entities in this case :D) are becoming more important for search.
Hope to see you posting more often Justin! You honestly write some of the best content I’ve ever read.
Thought provoking post! Definitely not written in a few hours. Great stuff Justin!
I am particularly fascinated with the fact that the emergence of an entity (or entities) from citation still functions like a vector with relationship/term weights.
I think the IR world saw this coming long ago and clustering is still effective, but semantic connectivity is reaching a whole new level in search.
Here is my favorite (heavy hitter) in this field – http://www.miislita.com with a post about the various types of relationships that can be implemented to sort the pieces.
It’s interesting that you chose film as an example, because that’s a domain in which Google has invested heavily for data entry into Freebase. It’s almost certainly the strongest domain in Freebase.
In addition to all the film/actor/director/etc facts, they’ve also got links between their entities and the corresponding records in IMDB, Netflix, and Rotten Tomatoes (as well as some non-U.S. sites), allowing them to cross-correlated references to those sites too.
Awesome post! Like Jon Henshaw, I have been thinking about clusters and TF/IDF more and more the last few months. Your post has really helped me solidify my own opinions around the march we as SEOs will be making towards deeper IR practice. I find it compelling and comforting that the concepts coming to fore around data science ‘decision’ analysis are causing us all to re-think our approaches.
Holy crap! This post is awesome and now I have like 3 more hours of reading. Before today, I have seen entities online but not even thought about them and how they work. Thanks!
Only just watching the presentation embedded here (I’ve been thinking about this stuff as I’m giving a couple of talks on similar subjects).
It’s really valuable. Thanks for surfacing it.