Nofollow / Indexing Issues
Introduction: Indexing and Ranking
Search engine algorithms can be viewed from the perspective of indexing and ranking. How is the link understood differently by indexers and rankers? This project uses an index research appoach. What is indexed is what is available for ranking.
A normal way of thinking about indexing is inclusion/exclusion in devices (for instance censorship). The question we ask is how is it that some things are not indexed? Noindex and nofollow are two kinds of instructions for bots and search engines. The first research in this project is focussed on nofollow.
Secondly, a case study is done investigating indexed pages and links in three devices, Google, Google, Blog Search, and Technorati. Does Google seggregate the blogosphere from the web by introducing Google Blog Seach?
A larger dmi question we ask is when do we need to know how Google works? Or should we not ask how it works but what the effects are?
Nofollow
One issue in indexing, especially as it relates to blogs, is the 'nofollow' tag:
nofollow is an HTML attribute value (no_follow) used to instruct search engines that a hyperlink should not influence the link target's ranking in the search engine's index. It is intended to reduce the effectiveness of certain types of spamdexing, thereby improving the quality of search engine results and preventing spamdexing from occurring in the first place.
(more on
Wikipedia)
Google introduced the no_follow attribute in 2005 to prevent comment spam and trackback spam:
Official Google Blog: Preventing comment spam
See also Google's post
"current activity on robot exclusion protocols"
Two kinds of no_follow
1. Robots (don't follow link)
The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable.
For example: "Do not follow any of the hyperlinks in the body of this document." (Wikipedia)
2. Search Engines (don't count link)
How the attribute is being interpreted differs between the search engines. While some take it literally and do not follow the link to the page being linked to, others still "follow" the link to find new web pages for indexing. In the latter case rel="nofollow" actually tells a search engine "Don't score this link" rather than "Don't follow this link."
(Wikipedia)
Research questions (C. Tatum):
- How prevalent is the nofollow tag?
- What percent of a given network is excluded from a internet search due to nofollow?
- Assuming that nofollow segregates content, is there a way to search the internet as a whole?
- What are the social/political implications of this sort of segregation?
- What do we loose by dividing our primary access to the web into two primary entry frames, blogs and not-blogs?
Prevalence of the no_follow tag
Blog software
- WordPress: default (do_follow plugin)
- Blogger:
- Typepad: "For TypePad? subscribers, implementation will be automatic. Links from commenters will be flagged automatically in the next update, which will be deployed within the next 24 hours." (Six Apart - Support for Nofollow)
- Movable Type: "For Movable Type users, we’re shipping a plugin today to enable support on Movable Type-powered sites. The Movable Type website has full details, including a download link." (Six Apart - Support for Nofollow)
- LiveJournal: "LiveJournal also plans to implement the specification for comments from other members who are not friends." (Six Apart - Support for Nofollow)
Ranking blogs
Nofollow is presumably important for blog rankings, but there are many other factors. See an interesting post on how Google ranks blogs
here.
In short, the major factors for Google
BlogSearch? are:
- Google’s regular ranking factors
- Scrape Gmail for links
- Frequency of Clicks
- Blogrolls
- Social Bookmarking
- Feed Readership
- Other factors
Wikipedia
A January 2007 post on the Search Engine Journal
claims that all outlinks on Wikipedia are Nofollow, without any exceptions.
More resources
Devices
- Xinu: Check site's pagerank, backlinks, diagnosis, indexed pages, syndication, bookmarks, domain
- Search Engine Support (2005, todo: check current status)
Reading
Table: Prevalence of the Nofollow Tag in Search engines
| rel="nofollow" Action |
|
Google |
|
Yahoo |
|
MSN Search |
|
Ask.com |
| Follows the link |
|
Yes |
|
Yes |
|
Not proven |
|
Yes |
| Indexes the "linked to" page |
|
No |
|
Yes |
|
No |
|
Yes |
| Shows the existence of the link |
|
Only for a previously indexed page |
|
Yes |
|
No |
|
Yes |
| In SERPs for anchor text |
|
Only for a previously indexed page |
|
Yes |
|
No |
|
Yes |
Indexing Issues Case Study: Google vs. Google Blogsearch vs. Technorati
How does Google segregate the static web and blogs? Do noindex and nofollow play a role?
See
here for speculation on how their blog search works.
From
About Google Blogsearch:
- Which blogs are included in Blogsearch? The goal of Blogsearch is to include every blog that publishes a site feed (either RSS or Atom). It is not restricted to Blogger blogs, or blogs from any other service.
- How do I get my blog listed? If your blog publishes a site feed in any format and automatically pings an updating service (such as Google Blogsearch Pinging Service), we should be able to find and list it. Also, we will soon be providing a form that you can use to manually add your blog to our index, in case we haven't picked it up automatically. Stay tuned for more information on this.
Starting point:
We will compare the results of the query "link:mastersofmedia.hum.uva.nl" in
google,
google blogsearch and
technorati.
We will use 3 tagclouds for speculation.
- Google
- Google Blogsearch
- Technorati
Creating a tag cloud for linking of Masters of Media Blog in Google.
Tools:
Method:
- Use Google scraper with query "link:mastersofmedia.hum.uva.nl"
- Google search result for "link:http://mastersofmedia.hum.uva.nl", on 27.07.07: 126 results
- Now manually count and list the results per domain like this
Final list for for Google query "link:http://mastersofmedia.hum.uva.nl":
DmiMoM?
Result
-
- Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
- 1 = 30%
- 2 = 40%
- 3 = 50%
- 4 = 60%
- 5 = 70%
- 6 = 80%
- 7 = 90%
- 8+ = 100%
Result: Backlinks Masters of Media. Tagcloud Google Search:
Creating a tag cloud for linking of Masters of Media Blog in Google Blogsearch.
Tools:
Method:
- Query Google Blogsearch with "link:mastersofmedia.hum.uva.nl". Since there is no tool to scrape all the results, we manually copied all the URLs of the titles of the results.
- googleblogsearch.txt: Google Blogsearch Masters of Media blog inlink results (27-07-07).
We proposed a tool that scrapes the URLs from Google Blogsearch automatically.
Update: Tool is now made and can be found:
here!
- Count the results per domain manually and make a list.
Final Google Blogsearch results for "link:http://mastersofmedia.hum.uva.nl" (27-07-07):
DmiMoM?
Google Blog Search MOM inlink results
googleblogsearch.txt
Result
-
- Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
- 1 = 30%
- 2 = 40%
- 3 = 50%
- 4 = 60%
- 5 = 70%
- 6 = 80%
- 7 = 90%
- 8+ = 100%
Result: links Masters of Media. Tagcloud Google Blogsearch:
Creating a tag cloud for links to Masters of Media Blog within Technorati.
Tools:
- [[http://service.openkapow.com/artonice/technoratipostsearch1.rest][Technorati Scraper]
- tag cloud to svg tool
- Illustrator
Method:
- Query Technorati (advanced search) with "link:mastersofmedia.hum.uva.nl".
result
Since there is no tool to scrape these results, we manually copied all the URLs of the titles of the results.
We proposed a tool that scrapes the URLs from Technorati automatically.
Update: Tool is now made and can be found:
here!
- Count the results per domain manually and make a list.
Final Technorati results for "link:http://mastersofmedia.hum.uva.nl" (27-07-07):
DmiMoM?
Result
-
- Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
- 1 = 30%
- 2 = 40%
- 3 = 50%
- 4 = 60%
- 5 = 70%
- 6 = 80%
- 7 = 90%
- 8+ = 100%
Result: links Masters of Media. Tagcloud Technorati:
Results: Comparing the three tagclouds
To see which blogs are included/excluded within the spheres of the three devices we compared the results as found in
1 2
3
we used the
compare lists tool
the result was
this
With this result we could manually make this cross spherical tag cloud in Illustrator:
cross-device DmiMoM? tag cloud overlap google googleblogsearch technorati:
Findings
- It is remarkable that Google returns mostly blog results (hardly any static web results). Are there very few static websites linking to MOM?
- Google returns blog results that cannot be found in Google blog Search.
- Nofollow has little to do with the difference in returns in the 3 devices. The permalink has no nofollow tag, only comments have a nofollow tag by default and are excluded from results. This has consequences for the results returned in that there are no links to DmiMoM? returned that are in placed in comments. This is however not a defining factor in the difference in returns. The difference is in the algorithm of the engines.
DmiMoM? inlink cluster map
- Do the inlinks to DmiMoM? and DmiMoM? form a network?
- Are the blogs that are central in the cross-device DmiMoM? cloud tag also central in the network?
- What new actors emerge and do they say something about the collection of URLs?
- Method: input the URLs from all three devices in the harvester and launch a crawl with the issuecrawler (settings 1-2)
- Result:
Tags:
,
view all tags