TWiki> Dmi Web>Nofollow (18 Sep 2007, MichaelStevenson? )EditAttach

Nofollow / Indexing Issues

Introduction: Indexing and Ranking

Search engine algorithms can be viewed from the perspective of indexing and ranking. How is the link understood differently by indexers and rankers? This project uses an index research appoach. What is indexed is what is available for ranking.

A normal way of thinking about indexing is inclusion/exclusion in devices (for instance censorship). The question we ask is how is it that some things are not indexed? Noindex and nofollow are two kinds of instructions for bots and search engines. The first research in this project is focussed on nofollow.

Secondly, a case study is done investigating indexed pages and links in three devices, Google, Google, Blog Search, and Technorati. Does Google seggregate the blogosphere from the web by introducing Google Blog Seach?

A larger dmi question we ask is when do we need to know how Google works? Or should we not ask how it works but what the effects are?

Nofollow

One issue in indexing, especially as it relates to blogs, is the 'nofollow' tag:

nofollow is an HTML attribute value (no_follow) used to instruct search engines that a hyperlink should not influence the link target's ranking in the search engine's index. It is intended to reduce the effectiveness of certain types of spamdexing, thereby improving the quality of search engine results and preventing spamdexing from occurring in the first place.

(more on Wikipedia)

Google introduced the no_follow attribute in 2005 to prevent comment spam and trackback spam: Official Google Blog: Preventing comment spam

See also Google's post "current activity on robot exclusion protocols"

Two kinds of no_follow

1. Robots (don't follow link)

The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable.

For example:

"Do not follow any of the hyperlinks in the body of this document." (Wikipedia)

2. Search Engines (don't count link)

How the attribute is being interpreted differs between the search engines. While some take it literally and do not follow the link to the page being linked to, others still "follow" the link to find new web pages for indexing. In the latter case rel="nofollow" actually tells a search engine "Don't score this link" rather than "Don't follow this link." (Wikipedia)

Research questions (C. Tatum):

  • How prevalent is the nofollow tag?
  • What percent of a given network is excluded from a internet search due to nofollow?
  • Assuming that nofollow segregates content, is there a way to search the internet as a whole?
  • What are the social/political implications of this sort of segregation?
  • What do we loose by dividing our primary access to the web into two primary entry frames, blogs and not-blogs?

Prevalence of the no_follow tag

Blog software

  • WordPress: default (do_follow plugin)
  • Blogger:
  • Typepad: "For TypePad? subscribers, implementation will be automatic. Links from commenters will be flagged automatically in the next update, which will be deployed within the next 24 hours." (Six Apart - Support for Nofollow)
  • Movable Type: "For Movable Type users, we’re shipping a plugin today to enable support on Movable Type-powered sites. The Movable Type website has full details, including a download link." (Six Apart - Support for Nofollow)
  • LiveJournal: "LiveJournal also plans to implement the specification for comments from other members who are not friends." (Six Apart - Support for Nofollow)

Ranking blogs

Nofollow is presumably important for blog rankings, but there are many other factors. See an interesting post on how Google ranks blogs here.

In short, the major factors for Google BlogSearch? are:

  • Google’s regular ranking factors
  • Scrape Gmail for links
  • Frequency of Clicks
  • Blogrolls
  • Social Bookmarking
  • Feed Readership
  • Other factors

Wikipedia

A January 2007 post on the Search Engine Journal claims that all outlinks on Wikipedia are Nofollow, without any exceptions.

More resources

Devices

  • Xinu: Check site's pagerank, backlinks, diagnosis, indexed pages, syndication, bookmarks, domain
  • Search Engine Support (2005, todo: check current status)

Reading

Table: Prevalence of the Nofollow Tag in Search engines

rel="nofollow" Action   Google   Yahoo   MSN Search   Ask.com
Follows the link   Yes   Yes   Not proven   Yes
Indexes the "linked to" page   No   Yes   No   Yes
Shows the existence of the link   Only for a previously indexed page   Yes   No   Yes
In SERPs for anchor text   Only for a previously indexed page   Yes   No   Yes

Indexing Issues Case Study: Google vs. Google Blogsearch vs. Technorati

How does Google segregate the static web and blogs? Do noindex and nofollow play a role?

See here for speculation on how their blog search works.

From About Google Blogsearch:

  • Which blogs are included in Blogsearch? The goal of Blogsearch is to include every blog that publishes a site feed (either RSS or Atom). It is not restricted to Blogger blogs, or blogs from any other service.

  • How do I get my blog listed? If your blog publishes a site feed in any format and automatically pings an updating service (such as Google Blogsearch Pinging Service), we should be able to find and list it. Also, we will soon be providing a form that you can use to manually add your blog to our index, in case we haven't picked it up automatically. Stay tuned for more information on this.

Starting point:

We will compare the results of the query "link:mastersofmedia.hum.uva.nl" in google, google blogsearch and technorati. We will use 3 tagclouds for speculation.

  • Google
  • Google Blogsearch
  • Technorati

Creating a tag cloud for linking of Masters of Media Blog in Google.

Tools:

Method:

  • Use Google scraper with query "link:mastersofmedia.hum.uva.nl"
  • Google search result for "link:http://mastersofmedia.hum.uva.nl", on 27.07.07: 126 results
  • Now manually count and list the results per domain like this

Final list for for Google query "link:http://mastersofmedia.hum.uva.nl": DmiMoM?

Result

    • Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
    • 1 = 30%
    • 2 = 40%
    • 3 = 50%
    • 4 = 60%
    • 5 = 70%
    • 6 = 80%
    • 7 = 90%
    • 8+ = 100%

Result: Backlinks Masters of Media. Tagcloud Google Search:

nofollow_google.png

Creating a tag cloud for linking of Masters of Media Blog in Google Blogsearch.

Tools:

Method:

  • Query Google Blogsearch with "link:mastersofmedia.hum.uva.nl". Since there is no tool to scrape all the results, we manually copied all the URLs of the titles of the results.
  • googleblogsearch.txt: Google Blogsearch Masters of Media blog inlink results (27-07-07).

We proposed a tool that scrapes the URLs from Google Blogsearch automatically.

Update: Tool is now made and can be found: here!

  • Count the results per domain manually and make a list.

Final Google Blogsearch results for "link:http://mastersofmedia.hum.uva.nl" (27-07-07): DmiMoM?
Google Blog Search MOM inlink results googleblogsearch.txt

Result

    • Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
    • 1 = 30%
    • 2 = 40%
    • 3 = 50%
    • 4 = 60%
    • 5 = 70%
    • 6 = 80%
    • 7 = 90%
    • 8+ = 100%

Result: links Masters of Media. Tagcloud Google Blogsearch:

nofollow_googleblogsearch.png

Creating a tag cloud for links to Masters of Media Blog within Technorati.

Tools:

  • [[http://service.openkapow.com/artonice/technoratipostsearch1.rest][Technorati Scraper]
  • tag cloud to svg tool
  • Illustrator

Method:

  • Query Technorati (advanced search) with "link:mastersofmedia.hum.uva.nl".
result Since there is no tool to scrape these results, we manually copied all the URLs of the titles of the results.

We proposed a tool that scrapes the URLs from Technorati automatically.

Update: Tool is now made and can be found: here!

  • Count the results per domain manually and make a list.

Final Technorati results for "link:http://mastersofmedia.hum.uva.nl" (27-07-07): DmiMoM?

Result

    • Open the file in Illustrator and manually rescale the results to A4 and organize the svg file into a tag cloud. Adjust transparency according to number of links to MOM blog:
    • 1 = 30%
    • 2 = 40%
    • 3 = 50%
    • 4 = 60%
    • 5 = 70%
    • 6 = 80%
    • 7 = 90%
    • 8+ = 100%

Result: links Masters of Media. Tagcloud Technorati:
nofollow_technorati.png

Results: Comparing the three tagclouds

To see which blogs are included/excluded within the spheres of the three devices we compared the results as found in 1 2 3 we used the compare lists tool

the result was this

With this result we could manually make this cross spherical tag cloud in Illustrator:

cross-device DmiMoM? tag cloud overlap google googleblogsearch technorati:
nofollow_all.png

Findings

  • It is remarkable that Google returns mostly blog results (hardly any static web results). Are there very few static websites linking to MOM?
  • Google returns blog results that cannot be found in Google blog Search.
  • Nofollow has little to do with the difference in returns in the 3 devices. The permalink has no nofollow tag, only comments have a nofollow tag by default and are excluded from results. This has consequences for the results returned in that there are no links to DmiMoM? returned that are in placed in comments. This is however not a defining factor in the difference in returns. The difference is in the algorithm of the engines.

DmiMoM? inlink cluster map

  • Do the inlinks to DmiMoM? and DmiMoM? form a network?
  • Are the blogs that are central in the cross-device DmiMoM? cloud tag also central in the network?
  • What new actors emerge and do they say something about the collection of URLs?
  • Method: input the URLs from all three devices in the harvester and launch a crawl with the issuecrawler (settings 1-2)
  • Result:

MoMcluster_map.jpg


Tags:
google scraper2Add my vote for this tag hyperlink2Add my vote for this tag issuecrawler2Add my vote for this tag no follow1Add my vote for this tag pagerank2Add my vote for this tag robot exclusion policy2Add my vote for this tag source distance1Add my vote for this tag tag cloud generator2Add my vote for this tag technorati scraper2Add my vote for this tag create new tag
, view all tags
Topic attachments
I Attachment Action Size Date Who Comment
docdoc 27julidmi.doc manage 57.0 K 03 Aug 2007 - 10:03 Main.r000s list of urls from all devices that link the mom blog
elserar Desktop.rar manage 72.4 K 27 Jul 2007 - 15:00 Main.esther cross-device overlap google googleblogsearch technorati
docrtf Google_search_for_link-mastersofmedia.hum.uva.nl.rtf manage 58.2 K 27 Jul 2007 - 11:44 MichaelStevenson? Google search for "link:http://mastersofmedia.hum.uva.nl" 27.07.07 155 results
txttxt MoM_technorati_inlinks.txt manage 9.5 K 27 Jul 2007 - 12:13 Main.clifford masters of media techno-inlinks
jpgjpg MoMcluster_map.jpg manage 282.7 K 04 Aug 2007 - 17:51 Main.esther cluster map MoM? inlinks
pptppt NoFollow.ppt manage 164.0 K 18 Sep 2007 - 11:39 AnneHelmond Nofollow presentation 18-09-07 PPT
elsesvg cross-device_MoM_tagcloud(2).svg manage 279.6 K 30 Jul 2007 - 14:47 Main.esther origineel cross-device MoM? tag cloud overlap google googleblogsearch technorati
gifgif cross-device_MoM_tagcloud(2.gif manage 39.8 K 30 Jul 2007 - 14:49 Main.esther cross-device MoM? tag cloud overlap google googleblogsearch technorati
gifgif googleblogsearch.gif manage 21.4 K 27 Jul 2007 - 13:57 Main.annehelmond Google Blog Search MOM inlink results TAG CLOUD
txttxt googleblogsearch.txt manage 4.9 K 27 Jul 2007 - 12:47 Main.annehelmond Google Blog Search MOM inlink results
elseai nofollow_all.ai manage 361.5 K 18 Sep 2007 - 11:36 AnneHelmond  
pdfpdf nofollow_all.pdf manage 384.6 K 18 Sep 2007 - 11:35 AnneHelmond  
pngpng nofollow_all.png manage 73.7 K 18 Sep 2007 - 11:36 AnneHelmond  
pdfpdf nofollow_google.pdf manage 323.1 K 18 Sep 2007 - 11:46 AnneHelmond  
pngpng nofollow_google.png manage 53.8 K 18 Sep 2007 - 11:45 AnneHelmond  
pdfpdf nofollow_googleblogsearch.pdf manage 345.9 K 18 Sep 2007 - 11:46 AnneHelmond  
pngpng nofollow_googleblogsearch.png manage 49.9 K 18 Sep 2007 - 11:45 AnneHelmond  
pdfpdf nofollow_technorati.pdf manage 426.6 K 18 Sep 2007 - 11:45 AnneHelmond  
pngpng nofollow_technorati.png manage 78.3 K 18 Sep 2007 - 11:45 AnneHelmond  
pdfpdf nofollowpresfin.pdf manage 80.0 K 23 Aug 2007 - 09:53 Main.annehelmond Nofollow presentation 23-08-07 PDF
elseai tagcloud_3devices.ai manage 173.5 K 30 Jul 2007 - 14:39 Main.annehelmond Backlinks Masters of Media. Illustrator Tagcloud 3 devices
gifgif tagcloud_googleblogsearch.gif manage 24.5 K 30 Jul 2007 - 14:20 Main.annehelmond Backlinks Masters of Media. Tagcloud Blog Google Search
pdfpdf tagcloud_googleblogsearch.pdf manage 218.7 K 30 Jul 2007 - 14:19 Main.annehelmond Backlinks Masters of Media. Tagcloud Google Blog Search PDF
gifgif tagcloud_googlesearch.gif manage 30.8 K 30 Jul 2007 - 14:35 Main.annehelmond Backlinks Masters of Media. Tagcloud Google Search
pdfpdf tagcloud_googlesearch.pdf manage 224.5 K 30 Jul 2007 - 14:19 Main.annehelmond Backlinks Masters of Media. Tagcloud Google Search PDF
gifgif tagcloud_technorati.gif manage 44.2 K 30 Jul 2007 - 14:21 Main.annehelmond Backlinks Masters of Media. Tagcloud Technorati Search
pdfpdf tagcloud_technorati.pdf manage 219.1 K 30 Jul 2007 - 14:20 Main.annehelmond Backlinks Masters of Media. Tagcloud Technorati Search
Topic revision: r43 - 18 Sep 2007 - 13:18:58 - MichaelStevenson?
Dmi.Nofollow moved from Dmi.Noindex on 10 Aug 2007 - 06:34 by Main.esther - put it back

Themes

Tag Cloud

archive  climate change skeptics  dataset  delicious related tags  geo-location  google image scraper  google news scraper  google scraper  hyperlink  hyves brands  iraq  israel  issue animals  issuecrawler  issuegeographer  link ripper  localising hyves  no follow  pagerank  palestine  politicians hyves  robot exclusion policy  scrape  source distance  tag  tag cloud generator  technorati scraper  thread  wayback machine  webantenne  wikiscanner



 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback