Item16: Validity: crawl record; show effects of ceilings

Priority: CurrentState: AppliesTo: Component: WaitingFor:
Normal New frontend    

Details

Provide user with a record of the crawl results. Allow for 'validity check' of crawl results, i.e., whether crawler has captured all interlinkings of the nodes on the map, by checking interlinkings between sites. This way the qualitative people can see if it crawled what it should crawl and investigate the effect of ceilings

Extra:

  • parse all rss feeds out of crawl records
  • parse all APPLIED robots.txt rules out of crawl records: e.g. ROBOTS: successfully got http://www.....com/robots.txt, ROBOTS: rule "user-agent: *" applied
  • parse all skipped non html/php pages out of crawl records (e.g. pdf, doc, xls)

problem:

  • we are talking about a million urls per crawler or even iteration

ItemTemplate
Summary Validity: crawl record; show effects of ceilings
ReportedBy ErikBorra
AppliesTo frontend
Priority Normal
CurrentState New
WaitingFor

Topic revision: r3 - 14 Apr 2008 - 13:31:00 - ErikBorra