Js Crawler
Stop Press!
Shanti Rao has created an extended standalone Windows Javascript interpreter based Mozilla's SpiderMonkey. This can be used to do any kind of programming you like; Rao has even created a web server using it.
Of course this doesn't really solve my problems because it is Windows specific and anyway I wanted something that would run in the browser.
Javascript Crawler
If you want to see the crawler in action then visit Crawler3.
Creating the crawler was unreasonably difficult for a number of reasons:
- Microsoft, Netscape and W3C disagree on a number of quite fundamental points so that simply finding the names of the properties or methods that are relevant is a task for Peter Wimsey or Hercule Poirot rather than a working programmer like me.
- The very specification of Javascript turns it into a second class language by denying it the ability to access files outside the domeain of the executing script. Yes I know it is supposed to be more secure that way but really we all execute insecure code all the time. Everytime you download a new program these days it calls home to see if there is a new version and offers to download it for you. This process plainly involves reading files from another domain so why should Javascript not be allowed to do the same. I'm not suggesting that it be allowed to read files from local disks just that it have the same ability to read public web sites that the user has.
How it works
This is very simple. Here is the pseudo-code:
push the URL of a page to a stack while stack not empty { pop URL from stack open page for URL add all links from this page that have not yet been visited to the stack }
Yes I know that if even one page points at an external page that this is probably not going to terminate until hell freezes over. Just getting this far was so hard that I'm not feeling sympathetic to cries of help it won't stop; I'm just glad I managed to make it go at all. In fact it probably only goes in IE6. Opera will have to wait (even though it is my favourite browser), I haven't got a copy of Netscape and I haven't tried Mozilla in ages.
Obviously the pseudo-code describes a too simple algorithm. Other sections on this page address the problems.
What is wrong with it
- Doesn't stop crawling until it runs out of links which could be never.
- doesn't work in my favorite browser.
- Doesn't actually do anything useful!
- opens a window instead of keeping the process in the background.
- doesn't crawl Javascript links properly. Probably doesn't do it at all.
- Adds all pages to the stack even if they are already there
- Wastes time waiting for pages that could be used loading others
What it might be useful for
- A search engine for a small web site.
- A concordance creater.
- A site map creater
What next?
In no particular order here are some things that I intend to do:
- don't crawl outside own host
- don't crawl above starting page
- crawl only n steps away from starting point
- create concordance of words found with links to the pages on which found
- create site map
- operate in the background
- store some state as cookies. How much can a cookie hold?
- multithreading. Well not exactly; just the ability to start a new page download before giving up an earlier one. This can easily be done by adding another list of pages so that every time a new page is opened it is added to the list. Instead of timing out and giving up we could have a two stage timeout so that the first timeout simply opens the next link. Each link would be given an amount of time to become readystate=complete. The number of links open at a given time should be limited to some user defined value. The list would be scanned at intervals looking for pages that have arrived and pages that are still trying. Any that have been trying too long would be discarded and replaced with new links. y
No comments:
Post a Comment