Wednesday, 31 March 2010

Hot Mail Downloader

The hotmail downloader is a special purpose web spider.

Requirements

  • Download all messages
  • Download all attachments
  • Download all different sort orders
  • rewrite HTML so that the downloaded files are browseable just as though you were on line.
  • replace complicated URLs with simpler URLs based on sender address, date, subject
  • must be able to follow JavaScript links because most of the links in Hotmail are of this type.

First Try

Using the Web Browser control build a simple browser in VB6. Use the document.links collection to find the links to the messages and attachments.

Starting from the page following the login look for the anchor that has:

  • text = Mail
  • class = E
  • href begins /cgi-bin/HoTMaiL?curmbox

The message lists are presented as a table where the rows are identified by the email address of the sender. The table has ID=MsgTable. Note that the bulk of the HTML is in fact not compliant with the standards as the attributes are not enclosed in quotation marks; just bear this in mind if you decide to parse things by hand.

And now we hit the principal problem: how to follow a JavaScript link because the links to the messages are JavaScript rather than simple URLs.

In fact all of the important links on this page are JavaScript: messages, next page, sort order, etc.

Haven't managed to find any web pages that say so but the click method of the link object cause the browser to navigate to that link, regardless of what the link method is so this solves the problem of how to follow the link. The next problem is how to identify the link. So far I just look for links that begin with a certain string. One caveat here is that you cannot easily save a link object for later because it relies on the current page to execute the JavaScript (at least it behaves that way).

This seems to work but the sequence of events doesn't seem to quite as expected, in particular when the Browser_NavigateComplete2 event occurs there is not always the expected document present. The correct event to use should be DocumentComplete. There is one very important caveat to be made: according to MSDN the DocumentComplete event does not fire if the Visibility property of the WebBrowser control is false. Microsoft Knowledge Base Article 180366 also makes a very important point and gives a code snippet that detects when a multi frame document has been completely downloaded. The trick is to check the that WebBrowser.Object property is the same as the pDisp argument of the event.

Luckily the argument to the JavaScript link is simply a relative URL, relative to the domain that is, so we can use the domain property of the document and a bit of string slicing to create an absolute URL. Now we can make a list of messages to be downloaded.

Another problem that occurs is that Hotmail is advertising supported so it tries to navigate to ads and if you have ad suppression on your computer then this may fail and cause NavigateError events. If this occurs in a sub-frame then it appears that the top level DocumentComplete event never fires. I suppose that this is reasonable in a way because it isn't complete. Nonetheless we need to be able to decide when it has gone as far as it can. Another possibility is to use the Inet control to download the page, then we will have to parse it ourselves to find links to attachments and so on.

What we can do instead is to look at the browser busy flag in the NavigateError event. If it is false then we can treat this event as the same as DocumentComplete.

So far so good but what about attachments? A message that has an attachment has a normal link to a page for the attachment so that can be simply added to a list of pending attachments. However the page that the server returns uses the click event on a table cell (td) to start the download. This means that we must search for the relevant table cell and parse the click event attribute. Luckily, it is a very simple script statement that simple sets the window href to a simple absolute URL. Once we have the URL we can add this to a list of pending downloads. For each message we can create a folder that will hold the attachments.

2004-07-26

Time to scrap the existing spider and rebuild it in a more logical fashion. In particular it is necessary to recognize that the whole thing is event driven. This means that we are dealing with a state machine so let's make that explicit.

Create a class that recieves WebBrowser events. the action taken will depend on the event and on the current state.

Posted via web from kwhitefoot's posterous

No comments:

Post a Comment

Blog Archive

Followers