I recently had to
change
some of the code in my scraping software for Blackboard because newer versions of
Mozilla Firefox1 were not interacting well somewhere in between
MozRepl and
WWW::Mechanize::Firefox. My
decision to use WWW::Mechanize::Firefox
was primarily prompted by ease of
development. By being able to look at the DOM and match elements using all of
Firefox's great development tools such as Firebug, I
was able to quickly write XPath queries to get
exactly the information I needed. Plus, Firefox would handle all the
difficulties of recursive frames and JavaScript. This had drawbacks in that it
was slow and it became difficult to really know if something completed
successfully which made me hesitant about putting it in a cronjob. It worked,
but there was something kludgey about the whole thing.
That solution worked last semester, but when I tried it at the beginning of
this semester, things started breaking down. At first, I tried working around
it, but it was too broken. I needed to use JavaScript, so the only solution I
could find was WWW::Scripter. It has a plugin
system that lets you use two different JavaScript engines: a pure-Perl engine
called JE and Mozilla's
SpiderMonkey. I had tried using
WWW::Scripter
before, but had encountered difficulties with compiling the
SpiderMonkey bridge. This time I gave the JE
engine a try and I was surprised
that it worked flawlessly on the site I was scraping.
After fixing up my code, I can see a few places where WWW::Scripter
could become a better tool:
- Add a plugin that makes managing and using frames easier.
- Create tool to make viewing and interacting with the rendered page as you go along possible. This will really make the it easier to debug and try out things in a REPL.
- Integrate with WWW::Mechanize::TreeBuilder so
that I can use HTML::TreeBuilder::XPath
immediately. As far as I can tell, all that needs to be added to
WWW::Scripter
is thedecoded_content
method.
Which, if you have not heard, is undergoing a rapid development schedule. ↩