switching out web scraping modules

	December 2022
S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

I recently had to change some of the code in my scraping software for Blackboard because newer versions of Mozilla Firefox 1 were not interacting well somewhere in between MozRepl and WWW::Mechanize::Firefox. My decision to use WWW::Mechanize::Firefox was primarily prompted by ease of development. By being able to look at the DOM and match elements using all of Firefox's great development tools such as Firebug, I was able to quickly write XPath queries to get exactly the information I needed. Plus, Firefox would handle all the difficulties of recursive frames and JavaScript. This had drawbacks in that it was slow and it became difficult to really know if something completed successfully which made me hesitant about putting it in a cronjob. It worked, but there was something kludgey about the whole thing.

That solution worked last semester, but when I tried it at the beginning of this semester, things started breaking down. At first, I tried working around it, but it was too broken. I needed to use JavaScript, so the only solution I could find was WWW::Scripter. It has a plugin system that lets you use two different JavaScript engines: a pure-Perl engine called JE and Mozilla's SpiderMonkey. I had tried using WWW::Scripter before, but had encountered difficulties with compiling the SpiderMonkey bridge. This time I gave the JE engine a try and I was surprised that it worked flawlessly on the site I was scraping.

After fixing up my code, I can see a few places where WWW::Scripter could become a better tool:

Add a plugin that makes managing and using frames easier.
Create tool to make viewing and interacting with the rendered page as you go along possible. This will really make the it easier to debug and try out things in a REPL.
Integrate with WWW::Mechanize::TreeBuilder so that I can use HTML::TreeBuilder::XPath immediately. As far as I can tell, all that needs to be added to WWW::Scripter is the decoded_content method.

Which, if you have not heard, is undergoing a rapid development schedule. ↩