.planENETDOWNhttp://enetdown.org//dot-plan/ENETDOWNikiwiki2016-06-13T07:17:08ZThinking about metacognition for reading on screenshttp://enetdown.org//dot-plan/posts/2016/02/24/thinking-about-metacognition/zaki2016-02-24T05:45:57Z2016-02-24T05:42:02Z
<p>If you've been around me for a while, you might have heard me go on and on
about my long-term project: creating the ultimate <a href="http://enetdown.org//tag/document_reader/">document reader</a>.
As I try to design it, I am thinking of ways that the cognitive process
of reading can be improved through computers and to guide my research, I have read a
couple books about the neuroscience of reading.</p>
<p>One book in particular was quite invaluable: Maryanne Wolf's <em>Proust and the Squid</em>. I
recently came across a few articles<a href="http://enetdown.org//dot-plan/#fn:maryanne-wolf-articles" id="fnref:maryanne-wolf-articles" class="footnote">1</a> in which she
discusses how she seems to have lost the ability to read and enjoy prose in
long-form. She claims that this is because of how reading on screens has
trained our brains to skim rather than use "deep reading" skills.
Maryanne Wolf's next books will be on this specific
topic and I look forward to reading them<a href="http://enetdown.org//dot-plan/#fn:maryanne-wolf-future-books" id="fnref:maryanne-wolf-future-books" class="footnote">2</a>.</p>
<h2 id="differencesbetweenreadingonpaperandonscreens">Differences between reading on paper and on screens</h2>
<p>Researchers in human-computer interaction, education, and psychology have long
looked at the differences between reading on paper versus on screens<span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Dillon:1992">1</a>)</span>.
Getting to the root causes behind why there are differences between the two
in terms of reading outcomes (e.g., speed, fatigue, comprehension) has been
difficult because there really hasn't been a standardised methodology behind
all the studies. Early studies found that some of the differences could be
attributed to image quality and this seems to hold for single pages, but for
longer documents, the ability to navigate and locate information is a more
important factor.</p>
<p>Perhaps the difference can accounted for by our expectations of screen reading.
When reading on a screen, we tend to read shorter texts such as e-mail or the
news while doing other tasks such as checking notifications.
This task switching may be occur many times a day as we navigate our computers.
Some have suggested that the gap between these different modes of reading may be closed by <a href="http://nautil.us/issue/32/space/the-deep-space-of-digital-reading">adapting to screen
reading</a>
in the same way that people had to adapt from reading on scrolls to reading from a codex.</p>
<p>A study of high-schoolers showed that those that were successful with tasks
that required finding information online via a hypertext interface either had
prior skills in linear reading or prior skills in basic computer navigation<span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Hahnel:2016">2</a>)</span>.
This implies that a combination of those factors may be necessary to make the
most of screen reading.</p>
<p>Another study compared students' performance with screen reading and paper
reading when given two scenarios<span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Ackerman:2011">3</a>)</span>. In the first scenario, the
students had a fixed time to read the material when given an assessment. In the
second scenario, they were allowed to use as much time as they needed. What the
authors found is that for the fixed time scenario, the screen readers performed
on par with paper readers, but for the flexible time scenario, the paper
readers outperformed the screen readers. The author's suggest that this
difference can be accounted for by differences in how each group of readers
self-regulated their reading time and that this difference in self-regulation
comes from how readers perceive reading in each medium.</p>
<p>From these studies, I believe that two areas to improve when designing a
document reader are <em>document navigation</em> and <em>self-regulation of reading</em>. There are many approaches
to navigation, but I will not be covering that in this text. What I want to
know is, how can we train ourselves to have better self-regulation?</p>
<p>I had a thought along these lines while I was studying a
textbook. A major part of reading is going back to review what we just
read. As we read, we must use our limited working memory to both visually
process the words on the page and simultaneously pull out related concepts from
long-term memory so that we can associate new ideas with our prior knowledge. In
fluent reading, all this occurs in a short amount of time between reading
chunks of text, so the number of associations that we can make depends on how
quickly we can process all this information. In order to ease students into
re-treading over what they just read, many textbooks use small prompts at the
end of sections. These prompts are meant to help provide context for where the
information fits into a larger structure and confirm that the reader understood
what was written.</p>
<p>In an age where we have direct access to more information than ever before, many
have turned to speed-reading techniques as a way to read more material — however,
such approaches are not supported by eye-tracking research<span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Schotter:2014">4</a>)</span>.
Others have tried to replace reading entire books with <a href="http://www.theatlantic.com/technology/archive/2015/11/please-be-brief/417894/">reading summaries</a>
and while these give a decent summary of the arguments in a book, they are but
tertiary sources — referencing a summary is not the same as referencing the
book itself. Furthermore, if you only read the summary, you might not give yourself
enough material to build a coherent model that you can recall later.
Both forms of reading have their place, but if we want to improve our
reading comprehension, it may be worth slowing down. </p>
<h2 id="metacognition">Metacognition</h2>
<p>The activities found in textbooks are meant to aid what is known as metacognition — thinking
about thinking. We apply metacognition whenever we take notes, break down a
problem into subproblems, or set goals for studying. Metacognition is
the process of recognising that we are thinking a certain way and applying a strategy to
regulate and improve how we think.</p>
<p>The following are all valid metacognitive strategies:</p>
<ul>
<li>using a consistent system for note-taking,</li>
<li>creating a sequence of tasks for working on a project,</li>
<li>planning to read 3 sections in the next 30 minutes,</li>
<li>doing all the exercises at the end of the chapter,</li>
<li>devising mnemonics to remember a certain sequence of events,</li>
<li>creating flash cards based on each new definition, and</li>
<li>cramming 1 hour before a test.</li>
</ul>
<p>These strategies might not be equally effective in all circumstances and for every
person, but what they have in common is that they all regulate how we think. What I would like to investigate is
can these metacognitive strategies be incorporated into the tools that we use
to read, so that it is easier to apply them effectively and
consistently?</p>
<p>The purpose of this text is to explore some thoughts on how to implement such a
system.</p>
<h2 id="annotations">Annotations</h2>
<p>When I started thinking about metacognition, my first thought led me to textual
annotations. Writing has long been suggested as a way to help improve memory.
It may even be more important when dealing with electronic media because
notetaking can be used to enhance the understanding of what we're reading.
A recent study from 2013 suggests that the difference between paper
and screen reading may be partially due to the distractions of multitasking;
however, the effect of multitasking can be mitigated by using notetaking on paper
as a way to retain focus<span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Subrahmanyam:2013">5</a>)</span>.</p>
<p>Taking notes on paper is not a huge burden, but having those notes in digital
form allows for more portability and, since not everyone has filing cabinets,
makes it easier to retrieve years later. This ease of use comes with a tradeoff: digital notetaking may not be
as effective for memory recall as taking notes by hand.
In Mueller et al. (2014)<span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Mueller:2014">6</a>)</span>, researchers looked at how students
wrote notes and found that students that wrote notes by hand were
able to <a href="http://www.crlt.umich.edu/node/80537">perform better</a> on assessments
than students that wrote notes on laptops. This may be because there is more
cognitive processing being done when choosing the salient points to
write down from a lecture than when typing out the lecture verbatim.</p>
<p>How can this additional cognitive processing be elicited when keeping
notes in digital form? One approach might be to use the <a href="https://en.wikipedia.org/wiki/Cornell_Notes">Cornell
method</a> for notetaking. One way of
using this method is to return to the notes after a day in
order to summarise what was written. This additional summary promotes
the synthesis and revision of the notes. When used correctly, the rewording
required for summarisation provides the extra cognitive processing needed for
learning.</p>
<p>Below is a simple mockup of how such a system might look like on a computer.
Instead of showing a single page as in most document readers, this interface
shows a page of the original document as well as another page where notes can
be kept. The cue column is one the left of this page, the note-taking column is
on the right, and the summary area is at the bottom of the page.
When the notes for a given unit of the text (e.g., a section or
paragraph) are complete, the time can be recorded so that when returning the next day,
the user can be given a reminder to write a summary.</p>
<table class="img"><caption>Mockup of Cornell notes UI</caption><tr><td><a href="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/cornell-method.svg"><img src="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/cornell-method.svg" width="600" class="img" /></a></td></tr></table>
<h2 id="flashcards">Flashcards</h2>
<p>Another popular metacognitive strategy is to use flashcards. They make a good
supplement to the Cornell method as the cues can be turned into the front of a
flashcard and the notes can become the back. We can also create flashcards on
their own and link them back to the text if further clarification of the notes
are needed.</p>
<table class="img"><caption>Mockup of flashcards UI</caption><tr><td><a href="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/flashcard-page.svg"><img src="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/flashcard-page.svg" width="600" class="img" /></a></td></tr></table>
<p>Just making the flashcards is not enough. They need to be reviewed. One way of
reviewing called <a href="https://en.wikipedia.org/wiki/Spaced_repetition">spaced repetition</a>
is based on creating an adaptive schedule so that difficult cards are seen more
often than easy cards until the difficult cards have been memorised.
There exist many techniques and tools for creating effective flashcards
and scheduling the revision of cards, so those can easily be integrated into
the document reader either by outputting data in the appropriate format or
reimplementing the algorithms<a href="http://enetdown.org//dot-plan/#fn:flashcard-tech" id="fnref:flashcard-tech" class="footnote">3</a>.</p>
<h2 id="triagingandcomparison">Triaging and comparison</h2>
<p>When researching a topic, using a single source of information may not be
enough. I often go to the library and grab almost every book on a topic so that
I can learn from many perspectives. As I read, I try to figure out the specific
strengths of each document: some may cover more theory than others or one may
have particularly informative diagrams. Being able to sort through each of
these documents and decide which ones are relevant is possible on a computer
through the use of folders and tagging, but this is sometimes very
unsatisfying. For example, if I am reading papers for a literature review, I
usually make a spreadsheet to organise details about what each paper is about. Before I
actually commit to making a spreadsheet, I try to stack the papers that I have
printed out into different piles based on quickly skimming the contents. This
is known as document triaging.</p>
<p>One example of how I would use this is to separate a set of papers into the
categories:</p>
<ul>
<li>Not relevant: papers that do not cover what I am trying to figure out in the
current project (ignore these);</li>
<li>Survey papers: papers that review the results of many papers at once (read
through these and find any references I might have missed);</li>
<li>Classic research: older research that may or may not be worth looking at
(skim over these — might only be of historical interest);</li>
<li>Recent research: more current papers (read these more closely).</li>
</ul>
<p>One approach to a GUI for this might be to emulate the layout of a desktop area
where different regions indicate different categories. This may be easier to visualise and work with than using
different lists or drop-down menus as it uses larger GUI elements (c.f., <a href="https://en.wikipedia.org/wiki/Fitts's_law">Fitts's law</a>).
An example of what it might look like is shown below.</p>
<table class="img"><caption>Mockup of document triaging UI</caption><tr><td><a href="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/document-triage.svg"><img src="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/document-triage.svg" width="600" class="img" /></a></td></tr></table>
<p>The workflow proceeds as follows:</p>
<ol>
<li>The user chooses a document from the Queue on the right.</li>
<li>The chosen document appears in the center Reading area where the user can
quickly look over the document.</li>
<li>Once a category for the document is determined, the user drags it to the
appropriate category region on the left.</li>
</ol>
<p>Perhaps this task can benefit from <a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)">active learning</a> which might be able to
suggest categories for papers based on metadata such as page length or year of publication.</p>
<p>Another related task is being able to compare multiple documents side by side.
Below is a picture of a bookstand modelled after a design that Thomas Jefferson
created for his office<a href="http://enetdown.org//dot-plan/#fn:jefferson-bookstand" id="fnref:jefferson-bookstand" class="footnote">4</a>. It allows for keeping multiple
books open at the same time and rotates so that switching between books is
easy.</p>
<table class="img"><caption>Picture of Jeffersonian revolving bookstand (taken from this <a href="http://lumberjocks.com/projects/109236">project showcase</a>)</caption><tr><td><a href="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/jeffersonian-bookstand.jpg"><img src="http://enetdown.org//dot-plan/posts/2016/02/24/gfx/jeffersonian-bookstand.jpg" width="712" height="700" class="img" /></a></td></tr></table>
<p>In a way, this is like browser tabs. However, there are times when you may want
to see two documents side-by-side rather than having to flip between them.
For example, if I was reading two books on history written by different
authors, I might want to see how both authors address the same topic. Whether
or not this should be implemented as a <a href="https://en.wikipedia.org/wiki/Multiple_document_interface">multiple document interface</a>
(with tabbing and docking) is still not clear to me.</p>
<h2 id="readingstrategiesandserendipity">Reading strategies and serendipity</h2>
<p>If you've ever read a scientific research paper, the first thing you'll notice
is that they are highly structured: there are sections that are always in every
paper (e.g., Abstract, Introduction, Methods, Results, Conclusion).
These sections are meant to guide readers so that they don't have to read through the whole paper.
There are certain strategies to reading a research
paper<a href="http://enetdown.org//dot-plan/#fn:reading-research-strategies" id="fnref:reading-research-strategies" class="footnote">5</a> which emphasise that you have
to re-read the paper several times based on your goals (e.g., performing a
literature review versus trying to reproduce the results). On each pass, you
will try to answer different questions so you will want to spend more time
reading specific sections. Perhaps these reading strategies can be turned into
a checklist so that each paper has a progress bar that tells you how far along
in understanding the paper you are. That way you can skim over several papers
in one sitting and then slowly try to understand one paper at a time.</p>
<p>Sometimes the path you take with your reading does not follow a straightforward
checklist. There are times when you are searching for something and you come
across an unexpected connection which can lead to more creative thoughts.
This can either occur when discovering new material or coming
back to older material. Understanding how to create serendipitous encounters
might be a little tougher than the previous metacognitive strategies. As noted
in André et al. (2009) <span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Andre:2009">7</a>)</span>, serendipity is difficult to facilitate
and study in a laboratory setting. One approach that I would like to try is to
bring up older material that the user might have read months or years ago.
Perhaps the older material might be seen in a different light now that the user
is re-encountering it.</p>
<p>The system described by author Steven Berlin Johnson in this post titled <a href="http://www.stevenberlinjohnson.com/movabletype/archives/000230.html">Tool
for Thought</a>
describes an interesting workflow where he captures quotes from books and uses
a tool that can help find other related quotes in his library. This allows him
to start with a single idea and then find other ideas that might be related to
that original seed. One of the authors of <span class="markdowncitation"> (<a href="http://enetdown.org//dot-plan/#Andre:2009">7</a>)</span>, <a href="http://research.microsoft.com/en-us/um/people/sdumais/">Susan
Dumais</a>, worked on a
technique to do just this: <a href="https://en.wikipedia.org/wiki/Latent_semantic_indexing">latent semantic indexing</a>.
However, at this point, I am not certain how large of a personal library is
needed for the gains from this technique to become apparent.</p>
<h2 id="conclusion">Conclusion</h2>
<p>As far as I can tell, none of these techniques exist within a single existing
application. Perhaps that is because using such a complex application would
become daunting — there would simply be too many features. Furthermore, since
each of these metacognitive strategies can be applied in many different ways,
another challenge will be creating tutorials that show how to use them
effectively.</p>
<p>I'd really appreciate any feedback on these ideas. They will certainly need
tweaking before they become usable.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:maryanne-wolf-articles"><p>The articles are <a href="http://www.newyorker.com/science/maria-konnikova/being-a-better-online-reader">Being a Better Online Reader</a>
by Maria Konnikova and an interview with Maryanne Wolf hosted by Robin Young <a href="http://hereandnow.wbur.org/2014/04/09/online-reading-comprehension">Is Online Skimming Hurting Reading Comprehension?</a>.<a href="http://enetdown.org//dot-plan/#fnref:maryanne-wolf-articles" class="reversefootnote"> ↩</a></p></li>
<li id="fn:maryanne-wolf-future-books"><p>From <a href="https://ase.tufts.edu/crlr/team/wolf.htm">her page</a> at Tufts</p>
<ul>
<li>Wolf, M. & Gottwald, S. (To appear, 2016) <em>What It Means to Read: A
Literacy Agenda for the Digital Age</em>. Oxford University Press. In Series,
Literary Agenda, Editor: Phillip Davis.</li>
<li>Wolf, M. (To appear, 2016). <em>Letters to the Good Reader: The Contemplative
Dimension in the Future Reading Brain</em>. New York: Harper Collins.</li>
<a href="http://enetdown.org//dot-plan/#fnref:maryanne-wolf-future-books" class="reversefootnote"> ↩</a></ul></li>
<li id="fn:flashcard-tech"><p>Some of these flashcard techniques and tools can be found in the following:</p>
<ul>
<li>Wikipedia provides a <a href="https://en.wikipedia.org/wiki/List_of_flashcard_software">list of flashcard software</a>.</li>
<li><a href="https://www.supermemo.com/en/articles/20rules">Effective learning: Twenty rules of formulating knowledge</a>
gives tips for using flashcards effectively.</li>
<li><a href="http://www.pnas.org/content/109/6/1868.full">Education of a model student</a>
describes different mathematical models of memory that can be used for varying
the learning rate based the goals of the learner.</li>
<a href="http://enetdown.org//dot-plan/#fnref:flashcard-tech" class="reversefootnote"> ↩</a></ul></li>
<li id="fn:jefferson-bookstand"><p>For details on how to make your own Jeffersonian bookstand, see this
<a href="https://www.youtube.com/watch?v=-SD_jlH7Ez8">video</a> and these
<a href="http://www.davidcolarusso.com/handouts/jefferson_bookstand.pdf">instructions</a>.<a href="http://enetdown.org//dot-plan/#fnref:jefferson-bookstand" class="reversefootnote"> ↩</a></p></li>
<li id="fn:reading-research-strategies"><p>A couple guides are
<a href="https://www.elsevier.com/connect/infographic-how-to-read-a-scientific-paper">here</a>
and <a href="http://www.bmj.com/about-bmj/resources-readers/publications/how-read-paper">here</a>.<a href="http://enetdown.org//dot-plan/#fnref:reading-research-strategies" class="reversefootnote"> ↩</a></p></li>
</ol>
</div>
<div class="bibliography">
<hr />
<p>Bibliography</p>
<div id="Dillon:1992"><p>[1] <span class="item">Dillon, Andrew.
"<a href="https://www.ischool.utexas.edu/~adillon/Journals/Reading.htm">Reading from paper versus screens: A critical review of the empirical literature</a>."
<em>Ergonomics</em> 35, no. 10 (1992): 1297-1326.
<a href="http://dx.doi.org/10.1080/00140139208967394">doi:10.1080/00140139208967394</a>.</span></p></div>
<div id="Hahnel:2016"><p>[2] <span class="item">Hahnel, Carolin, Frank Goldhammer, Johannes Naumann, and Ulf
Kröhne. "<a href="https://www.uni-frankfurt.de/58831570/Paperb14a7b8059d9c055954c92674ce60032ICTb14a7b8059d9c055954c92674ce60032effects.pdf">Effects of linear reading, basic computer skills, evaluating online
information, and navigation on reading digital text</a>." <em>Computers in Human
Behavior</em> 55 (2016): 486-500.
<a href="http://dx.doi.org/10.1016/j.chb.2015.09.042">doi:10.1016/j.chb.2015.09.042</a>.</span></p></div>
<div id="Ackerman:2011"><p>[3] <span class="item">Ackerman, Rakefet, and Morris Goldsmith. "<a href="http://iew3.technion.ac.il/~ackerman/papers/Ackerman%20&%20Goldsmith%202011%20-%20Metacognitive%20Regulation%20of%20Text%20Learning%20On%20Screen%20Versus%20on%20Paper.pdf">Metacognitive
regulation of text learning: on screen versus on paper</a>." Journal of
Experimental Psychology: Applied 17, no. 1 (2011): 18.
<a href="http://dx.doi.org/10.1037/a0022086">doi:10.1037/a0022086</a>.</span></p></div>
<div id="Schotter:2014"><p>[4] <span class="item">Schotter, Elizabeth R., Randy Tran, and Keith Rayner. "<a href="http://pss.sagepub.com/content/25/6/1218">Don’t Believe What You
Read (Only Once) Comprehension Is Supported by Regressions During Reading</a>."
<em>Psychological science</em> (2014): 0956797614531148.
<a href="http://dx.doi.org/10.1177/0956797614531148">doi:10.1177/0956797614531148</a>.</span></p></div>
<div id="Subrahmanyam:2013"><p>[5] <span class="item">Subrahmanyam, Kaveri, Minas Michikyan, Christine Clemmons, Rogelio Carrillo, Yalda T. Uhls, and Patricia M. Greenfield.
"<a href="http://www.cdmc.ucla.edu/KSb14a7b8059d9c055954c92674ce60032Mediab14a7b8059d9c055954c92674ce60032bibliob14a7b8059d9c055954c92674ce60032files/Subrahmanyam%20Michikyan%20et%20al%202014%20%28paper%20vs%20screens%29.pdf">Learning from Paper, Learning from Screens: Impact of Screen Reading and Multitasking Conditions on Reading and Writing among College Students</a>."
<em>International Journal of Cyber Behavior, Psychology and Learning (IJCBPL)</em> 3, no. 4 (2013): 1-27.
<a href="http://dx.doi.org/10.4018/ijcbpl.2013100101">doi:10.4018/ijcbpl.2013100101</a>.</span></p></div>
<div id="Mueller:2014"><p>[6] <span class="item">Mueller, Pam A., and Daniel M. Oppenheimer.
"<a href="http://pss.sagepub.com/content/25/6/1159">The Pen Is Mightier Than the Keyboard Advantages of Longhand Over Laptop Note Taking</a>."
<em>Psychological science</em> (2014): 0956797614524581.
<a href="http://dx.doi.org/10.1177/0956797614524581">doi:10.1177/0956797614524581</a>.</span></p></div>
<div id="Andre:2009"><p>[7] <span class="item">André, Paul, Jaime Teevan, and Susan T. Dumais.
"<a href="http://research.microsoft.com/en-us/um/people/sdumais/creativityandcognition09-fp392-andre.pdf">Discovery is never by chance: designing for (un)serendipity</a>."
In Proceedings of the seventh ACM conference on Creativity and cognition, pp.
305-314. ACM, 2009.
<a href="http://dx.doi.org/10.1145/1640233.1640279">doi:10.1145/1640233.1640279</a>.</span></p></div>
</div>
A fast and natural interface to R from Perlhttp://enetdown.org//dot-plan/posts/2014/12/24/a_fast_and_natural_interface_to_R_from_Perl/zaki2014-12-24T01:58:39Z2014-12-24T01:43:15Z
<p>Love <a href="http://www.r-project.org/">R</a>? Love <a href="http://www.perl.org/">Perl</a>? Well,
I've got a nice little present for you this <code>echo $(calendar)</code> <a href="http://enetdown.org//dot-plan/#fn:calendar-unix" id="fnref:calendar-unix" class="footnote">1</a>
season! Now you can pass data in and out of R as easily as</p>
<div class="highlight-perl"><pre class="hl"><span class="hl kwa">use</span> v5<span class="hl opt">.</span>16<span class="hl opt">;</span>
<span class="hl kwa">use</span> Statistics<span class="hl opt">::</span>NiceR<span class="hl opt">;</span>
<span class="hl kwa">use</span> Data<span class="hl opt">::</span>Frame<span class="hl opt">::</span>Rlike<span class="hl opt">;</span>
Moo<span class="hl opt">::</span>Role<span class="hl opt">-></span><span class="hl kwd">apply_roles_to_package</span><span class="hl opt">(</span> q<span class="hl opt">|</span>Data<span class="hl opt">::</span>Frame<span class="hl opt">|,</span> <span class="hl str">qw(Data::Frame::Role::Rlike)</span> <span class="hl opt">);</span>
<span class="hl kwc">my</span> <span class="hl kwb">$r</span> <span class="hl opt">=</span> Statistics<span class="hl opt">::</span>NiceR<span class="hl opt">-></span><span class="hl kwd">new</span><span class="hl opt">;</span>
<span class="hl kwc">my</span> <span class="hl kwb">$iris</span> <span class="hl opt">=</span> <span class="hl kwb">$r</span><span class="hl opt">-></span><span class="hl kwd">get</span><span class="hl opt">(</span><span class="hl str">'iris'</span><span class="hl opt">);</span>
<span class="hl kwc">say</span> <span class="hl str">"Subset of Iris data set"</span><span class="hl opt">;</span>
<span class="hl kwc">say</span> <span class="hl kwb">$iris</span><span class="hl opt">-></span><span class="hl kwd">subset</span><span class="hl opt">(</span> <span class="hl kwa">sub</span> <span class="hl opt">{</span> <span class="hl slc"># like a SQL WHERE clause</span>
<span class="hl opt">(</span> <span class="hl kwb">$_</span><span class="hl opt">->(</span><span class="hl str">'Sepal.Length'</span><span class="hl opt">) ></span> <span class="hl num">6.0</span> <span class="hl opt">)</span>
<span class="hl opt">& (</span> <span class="hl kwb">$_</span><span class="hl opt">->(</span><span class="hl str">'Petal.Width'</span><span class="hl opt">) <</span> <span class="hl num">2</span> <span class="hl opt">)</span>
<span class="hl opt">})-></span><span class="hl kwd">select_rows</span><span class="hl opt">(</span><span class="hl num">0</span><span class="hl opt">,</span> <span class="hl num">34</span><span class="hl opt">);</span> <span class="hl slc"># grab the first and last rows</span>
</pre></div>
<p>which outputs</p>
<div class="highlight-txt"><pre class="hl">Subset of Iris data set
-----------------------------------------------------------------------
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
-----------------------------------------------------------------------
51 7 3.2 4.7 1.4 versicolor
147 6.3 2.5 5 1.9 virginica
-----------------------------------------------------------------------
</pre></div>
<p>This is possible due to Statistics::NiceR and Data::Frame.</p>
<p><a href="http://p3rl.org/Statistics::NiceR">Statistics::NiceR</a> is a C-level binding to
the R interpreter that exposes all of R's functions as if they were Perl
functions. It handles all the magic data conversion in the background so that
you don't have to think about it.</p>
<p><a href="http://p3rl.org/Data::Frame">Data::Frame</a> is a container for
<a href="http://pdl.perl.org/">PDL</a> typed arrays that lets you think in terms of
tabular data just like R's <code>data.frame</code>s. It even prints out a table using
<a href="http://p3rl.org/Text::Table::Tiny">Text::Table::Tiny</a>. To support categorical
data just like R's <code>factor</code> variables, it has a PDL subclass that keeps track
of the levels of the data.</p>
<p>It's still an early release, so there may still be some kinks to figure out,
but give it a try and be sure to ping me if there is something wrong.</p>
<p>Much thanks to the folks in <code>#inline</code> for <a href="http://inline.ouistreet.com/page/inline-grant-weekly-report-8.html">helping out</a>
with very cool <a href="http://p3rl.org/Inline::Module">Inline::Module</a> so that this
code could hit CPAN (ingy++, ether++). You should definitely check it out as an
alternative to writing XS.</p>
<p>There are already several interfaces to R on CPAN, but this is the
first one that embeds R and provides a flexible data conversion mechanism. Hope
you enjoy using it!</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:calendar-unix"><p><a href="http://www.unix.com/man-page/freebsd/1/calendar/">calendar(1)</a><a href="http://enetdown.org//dot-plan/#fnref:calendar-unix" class="reversefootnote"> ↩</a></p></li>
</ol>
</div>
When APIs break, scrape! Or How I downloaded 17 GB of videos from Khan Academyhttp://enetdown.org//dot-plan/posts/2014/05/28/scraping_khan_academy/zaki2016-06-13T07:17:08Z2014-05-28T07:08:54Z
<p>I usually don't write about short scripts that I've written, but this one might
be useful to others. <a href="https://github.com/zmughal/khan-academy-downloader">Link</a>
for the impatient.</p>
<p>I needed to download videos from Khan Academy so that I could watch them
offline. That should be easy enough, right? The videos are hosted on
YouTube, so it should just be a matter of finding a playlist and running
<code>get_flash_videos</code> on all the URLs. Turns out this isn't the case: the playlists
on YouTube do not match up with all the videos on the Khan Academy website. Argh.</p>
<p>I could try to go through each of the sections on the website and copy the URLs
into a file, but doing that with 700 videos isn't my idea of a fun way to spend
a couple hours. I looked around for a way to download videos, but all I found
was this <a href="https://www.khanacademy.org/downloads">download page</a> which had an
old torrent. I looked for an API and found
<a href="https://github.com/Khan/khan-api/wiki/Khan-Academy-API">one</a> that was a bit
under-documented. After trying to figure out the easiest way of using the API,
I decided that trying to unravel the 10 MB JSON file returned by
<code>http://api.khanacademy.org/api/v1/topictree</code> wasn't worth it<a href="http://enetdown.org//dot-plan/#fn:api" id="fnref:api" class="footnote">1</a>. Time to
scrape the site!</p>
<p>The final code as of this writing is <a href="https://github.com/zmughal/khan-academy-downloader/tree/c42f6d3ea62dc7658a069d8374ffa45be23b93d3">here</a>.
The scraping code in <code>download.pl</code> isn't exactly great, but it does the job. It
just recursively follows children URLs and records them in a data structure
which is written out to <code>ka-data.json</code>. Then <code>process.pl</code> takes over and reads
the data structure. The important thing here is that the files get written out
with some way of maintaining the order of the playlist. I use the order of the
children URLs on the page to assign a numberic prefix to every directory and
file so that it will sort by name.</p>
<p>Finally, to download the videos, I took the laziest approach. I just print out
the <code>get_flash_videos</code> shell commands needed to download each video and pipe
the commands into a shell. That way I didn't have to deal with any error
handling myself. Now all I need to do is finish watching these videos!</p>
<div class="inlineheader">
<span class="header">
<a href="http://enetdown.org//dot-plan/posts/2014/05/28/gfx/ka-data-eg.pl">ka-data-eg.pl</a>
</span>
</div>
<div class="inlinecontent">
</div>
<div class="inlinefooter">
<span class="pagedate">
Posted <span class="date">Wed May 28 09:02:18 2014</span>
</span>
</div>
<p></div></p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:api"><p>In retrospect, it wouldn't have been that hard to use the API, but I
didn't like the feel of it. A recursive data structure isn't exactly the
easiest to run through at a midnight hour. Below, I sketched out how I could
have gotten the same results as the scraping part of my script now that I know
more about the problem. Oh well.
<div class="inlinepage"><a href="http://enetdown.org//dot-plan/#fnref:api" class="reversefootnote"> ↩</a></p></li>
</ol>
</div>
Releasing Data::TestImagehttp://enetdown.org//dot-plan/posts/2014/05/26/data-testimage/zaki2016-06-13T07:17:08Z2014-05-26T09:19:26Z
<p>I've recently been working on putting together modules for image processing in
Perl. One thing that keeps coming up when I write code that I want others to
run is that they don't have the same images on their system as I do. So I put
together a module that wraps up access to some standard test images from the
<a href="http://sipi.usc.edu/database/">USC SIPI image database</a>. So now, instead of
telling people to go download the right set of images, all they have to do is
install <a href="http://p3rl.org/Data::TestImage">Data::TestImage</a> from CPAN and it
will in turn download and extract the image database to a shared location. Then
you can just run</p>
<div class="highlight-perl"><pre class="hl"><span class="hl kwa">use</span> Data<span class="hl opt">::</span>TestImage<span class="hl opt">;</span>
<span class="hl kwc">my</span> <span class="hl kwb">$image</span> <span class="hl opt">=</span> Data<span class="hl opt">::</span>TestImage<span class="hl opt">-></span><span class="hl kwd">get_image</span><span class="hl opt">(</span><span class="hl str">'mandrill'</span><span class="hl opt">);</span>
</pre></div>
<p>and <code>$image</code> will contain the path to a TIFF file with the mandrill test image. Simple!</p>
<p><a href="http://enetdown.org//dot-plan/posts/2014/05/26/gfx/mandrill.png"><img src="http://enetdown.org//dot-plan/posts/2014/05/26/data-testimage/200x-mandrill.png" width="200" height="200" alt="Standard test image of mandrill" class="img" /></a></p>
<p>I put in a couple of nice tweaks in the install process too: it doesn't install
all 130 MB of the USC SIPI database by default, but you can set an environment
variable and it will install only the portions you ask for.</p>
<p>I got some inspiration from the <a href="http://www.mathworks.com/matlabcentral/answers/54439-list-of-builtin-demo-images">MATLAB Image Toolbox's built-in
images</a>
and the <a href="https://github.com/timholy/TestImages.jl">TestImages.jl</a> Julia
package. But mine is more extensible!</p>
a writeup on my document reader projecthttp://enetdown.org//dot-plan/posts/2013/09/22/a_writeup_on_my_doc_reader_project/zaki2013-09-22T21:59:52Z2013-09-22T21:59:52Z
<p>The last few months, I have been working on a project to treat text on
computers as more than just an electronic analogue of paper.</p>
<p>The idea is to make text structured and interactive. Text is not just a
blob of characters thrown on the paper/screen for reading, but one of
many ways communicate. Text is a medium for the conveyance of ideas and
these ideas do not stand alone. These thoughts are referenced
continually within a workflow.</p>
<p>I have read many papers and books (on my computer, tablet, and e-ink
screens, of course) over the past few months on the problems faced in
the fields of information science, interactive information retrieval,
digital libraries, active reading, and information extraction. To
achieve the goal of making a document reader that I would be happy with,
I need to implement several tasks which I will detail in the following
sections.</p>
<h1 id="annotations">Annotations</h1>
<p><em>Annotations</em> are the natural first step towards this goal. Annotations
are markers where we leave our thoughts. These thoughts can be
spontaneous "stream-of-consciousness" notes or they can be detailed
thoughts that are meant to tie multiple ideas together.</p>
<p>Annotations need to be <em>easy to create</em>. If you are reading and you have
to mess around with the interface, you will not be making annotations as
often as you would on paper.</p>
<p>Furthermore, annotations need to have the option to be <em>shareable</em>. There
is certainly a distinction made between annotations made for others and
those that are meant to be private. This kind of interaction must be
supported so that the choice of whether to share or not is
straightforward.</p>
<h1 id="catalogueandsearch">Catalogue and search</h1>
<p>Once you start using a computer to collect all your reading material,
inenvitably, the documents start piling up. It becomes difficult to return
to the same documents quickly. It is imperative that finding the same
documents again must be fast.</p>
<p>To support this, there needs to be an extensible metadata and indexing
tool. This tool should not only contain <em>bibliography metadata</em> that is
usually expected from a catalouge, but also develop <em>concept maps</em> from
the text. This is a difficult task and will require lots of <em>language
modelling</em>, but it is necessary for dealing with an area that has lots
of related information that is inherent in the meaning of the text.</p>
<p>An easier first approach to this comes from the old and well-studied
field of <em>bibliometrics</em>. Instead of trying to figure out what the
meaning of the text is, the indexer can start by using <em>citation
parsing</em> to find out the semantic structure between documents by seeing
how they are referenced in the text.</p>
<p>Together, these two approaches can lead to better approaches to the
problem of <em>document similarity</em>, or finding other documents that are
semantically close to each other. This is useful because it speeds up
the process of finding ideas to tie together. Instead of trying to hunt
down the appropriate documents, they can be presented to the user as
they are reading.</p>
<h1 id="findingmorereadingmaterial">Finding more reading material</h1>
<p>Usually, when trying to find new documents to add to one's collection,
researchers use search engines and try to explore a topic. <em>Interactive
information retrieval</em> is full of models of the cognitive states involved in
this process and there is a general agreement that it starts with a question
and an <em>uncertain state of knowledge</em> (USK). This is a state where the
researcher does not know enough about the field to know exactly what to search
for. The process to get out of this state is to search for related terms and
read the findings to understand more of the field to find out more about the
field of interest. Then, with this knowledge in hand, perform more searches to
see if they can approximate the original question better or if they need to
reformulate the question.</p>
<p>I believe that there are tools that can aid this kind of interaction.
Usually the questions being posed by the researcher have a context and
it is my hope that this context can come from what the researcher has
been reading and writing. Using this information, a search engine can
possibly provide better results by trying to use this contextual
information both expand the original search in the case where the query
is too narrow ( <em>query expansion</em> ) and then filter the results by looking
at usage patterns of the terms ( <em>entity recognition</em> ).</p>
<h1 id="aspecificworkflow">A specific workflow</h1>
<p>I’m going to separate the workflow into tasks:</p>
<ul>
<li>finding new papers:
<ul>
<li>current: I use PubMed and Google Scholar to find recent papers
and the lab’s collaborative reference manager to find older
articles.</li>
<li>desired: I would like a more unified interface that lets me
scroll through abstracts and categorise papers quickly. This
would include the ability to hide papers either individually or
based on certain criteria (authors, journal, etc.).</li>
</ul></li>
<li>storing and retrieving papers from my collection
<ul>
<li>current: Since I use a wiki, all I do is keep the papers in
directory on my server that I sync with all my computers.</li>
<li>desired: The current setup is fine, because a folder of PDFs is
the most portable solution for all my devices, but it is not the
most optimal for finding a specific paper. It would better if
there were a way to keep the folder setup, but have it managed
by a program that can match up citation keys and be used to only
show papers that I need to read and then send these to my
devices. I think that OPDS http://opds-spec.org/ (along with
OpenSearch support) and Z39.50/SRU could be useful in this
regard.</li>
</ul></li>
<li>following a citation:
<ul>
<li>current: I have scroll back and forth to see what paper a given
citation refers to. This is really slow on the Kindle’s e-ink
screen and not much faster with PDFs on other devices (many
journal’s actually do not accept PDF manuscripts that have
hyperlinks). The HTML version of papers that some journals
provide alleviate this problem somewhat, but PDFs are the
standard for most scientific literature.</li>
<li>desired: Automatic lookup (from either online or personal
collection) with the ability to jump back.</li>
</ul></li>
<li>adding annotations:
<ul>
<li>current: On a screen, annotations are rudimentary and slow to
use (this may be better on a tablet, but most tablets these days
are not built with high-resolution digitizer).</li>
<li>desired: Even if annotations are possible on any single device,
one can not use these across different platforms, nor share the
annotations easily. Annotations need to portable, searchable,
and support cross-references.</li>
</ul></li>
</ul>
<h1 id="otherrelatedtopics">Other related topics</h1>
<ul>
<li>document layout — There is some information that is relevant for
navigation that is implicit in the document's structure and dealing
with documents that are not "born-digital" will require the automatic
extraction of this structure. This can be quite difficult even for PDFs that
are born-digital.</li>
</ul>
Local hackathons (and how Perl can help)http://enetdown.org//dot-plan/posts/2013/05/12/local_hackathons_and_how_perl_can_help/zaki2013-08-20T17:02:03Z2013-05-12T16:19:09Z
<p>Crossposted from <a href="http://blogs.perl.org/users/zaki/2013/05/local-hackathons-and-how-perl-can-help.html">blogs.perl.org</a>.</p>
<p>Hi everyone, this is my first blog post on here (<a href="http://szabgab.com/">Gabor Szabo</a>++ for reminding me
that I need to blog!).
Last week, I <a href="http://permalink.gmane.org/gmane.comp.lang.perl.perl-mongers.houston/1534">posted</a>
to the Houston.pm group inviting them to come out to the local <a href="http://houstonhackathon.com/">City of Houston
Open Innovation Hackathon</a>.</p>
<p>I was planning on attending since I first heard about it a couple of weeks ago,
but I saw this also as an opportunity to build cool things in Perl and show
what Perl can do.</p>
<p>To those that don't interact much with the Perl community, Perl is completely
invisible or just seen in the form of short scripts that do a single task.
Showing people by doing is a great way to demonstrate how much Perl has
progressed in recent years with all tools that have been uploaded to CPAN and
that Perl systems can grow beautifully beyond 1,000 SLOC.</p>
<p>Going to hackathons like these also make for a great way to network with the
local technology community and see what type of problems they are interested in
solving. I would love to see what type of approaches they take in their
technology stack and whether those approaches can be adapted and made Perlish.</p>
Play Perl is a game changerhttp://enetdown.org//dot-plan/posts/2013/02/12/play_perl_is_a_game_changer/zaki2013-08-20T17:02:03Z2013-02-12T07:25:47Z
<p>A few months back, I had read Vyacheslav Matyukhin's
<a href="http://blogs.perl.org/users/vyacheslav_matjukhin/2012/11/play-perl-project.html">announcement</a>
of the <a href="http://play-perl.org/">Play Perl</a> project and I immediately thought
that the idea of a social network for programming was a sound one<a href="http://enetdown.org//dot-plan/#fn:github0" id="fnref:github0" class="footnote">1</a>, but
I was not sure if it would just languish and become another site that nobody
visits. The recent
<a href="http://blogs.perl.org/users/vyacheslav_matjukhin/2013/02/play-perl-is-online.html">release</a>
and widespread adoption of the site by the Perl community gives me optimism for
the kind of cooperation this site can bring to Perl. Play Perl brings a unique
way to spring the community into action — which excites me. I would like to
note that these ideas are not unique in application to Perl, but could be used
for any distributed collaborative effort.</p>
<p>There already exist tools for open-source where we can share tasks and wishlist
items: either project specific such as bug trackers or cross-project such as
<a href="http://openhatch.org/">OpenHatch</a>, <a href="http://24pullrequests.com/">24 Pull
Requests</a>, and the various project proposals made for
<a href="https://developers.google.com/open-source/soc/">GSOC</a>. What does Play Perl
add to what's already out there?</p>
<p>Firstly, it frees the projects from belonging to a single person. People
already create todo lists with projects that they would like to work on, but
these todo lists usually remain private. In an open-source environment, this
can make it difficult to find other people that might be interested in helping
those ideas get off the ground. With Play Perl, the todo list item (or
“quest”) no longer has to stay with the person that came up with it. Anyone
can see what may need to be done and if they have enough <a href="https://en.wiktionary.org/wiki/round_tuit">round
tuits</a>, they can work on it. A quote
that I recently read on the Wikipedia article for <a href="https://en.wikipedia.org/wiki/Alexander_Kronrod">Alexander
Kronrod</a> sums up how I see
this:</p>
<blockquote>
<p>A biographer wrote Kronrod gave ideas "away left and right, quite honestly
being convinced that the authorship belongs to the one who implements them."</p>
</blockquote>
<p>In this sense, Play Perl is an <a href="https://en.wikipedia.org/wiki/Ideas_bank">ideas
bank</a>, but it is much more than that.
By allowing these ideas to be voted on, it allows you to choose which one to
work on first. You can prioritise your time based on what would be most
beneficial to the community — a metric that is difficult to ascertain on your
own.</p>
<p>Secondly, with the gamification<a href="http://enetdown.org//dot-plan/#fn:gamification" id="fnref:gamification" class="footnote">2</a> of collaboration, the process of
implementing ideas becomes part of a feedback loop — we can introduce positive
reinforcement for our work through the interaction with the community. This
feedback loop process already happens through media such as mailing lists and
IRC (karmabots, anyone?). Play Perl quantifies this and lets us see how much
our contributions help others.</p>
<p>The last and possibly the most important aspect of Play Perl that leads me to
believe in its long-term success is its focus on tasks from a single community.
Everyone in the community can quickly see what the others are working on in a
way that is hard to do with blogs or GitHub<a href="http://enetdown.org//dot-plan/#fn:github1" id="fnref:github1" class="footnote">3</a> due to granularity. Ideas
often get lost with time, but more eyes can ensure that they get implemented. I
frequently browse the latest <a href="http://search.cpan.org/recent">CPAN uploads</a> to
look for interesting modules and I find myself following the feed on Play Perl
the same way. I can justify the time spent browsing through all this activity
on both sites because I know that there is likely an item of interest (high
probability of reward) and each item is a blip of text that I can quickly scan
through (low cost to reading each item). Looking at modules lets me see what
code already exists, but Play Perl lets me see what code will probably exist in
the future. Providing this new view on the workflow of open-source development
is empowering because it provides a channel for the free flow of a specific
kind of information that was previously trapped in other less-visible media.
Having easy access to this information means we can interact with it directly
at a scale that best fits the message.</p>
<p>I look forward to seeing Play Perl flourish in the coming months.</p>
<p>This blog post <a href="http://play-perl.org/quest/5119cc453a5ec3040700000f">brought to you</a> by Play Perl. ;-)</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:github0"><p>It has worked for GitHub, hasn't it?<a href="http://enetdown.org//dot-plan/#fnref:github0" class="reversefootnote"> ↩</a></p></li>
<li id="fn:gamification"><p>On a side note, Carl Mäsak has written about the gamification
of development with
<a href="https://en.wikipedia.org/wiki/Test-driven_development">TDD</a> in particular
in <a href="http://use.perl.org/use.perl.org/_masak/journal/39445.html">Perl 6 is my
MMORPG</a> and
<a href="http://use.perl.org/use.perl.org/_masak/journal/39639.html">Helpfully addictive: TDD on
crack</a>.<a href="http://enetdown.org//dot-plan/#fnref:gamification" class="reversefootnote"> ↩</a></p></li>
<li id="fn:github1"><p>GitHub should implement custom lists of people/projects to filter the activity like on Twitter.<a href="http://enetdown.org//dot-plan/#fnref:github1" class="reversefootnote"> ↩</a></p></li>
</ol>
</div>
Adding labels to Gmail's IMAPhttp://enetdown.org//dot-plan/posts/2012/11/11/adding_labels_to_Gmail_IMAP/zaki2012-11-25T23:18:40Z2012-11-11T04:07:56Z
<table class="img"><caption>mutt showing X-Label header in index view</caption><tr><td><a href="http://enetdown.org//dot-plan/posts/2012/11/11/gfx/mutt-gmail-imap-label.png"><img src="http://enetdown.org//dot-plan/posts/2012/11/11/adding_labels_to_Gmail_IMAP/400x-mutt-gmail-imap-label.png" width="400" height="229" class="img" /></a></td></tr></table>
<p>About a year ago (October 2011), I wrote a small tool (<a href="https://github.com/zmughal/gmail-imap-label">git
repo</a>) that has really made using
my e-mail a much more enjoyable experience. My personal e-mail inbox is on
Google's Gmail service; however, I find the web interface gets in the way of
reading and organising my e-mail. I make heavy use of the label and filter
features that let me automatically tag each message (or thread), but having
labels that number in the hundreds gets unwieldy since I can not easily access
them. I use the
<a href="https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol">IMAP</a>
interface to reach my inbox through the <a href="http://www.mutt.org/">mutt</a> e-mail
client; this is fast because there is almost no interface and I can bind many
actions to a couple of keystrokes.</p>
<p>The main problem I had with using IMAP was that, although I could see all the
labels presented as IMAP folders, I had no way to know which labels were used
on a particular message that was still in my inbox. I had thought about this
problem for a while and looked around to see if anyone made proxies for IMAP,
but there was not very much information out there. I had originally thought
that I would need to keep a periodically updated database of <code>Message-ID</code>s
and labels which I would query from inside mutt and I had in fact written
some code that would get all the <code>Message-ID</code>s for a particular IMAP folder,
but this was a slow process. I didn't look into it again until I was talking
about my problem with a friend (<a href="http://windotnet.blogspot.com/">Paul
DeCarlo</a>) and he pointed me towards the <a href="http://code.google.com/apis/gmail/imap/#x-gm-labels">Gmail
IMAP extensions</a>. This
was actually going to be possible!</p>
<p>I quickly put together a pass-through proxy that would sit between the mutt
client and the Gmail IMAP server. Since Gmail's server uses SSL with IMAP (i.e.
IMAPS) to encrypt the communication, I would need to get the unencrypted
commands from mutt and then encrypt them before sending them to Gmail. Once I
had this, I could log the commands to a file and study which IMAP commands mutt
was sending. At the same time, I had a window open with <a href="https://tools.ietf.org/html/rfc3501">IETF
RFC3501</a>, the IMAP specification document,
so that I could understand the structure of the protocol. Once I
saved the log to a file, I didn't actually need a network to program the filter
that I was writing — in fact, I finished the core of the program on a drive
back from the ACM Programming Contest in Longview, TX! When I got home, I
tested it and it worked almost perfectly except for another IMAP command that
mutt was sending that was not in my log, but that was just a small fix.</p>
<p>Not very long after I first published the code on Github, my code was mentioned
on the <a href="http://comments.gmane.org/gmane.mail.mutt.devel/19727">mutt developers mailing
list</a> on 2012-02-01.</p>
<p>Todd Hoffman writes:</p>
<blockquote>
<p>I read the keywords patch description and it definitely sounds useful. One
reminder is that gmail labels are not in the headers and can only be accessed
using the imap fetch command. Users of offlineimap or some other mechanism to
retrieve the messages and put them in mboxes or a maildir structure will not be
able to extract them from headers, in general. Of course, a patched version of
offlineimap or an imap proxy (see <a href="https://github.com/zmughal/gmail-imap-label">https://github.com/zmughal/gmail-imap-label</a>)
that assigns the labels to a header could be used also.</p>
</blockquote>
<p>It was good to know that people were finding my project. Also, over the summer,
I received my <a href="https://github.com/zmughal/gmail-imap-label/issues/1">first ever bug
report</a> from <a href="https://github.com/mavam">Matthias
Vallentin</a>, so I knew that somebody else actually
found it useful. \o/ This was a great feeling, because it closed the loop on the
open-source model for me: I was finally contributing code and interacting with
other developers.</p>
capturing HTTPS proxy requests with Firefoxhttp://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/zaki2012-10-20T15:47:58Z2012-10-19T22:12:52Z
<p>In one of my projects, I needed to trace HTTPS requests in order to understand
the behaviour of a web application. Since the data is encrypted, it can not be
read using the default configuration of the tools that I normally use to
inspect network data. This post details how to quickly set up an SSL proxy to
monitor the encrypted traffic.</p>
<h1 id="background">Background</h1>
<p>When debugging or reverse engineering a network protocol, it is often necessary
to look at the requests being made in order to see where they are going and what type of
parameters are being sent. Usually this is simple with a packet capture tool
such as <a href="http://www.tcpdump.org/">tcpdump</a> or
<a href="http://www.wireshark.org/">Wireshark</a> when the protocol is being sent in
plaintext; however, it is more work to capture and decode SSL packets as the purpose of
this protocol layer is to prevent the type of eavesdropping that is
accomplished in man-in-the-middle attacks. SSL works on the basis of public
certificates that are issued by trusted organisations known as certificate
authorities (CAs). The purpose of the CAs is to sign certificates so that any
clients that connect to a server that has a signed certificate can trust that they
are connecting to an entity with verified credentials; therefore, a certificate
can only be as trustworthy as the CA that signed it. Operating systems and
browsers come with a list of CA certificates that are considered
trustworthy by consensus; so, in order to run a server with verifiable SSL
communication, the owners of that server need to get their certificate signed
by one of the CAs in that list. Any traffic to and from that server will now be
accepted by the client in encrypted form.</p>
<h1 id="software">Software</h1>
<p>Trying to capture these encrypted packets would require you to have the private
key of a trusted CA; however, we can get around this by installing our own CA
certificate and using a proxy that signs certificates using that CA certificate
for every server that the client connects to. We can accomplish this by using
<a href="http://crypto.stanford.edu/ssl-mitm/">mitm-proxy</a>.</p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/mitm-proxy/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-mitm-proxy.png" width="400" height="416" alt="SSL Man in the Middle Proxy software website" class="img" /></a></p>
<p>It is written Java and comes with a CA certificate that you can use right away
which makes it is straightforward to set up. </p>
<p>Once you download and extract the software, you have to add the fake CA
certificate into Firefox. I prefer to set up a new session of Firefox so that
the configuration will use a separate database of certificates from my usual
browsing session. You can create and start a new session using the command</p>
<pre><code>firefox -P -no-remote
</code></pre>
<p>.</p>
<h1 id="installingthecertificate">Installing the certificate</h1>
<p>When this new session starts up, add the certificate by going into the
Preferences menu of Firefox and going to the options under <code>Advanced » Encryption</code>
and selecting the <code>View Certificates</code> button.</p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/preferences/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-preferences.png" width="400" height="366" alt="Preferences option for certificates" class="img" /></a></p>
<p>Under the <code>Authorities</code> tab, click the <code>Import</code> button and select the file
<code>FakeCA.cer</code> from the <code>mitm-proxy</code> directory.</p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/add_cert0/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-add_cert0.png" width="400" height="277" alt="List of CA certificates" class="img" /></a></p>
<p>Once you add the certificate for identifying websites, you should see it in the
list of authorities under the item name <code>stanford</code>.</p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/add_cert1/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-add_cert1.png" width="400" height="199" alt="Adding a fake CA for websites" class="img" /></a></p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/add_cert2/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-add_cert2.png" width="400" height="277" alt="Check if the fake CA is added under 'stanford'" class="img" /></a></p>
<h1 id="runningtheproxy">Running the proxy</h1>
<p>You are now ready to run the proxy. A shell script called <code>run.sh</code> is contained
in the <code>mitm-proxy</code> directory and by examining it<a href="http://enetdown.org//dot-plan/#fn:shell_script_security" id="fnref:shell_script_security" class="footnote">1</a>, you
can see that it starts a proxy on localhost:8888 using the fake CA certificate and
that it will log the HTTPS requests to <code>output.txt</code>. You need to add this proxy
to your Firefox instance by going to <code>Advanced » Network » Settings</code> and adding
the information under the SSL proxy configuration.</p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/config_ssl_proxy/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-config_ssl_proxy.png" width="400" height="365" alt="Network configuration for the SSL proxy" class="img" /></a></p>
<p>Once you start the server, you can test it by going to
<a href="https://httpsnow.org/">HTTPSNow</a>, a website that promotes HTTPS usage for
secure browsing. Now, by running</p>
<pre><code>tail -f output.txt
</code></pre>
<p>, you can see the HTTPS requests and responses as they are sent.</p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/run_mitm-proxy/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-run_mitm-proxy.png" width="400" height="214" alt="Start mitm-proxy using shell script" class="img" /></a></p>
<p><a href="http://enetdown.org//dot-plan/posts/2012/10/19/gfx/https_log/"><img src="http://enetdown.org//dot-plan/posts/2012/10/19/capturing_https_proxy_requests_with_firefox/400x-https_log.png" width="400" height="214" alt="Log of HTTPS output for httpsnow.org" class="img" /></a></p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:shell_script_security"><p>You should examine all shell scripts you download for
security reasons. You do not want to inadvertently delete your $HOME directory!<a href="http://enetdown.org//dot-plan/#fnref:shell_script_security" class="reversefootnote"> ↩</a></p></li>
</ol>
</div>
switching out web scraping moduleshttp://enetdown.org//dot-plan/posts/2011/08/29/switching_out_web_scraping_modules/zaki2012-01-31T21:20:04Z2011-08-29T16:21:27Z
<p>I recently had to
<a href="http://enetdown.org/git/?p=goosebumper;a=commitdiff;h=1671fff2233699f46ccdb8818990d6c9e4231db7">change</a>
some of the code in my scraping software for Blackboard because newer versions of
<a href="http://www.getfirefox.com/">Mozilla Firefox</a><a href="http://enetdown.org//dot-plan/#fn:firefox_dev" id="fnref:firefox_dev" class="footnote">1</a> were not interacting well somewhere in between
<a href="https://github.com/bard/mozrepl/wiki/">MozRepl</a> and
<a href="http://p3rl.org/WWW::Mechanize::Firefox">WWW::Mechanize::Firefox</a>. My
decision to use <code>WWW::Mechanize::Firefox</code> was primarily prompted by ease of
development. By being able to look at the DOM and match elements using all of
Firefox's great development tools such as <a href="http://getfirebug.com/">Firebug</a>, I
was able to quickly write <a href="http://www.w3.org/TR/xpath">XPath</a> queries to get
exactly the information I needed. Plus, Firefox would handle all the
difficulties of recursive frames and JavaScript. This had drawbacks in that it
was slow and it became difficult to really know if something completed
successfully which made me hesitant about putting it in a cronjob. It worked,
but there was something kludgey about the whole thing.</p>
<p>That solution worked last semester, but when I tried it at the beginning of
this semester, things started breaking down. At first, I tried working around
it, but it was too broken. I needed to use JavaScript, so the only solution I
could find was <a href="http://p3rl.org/WWW::Scripter">WWW::Scripter</a>. It has a plugin
system that lets you use two different JavaScript engines: a pure-Perl engine
called <a href="http://p3rl.org/JE">JE</a> and Mozilla's
<a href="https://www.mozilla.org/js/spidermonkey/">SpiderMonkey</a>. I had tried using
<code>WWW::Scripter</code> before, but had encountered difficulties with compiling the
SpiderMonkey bridge. This time I gave the <code>JE</code> engine a try and I was surprised
that it worked flawlessly on the site I was scraping.</p>
<p>After fixing up my code, I can see a few places where <code>WWW::Scripter</code> could become a better tool:</p>
<ol>
<li>Add a plugin that makes managing and using frames easier.</li>
<li>Create tool to make viewing and interacting with the rendered page as you
go along possible. This will really make the it easier to debug and try out
things in a REPL.</li>
<li>Integrate with <a href="http://p3rl.org/WWW::Mechanize::TreeBuilder">WWW::Mechanize::TreeBuilder</a> so
that I can use <a href="http://p3rl.org/HTML::TreeBuilder::XPath">HTML::TreeBuilder::XPath</a>
immediately. As far as I can tell, all that needs to be added to
<code>WWW::Scripter</code> is the <code>decoded_content</code> method.</li>
</ol>
<div class="footnotes">
<hr />
<ol>
<li id="fn:firefox_dev"><p>Which, if you have not heard, is undergoing a
<a href="http://www.computerworld.com/s/article/9215818/Mozilla_kicks_off_Firefox_5_faster_release_schedule">rapid development schedule</a>.<a href="http://enetdown.org//dot-plan/#fnref:firefox_dev" class="reversefootnote"> ↩</a></p></li>
</ol>
</div>