Welcome to my homepage. The following are a list of posts from all the subsections of my blog (listed in the sidebar). To see more, go to the archives for each section.

If you've been around me for a while, you might have heard me go on and on about my long-term project: creating the ultimate document reader. As I try to design it, I am thinking of ways that the cognitive process of reading can be improved through computers and to guide my research, I have read a couple books about the neuroscience of reading.

One book in particular was quite invaluable: Maryanne Wolf's Proust and the Squid. I recently came across a few articles1 in which she discusses how she seems to have lost the ability to read and enjoy prose in long-form. She claims that this is because of how reading on screens has trained our brains to skim rather than use "deep reading" skills. Maryanne Wolf's next books will be on this specific topic and I look forward to reading them2.

Differences between reading on paper and on screens

Researchers in human-computer interaction, education, and psychology have long looked at the differences between reading on paper versus on screens (1). Getting to the root causes behind why there are differences between the two in terms of reading outcomes (e.g., speed, fatigue, comprehension) has been difficult because there really hasn't been a standardised methodology behind all the studies. Early studies found that some of the differences could be attributed to image quality and this seems to hold for single pages, but for longer documents, the ability to navigate and locate information is a more important factor.

Perhaps the difference can accounted for by our expectations of screen reading. When reading on a screen, we tend to read shorter texts such as e-mail or the news while doing other tasks such as checking notifications. This task switching may be occur many times a day as we navigate our computers. Some have suggested that the gap between these different modes of reading may be closed by adapting to screen reading in the same way that people had to adapt from reading on scrolls to reading from a codex.

A study of high-schoolers showed that those that were successful with tasks that required finding information online via a hypertext interface either had prior skills in linear reading or prior skills in basic computer navigation (2). This implies that a combination of those factors may be necessary to make the most of screen reading.

Another study compared students' performance with screen reading and paper reading when given two scenarios (3). In the first scenario, the students had a fixed time to read the material when given an assessment. In the second scenario, they were allowed to use as much time as they needed. What the authors found is that for the fixed time scenario, the screen readers performed on par with paper readers, but for the flexible time scenario, the paper readers outperformed the screen readers. The author's suggest that this difference can be accounted for by differences in how each group of readers self-regulated their reading time and that this difference in self-regulation comes from how readers perceive reading in each medium.

From these studies, I believe that two areas to improve when designing a document reader are document navigation and self-regulation of reading. There are many approaches to navigation, but I will not be covering that in this text. What I want to know is, how can we train ourselves to have better self-regulation?

I had a thought along these lines while I was studying a textbook. A major part of reading is going back to review what we just read. As we read, we must use our limited working memory to both visually process the words on the page and simultaneously pull out related concepts from long-term memory so that we can associate new ideas with our prior knowledge. In fluent reading, all this occurs in a short amount of time between reading chunks of text, so the number of associations that we can make depends on how quickly we can process all this information. In order to ease students into re-treading over what they just read, many textbooks use small prompts at the end of sections. These prompts are meant to help provide context for where the information fits into a larger structure and confirm that the reader understood what was written.

In an age where we have direct access to more information than ever before, many have turned to speed-reading techniques as a way to read more material — however, such approaches are not supported by eye-tracking research (4). Others have tried to replace reading entire books with reading summaries and while these give a decent summary of the arguments in a book, they are but tertiary sources — referencing a summary is not the same as referencing the book itself. Furthermore, if you only read the summary, you might not give yourself enough material to build a coherent model that you can recall later. Both forms of reading have their place, but if we want to improve our reading comprehension, it may be worth slowing down.


The activities found in textbooks are meant to aid what is known as metacognition — thinking about thinking. We apply metacognition whenever we take notes, break down a problem into subproblems, or set goals for studying. Metacognition is the process of recognising that we are thinking a certain way and applying a strategy to regulate and improve how we think.

The following are all valid metacognitive strategies:

  • using a consistent system for note-taking,
  • creating a sequence of tasks for working on a project,
  • planning to read 3 sections in the next 30 minutes,
  • doing all the exercises at the end of the chapter,
  • devising mnemonics to remember a certain sequence of events,
  • creating flash cards based on each new definition, and
  • cramming 1 hour before a test.

These strategies might not be equally effective in all circumstances and for every person, but what they have in common is that they all regulate how we think. What I would like to investigate is can these metacognitive strategies be incorporated into the tools that we use to read, so that it is easier to apply them effectively and consistently?

The purpose of this text is to explore some thoughts on how to implement such a system.


When I started thinking about metacognition, my first thought led me to textual annotations. Writing has long been suggested as a way to help improve memory. It may even be more important when dealing with electronic media because notetaking can be used to enhance the understanding of what we're reading. A recent study from 2013 suggests that the difference between paper and screen reading may be partially due to the distractions of multitasking; however, the effect of multitasking can be mitigated by using notetaking on paper as a way to retain focus (5).

Taking notes on paper is not a huge burden, but having those notes in digital form allows for more portability and, since not everyone has filing cabinets, makes it easier to retrieve years later. This ease of use comes with a tradeoff: digital notetaking may not be as effective for memory recall as taking notes by hand. In Mueller et al. (2014) (6), researchers looked at how students wrote notes and found that students that wrote notes by hand were able to perform better on assessments than students that wrote notes on laptops. This may be because there is more cognitive processing being done when choosing the salient points to write down from a lecture than when typing out the lecture verbatim.

How can this additional cognitive processing be elicited when keeping notes in digital form? One approach might be to use the Cornell method for notetaking. One way of using this method is to return to the notes after a day in order to summarise what was written. This additional summary promotes the synthesis and revision of the notes. When used correctly, the rewording required for summarisation provides the extra cognitive processing needed for learning.

Below is a simple mockup of how such a system might look like on a computer. Instead of showing a single page as in most document readers, this interface shows a page of the original document as well as another page where notes can be kept. The cue column is one the left of this page, the note-taking column is on the right, and the summary area is at the bottom of the page. When the notes for a given unit of the text (e.g., a section or paragraph) are complete, the time can be recorded so that when returning the next day, the user can be given a reminder to write a summary.

Mockup of Cornell notes UI


Another popular metacognitive strategy is to use flashcards. They make a good supplement to the Cornell method as the cues can be turned into the front of a flashcard and the notes can become the back. We can also create flashcards on their own and link them back to the text if further clarification of the notes are needed.

Mockup of flashcards UI

Just making the flashcards is not enough. They need to be reviewed. One way of reviewing called spaced repetition is based on creating an adaptive schedule so that difficult cards are seen more often than easy cards until the difficult cards have been memorised. There exist many techniques and tools for creating effective flashcards and scheduling the revision of cards, so those can easily be integrated into the document reader either by outputting data in the appropriate format or reimplementing the algorithms3.

Triaging and comparison

When researching a topic, using a single source of information may not be enough. I often go to the library and grab almost every book on a topic so that I can learn from many perspectives. As I read, I try to figure out the specific strengths of each document: some may cover more theory than others or one may have particularly informative diagrams. Being able to sort through each of these documents and decide which ones are relevant is possible on a computer through the use of folders and tagging, but this is sometimes very unsatisfying. For example, if I am reading papers for a literature review, I usually make a spreadsheet to organise details about what each paper is about. Before I actually commit to making a spreadsheet, I try to stack the papers that I have printed out into different piles based on quickly skimming the contents. This is known as document triaging.

One example of how I would use this is to separate a set of papers into the categories:

  • Not relevant: papers that do not cover what I am trying to figure out in the current project (ignore these);
  • Survey papers: papers that review the results of many papers at once (read through these and find any references I might have missed);
  • Classic research: older research that may or may not be worth looking at (skim over these — might only be of historical interest);
  • Recent research: more current papers (read these more closely).

One approach to a GUI for this might be to emulate the layout of a desktop area where different regions indicate different categories. This may be easier to visualise and work with than using different lists or drop-down menus as it uses larger GUI elements (c.f., Fitts's law). An example of what it might look like is shown below.

Mockup of document triaging UI

The workflow proceeds as follows:

  1. The user chooses a document from the Queue on the right.
  2. The chosen document appears in the center Reading area where the user can quickly look over the document.
  3. Once a category for the document is determined, the user drags it to the appropriate category region on the left.

Perhaps this task can benefit from active learning which might be able to suggest categories for papers based on metadata such as page length or year of publication.

Another related task is being able to compare multiple documents side by side. Below is a picture of a bookstand modelled after a design that Thomas Jefferson created for his office4. It allows for keeping multiple books open at the same time and rotates so that switching between books is easy.

Picture of Jeffersonian revolving bookstand (taken from this project showcase)

In a way, this is like browser tabs. However, there are times when you may want to see two documents side-by-side rather than having to flip between them. For example, if I was reading two books on history written by different authors, I might want to see how both authors address the same topic. Whether or not this should be implemented as a multiple document interface (with tabbing and docking) is still not clear to me.

Reading strategies and serendipity

If you've ever read a scientific research paper, the first thing you'll notice is that they are highly structured: there are sections that are always in every paper (e.g., Abstract, Introduction, Methods, Results, Conclusion). These sections are meant to guide readers so that they don't have to read through the whole paper. There are certain strategies to reading a research paper5 which emphasise that you have to re-read the paper several times based on your goals (e.g., performing a literature review versus trying to reproduce the results). On each pass, you will try to answer different questions so you will want to spend more time reading specific sections. Perhaps these reading strategies can be turned into a checklist so that each paper has a progress bar that tells you how far along in understanding the paper you are. That way you can skim over several papers in one sitting and then slowly try to understand one paper at a time.

Sometimes the path you take with your reading does not follow a straightforward checklist. There are times when you are searching for something and you come across an unexpected connection which can lead to more creative thoughts. This can either occur when discovering new material or coming back to older material. Understanding how to create serendipitous encounters might be a little tougher than the previous metacognitive strategies. As noted in André et al. (2009) (7), serendipity is difficult to facilitate and study in a laboratory setting. One approach that I would like to try is to bring up older material that the user might have read months or years ago. Perhaps the older material might be seen in a different light now that the user is re-encountering it.

The system described by author Steven Berlin Johnson in this post titled Tool for Thought describes an interesting workflow where he captures quotes from books and uses a tool that can help find other related quotes in his library. This allows him to start with a single idea and then find other ideas that might be related to that original seed. One of the authors of (7), Susan Dumais, worked on a technique to do just this: latent semantic indexing. However, at this point, I am not certain how large of a personal library is needed for the gains from this technique to become apparent.


As far as I can tell, none of these techniques exist within a single existing application. Perhaps that is because using such a complex application would become daunting — there would simply be too many features. Furthermore, since each of these metacognitive strategies can be applied in many different ways, another challenge will be creating tutorials that show how to use them effectively.

I'd really appreciate any feedback on these ideas. They will certainly need tweaking before they become usable.

  1. The articles are Being a Better Online Reader by Maria Konnikova and an interview with Maryanne Wolf hosted by Robin Young Is Online Skimming Hurting Reading Comprehension?. ↩

  2. From her page at Tufts

    • Wolf, M. & Gottwald, S. (To appear, 2016) What It Means to Read: A Literacy Agenda for the Digital Age. Oxford University Press. In Series, Literary Agenda, Editor: Phillip Davis.
    • Wolf, M. (To appear, 2016). Letters to the Good Reader: The Contemplative Dimension in the Future Reading Brain. New York: Harper Collins.
    •  ↩
  3. Some of these flashcard techniques and tools can be found in the following:

  4. For details on how to make your own Jeffersonian bookstand, see this video and these instructions. ↩

  5. A couple guides are here and here. ↩


[2] Hahnel, Carolin, Frank Goldhammer, Johannes Naumann, and Ulf Kröhne. "Effects of linear reading, basic computer skills, evaluating online information, and navigation on reading digital text." Computers in Human Behavior 55 (2016): 486-500. doi:10.1016/j.chb.2015.09.042.

[3] Ackerman, Rakefet, and Morris Goldsmith. "Metacognitive regulation of text learning: on screen versus on paper." Journal of Experimental Psychology: Applied 17, no. 1 (2011): 18. doi:10.1037/a0022086.

[4] Schotter, Elizabeth R., Randy Tran, and Keith Rayner. "Don’t Believe What You Read (Only Once) Comprehension Is Supported by Regressions During Reading." Psychological science (2014): 0956797614531148. doi:10.1177/0956797614531148.

[5] Subrahmanyam, Kaveri, Minas Michikyan, Christine Clemmons, Rogelio Carrillo, Yalda T. Uhls, and Patricia M. Greenfield. "Learning from Paper, Learning from Screens: Impact of Screen Reading and Multitasking Conditions on Reading and Writing among College Students." International Journal of Cyber Behavior, Psychology and Learning (IJCBPL) 3, no. 4 (2013): 1-27. doi:10.4018/ijcbpl.2013100101.

[6] Mueller, Pam A., and Daniel M. Oppenheimer. "The Pen Is Mightier Than the Keyboard Advantages of Longhand Over Laptop Note Taking." Psychological science (2014): 0956797614524581. doi:10.1177/0956797614524581.

[7] André, Paul, Jaime Teevan, and Susan T. Dumais. "Discovery is never by chance: designing for (un)serendipity." In Proceedings of the seventh ACM conference on Creativity and cognition, pp. 305-314. ACM, 2009. doi:10.1145/1640233.1640279.

Posted Wed Feb 24 05:42:02 2016 Tags:

When I saw the CPAN PR Challenge come up in my feed, I signed up immediately. I love giving back to FOSS and this challenge would push me to make contributions that are outside of the usual software that I contribute to.

For January, I was assigned Clone. I looked at the reverse dependencies and saw 181 packages. I immediately thought to myself, "I'd better be careful. I don't want to break things.". This means that testing is very important for this package.

I e-mailed the maintainer of Clone, Breno G. de Oliveira (garu), about my assignment and he shot me back an lengthy e-mail with all the things I could do. Some of them were easy, such as:

  • fix typos,
  • add continuous integration with Travis-CI, code coverage with Coveralls, and adding badges for each of those.

Others were a bit more involved:

  • benchmarking against other packages such as Clone::PP and Storable,
  • adding more tests for different types of Perl variables,
  • go through the bug queue and fixing the open tickets.

I went for the easy ones first. I knew that adding the Travis-CI integration was just a matter of creating a .travis.yml file, but what actually goes in that file can vary quite a deal. I had noticed that haarg had created a set of helper scripts that can grab various pre-built Perl versions and run tests against them all.

I cloned the Clone repository and copied over the example .travis.yml:

language: perl
  - "5.8"                     # normal preinstalled perl
  - "5.8.4"                   # installs perl 5.8.4
  - "5.8.4-thr"               # installs perl 5.8.4 with threading
  - "5.20"                    # installs latest perl 5.20 (if not already available)
  - "blead"                   # install perl from git
    - perl: 5.18
      env: COVERAGE=1         # enables coverage+coveralls reporting
    - perl: "blead"           # ignore failures for blead perl
  - git clone git://github.com/travis-perl/helpers ~/travis-perl-helpers
  - source ~/travis-perl-helpers/init
  - build-perl
  - perl -V
  - build-dist
  - cd $BUILD_DIR             # $BUILD_DIR is set by the build-dist command
  - cpan-install --deps       # installs prereqs, including recommends
  - cpan-install --coverage   # installs converage prereqs, if enabled
  - coverage-setup
  - prove -l -j$((SYSTEM_CORES + 1)) $(test-dirs)   # parallel testing
  - coverage-report

and enabled my fork of Clone in the Travis-CI and Coveralls settings.

After pushing this, the tests ran, but I kept seeing n/a code coverage on Coveralls. I was very confused because the code coverage was working just fine locally. I jumped on IRC and chatted with haarg. He pointed out that I was using prove -l as in the example, but since Clone is a compiled module, I needed to use prove -b.

Oh. Silly me! I had been using prove -b locally, but never changed the .travis.yml file. That serves me right for copying-and-pasting without looking! Something good came out of it though: this ticket for Test::Harness has suggestions that will help catch this error if anyone else makes the same mistake.

haarg also pointed me to an even simpler .travis.yml file that he was working on that just had the lines

  - eval $(curl https://travis-perl.github.io/init) --auto

and a list of the Perl versions to test. I used that and ran it through Travis-CI and everything just worked!

Now all I had to do was grab the HTML for the badges and put them in the POD and Markdown. I went to the Travis-CI and Coveralls pages and copied the Markdown for those badges and then went to http://badge.fury.io/for/pl and entered in Clone to get a version badge for Clone on CPAN.

I then made a few grammar fixes and converted the POD into Markdown for the README and I was done!

The pull request with my changes is at https://github.com/garu/Clone/pull/4 and my changes are in Clone v0.38.

Badges for Clone on GitHub

Badges for Clone on CPAN

Posted Sun Jan 25 04:36:09 2015 Tags:

A while back, I wrote Unicode::Number which was based on libuninum. This is a library that can convert numbers written in various languages to integers and vice versa. I also wrote a library to install libuninum automatically, Alien::Uninum, with the help of Alien::Base.

This all worked quite well, but I wanted to go a step further. libuninum can support the numbers stored with the GNU Multiple Precision Arithmetic Library (libgmp). This allows converting to and from arbitrarily long numbers. To support this, the computer must have libgmp installed.

So I thought to myself, why not write Alien::GMP and install it myself?

Well, Alien::GMP already exists and is authored by Richard Simões, but it bundles an old version of libgmp (v5.0.4). Alien::GMP should be able to download the latest version and install that.

So I created an issue to point out that it needed updates. That led to me getting co-maintainership on the package.

I went ahead and pointed Alien::GMP to the download page for the source code, but it needed HTTPS: https://gmplib.org/download/gmp/. Alien::Base didn't have support for HTTPS, so I added support.

I could finally get back to Alien::GMP.

I cleared out the original code and made Alien::GMP inherit from Alien::Base.

I also added support for using the tool with Inline so that it is easy to compile with other code. I just need to change the tests that look for gmp.h and libgmp.so and the module was good to go. The overall changes can be seen at https://github.com/zmughal/p5-Alien-GMP/compare/zmughal:v0.0.6...v0.0.6_01 and a new dev release of Alien::GMP is at https://metacpan.org/release/ZMUGHAL/Alien-GMP-v0.0.6_01.

Posted Sun Jan 25 03:47:56 2015 Tags:

Love R? Love Perl? Well, I've got a nice little present for you this echo $(calendar) 1 season! Now you can pass data in and out of R as easily as

use v5.16;
use Statistics::NiceR;
use Data::Frame::Rlike;
Moo::Role->apply_roles_to_package( q|Data::Frame|, qw(Data::Frame::Role::Rlike) );

my $r = Statistics::NiceR->new;
my $iris = $r->get('iris');

say "Subset of Iris data set";
say $iris->subset( sub { # like a SQL WHERE clause
                  ( $_->('Sepal.Length') > 6.0 )
                & ( $_->('Petal.Width')  < 2   )
        })->select_rows(0, 34); # grab the first and last rows

which outputs

Subset of Iris data set
      Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
 51   7             3.2          4.7           1.4          versicolor
 147  6.3           2.5          5             1.9          virginica

This is possible due to Statistics::NiceR and Data::Frame.

Statistics::NiceR is a C-level binding to the R interpreter that exposes all of R's functions as if they were Perl functions. It handles all the magic data conversion in the background so that you don't have to think about it.

Data::Frame is a container for PDL typed arrays that lets you think in terms of tabular data just like R's data.frames. It even prints out a table using Text::Table::Tiny. To support categorical data just like R's factor variables, it has a PDL subclass that keeps track of the levels of the data.

It's still an early release, so there may still be some kinks to figure out, but give it a try and be sure to ping me if there is something wrong.

Much thanks to the folks in #inline for helping out with very cool Inline::Module so that this code could hit CPAN (ingy++, ether++). You should definitely check it out as an alternative to writing XS.

There are already several interfaces to R on CPAN, but this is the first one that embeds R and provides a flexible data conversion mechanism. Hope you enjoy using it!

In my last post, I talked about how I found that my social media use has become rather unwieldy. I posted a link to my post to Facebook (of course) to generate discussion and I got plenty of great discussion out of it. I have much more to say on this theme, so here's another post.

I decided to write a little about how the Internet was about 15 years ago when I started using it (or at least how I remember it). I split it off from this article because I was starting to meander. I make references to it in the following text, but you don't need to read it to understand what I'm discussing. Caveat lector.

[Information intensifies]

Information overload is one of those perennial topics that everyone always seems to be worrying about, but nobody does anything about. Clay Shirky explains how it will only get worse unless we think about it differently in his talk titled It's Not Information Overload. It's Filter Failure.

To summarise, the phenomenon of information overload follows naturally from how cheaply we can disseminate information with the Internet. In the past, we had editors that had an economic incentive to be the gatekeepers of public discourse (dead trees cost money). I like the analogy of the Great Conversation where different authors respond to the thoughts of others by referencing previous works. The editorial process behind this is not very egalitarian and we have missed out on many ideas that were ahead of their time, but it does keep the quality high. Shirky argues that the problem of information overload is not going to be solved until we start thinking about information flow. Both the problems of privacy and quality can be addressed in this framework.

That the talk was given in 2008 and we don't see powerful filter features everywhere indicates that we still haven't changed our mindset. Perhaps most people don't want that change? I'll admit, it is a lot of work. Managing filters is a constant struggle because not only does the content change, but so does your idea of relevance. And many times that criteria isn't even constant over the course of a single day!

But ignoring the problem doesn't make it go away. The problem of information overload is at its root one of attention scarcity1. We simply can't look deeply at everything. When people use a search engine, people rarely go past the 7th page or so — beyond that, abandon all hope. Endless feeds are even worse: not only are there always new posts to look at, each of the posts can cover various topics which only makes selecting where to divert our focus harder. When an entertaining post is next to something that demands more contemplation, the entertaining post will win. Deep understanding is difficult and requires patience and dedication.

Most people aren't prepared for that kind of work. But there are professionals that are trained to deal with large amounts of data: librarians. Librarians have to select new books for a collection, organise them according to a system that makes sense for their patrons, and be able to guide patrons towards good sources. This is called curation — which brings me to the first topic I want to talk about.


When I say curation, I mean the deliberate selection of information that is meant to be shared with a specific audience. This is an editorial process — in fact, every person that shares a link to an online community or posts to their social network is curating information.

The reasons for sharing content can be different for each person and not everyone is in the audience for a particular item. This can lead to cluttering — people don't want to see things they aren't interested in. People have different opinions on what is interesting and novel. This can be solved by having the curator find the appropriate audience — as is done with newsgroups on Usenet, subreddits on reddit, groups on Facebook, communities on Google+, or topics on Quora2. These all allow anyone to approach an audience and get content in front of that audience.

But allowing anyone to share in these communities means you'll have to deal with spammy content. One way around that is to have moderators who approve posts (and sometimes even replies to those posts). This is a lot of work, but a good set of moderators can make a community a very enjoyable place to be. My favourite example of this is Metafilter. They consistently have higher quality discussion than most general discussion sites (see this recent post). They have some moderation of content and the user community takes an active part in flagging posts that aren't to a certain level of quality — a kind of community self-policing. That joining the site requires a one-time $5 fee seems to help make sure that people that do join are invested in the community.

Another approach is to attack the problem at the curator level — this is closest to what an editor would do. Some sites like Slashdot have user-submitted stories that go through editors that post them to the front-page. All this depends on having good stories to submitted by user and a dedicated group of editors to go over all the submissions. Trove (Rob Malda, founder of Slashdot, works there now) adds another layer to this and allows everyone to be an editor and curate content that is suggested by an algorithm.


But getting content in front of an audience only takes into account what the person submitting the content thinks is appropriate for the audience. To gauge how the audience feels about the content, many services use collaborative filtering. That's what voting on stories on reddit or hackernews is. Social networks also use this with likes/+1s and comments.

This works fine for things that are presented to large, diverse groups. The wisdom of the crowd does not do well when there is a bias due to groupthink. The groupthink only gets worse if most people are seeing the same types of items again and again. In my opinion, this brings into question how collaborative filtering is implemented. If you show the top ranked items to everyone, those will slowly gather a higher ranking, while newer, lower-ranked items will not get the same chance and may disappear completely, a process that is known as preferential attachment. Others have written about how to solve this by using randomness (1), (2), (3).

What happens when there isn't a large enough group to apply collaborative filtering? This can happen when the topic is obscure or requires specialist knowledge. For example, I don't think a paper on the history of land use and agriculture will get many readers on a social media site. That's when you have to start using more information retrieval techniques. We need to start looking more at the content and how the content relates to the rest of the Internet. One of the approaches is to try to do topic modelling and assign each item a topic. Google+ currently approaches this very simply, as far as I can tell, by just looking for keywords in posts and applying a tag based on that. There was a site called Twine that used natural language processing and semantic web data representation to classify content and present that to users based on their interest. There was another site called Kiffets developed by researchers at Xerox PARC that combined many of the ideas of semantic processing, editorial curation, and collaborative filtering into a single system (4). Both Twine and Kiffets are no longer available. Perhaps no one has figured out how to scale their approach both financially and technologically?

Whenever a tool disappears, this is a huge loss for all those that invested the time in using it. To avoid that, we need a way of sharing the information we put into the tool with other tools that can replace the first one. There are some formats such as the APML and FOAF specifications that try to encode interests in a way that can be shared, but these have not been widely adopted. That's not surprising, because specifying before getting industry support rarely works well.

What I miss from the Internet

As I mentioned in the appendix to the this post, I found that the early web had many more personal sites on it. This is important because, back then, each of those sites had an essence that is getting harder to find nowadays: passion. Each person had a thing that they wanted to share with the world — something that made them stand out, sometimes they were even an expert in that niche area. Coming across a page borne out of passion was like walking on a rocky beach. Each of the pages was a rock that you would pick up and see all the unique patterns on it. You could recognize it instantly even after you put it down and looked at another rock. Now, with the never-ending clamor of headlines that want you to read one page or another, it feels like the constant crashing of the waves of time have ground those rocks into sand where each is indistinguishable from the next.

There is no incentive to fix this. Pages like Metafilter that rely on advertising to run can't compete in a world where entertainment is what gets the most ad impressions.

I'm not an expert on media studies, but I'm very interested in how it drives society. I plan on reading Neil Postman's Amusing Ourselves to Death sometime. It appears very related to the idea of how entertainment has stifled public discourse. This isn't necessarily the most important problem that the world is facing, but since we live together on this planet, we must be able to understand one another as rationally as possible, with as many facts laid bare3. It seems that every new communication technology promises to connect humanity together, but we need to closely examine what kinds of relationships we are building.

To conclude, I want to talk about what I would like to do to address this problem (because solving problems is my passion). I had previously worked on a tool to let me read social media in a single format for ease of access. My work then was based around creating a protocol to share the data, but now I think it is more important to work on filtering. I'm going to try and will keep posting updates with my results.

  1. For more on attention, I recommend reading "Scrolling Forward" by David M. Levy. I reviewed it here. He specifically addresses the differences in attention between reading on paper versus reading on screen. ↩

  2. Tagging (and, in a way, Google+ circles) is a more free-form version of this. ↩

  3. I'm also interested in the related topic of diffusion of innovations and how the Internet can help the rate of adoption of new ideas. ↩


[1] Luu, Dan. Why HN Should Use Randomized Algorithms. 04 Oct 2013.

[3] Marlin, B., Zemel, R., Roweis, S., and Slaney, M.: Collaborative Filtering and the Missing at Random Assumption. 20 Jun 2012. (Note: see more on Benjamin M. Marlin's research page).

[4] Stefik, M., and Good, L.: Design and Deployment of a Personalized News Service. AI Magazine 33.2 (2012): 28.

Posted Sun Jun 8 05:17:20 2014 Tags:

This is an appendix to the post curation and filtering of the social media firehose.

I'm first going to go over a bit of history of the Internet as I experienced it (yawn) to get a grounding for where I'm coming from and some insight into where the Net is going.

Let's start at the beginning of my journey on the Internet. I got online not soon after I was able to read on my own. I had a couple of books on how to use the Internet and I read them so that I could learn not only about the World Wide Web, but also other tools like FTP, Gopher, Usenet, MUDs, and e-mail. Interactivity was limited to the <map> tag, RealPlayer, Java applets, minimal Flash, and JavaScript embellishments (remember the term DHTML?).

Back then, the distribution of websites was different. I don't have any empirical evidence, but you were much more likely to hit a personal page than you are now. Corporate sites were more of a way to have a simple web presence than an attempt at creating a full-blown marketing experience. The most forward-thinking sites came from publishers and other media outlets that saw the web as a way of extending whatever they were doing in print or on (non-computer) screens; however, rich multimedia wasn't a mainstay of the web yet as the bandwidth wasn't available (see this page from National Geographic for an example). Even with all the hype around the web, most entities didn't attempt to create a web presence and were happy to let fans create communities for them both on the web and other parts of the Internet. Many of these sites are gone now or completely different from how they appeared years ago. I have a copy of the book "Net-mom's Internet Kids & Family Yellow Pages" written by Jean Armour Polly which is an excellent snapshot of both the state of the Internet and the mentality of users at the time. At the time, it made sense to publish a book full of URLs — though linkrot did occur, there was not a large enough volume for it to be a big deal. Now those pages are gone and all we have is printed paper1 describing what would have been there (if that's not a warning to archive anything you like, I don't know what is).

Aside from books, finding other sites on the Internet was still in its infancy. Prior to the web, information retrieval datasets were usually not this large or diverse. There were search engines for FTP (Archie), Gopher (Veronica), and the Web (AltaVista, Lycos, among many others) and many of these returned very different results. This is why some people used metasearch engines like Dogpile to combine multiple results.

Instead of using search engines, many times I would start my searches with directories such as the WWW Virtual Library, Yahoo!, and DMOZ as these had lists of sites that were vetted by editors and were generally of better quality than the usual search results.

  1. Speaking of printed paper, I want to take a moment to remember one of my heroes of the Internet, Michael S. Hart, the founder of Project Gutenberg, a project to convert public domain works into e-books, something he invented. It was the first virtual volunteer project. I remember going that site and downloading many of their books in Plucker format so that I could fill up the megabytes of storage on my Palm. ↩

Posted Fri Jun 6 22:16:38 2014 Tags:

I usually don't write about short scripts that I've written, but this one might be useful to others. Link for the impatient.

I needed to download videos from Khan Academy so that I could watch them offline. That should be easy enough, right? The videos are hosted on YouTube, so it should just be a matter of finding a playlist and running get_flash_videos on all the URLs. Turns out this isn't the case: the playlists on YouTube do not match up with all the videos on the Khan Academy website. Argh.

I could try to go through each of the sections on the website and copy the URLs into a file, but doing that with 700 videos isn't my idea of a fun way to spend a couple hours. I looked around for a way to download videos, but all I found was this download page which had an old torrent. I looked for an API and found one that was a bit under-documented. After trying to figure out the easiest way of using the API, I decided that trying to unravel the 10 MB JSON file returned by http://api.khanacademy.org/api/v1/topictree wasn't worth it1. Time to scrape the site!

The final code as of this writing is here. The scraping code in download.pl isn't exactly great, but it does the job. It just recursively follows children URLs and records them in a data structure which is written out to ka-data.json. Then process.pl takes over and reads the data structure. The important thing here is that the files get written out with some way of maintaining the order of the playlist. I use the order of the children URLs on the page to assign a numberic prefix to every directory and file so that it will sort by name.

Finally, to download the videos, I took the laziest approach. I just print out the get_flash_videos shell commands needed to download each video and pipe the commands into a shell. That way I didn't have to deal with any error handling myself. Now all I need to do is finish watching these videos!

Posted Wed May 28 09:02:18 2014

  1. In retrospect, it wouldn't have been that hard to use the API, but I didn't like the feel of it. A recursive data structure isn't exactly the easiest to run through at a midnight hour. Below, I sketched out how I could have gotten the same results as the scraping part of my script now that I know more about the problem. Oh well.

Posted Wed May 28 07:08:54 2014 Tags:

I might be suffering from social media overload. In the past, I've used social media networks as a way to keep up to date with both what my friends/acquaintances are up to and follow updates from certain news sources. This has mainly been through Facebook and reddit 1. Now I'm starting to find that the rate of return from visiting these sites has fallen below a threshold where I no longer feel that visiting them produces value. Now you may say "you're doing it wrong!" and that social media is about entertainment, not productivity and you're certainly right. However, I had set up my social networking accounts with enough variety in the sources they display that every time I would visit their newsfeeds, I would find something new to learn.

But lately I've found that the content and layout of the newsfeeds no longer prioritise the things I want to see. I never really wanted to see pictures of food or check-ins or lists of the same as-old-as-the-hills reaction GIFs. For a long time, I felt that this aspect of the web puts undue emphasis on visuals and videos, which is very frustrating to a text person like me. I like, no..., love reading and writing. I take joy in playing with words, using idioms, and crafting unspoken, never-before-seen sentences. Places like Facebook are not the right venue for me to indulge in that. I have toyed with the idea of scoring posts based on the complexity of their grammatical constructions, but I don't know if it is worth the time.

The essential point that keeps me from enjoying social media use is that it is very disorganised. I could try to put in work by organising people into lists, but any such lists are going to lack granularity. People are complex and multi-faceted and it would be a disservice to group them so indiscriminately. There may, however, be a way to cluster posts in a way that is useful, but this will require playing with the data. I may do that someday.

But not now. Right now, I've blocked myself from social media on my computer. Since my computer is where I work, I need to isolate my work area from something that is decidedly non-work. And I feel better for it.

P.S. A year ago, I read a book called The Filter Bubble by Eli Pariser. It talks about how algorithms are automatically choosing what we want to see for us and thus deciding our world views: we only see the world that we already agree with. He talks about how this is a dangerous trend because it rapidly blurs the line between fact and opinion.

Eli Pariser started the Upworthy site which tries to break such bubbles by making social issues go "viral". Their posts often fit a specific formula which appears to work given the amount of times their stories are shared. Upworthy has been so adept at applying this formula that imitators have sprung up all over and now social media feeds are full of people trying too hard to get people to click on their posts. I have a feeling that they all get their stories from the same sources because I see the same story posted multiple times throughout the day by different pages. The constant pleas of "Click me! I'm important" (termed click-bait) are yet another reason why social media starts to feel less social and more abusive. If every site is pushing the same content, I'm not sure the goal of bursting the filter bubble worked. We just replaced algorithms with people, but the results are the same.

  1. I've also used Slashdot for a very long time, but that has always been more of a news site for me. ↩

Posted Tue May 27 02:38:30 2014 Tags:

I've recently been working on putting together modules for image processing in Perl. One thing that keeps coming up when I write code that I want others to run is that they don't have the same images on their system as I do. So I put together a module that wraps up access to some standard test images from the USC SIPI image database. So now, instead of telling people to go download the right set of images, all they have to do is install Data::TestImage from CPAN and it will in turn download and extract the image database to a shared location. Then you can just run

use Data::TestImage;

my $image = Data::TestImage->get_image('mandrill');

and $image will contain the path to a TIFF file with the mandrill test image. Simple!

Standard test image of mandrill

I put in a couple of nice tweaks in the install process too: it doesn't install all 130 MB of the USC SIPI database by default, but you can set an environment variable and it will install only the portions you ask for.

I got some inspiration from the MATLAB Image Toolbox's built-in images and the TestImages.jl Julia package. But mine is more extensible!

Posted Mon May 26 09:19:26 2014 Tags:
Posted Sun Feb 23 08:03:46 2014 Tags:

  • This device (check out the GIFs!), called the Organ Care System, extends the life of organ transplants by providing an environment for the organ by keeping warm blood flowing while it is in transport. There are other systems that do the same, but this looks more portable.

  • This Python with Braces project seems cool. Don't know if it's a joke or not, but they forked the CPython code (via Evan Lee).

  • Engineering Map of America from PBS' American Experience

  • I saw this article about how one of the people promoting people to learn to code doesn't actually know how to code. This isn't actually that big a problem. The actual problem is that too many people think that computer science is about programming computers. Ostensibly, yes, that is what you do when you program, but really, what you want to do when you program is to learn about algorithmic thinking. You want to be able to think about logic and control flow and state. Those skills can be learned without a computer. Many of the most prolific computer scientists did not need computers to do computer science. This is why I like the CS Unplugged materials. You really need to check out their videos on YouTube.

  • DeepField is a company that does big data analysis on cloud infrastructure (via this article about Twitch.tv may have more traffic than Facebook).

  • Some more info on the Marvel Comics API from last time: apparently they use a graph DB which makes sense for what they are doing. There is a video from the GraphConnect New York conference here. This other video is mainly the same but there is a QA session at the end (@ 34:40).

  • Two links on DIY tooling. Pat Delany has been working on making open-source machine tools for cheap. His work on the MultiMachine is driven by a desire to make toolmaking tools available to anyone around the world:

    It’s strange, but at my advanced age I realize that machine tools are about all that I believe in. The lathe, shaper, and mill built the foundation of our current standard of living and there is no reason why a cheap and easy-to-build multipurpose tool could not help the 500 million people that need simple water pumps or the billion people who live on a dollar a day or less. Thanks for getting a crazy old man started.

    Here's another cheap tool: Mini circular bench saw from scrap.

  • I got quoted in this newspaper article about UH's startup accelerator, RED Labs. As I said in the article, I would really like to see more CS, engineering, and tech-related students join the program and get involved. The Computer Science Entrepreneurship Workshop+Startup Lab - RED Labs was a good start for reaching out to the CS students and there are more initiatives underway for the next semester, but we need to grow a passion for creating new things — I know it's there, but we need more expression and drive.

  • This article titled Girls and Software, while written about the gender problem faced in the software industry, had a different effect on me. It reminded me why I love the Internet and online communities. When you can "hide" your AFK identity behind a pseudonym, people don't treat you with the same AFK prejudices. I remember that I was able to converse with people much older than me and they didn't know they were talking to a 12-year-old. This was quite a freeing feeling as I could push myself to do things that you wouldn't expect from someone so young.

  • Music video: Sinine - All The Same (instrumental) (via mind.in.a.box). And it's award-winning. :-)


  • I read this article by Peter Seibel about code reading a few weeks ago. I love the idea of literate programming, but often you can't code that way because there is too much clutter. Short pieces of code like Backbone are easier to read from beginning-to-end. A comment by dmunoz on a Hacker News post about a 55-line Python task queue (thread) really sums up the sentiment nicely:

    Absolutely. I'm always pleased when documentation includes some pseudocode for what the system generally does, without the overhead of configuration, exceptional control flow, etc. It's not always possible with large systems, but makes it a lot easier to see the forest, not the trees, in even mid-sized code bases.

    (via Which code to read?)

  • This video of an autonomous boat demonstrates mapping and path planning on water. Now I'm wondering if the open-source vehicle I shared last time could be augmented with drive-by-wire to make a cheap driverless car testing platform! (via Evan Lee)

Posted Thu Feb 13 02:56:17 2014 Tags:
  • I was looking at parsing of HTML1 and came across a paper on parsing XML with regex titled REX: XML Shallow Parsing with Regular Expressions. But what's even more interesting is a project by the author called Parabix which implements parallel text processing.

  • Tethne looks like a neat Python tool for bibliographic network analysis.

  • I was reading a blog post about how horribly one-sided the Terms of Service for the Marvel API are and I came across the Swedish API License which attempts to create a license that doesn't just force developers to give up many of their rights to the API providers.

  • OpenCatalog is a list of open source projects that are funded in part by DARPA.

  • I saw this open-source car and was reminded of how I've wanted to build my own car for quite a while. Imagine the learning possibilities! There is actually a high school team in Philadelphia that works on designing and racing hybrid cars. Here's an article in IEEE Spectrum and a video from PBS Frontline. That is way cool.

  • Bioimaging consortium that connects academic and industrial partners: Cyttron.

  • I love backing up design with numbers and this user study on how people hold their mobile devices makes me happy.

  • I came across the book CMDAS: Knowledge-Based Programming for Music Research while in freenode's ##prolog. Algorithmic composition with Prolog!

  • GitHub Education is a very good idea. More students need to learn about version control and testing while in school.

  1. without and with regex (for certain values of regular :-P) ↩

Posted Tue Feb 11 17:18:30 2014 Tags:

This wiki is powered by ?ikiwiki.