I usually don't write about short scripts that I've written, but this one might be useful to others. Link for the impatient.
I needed to download videos from Khan Academy so that I could watch them
offline. That should be easy enough, right? The videos are hosted on
YouTube, so it should just be a matter of finding a playlist and running
get_flash_videos on all the URLs. Turns out this isn't the case: the playlists
on YouTube do not match up with all the videos on the Khan Academy website. Argh.
I could try to go through each of the sections on the website and copy the URLs
into a file, but doing that with 700 videos isn't my idea of a fun way to spend
a couple hours. I looked around for a way to download videos, but all I found
was this download page which had an
old torrent. I looked for an API and found
one that was a bit
under-documented. After trying to figure out the easiest way of using the API,
I decided that trying to unravel the 10 MB JSON file returned by
http://api.khanacademy.org/api/v1/topictree wasn't worth it1. Time to
scrape the site!
The final code as of this writing is here.
The scraping code in
download.pl isn't exactly great, but it does the job. It
just recursively follows children URLs and records them in a data structure
which is written out to
process.pl takes over and reads
the data structure. The important thing here is that the files get written out
with some way of maintaining the order of the playlist. I use the order of the
children URLs on the page to assign a numberic prefix to every directory and
file so that it will sort by name.
Finally, to download the videos, I took the laziest approach. I just print out
get_flash_videos shell commands needed to download each video and pipe
the commands into a shell. That way I didn't have to deal with any error
handling myself. Now all I need to do is finish watching these videos!
In retrospect, it wouldn't have been that hard to use the API, but I didn't like the feel of it. A recursive data structure isn't exactly the easiest to run through at a midnight hour. Below, I sketched out how I could have gotten the same results as the scraping part of my script now that I know more about the problem. Oh well.↩