I usually don't write about short scripts that I've written, but this one might be useful to others. Link for the impatient.

I needed to download videos from Khan Academy so that I could watch them offline. That should be easy enough, right? The videos are hosted on YouTube, so it should just be a matter of finding a playlist and running get_flash_videos on all the URLs. Turns out this isn't the case: the playlists on YouTube do not match up with all the videos on the Khan Academy website. Argh.

I could try to go through each of the sections on the website and copy the URLs into a file, but doing that with 700 videos isn't my idea of a fun way to spend a couple hours. I looked around for a way to download videos, but all I found was this download page which had an old torrent. I looked for an API and found one that was a bit under-documented. After trying to figure out the easiest way of using the API, I decided that trying to unravel the 10 MB JSON file returned by http://api.khanacademy.org/api/v1/topictree wasn't worth it1. Time to scrape the site!

The final code as of this writing is here. The scraping code in download.pl isn't exactly great, but it does the job. It just recursively follows children URLs and records them in a data structure which is written out to ka-data.json. Then process.pl takes over and reads the data structure. The important thing here is that the files get written out with some way of maintaining the order of the playlist. I use the order of the children URLs on the page to assign a numberic prefix to every directory and file so that it will sort by name.

Finally, to download the videos, I took the laziest approach. I just print out the get_flash_videos shell commands needed to download each video and pipe the commands into a shell. That way I didn't have to deal with any error handling myself. Now all I need to do is finish watching these videos!

  1. In retrospect, it wouldn't have been that hard to use the API, but I didn't like the feel of it. A recursive data structure isn't exactly the easiest to run through at a midnight hour. Below, I sketched out how I could have gotten the same results as the scraping part of my script now that I know more about the problem. Oh well.

    Posted Wed May 28 09:02:18 2014