Crossposted from blogs.perl.org.
Hi everyone, this is my first blog post on here (Gabor Szabo++ for reminding me that I need to blog!). Last week, I posted to the Houston.pm group inviting them to come out to the local City of Houston Open Innovation Hackathon.
I was planning on attending since I first heard about it a couple of weeks ago, but I saw this also as an opportunity to build cool things in Perl and show what Perl can do.
To those that don't interact much with the Perl community, Perl is completely invisible or just seen in the form of short scripts that do a single task. Showing people by doing is a great way to demonstrate how much Perl has progressed in recent years with all tools that have been uploaded to CPAN and that Perl systems can grow beautifully beyond 1,000 SLOC.
Going to hackathons like these also make for a great way to network with the local technology community and see what type of problems they are interested in solving. I would love to see what type of approaches they take in their technology stack and whether those approaches can be adapted and made Perlish.
A few months back, I had read Vyacheslav Matyukhin's announcement of the Play Perl project and I immediately thought that the idea of a social network for programming was a sound one1, but I was not sure if it would just languish and become another site that nobody visits. The recent release and widespread adoption of the site by the Perl community gives me optimism for the kind of cooperation this site can bring to Perl. Play Perl brings a unique way to spring the community into action — which excites me. I would like to note that these ideas are not unique in application to Perl, but could be used for any distributed collaborative effort.
There already exist tools for open-source where we can share tasks and wishlist items: either project specific such as bug trackers or cross-project such as OpenHatch, 24 Pull Requests, and the various project proposals made for GSOC. What does Play Perl add to what's already out there?
Firstly, it frees the projects from belonging to a single person. People already create todo lists with projects that they would like to work on, but these todo lists usually remain private. In an open-source environment, this can make it difficult to find other people that might be interested in helping those ideas get off the ground. With Play Perl, the todo list item (or “quest”) no longer has to stay with the person that came up with it. Anyone can see what may need to be done and if they have enough round tuits, they can work on it. A quote that I recently read on the Wikipedia article for Alexander Kronrod sums up how I see this:
A biographer wrote Kronrod gave ideas "away left and right, quite honestly being convinced that the authorship belongs to the one who implements them."
In this sense, Play Perl is an ideas bank, but it is much more than that. By allowing these ideas to be voted on, it allows you to choose which one to work on first. You can prioritise your time based on what would be most beneficial to the community — a metric that is difficult to ascertain on your own.
Secondly, with the gamification2 of collaboration, the process of implementing ideas becomes part of a feedback loop — we can introduce positive reinforcement for our work through the interaction with the community. This feedback loop process already happens through media such as mailing lists and IRC (karmabots, anyone?). Play Perl quantifies this and lets us see how much our contributions help others.
The last and possibly the most important aspect of Play Perl that leads me to believe in its long-term success is its focus on tasks from a single community. Everyone in the community can quickly see what the others are working on in a way that is hard to do with blogs or GitHub3 due to granularity. Ideas often get lost with time, but more eyes can ensure that they get implemented. I frequently browse the latest CPAN uploads to look for interesting modules and I find myself following the feed on Play Perl the same way. I can justify the time spent browsing through all this activity on both sites because I know that there is likely an item of interest (high probability of reward) and each item is a blip of text that I can quickly scan through (low cost to reading each item). Looking at modules lets me see what code already exists, but Play Perl lets me see what code will probably exist in the future. Providing this new view on the workflow of open-source development is empowering because it provides a channel for the free flow of a specific kind of information that was previously trapped in other less-visible media. Having easy access to this information means we can interact with it directly at a scale that best fits the message.
I look forward to seeing Play Perl flourish in the coming months.
This blog post brought to you by Play Perl. ;-)
It has worked for GitHub, hasn't it? ↩
GitHub should implement custom lists of people/projects to filter the activity like on Twitter. ↩
About a year ago (October 2011), I wrote a small tool (git repo) that has really made using my e-mail a much more enjoyable experience. My personal e-mail inbox is on Google's Gmail service; however, I find the web interface gets in the way of reading and organising my e-mail. I make heavy use of the label and filter features that let me automatically tag each message (or thread), but having labels that number in the hundreds gets unwieldy since I can not easily access them. I use the IMAP interface to reach my inbox through the mutt e-mail client; this is fast because there is almost no interface and I can bind many actions to a couple of keystrokes.
The main problem I had with using IMAP was that, although I could see all the
labels presented as IMAP folders, I had no way to know which labels were used
on a particular message that was still in my inbox. I had thought about this
problem for a while and looked around to see if anyone made proxies for IMAP,
but there was not very much information out there. I had originally thought
that I would need to keep a periodically updated database of
and labels which I would query from inside mutt and I had in fact written
some code that would get all the
Message-IDs for a particular IMAP folder,
but this was a slow process. I didn't look into it again until I was talking
about my problem with a friend (Paul
DeCarlo) and he pointed me towards the Gmail
IMAP extensions. This
was actually going to be possible!
I quickly put together a pass-through proxy that would sit between the mutt client and the Gmail IMAP server. Since Gmail's server uses SSL with IMAP (i.e. IMAPS) to encrypt the communication, I would need to get the unencrypted commands from mutt and then encrypt them before sending them to Gmail. Once I had this, I could log the commands to a file and study which IMAP commands mutt was sending. At the same time, I had a window open with IETF RFC3501, the IMAP specification document, so that I could understand the structure of the protocol. Once I saved the log to a file, I didn't actually need a network to program the filter that I was writing — in fact, I finished the core of the program on a drive back from the ACM Programming Contest in Longview, TX! When I got home, I tested it and it worked almost perfectly except for another IMAP command that mutt was sending that was not in my log, but that was just a small fix.
Not very long after I first published the code on Github, my code was mentioned on the mutt developers mailing list on 2012-02-01.
Todd Hoffman writes:
I read the keywords patch description and it definitely sounds useful. One reminder is that gmail labels are not in the headers and can only be accessed using the imap fetch command. Users of offlineimap or some other mechanism to retrieve the messages and put them in mboxes or a maildir structure will not be able to extract them from headers, in general. Of course, a patched version of offlineimap or an imap proxy (see https://github.com/zmughal/gmail-imap-label) that assigns the labels to a header could be used also.
It was good to know that people were finding my project. Also, over the summer, I received my first ever bug report from Matthias Vallentin, so I knew that somebody else actually found it useful. \o/ This was a great feeling, because it closed the loop on the open-source model for me: I was finally contributing code and interacting with other developers.
In one of my projects, I needed to trace HTTPS requests in order to understand the behaviour of a web application. Since the data is encrypted, it can not be read using the default configuration of the tools that I normally use to inspect network data. This post details how to quickly set up an SSL proxy to monitor the encrypted traffic.
When debugging or reverse engineering a network protocol, it is often necessary to look at the requests being made in order to see where they are going and what type of parameters are being sent. Usually this is simple with a packet capture tool such as tcpdump or Wireshark when the protocol is being sent in plaintext; however, it is more work to capture and decode SSL packets as the purpose of this protocol layer is to prevent the type of eavesdropping that is accomplished in man-in-the-middle attacks. SSL works on the basis of public certificates that are issued by trusted organisations known as certificate authorities (CAs). The purpose of the CAs is to sign certificates so that any clients that connect to a server that has a signed certificate can trust that they are connecting to an entity with verified credentials; therefore, a certificate can only be as trustworthy as the CA that signed it. Operating systems and browsers come with a list of CA certificates that are considered trustworthy by consensus; so, in order to run a server with verifiable SSL communication, the owners of that server need to get their certificate signed by one of the CAs in that list. Any traffic to and from that server will now be accepted by the client in encrypted form.
Trying to capture these encrypted packets would require you to have the private key of a trusted CA; however, we can get around this by installing our own CA certificate and using a proxy that signs certificates using that CA certificate for every server that the client connects to. We can accomplish this by using mitm-proxy.
It is written Java and comes with a CA certificate that you can use right away which makes it is straightforward to set up.
Once you download and extract the software, you have to add the fake CA certificate into Firefox. I prefer to set up a new session of Firefox so that the configuration will use a separate database of certificates from my usual browsing session. You can create and start a new session using the command
firefox -P -no-remote
Installing the certificate
When this new session starts up, add the certificate by going into the
Preferences menu of Firefox and going to the options under
Advanced » Encryption
and selecting the
View Certificates button.
Authorities tab, click the
Import button and select the file
FakeCA.cer from the
Once you add the certificate for identifying websites, you should see it in the
list of authorities under the item name
Running the proxy
You are now ready to run the proxy. A shell script called
run.sh is contained
mitm-proxy directory and by examining it1, you
can see that it starts a proxy on localhost:8888 using the fake CA certificate and
that it will log the HTTPS requests to
output.txt. You need to add this proxy
to your Firefox instance by going to
Advanced » Network » Settings and adding
the information under the SSL proxy configuration.
Once you start the server, you can test it by going to HTTPSNow, a website that promotes HTTPS usage for secure browsing. Now, by running
tail -f output.txt
, you can see the HTTPS requests and responses as they are sent.
You should examine all shell scripts you download for security reasons. You do not want to inadvertently delete your $HOME directory! ↩
I recently had to
some of the code in my scraping software for Blackboard because newer versions of
Mozilla Firefox1 were not interacting well somewhere in between
decision to use
WWW::Mechanize::Firefox was primarily prompted by ease of
development. By being able to look at the DOM and match elements using all of
Firefox's great development tools such as Firebug, I
was able to quickly write XPath queries to get
exactly the information I needed. Plus, Firefox would handle all the
was slow and it became difficult to really know if something completed
successfully which made me hesitant about putting it in a cronjob. It worked,
but there was something kludgey about the whole thing.
That solution worked last semester, but when I tried it at the beginning of
this semester, things started breaking down. At first, I tried working around
could find was WWW::Scripter. It has a plugin
called JE and Mozilla's
SpiderMonkey. I had tried using
WWW::Scripter before, but had encountered difficulties with compiling the
SpiderMonkey bridge. This time I gave the
JE engine a try and I was surprised
that it worked flawlessly on the site I was scraping.
After fixing up my code, I can see a few places where
WWW::Scripter could become a better tool:
- Add a plugin that makes managing and using frames easier.
- Create tool to make viewing and interacting with the rendered page as you go along possible. This will really make the it easier to debug and try out things in a REPL.
- Integrate with WWW::Mechanize::TreeBuilder so
that I can use HTML::TreeBuilder::XPath
immediately. As far as I can tell, all that needs to be added to
I have been recently working on and dogfooding a project that I call NNTP::Portal. It is meant to be a way to merge the "oldskool" world of newsreaders with content retrieved from other sources, mainly the World Wide Web.
Usenet is network originally developed in the early 1980s that provides a distributed system for the delivery of messages, called articles, between servers that carry these messages (collectively called a news feed) in hierarchical groups based on topics, called newsgroups. Quite a few of the jargon and behaviours of Internet communication were originated or developed by users of this network. Users of Usenet are able to read and reply to articles on Usenet using software called newsreaders. You can see an example of both the messages and interfaces from the early days of Usenet here. It is also worth noting that most of the software that runs the Internet today was first announced via a Usenet posting.
Usenet's distributed nature allows for a mostly-free reign in terms of content, so there are inevitably problems that arise from having a flood of articles, many of which could be spam. Newsreaders have developed the means to address this need by adding features that can filter out unimportant messages; this is usually accomplished through the use of scoring or kill files. There are some other methods that have been developed, but they are not used as widely.
The Network News Transport Protocol is a standardised protocol developed in the mid-1980s to facilitate the communication of Usenet traffic both between peering servers and between a server and a client. It is described in RFC 977 and has been updated with extensions over the years.
Today, in the year 2011, the major protocol used online is by far HTTP, which is the means of transportation for data on the World Wide Web. As you may be aware, this data is typically encoded in the HyperText Markup Language, or HTML, along with several other technologies which determine how a Web browser will display the data.
Today's Web sites are moving toward bringing more ways for users to participate in the generation and consumption of media by bringing the ability to interact with pages through the use of uploads, collaborative editing, ratings, recommendations, and comments. However, due to the client-server nature of HTTP, these interactions generally stay on one Web service and the only way to access this data is by going to that Web site. As a response to these incompatibilites between sites, Web feed1 formats and APIs have been used to get different Web services to talk to each other2. Many people have called these tools the pipes of the Web, in reference to Unix pipes, because they allow you to create a chain of transformations to get from one set of data to another.
This is great, but many of the efforts I have seen to that go through the work of moving data around are about either showing a cute visualisation or providing a way bringing that data to either the desktop or a mobile device. Each of these efforts have a different interface as well as different levels of interaction. Few have any extensibility to speak of and can only do a fixed action with the data. These are useful, but severely limiting when it comes to what can be done with a computer. A lack of a standard interface is not only confusing for users (it takes time to navigate and learn a new interface), but also negligent of accessibility needs. Missing extensibility means that any data not exposed via a feed/API will remain inside the browser or application.
Let us jump back for moment. Now, in the early days of computing, the fastest way to interact with a computer was through the keyboard. This is still true today for certain kinds of operations, specifically those involving text. As such, newsreaders were (and still are) largely keyboard-driven, allowing users to select and scan through large numbers of articles efficiently. Typically, these newsreaders had way of interacting with their environment, by having a built-in scripting language or pipes. This way, by extending the reach of newsreaders, one could build a toolkit that worked exactly how you wanted it to. For example, one could write script that, at a single keystroke, would grab a source code listing out of an article, compile it, and place it in the executable path. You could have such a script for every key on your keyboard. In the end, you are able to create an interface that matches the way you work.
This is what I want to bring to the Web. I'm not the only one.
I decided to work with the NNTP protocol because
- It is standardised, therefore I do not need to write a specification nor a client.
- The standard is simple to implement.
Newsreaders have a history of working with large conversations and modern newsreaders have excellent threading support. There is also the very much needed ability to mark messages as read so that they are out of the way. Another action I find lacking on Web sites is the ability to postpone a message, i.e. to save a reply so that I can return to it later. I have lost many lengthy replies because I accidently switched to another page.
Newsreaders have these features and many more that facilitate two-way communication.
- It uses the Internet Message Format which has the advantage of being human-readable and, through MIME, extra data can be embedded in the message if necessary.
- A scripting language that has an extensive library of modules (CPAN), as well as powerful text-processing tools (most notably, regular expressions).
- An object framework for Perl that simplifies working with attributes, roles, and meta-objects.
- A framework for event-driven programming. It abstracts away much of the code that is common in network programming. I wish I had known about it when I was first learning the socket API.
- A database abstraction library. It provides a consistent set of function calls whenever you need to connect, query, and retrieve data from a database.
- A small SQL database that requires no configuration. I chose SQLite as my first database backend for this very reason, but I may test other databases such as BerkeleyDB.
- A module for manipulating message headers and bodies.
Right now, there is not much to the server. All it does it take NNTP commands
and periodically requests new messages from plugins. I am currently playing
with the idea of using roles to implement extra features for each database
backend, such as the
OVERVIEW capability which sends a set of message headers
to the client in a tab-delimited format.
The current server layout is shown below
As you can see, it is rather simple, but it works. The only issue is that the updating is not realtime. One reason for this is that POE uses a single threaded model. So, future versions of the server will separate the server into separate threads that will use the message database concurrently.
The future server layout design so far is shown below
The purpose of the RPC server is to have a way for a user to query the state of plugins and retrieve specific information (e.g. notifications) that can not be done over the NNTP protocol alone (out-of-band).
The idea behind the job queue is that each plugin will be able to post jobs for plugin-specific workers to process. These jobs will most likely be scheduled so that they can poll for updates optimally (the specifics of which I have not worked out yet).
Two parts of the server design that I have not yet figured out are the message updating and the article posting mechanisms. By message updating, I mean when the message changes in the original source. Normal NNTP articles are meant to have immutable bodies, so I will need to see what is the best way to present these changes to the user as well as how the plugins will handle them. The other issue, article posting, has many parts to it, including whether the user has permission to post and how to indicate this without wasting the user's time.
Well, I have to show what the program's output looks like, so here it is in the slrn newsreader:
Currently, the database uses 1.6 KiB/message. I have also noticed that the Graph API can not retrieve messages from certain people. This is a permissions problem, so I will need to work on getting a way around it3. I would also like to tag people in posts on Facebook, but the API does not have a way of doing this, as far as I can tell.
Well, I wanted to do some software engineering and I've recently discovered how much I like writing network software (this is the third server I've written in the past eight months and the first non-trivial one). Also, I've been really frustrated with the trends towards moving applications to the Web. One has to ask, how many of these applications can you really trust? How many let you see the source to the backend so that you can evaluate security? Few do4. In addition, one rarely gets to pull all the data from these services and sometimes the services have put in place terms of service that are hostile5 to third-party users. I want to see how far this project will take me.
I may even try to develop my own distributed service on top of NNTP. It will certainly be more lightweight and extensible than the current webservice offerings.
And outside the browser:
Confusingly, these are also called newsfeeds. The similarity does not end there, as clients that pull Web feeds are also called news readers. Furthermore, it is entirely possible to use a NNTP newsreader to read Web feeds. ↩
So, I'm finally going to put some content on here. Hopefully it can keep you informed of what I am working on or thinking about at a given moment.