curation and filtering of the social media firehose

	December 2022
S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

In my last post, I talked about how I found that my social media use has become rather unwieldy. I posted a link to my post to Facebook (of course) to generate discussion and I got plenty of great discussion out of it. I have much more to say on this theme, so here's another post.

I decided to write a little about how the Internet was about 15 years ago when I started using it (or at least how I remember it). I split it off from this article because I was starting to meander. I make references to it in the following text, but you don't need to read it to understand what I'm discussing. Caveat lector.

[Information intensifies]

Information overload is one of those perennial topics that everyone always seems to be worrying about, but nobody does anything about. Clay Shirky explains how it will only get worse unless we think about it differently in his talk titled It's Not Information Overload. It's Filter Failure.

To summarise, the phenomenon of information overload follows naturally from how cheaply we can disseminate information with the Internet. In the past, we had editors that had an economic incentive to be the gatekeepers of public discourse (dead trees cost money). I like the analogy of the Great Conversation where different authors respond to the thoughts of others by referencing previous works. The editorial process behind this is not very egalitarian and we have missed out on many ideas that were ahead of their time, but it does keep the quality high. Shirky argues that the problem of information overload is not going to be solved until we start thinking about information flow. Both the problems of privacy and quality can be addressed in this framework.

That the talk was given in 2008 and we don't see powerful filter features everywhere indicates that we still haven't changed our mindset. Perhaps most people don't want that change? I'll admit, it is a lot of work. Managing filters is a constant struggle because not only does the content change, but so does your idea of relevance. And many times that criteria isn't even constant over the course of a single day!

But ignoring the problem doesn't make it go away. The problem of information overload is at its root one of attention scarcity1. We simply can't look deeply at everything. When people use a search engine, people rarely go past the 7th page or so — beyond that, abandon all hope. Endless feeds are even worse: not only are there always new posts to look at, each of the posts can cover various topics which only makes selecting where to divert our focus harder. When an entertaining post is next to something that demands more contemplation, the entertaining post will win. Deep understanding is difficult and requires patience and dedication.

Most people aren't prepared for that kind of work. But there are professionals that are trained to deal with large amounts of data: librarians. Librarians have to select new books for a collection, organise them according to a system that makes sense for their patrons, and be able to guide patrons towards good sources. This is called curation — which brings me to the first topic I want to talk about.

Curation

When I say curation, I mean the deliberate selection of information that is meant to be shared with a specific audience. This is an editorial process — in fact, every person that shares a link to an online community or posts to their social network is curating information.

The reasons for sharing content can be different for each person and not everyone is in the audience for a particular item. This can lead to cluttering — people don't want to see things they aren't interested in. People have different opinions on what is interesting and novel. This can be solved by having the curator find the appropriate audience — as is done with newsgroups on Usenet, subreddits on reddit, groups on Facebook, communities on Google+, or topics on Quora2. These all allow anyone to approach an audience and get content in front of that audience.

But allowing anyone to share in these communities means you'll have to deal with spammy content. One way around that is to have moderators who approve posts (and sometimes even replies to those posts). This is a lot of work, but a good set of moderators can make a community a very enjoyable place to be. My favourite example of this is Metafilter. They consistently have higher quality discussion than most general discussion sites (see this recent post). They have some moderation of content and the user community takes an active part in flagging posts that aren't to a certain level of quality — a kind of community self-policing. That joining the site requires a one-time $5 fee seems to help make sure that people that do join are invested in the community.

Another approach is to attack the problem at the curator level — this is closest to what an editor would do. Some sites like Slashdot have user-submitted stories that go through editors that post them to the front-page. All this depends on having good stories to submitted by user and a dedicated group of editors to go over all the submissions. Trove (Rob Malda, founder of Slashdot, works there now) adds another layer to this and allows everyone to be an editor and curate content that is suggested by an algorithm.

Filtering

But getting content in front of an audience only takes into account what the person submitting the content thinks is appropriate for the audience. To gauge how the audience feels about the content, many services use collaborative filtering. That's what voting on stories on reddit or hackernews is. Social networks also use this with likes/+1s and comments.

This works fine for things that are presented to large, diverse groups. The wisdom of the crowd does not do well when there is a bias due to groupthink. The groupthink only gets worse if most people are seeing the same types of items again and again. In my opinion, this brings into question how collaborative filtering is implemented. If you show the top ranked items to everyone, those will slowly gather a higher ranking, while newer, lower-ranked items will not get the same chance and may disappear completely, a process that is known as preferential attachment. Others have written about how to solve this by using randomness (1), (2), (3).

What happens when there isn't a large enough group to apply collaborative filtering? This can happen when the topic is obscure or requires specialist knowledge. For example, I don't think a paper on the history of land use and agriculture will get many readers on a social media site. That's when you have to start using more information retrieval techniques. We need to start looking more at the content and how the content relates to the rest of the Internet. One of the approaches is to try to do topic modelling and assign each item a topic. Google+ currently approaches this very simply, as far as I can tell, by just looking for keywords in posts and applying a tag based on that. There was a site called Twine that used natural language processing and semantic web data representation to classify content and present that to users based on their interest. There was another site called Kiffets developed by researchers at Xerox PARC that combined many of the ideas of semantic processing, editorial curation, and collaborative filtering into a single system (4). Both Twine and Kiffets are no longer available. Perhaps no one has figured out how to scale their approach both financially and technologically?

Whenever a tool disappears, this is a huge loss for all those that invested the time in using it. To avoid that, we need a way of sharing the information we put into the tool with other tools that can replace the first one. There are some formats such as the APML and FOAF specifications that try to encode interests in a way that can be shared, but these have not been widely adopted. That's not surprising, because specifying before getting industry support rarely works well.

What I miss from the Internet

As I mentioned in the appendix to the this post, I found that the early web had many more personal sites on it. This is important because, back then, each of those sites had an essence that is getting harder to find nowadays: passion. Each person had a thing that they wanted to share with the world — something that made them stand out, sometimes they were even an expert in that niche area. Coming across a page borne out of passion was like walking on a rocky beach. Each of the pages was a rock that you would pick up and see all the unique patterns on it. You could recognize it instantly even after you put it down and looked at another rock. Now, with the never-ending clamor of headlines that want you to read one page or another, it feels like the constant crashing of the waves of time have ground those rocks into sand where each is indistinguishable from the next.

There is no incentive to fix this. Pages like Metafilter that rely on advertising to run can't compete in a world where entertainment is what gets the most ad impressions.

I'm not an expert on media studies, but I'm very interested in how it drives society. I plan on reading Neil Postman's Amusing Ourselves to Death sometime. It appears very related to the idea of how entertainment has stifled public discourse. This isn't necessarily the most important problem that the world is facing, but since we live together on this planet, we must be able to understand one another as rationally as possible, with as many facts laid bare3. It seems that every new communication technology promises to connect humanity together, but we need to closely examine what kinds of relationships we are building.

To conclude, I want to talk about what I would like to do to address this problem (because solving problems is my passion). I had previously worked on a tool to let me read social media in a single format for ease of access. My work then was based around creating a protocol to share the data, but now I think it is more important to work on filtering. I'm going to try and will keep posting updates with my results.

For more on attention, I recommend reading "Scrolling Forward" by David M. Levy. I reviewed it here. He specifically addresses the differences in attention between reading on paper versus reading on screen. ↩
Tagging (and, in a way, Google+ circles) is a more free-form version of this. ↩
I'm also interested in the related topic of diffusion of innovations and how the Internet can help the rate of adoption of new ideas. ↩

Bibliography

[1] Luu, Dan. Why HN Should Use Randomized Algorithms. 04 Oct 2013.

[2] Stucchio, Chris: Why Hacker News should use a different Deterministic Algorithm, NOT a Random One. 02 Dec 2013.

[3] Marlin, B., Zemel, R., Roweis, S., and Slaney, M.: Collaborative Filtering and the Missing at Random Assumption. 20 Jun 2012. (Note: see more on Benjamin M. Marlin's research page).

[4] Stefik, M., and Good, L.: Design and Deployment of a Personalized News Service. AI Magazine 33.2 (2012): 28.