Welcome to the modern world. It's big. It's scary. At its center is the internet, a tool of such massive utility that everyone tries to use it. Thus it overflows with an inconceivable vastness of information... and is rendered useless. Thus we have the paradox of the internet: the more valuable it becomes, the harder it is to utilize.
This is embodied by the concept of the signal-to-noise ratio. It may sound unfamiliar to most of you, but it is far from foreign to your lives. When a phone conversation is drowned out by static, the noise has overpowered the signal. When you can't hear your friend's joke in a crowded restaurant, it is this ratio coming into play. Signal is the desired message. Noise is anything that interferes with that message.
Take spam, for instance. Spam is noise in an e-mail inbox. A large percentage of the world's daily e-mail volume is composed of junk mail. I personally get about 13 spam messages every day, which sadly outweighs the number of legitimate messages I receive. Since it would be unacceptable to manually sort hundreds of messages weekly, the spam filter was invented.
A typical spam filter checks each message for spam keywords such as 'Viagra' or 'stocks'; it checks the e-mail address against a list of known spam senders; it checks the number of links; and it performs various other tests. A message with a low enough score is thrown into the spam folder. A competent filter will let through only one or two messages every day — well within the bounds of what can be handled manually.
Spam filters reside at the simplest level of noise reduction. A message is either spam or not, and it is bound by the necessity of carrying its spam content, which makes it relatively easy to sort out from legitimate messages. There are more interesting signal-to-noise problems.
Take for an example the search engine. Like spam filters, it is rule-driven. In the case of an internet search, however, it isn't so easy to determine what to include and what not to include. Undesirable sites are not necessarily malicious, they just don't happen to fulfill the needs of the searcher. To make it even more interesting, order becomes important. Search engines strive to put the desired result on the first page, because they have found that few people actually look at subsequent results.
Ultimately, though, these problems are solved in a similar way to the spam filter. Various factors award points to a result. These may include the number of times the search terms appear, whether they appear in the title or simply in a paragraph, and the popularity of that particular page. The results are sorted based on their overall score.
Our third case is the comments section of a weblog (or any online forum, really). Here, there are no search terms. Spam comes into play, but so does inanity. Who wants to read ten comments that say the same thing, or comments that don't add anything to the conversation? To increase the signal-to-noise ratio, comments must be filtered by quality.
How can this be done? After all, if there's one thing that computers are bad at, it's understanding language. Language is far too haphazard and anarchistic. Then there is the abundance of synonyms and equivalent phrasings, metaphors and idioms. Computers just don't understand things — they process binary code. Perhaps the biggest problem in artificial intelligence is to mimic the understanding of concepts. The state of the art isn't there yet.
Even simple measurements like comment length are useless here; the deepest insight can be expressed by a single word, whereas twelve rambling paragraphs might say nothing. It is a well-established fact that neither refined vocabulary nor a knack for eloquence can ensure worthy discourse. Very little that a computer can easily measure is of use.
Which means that it's hopeless, right? Of course not! What is the internet but a network of external devices that have a knack for language processing? People, in other words. You. Me. Aunt Mabel. We are all vastly superior to computers at categorization, understanding language, and playing Go.1 Okay, I'm not actually any good at Go, but some people are.
In the past few years, we have seen the dominance of the collaborative reference site Wikipedia. It is one of the most important sites on the web. There are approximately 2.3 million articles in the English Wikipedia alone, driven entirely by user-submitted content and editing. It covers everything from popular new bands to esoteric mathematical theorems, and operates at a surprisingly high level of quality.2
How is this possible?
The power of the masses is always a bit startling. Give enough people a social incentive to do something that they find interesting, and many of them will do it. Wikipedia relies on devoted contributors who manage small areas of interest and expertise. These contributors provide the necessary quality control over the raw power of collaborative content.
This raw power is already being used to filter information. Look at Slashdot, a popular geek news site. The comment threads can be filtered by ratings given by a host of moderators. It has proven remarkably effective at pushing the best comments to people's attention, and removing the worst ones from the discussion entirely. The signal to noise ratio remains high.
Another example of the power of groups is the social bookmarking site Digg. A user of Digg will post a link to, say, a picture of Niagara Falls. If another person on Digg likes the picture, they may give it a point. Eventually these points add up, and the picture ends up on the front page.
Collaborative content and filtering are good solutions for the here and now, but the big frontier for information is artificial intelligence.
Artificial intelligence is an ambiguous concept. In science fiction, it means a non-biological intelligence that could rival or surpass our own. Practically, though, it means something very different: the science of mimicking our own intelligence.
This is important. It's a lot easier to get a computer to act like it understands language than to actually comprehend the symbols of which language is composed. It is sufficient for our purposes to design an artificial intelligence that can judge whether or not a comment is of high quality. It doesn't also need to appreciate Rachmaninoff or be confused by Tsara.
Even the narrowest, most primitive problems prove extremely difficult. Think about everything that is involved in deciding whether a book is worth reading, or whether a song is catchy. A human evaluation requires experience with what is good and what is not; it requires an aesthetic sense; and it requires a response. How could a computer even mimic these things? Answer that, and you will find yourself plagued with job offers. The closer we get to mimicking the relevant aspects of ourselves, the more useful artificial intelligence becomes.
We're not there yet, needless to say.
To conclude, let's take a step back from the land of dreams into the real world. A sad transition, I know. But important, if this is going to be of any use besides as science fiction.3
The most practical improvements we can currently make are improvements in the way we perform the following basic steps:
If we can do these things well, we will ensure that the net remains usable until AI moves into its heyday.