Espen Andersen noted the new O’Reilly book Programming Collective Intelligence, by Toby Segaran, which looks really interesting. In an excellent blog post discussing the book, Tim O’Reilly writes about the importance of what users implicitly contribute to the web, rather than just looking at the photos and videos blog posts and Facebook profiles that are explicitly contributed.
No one would characterize Google as a “user generated content” company, yet they are clearly at the very heart of Web 2.0. That’s why I prefer the phrase “harnessing collective intelligence” as the touchstone of the revolution. A link is user-generated content, but PageRank is a technique for extracting intelligence from that content. So is Flickr’s “interestingness” algorithm, or Amazon’s “people who bought this product also bought…”, Last.Fm’s algorithms for “similar artist radio”, ebay’s reputation system, and Google’s AdSense.
This is a book explaining the practical sides of actually using this information – it “teaches algorithms and techniques for extracting meaning from data, including user data”, O’Reilly writes. For instance, it explains that you might be able “to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles” by “by clustering blogs based on word frequencies”, and that this “could be very useful in searching, cataloging, and discovering the huge number of blogs that are currently online.” It then proceeds to tell you exactly how to do this by “downloading the [RSS] feeds from a set of blogs, extracting the text from the entries, and creating a table of word frequencies.”
And the way they’ve set up the online table of contents, with extracts from each subchapter, is a thing of beauty. The bit about finding word clusters in blogs is from Chapter 3, in the sub-section “Word Vectors”.
William Patrick Wend
Thank you for posting about this book, Jill. I think this might be useful for my MA thesis.