- Collecting new & diverse sets of data via our surveys, primary research, web crawlers, manual data collection efforts & other mechanisms
- Building algorithms to tease out important signals from the data.
We have three points of view on the same topic by people far smarter than us!
Point of View 1
Anand Rajaraman on his blog post More Data usual beats better algorithms .
Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million!
Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database(IMDB). Guess which team did better?
Team B got much better results, close to the best results on the Netflix leaderboard!! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often suprised that many people in the business, and even in academia, don't realize this.
Point of View 2
Chris Dixon on his blog post "To Make Smarter systems, its all about the data".
Significant AI breakthroughs come from identifying or creating new sources of data, not inventing new algorithms.
Google’s PageRank was probably the greatest AI-related invention ever brought to market by a startup. It was one of very few cases where a new system was really an order of magnitude smarter than existing ones. The Google founders are widely recognized for their algorithmic work. Their most important insight, however, in my opinion, was to identify a previously untapped and incredibly valuable data source – links – and then build a (brilliant) algorithm to optimally harness that new data source.
Modern AI algorithms are very powerful, but the reality is there are thousands of programmers/researchers who can implement them with about the same level of success. The Netflix Challenge demonstrated that a massive, world-wide effort only improves on an in-house algorithm by approximately 10%. Studies have shown that naive bayes is as good or better than fancy algorithms in a surprising number of real world cases. It’s relatively easy to build systems that are right 80% of the time, but very hard to go beyond that.
Point of View 3
Peter Norvig spoke about the importance of diverse data and its co-relation at this talk. Similar concepts where explored in his paper- "The Unreasonable Effectiveness of Data" . We try to take his advise at the end of the paper very seriously!
"So, follow the data. The data holds a lot of detail. Now go out and gather some data, and see what it can do..."