Let’s make this more concrete. Here are two patterns, from Steven Pinker’s book, The Better Angels of our Nature. One of the patterns is randomly generated. The other imitates a pattern from nature. Can you tell which is which?
Thought about it?
Here is Pinker’s explanation.
The one on the left, with the clumps, strands, voids, and filaments (and perhaps, depending on your obsessions, animals, nudes, or Virgin Marys) is the array that was plotted at random, like stars. The one on the right, which seems to be haphazard, is the array whose positions were nudged apart, like glowworms
That’s right, glowworms. The points on the right records the positions of glowworms on the ceiling of the Waitomo cave in New Zealand. These glowworms aren’t sitting around at random, they’re competing for food, and nudging themselves away from each other. They have a vested interest against clumping together.
Try to uniformly sprinkle sand on a surface, and it might look like the pattern on the right. You’re instinctively avoiding places where you’ve already dropped sand. Random processes have no such prejudices, the grains of sand simply fall where they may, clumps and all. It’s more like sprinkling sand with your eyes closed. They key difference is that randomness is not the same thing as uniformity. True randomness can have clusters, like the constellations that we draw into the night sky.
And now, one of our favorite thinkers, Nassim Taleb
, has just published an article on a similar concept i.e. its easy to find patterns in Randomness & big data makes it even easier to find these relationships. A brief excerpt from his post
Big-data researchers have the option to stop doing their research once they have the right result. In options language: The researcher gets the “upside” and truth gets the “downside.” It makes him antifragile, that is, capable of benefiting from complexity and uncertainty — and at the expense of others.
But beyond that, big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). It’s a property of sampling: In real life there is no cherry-picking, but on the researcher’s computer, there is. Large deviations are likely to be bogus.
The NY Times just published a great article
using software to help in better tutoring. There were some great nuggets around education data mining below:
Of course, as D’Mello puts it, “we can’t install a $20,000 butt-sensor chair in every school in America.” So D’Mello, along with Heffernan, is working on a less elaborate, less expensive alternative: judging whether a student is bored, confused or frustrated based only on the pattern of his or her responses to questions. Heffernan and a collaborator at Columbia’s Teachers College, Ryan Baker, an expert in educational data mining, determined that students enter their answers in characteristic ways: a student who is bored, for example, may go for long stretches without answering any problems (he might be talking to a fellow student, or daydreaming) and then will answer a flurry of questions all at once, getting most or all correct. A student who is confused, by contrast, will spend a lot of time on each question, resort to the hint button frequently and get many of the questions wrong.
“Right now we’re able to accurately identify students’ emotions from their response patterns at a rate about 30 percent better than chance,” Baker says. “That’s about where the video cameras and posture sensors were a few years ago, and we’re optimistic that we can get close to their current accuracy rates of about 70 percent better than chance.” Human judges of emotion, he notes, reach agreement on what other people are feeling about 80 percent of the time.
Koedinger is convinced that learning is so unfathomably complex that we need the data generated by computers to fully understand it. “We think we know how to teach because humans have been doing it forever,” he says, “but in fact we’re just beginning to understand how complicated it is to do it well.”
As an example, Koedinger points to the spacing effect. Decades of research have demonstrated that people learn more effectively when their encounters with information are spread out over time, rather than massed into one marathon study session. Some teachers have incorporated this finding into their classrooms — going over previously covered material at regular intervals, for instance. But optimizing the spacing effect is a far more intricate task than providing the occasional review, Koedinger says: “To maximize retention of material, it’s best to start out by exposing the student to the information at short intervals, gradually lengthening the amount of time between encounters.” Different types of information — abstract concepts versus concrete facts, for example — require different schedules of exposure. The spacing timetable should also be adjusted to each individual’s shifting level of mastery. “There’s no way a classroom teacher can keep track of all this for every kid,” Koedinger says. But a computer, with its vast stores of memory and automated record-keeping, can. Koedinger and his colleagues have identified hundreds of subtle facets of learning, all of which can be managed and implemented by sophisticated software.
Yet some educators maintain that however complex the data analysis and targeted the program, computerized tutoring is no match for a good teacher. It’s not clear, for instance, that Koedinger’s program yields better outcomes for students. A review conducted by the Department of Education in 2010 concluded that the product had “no discernible effects” on students’ test scores, while costing far more than a conventional textbook, leading critics to charge that Carnegie Learning is taking advantage of teachers and administrators dazzled by the promise of educational technology. Koedinger counters that “many other studies, mostly positive,” have affirmed the value of the Carnegie Learning program
. “I’m confident that the program helps students learn better than paper-and-pencil homework assignments.”
One of our heroes- Nate Silver
, is publishing a new book- The Signal and the Noise: Why So Many Predictions Fail-but Some Don't
& the NY Times excerpted part of his book
. Nate himself commented on the fallibility of modelling and a human common sense "over ride" was so necessary.
Perhaps because chaos theory has been a part of meteorological thinking for nearly four decades, professional weather forecasters have become comfortable treating uncertainty the way a stock trader or poker player might. When weather.gov
says that there’s a 20 percent chance of rain in Central Park, it’s because the National Weather Service recognizes that our capacity to measure and predict the weather is accurate only up to a point. “The forecasters look at lots of different models: Euro, Canadian, our model — there’s models all over the place, and they don’t tell the same story,” Ben Kyger, a director of operations for the National Oceanic and Atmospheric Administration, told me. “Which means they’re all basically wrong.” The National Weather Service forecasters who adjusted temperature gradients with their light pens were merely interpreting what was coming out of those models and making adjustments themselves. “I’ve learned to live with it, and I know how to correct for it,” Kyger said. “My whole career might be based on how to interpret what it’s telling me.”
Despite their astounding ability to crunch numbers in nanoseconds, there are still things that computers can’t do, contends Hoke at the National Weather Service. They are especially bad at seeing the big picture when it comes to weather. They are also too literal, unable to recognize the pattern once it’s subjected to even the slightest degree of manipulation. Supercomputers, for instance, aren’t good at forecasting atmospheric details in the center of storms. One particular model, Hoke said, tends to forecast precipitation too far south by around 100 miles under certain weather conditions in the Eastern United States. So whenever forecasters see that situation, they know to forecast the precipitation farther north.
But there are literally countless other areas in which weather models fail in more subtle ways and rely on human correction. Perhaps the computer tends to be too conservative on forecasting nighttime rainfalls in Seattle when there’s a low-pressure system in Puget Sound. Perhaps it doesn’t know that the fog in Acadia National Park in Maine will clear up by sunrise if the wind is blowing in one direction but can linger until midmorning if it’s coming from another. These are the sorts of distinctions that forecasters glean over time as they learn to work around potential flaws in the computer’s forecasting model, in the way that a skilled pool player can adjust to the dead spots on the table at his local bar.
Among the National Weather Service’s detailed records is a thorough comparison of how well the computers are doing by themselves alongside the value that humans are contributing. According to the agency’s statistics, humans improve the accuracy of precipitation forecasts by about 25 percent over the computer guidance alone. They improve the temperature forecasts by about 10 percent. Humans are good enough, in fact, that when the organization’s Cray supercomputer burned down, in 1999, their high-temperature forecasts remained remarkably accurate. “You almost can’t have a meeting without someone mentioning the glory days of the Cray fire,” Kyger said, pointing to a mangled, half-burnt piece of the computer that was proudly displayed in the office where I met him. “If you weren’t here for that, you really weren’t part of the brotherhood.”
Continuing our coverage of getting inspired from other industries, is another great article from UFC & Mixed Martial Arts
. One of the great examples of using new data sets to provide outside the box insights
Among the many die-hard UFC fans was Rami Genauer, a journalist based in Washington, D.C. Genauer had read Moneyball
, Michael Lewis’s best seller about Oakland Athletics general manager Billy Beane and his statistics-driven approach to player evaluation. He dreamed of analyzing mixed martial arts in the same way.
“There were no numbers,” Genauer says. “You’d try to write something, and you’d come to the place where you’d put in the numbers to back up your assertions, and there was absolutely nothing.”
In 2007 Genauer obtained a video of a recent UFC event, and using the slow-motion function on his TiVo, he broke each fight down by the number of strikes attempted, the volume of strikes landed, the type of strike (power leg versus leg jab, for instance) and the finishing move (rear naked choke versus guillotine, and so on). The process took hours, but the end result was something completely new to the sport: a comprehensive data set.
Genauer titled his data-collection project FightMetric and created a website to house the information. Some UFC fans registered their disapproval on Web forums. “‘We don’t need math with our fighting,’ people would say. I disagreed,” Genauer says.
In 2008 he managed to persuade the UFC to use FightMetric data from past matches to support a televised event in Minneapolis. “The idea was that this would be good for the producers, who could use the numbers to illustrate the story,” he says. “It’d also be good for the broadcaster—they’d have ammunition, something to rely on just like they do in other sports.”
Officials liked having Genauer’s fight data, and when the UFC began spiffing up its broadcasts with more graphics and statistics—part of an effort to make MMA seem like a real sport instead of a series of cage brawls—it hired FightMetric as its statistics provider. Genauer quit his job and opened an office in D.C.
Today FightMetric has five full-time staffers and a rotating cast of 15 specialists who collect a large data set for each fight using a video feed, proprietary software and a video-game controller with which they can record every type of strike. Among the statistics they track: each fighter’s number and type of strikes, number of significant strikes (defined as all strikes landed from a distance, as well as power strikes landed from close range) and the accuracy and location of kicks and punches.
The FightMetric team collects the strike and location statistics in real time. The UFC uses some of the data for graphics during broadcasts and on its website. FightMetric goes into even greater detail on its own website, presenting statistics over outlines of a human body. Colored lines indicate the accuracy of each type of strike, and boxes show which ground move, whether arm bar, kimura lock or triangle choke, each fighter used to try to induce a submission. The analysis is strangely disconnected from the violence of the Octagon—a savage fight broken down into simple, neat figures.
As the available body of data from FightMetric (and its main competitor, CompuStrike) grows, Genauer and others are attempting to analyze it in new ways. Already Genauer and his colleagues have identified some clear trends in MMA matches. For instance, the number of fights that end in decisions, especially at the lower weight classes, has risen from a third in 2007 to half today. That’s a significant change from the wilder early days of the UFC, when fighters swung crazily and the vast majority of bouts ended in knockouts. It points to increasing skill levels among UFC fighters (knockouts usually happen when one fighter is obviously superior to the other), a factor that could affect fighters’ styles and training methods. A lighter-weight fighter, expecting now to go the distance in his next fight, might accordingly develop his aerobic threshold (so he can wear out bigger opponents) rather than his ability to throw first-round knockout blows.
There is an article coming about Big Data Analytics almost every day
. There was a great piece on Google's Dremel project on Wired
. The article was posted at a technical news discussion site
and the top ranked comment for that article was so sensible
and something we have to explain to clients everyday especially when they are under relentless pressure by everybody about "Big Data". I am re-posting the comment for people to think about:A small note: Its great to see so many great tools coming up to solve the kind of problems which were earlier difficult/impossible to solve. But however please check your big data use cases many times before using big data tools. Because frankly 'big data' is becoming a just cool must use tool regardless of use cases people have these days. I've even seen data sizes as small as 10 MB being considered for big data use cases. Often this gets subjected to a monstrously complex architecture for no good reason.
Generally most of these cases can be addressed and solved with as simple a tool like sqlite! And all you generally need is something like Perl with sqlite and ability to write simple SQL queries.
People get deceived very easily, When they look at GB scale XML files they think that is what big data is. Yet most of that generally and easily goes into a traditional RDBMS. And the performance is generally is in pretty acceptable limits. Mark up eats a lot of space and data size. When converted to flat file structures like csv's, tsv's and then imported to a RDBMS the data sizes are way smaller. I've some times seen an order of 10x difference.
Another annoying thing is abuse of NoSQL databases. Perfectly relational data is being de normalized, force fed in NoSQL databases and access data interfaces are generally bad buggy sub implementations of SQL.
This is almost like, people who don't understand SQL are condemned to implement it badly.
, one of our favorite bloggers just wrote a very insightful post which explains our Data beats Algorithms
thesis much better than we could explain it ourselves. We would strong suggest that you go & read the whole post
but here are the key highlights:"
A week or so ago, a top Internet analyst from Wall Street was in our office. He mentioned that the AppData numbers on Zynga foretold a difficult second quarter and the Yipit report on Groupon predicted trouble in that name as well. Both turned out to be fairly accurate and investable. "And then he references an excellent analysis by Micah Sifry using Wikipedia Edits as Signals to predict Vice Presidential Picks, which we have re-blogged parts of below:"
Sarah Palin's Wikipedia page was updated at least 68 times the day before John McCain announced her selection, with another 54 changes made in the five previous days previous. Tim Pawlenty, another leading contender for McCain's favor, had 54 edits on August 28th, with just 12 in the five previous days. By contrast, the other likely picks — Romney, Kay Bailey Hutchison — saw far fewer changes. The same burst of last-minute editing appeared on Joe Biden's Wikipedia page, Terry Gudaitis of Cyveillance, told
the Washington Post.
None of Wikipedia entries for the current candidates being bandied about by Romney-watchers — Rob Portman, Marco Rubio, Paul Ryan, Bobby Jindal, Chris Christie, Kelly Ayotte or Pawlenty — are currently showing anything like the spike in edits that Cyveillance spotted on Palin and Biden's pages back in 2008. But most of those came in the 24 hours prior to the official announcement. That said, if Wikipedia changes offer any hint of what's coming, then today might be a good day to bet on Ryan.
We take our inspiration from all corners of interesting industries. One of the more recent stories we read is this great article from the the Dairy Farming Industry
Dairy breeding is perfect for quantitative analysis. Pedigree records
have been assiduously kept;relatively easy artificial insemination
has helped centralized genetic information in a small number of key bulls
since the 1960s; there are a relatively small and easily measurable number of traits
-- milk production, fat in the milk, protein in the milk, longevity, udder quality -- that breeders want to optimize; each cow works for three or four years, which means that farmers invest thousands of dollars
into each animal, so it's worth it to get the best semen money can buy. The economics push breeders to use the genetics.
The bull market (heh) can be reduced to one key statistic, lifetime net merit
, though there are many nuances that the single number cannot capture. Net merit denotes the likely additive value of a bull's genetics. The number is actually denominated in dollars because it is an estimate of how much a bull's genetic material will likely improve the revenue from a given cow. A very complicated equation
weights all of the factors that go into dairy breeding and -- voila -- you come out with this single number. For example, a bull that could help a cow make an extra 1000 pounds of milk over her lifetime only gets an increase of $1 in net merit while a bull who will help that same cow produce a pound more protein will get $3.41 more in net merit. An increase of a single month of predicted productive life yields $35 more.
When you add it all up, Badger-Fluff Fanny Freddie has a net merit of $792. No other proven sire ranks above $750 and only seven bulls in the country rank above $700
. One might assume that this is largely because the bull can help the cows make more milk, but it's not! While breeders used to select for greater milk production, that's no longer considered the most important trait. For example, the number three bull in America is named Ensenada Taboo Planet-Et. His predicted transmitting ability for milk production is +2323, more than 1100 pounds greater than Freddie. His offspring's milk will likely containmore protein and fat as well. But his daughters' productive life would be shorter and their pregnancy rate is lower. And these factors, as well as some traits related to the hypothetical daughters' size and udder quality, trump Planet's impressive production stats.
One reason for the change in breeding emphasis is that our cows already produce tremendous amounts of milk relative to their forbears. In 1942, when my father was born, the average dairy cow produced less than 5,000 pounds of milk in its lifetime
. Now, the average cow produces over21,000 pounds of milk
. At the same time, the number of dairy cows has decreased from a high of 25 million around the end of World War II to fewer than nine million today. This is an indisputable environmental win as fewer cows create less methane, a potent greenhouse gas, and require less land.
Our analysis of more than 5000 publicly listed entities in the US has shown that companies impacted by consumers making brand decisions account for more than 35% of total US listed market capitalization & trading volume. If you are an institutional investor, your portfolio is most certainly exposed to these type of companies. Some of the smart money on specific cases, use primary research & surveys to stay ahead of the curve and to know if the previous quarters growth or decline projections are likely to continue or reverse.
An example of a fund making decisions based on primary consumer survey is Whitney Tilson when they shorted Netflix (NFLX)
. Now, the Wall Street Journal published an article on how Walmart is losing its edge
. Key parts of the article are based on consumer surveys by equity research analysts on Walmart.
Some excerpts include:
A recent Goldman Sachs Group Inc. survey of store prices in Chicago found that Wal-Mart prices on identical toys, foods and health and beauty aids were lower than Target's across all categories and 6.2% less overall.....
Morgan Stanley surveys have yielded similar results. But when it recently polled 1,100 Wal-Mart customers to see what they thought, it found the perception was quite different. "We were shocked to see 60% of Wal-Mart shoppers no longer viewed Wal-Mart as having the lowest prices," says Morgan Stanley analyst Mark Wiltamuth
Some other examples were consumer choice could significantly impact revenue include (Warning: These examples could be dated by tomorrow!):
- Churn & New subscriber uptake because of price changes on Netflix (NFLX)
- Patient attitudes towards Lipitor when it goes of patent & its impact on Pfizer (PFE)
- Are price conscious shoppers preferences for MetroPCS (PCS) changing when compared to ATT & Verizon Vs T-Mobile Vs Sprint in overlapping regions?
- Are regular New York Times (NYT) readers going to upgrade to the subscription plan in the next 12 Months?
- How are consumers reacting to cable offerings (CMCSA, TWC, CHTR, CVC) Vs FiOS/uVerse (VZ/T) when both offered in the same markets?
Our organization runs on the principle that "diverse data" provides far more insight than the most elegant mathematical algorithm. Therefore, we spend an in-ordinate amount of time and technology resources on:
- Collecting new & diverse sets of data via our surveys, primary research, web crawlers, manual data collection efforts & other mechanisms
- Building algorithms to tease out important signals from the data.
We have three points of view on the same topic by people far smarter than us!
-----Point of View 1Anand Rajaraman on his blog post More Data usual beats better algorithms . Key Excerpts:
Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not
rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million!
Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database
(IMDB). Guess which team did better?
Team B got much better results, close to the best results on the Netflix leaderboard!! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often suprised that many people in the business, and even in academia, don't realize this.
----Point of View 2Chris Dixon on his blog post "To Make Smarter systems, its all about the data". Key Excerpts
Significant AI breakthroughs come from identifying or creating new sources of data, not inventing new algorithms.
was probably the greatest AI-related invention ever brought to market by a startup. It was one of very few cases where a new system was really an order of magnitude smarter than existing ones. The Google founders are widely recognized
for their algorithmic work. Their most important insight, however, in my opinion, was to identify a previously untapped and incredibly valuable data source – links – and then build a (brilliant) algorithm to optimally harness that new data source.
Modern AI algorithms are very powerful, but the reality is there are thousands of programmers/researchers who can implement them with about the same level of success. The Netflix Challenge demonstrated that a massive, world-wide effort only improves on an in-house algorithm by approximately 10%. Studies
have shown that naive bayes
is as good or better than fancy algorithms in a surprising number of real world cases. It’s relatively easy to build systems that are right 80% of the time
, but very hard to go beyond that.
----Point of View 3Peter Norvig spoke about the importance of diverse data and its co-relation at this talk. Similar concepts where explored in his paper- "The Unreasonable Effectiveness of Data" . We try to take his advise at the end of the paper very seriously!"So, follow the data. The data holds a lot of detail. Now go out and gather some data, and see what it can do..."
We published our latest municipal bond liquidity analytics at Seeking Alpha here:http://seekingalpha.com/article/252517-muni-etfs-less-pricing-risk-than-mutual-funds-with-illiquid-assets
) analyze large data sets to provide investment, trading and risk management insights to clients. As part of our municipal bond and fund analytics service, we monitor the underlying portfolio holdings of more than 200+ muni mutual funds and ETFs that control more than 80% of mutual fund and ETF assets. One element of our liquidity analysis examined trading data for the underlying muni fund holdings and ETFs for a given period and determines what percentage of the underlying bond holdings have not been traded in the past 90 days. For example, if a fund held a bond – ABC- and if that bond was not traded in the past 90 days, it was considered illiquid.
Funds with a high percentage of assets that have not been actively traded would use non-market based pricing for a significant portion of their assets and as a result would run the highest risk of having stale pricing. The 90 day inactive trading is a conservative baseline, since most ETFs and funds would need to use non-market based pricing sometimes even if the bonds have not traded for a day. In the current volatile period, even short durations of non-market based pricing impact overall returns.
To examine the effects of illiquid assets on stale pricing and returns, we picked two mutual funds with the most illiquid assets in their peer group – Nuveen Intermediate Tax Free Fund (FMBIX , FAMBX , FMBCX) and Thornburg Intermediate Municipal Fund (THIMX, THMCX) – and compared them to two of the largest ETFs with a similar profile – S&P National AMT-Free Municipal Bond Fund (MUB
) and SPDR® Nuveen Barclays Capital Municipal Bond ETF (TFI