Tech giant Yahoo is doing everything it can to gain an edge in the machine learning market, including releasing what it said is the “largest-ever machine learning data set.” The coveted info is going to the academic research community.
Yahoo’s said its goal is to advance the field of large-scale machine learning and recommender systems. The company also wants to help bring more equality between the academic and industrial research communities.
“Many academic researchers and data scientists don’t have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies,” said Suju Rajan, director of research at Yahoo Labs (pictured), in a statement. “We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems.”
20 Million Users Involved
What exactly is Yahoo handling over? A collection based on a sample of anonymized user interactions on Yahoo properties, including the Yahoo News Feed dataset, the Yahoo home page, Yahoo Finance, Yahoo Sports, Yahoo Real Estate and Yahoo Movies.
All told, the dataset contains 13.5 TB of uncompressed information connected to how users relate to and interact with these Yahoo properties. The dataset covers 110 billion events and includes the interactions of about 20 million users from February 2015 to May 2015.
Categorized information, including age range, general geographic data and gender, is included in the dataset for a subset of anonymized users. The title, key-phrases of news articles, and summaries are also included in the data dump. User interaction data is timestamped and even shows what device was used to browse the sites.
“Academic researchers everywhere will finally have access to realistic scale data to study how to automatically discover which news articles are of interest to which users, and will be able to compare their methods using this as a shared test case,” said Tom Mitchell, machine learning department chair, Carnegie Mellon University, in a statement. “Here at CMU we’ll certainly be using it for our research.”
Yahoo’s Big Move
We caught up with Charles King, principal analyst at Pund-IT, to get his thoughts on Yahoo’s big machine learning move. In a way, this qualifies as a self-promotional event on Yahoo’s part that positions the company as a player in the rapidly growing area of machine learning, he told us. The company’s ongoing business troubles sometime mask its history of developing innovative, often market-leading technologies, and this effort could and should help counteract that misperception, he said.
“In essence, by making this huge dataset charting anonymized user interactions with Yahoo properties available to academic researchers, the company is helping to advance machine learning efforts among users who seldom, if ever, have access to such a profusion of data,” King said.
In the vast majority of instances, companies collecting datasets of this sort retain them for their own private uses, King noted. As a result, data scientists at universities and associated research labs are forced to make due with much smaller data samples.
“Yahoo’s effort should help to advance machine learning, particularly at the university level. Its effects on business organizations is hard to parse though. Over time, many of the innovations that universities develop do find their way into the commercial market,” King said. “Given the size and richness of the dataset Yahoo is releasing, it could very well support and inspire research that will eventually benefit businesses.”