Yahoo Announces The Public Release Of The Largest-ever Machine Learning Dataset
Image Source: yahoolabs.tumblr.com
Working on large-scale machine learning problems inspired by consumer-facing products is what Yahoo Labs research scientists always loved. This interest has created quit a stir for Yahoo to focus on areas such as computational advertising, search ranking, information retrieval, and core machine learning. Today, Yahoo announced the public release of the largest-ever Machine Learning Dataset for the researchers.
Here’s Suju Rajan stated in a blog post via yahoo:
The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015. The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.
The aim looks beyond promoting independent research in large-scale machine learning and recommender systems. The newly announced dataset available as part of the Yahoo Labs Webscope data-sharing program. This is a reference library of scientifically-useful datasets composed of anonymized user data for non-commercial use.
The following is provided by this Machine Learning Dataset:
- Categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users.
- On the item side, the title, summary, and key-phrases of the pertinent news article released.
- The interaction data is timestamped with the relevant local time and also contains partial information about the device on which the user accessed the news feeds.
He further stated,
We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, “real-world” dataset. We strongly believe that this dataset can become the benchmark for large-scale machine learning and recommender systems, and we look forward to hearing from the community about their applications of our data.