Top applications across the top players
- Frequently Bought Together
- Customers Who Bought This Item Also Bought
- Movie Recommendation
- People You Might Know (aka friend suggestions)
- Face detection
- Entity extraction from web page and queries, like names, addresses. It was running inside IE toolbar, Bing index generation and query processing.
- Click Fraud Detection
NLP API Services
Deep Learning Datasets
Interesting Public Datasets
There are quite a bit of ML competitions in Kaggle. And each of these competitions, a good amount of dataset are released in public. Here are the list of datasets that I found interesting.
A set of celebrity, images and movie data below. It is about 1000 to 2000 celebrities. You can cross check People.com for its completeness.
* Celebrity Face on Web from Microsoft
* Celebrity Twitter Accounts – over 1000+ celebrity twitter accounts there.
* Cross-Age Celebrity Dataset (CACD)
celebrityData - contains information of the 2,000 celebrities
name - celebrity name
identity - celebrity id
birth - celebrity brith year
rank - rank of the celebrity with same birth year in IMDB.com when the dataset was constructed
lfw - whether the celebrity is in LFW dataset
celebrityImageData - contains information of the face images
age - estimated age of the celebrity
identity - celebrity id
year - estimated year of which the photo was taken
feature - 75,520 dimension LBP feature extracted from 16 facial landmarks
name - file name of the image
- MovieLens Latest Dataset
- OMDB Movie Dataset
- Human curated Movie list from IMDB
- Million of Songs (500GB)
- Otto Group product classification dataset – For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. The winning models will be open sourced. Competition held at Kaggle.
- eCommerce Search Relevance – This set contains image URLs, rank on page, description for each product, search query that lead to each result, and more, each from five major English-language ecommerce sites.
- Sentiment Training Data Towards Products or Brands
- Top 500 brands
- Apparel Top Brands
- 2500+ Popular Brands and 550+ Popular Categories
- Google Product Taxonomy
- Wikipedia Traffic Statistics
- Common Crawl Corpus (541TB) – Have you ever wanted to get your hands on crawl data for billions of web pages with trillions of links? Here’s your chance. The Common Crawl Corpus provides a rich set of tools, examples, and projects you can jump into today.
- Internet Advertisements dataset – Given the details of images on web pages predict whether an image is an advertisement or not.
- Human activity recognition using smart phones dataset – From smart phone movement data predict the type of activity performed by the person holding the smart phone.
- Facebook Public Page Post dataset – Data scraper for Facebook Pages, and also code accompanying the blog post How to Scrape Data From Facebook Page Posts for Statistical Analysis.
- All Subreddits from ScrapingHub – This dataset contains a list with all the subreddits from reddit.com.
- Zillow Home Value Dataset
- Springleaf marketing dataset – Given features of customers predict whether they are a marketing target or not. And Springleaf is a lending company. Competition held in Kaggle.
- Lending Club Loan Data
Other Interesting DataSets
- OpenStreetMap database
- Project Gutenberg (742GB) – Project Gutenberg makes over 46,000 books available for analysis. These books are now on the public domain because their copyrights have expired.
- NOAA National Climatic Data Center (3.3 TB) – This dataset contains data on over 150 years of weather from many sources ranging from weather stations to airport readings to satellite data.
- Amazon.com Employee Access Challenge – Given historical resource access changes for employees predict the resources required by employees.
- WordSimilarity-353 Test Collection contains two sets of English word pairs along with human-assigned similarity judgements. The collection can be used to train and/or test computer algorithms implementing semantic similarity measures (i.e., algorithms that numerically estimate similarity of natural language words).
- Google Book Ngram