Mar 2, 20212 min read

Datasets for Research work in Artificial Intelligence and Machine Learning

Updated: Mar 21, 2021

Looking at the struggle the researchers undergo in handling the humungous task of thesis writing, I have compiled the datasets from various sources sector-wise, industry-wise and technology-wise to relieve the burden off shoulders during research.

Sector: Public Government Datasets for Machine Learning

Dataset and its description

Data.gov: Download data from multiple US government agencies. Data can range from government budgets to school performance scores. Be warned though: much of the data requires additional research.
EU Open Data Portal: This EU data provides access to open data published by EU institutions in fields as diverse as economics, employment, science, the environment, and education.
School System Finances: Survey data of the finances of school systems in the US.
US Healthcare Data: Data about population health, diseases, drugs, and health plans have been collected from the FDA drug database and USDA Food composition database in this dataset.
The US National Center for Education Statistics: Data on educational institutions and education demographics from the US and around the world.
The UK Data Service: The UK’s largest collection of social, economic and population data can be found here.
Data USA: A comprehensive visualization of US public data.

Sector : Finance & Economics Datasets for Machine Learning

Dataset and its description

World Bank Open Data: Datasets covering population demographics and a huge number of economic and development indicators from across the world.
IMF Data: The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices and investments.
Quandl: A good source for economic and financial data – useful for building models to predict economic indicators or stock prices.
Financial Times Market Data: Up to date information on financial markets from around the world, including stock price indexes, commodities and foreign exchange.
Google Trends: Examine and analyze data on internet search activity and trending news stories around the world.
American Economic Association (AEA): A good source to find US macroeconomic data.

Interest Area : Image Datasets for Computer Vision

Dataset and its description

ImageNet :The de-facto image dataset for new algorithms. Is organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images.
Labelme :A large dataset of annotated images.
LSUN :Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.)
COIL100: 100 different objects imaged at every angle in a 360 rotation.
MS COCO: Generic image understanding and captioning
Visual Genome: Very detailed visual knowledge base with captioning of ~100K images.
Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.
Labelled Faces in the Wild: 13,000 labeled images of human faces, for use in developing applications that involve facial recognition.
Stanford Dogs Dataset: Contains 20,580 images and 120 different dog breed categories.
Indoor Scene Recognition: A very specific dataset, useful as most scene recognition models are better ‘outside’. Contains 67 Indoor categories, and a total of 15620 images.
VisualQA: This dataset contains open-ended questions related to 265,016 images. The questions asked require an understanding of vision and language to answer.

Note: Will keep on adding the relevant datasets as I come across them.

References :

lionbridge.ai

Datasets for Research work in Artificial Intelligence and Machine Learning

Recent Posts

Comments