Sebastian Raschka Last updated: 01/16/2015 A collection of links to various free and open-source datasets. I am looking forward to extend this little collection! If you don't find your favorite datasets listed here, just let me know (via [email](mailto:mail@sebastianraschka.com) or [twitter](https://twitter.com/rasbt)) and I will add them in no-time!
## Sections - [Dataset Repositories](#dataset-repositories) - [Datasets by Format](#datasets-by-format) - [Image](#image) - [Audio](#audio) - [Text](#text) - [Time Series](#time-series) - [Datasets by Topic](#datasets-by-topic) - [Natural Sciences](#natural-sciences) - [Web, Technology, and Social Networks](#web-technology-and-social-networks) - [Historical Data and Human Resources](historical-data-and-human-resources) - [Finance and Companies](#finance-and-companies) - [Government Data and Politics](#government-data-and-politics)

# Dataset Repositories [[back to top](#sections)] - [Kaggle](https://www.kaggle.com/competitions) - Kaggle, the leading platform for predictive modeling competitions. - [UCI MLR](http://archive.ics.uci.edu/ml/) - UC Irvine Machine Learning Repository. - [google.com/publicdata](http://www.google.com/publicdata/directory) - Public data maintained by Google. - [Freebase](http://www.freebase.com) - A community-curated database of well-known people, places, and things. - [mldata.org](http://mldata.org) - Machine learning data set repository for uploading and finding data sets. - [Infochimps](http://www.infochimps.com/datasets) - A huge collection of large-sized data sets. - [Amazon Web Services](http://aws.amazon.com/datasets) - Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. - [Databib](http://databib.org) - A searchable catalog / registry / directory / bibliography of research data repositories. - [figshare](http://figshare.com) - An online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. - [reddit r/datasets](http://www.reddit.com/r/datasets) - Datasets shared on reddit. - [datahub](http://datahub.io) - The free, powerful data management platform from the Open Knowledge Foundation - [Quandl](http://www.quandl.com/) - A search engine for numerical data - [enigma](http://enigma.io) - A search engine for public records published by governments, companies and organizations.

# Datasets by Format [[back to top](#sections)]

## Images [[back to top](#sections)] - [Tiny Images Dataset](http://horatio.cs.nyu.edu/mit/tiny/data/index.html) - A dataset of 79,302,017 images, each being a 32x32 color image. - [ImageNet](http://www.image-net.org/index) -A searchable image database. - [CAT Dataset](http://137.189.35.203/WebUI/CatDatabase/catData.html) - A dataset of 10,000 cat images. - [Amsterdam Library of Object Images (ALOI)](http://aloi.science.uva.nl) - A color image collection of one-thousand small objects, recorded for scientific purposes. - [Face Recognition Databases](http://www.face-rec.org/databases/) - A large collection of datasets for face recognition. - [INRIA Holidays and Copydays datasets](http://lear.inrialpes.fr/people/jegou/data.php) - Datasets of personal holidays photos.

## Audio [[back to top](#sections)] - [Mobio](https://www.idiap.ch/dataset/mobio) - bi-modal (audio and video) data taken from 152 people. - [Million Song Dataset](http://labrosa.ee.columbia.edu/millionsong/) - The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. - [Music Data Mining](http://users.cis.fiu.edu/~lli003/Music/music.html) - A collection of research done on music analysis and links to various datasets. - [CMU Audio Databases](http://www.speech.cs.cmu.edu/databases/) - A collection of databases for speech recognition. - [CMU Audio Databases](http://www.speech.cs.cmu.edu/databases/) - A collection of databases for speech recognition. - [CMU_ARCTIC speech synthesis databases](http://festvox.org/cmu_arctic/) - Phonetically balanced, US English single speaker databases designed for unit selection speech synthesis research. - [VoxForge](http://www.voxforge.org) - GPL speech audio corpora.

## Text [[back to top](#sections)] - [TechTC](http://techtc.cs.technion.ac.il/techtc300/techtc300.html) - Technion Repository of Text Categorization Datasets containing 300 labeled datasets with categorization difficulties indicated by baseline SVM accuracies. - [SMS Spam Collection](http://www.dt.fee.unicamp.br/%7Etiago/smsspamcollection/) - A public dataset of 5572 SMS messages that are labeled as either "spam" or "ham" (not spam). - [musiXmatch](http://labrosa.ee.columbia.edu/millionsong/musixmatch) - A dataset of lyrics for the songs in the one million songs dataset. The lyrics are pre-processed and available as "bag of words" after stemming. - [Google books Ngram Viewer](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) - The corpus of Google books as n-grams available for quick online queries or download. - [Jeb Bush's email archive](http://americanbridgepac.org/jeb-bushs-gubernatorial-email-archive/) - Jeb Bush's emails during his days as the governor of Florida. - [Amazon Google Books Ngrams](http://aws.amazon.com/datasets/8172056142375670) - A data set containing Google Books n-gram corpuses. - [The Wayback Machine](http://blog.archive.org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-for-research/) - 80 terabytes of archived web crawl data available for research. - [SMS Spam Collection](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. - [Yahoo News Feed dataset](http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75) - An 1.5 TB dataset for building machine learning recommendation systems - [The full Reddit Submission Corpus 2006-2015](https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/) - This represents all publicly available Reddit submissions from January 2006 - August 31, 2015).

## Time Series [[back to top](#sections)] - [NGAFID](http://people.cs.und.edu/%7Etdesell/ngafid_releases.php) - National General Aviation Flight Information Database. Time series data from various flight data recorders for flights that are approximately an hour long each.

# Datasets by Topic [[back to top](#sections)]

## Natural Sciences [[back to top](#sections)] - [1000 Genomes Project](http://www.1000genomes.org/ftpsearch/) - A Deep Catalog of Human Genetic Variation. - [Cancer Program Data Sets](http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi) - a collection of genomic datasets. - [Meteorites](http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized) - Registered meteorites that have impacted on Earth.

## Web, Technology, and Social Networks [[back to top](#sections)] - [The Wayback Machine](http://blog.archive.org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-for-research/) - 80 terabytes of archived web crawl data available for research. - [Social Network Analysis Interactive Dataset Library](http://www.growmeme.com/overview) - a site that contains an accessible library of many of the 'open' social network analysis datasets. - [SMS Spam Collection](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. - [SNAP](http://snap.stanford.edu/data/index.html) - Stanford Large Network Dataset Collection. - [Amazon Google Books Ngrams](http://aws.amazon.com/datasets/8172056142375670) - A data set containing Google Books n-gram corpuses. - [Click Dataset](http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/) - A large dataset of about 53.5 billion HTTP requests made by users at Indiana University. - [Common Crawl 2012 web corpus](http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us) - A hyperlink graph of 3.5 billion web pages and 128 billion hyperlinks between these pages. - [PyPi/Maven Dependency Data](http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/) - State of the Maven/Java dependency graph and state of the PyPi/Python dependency graph.

## Historical Data and Human Resources [[back to top](#sections)] - [Titanic Survivors](http://lib.stat.cmu.edu/S/Harrell/data/descriptions/titanic.html) - dataset with 1313 samples and 10 features about Titanic survivors. - [Pass rates, race & gender](http://home.cc.gatech.edu/ice-gt/556) - Detailed data on pass rates, race, and gender for 2013.

## Finance and Companies [[back to top](#sections)] - [Modeling Online Auctions](http://www.modelingonlineauctions.com/datasets) - Datasets of bidding for different ebay auctions. - [NYPD Crash Data Band-Aid](http://nypd.openscrape.com/#/) - NYPD traffic crash data as a geocoded CSV. - [aiHit Datasets](http://endb-consolidated.aihit.com/datasets.htm) - Information on random 10,000 UK companies sampled from aiHit DB. - [Crunchbase Companies Datasets](https://brightdata.com/products/datasets/crunchbase) - 2 Million crunchbase company listings with over 100 data points.

## Government Data and Politics [[back to top](#sections)] - [United Nations](http://data.un.org/) Data about health, environment, energy, etc. - [United Stated Government Data](http://www.data.gov/) The home of the U.S. Government’s open data. - [Survey Data from U.S.](http://www.asdfree.com/) - [EconData](http://inforumweb.umd.edu/econdata/econdata.html) - economic time series, produced by a number of U.S. Government agencies and distributed in a variety of formats and media. - [USGovXML](http://usgovxml.com) - USGovXML is an index to publicly available web services and XML data sources that are provided by the US government. - [Nominate/vote data](http://voteview.com/dwnl.htm) - Datasets including all the D-NOMINATE and W-NOMINATE scores. - [Jeb Bush's email archive](http://americanbridgepac.org/jeb-bushs-gubernatorial-email-archive/) - Jeb Bush's emails during his days as the governor of Florida.