Sebastian Raschka
Last updated: 01/16/2015
A collection of links to various free and open-source datasets.
I am looking forward to extend this little collection! If you don't find your favorite datasets listed here, just let me know (via [email](mailto:mail@sebastianraschka.com) or [twitter](https://twitter.com/rasbt)) and I will add them in no-time!
## Sections
- [Dataset Repositories](#dataset-repositories)
- [Datasets by Format](#datasets-by-format)
- [Image](#image)
- [Audio](#audio)
- [Text](#text)
- [Time Series](#time-series)
- [Datasets by Topic](#datasets-by-topic)
- [Natural Sciences](#natural-sciences)
- [Web, Technology, and Social Networks](#web-technology-and-social-networks)
- [Historical Data and Human Resources](historical-data-and-human-resources)
- [Finance and Companies](#finance-and-companies)
- [Government Data and Politics](#government-data-and-politics)
# Dataset Repositories
[[back to top](#sections)]
- [Kaggle](https://www.kaggle.com/competitions) - Kaggle, the leading platform for predictive modeling competitions.
- [UCI MLR](http://archive.ics.uci.edu/ml/) - UC Irvine Machine Learning Repository.
- [google.com/publicdata](http://www.google.com/publicdata/directory) - Public data maintained by Google.
- [Freebase](http://www.freebase.com) - A community-curated database of well-known people, places, and things.
- [mldata.org](http://mldata.org) - Machine learning data set repository for uploading and finding data sets.
- [Infochimps](http://www.infochimps.com/datasets) - A huge collection of large-sized data sets.
- [Amazon Web Services](http://aws.amazon.com/datasets) - Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
- [Databib](http://databib.org) - A searchable catalog / registry / directory / bibliography of research data repositories.
- [figshare](http://figshare.com) - An online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos.
- [reddit r/datasets](http://www.reddit.com/r/datasets) - Datasets shared on reddit.
- [datahub](http://datahub.io) - The free, powerful data management platform from the Open Knowledge Foundation
- [Quandl](http://www.quandl.com/) - A search engine for numerical data
- [enigma](http://enigma.io) - A search engine for public records published by governments, companies and organizations.
# Datasets by Format
[[back to top](#sections)]
## Images
[[back to top](#sections)]
- [Tiny Images Dataset](http://horatio.cs.nyu.edu/mit/tiny/data/index.html) - A dataset of 79,302,017 images, each being a 32x32 color image.
- [ImageNet](http://www.image-net.org/index) -A searchable image database.
- [CAT Dataset](http://137.189.35.203/WebUI/CatDatabase/catData.html) - A dataset of 10,000 cat images.
- [Amsterdam Library of Object Images (ALOI)](http://aloi.science.uva.nl) - A color image collection of one-thousand small objects, recorded for scientific purposes.
- [Face Recognition Databases](http://www.face-rec.org/databases/) - A large collection of datasets for face recognition.
- [INRIA Holidays and Copydays datasets](http://lear.inrialpes.fr/people/jegou/data.php) - Datasets of personal holidays photos.
## Audio
[[back to top](#sections)]
- [Mobio](https://www.idiap.ch/dataset/mobio) - bi-modal (audio and video) data taken from 152 people.
- [Million Song Dataset](http://labrosa.ee.columbia.edu/millionsong/) - The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
- [Music Data Mining](http://users.cis.fiu.edu/~lli003/Music/music.html) - A collection of research done on music analysis and links to various datasets.
- [CMU Audio Databases](http://www.speech.cs.cmu.edu/databases/) - A collection of databases for speech recognition.
- [CMU Audio Databases](http://www.speech.cs.cmu.edu/databases/) - A collection of databases for speech recognition.
- [CMU_ARCTIC speech synthesis databases](http://festvox.org/cmu_arctic/) - Phonetically balanced, US English single speaker databases designed for unit selection speech synthesis research.
- [VoxForge](http://www.voxforge.org) - GPL speech audio corpora.
## Text
[[back to top](#sections)]
- [TechTC](http://techtc.cs.technion.ac.il/techtc300/techtc300.html) - Technion Repository of Text Categorization Datasets containing 300 labeled datasets with categorization difficulties indicated by baseline SVM accuracies.
- [SMS Spam Collection](http://www.dt.fee.unicamp.br/%7Etiago/smsspamcollection/) - A public dataset of 5572 SMS messages that are labeled as either "spam" or "ham" (not spam).
- [musiXmatch](http://labrosa.ee.columbia.edu/millionsong/musixmatch) - A dataset of lyrics for the songs in the one million songs dataset. The lyrics are pre-processed and available as "bag of words" after stemming.
- [Google books Ngram Viewer](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) - The corpus of Google books as n-grams available for quick online queries or download.
- [Jeb Bush's email archive](http://americanbridgepac.org/jeb-bushs-gubernatorial-email-archive/) - Jeb Bush's emails during his days as the governor of Florida.
- [Amazon Google Books Ngrams](http://aws.amazon.com/datasets/8172056142375670) - A data set containing Google Books n-gram corpuses.
- [The Wayback Machine](http://blog.archive.org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-for-research/) - 80 terabytes of archived web crawl data available for research.
- [SMS Spam Collection](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.
- [Yahoo News Feed dataset](http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75) - An 1.5 TB dataset for building machine learning recommendation systems
- [The full Reddit Submission Corpus 2006-2015](https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/) - This represents all publicly available Reddit submissions from January 2006 - August 31, 2015).
## Time Series
[[back to top](#sections)]
- [NGAFID](http://people.cs.und.edu/%7Etdesell/ngafid_releases.php) - National General Aviation Flight Information Database. Time series data from various flight data recorders for flights that are approximately an hour long each.
# Datasets by Topic
[[back to top](#sections)]
## Natural Sciences
[[back to top](#sections)]
- [1000 Genomes Project](http://www.1000genomes.org/ftpsearch/) - A Deep Catalog of Human Genetic Variation.
- [Cancer Program Data Sets](http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi) - a collection of genomic datasets.
- [Meteorites](http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized) - Registered meteorites that have impacted on Earth.
## Web, Technology, and Social Networks
[[back to top](#sections)]
- [The Wayback Machine](http://blog.archive.org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-for-research/) - 80 terabytes of archived web crawl data available for research.
- [Social Network Analysis Interactive Dataset Library](http://www.growmeme.com/overview) - a site that contains an accessible library of many of the 'open' social network analysis datasets.
- [SMS Spam Collection](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.
- [SNAP](http://snap.stanford.edu/data/index.html) - Stanford Large Network Dataset Collection.
- [Amazon Google Books Ngrams](http://aws.amazon.com/datasets/8172056142375670) - A data set containing Google Books n-gram corpuses.
- [Click Dataset](http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/) - A large dataset of about 53.5 billion HTTP requests made by users at Indiana University.
- [Common Crawl 2012 web corpus](http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us) - A hyperlink graph of 3.5 billion web pages and 128 billion hyperlinks between these pages.
- [PyPi/Maven Dependency Data](http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/) - State of the Maven/Java dependency graph and state of the PyPi/Python dependency graph.
## Historical Data and Human Resources
[[back to top](#sections)]
- [Titanic Survivors](http://lib.stat.cmu.edu/S/Harrell/data/descriptions/titanic.html) - dataset with 1313 samples and 10 features about Titanic survivors.
- [Pass rates, race & gender](http://home.cc.gatech.edu/ice-gt/556) - Detailed data on pass rates, race, and gender for 2013.
## Finance and Companies
[[back to top](#sections)]
- [Modeling Online Auctions](http://www.modelingonlineauctions.com/datasets) - Datasets of bidding for different ebay auctions.
- [NYPD Crash Data Band-Aid](http://nypd.openscrape.com/#/) - NYPD traffic crash data as a geocoded CSV.
- [aiHit Datasets](http://endb-consolidated.aihit.com/datasets.htm) - Information on random 10,000 UK companies sampled from aiHit DB.
- [Crunchbase Companies Datasets](https://brightdata.com/products/datasets/crunchbase) - 2 Million crunchbase company listings with over 100 data points.
## Government Data and Politics
[[back to top](#sections)]
- [United Nations](http://data.un.org/) Data about health, environment, energy, etc.
- [United Stated Government Data](http://www.data.gov/) The home of the U.S. Government’s open data.
- [Survey Data from U.S.](http://www.asdfree.com/)
- [EconData](http://inforumweb.umd.edu/econdata/econdata.html) - economic time series, produced by a number of U.S. Government agencies and distributed in a variety of formats and media.
- [USGovXML](http://usgovxml.com) - USGovXML is an index to publicly available web services and XML data sources that are provided by the US government.
- [Nominate/vote data](http://voteview.com/dwnl.htm) - Datasets including all the D-NOMINATE and W-NOMINATE scores.
- [Jeb Bush's email archive](http://americanbridgepac.org/jeb-bushs-gubernatorial-email-archive/) - Jeb Bush's emails during his days as the governor of Florida.