Statistical Data Sets

UCI Machine Learning Repository
A very extensive archive with over hundred data collections from applications; get the README file (local copy) first

UCI Knowledge Discovery in Databases Archive for large data sets
''The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.''

THE MNIST DATABASE of handwritten digits and some of their uses: 1, 2, 3

StatLog datasets from Machine Learning, Neural and Statistical Classification (online copy of the book by Michie, Spiegelhalter and Taylor)

Delve Datasets for developing, evaluating, and comparing learning methods

Datasets used for classification: comparison of results

mldr.datasets: R Ultimate Multilabel Dataset Repository

languageR: Data sets and functions with "Analyzing Linguistic Data: A practical introduction to statistics"

Free Datasets, a list of links to collections of datasets

Free Datasets, another list of links to collections of datasets

Datasets in R packages (in CSV format)

Raw data from online personality tests (in CSV format)

Datasets for Data Mining

AI Datasets (maintained by Zhi-Hua Zhou)

Machine Learning and Data Mining - Datasets

KDD Cup, annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining

Natural Stimuli Collection (van Hateren natural image database)

Data Sets For OCR And Document Image Understanding Research

MeasTex Image Texture Database

The Yale Face Database B

The extended Yale Face Database B

Active appearance models

Computer Vision Test Images (still useful, though many links are outdated)

Benchmarking of Learning Algorithms (also includes papers on methodology and assessment of the status quo)

Unipen handwriting data (not free)

Peterson-Barney vowel data (file PetersonBarney.tar.Z)

Audio File Format FAQ

Switchboard Transcription System, several hundred informal speech dialogs recorded over the telephone

Speech databases from the comp.speech FAQ

Speech data

More speech data from the Center for Spoken Language Understanding ($30 per corpus)

WORTSCHATZ, 101 Corpus-Based Monolingual Dictionaries

Linguistic Data Consortium LDC Catalog of 210 corpora of language data (the cheapest, TIMIT and CTIMIT cost $100 each)

Social Science Data Resources

EconData, economic time series

Time Series Data Library (very extensive)

University of Colorado at Boulder Time Series Repository
(containing a.o. the Santa Fe Institute Time Series Competition Data)

DAISY: A Database for Identification of Systems

WEKA, Waikato Environment for Knowledge Analysis (in Java)
has many data sets

Data Sets for Pattern Recognition and Classification

Ripley's Datasets
from the book Pattern Recognition and Neural Networks by B.D. Ripley, Cambridge University Press 1996.

Simonoff Datasets from the book Smoothing Methods in Statistics by Jeffrey S. Simonoff, Springer 1996.

Data (and Papers) on Cost-Sensitive Learning

Statistical Reference Datasets (StRD)

StatLib data sets

Data sets from the Chance database

neural-bench Benchmark collection

NIST Handprinted Forms and Characters Database (not free)

NIST (National Institute Of Standards And Technology) databases
Industrial Quality Data on CD-ROM (not free)

Data sets from BADC, British Atmospheric Data Centre

Data sets from NCAR, US National Center for Atmospheric Research

Climate Data Catalog

ZUDIS, German Environmental and Climate Data Directories

SCDS, Synthetic Classification Data Set Generator

Tables of Points on Spheres (by Neil Sloane)

Some of my Other Pages

Statistics Links
Restricted Maximum Likelihood (REML) Methods
Surface Interpolation and Approximation
Artificial Intelligence
Morphometrics (Statistics of Shape)
Global Optimization
Mathematical Software
Mathematics Links
my home page (

Arnold Neumaier (