UCI Knowledge Discovery in Databases Archive
for large data sets
''The primary role of this repository is to enable researchers in
knowledge discovery and data mining to scale existing and future
data analysis algorithms to very large and complex data sets.''
THE MNIST DATABASE of handwritten digits and some of their uses: 1, 2, 3
StatLog datasets from Machine Learning, Neural and Statistical Classification (online copy of the book by Michie, Spiegelhalter and Taylor)
Delve Datasets for developing, evaluating, and comparing learning methods
Datasets used for classification: comparison of results
mldr.datasets: R Ultimate Multilabel Dataset Repository
languageR: Data sets and functions with "Analyzing Linguistic Data: A practical introduction to statistics"
Free Datasets, a list of links to collections of datasets
Free Datasets, another list of links to collections of datasets
Datasets in R packages (in CSV format)
Raw data from online personality tests (in CSV format)
AI Datasets (maintained by Zhi-Hua Zhou)
Machine Learning and Data Mining - Datasets
KDD Cup, annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining
Natural Stimuli Collection (van Hateren natural image database)
Data Sets For OCR And Document Image Understanding Research
MeasTex Image Texture Database
The extended Yale Face Database B
Computer Vision Test Images (still useful, though many links are outdated)
Benchmarking of Learning Algorithms (also includes papers on methodology and assessment of the status quo)
Unipen handwriting data (not free)
Peterson-Barney vowel data (file PetersonBarney.tar.Z)
Switchboard Transcription System, several hundred informal speech dialogs recorded over the telephone
Speech databases from the comp.speech FAQ
More speech data from the Center for Spoken Language Understanding ($30 per corpus)
WORTSCHATZ, 101 Corpus-Based Monolingual Dictionaries
Linguistic Data Consortium LDC Catalog of 210 corpora of language data (the cheapest, TIMIT and CTIMIT cost $100 each)
EconData, economic time series
Time Series Data Library (very extensive)
University of Colorado at Boulder Time Series Repository
(containing a.o. the Santa Fe Institute Time Series Competition Data)
DAISY: A Database for Identification of Systems
WEKA, Waikato Environment for Knowledge Analysis (in Java)
has many data sets
Data Sets for Pattern Recognition and Classification
Ripley's Datasets
from the book Pattern Recognition and Neural Networks
by B.D. Ripley, Cambridge University Press 1996.
Simonoff Datasets from the book Smoothing Methods in Statistics by Jeffrey S. Simonoff, Springer 1996.
Data (and Papers) on Cost-Sensitive Learning
Statistical Reference Datasets (StRD)
Data sets from the Chance database
neural-bench Benchmark collection
NIST Handprinted Forms and Characters Database (not free)
NIST (National Institute Of Standards And Technology) databases
Industrial Quality Data on CD-ROM (not free)
Data sets from NCAR, US National Center for Atmospheric Research
ZUDIS, German Environmental and Climate Data Directories
SCDS, Synthetic Classification Data Set Generator
Tables of Points on Spheres (by Neil Sloane)
Statistics Links
Restricted Maximum Likelihood (REML) Methods
Surface Interpolation and Approximation
Artificial Intelligence
Morphometrics (Statistics of Shape)
Global Optimization
Mathematical Software
Mathematics Links
my
home page (http://arnold-neumaier.at)
Arnold Neumaier (Arnold.Neumaier@univie.ac.at)