Home » High 10 Platforms for Hundreds of ML and AI Datasets You Have to Know | by Jan Marcel Kezmann | Could, 2023

High 10 Platforms for Hundreds of ML and AI Datasets You Have to Know | by Jan Marcel Kezmann | Could, 2023

by Narnia
0 comment

While there are quite a few web sites that supply varied datasets, I’d like to start with the seven most well-known ones.

Typically, they’re a superb place to begin searching for a dataset that’s tailor-made to your wants.

1. UCI Machine Learning Repository

UCI ML Repository Homepage
Source: Screenshot taken by Jan Marcel Kezmann from UCI Machine Learning Repository

The UCI Machine Learning Repository [1] contains an intensive assortment of information mills, area theories, and databases that the AI studying group makes use of to empirically assess machine studying algorithms.

It was created in 1987 by UC Irvine college students led by David Aha and is a broadly referenced useful resource for machine studying datasets, with over 1000 citations, rating it within the prime 100 most cited publications in the entire pc science self-discipline.

2. Kaggle

Kaggle Datasets Homepage
Source: Screenshot taken by Jan Marcel Kezmann from Kaggle datasets

Kaggle will not be solely a preferred place for knowledge scientists and machine studying practitioners; it additionally affords an intensive choice of datasets [2], lots of that are accompanied by tutorials and code.

This makes it a superb useful resource for each novices and professional AI practitioners fascinated by exploring rising ML functions and finest practices in AI.

As a bunch for knowledge science competitions, that are principally created by large tech firms and analysis institutes, the platform all the time affords brand-new and up-to-date datasets from all areas of analysis and trade, the place you may compete with different practitioners to enhance your expertise and infrequently win prizes.

3. Google Dataset Search

Source: Screenshot taken by Jan Marcel Kezmann from Google Dataset Search

Google is useful not only for odd search queries, but it surely additionally has a specialised search engine [3] for locating datasets housed on a whole bunch, if not 1000’s, of internet sites.

It lets you filter datasets by kind, file format, and different standards of your alternative, making it a very useful useful resource for locating datasets on area of interest subjects or from lesser-known sources.

4. OpenML

OpenML Datasets Homepage
Source: Screenshot taken by Jan Marcel Kezmann from OpenML

OpenML [4] is a unbelievable device for sharing and discovering datasets and machine studying workflows.

With over 5,000 datasets, it affords a big assortment of datasets, lots of that are complemented by preprocessed variations and benchmark outcomes.

Moreover, you may simply examine completely different machine-learning algorithms on the identical dataset.

5. Amazon Web Services (AWS) Public Datasets

Source: Screenshot taken by Jan Marcel Kezmann from AWS Public Datasets

The AWS Public Datasets assortment provides you entry to public datasets hosted on the AWS infrastructure.

Thereby, not solely datasets in varied domains akin to genomics, astronomy, and social sciences are provided, however they can be simply accessed and analyzed utilizing AWS instruments and companies.

6. Papers With Code

Source: Screenshot taken by Jan Marcel Kezmann from Papers With Code Datasets

One of my private favourite web sites for locating ML subjects, papers, benchmarks, and datasets is Papers With Code.

Its dataset repository incorporates over 8,000 datasets that may be browsed or filtered by modality, activity, and language.

Furthermore, when testing a dataset, you might be instantly given further useful data such because the benchmarks it has been used for, the papers it has been cited in, repositories and frameworks providing knowledge loaders, and rather more.

7. Hugging Face Datasets

Source: Screenshot taken by Jan Marcel Kezmann from HuggingFace Datasets

HuggingFace, principally recognized for its immensely complete deep studying mannequin library, additionally has an monumental dataset corpus [5] with over 30,000 datasets that provides you easy entry to datasets for audio, pc imaginative and prescient, and NLP duties.

Besides that, it simplifies the information loading course of to a single line of code, and its knowledge processing strategies expedite dataset preparation for deep studying mannequin coaching.

Additionally, it seamlessly integrates with the HuggingFace Hub, simplifying dataset sharing and loading throughout the machine-learning group.

While the above-mentioned sources are generally recognized within the ML group, the three listed web sites under are pretty unknown however supply not less than as nice datasets as those above.

1. Penn Machine Learning Benchmarks

Penn Machine Learning Benchmarks Homepage
Source: Screenshot taken by Jan Marcel Kezmann from Penn Machine Learning Benchmarks (PMLB)

Penn Machine Learning Benchmarks (PMLB) [6] is an intensive compilation of curated benchmark datasets that consider and examine supervised machine studying algorithms.

The datasets embody a variety of functions, together with binary and multi-class classification and regression issues.

Additionally, they function combos of categorical, ordinal, and steady options, making them extremely versatile.

2. re3data

re3data Homepage
Source: Screenshot taken by Jan Marcel Kezmann from re3data

The registry of analysis knowledge repositories, re3data, is a world registry of analysis knowledge repositories encompassing quite a lot of educational disciplines.

Its repository catalog consists of repositories that facilitate long-term storage and retrieval of datasets for researchers, publishers, funding our bodies, and educational establishments.

Personally, I like browsing via it because it gives glorious visualization, as proven within the image under.

re3data Browse by Category
Source: Screenshot taken by Jan Marcel Kezmann from re3data

3. Academic Torrents

Academic Torrents Homepage
Source: Screenshot taken by Jan Marcel Kezmann from Academics Torrents

Academics Torrents was created to handle the wants of the scientific group within the period of huge knowledge.

It spreads the internet hosting load of information and mitigates the danger of information loss brought on by the withdrawal of internet hosting companies for datasets by using a scalable BitTorrent platform.

This permits researchers to duplicate the information they’re working with and share massive datasets with out incurring the excessive prices usually related to business suppliers.

You may also like

Leave a Comment