Croissant

CI Python 3.10+

Summary

Croissant is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable. You can find a gentle introduction in the companion paper Croissant: A Metadata Format for ML-Ready Datasets.

Trying It Out

Croissant is currently under development by the community. You can try the Croissant implementation, mlcroissant:

Installation (requires Python 3.10+):

pip install mlcroissant

Loading an example dataset:

import mlcroissant as mlc
ds = mlc.Dataset("https://raw.githubusercontent.com/mlcommons/croissant/main/datasets/1.0/gpt-3/metadata.json")
metadata = ds.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")
for x in ds.records(record_set="default"):
    print(x)

Use it in your ML workflow:

# 1. Point to a local or remote Croissant file
import mlcroissant as mlc
url = "https://datasets-server.huggingface.co/croissant?dataset=fashion_mnist"
# 2. Inspect metadata
print(mlc.Dataset(url).metadata.to_json())
# 3. Use Croissant dataset in your ML workload
import tensorflow_datasets as tfds
builder = tfds.core.dataset_builders.CroissantBuilder(jsonld=url)
# 4. Split for training/testing
train, test = builder.as_data_source(split=['default[:80%]', 'default[80%:]'])

Please see the notebook recipes for more examples.

Why a standard format for ML datasets?

Datasets are the source code of machine learning (ML), but working with ML datasets is needlessly hard because each dataset has a unique file organization and method for translating file contents into data structures and thus requires a novel approach to using the data. We need a standard dataset format to make it easier to find and use ML datasets and especially to develop tools for creating, understanding, and improving ML datasets.

The Croissant Format

Croissant is a high-level format for machine learning datasets. Croissant brings together four rich layers (in a tasty manner, we hope ):

Simple Format Example

Here is an extremely simple example of the Croissant format, with comments showing the four layers:

{
  "@type": "sc:Dataset",
  "name": "minimal_example_with_recommended_fields",
  "description": "This is a minimal example, including the required and the recommended fields.",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "url": "https://example.com/dataset/recipes/minimal-recommended",
  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "minimal.csv",
      "name": "minimal.csv",
      "contentUrl": "data/minimal.csv",
      "encodingFormat": "text/csv",
      "sha256": "48a7c257f3c90b2a3e529ddd2cca8f4f1bd8e49ed244ef53927649504ac55354"
    }
  ],
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "name": "examples",
      "description": "Records extracted from the example table, with their schema.",
      "field": [
        {
          "@type": "cr:Field",
          "name": "name",
          "description": "The first column contains the name.",
          "dataType": "sc:Text",
          "references": {
            "fileObject": { "@id": "minimal.csv" },
            "extract": {
              "column": "name"
            }
          }
        },
        {
          "@type": "cr:Field",
          "name": "age",
          "description": "The second column contains the age.",
          "dataType": "sc:Integer",
          "references": {
            "fileObject": { "@id": "minimal.csv" },
            "extract": {
              "column": "age"
            }
          }
        }
      ]
    }
  ]
}

Resources

Getting involved

Integrations

Licensing

Croissant project code and examples are licensed under Apache 2.

Governance

Croissant is being developed by the community as a Task Force of the MLCommons Association Datasets Working Group. The Task Force is open to anyone (as is the parent Datasets working group). The Task Force is co-chaired by Omar Benjelloun and Elena Simperl.

Contributors

Albert Villanova (Hugging Face), Andrew Zaldivar (Google), Baishan Guo (Meta), Carole Jean-Wu (Meta), Ce Zhang (ETH Zurich), Costanza Conforti (Google), D. Sculley (Kaggle), Dan Brickley (Schema.Org), Eduardo Arino de la Rubia (Meta), Edward Lockhart (Deepmind), Elena Simperl (King's College London), Goeff Thomas (Kaggle), Joan Giner-Miguelez (UOC), Joaquin Vanschoren (TU/Eindhoven, OpenML), Jos van der Velde (TU/Eindhoven, OpenML), Julien Chaumond (Hugging Face), Kurt Bollacker (MLCommons), Lora Aroyo (Google), Luis Oala (Dotphoton), Meg Risdal (Kaggle), Natasha Noy (Google), Newsha Ardalani (Meta), Omar Benjelloun (Google), Peter Mattson (MLCommons), Pierre Marcenac (Google), Pierre Ruyssen (Google), Pieter Gijsbers (TU/Eindhoven, OpenML), Prabhant Singh (TU/Eindhoven, OpenML), Quentin Lhoest (Hugging Face), Steffen Vogler (Bayer), Taniya Das (TU/Eindhoven, OpenML), Michael Kuchnik (Meta)

Thank you for supporting Croissant!

Citation

@misc{akhtar2024croissant,
      title={Croissant: A Metadata Format for ML-Ready Datasets}, 
      author={Mubashara Akhtar and Omar Benjelloun and Costanza Conforti and Joan Giner-Miguelez and Nitisha Jain and Michael Kuchnik and Quentin Lhoest and Pierre Marcenac and Manil Maskey and Peter Mattson and Luis Oala and Pierre Ruyssen and Rajat Shinde and Elena Simperl and Goeffry Thomas and Slava Tykhonov and Joaquin Vanschoren and Steffen Vogler and Carole-Jean Wu},
      year={2024},
      eprint={2403.19546},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}