CHEMeDATA Initiative

Main website of the CHEMeDATA umbrella organization

Goal and scope

The goal of the CHEMeDATA Initiative is to improve the quality of chemistry data generated in support of scientific discoveries.

The Initiative is focusing on the problem of archive files, usually .zip files including chemistry data. These are generated by chemists to fulfill the requirement of funding agencies to make the outcome of their research available. These files are typically submitted as supplementary data when publishing scientific articles or stored on repositories such as Zenodo, Figshare, or university repositories such as Yareta for the University of Geneva, etc.

Image chemistry archive

We are addressing the questions of how to make the content of these .zip files FAIR? In other words: How to make the chemistry-rich contents Findable, Accessible, Interoperable and Reusable.

One aspect will be to develop a minimal ontology for the key chemistry objects.

For example …

Let’s say that you want to share the 1D 1H spectrum of a compound you analysed. First, it has to be accessible: be somewhere on the web. But making it reliably findable using common web search tools or via specialized chemistry data services is more difficult. It cannot be a simple link to a dropbox for example, but should be saved in an data repository. Usually providing DOI is a sign of stability. But even there, how will a search engine know that there is an NMR spectrum there, and know to what compounds it corresponds to?

How search engine can find chemistry data on the web

Organized metadata such as linked data will make the spectrum and the compound visible to search engines. You will need tools to make this automatically (so that you don’t have to know what linked data are)! This will also allow others to easily cite your work when they (re)use your data.

Export for ELN

The CHEMeDATA project may be seens as the effort to introduce a common format for all ELN and Chemistry data management systems.

CHEMeDATA is an extension of NMReDATA

It may be that the work would mostly involve mapping the data of the different sources (ELN, etc.) to a set of schema based concepts such as those listed the minimal ontology.

The CHEMeDATA Initiative involves:

From Chemist to database

For example, chemists should know that structure files (.cdx, .mol, etc.) are extremely important to make any data involving organic compounds computer understandable. Instructions for authors submitted articles to specialized journals should should mention it in the instructions to the authors submitting supplementary chemistry information, or even better, provide tools to generate exemplary data (see next point).

Making intuitive what needs to be dropped in a web page to make a fulfill requirements

A well designed data generation tool can replace wordy Instructions for authors. (A project to lay the ground of such a tool has been submitted in Nov. 2020 to a swiss Open Research Data Hackathon taking place in early 2021.)

How to make better chemistry data?

The supplementary data (typically .zip files) that are produced in association to publication or to fulfil the requirements of funding agencies to make data available can be improved. This is the key to make FAIR data!

Technical information

We shall recommend using linked data, as a manner to facilitate the use of any existing ontologies, for author, institution, funding, and key Chemistry objects. These linked data may be enclosed in RO-crate frames. Tools generating chemistry archives may directly use these format, or any other XML with accessible schema with appropriate data to allow for the generation of more “searchable” standards.

General recommendations

To be called CHEMeDATA the data should fulfil some requirement with respect to openness (free access of the content).

CHEMeDATA is the umbrella organization combining efforts by diverse communities addressing the FAIRness and chemistry data in a similar (and yet-to-be clearly defined) manner. It is spinoff of the NMReDATA Initiative, inheriting its general principles.

Format for chemistry data

One of the activities of the CHEMeDATA Initiative is to encourage the development of standards to report information extracted from crude data (when such format is not already existing). For example, the assignment of NMR spectra, IR spectra, etc. Extract the relevant information from the output of chemistry software, etc.)

Currently only the NMR community is a part of the CHEMeDATA Initiative via the NMReDATA Initiative. Their main outcome is a format to associate the NMR assignment of an organic compound to a chemical structure file. This is done using so-called “tags” included in .sdf files, the latter being compatible with the commonly used .mol format.

Potential developments:

Dealing with repositories

Some information and example of data extraction given (in a random and uneven manner) for the following:

From Oct. 2019 to Nov. 2020, the funding of Damien Jeannerat was covered by a 20% activity at the DLCM/E-research group at the University of Geneva.

Starting in Sept. 2021, the PANACEA consortium including Mestrelab may develop open data solutions relevant to the CHEMeDATA Initiative.

IUPAC work group on the development of a standard for FAIR data mangement of spectroscopic data.

How to participate?

If you have comment, suggestions, contributions, etc. raise an Issue on Github.

Visibility

CHEMeDATA at the SMASH 2023 conference.

Relevant schema

Provenance ontology