CHEMeDATA projects

CHEMeDATA logo

Here are the projects presented at the SMASH conference 2023 in Baveno, Italy. They should be considered as a Work in progress! They will evolve depending on the interest expressed by the community, including the visitors of our booth …

Come visit us!

CHEMeDATA-Schema

We introduce JSON objects we shall call CHEMeDATA objects to collect key data and metadata of common chemistry files in a structured manner.

Tentative examples:

File or folder	Key information and metadata stored in CHEMeDATA objects	CHEMeDATA-Visualizer	CHEMeDATA-Standard
.cdx/.cdxml file with a single compound	Molecular formula, INCHI code, etc.	JSmol	.mol
Bruker NMR file folder	Observed nucleus, SNR, etc.	NMRium	x/y JSON
etc.

Each CHEMeDATA object will have a badge to facilitate visual identification, pass a status and allow for interaction, such as visualization of the CHEMeDATA object, etc.

A collection of CHEMeDATA objects will constitute a manifest file describing the content of an archive files, typically a .ZIP files of “electronic supplementary information” submitted with articles for publication or deposited on science repositories such as figshare, Zenodo, etc.

More details about the schema …

Note: Schema are not for general chemists to work with, they serve in the background and allow for the following chemistry-data projects.

CHEMeDATA-Finder

Finding chemical information is notoriously difficult. Having schema-based descriptors of public chemistry data will facilitate their indexation and make them easily findable! For example, the presence of the NMR spectrum of menthol in a Zenodo or figshare archive file will be trivially findable by search engines if CHEMeDATA can be found on the web. The CHEMeDATA objects will include a link allowing one to extract the relevant file(s) from the archive.

Help needed:

Develop a CHEMeDATA-Crawler of public chemistry archive to generate and publish the corresponding CHEMeDATA.
Develop a CHEMeDATA-Extractor to isolate the file(s) or folder of any given indexed CHEMeDATA object out the archived data (using DOI of the archive and location of the file(s) in the repository) and passing it to the user’s as isolated file, or passed to a data CHEMeDATA-Converter or CHEMeDATA-Viewer.

CHEMeDATA-Viewer

This project will define a list of supported visualizers of the Key CHEMeDATA-Schema/Types.

Examples of visualization:

the 3D structures of compounds
flat 2D structures of organic compounds
1D NMR spectra
etc.

CHEMeDATA-Converter

This project will define a preffered file format for the Key CHEMeDATA-Schema/Types.

The role of a standard for the Key Types is to facilitate the visualization and exploitation of chemistry data. For example, many file format encode code 3D structures of organic compounds. Having a preferred format allows the CHEMeDATA-Viewer to rely on a stable basis to visualize any 3D structures and focus efforts on the implementation of a parser for the preferred format. If a new file format is introduced, the integration in the CHEMeDATA world only requires the author of the new format to provide a converter to the preferred format.

CHEMeDATA-Archeology

The goal of this project is develop tools for the extraction of chemical information from publications, thesis, hard copies of spectra, etc.

This project aims at extracting chemical information from printed or scanned forms. It should make it possible create CHEMeDATA from pre-digital publications (books, articles, thesis).

Help needed:

specialist of text, and image extraction from pdf.
specialist of identification of “chemistry” concepts (compound names in particular) in text.
specialist of image recognition of organic compounds.
specialist of reconstrution of spectra from images.
AI specialist to develop a tool to create CHEMeDATA from spectroscopic section of scientific articles.

CHEMeDATA-Release

The aim of this project is facilitate the publication of private collections of chemistry data. Whether they are personal, institutional, or industrial, collections of chemistry data may require processing, filtering, anonymization, before they can be made public. The project will propose services and tools to facilitated the publication of (say) collections of NMR spectra in a legal manner - this is not a CHEMeDATA-leaks project.

Help needed:

Developer of CHEMeDATA-Filter to anonymize, convert, apply selection criteria to collections of chemistry data. Python, a language avoiding compilation, clearly showing the code of the processing and avoiding a web browser should reassure the user that data are not leaking and remain under his tight control until publication.
Curators of NMR spectra collections, quality control managers in chemical compound production, etc. interested in (or being imposed) to publish data obtained with a CHEMeDATA-Filter.

CHEMeDATA-Evolution

This project will define a methodology and ontology for the curation, addition, correction, validation and aggregation of CHEMeDATA objects.

Relation to NMReDATA

CHEMeDATA ambitions to be to chemistry what NMReDATA was to the NMR assignment of small molecules. NMReDATA will be one particular type of CHEMeDATA object. The NMReDATA tags of .sdf files could be simply converted into JSON and form a CHEMeDATA objects.

License

CHEMeDATA is currently a completely non-profit initiative. Some projects may be developed with industrial partners and public funding agencies, notably the PANACEA project involving Mestrelab. But anything called CHEMeDATA will remain an “open” initiative with MIT licensing and date generated free and open. Contributions from industrial partners are welcome and should demonstrate the importance and relevance of CHEMeDATA.

Poster presented at the conference.

CHEMeDATA Initiative

Main website of the CHEMeDATA umbrella organization

CHEMeDATA projects

CHEMeDATA-Schema

CHEMeDATA-Finder

CHEMeDATA-Viewer

CHEMeDATA-Converter

CHEMeDATA-Archeology

CHEMeDATA-Release

CHEMeDATA-Evolution

Relation to NMReDATA

License