Faciliate correction of series of .mol files following automatic extraction from images and pdf files
This project is maintained by CHEMeDATA
We used OSRA to extract stuctures from pdf.
The extraction of molecules from images or .pdf files is never perfect. This tool allows to visualize and edit (future work) the generated structures.
On-line version of OSRA](https://cactus.nci.nih.gov/osra/). The data produced here were using using a local installation of osra-2.1.0-1 after some minor modification in the cpp code (a few casting causing errors at compilation).
The data in the input folder were generated from
[input/unige_5398_attachment01.pdfdf](/fixingmolfiles/input/unige_5398_attachment01.pdf)
using
osra-2.1.0-1/src/osra input/unige_5398_attachment01.pdf -l osra-2.1.0-1/dict/spelling.txt -a osra-2.1.0-1/dict/superatom.txt -f sdf -w input/allcompounds.sdf --embedded-format inchi -b -c -d -g -p -o input/images/struct -e
All structures are stored in a single file :
unige_5398_attachment01_structures/allcompounds.sdf and the images in the input/images folder with number starting at 0!
reactions were extracted using
osra-2.1.0-1/src/osra input/unige_5398_attachment01.pdf -l osra-2.1.0-1/dict/spelling.txt -a osra-2.1.0-1/dict/superatom.txt -f rxn -w input/allreactions.sdf -b -c -d -g -p -o input/images/reaction -e
The scripts used are the following:
obabel input/allcompounds.sdf -o sdf -O input/separated_sdf/str.sdf -m -e
obabel input/allcompounds.sdf -o sdf -O input/separated_sdf/strH.sdf -m -h -e
obabel input/allcompounds.sdf -o svg -O input/separated_svg/str.svg -m -e
obabel input/allcompounds.sdf -o svg -O input/separated_svg/strH.svg -m -h -e
obabel input/allcompounds.sdf -o png -O input/separated_png/str.png -m -e
obabel input/allcompounds.sdf -o png -O input/separated_png/strH.png -m -e -h
obabel input/allreactions.rxn -o rxn -O input/separate_reactions_rxn/reac.rxn -m -e
These two do not work.....
obabel input/allreactions.rxn -o svg -O input/separate_reactions_svg/reac.svg -m -e
obabel input/allreactions.rxn -o png -O input/separate_reactions_png/reac.png -m -e
Note: The .png generated files are not opening on mac os…
Note: The .svg were used to generate the results page.
Note: For reaction generation from allreactions.rxn does not work… see scrpt for extraction of separate .rxn files using csh script (also not working).
Note: -h
adds explicit Hydrogen.
Note: -b
black background.
Note: -e
continues after error.
Scripts to attempt generation of png and svg files for compounds and reactions.
This extracts the SD tag including the page number of the molecule in the pdf file.
obabel input/allcompounds.sdf -otxt --title "" --append Page -O input/page-number/in.txt
Note: No page number in .rxn files.
more info and options on open babel
The files are stored in input/separated_sdf
folder with numbers starting at 1. (unlike images!)
From root:
npm i pdfjs-dist
node myPDFfileToText.js > extractedText.txt