Thursday, 25 March 2010

Any other case of "Different InChIs from the Same Molecule"?: an experiment with MMsINC 1.0

Some days ago, an InChI bug was highlighted by Rich Apodaca on his blog:

me and Marco Fanton (see Marco Fanton web page) decided to perform an experiment with the 3M of 3D structures from MMsINC 1.0: we have created a small pure-Python script that is able to reshuffle the atom order of a given SDF file (you can write me if you are interested in the code), then we generated "on-the-fly" 10 random permutations for each of the 3M structures, automatically calculated the standard InChI, and searched for any "new" duplicated InChI.
Results? Interesting:

two molecules (MMs03263666 , MMs03263667) that are di-azo compounds (and this confirms the known bug)... and no other duplicates.

Wednesday, 10 March 2010

About MMsINC data and license

As a comment of a recent post, Egon posted two short questions about MMsINC data and license.
I want to thank Egon for the two questions.

QUESTION 1: what's the license of the data in the database?

ANSWER: MMsINC data are property of the University of Padova.
Actually, data are not available for download, but the users can query them though the web interface.

QUESTION 2: how does the curation compare to that of ChEBI?

ANSWER:
a. the MMsINC data sources are larger than those included in ChEBI: first release contains a number of sources but the most relevant is Zinc (version 7), but the next release aims to process the greater part of public data (Pubchem, mainly).
b. structures are not checked or processed one by one, but by following a precise protocol described here (open access NAR paper).

Some additional notes about the quality of MMsINC data:

the quality of MMsINC chemical data is higher than other public resources:

- MMsINC is not just a collection of public data, but there is a long preprocessing work (see the Nucleic Acid Research DBIssue article for a full description of the pipeline) and a data cleaning based on the InChIs.
- MMsINC is the only resource that collect the most probable ionic states and tautomers of all the structures (when possible and with the known limitation)
- MMsINC stores precalculated predictions of biological enrichment of each molecules (similarity to PDB ligand, to bioactive molecules, presence of "active" fragments)
- MMsINC contains a selection of descriptors important from a pharmaceutical and biochemical point of view

I want also to cite Stefano (Prof. Stefano Moro, University of Padova, Italy):

MMsINC is not only a database: it is a chemogenomics work platform that places
its data and tools to work with it on an even footing.  Although the data
formally is property of the University of Padova, it is more important to note
that we feel that simply providing files to download would belittle our
mission.  Instead, we aim to bring this data and the science of chemoinformatics
together to provide the MMsINC service to our web community.


Let me thank Luca Pireddu for helping me translate Prof. Moro's quote.

Saturday, 6 March 2010

OOChemistry

This is an interesting tool.
I want to report here the message from the developers:

OOChemistry is an extension for OpenOffice.org which provides cross-platform OLE-like integration of OOo with JChemPaint chemical diagram editor. With OOChemistry you can draw structure, embed into document (text or presentation) and than double click and edit whenever you want on any platform having OpenOffice.org and Java Runtime (Windows, Linux, Mac OS X, other Unix flavours). It is only first alpha version and is not recommended for production use (e.g., compatibility with futher versions is not guaranteed).

OOChemistry needs your help! Experience in Java, in development of projects dealing with JChemPaint/CDK, or in development of OpenOffice.org extensions will be highly appreciated. Of course, you can help not only in coding, but also in translation of interface and writing docs.


Project page on SF

It sounds very interesting.
I'll try the installation asap.

Friday, 5 March 2010

c-d-k.org

Good news: c-d-k.org is again available!; it is a good way to be introduced to the CDK functionalities.

Wednesday, 3 March 2010

About SMARTS patterns in PubChem fingerprints

See my recent post at BlueObelisk StackExchange.
Wolf Ihlenfeldt's reply is very interesting for people that want to know more about PubChem fingerprints.
As he says, the SMARTS patterns (terminal part of the FPs)...

"are the result of capturing and analyzing user queries on the old NCI Cancer Screening database Web interface. They are intended to capture features which are used in actual queries. They were not designed specifically for similarities or correlation with any properties, but it turns out that there are indications that these screens work about as well as others in that sector".