Thursday, 23 December 2010

How to download MMsINC entries for Autodock.

AutoDock is a
suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure

AutoDock 4 uses PDBQT formatted files not only for the receptor but also for the ligand.
The PDBQT format is described here.

Yesterday I have converted all the MMsINC entries (about 4M) to the PDBQT format.
It is so possible to download each entry as input format for Autodock experiments.

Have a look at this example page. As you can see, at the top of the page there is a menu for the download of several format files. The first of them is Autodock input file. Simply press the "go" button to download it.

Let me close this post with this nice photo.

Where: the place is the Poetto Beach in Cagliari (Sardinia, Italy).
What: Stefano Moro and me eating these "ricci di mare" (Sea urchin).

Thursday, 16 December 2010

The hard life of a chemoinformatician (and the chemical space in a bag)

Part 1: The hard life of a chemoinformatician

Thoughts at the end of the Year 2010.
PhD going to be completed in the next 2 months.
My contract is coming to the end (February).
I would like to do more cheminfo than bioinfo.
But this is not easy, also because there are more opportunities in bioinfo than in cheminfo.

Part 2: the chemical space in a bag

Stefano Moro will be wellcome to my village during next weekend.
In his bag, there will be the whole public chemical space (the future MMsINC 2.0).
I will enjoy exploring that space during the holidays.

Monday, 8 November 2010

Who is using MMsINC?

A good number of people are using MMsINC (summary from May to October 2010),

but what is interesting is the geographical distribution of the visitors: of the last 500 visitors, 67% were from India.

Friday, 17 September 2010

Pharao installation

Few days ago Silicos released the source code of the tool Pharao.
Here some notes about the installation of Pharao.
I had some problems, and with the help of the Pharao developers (Gert Thijs) I understood how to install it properly.

1) install cmake
2) install the latest openbabel release via svn
3) follow the installation of openbabel here
4) download and install pharao

If everything worked fine, there should not be any problem.

See also BO exchange.

Sunday, 12 September 2010

Automated large-scale protein modeling

A pipeline for the multiple automated comparative modeling can be easily built with the following software:

1) the best template (based on some filters such as the e-value and the resolution) is identified using the hhsearch program;
2) Modeller is used for the comparative modeling;
3) databases for the template search (nr90 and nr70) and for the modeling (a reformatted version of the PDB) are available from the hhsearch ftp site;
4) a set of python and bash utilities for the management of the jobs on a computer cluster.

All these building block are part of my pipeline that is going to be released.
I'll give more informations soon.

Some days ago a test experiment revealed that the pipeline can build 450 models in 9 hours (50 models per hour).
Not so fast, but my pipeline contains also come modules for the model assessment and the cluster resources are shared with a lot of different users.

With a dedicated (and larger) cluster, I suppose it would be possible to model the whole human proteome (ca. 78.000 peptides, source: Ensembl) in 1 or 2 weeks with this pipeline.

Monday, 23 August 2010

MMsINC 2.0: coming soon.

We are processing (mainly Marco Fanton at Univ. of Padova) lot of public sources for the next MMsINC release; please feel free to contact me if you have any SDF catalog, it will be our pleasuse to process and incorporate it (with the appropriate link to your Company or Institute).

Thursday, 19 August 2010

"Ultrafast shape recognition" method implementation

I have implemented the Ultrafast Shape Recognition method (Ballester and Richards, Proc. R. Soc. A., 2007) for the next MMsINC release... pure Python implementation, no external libraries required, fast calculation. I'm looking for a small dataset for the validation of my script. If you are interested in the source code, please send me an email at

Thursday, 1 July 2010

InChI Version 1, Software Version 1.03 is out!
InChI Version 1, Software Version 1.03
– implemented for both Standard and
Non-standard (Customized) InChI/InChIKey

Wednesday, 23 June 2010

Removing duplicates from large SDF files

Maybe there are better solutions, but this worked very well with a random set taken from Pubchem (5.000.000 structures, but I introduced random duplicates, for a total of 120.000.000 structures):
1) generate your preferred inchis for all the structures in your big SDF, and update the SDF with these inchis (you can use pybel for that)
2) extract PUBCHEM_COMPOUND_CID from the SDF:

grep PUBCHEM_COMPOUND_CID -A 1 big.sdf > PUBCHEM_COMPOUND_CID | grep -v "PUBCHEM_COMPOUND_CID" | grep -v "-" > CIDs.txt

3) then put inchis and CIDs in the same file:

paste inchi CIDs.txt > inchi_CIDs.txt

4) now you can sort this file:

sort inchi_CID.txt -o inchi_CID_sort.txt

so, all the duplicates are visible...
5) now, you could load all the inchis as keys of a python cPickle dictionary... if an inchi is unique in the inchi_CID_sort.txt file, the value of the key is 0, if it is a duplicate (last visited inchi == actual inchi) then the value of the key is 10.
6) now, the python script should parse the SDF in this way:
for each structure:
if the inchi of this structure has value 0 in the dictionary, save the molecule;
if the value is 10, save the molecule, but change the value to 11;
if the value is 11, skip this structure

I would suggest to save the output file every 100.000 structures, the open a different output file at each iteration... at the end, a "cat" command will generate a big SDF without duplicates.

Thursday, 13 May 2010

Which is the "real" RU-486? [2]

From a Pubchem search (search term: RU-486, 20 results, 18 RO5):

Thursday, 22 April 2010

ChEMBL_03 is available!

JPO announced the new release. FTP data will be available in the next few days.

Thursday, 25 March 2010

Any other case of "Different InChIs from the Same Molecule"?: an experiment with MMsINC 1.0

Some days ago, an InChI bug was highlighted by Rich Apodaca on his blog:

me and Marco Fanton (see Marco Fanton web page) decided to perform an experiment with the 3M of 3D structures from MMsINC 1.0: we have created a small pure-Python script that is able to reshuffle the atom order of a given SDF file (you can write me if you are interested in the code), then we generated "on-the-fly" 10 random permutations for each of the 3M structures, automatically calculated the standard InChI, and searched for any "new" duplicated InChI.
Results? Interesting:

two molecules (MMs03263666 , MMs03263667) that are di-azo compounds (and this confirms the known bug)... and no other duplicates.

Wednesday, 10 March 2010

About MMsINC data and license

As a comment of a recent post, Egon posted two short questions about MMsINC data and license.
I want to thank Egon for the two questions.

QUESTION 1: what's the license of the data in the database?

ANSWER: MMsINC data are property of the University of Padova.
Actually, data are not available for download, but the users can query them though the web interface.

QUESTION 2: how does the curation compare to that of ChEBI?

a. the MMsINC data sources are larger than those included in ChEBI: first release contains a number of sources but the most relevant is Zinc (version 7), but the next release aims to process the greater part of public data (Pubchem, mainly).
b. structures are not checked or processed one by one, but by following a precise protocol described here (open access NAR paper).

Some additional notes about the quality of MMsINC data:

the quality of MMsINC chemical data is higher than other public resources:

- MMsINC is not just a collection of public data, but there is a long preprocessing work (see the Nucleic Acid Research DBIssue article for a full description of the pipeline) and a data cleaning based on the InChIs.
- MMsINC is the only resource that collect the most probable ionic states and tautomers of all the structures (when possible and with the known limitation)
- MMsINC stores precalculated predictions of biological enrichment of each molecules (similarity to PDB ligand, to bioactive molecules, presence of "active" fragments)
- MMsINC contains a selection of descriptors important from a pharmaceutical and biochemical point of view

I want also to cite Stefano (Prof. Stefano Moro, University of Padova, Italy):

MMsINC is not only a database: it is a chemogenomics work platform that places
its data and tools to work with it on an even footing.  Although the data
formally is property of the University of Padova, it is more important to note
that we feel that simply providing files to download would belittle our
mission.  Instead, we aim to bring this data and the science of chemoinformatics
together to provide the MMsINC service to our web community.

Let me thank Luca Pireddu for helping me translate Prof. Moro's quote.

Saturday, 6 March 2010


This is an interesting tool.
I want to report here the message from the developers:

OOChemistry is an extension for which provides cross-platform OLE-like integration of OOo with JChemPaint chemical diagram editor. With OOChemistry you can draw structure, embed into document (text or presentation) and than double click and edit whenever you want on any platform having and Java Runtime (Windows, Linux, Mac OS X, other Unix flavours). It is only first alpha version and is not recommended for production use (e.g., compatibility with futher versions is not guaranteed).

OOChemistry needs your help! Experience in Java, in development of projects dealing with JChemPaint/CDK, or in development of extensions will be highly appreciated. Of course, you can help not only in coding, but also in translation of interface and writing docs.

Project page on SF

It sounds very interesting.
I'll try the installation asap.

Friday, 5 March 2010

Good news: is again available!; it is a good way to be introduced to the CDK functionalities.

Wednesday, 3 March 2010

About SMARTS patterns in PubChem fingerprints

See my recent post at BlueObelisk StackExchange.
Wolf Ihlenfeldt's reply is very interesting for people that want to know more about PubChem fingerprints.
As he says, the SMARTS patterns (terminal part of the FPs)...

"are the result of capturing and analyzing user queries on the old NCI Cancer Screening database Web interface. They are intended to capture features which are used in actual queries. They were not designed specifically for similarities or correlation with any properties, but it turns out that there are indications that these screens work about as well as others in that sector".

Saturday, 27 February 2010

Clustering with rcdk and Python

I received an email from a collegue: "I have a list of ChEBI IDs with the corresponding SMILES; I'd like to do some clustering, based on a similarity measure between the structures".

I want to propose here a method based on the rcdk package developed by R. Guha. This package is really nice. You should read carefully this article.

Let me suggest a variation. If you have a huge number of structures, I would suggest to create externally from R your matrix of similarities.

This can be done with a Python script:
1) you can create your structural keys with an external tool (or, with the rcdk and save the fingerprints in another file)
2) then, you can calculate the Tanimoto similarity by using functions from the Python sets package:

import sys, os
from sets import Set

fp_A = list("110011")
fp_B = list("101011")

set_a, set_b = Set([]), Set([])
i = -1
while 1:
i = fp_A.index("1", i+1)
except ValueError: pass
i = -1
while 1:
i = fp_B.index("1", i+1)
except ValueError: pass

tanimoto = float( len(set_a.intersection(set_b)) ) / float( len(set_a.union(set_b)) )

3) in this way, you can calculate the matrix of similarities between all your structures, save the matrix in a file, load it in R environment, and use rcdk for the clustering.

That's all.

Thursday, 25 February 2010

Chemoinformatics in R:

Really interesting if you want to learn more about R programming applied to chemoinformatics: a Joint EBI-Industry Workshop on Cheminformatics in R.
Speakers of this short course:
- Rajarshi Guha, NIH Chemical Genomics Center (R-CDK and R-Pubchem)
- Steffen Neumann, AG Massenspektrometrie & Bioinformatik ( XCMS, Rdisop, CAMERA)
- H. Paul Benton, Imperial College London.
- David Broadhurst, Cork University Maternity Hospital.

Course page

Wednesday, 24 February 2010

New data load for kinase SARfari screening data

As JPO reported on the ChEMBL Blog, there is a new data load for the beta version of Kinase SARfari.

See the post at ChEMBL blog

Thursday, 18 February 2010


molpaint - the MMsINC molecular paint tool, based on DINGO.

Tuesday, 9 February 2010

Looking for Open Source Chemical Descriptors

Just asked a question about this point on the Blue Obelisk Exchange.

With RCDK is very easy to get a good number of descriptors with few rows of code:

dn <- get.desc.names(dc[1])
mol <- parse.smiles("c1ccccc1")
descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names)))
descs <- eval.desc(mol, descNames)

Nice!, 288 descriptors (but 65 of them are "NA").

Database Indexing for a faster Substructure Search

In our first release of MMsINC (see recent post) we developed a strategy that can help in making faster a substructure search in huge databases.

1) choose a fragmentation algorithm
2) fragment all the compounds in your database
3) store the fragments in your database

Then, when the user submits a query, you can apply your fragmentation tool to the query compoud.
If you are lucky, you can restrict the search space to the database compounds that share the same fragments of the query.
You can then apply the exact substructure search to this reduced set of compounds.

You need more disk space, but in this way you can save computation time.

Monday, 8 February 2010


What is MMsINC?

MMsINC is a database of non-redundant, richly annotated, and biomedically relevant chemical structures.

A primary goal of MMsINC is to guarantee the highest quality and the uniqueness of each entry. MMsINC then adds value to these entries by including the analysis of crucial chemical properties such as ionization and tautomerization processes, and the in silico prediction of 24 important molecular properties in the biochemical profile of each structure. MMsINC is consequently a natural input for different chemoinformatics and virtual screening applications. In addition, MMsINC supports various types of queries, including substructure queries and the novel "molecular scissoring" query.

MMsINC is interfaced with other primary data collectors such as PubChem, Protein Data Bank (PDB), the Food and Drug Administration (FDA) database of approved drugs, and ZINC.

The current database contains about 4 million unique compounds. For all the molecules, we calculate 24 molecular properties useful for quantitative structure-activity relationship (QSAR), diversity analysis or combinatorial library design.