Saturday, 27 February 2010

Clustering with rcdk and Python

I received an email from a collegue: "I have a list of ChEBI IDs with the corresponding SMILES; I'd like to do some clustering, based on a similarity measure between the structures".

I want to propose here a method based on the rcdk package developed by R. Guha. This package is really nice. You should read carefully this article.

Let me suggest a variation. If you have a huge number of structures, I would suggest to create externally from R your matrix of similarities.

This can be done with a Python script:
1) you can create your structural keys with an external tool (or, with the rcdk and save the fingerprints in another file)
2) then, you can calculate the Tanimoto similarity by using functions from the Python sets package:

import sys, os
from sets import Set

fp_A = list("110011")
fp_B = list("101011")

set_a, set_b = Set([]), Set([])
i = -1
while 1:
i = fp_A.index("1", i+1)
except ValueError: pass
i = -1
while 1:
i = fp_B.index("1", i+1)
except ValueError: pass

tanimoto = float( len(set_a.intersection(set_b)) ) / float( len(set_a.union(set_b)) )

3) in this way, you can calculate the matrix of similarities between all your structures, save the matrix in a file, load it in R environment, and use rcdk for the clustering.

That's all.

Thursday, 25 February 2010

Chemoinformatics in R:

Really interesting if you want to learn more about R programming applied to chemoinformatics: a Joint EBI-Industry Workshop on Cheminformatics in R.
Speakers of this short course:
- Rajarshi Guha, NIH Chemical Genomics Center (R-CDK and R-Pubchem)
- Steffen Neumann, AG Massenspektrometrie & Bioinformatik ( XCMS, Rdisop, CAMERA)
- H. Paul Benton, Imperial College London.
- David Broadhurst, Cork University Maternity Hospital.

Course page

Wednesday, 24 February 2010

New data load for kinase SARfari screening data

As JPO reported on the ChEMBL Blog, there is a new data load for the beta version of Kinase SARfari.

See the post at ChEMBL blog

Thursday, 18 February 2010


molpaint - the MMsINC molecular paint tool, based on DINGO.

Tuesday, 9 February 2010

Looking for Open Source Chemical Descriptors

Just asked a question about this point on the Blue Obelisk Exchange.

With RCDK is very easy to get a good number of descriptors with few rows of code:

dn <- get.desc.names(dc[1])
mol <- parse.smiles("c1ccccc1")
descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names)))
descs <- eval.desc(mol, descNames)

Nice!, 288 descriptors (but 65 of them are "NA").

Database Indexing for a faster Substructure Search

In our first release of MMsINC (see recent post) we developed a strategy that can help in making faster a substructure search in huge databases.

1) choose a fragmentation algorithm
2) fragment all the compounds in your database
3) store the fragments in your database

Then, when the user submits a query, you can apply your fragmentation tool to the query compoud.
If you are lucky, you can restrict the search space to the database compounds that share the same fragments of the query.
You can then apply the exact substructure search to this reduced set of compounds.

You need more disk space, but in this way you can save computation time.

Monday, 8 February 2010


What is MMsINC?

MMsINC is a database of non-redundant, richly annotated, and biomedically relevant chemical structures.

A primary goal of MMsINC is to guarantee the highest quality and the uniqueness of each entry. MMsINC then adds value to these entries by including the analysis of crucial chemical properties such as ionization and tautomerization processes, and the in silico prediction of 24 important molecular properties in the biochemical profile of each structure. MMsINC is consequently a natural input for different chemoinformatics and virtual screening applications. In addition, MMsINC supports various types of queries, including substructure queries and the novel "molecular scissoring" query.

MMsINC is interfaced with other primary data collectors such as PubChem, Protein Data Bank (PDB), the Food and Drug Administration (FDA) database of approved drugs, and ZINC.

The current database contains about 4 million unique compounds. For all the molecules, we calculate 24 molecular properties useful for quantitative structure-activity relationship (QSAR), diversity analysis or combinatorial library design.