Saturday, 27 February 2010

Clustering with rcdk and Python

I received an email from a collegue: "I have a list of ChEBI IDs with the corresponding SMILES; I'd like to do some clustering, based on a similarity measure between the structures".

I want to propose here a method based on the rcdk package developed by R. Guha. This package is really nice. You should read carefully this article.

Let me suggest a variation. If you have a huge number of structures, I would suggest to create externally from R your matrix of similarities.

This can be done with a Python script:
1) you can create your structural keys with an external tool (or, with the rcdk and save the fingerprints in another file)
2) then, you can calculate the Tanimoto similarity by using functions from the Python sets package:


import sys, os
from sets import Set

fp_A = list("110011")
fp_B = list("101011")

set_a, set_b = Set([]), Set([])
i = -1
try:
while 1:
i = fp_A.index("1", i+1)
set_a.add(i)
except ValueError: pass
i = -1
try:
while 1:
i = fp_B.index("1", i+1)
set_b.add(i)
except ValueError: pass

tanimoto = float( len(set_a.intersection(set_b)) ) / float( len(set_a.union(set_b)) )


3) in this way, you can calculate the matrix of similarities between all your structures, save the matrix in a file, load it in R environment, and use rcdk for the clustering.

That's all.

No comments:

Post a Comment