I want to propose here a method based on the rcdk package developed by R. Guha. This package is really nice. You should read carefully this article.
Let me suggest a variation. If you have a huge number of structures, I would suggest to create externally from R your matrix of similarities.
This can be done with a Python script:
1) you can create your structural keys with an external tool (or, with the rcdk and save the fingerprints in another file)
2) then, you can calculate the Tanimoto similarity by using functions from the Python sets package:
import sys, os
from sets import Set
fp_A = list("110011")
fp_B = list("101011")
set_a, set_b = Set([]), Set([])
i = -1
try:
while 1:
i = fp_A.index("1", i+1)
set_a.add(i)
except ValueError: pass
i = -1
try:
while 1:
i = fp_B.index("1", i+1)
set_b.add(i)
except ValueError: pass
tanimoto = float( len(set_a.intersection(set_b)) ) / float( len(set_a.union(set_b)) )
3) in this way, you can calculate the matrix of similarities between all your structures, save the matrix in a file, load it in R environment, and use rcdk for the clustering.
That's all.
No comments:
Post a Comment