1) generate your preferred inchis for all the structures in your big SDF, and update the SDF with these inchis (you can use pybel for that)
2) extract PUBCHEM_COMPOUND_CID from the SDF:
grep PUBCHEM_COMPOUND_CID -A 1 big.sdf > PUBCHEM_COMPOUND_CID | grep -v "PUBCHEM_COMPOUND_CID" | grep -v "-" > CIDs.txt
3) then put inchis and CIDs in the same file:
paste inchi CIDs.txt > inchi_CIDs.txt
4) now you can sort this file:
sort inchi_CID.txt -o inchi_CID_sort.txt
so, all the duplicates are visible...
5) now, you could load all the inchis as keys of a python cPickle dictionary... if an inchi is unique in the inchi_CID_sort.txt file, the value of the key is 0, if it is a duplicate (last visited inchi == actual inchi) then the value of the key is 10.
6) now, the python script should parse the SDF in this way:
for each structure:
if the inchi of this structure has value 0 in the dictionary, save the molecule;
if the value is 10, save the molecule, but change the value to 11;
if the value is 11, skip this structure
I would suggest to save the output file every 100.000 structures, the open a different output file at each iteration... at the end, a "cat" command will generate a big SDF without duplicates.
No comments:
Post a Comment