Wednesday, 23 June 2010

Removing duplicates from large SDF files

Maybe there are better solutions, but this worked very well with a random set taken from Pubchem (5.000.000 structures, but I introduced random duplicates, for a total of 120.000.000 structures):
1) generate your preferred inchis for all the structures in your big SDF, and update the SDF with these inchis (you can use pybel for that)
2) extract PUBCHEM_COMPOUND_CID from the SDF:

grep PUBCHEM_COMPOUND_CID -A 1 big.sdf > PUBCHEM_COMPOUND_CID | grep -v "PUBCHEM_COMPOUND_CID" | grep -v "-" > CIDs.txt

3) then put inchis and CIDs in the same file:

paste inchi CIDs.txt > inchi_CIDs.txt

4) now you can sort this file:

sort inchi_CID.txt -o inchi_CID_sort.txt

so, all the duplicates are visible...
5) now, you could load all the inchis as keys of a python cPickle dictionary... if an inchi is unique in the inchi_CID_sort.txt file, the value of the key is 0, if it is a duplicate (last visited inchi == actual inchi) then the value of the key is 10.
6) now, the python script should parse the SDF in this way:
for each structure:
if the inchi of this structure has value 0 in the dictionary, save the molecule;
if the value is 10, save the molecule, but change the value to 11;
if the value is 11, skip this structure

I would suggest to save the output file every 100.000 structures, the open a different output file at each iteration... at the end, a "cat" command will generate a big SDF without duplicates.