alchemoinformatics

pepMMsMIMIC paper is out

2011-05-28T19:37:00.000+02:00

The pepMMsMIMIC paper now can be accessed from the Nucleic Acid Research website.

A novel web-oriented peptidomimetic compound virtual screening tool.

2011-04-26T10:08:00.000+02:00

pepMMsMIMIC is a public, web-based virtual screening platform with the aim to suggest chemical compounds whose essential elements (pharmacophore) mimic a natural peptide or protein in 3D space which hopefully retain the ability to interact with the biological target and produce the typical biological effect.

Starting from the 3D structure of any protein-protein/peptide complex, pepMMsMIMIC design process begins by identifying the key residues that are responsible for the protein-protein recognition process. In this process, the peptide complexity is reduced and the basic pharmacophore model is defined by its critical structural features (peptide annotation points) in 3D space.

The pepMMsMIMIC paper has been accepted for publication in the NAR Web Server Issue 2011. I will post the Advance Access link in the near future.

Here the abstract:

pepMMsMIMIC is a novel web-oriented peptidomimetic compound virtual screening tool based on a multi-conformers 3D- similarity search strategy. Key to the development of pepMMsMIMIC has been the creation of a library of 17 million conformers calculated from 3.9 million commercially available chemicals collected in the MMsINC^® database. Using as input the three-dimensional structure of a peptide bound to a protein, pepMMsMIMIC suggests which chemical structures are able to mimic the protein-protein recognition of this natural peptide using both pharmacophore and shape similarity techniques. We hope that the accessibility of pepMMsMIMIC will encourage medicinal chemists to de-peptidize protein-protein recognition processes of biological interest, thus increasing the potential of in silico peptidomimetic compound screening of known small molecules to expedite drug development.

MAISTAS: a tool for automatic structural evaluation of alternative splicing products

2011-04-19T12:12:00.002+02:00

MAISTAS: a tool for automatic structural evaluation of alternative splicing products

Matteo Floris 1, Domenico Raimondo 2, Guido Leoni 2, Massimiliano Orsini 1, Paolo Marcatili 2 and Anna Tramontano 3,4*

Author Affiliations

1 CRS4-Bioinformatics Laboratory, c/o Sardegna Ricerche Scientific Park, Pula, 09010 Cagliari, Italy

2 Department of Biochemical Sciences, Sapienza University of Rome, P.le A. Moro, 5 - 00185 Rome, Italy

3 Department of Physics, Sapienza University of Rome, P.le A. Moro, 5 - 00185 Rome, Italy.

4 Istituto Pasteur Fondazione Cenci Bolognetti, Sapienza University of Rome, P.le A. Moro, 5 - 00185 Rome, Italy.

*To whom correspondence should be addressed. Prof. Anna Tramontano, E-mail: anna.tramontano@uniroma1.it

Received October 26, 2010

Revision received March 17, 2011

Accepted March 22, 2011

Bioinformatics (2011) doi: 10.1093/bioinformatics/btr198 First published online: April 15, 2011

Abstract

Motivation: Analysis of the human genome revealed that the amount of transcribed sequence is an order of magnitude greater than the number of predicted and well characterized genes. A sizeable fraction of these transcripts is related to alternatively spliced forms of known protein coding genes. Inspection of the alternatively spliced transcripts identified in the pilot phase of the ENCODE project has clearly shown that often their structure might substantially differ from that of other isoforms of the same gene, and therefore that they might perform unrelated functions, or that they might even not correspond to a functional protein. Identifying these cases is obviously relevant for the functional assignment of gene products and for the interpretation of the effect of variations in the corresponding proteins.

Results: Here we describe a publicly available tool that, given a gene or a protein, retrieves and analyses all its annotated isoforms, provides users with three-dimensional models of the isoform(s) of his/her interest whenever possible and automatically assesses whether homology derived structural models correspond to plausible structures. This information is clearly relevant. When the homology model of some isoforms of a gene does not seem structurally plausible, the implications are that either they assume a structure unrelated to that of the other isoforms of the same gene with presumably significant functional differences, or do not correspond to functional products. We provide indications that the second hypothesis is likely to be true for a substantial fraction of the cases.

Availability: http://maistas.bioinformatica.crs4.it/

Splicing isoforms modeling, peptidomimetics and molecular dynamic made easy

2011-04-13T09:10:00.000+02:00

This new season is started with 3 new accepted papers. Here a brief introduction, I will give more details very soon for each of them:

Maìstas (Bioinformatics, first name), a fully automatic pipeline aimed at building and assessing three-dimensional models for alternative splicing isoforms. The server builds, when possible, comparative structural models for all the splicing isoforms of a submitted gene or set of genes. The models are then analysed in terms of their suitability to exist in the monomeric state, i.e. when a warning appears in the model assessment, it cannot be excluded the possibility that other multimeric state may stabilize the structure. Moreover, the splicing isoform exonic coordinates are mapped on the final models.
pep:MMs:MIMIC (Nucleic Acid Research, Web Server Issue, first name), a web-oriented tool that, given a peptide three-dimensional structure, is able to automate a multiconformers three-dimensional similarity search among 17 million of conformers calculated from 3.9 million of commercially available chemicals collected in the MMsINC database.
ClickMD (Future Medicinal Chemistry), a web-based explicit solvent molecular dynamic simulator. ClickMD performs minimization, equilibration phase and a short run of classical MD. ClickMD works with PDB files of protein and peptides. You just needs a valid PDB file to start the MD simulation! You will receive an e-mail at the end of the simulation containing a link to a web page where you can download the MD results as: log files, trajectory files, energy and RMSD representations and graphs.

Job position: modeling the interaction of genetic and environmental factors in autoimmune diseases

2011-01-04T16:17:00.000+01:00

Not exactly drug design (not yet): a grant is available for modeling the interaction of genetic + environmental factors in autoimmune diseases. Deadline for application is Jan 11, 12AM Italy timezone.

It will be cooordinated by CRS4 in collaboration with two clinical units for MS and DT1 in Cagliari Hospitals and Biomed Dept for Chron disease in Sassari. Info in Italian at http://www.unica.it/UserFiles/File/Selezioni/SciCardio%20N.%206.doc.

Please send directly an email to Enrico Pieroni if interested (ep@crs4.it).

How to download MMsINC entries for Autodock.

2010-12-23T19:05:00.003+01:00

AutoDock is a

suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure

AutoDock 4 uses PDBQT formatted files not only for the receptor but also for the ligand.
The PDBQT format is described here.

Yesterday I have converted all the MMsINC entries (about 4M) to the PDBQT format.
It is so possible to download each entry as input format for Autodock experiments.

Have a look at this example page. As you can see, at the top of the page there is a menu for the download of several format files. The first of them is Autodock input file. Simply press the "go" button to download it.

Let me close this post with this nice photo.

Where: the place is the Poetto Beach in Cagliari (Sardinia, Italy).
What: Stefano Moro and me eating these "ricci di mare" (Sea urchin).

The hard life of a chemoinformatician (and the chemical space in a bag)

2010-12-16T10:31:00.002+01:00

Part 1: The hard life of a chemoinformatician

Thoughts at the end of the Year 2010.
PhD going to be completed in the next 2 months.
My contract is coming to the end (February).
I would like to do more cheminfo than bioinfo.
But this is not easy, also because there are more opportunities in bioinfo than in cheminfo.

Part 2: the chemical space in a bag

Stefano Moro will be wellcome to my village during next weekend.
In his bag, there will be the whole public chemical space (the future MMsINC 2.0).
I will enjoy exploring that space during the holidays.

Who is using MMsINC?

2010-11-08T12:53:00.004+01:00

A good number of people are using MMsINC (summary from May to October 2010),

but what is interesting is the geographical distribution of the visitors: of the last 500 visitors, 67% were from India.

Pharao installation

2010-09-17T14:30:00.002+02:00

Few days ago Silicos released the source code of the tool Pharao.
Here some notes about the installation of Pharao.
I had some problems, and with the help of the Pharao developers (Gert Thijs) I understood how to install it properly.

1) install cmake
2) install the latest openbabel release via svn
3) follow the installation of openbabel here
4) download and install pharao

If everything worked fine, there should not be any problem.

See also BO exchange.

Automated large-scale protein modeling

2010-09-12T12:14:00.008+02:00

A pipeline for the multiple automated comparative modeling can be easily built with the following software:

1) the best template (based on some filters such as the e-value and the resolution) is identified using the hhsearch program;
2) Modeller is used for the comparative modeling;
3) databases for the template search (nr90 and nr70) and for the modeling (a reformatted version of the PDB) are available from the hhsearch ftp site;
4) a set of python and bash utilities for the management of the jobs on a computer cluster.

All these building block are part of my pipeline that is going to be released.
I'll give more informations soon.

Some days ago a test experiment revealed that the pipeline can build 450 models in 9 hours (50 models per hour).
Not so fast, but my pipeline contains also come modules for the model assessment and the cluster resources are shared with a lot of different users.

With a dedicated (and larger) cluster, I suppose it would be possible to model the whole human proteome (ca. 78.000 peptides, source: Ensembl) in 1 or 2 weeks with this pipeline.

MMsINC 2.0: coming soon.

2010-08-23T17:10:00.006+02:00

We are processing (mainly Marco Fanton at Univ. of Padova) lot of public sources for the next MMsINC release; please feel free to contact me if you have any SDF catalog, it will be our pleasuse to process and incorporate it (with the appropriate link to your Company or Institute).

Sardinian gold jewel

"Ultrafast shape recognition" method implementation

2010-08-19T14:32:00.003+02:00

I have implemented the Ultrafast Shape Recognition method (Ballester and Richards, Proc. R. Soc. A., 2007) for the next MMsINC release... pure Python implementation, no external libraries required, fast calculation. I'm looking for a small dataset for the validation of my script. If you are interested in the source code, please send me an email at matteo.floris@gmail.com.

InChI Version 1, Software Version 1.03 is out!

2010-07-01T17:04:00.001+02:00

http://www.iupac.org/inchi/release103.html:

InChI Version 1, Software Version 1.03
– implemented for both Standard and
Non-standard (Customized) InChI/InChIKey

Removing duplicates from large SDF files

2010-06-23T08:54:00.003+02:00

Maybe there are better solutions, but this worked very well with a random set taken from Pubchem (5.000.000 structures, but I introduced random duplicates, for a total of 120.000.000 structures):
1) generate your preferred inchis for all the structures in your big SDF, and update the SDF with these inchis (you can use pybel for that)
2) extract PUBCHEM_COMPOUND_CID from the SDF:

grep PUBCHEM_COMPOUND_CID -A 1 big.sdf > PUBCHEM_COMPOUND_CID | grep -v "PUBCHEM_COMPOUND_CID" | grep -v "-" > CIDs.txt

3) then put inchis and CIDs in the same file:

paste inchi CIDs.txt > inchi_CIDs.txt

4) now you can sort this file:

sort inchi_CID.txt -o inchi_CID_sort.txt

so, all the duplicates are visible...
5) now, you could load all the inchis as keys of a python cPickle dictionary... if an inchi is unique in the inchi_CID_sort.txt file, the value of the key is 0, if it is a duplicate (last visited inchi == actual inchi) then the value of the key is 10.
6) now, the python script should parse the SDF in this way:
for each structure:
if the inchi of this structure has value 0 in the dictionary, save the molecule;
if the value is 10, save the molecule, but change the value to 11;
if the value is 11, skip this structure

I would suggest to save the output file every 100.000 structures, the open a different output file at each iteration... at the end, a "cat" command will generate a big SDF without duplicates.

Which is the "real" RU-486? [2]

2010-05-13T10:31:00.012+02:00

From a Pubchem search (search term: RU-486, 20 results, 18 RO5):

Which is the "real" RU-486?

2010-05-07T18:56:00.006+02:00

This

or this

???
These images are from ChEBI: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=363012 and http://www.ebi.ac.uk/chebi/searchId.do?chebiId=363012.

ChEMBL_03 is available!

2010-04-22T14:11:00.001+02:00

JPO announced the new release. FTP data will be available in the next few days.

Any other case of "Different InChIs from the Same Molecule"?: an experiment with MMsINC 1.0

2010-03-25T11:26:00.010+01:00

Some days ago, an InChI bug was highlighted by Rich Apodaca on his blog:

me and Marco Fanton (see Marco Fanton web page) decided to perform an experiment with the 3M of 3D structures from MMsINC 1.0: we have created a small pure-Python script that is able to reshuffle the atom order of a given SDF file (you can write me if you are interested in the code), then we generated "on-the-fly" 10 random permutations for each of the 3M structures, automatically calculated the standard InChI, and searched for any "new" duplicated InChI.
Results? Interesting:

two molecules (MMs03263666 , MMs03263667) that are di-azo compounds (and this confirms the known bug)... and no other duplicates.

About MMsINC data and license

2010-03-10T16:25:00.007+01:00

As a comment of a recent post, Egon posted two short questions about MMsINC data and license.
I want to thank Egon for the two questions.

QUESTION 1: what's the license of the data in the database?

ANSWER: MMsINC data are property of the University of Padova.
Actually, data are not available for download, but the users can query them though the web interface.

QUESTION 2: how does the curation compare to that of ChEBI?

ANSWER:
a. the MMsINC data sources are larger than those included in ChEBI: first release contains a number of sources but the most relevant is Zinc (version 7), but the next release aims to process the greater part of public data (Pubchem, mainly).
b. structures are not checked or processed one by one, but by following a precise protocol described here (open access NAR paper).

Some additional notes about the quality of MMsINC data:

the quality of MMsINC chemical data is higher than other public resources:

- MMsINC is not just a collection of public data, but there is a long preprocessing work (see the Nucleic Acid Research DBIssue article for a full description of the pipeline) and a data cleaning based on the InChIs.
- MMsINC is the only resource that collect the most probable ionic states and tautomers of all the structures (when possible and with the known limitation)
- MMsINC stores precalculated predictions of biological enrichment of each molecules (similarity to PDB ligand, to bioactive molecules, presence of "active" fragments)
- MMsINC contains a selection of descriptors important from a pharmaceutical and biochemical point of view

I want also to cite Stefano (Prof. Stefano Moro, University of Padova, Italy):

MMsINC is not only a database: it is a chemogenomics work platform that places
its data and tools to work with it on an even footing. Although the data
formally is property of the University of Padova, it is more important to note
that we feel that simply providing files to download would belittle our
mission. Instead, we aim to bring this data and the science of chemoinformatics
together to provide the MMsINC service to our web community.

Let me thank Luca Pireddu for helping me translate Prof. Moro's quote.

OOChemistry

2010-03-06T14:22:00.002+01:00

This is an interesting tool.
I want to report here the message from the developers:

OOChemistry is an extension for OpenOffice.org which provides cross-platform OLE-like integration of OOo with JChemPaint chemical diagram editor. With OOChemistry you can draw structure, embed into document (text or presentation) and than double click and edit whenever you want on any platform having OpenOffice.org and Java Runtime (Windows, Linux, Mac OS X, other Unix flavours). It is only first alpha version and is not recommended for production use (e.g., compatibility with futher versions is not guaranteed).

OOChemistry needs your help! Experience in Java, in development of projects dealing with JChemPaint/CDK, or in development of OpenOffice.org extensions will be highly appreciated. Of course, you can help not only in coding, but also in translation of interface and writing docs.

Project page on SF

It sounds very interesting.
I'll try the installation asap.

c-d-k.org

2010-03-05T11:26:00.003+01:00

Good news: c-d-k.org is again available!; it is a good way to be introduced to the CDK functionalities.

About SMARTS patterns in PubChem fingerprints

2010-03-03T10:12:00.002+01:00

See my recent post at BlueObelisk StackExchange.
Wolf Ihlenfeldt's reply is very interesting for people that want to know more about PubChem fingerprints.
As he says, the SMARTS patterns (terminal part of the FPs)...

"are the result of capturing and analyzing user queries on the old NCI Cancer Screening database Web interface. They are intended to capture features which are used in actual queries. They were not designed specifically for similarities or correlation with any properties, but it turns out that there are indications that these screens work about as well as others in that sector".

Clustering with rcdk and Python

2010-02-27T13:23:00.008+01:00

I received an email from a collegue: "I have a list of ChEBI IDs with the corresponding SMILES; I'd like to do some clustering, based on a similarity measure between the structures".

I want to propose here a method based on the rcdk package developed by R. Guha. This package is really nice. You should read carefully this article.

Let me suggest a variation. If you have a huge number of structures, I would suggest to create externally from R your matrix of similarities.

This can be done with a Python script:
1) you can create your structural keys with an external tool (or, with the rcdk and save the fingerprints in another file)
2) then, you can calculate the Tanimoto similarity by using functions from the Python sets package:


import sys, os
from sets import Set

fp_A = list("110011")
fp_B = list("101011")

set_a, set_b = Set([]), Set([])
i = -1
try:
 while 1:
  i = fp_A.index("1", i+1)
  set_a.add(i)
except ValueError: pass
i = -1
try:
 while 1:
  i = fp_B.index("1", i+1)
  set_b.add(i)
except ValueError: pass

tanimoto = float( len(set_a.intersection(set_b)) ) / float( len(set_a.union(set_b)) )

3) in this way, you can calculate the matrix of similarities between all your structures, save the matrix in a file, load it in R environment, and use rcdk for the clustering.

That's all.

Chemoinformatics in R:

2010-02-25T13:39:00.002+01:00

Really interesting if you want to learn more about R programming applied to chemoinformatics: a Joint EBI-Industry Workshop on Cheminformatics in R.
Speakers of this short course:
- Rajarshi Guha, NIH Chemical Genomics Center (R-CDK and R-Pubchem)
- Steffen Neumann, AG Massenspektrometrie & Bioinformatik ( XCMS, Rdisop, CAMERA)
- H. Paul Benton, Imperial College London.
- David Broadhurst, Cork University Maternity Hospital.

Course page

New data load for kinase SARfari screening data

2010-02-24T09:45:00.002+01:00

As JPO reported on the ChEMBL Blog, there is a new data load for the beta version of Kinase SARfari.

See the post at ChEMBL blog