Atom typing based on local atomistic environments and data-driven statistics
Loïc Dréano1, Ashenafi Legehar1, Evgeni Grazhdankin1, Alexandre Borrel1,2, Henri Xhaard1
1Faculty of Pharmacy, Division of Pharmaceutical Chemistry and Technology, University of Helsinki, Viikinkaari 5E, P.O. Box 56, FI-00014 Helsinki, Finland
2Division of Intramural Research/Biostatistics and Computational Biology Branch, NIH/NIEHS, RTP, North Carolina 27709, United States
There are around 130,000 protein structures in the protein data bank (PDB that provide a wealth of information about molecular interactions. It is now possible to automatically extract experimentally measured binding affinities with reasonable confidence (e.g. in PDBbind), that could be used to train knowledge-based scoring functions. Within this framework, in addition to (i) ligand data availability, common problems are (ii) to define atom types for the ligand, with an expected balance between number and accuracy; (iii) lack of negative data; as well as (iv) to set up reference states.
Here, we present a relational database, developed in PostgreSQL, to mine for molecular interactions in the PDB. The database represents a powerful and time-efficient way to conduct data mining studies: indeed, filtering the complexes of interest is usually conducted at different levels in order to build datasets where for example the resolution, protein chain and ligand redundancies, interactions with metals, or simply associated features have been controlled. This is usually done using collections of scripts organized into workflows[4-5]. Examples of organizing the data into databases are rare.
The database is used to derive the key features that are employed to cluster atoms based both on their bonded and non-bonded contacts. The features are comprised of (i) element types of the origin atom and its neighbourhood, (ii) the count and (iii) the distance between the origin and the neighbours. We are currently working on the development of knowledge-based scoring functions built upon the database. The data structure has been devised to be able to simply integrate water molecules and metal atoms.
 H.M. Berman; J. Westbrook, Z. Feng; G. Gilliland; T.N. Bhat; H. Weissig; I.N. Shindyalov; P.E. Bourne. (2000) The Protein Data Bank Nucleic Acids Research, 28: 235-242.