By comparing the knottin sequence identity distribution with the expected model accuracy, the average model versus native structure RMSD over all knottin sequences can be esti mated between 1. 6 and 1. 7 which should be a sufficient nearly accuracy for many applications. The homology modeling procedure has also been inte grated into the protein analysis toolkit PAT accessible. The whole pro cessing for one knottin structure prediction requires one minute to one hour on this server. This processing time depends linearly on the product of the chosen maximal number of 3D templates and of the number of models generated per Modeller run. The best resulting knottin model is saved as PDB formatted data and is accessible from the PAT web session manager.
By this way, knot tin data can be further analysed by interactive data transfer to other analysis tools available in the PAT pro cessing environment. Discussion Modeling at low sequence identity can be improved by a structural analysis of template clusters Although continuous improvements in the accuracy of protein modeling techniques have been achieved over the last years, structural predictions at low sequence identity still remain difficult. In this work, we have shown that the optimal use of the structural information available from all members of the query family can lead to notable model accuracy and quality gains, even when the closest templates share less than 20% sequence iden tity with the protein query. For example, the DC4 criter ion, which was shown to improve template selection, could be directly derived from the analysis of the disul fide bridges and hydrogen bonds conservation over all knottin structures.
Using a hierarchical classification of all knottin structures, we could evidence a direct influ ence of the position of cysteine IV onto the main chain hydrogen bond network. Such structural information can be easily translated into a sequence constraint by adding, to the PID criterion, a penalty when template and query cysteine IV cannot be aligned. Benchmarks on our knottin test set showed that this modified DC4 criterion achieves a better template selection than PID alone. This example demonstrates that generic modeling approaches applicable to any protein are too general for optimally modeling a specific protein family because they are not able to delineate precisely the structural features conserved over related protein subsets.
Further more, in our work, the conserved hydrogen bonds derived from structure superimposition and clustering were used as restraints to force the models to conform to the 80% consensus hydrogen bonding observed over the whole knottin family or a subset of it. This is useful because not all templates satisfy the consensus hydrogen bonds, most likely because hydrogen bonds cannot always be Batimastat directly inferred from NMR data.