- Distance(i, i+1) < 4 Å for each Calpha atom i;
- Distance(i, j) > 3,5 Å for each pair of Calpha atoms i and j.
UsageThe FT-COMAR implementation available here takes as input a symmetric contact map matrix where contacts are denoted with 1 entries, non-contact are denoted with 0 entries and unknown entries are denoted with -1. Such contact map may be computed with any threshold value, even if previous results show that higher thresholds allow the contact map to carry more information (Vassura et al. 2007 a). Marking unsafe areas as unknown (-1 entry in the contact map) improves reconstruction quality. For further improvement a simple filtering procedure is provided: it can be used to detect (and mark with -1) possibly unsafe entries of the input contact map. This filtering procedure is based on the common neighbors property of contact maps at threshold 12 Å (see Filtered Reconstruction), so it has better performances when the input contact map threshold is 12 Å. FT-COMAR has been tested, obtaining structures with RMSD from the native on average less than 4 Å, with:
- thresholds in the range between 7 and 18 Å;
- proteins having between 55 and 786 residues;
- number of errors in the range between 0 and 10% of the whole contact map (threshold 12, see Experimental Results)
Proteins are from SCOP release 1.67 detailed with X-ray diffraction and with a resolution <2.5 Å, (without missing internal residues). We removed sequence redundancies using BLAST, ending up with a datasets of 1760 protein chains with sequence similarity lower than 25%. Among these we selected 100 mono domain proteins distributed uniformly among the four principal SCOP classes, with chain length between 55 and 629 residues. These 3D structures can be reconstructed by COMAR up to a 2 Å RMSD from the native structure (Vassura et al., 2007 b). Distribution of the resulting protein set is: 25 all Alpha; 25 all Beta; 25 Alpha/Beta; 25 Alpha+Beta (complete list of pdb codes).^
Error generationErrors are generated by flipping the entry of randomly chosen rows and columns of the contact map. To introduce x% errors we generate x errors for each 100 couples of residues. To study how protein 3D structure can be reconstructed with our algorithm from faulty contact maps we introduce three classes errors. Examples of 15% errors of each type are shown below.
Fig. 1. Native contact map of the Asn102 mutant of trypsin (PDB code: 1trmA). Contact map is computed with a threshold of 12 Å: gray areas are contacts, white areas are non-contact. This contact map contains 24753 pairs of residues, 3595 contacts, 21158 non-contacts, and no errors.
Fig. 3. Native contact map of the Asn102 mutant of trypsin (PDB code: 1trmA) with missing contacts. Lengend is as above. Missing contacts (Err-1) = 15% (out of 3595 contacts, 539 missing contacts).
Fig. 4. Native contact map of the Asn102 mutant of trypsin (PDB code: 1trmA) with false contacts. Legend is as above. False contacts (Err-0) = 15% (out of 21158 non-contacts, 3173 false contacts).
In our testing, for each protein contact map and for each percentage of error considered, we generated 100 different faulty contacts maps. Thus, having 100 proteins in our set, we performed 10000 tests for each percentage of errors. By this, our test results are the average values obtained from the 100 different instances we generated. All test runs have been executed on personal computers equipped with the Intel Pentium 4 processor (clock rate of 2.8GHz) and 1Gb of RAM memory.^
Structure Reconstruction from Faulty Contact Maps
Fig. 5. Reconstruction quality (RMSD) as a function of the number of protein residues (Size). Different percentages of random errors on the total pairs of residues (Err%) are shown in different colors. As expected, reconstruction quality decreases for longer protein chains and higher percentages of errors. Note that the sheer number of errors relative to the same percentage increases with the protein size: 10% random errors for a protein of size = 100 tantamounts to 450 errors, while 1% random errors for a protein of size = 400 tantemounts to 798 errors (10000 contact maps are analyzed).
Reconstruction and SCOP Structural Classes
Fig. 6. Reconstruction quality (RMSD) with an Err = 5% as a function of the protein length (Size) clustered according to SCOP categories. As expected the quality is better for small mono domain proteins, with few exceptions. Note that most exceptions belong to the All-Alpha SCOP structural category. This may be due to the fact that in such category most contacts are located at very short sequence separations, causing more difficulties in the reconstruction of the overall structure.
Comparison of Reconstruction Quality for Different Error Types
Fig. 7. Average RMSD to the native structure of structures reconstructed from contact maps as a function of the percentage of errors with respect to each error class: Err refers to random errors, Err-1 refers to missing contacts and Err-0 refers to false contacts. Note that reconstruction quality is better in presence of Err-1 errors.
Reconstruction From Incomplete Contact Maps
Fig. 8. Reconstruction quality (RMSD) as a function of the number of residues in the protein chain (Size) and of the percentage of random skipped pairs on the total pairs of residues. FT-COMAR reconstructs 3D protein structures with RMSD<4 Å provided that the number of skipped pairs is ≤ 75% of the entries of the contact map for proteins of any size.
Fig. 9. Reconstruction quality (RMSD) as a function of the number of residues in the protein (Size) when 25% of the input contact map is skipped. Increasing percentages of random errors (Err) on the remaining 75% of the map are shown. Different percentages of Err have different colors: note that we reconstruct with RMSD < 4 Å only for low percentages of errors and reconstruction quality is decreasing at increasing protein sizes.
Error Filter Preprocessing
Given the good results obtained when some areas of the contact map are not considered we improve reconstruction preprocessing the contact map in order to identify (and not consider) "unsafe" areas. The simple filter we implemented skips contact i,j if:
- M[i, j]=1 (i e j are in contact) and i, j share less than 10 neighbors, i.e. residue i is in contact with less than 10 residues which are in contacts also with residue j;
- M[i, j]=0 (i e j are not in contact) and i, j share more than 18 neighbors, i.e. residue i is in contact with more than 18 residues which are in contacts also with residue j.
Fig. 10. Reconstruction quality (RMSD) of FT-COMAR as a function of the number of residues in the protein (Size). Different percentages of random errors (Err%) on the whole contact map are shown with different colors. Note that we reconstruct with RMSD < 4 Å for 1-10% of errors for proteins of any size, while over 16% of errors the simple filtering preprocessing adopted is not able to skip enough errors to keep reconstruction quality independent of the protein size.
Fig. 11. Average FT-COMAR reconstruction times in seconds for our 100 proteins data set as a function of the protein length for four percentages of random errors: 0% (no errors) 1%, 7%, and 10%. As expected computing times increase at increasing protein size and number of errors.
Comparison with Previous ResultsRecovering a set of 3D coordinates having a specific contact map is known as the realization problem of the unit-disk graph which has been proved to be NP-hard (Breu et al., 1998). A series of heuristic algorithms have been proposed to solve the problem. Galaktinov and Marshall (Galaktionov et al., 1994) reconstructed the structures of five small proteins by adopting information relative to the residue coordination numbers. Vendruscolo et al. (Vendruscolo et al., 1997) described a method based on simulated annealing with the contact map as a target potential.
Fig. 12. Average reconstruction quality (RMSD) for the Asn102 mutant of trypsin (PDB code: 1trm chain A, 223 residues) as a function of the number of random errors included in the native contact map. Vendruscolo et al. refers to the performances described in (Vendruscolo et al., 1997), FT-COMAR* refers to FT-COMAR with common neighbors filtering. To compare this graph with previous ones consider that here 1000 errors are approximately 4% of the number of pairs of residues.
Fig. 13. Average reconstruction quality (RMSD) for the Bovine pancreatic trypsin inhibitor (PDB code: 6pti, 56 residues) as a function of the number of random errors included in the native contact map. Vendruscolo et al. refers to the algorithm described in (Vendruscolo et al., 1997), FT-COMAR* refers to FT-COMAR with common neighbors filtering. To compare this graph with previous ones consider that here 100 errors are approximately 7.25% of the number of pairs of residues.
Fig. 14. Average reconstruction quality (RMSD) for 18 proteins taken from a previous dataset (Vendruscolo et al., 1997) as a function of the number of residues. Vend. et al. refers to the results described in (Vendruscolo et al., 1997), threshold 9 and 13 refer to FT-COMAR when the input contact maps are computed with contact map threshold respectively 9 Å (as in Vendruscolo et al., 1997) and 13 Å. For the sake of comaparison we show the average results of 10 reconstructions obtained from the native contact map without errors.
Below we compare the average RMSD of 10 reconstructions done with FT-COMAR to other results (Galaktionov et al., 1994). For the sake of comparison reconstructions are done from native contact maps without the introduction of errors. Results of this table are computed using as contact map threshold 13 Å (thresholds of 9 or 12 Ås yield similar results).
|Protein||Number of residues||Galaktionov et al., 1994||FT-COMAR|
|RMSD (Å)||Avg RMSD (Å)||Dev RMSD (Å)||Avg Time (s)||Dev Time (s)|