FT-COMAR implements a heuristic procedure for recovering a set of 3D coordinates from an incomplete contact map (Vassura et al. 2007 a, b). The algorithm is in two phases. The first phase consists of the partially random prediction of a set of starting coordinates. This initial phase relies essentially on the metric matrix embedding algorithm (Havel, 1998). The second phase of the algorithm iteratively applies a correction/perturbation procedure to the randomly generated set of coordinates. This is performed in order to obtain a new set of coordinates as consistent as possible with the given contact map. The refinement applies until the set of coordinates is consistent with the given contact map or until some control parameter reaches the stop condition. After that the current version of FT-COMAR applies a correction-perturbation procedure to ensure that the following constraints, known to be true for all existing proteins, are respected:
  • Distance(i, i+1) < 4 Å for each Calpha atom i;
  • Distance(i, j) > 3,5 Å for each pair of Calpha atoms i and j.
Since a blurred contact map can eventually be not physical (i.e. not consistent with any set of 3D coordinates), the quality of the solution found by FT-COMAR is highly influenced by the correctness of the input contact map. To improve the reconstruction it can be sometimes useful to mark as unknown some (possibly) highly unsafe areas of the input contact map. FT-COMAR does not consider the coordinates points related to unknown entries during the refinement phase. This has the effect to eventually avoid the propagation of errors.

Usage

The FT-COMAR implementation available here takes as input a symmetric contact map matrix where contacts are denoted with 1 entries, non-contact are denoted with 0 entries and unknown entries are denoted with -1. Such contact map may be computed with any threshold value, even if previous results show that higher thresholds allow the contact map to carry more information (Vassura et al. 2007 a). Marking unsafe areas as unknown (-1 entry in the contact map) improves reconstruction quality. For further improvement a simple filtering procedure is provided: it can be used to detect (and mark with -1) possibly unsafe entries of the input contact map. This filtering procedure is based on the common neighbors property of contact maps at threshold 12 Å (see Filtered Reconstruction), so it has better performances when the input contact map threshold is 12 Å. FT-COMAR has been tested, obtaining structures with RMSD from the native on average less than 4 Å, with:
  • thresholds in the range between 7 and 18 Å;
  • proteins having between 55 and 786 residues;
  • number of errors in the range between 0 and 10% of the whole contact map (threshold 12, see Experimental Results)

^

Experimental Results

Data Set

Proteins are from SCOP release 1.67 detailed with X-ray diffraction and with a resolution <2.5 Å, (without missing internal residues). We removed sequence redundancies using BLAST, ending up with a datasets of 1760 protein chains with sequence similarity lower than 25%. Among these we selected 100 mono domain proteins distributed uniformly among the four principal SCOP classes, with chain length between 55 and 629 residues. These 3D structures can be reconstructed by COMAR up to a 2 Å RMSD from the native structure (Vassura et al., 2007 b). Distribution of the resulting protein set is: 25 all Alpha; 25 all Beta; 25 Alpha/Beta; 25 Alpha+Beta (complete list of pdb codes).

^


Error generation

Errors are generated by flipping the entry of randomly chosen rows and columns of the contact map. To introduce x% errors we generate x errors for each 100 couples of residues. To study how protein 3D structure can be reconstructed with our algorithm from faulty contact maps we introduce three classes errors. Examples of 15% errors of each type are shown below.

FT-COMAR: no errrors

Fig. 1. Native contact map of the Asn102 mutant of trypsin (PDB code: 1trmA). Contact map is computed with a threshold of 12 Å: gray areas are contacts, white areas are non-contact. This contact map contains 24753 pairs of residues, 3595 contacts, 21158 non-contacts, and no errors.
^


FT-COMAR: 5% random errors
Fig. 2. Native contact map of the Asn102 mutant of trypsin (PDB code: 1trmA) endowed with random errors. Legend is as above. Random errors (Err) = 15% (out of 24753 pairs of residues, 3713 random errors).
^


FT-COMAR: 5% random errors on contacts

Fig. 3. Native contact map of the Asn102 mutant of trypsin (PDB code: 1trmA) with missing contacts. Lengend is as above. Missing contacts (Err-1) = 15% (out of 3595 contacts, 539 missing contacts).
^


FT-COMAR: 5% random err

Fig. 4. Native contact map of the Asn102 mutant of trypsin (PDB code: 1trmA) with false contacts. Legend is as above. False contacts (Err-0) = 15% (out of 21158 non-contacts, 3173 false contacts).
^


Test configuration

In our testing, for each protein contact map and for each percentage of error considered, we generated 100 different faulty contacts maps. Thus, having 100 proteins in our set, we performed 10000 tests for each percentage of errors. By this, our test results are the average values obtained from the 100 different instances we generated. All test runs have been executed on personal computers equipped with the Intel Pentium 4 processor (clock rate of 2.8GHz) and 1Gb of RAM memory.

^


Structure Reconstruction from Faulty Contact Maps

FT-COMAR: random err
Fig. 5. Reconstruction quality (RMSD) as a function of the number of protein residues (Size). Different percentages of random errors on the total pairs of residues (Err%) are shown in different colors. As expected, reconstruction quality decreases for longer protein chains and higher percentages of errors. Note that the sheer number of errors relative to the same percentage increases with the protein size: 10% random errors for a protein of size = 100 tantamounts to 450 errors, while 1% random errors for a protein of size = 400 tantemounts to 798 errors (10000 contact maps are analyzed).
^

Reconstruction and SCOP Structural Classes

FT-COMAR: scop 5% random err
Fig. 6. Reconstruction quality (RMSD) with an Err = 5% as a function of the protein length (Size) clustered according to SCOP categories. As expected the quality is better for small mono domain proteins, with few exceptions. Note that most exceptions belong to the All-Alpha SCOP structural category. This may be due to the fact that in such category most contacts are located at very short sequence separations, causing more difficulties in the reconstruction of the overall structure.
^


Comparison of Reconstruction Quality for Different Error Types

FT-COMAR: scop 5% random err
Fig. 7. Average RMSD to the native structure of structures reconstructed from contact maps as a function of the percentage of errors with respect to each error class: Err refers to random errors, Err-1 refers to missing contacts and Err-0 refers to false contacts. Note that reconstruction quality is better in presence of Err-1 errors.
^


Reconstruction From Incomplete Contact Maps

FT-COMAR: skip random
Fig. 8. Reconstruction quality (RMSD) as a function of the number of residues in the protein chain (Size) and of the percentage of random skipped pairs on the total pairs of residues. FT-COMAR reconstructs 3D protein structures with RMSD<4 Å provided that the number of skipped pairs is ≤ 75% of the entries of the contact map for proteins of any size.
^


FT-COMAR: skip random
Fig. 9. Reconstruction quality (RMSD) as a function of the number of residues in the protein (Size) when 25% of the input contact map is skipped. Increasing percentages of random errors (Err) on the remaining 75% of the map are shown. Different percentages of Err have different colors: note that we reconstruct with RMSD < 4 Å only for low percentages of errors and reconstruction quality is decreasing at increasing protein sizes.
^


Error Filter Preprocessing

Given the good results obtained when some areas of the contact map are not considered we improve reconstruction preprocessing the contact map in order to identify (and not consider) "unsafe" areas. The simple filter we implemented skips contact i,j if:

  • M[i, j]=1 (i e j are in contact) and i, j share less than 10 neighbors, i.e. residue i is in contact with less than 10 residues which are in contacts also with residue j;
  • M[i, j]=0 (i e j are not in contact) and i, j share more than 18 neighbors, i.e. residue i is in contact with more than 18 residues which are in contacts also with residue j.
FT-COMAR: filter random err
Fig. 10. Reconstruction quality (RMSD) of FT-COMAR as a function of the number of residues in the protein (Size). Different percentages of random errors (Err%) on the whole contact map are shown with different colors. Note that we reconstruct with RMSD < 4 Å for 1-10% of errors for proteins of any size, while over 16% of errors the simple filtering preprocessing adopted is not able to skip enough errors to keep reconstruction quality independent of the protein size.
^



FT-COMAR: running times
Fig. 11. Average FT-COMAR reconstruction times in seconds for our 100 proteins data set as a function of the protein length for four percentages of random errors: 0% (no errors) 1%, 7%, and 10%. As expected computing times increase at increasing protein size and number of errors.
^


Comparison with Previous Results

Recovering a set of 3D coordinates having a specific contact map is known as the realization problem of the unit-disk graph which has been proved to be NP-hard (Breu et al., 1998). A series of heuristic algorithms have been proposed to solve the problem. Galaktinov and Marshall (Galaktionov et al., 1994) reconstructed the structures of five small proteins by adopting information relative to the residue coordination numbers. Vendruscolo et al. (Vendruscolo et al., 1997) described a method based on simulated annealing with the contact map as a target potential.

FT-COMAR: comparison Vendruscolo 1trmA
Fig. 12. Average reconstruction quality (RMSD) for the Asn102 mutant of trypsin (PDB code: 1trm chain A, 223 residues) as a function of the number of random errors included in the native contact map. Vendruscolo et al. refers to the performances described in (Vendruscolo et al., 1997), FT-COMAR* refers to FT-COMAR with common neighbors filtering. To compare this graph with previous ones consider that here 1000 errors are approximately 4% of the number of pairs of residues.
^


FT-COMAR: comparison Vendruscolo 6pti
Fig. 13. Average reconstruction quality (RMSD) for the Bovine pancreatic trypsin inhibitor (PDB code: 6pti, 56 residues) as a function of the number of random errors included in the native contact map. Vendruscolo et al. refers to the algorithm described in (Vendruscolo et al., 1997), FT-COMAR* refers to FT-COMAR with common neighbors filtering. To compare this graph with previous ones consider that here 100 errors are approximately 7.25% of the number of pairs of residues.
^


FT-COMAR: comparison Vendruscolo noerr

Fig. 14. Average reconstruction quality (RMSD) for 18 proteins taken from a previous dataset (Vendruscolo et al., 1997) as a function of the number of residues. Vend. et al. refers to the results described in (Vendruscolo et al., 1997), threshold 9 and 13 refer to FT-COMAR when the input contact maps are computed with contact map threshold respectively 9 Å (as in Vendruscolo et al., 1997) and 13 Å. For the sake of comaparison we show the average results of 10 reconstructions obtained from the native contact map without errors.
^


Below we compare the average RMSD of 10 reconstructions done with FT-COMAR to other results (Galaktionov et al., 1994). For the sake of comparison reconstructions are done from native contact maps without the introduction of errors. Results of this table are computed using as contact map threshold 13 Å (thresholds of 9 or 12 Ås yield similar results).
Protein Number of residues Galaktionov et al., 1994 FT-COMAR
    RMSD (Å) Avg RMSD (Å) Dev RMSD (Å) Avg Time (s) Dev Time (s)
1rdg 52 0.66 0.92 0.02 0.07 0.01
1pcy 99 0.88 0.85 0.02 0.28 0.03
4fd1 106 0.86 0.59 0.02 0.33 0.05
1acx 108 0.96 0.85 0.02 0.42 0.06
1cpv 108 0.89 0.68 0.01 0.37 0.04
Average   0.85 0.78 0.02 0.29 0.02

^


DOWNLOAD FT-COMAR


FT-COMAR stand-alone version is currently available for the following architectures:
^


References

  1. Breu,H., Kirkpatrick,D.G. (1998) Unit disk graph recognition is NP-hard, Computational Geometry 9 3-24.
  2. Galaktionov,S.G., Marshall,G.R. (1994) Properties of intraglobular contacts in proteins: an approach to prediction of tertiary structure. In System Sciences, Vol.V:, Proceedings of the Twenty-Seventh Hawaii International Conference on Biotechnology Computing Vol. 5, 4-7 Jan. 1994 Page(s):326-335.
  3. Havel,T.F. (1998) Distance Geometry: Theory, Algorithms, and Chemical Applications in the Encyclopedia of Computational Chemistry.
  4. Vassura,M., Margara,L., Di Lena,P., Medri,F., Fariselli,P., Casadio,R. (2007) Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps. Seventh International Workshop on Algorithms in Bioinformatics (WABI 2007), Pennsylvania 2007. Springer Verlag Lecture Notes in Bioinformatics 4645,25-37.
  5. Vassura,M., Margara,L., Di Lena,P., Medri,F., Fariselli,P., Casadio,R. (2007) Reconstruction of 3D Structures From Protein Contact Maps. Proceedings of Bioinformatics Research and Applications third International Symposium (ISBRA 2007), Atlanta, Springer Lecture Notes in BioInformatics 4463,578-589.
  6. Vendruscolo,M., Kussell,E., Domany,E. (1997) Recovery of protein structure from contact maps. Folding and Design, 2(5):295-306.