Mechanical Unfolding of a Simple Model Protein Goes Beyond the Reach of One-Dimensional Descriptions

We study the mechanical unfolding of a simple model protein. The Langevin dynamics results are analyzed using Markov-model methods which allow to describe completely the configurational space of the system. Using transition path theory we also provide a quantitative description of the unfolding pathways followed by the system. Our study shows a complex dynamical scenario. In particular, we see that the usual one-dimensional picture: free-energy vs end-to-end distance representation, gives a misleading description of the process. Unfolding can occur following different pathways and configurations which seem to play a central role in one-dimensional pictures are not the intermediate states of the unfolding dynamics.

In force-pulling experiments, the one-dimensional description is usually adopted, as force is considered to impose a preferred direction that appears as the slowest degree of freedom compared with the remaining ones. In this sense, optical tweezers [6,7], magnetic tweezers [8,9] or AFM [10][11][12] experiments are usually analyzed considering the end-to-end distance as the proper reaction coordinate, with a well developed force spectroscopy theory [15][16][17][18] that allows stating predictions grounded on this hypothesis. Also, recent studies of single molecule Foester resonant energy transfer fluorescence study thermal unfolding by tracking the radius of gyration of individual molecules [13,14]. Computational works similarly take advantage of this simple description, choosing reaction coordinates such as the fraction of native contacts Q [19][20][21][22], the RMSD from the native structure [23] or the Principal Components [24][25][26][27]. Nevertheless, this tempting approach must be used with great care, as some energy minima which represent relevant metastable conformations and the barriers connecting such states may be hidden when projecting the actual large-dimensional free energy landscape onto a low-dimensional subspace. * Rafa.T. Rojo@gmail.com Besides, one dimensional profiles might suggest misleading unfolding paths, consequence of this projection restriction.
In order to explore such aspects, we choose a coarsegrained model protein [28][29][30][31][32][33][34][35][36][37] and study it through a force-clamp protocol. The output of the simulations will be analyzed through two different approaches, allowing a comparison between the conclusions yielded by each. First, we build one dimensional free-energy profiles along the end-to-end distance and the fraction of native contacts. Second, we describe the configurational space of the system by using Markov-Model methods [38][39][40][41] and obtain the unfolding paths applying transition-path theory [42][43][44][45].
Although recent works cast doubt on a simple low dimensional description of thermal (un)folding processes [24,46], the one-dimensional approach is usually adopted for mechanical unfolding processes, due to the privileged direction imposed by the force [15,47]. In the case studied here, this fact, together with the simplicity of the protein structure, apparently point to a valid one-dimensional description of the unfolding process. Nevertheless, we find out that one-dimensional profiles lead to deceptive conclusions. In particular, these profiles suggest the existence of a metastable state (the half-stretched configuration, see Fig. 1) as a mechanical intermediate between the native and stretched states. Opposed to this, we find that unfolding occurs through two major routes defined by the existence of two different mechanical intermediates, not identified in the onedimensional description. Although very stable, the halfstretched configuration plays a marginal role in the unfolding process. This multi-path picture can never be captured through a one-dimensional description. In addition we are able to systematically define all the individual unfolding pathways calculating their relative weight in the dynamics and yielding a complete and quantitative vision of the protein's landscape that completes the picture described in previous studies on the same system [32,35,36].

II. MODEL
The BLN model [28,29] is a coarse grained offlattice protein model in which the residues are represented by "colored" beads, hydrophobic (B), hydrophilic (L) and neutral (N). Due to its rich behavior, despite its simplicity, this model has been widely studied, with several modifications through time [30-33, 35, 36]. In particular, the 46-residue sequence (BLN-46) B 9 N 3 (LB) 4 N 3 B 9 N 3 (LB) 5 L folds into a four-strand β barrel showing nonetheless a frustrated ground state [33].
The potential terms we use account for a stiff nearestneighbor harmonic potential, a three-body bending interaction, a four-body dihedral interaction and a sequence dependent Lennard-Jones potential [35,36]: where r ij is the distance between residues i and j, θ is the bending angle and φ the dihedral angle. For parameter values see [35] and Appendix A. We simulate the system by integrating Langevin equations of motion at constant temperature T and following a force-clamp protocol, where monomer 1 is fixed while a constant force is applied to the last monomer, 46, through a linear spring. Such equations are given by where m is each residue unitary mass, γ the friction coefficient, F the external force applied in the z direction and η i Gaussian white noise of zero average, holding fluctuation-dissipation theorem η i η j = 2T γδ(t − t )δ ij . This model protein has a well characterized unfolding transition (see [35] and Appendix C) at T c and unfolds mechanically at F U . We work from now on at T = 0.55T c and F = 0.8F U in order to maximize the number of configurations visited by the system. Lower forces would not populate the unfolded state while above F U the unfolding would be irreversible.

III. METHODS
We present here the different methods use to analyze the simulated trajectories in order to understand the mechanical unfolding scenario of our model system.

A. Potential of Mean Force
The Potential of Mean Force (PMF) is a low dimensional (typically one-dimensional) characterization of the free energy landscape of a system, which relies on the choice of a reaction coordinate X. The PMF is simply F/k B T = − log P (X), where P (X) is the probability density of the chosen reaction coordinate X.
We will explore the PMF of the system (section IV.A) by using two different reaction coordinates. As the mechanical force imposes a privileged direction, the endto-end distance ξ = |r N − r 0 | appears as a natural choice. This magnitude is indeed widely used in most single molecule force spectroscopy applications [15,[48][49][50]. Additionally, we use the fraction of native contacts Q [19,20], often reported in computational applications as a good magnitude for describing protein unfolding, based on the importance of topology on protein structure.

B. Principal Components Analysis
Principal Component Analysis (PCA) is a standard statistical method for reducing the dimensionality of a complex system such as biological molecule [25][26][27]. PCA performs a linear transformation by diagonalizing the covariance matrix C ij = y i y j − y i y j , removing thus all internal correlations. The Principal Components (PCs) q i are calculated as the projection of the trajectory onto each eigenspace. If we order the eigenvalues, the first largest PCs contain most of the fluctuations of the system and can be used as adequate reaction coordinates.

C. Conformational Markov Network
In order to characterize the thermodynamical and kinetic properties of our system we build a Markov Model [38,39] by discretizing the state space of our molecule into a set S = {1, 2, · · · , M } of M conformational states defining the Conformational Markov Network of the system [40,41]. For our system, the conformational space is defined as the first three PCs, reducing greatly its dimensionality but keeping its essential features. With these three coordinates we maintain the 75% of the system fluctuations, while the remaining ones account for symmetric thermal fluctuations. Each of the coordinates is discretized into 30 bins of equal volume, thus M = 27000.
The Conformational Markov Network is built from the dynamical trajectories, by counting the occupation of each of the states π i and calculating the transition matrix T ij which measures the probability of going from state i to state j within time τ , being τ the time window or lag time used to analyze our trajectories (τ = 15ps in our case).
The transition matrixT is ergodic and, if the molecule is in equilibrium, the occupation distribution π i can be recovered as the eigenvector with eigenvalue 1. In such situation, detail balance condition holds, π i T ij = π j T ji , and π is the Boltzmann distribution.

D. Basins of attraction Network
As the Conformational Markov Network is typically made up of thousands of nodes and links, hardly any relevant physical information can be directly obtained. A clustering or coarse-graining process is usually followed in order to group together nodes with similar physical features leading to an smaller, more meaningful network.
Here we apply the Stochastic Steepest Descent algorithm [41] (see Appendix B.2 for detailed algorithm). The advantage of this algorithm is that the network is systematically split into its basins of attraction i.e. groups of nodes whose probability flux converges into a single node (minimum). The coarse-graining process does not rely in any arbitrary definition, but on the kinetic properties of the system. Physically, while each node would represent microstates of the system, the basins of attraction represent macrostates.
Onto this network we calculate a new transition matrix T ij and the occupation probability of each basins π i . Free energy differences from basin i and j are given by ∆F ij = −k B T log π i /π j . The mean escape time from basin i is defined as t s = τ /(1 − T ii ), where τ is the time window used to sample the configurations, while transition times between basins i and j are defined as τ i→j = τ /T ij .

E. Transition-Path Theory
The Markov Network defined above contains all thermodynamic and kinetic information of the system. Nevertheless, we are interested in computing the transition pathways between the set of native conformations to the fully stretched conformation. Transition-Path theory provides the necessary tools for doing this [42][43][44]. We define A as the subset of basins which represent the native conformation while B is the subset of stretched basins. Our question is which is the typical sequence of intermediate I states to go from A to B.
The committor probability q + i is defined as the probability, when starting at state i, to reach set B next rather than A. In our case, this is the unfolding probability. By Mathematically, the committor probability can be computed by solving the following system of linear equations: For a molecule in equilibrium, the backwardcommittor probability q − i is simply q − i = 1 − q + i . The transition matrix T ij contains information from every possible trajectory which appears in the equilibrium ensemble of the molecule. In order to extract the contributions from the unfolding trajectories A → B, we calculate the effective flux f ij defined as the probability flux from i → j contributing to the A → B transition: If we want to calculate the unfolding flux, removing recrossings which might appear in a A → B transition, we need to define the net flux as f + ij defines a network of fluxes that go from A to B. The total unfolding flux F represents the expected number of A → B transitions per time window τ and is defined as: In order to decompose this flux network onto individual pathways P i , different approaches can be applied [44,45].
Here we base our strategy on the bottleneck algorithm, where given an individual pathway, the bottleneck (rate limiting step) is identified as the minimal net flux of the path f i and subtracted from every remaining net flux f + ij . The process is iterated until the network is fully decomposed into a set of individual pathways P i .

IV. RESULTS
In order to elucidate the unfolding mechanism under the effect of mechanical force for our model protein, we have performed six long equilibrium simulations. Every simulation starts from the native configuration, is equilibrated for 3µs and then runs up to 3ms.
A. One dimensional description: the Potential of Mean Force Figure 1 shows the PMF calculated along the end-toend distance ξ and the fraction of native contacts Q of our model protein. The profile for ξ shows four clear minima that can be identified with four different configurations, considering that each of the β strands has a length of ξ ∼ 3 nm. In the native configuration (N ) ξ ∼ 0 nm, as the extremal β strands are oriented in the same direction. In the aligned configuration (Al) the second strand (LB) 4 is bent so that the extremes are aligned in the pulling direction and ξ ∼ 3 nm. The half-stretched configuration (HS) appears as an stable minima at ξ ∼ 6 nm, as the fourth (LB) 5 strand is unfolded. The fully stretched configuration (S), with ξ ∼ 12 nm, shows the protein totally unfolded, as an stretched polymer.
These states can also be identified in the Q profile. State S has all contacts broken Q ∼ 0, while Al and HS maintain around half of the contacts (Q ∼ 0.5). The N configuration shows a minimum at Q ∼ 0.75, as thermal fluctuations break on average some of the contacts. Remarkably, for this value of the force, the HS configuration correspond to the lowest minimum in both free energy profiles, and thus is the most stable configuration. Its position in the PMF suggests that it also has a relevant role in the stretching pathways, appearing as a clear mechanical intermediate between the native and fully stretched configuration. In addition, it is necessary to jump over a barrier of several k B T to reach state S while the other states are separated by low barrier. This suggest a fast dynamics between N and HS and longer time scales to visit state S.

B. Two dimensional description: Principal Component Analysis
Before describing the Markov Model of the system, it is worth to exploit further the information PCA provides. As explained previously, we build the Markov network by discretizing the first three PCs, which define our conformational space, with lower dimensionality, but still capturing the main aspects of the system dynamics.  Figure 2 shows the free-energy landscape along the first two principal components ∆F/k B T = − log P (q 1 , q 2 ). Its basic features agree with the one dimensional landscapes shown in previous section, as three major wells are found. Nevertheless we see also clear differences, being the PCs able to capture better the details of the free-energy landscape. Each of these major wells have a rough structure, showing a set of minor wells separated by small energy barriers ∼ 2k B T , revealing thus a richer variety of configurations. Moreover, two new low populated wells appear between the folded structures (native and halfstretched) and the fully-stretched configurations. These new states could suggest the existence of different unfolding pathways, where the half-stretched configuration does not necessarily plays the role of mechanical intermediate.

C. Equilibrium ensemble of the model protein: the Basin Network
The built microstate network is made up of 1876 nodes related kinetically through 23995 links. After applying the Stochastic Steepest Descent algorithm [41], the network is clustered into 30 basins connected through 1290 links. In order to obtain a good description of the system, we keep only those basins which were visited at least . We represent the 13 basins with π > 10 −5 where the size of the bead is proportional to π i . The bidirectional arrows connecting nodes represent allowed transitions (the magnitude of T ij is not shown). Each basin is labelled according to the configuration they encode. Representative structure associated to each basin (Lower ). 0.001% of the trajectory (π i > 10 −5 ), avoiding pathological or extremely rare states. After this refinements, we keep 13 macrostates, connected through 65 edges, including auto-links. Figure 3 (upper) shows a graphical representation of the basin network, where the size of each bead (node) is proportional to its occupation π i . The spatial arrangement of the nodes was calculated applying the Force Atlas algorithm [51], where an artificial dynamics is simulated. This dynamics is based in considering each link as a linear spring and including a certain repulsion between nodes, until an equilibrium configuration is obtained. The nodes are colored according to the modularity class they be-long to [52], having five different classes. Lower panel of Fig. 3 shows a representative structure of each basin (macrostate), including the label which identifies them.
Configurations N 1 and N 2 correspond to native-like states and will define the native set A due to its structural similarity and high Q value. The aligned configuration Al, already identified in Fig. 1, appears close to N 1 and N 2 in Fig. 3 but does not belong to the native set since it gives very different Q and ξ values. Basin HS is the Half-Stretched Configuration, the most stable macrostate under these conditions. State S is the Fully-Stretched Configuration, while the remaining 8 basins are labelled as intermediate states and will be discussed further on.  Table I shows information about each of the identified macrostates. π i is the occupation of basin i, t s the mean escape time (defined above), Q the mean fraction of native contacts and ξ the mean end-to-end distance, both calculated from the marginal distributions of such magnitudes on each basin. It is remarkable that in many cases such distributions are not unimodal, so the actual meaning of the average must be taken with care. Finally, q + i are the committor probabilities from the native (N 1 and N 2 ) to the stretched (S) configuration this is: the unfolding probability of basin i. It is important to stress the difference between the two native basins N 1 and N 2 , as they have very different connectivity features in the network, belonging to different modularity classes. Configuration N 1 is closer to the native structure, given the arrangement of the neutral turns, while N 2 shows bigger fluctuations, leading to a loss of some contacts. Interestingly, N 1 is more connected to the Intermediate States than N 2 , which shows fast transition times to HS, τ N2→HS = 557ps, while τ N1→HS = 13.5 × 10 6 ps. In fact, they are both scarcely connected -τ N2→N1 = 14 × 10 3 ps and τ N1→N2 = 15 × 10 3 ps-, reason why they belong to a different modularity class. In this regard, in spite its structural similarity which overlap both states in the PMF description, their actual role in the configurational space is quite different.
In this sense, the first contradictions with the conclusions yielded by the PMF description appear here. While both descriptions agree coarsely in the main features of the equilibrium ensemble of the system, revealing three major states (native, half-stretched and fully-stretched), the role of such states and the presence of other relevant configurations is hidden in the one-dimensional projection. N 1 and N 2 states are integrated into the same high Q or low ξ minimum, will the intermediate low-populated states which connect to the stretched state are impossible to be identified in the one-dimensional representation.

D. The unfolding pathways: Transition Path Theory
In order to decipher the actual unfolding mechanism of our model protein under the effect of a mechanical force, we apply Transition Path theory to the basin network, as explained in Methods section.
We define the native set A as basins N 1 and N 2 , while the stretched set B is just made up of basin S. According to this definitions, we calculate the committor probabilities, shown in Table I. Figure 4 shows the net flux network, being the thickness of the arrows proportional to the net flux f + ij . The total unfolding flux is F = 2.9 × 10 −7 ps −1 , meaning that we observe an unfolding transition every 3.5µs, approximately.
We decompose the net flux network by identifying first the strongest pathway, remove it from the network and repeat the process until there is no path from set A to set B. Due to the size of our network, this process can be done manually, although computational applications can be used [44,45]. We identify a total of 9 different paths leading from A to B. After decomposing the network into these 9 paths, unconnected regions still remain due to the presence of trap states [43] that carry around 20% of the flux. Figure 5 shows the 6 more relevant paths, which carry 89% of the unfolding flux.
From the 9 pathways, 7 start from conformation N 1 while just 2 from N 2 . This is a remarkable fact, being N 1 closer to the native structure than N 2 , as discussed in previous section. In addition, states I 1 and I 2 appear as the actual intermediates for the unfolding mechanism: A → B is forbidden in case these two states are removed from the net flux network. Out of the 9 pathways, 6 of them pass through state I 2 and 3 through state I 1 .
The construction of the Markov Model from the PCs and the use of Transition Path Theory help us to unveil the actual unfolding mechanism and its driving process. While HS is a notably relevant metastable state (indeed the most stable state under these conditions), its role in the unfolding mechanism is completely marginal, as just appears in path P 5 , with a weight of 7%. This important conclusion contradicts those derived from the onedimensional description showed in Fig. 1, where HS was suggested as the mechanical intermediate of the unfolding mechanism. The actual mechanical intermediates are I 1 and I 2 (not identified in the one-dimensional description), defining the two major unfolding routes. I 2 has a similar structure to HS, but while HS maintains the hydrophobic core, in HS the extremal B 9 strand is unfolded, breaking the core that stabilizes the structure and driving the unfolding mechanism. On the other hand, I 1 is more stable π I1 = 0.07 and represents a modified HS structure where the folded branches collapse into a globu- lar structure which might lead to expose the extremal B 9 branch to the solvent and drive the unfolding mechanism through states I 4 and I 8 .

V. CONCLUSIONS AND DISCUSSION
In this paper we have presented the detailed analysis of the unfolding process of a model protein under the presence of a mechanical pulling force. This scenario mimics force clamp single molecule experiments, where proteins or nucleic acids are subject to a constant external force that drives their unfolding. Due to the limitation of available observables, these experiments are often analyzed by reconstructing their free-energy landscape along the pulling direction through different existing techniques [15][16][17][18][48][49][50]. This approach is often followed in many computational studies by using different reaction coordinates [19][20][21][22][23][24].
In this sense we wanted to reproduce a similar protocol and explore the conclusions yielded by a one-dimensional analysis and a multidimensional Markov model approach.
The simplicity of our model protein, and the fact that the force sets a privileged direction invites to a onedimensional characterization. Nonetheless, we have seen how both approaches lead to contradictory conclusions.
The PMF description shows the existence of three major states, the native, the stretched or denatured and a metastable Half-Stretched configuration which seems to play the role of mechanical intermediate due to its position in the free-energy profile.
Nonetheless, a more detailed multidimensional study changes dramatically the unfolding picture. Being the most populate one, HS state plays a marginal role in the unfolding pathway, with just 7% of the unfolding flux passing through it. The true mechanical intermediates are states I 1 and I 2 , building the two major unfolding routes, both related to the loss of the hydrophobic core that destabilizes the structure and drives the unfolding process. In this sense, due to the existence of multiple pathways, independently of the chosen reaction coordinate, a one-dimensional picture would never be enough to characterize the unfolding pathway of this system. Thus, our work differs from those which put attention on the proper choice of the reaction coordinate [18,47]. The necessity of multidimensional descriptions indeed has been warned in the last years to understand thermal unfolding, where the protein transits from a low-entropy state (native) to a high-entropy one (denatured) [24,46]. The one-dimensional picture, however, is vastly assumed in mechanical unfolding processes, both in experimental and computational applications.
Regarding our analysis Markov Model protocol, we stress two major differences when compared to most works of this community. First, it is important to note that we are actually using the PCs as reaction coordinates in order to reduce the system dimensionality. Nevertheless, these coordinates has been proven to capture successfully the most relevant dynamical events of complex systems such as biomolecules. In our case, three coordinates are enough, as the remaining ones account merely for gaussian thermal fluctuations. Second, we stress on the importance of the coarse-graining mechanism applied to the original Conformational Markov Network [41], which is able to systematically cluster the network based only on the kinetic properties of the system.
Although extremely simple molecular assays such as DNA or RNA hairpins could fit into a single reaction coordinate description [48], increasing slightly the complexity of the molecule leads to a dramatical rise in the complexity of the actual free energy landscape in the system, requiring more detailed studies. In this sense, molecules such as multiple nucleic-acid hairpins [53], protein-ligand complexes [54] or any mechanically pulled protein [55], appear as potential systems where a one-dimensional description takes the risk of leading to a clear misunderstanding of the actual complexity of their conformational space and the dynamical processes to which they are subject.
(b) If T ml > T lm and u m = 0 write the labels of all the nodes in the list as u j = u m . Go back to step 3.
(c) If T ml ≤ T lm remove link l → m from the graph. Return to point 3.
This process ends when every node in the CMN has been labelled, this is u i = 0 ∀ i. Then, the whole conformational space has been characterized and every node is connected with its local minima in the FEL. All nodes with the same label belong to the same basin in this FEL and therefore we can associate them with the same conformational state.
Given the basin partition, a new CMN network can be built, taken the basins themselves as new nodes. The occupation probabilities will now be defined as π α = i∈α π i , while the new transition matrixT is built, with elements T βα = i∈α j∈β T ji π i / i∈α π i . From these definitions, transition times can be easily calculated as t αβ = τ /T βα , being τ the time window used for the network construction. The relative free energy of basin α with respect to basin β is simply ∆F α = −k B T log(π α /π β ).