# 13. Predicting Protein Structure

The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Welcome

back, everyone. I hope you had a good break. Hopefully you also

remember a little bit about what we did last time. So if you’ll

recall, last time we did an introduction

to protein structure. We talked a little bit

about some of the issues in predicting protein structure. Now we’re going to go

into that in more detail. And last time, we’d broken

down the structure prediction problem into a couple

of sub-problems. So there was a problem of

secondary structure prediction, which we discussed a

little bit last time. And remember that the

early algorithms developed in the ’70s get about

60% accuracy, and decades of research has only

marginally improved that. But we’re going to see

that some of the work on the main structure

recognition and predicting novel three-dimensional

structures has really advanced

very dramatically in the last few years. Now, the other thing

I hope you’ll recall is that we had this dichotomy

between two approaches to the energetics of

protein structure. We had the physicist’s

approach and we the statistician’s

approach, right? Now, what were some of the key

differences between these two approaches? Anyone want to

volunteer a difference between the statistical

approach to parametrizing the energy of a structure? So we’re trying to come

up with an equation that will convert coordinates

into energy, right? And what were some of the

differences between the physics approach and the

statistical approach? Any volunteers? Yes. AUDIENCE: I think the

statistical approach didn’t change the phi and

psi angles, right? It just changed other variables. PROFESSOR: So you’re close. Right. So the statistical– or maybe

you said the right thing, actually. So the statistical

approach keeps a lot of the pieces the protein rigid,

whereas the physics approach allows all the atoms

to move independently. So one of the key

differences, then, is that in the physics

approach, two atoms that are bonded to each other

still move apart based on a spring function. It’s a very stiff spring, but

the atoms move independently. In the statistical

approach, we just fix the distance between them. Similarly for a tetrahedrally

coordinated atom, in the physics approach

those angles can deform. In the statistical

approach, they’re fixed. Right? So in the statistical

approach, we have more or less

fixed geometry. In the physics approach, every

atom moves independently. Anyone else remember

another key difference? Where do the energy

functions come from? Volunteers? All right. So in the physics

approach, they’re all derived as much as possible

from physical principles, you might imagine. Whereas in the

statistical approach, we’re trying to recreate what we

see in nature, even if we don’t have a good physical

grounding for it. So this is most

dramatic in trying to predict the

solvation free energies. Right? How much does it cost you if

you put a hydrophobic atom into a polar environment? Right? So in the physics

approach, you actually have to have water molecules. They have to

interact with matter. That turns out to be

really, really hard to do. In the statistical approach, we

come up with an approximation. How much solvent

accessible surface area is there on the polar

atom when it’s free? When it’s in the

protein structure? And then we scale the transfer

energies by that amount. OK, so these are then

the main differences. Gotta be careful here. So we’ve got fixed geometry

this the statistical approach. We often use discrete rotamers. Remember? The side-chain angles, in

principle, can rotate freely. But there were only

a few confirmations are typically observed, so

we often restrict ourselves to the most commonly observed

combinations of the psi angles. And then we have the

statistical potential that depends on the

frequency at which we observe things in the database. And that could be the

frequency at which we observe particular atoms at

precise distances. It could be the fraction

of time that something’s solvent accessible versus not. And the other thing that

we talked about a little bit last time was

this thought problem. If I have a protein

sequence and I have two potential

structures, how could I use these

potential energies– whether they’re derived

from the physics approach or from the

statistical approach– how could I use these potential

energies to decide which of the two structures

is correct? So one possibility is that

I have two structures. One of them is truly the

structure and the other is not. Right? Your fiendish lab mate

knows the structure but refuses to tell you. So in that case,

what would I do? I know that one of these

structures is correct. I don’t know which one. How could I use the

potential energy function to decide which one’s correct? What’s going to be true

of the correct structure? AUDIENCE: Minimal energy. PROFESSOR: It’s going

to have lower energy. So is that sufficient? No. Right? There’s a subtlety

we have to face here. So if I just plug my protein

sequence onto one of these two structures and compute

the free energy, there’s no guarantee

that the correct one will have lower free energy. Why? What decision do I have to

make when I put a protein sequence onto a

backbone structure? Yes. AUDIENCE: How to

orient the side chain. PROFESSOR: Exactly. I need to decide how to

orient the side chains. If I orient the

side chains wrong, then I’ll have side chains

literally overlapping with each other. That’ll have incredibly

high energy, right? So there’s no

guarantee that simply having the right

structure will give you the minimal free energy

until you correctly place all the side chains. OK, but that’s the simple case. Now, that’s in the

case where you’ve got this fiendish friend who

knows the correct structure. But of course, in the general

domain recognition problem, we don’t know the

correct structure. We have homologues. So we have some

sequence, and we believe that it’s either homologous

to Protein A or to Protein B, and I want to decide

which one’s correct. So in both cases, the

structure’s wrong. It’s this question of

how wrong it is, right? So now the problem

actually becomes harder, because not only do

I need to get the right side chain confirmations, but I

need to get the right backbone confirmation. It’s going to close to one

of these structures, perhaps, but it’s never going

to be identical. So both of these

situations are examples where have to do some

kind of refinement of an initial

starting structure. And what we’re going to

talk about for the next part of the lecture are

alternative strategies for refining a partially

correct structure. And we’re going to look

at three strategies. The simplest one is called

energy minimization. Then we’re going to look

at molecular dynamics and simulated annealing. So energy minimization

starts with this principle that we talked about last time

I remember that came up here, that a stable structure has to

be a minimum of free energy. Right? Because if it’s not, then there

are forces acting on the atoms and that are going

to drive it away from that structure to

some other structure. Now, the fact that it is

a minimum of free energy does not guarantee that is

the minimum of free energy. So it’s possible that there

are other energetic minima. Right? The protein structure,

if it’s stable, is at the very least a

local energetic minimum. It may also be the global

free energy minimum. We just don’t know

the answer to that. Now, this was a

big area of debate in the early days of the

protein structure field, whether proteins could

fold spontaneously. If they did, then it meant

that they were at least apparently global

free energy minima. Chris Anfinsen actually

won the Nobel Prize for demonstrating that

some proteins could fold independently

outside of the cell. So at least some proteins had

all the structural information implicit in their

sequence, right? And that seems to imply that

there are global free energy minimum. But there are other

proteins, we now know, where the most commonly

observed structure has only a local

free energy minimum. And it’s got very high energetic

barriers that prevent it from actually getting to the

global free energy minimum. But regardless of

the case, if we have an initial

starting structure, we could try to find the nearest

local free energy minimum, and perhaps that is

the stable structure. So in our context,

we were talking about packing the side chains

on the surface of the protein that we believe might

be the right structure. So imagine that this

is the true structure and we’ve got the

side chain, and it’s making the dashed green lines

represent hydrogen bonds. It’s making a series

of hydrogen bonds from this nitrogen

and this oxygen to pieces of the

rest of the protein. Now, we get the crude

backbone structure. We pop in our side chains. We don’t necessarily– in

fact, we almost never– will choose randomly to

have the right confirmation to pick up all these

hydrogen bonds. So we’ll start off with

some structure that looks like this,

where it’s rotated, so that instead of seeing both

the nitrogen and the oxygen, you can only see the profile. And so the question is

whether we can get from one to by following the

energetic minima. So that’s the question. How would we go

about doing this? Well, we have this

function that tells us the potential energy for every

XYZ coordinate of the atom. That’s what we talked

about last time, and you can go back

and look at your notes for those two approaches. So how could we minimize

this free energy minimum? Well, it’s no different

from other functions that we want to minimize, right? We take the first derivative. We look for places where the

first derivative is zero. The one difference is that we

can’t write out analytically what this function

looks like and choose directions and locations in

space that are the minima. So we’re going to have

to take an approach that has a series of perturbations to

a structure that try to improve the free energy systematically. The simplest understanding is

this gradient descent approach, which says that I have some

initial coordinates that I choose and I take a

step in the direction of the first derivative

of the function. So what does that look like? So here are two possibilities. I’ve got this function. If I start off at x equals

2, this minus some epsilon, some small value times

the first derivative, is going to point

me to the left. And I’m going to take

steps to the left until this function, f prime,

the first derivative, is zero. Then I’m going to stop moving. So I move from my initial

coordinate a little bit each time to the left

until I get to the minimum. And similarly, if I

start off on the right, I’ll move a little bit

further to the right each time until the

first derivative is zero. So that looks pretty good. It can take a lot

of steps, though. And it’s not actually guaranteed

to have great convergence properties. Because of the number of

steps you might have to take, it might take quite a long time. So that’s the first

derivative, in a simple one-dimensional case. We’re dealing with a

multi-dimensional vector, so instead of doing

the first derivative we use the gradient,

which is a set of partial first derivatives. And I think one thing that’s

useful to point out here is that, of course, the force

is negative of the gradient of the potential energy. So when we do

gradient descent, you can think of it from

a physical perspective as always moving in the

direction of the force. So I have some structure. It’s not the true

native structure, but I take incremental steps

in the direction of the force and I move towards

some local minima. And we’ve done this in the

case of a continuous energy, but you can actually also

do this for discrete ones. Now, the critical point

was that you’re not guaranteed to get to the

correct energetic structure. So in the case that I showed

you before where we had the side chain side-on, if you actually

do the minimization there, you actually end up with the

side chain rotated 180 degrees where it’s supposed to be. So it eliminates all

the steric clashes, but it doesn’t actually pick

up all the hydrogen bonds. So this is an example of a

local energetic minima that’s not the global energetic minima. Any questions on that? Yes. AUDIENCE: Where do all these

n-dimensional equations come from? PROFESSOR: Where

do what come from? AUDIENCE: The

n-dimensional equations. PROFESSOR: So these are the

equations for the energy in terms of every single

atom in the protein if you’re allowing the

atoms to move, or in terms of every rotatable

bond, if you’re allowing only bonds to rotate. So the question was, where do

the multi-dimensional equations come from. Other questions? OK. All right, so that’s

the simplest approach. Literally minimize the energy. But we said it has this problem

that it’s not guaranteed to find the global

free energy minimum. Another approach is

molecular dynamics. So this actually

attempts to simulate what’s going on in a

protein structure in vitro, by simulating the force in

every atom and the velocity. Previously, there was

no measure of velocity. Right? All the atoms were static. We looked at what the

gradient of the energy was and we move by

some arbitrary step function in the

direction of the force. Now we’re actually

going to have velocities associated with all the atoms. They’re going to be

moving around in space. And we’ll have the

coordinate at any time t is going to be determined

by the coordinates of the previous

time, t of i minus 1 plus a velocity

times the time step. And the velocities are

going to be determined by the forces,

which are determined by the gradient of

the potential energy. Right? So we start off, always, with

that potential energy function, which is either from

the physics approach or the statistical approach. That gives us

velocities, eventually giving us the coordinates. So we start off

with the protein. There are some serious

questions of how you equilibrate the atoms. So you start off with a

completely static structure. You want to apply forces to it. There are some

subtleties as to how you go about doing that,

but then you actually end up simulating the motion

of all the atoms. And just give you a sense

of what that looks like, I’ll show you a quick movie. So this is the simulation of the

folding of a protein structure. And the backbone is

mostly highlighted. Most of the side chains

are not being shown. Actually, in bold, but you

can see the stick figures. And slowly it’s accumulating

its three-dimensional structure. [VIDEO PLAYBACK] [LAUGHTER] [END VIDEO PLAYBACK] PROFESSOR: OK, I think

you get the idea here. Oh, it won’t let me give up. OK, here we go. OK, so these are

the equations that are governing the motion

in an example like that. Now, the advantage of

this is we’re actually simulating the protein folding. So if we do it correctly,

we should always get the right answer. Of course, that’s not

what happens in reality. Probably the biggest problem

is just computational speed. So these simulations–

even very, very short ones like the

one I showed you– so how long does it take a

protein to fold in vitro? A long folding might

take a millisecond, and for a very small

protein like that it might be orders

of magnitude faster. But to actually

compute that could take many, many, many days. So a lot of computing

resources going into this. Also, if we want to

accurately represent solvation– the interaction of

the protein with water, which is what causes the hydrophobic

collapse, as we saw– then you actually would have to have

water in those simulations. And each water molecule adds

a lot of degrees of freedom, so that increases the

computational cost, as well. So all of these things determine

the radius of convergence. How far away can you be

from the true structure and still get there? For very small

proteins like this, with a lot of

computational resources, you can get from an unfolded

protein to the folded state. We’ll see some

important advances that allow us to get around

this, but in most cases we only can do

relatively local changes. So that brings us to our third

approach for refining protein structures, which is

called simulated annealing. And the inspiration

for this name comes from metallurgy

and how to get the best atomic

structure in a metal. I don’t know if any of you have

ever done any metalworking. Anyone? Oh, OK, well one person. That’s better than most years. I have not, but I understand

that in metallurgy– and you can correct me if I’m

wrong– that by repeatedly raising and lowering

the temperature, you can get better

metal structures. Is that reasonably accurate? OK. You can talk to one of your

fellow students for more details if you’re interested. So this similar

idea is going to be used in this

competition approach. We’re going to try to find

the most probable confirmation of atoms by trying to get

out of some local minima by raising the

energy of the system and then changing

the temperatures, or raising and lowering it

according to some heating and cooling schedule to get

the atoms into their most probable confirmation, the

most stable conformation. And this goes back

to this idea that we started with the local minima. If we’re just doing

energy minimization, we’re not going to be able

to get from this minimum to this minimum, because

these energetic barriers are in the way. So we need to raise the

energy of the system to jump over these

energetic barriers before we can get to the

global free energy minimum. But if we just move at very

high temperature all the time, we will sample the

entire energetic space but it’s going to

take a long time. We’re going to be sampling

a lot of confirmations that are low

probability, as well. So this approach

allows us to balance the need for speed and the

need to be at high temperature where we can overcome

some of these barriers. So one thing that I

want to stress here is that we’ve made a physical

analogy to this metallurgy process. We’re talking about raising

the temperature of the system and let the atoms

evolve under forces, but it’s in no way

meant to simulate what’s going on in

protein folding. So molecular dynamics

would try to say, this is what’s actually

happening to this protein as it folds in water. Simulated annealing is

using high temperature to search over spaces

and then low temperature. But these temperatures much,

much higher than the protein would ever encounter, so

it’s not a simulation. It’s a search strategy. OK, so the key to

this– and I’ll tell you the full

algorithm in a second– but at various steps

in the algorithm we’re trying to make

decisions about how to move from our current

set of coordinates to some alternative

set of coordinates. Now, that new set of coordinates

we’re going to call test state. And we’re going to decide

whether the new state is more or less probable

than the current one. Right? If it’s lower in energy,

then what’s it going to be? It’s going to be

more probable, right? And so in this

algorithm, we’re always going to accept those states

that are lower in free energy than our current state. What happens when

the state is higher in free energy than

our current state? So it turns out we are going

to accept it probabilistically. Sometimes it’s going to move

up in energy and sometimes not, and that is going

to allow us to go over some those

energetic barriers and try to get to new

energetic states that would not be accessible to

purely minimization. So the form of this is the

Boltzmann equation, right? The probability of some

test state compared to the probability

of a reference state is going to be the ratio of

these two Boltzmann equations– the energy of the test

state over the energy of the current state. So it’s the e to the minus

difference in energy over KT. And we’ll come back to

where this temperature term comes from in a second. OK, so here’s the

full algorithm. We will either iterate for

a fixed number of steps or until convergence. We’ll see we don’t

always converge. We have some initial

confirmation. Our current confirmation

will be state n, and that we can

compute as energy from those potential

energy functions that we discussed

in the last meeting. We’re going to choose a

neighboring state at random. So what does neighboring mean? So if I’m defining this in

terms of XYZ coordinates, for every atom I’ve

got a set of XYZ coordinates I’m going to

change them a few of them by small amount. Right? If I change them all

by large amounts, I have a completely

different structure. So I’m going to make

small perturbations. And if I’m doing this

with fixed backbone angles and just rotating the

side chains, then what would a neighboring state be? Any thoughts? What would a

neighboring state be? Anyone? Change a few of the side

chain angles, right? So we don’t want to globally

change the structure. We want some continuity

between the current state and the next state. So we’re going to

chose an adjacent state in that sense, so

the state space. And then here are the rules. If the new state

has an energy that’s lower than the current state,

we simply accept the new state. If not, this is where

it gets interesting. Then, we accept

that higher energy with a probability

that’s associated with the difference

in the energies. So if the difference

is very, very large, there’s a low

probability it’ll accept. If the differences are

slightly higher, than there’s a higher

probability that we accept. If we reject it, we just drop

back to our current state and we look for

a new test state. OK? Any questions on how we do this? Question, yes. AUDIENCE: How far away do

we search for neighbors? PROFESSOR: That’s the

art of this process, so I gave you a straight answer. Different approaches will

use different thresholds. Any other questions? OK, so the key thing

I want you realize, then, is there’s

this distinction between the

minimization approach and simulated

annealing approach. Minimization can only

go from state one to the local free

energy minimum, whereas the simulated

annealing has the potential to go much further

afield, and potentially to get to the global

free energy minimum. But it’s not

guaranteed to find it. OK, so let’s say we

start in state one and our neighbor

state was state two. So we’d accept that with

100% probability, right? Because it’s lower in energy. Then let’s say the

neighboring state turns out to be state three.

that’s higher in energy, so there’s a probability

that we’ll accept it, based on the difference

between the energy of state two and state three. Similarly from state

three to state four, so we might drop

back to state two. We might go up. And then we can eventually

get over the hump this way with sum probability. It’s a sum of each

of those steps. OK? OK, so if this is our

function for deciding whether to accept

a new state, how does temperature

affect our decisions? What happens when the

temperature is very, very high, if you look at that equation? So it’s minus e to the delta. The difference in

the energy over kT. So if t is very,

very large, then what happens that exponent? It approaches zero. So e to the minus zero is going

to be approximately 1, right? So at very high temperatures,

we almost always take the high energy state. So that’s what allows us to

climb those energetic hills. If I have a very

high temperature in my simulated

annealing, then I’m always going over

those barriers. So conversely,

what happens, then, when I set the

temperature very low? Then there’s a very,

very low probability of accepting those

changes, right? So if I have a very low

temperature– temperature approximately zero– then

I’ll never go uphill. Almost never go uphill. So we have a lot of control

over how much of the space this algorithm explores by

how we set the temperature. So this is again a little bit

of the art simulated annealing– decide exactly what

annealing schedule to use, what temperature

program you use. Do you start off high

and go literally down? Do you use some other,

more complicated function to decide the temperature? We won’t go into exactly

how to choose these. [INAUDIBLE] you could

track some of these things down from the references

that are in the notes. So we have this choice. But the basic idea

is, we’re going to start at higher temperatures. We’re going to explore

most of the space. And then, as we lower

the temperature, we freeze ourselves into the

most probable confirmations. Now, there’s nothing that

restricts simulated annealing to protein structure. This approach is

actually quite general. It’s called the Metropolis

Hastings algorithm. It’s often used in cases where

there’s no energy whatsoever and it’s thought of purely

in probabilistic terms. So if I have some probabilistic

function– some probability of being in some state S– I

can choose a neighboring state at random. Then I can compute

an acceptance ratio, which is the probability

of being a state S test over the probability

of being in a current state. This is what we did in terms

of the Boltzmann equation, but if I some other formulation

for the probabilities I’ll just use that. And then, just like in our

protein folding example, if this acceptance

ratio is greater than 1, we accept the new state. If it’s less than

1, then we accept it with a probabilistic statement. And so this is a very

general approach. I think you might see

it in your problem sets. We certainly have done

this on past exams– asked you to apply this algorithm to

other probabilistic settings. So it’s a very, very general

way to search the sample across a probabilistic

landscape. OK, so we’ve seen these

three separate approaches, starting with an

approximate structure and trying to get to

the correct structure. We have energy

minimization, which will move towards the

local confirmation. So it’s very fast

compared the other two, but it’s restricted

to local changes. We have molecular

dynamics, which actually tries to simulate the

biological process. Connotationally very intensive. And then we have

simulated annealing, which tries to shortcut

the root to some of these global

free energy minima by raising the temperature,

pretending at this very high temperature so we

can sample all the space, and then cooling down so

we trap a high probability confirmation. Any questions on any of

these three approaches? OK. All right, so I’m going

to go through now some of the approaches

that have already been used to try to

solve protein structures. We started off with a sequence. We’d like to figure out

what the structure is. And this field has had

a tremendous advance, because in 1995 a group

got together and came up with an objective

way of evaluating whether these

methods were working. So lots of people

have proposed methods for predicting

protein structure, and what the CASP group

did in ’95 was they said, we will collect structures

from crystallographers, NMR spectroscopists,

that they have not yet published but they know

they’re likely to be able to get within the

time scale of this project. We will send out those

sequences to the modelers. The modelers will attempt

to predict the structure, and then at the end

of the competition we’ll go back to the

crystallographers and the spectroscopists and

say, OK, give us a structure and now we’ll compare

the predicted answers the real ones. So no one knows

are the answer is until all the

submissions are there, and then you can see objectively

which of the approaches did the best. And one of the approaches

that’s consistently has done very well, which

we’ll look at in some detail, is this approach called Rosetta. So you can look at

the details online. They split this modeling

problem into two types. There are ones for

which you can come up with a reasonable

homology model. This can be very, very

low sequence homology, but there’s something in the

database of known structure that it’s sequenced

similarly to the query. And then ones where

it’s completely de novo. So how do they go about

predicting these structures? So if there’s homology, you

can imagine the first thing you want to do is align your

sequence to the sequence of the protein that

has a known structure. Now, if it’s high homology this

is not a hard problem, right? We just need to do a few tweaks. But we get to places–

what’s called the Twilight Zone, in fact– where there’s

a high probability that you’re wrong, that your sequence

alignments could be to entirely the wrong structure. And that’s where

things get interesting. So they’ve got high

sequence similarity– greater than 50%

sequence similarity that are considered

relatively easy problems. These medium problems that are

20% to 50% sequence similarity. And then very low sequence

similar problems– less than 20% sequence similarity. OK, so you’ve already

seen this course methods for doing sequence

alignment, so we don’t have to go into

that in any detail. But there are a lot of

different specific approaches for how to do those alignments. You could do anything from blast

to highly sophisticated Markov models to try to decide what’s

most similar to your protein structure. And one of the important

things that Rosetta found was not to align on

any single method but to try a bunch of

different alignment approaches and then follow

through with many of the different alignments. And then we get

this problem of how do you refine the models,

which is what we’ve already started to talk about. So in the general

refinement procedure, when you have a protein that’s

relatively in good shape they apply random perturbations

to the backbone torsion angle. So this is again the

statistical approach, the not allowing

every atom to move. They’re just rotating a certain

number of the rotatable side chains. So we’ve got the fine psi

angles in the backbone, and some of the side channels. They do what’s called rotamer

optimization of the side chain. So what does that mean? Remember that we

could allow the side chains to rotate

freely, but very, very few of those rotations

are frequently observed. So we’re going to choose,

as these three choices, among the best possible

rotamers, rotational isomers. And then once we’ve found

a nearly optimal side chain confirmation from those

highly probable ones, then we allow more

continuous optimization of the side chains. So when you have a very,

very high sequence homology template, you don’t

need to do a lot of work on most of the structure. Right? Most of it’s going

to be correct. So we’re going to

focus on those places where the alignment is poor. That seems pretty intuitive. Things get a little

bit more interesting when you’ve got these medium

sequence similarity templates. So here, even your basic

alignment might not be right. So they actually proceed

with multiple alignments and carry them through

the refinement process. And then, how do you decide

which one’s the best? You use the potential

energy function. Right? So you’ve already

taken a whole bunch of starting confirmations. We’ve taken them through

this refinery procedure. You now believe that

those energies represent the probability that the

structure is correct, so you’re going to choose

which of those confirmations to use based on the energy. OK, in these medium sequence

similarity templates, the refinement doesn’t do

the entire protein structure, but it focuses on

particular region. So places where there

are gaps, insertions, and deletions in the alignment. Right? So your alignment is

uncertain, so that’s where you need to

refine the structure. Places that were loops

in the starting models, so they weren’t

highly constrained. So it’s plausible

that they’re going to be different in

the starting structure from some homologous protein

and in the final structure. And then, regions where the

sequence conservation is low. So even if there is a

reasonably good alignment, there’s some

probability that things have changed during evolution. Now, when they do a

refinement, how they do that? In these places that

we’ve just outlined, they don’t simply randomly

perturb all of the angles. But actually, they take

a segment of the protein, and exactly how long

those segments are has changed over the course

of the Rosetta algorithm’s refinement. But say something on the order

of three to six amino acids. And you look in the

database for proteins that have known structure that

contain the same amino acid sequence. So it could be completely

unrelated protein structure, but you develop

a peptide library for all of those short

sequences for all the different

possible structures that they’ve adopted. So you know that those

are at least structures that are consistent with

that local sequence, although they

might be completely wrong for this

individual protein. So you pop in all of

those alternative possible structures. So OK, we replace

the torsion angles with those of peptides

of known structure, and then we do a local

optimization using the kinds of minimization

algorithms we just talked about to see whether

there is a structure that’s roughly compatible with

that little peptide that you took from the

database that’s also consistent with the

rest the structure. And after you’ve done that,

then you do a global refinement. Questions on that approach? OK, so does this work? One of the best competitors

in this CASP competition. So here are examples where the

native structure’s in blue. The best model they

produced was in red, and the best template– that’s

the homologous protein– is in green. And you can see that they

agree remarkably well. OK? So this is very

impressive, especially compared to some of

the other algorithms. But again, it’s

focusing on proteins where there’s at least some

decent homology to start with. If you look here at the

center of these proteins, you can see the original

structure, I believe, is blue, and their model’s in red. You can see they also get the

side chain confirmations more or less correct, which

is quite remarkable. Now, what gets

really interesting is when they work on

these proteins that have very low

sequence homologies. So we’re talking about 20%

sequence similarity or less. So quite often, you’ll actually

have globally the wrong fold– a 20%

sequence similarity. So what do they do here? They start by saying,

OK, we have no guarantee that our templates are

even remotely correct. So they’re going to start

with a lot of templates and they’re going to refine

all of these in parallel in hopes that some of them come

out right at the other end. And these are what they call

more aggressive refinement strategies. So before, where did we focus

our refinement energies? We focused on places that

were poorly constrained, either by evolution or

regions of the structure that weren’t well-constrained,

or places where the alignment wasn’t good. Here, they actually go after

the relatively well-defined secondary structure

elements, as well. And so they will

allow something that was a clear alpha helix

in all of the templates to change some of the structure

by taking peptides out of the database that

have other structures. OK? So you take a very,

very aggressive approach to the refinement. You rebuild the secondary

structure elements, as well as these gaps, insertions,

loops, and regions with low sequence conservation. And I think the really

remarkable thing is that this

approach also works. It doesn’t work quite as

well, but here’s a side by side comparison of a native

structure and the best model. So this is the hidden

structure that was only known to the crystallographer,

or the spectroscopist, who agreed to participate

in this CASP competition. And here is the

model they submitted blind without

knowing what it was. And you can see again

and again that there’s a pretty good global similarity

between the structures that they propose

and the actual ones. Not always. I mean, here’s an example where

the good parts are highlighted and the not-so-good

parts are shown in white so you can barely see them. [LAUGHTER] PROFESSOR: But even

so, give them that. Give them their credit. It’s a remarkably

good agreement. Now, we’ve looked at cases

where there’s very high sequence similarity, where there’s medium

sequence similarity, where there’s low sequence similarity. But the hardest category are

ones where there’s actually nothing in the structural

database that’s a detectable homologue to the

protein of interest. So how do you go

about doing that? That’s the de novo case. So in that case, they take

the following strategy. They do a Monte Carlo

search for backbone angles. So specifically, they

take short regions– and again, this is

the exact length. Changes in different

versions of the algorithm, but it’s either three to nine

amino acids in the backbone. They find similar

peptides in the database of known structure. They take the

backbone confirmations from the database. They set the angles

to match those. And then, they use those

Metropolis criteria that we looked at in

simulated annealing. Right? The relative probability

of the states, determined by the

Boltzmann energy, to decide whether

to accept or not. If it’s lower

energy, what happens? Do you accept? Do you not accept? AUDIENCE: Accept. PROFESSOR: You accept. And if it’s high energy,

how do you decide? AUDIENCE: [INAUDIBLE] PROFESSOR: [INAUDIBLE],

probability. Very good. OK, so they do a fixed number

of Monte Carlo steps– 36,000. And then they repeat

this entire process to get 2,000 final structures. OK? Because they really have

very, very low confidence in any individual one

of these structures. OK, now you’ve got

2,000 structures, but you’re allowed

to submit one. So what do you do? So they cluster

them to try to see whether there are common

patterns that emerge, and then they

refine the clusters and they submit each cluster

as a potential solution to this problem. OK, questions on the

Rosetta approach? Yes. AUDIENCE: Can you mention again

why the short region of three to nine amino acids,

and whether [INAUDIBLE]. PROFESSOR: So the

question is, what’s the motivation for taking

these short regions from the structural database? Ultimately, this is

a modeling choice that they made that

seems to work well. So it’s an empirical choice. But what possibly motivated

them, you might ask, right? So, the thought has been in

this field for a long time, and it’s still, I

think, unproven, that certain sequences will

have a certain propensity to certain structures. We saw this in the secondary

structure prediction algorithms, that there were

certain amino acids that occurred much more

frequently in alpha helixes. So it could be that there

are certain structures that are very likely to occur

for short peptides, and other ones that

almost never occur. And so if you had a large enough

database of protein structures, then that would be a

sensible sampling approach. Now, in practice, could you

have gotten some good answer in some other approach? We don’t know. This is what

actually worked well. So there’s no real theoretical

justification for it other than that

crude observation that there is some information

content that’s local, and then a lot of information

content that’s global. Yes? AUDIENCE: So when you’re

doing a de novo approach, is it general that you

come up with a bunch of different clusters

as your answer, whereas with the

homology approach, you are more confident

of structure answer? PROFESSOR: So the question

was, if you’re doing a de novo approach, is it

generally the case that you have lots

of individual, or clusters of structures,

whereas in homology you tend not to. And yes, that’s correct. So in the de novo,

there are frequently going to be multiple

solutions that look equally plausible to you,

whereas the homology tends to drive you to certain classes. Good questions. Any other questions? All, right so that was CASP. One was in 1995, which

seems like an eon ago. So how have things

improved over the course of the last decade or two? So there was an interesting

paper that came out recently that just looked at the

differences between CASP 10, one of are the most

recent ones, and CASP 5. They’re every two years,

so that’s a decade. So how have things

improved or not over the last decade

in this challenge? So in this chart, the

y-axis is the percent of the residues

that were modeled and that were not

in the template. OK? So I’ve got some template. Some fraction of the amino acids

have no match in the template. How many of those

do I get correct? As a function of

target difficulty, they have their own definition

for target difficulty. You can look in the

actual paper to find out what is in the CASP

competition, but it’s a combination of structural

and sequence data. So let’s just take

them that they made some reasonable

choices here. They actually put

a lot of effort into coming up with a

criteria for evaluation. Every point in this diagram

represents some submitted structure. The CASP5, a decade

ago, are the triangles. CASP 9, two years

ago, were the squares, and the CASP10 are the circles. And then they have

trend lines for CASP9 and CASP10 are shown

here– these two lines. And you can see that they

do better for the easier structures and worse for

the harder structures, which is what you’d expect,

whereas CASP5 was pretty much flat across all of them

and did about as well even on on the easy structures

as these ones are doing on the hard structures. So in terms of the fraction

of the protein that they don’t have a template for that

they’re able to get correct, they’re doing much, much

better in the later CASPs than they did a decade earlier. So that’s kind of encouraging. Unfortunately, the story isn’t

always that straightforward. So this chart is, again, target

difficulty on the x-axis. The y-axis is what they call

the Global Distance Test, and it’s a model of accuracy. It’s the percent of the carbon

alpha atoms in the predictions that are close– and they have

a precise definition of close that you can look up– that are

close to the true structure. So for a perfect model, it would

be up here in the 90% to 100% range, and then random

models would be down here. You can see a lot of

them are close to random. But more important here

are the trend lines. So the trend line for

CASP10, the most recent one in this report, is black. And fore CASP5, it’s

this yellow one, which is not that

different from the black. So what this shows is that,

over the course of a decade, the actual prediction

accuracy overall has not improved that much. It’s a little bit shocking. So they tried in this paper to

try to figure out, why is that? I mean, the percentage of

the amino acids that you’re getting correct is going up,

but overall accuracy has not. And so they make

some claims that it could be that target difficulty

is not really a fair measure, because a lot of the proteins

that are being submitted are now actually much harder

in different sense, in that they’re not single domain

proteins initially. So in CASP5, a lot

of them were proteins that had independent structures. By the time of CASP10,

a lot of the proteins that are being

submitted are more interesting structural problems

in that they’re folding is contingent on interactions

with lots of other things. So maybe all the

information you need is not composed entirely in

the sequence of the peptide that you’ve been given

to test but depends more on the interactions

of it with its partners. So those were for

homology models. These are the free

modeling results. So in free modeling, there’s

no homology to look at, so they don’t have a measure of

difficulty except for length. They’re using, again,

that Global Distance Test. So up here are perfect models. Down here are nearly

random models. CASP10 is in red. CASP5, a decade

earlier, is in green. And you can see the trend

lines are very, very similar. And CASP9, which is

the dashed line here, looks almost identical to CASP5. So again, this is

not very encouraging. It says that the

accuracy the models has not approved very

much over the last decade. And then, they do

point out that if you focus on the short structures,

then it’s kind of interesting. So in CASP5, which are the

triangles, only one of these was above 60%. CASP9, they had 5 out

of 11 were pretty good. But then you get to

CASP10 and now only three are greater than 60%. So it’s been

fluctuating quite a lot. So modeling de novo is still

a very, very hard problem. And they have a whole

bunch of theories as to why that could be. They proposed, as

I already said, that maybe the models that

they’re trying to solve have gotten harder in ways

that are not easy to assess. A lot of the proteins that

previously wouldn’t have had a homologue now already do,

because there has been a decade of structural work trying

to fill in missing domain structures. And that these targets tend

to have more irregularity. Tendency be part

of larger proteins. So again, there’s not

enough information in the sequence of

what you’re given to make the full prediction. Questions? So what we’ve seen so far

has been the Rosetta approach to solving protein structures. And it really is,

throw everything at it. Any trick that you’ve got. Let’s look into the databases. Let’s take homologous proteins. Right? So we have these high,

medium, low levels homologues. And even when we’re

doing a homologue, we don’t restrict ourselves

to that protein structure. But for certain parts,

we’ll go into the database and find the structures of

peptides of length three to nine. Pull those out of the

[? betas. ?] Plug those in. Our potential energy functions

are grab bag information, some of which has strong

physical principles, some which is just curve fitting

to make sure that we keep the hydrophobics inside

and hydrophilics outside. So we throw any information

that we have at the problem, whereas our physicist has

disdain for that approach. He says, no, no. We’re going to this

purely by the book. All of our equations are going

to have some physical grounding to them. We’re not going to start

with homology models. We’re going to try to do the

simulation that I showed you a little movie of for

every single protein we want to know the structure of. Now, why is that problem hard? It’s because these

potential energy landscapes are incredibly complex. Right? They’re very rugged. Trying to get from any current

position to any other position requires a go over

many, many minima. So the reason it’s

hard to do, then, is it’s primarily a

computing power issue. There’s just not

enough computer power to solve all of these problems. So what one group, DE

Shaw, did was they said, well, we can solve

that by just spending a lot of money, which

fortunately they had. So they designed

hardware that actually solves individual components of

the potential energy function in hardware rather

than in software. So they have a chip that

they call Anton that actually has parts of it that solve the

electrostatic function, the van der Waals function. And so in these chips,

rather than in software, you are doing as fast

as you conceivably can to solve the energy terms. And that allows you to

sample much, much more space. Run your simulations

for much, much longer in terms of real time. And they do remarkably well. So here are some pictures

from a paper of theirs– a couple of years ago

now– with the predicted and the actual structures. I don’t even remember

which color is which, but you can see it

doesn’t much matter. They get them down to

very, very high resolution. Now, what do you notice

about all these structures? AUDIENCE: They’re small. PROFESSOR: They’re small, right? So obviously there’s

a reason for that. That’s when you can do in

reasonable compute time, even with a high-end computing

that’s special purpose. So we’re still not

in a state where they can fold any

arbitrary structure. What else do you

notice about them? Yeah, in the back. AUDIENCE: They have

very well-defined secondary structures. PROFESSOR: They have

very well-defined secondary structures. And they’re specifically

what, mostly? AUDIENCE: Alpha helixes. PROFESSOR: Alpha helixes, right. And it turns out that a lot more

information is encoded locally in an alpha helix than

in a beta sheet, which is going to be contingent on

what that piece of protein comes up against. Right? Whereas in the

alpha helix, we saw that you can get 60% accuracy

with very crude algorithms, right? So we’re going to do best

with these physics approaches when we have small proteins

that are largely alpha helical. But in later papers– well

here’s even an example. Here’s one that has a

certain amount of beta sheet. And the structures are going

to get larger with time. So it’s not an inherent problem. It’s just a question of

how fast the hardware is today versus tomorrow. OK, a third approach. So we had the

statistical approach. We have the physics approach. The third approach, that

I won’t go into detail but you can play around

was literally yourselves, is a game where

we have humans who try to identify the

right structure, just as humans do very well

in other kinds of pattern recognition problems. So you can try this video game

where you’re given structures to try to solve and say, oh,

should I make that helical? Should I rotate that side chain? So give it a try. Just Google FoldIT,

and you can find out whether you can be the best

gamers and beat the hardware. All right. So so far we’ve been

talking about solving the structures of

individual proteins. We’ve seen there is some

success in this field. It’s improved a

lot in some ways. Between CASP1 and CASP5 I think

there’s been huge improvements. Between CASP5 and CASP10, maybe

the problems have gotten hard. Maybe there have

been no improvements. We’ll leave that for

others to decide. What I’d like to look at

in the end of this lecture and the beginning

of the next lecture are problems of proteins

interacting with each other, and can we predict

those interactions? And that’ll, then, lead us

towards even larger systems and network problems. So we’re going to

break this down to three separate

prediction problems. The first of these is predicting

the effect of a point mutation on the stability

of a known complex. So in some ways, you might

think this is an easy problem. I’ve got two proteins. I know their structure. I know they contract. I want to predict whether

a mutation stabilizes that interaction or

makes it fall apart. That’s the first

of the problems. We can try to

predict the structure of particular complexes,

and we can then try to generalize that and try

to predict every protein that interacts with

every other protein. We’ll see how we

do on all of those. So we’ll go into one of these

competition papers, which are very good at

evaluating the fields. This competition paper looked at

what I call the simple problem. So you’ve got two proteins

of known structure. The authors of the paper,

who issued the challenge, knew the answer for the effect

of every possible mutation at a whole bunch of positions

along these proteins on the– well, an approximation

to the free energy of binding. So they challenged

the competitors to try to figure out, we

give you the structure, we tell you all the

positions we’ve mutated, and you tell us whether those

mutations made the complex more stable or made the

complex less stable. Now specifically, they had two

separate protein structures. They mutated 53

positions in one. 45 positions in another. They didn’t directly measure

the free energy of binding for every possible complex,

but they used a high throughput assay. We won’t go into

the details, but it should track, more or

less, with the free energy. So things that seem to be more

stable directors here probably are lower free energy complexes. OK, so how would you go

about trying to solve this? So using these potential

energy functions that we’ve already

seen, you could try to plug in the mutation

into the structure. And what would you

have to do then in order to evaluate the energy? Before you evaluate the energy. So I’ve got known structure. I say, position 23 I’m mutating

from phenylalanine to alanine. I’ll say alanine

to phenylalanine. Make it a little

more interesting. OK? So I’m now stuck on

this big side chain. So what do I need to do before

I can evaluate the structure energy? AUDIENCE: Make sure

there’s no clashes. PROFESSOR: Make sure

no clashes, right? So I have to do one

of those methods that we already described

for optimizing the side chain confirmation, and

then I can decide, based on the free energy,

whether it’s an improvement or makes things worse. OK, so let’s see how they do. So here’s an example

of a solution. The submitter, the person who

has the algorithm for making a prediction, decides on

some cutoff in their energy function, whether

they think this is improving things or

making things worse. So they decide on the color. Each one of these

dots represents a different mutation. On the y-axis is the

actual change in binding, the observed change in binding. So things above zero

are improved binding. Below zero are worse binding. And here are the predictions

on the submitter scale. And here the submitter

said that everything in red should be worse and everything

green should be better. And you can see that

there’s some trend. They’re doing reasonably well

in predicting all these red guys as being bad, but

they’re not doing so well in the neutral ones, clearly,

and certainly not doing that well in the improved ones. Now, is this one of the better

submitters or one of the worst? You’d hope that this

is one of the worst, but in fact this is one

of the top submitters. In fact, not just

the top submitter but top submitter

looking at mutations that are right at the

interface where you’d think they’d do the best, right? So if there’s some mutation on

the backside of the protein, there’s less

structural information about what that’s going to

be doing in the complex. There could be some

surprising results. But here, these are

amino acid mutations right at the interface. So here’s an example

of the top performer. This is the graph

I just showed you, focusing only at

the [? residues ?] of the interface, and all sites. And here’s an average group. And you can see the

average groups are really doing rather abysmally. So this blue cluster that’s

almost entirely below zero were supposed to be neutral. And these green ones were

supposed to be improved, and they’re almost

entirely below zero. This is not encouraging story. So how do we

evaluate objectively whether they’re

really doing well? So we have some sort

of baseline measure. What is it the sort

of baseline algorithm you could use to predict

whether a mutation is improving or hurting this interface? So all of their

algorithms are going to use some kind

of energy function. What have we already seen in

earlier parts of this course that we could use? Well, we could use the

substitution matrices, right? We have the BLOSUM

substitution matrix that tells us how

surprised we should be when we see an evolution,

that Amino Acid A turns into Amino Acid B.

So we could use, in this case, the BLOSUM matrix. That gives us for

each mutation a score. It ranges from minus 4 to 11. And we can rank

every mutation based on the BLOSUM matrix

for the substitution and say, OK, at some value

in this range things should be getting better

or getting worse. So here’s an area

under the curve plot where we’ve plotted

the false positives and true positive

rates as I change my threshold for

that BLOSUM matrix. So I compute what the

mutation BLOSUM matrix is, and then I say, OK, is a

value of 11 bad or is it good? Is a value of 10 bad or good? That’s what this

curve represents. As I vary that threshold,

how many do I get right and how many do I get wrong? If I’m doing the

decisions at random, then I’ll be getting

roughly equal true positives and false positives. They do slightly better in

the random using this matrix. Now, the best algorithm at

predicting that uses energies only does marginally better. So this is the best

algorithm at predicting. This is this baseline algorithm

using just the BLOSUM matrix. You can see that the green curve

predicting beneficial mutations is really hard. They don’t do much

better than random. And for the

deleterious mutations, they do somewhat better. So we could make these

plots for every single one of the algorithms,

but a little easier is to just compute the

area under the curve. So how much of the area? If I were doing perfectly, I

would get 100% true positives and no false positives, right? So my line would go

straight up and across and the area under the

curve would be one. And if I’m doing terribly,

I’ll get no true positives and all false positives. I’d be flatlining and

my area would be zero. So the area under

the curve, which is normalized

between zero and one, will give me a sense of how

well these algorithms are doing. So this plot– focus first

on the black dots– shows at each one of these algorithms

what the area under the curve is for beneficial and

deleterious mutations. Beneficial on the x-axis,

deleterious mutations on the y-axis. The BLOSUM matrix is here. So good algorithms should be

above that and to the right. They should having a better

area under the curve. And you can see the

perfect algorithm would have been all the way up here. None of the black dots

are even remotely close. The G21, which we’ll talk

about a little bit in a minute, is somewhat better than the

BLOSUM matrix, but not a lot. Now, I’m going to ignore the

second round in much detail, because this is a case

where people weren’t doing so well in the first round so

they went out and gave them some of the information about

mutations at all the positions. And that really changes

the nature of problem, because then you have

a tremendous amount of information about which

positions are important and how much those

mutations are making. So we’ll ignore

the second round, which I think is an overly

generous way of comparing these algorithms. OK, so what did the authors

of this paper observe? They observed that the

best algorithms were only doing marginally better

than random choice. So three times better. And that there seemed to be

a particular problem looking at mutations that

affect polar positions. One of the things that I think

was particularly interesting and quite relevant when we

think about these things in a thermodynamic context

is that the algorithms that did better– none of them

could be really considered to do really well– but the

algorithms that did better didn’t just focus on

the energetic change between forming the

native complex over here and forming this mutant

complex indicated by the star. But they also focused on

the affect of the mutation on the stability of

the mutated protein. So there’s an

equilibrium not just moving between the free

proteins and the complex, but also between moving

between the free proteins that are folded and the free

proteins that are unfolded. And some of these

mutations are affecting the energy of the folded

state, and so they’re driving things to the

left, to the unfolded. And if you don’t include

that, then you actually get into trouble. And I’ve put a link here

to some lecture notes from a different course that

I teach where you can look up some details and more

sophisticated approaches that actually do take into account

a lot of the unfolded states. So the best approach–

best of a bad lot– consider the effects of

mutations on stability. They also model packing,

electrostacks, and solvation. But the actual

algorithms that they used were a whole mishmash

of approaches. So there didn’t seem to emerge

a common pattern in what they were doing, and I thought I

would take you through one of these to see what

actually they were doing. So the best one was this

machine learning approach, G21. So this is how they

solved the problem. First of all, they dug

through the literature and found 930 cases where they

could associate a mutation with a change in energy. These had nothing to do with

proteins under consideration. They were completely

different structures. But they were cases

where they actually had energetic information

for each mutation. Then we go through

and try to predict what the structural change

will be in the protein, using somebody else’s

algorithm, FoldX. And now, they describe

each mutant, not just with a single energy–

we have focused, for example, on PyRosetta,

which you’ll use in process– but they actually had

85 different features from a whole bunch of

different programs. So they’re taking a

pretty agnostic view. They’re saying, we don’t know

which of these energy functions is the best, so let’s let

the machine learning decide. So every single mutation that’s

posed to them as a problem, they have 85

different parameters as to whether it’s

improving things or not. And then, they had their

database of 930 mutations. For each one of those

they had 85 parameters. So those are label

trending data. They know whether things

are getting better or worse. They actually don’t even rely

on a single machine learning method. These actually used five

different approaches. We’ll discuss Bayesian

nets later in this course. Most of these others

we won’t cover at all, but they used a lot of different

computational approaches to try to decide how to go

from those 85 parameters to a prediction of whether the

structures improved or not. So this actually

shows the complexity of this apparently

simple problem, right? Here’s a case where I have two

proteins of known structure. I’m making very specific

point mutations, and even so I do only

marginally better than random. And even throwing at it all

the best machine learning techniques. So there’s clearly a

lot in protein structure that we don’t yet have

parametrized in these energy functions. So maybe some of

these other problems are actually not as

hard as we thought. Maybe instead of trying to

be very precise in terms of the energetic change

for a single mutation at an interface, we’d do

better trying to predict rather crude parameters of which

two proteins interact with each other. So that’s what

we’re going to look at in the next

part of the course. We’re going to

look at whether we can use structural data to

predict which two proteins will interact. So here we’ve got a problem,

which is a docking problem. I’ve got two proteins. Say they’re of known

structure, but I’ve never seen them interact

with each other. So how do they come together? Which faces of the proteins are

interacting with each other? That’s called a docking problem. And if I wanted to try to

systematically figure out whether Protein A and Protein

B interact with each other, I would have to do a search

over all possible confirmations, right? Then I could use

the energy functions to try to predict which

one has the lowest energy. But it actually would be

a computationally very inefficient way to do things. So we could imagine we

wanted to solve this problem. For each potential

partner, we could evaluate all relative

positions and orientations. Then, when they come together

we can’t just rely on that, but as we’ve seen

several times now we’re going to have to do local

confirmational changes to see how they fit together for

each possible docking. And then, once

we’ve done that, we can say, OK, which of

these has the lowest energy of interaction? So that, obviously, is going

to be too computationally intensive to do

on a large scale. It could work very

well if you’ve got a particular pair

or proteins that you need to study. But on a big sale, if

we wanted to predict all possible interactions,

we wouldn’t really be able to get very far. So what people typically do is

use other kinds of information to reduce the search space. And what we’ll see

in the next lecture, then, are different ways

to approach this problem. Now, one question we

should ask is, what role is structural homology

going to play? Should I expect that

any two proteins that interact with each

other– let’s say that that Protein A and

I know its interactors. So I’ve got A known to

interact with B. Right? So I know this interface. And now I have

protein C, and I’m not sure if it interacts or not. Should I expect the interface

of C, that touches A, to match the interface of B? Should these be homologous? And if not precisely

homologous, then are there properties

that we can expect that should be

similar between them? So different

approaches we can take. And there are

certainly cases where you have proteins that interact

with a common target that have no overall structure

similarity to each other but do have local

structural similarity. So here’s an example

of subtilisn, which is shown in light

gray, and pieces of it that interactive with the

target are shown in red. So here are two proteins that

are relatively structurally homologous– they interact

at the same region. That’s not too surprising. But here’s a subtilisn

inhibitor that has no global structural

similarity to these two proteins, and yet its

interactions with subtilisn are quite similar. So we might expect, even if

C and B don’t look globally anything like each

other, they might have this local similarity. OK, actually I think we’d

like to turn back your exams. So maybe I’ll stop here. We’ll return the

exams in the class, and then we’ll pick up at this

point in the next lecture.

## 2 Replies to “13. Predicting Protein Structure”

I do not understand the blossom matrix algorithm comparison graph..

Please hire me so I may make ONE contribution to science in my life before I die!