# 14. Predicting Protein Interactions

The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: OK. So we’ve been talking about

predicting structure proteins. At the end of the

last lecture we started to talk a little bit

about predicting interactions, and that’s going to be the

focus of today’s lecture. And we identified a couple of

different possible prediction challenges. One was quantitative

predictions of what happens when you make specific

mutations in a known protein complex. We talked about

trying to predict the structure of, say,

just a pair of proteins, and then trying to do that

on the global scale for all known proteins. And so last time,

if you recall, we thought that initially maybe

this would be a simple problem. We have proteins of known

structure with a complex. Structure of the

complex is also known. And we want to make predictions

as to the change in affinity when there’s a

specific mutation made. In principle, this

should be easy because we have all those

different formulations for the potential

energy function. And so if we figure out what

the local structural changes are that are due to the insertion

or deletion of some side chain, then we should be

able to predict the change in the

potential energy, and therefore the change in

the energy of the complex. But in fact, it

turned out that it was very, very hard to do that. And so this plot compared–

the black circles were the prediction

algorithms for this problem, compared to just simply

a substitution matrix, the BLOSUM substitution matrix

defined in terms of the area under the curve for

beneficial mutations and deleterious mutations. And you can see that very,

very few of the black dots get far away from what is the

really simple default model. A lot of them do worse. So OK, well maybe that’s

not such a simple problem because it requires a highly

quantitative prediction. Maybe we’ll do

better just trying to predict which

proteins interact at all. And so that’s going to be

the focus of today’s lecture. Now, that also had

a problem, right? Because even if I know the

structure of two proteins, I don’t know necessarily

what surfaces of those proteins interact. And so I have to

figure out this docking problem of which part

of protein A interacts with which part of protein B. That’s the beginning

of my problem, and then I have to make a

series of subsequent decisions. So I’m going to

have to figure out for any potential

partner of my protein, I need to figure out

the docking problem, the relative

position orientation. Now, in this little

cartoon, it’s shown as a completely

static protein that approaches another

static protein. The only thing that’s changing

is the relative coordinates. But of course, there

will be local changes in confirmation, perhaps

even global ones. And so we need to be able

to make some estimates as to what those structural

rearrangements will be when the two

proteins interact. And then after we’ve

come up with our best estimate of the

structural rearrangements, only then can we come up with

an estimate of the energy interaction and

decide whether it’s better than some threshold. OK. So one of the problems that’s

pretty obvious from this is that this kind of

approach in principle, if we do it rigorously

through all the steps, would be extremely slow. Now, another part that’s perhaps

a little bit less obvious is that it’s going to be very

prone to false positives. And why do you

think that might be? What am I not taking

into account here? AUDIENCE: Are you

not taking into account the desolvation

[INAUDIBLE]. PROFESSOR: So one

answer is I’m not taking account of

the desolvation, but in fact, I can do that. Right? So some of the potential

energy functions we looked at, the

statistician’s version rather than the physicist’s

makes it pretty easy to incorporate the desolvation. Any other thoughts as to what

I’m not taking into account? What other protein

should I be considering when I’m considering

an interaction problem? So I’ve isolated, in

this case, two proteins. I’m saying, in a universe

where these are the only two proteins that exist, will

they have a favorable energy interaction? What I really need to know is

whether that energy interaction is more favorable than all

the competing interactions that they could have. So even if I find something

that’s potentially a good interaction, it may

not be the best possible interaction. And if I consider then the

concentration of this protein and the concentration of

all the other molecules out there that have

a higher affinity, then it could turn out

that this is actually a rather poor substrate

for my protein, a rather poor interaction partner. So we have that false

positive problem. OK. But let’s focus on the

computational efficiency problem, because

that’s at least one that we can come up with

some nice algorithms to try to solve. So what we want to do is try

to limit our search space. If I want to figure out–

I have a query protein and I want to ask, what

does it interact with, instead of trying to do

the pairwise comparison of this protein with every

other protein in the database, and doing very precise

structural calculations on all of those, maybe

there’s some way that I can prefilter the

set of proteins that it might interact with. And that’s what we’re

going to look at. So we’re going to try

to officially choose potential partners

before we’re doing any structural comparison. And then once we

have those partners, we’re going to try

to avoid having to do detailed calculations

until we have a relatively high degree of confidence that

these proteins could interact by other criteria. And we’re going to look at two

papers that describe algorithms for solving this

problem, and they’re both uploaded to the website. The first thing

that we’ll look at is called PRISM that actually

uses structural calculations. And then we’ll look

at PrePPI, which deals with everything

purely at– without actually explicitly calculating

the structures. OK. So what does PRISM do? Well, it’s based on

the notion that there are a limited number

of architectures that we could look at for

which proteins can interact. And so if we can identify

those architectures, then we can try to figure

out whether a protein is a potential partner

of another one before we do the detailed,

costly calculations. In addition, in

those architectures, not all amino acids

are going to be equal, but there are going

to be some that contribute more to the

energy than others. And so by identifying

those critical residues, we can once again focus

our computational energy on those complexes that are

most likely to be important. OK. So it has these two components–

a rigid-body structural comparison. So that’s that two

proteins are not changing their own

coordinates, they’re just being brought together

in different conformations. And then once the proteins

have passed a series of checks, then we allow for

flexible refinement using the kinds of energies we looked

at in the previous lectures to decide how high affinity

this complex could be. And the critical

thing is that we’re going to make some of

these early decisions after the rigid-body comparison

using structural similarity, evolutionary conservation,

and particularly looking at these regions

that are called hotspots. These are sites where most of

the free energy of interaction occurs during an interface. So it’s not, as I said,

uniformly distributed. So I showed you this

slide last time. It shows chymotrypsin in a

light gray and its interaction with some protein partners. These two share some global

similarity to each other, whereas this partner is

quite different from either of these two globally. But you can see that

at the interface, it’s actually quite similar. And so this gives you hope that

even if you can’t find a direct homologue– so if you

were trying to figure out, what does this protein

in yellow interact with, and you searched the database

and you couldn’t find anything that was its

structural homologue, but if you could figure

out to look for homologues of the lower regions

that interact, you might be able to figure out

that it interacts with the same protein as this

one and this one. OK. So what about this

idea of hotspots? And this was an

idea that was first developed in 1995 by this

paper, Clackson and Wells, where they were looking at the

interaction of a cell surface receptor with its

ligand approaching. And they did

systematic mutagenesis across the surface

of the interface to see when I mutate any

single amino acid to alanine, how much it affects the

energy of interaction. What they found was things

were highly non-uniform. So this lower curve shows

the change in free energy when you mutate particular

individual amino acids to alanine. And you can see there are

big losses of free energy at some places, and

other places there’s almost no change in the

free energy binding. In a few places you

actually get a benefit from mutating a side

chain to alanine. So in this particular

case, and it’s held up over many,

many cases then, the free energy of binding is

not uniform across the surface, but it’s distributed in what

has been called hotspots. So here is a structure

of the human growth hormone and its receptor. And in red are the

few amino acids that contribute very, very large

amounts– more than one and a half kcals per mole–

to the energy of interaction. And it doesn’t correspond

with any simple structural parameter. So it’s not the amino acids

that have the biggest surface area, for example, or

anything like that. So it’s not trivial to figure

out what these regions are, although there are some

prediction algorithms. So there are studies,

and subsequent ones have indicated that roughly

10% of the amino acids at the interface

are the ones that have the biggest contribution. There are some trends, but

none of these are hard rules. These tend to be rich in

these three amino acids– tryptophan, arginine,

and tyrosine. If you might imagine,

these are regions of the protein that are

highly complimentary. So there’ll be a patch

on one side that’s a hotspot matching

up with another patch on the other protein

that’s also a hotspot. And it’s kind of

an interesting note that around these regions

where the hotspots occur, there are other amino

acids that exclude solvent from the interface. And they call that an o-ring. So these are some

of the features that tend to occur with

protein interfaces. So in this PRISM algorithm,

what they do is the following. They start off with a

template– two proteins that are known to interact– and

they define the interface simply by close approach of

amino acids in one chain to amino

acids in the other. So in this case,

shown in these balls are regions of the

proteins that interact. And then they isolate

the interfacial residues. Ignore the rest of the

protein, because we said that the parts that

interact in different proteins could be homologous even

if the global structures of the proteins are not, right? So we’re going to do our

structural similarity calculations purely on

the interface residues and not on the entire structure. So then with that

template, you can then look at lots of proteins

and see whether they have any structural match

to pieces that interact. So here they’ve identified

this protein, ASPP2, which has structural homology

to I kappa b at the interface. Although globally

it’s quite different. And now, once they have this

potential partner for NF kappa b, this ASPP2,

they’re going to test whether there’s a

good structural match, whether specifically

in the regions that are hotspots– they have

an algorithm for predicting hotspots– whether the

match is good, whether it’s sequence conservation

at those hotspots. And only then do they

do the refinement to do the flexible

refinement of the type that we looked at in the

previous lecture, energy minimization, and

other approaches to figure out what the

best possible structure of this complex

would be, and then what it’s free energy would be. So here’s their

description of the problem. They have template

proteins and targets. They do a structure alignment. They asked whether it

passes some thresholds. These are very, very

fast calculations to do. And only if they pass

these fast calculations do you do more

detailed calculations. And finally, only

if it passes this do you do the very

computationally expensive refinement. And then one critical thing to

remember from this algorithm is that it doesn’t require

the template and its query to be perfectly

matched in structure. In fact, the elements of the

structure at the interface could come from different

parts of the chain. So they don’t take into

account the chain order. So if I had a beta sheet

structure in one protein that looks like this, in my

query these two proteins could be very

indirectly connected. I don’t care that there’s a

huge gap in the insertion. I just care that locally

at the interface, one protein looks a

lot like the other. There was a question

in the back. AUDIENCE: How do you search

a database for 3d structures? Are you just looking

at all the [INAUDIBLE]? PROFESSOR: That’s right. So the question was,

how do you search a database for 3D structure? You do structural

similarity comparisons that are based on

the 3D coordinates. The simplest way to do it,

but not the most efficient, is to find the rigid-body

superpositions that minimize the root mean

squared deviation, which was a metric we gave in one

of the previous lectures. There are faster things

you can do as well. You could imagine that you could

look at certain global features of elements of secondary

structure and so on. And there’s been a lot of work

making those algorithms very fast. Other questions? Good question. So they give an

example in their papers that starting off with this

known structural complex, cyclin-dependent kinase, the

cyclin, and p27, the inhibitor. And then looking for

structural matches. So we can identify this

potential structure match. You refined it, get an

energy of interaction. Try another one that has no

global structural similarity. Again, once it passes

all the checks, you compute the

refinement and the energy. And similarly with this side. And so from this

initial complex, where we had these

two proteins which were known to interact in the

PDP they can make predictions that these other proteins are

likely to interact even though, again, at the global

level, there’s very little sequence similarity. Is that clear? OK. So the advantage of this

is that it eventually does do these structural

refinements that allow us to figure out

the best match between two potential interacting proteins. But that’s also its

weakness because that takes a lot of

computational time. So this other approach called

PrePPI never actually does those structural refinements

of the type we talked about in the previous lecture. So if so, how does it

figure out whether the two proteins are likely to interact? So this is their schematic,

and we’ll go through the steps. So you start off with

two query proteins that you want to know

if they interact. And you do sequence

similarity to a database of known structures. So you find sequence

homologues to those proteins. And so they call

those homology models. MA and MB. And now they look through the

database for all the structural homologues, not

sequence homologues, but structural

homologues of MA and MB. So they get a

series of neighbors that they call NA 1

through n and NB 1 to n. So these are the neighbors

of these homologues. And they asked whether

any of these neighbors, anything in this row,

anything in this row, are known to interact. And that potential

interaction then could be a model for the

interaction of the query, right? So far so good. Then they do a

sequence alignment. They sequence

alignment of MA and MB, which are the known structural

homologues of the queries, and the two proteins that

are known to interact. And so now they’ve got

this potential model for the interaction of the

queries made up of two proteins of known structure that have

homologues that are known to interact. OK? So it’s two steps removed

from the actual interaction. Now, while their

figure says that they do a structural

superposition, that’s not, in fact, what they do. If you look at it carefully,

it’s a sequence analysis. And I’ll take you through

the steps in a second. So they mean structured

in a rather loose way. So they’re only doing

sequence comparisons here. They’re never actually

building a homology model for the queries. OK So this figure comes

from the supplement where, for some

mysterious reason, they’ve changed all

the nomenclature. So things that previously

were called NA and NB have now been called TA and TB. Take what you get. So this is a pair of

interacting proteins where the structure of

the interaction is known. And they’re structural

neighbors of NA and NB, which you don’t know whether

they interact or not. They identify interacting

residues in this structure. That’s why it’s represented by

these black lines connecting blue dots. So these are

interacting residues from the two template proteins

and neighbors NA and NB. And they asked whether the

amino acids in MA and MB also are good matches

for this interface. And they have a number of

criteria for doing that. So they come up

with five measures. The first of those measures

is a structural similarity between these MA proteins and

the MA and MB and NA and NB. Then similarity– OK, similarity

is the structural similarity. Then they asked, how many of the

amino acids at this interface, and what fraction of the

amino acids at the interface can be aligned? So this is a sequence-based

alignment of MA and– well, it’s here called TA,

but was previously called MA. Just to make life complicated. So this is the

sequence-based alignment. These are they interacting

residues, all the blue ones in the structure of

TA and TB interacting. And they asked, what

fraction and what number of these amino acids are aligned

in this sequence alignment? So here they come

up with a number. In this case, I guess, it’s

four amino acids in this– four pairs, I should say, of the

amino acids– one, two, three, and four, indicated

by these four lines– are both interacting in the

structure of the complex and can be aligned to

sequences in MA and MB. And then they use

these other algorithms that are based primarily

on machine learning looking at protein

interfaces to decide whether the sequence of the

amino acids that are going to sit at those places

in the interface are likely to be residues that

typically occur at interfaces. So this is the

kind of statistics that I showed you before

from those old papers that said 10% of the amino acids

are in these hotspots. Certain kinds of amino

acids are predominant there. So the number of algorithms,

and they list a bunch, that they use to

come up with a score to decide whether these

residues, in fact, are statistically likely

to be good matches. So they have these criteria

and they decide then that some fraction of the amino

acids at this interface in MA and MB are likely to

be reasonable ones to be at the interface. So with all that

done, they then use all of these different scores

with a Bayesian classifier, and we’ll talk a

little bit later in this lecture and

probably the next lecture as well as to what a

Bayesian classifier is. But they plug all

those scores in that they’ve derived

from these proteins to decide whether

these two proteins are likely to interact. So the advantage

of this approach is it’s extremely fast. Everything we’ve talked

about are very, very quick calculations. Even the structural

alignments are fast. The sequence alignments,

of course, are. So we get through the whole

database very quickly. So they’ve actually computed the

potential attraction partners of every pair of proteins in

various genomes based solely on these alignments. The disadvantage– so what’s

the disadvantage of this method? AUDIENCE: Can’t get a

de novo interaction? PROFESSOR: We can’t get

any de novo interaction, so if there’s no neighboring

structures that interact, they’ll never come up with it. So that’s an important point. And then the other problem

is, because it doesn’t have the structural

refinement, it’s given up on that

slow calculation, so also loses a lot of

potential specificity. All the conformational

changes that can occur will be lost to an

algorithm like this. So we have these two

competing approaches. Yes, questions in the back. AUDIENCE: Couldn’t this method

actually be used as an input to, say, a refinement

step, for example? PROFESSOR: The

question was, could you use this kind of approach as an

input to the refinement step? And absolutely one could. Is there another

question back there? Other questions? All right. So we’re going to take a slight

turn here in the course lecture and move away from a purely

computational approach and actually look at how

interaction measurements are made. One of the big changes

of the last decade or so is that we’ve gone from an era

when interactions were measured pairwise to interactions

being measured in bulk. So through high

throughput measurements. And we’ll see that that leads

us to some statistical problems which eventually bring us back

to some computational issues as well. So if you want to measure

all the proteins that interact in an

organism, turns out to be, obviously,

very difficult. One big advance that’s

helped with this is the idea of tagging proteins

and using mass spectrometry to figure out what

they interact with. So in these two sets

of papers, which were some of the early

ones being done in yeast, they took one protein at a

time and attached a tag to it. And I’ll talk about exactly

what those tags are, but those are labels

that allow you to attach it to a solid support. And then by attaching

to a solid support, you could then

purify any proteins that stuck to protein one here. And then after you purify them,

you can run them out on a gel, cut them out, and

figure out what the identity of those

interacting proteins were by mass spec. So this sounds very

labor intensive, but it’s still a lot faster than

anything that came before it. And with this

approach, they were able to go through

entire genomes, proteomes I should

say, and figure out all the interacting partners

for very, very large fractions of all the proteins there. So with this approach,

what kinds of proteins do you think are likely

to be false positives? Any thoughts? Yes. AUDIENCE: Proteins

stuck on the column that has nothing to do with

interaction [INAUDIBLE]. PROFESSOR: Exactly. So one thing that can

be quite problematic are proteins that

stick to the column regardless of which

protein you put there. And we’ll see an approach

to getting rid of that. Other kinds of problems? A variant of that. Thoughts? What about proteins

that tend to stick to other proteins

non-specifically, right? Those are going to be

quite problematic too. And what are the

likely false negatives in an approach like this? The proteins that really do

interact with the blue one but aren’t picked up. Yes. AUDIENCE: Weak interaction

partners [INAUDIBLE] PROFESSOR: Weak interaction

partners, things, particularly with

short half lives. Because you do a lot

of washing, so it’s going to be dependent

on half-life. Very good. What else? Yeah. AUDIENCE: Maybe something

that interacts in tag region? PROFESSOR: Something interacts

in the tag region, right. So something

interacts right around here would be lost because this

would sterically interfere. Very good. Anything else? What about the

concentration of proteins. How does that influence

whether they show up here? All right. So if I have a very high

concentration protein, it may interact even though

naturally it doesn’t. They never see each other. They’re in different

compartments. But when [INAUDIBLE]

and do this. But low abundance

proteins are going to be quite problematic because

there’ll be very little of them in these complexes compared to

the high abundance proteins. It won’t be detected

by this method. They will never get to

the mass spec, and so on. So we’ve got both false

positives and false negatives in these approaches. Now, one of the

things that came up was proteins that stick

non-specifically to the column. And there was a

clever approach in one of these early papers that

got picked up to avoid that. And this is called tandem

affinity purification, or TAP-tags. And the idea is the following. We have some gene. And we use homologous

recombination– this was done in

yeast where this is easy– to insert

this sequence, which codes for the following. A piece of protein of

no particular function, as far as anyone

knows, a spacer, followed by this

calmodulin-binding protein, followed by a protease

recognition site, and then by protein A. So once this protein

gets expressed– and it gets expressed

in it’s native levels because you’re inserting

this into the genome. So it’s not on an

exogenous promoter. It’s in its normal position. Whatever that protein

was, then has it as C terminus all these pieces. So how does that help? In the purification, we start

with something, IgG IGG, that binds to protein

A. So now that’s what attaches us to

the solid support. And attached to

the solid support will be all those things

that are nonspecific binders. And so if I have some

nonspecific binder that just likes my solid

support, it’ll be here. Nonspecific. And if I just acid washed

everything off the column and ran my gels with that,

or boiled it off in SDS, I would get the

nonspecific protein too. But what they do instead

is they instead cleave here with a very specific protease

that recognizes this site. It’s called a tobacco

etch virus protease. It has a very long

recognition sequence. You can make sure it doesn’t cut

anywhere in any other protein. And so now, instead of

alluding non-specifically with acid or detergent, you

allude specifically with TEV, and then this part of the

protein will fall off. And then you do a

second purification that relies on this

piece of the protein. So you pull out only

the things that you want that have the CBP,

the calmodulin binding protein, by having different

kind of solid support that has calmodulin

attached to it. And so through this

process, you can get rid of a lot of nonspecific binders. It doesn’t help you with

the false negatives, right? You’ve made the wash

conditions even harsher so you’re going to

lose more proteins. But you’ll pick up

fewer false positives. And then finally, the last

purification procedure actually uses EGTA, which is

a chelating agent. So this interaction

between CBP and calmodulin depends on calcium. EGTA sucks the calcium

out of that interaction. And so it’s, again, a very

specific way of alluding rather nonspecific one, like heat,

salt, acid, or detergent. So this has been one technology,

affinity purification followed by mass spec, that’s

given us a lot of information on protein-protein interactions. And a computing

technology that’s also contributed quite a lot

is called yeast two-hybrid. So in this approach,

you have a reporter gene that normally is not

going to be transcribed. It has at a design DNA binding

site, a DNA binding protein, and your bait protein. And you want to figure

out every protein that can interact with this prey. So the prey now is attached

to an activation domain. If these two proteins

don’t interact, the activation domain never

gets recruited to this reporter, there’s no transcription. But if the green protein and

the blue protein interact, then the activation

domain is going to be recruited to

this promoter and it’s going to turn on transcription,

and then you’ll get a signal. So what are some of the

advantages of this approach? It doesn’t require you

to purify anything. So it should be

much more sensitive to low abundance proteins. So that’s definitely

an advantage. It’ll pick up a lot of those

transient interactions. You may not get

continuous activation, but you’ll get

transient activation. And if you’ve set the

conditions up properly, you can pick up the

transient activation. But it has its own biases,

so none of these techniques are going to be perfect. It’s going to be

biased against proteins that don’t express well. This is, as the name implies,

typically done in yeast. So if you have human

proteins and you express them in yeast, or plant proteins

that you express in yeast, there could be some proteins

that just will not express well in that organism. What else can be a problem? Some proteins don’t do

well in the nucleus, right? So if you’re interested

in interactions with membrane

proteins, it’s going to be very hard to get them

to express in the nucleus, and therefore, you’ll never

pick up those interactions. OK. So we’ve got these two

different technologies– the affinity capture mass

spec and the two-hybrid. Questions on those technologies? Yes. AUDIENCE: Could

another control be for the mass spec

purification just to subtract out everything

that alludes non-specifically. PROFESSOR: The question was,

could you subtract out anything that’s nonspecific. And yes, if you’ve

got what you might call frequent flyers,

proteins that show up in every single

purification, then you can simply ignore those. And that is often done. So that’ll help you

with things that are very nonspecific

for the surface. What’s more of a

problem are proteins that have some affinity

for your protein x but are not really

highly specific for it. So they tend to bind in

certain kinds of patches. Those would be

harder to figure out because they won’t

stick to everything. Good question. Other questions? All right. So we’ve got these

different technologies. What we’d really

like to be able do is we know that there are

problems in each approach. We’d like to be able to compute

the probability that two proteins interact

based on the data. So now we’re turning back to the

more mathematical computational approaches. So if we just consider

one experiment– and we’re going to talk about

gold standard. So what’s a gold standard? It’s a set of proteins that we

have extremely high confidence interact because it was analyzed

by some other technology. Not two-hybrid, non-affinity

capture mass spec, but much, much more direct interactions. By physical measurements,

maybe the structural work. So the number of

criteria that go into it. So we have this

gold standard data set where we know the proteins

definitely interact, and we have our experiment. So clearly anything

in the overlap, we can count as true

positives, right? We detected it. It’s in the database

of gold standards. And things that are in the

gold standard that we missed are obviously false negatives. We report them as

non-interacting, but in fact they do. The question is, how much

of this is true positive? Everything that’s detected

in the experiment but we have no information

for it in the database. So that could be for one

of two reasons, right? That could be that they

really don’t interact. Or it could be that

no one’s measured it. The whole point

of this experiment is to find new things. So is there any way to estimate

what fraction of all the things that are unique to this

experiment are true positives, and what fraction

are false positives? Those we’d like to

try to figure out. Now, if we just

had one experiment, that would be very challenging. But what happens when

we’ve got two experiments? So we have these two affinity

capture mass spec experiments, or maybe affinity capture

mass spec and a two-hybrid. So now let’s think about

the overlap of those two experiments with

the gold standard. So I’ve got this region of

overlap between experiment 1 and experiment 2,

and then this region that’s overlapping

between all three things. Experiment 1, experiment

2, and the gold standard. So these clearly are

two positives, right? They’re high confidence

because I picked them up in both experiments, and

they’re in the gold standard. What about all these things

in what I’ve labeled here region 2? Well, if we believe that

these two experiments are independent of each

other in a rigorous way– so let’s say one’s a

two-hybrid and one’s an affinity capture mass spec,

there’s no particular reason that the false

positives for one would be false positives in the other. In that case, I can

call this region 2 my consensus true positives. I have a very high

confidence that these are true interactors. Everyone buy that? Seem reasonable? OK. So here’s where

the trick comes in. What fraction of all these

consensus true positives are picked up in

the gold standard? This ratio, right? Region 1 over region 2. OK. So now I’ve got this region

of things that are picked up– the true positives from

this experiment, then the gold standard. And then I’ve got this region

that’s unique to experiment 2 and it’s going to be some

mix of true positives and false positives. And the authors of this

paper that are cited here make the following argument. We’re going to assume

that the ratio of I to II is the same as the

ratio of III to IV. So the fraction of

consensus true positives that are picked– these are

independent experiments. So the fraction

of true positives that are picked up

in the gold standard is going to be constant,

whether they’re in the consensus or not. So the fraction at

ratio of I to II is going to be the same

as the ratio of III to IV. So by that then, I can figure

out how much of this region consists of true

positives and how much consists of false positives. Everyone buy that? Yeah. AUDIENCE: Can I check–

are we not saying that the gold standard

represents all true positives? PROFESSOR: Correct. Well, we’re saying that the

gold standard consists of things that we know to interact– AUDIENCE: But there may be more. PROFESSOR: But

there may be more. And the goal of our experiment

is to find those other ones. All right. So if you accept that premise,

which seems plausible, then you can compute what

fraction of all the things that are picked up in

each of these experiments are likely to be true positives. So drum roll please. It turns out that the

number’s not that high. So the fraction of

things in the consensus was 347 out of almost 2000. And if you do the math

then, what you end up with is that the true

fraction in this region, for which we have no

data, is 1,123 out of– and the false piece in this

is going to be almost 15,000. And they went ahead and

did this for a number of different

experiments and computed the fraction of derived false

positives for these data– might be a little bit hard

to see on this screen. But the numbers range

from 50% false positives to, in some cases, over

90% false positives. That’s a little

disturbing, right? So these technologies are good

at picking up interactions, but there’s reason

to be very skeptical. OK. So now we’ve got

a serious problem, because how are we

going to figure out which of these interactions

to trust when we know that a very, very large fraction

of them are false positives? So what could you do? Well, you could take only

the little bit of overlap. You could say, I have that Venn

diagram– method 1, method 2. They did agree on

a bunch of things. So I could take only those. That obviously

throws away a lot. Someone else suggested we could

throw away the sticky proteins, right? So maybe there are

nonspecific proteins that don’t show up

in every experiment, but they show up in a

very, very large fraction of all experiments. Maybe I toss those out. That’s another possibility. But what we really

want to do is actually come up with a

probability estimate. To not have to make

a hard decision, but come up with an

estimate of the probability that things interact

based on all the data. So how do we go

about doing that? So first of all, what happens

if you just require a consensus? So this plot shows

accuracy and coverage of the gold standard for

individual experiments with different thresholds for

deciding what’s interacting, different cutoffs and things. So the individual

experiments are shown here. And then if you

acquire two methods to pick something up, or three

methods to pick something up, you can get better and

better in your accuracy. This is a log-log plot. So if you require

three methods to agree before you call something

a true positive, you can get up to– I’m not

sure exactly what this is, but 80%, 90% possibly. Right? But look at where

you at the y-axis. You’d only get

about less than 1% coverage of the gold standard. So that’s not a great approach. So what we really

want to do, as I said, is to try to estimate the

probability that proteins interact given all of

our available data. And the data could be

specific experiments. Say the two different

mass spec experiments we just referred to. Or as we’ll see a

little bit later in this lecture and possibly

the next one, other kinds of extraneous data that are not

direct physical measurements of interaction, but

might give us confidence that things interact based

on similarity in annotation, or similarity in gene

expression, and so on. And we’ll get into

details of that. OK. So to do this, we need

to have a little bit of a refresher on

Bayesian statistics. So I want to measure

the probability that an interaction is true

given the available data. Right? And I can estimate that based

on the probability of observing the data for things

that I know to be true and these prior estimates. So what’s the prior probability

that an interaction is true and the prior probability of

observing a particular data set. Now, this by itself isn’t

really that helpful. I haven’t told you yet how

to calculate any of the terms on the right. But bear with me. If I want to decide

the likelihood that a protein interacts–

how likely is it? Is it more likely that

it interacts or not? I can compute this ratio. The probability

that the interaction is true given the data

over the probability an interaction is

false given the data. That’s the likelihood ratio. So by this formula, I then

cancel out this probability of the data, the prior

probability of the data. And if I had a way

of calculating this, and we’ll get to it in

a second, then if it’s more likely than not to

be a true interaction, I can call it an interaction,

right, if it’s less likely. So if this ratio

is greater than 1, I accept it as a

true interaction. If this ratio is less

than 1, then I reject it. OK. So now our challenge

is to figure out how to compute these terms. One more thing to note

is if all I want to do is be able to rank every

interaction by this likelihood ratio, rather than coming

up with a hard threshold, then I actually don’t

need all these terms. So this is the likelihood ratio. I can convert it to a log space. So it’s going to be the

sum of these two terms. And if I’m simply

ranking everything by this log likelihood

ratio, this term is the same for

every interaction. It’s just composed of

prior probabilities. So it’s not going to

affect the ranking at all. Any questions on that? Is that clear? Good. So if I just want to come

up with a ranking function, all I need to do–

all– I need to do is to be able to estimate

the probability of observing data for true interactions and

the probability of observing that set of data for

false interactions. Everybody buy that? Yes, please. AUDIENCE: When you say

that prior probability is the same for all

interactions, we’re saying we’re assuming the same

prior probability for all, or is this [INAUDIBLE]? PROFESSOR: That’s

its definition. We mean, what is the prior

probability that proteins interact versus the

prior probability? So it’s independent of the

proteins that we’re looking at. Other questions? All right. So we need a way of

computing this piece of all the things

we’ve looked at before. So how do we get an estimate

of the probability observing a particular

configuration of the data? Meaning, I detect

it in experiment 1 and not in experiment

2, but in experiment 3. What’s the probability of that

given it’s a true interaction? So that’s what we’re going

to dive into right now. OK. So one thing we could

do to make life simpler, and then we’ll remove

this simplification later, but let’s, for the time being,

assume that all of my data are independent. So the two-hybrid is going

to have completely different mistakes than the affinity

capture mass spec. So those two data

sets are going to be completely independent

of each other. So I can write this as a product

of a particular observation– a particular mass

spec experiment and a particular two-hybrid

experiment for true attractions and false interactions. So it’s the product

of the probability that a particular experiment

would detect an interaction if the interaction is

true over the probability that that particular

experiment would detect it if there was no interaction. I’m just going to multiply

all of those probabilities. Yes. AUDIENCE: [INAUDIBLE]. This is one interaction pair? PROFESSOR: That’s right. AUDIENCE: And you

take the product over all the interaction

pairs within one run of the experiment. Is that correct? PROFESSOR: If I

want to determine whether a particular

interaction pair– I want to compute

this log likelihood ratio, or this,

actually, ranking ratio, because I’ve thrown

away the priors. I want to compute this ranking

ratio for a particular pair. So I’ve got protein

A and protein B. And I want to determine

whether I believe it to be more likely

to interact or not, and rank it with all

the others, right? So I’m doing this for

a pair of proteins now. So far so good? Now, for that pair

of proteins, I have a series of observations,

or lack of observations, right? I have a whole bunch

of experiments. This experiment detected

it, that experiment didn’t detect it, this one did. So what’s the probability

of these proteins– these A and B really interact

given that yes, no, yes in my experiments? And then for new protein,

it might be no, no, yes, and what I want to figure out

the probability for this pair. AUDIENCE: So is the scale

of the big letter M, is it on the order of like 10

experiments, 100 experiments, or thousands of experiments? PROFESSOR: Ah. So the question is,

what’s the scale of this. So obviously, that’s going to

depend on what kind of data I bring in, but in

these cases, it’s small. So we have a handful of these

high throughput experiments over entire genomes

and proteomes. So there’s not to be a lot. So in some of

these early papers, there were four

interaction experiments that they were looking at. Now the numbers might

be a little bit bigger, but not significantly greater. All right. So now to compute this, we

need a set of gold standards. But now we don’t just need gold

standard positive interactions, proteins that we know

really do interact. We also need proteins that we

know really don’t interact. Because I want to compute the

probability of an observation given that some interaction

is definitely wrong. So precisely how I

compute these terms is going to depend

on the kinds of data. The experiments I’ve

just been talking about, these high throughput

mass spec, which were the ones which we looked

at the ratio of the consensus, true positives, and estimated

that 96% of all the data were possibly in error. The details of how to do

those calculations are here. I leave you to look that

up if you’re interested. But now what we’re

going to do is we’re going to see how, if

we were to rank interactions based on this term,

we can avoid having to throw out most of our data. So we said if we require all

the experiments to agree, we’re going to have

very, very low coverage. Now we’re instead going

to rank everything based on this likelihood

ratio, or something derived from the

likelihood ratio. So in this paper

where they were simply looking at the

protein-protein interaction data sets to compute

these interactions, they ranked everything based on

that ranking function we just described. And then as you

vary your threshold, you can figure out how many

true positives you have and how many false positives

you have in the gold standard. True interactors and

false interactors. And you can compute

this curve, right? For any particular value

of that ranking ratio, what’s my sensitivity and

what’s my specificity? Are you clear what

this plot means? And here they’ve

plotted the values for individual experiments. And this is the value for

an independent database of gold standard interactions. And so now, where

do they come up with their true positives

and their false positives? A lot of this is going to depend

on how representative those are. And all these numbers

are subject to revision if you decide that the true

positives and false positives that people are using

are not accurate enough. So they used two well annotated

databases of interactions. One from MIPS and one from SGD. And you can play those

off against each other as the database

of true positives. In some ways, that’s

the easier thing because people like to report

that proteins interact. They tend not to like to report

the proteins don’t interact. You don’t see a lot of

nature papers saying protein x doesn’t interact

with protein y. So how are you going

to figure out, then, what are your true negatives? So the strategies

that they used– well, one possibility is they’re

annotated to be in complexes, and those complexes are

different from each other. That’s not bad, right? But it’s not a guarantee either. Or this is a little bit better. They’re annotated to be in

different parts of the cell. Of course, if those

annotations aren’t perfect, low concentrations, you

could still be wrong. Or that they have

anti-correlated gene expression. I kind of like this one. So it’s one thing to be not

correlated, but if you’re anti-correlated, seems

pretty suggestive that these two proteins are

never in a complex together. Again, it’s no guarantee

because, as we’ll talk about in some detail later,

RNA levels are not very good predictors

of protein levels. But if you apply enough

of these criteria, you can come up with

a set of proteins that you have fairly

high confidence really don’t interact. You combine that with

the databases of proteins with very high confidence

that they do interact, and you can get the true

positives and false positives that you need for this analysis. all right. So that’s a way of

combining some information. We’re going to see a

generalization of that called Bayesian networks. We’ve mentioned this

already in at least two different

contexts, and it’ll come up again later

in the course as well. So these are very

general methods for reasoning probabilistically. We will see them

in the context here of predicting interactions. We’ll see them later in the

context of gene regulation and signaling as well. What we fundamentally need

to do a Bayesian network is a graphical structure that

represents our understanding what the relationship is

between causes and effects. And a set of

probabilities that allow us to compute things

on this network. We’ll show you examples where

those networks are derived from our prior understanding

of the problem, but also ones where the

structure of the network is learned from the data. And we’re going to see

two primary contexts. First we have this question

of whether proteins interact. That’s what we’ve just

been talking about. So here are four experiments,

the in vitro pulldown experiments and yeast

two-hybrid experiments, that give us relatively

independent information about whether proteins interact. And we’re going to

look at a paper that used those data with

a Bayesian network to compute the probability that

two proteins really do interact based on the combination

of all the data, rather than throwing out

anything that doesn’t fall in the overlap, which could

be a very, very small number. And then later on

we’ll see examples of using Bayesian networks to

understand biological networks. So this might be a set

of transcription factors that are regulating a set of

differentially expressed genes. And the structure of

the graphical network for a Bayesian network

has a lot of similarities to the way we normally

think about transcriptional regulatory networks. So there’s sort of a

natural way of transferring our regulatory problem into

a graphical network problem. But we’re going to focus

on these prediction problems for protein-protein

interactions first. Now, if I just want to compute

the probability of detecting an interaction in various

experiments, given that it’s true or false, I

could explicitly compute that probability. And we saw examples

of that just now. But some of these

Bayesian network problems become much, much

too large to do that. This is a little tiny

piece of a Bayesian network that is supposed to

represent I believe it’s transcriptional

regulatory network. You could never possibly

write down all of the terms in this probability, where every

node could, in principle depend on every other node

in the network. It would just be a

ridiculously large problem. In fact, how large would it be

if I’ve got N binary variables, my gene is on or off, my

interaction is true or false, I have 2 to the N

possible states? Right? And the only constraint

I have, in principle, is that all the probabilities

have to add up to one. So I have 2 to the N minus 1. 2 to the N minus 1 possible

variables that I need to set. So that’s a ridiculously

large number in most contexts. So how do Bayesian networks

help us solve this problem? Well, we represent

our understanding of the problem in a

graphical structure where we have

causes and effects. And there’ll be a direct arrow

from a cause to an effect. I don’t always know the cause. So in our context,

we were trying to figure out whether

two proteins interact. What do we measure? We actually don’t

measure interactions. We measure the result of a

particular experiment, which is a combination of

whether interacted and all sorts of noise

that we’ve just discussed. So the effects that we observe

are detected in experiment one or detected in experiment two. The cause is, did

it interact or not? So the cause is hidden,

the effects are observed. Now, in the case we

were looking at before, we treated all

these probabilities as being independent. But we might know something

about the structure of our experiments, the kinds

of experiments we’re doing, that might lead us to have

a different structure. So we could have an

interaction that gives rise to all different kinds of data. But depending on whether

the protein’s a membrane protein or highly

expressed, it might influence the results

of certain experiments and not influence the

results of others, right? So like a two-hybrid

would be very biased by which one of these? The membrane, right? And then the affinity

capture mass spec could be very

influenced by proteins that are expressed at very

high levels or very low levels. If we assume that all the

interactions are independent, then we multiply probabilities. And we’ll go into

more detail, but this is what we’re looking

at up until now. In cases where we believe that

all the observations are not independent, then

we’re not going to simply multiply things. We’ll see there’s

a more precise way of computing the probabilities. Now in this case, I’ve drawn

the graphical structure because I believe that

I know what’s going on. But in the more general

case that we’ll look at, we’ll actually derive the

structure from the data. One of the nice things

about Bayesian networks is that it removes the

need to have all 2 to the N minus 1 possible parameters,

because it tells us there are certain

independence conditions. So node is independent of its

ancestors given its parents. What does that mean? If I’m trying to reason

about the expression of one of the genes down here, and I

know that this transcription factor is on, I

don’t really care what the probability is

that any particular parent of that transcription

factor is on, right? So I don’t need to know anything

of transcription factor B1 if I know the state of B2. If this is on, then

that’s the only thing that’s going to affect whether

it’s turning on these genes, regardless of what the

activation state of its parent was. Is that clear? Yes. AUDIENCE: The

slide’s saying TF B1. [INAUDIBLE] TF B2? It says TF A1. PROFESSOR: Yeah, sorry. That should say TF B1. Thank you. OK. So we’ll do a little example. It’s admission season

both for graduate school and undergraduate. So let’s do a little

toy example where we’re going to get rid of

the admissions committees and just do

automated admissions. So we’re going to collect

various data about students, and then we’re going to

build a Bayesian network. And that network

is going to decide whether to admit students

into this simplified version. And the only information that

will go into our decision will be the grades on the

transcript and the GREs. Hopefully that’s not the case. And we believe

that certain things influenced your

grades and your GREs. Whether or not the

student is smart certainly should

have some influence, but also the great

inflation at their school will have some influence. So a prediction problem

in a Bayesian network is going from the

causes to the effects. So if I want to predict

whether a student’s admitted, I only need to look upstream. So we want to predict– we

observe the things on the top. Say, grades and

GREs, and we want to predict whether this student

should be admitted or not. There’s another problem called

an inference problem, which is when we observe

the effect and we want to make inferences

about the causes. So an example of that would

be, you apply for an internship and they say, oh,

she’s a student at MIT. I bet she’s smart. Right? They’re doing an

inference problem. We’ll leave it for you to decide

whether you and your colleagues are as smart as everyone

thinks, but hopefully you are. OK. So we’ve got these two

different kinds of problems. We’ve got prediction

problems from top to bottom, and inference problems

from bottom to top. And we’re going to talk about

conditional probability. So if I’ve got some very

small piece of this network with just two nodes,

I could write out all the possible probabilities

for any pair of those nodes. So the probability that

a student is not smart given that that student has

low grades, the probability that the student is not smart

given that the student has good grades, and so on, for all

possible pairwise comparisons. Or I could write this as a

conditional probability, which tends to be an easier way

to think about the problem. What’s the conditional

probability of a student being smart given that

they’ve got good grades or given that they

have bad grades? They have the same information. For this one, I need

additional information about the total probability of

students being smart or not. And the total number of

variables, as I said, in either case is the same. So these are completely

interchangeable, but it’s a lot easier to reason

with conditional probabilities than with the joint

probability tables. Those we’ll see in a second. So as I’ve said, you don’t

need a full probability table for a Bayesian network. You don’t need two N to

the minus 1 variables. And the fundamental

reason for that is that the joint

probability is only going to depend on the parents. So in this toy example,

the GRE scores over here are not dependent

on grade inflation. Now, that all

hopefully makes sense. Questions? Bayesian networks get

a little murky next, so I’m going to try to

give you into– oh, yes. Question, please. AUDIENCE: You said that

the parents don’t affect their children, but if grade

inflation affects the grades, how does that

influence– will that influence the grade [INAUDIBLE]? PROFESSOR: Sorry, can you

say the question again? AUDIENCE: I guess

I’m just confused by this particular example. What do you mean by

the joint probability? The joint probability of what? PROFESSOR: So if I

want to figure out the probability of some

particular configuration of all the nodes in my network,

I don’t necessarily need to consider

all possibilities. Because for example,

if I want to consider all of the joint

probability samples with settings for the GREs,

whether the student had good GRE scores

or not, that’s not going be influenced by the

student’s school’s grade inflation policies. AUDIENCE: But wouldn’t the

grades be influenced by the– PROFESSOR: But the

grades would be. That’s right. So some of the

variables I can remove and others– some of the

joint probability statements I don’t need to worry

about and others I do. And which ones I

need to consider is determined by

the graph structure. Yes. AUDIENCE: How is the graph

structure determined? PROFESSOR: OK. So how is the graph

structure determined? So it’s determined

in one of two ways. I can draw it in advance because

I believe that I know something about my setting, I believe

that these data are independent. Then it has that

structure like this. Cause and a bunch of

independent effects. Or perhaps I claim to know that

actually two of these things have a common parent as well. In some cases I know. We’ll also talk about how

to learn the structure from the data, which is

the more common setting in regulatory networks. So in these kinds

of problems when trying to decide

how to integrate different proteomic data

sets, typically people make arbitrary decisions

about what the structure is based on their

knowledge of the system. But if you’re trying to figure

out de novo which proteins interact with which, which

proteins regulate which genes, then you have to learn

it from the data. And we’ll talk about how

to do that in a second. Great questions. Any other questions? Anything in the quiet

half of the room? OK. So as I said, this

part of it, I think you can usually

come up with cases that give you fairly

good intuition. One of the things that is true

in these Bayesian networks which most people find a

little bit surprising at first is something called

explaining away. So let’s look at this

Bayesian network. I go outside and I

detect that things are slippery on the grass. So that could be for

a lot of reasons, but one possible reason

is that the grass is wet. OK. What are the causes of

the grass being wet? Well, it could have

rained or the sprinklers might have been on. And depending on this

as an example– so a lot of the Bayesian networks

were developed in Stanford by Judea Pearl and colleagues. And of course, in California

it doesn’t rain that often. So there the season is a strong

determiner of these things. Not so much around here. So in this example

that they like to do, so does the

probability that it’s raining depend on whether

the sprinkler is on or not? Now, the answer

should be no, right? I mean, in reality, when

you think about– there’s no causal relationship

between the sprinkler being on and the rain. But in fact, when we’re

reasoning over these networks, we actually are influenced. In a probabilistic model,

if I know that it’s raining, and I know the grass

is wet, then what do I think about the

sprinkler being on? Do I think it’s just as likely? No, I think it’s

less likely, right? If I go outside and see the

grass is wet, there are clouds, the rain is coming

down, is the sprinkler likely to be on or not? It’s likely to be off, right? So there’s no

causal relationship, but there’s the probabilistic

relationship through the graph structure. And that’s called

explaining away. And you can take a whole

course on how to understand which relationships you

can detect and which not. This is not the place

to try to go into that, but I hope you’ll be

familiar with this problem. And I’ll try to give

you a toy example that makes it a little bit

more obvious in terms of the equations

where this comes from. So imagine this very silly game

where we play, we toss coins. We toss a coin twice. And if it turns up heads

both times, you get a point. If it turns up tails both

times, you get a point. But if one’s a head and one’s a

tail, you don’t get any points. Now, does the probability that I

tossed a head on the first time depend on whether I toss

a tail on the second time? So causally,

obviously not, right? First of all, it

happened earlier in time. And secondly, the coin tosses

are completely independent. But what happens when

I know the outcome? What if I know

what score you got? So if I know your score,

then is the probability that I tossed the

heads on the first time independent of whether I got

a tail on the second time? What do you think? How many people think

it is independent then? How many people think

it’s not independent. Very good. It’s not independent. And obviously, here’s

the math to prove it, but your intuition

does the same thing. So what’s the probability

that I tossed a head on the second time

given that I got a one, I scored, and I tossed a

tail on the first time? Obviously, it’s zero, right? So here’s the

probability of getting a head in the first

time and scoring one, and tails on the second

time is exactly zero. So that’s called

explaining away. You can reduce your

belief in certain parents based on what you know

about the children. Think of this coin toss

example or the rain in California and

the sprinklers. All right. So as this come

up several times, how do we obtain the

Bayesian network structure? There are two problems that

we need to be able to solve. We need to be able to

learn the structure, and we need to be able to

learn these probability tables. If we know structure, how

do we get the probabilities? Well, we need to identify

some objective function we’re going to try to optimize,

and then choose values for all probability

distributions that optimize that

objective function. And that’s the

kind of thing we’ve been doing all along, just

like in the Gibbs sampler. We need some objective

function or protein structure. We need some objective

function that we’re going to try to optimize. So there are two common

ones that are used a lot. There’s maximum likelihood

and the maximum posterior. So maximum likelihood is defined

as the set of param– theta is all the parameters, all

the probability distributions, the probability of getting a

score of one given that you had heads and tails,

whatever it may be. The probability of

getting admitted given that you had certain

GREs and certain grades. So we want to find

the set of parameters, all those probability

distributions, that maximize this. The probability of the

data, our training data, given those parameters. That’s a pretty obvious one. And the maximum posterior

includes some of our beliefs about the prior

probability of the data and the prior probability

of the parameters. This is a little

bit less intuitive because you have

to ask, well, where do those numbers come from? And that, again, is a

whole course unto itself. OK. Now, how do you find

these parameters? Again, it’s the kinds

of search problems that we’ve looked at before,

various kinds of hill climbing. So gradient descent,

expectation maximization, Gibbs sampling, which

you’ve looked at explicitly. And again, the full

details of how to do that are outside of our scope today. OK. So in our example of

this coin toss game, we would use one of

these two functions to try to decide what’s

the probability of getting heads or tails for

any given score. That’s what the kinds

of parameters are. Now, the structure

problem actually turns out to be

really, really hard, because there are a very

exponentially large number of potential structures

to draw from. And unless you’ve got

some prior knowledge, it can be impossible, depending

on how much data you have, to actually build

this structure. So there are many algorithms

that have been proposed. And a lot of our

settings, we’re going to use some kind

of prior knowledge to reduce the search space. So if we’re trying to talk

about transcriptional regulatory networks, it’s very common

to assume that there are only some kinds of nodes that can be

causes and other kinds of nodes that can be effects, right? So in gene expression

it would be effect, and then you would

limit your causes to only be

transcription factors. It would generally be signaling

molecules or something like that, and not allow all

20,000 genes to be causes and all 20,000

genes to be effects. So there are lot of

resources to learn more about Bayesian networks. As I said, you can have

whole courses on this. I think there are a lot of

good tutorials at this website. I’ve also put in the notes

a little toy example for you to work through all the

probabilities, which I think, in the interest of time, we

won’t go through in detail. All right. So to motivate what we’re going

to do in the next lecture, I just want to talk

about other kinds of data that you could bring to bear

on this problem of predicting which proteins interact. We’ll see, then,

how that gets fed into an interaction

Bayesian network to make the predictions. So we’ve talked about affinity

capture and two-hybrid, but what other

kinds of data could we use to predict the

probability interaction? Well, one thing you could use

would be gene expression data. And the idea is that if

two proteins interact, they should be present in the

cell at the same time, right? So we talked about

this a little bit. If they’re

anti-correlated, it seems very unlikely they interact. What about if they’re

correlated, but not perfectly correlated? So here’s a plot that shows

a histogram of proteins that are known to interact, proteins

that are known not to interact. So empty circles are known

interacting proteins, the dark circles are

non-interacting proteins, and the other ones are based

on the experimental data. And the distance here

is the difference between expression profiles. And we’ll talk in coming lecture

about exactly how to compute distance between

expression profiles. But the further to the right

it is, the less similar the expression profiles

are across large data sets. So what you see

is the interacting proteins tend to be shifted

more to the left, more similar expression profiles

than the non-interacting ones. But what do you

notice about this? There’s no way to

draw a line and say, everything to the right

of this is in one class and everything to the

left is another, right? So by itself, it’s not

going to get us very far. There are plenty of

non-interacting proteins that have very highly correlated

gene expression and plenty of interacting proteins

that have poorly correlated gene expression. So it’s a trend, not a rule. Now, what about evolution? So if I look over many, many

organisms, I might expect what? The proteins that

interact with each other are going to appear in

the same species, right? So let’s look at

these two cases. We’ve got a bunch of–

eight different genomes. And I’ve got gene 1 and gene 2,

which I suspect might interact, and gene 3 and gene 4, which

I suspect might interact. Now, looking at

these two patterns of evolution, which one

do we have more confidence in that it interacts? The red one or the green one? So what do we notice about

the difference between them? What’s true of the red one

compared to the green one? Yeah. AUDIENCE: The red one is only

in one branch of the tree. PROFESSOR: The red one is

only one branch in the tree and the green one

is scattered across. So let’s take a vote. Do we believe that

the red one is better evidence of

interaction or the green one is better evidence

of interaction? Red? Green? Can I have an advocate of green. Someone explain their rationale? Anyone in the quiet

side of the room? All right, Ed. AUDIENCE: Because red is only

on one branch of the tree, I’d expect that

they’re naturally more correlated with each other. They have less–

they appear together in [INAUDIBLE] so I’d

expect [INAUDIBLE]. PROFESSOR: OK. So the argument is that red only

occurs in one part of the tree. And so there could be a

very simple explanation for all the reds being in one

part of the tree and one not, which would be a single

loss and gain event. Right? Somewhere early

on, perhaps here, I gain those two proteins. And then they’re inherited

throughout the genome, like most of genes get

inherited throughout the genome. Whereas here, we’ve

got independent events of gain and loss. And at each one of these

independent events, we’re getting them

moving jointly, either in or out of the genome. So there’s more

evidence for green to be interacting than red. Everyone buy that? Even some of the

advocates of red? Questions? Yes. AUDIENCE: Could there be a

way of either objectively or mathematically

[INAUDIBLE] that way, or is it just the

reasoning [INAUDIBLE]? PROFESSOR: One can do

the statistics on it with known ones, right? I think that’s

probably the best way. And we’ll actually see that

in one of these papers that uses– well,

actually, now I don’t recall whether they

use this co-evolution. But yeah, there are

plenty of papers that actually have done

the statistics on that. So it is supported. And a related kind

of question is what’s called the

Rosetta Stone approach. Unfortunately, of

the term Rosetta gets used far too much

in computational biology. So this has nothing to

do with the other Rosetta that we’ve been talking about. And this has to do

with how often you find the same pair of genes in

the same genome versus split up in different genomes. OK. So what we’re going to

look at next time then is an approach that

combines these kinds of data with the protein interaction

physical measurements through the two-hybrid and the

affinity capture mass spec that actually uses the Bayesian

networks we talked about this time to predict

whether two proteins are likely to interact based on

all of the available data. These evolutionary arguments,

the [? sentiality ?] arguments, and then the interaction data. Any final questions? OK, see you next time.

## 4 Replies to “14. Predicting Protein Interactions”

Thanks for sharing this. Large scale identification of PPIs generated hundreds of thousands interactions, which were collected together in specialized biological databases that are continuously updated in order to provide complete interactomes. The first of these databases was the Database of Interacting Proteins (DIP).

13:58 // Pg.23

How can we determine the delta G value? Could you give us a link or so? I assume we are not literally calculating the value based on thermodynamic data but there should be a software that does that. Could you help me how to figure del G value? Thanks

40:09 // pg. 60

How does that prior probability is the same for all interactions?? I thought differently. If P(true_PPI) is 0.6, wouldn't that P(false_PPI) is 0.4? Whatever probability we have for the true_PPI, wouldn't the false probability be 1-P(true)? Could anyone help me out? Thanks

Hey, team, I have just started a youtube channel about my PhD. I have just started so the idea of the channel is for you guys to follow along with me as I complete my PhD. My latest video is about how having dyslexia and how it affecting my grades and academia!