3. Global Alignment of Protein Sequences (NW, SW, PAM, BLOSUM)

3. Global Alignment of Protein Sequences (NW, SW, PAM, BLOSUM)


The following
content is provided under a Creative
Commons license. Your support will help MIT
OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of
MIT courses, visit [email protected] PROFESSOR: All right,
so let’s get started. So today we’re going to
review local alignment, we talked about last time, and
introduce global alignment, also talking about
issues related to protein sequences,
which include some more interesting
scoring matrices. So just some info on topic
one, which we are still in. So I will have an
overview slide. It’ll have a blue
background, and there will be a review slide
with a purple background in every lecture. So last time we talked about
local alignment and some of the statistics
associated with that, and also a little bit about
sequencing, technologies, both conventional
Sanger DNA sequencing as well as second
generation sequencing. And at the beginning of the
local alignment section, we introduced a simple
BLAST-like algorithm, and then we talked about
statistics, target frequencies, mismatched penalties,
that sort of thing. So there were a couple questions
at the end which I just wanted to briefly answer. So I believe it was Joe asked
about how the dye is attached to the DNTP in dye
terminator sequencing. And it appears
that it’s attached to the base, sort of the
backside of the base, not the Watson-Crick
face, obviously. That seems to be the
common way that it’s done. And then there was
another question from somebody in the back. I don’t remember who asked about
when you’re making libraries, how do you make sure that
each of your insert sequences has the two different adapters,
one adaptor on one side and the other adapter
on the other side? And there are at least
three ways to do this. So simplest is in
RNA ligation, when you take advantage of the
different chemistry at the five prime and three prime
ends of the small RNA that you’re trying to clone. So you just use the
phosphate and NLH to ligate two
different adapters. Another more complicated way
occurs in ribosome footprint profiling, which is a method for
mapping the precise locations of ribosomes along
mRNAs, and involves polyA tailing, and then
introducing the adapters together, the two adapters,
with a polyT primer that primes off the polyA tail. And then you circularize, and
then you PCR off the circles. And it’s a little
bit complicated, but you can look it up
in the reference that’s up here on the slide. It’s working now. And then, finally,
the way that’s actually most commonly
used for protocols like RNA seq and
genomic DNA sequencing is that after you make
your double strand DNA, there’s an enzyme that adds
a single A to the three prime end of each strand. So now you have a
symmetrical molecule. But then you add these funny
Y-shaped adapters that have and overhanging T on,
say, the red guy here. And so what will happen is
that each of these Y’s can be ligated here. But each of the inserts,
independent of which strand it is, will have a red
adapter at the five prime end and a blue adaptor at
the three prime end. Any questions
about this or about sequencing technologies before
we go to local alignments? OK, good. It was a good question,
and that’s the answer. So we motivated our
discussion of local alignments last time by talking
about this example, where you have a non-coding
RNA that you found, inhuman. You BLAST it against mouse,
and you get this alignment. Is this significant? So is this really likely to
be a homologous sequence? And how do you find
the alignments? And so we said
that, well, there’s this theory that’s
exact, at least exact in the asymptotic sense
for large query and database sizes that tells us the
statistical significance of the highest scoring
ungapped local alignment. And it’s given by
this formula here, which is the extreme value
or Gumbel distribution. And then we talked
about the constraints or the expected score
has to be negative, but positive scores have to
be possible for this theory to work. And we also talked
about an algorithm. But if you remember, the
algorithms was very simple. It involved– this
is zero– keeping track of the cumulative score. So we have a mismatch and a
match, mismatch, mismatch, mismatch, match, match, match. That is a high scoring
segment, et cetera. So you keep track of the lowest
point you’ve ever been to as well as the current score. And when the current score
exceeds that lowest point you’ve ever been to by more
than we’ve ever seen before, more than this, then that’s
your high scoring segment. Now it turns out, if this
is not intuitive to you, there’s another
algorithm which I find, personally, more intuitive. So I just want to tell you
about that one as well. And it’s basically
the same thing, except whenever you go
negative, you reset to zero. So here, we were
going to go negative, so we just reset to zero. That was on this mismatch here. Then we have a match. Now we’re at plus 1. That’s fine. Now we have a mismatch. Now we’re down to zero. We don’t need to do anything. Now we have another mismatch. Here, we’re still at zero. Remember, we just stay at zero. We were going to go negative,
but we stayed at zero. Another mismatch, we
still stay at zero. And now we have these
three matches in a row. My line is not
staying very flat. But this should’ve
been here flat at zero. The point is that now the
highest scoring segment is the highest point
you ever reach. So it’s very simple. So this is actually slightly
easier to implement. And that’s sort
of a little trick. So for local alignments,
you can often reset to zero. Any questions about that? So well, we talked about
computational efficiency, this big O notation,
where you consider the number of
individual computations that are required to run
an algorithm as a function of the size of the
input, basically the number of units
in the problem base pairs, amino acid
residues, whatever. So computer scientists look
at the asymptotic worst case running time. That’s either because
they’re pessimistic, or perhaps because they
want to guarantee things. They want to say, it’s not
going to be worse than this. Maybe it’ll be faster,
and then you’ll be happy. But I can guarantee
you, it’s not going to be worse than this. And so in this case, the
algorithm we talked about was order n times
n, where that’s the lengths of
the two sequences. So toward the end
last time, we get we talked about this
lambda parameter and said that lambda is the
unique positive solution to this equation here,
where sij are the scores and pi and rj are the
nucleotide frequencies. And then there’s this
target frequency formula that comes up that says that
if you use a scoring system sij to apply to sequences,
and then you pull out just the high scoring
segments, the ones that are unusually high
scoring, they will have a frequency of
matching nucleotides qij that’s given by the
product of the frequencies in the two sequences basically
weighted by e to the lambda sij. So matches will occur more
strongly, because that has a positive work, and
mismatches less strongly. And that then gives
rise to this notion that there’s an optimal
mismatch penalty, if you just consider scoring systems
that have plus 1 for a match and m for a mismatch, some
negative number, that’s given by this equation
here, and here I’ve worked out a couple of values. So the theory says that to find
matches that are 99% identical, you should use a mismatched
score of minus 3, but for 75% identical,
you should use minus 1. And I asked you to think
about does that make sense, or how is that true? So y is minus 3
better than minus 1 for finding nearly
identical matches. Anyone have an idea
or thought on this? There’s some thoughts
on the slide. But can anyone intuitively
explain why this is true? Yeah, what’s your name? AUDIENCE: Eric. PROFESSOR: Yeah, go ahead. AUDIENCE: With a mismatch
penalty of minus 3, you actually need more
steps climbing back up to get back to
some local maximum. And therefore, you do require
a longer stretch [INAUDIBLE] matches in order to
get a significant hit. That’s my guess as
to why a score of m equals minus 3, a
penalty, [INAUDIBLE] why it would be
better at finding the higher identity matches. PROFESSOR: OK, because
the minus 3 makes you go down faster, so it
takes longer to recover, so you can only find
nearly identical things with that kind of
scoring system. Is that your point? OK, that’s a good point. So yeah, when would
you want to use a mismatched penalty of minus 1? AUDIENCE: When you’re trying
to look for things that are [INAUDIBLE], but
maybe not so close. Well, when you’re looking
for a [INAUDIBLE], you’re looking for [INAUDIBLE]. That kind of situation. PROFESSOR: And so
let’s say I’m using a mismatch penalty of minus 2. Can I find regions
that are 66% identical? AUDIENCE: Probably. PROFESSOR: Probably. AUDIENCE: But not guaranteed. PROFESSOR: Anyone else
have a comment on that? Match is plus 1. Mismatch is minus 2
regions of 66% identity. Yeah, with Levi, yeah. AUDIENCE: No, since your
score will just be zero. PROFESSOR: Yeah. That’s correct. So Levi’s comment is
your score will be zero. Well, I’ll just say plus for
match, plus, plus, minus, plus, plus, minus. I mean, it’ll be interspersed. It doesn’t have to be like
this for every triplet. But but on average, you’ll have
two matches for every match. That’s what 66% identity means. And so these will score
a total of plus 2. And this will score minus 2. And so you basically will
never rise much above zero. And so you you can’t really
use that mismatch penalty. There’s a limit. 66% is sort of at a point
where you can no longer see. You could potentially
see things that are 75% identical if
they were incredibly long with that kind
of mismatch penalty. But you just can’t see
anything below 2/3% identity with minus 2. So to find those low things,
you have to use the lower. You have to go down
to minus 1 if you want to find the
really weak matches. But they will have to be
correspondingly very long in order to achieve
statistical significance. So correspondingly, the
reason why it’s better to use a harsher mismatch
penalty of minus 3 to find the nearly
identical regions is that, in this
equation, when you go from having a plus 1, minus
1 scoring system to plus 1, minus 3, lambda will change. This equation will
no longer satisfied so that a new value of
lambda will be relevant. And that value will be larger. That’s not totally
obvious from this equation because you sort of
have one term, which is either the minus lambda in
one term or either plus lambda. But it turns out that
that making the mismatch penalty more negative will
lead to a solution that’s a bigger value of lambda. So that means that
the same score, x, will lead to a larger
negative exponent here. and? How will that
affect the p-value? Anyone take us through this? It’s a little bit
convoluted with all these negative
exponentials and stuff, but can someone explain to us
how that affects the p-value? Same x. We’re going to increase lambda. What happens to the p-value? This gets bigger, more negative. That means this e to the
minus thing gets closer to 0. That means that this is
inside an exponential. As that thing gets closer
to 0, this whole term here gets closer to 1. Therefore you’re
subtracting it from 1. Therefore the p-value
gets smaller, closer to 0, more significant. Does that made sense? So it’s good to
just work through how this equation works. So that’s all I wanted
to say about mismatch penalties for DNA. Any questions about that? So how do you actually
use this in practice? So if you just
Google “BLAST end,” you’ll get to this website. It’s been set up at NCBI
for about 20 years or so. And of course, it’s gone
through various iterations and improvements over the years. And if you look
down at the bottom, there is a place where
you can click and set the algorithm parameters. And there are a number of
parameters you can set. Some of them affect the speed. But we’re focused here
mostly on parameters that will affect the quality,
the nature of the alignments that you find. And so here, you can set
not arbitrary penalties, but you can set within a range
of standard mismatch penalties. You can do 1 minus 1,
1 minus 2, et cetera. So what about sequences
that code for protein? So exons, for example. So you can search them with a
nucleotide search, like BLAST. But it can often be
the case that you’ll do better if you first
translate your exon into the corresponding
amino acid sequence using a genetic code and then
search that peptide. Now you may or may not know
the reading frame of your exon a priori, or even know
that it is an exon, so BLAST automatically will
do this translation for you. So for example, with
this DNA sequence, it’ll translate in all
three of the reading frames, leading to essentially
this bag of peptides here, where sometimes you’ll
hit a stop code on, like right here. And then it just treats
it as, OK, there’s a little PR dipeptide there. And then there’s a longer
peptide here, [INAUDIBLE], and so forth. So it just makes these bags of
peptides for each reading frame and searches all those peptides
against some target, which can be approaching
database or a DNA database, again, translated in
all the reading frames. So the folks at NCBI have made
all these different flavors of BLAST available. So BLASTP is for proteins. N is for nucleotides. And then the translating ones
are called things like BLASTX for a nucleotide query
against a protein database. TBLASTN for a protein query
against a nucleotide database, which gets translated in
all frames, or TBLASTX, where you translate both
the nucleotide sequences in all frames. And then there’s a number
of other versions of BLAST which we probably won’t
discuss but that are well described in the textbook
and other accessible online sources. Let me ask you this. So remember ESTs. So ESTs are segments of cDNAs
that typically correspond to one ABI 3700 Sanger
sequenc off of that cDNA, so one read, like
600 bases or so. So let’s say you have
some ESTs from chimp. And you don’t have
the chimp genome yet. So you’re going to search
them against human. What would you do? Would you use a
translating search? Or would you use
a BLASTN search? Or does it matter? Chimp is a 98% identical
human, very high. Any ideas? Yeah, Tim. AUDIENCE: You could use
a translating search, because you know that the cDNAs
are at least coding for RNAs. And so if you just use
a nucleotide search, then you’re not going to
have functional significance in terms of the alignment. But if it’s going
for a protein, then– PROFESSOR: You mean you won’t
know whether it is the protein coding part of the cDNA or not? AUDIENCE: So I just
mean that if you’re looking between
chimp and human, then you’re expecting some
sort of mismatch. But it’s possible that it
could be a functional mismatch. Then you know that the cDNA
is maybe coding for a protein. Therefore, if the
mismatch is between two similar amino acids, then
that would be picked up by a translating
search, but it would be skewed against it
in a nucleotide search. PROFESSOR: OK, fair enough. But if you assume that
the two genomes are, let’s say, 97% identical, even
in a non-coding region, which they’re very high. I don’t remember the exact
percent, but very high. Then if you’re searching 600
nucleotides against the genome, even if it’s 95%
identical, you’ll easily find that under either. So either answer is
correct, BLASTN or BLASTX. And the UTRs could only be found
by– if it happened that this was a sequence from
a three prime UTR, you could only find that
by BLASTN typically. What if it’s a human EST
against the mouse genome? So mouse exons are
about 80% identical to human exons at the
nucleotide level, typically. Any ideas? What kind of search
would you do? BLASTN, BLASTX,
or something else? TBLASTX. Yeah, go ahead. AUDIENCE: I have
another question. What exactly is the
kind of question we’re trying to answer by
doing this BLAST search? PROFESSOR: Oh,
well, I was assuming you’re just trying to find
the closest homologous cDNA or exons in the genome– exons,
I guess, yeah, the exons. of the homologous gene. Yeah, that’s a good question. Exons of a homologous gene. We’ve got a human EST going
against the mouse genome. When do we do? AUDIENCE: I suggest
BLASTP because– PROFESSOR: Well,
BLASTP, that’s protein. This is a nucleotide
sequence against nucleotide. So we can do BLASTN
or TBLASTX, let’s say. AUDIENCE: TBLASTX. PROFESSOR: TBLASTX. You translate your EST,
translate the genome, search those peptides. TBLASTX, why? AUDIENCE: The
nucleotide sequences may be only about
80% similarity, but the protein
sequences functionally, due to the functional
constraints, you might actually get
higher similarities there. PROFESSOR: Yeah. It’s exactly right. So they are about, on
average, about 80% identical. It varies by gene. But a lot of those
variations that occur are at the third side
of the codon that don’t effect the amino
acid, because there’s a lot of constraint
on protein sequence. And so you’ll do
better, in general, with a translating search
than with a nucleotide search. Although, they both may work. But you may find a
more complete match with a translating search. That’s good. Everyone got that? Sally, yeah. AUDIENCE: Is there a reason
why you wouldn’t use BLASTX and instead you use TBLASTX? PROFESSOR: Yeah, I
just gave the example of searching against the genome. But you could search against
the mouse proteome as well. You might or might not. It depends how well
annotated that genome is. Mouse is pretty well annotated. Almost all the proteins
are probably known. So you probably get it. But if you were searching
against some more obscure organism, the chameleon
genome or something, and it wasn’t well
annotated, then you might do better with searching
against the genome, because you could find a new x on there. OK, good. Question yeah, go ahead. AUDIENCE: So when we do these
translations, these nucleotide, amino acid things,
do we get all frames? Do the algorithms to all frames? PROFESSOR: Yeah, all six frames. So three frames on the plus
strand, and three frames on the reverse strand. Yeah. All right, great. So that’s the end of local
alignment, for the moment. And we’re going to now move
on to global alignment using two algorithms. For global alignment,
Needleman-Wunch-Sellers, and then for gapped local
alignment, Smith-Waterman. And toward the end,
we’re going to introduce the concept of amino acid
substitution matrices. So the background for
today, the textbook does a pretty good
job on these topics, especially the
pages indicated are good for introducing the
PAM series of matrices. We’ll talked a little bit today
and a little bit next time. So why would we align
protein sequences? So the most obvious reason is to
find homologues that we might, then, want to investigate,
or we might, for example, if you have a human protein
and you find homologous mouse protein, and that mouse
protein has known function from a knockout or from some
biochemical studies, for example, then you can guess
that the human protein will have similar function. So we often use this
type of inference that sequence similarity implies
similarity in function and/or structure. So how true is this? So it turns out,
from a wide body of literature, that this
inference that sequence similarity implies functional
and structural similarity is almost always true when
the sequence similarity is more than about 30% identity
over the whole length of a protein, over
300, 400 amino acids. That’s a good inference. Below that, sort of
in the 20% to 30% sequence similarity,
that’s often referred to as
the Twilight Zone, where sometimes it’s
a good inference, and sometimes it’s not. So you need to be a
little bit careful. And below that, it’s deeper
into the Twilight Zone, where most of the time you
probably shouldn’t trust it. But occasionally, you can see
these very remote homologies. You might want to have
additional information to support that
kind of inference. And I want to just point out
that the converse is just not true in biology. So structural
similarity does not imply sequence
similarity or even derivation from a
common ancestor. So you may think,
well, every protein has a really complex, elaborate
three dimensional structure, and there’s no way that
could ever evolve twice. And it’s true that probably
that exact structure can never evolve twice. But a very similar structure,
a similar fold even, in terms of the topology
of alpha helices and beta strands, which
Professor Frank will talk about later in the
course, the identical fold can involve more than once. It’s not that hard
to evolve a pattern of alpha helices
and beta strands. And so this point about
structural similarity not implying
sequence similarity, the way I think about it
is like this, like here are two organisms. This is a hummingbird,
you’ve all seen. And some of you
may have seen this. This is a hawk moth,
which is an insect that is roughly two inches long,
beats its wings very fast, has a long tongue that
sips nectar from flowers. So it basically occupies
the same ecological niche as a hummingbird, and
looks very, very similar to a hummingbird at a distance. From 10 or more feet,
you often can’t tell. This is an insect,
and that’s a bird. The last common
ancestor was something that probably lived
500 million years ago, and certainly didn’t
have wings, and may not have had legs or eyes. And yet, they’ve
independently evolved eyes and wings and
all these things. So when there’s
selective pressure to evolve something, either
a morphology or a protein structure, for
example, evolution is flexible enough that it can
evolve it many, many times. So here’s an example from
the protein structure world. This is homophilous
iron binding protein. This is just the iron
coordination center. And this is now a eukaryotic
protein called lactoferrin. Turns out these
guys are homologous. But eukaryotes and bacteria
diverged 2 million years ago or so, so their ancestry
is very, very ancient. And yet, you can see that in
this iron coordination center, you have a tyrosine
pointing into the iron here. And you have a histidine
up here, and so forth. So the geometry has
been highly conserved. It’s not perfectly conserved. Like, here you
have a carboxylate. And here you have a phosphate. So there’s been a
little bit of change. But overall, this way
of coordinating iron has basically evolved
independently. So although these
are homologous, the last common
ancestor bound anions– that’s known from
[INAUDIBLE] construction. So they independently evolved
the ability to bind cations, like iron. And here is actually
my favorite example. So here’s a protein called
ribosome recycling factor. And that’s its shape. So it’s a very
unusual shaped protein that’s kind of shaped like an L. Does this remind
anyone of anything, this particular shape? Have you seen this in another
biomolecule at some point? AUDIENCE: [INAUDIBLE]. PROFESSOR: Something
like [INAUDIBLE]. OK, could be. Any other guesses? How about this? That’s a tRNA. So the 3D structure of
tRNA is almost identical, both in terms of
the overall shape and in terms of the geometry. Sorry, I’m having issues
with my animations here. The geometry of these, they’re
both about 70 angstroms long. So why is that? Why would this protein
evolve to have the same three dimensional shape as a tRNA? Any ideas? AUDIENCE: [INAUDIBLE]. PROFESSOR: [INAUDIBLE]. Right, exactly. It fits into the
ribosome, and it’s involved, when the
ribosome is stalled, and basically
releasing the ribosome. So it’s mimicking a tRNA
in terms of structure. And so the point about
this is that, if you were to take a bunch
of biomolecules and match them up using
a structure comparison algorithm to find similar ones–
these two are clearly similar. And yet, they probably never
had a common ancestor right, because one’s an RNA
in one’s a protein. OK. So now what we’re going to talk
about a few different types of alignments. So we talked about
local alignments, where you don’t try to
align the entire sequence of your query or your database. You just find smaller
regions of high similarity. Global alignment, where
you try to align the two proteins from end
to end, you assume that these two proteins
are homologous, and actually that they
haven’t had major insertions or rearrangements
of their sequence. And then semi-global,
which is sort of a little twist on global. And we’ll talk about a few
different scoring systems– so ungapped, which we’ve been
talking about until now, and then we’ll introduce
gaps of two types that are called linear and affine. And the nomenclature is a little
bit confusing, as you’ll see. They’re both linear, in a sense. So a common way to represent
sequence alignments, especially in the protein alignment– you
can do it for protein or DNA– is what’s called a dot matrix. Now we’ve got two proteins. They might be 500 amino
acids long each, let’s say. You write sequence
one along the x-axis, sequence two along the y-axis. And then you make a dot
in this matrix whenever they have identical residues,
although probably there would be a lot
more dots in this. So let’s say, whenever you have
three residues in a row that are identical– OK, that’s
going to occur fairly rarely, since there’s 20 amino acids. And you make that dot. And for these two
proteins, you don’t get any off diagonal dots. You just get these three
diagonal lines here. So what does that tell you
about the history of these two proteins? What’s that right there? Sally. AUDIENCE: An
insertion or deletion. PROFESSOR: An
insertion or deletion. An insertion in which protein? AUDIENCE: Sequence two. PROFESSOR: Or a deletion in? AUDIENCE: One. PROFESSOR: OK. Everyone got that? OK, good. There’s extra sequence
in sequence two here that’s not in sequence one. You don’t know whether it’s
an insertion or deletion. It could be either one,
based on this information. Sometimes you can figure that
out from other information. So sometimes you call that an
indel– insertion or deletion. And then, what is
this down here? Someone else? Insertion, I heard, insertion
in sequence one or deletion in sequence two. OK, good. All right, so what
type of alignment would be most appropriate
for this pair sequences, a local or a global? AUDIENCE: I would do
global, because they’re very, very similar. [INAUDIBLE]. PROFESSOR: Yeah. They are quite similar
across their entire lengths, just with these
two major indels. So that’s sort of the
classical case where you want to do the
global alignment. All right. So what about
these two proteins? Based on this dot
matrix, what can you say about the relation
between these two, and what type of
alignment would you want to use when comparing
these two proteins? Yeah, what’s your name? AUDIENCE: Sonia. PROFESSOR: Go ahead, Sonia. AUDIENCE: It looks like they’ve
got similar domains, maybe. So local alignment
might be better. PROFESSOR: And why wouldn’t
you do a global alignment? AUDIENCE: Local, because the
local alignment might actually find those domains and
tell you what they are. PROFESSOR: So a local
alignment should at least find these two guys here. And why do these two
parallel diagonal lines, what does that tell you? AUDIENCE: That the
two different proteins have similar sequences, just in
different parts of the protein, different areas
relative to the start. PROFESSOR: Right. Yeah, go ahead. AUDIENCE: Doesn’t it
just basically mean that there’s a section
in sequence two that’s in sequence one twice? PROFESSOR: Yeah, exactly. So this segment of sequence two,
here– sorry, having trouble, there we go, that apart– is
present twice in sequence one. It’s present once from
about here over to here, and then it’s present once
from here over to here. So it’s repeated. So repeats and
things like that will confuse your global alignment. The global alignment needs to
align each residue– or trying to align each residue
in protein one to each residue in protein two. And here, it’s ambiguous. It’s not clear which
part of sequence one to align to that
part of sequence two. So it’ll get confused. It’ll choose one or the other. But that may be
wrong, and that really doesn’t capture what
actually happens. So yeah, so here
a local alignment would be more suitable. Good. So let’s talk now about
gaps, again, which can be called indels. In protein sequence alignments,
or DNA, which many of you have probably seen, you
often use maybe just a dash to represent a gap. So in this alignment
here, you can see that’s kind of a
reasonable alignment, right? You’ve got pretty good
matching on both sides. But there’s nothing in
the second sequence that matches the RG in
the first sequence. So that would be a reasonable
alignment of those two. And so what’s often
used is what’s called a linear gap penalty. So if you have end gaps,
like in this case two, you assign a gap
penalty A, let’s say. And A is a negative number. And then you can just run the
same kinds of algorithms, where you add up matches,
penalize mismatches, but then you have an
additional penalty you apply when you
introduce a gap. And typically,
the gap penalty is more severe than your
average mismatch. But there’s really
no theory that says exactly how the gap
penalty should be chosen. But empirically,
in cases where you should know the
answer, where you have, for example, a
structural alignment, you can often find that
a gap penalty that’s bigger than your
average mismatch penalty is usually the
right thing to do. So why would that be? Why would a gap penalty–
why would you want to set it larger than a typical mismatch? Any ideas? Yeah, what’s your name? AUDIENCE: I’m Chris. PROFESSOR: Chris. AUDIENCE: Because having
mutations that shift the frame or that one insert would
have insertions or deletions is far more uncommon than just
having changing [INAUDIBLE]. PROFESSOR: Mutations that create
insertions and deletions are less common than those that
introduce substitutions of residues. Everyone got that? That’s true. And do you know by what factor? AUDIENCE: Oh, I couldn’t
give you a number. PROFESSOR: So I mean,
this varies by organism, and It varies by what type of
insertion you’re looking at. But even at the single
nucleotide level, having insertions
is about an order of magnitude less
common than having a substitution in
those lineages. And here, in order to get
an amino acid insertion, you actually have to have a
triplet insertion, three or six or some multiple of
three into the exon. And that’s quite
a bit less common. So they occur less commonly. A mutation occurs less commonly,
and therefore the mutation is actually accepted by
evolution even less commonly. And an alternative is a
so-called affine gap penalty, which is defined
as G plus n lambda. So n is the number
of gaps, and then G is what’s called a
gap opening penalty. So the idea here is
that basically the gaps tend to cluster. So having an insertion
is a rare thing. You penalize that
with G. But then, if you’re going to
have an insertion, sometimes you’ll have a big
insertion of two or three or four codons. A four codon insertion should
not be penalized twice as much as a two codon insertion,
because only one gap actually occurred. And when you have
this insertion event, it can any variety of sizes. You still penalize
more for a bigger gap than for a smaller gap,
but it’s no longer linear. I mean, it’s still
a linear function, just with this
constant thing added. So these are the two common
types of gap penalties that you’ll see
in the literature. The affine works a
little bit better, but it is a little bit more
complicated to implement. So sometimes you’ll see both
of them used in practice. And then, of course, by changing
your definition of gamma, you could have a
G plus n minus 1. So that first gap would
be G, and then all the subsequent gaps would gamma. So you’re not going to have
to double score something. All right. OK. You’ve got two proteins. How do you actually find the
optimal global alignment? Any ideas on how to do this? So we can write one sequence
down one axis, one down the other axis. We can make this dot plot. The dot plot can give us some
ideas about what’s going on. But how do we actually
find the optimal one where we want to start
from the beginning? In the end, we’re going
to write the two sequences one above the other. And if the first residue
or the first sequence is n, and maybe we’ll
align it to here, then we have to write
the entire sequence here all the way down to the end. And below it has to
be either a residue in sequence two or a gap. And again, we can
have gaps up here. So you have to do something. You have to make it all the way
from the beginning to the end. And we’re just going to sum
the scores of all the matching residues, of all the mismatching
residues, and of all the gaps. How do we find that alignment? Chris. AUDIENCE: Well, since we’re
using dynamic programming, I’m guessing that you’re
going to have to fill out a matrix of some
sort and backtrack. PROFESSOR: And so when you see
the term dynamic programming, what does that mean to you? AUDIENCE: You’re going to
find solutions to sub problems until you find a
smaller solution. Then you’d backtrack
through what you’ve solved so far to
find the global sequence. PROFESSOR: Good. That’s a good way
of describing it. So what smaller
problems are you going to break this
large problem into? AUDIENCE: The smaller
sub-sequences. PROFESSOR: Which
smaller sub-sequences? Anyone else? You are definitely on
the right track here. Go ahead. AUDIENCE: I mean, it
says at the top there, one sequence across the
top and one down the side. You could start with just
the gap versus the sequence and say your gap will
increase as you move across. Basically, each cell there could
be filled out with information from some of its neighbors. So you want to make
sure that you fill out old cells in some
order so that we can proceed to the next
level with what we’ve [? written down. ?] PROFESSOR: So if
you had precisely to find a sub problem where you
could see what the answer is, and then a slightly larger sub
problem whose solution would build on the solution that first
one, where would you start? What would be your
smallest sub problem? AUDIENCE: I’d start
with the top row, because you could just
the gap versus gap, and then move in the row,
because you don’t need anything above that. PROFESSOR: And then what’s
the smallest actual problem where you actually have
parts of the protein aligned? AUDIENCE: One row in
column two, basically. If it’s a match,
you have some score. And if it’s a mismatch,
you have some other score. And you want the best
possible one in each block. PROFESSOR: Yeah, OK. Yeah. That’s good. So just to generalize
this– hopefully this is blank– in general, you
could think about we’ve got, let’s say, 1 to n here,
and a sequence 1 to n here. You could think about a position
i here and a position j here. And we could say finding the
optimal global alignment, that’s a big problem. That’s complicated. But finding an alignment
of just the sequence from 1 to i in the first protein
against the sequence from 1 to j in the
second protein, that could be pretty easy. If i is 2 and j is 2,
you’ve got a dipeptide against a dipeptide. You could actually
try all combinations and get the optimal
alignment there. And so the idea, then, is if you
can record those optimal scores here in this matrix, then you
could build out, for example, like this, and find
the optimal alignments of increasingly
bigger sub problems where you add another residue
in each direction, for example. Does that makes sense to you? The idea of a dynamic
programming algorithm is it’s a form of
recursive optimization. So you first optimize
something small, and then you optimize
something bigger using the solution you got
from that smaller piece. And the way that that’s
done for protein sequences in Neeleman-Wunsch is
to, as we were saying, first consider that there
might be a gap in one aligning to a residue in the other. So we need to put these
gaps down across the top and down the side. This is a linear gap
penalty, for example. And so here would
be how you start. And this is a gap penalty,
obviously, of minus 8. So if you’re the
optimal solution that begins with this V in
the top sequence aligned to this gap in the
vertical sequence, there’s one gap there,
so it’s minus 8. And then if you want to start
with two gaps against this V and D, then it’s minus 16. So that’s how you
would start it. So you start with these problems
where there’s no options. If you have two gaps against
two residues, that’s minus 16. By our scoring system,
it’s unambiguous. So you just can fill those in. And then you can
start thinking about, what do we put right here? What score should
we put right there? Remember, we’re defining
the entries in this matrix as the optimal score of the
sub-sequence of the top protein up to position i against
the vertical protein up to position j. So that would be the
top protein position one up to the vertical
protein position one. What score would that be? What’s the optical
alignment there that ends position
one both sequences? It’ll depend on
your scoring system. But for a reasonable scoring
system, that’s a match. That’s going to get
some positive score. That’s going to be better than
anything involving a gap in one against a gap in the other
or something crazy like that. So that’s going to get whatever
your VV match score is. This is your Sij from
your scoring matrix for your different amino acids. And then, basically the
way that this is done is to consider that when
you’re matching that position one against position one, you
might have come from a gap before in one sequence or a
gap in the other sequence, or from a match position
in the other sequence. And that leads to
these three arrows. I think it gets clear if I write
up the whole algorithm here. So Sij is the score of the
optimal alignment ending at position i in sequence one
and position j in sequence two. Requires that we know
what’s above, to the left, and diagonally above. And you solve it from
the top and left down to the bottom and
right, which is often called dynamic programming. And let’s just look at
what the recursion is. So Needleman and
Wunsch basically observed that you could find
this optimal global alignment score by filling in the
matrix by at each point taking the maximum of these
three scores here. So you take the
maximum of the score that you had above and to the
left, so diagonally above, plus sigma of xi yj. Sigma, in this case,
is the scoring matrix that you’re using
that’s 20 by 20 that scores each amino acid against
each other amino acid residue. You add that score if you’re
going to move diagonally to whatever the optimal
score was there, or if you’re moving to
the right or down, you’re adding a gap in one
sequence or the other. So you have to add A, which
is this gap penalty, which is a negative number, to
whatever the optimal alignment was before. I think it’s maybe easier
if we do an example here. So here is the PAM250
scoring matrix. So this was actually developed
by Dayhoff back in the ’70s. This might be an
updated version, but it’s more or less
the same as the original. Notice, it’s a
triangular matrix. Why is that? AUDIENCE: It’s symmetric. PROFESSOR: It’s
symmetric, right. So it has a diagonal. But then everything
below the diagonal, it would be mirrored
above the diagonal, because it’s symmetric. Because you don’t know when
you see a valine matched to a leucine, it’s the same as
a leucine matched to a valine, because it’s a symmetrical
definition of scoring. And here are two
relevant scores. So notice that VV has a score
of plus 4 in this matrix. And over here, VD has
a score of minus 2. So I’ll just write those down. Anyone notice anything else
interesting about this matrix? We haven’t said exactly where it
comes from, but we’re going to. Yeah, what’s your name? AUDIENCE: Michael. PROFESSOR: Go ahead. AUDIENCE: Not all the
diagonal values are the same. PROFESSOR: Not all
diagonals are the same. In fact, there’s a pretty
big range, from 2 up to 17, so a big range. And anything else? OK, I’m sorry, go ahead. What’s your name? AUDIENCE: Tagius. PROFESSOR: Tagius, yeah. Go ahead. AUDIENCE: There
are positive values for things that
are not the same? PROFESSOR: Yeah. So all the diagonal
terms are positive. So a match of any
particular residue type to its identical
residue is always scored positively, but
with varying scores. And there are also some positive
scores in the off diagonal. And where are those
positive scores occurring? Notice they tend to
be to nearby residues. And notice the order of
residues is not alphabetical. So someone who knows a
lot about amino acids, what can you see
about these scores? Yeah, go ahead. AUDIENCE: I think these
amino acids [INAUDIBLE] based on their [INAUDIBLE]. PROFESSOR: So the comment
was that the residues have been grouped by similar
chemistry of their side chains. And that’s exactly right. So the basic residues,
histidine, arginine, and lysine, are all together. The acidic residues,
aspartate and glutamate, are here, along with
asparagine and glutamine. And notice that D to E
has a positive score here. It’s 3. It’s almost as good as D to D
or E to E, which are plus 4. So recognizing that you can
often substitute in evolution an aspartate for a glutamate. So yeah, so it basically,
to some extent, is scoring for
similar chemistry. But that doesn’t explain
why, on the diagonal, you have such a large
range of values. Why is a tryptophan
more like a tryptophan than a serine is like a serine. Tim, you want to comment? AUDIENCE: Perhaps it’s because
tryptophans occur very rarely in all proteins [INAUDIBLE]. So if you’ve got
two [INAUDIBLE], that’s a lot rarer of an
occurrence and [INAUDIBLE]. PROFESSOR: So Tim’s point was
that tryptophans occur rarely, so when you see two
tryptophans aligned, you should take note of it. It can anchor your alignment. You can be more
confident in that. Sally. AUDIENCE: Well, tryptophans
are also incredibly bulky, and also have the ability to
make electric interactions, electro-static interactions. PROFESSOR: Not really
electro-static, you would say, more– [INTERPOSING VOICES] AUDIENCE: Yes. But they do have
a lot of abilities to interact with
other side chains. And cysteines contribute
very, very strongly to the three dimensional
structure of the protein. PROFESSOR: Why is that? AUDIENCE: Well, because
they can form [INAUDIBLE]. PROFESSOR: OK. Yeah. So maybe you don’t put your
tryptophans and your cysteines into your protein by
chance, or you only put them when you
want them, when there’s enough space
for a tryptophan. And when you substitute
something smaller, it leaves a gap. It leaves a 3D spatial gap. And so you don’t want that. You don’t get good packing. When you have cysteines,
they form disulfide bonds. If you change it to something
that’s non-cysteine, it can form that anymore. That could be disruptive
to the overall fold. So those ones tend to be more
conserved in protein sequence alignments, absolutely. Whereas, for example, if you
look at these hydrophobics, the MILV group
down here, they all have positive scores
relative to each other. And that says that, basically,
most the time when those are used– I mean,
there are sometimes when it went really matters. But a lot of time, if you just
want a transmembrane segment, you can often substitute any one
of those at several positions and it’ll work equally well
as a transmembrane segment. So these are not random at all. There’s some patterns here. So let’s go back
to this algorithm. So now, if we’re going to
implement this recursion, so we fill on the top
row and the left column, and then we need to
fill in this first. I would argue the first
interesting place in the matrix is right here. And we consider
adding a gap here. When you move vertically
or horizontally, you’re not adding a
match or adding a match. So from this position, this is
sort of the beginning point. It doesn’t actually correspond
to a particular position in the protein. We’re going to add
now the score for VV. And we said that VV, you look
it up in that PAM matrix, and it’s plus 4. So we’re going to
add 4 there to 0. And so that’s clearly
bigger than minus 16, which is what you get from
coming above or coming from the left. So you put in the 4. And then you also, in addition
to putting that 4 there, you also keep the arrow. So there’s that red arrow. We remember where we came
from in this algorithm. Because someone said something
about backtracking– I think Chris– so that’s
going to be relevant later. So we basically get rid
of those two dotted arrows and just keep that red
arrow as well as the score. And then we fill in
the next position here. And so to fill in
this, now we’re considering going to the second
position in sequence one, but we’re still only at the
first position in sequence two. So if we match V to V,
then we’d have to add, basically, a gap in
one of the sequences. Basically it would be
a gap in sequence two. And that’s going to be minus 8. So you take 4, and then plus
minus 8, so it’s negative 4. Or you could do minus 8
and then plus negative 2, if you want to start
from a gap and then add a DV mismatch
there, because minus 2 was the score for a DV mismatch. Or again, you can start from a
gap and then add another gap. OK, does that make sense? So what is the
maximum going to be? Negative 4. And the arrow is going
to be horizontal, because we got some bonus
points for that VV match, and now it’s carrying over. We’re negative, but that’s OK. We’re going to just keep
the maximum, whatever it is. All right, so it’s minus 4,
and the horizontal arrow. And then here’s the
entire matrix filled out. And you’ll have a chance
to do this for yourself on problem set one. And I’ve also filled in arrows. I haven’t filled
in all the arrows, because it gets
kind of cluttered. But all the relevant arrows
here are filled in, as well as some irrelevant arrows. And so then, once
I fill this in, what do I do with
this information? How do I get an actual
alignment out of this matrix? Any ideas? Yeah, what’s your name? AUDIENCE: [INAUDIBLE]. PROFESSOR: Yeah, so what he said
is start at the bottom right corner and go
backwards following the red arrows in reverse. Is that right? So why the bottom right corner? What’s special about that? AUDIENCE: [INAUDIBLE]. PROFESSOR: Yeah. It’s a score of the
optimal alignment of the entire sequence one
against the entire sequence two. So that’s the answer. That’s what we define as the
optimal global alignment. And then you want to
know how you got there. And so how did we get there? So the fact that there’s
this red arrow here, what does that red arrow
correspond to specifically? AUDIENCE: [INAUDIBLE]. PROFESSOR: Right. In this particular case, for
this particular red arrow, remember the
diagonals are matches. So what match is that? AUDIENCE: [INAUDIBLE]. PROFESSOR: Yeah, that’s
a Y to Y match, right? Can everyone see that? We added Y to Y,
which was plus 10, to whatever this
13 was and got 23. OK, so now we go back to here. And then how do we get here? We came from up here by
going this diagonal arrow. What is that? What match was that? That’s a
cysteine-cysteine match. And then how do
we get to this 1? We came vertically. And so what does that mean? AUDIENCE: [INAUDIBLE]. PROFESSOR: We inserted a
gap, in which sequence? The first one. The second one? What do people think? Moving down. AUDIENCE: [INAUDIBLE]. PROFESSOR: Yeah, the top one. And so that got us to here. Here’s a match, plus 2 for
having a serine-serine match. Here’s a plus 3 for
having a D to E mismatch. But remember, that’s those
are chemically similar, so they get a positive score. And then this is the V to V. So can you see? I think I have the
optimal alignment written somewhere here,
hopefully, there. That’s called the trace back. And then that is the alignment. OK, we align the Y to
the Y, the C to the C. Then we have basically a gap
in this top sequence– that’s that purple dash there–
that’s corresponding to that L. And you can see why we wanted
to put that gap in there, because we want
these S’s to match, and we want the C’s to match. And the only way
to connect those is to have a gap in the purple. And the purple was shorter
than the green sequence anyway, so we kind
of knew that there was going to be a gap somewhere. And good. And that’s the
optimal alignment. So that’s just some philosophy
on Needlemen-Wunsch alignments. So what is
semi-global alignment? You don’t see that
that commonly. It’s not that big a deal. I don’t want to spend
too much time on it. But it is actually
reasonable a lot of times that, let’s say you
have a protein that has a particular
enzymatic activity, and you may find that the
whole, the bulk of the protein is well conserved
across species. But then at the N and
C termini, there’s a little bit of flutter. You can add a few residues or
delete a few residues, and not much matters at the
N and C termini. Or it may matter not for
the structure but for, you know, you’re adding a single
peptide so it’ll be secreted, or you’re adding some
localization signal. You’re adding some
little thing that isn’t necessarily conserved. And so a semi-global
alignment, where you use the same
algorithm, except that you initialize the edges of
the dynamic programming matrix to 0, instead of the minus
8, minus 16 whole gap, and go to 0. So we’re not going to penalize
for gaps of the edges. And then, instead of
requiring the trace back to begin at the
bottom right, Smn, you allow it to
begin at the highest score in the bottom row
or the rightmost column. And when you do the
trace back as before, these two changes basically find
the optimal global alignment but allowing arbitrary
numbers of gaps at the ends and just finding the core match. It has to go basically
to the end of one or the other
sequences, but then you can have other residues
hanging off the end on the other sequence, if
you want, with no penalty. And this sometimes will
give a better answer, so it’s worth knowing about. And it’s quite
easy to implement. Now what about gapped
local alignments? So what if you
have two proteins? Do you remember
those two proteins where we had the two diagonal? I guess they were
diagonal lines. How where they? Something like that. Anyway, diagonal
lines like that. So where in this protein on the
vertical, there is a sequence here that matches two segments
of the horizontal protein. So for those two, you don’t want
to do this global alignment. It’ll get confused. It doesn’t know whether
to match this guy to this or this other one
to the sequence. So you want to use
a local alignment. So how do we modify this
Needleman-Wunsch algorithm to do local alignment? Any ideas? It’s not super hard. Yeah, go ahead. AUDIENCE: If the score
is going to go negative, instead of putting a negative
score, you just put 0, and you start from where you get
the highest total score, rather than the last
column or last row. Start your traceback
from the highest score. PROFESSOR: So whenever you’re
going negative, you reset to 0. Now what does that
remind you of? That’s the same trick
we did write previously with ungapped local alignment. So you reset to 0. And that’s as a no
penalty, because if you’re going negative, it’s better
just to throw that stuff away and start over. We can do that because
we’re doing local alignment. We don’t have to
align the whole thing. So that’s allowed. And then, rather than going
to the bottom right corner, you can be anywhere
in the matrix. You look for that highest score
and then do the traceback. That’s exactly right. So it’s not that different. There are a few
constraints, though, now on the scoring system. So if you think about the
Needleman-Wunsch algorithm, we could actually use a matrix
that had all positive scores. You could take
the PAM250 matrix. And let’s say the
most negative score there is, I don’t know,
like minus 10 or something, and you could just add 10, or
even add 20 to all those score. So they’re all positive now. And you could still
run that algorithm. And it would still produce
more or less sensible results. I mean, they t wouldn’t be
as good as the real PAM250, but you would still get
a coherent alignment out of the other end. But that is no
longer true when you talk about the
Smith-Waterman algorithm, for the same reason that an
ungapped local alignment, we had to require that the
expected score be negative, because you have to
have this negative drift to find small regions
that go in the positive. So if you have this rule, this
kind of permissive that says, whenever we go negative
we can just reset to 0, then you have to have this
negative drift in order for positive scoring
stuff to be unusual. All right, so that’s
another constraint there. You have to have negative
values for mismatches– I mean, not all mismatches. But if you took two random
residues in alignment, the average score
has to be negative. I should probably rephrase
that, but more or less. And here’s an example
of Smith-Waterman. So you right zeroes down the
left side and across the top. And that’s because,
remember, if you go negative, you reset to 0. So we’re doing that. And then you take the
maximum of four things. So coming from the diagonal and
adding the score of the match, that’s the same as before. Coming from the left and
adding a gap in one sequence, coming from above
and adding a gap in the other sequence, or 0. This “or 0” business
allows us to reset to 0 if we ever go negative. And when you have a 0, you still
keep track of these arrows. But when you have a
0, there’s no arrow. You’re starting it. You’re starting the
alignment right there. So that’s Smith-Waterman. It’s helpful. I think on problem
set one, you’ll have some experience thinking
about both Needleman-Wunsch and Smith-Waterman. They sort of behave a
little bit differently, but they’re highly related. So it’s important to
understand how they’re similar, how they’re different. And what I want to focus on for
the remainder of this lecture is just introducing the concept
of amino acid similarity matrices. We saw that PAM matrix, but
where does it come from? And what does it mean? And does it work well or not,
and are there alternatives? So we could use this
identity matrix. But as we’ve heard,
there are a number of reasons why that
may not be optimal. For example, the cysteines, we
should surely score them more, because they’re often
involved in disulfide bonds, and those have major structural
effects on the protein and are likely to be conserved
more than your average leucine or alanine or whatever. So clearly, scoring
system should favor matching identical or
related amino acids, penalize poor matches and for gaps. And there’s also
an argument that can be made that it should
have to do with how often one residue is substituted for
another during evolution. So that commonly
substituted thing should have either positive
scores or less negative scores than rarely
substituted things. And perhaps not
totally obvious, but it is if you think
about it for a while, is that any scoring system that
you dream up carries with it an implicit model of
molecular evolution for how often things
are going to be substituted for each other. So it’s going to turn out
that the score is roughly proportional to a
[INAUDIBLE] score for the occurrence of
that pair of residues, divided by how
often it would occur by chance, something like that. And so that if you assign
positive scores to things, to certain pairs
of residues, you’re basically implying that
those things will commonly interchange during evolution. And so if you want to have
realistic, useful scores, it helps to think about what the
implicit evolutionary model is and whether that is a realistic
model for how proteins evolve. So I’m going to come to Dayhoff. And so unlike
later matrices, she had an explicit
evolutionary model, like an actual mathematical
model, for how proteins evolve. And the idea was
that there are going to be alignments
of some proteins. And keep in mind,
this was in 1978. So the protein database probably
had like 1,000 proteins in it, or something. It was very, very small. But there were some
alignments that were obvious. If you see two protein
segments of 50 residues long that are 85 identical, there’s
no way that occurred by chance. You don’t even need to
do statistics on that. So you’re sure. So she took these very high
confidence protein sequence alignments, and she calculated
the actual residue residue substitution frequencies, how
often we have a valine in one sequence as a substitute
for a leucine. And it’s actually
assumed it’s symmetrical. Again, you don’t
know the direction. And calculated these
substitution frequencies. Basically estimated
what she called a PAM1 one matrix,
which is a matrix that implies 1% divergence
between proteins. So there’s, on average,
only a 1% chance any given residue will change. And the real alignments had
greater divergence than that. They had something
like 15% divergence. But you can look at
those frequencies and reduce them
by a factor of 15, and you’ll get not exactly
15 but something like 15. And you’ll get
something where there’s a 1% chance of substitution. And then once you have that
model for what 1% sequence substitution looks
like, turns out you can just represent
that as a matrix and multiply it up to get a
matrix that describes what 5% sequence substitution looks
like, or 10% or 50% or 250%. So that PAM250 matrix that we
talked about before, that’s a model for what 250% amino
acid substitution looks like. How does that even make sense? How can you have more than 100%? Is anyone with me on this? Tim, yeah. AUDIENCE: Because
it can go back. So it’s more likely, in
some cases, that you revert rather than [INAUDIBLE]. PROFESSOR: Right. So a PAM10 matrix means, on
average, 10% of the residues have changed. But a few of those
residues might have actually– so maybe about
90% won’t have changed at all. Some will have
changed once, but some might have even changed
twice, even at 10%. And when you get to
250%, on average, every residue has
changed 2 and 1/2 times. But again, a few residues
might have remained the same. And some residues that
change– for example, if you had an isoleucine
that mutated to a valine, it might have actually changed
back already in that time. So it basically accounts for
all those sorts of things. And if you have commonly
substituted residues, you get that type of
evolution happening. All right. So she took these protein
sequence alignments– it looks something like
this– and calculated these statistics. Again, I don’t want to
go through this in detail during lecture, because
it’s very well described in the text. But what I do want to do
is introduce this concept of a Markov chain,
because it’s sort of what is underlying
these Dayhoff matrices. So let’s think about it. We’ll do more on this next time. But imagine that you
were able to sequence the genomes of
cartoon characters with some newly
developed technology and you chose to analyze
the complicated genetics of the Simpson lineage. I’m assuming you all
know these people. This is Grandpa and Homer eating
doughnut and his son, Bart. So imagine this is
Grandpa’s genome at the apolipoprotein A locus. And a mutation occurred that
he then passed on to Homer. So this mutation occurred in the
germ line, passed on to Homer. And then when Homer passed
on his genes to Bart, another mutation occurred
here, changing this AT pair to a GC pair in Bart. So this, I would argue,
is a type of Markov chain. So what is a Markov chain? So it’s just a
stochastic process. So a stochastic process
is a random process, is sort of the general meaning. But here we’re
going to be dealing with discrete stochastic
processes, which is just a sequence of random variables. So X1 here is a random variable
that represents, for example, the genome of an
individual, or it could represent the
genotype, in this case, at a particular position,
maybe whether it’s an A, C, G, or T at one particular
position in the genome. And now the index here– one,
two, three, and so forth– is going to represent time. So X1 might be the
genotype in Grandpa Simpson at a particular position. And X2 might be the
genotype of Homer Simpson. And X3 would be the genotype
in the next generation, which would be Bart Simpson. And what a Markov
chain is is it’s a particular type of
stochastic process that arises commonly in
natural sciences, really, and other places
all over the place. So it’s a good one
to know that has what’s called the
Markov property. And that says that
the probability that the next
random variable, or the genotype at the next
generation, if you will– so Xn plus 1 equals
some value, j, which could be any of the
possible values, say any of the four bases,
conditional on the values of X1 through Xn, that is the entire
history of the process up to that time, is equal to
the conditional probability that Xn plus 1 equals j given
only that little xn equals some particular value. So basically what it says
that if I tell you what Homer’s genotype
was at this locus, and I tell you what Grandpa
Simpson’s genotype was at that locus, you can just
ignore Grandpa Simpson’s. That’s irrelevant. It only matters what
Homer’s genotype was for the purpose of
predicting Bart’s genotype. Does that make sense? So it really doesn’t
matter whether that base in Homer’s genome was the same
as it was in Grandpa Simpson’s genome, or whether it
was a mutation that’s specific to Homer,
because Homer is the one who passes
on DNA to Bart. Does that make sense? So you only look
back one generation. It’s a type of
memoryless process, that you only remember
the last generation. That’s the only thing
that’s relevant. And so to understand
Markov chains, it’s very important
that you all review your conditional probability. So we’re going to do a little
bit more on Markov chains next time. P A given B, what
does that mean? If you don’t remember, look
it up in the Probability and Statistics, because that’s
sort of essential to Markov chains. So next time we’re going to
talk about comparative genomics, which will involve some
applications of some of the alignment methods that
we’ve been talking about. And I may post some examples of
interesting comparative genomic research papers, which are
going to be optional reading. You may get a little more out
of the lecture if you read them, but it’s not essential.

4 Replies to “3. Global Alignment of Protein Sequences (NW, SW, PAM, BLOSUM)”

  1. Did I just see you fudge the matrix numbers toward evolutionary theory? "Evolutionary theory is 'implicit' in the alignment theory." Guess there was maybe too much obvious about the old theory where evolution was 'explicit'.  What happens to the alignments if you don't fudge? Does evolution become too improbable? Scientific, huh? You sure had a lot of explaining to do to justify that 250% crap to get your alignments. You didn't even seem to convince yourself.

  2. In 1:00:35: Shouldn't that C-Y in (5,5) be 5 instead of 3 as we can come from 13 with a gap penalty of -8, i.e., 13-8=5?

Leave a Reply

Your email address will not be published. Required fields are marked *