Stretched Clusters and High Availability Best Practices | vSAN

Stretched Clusters and High Availability Best Practices | vSAN


Hola, soy Elver Sena Sosa. Now
we’re going to talk about stretch clusters. It turns out this is one of
the reasons besides SPBM that I really got interested in vSAN and I’ll explain
why in a minute. But first, let’s describe what a stretched cluster is. We will actually just remind you because we’ve been using stretched clusters since the beginning of time. With a stretched cluster, you have two locations and
you have a number of hosting location one and number of hosting location two.
And, you connect it to get two to be far from the same vSphere cluster. Hence, stretched. Right,
different locations. When it turns to be vSAN however, there is some conditions
that we must meet. One of them is that in each data center you can have up to 15
ESXi hosts for total of 30. Then, because you have two locations you need
a third location to be the witness. So, we’re gonna have location number three. This is something which is going to hold a ESXi hose in a virtual appliance
that is going to be the witness. One condition, another condition, is that the
distance between these two guys, the round-trip delay (RTT), the
round-trip time should not be more than 200 milliseconds. Of course you want it closer to that for better performance because the witnesses will be constantly
updated with information. So that said, I know the condition is that between the
two data centers you have to have no more than five millisecond round-trip
delay either. So the two data centers had to be close in proximity to each other,
more likely a metric cluster if you wish. Now how do you configure this? You
have fifty-hose and fifty-hose, how does vSAN do it? And it turns out it is rather simple.
What vSAN does is that when you configure vSAN, vSAN groups all of these
15 hosts in something called a fault domain. A fault domain is a collection
of ESXi hosts for vSAN purposes where vSAN, instead of thinking out the hosts
as the failure construct, is going to think that all the hosts in that
failure domain as the failure construct, and you make decision based on that. So,
you have fault domain number one for fault domain number two, and you will have fault domain number three which is the witness.
So, we have our stretched closer configured. So, what’s next? We need to have a VM. And the VM, there it is, is going to have an FTT policy of failures to
tolerate of one, with a rate of one. This means one copy of the objects for
the VM will be copied locally and the other copy would be across the data center and
interconnect to the other location. So that’s an FTT of one with a Raid
method of one. That’s what we have. However, there is an additional FTT that you can
leverage here. You see, this VM objecting to here is in one host. As it
stands right now, if that host dies that means that the recovery has to come from, the replication has to come from, the other data sent to the other freely
other main. That means it’s going to cross across the DCI, and if you talk to
any network engineer that’s working on data center, this connection is sacred.
It’s really really important, therefore you try to minimize the traffic that
goes across that so that you don’t impact production. Therefore, you can have a second failure to tolerate of one or two with a method of Raid one or ER. Coding of Raid five or six. And that’s local, so both locations will have that. Now we
have two failures to tolerate and that’s becoming confusing by using the same term for the same thing. Therefore, this failure to tolerate that is
between data centers we’re going to call it a primary failure to tolerate
and the other failure to tolerate, which is local to the data center, is going to be
call the secondary failure to tolerate. So you have those options for the spbm
for each VM as you deployed in this data center. Now the one thing that I really
like about this, which I mentioned at the beginning of the video, is that you
see when this VM does the right locally. The right, well vSAN does the
right locally, the right also has to go across and can be written here. And when
the other data center replies and acknowledges the right, until that
happens, that VM doesn’t get the right acknowledgment. That
means that because you had the right being acknowledged on both sides they
protected, you have an RPO of zero, which means you finally have a storage
solution that natively allows you to be able to have your data replicated across
two different locations without having to spend extra, well lots of money to
have Metro Cluster Storage. That’s what I really like about this. And, now that we
have an RPO of zero, and one thing I should mention kind of digressing a bit,
is that all of this is under the same vCenter. So, finally now what you have is that if something happens to that VM or this data center you can enable
vSphere HA to restore that VM over here, because you had a RPO of
zero. There is no data loss. And because you’re recovering your VM exactly as it
was before, you don’t need other replication technologies or recovery
technology automation like SRM, for example. Native to vSphere, you can have
complete protection and recoverability of minutes, automated without any human intervention. That is a stretched cluster. Elver Sena Sosa, thank you for
watching.

14 Replies to “Stretched Clusters and High Availability Best Practices | vSAN”

  1. thank you so much for this series of vSAN. i learnt a lot. i saw a lot of videos nearly one hour long, did projects for implementation for vsphere and vSAN but still was vague about some concepts. and i got those cleared. again thank you so much this series. WAITING FOR ANY SERIES WHICH EXPLAINS MORE DEEPLY THE CONCEPTS OF vSAN and problems admin face in day to day operations. like object synchronization, re balance, error during maintenance mode , how components get divided and on what size scale. how to maintain health checks and remove inaccessible objects including more details on stretched cluster .

  2. This was an entertaining and very easy-to-watch series on vSAN which anyone in the field would benefit from. Great work!

  3. This was a great series to watch, thanks! I did feel a little confused about the block level drive setup though. Kind of wish you'd spent more time talking about physical drives, where they can go, why they aren't using a hardware RAID controller, how to recover from failed drive, etc. I have the resources I need to learn about this on my own but it would have made the video series that much more perfect. πŸ™‚

  4. Question: Does the Witness FD3 VM have to be in the third physical location? Can it be a VM in one of the DCs protected by HA or even by FT? Or can I have 4 Fault Domains 2 of which are VM witnesses each in it's own DC?

  5. Absolutely awesome! I watched all 12 videos in one sitting – which was totally not what I planned. Useful, practical information, presented in a straight forward, understandable manner. Excellent detail in easily digestible increments. Took copious notes; learned and clarified tons about this technology. Presentation flowed naturally, with each video setting up the foundation for the next. All terms explained in sufficient detail to be able to understand the necessary nuances of more complex concepts. Delivery pulls in the audience. This has been added to my library of "go to" references because it is devoid of marketing/sales spin and vague techno-babble. Every video filled with usable, essential information. Well done, Elver. Well done indeed. Looking forward to more of your videos.

  6. RPO=0 agree, but in case corruption or human mi stack in site-1 hosts then the corrupted block will be replicate as well to site-2 then RPO will not chive zero any more, therefor you need the third site to replicate data in Async mode to have a delayed updated copy in the third site.

  7. You are really Awesome sir … You have cleared all my VSAN doubts in just 12 small but informational videos…Thank you πŸ™‚

  8. so basically, redundant wan links are mandatory between sites, because failure of that wan will mean no ack for writes to the stretched cluster and therefore bring down the whole cluster?

  9. Thank you so much for this series of vSAN. Very easy-to-watch series , to the point and very well explained, i will look fwd for more videos from you. thanks again for the vSan 100 Series…

Leave a Reply

Your email address will not be published. Required fields are marked *