Windows Azure introduced concepts of fault and upgrade domains. Fault domains are related to the physical deployment of the roles whereas upgrade domains are related to the logical deployment of the roles.
Fault domains define a physical unit of deployment for an application. The fault domain concept has been introduced for Windows Azure to provide high availability services and to reduce single points of failure (servers, rack of servers, switches) in a data center. In Windows Azure, a rack of computers is indeed identified as a fault domain.
Service instance allocation to a specific fault domain is determined by Windows Azure at deployment time and cannot be controlled by a service owner. By placing fault domains in separate racks of computers, you separate service instances deployment to hardware well enough that it’s unlikely all would fail at the same time.
Note that to get guaranteed SLA at a level of 99.95% you have to have two or more role instances in different which will be deployed to different fault domains. You can find more on Cloud Service SLA at Service Level Agreements page.
Upgrade domains define a logical unit of deployment for an application. The upgrade domain concept has been introduced for Windows Azure to provide high availability services during an upgrade of an application.
The number of upgrade domains can be configured as a part of the service definition file (
.csdef). The default number of upgrade domains is 5, and the maximum is 20.
<ServiceDefinition name="service-name" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" schemaVersion="version" upgradeDomainCount="number-of-upgrade-domains"> <!-- .... --> </ServiceDefinition>
Windows Azure distributes instances of a role evenly (when possible) across a set number of upgrade domains. For example, if the default number of upgrade domains is used and a service has five instances, each instance will be assigned to an individual upgrade domain. In the case of a service having ten instances, each upgrade domain will have two instances. In the case of a service having 14 instances, the first four upgrade domains will have three instances, and the last one will have two instances.
Note that Windows Azure determines a service instance allocation to a particular upgrade domain at deployment time and the service owner can not control it.
Note that number of upgrade domains does not have to equal to the number of fault domains so a single application could exist in several upgrade domains but only deployed to two separate fault domains.
How a deployment proceeds
During deployment all instances of the upgraded role that belongs to the first upgrade domain are stopped, upgraded, and brought back online. Once they are back online, the process is repeated for the second upgrade domain (all roles stopped, upgraded and brought back online), the third upgrade domain and so on until all instances in all upgrade domains have been upgraded.
The screenshots below present Windows Azure portal - cloud service instances screen - during an upgrade of a service with three instances and 5 (default value) upgrade domains.
Please note that during deployment you can decide whether you want to update all of the roles in your service or a single role in the service.