My Top Cloud Risks and their Mitigation

Most customers have a risk management, but with the new cloud area, there are new risks too. Every customer has to assess them for himself and work out possible compensation strategies. The following is my personal assessment of the most common cloud risks I see and hear from customers.

R1: Microsoft will not pay compensation if the Azure service is not performed

Occurrence probability: possible

Impact: medium

Description: In contrast to providers with an on-premises data center, the large cloud providers are only liable for the failed compute hours (see 6b). I.e. if some Azure Services fail for 4 hours and during this time a WebShop is unreachable, the customer only receives as compensation the amount that the Azure Services would have cost in these 4 hours. However, he does not receive any compensation for the lost sales of his WebShop or the negative external impact that this failure has.

Assessment: On the one hand, Microsoft regularly exceeds its own SLA, for example with an uptime of 99.995 in the period from July 2018 – July 2019. On the other hand, critical systems should always be protected against failure.

Mitigation: It is therefore advisable, even if the availability of Azure services is very high, to think of a fail-safe design and to make important systems resilient.

R2: Services can be discontinued with a short preparation time

Occurrence probability: unlikely

Impact: high

Description: In the past, the lead time for expiring Azure Services was only 3 months. But today, Microsoft is now talking about a Modern Lifecycle Policy. A notification of 30 days applies when customers have to carry out activities due to changes in services. The MLP also includes a lead time of 12 months if a service is discontinued.

Assessment: Due to this 12 month period there is enough time for larger applications to counteract the loss of services. In addition, services are usually announced earlier and operated beyond the contractually agreed 12 months. In addition, services that are no longer available are generally replaced by more modern and new services that even have migration scenarios for the services that are no longer available.

Mitigation: Mitigation is very difficult and makes little sense. The use of IaaS and the self operation of all application parts could help, but many advantages were lost. The use of higher quality PaaSe and SaaSe makes the cloud so flexible, fast and attractive.

R3: Using Azure creates a “vendor lock-in” to Microsoft

Occurrence probability: very likely

Impact: low

Description: The more high-quality services of a cloud provider are used (in Azure, for example, Managed SQL Server, EventHub, StreamAnalytics, Functions, Container Instances, …) the greater the advantages for the application and the customers. However, this also increases the binding to these services, which are only available from the selected cloud provider. A quick move, which many customers always want as an exit strategy, is then no longer possible.

Assessment: Yes, there is clearly a vendor lock-in. Only through the use of higher quality services, the advantages unfold and the use of the cloud makes sense. In a way, the effect is wanted and unavoidable.
If some customers still demand a “quick move”, this is often a bit of an illusion. On the one hand, the use of cloud services requires governance that differs for each cloud provider and, on the other hand, an operating team is also always required, which often then only has to be trained or purchased by a service provider.
The most common reason to avoid vendor lock-in is the failure of a hyperscaler. But the worldwide failure of all regions and all data centers is an extremely absurd risk (see R7).

Mitigation: If a quick exit should be possible, only infrastructure services can be used, because all cloud providers offer a similar range of services for infrastructure. The workload can usually only consist of VMs or containers.

R4: Malicious Microsoft employees can easily take customer data

Occurrence probability: unlikely

Impact: high

Description: Some customers are concerned that Microsoft employees can directly access all Azure services and the data they contain.

Assessment: The probability of this scenario is very low. On the one hand, direct access by Microsoft employees is prohibited or is only permitted under extreme conditions. Second, most Azure services are encrypted out-of-the-box. For example, an Azure Storage Account is always created with encryption activated, which cannot be deactivated either.

Mitigation: As extended protection, customers can also provide their own key and use it with many Azure services. This means that even in the event of a data theft by Microsoft, no data can be accessed.

R5: Access of other Azure customers to systems or data through shared infrastructure

Occurrence probability: unlikely

Impact: high

Description: The cloud often works on shared hardware, i.e. VMs from different customers run on the same hardware. Some customers fear that external VMs will have access to their own VM.

Assessment: However, this has been a known problem since there are shared hosts and it is exactly the business model of all hyperscalers. Therefore, data separation pays special attention to the fact that customer data is not accessible from outside.

Mitigation: For an even stronger separation, many services are available on dedicated hardware in Azure. These include, for example, App Service Environment or Dedicated Hosts. Another option is DC machines, which encrypt the data not only when it is persisted, but also when it is processed in memory.

R6: Internet-accessible infrastructure in the cloud is subject to frequent attacks

Occurrence probability: very likely

Impact: low

Description: Many customers fear that they will be exposed to more frequent attacks if they store their data with one of the large cloud providers. A small data center doesn’t seem to be attacked as often by hackers because it’s less known.

Assessment: It is true that Azure is subject to frequent attacks. Microsoft publishes 1.5 million attacks per day. However, Microsoft is also opposing a large team of 3500 security engineers and IT forensics. All data centers are certified many times, for Germany e.g. for the EU Model Clauses or the GDPR. Smaller data centers often cannot provide that amount of certificates.
The individual regions are additionally protected, for example by automatic DDoS detection.
In addition, Microsoft operates with its Red Team – Blue Team approach a continuous improvement in the security of their own systems.

Mitigation: Additional services or appliances are available for every customer, which can be integrated in every solution to get more security features. The spectrum is very broad and starts with subnets and NSGs up to Azure Sentinal, as a complex SIEM solution.

R7: Total Azure blackout (all regions)

Occurrence probability: unlikely

Impact: high

Description: Sometimes customers speak of “a possible blackout of the Azure Cloud”, which means all Azure data centers are down.

Assessment: The probability of this scenario is extremely low. Microsoft currently operates more than 60 regions with several data centers. There are separate areas for power supply and fire compartments within a data center. Services can be distributed across multiple Availability Zones, placing the workload in multiple buildings in the region. Due to the decentralized structure, one region cannot influence the performance of another region.
However, the major cloud providers can also affect central services. For example, backups were automatically imported in some SQL databases in January 2019 and services could not be reached. Even though the problem affected several services and customers, but all regions and most users were not affected.
Nevertheless, some customers have a strategy of distributing the workload across two different cloud providers, for example, running half of the workload in AWS and the other half in Azure. But here you get similar problems as with the distribution in R3. In addition, the latencies between the workloads increase enormously and the traffic causes additional costs.

Mitigation: To compensate for the possible failure of an entire Azure region, the distribution or duplication of resources is needed. This scenario requires global routing, administration becomes more complex and the costs almost doubles.

R8: Deletion of data may be delayed

Occurrence probability: possible

Impact: low

Description: The Azure infrastructure is gigantic and the internal administration runs asynchronously in many parts. So if we delete a service or data within a service, this often happens in the background and is not done immediately. Many customers fear breaches of the GDPR or that deleted data can still be tapped by attackers.

Assessment: It is true that certain deletion processes have to happen in the background. This is the case with all major cloud providers and is by design. Deleted data usually disappear immediately from the users view, but some still exist in the background. For example, if services are deleted and created shortly thereafter with the same name, error messages are thrown. With PaaSes, of course, data that has been deleted can be created again immediately, since no such behavior can be observed.
Nevertheless, due to the strict data separation (see R5) and various security certifications, e.g. also GDPR, no real risk exists here.

Mitigation: Mitigation is not possible. Through the shared responsibility model, the management of the services and the logic below the API layer is provided by Microsoft and can no be influenced.

R9: No new service can be provisioned in Azure

Occurrence probability: possible

Impact: high

Description: A rather new risk is that a data center could be full and no new service can be provisioned.

Assessment: This probability has been extremely low in recent years. Microsoft currently operates 60 Azure regions, which consist of several data centers. As the global cloud usage increased in the last years, the hardware in the regions is constantly renewed and more powerful, additional the capacity of the data centers also increases.
Since Covid-19, the worldwide usage of Azure has increased rapidly in a short time. This sudden global increase leads to bottlenecks in all existing services and in the provisioning of new services. This applies not only to Microsoft, but to all major cloud providers.

Mitigation: Microsoft itself implements many strategies to keep the availability of the services high, for example the file size restrictions in MS Teams. As a customer, however, there is only the option of provisioning the services in other regions that are lesser used. Thanks to the Microsoft Backbone, the connection between the regions is very fast.

My Top Cloud Risks and their Mitigation

R1: Microsoft will not pay compensation if the Azure service is not performed

R2: Services can be discontinued with a short preparation time

R3: Using Azure creates a “vendor lock-in” to Microsoft

R4: Malicious Microsoft employees can easily take customer data

R5: Access of other Azure customers to systems or data through shared infrastructure

R6: Internet-accessible infrastructure in the cloud is subject to frequent attacks

R7: Total Azure blackout (all regions)

R8: Deletion of data may be delayed

R9: No new service can be provisioned in Azure

Thomas Zühlke

Schreibe einen Kommentar Antworten abbrechen