Resiliency has always been an important factor in the Enterprise architecture Roadmap. With the upcoming innovations in technology and competitors, there is a need for organizations to move on to cloud migration for a better-sustained strategy. Application Resiliency is defined as the ability of the system to maintain a minimum viable service in the phase of any kind of disruption and challenges to normal operating conditions. Building resiliency into business-critical applications ensures that IT teams are able to quickly identify and resolve typical issues that affect the application performance in the cloud.
Why is resiliency needed for Cloud?
Cloud provides undeniable benefits to the organization right from the costs, scalability, reliability, improved customer experience via reduced downtime, ease of operations, and many more. Resiliency is not just about avoiding failure – it also involves accepting the failure and building and automating the next steps that allow the application to respond to the event and return to a fully functioning or optimal state as quickly as possible.
Creating cloud resilient applications will help organizations to better achieve the right balance between the agility that the cloud provides, and organizations’ risk tolerance. Adding resiliency to the cloud roadmap will help organizations to manage applications with cloud design patterns and embedding risk handling capability, respond to unpredictable changes faster, and increase customer satisfaction.
Cloud design patterns to enhance resiliency
Traditionally we have been using multiple design patterns to build resiliency into the applications on the on-prem. As part of the cloud migration roadmap, resiliency should be considered a key metric in the design phase.
Failure mode analysis (FMA) is a process for building resiliency into a system, by identifying possible failure points in the system. The FMA should be part of the architecture and design phases so that you can build failure recovery into the system from the beginning. To ensure an application is resilient to end-to-end failures, it is essential that all fault points and fault modes are understood and operationalized.
Below is the list of patterns to be which can be considered during the design stage :
Bulkhead pattern
This pattern in general is used to isolate the parts of failed application so that the other services of the application continues to work. This is named after the partitions called bulkheads in the ship’s hull design. If one of the hulls is compromised it prevents the entire ship from sinking by containing the water in the damaged hull.
A cloud-based application may have a service or multiple services which are consumed by multiple clients in a different combination. A fault request from one of the consumers can exhaust the requests for the service and this may disrupt the other consumers if the bulkhead pattern is not used.
Every cloud vendor provides ways to partition service instances into different groups based on consumer requirements. This can help to isolate the failures of service and helps to provide the fractionation for the other consumers.
Widely this pattern can be used to isolate critical applications from standard, to protect cascading failure, and also some specifically targeted clusters of backend services.
Retry pattern
One of the common failures which we see is transient issues like the systems are momentarily not available or timeouts when the downstream is busy processing the previous payloads. This can be generally tackled by retrying after some time by implementing the delay on the calling application and retrying it.
The cloud applications can be designed to handle them elegantly so that the applications can reduce the faults in the business logic that it is performing currently.
Retry strategies are Cancel, Retry and Retry after delay. There are other factors to be considered while implementing retries to an application namely logging and monitoring. Rightly logging helps operations to investigate the successive behavior and provide room for improvements.
Circuit breaker Pattern
While there is a lot to talk through in this pattern we will keep this part concise to understand the need for it in the cloud.
Sometimes there are certain faults that cause issues even after performing retry strategies. The circuit pattern enables the application to detect whether the fault has been resolved before retrying again. Thereby reducing unnecessary execution cycles and resources.
The purpose of circuit breaker and retry however are different though they can go hand in hand during the implementation. The CB acts as a proxy and monitors the number of failures that occurred and using this information it decides whether to allow the operation to proceed or simply return an exception. The pattern is customizable and can be adapted according to the type of failure.
Other considerations while implementing this pattern: Exception Handling, logging, recovery, testing failed operations, manual override, concurrency, resource differentiation, accelerated circuit breaking, replay failed requests, and inappropriate time or n external services
Throttling
Throttling means controlling the consumption of resources used by an instance of an application or a service. In the cloud, one such strategy which we can use is autoscaling depending on the business needs. But this can potentially increase the costs to meet the demand rather than running within the targeted costs. Alternatively, we can use autoscaling up to a certain limit and when the threshold is met it can throttle requests from one or more users by using strategies like rejecting, disabling, or degrading, using level loading, or deferring based on priorities. A system can use a priority queue as a part of the throttling strategy or maintain the performance for critical or higher-priority applications.
There are a few things to be considered while implementing throttling like SLA, cost optimization, setting up priorities, and as well as monitoring.
Rate Limiting Pattern
Rate limiting is a simple strategy for limiting traffic through which we can define a cap or a limit in which a service or a consumer can repeat an action in a certain timeframe. This pattern is widely used in batch processing. While using the throttling strategy can increase the traffic and throughput while retrying those failed operations. This can be controlled by using a rate-limiting strategy.
Rate limiting can be performed based on the number of operations, amount of data, or cost of operations
Queue-Based Load Levelling
This design pattern uses a queue that acts as a buffer between a task and a service it invokes in order to smooth intermittent heavy loads that can cause the service to fail or the task to time out. This can help to minimize the impact of peaks in demand on availability and responsiveness for both the task and the service.
Many solutions in the cloud involve running tasks that invoke services. In this environment, if a service is subjected to intermittent heavy loads, it can cause performance or reliability issues. Flooding a service with a large number of concurrent requests can also result in the service failing if it’s unable to handle the contention these requests cause.
This pattern is useful to any application that uses services that are subject to overloading and cant be useful if the application expects a response from the service with minimal latency.
There is also a pattern called the Competing Consumers pattern. It might be possible to run multiple instances of a service, each acting as a message consumer from the load-leveling queue. You can use this approach to adjust the rate at which messages are received and passed to service.
Conclusion: A sustainable strategy for resilient Cloud Design
Think through the customer vision to build the right cloud strategy. A resiliency-focused team should critically examine all components of the application stack to understand every possible failure scenario that goes into designing applications. The solution design team should work with application, network, security, and infrastructure architects to develop an interaction node of all the components that are part of the overall application stack.
In an “all-time available” world, the cloud poses inherent risks. The complexity and impact of disruptions require detailed and focused attention. Understanding each of the risk scenarios to understand their impact and mitigating it would create solutions to build resilient applications which bring value to the customer.
In the upcoming posts, we will come up with a detailed strategy on how to use these patterns and their effectiveness in implementing them on certain cloud applications.