Approaches to Fault Tolerance and Disaster Recovery in DevOps Processes

Premier Science > Approaches to Fault Tolerance and Disaster Recovery in DevOps Processes

Maksym Karyonov
Faculty of Information Technology and Cyber Security, State University of Intellectual Technologies and Communications, Odesa, Ukraine
Correspondence to: Maksym Karyonov, maksymkaryonov@gmail.com

DOI: https://doi.org/10.70389/PJS.100165

Additional information

Ethical approval: Ethical approval was given by the Ethics Commission of the State University of Intellectual Technologies and Communications with No. HG-764
Consent: The participants were informed about the participation being anonymous and voluntarily, and they provided their consent.
Funding: No industry funding
Conflicts of interest: N/a
Author contribution: Maksym Karyonov – Conceptualization, Writing – original draft, review and editing
Guarantor: Maksym Karyonov
Provenance and peer-review: Unsolicited and externally peer-reviewed
Data availability statement: N/a

Keywords: Fault-tolerant DevOps, Microservices scalability, Kubernetes orchestration, Adaptive load balancing, Automated disaster recovery.

Peer Review
Received: 25 August 2025
Last revised: 13 October 2025
Accepted: 14 October 2025
Version accepted: 4
Published: 15 November 2025

Plain Language Summary Infographic

‘Approaches to Fault Tolerance and Disaster Recovery in DevOps’. It presents three sections: (1) Background—explaining the study on improving reliability and failure recovery in software development and operations; (2) Fault tolerance methods including microservices, serverless computing, monitoring, containerisation, load balancing, and disaster recovery planning, each illustrated with icons; (3) Disaster recovery strategies such as data replication, backup, automated testing, and preparedness planning. The infographic uses bold colours—orange, blue, and green—with clear text and flat DevOps-style icons.”

Abstract

Background: The purpose of the study was to analyse approaches to improving reliability and recovery from failures in the processes of software development and operation.

Materials and Methods: The research utilised a qualitative methodology, conducting semi-structured interviews with industry experts to investigate fault tolerance and disaster recovery solutions within DevOps processes. Grounded theory coding was applied to analyse the data, identifying key themes that were mapped to the research questions and comparison matrix.

Results: The results showed that the microservices’ architecture is highly flexible and scalable, which allows for faster adaptation to changes in the workload and reduces the likelihood of system failures. However, the complexity of managing dependencies between services can lead to new problems. Serverless computing can be effective in reducing infrastructure costs and simplifying scaling, but there is a need for detailed monitoring of the functioning of functions and management of their resources. Containerisation and container orchestration are suitable for providing a high level of portability and application isolation. At the same time, the growing complexity of orchestration systems can be a challenge for the team working on their administration.

Cloud computing and load balancing should be used to allocate resources and reduce the risk of stress on individual components, but they require constant monitoring to ensure optimal performance and identify potential problems. In addition, data replication and backup provide reliable protection against data loss, but the increase in data volume places new demands on the speed and efficiency of backup processes. Monitoring and alerting can improve timely detection and response to problems but require regular updates and adaptation to new threats. Automated testing and disaster recovery planning have proven to be key to reducing system recovery times, but the complexity of setting up automated scripts and tests remains an essential issue.

Conclusion: Overall, the results indicate the need for continuous improvement of existing methods to ensure increased resilience and adaptability of systems to new threats when working with software.

Highlights

Microservices architecture offers high flexibility and scalability but increases dependency management complexity.
Serverless computing reduces infrastructure costs while requiring detailed function monitoring and resource management.
Containerization provides superior portability and isolation but demands specialized orchestration expertise.
Cloud computing and load balancing optimize resource allocation but need continuous performance monitoring.

Introduction

The Development and Operations (DevOps) methodology emerged as a response to the need to integrate development and operational support processes to speed up product delivery and improve product quality. DevOps is focused on automating and monitoring all stages of software development, from code integration to deployment and infrastructure management. One of its key aspects is the implementation of practices that reduce the risk of failures and allow for quick recovery, which ensures high system resilience. Despite advances in technology, many organisations face challenges in ensuring fault tolerance and rapid disaster recovery in their DevOps processes. The lack of effective risk management and disaster recovery methods can lead to long downtime, reduced productivity, and user dissatisfaction.¹ This research aims to address these issues by analysing different approaches and technologies that can help reduce risks and increase system reliability in a DevOps environment.

Analysing similar studies is an important aspect of identifying existing gaps and implementing best practices in this work. For example, Orlov and Pasichnyk² emphasised the importance of implementing the DevOps methodology to accelerate software development and improve the quality of software, focusing on the need for continuous monitoring and optimisation of processes, as well as training of qualified personnel for the effective implementation of DevOps in technology projects. The authors Onyshchenko et al.³ pointed out that the use of artificial intelligence in combination with DevOps tools is critical to optimise testing, especially in today’s high requirements for efficiency and speed of development. The results of the study by Lytvynov et al.⁴ showed the effectiveness of integrating DevOps and Development, Security, Operations (DevSecOps) to optimise the interaction between developers, testers and security specialists. Through the use of centralised log management systems like ELK Stack and Graylog, this integration improves overall system monitoring and troubleshooting, boosts efficiency by automating log management procedures, and improves communication by simplifying the exchange of log data.

Researchers Capizzi et al.⁵ noted that data management plays a key role in DevOps processes, where Big Data solutions are successfully used to optimise data collection, analysis and storage, which improves the quality and speed of software development. In addition, Valenzuela Robles et al.⁶ identified approaches to improve effort estimation in software development projects in the context of DevOps and also emphasised the importance of extending effort estimation from the development phase to the operational phase, which helps to reduce discrepancies between initial and actual cost and time estimates. In addition, after analysing the detection and recovery of network failures, researchers Kumar et al.⁷ demonstrated the benefits of proactive strategies, such as predictive maintenance and network rebooting, which ensure network resilience and autonomy through the use of machine learning and deep learning algorithms.

Sandu⁸ highlighted the importance of integrating security practices into the DevOps lifecycle to increase resilience and reduce security risks, emphasising the automation of security procedures, resilience testing, cultural change, and new trends in DevSecOps. Tatineni⁹ noted that DevOps has a significant impact on the quality and reliability of software, pointing out the importance of cultural change in the organisation and the introduction of continuous delivery, feedback, and automation as part of the complex impact of DevOps on software. And the results of Craciun and Necula¹⁰ have shown that the effectiveness of DevOps implementation largely depends on the organisational structure, as start-ups and their cultures contribute to the effective adaptation of DevOps, which ensures technological progress and cultural synergy, compared to large companies.

Thus, the main conclusions of the works of these authors pointed to the importance of continuous monitoring, automated testing and security processes, as well as the need to integrate modern technologies to improve the efficiency and quality of development. This study explores the effectiveness of fault tolerance and disaster recovery strategies in DevOps processes, aiming to identify optimal methods for ensuring system reliability and minimising downtime in dynamic environments. The study is guided by the following research questions:

What are the benefits and difficulties linked to various fault tolerance solutions (e.g., microservices, serverless computing, containerisation, cloud computing);
How do these fault tolerance tactics help with disaster recovery in real-world DevOps settings;
How can the experiences of industry professionals be used to judge how well these strategies work.

Materials and Methods

Based on a qualitative methodology, this study examines how DevOps processes analyse fault tolerance and disaster recovery plans. The research’s empirical foundation consists of a thorough case study methodology, which is enhanced by expert interviews with experts in the fields of operations and software development. The interviewees included professionals from the fields of operations and software development, specifically DevOps engineers and system architects, who provided insights into the practical implementation of fault tolerance techniques. The study also examined publicly accessible documentation, case studies, and operational data from top businesses using DevOps techniques, including Netflix and Spotify, in addition to the interviews.

Participants

Participants were selected using purposive sampling to ensure the inclusion of individuals with extensive experience in DevOps processes. Ten experts were questioned, all from diverse fields like technology, software development, and cloud computing (Table 1). They all worked for firms that use modern DevOps. The study’s inclusion requirements mandated that participants possess at least five years of experience in DevOps roles or as system architects, with direct engagement in the execution of fault tolerance or disaster recovery strategies. The study excluded individuals with minimal practical experience in operational settings or insufficient exposure to DevOps approaches. Recruitment was performed through professional networks (LinkedIn, DevOps community Slack channels) and by direct email invitation explaining the research purpose, voluntary participation, and confidentiality measures. Out of 15 invitations, 10 experts consented to participate.

Table 1: Characteristics of participants.
Characteristic	Description
Gender	7 male, 3 female
Age	28–46 years (mean = 37.2)
Years of DevOps experience	5–15 years (mean = 9.8)
Industry sectors	Software development (Participants 1–5), Cloud computing (Participants 6–8), IT operations (Participants 9-10)
Geographic location	USA (3), Ukraine (2), Poland (1), Germany (2), India (2)
^{Source: Created by the author.}

Interview Guide

The interviews were semi-structured and had open-ended questions about the advantages and limitations of different fault tolerance systems and how they may be put into practice. Interviews were conducted via video calls and were recorded with consent, and each lasted between 30 and 60 minutes. The questions looked into the following topics:

Microservices: “Can you describe how your organisation has implemented microservices architecture, and what advantages and challenges have you encountered in terms of fault tolerance and disaster recovery?”
Serverless Computing: “How does serverless computing impact your system’s ability to recover from failures? What are some of the trade-offs you’ve experienced, particularly in managing resource allocation and cold-start issues?”
Containerisation: “How has the use of containerisation and container orchestration contributed to fault tolerance in your environment? Can you provide examples of how these technologies have improved or complicated system recovery?”
Cloud Computing: “In what ways have cloud technologies, such as AWS or Azure, enhanced your ability to ensure high availability and quick recovery during disruptions? What limitations have you faced when relying on cloud platforms for fault tolerance?”
Automated Recovery: “How does your organisation approach automated recovery? What role does automation play in reducing downtime during system failures, and what challenges have you faced in implementing automated recovery processes?”

These questions were meant to get the interviewee to give extensive answers and share specific ideas based on their own experiences. The researcher wanted to reach data saturation, which means that the interviews didn’t reveal any new themes or ideas. After doing eight interviews, saturation was attained. Two further interviews confirmed that the results were the same.

Coding and Codebook

The qualitative data underwent analysis by grounded theory coding, entailing a three-stage process. Initially, all transcripts were examined line by line during open coding in order to find recurrent behaviours, viewpoints, and procedures associated with disaster recovery and fault tolerance. Axial coding was then used to arrange conceptually similar codes into more general categories that showed connections between organisational characteristics and technology practices. Lastly, these categories were combined via selective coding around key elements that describe systemic resilience in DevOps systems. Timestamped transcripts, coding memoranda, code modifications, and analytical reflections in NVivo software were all part of the audit trail that was kept up to date. This traceable record enhanced the confirmability of conclusions and guaranteed process transparency. The author did the coding on their own to make sure it was all the same. The codebook was created in stages (Appendix 1), starting with a rough set of categories that were improved as the study went on.

Triangulation and Reliability Check

Methodological triangulation was used to cross-reference a number of independent sources of evidence in order to assure the study’s validity and credibility. These sources included technical reports and white papers that described system resilience architectures, expert interview transcripts, documentation, and case studies from top DevOps-oriented companies like Netflix, Spotify, GitLab, and AWS. By using a triangulated approach, the researcher was able to ensure analytical consistency and minimise potential bias by comparing the subjective observations of practitioners with documented industry practices. Additionally, preliminary topic summaries were sent back to each of the 10 participants for their input in order to perform member checking. The accuracy of the interpretations was verified by each expert, which strengthened the findings’ credibility and the thematic categorisation’s dependability.

DevOps Architecture Analysis

An analysis of the microservices and serverless computing architectures was carried out. The paper considers how these architectures allow for efficient load distribution and isolation of system components. The advantages and disadvantages have been identified. An example to illustrate this approach is taken from the American online publishing platform Medium.¹¹ On the other hand, the study describes the concepts of containerisation and container orchestration, which have become key technologies for modern application management. The benefits and challenges of these technologies were examined. The official documentation of the open-source Kubernetes system provides an example of the use of these technologies.¹²

The article also considers how cloud technologies can significantly increase the scalability and flexibility of infrastructure. The analysis is based on the examples of Amazon Web Services (AWS) and Amazon Elastic Load Balancing (ELB), using data taken from the GeeksforGeeks platform.¹³ The benefits and disadvantages have also been studied. As an example, the information from the official website of the software development company Rivery¹⁴ is used. The advantages of such mechanisms are considered, as well as the disadvantages In addition, an in-depth analysis of the monitoring and alerting process was carried out, which is important for tracking the system status and responding quickly to any anomalies. The benefits and disadvantages were assessed. As an example, the experience of New Relic was presented, using materials from its official platform.¹⁵ Particular attention is paid to disaster recovery planning, which includes testing recovery plans and automating all necessary processes to minimise downtime. As an example, data from the GitLab documentation is used.¹⁶

Results

With rapidly changing technologies and high requirements for business continuity, DevOps fault tolerance and disaster recovery are becoming critical tasks. DevOps is a methodology that integrates software development and operational processes to improve the speed and quality of software development. The main goal of DevOps is to ensure continuous delivery and deployment of applications, which requires effective strategies for managing failures and ensuring recovery. In DevOps, disruptions can affect any aspect of the system, including infrastructure, applications, or processes. Therefore, it is critical to implement approaches that allow responding to faults in a timely manner and ensuring rapid recovery.

The first approach to fault tolerance and recovery in DevOps is microservices architecture and serverless computing (Figure 1). Microservices architecture is an approach to software development where a large monolithic system is divided into small, independent services that interact with each other via an application programming interface (API).^17,18 Each microservice is responsible for a specific functionality or set of functions and can be deployed, scaled, and updated independently of the others.¹⁹ Serverless computing allows the execution of code without the need to manage physical or virtual infrastructure.²⁰ The execution platform automatically handles all the necessary resources, and developers focus on implementing business logic. Typically, serverless computing is implemented through a service or as part of a larger serverless environment.

Fig 1 | Architecture diagram of microservices and serverless computing
Source: Created by the author. — Figure 1: Architecture diagram of microservices and serverless computing.
^{Source: Created by the author.}

In any case, the microservices and serverless computing architecture method have their advantages and limitations. This architecture offers high flexibility and scalability by dividing the system into independent components, which allows isolating problems and scaling individual services as needed. The range of technologies that can be used for various microservices makes it easier to integrate new solutions. These technologies include programming languages (like Java, Python, and Go), containerisation tools (like Docker), orchestration platforms (like Kubernetes), API management systems (like Kong), and service meshes (like Istio).^21,22 Microservices can be easily deployed, scaled, and managed thanks to these technologies, making it easier to add new features or swap out ageing parts. However, this flexibility comes with certain challenges, as coordination between services and deployment and testing becomes more complex.

Participant 3 (Software Dev) said, “Microservices gave us a modular structure; if one service failed, others stayed unaffected, which drastically reduced downtime”. Another, Particiant 4 (Software Dev), also said, “The flexibility was worth the complexity – isolating updates by service improved release frequency”. In addition, communication between services via APIs can lead to additional overheads. Serverless computing provides the ability to automatically scale resources and quickly deploy new features, which greatly simplifies development.^23,24 According to Participant 6 (Cloud Comp), “Serverless helped us handle unpredictable traffic spikes automatically; the scaling was invisible to users”. However, there may be limited runtime configuration options and potential delays when first launching a function. For example, Participant 8 (Cloud Comp) said, “We noticed that serverless platforms handled scaling far better than traditional virtual machines, but the unpredictability of cold starts sometimes disrupted time-critical tasks”. Also, integrating serverless functions with existing systems may require additional effort.

As an example, Netflix uses a microservices architecture to manage various functionalities of its service, such as payment processing and content recommendations, which allows it to scale and update system components independently.¹¹ This architecture is also supported by serverless computing, which allows Netflix to focus on implementing business logic without having to worry about infrastructure. In terms of fault tolerance, a hybrid architecture, such as a monolithic core with microservices extensions, can achieve a balance between the simplicity of a monolithic structure and the flexibility provided by microservices. A stable foundation offered by the monolithic core makes management easier and inter-service communication less complicated. However, by incorporating microservices as extensions, modularity and flexibility are made possible, allowing individual components to grow on their own and react swiftly to errors without compromising the system as a whole. This method preserves the monolithic core’s dependability and cohesiveness while enhancing fault tolerance by separating failures to specific microservices. As the interactions between the microservices and the monolithic core can add complexity to deployment, monitoring, and error recovery, the difficulty is in guaranteeing a smooth transition between the two architectures. Therefore, if a hybrid approach is carefully thought out with system interactions and fault isolation techniques in mind, it can strike a balance between flexibility and complexity.

The second approach is containerisation and container orchestration, which contribute to increased fault tolerance and system recovery efficiency (Figure 2).
This simplifies the management of applications and their infrastructure, increasing their reliability and scalability. Containerisation is a technology that allows packing software applications and all their dependencies (libraries, configuration files) into standardised containers.²⁵ Containers provide runtime isolation, which allows applications to run in the same conditions regardless of the deployment environment. Container orchestration is the management of large-scale containerised applications and services.²⁶ Orchestrators automate container management, including deployment, scaling, load balancing, and self-healing.

Fig 2 | Scheme of containerization and container orchestration.
Source: Created by the author. — Figure 2: Scheme of containerization and container orchestration.
^{Source: Created by the author.}

There are certain advantages and disadvantages to this approach. Containerisation provides portability of applications due to isolation from the operating system, which simplifies movement between different environments and reduces the risk of conflicts. Rapid deployment of containers significantly speeds up the software delivery process.²⁷ However, configuring containers and integrating them with other systems can be complex, and ensuring security requires additional effort. Orchestrators automate container management, ensuring high availability and efficient use of resources. For example, Spotify uses Docker to containerise its microservices, which allows them to quickly deploy new features and ensure stable operation of the service.¹² Participant 10 (IT Ops) stated, “We used Docker to simplify migrations between test and production without downtime”. Using Kubernetes, Spotify automates the management of thousands of containers, which allows it to scale services according to user requirements and provides an uninterrupted music listening experience.²⁸

Containerisation ensures consistent behaviour across different environments by offering isolated environments, which reduces network partitioning and misconfiguration. For example, Participant 10 (It Ops) said, “Containers guaranteed uniform environments; we stopped hearing ‘it works on my machine”. Although it makes deployment easier and increases portability, it adds operational overhead and setup complexity to managing containers and their lifecycles, particularly in large-scale deployments. Though they add extra overhead in learning and managing infrastructure, container orchestration tools like Kubernetes simplify things somewhat, especially when scaling or integrating with other services. According to Participant 9 (It Ops), “Kubernetes automated almost everything, but mastering its configuration curve took months”. Through the automation of deployment, scaling, and recovery, they improve fault tolerance; however, they may cause latency problems during network outages or scaling. Though costs and complexity rise with larger systems, these strategies offer enhanced resilience, high availability, and self-healing capabilities despite the trade-offs.

Next, it is worth considering cloud computing and load balancing, which are key components to achieving fault tolerance and high availability in modern systems (Figure 3). These technologies help ensure application stability and scalability, which is critical in a constantly changing and evolving environment. Cloud platforms provide a variety of services, including computing resources, storage, and networking services over the Internet.²⁹ They provide a high level of availability, scalability, and the ability to quickly restore systems.

Fig 3 | Scheme of cloud technologies and load balancing.
Source: Created by the author. — Figure 3: Scheme of cloud technologies and load balancing.
^{Source: Created by the author.}

Like the others, this approach has its advantages and limitations. Cloud platforms provide high availability due to built-in recovery mechanisms and flexible scalability, which allows for effective cost management.^30,31 Rapid adaptation to changing business conditions is another advantage of the cloud. However, the cost of cloud services can increase significantly as the workload increases, and setup and management can be difficult. In addition, dependence on a cloud service provider is a potential risk.³² Load balancing allows efficiently distributing traffic between servers, increasing system performance and availability.^33,34 However, setting up load balancing requires specialised knowledge, and the mechanism itself can become a point of failure if configured incorrectly.³⁵ An example of the use of cloud technologies and load balancing is the Amazon system.¹³ AWS, one of the largest cloud platforms in the world, uses load balancing to ensure high availability of its services. In particular, ELB automatically distributes incoming traffic between several servers, ensuring reliable operation even under high loads. For example, Participant 6 (Cloud Comp) claimed, “Cloud elasticity saved us during seasonal surges; scaling policies kicked in without notice”.

Through the distribution of traffic among several servers or data centres, cloud technologies and load balancing effectively manage network partition and cascading crash failures, guaranteeing that the failure of a single server or network segment does not affect the system as a whole. This method can improve scalability by automatically allocating resources as needed and decrease latency by guiding traffic to the closest available resource. Participant 8 (Cloud Comp) said, “Vendor dependence worried us, but managed load balancing kept the system stable”. It does have cost trade-offs, though, as cloud services can get expensive as usage rises. Coordinating several servers or platforms also makes configuration and management more difficult, particularly when handling failures in dispersed environments.

A crucial approach for confirming the resilience of systems in failure scenarios is chaos engineering. In order to ensure that a system can recover gracefully without significantly affecting the end-user experience, the fundamental idea behind chaos engineering is to purposefully introduce flaws into it.³⁶ Netflix’s Chaos Monkey is one of the most well-known tools for chaos engineering. To test the system’s ability to withstand unforeseen disruptions, Chaos Monkey randomly ends virtual machine instances. Similar to this, Gremlin is another platform that offers a collection of tools for chaos engineering that are intended to replicate different kinds of failures, such as resource depletion and network outages. With the use of these tools, teams can safely stress-test systems, assisting in the discovery of flaws before they appear in real-world settings.

Another promising method, Serverless Performance-Resource Optimisation Scheduler (SPES),³⁷ uses serverless function invocation patterns to anticipate function requests in order to optimise cold start performance. Based on these predictions, SPES classifies functions and pre-loads the relevant instances, resulting in a 49.77% reduction in cold start latency and a 56.43% reduction in wasted memory time. Compared to conventional pre-loading techniques, which frequently overlook function invocation patterns and produce subpar optimisation, this approach provides a substantial improvement. For serverless functions, SPES is a unique scheduling method that provides a more precise trade-off between performance and resources. SPES offers a more effective model for serverless environments by addressing cold start latency and resource waste through better function provisioning.

However, Behera et al.³⁸ suggest function fusion as a different strategy to lower cold-start latency. Their approach minimises redundant initialisations and maximises resource utilisation by integrating several functions within a workflow. The PanOpticon simulator was used to test this function fusion technique, which improves system responsiveness and throughput while also lowering cold-start delays. This technique increases efficiency and eliminates needless overhead by dynamically combining functions while they are being executed. Function fusion incorporates a more comprehensive view of workflow optimisation than traditional methods, which only concentrate on optimising individual functions. This strategy’s trade-off is that it greatly improves serverless computing’s performance and scalability, but it also complicates function management and necessitates a more intricate system architecture.

The next approach is data replication and backup, which are critical to ensuring fault tolerance and protecting system data (Figure 4). These methods help to reduce the risk of data loss and ensure recovery in the event of a disruption. Data replication involves duplicating data on multiple servers or in different geographical locations to ensure availability and reliability. This process can be performed in real time or periodically, depending on the needs of the system and specific solutions. For example, Participant 10 (IT Ops) said, “Incremental backups every six hours became our safety net; full restores were rare but reliable”. Replication can be asynchronous, where changes in the primary storage are applied to the backups with a certain delay, or synchronous, where all changes are instantly reflected in all copies of the data. Backup is the process of creating backup copies of system data and configurations to restore them in the event of failure, loss, or damage. Backups can be stored locally or in the cloud, and they can be created on a regular basis (daily, weekly) or based on events (when data changes).³⁹

Fig 4 | Data replication and backup scheme.
Source: Created by the author. — Figure 4: Data replication and backup scheme.
^{Source: Created by the author.}

This approach has certain advantages and disadvantages. Data replication ensures high availability, protecting data from loss in the event of a failure. Data duplication allows for quick system recovery and reduced downtime. However, replication requires significant resources for data storage and management and can lead to delays in updating information.⁴⁰ Backups provide protection against complete data loss by restoring the system to a certain state, but restoring from backups can be a lengthy process, and storing copies requires additional resources. For instance, Participant 6 (Cloud Comp) claimed, “Synchronous replication raised costs but reduced data-loss anxiety for management”. In general, managing replicas and backups is a complex task, especially in large systems.

For example, Facebook uses data replication to ensure high availability and fault tolerance of its social platform.¹⁴ User data is stored in several data centres around the world. This allows for instant switching to another data centre in the event of a failure or accident, ensuring the continuity of the platform. Data replication and backup methods are made specially to deal with data loss and corruption by making copies of data across several systems or geographical locations. This guarantees that the system can be restored to its initial state using a replica in the event that one data store is lost or corrupted. However, as the amount of data grows, synchronisation lags and storage expenses may become substantial. Replication adds overhead because it necessitates regular backup copy management and synchronisation, even though it guarantees dependability and speedy recovery from failure scenarios.

The monitoring and alerting approach is also important (Figure 5). Monitoring is the process of continuously observing various aspects of the system.^41,42 It involves collecting and analysing data such as CPU, memory, and disc space usage, as well as network and application status. The main goal of monitoring is to identify anomalies and potential problems in real time, which allows preventing failures or minimising their impact. Participant 9 (IT Ops) said, “Real-time dashboards gave us visibility; anomalies popped up before users even noticed issues”. Monitoring can be implemented through various tools and platforms that provide data visualisation in the form of graphs, tables, and reports.⁴³ Alerts, in turn, are a mechanism for informing the system administrator or responsible individuals about problems or anomalies detected during monitoring. Notifications can be implemented in the form of certain communication channels that provide up-to-date information about the system status and allow for a quick response to unforeseen situations. Participant 8 (Cloud Comp) stated “Automated Slack alerts shortened our mean time to resolution from hours to minutes”.

Fig 5 | Monitoring and alerts scheme.
Source: Created by the author. — Figure 5: Monitoring and alerts scheme.
^{Source: Created by the author.}

It is important to consider both the advantages and limitations of this approach. Monitoring allows proactively identifying problems in the system, ensuring prompt response and optimisation of resources. Detailed analysis of monitoring data helps to identify the root causes of problems. Setting up monitoring systems is a complex and expensive process, and an excessive number of alerts can reduce efficiency. In contrast, alerts enable prompt notification of issues, facilitating prompt action. However, if they are not set up correctly, their effectiveness may be greatly diminished. False alarms or an excessive number of notifications could result from improperly configured alerts or from alerts that are sent too frequently. Users may experience “alert fatigue” as a result of this overload, which impairs their ability to react quickly to real problems by causing them to ignore or dismiss alerts.

An example is New Relic, which uses a comprehensive approach to monitoring and alerting to manage its services.¹⁵ They implement continuous monitoring of servers, databases, and applications, collecting data on resource usage, network status, and application performance. If the system detects anomalies, such as high CPU load or application malfunctions, notifications are automatically sent to administrators. This allows for a quick response to potential problems, which can prevent outages and ensure service stability.

Systems that continuously monitor system performance and send notifications when anomalies are found are crucial for identifying misconfigurations and network partition failures. The main advantage is the early identification of problems, which reduces downtime and enables proactive recovery. Setup costs and the difficulty of configuring monitoring systems, especially in large-scale or multi-cloud environments, are the trade-offs, though. Furthermore, improperly tuned alerting systems can lead to false alarms or information overload, which can decrease efficacy and cause alert fatigue in system administrators. To stay ahead of emerging threats or system modifications, effective monitoring necessitates substantial operational resources and frequent updates, which raises the overall operational overhead.

It is also worth considering the approach of automated testing and disaster recovery planning (Figure 6). Automated testing is the process of regularly performing code tests using automatic tools to ensure high software quality and quickly detect errors before they affect the system’s functionality.^44,45 It includes various types of testing, such as unit, integration, functional, and regression testing. With the help of automation, these tests can be performed more frequently and more systematically, reducing the likelihood of defects in the production environment. Disaster recovery planning is the process of developing a detailed action plan that includes measures to restore system operations in the event of a serious incident or disaster.

Fig 6 | Scheme of automated testing and disaster recovery planning
Source: Created by the author.
Note: CI/CD – Continuous Integration/Continuous Deployment. — Figure 6: Scheme of automated testing and disaster recovery planning.
^{Source: Created by the author.
Note: CI/CD – Continuous Integration/Continuous Deployment.}

For a better understanding of this approach, it is necessary to analyse its advantages and disadvantages. Automated testing significantly improves the quality of software by enabling quick identification and fixing of bugs. It also reduces development time and ensures test repeatability. However, developing and maintaining automated tests requires significant resources, and managing numerous tests can be difficult. As for the disaster recovery plan, it allows for responding quickly to incidents, reducing downtime. However, developing and maintaining such a plan is resource-intensive, and its effectiveness depends on regular updates and the availability of the necessary resources. As an example of the use of automated testing and disaster recovery planning, it is possible to cite the practice of GitLab.¹⁶ They actively use automated testing as part of their CI/CD process to check the quality of code at every stage of development. This allows them to quickly identify and fix errors before the code is deployed to the production environment.

In the software development lifecycle, DevOps techniques like code reviews, pair programming, infrastructure as code, and automation are essential for reducing human error. Through the involvement of multiple team members, code reviews offer a methodical approach to identifying errors early on, improving code quality and lowering the possibility of missed bugs. Pair programming, in which two developers collaborate on the same code in real time and share knowledge, also helps to reduce errors. While automation in testing and deployment ensures consistency and lowers the risk of human error during these processes, infrastructure as code automates infrastructure setup, minimising the possibility of manual misconfiguration. Maintaining high-quality software in hectic DevOps workflows requires a more dependable and error-resistant development environment, which is what these practices collectively foster.

Table 2 presents a comparison matrix of the advantages and disadvantages of the discussed approaches. Since techniques like serverless computing and microservices offer scalable solutions but frequently call for complex management procedures to prevent operational bottlenecks, the trade-offs between flexibility and complexity are especially noticeable.

Table 2: Comparison matrix: Fault tolerance and recovery methods in DevOps processes.
Method	Advantages	Limitations	Example	Metrics	Typical Failure Modes Mitigated
Microservices Architecture	Fault isolation, scalability, flexibility	Complexity of management, coordination overhead, communication overhead via APIs	Netflix, Medium	RTO: Low, RPO: Low, Latency: Medium	Network partition, Cascading failure, Data corruption
Serverless Computing	Reduced infrastructure costs, automatic scaling, fast implementation	Limited configuration options, delays during initial function invocation, integration challenges	Netflix	RTO: Medium, RPO: Medium, Latency: High	Network partition, Cascading failure, Cold start
Containerisation	Portability, rapid deployment, isolation of the environment	Difficulty in setting up, resources, security	Spotify	RTO: Low, RPO: Medium, Latency: Low	Network partition, Data corruption, Cascading failure
Container Orchestration	Automation of management, self-healing, scalable	Difficulty of setting up, resource costs, difficulty of training	Spotify (Kubernetes)	RTO: Low, RPO: Low, Latency: Medium	Network partition, Cascading failure
Cloud Technologies & Load Balancing	High availability, scalability, flexibility, quick recovery from disruptions	Cost, complexity of setup and management, dependence on a supplier, failure points	AWS, ELB	RTO: Low, RPO: Low, Latency: Medium	Cascading failure, Network partition, Load spikes
Data Replication & Backup	Data protection, faster recovery, high reliability, reduces downtime	High storage costs, delays in synchronisation, complex management of multiple copies of data	Facebook	RTO: Low, RPO: Low, Latency: Low	Data corruption, Network partition
Monitoring & Alerts	Early detection of issues, improved resource management, increased system reliability	Expensive setup, information overload, dependence on proper configuration	New Relic	RTO: Medium, RPO: High, Latency: Low	Cascading failure, Network partition
Automated Testing & Disaster Recovery	Improves code quality, reduces downtime, repeatability, faster recovery from failures	Setup costs, complex management, needs regular updates and maintenance	GitLab, Jenkins	RTO: Low, RPO: Low, Latency: Low	Cascading failure, Data corruption, Network partition
^{Source: Created by the author. Note: RTO – Recovery Time Objective; RPO – Recovery Point Objective.}

The ratings in Table 2 are based on information from case studies and interviews that connect performance measures to actual DevOps procedures. While medium delay results from coordination overhead, low RTO/RPO in microservices and containerisation reflects the quick recovery observed at Netflix and Spotify. According to Participants 7–8, serverless computing exhibits medium RTO/RPO and significant latency due to cold-start delays. Although load balancing and the cloud have minimal RTO/RPO, they are vulnerable to configuration errors that result in downtime. While data replication guarantees availability, it also causes synchronisation lag and storage overload. While automation can reduce RTO/RPO, it can also fail when dependencies change, and monitoring and warnings are susceptible to false positives. Therefore, the ratings take into account fault-tolerance techniques’ empirical vulnerabilities as well as their strengths.

In the end, the interaction of these techniques emphasises the necessity of a comprehensive strategy that takes into account the practical constraints of deploying and maintaining these technologies in actual DevOps environments in addition to addressing system resilience. In other words, ensuring fault tolerance and fast recovery from DevOps failures is a critical task, given the complexity of modern information systems and the high requirements for business continuity. These approaches help to increase the reliability and stability of systems, allowing them to quickly adapt to changes and effectively manage resources. The combination of these approaches allows for the creation of resilient and adaptive systems that can withstand disruptions and ensure stable operation in a constantly changing environment. Based on the interviews with industry experts, several key insights emerged regarding the challenges and best practices associated with fault tolerance and disaster recovery in DevOps environments.

First, a lot of experts stressed how crucial microservices architecture is as a fundamental component for improving fault tolerance. A number of them emphasised how microservices can isolate failures to specific services, avoiding system-wide disruptions. They did, however, also draw attention to the difficulties in maintaining inter-service dependencies, which can become a bottleneck in the event of failure. Although microservices’ scalability and flexibility increase system reliability overall, experts pointed out that managing these services requires sophisticated monitoring tools and careful orchestration, particularly as the number of services grows. This is consistent with earlier research on the benefits and drawbacks of microservices architecture, where coordination issues are frequently mentioned.

Serverless computing was also acknowledged as a successful tactic for lowering infrastructure overhead and facilitating quicker scaling. Numerous interviewees pointed out that serverless architectures reduce the possibility of system outages brought on by resource shortages by enabling automatic scaling in response to demand. But as the experts noted, the trade-off is in the requirement for careful resource management and monitoring. One limitation that was often brought up was the start-up latency, also known as the “cold start” issue. According to a number of experts, this issue needs to be carefully managed to guarantee system responsiveness. In order to balance the advantages of serverless computing with the dependability of dedicated resources, some participants suggested utilising hybrid architectures, which combine serverless functions with conventional server-based systems.

Experts emphasised the importance of disaster recovery planning and automated testing, stressing that automation is necessary to minimise human error and guarantee quick failure recovery. By facilitating the prompt identification and fixing of pre-production problems, the integration of continuous integration and continuous deployment pipelines with automated testing frameworks has decreased downtime. One expert stated, “Automated recovery scripts and testing pipelines are the backbone of our fault tolerance strategy; they ensure that even the most complex systems can be quickly restored with minimal manual intervention.”

Lastly, the interviews emphasised the increasing importance of load balancing and cloud computing in enhancing system resilience. According to a number of experts, cloud platforms like AWS and Azure come with built-in fault-tolerance and redundancy features that are essential for guaranteeing high availability in contemporary DevOps settings. According to one participant, load balancing “distributes the traffic across multiple servers to prevent individual system overloads,” improving fault tolerance and performance. However, experts warned that relying too much on the cloud can come with risks, like vendor lock-in and possible cloud provider service outages. Some experts recommended multi-cloud strategies that diversify infrastructure and lessen dependency on a single provider in order to reduce these risks.

Table 3 summarises the main themes that came up throughout the interviews. This makes it easy to see how the data supports the statements stated in the study. Each subject highlights a core idea about fault tolerance and disaster recovery in DevOps systems. Anonymised statements from professionals in the field show how these themes were expressed. The prevalence column shows how often each theme came up among the participants, which shows how important each issue was to the study.

Table 3: Themes with representative anonymised quotes and prevalence.
Theme	Representative Anonymised Quote	Prevalence
Microservices Flexibility	“Microservices give us the flexibility to isolate failures to individual services, which helps minimise the impact on the overall system.” – Participant 2 (Software Dev).	75%
Serverless Scaling	“Serverless computing allows us to scale functions automatically based on demand, making it ideal for unpredictable traffic.” – Participant 6 (Cloud Comp).	80%
Cold-Start Latency	“The cold-start issue in serverless functions can cause significant delays, especially when low-latency is crucial.” – Participant 3 (Software Dev).	60%
Containerisation Portability	“Containers make it easier to move applications between different environments without worrying about dependency issues.” – Participant 10 (IT Ops).	70%
Automated Recovery	“We use automated recovery scripts that help restore services quickly without manual intervention.” – Participant 9 (IT Ops).	85%
High Availability via Cloud	“Cloud platforms like AWS ensure that we can recover quickly from disruptions, thanks to built-in redundancy and auto-scaling.” – Participant 7 (Cloud Comp).	90%
Data Replication and Backup	“Data replication across multiple nodes ensures that we don’t lose critical data during a failure, but it’s costly.” – Participant 1 (Software Dev).	65%
Monitoring and Alerts	“Effective monitoring and timely alerts have been crucial in catching issues early, though managing alerts can be overwhelming.” – Participant 4 (Software Dev).	80%
Operational Overhead	“Managing microservices can be complex and requires significant resources, especially when it comes to inter-service communication.” – Participant 5 (Software Dev).	70%
^{Source: Created by the author.}

Table 3 shows how the data we collected supports the bigger ideas about how well alternative fault tolerance solutions work. It makes things clearer by providing both the direct quotes from participants and how often each theme came up. This enables the study’s results to be based on real-life experiences and makes sure that the analysis is based on the data. The study of fault tolerance and disaster recovery solutions in DevOps shows that network segmentation, cascading failures, cold starts, and data corruption are some of the most common ways that things go wrong. Microservices architectures provide fault isolation, which speeds up recovery by keeping failures to specific services. But they need a lot of work to manage communication between services. Serverless computing automatically scales to deal with network segmentation and cascade failures; however, it can have cold-start delay. A hybrid solution that uses microservices for core tasks and serverless for variable workloads can find a balance between fault isolation and cost-effectiveness.

Cold-start latency is a big problem in serverless computing, especially for apps that need minimal latency. Microservices, where services are continuously operating, are preferable for functions that need low latency. However, serverless is cheaper if you use optimisation tools to control cold-start latency. A hybrid paradigm can use serverless to save money and microservices to address the needs of important functions that need low latency. When data gets corrupted, it needs to be fixed quickly. Both microservices and serverless use data replication, but managing it in distributed systems can be hard. Synchronous replication costs more but protects against data loss, while asynchronous replication costs less but is riskier. A hybrid approach that uses microservices for important data and serverless for less important data is the best way to secure data and save money.

Cost is another thing that affects the choice between microservices and serverless. Microservices are more expensive to run since you have to manage more than one service. Serverless is cheaper for short bursts of activity, but it can get expensive if you use it all the time. Microservices are the best choice when reliability and fault tolerance are very important. Serverless is the best choice when cost-effectiveness and periodic downtime are okay. A hybrid method meets both of these goals by using the best parts of both architectures.

The decision framework was operationalised into a structured evaluative matrix that guides technology selection in DevOps contexts, building on the comparative insights from interview data and thematic synthesis. The framework arranges fault-tolerance tactics along two intersecting axes: (1) Operational Complexity, which is determined by the level of automation, load monitoring, and orchestration needs, and (2) Recovery Criticality, which is determined by acceptable RTO and RPO levels. A corresponding quadrant is assigned to each architecture. For instance, containerised systems are ideal for balanced recovery and control trade-offs, serverless architectures are in the moderate-recovery, low-complexity category, while microservices are in the high-recovery, high-complexity quadrant. At the framework’s intersection, or the adaptive zone, are hybrid models that combine serverless for elastic workloads and microservices for mission-critical activities. In order to operationalise this approach, practitioners should compare the system needs to these two dimensions and choose configurations that minimise RTO while keeping administrative overhead under control. In order to improve deployment tactics over time, the framework also suggests iterative calibration through empirical monitoring of recovery measures and resource usage.

This decision framework helps businesses figure out what they need by matching failure modes with trade-offs in things like recovery time, service levels, latency, and expenses. Businesses can choose the best and most cost-effective plan to improve both performance and costs by carefully thinking about these factors.

Discussion

The results of the study demonstrated that the introduction of automated monitoring and recovery mechanisms significantly improves fault tolerance in DevOps processes and confirmed that such solutions can reduce risks and increase system stability. However, to comprehensively assess the effectiveness of these approaches and eliminate possible shortcomings, it is important to take into account the results of other studies in this area. Comparison with existing work in this area will allow for a deeper understanding of the advantages and limitations of the proposed solutions, as well as identify areas for further refinements. For example, Banala⁴⁶ showed that effective implementation of CI/CD practices in DevOps improves software quality and speeds up delivery through automation and tool integration, and Priyanka et al.⁴⁷ considered CI/CD with a focus on ensuring continuous integration for the deployment of a navigation application. Similarly, the current work focuses on CI/CD to improve reliability and speed of disaster recovery, but it is concentrated on precise aspects of fault tolerance and disaster recovery in DevOps processes.

The study focused on ensuring fault tolerance and rapid recovery in DevOps processes by optimising approaches to fault management. Dakkak et al.⁴⁸ focused on optimising DevOps to understand and implement new features and solutions in complex systems, where it is relevant to take into account the specifics of the software flow and value for end users, which reflects different approaches to improving DevOps depending on the needs of the system. While the current work considered methods for ensuring reliability and recovery from DevOps failures, in particular through the implementation of effective practices to ensure system continuity, the study by Segovia-Ferreira et al.⁴⁹ focuses on the cyber resilience of systems in critical infrastructure, which includes the ability of such systems to prepare, absorb, recover, and adapt to cyber threats, emphasising the need to develop certain metrics and assessment methods to ensure resilience. This study found that the integration of test automation practices in DevOps improves system stability, while Mumtaz et al.⁵⁰ examined in their study how DevOps simplifies software delivery processes by improving the interaction between development and operations teams. In contrast to the study, which focused on automating recovery processes in DevOps to improve fault tolerance, Liu and Wang51 concentrate on the analysis of network recovery mechanisms that ensure the resilience of network structures to failures.

While the results of this work highlighted the importance of automating testing and monitoring to improve system resilience in DevOps, the study by Azad⁵² focused on the risks associated with DevOps and mitigation strategies, such as continuous testing and disaster recovery planning, without a concrete focus on automation. In addition, the current study, which focused on ensuring disaster recovery in DevOps, differs from the work of Jayakody and Wijayanayake⁵³, which concentrated on analysing and comparing DevOps maturity models to assess the level of DevOps adoption, as the results reflect different aspects of DevOps improvement depending on the needs of organisations.

The results of our work, which focused on improving the fault tolerance and reliability of systems, are consistent with the results of Romanelli54, who also showed an increase in the resilience of autonomous systems, but through the use of multisensory fusion methods that combine classical and deep learning approaches to improve the perception and reliability of systems. Unlike the current study, which focuses on ensuring the stability of DevOps processes, the work of Mallreddy and Vasa55 demonstrated the use of machine learning to predict system risks and prevent failures in DevOps cloud environments. Moreover, this study focuses on DevOps stability, while Kadaskar56 shows the transformational impact of DevOps on organisational culture and software quality improvement.

While the work of Pando et al.⁵⁷ provided a systematic review of secondary research on DevOps implementation, the current study showed an increase in the stability and fault tolerance of DevOps processes. It is worth noting that the results of the study, which focused on the integration of various technologies in DevOps, also differ from the work of Abudalou⁵⁸, where the emphasis is on the use of artificial intelligence for automatic threat detection, security testing, and risk prediction in DevOps. At the same time, the current work has shown improvements in system resilience and recovery in the context of DevOps, without addressing certain metrics to assess the success of the implementation, as in the study by Amaro et al.⁵⁹, which developed and applied metrics to improve DevOps processes in enterprises.

Additionally, the present study introduced automation practices to ensure continuity in DevOps processes, while the study by Mohanty et al.60 analysed risk management and recovery in the context of energy infrastructures. While the approach of Kumar et al.⁶¹ study focused on assessing the probability of project success and improving project performance, the current study focused on practical solutions for fault management and system recovery. While the results of the work of Charles do Nascimento Marreiros and José Galvão do Nascimento⁶² demonstrated the integration of high-volume systems into a DevOps environment to optimise processes in large multi-platform enterprises, the current study showed an improvement in the sustainability and management efficiency in this environment. Finally, the study complemented the work of Smuts et al.⁶³, which examined the optimisation of DevOps team composition to improve the efficiency and quality of product delivery, as it focused on improving DevOps resilience through automated system recovery, which is an important aspect of ensuring system uptime and reliability.

Thus, the current work differs from other comparable studies in that it focuses on automating system recovery to improve the resilience of DevOps processes by offering practical solutions for fault management. This contrasts with other work that focuses on different aspects such as technology integration, the use of artificial intelligence, or the development of metrics to evaluate performance. The approach used in this paper provides a holistic solution for improving the stability and recoverability of systems in a DevOps environment.

Conclusions

The study found that the introduction of fault tolerance and disaster recovery methods in DevOps processes ensures scalability and adaptability to load changes, which reduces the likelihood of system failures. Serverless computing reduces infrastructure costs and simplifies scaling, although it requires careful monitoring and resource management. Containerisation and orchestration increase the portability and isolation of applications, but their complexity can cause problems with administration. Cloud computing and load balancing improve resource allocation and reduce the load on individual components but require constant monitoring to maintain efficiency. Data replication and backup provide protection against data loss, but as data volumes grow, these processes need to be more efficient. Monitoring and alerts improve problem detection and response but require regular updates. Automated testing and disaster recovery planning are critical to reducing recovery time, but setting up automated scenarios remains a significant challenge.

Recommendations include deeper implementation of automated mechanisms for microservice architectures and serverless computing, as well as continuous improvement of monitoring and backup processes to address growing data volumes and new threats. It is necessary to integrate new technologies to improve resource management, implement adaptive systems to automate testing, and provide staff training to work effectively with new tools.

Key areas for further research include improving automated test and recovery scenarios, as well as exploring new technologies such as artificial intelligence to improve resource monitoring and management. It is important to focus on simplifying container and cloud orchestration processes, as well as developing tools for automated dependency management between microservices. The inclusion of an empirical component, such as a simulation or reproducible case study, is outside the scope of this study, which focuses on qualitative insights from industry professionals. Future research could expand on this by integrating empirical methods to validate and further explore the findings presented here.

Limitations of the study include the specificity of the technologies considered and their adaptation to particular environments, which may limit the generalisability of the findings for some industries or settings. Furthermore, since the technologies selected might not be universally applicable, the difficulty of establishing and maintaining automated systems raises the possibility of validity threats like selection bias, self-report bias, and industry/context specificity. Additionally, because the results are context-specific, they might not be readily applicable to other industries or organisational settings, which would limit their wider applicability.

References

Tkachuk H, Burachek I, Vyhovskyi V, Sotnyk A, Tsaruk I. Analysis of the financial derivatives for risk management in the context of financial market instability. Sci Bull Mukachevo State Univ Ser Econ. 2024;11(4):81–92. https://doi.org/10.52566/msu-econ4.2024.81
Orlov M, Pasichnyk V. Systemic assessment of risks and challenges in implementing the DevOps methodology in corporate IT infrastructures. Comput Integr Technol Educ Sci Prod. 2024;54:171–178. https://doi.org/10.36910/6775-2524-0560-2024-54-21
Onyshchenko R, Kotenko N, Zhyrova T. The role and effectiveness of artificial intelligence tools in software testing. Inf Technol Soc. 2024;2(13):66–70. https://doi.org/10.32689/maup.it.2024.2.10
Lytvynov VA, Myakshylo OM, Bratskyi VO. The task of centralized management of logs in the network of situation centers of public authorities and approaches to prototyping its software solution. Math Mach Syst. 2023;4:33–42. https://doi.org/10.34121/1028-9763-2023-4-33-42
Capizzi A, Distefano S, Mazzara M. From DevOps to DevDataOps: Data management in DevOps processes. In: Bruel JM, Mazzara M, Meyer B, (eds.). Revised Selected Papers of the Second International Workshop, DEVOPS 2019 “Software Engineering Aspects of Continuous Development and New Paradigms of Software Production and Deployment”. Springer. 2020. p. 52–62. https://doi.org/10.1007/978-3-030-39306-9_4
Valenzuela Robles BD, Alvarado Lara IL, Santaolaya Salgado R, Hidalgo-Reyes M. Identification of methods, approaches, and factors in effort estimation for DevOps projects: A systematic literature mapping. In: 2023 Mexican International Conference on Computer Science (ENC). IEEE. 2023. p. 1–6. https://doi.org/10.1109/ENC60556.2023.10508603
Kumar N, Groenewald ES, Kulkarni S, Ashifa KM, Howard E. Self-healing networks AI-based approaches for fault detection and recovery. Power Syst Technol. 2023;47(4):371–386. https://doi.org/10.52783/pst.206
Sandu AK. DevSecOps: Integrating security into the DevOps lifecycle for enhanced resilience. Technol Manag Rev. 2021;6(1):1–19.
Tatineni S. Applying DevOps practices for quality and reliability improvement in cloud-based systems. Int Res J. 2023;10(11):a374–a380.
Craciun PC, Necula RC. Why startups outpace multinationals in leveraging DevOps. Proc Int Conf Bus Excell. 2024;18(1):3421–3429. https://doi.org/10.2478/picbe-2024-0277
Saddam M. Netflix architecture. Medium. 2023. https://medium.com/@saddy.devs/netflix-architecture-72bb8572a102
Case study: Spotify. Kubernetes. 2024. https://kubernetes.io/case-studies/spotify/
Load Balancing using AWS. GeeksforGeeks. 2023. https://www.geeksforgeeks.org/load-balancing-using-aws/
Gubitosa B. What is data replication and why it’s important? Rivery. 2023. https://rivery.io/data-learning-center/data-replication/
Alert monitoring for all new relic products. New Relic. 2024. https://newrelic.com/platform/alerts
Disaster recovery for planned failover. GitLab Docs. 2024. https://docs.gitlab.com/ee/administration/geo/disaster_recovery/planned_failover.html
Rehman HU, Darus M, Salah J. Graphing examples of starlike and convex functions of order β. Appl Math Inf Sci. 2018;12(3):509–515. http://dx.doi.org/10.18576/amis/120305
Imamguluyev R, Umarova N. Application of fuzzy logic apparatus to solve the problem of spatial selection in architectural-design projects. Lect Not Networks Syst. 2022;307:842–848. https://doi.org/10.1007/978-3-030-85626-7_98
Ponce F, Verdecchia R, Miranda B, Soldani J. Microservices testing: A systematic literature review. Inf Softw Technol. 2025;188:107870. https://doi.org/10.1016/j.infsof.2025.107870
Zhou M, Zheng B, Pan L, Liu S. Balancing function performance and cluster load in serverless computing: A reinforcement learning solution. J Netw Comp Appl. 2025;243:104299. https://doi.org/10.1016/j.jnca.2025.104299
Bisenovna KA, Ashatuly SA, Beibutovna LZ, Yesilbayuly KS, Zagievna AA, Galymbekovna MZ, Oralkhanuly OB. Improving the efficiency of food supplies for a trading company based on an artificial neural network. Int J Elect Comp Eng. 2024;14(4):4407–4417. http://doi.org/10.11591/ijece.v14i4.pp4407-4417
Destek MA, Hossain MR, Manga M, Destek G. Can digital government reduce the resource dependency? Evidence from method of moments quantile technique. Resour Pol. 2024;99:105426. https://doi.org/10.1016/j.resourpol.2024.105426
Amourah A, Frasin B, Salah J, Yousef F. Subfamilies of bi-univalent functions associated with the imaginary error function and subordinate to Jacobi polynomials. Symmetry. 2025;17(2):157. https://doi.org/10.3390/sym17020157
Bezshyyko O, Bezshyyko K, Kadenko I, Yermolenko R, Dolinskii A, Ziemann V. Monte Carlo simulation model of internal pellet targets. In: EPAC 2006 – Contributions to the Proceedings. European Physical Society Accelerator Group (EPS-AG). 2006. p. 2239–2241.
Gladka M, Kuchanskyi O, Kostikov M, Lisnevskyi R. Method of allocation of labor resources for IT project based on expert assessements of Delphi. In: SIST 2023 – 2023 IEEE International Conference on Smart Information Systems and Technologies, Proceedings. IEEE. 2023. p. 545–551. https://doi.org/10.1109/SIST58284.2023.10223549
Swetha R, Thriveni J, Venugopal KR. Resource utilization-based container orchestration: Closing the gap for enhanced cloud application performance. SN Comp Sci. 2025;6(3):191. https://doi.org/10.1007/s42979-024-03624-4
Holovko O, Kravchenko O, Pogrebytskyi M, Romaniuk I. Ways to improve legal regulation of critical infrastructure information networks protection. Soc Leg Stud. 2025;8(1):70–81.
Chen Q, Liu Y, Tan R, Jin Z, Xiao J, Wang X, Zhang F, Liu Q. Shadowkube: Enhancing Kubernetes security with behavioral monitoring and honeypot integration. Cybersecur. 2025;8(1):63. https://doi.org/10.1186/s42400-025-00372-7
Orazbayev B, Kozhakhmetova D, Orazbayeva K, Utenova B. Approach to modeling and control of operational modes for chemical and engineering system based on various information. Appl Math Inf Sci. 2020;14(4):547–556. http://dx.doi.org/10.18576/amis/140403
Issayeva A, Niyazbekova S, Semenov A, Kerimkhulle S, Sayimova M. Digital technologies and the integration of a green economy: Legal peculiarities and electronic transactions. Reliab Theory Applicat. 2024;19(6(81)):1088–1096. https://doi.org/10.24412/1932-2321-2024-681-1088-1096
Zikiryaev N, Grishchenko V, Rakisheva Z, Kovtun A. Analysis of the architecture of the hardware and software complex for ground-based ionosphere radiosounding. Eureka Phys Eng. 2022;3:167–174. https://doi.org/10.21303/2461-4262.2022.002381
Laha J, Pattnaik S, Chaudhury KS, Palai G. Reducing makespan and enhancing resource usage in cloud computing with ESJFP method: A new dynamic approach. Internet Technol Lett. 2025;8(5):e608. https://doi.org/10.1002/itl2.608
Yermolenko R, Falko A, Gogota O, Onishchuk Yu, Aushev V. Application of machine learning methods in neutrino experiments. J Phys Stud. 2024;28(3):3001. https://doi.org/10.30970/jps.28.3001
Azieva G, Kerimkhulle S, Turusbekova U, Alimagambetova A, Niyazbekova S. Analysis of access to the electricity transmission network using information technologies in some countries. E3S Web Conf. 2021;258:11003. https://doi.org/10.1051/e3sconf/202125811003
Yang X, Shi Y, Chen R, Xu H. An edge server load balancing method based on particle swarm optimization. Int J Innov Comp Inform Control. 2025;21(4):859–883. https://doi.org/10.24507/ijicic.21.04.859
Owotogbe J. Assessing and enhancing the robustness of LLM-Based Multi-Agent Systems through chaos engineering. In: Proceedings – 2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI, CAIN 2025, IEEE. 2025. p. 250–252. https://doi.org/10.1109/CAIN66642.2025.00039
Lee C, Zhu Z, Yang T, Huo Y, Su Y, He P, Lyu MR. SPES: Towards optimizing performance-resource trade-off for serverless functions. In: Proceedings – International Conference on Data Engineering, IEEE. 2024. p. 165–178. https://doi.org/10.1109/ICDE60146.2024.00020
Behera RK, Kumari A, Cho SB. Scalable cold-start optimization in serverless computing: Leveraging function fusion with PanOpticon simulator. IEEE Access, 2025;13:125101–125118. https://doi.org/10.1109/ACCESS.2025.3589393
Khadzhiradieva S, Bezverkhniuk T, Nazarenko O, Bazyka S, Dotsenko T. Personal data protection: Between human rights protection and national security. Soc Leg Stud. 2024;7(3):245–256.
Ramesh C, Srinivasulu B, Suresh Babu M, Keerthi M, Indira Priyadarshini G, Grace Verghese M. An effective data replication strategy and the improvement of the storage environment. Smart Innov Syst Technol. 2025;417:373–383. https://doi.org/10.1007/978-981-97-8355-7_32
Kadenko IM, Sakhno NV, Biró B, Fenyvesi A, Iermolenko RV, Gogota OP. A bound dineutron: Indirect and possible direct observations. Acta Phys Pol B Proceed Suppl. 2024;17(1):1A31–1A39. https://doi.org/10.5506/APhysPolBSupp.17.1-A3
Smolij V, Smolij N, Shvydenko M, Voloshyn S. Hardware and software complex for operational management of a flexible site of highly efficient assembly production. Mach Energ. 2025;16(2):20–35. https://doi.org/10.31548/machinery/2.2025.20
Buzhymska K, Tsaruk I, Biriuchenko S, Pashchenko O, Svitlyshyn I. Impact of diversification on strategic business management. Sci Bull Mukachevo State Univ Ser Econ. 2024;11(3):34–46. https://doi.org/10.52566/msu-econ3.2024.34
Babak VP, Scherbak LM, Kuts YV, Zaporozhets AO. Information and measurement technologies for solving problems of energy informatics. CEUR Workshop Proceed. 2021;3039:24–31.
Smailov N, Tsyporenko V, Ualiyev Z, Issova A, Dosbayev Z, Tashtay Y, Zhekambayeva M, Alimbekov T, Kadyrova R, Sabibolda A. Improving accuracy of the spectral-correlation direction finding and delay estimation using machine learning. East Eur J Enter Tech. 2025;2(5(134)):15–24. https://doi.org/10.15587/1729-4061.2025.327021
Banala S. DevOps essentials: Key practices for continuous integration and continuous delivery. Int Numer J Mach Learn Robots. 2024;8(8):1–14.
Priyanka M, Sindhuja K, Madhuvani V, Prasoona Sowpthika K, Kranthi Kumar K. DevOps optimized navigation: Building a DevOps CI/CD pipeline. EPRA Int J Res Dev. 2024;9(3):376–382. https://doi.org/10.36713/epra16292
Dakkak A, Daniele P, Bosch J, Holmström Olsson H. DevOps value flows in software-intensive system of systems. In: Euromicro Conference Series on Software Engineering and Advanced Applications. IEEE. 2024. p. 387–394. https://doi.org/10.1109/SEAA64295.2024.00065
Segovia-Ferreira M, Hernan JR, Cavalli AR, Garcia-Alfaro J. A survey on cyber-resilience approaches for cyber-physical systems. ACM Comput Surv. 2024;56(8):202. https://doi.org/10.1145/3652953
Mumtaz MH, Khan MA, Koskula J, Luukkonen JJ, Mohammed AK. Waterfall to DevOps transition Successful DevOps driven digital transformation. Lappeenranta: LUT University; 2024.
Liu Q, Wang B. Network resilience and recovery mechanism: A review. J Cyber Secur. 2024;6(4):44–59.
Azad N. DevOps challenges and risk mitigation strategies by DevOps professionals teams. In: Hyrynsalmi S, Münch J, Smolander K, Melegati J, (eds.). Proceeding of the 14th International Conference “Software Business”. Springer. 2024. p. 369–385. https://doi.org/10.1007/978-3-031-53227-6_26
Jayakody JAVMK, Wijayanayake WMJI. DevOps maturity: A systematic literature review. In: International Research Conference on Smart Computing and Systems Engineering. IEEE. 2024. p. 1–6. https://doi.org/10.1109/SCSE61872.2024.10550493
Romanelli F. Multi-sensor fusion for autonomous resilient perception exploiting classical and deep learning techniques. [PhD thesis]. Rome: Sapienza University of Rome; 2024.
Mallreddy SR, Vasa Y. Predictive maintenance in cloud computing and DevOps: ML models for anticipating and preventing system failures. Nat Volatiles Essent Oils J. 2023;10(1):213–219. https://doi.org/10.53555/nveo.v10i1.5751
Kadaskar HR. Unleashing the power of DevOps in software development. Int J Sci Res Mod Sci Technol. 2024;3(3):1–7. https://doi.org/10.59828/ijsrmst.v3i3.185
Pando B, Silva A, Dávila A. A tertiary study on the DevOps adoption. Iber J Inf Syst Technol. 2024;53(3):23–36. https://doi.org/10.17013/risti.53.23-36
Abudalou MA. Security DevOps: Enhancing application delivery with speed and security. Int J Comput Sci Mob Comput. 2024;13(5):100–104. https://doi.org/10.47760/ijcsmc.2024.v13i05.009
Amaro R, Pereira R, Mira da Silva M. DevOps metrics and KPIs: A multivocal literature review. ACM Comput Surv. 2024;56(9):231. https://doi.org/10.1145/3652508
Mohanty A, Ramasamy AK, Verayiah R, Bastia S, Dash SS, Cuce E, Yunus Khan TM, Soudagar MEM. Power system resilience and strategies for a sustainable infrastructure: A review. Alex Eng J. 2024;105:261–279. https://doi.org/10.1016/j.aej.2024.06.092
Kumar A, Nadeem M, Shameem M. Metaheuristic based cost-effective predictive modeling for DevOps project success. Appl Soft Comput. 2024;163:111834. https://doi.org/10.1016/j.asoc.2024.111834
Charles do Nascimento Marreiros E, José Galvão do Nascimento E. IT process optimization through the implementation of bimodal IT with DevOps. J Interdiscip Debates. 2024;5(2):37–62. https://doi.org/10.51249/jid.v5i02.2096
Smuts H, Louw P, Smit D, Waechter I, Sardinha-Da Silva V. Optimal workforce allocation for quality delivery in DevOps teams: A case study. IADIS Int J Comput Sci Inf Syst. 2024;19(1):96-110.

Appendix

Appendix 1: Codebook.
Theme	Operational definition	Representative anonymised quote
Microservices flexibility	Adoption of modular service architecture enabling independent deployment, isolation of faults, and rapid recovery without full-system downtime.	“Microservices let us isolate failures to a single component instead of bringing down the whole stack.” – P3 (Software Dev) “We can roll back one microservice without touching the rest of the system.” – P1 (Software Dev)
Serverless scaling	Implementation of serverless functions that automatically allocate computing resources in response to fluctuating workloads, enhancing elasticity and cost-efficiency.	“Serverless functions scale instantly when demand spikes.” – P7 (Cloud Comp) “Auto-scaling reduced our downtime during traffic surges from hours to minutes.”– P8 (Cloud Comp) “We no longer over-provision servers; the system scales and shrinks on its own.”– P6 (Cloud Comp)
Cold-start latency	Delay experienced during the initial invocation of serverless functions, affecting applications requiring continuous low-latency performance.	“Cold-starts add seconds of delay in analytics jobs.” – P6 (Cloud Comp) “For real-time dashboards, even two-second cold-starts break user flow.” – P7 (Cloud Comp) “We pre-warm key functions to avoid cold-start lag during peak loads.” – P8 (Cloud Comp)
Containerisation portability	Use of container technologies to ensure consistent runtime environments, enabling seamless deployment across development, testing, and production platforms.	“Containers guarantee the same setup everywhere, from dev to production.” — P2 (Software Dev) “Docker made migrations painless but Kubernetes adds its own complexity.” – P4 (Software Dev) “We deploy updates in minutes because the container image already contains dependencies.” – P3 (Software Dev)
Automated recovery	Integration of automated recovery scripts and CI/CD pipelines that restore services or infrastructure automatically after a detected failure.	“When a node fails, our pipeline redeploys it automatically within minutes.” – P9 (IT Ops) “Self-healing scripts restart containers before users notice any disruption.” – P10 (IT Ops) “Automation reduced mean-time-to-recovery by almost 60 %.” – P2 (Software Dev)
High availability via cloud	Utilisation of redundant, geographically distributed cloud infrastructures and load-balancing mechanisms to maintain continuous service availability.	“Cloud redundancy keeps systems alive even when one region goes down.” – P8 (Cloud Comp) “Multi-zone deployment gives us near-zero downtime.” – P7 (Cloud Comp)
Data replication and backup	Deployment of synchronous or asynchronous replication and systematic backups to safeguard data integrity and support rapid restoration.	“We replicate data across three regions – storage costs rise but reliability is priceless.” – P3 (Software Dev) “Automated nightly backups saved us after a database corruption.” – P9 (IT Ops)
Monitoring and alerts	Continuous performance tracking and automated alert systems designed to detect anomalies, trigger early responses, and prevent cascading failures.	“Real-time alerts help us act before users notice issues.” – P10 (IT Ops) “We use dashboards to track latency spikes and CPU thresholds.” – P8 (Cloud Comp) “Too many alerts cause fatigue; we had to calibrate severity levels.” – P9 (IT Ops)
Operational overhead	The cumulative administrative, computational, and cognitive burden associated with managing complex distributed DevOps infrastructures.	“Keeping orchestration tuned consumes a third of our sprint time.” – P4 (Software Dev) “Complex pipelines require constant monitoring and retraining of staff.” – P9 (IT Ops)
^{Source: Created by the author.}

Cite this article as:
Karyonov M. Approaches to Fault Tolerance and Disaster Recovery in DevOps Processes. Premier Journal of Science 2025;14:100165