An outage of your applications or infrastructure feels like a storm crashing into your organization. But, with a proper DevOps on-call in place, you’ll be able to weather the storm and quickly get back into the sunshine.
In today’s 24/7 – 365 environment, efficiency and speed are crucial when addressing an outage, from the second you receive a notification. Agile organizations no longer rely on tiered structures of support for varying severity of issues, instead, with the velocity of a modern organization a lightweight and integrated approach is required.
On-call can be a miserable experience for employees. But, with the right expectations and proper assignment of responsibilities, you can reduce burnout and make your organization more resilient to severe outages.
Integrate Development Crews
As DevOps engineers, developers act as our customers or internal stakeholders. They rely on our tools and automation to deliver code quickly and reliably across different environments and at any time. We want to keep a customer-service focus to resolve developer’s issues as fast as possible, be it updating secrets, addressing outages, or possessing resilient infrastructure.
Often, the best person to debug issues is the person that wrote the code themselves. Customers are highly likely to understand what behaviors or outages are being impacted, and they should be integrated into an on-call response team. Coupled with monitoring tools for both DevOps and the developers, your organization can recover quickly from outages and limit the amount of time spent on those dreaded late-night calls.
Implement Extreme Ownership
Involving the developers right off the bat knocks down any silos that exist between the different roles, delivering instant insights beyond just a log to any issues found. In a development sprint, if the developers know they’re solely responsible for any errors caused by their code, extreme ownership takes over, driving best practices and ensuring they’re double or triple-checking their code when it’s deployed.
Stop “throwing it over the wall” for DevOps to Handle
Avoid behaviors like “throwing it over the wall to DevOps to fix” by establishing a response that includes the developers themselves. Then, as your organization continues to evolve, make steady improvements by designing better checks to ensure bad code never makes it to a production environment.
Own Your Infrastructure
Developers should be on call for the code they deploy to environments when following the “you build it, you run it” model. DevOps is in turn responsible for the resiliency of environments the developers depend on. Ownership of infrastructure means identifying performance issues or outages that are impacting your applications. Complete visibility across your entire stack through effective monitoring points you in the right direction off the bat.
Work towards building systems that are “self-healing” with capabilities like auto-scaling or fail-over modes to avoid those 3 AM calls that nobody likes to attend. Depending on the impact, having a window where the environments are still functioning in a failure state leads to happier engineers – and lessens the urgency and potential impact of a non-functioning system.
Give Developers the Keys
Developers also need access to monitoring tools across their entire application stack. Deliver focused dashboards that don’t display details that aren’t part of their responsibility. These more than likely will be significantly different from what you as an engineer view day-to-day.
Imagine how much easier troubleshooting will be with a focused log right on a specific application component. Most of the time, the developers can resolve the issue themselves with good error logging and monitoring. These methods also keep most of the application-related issues away from the engineering and support teams.
Developers should be responsible for the code they’re deploying. Furthermore, they should have the access and ability to perform base-level DevOps tasks themselves. A critical part of DevOps is delivering feedback from builds through logs. Developers need to be able to use and understand these logs in order to problem-solve.
If developers can re-trigger builds, are trained on reading logs, and can generate their secrets seamlessly, you’re on the right track to eliminating the number and duration of those dreaded late-night calls.
Every developer should be well-versed in the DevOps tools stack and perform light troubleshooting and self-govern the quality of code they’re sending to environments.
Continually Improve Your on-call Response
When working with developers on-call, start mapping out and recognizing common issues. Begin taking note of where the developers are running into problems and strengthen the people and processes to continually refine the way that issues are handled.
Find those developers that are promoters of a shared responsibility process and listen to their feedback. Mold your systems and people to handle issues either on their own or with DevOps, side-by-side. Eliminate the mentality of throwing issues over the wall and build partnerships with the developers to deliver better systems.
As you continue to evolve and improve your DevOps tools, you’ll start seeing fewer and fewer impacts to systems at the wrong times, therefore making production deployments, or day-to-day development, more resilient to outages.
Ideally, as an organization, you want to strive towards eliminating these late-night on-call situations. Whether improving the pipelines, making your infrastructure self-healing and resilient, or enabling a comprehensive testing framework, build your developers’ skills and knowledge when it comes to their DevOps tools to build a team with clear responsibilities and shared responsibilities objectives to weather any storm.