What is root cause analysis (RCA)?
When you test major new releases, you can sometimes get surprised by new errors in the production environment. Why? What went wrong? Test environments aren’t always as close to production as you hope. Infrastructure changes can be made to the environment without being documented, causing the environments to slowly drift apart.
Troubleshooting defects is very time consuming. Learning to troubleshoot faster is one of the best investments you can make as a software developer.
Root cause analysis (RCA) is a specific technique you can use to troubleshoot problems. With this technique, you analyze the issue at hand using a particular set of steps to identify the primary cause of the problem. RCA is based on the principle that it’s not useful to cater to the symptoms of a problem while ignoring its roots.
What are the benefits of root cause analysis?
Root cause analysis (RCA) is a specific technique you can use to troubleshoot problems. With this technique, you analyze the issue at hand using a particular set of steps to identify the primary cause of the problem. RCA is based on the principle that it’s not useful to cater to the symptoms of a problem while ignoring its roots.
How do I get started with root cause analysis?
Explain the problem
Use the rubber duck approach (rubber-duck debugging) to explain your problem simply. By explaining something, you are forced to order your thoughts. Jeff Atwood, the cofounder of the popular Q&A site Stack Overflow, talks about how many times a software developer has told him about writing a new question to the site, figuring out the answer for themselves in the process, and never actually submitting the question.
Try the following approaches to help you articulate the problem simply:
- Write a Stack Overflow question—even if you never submit it.
- File a detailed bug report.
- Explain it to a co-worker.
Collect log data (and search through it efficiently)
Next, gather more data about the problem and extract insights out of it. Logging and monitoring help here—crash logs, application and server logs, etc. You have to gather evidence that the problem happened but also, if possible, find out how long it’s been happening and with what frequency.
Within all of that data, you need to find specific data points quickly. Tools can help you search and analyze the log data you’ve been gathering and turn it into insights to diagnose and resolve issues more quickly.
Employ the five-whys technique
Next, identify causal factors--the immediate cause of the problem at hand. Do not identify one causal factor and then stop. You have to go further with the five-whys technique. Ask “why?” iteratively until you get to the root of the problem. For example, your website is showing error 500.
- Why? Because the web framework’s routing component malfunctioned.
- Why? Because it requires another component, which itself malfunctioned.
- Why? Because this component of the web framework requires the intl extension, which isn’t working.
- Why? Because it was accidentally deactivated after the server software got updated.
Of course, you can get to the root problem with fewer steps. Or you may need even more.
Get a second pair of eyes
Like code review, have another, unbiased person take a look at your code. With time, the expectation of review will help you refine your process. Or better yet, do pair problem-troubleshooting.
How can AWS support your root cause analysis?
One of the primary AWS offerings for root cause analysis is to help you ingest and analyze your log data. For this, we recommend Amazon OpenSearch Service. Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search.
Get started with root cause analysis on AWS by creating an account today.