How to Improve Your Troubleshooting Skills

While I was working towards my undergrad at Utah State, I had the opportunity to do tech support at a local web hosting company. While I was there, I acquired and honed basic troubleshooting skills that have helped me be a better software engineer not only when debugging problems, but also when designing software. This is my attempt at sharing these skills with you.

Know Your System

The first item that will help you improve your troubleshooting skills is to know your system. While working tech support for a web hosting company, I found that the more I knew about our internal systems as well as the internet in general, the easier it was for me to assist our clients with the issues they encountered. When I had an understanding of how a request goes from the client's machine, to our servers, and back again, it helped me in identifying possible failure points based on the issues the client described. The same was true when I gained an understanding of how web servers served files and how the tools we provided interacted with. The more I understood the system, the easier it was for me to identify possible points of failure to investigate.

Whatever your domain, learning as much as you can about it will boost your troubleshooting skills more than many other items on this list. So, take time to learn about it. Are you working on a web-based system? Learn how the internet works. Learn how web servers work. Learn HTTP and REST and any other technologies your software leverages. Are you working on firmware? Learn about the hardware. Learn how the controllers and micro-controllers and different devices work. Learn the varying protocols that are used to interact with them. Learn the differences between them. Take time to be a master of your craft.

Think Small

My next tip is to break the problem down into small parts. This really comes down to a way of thinking that'll influence the rest of the process. If you take the problem at face value, it's often too big to deal with as one big chunk. The problem needs to be broken down into separate pieces so that each can be considered and analyzed. Domain knowledge will be a big help here as you consider the problem and break it into small parts.

Don't be afraid to think too small. I'd actually encourage it. Many problems that I encounter end up being caused by very small issues. If it could potentially break something, it's not too small to be considered as a root cause. I've seen issues that were caused by misspelled words, files in the wrong place (even though it seemed it should have worked), missing punctuation, and many other seemingly small issues.

One benefit of thinking small is that the small theories are easy to test. Is it a missing semi-colon on line 85? Pull up the file and have a look. The smaller the hypothesis or theory, the easier it is to test.

Root Cause Analysis

Once you have a basic understanding of the system(s) you're working with, you can then leverage that knowledge to perform a root cause analysis. There are various ways to do this and books written on how to do them. Just Google "root cause analysis" and you'll find many results on a wide variety of methodologies. One basic root cause analysis technique is called the 5 Whys.

In the 5 Whys, write down a one sentence summary of the problem. Then you ask yourself, "why?" Why is the problem happening? What's causing it? Don't spend too much time on it. Just write the first answer that seems reasonable. Then ask, "why," again. Keep repeating the process until you've asked "why" five times. Generally by the time you've asked "why" the fifth time, you've narrowed it down to something that's very likely the root cause.

Whatever your chosen technique or methodology for finding the root cause, use it. Use it again and again. Refine your technique until you have something that works for you. Identifying the root cause can be a little like shooting in the dark, but spending the time to do this before jumping into solutions will greatly reduce the time you spend on the problem later on.

Test a Hypothesis

Now that you've performed a root cause analysis, you should have a good idea of at least one area of the system that could be causing the issue. Maybe you've even narrowed it down to a specific file or specific piece of code. Now is the time to test your hypothesis. To do this, you might set breakpionts in the code or manually test the behavior. You may review config files or do any number of things to see if what you think is the root cause is, in fact, the root cause.

If you did your due diligence in the root cause analysis, this step should hopefully be pretty quick. However, that's not always the case. Testing this hypothesis may uncover a deeper cause that wasn't known before. Whatever the case may be, don't give up if your first hypothesis fails. It may take a few iterations of the process up to this point to uncover the true cause and verify it.

Once you've tested a hypothesis and verified that it is, in fact, the root cause, you can then start formulating solutions. In doing so, you may follow a similar process to the one outlined here. Whatever your process looks like, your domain knowledge will be a great tool for you as you brainstorm and test these solutions.

Be Persistent

While knowledge is one of the most important things to help you troubleshoot, paramount to that is being persistent. Don't give up if your first guess at the root cause is wrong. Don't be afraid to try multiple hypotheses even if one of them doesn't seem very likely. In troubleshooting, it doesn't matter so much the time you spend or the number of failures you have. What's important is finding and fixing the issue. Whatever path you take to get there or however many paths you go down along the way doesn't matter so long as the issue is eventually identified and resolved. So, be persistent. If you're stuck on an issue, try the crazy idea that keeps coming into the back of your mind. That might just be the idea that provides the breakthrough.

Remember What You Learn

Throughout this process, you'll likely learn something. Whether you learn more about the system you're working on, the troubleshooting process you follow, or the people you interact with, be sure to remember it. There have been plenty of times that I've solved a problem once and not written it down only to encounter the issue again and have to repeat the troubleshooting process all over again. When I write things down, I'm more prone to remember them for next time. Even if I don't remember them, I'll at least have a record of the issue I saw and how I fixed it so I can speed up my work next time.

Every experience is a learning experience. Take time to recognize what you learn. You may not always need to write it down, but it is important to recognize what you learn. Doing so will improve you as a person and help you be more mindful about your work.

I hope these thoughts have helped you consider your own troubleshooting practices and identify an area that you can improve. You don't need to follow my suggestions verbatim, but I hope you're always trying to improve your skills. What do you think. Did I hit the nail on the head or did I miss the mark? What's helped you with your troubleshooting skills? Anything you'd like to add? Let me know in the comments below.

Comments

Popular posts from this blog

A Common Technical Lead Pitfall

Maze Generation in JavaScript

Leadership Experiment Update 2