Your poor MTTR isn’t a technical issue

(Feature photo by Markus Spiske

One of the biggest misconceptions about troubleshooting systems is that it requires deep, specific technical knowledge to locate and solve production issues. This assumption can often result in extending the time between the discovery and resolution of a problem. At first this may seem counter intuitive, so let’s look at some common scenarios to see which concept is makes the most sense.

To start with, most assumptions about broad concepts are generally wrong because they are based on the expectation that there is a single, best doing things every time.  There are certainly times when the developer of a particular solution can look at a problem a production application is having and instantly say “I know why that is happening”.  This happens not because the developer deliberately left an issue but because most solutions have multiple, valid approaches.  Some of them can have flaws that may not be immediately obvious. In some cases, all options have flaws and it is a matter of choosing the path with the weakness that is least likely to be found “in the wild”. The experienced developer will unconsciously be aware of these potential problems and, when presented with the issue in production, will instantly recognize it. In most cases these things will surface and be addressed in QA before they reach production. By the nature of production systems (where users are always more inventive than the best QA analyst), the application will encounter something that was not anticipated.

Once in production, the key to identifying the cause of the problem is to look at what is happening, where the person with deep, specific knowledge will most likely first look for what is expected to happen. There lies the trap. If a reasonable QA effort was put in before release, it is what is unexpected that is more likely to be the issue. The easiest way to find an issue that isn’t immediately obvious is to have no expectations and instead observe what the behavior is and trace it back to its origin with no anticipation of what will be found. It is much more about applying a way of thinking than it is about knowing something in advance to find the root cause.

There is also the psychological aspect that can occur in having the original developer investigate the issue. For reasons that could fill another article (if not a whole book), the first thing the developer tends to look for is something outside their application as the cause. It is quite possible it is something from outside causing the issue. The more experience the developer the more likely this is the case. In trouble-shooting, the goal is to fix the problem and having any assumptions at the start can delay finding the problem where ever it is. Yes, sometimes those intuitive assumptions are useful, so long as they are abandoned if they don’t quickly prove out.

When issue is determined to be outside the responsibility of the person or team investigating, the mistake most often made is to hand it off to another team before clearly understanding how the external system is causing the issue. Failure to articulate irrefutable evidence of the source of the issue before passing it on to those responsible for that part of the system to solve can result in an unproductive back and forth between developers or teams as they also expect it is not in their work.

Once the issue is identified, deep knowledge may still hinder resolution and will not always be necessary. I was recently asked to help with an issue where the production support team followed a recommendation from the cloud platform vendor support to address an issue with throttling by moving the offending process on premise in a hybrid solution. While platform support knows their platform really well, the myriad ways it can be implemented is just not possible to always anticipate how combinations will work out. The support team followed the advice without thinking about why that process was deployed to the cloud to begin with. The change resulted in new issues because there were insufficient resources in the on premise server. Further, when validating the change they only looked at the cloud monitoring (where the problem originally manifested). The failure point had been moved to the on premise system and it was the business that reported the new manifestation of the problem (and brought me in to help).

The final solution was to manage the iterations in the process being throttled to bring it within threshold limits. This required no knowledge of the cloud platform beyond that throttling was a factor, and no detailed knowledge of the specific implementation as the logs clearly pointed to where the failure was occurring which was the point where the counter needed to be added to avoid the threshold.

To sum up the lesson, the ability to suspend assumptions and ego are far more critical than specific technical knowledge to solve issues in production.  During development it is common to be stuck for a while solving a bug and to ask someone else to look at the problem with a fresh perspective.  Carrying this process on into production will resolve issues faster and leave more time for working on the next cool iteration.


(Originally published JAN 17, 2018 at InfoWorld as “Production system troubleshooting 101: it’s not always about technical knowledge”)

Facebooktwitterredditlinkedinmail
© Scott S. Nelson

PCAnywhere and Your Firewall

I’m always forgetting what ports to set for PCAnywhere use. This time I thought I’d share the link I found on PCAnywhere ports at http://www.nthelp.com/NT6/pcanywhere_ip_port_usage.htm.

Though these days I’m using TeamViewer, which has no problem with firewalls though does get filtered by some networks admins.

Facebooktwitterredditlinkedinmail
© Scott S. Nelson

GIS

For those smarter or more up on acronyms than I, ignore this post.

As posted at my parent blog, I was reading this SF story last night in Analog. There was a reference to GIS, which I’ve seen around alot lately on the job boards but haven’t bothered to look up.  For the curious, it stands for Geographic information system.

Facebooktwitterredditlinkedinmail
© Scott S. Nelson

A Real Annoyance

The point of this post is getting rid of that annoying incompatibility notice about Real every time an update is made to FireFox. But first, a rant…

I am not a fan of the Real Player to begin with.  I certainly give it credit for being one of the early multimedia players. I also give them credit for being one of the first major abusers of the installation process, changing extension mappings without asking, installing itself as a service when it is only used occasionally, and being really obtuse in how to fix these problems afterward.   When I did PC maintenance service (before the Geek Squad, which people keep reminding me that I thought of four years before they did) I routinely removed the RealPlayer service and was always thanked for speeding up the machine.

I even tried to give the Real Player a second chance when they bought the Napster name. That lasted about 2 minutes past the installation where it still did all the things that annoyed me about the their 1.0 version. The Real Player is not installed on my personal machine. I used to routinely uninstall it from my work machine until my current employer decided to build their compliance training application using it. Which brings me to my point.

After updating the excellent password manager I use (RoboForm), I was once again confronted with this annoying screen.

Real Extension Annoyance
Real Extension Annoyance

My first shot in Google (remove incompatible firefox extension) got me pretty close to a solution with a Mozilla Support thread. The last entry in the thread did the trick for me. In case that link is dead, the entry was:

Ok, run the program “regedit” and goto “HKEY_LOCAL_MACHINESoftwareMozillaFirefoxExtensions”

If there is nothing there try “HKEY_CURRENT_USERSoftwareMozillaFirefoxExtensions”

There you should see the extension… delete the registry entry.

That worked for me…

The first path worked for me, too, specifically the key {ABDE892B-13A8-4d1b-88E6-365A6E755758}, with the value of “C:Program FilesRealRealPlayerbrowserrecord”

Facebooktwitterredditlinkedinmail
© Scott S. Nelson

Integration at the Glass

Ran across this one recently from a UK services sales guy in reference to integrating legacy mainframe applications into a portal.

From http://www.manageability.org/blog/stuff/integration-at-the-glass-and-80-20-point:

“Integration at the glass” is a term associated with Portal development, where developers or users can quickly link portlets together creating composite applications.

Facebooktwitterredditlinkedinmail
© Scott S. Nelson