Your poor MTTR isn’t a technical issue

(Feature photo by Markus Spiske

One of the biggest misconceptions about troubleshooting systems is that it requires deep, specific technical knowledge to locate and solve production issues. This assumption can often result in extending the time between the discovery and resolution of a problem. At first this may seem counter intuitive, so let’s look at some common scenarios to see which concept is makes the most sense.

To start with, most assumptions about broad concepts are generally wrong because they are based on the expectation that there is a single, best doing things every time.  There are certainly times when the developer of a particular solution can look at a problem a production application is having and instantly say “I know why that is happening”.  This happens not because the developer deliberately left an issue but because most solutions have multiple, valid approaches.  Some of them can have flaws that may not be immediately obvious. In some cases, all options have flaws and it is a matter of choosing the path with the weakness that is least likely to be found “in the wild”. The experienced developer will unconsciously be aware of these potential problems and, when presented with the issue in production, will instantly recognize it. In most cases these things will surface and be addressed in QA before they reach production. By the nature of production systems (where users are always more inventive than the best QA analyst), the application will encounter something that was not anticipated.

Once in production, the key to identifying the cause of the problem is to look at what is happening, where the person with deep, specific knowledge will most likely first look for what is expected to happen. There lies the trap. If a reasonable QA effort was put in before release, it is what is unexpected that is more likely to be the issue. The easiest way to find an issue that isn’t immediately obvious is to have no expectations and instead observe what the behavior is and trace it back to its origin with no anticipation of what will be found. It is much more about applying a way of thinking than it is about knowing something in advance to find the root cause.

There is also the psychological aspect that can occur in having the original developer investigate the issue. For reasons that could fill another article (if not a whole book), the first thing the developer tends to look for is something outside their application as the cause. It is quite possible it is something from outside causing the issue. The more experience the developer the more likely this is the case. In trouble-shooting, the goal is to fix the problem and having any assumptions at the start can delay finding the problem where ever it is. Yes, sometimes those intuitive assumptions are useful, so long as they are abandoned if they don’t quickly prove out.

When issue is determined to be outside the responsibility of the person or team investigating, the mistake most often made is to hand it off to another team before clearly understanding how the external system is causing the issue. Failure to articulate irrefutable evidence of the source of the issue before passing it on to those responsible for that part of the system to solve can result in an unproductive back and forth between developers or teams as they also expect it is not in their work.

Once the issue is identified, deep knowledge may still hinder resolution and will not always be necessary. I was recently asked to help with an issue where the production support team followed a recommendation from the cloud platform vendor support to address an issue with throttling by moving the offending process on premise in a hybrid solution. While platform support knows their platform really well, the myriad ways it can be implemented is just not possible to always anticipate how combinations will work out. The support team followed the advice without thinking about why that process was deployed to the cloud to begin with. The change resulted in new issues because there were insufficient resources in the on premise server. Further, when validating the change they only looked at the cloud monitoring (where the problem originally manifested). The failure point had been moved to the on premise system and it was the business that reported the new manifestation of the problem (and brought me in to help).

The final solution was to manage the iterations in the process being throttled to bring it within threshold limits. This required no knowledge of the cloud platform beyond that throttling was a factor, and no detailed knowledge of the specific implementation as the logs clearly pointed to where the failure was occurring which was the point where the counter needed to be added to avoid the threshold.

To sum up the lesson, the ability to suspend assumptions and ego are far more critical than specific technical knowledge to solve issues in production.  During development it is common to be stuck for a while solving a bug and to ask someone else to look at the problem with a fresh perspective.  Carrying this process on into production will resolve issues faster and leave more time for working on the next cool iteration.


(Originally published JAN 17, 2018 at InfoWorld as “Production system troubleshooting 101: it’s not always about technical knowledge”)

If you found this interesting, please share.

© Scott S. Nelson

From Agile to Fragile in 60 sprints

Feature image by Elisa Kennemer on Unsplash

The adoption of agile software development methodologies has been a necessary evolution to support the explosive demand for new and expanded capabilities.   There is no doubt that without the broad adoption of agile practices much of the growth in technology, and all of those aspects of everyday life that is driven by technology, simply would not have happened.

Still, too much of a good thing applies. Another old adage that comes to mind is “You can have it better, cheaper, faster. Pick any two”. Many organizations have insisted on all three. How did they do it? They sacrificed the documentation.

I’m not talking about saving shipping costs and trees by making manuals virtual, and then saving bandwidth by replacing the documents download with the install files with links to online documentation (which has its own issues in this world of massive M&A). I’m talking about all those wonderful references that development teams, sometimes backed by technical writers, produced so that others may pick up where they left off to maintain and enhance the final applications. Yes, that documentation.

Self-Documenting Code does not make a Self-Documenting Solution

While no one can honestly disagree with the value put forth in the Manifesto for Agile Software Development : “Working software over comprehensive documentation”, I also don’t think the intention was that documentation impedes working software.   Still, the manifesto has fed the meme (the original definition, not the funny GIFs) “Good code is self-documenting”. When I hear this, my response is “True; and knowing what code to read for a given issue or enhancement requires documentation”.  My response lacks the desired impact for two reasons: It doesn’t easily fit on a bumper sticker and it requires putting time and effort into a task that many people do not like to do.

The danger of little or no documentation is that the application becomes dependent on “tribal knowledge”. In a perfect enterprise, this is a dependable approach because employee turnover is low and when people do depart they always do so with adequate notice and thoroughly train their replacements. I have heard these enterprises exist, though I have never spent any time working with one of them.  I did, however, recently work with a business intelligence group where their entire ETL staff departed within a few weeks of each other after a few years of furiously building hundreds of data integrations in a dozen different business areas and then spent less than 9 hours in “knowledge transfer” sessions with my team who were tasked with keeping the lights on until a new crew was hired and trained. There was not one page of documentation at the start of the knowledge transfer and I have yet to find a line of documentation in any of the code.

I’m not advocating the need for waterfall-style detailed design documents. In some ways, those can be worse than no documentation because they are written before the code and configurations they are intended to describe are created and fail to be updated when the actual implementation deviates. In an agile world, writing the documentation after the implementation is a sound approach that will support the manifesto value of “Working software over comprehensive documentation” by being just enough documentation to facilitate maintaining the software in the future.

Meeting between the Lines

How much is just enough? That is going to vary by both application (and/or system) and enterprise. Some applications are so simple that documentation in the code to supplement the “self-documenting” style is sufficient. More complex solutions will need documentation to describe things from different aspects, and the number of aspects is effected by whether maintenance is done by the development team or a separate production support group. The litmus test for whether your documentation is adequate is to take a look at it from the perspective of someone who has never heard of your application and needs to be productive in maintaining or enhancing it in less than a day. If you have difficulty in adopting that point of view (many people do, and double as many developers), have someone outside your team review the documentation.

I find the following types of documents to be a minimum to ensure that a system can be properly managed once released to production:

  • Logical System Architecture
  • Physical System Architecture
  • Component Relation Diagrams
  • Deployment Procedures

Again, the level of detail and need for additional documentation is going to be driven by complexity and experience. Another factor is how common the relevant skills are. If the candidate pool for a particular platform or framework is shallow, more detail should be provided to act as springboard for people that may be learning the technology in general while diving into the particular implementation.

Yes, there are Exceptions

Conversely, some solutions are true one-offs that are filling a very specialized need that is unlikely to evolve and may have a short lifespan. These implementations only really need sufficient reference to migrate them to another environment or decommission them to free up resources while not negatively impacting other systems. I do caution you to really be sure that an application falls into this category before deciding to minimize the documentation.  What comes to my mind when I think of such decisions is massive amount of resources dedicated to dealing with two-digit years in 1999 to address applications that were not expected to still be in use when they were developed 10 or 20 years previously.

A Final Appeal

At the beginning I agreed with the manifesto value of working code prioritized over comprehensive documentations. In the days when most software life cycles began with tons of documentation and meetings to review the documents and meetings to review the results of the review, a great deal more beneficial build and test activities could have been done in that time instead. My experience in documenting the results of agile and other iterative processes toward the end of the development cycle and then reviewing that documentation with people outside the team is that design flaws are discovered when looking at the solution as a whole rather than the implications to individual stories in a sprint. The broader perspective that waterfall tried to create (and often failed since most waterfall documentation does not match the final implementation) can be achieved better, cheaper and faster when documenting at the end of the epic. In this one case, picking cheaper and faster yields better.

Documenting the fruits of your software and application implementation labors may not be the most exciting part of your team’s work, but the results of not documenting can become the most painful experience for those that follow…or your next gig!


Originally published at InfoWorld

If you found this interesting, please share.

© Scott S. Nelson

The Differences between IT Consultants and Contractors

I try to post original content. Sometimes that originality may only be in the presentation of the information, in which case I am attempting to provide (I hope) a clearer understanding or a simpler approach. Because of this personal rule of conduct, I first researched this topic to which I have thought and spoke about for quite some time and was very surprised at what I found. What is already out there on the subject of comparing contractor and consultant roles is sometimes contradictory and has some distinctions that I think are based on only thinking about individuals rather than encompassing companies that provide both services as well. Rather than argue the points others have made (which I don’t necessarily disagree with in certain, specific contexts) I will present my thoughts and experience and leave it to you if you wish to research further.

What’s the Difference?

In short, the basic difference between the two is simple: A contractor is an individual who possesses a specific skill set that they will utilize to your specification, where a consultant is an individual who has experience with developing a solution within a domain where you need assistance.

The basic difference is also inadequate to understanding which one you need for a given project (or aspect of a project) and how to work with them to your best advantage, so let’s dive a little deeper into the more subtle differences.

While you may work with both as individuals it is more common to work with them in groups. A group of consultants will be a team assembled on your behalf by a consulting company (AKA partner, group, professional services provider, etc.) and should be self-managing. A group of contractors may come from the same agency but will require management (which may also be contracted).

Consultants can help you define the problem and work with you to develop a plan to get from current state to target state.  Frequently they also perform and/or manage the tasks and deliverables of the plan. Consultants can direct contractors to execute to the plan, and will often provide those contractors as well.

Another difference is that for a contractor to be valuable, they must be deeply familiar with a specific aspect of the project, where consultants need only be familiar with the general domain of the project. One of the best reasons for engaging consultants is their proven ability to navigate through the unknown.

Working with Consultants vs. Contractors

One difference not included above is cost. There are many different fee structures for either, though they can all be broken down (for the sake of comparison) to cost per hour of effort. Consultants are almost always a higher hourly cost. The difference is usually reflected in the value provided during that time, meaning that you will get more benefit for each hour of consulting. They key word in the previous sentence is usually.

There are two common scenarios where the value is not always higher with consultants.  The first is when it is the wrong consultant.  The wrong consultant can be engaged for any number of reasons, and once this is determined than it should be corrected. This, however, is not the most common reason for missing out on the full value of a consultant.

The most common reason for not realizing the maximum value of a consultant or consulting team is working with them like they are contractors. Consultants should be actively involved at all levels of the project. During requirements definition they can provide their experience with what similar projects have missed including early on, and help determine prioritization through an understanding of the effort involved in delivering a requirement. Consultants will be able to apply experience in planning, knowing what tasks can be done in parallel to support timelines and where risks are most likely to occur along with mitigation approaches. Once the delivery phase has begun, consultants will recognize issues and opportunities during regular reviews that might go unnoticed by those who have not done similar projects in the past. Every consulting company I have worked with has a project management practice, and if it is a team of consultants engaged on a project it will generally yield the most value if the part of that team is a project manager who will, among other contributions, help the client to realize the maximum benefit of working with the consulting team.

Having one or more consultants on a project and then tasking them the same way as contractors is like rowing a power boat. It can still get from one place to another, but the boat is under-utilized, the journey will take more work than required, and it will not be nearly as much fun!

Which is Best for Your Project?

If your project involves technologies that your enterprise is already comfortably familiar with and you just need more hours in the day, contractors should fill the need nicely. You may be implementing a larger project where an isolated area is outside of your experience and a contractor can fill that gap and train your people on how to maintain it afterwards. Or the project you are working on is scaling out your technical landscape and you will need to keep on someone afterwards for maintenance, so contracting can be a “try before you buy” approach to determine the right candidate.

If there is a concern about whether the project is the right thing to do or the technologies are the right ones to use, consultants can bring experience and a fresh viewpoint to increase confidence. If a project will introduce more than one or two completely new aspects to the enterprise, engaging a consultant should certainly be considered. The nature of consulting makes them familiar and comfortable with the unknown. For many organizations, internal teams need to be more focused on the day-to-day operations and introducing change to the technical landscape can be better served by professionals for whom change is the day-to-day operation.

If you found this interesting, please share.

© Scott S. Nelson

Port Tunneling with Putty

Recently I had a situation where a combination of firewalls and load balancers prevented me from testing an application. Fortunately, an experienced server admin had a solution that I am sharing here: Use putty for port tunneling.

Create and save an SSH session for the host

Create Putty Session
Create Putty Session

Load the session, then go to Connection > SSH > Tunnels

Enter Putty Tunnel Details
Enter Putty Tunnel Details

Enter port and server info then click Add

Save Tunnel Connection
Save Tunnel Connection

Click Open
Return to the Sessions and Save to store for future use
Now you can access the remote machine:port by using localhost:port, i.e., http://localhost:8080 will take you to http://anyhostname:8080 in the above examples.

This can also be done with BitVise Tunnelier (shown below for accessing MySQL):

BitVise SSH Tunneling
BitVise SSH Tunneling
If you found this interesting, please share.

© Scott S. Nelson