Investigating Intermittent Problems

Came across a great article by James Bach titled “How to Investigate
Intermittent Problems”. There’s some great advice in there,
particularly around suggestions on how to deal with elusive problems
and some great principles. Much recommended. Read here

Here’s some goodness from the article:

Some Principles of Intermittent Problems:

  • Be comforted: the cause is probably not evil spirits.
  • If it happened once, it will probably happen again.
  • If a bug goes away without being fixed, it probably didn’t go away for good.
  • Complex and baffling behavior often has a simple underlying cause.
  • Complex and baffling behavior sometimes has a complex set of causes.
  • Intermittent problems often teach you something profound about your product.
  • It’s easy to fall in love with a theory of a problem that is sensible, clever, wise,
    and just happens to be wrong.
  • The key to your mystery might be resting in someone else’s common knowledge.
  • The Pentium Principle of 1994: an intermittent technical problem may pose a *sustained
    and expensive* public relations problem.
  • The problem may be intermittent, but the risk of that problem is ever present.

Some General Suggestions for Investigating Intermittent Problems:

  • Recheck your most basic assumptions: are you using the computer you think you are
    using? are you testing what you think you are testing? are you observing what you
    think you are observing?
  • Invite more observers and minds into the investigation.
  • If someone tells you what the problem can’t possibly be, consider putting extra attention
    into those possibilities.
  • Seek tools that could help you observe and control the system.
  • Establish a central clearinghouse for mystery bugs, so that patterns among them might
    be easier to spot.
  • Systematically cover the input and state spaces.
  • Consider controlling things that you think probably don’t matter.
  • Simplify. Try changing only one variable at a time; try subdividing the system. (helps
    you understand and isolate problem when it occurs)
  • Complexify. Try changing more variables at once; let the state get “dirty”. (helps
    you make a lottery-type problem happen)
  • Beware of burning huge time on a small problem. Keep asking, is this problem worth
    it?
  • When all else fails, let the problem sit a while, do something else, and see if it
    spontaneously recurs.

Much more where those came from. Read here

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.