Break Points

Developing a good bedside manner

Jack Ganssle

9/23/2009 12:00 AM EDT

Why are debugging code and alien abduction so similar? Because in both cases large blocks of time are unaccounted for.

Medicine, too, resembles debugging (and perhaps alien abduction). In an astonishing development, recently my doctor actually chatted with me for a few minutes. Maybe it was a slow H1N1 day, or perhaps he just wanted a break from all of the runny noses. So I asked him what the hardest part of his job was. Expecting to hear a rant against insurance companies, it was interesting to hear him talk about the difficulty in diagnosing diseases. Most people present with simple cases, but sometimes an individual will have a bewildering array of symptoms that suggest no single etiology. Physicians use differential diagnosis to try and weed causation from the complaints, which is made very complex since patients may ignore some important symptoms while focusing on those that are less critical. To add spice to the sauce of diagnosis, several illnesses may present at the same time.

I was struck by how closely his comments mirror the art of debugging embedded systems. Our nascent product has a bug, which presents itself via some set of symptoms. Press the green button and nothing happens. Is the switch wired incorrectly? Could its Schmidt trigger gate be shot? Is the ISR invoked? Perhaps the ISR never passes a signal to the code that displays a result.

Or maybe the power is off.

In other cases, just as in medicine, one bug may present a variety of odd effects. Or a single symptom could stem from a combination of bugs all interacting in excruciating-complex, and hard to diagnose, manners. I wonder if physicians observe the infrequent symptoms we see, that appear in a system once, go away for weeks, and then randomly resurface?

Then there are the bugs we know exist, but just cannot fix. The system gets shipped with the expectation that sometime someone will see a problem. That, too, is like medicine. "Doc, it hurts when I do this." "Don't do that." My health insurance will not cover kidney stones due to three prior episodes, one of which required an expensive lithotripsy. So that's one latent bug in my gut that will surely reappear at some unknown time in the future.


Next:




Evgeni

9/24/2009 12:02 AM EDT

Another debug approach is "divide and conquer". Draw the entire HW/SW system as a block diagram, and try to isolate the problem into the smallest possible block.

Sign in to Reply



Dmd

9/24/2009 12:21 PM EDT

I never much liked the "Generate a hypothesis" part, because, in my experience, people would just sit around a table and speculate about the cause - and this happens mostly when the managers get involved.

Instead I prefer this approach: "Collect data about the problem until the cause becomes bloody obvious". For difficult bugs this often involves designing specific tests to collect that necessary data, and that is a step that most people bypass. The fancy/dancy debugging tools won't get you there all by themselves.

I have many examples from my embedded programming career. Here is one. A portable defibrillator I worked on was going through a battery life testing protocol when it was noticed that every once in a while a unit would stop working. So the first task is to collect more data. I instrumented the code to write out checkpoint information to a serial port, just single characters to indicate where the code was at that moment. Then I garnered every available computer in the office with serial ports (at that time there were usually 2) and connected defibrillators to every serial port (with a logging capability) and let them run, maybe 20 units overnight, every night. Little by little, by observing the failures and refining the checkpoint printout information I was able to close in on, by now, an obvious hypothesis - the flash memory chip was misbehaving. It wasn't even our problem! As a footnote, further testing did prove it was indeed the flash memory chip and when the manufacturer was confronted with this data they stonewalled (wouldn't you know) but shortly after introduced another version of this chip which worked fine.

How likely is it that the managers sitting around the conference table would come up with this hypothesis?

Sign in to Reply



krwada

9/24/2009 4:13 PM EDT

I suppose it is too bad that the human body does not have a printf() function!

Sign in to Reply



Ray Keefe

9/24/2009 9:51 PM EDT

Thanks for another excellent article Jack,

The Bob Pease quote is a classic. I've always loved that one.

I find myself easily falling into soem of the traps you outlined. The quick fix is tempting when under pressure to get something out the door.

I agree that in the end, measure the symptoms, all the symptoms, and investigate until you understand the problem will usually win the day but it won't look as much like progress as the flurry of coding trials approach does. The same on the hardware side.

Another rule of thumb is that if you need a quick prototype for a demo, NEVER think that you can just beef it up a bit for the final product. Always go back and architect it properly.

We do contract Electronics and Embedded Software Development and about 20% of the projects we get have been done by someone else and the project is in a lot of trouble. The trouble might be hardware or software and often can be both. In these cases it is important to nail down the real problems and the sort of 'no assumptions' investigation of symptoms is essential for getting to this point.

Thanks again,

Ray Keefe
http://www.successful.com.au

Sign in to Reply



dale@allthingsembedded

10/1/2009 2:44 AM EDT

Dear "Guest" above, I cannot agree more re power supplies. You can have the greatest embedded design, but it its foundations are shakey, i.e. the power supply, you are surely headed for all unexpected delights!. Been there done that and will never again.
Dale
www.allthingsembedded.com

Sign in to Reply



Dave Agans

11/19/2009 10:55 PM EST

Jack,

I was quite interested in your use of the medical diagnosis analogy here. I wrote the book "Debugging", published in 2002 and still selling well because it extracts the essence of debugging, which as you point out, is not restricted to hardware and software. I use examples from medicine, car repair and plumbing, to name a few, which is one of the reasons it's popular. (The whole thing is humorous, which helps make it a fun read, too.)

I came up with 9 rules (shown on the website debuggingrules.com in a free poster) which I challenge anyone to prove: 1. include a rule you can ignore, or 2. are missing a rule. Your 6 steps (and other important things) are covered by my 9 rules, except for hypothesis - fix - test sequence, with which I respectfully disagree. My rule #3: "Quit thinking and look" means use your hypothesis to decide where to look next, not what fix to try. Trying a fix before you have SEEN the cause of the bug is sometimes effective, but often leads to a long loop of misdirected fixes. (There are examples in the book.) The other rules are equally important, in fact, here they are:

Understand the System
Make it Fail
Quit Thinking and Look
Divide and Conquer
Change One Thing at a Time
Keep an Audit Trail
Check the Plug
Get a Fresh View
If You Didn't Fix It, It Ain't Fixed

I'll send you a copy of Debugging to review if you want. I guarantee you will not want to put it down, either. :-)

Dave Agans
www.DebuggingRules.com

Sign in to Reply



JackGanssle

11/20/2009 7:50 AM EST

Sure, Dave, I'd love to read it (though it takes me forever to get to a book, as my input stack is so high). Drop me an email at jack@ganssle.com.

Also do check out Steve Litt's site: troubleshooters.com.

Jack

Sign in to Reply



ECS_Shadow

11/20/2009 11:16 AM EST

Great article Jack!

Jack writes:

"In other cases, just as in medicine, one bug may present a variety of odd effects. Or a single symptom could stem from a combination of bugs all interacting in excruciating-complex, and hard to diagnose, manners. I wonder if physicians observe the infrequent symptoms we see, that appear in a system once, go away for weeks, and then randomly resurface?"

I liked being the Hero that caught the elusive bug as much as anyone. But do we really have to let the bugs "go away for weeks"? How much time and money are spent chasing these bugs? How much does it cost when we fail to catch them?

Are we not smart people, with systems of our own design and under our own control?

These bugs can be easily captured, if we make proper use of our software to help us. The vast majority of embedded systems can be “instrument” (in software by the developer) to record and then replay the software execution. The data rate of a proper implementation is surprising low (~2KB per MHz of CPU clock). A rate that is lower than typical instrumentation approaches that pump out information that we think will help us find these bugs.

The record process saves the minimum data that is needed to capture the exact execution process of the software. Therefore the real-time execution is not being changed by the analysis and debug processes.

The replay process re-creates the recorded execution with the bugs. Complete analysis and debugging takes place in the replay process without changing the re-created execution of the software.

So what’s the big disadvantage?
It requires a change in the typical embedded mindset!

Sign in to Reply



tildejac

9/13/2010 2:16 AM EDT

I originally come from a scientific research background before becoming an engineer, and I always use this process for debugging code. For some reason wherever I work, all the hardest problems eventually end up at my desk. Most of the criticism seems to me to revolve around people wanting to delineate how they observe collateral behavior rather than the process itself. For many complex problems, often times you need to rely on test groups observation of collateral behavior, and this is mt biggest problem. How to you get others to report observations and differentiate them from conclusions? Often times I get vague descriptions of what test did, even when they have specific data. I also tend to get conclusions rather than observation.

Can anyone suggest articles or processes to get test groups and customer service to learn how to best report problems and their observed behaviors?

Sign in to Reply



Zameer

9/28/2010 5:26 AM EDT

This discussion seems to be similar to who is better between Newton and Einstein to me.I feel it is difficult to say which approach is better as already said by jack it is science and art. Often I think about this and feel that it depends upon individual person in hand and there up bringing. Few believe in hard work , they try to collect data , run as many tests as possible , try with numbers of fixes etc. on the other hand , few sit down calmly, look at the problem , symptoms and then decide what data is needed and decide tests accordingly. Then they run those tests and looking at the logs or debug data they know the problem and fix too.
Personally I feel "Generate a hypothesis " is better one and works always provided person in hand has got enough imagination. The reason behind this is that it is not always possible for you to get the time and luxury to run tests number of times specially if your product is experiencing problem when installed in field.

Sign in to Reply



Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
Jobs sponsored by

Feedback Form