Methods to propagate error codes in C firmware
I’d like to discuss how to report errors from C functions up the call chain. Consider firmware which communicates to handful of external peripherals (say over I2C and SPI), then validates the data which it has received. The communication could timeout, and that’s an error. The data may end up being invalid, and that’s an error. Some of the errors can be handled near their source. Most of the errors (counting species, not the total biomass) need to trickle up to the top where they are logged before the device shuts down gracefully.
-
Functions return error codes. If a function needs to pass a result to the calling code, it does it through an output parameter.
-
There’s a global variable which stores errors. If a function detects an error, then it writes the error code to the global error variable. In case the error is handled near the origin then the global error variable is cleared before it trickles up to the top.
Which method is more common? Are there more methods?
¹ The context of microcontroller firmware, and that’s why I’m posting to the EE board rather than software board.
² No C++ exceptions. I appreciate the exceptions mechanism, but it bloats the code, unfortunately.
edit:
Before you decide how the code should handle errors, you need to decide how the system as a whole should handle errors. A lot of this depends on the embedded system's job and what, if any, user interface exists. Does it always need to be running and responding? Is there a way to communicate bad data or no data? Does bad data indicate a hardware failure? What is the overall fail-safe strategy, or is there even one? [from Olin's answer]
The device is a $300 instrument for citizen scientists. It’s a sensor which exchanges information through a microSD card. There’s plenty of space for data, event log, error log. The device is headless, other than a red-green LED. It doesn’t have communications.
Since this is a scientific toy, the error handling strategy prefers integrity over robustness. There are situations where the firmware has to prevent the instrument from self-damage (burning out an IR LED), but that's the extent of safety concerns. There's nothing safety-critical there.
Before you decide how the code should handle errors, you need to decide how the system as a whole should handle errors. [...]
I think that the firmware mechanism for error code propagation is orthogonal to the overall fail-safe strategy. It’s a low-level question¹ which can support various error handling strategies.
¹ Programming language level
The only architecture I can think of where errors don’t have to propagate is when a function which detects an error either takes care of it (ignore, default, plausible value, retry), or logs the error and throws a hard fault. Such function doesn’t propagate errors, and the calling code doesn’t get a chance to handle the error.
2 answers
The following users marked this post as Works for me:
| User | Comment | Date |
|---|---|---|
| misk94555 | (no comment) | Oct 20, 2025 at 13:35 |
Functions return error codes. If a function needs to pass a result to the calling code, it does it through an output parameter
This is the proper way. Static analyzers and compilers often check to ensure that the result of a function is used, so the caller won't be able to ignore it by accident. Furthermore, using the return value means that each module of the program can create its own result code enum, to be used by every function in that module.
Result codes aren't necessarily errors either. For example you could poll some serial bus and receive "no data available" as result. That's a typical result code the caller might just ignore.
Functions should document which return codes they might return during which conditions. The caller can then decide which results it should handle locally, pass further upward or simply ignore. The caller will use different result codes suitable for that module.
This design makes it easy to create a centralized error handler on the top level, which is preferable. Local modules should not handle program flow, error logging etc, that will just create messy "spaghetti programs". Instead the error handler can be the centralized decision maker, deciding when to revert to a safe state, when to reset the MCU, when to report errors etc etc.
There’s a global variable which stores errors. If a function detects an error, then it writes the error code to the global error variable. In case the error is handled near the origin then the global error variable is cleared before it trickles up to the top.
This has been tried a lot in the PC world and proven to be bad. There is standard C and *nix with the errno mess and there's GetLastError in Windows. The main issue with these were always re-entrancy and multi-processing, in case several errors happen at once. But also from a design point of view, it is plain bad to drag a global variable around to every single module in your program, coupling them all together and preventing them from functioning stand-alone.
I think the worst problem is perhaps that once you go down the path of using global error variables, you will soon find the need to invent more such variables. Before you know it you have "flaghetti", numerous flags all over the program, which can be set from all over the program. After which it gets very difficult to keep track of program flow and the complexity builds up rapidly, leading to bugs.
"Flaghetti" is one of the main reasons why many old school assembler programs which reached a certain size turned incredibly error-prone and with severe tight coupling problems. You would add a patch somewhere in the program and some completely unrelated part of the program breaks unexpectedly. Because pretty much everything tended to exist in the global namespace and there was no telling which modules that used which flags. I've maintained a few messes like that - they are essentially lost causes that have to be rewritten from scratch.
0 comment threads
Asking how to propagate errors in just the code is only one part of handling errors in an embedded system.
Before you decide how the code should handle errors, you need to decide how the system as a whole should handle errors. A lot of this depends on the embedded system's job and what, if any, user interface exists. Does it always need to be running and responding? Is there a way to communicate bad data or no data? Does bad data indicate a hardware failure? What is the overall fail-safe strategy, or is there even one?
There are many possibilities for handling errors. Here are some:
- Ignore it and keep going. This can be appropriate when handling streaming real time data. If you're doing audio filtering, for example, then bad data might cause an audible glitch, but you want things to continue as best as possible after that. What's done is done, and now it's on to getting the next data point right.
- Substitute default or benign values. Similar to above, it might be better to have a dead spot in the audio than a harsh-sounding glitch.
- Substitute plausible values, at least for a while. Kallman filters are often used for this. You'd probably rather have your car GPS ride out short dropouts or obvious noise, than suddenly jumping to the middle of the Atlantic and back again when you know you're driving in Massachusetts.
- Keep error counts and related statistics. This is orthogonal to how to actually handle bad or missing data. This can be useful if the system occasionally reports to something larger, or a user or field service agent can go into a diagnostic mode to try and understand why the output is glitchy.
- Log errors. This is different from keeping counts and fixed statistics because the size of the log data grows with each error. Of course you have to consider how much space there is, and what to do when that space is used up.
- Cause a hard restart, including rebooting the embedded processor. This is the strategy behind hardware watchdog timers. Keep in mind though that in some cases this is the absolute worst thing to do. Shutting down the engine in a loop every 10 seconds is not the best response to a failed oxygen sensor.
- Declare the world has ended, put outputs in the most benign state possible, shut down everything, and require deliberate human action to continue. When you're told to raise the control rods by 1038 meters, it's probably better to insert them all the way, turn on the cooling pumps to maximum, and don't do anything further until a user override.
There are many different error handling strategies, so there are different ways to handle errors in the code that can't be decided until the higher level strategy has been settled upon. Data representation also matters. Are there in-band values for out of range, stale, or unknown? The firmware architecture is the last thing to work out because it needs to be driven by the higher level requirements.
The device is a $300 instrument for citizen scientists. It’s a sensor which exchanges information through a microSD card. There’s plenty of space for data, event log, error log. The device is headless, other than a red-green LED. It doesn’t have communications.
It sounds like this device simply measures things, then stores the measurements in removable memory. In that case, it may make sense to store status along with each value. Depending on your data format, this can possibly be done in-band. For example, if you're measuring a resistive sensor that is always in the 1 kΩ to 100 kΩ range, then 0 could be used indicate invalid low (probably a short somewhere), and some other impossible encoding (all 1s when using fixed point, for example) to indicate invalid high (probably an open).
Maybe there is nothing you need to do at all. In the above example, just report the actual measured resistance, and let the analysis program that reads the raw data figure out what to do about it.
In this case, in-band signalling is probably the easiest approach in the firmware because those values can be passed directly to the end user that reads the SD card. For example, let's say your measurements are from a 12 bit A/D on a IIC bus. The low level "get the next reading" routine will return those 12 bits in a 16 bit word. That gives you extra values to communicate status like "sensor didn't ACK", and the like. Values 0-4095 are valid readings, and everything else indicates some sort of exception or error condition. The higher level routines just handle the 16 bit values, and the rest is the problem of the software on the PC that eventually reads the SD card.

0 comment threads