In many complex computer RFQs I have seen in the last decade, when it came to BIT (Built In Test) requirements, heated discussion on compliance would arise. However complex the compliance assessment, it always went down to compromise and interpretation. In PBIT (Power On Built In Test), the usual tradeoff is on boot time vs coverage. The more thorough the test, the longer it takes before the application can be operational.
But the worst part was always the CBIT (Continuous Built In Test). The typical CBIT expectation should be that some code, running in parallel of the application, and constantly assessing the health and performance of the computer would start signaling when something went wrong. And most inexperienced customers would expect the vendor to deliver this piece of code off the shelf, ready to run, along with the computer board.
Unfortunately, the situation is not that simple. Assessing some hardware status ”independently” of the application and ”while the application runs” is OK for some piece of computing hardware only. Computer chips in our architectures have transient values in some registers that you can only access once. Because of this only one code entity should be interfacing with the hardware, this is the role of the device driver which, with the help of the operating system, offers its services to multiple applications running on the same computer. So how would you assess the hardware health? Only one method: work with the device driver. In a nutshell, nothing the application code cannot do either.
Moreover, an important part of the CBIT is supposed to focus on the performance because a computer with degraded performance is likely to impact an operational application.
So let’s assume we want to monitor the performance of a network link, or a hard disk drive, what would we do? Again, use the same OS services as the application to send or retrieve data from the device. How do you assess the performance? By challenging the device with a heavy load and measure the outcome (this is the essence of benchmarking). Would you want this to happen in parallel of your operational application? I guess not, since it would make your application unreliable.
Where do I want to go from here? Simple: there is no such thing as off the shelf CBIT. So much cooperation/compromise must be achieved between the operational code and the health management code that only the application designers can lead the work. Here your mileage may vary – if your project is very large and you can assemble a team of application and platform experts to achieve the ultimate grail, good for you. Not a lot of projects fall into this category these days.
Let’s try another, less academic approach to CBIT. What do we really want to achieve? Do we want to monitor the platform or the application? My conclusion should be that critical operational application should monitor themselves from the inside. Using timers, averaging, comparison, thresholds, time stamping, callout routines, an application should be designed with self-monitoring in mind. Only the application designer knows if computing and sending “x” images per second is the most important mission for this system. Applications can control themselves at critical places in the computing flow to raise meaningful alerts or degrade gracefully rather than relying on a hypothetical outside CBIT element to raise the alarm. I’d rather understand an application message (file not sent) than a platform information (lost connection).
And when the application decides to quit and declare this system faulty, troubleshooting mode can be entered. This is where a system PBIT approach shines the most.
At system start, all circuits, channels, busses status information visible from the main CPU are compared with a pre-registered reference. And any discrepancy is listed and refers to a piece of equipment, chip, channel, slot. With the amount of details available in modern silicon chips this approach uncovers all kinds of changes. From the obvious (a network link that should normally be up and is down can indicate a loose cable) to the more subtle (discovering a PCIe channel synchronized on a different width or speed). However, such subtle differences may mean a lot in terms of computer performance and application behavior. This is why this approach is so powerful. Storing and comparing everything device the CPU sees, is also checking what the application code will use – no more, no less. This elegant approach works beyond physical boundaries of the computer board (e.g. network links to another cabinet, or neighboring boards in a backplane) and requires no special training to operate.
I have presented this to many people with many different jobs in our industry. All of them have seen a way to use this “Learn and Compare” approach to their benefit. Use cases are unnumbered.
Can you find one or two yourself? Do you think a good self-testing application, plus a proper system PBIT covers more than traditional BIT approach with a lot less effort?
This is your time to say.
{{comment.comment}}