Diagnostics

Post by Don Y
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?

This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations. Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.

--
Waldek Hebisch

George Neuner

2024-10-18 21:42:44 UTC

However, only a minor percentage of all devices must comply with such
safety regulations.

As I understand it, Don is working on tech for "smart home"
implementations ... devices that may be expected to run nearly
constantly (though perhaps not 365/24 with 6 9's reliability), but
which, for the most part, are /not/ safety critical.

WRT Don's question, I don't know the answer, but I suspect runtime
diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est)
priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation. To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

YMMV.
George

Don Y

2024-10-18 22:30:54 UTC

Hi George,

[Hope all is well with you and at home]

Post by George Neuner
WRT Don's question, I don't know the answer, but I suspect runtime
diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est)
priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation.

But, if you *know* when certain aspects of a device will be "called on",
you can take advantage of that to schedule diagnostics when the device is
not "needed". And, in the event that some unexpected "need" arises,
can terminate or suspend the testing (possibly rendering the effort
moot if it hasn't yet run to a conclusion).

E.g., I scrub freed memory pages (zero fill) so information doesn't
leak across protection domains. As long as some minimum number
of *scrubbed* pages are available for use "on demand", why can't
I *test* the pages yet to be scrubbed?

If I don't *expect* a car to pull up and call for the garage door
to be opened, why can't I play with the lighting to verify that
the cameras located within *notice* changes?

If there is no anticipated short term need for irrigation, why
can't I momentarily activate individual valves and watch to see that
the expected amount of water is flowing?

If a node is powered down due to lack of expected immediate need,
why not power it *up* and run diagnostics on it? Powering it back
down once completed -- *or*, aborting the diagnostics if the node
is called on to be powered up?

Post by George Neuner
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

Waldek Hebisch

2024-10-19 01:50:34 UTC

Post by George Neuner
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

TI for several years has nice processors with two cores, which
are almost in sync, but one is something like one cycle behind
the other. And there is circuitry to compare that both cores
produce the same result. This does not cover failures of the
whole chip, but dramaticaly lowers chance of undetected erros due
to some transient condition.

For critical functions a car could have 3 processors with
voting circuitry. With separate chips this would be more expensive
than single processor, but increase of cost probably would be
negligible compared to cost of the whole car. And when integrated
on a single chip cost difference would be tiny.

IIUC car controller may "reboot" during a ride. Intead of
rebooting it could handle work to a backup controller.

--
Waldek Hebisch

Don Y

2024-10-19 02:38:18 UTC

Post by George Neuner
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

The 4th bit in memory location XYZ has failed "stuck at zero".
How are you going to detect that?

One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?

The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.

There are innumerable failures that can occur to compromise
the "system" and no *easy*/inexpensive/reliable way to detect
and recover from *all* of them.

Post by Waldek Hebisch
For critical functions a car could have 3 processors with
voting circuitry. With separate chips this would be more expensive
than single processor, but increase of cost probably would be
negligible compared to cost of the whole car. And when integrated
on a single chip cost difference would be tiny.
IIUC car controller may "reboot" during a ride. Intead of
rebooting it could handle work to a backup controller.

How do you know the circuitry (and other mechanisms) that
implement this hand-over are operational?

It is VERY difficult to design reliable systems. I am not
attempting that. Rather, I am trying to address the fact that
the reassurances POST (and, at the user's perogative, BIST)
are not guaranteed when a device runs "for long periods of time".

Waldek Hebisch

2024-10-19 03:53:40 UTC

Post by George Neuner
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

The 4th bit in memory location XYZ has failed "stuck at zero".
How are you going to detect that?

The chips that I mentioned use static memory with ECC. Of course,
ECC circuitry may fail. There may be error undetected by ECC.
The two cores may have the same error or comparison circuitry
may fail to detect the difference. Each may happen, but each
is much less likely to happen than simple transient error.

Post by Don Y
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?

Detecting such thing looks easy. Recovery is tricky, because
if you have spare FET and activate it there is good chance that
it will fail due to the same reason that the first FET failed.
OTOH, if you have propely designed circuit around the FET,
disturbance strong enough to kill the FET is likely to kill
the controller too.

Post by Don Y
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.

Use 3 (or more) and voting. Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).

Post by Don Y
There are innumerable failures that can occur to compromise
the "system" and no *easy*/inexpensive/reliable way to detect
and recover from *all* of them.

Sure. But for common failures or serious failures having non-negligible
pobability redundancy may offer cheap way to increase reliability.

How do you know the circuitry (and other mechanisms) that
implement this hand-over are operational?

It does not matter if handover _always_ works. What matter is
if system with handover has lower chance of failure than
system without handover. Having statistics of actual failures
(which I do not have but manufacturers should have) and
after some testing one can estimate failure probablity of
different designs and possibly decide to use handover.

Post by Don Y
It is VERY difficult to design reliable systems. I am not
attempting that. Rather, I am trying to address the fact that
the reassurances POST (and, at the user's perogative, BIST)
are not guaranteed when a device runs "for long periods of time".

You may have tests essentially as part of normal operation.
Of course, if you have single-tasked design with a task which
must be "always" ready to respond, then running test becomes
more complicated. But in most designs you can spare enough
time slots to run tests during normal operation. Tests may
interfere with normal operation, but here we are in domain
specific teritory: sometimes result of operation give enough
assurance that device is operating correctly. And if testing
for correct operation is impossible, then there is nothing to
do, I certainly do not promise to deliver impossible.

--
Waldek Hebisch

Don Y

2024-10-19 04:17:21 UTC

Post by Don Y
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?

The immediate goal is to *detect* that a problem exists.
If you can't detect, then attempting to recover is a moot point.

Post by Don Y
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.

Use 3 (or more) and voting. Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety

As well as the reliability of the additional "voting logic".
If not a set of binary signals, determining what the *correct*
signal may be can be problematic.

Post by Waldek Hebisch
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).

It is different only in the sense of liability and exposure to
loss. I am not assigning values to those consequences but,
rather, looking to address the issue of run-time testing, in
general.

Even if NONE of the failures can result in injury or loss,
it is unlikely that a user WANTS to have a defective product.
If the user is technically unable to determine when the
product is "at fault" (vs. his own misunderstanding of how it
is *supposed* to work), then those failures contribute to
the users' frustrations with the product.

Post by Don Y
There are innumerable failures that can occur to compromise
the "system" and no *easy*/inexpensive/reliable way to detect
and recover from *all* of them.

Sure. But for common failures or serious failures having non-negligible
pobability redundancy may offer cheap way to increase reliability.

How do you know the circuitry (and other mechanisms) that
implement this hand-over are operational?

Again, I am not interested in "recovery" as that varies with
the application and risk assessment. What I want to concentrate
on is reliably *detecting* faults before they lead to product
failures.

I contend that the hardware in many devices has that capability
(to some extent) but that it is underutilized; that the issue
of detecting faults *after* POST is one that doesn't see much
attention. The likely thinking being that POST will flag it the
next time the device is restarted.

And, that's not acceptable in long-running devices.

You may have tests essentially as part of normal operation.

I suspect most folks have designed devices with UARTs. And,
having written a driver for it, have noted that framing, parity
and overrun errors are possible.

Ask yourself how many of those systems ever *use* that information!
Is there even a means of propagating it up out of the driver?

Post by Waldek Hebisch
Of course, if you have single-tasked design with a task which
must be "always" ready to respond, then running test becomes
more complicated. But in most designs you can spare enough
time slots to run tests during normal operation. Tests may
interfere with normal operation, but here we are in domain
specific teritory: sometimes result of operation give enough
assurance that device is operating correctly. And if testing
for correct operation is impossible, then there is nothing to
do, I certainly do not promise to deliver impossible.

Waldek Hebisch

2024-10-24 17:52:02 UTC

Post by Don Y
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?

The immediate goal is to *detect* that a problem exists.
If you can't detect, then attempting to recover is a moot point.

In a car you have signals from wheels and engine, you can use
those to compute transmission ratio and check is it is expected
one. Or simply have extra inputs which mountor FET output.

Post by Don Y
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.

Use 3 (or more) and voting. Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety

As well as the reliability of the additional "voting logic".
If not a set of binary signals, determining what the *correct*
signal may be can be problematic.

Matching images is now a stanard technology. And in this case
"voting logic" is likely to be software and main trouble are
possible bugs.

Post by Waldek Hebisch
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).

It is different only in the sense of liability and exposure to
loss. I am not assigning values to those consequences but,
rather, looking to address the issue of run-time testing, in
general.

I doubt in general solutions. Various parts of your system
may have enough common features to allow single strategy
in your system. But it is unlikely to generalize to other
systems. To put it differently, there are probabilites
of various events and associated costs. Even if you
refuse to quantify probabilites and costs your design
decisions (assuming they are rational) will give some
estimate of them.

Post by Don Y
Even if NONE of the failures can result in injury or loss,
it is unlikely that a user WANTS to have a defective product.
If the user is technically unable to determine when the
product is "at fault" (vs. his own misunderstanding of how it
is *supposed* to work), then those failures contribute to
the users' frustrations with the product.

Post by Don Y
There are innumerable failures that can occur to compromise
the "system" and no *easy*/inexpensive/reliable way to detect
and recover from *all* of them.

Sure. But for common failures or serious failures having non-negligible
pobability redundancy may offer cheap way to increase reliability.

How do you know the circuitry (and other mechanisms) that
implement this hand-over are operational?

Again, I am not interested in "recovery" as that varies with
the application and risk assessment. What I want to concentrate
on is reliably *detecting* faults before they lead to product
failures.
I contend that the hardware in many devices has that capability
(to some extent) but that it is underutilized; that the issue
of detecting faults *after* POST is one that doesn't see much
attention. The likely thinking being that POST will flag it the
next time the device is restarted.
And, that's not acceptable in long-running devices.

Well, you write that you do not try to build high reliablity
device. However device which correctly operates for years
without interruption is considered "high availability" device
which is a king of high reliablity. And techniques for high
reliablity seem appropiate here.

You may have tests essentially as part of normal operation.

I suspect most folks have designed devices with UARTs. And,
having written a driver for it, have noted that framing, parity
and overrun errors are possible.
Ask yourself how many of those systems ever *use* that information!
Is there even a means of propagating it up out of the driver?

Well, I always use no parity transmission mode. Standard way is
to use checksums and acknowledgments. That way you know if
transmission is working correctly. What extra info you expect
from looking at detailed error info from UART?

--
Waldek Hebisch

Don Y

2024-10-24 21:49:42 UTC

Post by Don Y
One of the FETs that controls the shifting of the automatic
transmission as failed open. How do you detect that /and recover
from it/?

The immediate goal is to *detect* that a problem exists.
If you can't detect, then attempting to recover is a moot point.

In a car you have signals from wheels and engine, you can use
those to compute transmission ratio and check is it is expected
one. Or simply have extra inputs which mountor FET output.

But a *user* can't do that. They can only claim "something doesn't
feel right about the drive"...

So, if the controller doesn't do it, what recourse?

Post by Don Y
The camera/LIDAR that the self-drive feature uses is providing
incorrect data... etc.

Use 3 (or more) and voting. Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety

As well as the reliability of the additional "voting logic".
If not a set of binary signals, determining what the *correct*
signal may be can be problematic.

Matching images is now a stanard technology. And in this case
"voting logic" is likely to be software and main trouble are
possible bugs.

The data must be available concurrently in order to "vote" on
them. And, must be "close enough" to not consider them to differ.
For high reliability applications, you often *compute* the results
in different ways / algorithms -- to highlight any issues in
one implementation over the other. So, the temporal path to
"their solutions" isn't the same.

Post by Waldek Hebisch
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).

It is different only in the sense of liability and exposure to
loss. I am not assigning values to those consequences but,
rather, looking to address the issue of run-time testing, in
general.

I've asked for other peoples' experiences. I've not expected
them to have solved *my* problem. Nor do I expect my solution
to solve theirs. Likewise, why something like Linux wouldn't
have "the" solution.

Post by Don Y
Again, I am not interested in "recovery" as that varies with
the application and risk assessment. What I want to concentrate
on is reliably *detecting* faults before they lead to product
failures.
I contend that the hardware in many devices has that capability
(to some extent) but that it is underutilized; that the issue
of detecting faults *after* POST is one that doesn't see much
attention. The likely thinking being that POST will flag it the
next time the device is restarted.
And, that's not acceptable in long-running devices.

No. Most devices can't afford the cost/complexity of a true
high reliability/redundant solution.

Your car has SOME redundancy in how it handles braking (two
chamber master cylinder plus "emergency/parking" brake PLUS
using the engine to slow the vehicle). Yet, absolutely NO
protection against a catastrophic failure of the steering!

Redundancy in braking is relatively easy to provide -- esp
in the volumes produced. So, adds little to the cost and
complexity of the vehicle. Adding a redundant steering
mechanism... where would you even BEGIN to address that?

Cars are a great example of the tradeoffs involved. You invest
a *little* to detect and report problems instead of a LOT to
continue operating in their presence. Why not have duplicate
turn signal indicators (front, rear and side) to guard against
a bulb failure? Much easier and cheaper to detect that a filament
has opened and report that to the driver (and HOPE he gets around
to fixing it).

If the driver can't be TOLD of faults and failures, then he
is in the dark as to how effectively his device is performing its
required actions. "CHECK ENGINE" really does mean that the
engine *needs* attention. What difference, "CHECK DRAM"?

You may have tests essentially as part of normal operation.

I suspect most folks have designed devices with UARTs. And,
having written a driver for it, have noted that framing, parity
and overrun errors are possible.
Ask yourself how many of those systems ever *use* that information!
Is there even a means of propagating it up out of the driver?

That assumes you can control the messages exchanged. If I
attach a TTY to the console -- routed through a serial port -- on
my computer, what should the checksum be when I see the "login: "
message? When I type my name, what checksum should I append
to the identifier?

I.e., serial port protocols don't *require* these things.
If the computer sees "do~n" -- where '~' indicates an overrun
error in the preceeding character's reception -- it KNOWS
that I haven't typed exactly three characters: d, o, n.
So, it shouldn't even ASK for my password, choosing, instead,
to reissue the login: banner (because it wouldn't know which
password to validate).

Likewise, if I saw "log~in: " on my TTY, I *know* that it
isn't saying "login: " because AT LEAST one character has
been omitted in that '~'.

This is easy to fix -- in ALL interactions. But, requires
the driver to propagate these errors up the stack and the
application layer to act on them. I.e., if the application
layer encounters lots of overrun (or partity/framing) errors,
SOMETHING is wrong with the link and/or the driver.

David Brown

2024-10-19 12:07:07 UTC

Post by Don Y
Hi George,
[Hope all is well with you and at home]

You /could/ do that, but what is the point?

What are you checking for? What is the realistic likelihood of finding
a problem, and what are the consequences of such a problem? How do you
test your test routines - are you able to simulate the problem you are
testing in a good enough manner? What are the circumstances that could
lead to a fault that you detect with the tests but where you would not
already see the problem in other ways? Is it realistic to assume that
your diagnostic test and reporting systems are able to run properly when
the this problem occurs? If some kind of problem actually occurs, will
your tests realistically identify it?

/Those/ are the kinds of questions you should be asking before putting
in some kind of tests. They are the important questions. Asking "why
can't I do a test now?" is peanuts in comparison.

George Neuner

2024-10-19 19:25:43 UTC

On Fri, 18 Oct 2024 15:30:54 -0700, Don Y

Post by Don Y
Hi George,
[Hope all is well with you and at home]

Hi Don,

Same ol', same ol'. Nothing much new to report.

If you "know" a priori when some component will be needed, then you
can do whatever you want when it is not. The problem is that many
uses can't be easily anticipated.

Which circles back to testing priority: if the test is interruptible
and/or resumeable, then it may be done whenever the component is
available ... as long as it won't tie up the component if and when it
becomes needed for something else.

Post by Don Y
E.g., I scrub freed memory pages (zero fill) so information doesn't
leak across protection domains. As long as some minimum number
of *scrubbed* pages are available for use "on demand", why can't
I *test* the pages yet to be scrubbed?

If you're testing memory pages, most likely you are tying up bandwidth
in the memory system and slowing progress of the real applications.

Also because you can't accurately judge the "minimum" needed. BSD and
Linux both have this problem where a sudden burst of allocations
exhausts the pool of zeroed pages, forcing demand zeroing of new pages
prior to their re-assignment. Slows the system to a crawl when it
happens.

Post by Don Y
If there is no anticipated short term need for irrigation, why
can't I momentarily activate individual valves and watch to see that
the expected amount of water is flowing?

Because then you are watering (however briefly) when it is not
expected. What if there was a pesticide application that should not
be wetted? What if a person is there and gets sprayed by your test?

Properly, valve testing should be done concurrently with a scheduled
watering. Check water is flowing when the valve should be open, and
not flowing when the valve should be closed.

Post by George Neuner
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

Automotive is going the way of aircraft: standby running lockstep with
the primary and monitoring its data flow - able to reset the system if
they disagree, or take over if the primary fails.

The point here is that there is no "one fits all" philosophy you can
follow ... what is proper to do depends on what the (sub)system does,
its criticality, and on the components involved that may need to be
tested.

Don Y

2024-10-19 21:32:48 UTC

Post by George Neuner
Same ol', same ol'. Nothing much new to report.

No news is good news!

Post by Don Y
But, if you *know* when certain aspects of a device will be "called on",
you can take advantage of that to schedule diagnostics when the device is
not "needed". And, in the event that some unexpected "need" arises,
can terminate or suspend the testing (possibly rendering the effort
moot if it hasn't yet run to a conclusion).

If you "know" a priori when some component will be needed, then you
can do whatever you want when it is not. The problem is that many
uses can't be easily anticipated.

Granted, I can't know when a user *might* want to do some
asynchronous task. But, the whole point of my system is to
watch and anticipate needs based on observed behaviors.

E.g., if the house is unoccupied, then its not likely that
anyone will want to watch TV -- unless they have *scheduled*
a recording of a broadcast (in which case, I would know it).

If the occupants are asleep, then its not likely they will be
going out for a drive.

Post by George Neuner
Which circles back to testing priority: if the test is interruptible
and/or resumeable, then it may be done whenever the component is
available ... as long as it won't tie up the component if and when it
becomes needed for something else.

Exactly. I already have to deal with that in my decisions to
power down nodes. If my actions are incorrect, then it introduces
a delay in getting "back" to whatever state I should have been in.

If you're testing memory pages, most likely you are tying up bandwidth
in the memory system and slowing progress of the real applications.

But, they wouldn't be scrubbed if there were higher "priority"
tasks demanding resources. I.e., some other "lower priority"
task would have been accessing memory.

Post by George Neuner
Also because you can't accurately judge the "minimum" needed. BSD and
Linux both have this problem where a sudden burst of allocations
exhausts the pool of zeroed pages, forcing demand zeroing of new pages
prior to their re-assignment. Slows the system to a crawl when it
happens.

Yes, but you have live users arbitrarily deciding they "need" those
resources. And, have considerably more pages at risk for use.
I've only got ~1G per node and (theoretically), a usage model of
what resources are needed, when (where).

*Not* clearing the pages leaves a side channel open for information
leakage so *that* isn't negotiable. Having some "deliberately
dirty" could be an issue but, even "dirty", they are wiped of
their previous contents after a single pass through the test.

Post by Don Y
If there is no anticipated short term need for irrigation, why
can't I momentarily activate individual valves and watch to see that
the expected amount of water is flowing?

Because then you are watering (however briefly) when it is not
expected. What if there was a pesticide application that should not
be wetted? What if a person is there and gets sprayed by your test?

Irrigation, here, is not airborne. The ground may be wetted in the
*immediate* vicinity of the emitters activated. But, they operate at
very low flow rates (liters per HOUR).

Your goal is to verify the master valve(s) operate (I do that by opening
the purge valve(s) and letting water drain into a sump); the individual
valves are operable; and that water *flows* when commanded.

Post by George Neuner
Properly, valve testing should be done concurrently with a scheduled
watering. Check water is flowing when the valve should be open, and
not flowing when the valve should be closed.

That happens as part of normal operation. But, NOT knowing until that
time can lead to plant death. E.g., if the roses don't get watered twice
a day, they are toast (in this environment). If the cacti valves don't
*close*, they are toast. If a line is "failed open", then you've
a geyser in the yard (and *no* irrigation to those plants)

Repairs of this nature can be time consuming, depending on the nature
of the failure (and cost thousands of dollars in labor). The more I
can deduce about the nature of the failure, the quicker the service
can be brought back up to par and the less the "diagnostic cost"
of having someone do so, manually (digging up a yard to determine where
a line has been punctured; inspecting individual emitters to determine
which are blocked; visually monitoring for water flow per zone; etc.)

[Amazing how much these "minimum wage jobs" actually end up costing
when you have to hire someone! E.g., $160/month to have your "yard
cleaned" -- *if* you can find someone to do it at that rate! Irrigation
work starts at kilobucks and is relatively open-ended (as no one can
assess the nature of the job until they start on it)]

Post by George Neuner
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

Automotive is going the way of aircraft: standby running lockstep with
the primary and monitoring its data flow - able to reset the system if
they disagree, or take over if the primary fails.
The point here is that there is no "one fits all" philosophy you can
follow ... what is proper to do depends on what the (sub)system does,
its criticality, and on the components involved that may need to be
tested.

I am, rather, looking for ideas as to how (others) may have approached
it. Most of the research I've uncovered deals with servers and their
ilk. Or, historical information (e.g., MULTICS' "computing as a service"
philosophy). E.g., *scheduling* testing vs. opportunistic testing.

Don Y

2024-10-23 12:53:44 UTC

Post by George Neuner
The point here is that there is no "one fits all" philosophy you can
follow ... what is proper to do depends on what the (sub)system does,
its criticality, and on the components involved that may need to be
tested.

"Opportunistic" seems to work well -- *if* you declare the resources
you will need and wait until you can acquire them.

The downside is that you may NEVER be able to acquire them,
based on what processes are active on a node. You wouldn't want
the diagnostic task to have to KNOW those things!

As different tests may require different resources, this
becomes problematic; do you request the largest set? A
smaller set? Or, design a mechanism to allow for arbitrarily
complex combinations to be specified <frown>

This became apparent when running the DRAM test using the
DRAM emulator (non-production board designed to validate the
DRAM test by allowing arbitrary fault injection, on demand).
While it was known that *some* tests could NOT be run out of
DRAM (which limits their efficacy in a running system), there
were other system resources that were "silently" called upon
that would have impacted other coexecuting tasks. <frown>

The good news (wrt DRAM testing) is that checking for "stuck at"
faults -- the most prevalent described in published research -- makes
no special needs for resources, beyond access to DRAM!

Moral of story: CAREFULLY enumerate (and declare) ALL such
resources. And, consider how realistic it is to expect
ALL of them to be available serendipitously in a given node.

Else, resort to *scheduling* the diagnostic ("maintenance period")

Nioclásán Caileán de Ghlostéir

2024-10-20 17:15:29 UTC

On Fri, 18 Oct 2024, Don Y wrote:
"> To ensure 100%

Post by George Neuner
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

Apparently, there is noise about incorporating such hardware into
*automotive* designs (!). I would have thought the time between
POSTs would have rendered that largely ineffective. OTOH, if
you imagine a failure can occur ANY time, then "just after
putting the car in gear" is as good (bad!) a time as any!"

Hi Don,

We were lectured by a person whom an expensive-German-car manucfaturer
paid to help make cars, but he told us that this rich German company did
not pay him enough to buy 1 of these cars so he drives a cheap car from a
different company. He might endanger a client's clients' lives, but less
so his own (though of course a malfunctioning German car could crash into
him).

Waldek Hebisch

2024-10-19 01:25:54 UTC

However, only a minor percentage of all devices must comply with such
safety regulations.

Maybe, if you mean domain specific regulations. But there are
general EC directives. One may spent money on lawyers and
research to conclude that some protections are not required
by law. Or one may implement things like for regulated domain.
On small scale the second is likely to be cheaper.

Anyway, I do not know if there is anything specific about
washing machines, but software for them is clearly written
as if they were regulated. The same for ovens, heaters etc.

Post by George Neuner
As I understand it, Don is working on tech for "smart home"
implementations ... devices that may be expected to run nearly
constantly (though perhaps not 365/24 with 6 9's reliability), but
which, for the most part, are /not/ safety critical.

IMO, "smart home" which matter have safety implications. Even
if they are not regulated now there is potential for liabilty.
And new requlations appear quite frequently.

IIUC at low levels requirements are not that hard to satisfy,
especially since in most cases non-working device is deemed "safe".

--
Waldek Hebisch

David Brown

2024-10-19 11:57:30 UTC

However, only a minor percentage of all devices must comply with such
safety regulations.
As I understand it, Don is working on tech for "smart home"
implementations ... devices that may be expected to run nearly
constantly (though perhaps not 365/24 with 6 9's reliability), but
which, for the most part, are /not/ safety critical.
WRT Don's question, I don't know the answer, but I suspect runtime
diagnostics are /not/ routinely implemented for devices that are not
safety critical. Reason: diagnostics interfere with operation of
<whatever> they happen to be testing. Even if the test is at low(est)
priority and is interruptible by any other activity, it still might
cause an unacceptable delay in a real time situation. To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.

That brings up one of the critical points about any kind of runtime
diagnostics - what do you do if there is a failure? Until you can
answer that question, any effort on diagnostics is not just pointless,
but worse than useless because you are adding more stuff that could go
wrong.

I think bad or useless diagnostics are a more common problem than
missing diagnostics. People feel pressured into having them when they
can't measure anything useful and you can't do anything sensible with
the results.

I have seen first-hand how the insistence of having all sorts of
diagnostics added to a product so that it could be "safety" certified
actually result in a less reliable and less safe product. The only
"safety" they provided was legal safety so that people could claim it
wasn't their fault if it failed, because they had added all the
self-tests required by the so-called safety experts.

Don Y

2024-10-18 22:15:30 UTC

This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.

Not all devices are covered by "regulations".

And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

[I.e., how long a device remains "up" is a function of the device,
it's application, environment and user]

Do you just *hope* the device "happens" to fail in a noticeable
manner so a user is left with no doubt but that the device is
no longer operational?

Post by Waldek Hebisch
Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.

So, if a bit in a RAM in said device *dies* some time after power on,
is the device going to *know* that has happened? And, signal its
unwillingness to continue operating? What is going to detect that
failure?

What if the bit's failure is inconsequential to the operation
of the device? E.g., if the bit is part of some not-used
feature? *Or*, if it has failed in the state it was *supposed*
to be in??!

With a "good" POST design, you can reassure the user that the
device *appears* to be functional. That the data/code stored in it
are intact (since last time they were accessed). That the memory
is capable of storing any values that is called on to preserve.
That the hardware I/Os can control and sense as intended, etc.

/But, you have no guarantee that this condition will persist!/
If it WAS guaranteed to persist, then the simple way to make high
reliability devices would be just to /never turn them off/ to
take advantage of this "guarantee"!

Waldek Hebisch

2024-10-19 03:00:48 UTC

This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.

Not all devices are covered by "regulations".

Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.

Post by Don Y
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????

What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values. IIUC cristal osciators are likely to fail
so you are supposed to regularly check for presence of the clock
and its frequency (this assumes hardware design with a backup
clock).

I do not know how PLC manufactures implement checks. Small
PLC-s are based on MCU-s with static parity protected RAM.
This may be deemed adequate. PLC-s work in cycles and some
percentage of the cycle is dedicated to self-test. So big
PLC may divide memory into smallish regions and in each
cycle check a single region, walking trough whole memory.

Post by Don Y
What if the bit's failure is inconsequential to the operation
of the device? E.g., if the bit is part of some not-used
feature? *Or*, if it has failed in the state it was *supposed*
to be in??!

I am affraid that usually inconsequential failure gets
promoted to complete failure. Before 2000 checking showed
that several BIOS-es "validated" date and "incorrect" (that
is after 1999) date prevented boot.

Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).

Post by Don Y
With a "good" POST design, you can reassure the user that the
device *appears* to be functional. That the data/code stored in it
are intact (since last time they were accessed). That the memory
is capable of storing any values that is called on to preserve.
That the hardware I/Os can control and sense as intended, etc.
/But, you have no guarantee that this condition will persist!/
If it WAS guaranteed to persist, then the simple way to make high
reliability devices would be just to /never turn them off/ to
take advantage of this "guarantee"!

Everything here is domain specific. In cheap MCU-based device main
source of failurs is overvoltage/ESD on MCU pins. This may
kill the whole chip in which case no software protection can
help. Or some pins fail, sometimes this may be detected by reading
appropiate port. If you control electic motor then you probably
do not want to sent test signals during normal motor operation.
But you are likely to have some feedback and can verify if feedback
agrees with expected values. If you get unexpected readings
you probably will stop the motor.

--
Waldek Hebisch

Don Y

2024-10-19 04:05:14 UTC

Post by Don Y
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?

This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.

Not all devices are covered by "regulations".

Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.

In the US, ANYTHING can result in a lawsuit. But, "due diligence"
can insulate the manufacturer, to some extent. No one ever
*admits* to "doing a bad job".

If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?

The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.

What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.

That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.

Post by Waldek Hebisch
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,

That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!

E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!

Post by Waldek Hebisch
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.

Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?

Post by Waldek Hebisch
IIUC cristal osciators are likely to fail
so you are supposed to regularly check for presence of the clock
and its frequency (this assumes hardware design with a backup
clock).

If *a* failure resulted in a catastrophic failure, things would
be "acceptable" in that the user would KNOW that something is
wrong without the device having to tell them.

But, too often, faults can be "absorbed" or lead to unobservable
errors in operation. What then?

Somewhere, I have a paper where the researchers simulated faults
*in* various OS kernels to see how "tolerant" the OS was of these
faults (which we know *happen*). One would think that *any*
fault would cause a crash. Yet, MANY faults are sufferable
(depending on the OS).

Consider, if a single bit error converts a "JUMP" to a "JUMP IF CARRY"
but the carry happens to be set, then there is no difference in the
execution path. If that bit error converts a "saturday" into a
"sunday", then something that is intended to execute on weekdays (or
weekends) won't care. Etc.

Post by Waldek Hebisch
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).

Using ECC DRAM doesn't solve the problem. If you see errors
reported by your ECC RAM (corrected errors), then when do
you decide you are seeing too many and losing confidence that
the ECC is actually *detecting* all multibit errors?

That depends on HOW you generate your test signals, what the hardware
actually looks like and how sensitive the "mechanism" is to such
"disturbances". Remember, "you" can see things faster than a mechanism
can often respond. I.e., if applying power to the motor doesn't
result in an observable load current (or "micromotion"), then the
motor is likely not responding.

Post by Waldek Hebisch
But you are likely to have some feedback and can verify if feedback
agrees with expected values. If you get unexpected readings
you probably will stop the motor.

Waldek Hebisch

2024-10-19 13:53:33 UTC

Post by Don Y
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
How commonly are such mechanisms implemented? And, how thoroughly?

This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.

Not all devices are covered by "regulations".

Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.

In the US, ANYTHING can result in a lawsuit. But, "due diligence"
can insulate the manufacturer, to some extent. No one ever
*admits* to "doing a bad job".
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.

Each item above may contribute to a significant loss. And
there could push to litigation (say by a consumer advocacy group)
basically to establish a precedent. So, better have
record of due diligence.

What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.

That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.

First, I mean relevant hardware, that is hardware inside a MCU.
I think that there are strong arguments that such hardware is
more reliable than software. I have seen claim based on analysis
of discoverd failures that software written to rigorous development
standars exhibits on average about 1 bug (that lead to failure) per
1000 lines of code. This means that evan small MCU has enough
space of handful of bugs. And for bigger systems it gets worse.

Post by Waldek Hebisch
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,

That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!

I another place I wrote the one of studies that I saw claimed that
significant number of errors they detected (they monitored changes
to a memory area that was supposed to be unmodifed) was due to buggy
software. And DRAM is special.

Post by Don Y
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!

You are supposed to regularly verify sufficiently strong checksum.

Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?

First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.

ECC is part of solution. It may reduce probability of error
so that you consider them not serious enough. And if you
really care you may try to increase error rate (say by putting
RAM chips at increased temperature) and test that your detection
and recovery strategy works OK.

--
Waldek Hebisch

Don Y

2024-10-19 16:55:35 UTC

Post by Don Y
In the US, ANYTHING can result in a lawsuit. But, "due diligence"
can insulate the manufacturer, to some extent. No one ever
*admits* to "doing a bad job".
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.

Each item above may contribute to a significant loss. And

Significant loss? From a doorbell failing to ring? Are you
sure YOUR doorbell has rung EVERY time someone pressed the button?

Post by Waldek Hebisch
there could push to litigation (say by a consumer advocacy group)
basically to establish a precedent. So, better have
record of due diligence.

But things can *still* "fail to perform". That;s the whole point of
runtime diagnostics; to notice a failure that the user may NOT!
If you can take remedial action, then you have a notification that
it is needed. If this requires the assistance of the user, then
you can REQUEST that. If you can offload some of the responsibilities
of the device (something that I can do, dynamically), then you
can elect to do so. If you can do nothing to keep the device in
service, then you can alert the user of the need for replacement.

*NOT* knowing of a fault means you gleefully keep operating as
if everything was fine.

What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.

That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.

But bugs need not be consequential. They may be undesirable or
even annoying but need not have associated "costs".

I have a cassette deck (Nakamichi Dragon) that has a design flaw.
When the tape reaches the end of side "A", it is supposed to
autoreverse and play the "back side". So, the revolutions
counter counts *up* while playing side A and then back down
while playing side B.

However, if you eject the tape just as side A finishes and
physically flip it over (so side B is the "front" side)
pressing FORWARD PLAY (which is the direction that the
reels were moving while the tape counter was counting UP),
the tape will move FORWARD but the reels will count backwards.

If you had removed the tape and placed some OTHER tape in
the mechanism, the same behavior results (obviously) -- moving
forward but counting backwards. If you turn the deck OFF
and then back ON, the tape counter moves correctly.

How am I harmed by this? To what monetary extent? It's
a race in the hardware & software (the tape counter is
implemented in a separate MCU). I can avoid the problem
by NOT ejecting the tape just after the completion of
side A...

Post by Waldek Hebisch
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,

That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!

If you have memory protection hardware (I do), then such changes
can't casually occur; the software has to make a deliberate
attempt to tell the memory controller to allow such a change.

Post by Don Y
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!

You are supposed to regularly verify sufficiently strong checksum.

Really? Wanna bet that doesn't happen? How many Linux-based devices
load applications and start a process to continuously verify the
integrity of the TEXT segment?

What are they going to do if they notice a discrepancy? Reload
the application and hope it avoids any "soft spots" in memory?

Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?

First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.

Then you are dealing with high reliability designs. Do you
really think my microwave oven, stove, furnace, telephone,
etc. are designed to be resilient to those types of faults?
Do you think the user could detect such an occurrence?

Studies suggest that temperature doesn't play the role that
was suspected. What ECC does is give you *data* about faults.
Without it, you have no way to know about faults /as they
occur/.

Testing tries to address faults at different points in their
lifespans. Predictive Failure Analysis tries to alert to the
likelihood of *impending* failures BEFORE they occur. So,
whatever remedial action you might take can happen BEFORE
something has failed. POST serves a similar role but tries to
catch failures that have *occurred* before they can affect the
operation of the device. BIST gives the user a way of making
that determination (or receiving reassurance) "on demand".
Run time diagnostics address testing while the device wants
to remain in operation.

What you *do* about a failure is up to you, your market and the
expectations of your users. If a battery fails in SOME of my
UPSs, they simply won't power on (and, if the periodic run-time
test is enabled, that test will cause them to unceremoniously
power themselves OFF as they try to switch to battery power).
Other UPSs will provide an alert (audible/visual/log message)
of the fact but give me the option of continuing to POWER
those devices in the absence of backup protection.

The latter is far more preferable to me as I can then decide
when/if I want to replace the batteries without being forced
to do so, *now*.

The same is not true of smoke/CO detectors; when they detect
a failed (failING battery), they are increasingly annoying
in their insistence that the problem be addressed, now.
So much so, that it leads to deaths due to the detector
being taken out of service to stop the damn bleating.

I have a great deal of latitude in how I handle failures.
For example, I can busy-out more than 90% of the RAM in a device
(if something suggested that it was unreliable) and *still*
provide the functionality of that node -- by running the code
on another node and leaving just the hardware drivers associated
with *this* node in place. So, I can alert a user that a
particular device is in need of service -- yet, continue
to provide the services that were associated with that device.
IMO, this is the best of all possible "failure" scenarios;
the worst being NOT knowing that something is misbehaving.

Waldek Hebisch

2024-10-24 16:34:11 UTC

What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.

That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.

But bugs need not be consequential. They may be undesirable or
even annoying but need not have associated "costs".

The point is that you can not eliminate all bugs. Rather, you
should have simple code with aim of preventing "cost" of bugs.

Post by Waldek Hebisch
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,

That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!

If you have memory protection hardware (I do), then such changes
can't casually occur; the software has to make a deliberate
attempt to tell the memory controller to allow such a change.

The tests where run on Linux boxes with normal memory protection.
Memory protection does not prevent troubles due to bugs in
priviledged code. Of course, you can think that you can do
better than Linux programmers.

Post by Don Y
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!

You are supposed to regularly verify sufficiently strong checksum.

Really? Wanna bet that doesn't happen? How many Linux-based devices
load applications and start a process to continuously verify the
integrity of the TEXT segment?

Using something like Linux means that you do not care about rare
problems (or are prepared to resolve them without help of OS).

Post by Don Y
What are they going to do if they notice a discrepancy? Reload
the application and hope it avoids any "soft spots" in memory?

AFAICS the rule about checking image originally were inteded
for devices executing code directly from flash, if your "primary
truth" fails possibilities are limited. With DRAM failures one
can do much better. The question is mainly probabilities and
effort.

Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?

First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.

IIUC microwave, stove and furnace should be. In cell phone
BMS should be safe and core radio is tightly regulated. Other
parts seem to be at quality/reliability level of PC-s.

You clearly want to make your devices more reliable. Bugs
and various events happen and extra checking is actually
quite cheap. It is for you to decide if you need/want
it.

Studies suggest that temperature doesn't play the role that
was suspected. What ECC does is give you *data* about faults.
Without it, you have no way to know about faults /as they
occur/.

Well, there is evidence that increased temperature inreases
chance of errors. More precisely, expect errors when you
operate DRAM close to max allowed temperature. The point is
that you can cause errors and that way test your recovery
strategy (untested recovery code is likely to fail when/if
it is needed).

Post by Don Y
Testing tries to address faults at different points in their
lifespans. Predictive Failure Analysis tries to alert to the
likelihood of *impending* failures BEFORE they occur. So,
whatever remedial action you might take can happen BEFORE
something has failed. POST serves a similar role but tries to
catch failures that have *occurred* before they can affect the
operation of the device. BIST gives the user a way of making
that determination (or receiving reassurance) "on demand".
Run time diagnostics address testing while the device wants
to remain in operation.
What you *do* about a failure is up to you, your market and the
expectations of your users. If a battery fails in SOME of my
UPSs, they simply won't power on (and, if the periodic run-time
test is enabled, that test will cause them to unceremoniously
power themselves OFF as they try to switch to battery power).
Other UPSs will provide an alert (audible/visual/log message)
of the fact but give me the option of continuing to POWER
those devices in the absence of backup protection.
The latter is far more preferable to me as I can then decide
when/if I want to replace the batteries without being forced
to do so, *now*.
The same is not true of smoke/CO detectors; when they detect
a failed (failING battery), they are increasingly annoying
in their insistence that the problem be addressed, now.
So much so, that it leads to deaths due to the detector
being taken out of service to stop the damn bleating.
I have a great deal of latitude in how I handle failures.
For example, I can busy-out more than 90% of the RAM in a device
(if something suggested that it was unreliable) and *still*
provide the functionality of that node -- by running the code
on another node and leaving just the hardware drivers associated
with *this* node in place. So, I can alert a user that a
particular device is in need of service -- yet, continue
to provide the services that were associated with that device.
IMO, this is the best of all possible "failure" scenarios;
the worst being NOT knowing that something is misbehaving.

Good.

--
Waldek Hebisch

Don Y

2024-10-24 21:28:44 UTC

Post by Don Y
That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.

But bugs need not be consequential. They may be undesirable or
even annoying but need not have associated "costs".

The point is that you can not eliminate all bugs. Rather, you
should have simple code with aim of preventing "cost" of bugs.

Code need only be "as simple as possible, /but no simpler/".
The problem defines the complexity of the solution.

Post by Waldek Hebisch
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,

That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!

If you have memory protection hardware (I do), then such changes
can't casually occur; the software has to make a deliberate
attempt to tell the memory controller to allow such a change.

Linux code is far from "as simple as possible". They are constantly
trying to make a GENERAL PURPOSE solution for a wide variety of
applications that THEY envision.

Post by Don Y
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!

You are supposed to regularly verify sufficiently strong checksum.

Really? Wanna bet that doesn't happen? How many Linux-based devices
load applications and start a process to continuously verify the
integrity of the TEXT segment?

Using something like Linux means that you do not care about rare
problems (or are prepared to resolve them without help of OS).

Using <anything> means you don't care about any of the issues
that the <anything> developers considered unimportant or were
incapable of/unwilling to addressing.

Post by Don Y
What are they going to do if they notice a discrepancy? Reload
the application and hope it avoids any "soft spots" in memory?

One typically doesn't assume flash fails WHILE in use (though, of course,
it does). DRAM is documented to fail while in use. If you have
"little enough" of it then you can hope the failures are far enough
apart, in time, that they just look like "bugs". This is especially
true if your device is only running "part time" or is unobserved
for long stretches of time as that "bug" can manifest in numerous
ways depending on the nature of the DRAM fault and *if* the CPU
happens to encounter it. E.g., just like a RAID array, absent
patrol reads, you never know if a file that hasn't been referenced
in months has suffered any corruption.

In many applications, there are large swaths of code that get
executed once or infrequently. E.g., how often does the first
line of code after main() get executed? If it was corrupted
after that initial execution/examination, would you know? or care?
Ah, but if you don't NOTICE that it has been corrupted, then you
will proceed gleefully ignorant of the fact that your memory
system is encountering problem(s) and, thus, won't take any steps
to address them.

Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?

First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.

IIUC microwave, stove and furnace should be. In cell phone
BMS should be safe and core radio is tightly regulated. Other
parts seem to be at quality/reliability level of PC-s.
You clearly want to make your devices more reliable. Bugs
and various events happen and extra checking is actually
quite cheap. It is for you to decide if you need/want
it.

Unlike a phone or "small appliance" that you can carry in to a
service center -- or, return to the store where purchased -- I
can't expect a user to just pick up a failed/suspect device and
exchange it for a "new" one. Could you remove the PCB that
controls your furnace and bring it <somewhere> to have someone
tell you if it is "acting up" and in need of replacement?
Would you even THINK to do this?

Instead, if you were having problems (or suspected you were)
with your "furnace", you would call a service man to come to
*it* and have a look. Here (US), that is costly -- they aren't
going to drive to you without some sort of compensation
(and they aren't going to do so for "Uber rates").

E.g., a few winters past, the natural gas supply to our city was
"compromised"; it was unusually cold and demand exceeded the
ability of the system to deliver gas at sufficient pressure to
the entire city.

Most furnaces rely on a certain flow of fuel to operate. And,
contain sensors to shutdown the furnace if they sense an
inadequate fuel supply. So, much of the city had no heat.
This resulted to most plumbing contractors being overwhelmed
with calls for service.

Of course, there was nothing they could *do* to correct the problem.
But, that didn't stop them from taking orders for service,
dispatching their trucks and BILLING each of these customers.

One could argue that a more moral industry might have recommended
callers "wait a while as there is a citywide problem with the
gas supply". But, maybe there was something ELSE at fault
as some of those callers? And, it's a great opportunity to
get into their homes and try to make a sale to upgrade your
"old equipment" to unsuspecting homeowners (an HVAC system is
in the $10K price range for a nominal home).

Bring your phone into a store complaining of a problem and
they will likely show you what you are doing wrong. They *may*
suggest you upgrade that 3 year-old-model -- but, if that
is the sole reason for their suggested upgrade, you will likely
decline and walk away. "It's just a phone; if it keeps giving
me problems, THEN I will upgrade". That's not the case with
something "magical" (in the eyes of a homeowner) like an
HVAC system. Upgrading later may be even less convenient than
it is now! And, it actually *is* an old system... (who defines
old?)

Post by Don Y
Studies suggest that temperature doesn't play the role that
was suspected. What ECC does is give you *data* about faults.
Without it, you have no way to know about faults /as they
occur/.

That was debunked by the data.

George Neuner

2024-10-19 19:58:13 UTC

On Fri, 18 Oct 2024 21:05:14 -0700, Don Y

Post by Don Y
In the US, ANYTHING can result in a lawsuit.

Yes.

Post by Don Y
But, "due diligence" can insulate the manufacturer, to some extent.
No one ever *admits* to "doing a bad job".

Actually due diligence /can't/ insulate a manufacturer if the issue
goes to trial. Members of a jury may feel sorry for the litigant(s),
or conclude that the manufacturer can afford whatever they award ...
or maybe they just don't like the manufacturer's lawyer.

Unlike judges, juries do /not/ have to justify their decisions,
Moreover, in some US juridictions, the decision of a civil case need
not be unanimous but only that of a quorum.

Post by Don Y
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.

So they created the "class action", where all the litigants
individually may have very small claims, but when put together the
total becomes significant.

Don Y

2024-10-19 23:26:48 UTC

Post by George Neuner
On Fri, 18 Oct 2024 21:05:14 -0700, Don Y

Post by Don Y
But, "due diligence" can insulate the manufacturer, to some extent.
No one ever *admits* to "doing a bad job".

You missed my "to some extent" qualifier. It allows you to make
the case /to that jury/ that you *thought* about the potential
problems and made a concerted attempt to address them. Contrast
that with "Didn't it occur to the manufacturer that a customer
might LIKELY use their device in THIS manner, resulting in
THESE sorts of problems?"

You can never understand how a device can be "misapplied". But,
not making ANY attempt to address "off label" uses is sure to
result in a hostile attitude from those judging your behavior:
"So, you made HOW MUCH money off this product and still
couldn't afford the time/effort to have considered these issues?"

"Small fish" are seldom targeted by such lawsuits as they have
few assets and can fold without consequences to their owners.

Post by George Neuner
Unlike judges, juries do /not/ have to justify their decisions,
Moreover, in some US juridictions, the decision of a civil case need
not be unanimous but only that of a quorum.

So they created the "class action", where all the litigants
individually may have very small claims, but when put together the
total becomes significant.

But you still have to demonstrate a loss. And, be able to
argue some particular value to that loss. "Well, MAYBE
Publishers' Clearinghouse came to my house to give me that
oversized/cartoon check but the doorbell MIGHT not have
rung. So, I want '$3000/week for life' as compensation..."

Nioclásán Caileán de Ghlostéir

2024-10-20 18:08:50 UTC

On Sat, 19 Oct 2024, Don Y wrote:
"[. . .]
[. . .] "Well, MAYBE
Publishers' Clearinghouse came to my house to give me that
oversized/cartoon check but the doorbell MIGHT not have
rung. So, I want '$3000/week for life' as compensation...""

Hi Don,

Grady Booch confessed at an International Conference on Software
Engineering in 2000 that his anti-engineering UML policy produced a door
bell which was not rung, so a friend spent money on a cellphone call to
him to open a door! Fools pretend that Grady Booch is an engineering hero!

Mister Fabio Bertella (then at Optisoft Srl,
Via A. Bertoloni, 15,
19038 Sarzana (SP),
Italy) said in 2007 that Grady Booch "has never written a real program."

Regards.

Nioclásán Caileán de Ghlostéir

2024-10-20 17:11:31 UTC

On Fri, 18 Oct 2024, Don Y wrote:
"The costs of litigation are reasonably high."

Hi Don,

Court cases' costs are unreasonably high.

George Neuner

2024-10-20 19:44:33 UTC

On Sun, 20 Oct 2024 19:11:31 +0200, Nioclásán Caileán de Ghlostéir

Post by NioclÃ¡sÃ¡n CaileÃ¡n de GhlostÃ©ir
"The costs of litigation are reasonably high."
Hi Don,
Court cases' costs are unreasonably high.

Courts need not be involved ... just lawyers.

A few weeks ago there was an article in Forbes magazine saying that
the /average/ billing by lawyers at moderately sized firms in the US
is now $1500..$1800/hr, and billing by lawyers from top firms is now
$2600..$3000/hr.

Litigating a product liability case routinely takes hundreds to
thousands of hours depending. True, many of those hours will be for
paralegals unless/until the case gets to arbitration or court - but
even small firms now bill their paralegals at over $200/hr.

Nioclásán Caileán de Ghlostéir

2024-10-20 21:14:52 UTC