Over the course of several blogs , I will talk about getting realistic failure rate data, where this failure data comes from, and how different methods of failure data analysis compare. I think if you understand this, you will begin to get a very good feel of what it takes to generate realistic failure data. This is a subject I find very important and I hope you will find your time well spent reading this.
In Part 1, I wrote about the fundamental concepts of functional safety standard for the process industries, IEC 61511. As well as the design phase of the safety lifecycle. In this blog, I will continue with talking about two fundamental techniques that have been developed in the field of reliability engineering: failure rate estimation techniques and failure rate prediction techniques. As well as failure rate estimation techniques.
In Part 2, I explained two fundamental techniques that have been developed in the field of reliability engineering: failure rate estimation techniques and failure rate prediction techniques. As well as failure rate estimation techniques.
Part 3 was about field data collection standards and tools as well as prevalent prediction techniques like B10 and FMEDA approaches.
In this blog, I will cover FMEDA results and accuracy.
FMEDA analysis can show differences in designs, applications, and operating conditions. In this picture, we've shown two sets of failure rate data showing two different service conditions, clean service and severe service.
The picture shows three different applications. Are you using a double port guided valve in clean service that is a full stroke close on strip or is it a tight shutoff? The difference ,of course, is leakage. The difference , of course, is what's a failure and what is not? Obviously, tight shut off has a much higher dangerous undetected failure rate. In a different application, where you open on trip, leakages can be tolerated. If it's a significant leakage, it turns out to be a false trip. So you can see all these failure rates differ as a function of application, and as a function of service levels. Take a look at those numbers and you can see they are certainly different. Sometimes not that much different. FMEDAs can show the differences in these numbers.
Sometimes, FMEDAs have to express the results in terms of function failure modes. Like in this case, pressure transmitter results have to be interpreted depending on whether the transmitter is used in a high trip application or in a low trip application. You might think, what's the difference? Well … what if it fails with the output saturated high? It's probably 120%. Is that failsafe or fail dangerous?
It depends on whether you have a high trip or low trip and it depends on whether the logic solvers are configured to reject out of range current levels. Or at least enunciate out of range current levels as a failure rather than a trip. A high trip that fails high with no out of range conditioning is a failsafe. However, a low trip that fails high is fail danger. You got to know the difference! FMEDAs can absolutely provide that level of detail.
This is good, but there is a caveat. FMEDA accuracy is only as good as the component database. I remember hearing all kinds of things over the years --“garbage in vs. garbage out”.
Component databases have to be calibrated. A way to calibrate the data is to compare total failure rates collected from estimation techniques field failure data to the total failure rates generated by a FMEDA analysis. This is a technique we have been using at exida for 15 years and after virtually hundreds of these comparisons we’re starting to get a real good idea of exactly how the component databases work.
exida has accumulated over 150 billion unit operating hours of field data from the process industries. Some of this is relatively low quality data and yet there is useful information there. Some of it is extremely high quality data with every single failure traced to root cause and we use all of this information to help calibrate our data.
Are you aware that field devices are not reported very often? How do you rely on any failure rate data? Well, we’re going to show you.
I'd like to say that given sets of data, some data is very high quality and I'll talk about that in particular. One set of data that was vetted very carefully by exida is the Dow field failure studies. Engineers from exida visited the Dow site in Terneuzen Netherlands on multiple occasions where we verified the data collection procedures, the estimated population, and equipment aging. They were pretty rigorous in their ability to report most failures. Some engineers at that site claimed all failures. They did account for their estimate of what failures were reported and what failures weren’t. After multiple visits of several days each, I'll just tell you that the exida engineers became pretty confident that the absolute failure rate was a useful number for purposes of comparison and calibration. I’ll show you how we use it.
We know that manufacture field return data studies are not useful for absolute failure rates for all the reasons I talked about earlier. However, at exida, we only use raw warranty return data. In other words, failures during the warranty period are counted. Only the operational hours during the warranty period are counted. We assume six months from duration to shipment and we assume return rates of 10%, 50%, or 70% based on service policies of the manufacturer and the product cost. We got those numbers from a series of surveys that we have done with end users two times over the past few years.
Now, here's something you may not know about. Some manufacturers actually have power-on-hours counters in the product. So when a failure occurs, we have a precise measurement of time to failure to the hour. When those data records are collected, they are averaged and inverted to obtain the total failure rate. You absolutely cannot get a more accurate number than that because it doesn't depend on how many failures occurred. We simply count the time to failure from all the failures that were reported.
When we have manufacturers data, we will calculate based on the raw warranty return data, a lower bound failure rate where the bottom gold dot is in the picture above. For some products we actually have the power on hours counter which is a very precise estimate of failure rate, but it doesn't include the operating hours of many units that are successfully operating in the field. The power on hour method would be as precise as humanly possible if we waited until all units in a given population fail, but we don't because by then the products obsolete. So it gives us a very nice upper bound number. We compare the FMEDA rate and quite frankly we expect the FMEDA predictive failure rate to be near the POH but under. Over and over again this is exactly what we see. This is a FMEDA verification technique combined with fault injection testing and a number of other methods. It works well.
Sometimes we discover that the FMEDA results go below or near the manufacturers lower bound failure rate. That means something is wrong in the FMEDA. This happens about twice a year and we do a detailed investigation into the data. This technique has revealed previously unknown failure modes of components identified by the root cause analysis done by a manufacturer. This method has identified previously unknown application stress variables that impact the failure rate in ways in which thev science never knew about previously. This is good stuff! This is valuable stuff! This is stuff that goes into the component database to make it more accurate and more realistic. The bottom line is every time we discover we can improve the component database and calibrate it more accurately.
In the next blog, we will compare failure rates from OREDA and DOW field data.