Who cares about field failure data? Why are we even here?

IEC 61511 – Fundamental Concepts

The fundamental concepts from our functional safety standards are the probabilistic performance based design.  Many of you know that this was terribly controversial when this was first proposed. Even to this day, there are many people who prefer a very prescriptive canned design type approach rather than allowing engineers to create new and innovative designs. The advantage of this is that not only can engineers actually do engineering, but we are allowed to optimize our designs to match the risk and to match the variables of our plant.

In roughly 15 years since the standards have been released, many people have taken full advantage of these attributes.  The whole probabilistic performance-based system design approach is becoming rather commonplace.

Part of that, of course, the other fundamental concept is the use of the safety lifecycle, a detailed engineering process intended to reduce design mistakes.

As part of the detailed safety lifecycle, there is a section, a series of steps, where we perform the probabilistic performance analysis. 


Detailed_Safety_Lifecycle_-_Design_Phase


Many of you know how to do that and exida has an entire web seminar on the three barriers to SIL verification.  It is recorded and available on the exida website as well as on YouTube. So if you want to see the details of how to perform this step number 11 on this drawing, please feel free to look at the Web seminar.

The key point for now is very simple, in order to verify a design in terms of its failure probability, we must have realistic failure data.  This of course was the key point of debate when this whole concept was being proposed because there were many who simply said there is no such thing as realistic failure rate data. We don’t have the information we need to implement this standard. 

Some people say there is no such thing as realistic failure data.  Why?

One Sunday morning, when I was looking at the paper, I’ve gained some insight from cartoonist Scott Adams. People kind of have a bad attitude toward failure data and I have to admit when I’ve studied some of this failure data, I began thinking the same way.

According to Scott Adams , the way to create a set of failure data is start by typing random numbers into a spreadsheet and then you’re done. I know some data seems that way but it isn’t if you start digging in and studying.

Scott Adams

Getting Failure Data

There are many sources of failure rate data. Industry databases are an excellent source and a valuable way to get data. Manufacturers field return data studies have some value, although we have to be extremely careful about the assumptions used.  It’s most valuable, in my opinion, for gathering up root cause of failure analysis reports; which manufactures typically do.

There is the B10 approach used in mechanical and electromechanical products primarily out of the machine safety arena, but you have to remember machines typically move fast, and the primary failure mechanisms wear out. So we concluded, of course, that you have to be very careful about manufacturers field return studies. The B10 cycle testing has no place in the process industries and is extremely dangerous.

End user field failure data studies are probably the most excellent source for potential data as we move forward, but unfortunately many of their systems are quite weak which we will talk about in great detail.  And of course the FMEDA technique developed by exida is a predictive failure rate approach.  This can produce very good results if it’s properly calibrated.

Industry Databases

I’m a fan of OREDAOREDA is a consortium of offshore companies in the North Sea. It is operated by DNV out of Norway and data analysis is done by SINTEF.

It provides very useful data on process equipment in very generic categories like final element assembly.  It’s updated every few years. I keep up to date on it. I study it.  A group from exida even went to Norway and reviewed the data collection and analysis techniques with some engineers from SINTEF.

The problem is, it’s a bit hard to use in safety instrumented function design as the data is significantly aggregated. That sounds like a mighty fancy word, but what it means is, you can’t get the detailed failure data for a particular type of solenoid valve,a quick exhaust valve, a rack and pinion actuator, a piston actuator, butterfly valve, or a ball valve. You can’t get specific data for a particular brand of PLC. You get very generic categories.

How do you know if your design matches the category?

Well, you just got to do your best and ,of course, data is not available for many years after a new product is designed. You would be quite shocked how often it happens, even when you don’t think it does. Manufacturers are redesigning all the time. Sometimes because they must. Parts go obsolete.

End User Field Failure Data Studies

End user field failure data studies. on the other hand, represent the highest potential for good field failure rate data, but we must overcome a number of problems that we have found to be quite chronic including insufficient information in the failure reports, different definitions of what is a failure, and different data analysis methods.

These problems can be solved, as I will talk about some more.

Field Failure Data Collection Standards

Why are there all these problems? Are there field failure data collection standards?

Yes. IEC 61508 refers to ISO 14224:2006 and IEC 60300–3–2 and there is also Namur NE 93.

They all set up a framework for data collection, but what we have discovered is that there’s not nearly enough detail in there to do a proper job. Perhaps most importantly, many people aren’t even following those standards. That’s okay. We can solve the problem.

Data Analysis

We must be very careful of the input data we have from a set of field failure. I once talked to an engineer who was quite proud of their data analysis and he basically said “I could hire a statistics expert to completely analyze this data!” And he did. They had a lot of very interesting numbers. I said “how do you know your numbers are right? All the statistical expertise in the world cannot identify bad data or perhaps more commonly, inconsistently classified data.”

So if we want to set up a good field failure data collection system, how do we do it? What kind of a system should we have? 

A group of us at exida have collected over 150 billion unit operating hours of process equipment field failure data from over 100 locations. Some of it from direct end users. Some of it from manufacturers which were used to identify failure modes and root cause. Some of them are SILStat users who have elected to allow the data to be sent to us.

We’ve learned many different systems.  We’ve discovered many different types of data collection and we have discovered a number of different lessons.  We’ve taken all this information and integrated it into a set of techniques which we called predictive analytics.

Predictive Analytic Analysis

What is predictive analytics?

We’re looking for a way of validating a failure rate number.  At least checking it for reasonableness.  So we start with our knowledge of design strength from actual products. Everyone knows that a failure rate is a function of stress versus strength.  I’ve always been a big believer in that kind of a general philosophy. I learned it at the Eindhoven University of Technology from Dr. Rombecker and it just made sense from the day I learned it.

We can evaluate the design strength of a range of different products and at exida we do that through our FMEDA technique. It is an analysis of design strength.  By looking at a large collection of products, we can tell how the design strength varies from product to product.

The concept is given a consistent environmental stress. Operating profile, it’s called.  We can predict how the failure rate will vary as a function of the variations in design strength of the products. We’re very lucky to be able to have of this extent of data to be able to analyze.

Outlier Limit Graph

Here’s a graph showing a study of the design strength of 35 different pressure transmitter designs combined with field failure data from OREDA and from Dow. We show outlier limits and expected range. We can tell by looking at this chart that we can establish a set of numbers for a pressure transmitter.

Based on the analysis of this set of data,  we can predict how well a data collection process of field failure data is working and whether they are collecting information properly or whether they have some serious problems in their data collection process.

Here’s what we do, we take a set of field failure data that we get from the field. We perform statistical analysis on it using one of several different techniques to produce an estimated failure rate.

We take the predictive analytic model for each device type including the upper bounds and lower bounds and statistical averages and deviations and we compare the actual estimated Lambda with the predicted lambda range.

Ask a number of simple questions:

  • Are we out of range?
  • How far are we out of range?
  • What is the probability that the data collection process is inconsistent with the norm?

We study the data collection process and the failure categories.  Issues can be resolved for the purposes of creating consistent and realistic failure rate estimates.  That’s not a bad approach. When all that is done, we often find problems in data collection.  We fix them and customers get much more useful and relevant and realistic failure rate data.

Field Failure Study Experience

It’s been quite an interesting experience. An extensive set of test results from a test shop indicated that a manufacturer had an exceptionally low failure rate when compared to our predictive analytic bounds at a range not credible. So we took a visit to the test shop to learn exactly how the instruments were evaluated and classified as a failure or not.

What we have discovered was, before each instrument that was tested to see if it failed or not, it was cleaned up.  Cleaning included, you won’t believe this, disassembly and replacement of seals and O-rings.

I was thinking to myself. “I’m shocked that any units failed after that refurbishment activity.”

The key lesson learned is the test and repair and data collection process must be understood to correctly analyze field data.  The predictive analytic bounds flagged the need for extensive data collection process review.

Key point: as found conditions must be recorded.  Although I have to say, I’m a little surprised how often it does not happen, but it is an essential component of any good field data collection system.

In another incident, we discovered that data from an on-site repair shop in an end-user facility is entered into a maintenance management system where the equipment failure rates are automatically analyzed.


exSILentia


The MMS report showed the failure rates of some instruments being used in a safety function were much lower than those used for SIL verification in the exSILentia tool.  Therefore,  they concluded they could extend proof test intervals without sacrificing their necessary risk reduction.

However before proceeding, a wise instrument engineer visited the test shop.  He discovered that failure reports were entered into the MMS and only one of the devices were sent out for repair.

In other words,  the only units that were considered failures were those that had to be sent out. That’s how a repair order was generated.  All units internally repaired were not recorded as a failure because they didn’t need a purchase order.  Oh my goodness.  It’s no wonder the data was extremely low.  The proof test interval was not extended.  The test repair data collection process must be understood and as found conditions must be recorded.

I have a lot of stories and you don’t want to hear them all but I’ve got one more.

A field failure study of a valve showed a very low failure rate when compared to the exida PA benchmark. The conclusion was, this valve would be excellent for safety applications.  We discovered that the valve was designed and manufactured for control applications. In the analysis of field failures, nobody ever recorded whether the device was being used in a control application or in a safety application.

Many of you know and understand that any mechanical component used in an application where it’s frequently moving has very different failure rates and failure modes than a valve used in an application where it’s typically not moving. binding, stiction, cold welding… all potentially dangerous failure modes.  It’s no wonder this low failure rate was outside the reasonability limits of the PA benchmark.

Lesson learned:  application conditions must be recorded and listed in the analysis report.

Field Failure Studies

We see that they exist.  That they can be valuable, but the outputs have to be checked for reasonability. The data collection process must be checked and when that is done then that data is what I call validated field failure data. We use it to calibrate the FMEDA results in a nice closed loop system.

Predictive Analytic Analysis


Field Failure Data


We also use predictive analytics to identify product specific failures versus site specific failures. What I mean by that is, several datasets, once the data collection process was validated have indicated a 2X difference in the failure rate from site to site for the same product model number.

Did you hear me?  Okay , let’s say I got this Brand XYZ pressure transmitter and it has a failure rate of 3×10 to the -7 at one site and has a failure rate of 6×10 to the -7 at another site.

Is that because they’re collecting the data differently at these two sites, perhaps, but once it was determined that was not the case, then we went into root cause analysis and determined a difference between product specific failures.  Was it an internal failure or was it caused by a weakness in the product selection process? They were picking equipment with bad materials… They were maintaining them wrong… they were testing them wrong or they were repairing them wrong. This is really a benefit of data collection because you can actually identify site specific causes and fix them.

What am I saying?  You can actually reduce the failure rate by doing the root cause analysis. It’s especially valuable once you have validated the data collection process.

Failure Data Analysis

I’m absolutely sure that a good data collection system can be implemented without major expense. It has been done in many sites and we have quite a few of those now that are feeding back high quality data with a validated data collection system.

We use the predictive analytic statistics as a benchmark to tell us how deep to dig and how good a data collection process might be.  Any time there’s a discrepancy you must explain why and the more you explain the more benefit you get from the data collection.


Tagged as:     SILStat     Safety Lifecycle     OREDA     IEC 61508     FMEDA     Failure Rates     Dr. William Goble  

Other Blog Posts By Dr. William Goble