Blog de MAGBerumen Coaching PNL y Consultoría Empresarial: MTBF y MTTR

Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) are two important KPI's in plant maintenance.

MTBF = (Total up time) / (number of brekdowns)

MTTR = (Total down time) / (number of breakdowns)

Mean Time Between Failures & Mean Time To Repair - What do these mean?

“Mean Time” means, statistically, the average time.

“Mean Time Between Failures” (MTBF) is literally the average time elapsed from one failure to the next. Usually people think of it as the average time that something works until it fails and needs to be repaired (again).

“Mean Time To Repair” (MTTR) is the average time that it takes to repair something after a failure.

For something that cannot be repaired, the correct term is “Mean Time To Failure” (MTTF). Some would define MTBF – for repair-able devices – as the sum of MTTF plus MTTR. (MTBF = MTTF + MTTR). In other words, the mean time between failures is the time from one failure to another. This distinction is important if the repair time (MTTR) is a significant fraction of MTTF.

Here is an example. A light bulb in a chandelier is not repairable, so MTTF is most appropriate. (The light bulb will be replaced). The MTTF might be 10,000 hours.

On the other hand, without oil changes, an automobile’s engine may fail after 150 hours of highway driving – that is the MTTF. Assuming 6 hours to remove and replace the engine (MTTR), MTBF is 150 + 6 = 156 hours.

Like automobiles, most manufacturing equipment will be repaired, rather than replaced after a failure, so MTBF is the more appropriate measurement.

What is a Failure?

“Failure” can have multiple meanings. Let us briefly examine one device’s “failures”:

An Uninterruptible Power Source (UPS) may have five functions under two conditions:

While the main power is available:

Allow power to flow from the main source to the machine being protected
Condition the power by limiting surges or brownouts
Store power in a battery, up to the battery’s full charge

When the main power is interrupted:
Supply continuous power to the machine being protected
Emit a signal to indicate that the main power is off

There is no question that the UPS has failed if it prevents main power from flowing to the machine being protected (function 1). Failures for functions 2, 3 or 5 may not be obvious, because the “protected” machine is still running on main power or on the battery supply. Even if noticed, these failures may not trigger immediate corrective measures because the “protected” machine is still running and it may be more important to keep it running than to repair or replace the UPS.

What is Availability?

The “availability” of a device is, mathematically, MTBF / (MTBF + MTTR) for scheduled working time.

The automobile in the earlier example is available for 150/156 = 96.2% of the time. The repair is unscheduled down time.

With an unscheduled half-hour oil change every 50 hours – when a dashboard indicator alerts the driver – availability would increase to 50/50.5 = 99%.

If oil changes were properly scheduled as a maintenance activity, then availability would be 100%.

Why are these important?

“Availability” is a key performance indicator in manufacturing; it is part of the “Overall Equipment Effectiveness” (OEE) metric.

A production schedule that includes down time for preventative maintenance can accurately predict total production. Schedules that ignore MTBF and MTTR are simply future disasters awaiting remediation.

How to calculate actual MTBF

Actual or historic MTBF is calculated using observations in the real world. (There is a separate discipline for equipment designers to predict MTBF, based on the components and anticipated workload).

Calculating actual MTBF requires a set of observations; each observation is:

Uptime_moment: the moment at which a machine began operating (initially or after a repair)
Downtime_moment: the moment at which a machine failed after operating since the previous uptime-moment

So each Time Between Failure (TBF) is the difference between one Uptime_moment observation and the subsequent Downtime_moment.

Three quantities are required: