Материал (22 февруари 2010)

Liebert
Emerson Power Network

Balancing scalability and reliability in the critical power system: when does 'n + 1' become 'too many + 1'?

Summary

Uninterruptible Power Supply (UPS) protection can be delivered through a single-moduleapproach or through redundant systems. Redundancy enables higher availability ofcritical systems and is typically achieved through either an N + 1 or 1 + 1 design.While 1 + 1 systems deliver a significant improvement in availability over N + 1systems and are regularly specified for the most critical applications, N + 1 remainsa viable and popular option for applications seeking to balance cost, reliabilityand scalability.

However, the benefits of N + 1 redundancy diminish with the number of modules thatare added to the system. In fact, beyond four modules (3 + 1), the complexity ofan N + 1 system begins to significantly compromise reliability. System costs andservice requirements also increase with the number of UPS and battery modules added.Increased service requirements typically mean increased human intervention, increasingthe risk of downtime.

When N + 1 redundancy is used, UPS modules should be sized so thatthe total anticipated load can be carried by three modules.

Consequently, when N + 1 redundancy is used, UPS modules should be sized so thatthe total anticipated load can be carried by at most three modules. While UPS systemsare technically scalable beyond this point, 3 + 1 should be considered the thresholdat which scalability has such a negative impact on system availability, cost andperformance that it is not recommended.

For example, in a data center or computer room that is expected to eventually support600 kW of critical load, the switchgear and distribution panels are sized to thisanticipated load. The UPS configuration that maximizes availability for this roomis two 600 kW UPSs operating in parallel (1+ 1). If budget limitations do not allowthis configuration, an N + 1 configuration should be considered to effectively balancecost, reliability and scalability. In this case, initial capacities could be metby two 200 kW UPS modules operating in a 1+ 1 configuration. As the load in theroom increases, additional 200 kW UPS modules can be added until the capacity ofthe room is reached. At full capacity the UPS system will include four UPS modulesoperating in a 3 + 1 configuration. This achieves a balance between reliabilityand cost management in the context of unpredictable data center growth.

UPS Redundancy Options

All UPS equipment and switchgear, regardless of manufacturer, requires regular preventivemaintenance during which the UPS system must be taken offline. Redundancy enablesindividual UPS modules to be taken offline for service without affecting the qualityof power delivered to connected equipment. Redundancy also adds fault toleranceto the UPS system, enabling the system to survive a failure of any UPS module withoutaffecting power quality.

There are a number of different approaches to UPS redundancy, which are summarizedin Table 1. For more information on these options, see the Emerson Network Powerwhite paper, The Role of Redundancy in Increasing PowerSystem Availability, which describes each approach.

This paper provides a detailed analysis of the commonly used parallel redundantoption (N + 1), focusing on the reliability, cost-effectiveness and service requirementsof this architecture. The analysis is based on the availability of conditioned powerand therefore bypass is not considered for any option.

In a parallel redundant (N + 1) system, multiple UPS modules are sized so that thereare enough modules to power connected equipment (N), plus one additional modulefor redundancy (+ 1). During normal operation the load is shared equally acrossall modules. If a single module fails or needs to be taken offline for service,the system can continue to provide an adequate supply of conditioned power to connectedsystems.

N + 1 systems can be configured with either a scalable or modular architecture.The scalable architecture features UPS modules that each include a controller, rectifier,inverter and battery. In the modular architecture, the power modules comprise arectifier and inverter. A single or redundant controller controls operation andthe battery system is shared among modules. All modules in an N + 1 system sharea common distribution system.

System ConfigurationConcurrent Maintenance?Fault Tolerance?Availability
 ModuleSystemDistributionModuleDistribution 
Single moduleNoNoNoHighNoHigh
Parallel redundantYesYesNoYesNoHigher
Small isolated redundantYesYesNoYesNoHigher
Large isolated redundantYesYesNoYesNoHigher
Distributed redundantYesYesYesYesYesContinuous
Selective redundantYesSomeSelectiveSomeSelectiveContinuous
Power-TieTMYesYesYesYesYesContinuous
Hybrid AC-DC Power SystemYesYesYesYesYesContinuous
Table 1. Summary of system configurations.
N + 1 'Scalable' Architecture
N + 1 'Modular' Architecture
Figure 1. Left: In a scalable N + 1 architecture each UPS has its own controller and battery systems. Right: In a modular N + 1 architecture, power modules may share a controller and battery system. The common battery bank constitutes a single point of failure.

Scalability: IT Systems vs Power Systems

While similar conceptually, there are significant differences in how scalability applies to power systems and IT systems.

Network and data center managers expect scalability in the systems they use to manageand route data because future requirements are dynamic and difficult to project.A system that can easily “scale” to meet increased requirements enables an organizationto invest in technology based on current needs without having to abandon or replacethose systems when requirements change.

This is obviously desirable and is often promoted as a benefit of the N + 1 architecture.With this architecture, the argument goes, the UPS system can be sized to initialcapacities and additional UPSs or power modules can be added later as capacitiesincrease. This is true to a point. To find that point it is first necessary to understandthe difference in how “scalability” applies to IT systems versus power systems.

For IT systems, scalability refers to the ability to add processors, memory, orcontrollers without swapping out the rack or enclosure. In the power infrastructure,it refers to the ability to add power and battery modules as the load increases.While similar conceptually, there are significant differences in how scalabilityapplies to power systems and IT systems.

  • Failure Consequences.In IT systems, the failure of a single modulemay create a slight degradation in the performance of the system. A failure of asecond module increases this degradation and may make the application unavailableto some users. While similar conceptually, there are significant differences inhow scalability applies to power systems and N + 1 "Scalable" Architecture IT systems.N + 1 "Modular" Architecture 5 In an N + 1 power system, a failure of one UPS modulehas no effect on system performance; however, a failure of two modules results ina complete shutdown of all systems that depend on the UPS.The N + 1 system will not support any equipment if two modules fail, regardlessof whether the system has two or fifteen modules.
  • Open vs Closed Scalability:In IT hardware systems, standardizationoften enables additional memory or processor modules to be added from a manufacturerother than the original equipment manufacturer. In the power system, additionalmodules must be acquired from the original manufacturer and must be for the samemodel UPS.
  • Expected Lifespan:IT systems are typically upgraded every threeto five years, while the power infrastructure must serve the entire life of thedata center, often ten years or more. This makes the previous point even more significant.Will equipment manufacturers support modules for 10 years or more? Will expansionmodules be available and at what cost? Will vendors guarantee backward compatibilityfor that period?
  • Software Cost Optimization:Software licensing costs are becomingan increasingly large component of IT budgets. IT managers need incrementally scalablecomputing hardware to optimize costs of software licenses that are charged on thebasis of number of CPUs or MIPS. There is no such issue with the power system.
  • Expansion Capability:While it can be difficult to project futurecapacities, it is a necessary step in the design of a data center or computer room.Future capacities are projected and used to size the main input switchgear and mainpower distribution panels in data centers and server rooms. The UPS system cannotexpand beyond the capacity of these components.

These factors, taken together, make scalability completely different for the infrastructurethan for the IT systems the infrastructure supports. Certainly scalability is adesirable trait, but it is desirable only if it can be achieved without compromisingavailability.

The investment in support systems needs to be weighed against the value of those systems to the business.

Network Criticality and the Cost of Downtime

Organizations don't acquire a UPS system for its own sake; they acquire a UPS systembecause they understand the importance of ensuring power quality and availabilityto the hardware and software systems that are critical to business operations. Themore important these systems are to the business, the more important the investmentin support systems to overall business success.

As a result, this investment needs to be weighed against the value of these systemsto the business. That value comes in two forms. First, and most obviously, is avoidanceof downtime and the associated costs, which include loss of productivity, loss ofrevenue and loss of customer confidence. These costs can be extremely high for businessesthat are moderately to highly dependent on network systems and provide a basis formaking sound business decisions in relation to the network and support system. However,studies indicate a surprising number of organizations do not accurately quantifythe cost of network downtime. A recent survey by Forrester Research revealed that67 percent of enterprises either did not know or could not provide an estimate ofthe costs of downtime to the business.

Not only is it important to analyze downtime costs, these costs should be consideredrelative to overall business costs. Network criticality is not necessarily a functionof the size of the data center or computer room. It is a measure of cost of downtimeversus expected profits/gains. A small computer room can be just as critical tothe business it supports as a large data center.

The second value of the support system is enabling organization to do more withtechnology. Maintaining 7x24 availability of network services and deploying newbusiness applications such as IP telephony are only possible through an appropriatesupport system.

Availability and the Power System Infrastructure

The relationship between IT system availability and the power system infrastructureis illustrated in Figure 2. The desired business result is at the top of the pyramid,the application layer is in the middle and the critical support infrastructure isat the bottom. The pyramid is inverted because the investment is smallest at thebottom of the pyramid and largest at the top.

Critical support systems
Figure 2. Critical support systems represent a much smaller investment than the network application layer, but are the foundation that supports the application layer’s ability to achieve business objectives.

The relative costs for each layer of the Pyramid tend to remain fairly constant regardless of the size of the facility.

Critical power system availability must be 100 times greater than the availability of the systems being supported to keep from negatively impacting total system availability.

Interestingly, the relative costs for each layer of the pyramid tend to remain fairlyconstant regardless of the size of the facility. This is significant because everysize data center is now being asked to support higher levels of availability anda misconception persists that it is relatively more expensive for smaller data centersto achieve higher levels of availability than for larger centers. Typically theproportion of capital expenditures dedicated to critical power systems is 2 to 4percent of the total capital expenditure in the data center, regardless of the sizeof the facility.

This also puts into perspective the cost of power system "future sizing"– the practiceof sizing the power system based on projected capacities rather than capacitiesrequired at startup. This may add up to 1 percent to total data center capital expenditures– definitely worth saving if possible. But only if this can be accomplished withoutcompromising the availability of the middle of the pyramid. As will be seen in thefollowing sections, a power system that does not adequately consider future growthwill compromise overall availability – and ultimately cost more than a system thatis properly sized.

System availability is calculated by dividing the hours in a year the system isavailable by the total number of hours in a year. Because availability of the systemsin the middle of the pyramid is dependent on the systems at the bottom, the availabilityof network hardware and software is the product of the availability of those systemsmultiplied by the availability of the critical power system. This relationship isillustrated in Table 2. Critical power system availability must be 100 times greaterthan the availability of the systems being supported to keep from negatively impactingtotal system availability.

Calculating Availability of the N + 1 Architecture

In terms of the power infrastructure, availability can be projected based on thesystem design and the reliability of system components. Reliability is measuredin terms of Mean Time Between Failure and Mean Time to Repair. Availability is alsocalculated as follows:

MTBF – MTTR
MTBF

The critical bus is available if at most one power system is down. The probabilityof this is equal to the probability that each power system is up, plus the probabilitythat one power system is down. If R is the probability of single UPS plus batteryavailability, the availability of a 1 + 1 system will be

R2 + 2 x R x (1 – R)

And the availability of a 3 + 1 system will be

R4 + 4 x R3 x (1 – R)

Figure 3 shows how this translates into power system availability for N +1 systemsfrom 1 + 1 to 13 + 1.

Critical Power AvailabilityIT System AvailabilityTotal Availability
.99.9999.9899
.999.9999.9989
.9999.9999.9998
.99999.9999.99989
.999999.9999.9999
Table 2. Total availability of IT systems is a product ofthe availability of the network hardware and software multiplied by the availabilityof critical power systems.
Critical bus availability
Figure 3. Critical bus availability drops as more modules are added to an N + 1 system. Beyond 3 + 1, the drop in availability begins to represent a significant risk for the business.

At 4 + 1, power system availability begins dropping precipitously. At 13 + 1, power system availability is four nines as opposed to six nines for a 1 + 1 system.

Critical bus availability drops as the number of modules goes up; however, the curvestays fairly flat up to the 3 + 1 level. At 4 + 1 critical bus availability beginsdropping precipitously. At 13 + 1, critical bus availability is four nines as opposedto six nines for a 1 + 1 system (assuming a single UPS plus battery system reliabilityof 3 nines).

This is particularly problematic because modules are added to an N + 1 system asthe load increases. Typically an increase in load correlates with an increase innetwork criticality (i.e. cost of downtime increases). So, an N + 1 architectureis responding to an increase in network criticality by reducing critical power busavailability.

If the reliability of a single UPS and battery system is .9995, a 13 + 1 systemwill be down about 90 times more than a 1 + 1 system (see Figure 4).

Calculating the Cost of the N + 1 Architecture

Even with the reduced availability of 4 +1 and higher modular systems, some organizationsmight be willing to risk a pay-as-you-grow approach to power system design if significantcosts savings could be realized. However, it isn’t just availability that dropsas the number of modules increases; cost-effectiveness goes down as well. This isbecause UPS costs go down on a per-kW basis as the size of the UPS increases. Thisis also true for battery systems: cost per ampere/hour goes down as ampere/hourrating goes up. As a result, the cost of protecting a 500 kW room may well be lessfor a 1 + 1 system using two 500 kW UPS plus battery systems than if 14 units of40 kW UPS, along with 14 battery modules, are used in a 13 + 1 architecture.

Translated into downtime per year
Figure 4. Differences in critical bus availability based on the number of UPS modules creates significant differences in the amount of downtime that can be expected.

The Final Nail: Service Requirements

In studies of the causes of critical system downtime, human error typically representsfrom 15 to 20 percent of occurrences, behind only hardware and software failures.Unfortunately, the N + 1 architecture increases the likelihood of human errorrelateddowntime – the more modules in a system, the higher the probability of human errordue to increased human intervention for service.

This can also be analyzed statistically. If R is the probability of single UPS plusbattery availability, service will be required whenever any unit goes down. Fora 4 + 1 system, this can be calculated as follows:

1 – R5

Performing this calculation on various N + 1 configurations produces the graph inFigure 5. This graph shows that a 13 + 1 architecture is 6.6 times more likely torequire service attention than a 1 + 1 system and 3.3 times more likely than a 3+ 1 system. Also, remember that Figure 5 does not factor in the increased probabilityof downtime resulting from other activities, such as the addition of new power orbattery modules.

Service requirement frequency
Figure 5. Service requirements also increase with the number of modules, increasing the possibility of downtime from human error.

Conclusion

Infrastructure costs are a relatively small percentage of total data center capitalexpenditures. But, they have a significant impact on IT system utilization and availabilityand therefore on the business itself.

Organizations should seek ways to minimize infrastructure costs where possible,but only if this can be accomplished without compromising availability. Decisionsthat reduce critical system availability may end up reducing the return on investmentin all IT systems and limiting the ability to achieve business objectives.

Availability of the N + 1 architecture becomes unacceptable at system configurationsthat utilize five or more modules. These configurations present greater risk ofdowntime, are less cost-effective and require more service attention than systemsthat use four or fewer modules. As a result, the recommended design standard isto use a 1 + 1 configuration whenever possible. If initial capital limitations dictatean N + 1 architecture, UPS modules should be sized so that no more than three modulesare required to support the expected capacity of the room. If a 1 + 1 configurationis used to meet initial requirements, the system can accommodate growth of 300 percentwithout significantly compromising availability.


Новини


25 окт 2010Datacenter Infrastructure in Sofiaповече

20 яну 2010Liebert XDFN Panduit Editionповече

21 дек 2009Liebert CRVповече

White papers


01 май 2011Application Considerations for Cooling Small Computer and...повече

01 апр 2011Longevity of Key Components in Uninterruptible Power Systemsповече