A well-run and well-maintained network should have 99.99 percent availability. There
should be less than 1 percent packet loss and packet turnaround of 80 milliseconds or
less. To achieve this level of availability and performance the network must be moni-
tored. Any time business systems extend to the Internet or to wide area networks
(WANs), internal network monitoring must be supplemented with outside-in monitoring
that checks the availability of the network and business systems.
Resources, training, and documentation are essential to ensuring that you can manage
and maintain mission-critical systems. Many organizations cripple the operations team
by staffi ng minimally. Minimally manned teams will have marginal response times and
nominal effectiveness. The organization must take the following steps:
Staff for success to be successful.
Conduct training before deploying new technologies.
Keep the training up-to-date with what’s deployed.
Document essential operations procedures.
Every change to hardware, software, and the network must be planned and executed
deliberately. To do this, you must have established change control procedures and
well-documented execution plans. Change control procedures should be designed to
ensure that everyone knows what changes have been made. Execution plans should
be designed to ensure that everyone knows the exact steps that were or should be per-
formed to make a change.
Change logs are a key part of change control. Each piece of physical hardware deployed
in the operational environment should have a change log. The change log should be
stored in a text document or spreadsheet that is readily accessible to support personnel.
The change log should show the following information:
Who changed the hardware
What change was made
When the change was made
Why the change was made
SIDE OUT
Use monitoring to ensure availability
A well-run and well-maintained network should have 99.99 percent availability. There
should be less than 1 percent packet loss and packet turnaround of 80 milliseconds or
less. To achieve this level of availability and performance the network must be moni-
tored. Any time business systems extend to the Internet or to wide area networks
(WANs), internal network monitoring must be supplemented with outside-in monitoring
that checks the availability of the network and business systems.
Planning for Hardware Needs 1317
Chapter 38
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Establish and Follow Change Control Procedures
Change control procedures must take into account the need for both planned changes
and emergency changes. All team members involved in a planned change should meet
regularly and follow a specifi c implementation schedule. No one should make changes
that aren’t discussed with the entire implementation team.
You should have well-defi ned backup and recovery plans. The backup plan should spe-
cifi cally state the following information:
When full, incremental, differential, and log backups are used
How often and at what time backups are performed
Whether the backups must be conducted online or offl ine
The amount of data being backed up as well as how critical the data is
The tools used to perform the backups
The maximum time allowed for backup and restore
How backup media is labeled, recorded, and rotated
Backups should be monitored daily to ensure that they are running correctly and that
the media is good. Any problems with backups should be corrected immediately. Mul-
tiple media sets should be used for backups, and these media sets should be rotated on
a specifi c schedule. With a four-set rotation, there is one set for daily, weekly, monthly,
and quarterly backups. By rotating one media set offsite, support staff can help ensure
that the organization is protected in case of a disaster.
The recovery plan should provide detailed step-by-step procedures for recovering the
system under various conditions, such as procedures for recovering from hard disk
drive failure or troubleshooting problems with connectivity to the back-end database.
The recovery plan should also include system design and architecture documentation
that details the confi guration of physical hardware, application logic components,
and back-end data. Along with this information, support staff should provide a media
set containing all software, drivers, and operating system fi les needed to recover the
system.
Note
One thing administrators often forget about is spare parts. Spare parts for key compo-
nents, such as processors, drives, and memory, should also be maintained as part of the
recovery plan.
Establish and Follow Change Control Procedures
Change control procedures must take into account the need for both planned changes
and emergency changes. All team members involved in a planned change should meet
regularly and follow a specifi c implementation schedule. No one should make changes
that aren’t discussed with the entire implementation team.
Note
One thing administrators often forget about is spare parts. Spare parts for key compo-
nents, such as processors, drives, and memory, should also be maintained as part of the
recovery plan.
Chapter 38
1318 Chapter 38 Planning for High Availability
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
You should practice restoring critical business systems using the recovery plan. Practice
shouldn’t be conducted on the production servers. Instead, the team should practice on
test equipment with a confi guration similar to the real production servers. Practicing
once a quarter or semiannually is highly recommended.
You should have well-defi ned problem escalation procedures that document how to
handle problems and emergency changes that might be needed. Many organizations
use a three-tiered help desk structure for handling problems:
Level 1 support staff forms the front line for handling basic problems. They typi-
cally have hands-on access to the hardware, software, and network components
they manage. Their main job is to clarify and prioritize a problem. If the problem
has occurred before and there is a documented resolution procedure, they can
resolve the problem without escalation. If the problem is new or not recognized,
they must understand how, when, and to whom to escalate it.
Level 2 support staff includes more specialized personnel that can diagnose a
particular type of problem and work with others to resolve a problem, such as
system administrators and network engineers. They usually have remote access to
the hardware, software, and network components they manage. This allows them
to troubleshoot problems remotely and to send out technicians after they’ve pin-
pointed the problem.
Level 3 support staff includes highly technical personnel who are subject matter
experts, team leaders, or team supervisors. The level 3 team can include support
personnel from vendors as well as representatives from the user community.
Together, they form the emergency response or crisis resolution team that is
responsible for resolving crisis situations and planning emergency changes.
All crisis situations and emergencies should be responded to decisively and resolved
methodically. A single person on the emergency response team should be responsible
for coordinating all changes and executing the recovery plan. This same person should
be responsible for writing an after-action report that details the emergency response
and resolution process used. The after-action report should analyze how the emergency
was resolved and what the root cause of the problem was.
In addition, you should establish procedures for auditing system usage and detecting
intrusion. In Windows Server 2008, auditing policies are used to track the successful or
failed execution of the following activities:
Account logon events
Tracks events related to user logon and logoff
Account management
Tracks those tasks involved with handling user accounts,
such as creating or deleting accounts and resetting passwords
Directory service access
Tracks access to the Active Directory Domain Service
(AD DS)
Object access
Tracks system resource usage for fi les, directories, and objects
Policy change
Tracks changes to user rights, auditing, and trust relationships
Planning for Hardware Needs 1319
Chapter 38
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Privilege use
Tracks the use of user rights and privileges
Process tracking
Tracks system processes and resource usage
System events
Tracks system startup, shutdown, restart, and actions that affect
system security or the security log
You should have an incident response plan that includes priority escalation of sus-
pected intrusion to senior team members and provides step-by-step details on how to
handle the intrusion. The incident response team should gather information from all
network systems that might be affected. The information should include event logs,
application logs, database logs, and any other pertinent fi les and data. The incident
response team should take immediate action to lock out accounts, change passwords,
and physically disconnect the system if necessary. All team members participating in
the response should write a postmortem that details the following information:
What date and time they were notifi ed and what immediate actions they took
Who they notifi ed and what the response was from the notifi ed individual
What their assessment of the issue is and the actions necessary to resolve and
prevent similar incidents
The team leader should write an executive summary of the incident and forward this to
senior management.
The following checklist summarizes the recommendations for operational support of
high-availability systems:
Monitor hardware, software, and network components 24/7.
Ensure that monitoring doesn’t interfere with normal systems operations.
Gather only the data required for meaningful analysis.
Establish procedures that let personnel know what to look for in the data.
Use outside-in monitoring any time systems are externally accessible.
Provide adequate resources, training, and documentation.
Establish change control procedures that include change logs.
Establish execution plans that detail the change implementation.
Create a solid backup plan that includes onsite and offsite tape rotation.
Monitor backups and test backup media.
Create a recovery plan for all critical systems.
Test the recovery plan on a routine basis.
Document how to handle problems and make emergency changes.
Chapter 38
1320 Chapter 38 Planning for High Availability
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Use a three-tier support structure to coordinate problem escalation.
Form an emergency response or crisis resolution team.
Write after-action reports that detail the process used.
Establish procedures for auditing system usage and detecting intrusion.
Create an intrusion response plan with priority escalation.
Take immediate action to handle suspected or actual intrusion.
Write postmortem reports detailing team reactions to the intrusion.
Planning for Deploying Highly Available Servers
You should always create a plan before deploying a business system. The plan should
show everything that must be done before the system is transitioned into the produc-
tion environment. After a system is in the production environment, the system is
deemed operational and should be handled as outlined in “Planning for Day-to-Day
Operations” on page 1316.
The deployment plan should include the following items:
Checklists
Contact lists
Test plans
Deployment schedules
Checklists are a key part of the deployment plan. The purpose of a checklist is to
ensure that the entire deployment team understands the steps they need to perform.
Checklists should list the tasks that must be performed and designate individuals to
handle the tasks during each phase of the deployment—from planning to testing to
installation. Prior to executing a checklist, the deployment team should meet to ensure
that all items are covered and that the necessary interactions among team members are
clearly understood. After deployment, the preliminary checklists should become a part
of the system documentation and new checklists should be created any time the system
is updated.
The deployment plan should include a contact list. The contact list should provide
the name, role, telephone number, and e-mail address of all team members, vendors,
and solution provider representatives. Alternative numbers for cell phones and pagers
should be provided as well.
The deployment plan should include a test plan. An ideal test plan has several phases.
In Phase I, the deployment team builds the business system and support structures in a
test lab. Building the system means accomplishing the following tasks:
Creating a test network on which to run the system
Planning for Hardware Needs 1321
Chapter 38
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Putting together the hardware and storage components
Installing the operating system and application software
Adjusting basic system settings to suit the test environment
Confi guring clustering or network load balancing as appropriate
The deployment team can conduct any necessary testing and troubleshooting in the
isolated lab environment. The entire system should undergo burn-in testing to guard
against faulty components. If a component is fl awed, it usually fails in the fi rst few days
of operation. Testing doesn’t stop with burn-in. Web and application servers should be
stress tested. Database servers should be load tested. The results of the stress and load
tests should be analyzed to ensure that the system meets the performance requirements
and expectations of the customer. Adjustments to the confi guration should be made to
improve performance and optimize for the expected load.
In Phase II, the deployment team tests the business system and support equipment
in the deployment location. They conduct similar tests as before but in the real-world
environment. Again, the results of these tests should be analyzed to ensure that the sys-
tem meets the performance requirements and expectations of the customer. Afterward,
adjustments should be made to improve performance and optimize as necessary. The
team can then deploy the business system.
After deployment, the team should perform limited, nonintrusive testing to ensure that
the system is operating normally. After Phase III testing is completed, the team can use
the operational plans for monitoring and maintenance.
The following checklist summarizes the recommendations for predeployment planning
of mission-critical systems:
Create a plan that covers the entire testing to operations cycle.
Use checklists to ensure that the deployment team understands the procedures.
Provide a contact list for the team, vendors, and solution providers.
Conduct burn-in testing in the lab.
Conduct stress and load testing in the lab.
Use the test data to optimize and adjust the confi guration.
Provide follow-on testing in the deployment location.
Follow a specifi c deployment schedule.
Use operational plans once fi nal tests are completed.
Chapter 38
1322 Chapter 38 Planning for High Availability
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
C
lustering technologies allow servers to be connected into multiple-server units
called server clusters. Each computer connected in a server cluster is referred to as a
node. Nodes work together, acting as a single unit, to provide high availability for busi-
ness applications and other critical resources, such as Microsoft Internet Information
Services (IIS), Microsoft SQL Server, or Microsoft Exchange Server. Clustering allows
administrators to manage the cluster nodes as a single system rather than as individual
systems. Clustering allows users to access cluster resources as a single system as well.
In most cases, the user doesn’t even know the resources are clustered.
The main cluster technologies that Windows Server 2008 supports are:
Failover clustering
Failover clustering provides improved availability for appli-
cations and services that require high availability, scalability, and reliability. By
using server clustering, organizations can make applications and data available
on multiple servers linked together in a cluster confi guration. The clustered serv-
ers (called nodes) are connected by physical cables and by software. If one of
the nodes fails, another node begins to provide service. This process, known as
failover, ensures that users experience a minimum of disruptions in service. Back-
end applications and services, such as those provided by database servers, are
ideal candidates for failover clustering.
Network Load Balancing
Network Load Balancing (NLB) provides failover sup-
port for Internet Protocol (IP)–based applications and services that require high
scalability and availability. By using Network Load Balancing, organizations can
build groups of clustered computers to support load balancing of Transmission
Control Protocol (TCP), User Datagram Protocol (UDP), and Generic Routing
Encapsulation (GRE) traffi c requests. Front-end Web servers are ideal candidates
for Network Load Balancing.
These cluster technologies are discussed in this chapter so that you can plan for and
implement your organization’s high-availability needs.
Introducing Server Clustering . . . . . . . . . . . . . . . . . . . . 1324
Using Network Load Balancing . . . . . . . . . . . . . . . . . . 1331
Managing Network Load Balancing Clusters . . . . . . . 1337
Using Failover Clustering . . . . . . . . . . . . . . . . . . . . . . . . 1345
Running Failover Clusters . . . . . . . . . . . . . . . . . . . . . . . 1352
Creating Failover Clusters . . . . . . . . . . . . . . . . . . . . . . . 1356
Managing Failover Clusters and Their Resources . . . . 1361
CHAPTER 39
Preparing and Deploying
Server Clusters
1323
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Introducing Server Clustering
A server cluster is a group of two or more servers functioning together to provide essen-
tial applications or services seamlessly to enterprise clients. The servers are physically
connected together by a network and might share storage devices. Server clusters are
designed to protect against application and service failure, which could be caused by
application software or essential services becoming unavailable; system and hardware
failure, which could be caused by problems with hardware components such as central
processing units (CPUs), drives, memory, network adapters, and power supplies; and
site failure, which could be caused by natural disaster, power outages, or connectivity
outages.
You can use cluster technologies to increase overall availability while minimizing
single points of failure and reducing costs by using industry-standard hardware and
software. Each cluster technology has a specifi c purpose and is designed to meet differ-
ent requirements. Network Load Balancing is designed to address bottlenecks caused
by Web services. Failover clustering is designed to maintain data integrity and allow a
node to provide service if another node fails.
The clustering technologies can be and often are combined to architect a comprehen-
sive service offering. The most common scenario in which both solutions are combined
is a commercial Web site where the site’s Web servers use Network Load Balancing and
back-end database servers use failover clustering.
Benefi ts and Limitations of Clustering
A server cluster provides high availability by making application software and data
available on several servers linked together in a cluster confi guration. If a server stops
functioning, a failover process can automatically shift the workload of the failed server
to another server in the cluster. The failover process is designed to ensure continuous
availability for critical applications and data.
Although clusters can be designed to handle failure, they are not fault tolerant with
regard to user data. The cluster by itself doesn’t guard against loss of a user’s work.
Typically, the recovery of lost work is handled by the application software, meaning the
application software must be designed to recover the user’s work or it must be designed
in such a way that the user session state can be maintained in the event of failure.
Clusters help to resolve the need for high availability, high reliability, and high scal-
ability. High availability refers to the ability to provide user access to an application or a
service a high percentage of scheduled times while attempting to reduce unscheduled
outages. A cluster implementation is highly available if it meets the organization’s
scheduled uptime goals. Availability goals are achieved by reducing unplanned down-
time and then working to improve total hours of operation for the related applications
and services.
High reliability refers to the ability to reduce the frequency of system failure while
attempting to provide fault tolerance in case of failure. A cluster implementation is
highly reliable if it minimizes the number of single points of failure and reduces the
Chapter 39
1324 Chapter 39 Preparing and Deploying Server Clusters
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
risk that failure of a single component or system will result in the outage of all appli-
cations and services offered. Reliability goals are achieved by using redundant, fault-
tolerant hardware components, application software, and systems.
High scalability refers to the ability to add resources and computers while attempting to
improve performance. A cluster implementation is highly scalable if it can be scaled up
and out. Individual systems can be scaled up by adding more resources such as CPUs,
memory, and disks. The cluster implementation can be scaled out by adding more
computers.
Design for Availability
A well-designed cluster implementation uses redundant systems and components so that
the failure of an individual server doesn’t affect the availability of the related applications
and services. Although a well-designed solution can guard against application failure,
system failure, and site failure, cluster technologies do have limitations.
Cluster technologies depend on compatible applications and services to operate prop-
erly. The software must respond appropriately when failure occurs. Cluster technology
cannot protect against failures caused by viruses, software corruption, or human error.
To protect against these types of problems, organizations need solid data protection
and recovery plans.
Cluster Organization
Clusters are organized in loosely coupled groups often referred to as farms or packs.
A farm is a group of servers that run similar services but don’t typically share data.
They are called a farm because they handle whatever requests are passed out to them
using identical copies of data that are stored locally. Because they use identical copies
of data rather than sharing data, members of a farm operate autonomously and are also
referred to as clones.
A pack is a group of servers that operate together and share partitioned data. They are
called a pack because they work together to manage and maintain services. Because
members of a pack share access to partitioned data, they have unique operations modes
and usually access the shared data on disk drives to which all members of the pack are
connected.
In most cases, Web and application services are organized as farms, while back-end
databases and critical support services are organized as packs. Web servers running IIS
and using Network Load Balancing are an example of a farm. In a Web farm, identical
data is replicated to all servers in the farm and each server can handle any request that
comes to it by using local copies of data. For example, you might have a group of fi ve
Web servers using Network Load Balancing, each with its own local copy of the Web
site data.
Design for Availability
A well-designed cluster implementation uses redundant systems and components so that
the failure of an individual server doesn’t affect the availability of the related applications
and services. Although a well-designed solution can guard against application failure,
system failure, and site failure, cluster technologies do have limitations.
Introducing Server Clustering 1325
Chapter 39
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Database servers running SQL Server and failover clustering with partitioned database
views are an example of a pack. Here, members of the pack share access to the data and
have a unique portion of data or logic that they handle rather than handling all data
requests. For example, in a two-node SQL Server cluster, one database server might
handle accounts that begin with the letters A through M and another database server
might handle accounts that begin with the letters N through Z.
Servers that use clustering technologies are often organized using a three-tier structure.
The tiers in the architecture are composed as follows:
Tier 1 includes the Web servers, which are also called front-end Web servers.
Front-end Web servers typically use Network Load Balancing.
Tier 2 includes the application servers, which are often referred to as the middle-
tier servers. Middle-tier servers typically use the Windows Communications
Foundation (WCF) or other Web Services technologies to implement load bal-
ancing for application components that use COM+. Using a WCF-based load bal-
ancer, COM+ components can be load balanced over multiple nodes to enhance
the availability and scalability of software applications.
Tier 3 includes the database servers, fi le servers, and other critical support serv-
ers, which are often called back-end servers. Back-end servers typically use
failover clustering.
As you set out to architect your cluster solution, you should try to organize servers
according to the way they will be used and the applications they will be running. In
most cases, Web servers, application servers, and database servers are all organized in
different ways.
By using proper architecture, the servers in a particular tier can be scaled out or up as
necessary to meet growing performance and throughput needs. When you are looking
to scale out by adding servers to the cluster, the clustering technology and the server
operating system used are both important:
All editions of Windows Server 2008 support up to 32-node Network Load Bal-
ancing clusters.
Windows Server Enterprise and Windows Server Datacenter support failover
clustering, allowing up to 8-node clusters.
When looking to scale up by adding CPUs and random access memory (RAM), the
edition of the server operating system used is extremely important. In terms of both
processor and memory capacity, Windows Server Datacenter is much more expandable
than either Windows Server Standard or Windows Server Enterprise.
As you look at scalability requirements, keep in mind the real business needs of the
organization. The goal should be to select the right edition of the Windows operating
system to meet current and future needs. The number of servers needed depends on the
anticipated server load as well as the size and types of requests the servers will handle.
Processors and memory should be sized appropriately for the applications and services
the servers will be running as well as the number of simultaneous user connections.
Chapter 39
1326 Chapter 39 Preparing and Deploying Server Clusters
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Cluster Operating Modes
For Network Load Balancing, cluster nodes usually are identical copies of each other.
Because of this, all members of the cluster can actively handle requests, and they can do
so independently of each other. When members of a cluster share access to data, how-
ever, they have unique operating requirements, as is the case with failover clustering.
For failover clustering, nodes can be either active or passive. When a node is active, it
is actively handling requests. When a node is passive, it is idle, on standby waiting for
another node to fail. Multinode clusters can be confi gured by using different combina-
tions of active and passive nodes.
When you are architecting multinode clusters, the decision as to whether nodes are
confi gured as active or passive is extremely important. If an active node fails and there
is a passive node available, applications and services running on the failed node can be
transferred to the passive node. Because the passive node has no current workload, the
server should be able to assume the workload of the other server without any problems
(providing all servers have the same hardware confi guration). If all servers in a cluster
are active and a node fails, the applications and services running on the failed node can
be transferred to another active node. Unlike a passive node, however, an active server
already has a processing load and must be able to handle the additional processing load
of the failed server. If the server isn’t sized to handle multiple workloads, it can fail as
well.
In a multinode confi guration where there is one passive node for each active node, the
servers could be confi gured so that under average workload they use about 50 percent
of processor and memory resources. In the four-node confi guration depicted in Figure
39-1, in which failover goes from one active node to a specifi c passive node, this could
mean two active nodes (A1 and A2) and two passive nodes (P1 and P2) each with four
processors and 4 GB of RAM. Here, node A1 fails over to node P1, and node A2 fails
over to node P2 with the extra capacity used to handle peak workloads.
In a confi guration in which there are more active nodes than passive nodes, the servers
can be confi gured so that under average workload they use a proportional percentage
of processor and memory resources. In the four-node confi guration also depicted in
Figure 39-1, in which nodes A, B, C, and D are confi gured as active and failover could
go between nodes A and B or nodes C and D, this could mean confi guring servers so
that they use about 25 percent of processor and memory resources under an average
workload. Here, node A could fail over to B (and vice versa) or node C could fail over
to D (and vice versa). Because the servers must handle two workloads in case of a node
failure, the processor and memory confi guration would at least be doubled, so instead
of using four processors and 4 GB of RAM, the servers would use eight processors and
8 GB of RAM.
Introducing Server Clustering 1327
Chapter 39
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Node A1 Node A2
Node P1 Node P2
4-node clustering
2 active / 2 passive
Each node has:
4 CPUs, 4 GB RAM
Failover from A1 to P1,
A2 to P2
Node A Node C
Node B Node D
4-node clustering
All nodes active
Each node has:
8 CPUs, 8 GB RAM
Failover from A to B, B to A,
C to D, D to C
Figure 39-1 Clustering can be implemented in many ways; these are examples.
When failover clustering has multiple active nodes, data must be shared between
applications running on the clustered servers. In many cases, this is handled by using
a shared-nothing database confi guration. In a shared-nothing database confi guration,
the application is partitioned to access private database sections. This means that a par-
ticular node is confi gured with a specifi c view into the database that allows it to handle
specifi c types of requests, such as account names that start with the letters A through
F, and that it is the only node that can update the related section of the database (which
eliminates the possibility of corruption from simultaneous writes by multiple nodes).
Both Exchange Server 2003 and SQL Server 2000 support multiple active nodes and
shared-nothing database confi gurations.
As you consider the impact of operating modes in the cluster architecture, you should
look carefully at the business requirements and the expected server loads. By using
Network Load Balancing, all servers are active and the architecture is scaled out by
adding more servers, which typically are confi gured identically to the existing Network
Load Balancing nodes. By using failover clustering, nodes can be either active or pas-
sive, and the confi guration of nodes depends on the operating mode (active or passive)
as well as how failover is confi gured. A server that is designated to handle failover must
be sized to handle the workload of the failed server as well as the current workload (if
any). Additionally, both average and peak workloads must be considered. Servers need
additional capacity to handle peak loads.
Chapter 39
1328 Chapter 39 Preparing and Deploying Server Clusters
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Multisite Options for Clusters
Some large organizations build disaster recovery and increased availability into their
infrastructure using multiple physical sites. Multisite architecture can be designed in
many ways. In most cases, the architecture has a primary site and one or more remote
sites. Figure 39-2 shows an example of a primary site and a remote site for a large com-
mercial Web site.
Geographic
load
balancers
Cache
engines
Microsoft
IIS servers
Application
servers
Transaction
servers
SQL database
servers
Internet
ISP-facing routers
Back channel
Load balancers
Switches
Corporate Network
Figure 39-2 Enterprise architecture for a large commercial Web site that has multiple physical
locations.
As shown in Figure 39-2, the architecture at the remote site mirrors that of the primary
site. The level of integration for multiple sites and the level at which components are
mirrored between sites depends on the business requirements. With a full implementa-
tion, the complete infrastructure of the primary site could be re-created at remote sites.
Introducing Server Clustering 1329
Chapter 39
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
This allows for a remote site to operate independently or to handle the full load of the
primary site if necessary. Here, the design should incorporate real-time replication and
synchronization for databases and applications. Real-time replication ensures a consis-
tent state for data and application services between sites. If real-time updates are not
possible, databases and applications should be replicated and synchronized as rapidly
as possible.
With a partial implementation, only essential components are installed at remote sites
with the goal of handling overfl ow in peak periods, maintaining uptime on a limited
basis in case the primary site fails, or providing limited services on an ad hoc basis.
One technique is to replicate static content on Web sites and read-only data from data-
bases. This would allow remote sites to handle requests for static content and other
types of data that is infrequently changed. Users could browse sites and access account
information, product catalogs, and other services. If they must access dynamic content
or modify information (add, change, delete), the sites’ geographical load balancers
could redirect the users to the primary site.
Another partial implementation technique is to implement all layers of the infrastruc-
ture but with fewer redundancies in the architecture or to implement only core compo-
nents, relying on the primary site to provide the full array of features. By using either
technique, the design might need to incorporate near real-time replication and synchro-
nization for databases and applications. This ensures a consistent state for data and
application services.
A full or partial design could also use geographically dispersed clusters running
failover clustering. Geographically dispersed clusters use virtual local area networks
(VLANs) to connect storage area networks (SANs) over long distances. A VLAN con-
nection with latency of 500 milliseconds or less ensures that cluster consistency can
be maintained. If the VLAN latency is over 500 milliseconds, the cluster consistency
cannot be easily maintained. Geographically dispersed clusters are also referred to as
stretched clusters.
Windows Server 2008 supports a majority node set quorum resource. Majority node
clustering changes the way the cluster quorum resource is used to allow cluster servers
to maintain consistency in the event of node failure. In a standard cluster confi guration,
the quorum resource writes information on all cluster database changes to the recovery
logs, ensuring that the cluster confi guration and state data can be recovered. Here, the
quorum resource resides on the shared disk drives and can be used to verify whether
other nodes in the cluster are functioning.
In a majority node cluster confi guration, the quorum resource is confi gured as a major-
ity node set resource. This allows the quorum data, which includes cluster confi gura-
tion changes and state information, to be stored on the system disk of each node in
the cluster. Because the data is localized, the cluster can be maintained in a consistent
state. As the name implies, the majority of nodes must be available for this cluster con-
fi guration to operate normally. Should the cluster state become inconsistent, you can
force the quorum to get a consistent state. An algorithm also runs on the cluster nodes
to help ensure the cluster state.
Chapter 39
1330 Chapter 39 Preparing and Deploying Server Clusters
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Using Network Load Balancing
Each server in a Network Load Balancing cluster is referred to as a node. Network Load
Balancing nodes work together to provide availability for critical IP-based resources,
which can include TCP, UDP, and GRE traffi c requests.
Using Network Load Balancing Clusters
Network Load Balancing provides failover support for IP-based applications and ser-
vices that require high scalability and availability. You can use Network Load Balancing
to build groups of up to 32 clustered computers, starting with as few as 2 computers
and incrementally scaling out as demand increases. Network Load Balancing is ideally
suited to improving the availability of Web servers, media servers, terminal servers, and
e-commerce sites. Load balancing these services ensures that there is no single point of
failure and that there is no performance bottleneck.
Network Load Balancing uses virtual IP addresses, and client requests are directed to
these virtual IP addresses, allowing for transparent failover and failback. When a load-
balanced resource fails on one server, the remaining servers in the group take over the
workload of the failed server. When the failed server comes back online, the server can
automatically rejoin the cluster group, and Network Load Balancing starts to distribute
the load to the server automatically. Failover takes less than 10 seconds in most cases.
Network Load Balancing doesn’t use shared resources or clustered storage devices.
Instead, each server runs a copy of the TCP/IP application or service that is being load
balanced, such as a Web site running Internet Information Services (IIS). Local storage
is used in most cases as well. As with failover clustering, users usually don’t know that
they’re accessing a group of servers rather than a single server. The reason for this is
that the Network Load Balancing cluster appears to be a single server. Clients connect
to the cluster using a virtual IP address and, behind the scenes, this virtual address is
mapped to a specifi c server based on availability.
Anyone familiar with load-balancing strategies might be inclined to think of Network
Load Balancing as a form of round robin Domain Name System (DNS). In round robin
DNS, incoming IP connections are passed to each participating server in a specifi c
order. For example, an administrator defi nes a round robin group containing Server A,
Server B, and Server C. The fi rst incoming request is handled by Server A, the second by
Server B, the third by Server C, and then the cycle is repeated in that order (A, B, C, A,
B, C, . . .). Unfortunately, if one of the servers fails, there is no way to notify the group of
the failure. As a result, the round robin strategy continues to send requests to the failed
server. Windows Network Load Balancing doesn’t have this problem.
To avoid sending requests to failed servers, Network Load Balancing sends heartbeats
to participating servers. These heartbeats are similar to those used by the Cluster ser-
vice. The purpose of the heartbeat is to track the condition of each participant in the
group. If a server in the group fails to send heartbeat messages to other servers in the
group for a specifi ed interval, the server is assumed to have failed. The remaining serv-
ers in the group take over the workload of the failed server. While previous connections
Using Network Load Balancing 1331
Chapter 39
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
to the failed host are lost, the IP-based application or service continues to be available.
In most cases, clients automatically retry the failed connections and experience only
a few seconds delay in receiving a response. When the failed server becomes available
again, Network Load Balancing automatically allows the server to rejoin the group and
starts to distribute the load to the server.
Network Load Balancing Confi guration
Although Network Load Balancing is normally used to distribute the workload for an
application or service, it can also be used to direct a specifi c type of traffi c to a particu-
lar server. For example, an administrator might want to load Hypertext Transfer Proto-
col (HTTP) and File Transfer Protocol (FTP) traffi c to a group of servers but might want
a single server to handle other types of traffi c. In this latter case, Network Load Balanc-
ing allows traffi c to fl ow to a designated server and reroutes traffi c to another server
only in case of failure.
Network Load Balancing runs as a network driver and requires no hardware changes
to install and run. Its operations are transparent to the TCP/IP networking stack.
Because Network Load Balancing is IP-based, IP networking must be installed on all
load-balanced servers. At this time, Network Load Balancing supports Ethernet and
Fiber Distributed Data Interface (FDDI) networks but doesn’t support Asynchronous
Transfer Mode (ATM). Future versions of Network Load Balancing might support this
network architecture. There are four basic models for Network Load Balancing:
Single network adapter in unicast mode
This model is best for an environment in
which ordinary network communication among cluster hosts is not required and
in which there is limited dedicated traffi c from outside the cluster subnet to spe-
cifi c cluster hosts.
Multiple network adapters in unicast mode
This model is best for an environ-
ment in which ordinary network communication among cluster hosts is neces-
sary or desirable and in which there is moderate to heavy dedicated traffi c from
outside the cluster subnet to specifi c cluster hosts.
Single network adapter in multicast mode
This model is best for an environment
in which ordinary network communication among cluster hosts is necessary or
desirable but in which there is limited dedicated traffi c from outside the cluster
subnet to specifi c cluster hosts.
Multiple network adapters in multicast mode
This model is best for an environ-
ment in which ordinary network communication among cluster hosts is neces-
sary and in which there is moderate to heavy dedicated traffi c from outside the
cluster subnet to specifi c cluster hosts.
Network Load Balancing uses unicast or multicast broadcasts to direct incoming traffi c
to all servers in the cluster. The Network Load Balancing driver on each host acts as a
fi lter between the cluster adapter and the TCP/IP stack, allowing only traffi c bound for
the designated host to be received. For Windows Server 2008, the NLB network driver
has been completely rewritten to use the NDIS 6.0 lightweight fi lter model. NDIS 6.0
features enhanced driver performance and scalability, a simplifi ed driver model, and
backward compatibility with earlier NDIS versions.
Chapter 39
1332 Chapter 39 Preparing and Deploying Server Clusters
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Network Load Balancing controls only the fl ow of TCP, UDP, and GRE traffi c on speci-
fi ed ports. It doesn’t control the fl ow of TCP, UDP, and GRE traffi c on nonspecifi ed
ports, and it doesn’t control the fl ow of other incoming IP traffi c. All traffi c that isn’t
controlled is passed through without modifi cation to the IP stack. For Windows Server
2008, IP address handling has been extended to accommodate IP version 6 (IPv6) as
well as multiple dedicated IP addresses for IPv4 and IPv6. This allows you to confi gure
NLB clusters with one or more dedicated IPv4 and IPv6 addresses.
Note
Network Load Balancing can be used with Microsoft Internet Security and Accelera-
tion (ISA) Server. For Windows Server 2008, both NLB and ISA have been extended for
improved interoperability. With ISA Server, you can confi gure multiple dedicated IP
addresses for each node in the NLB cluster when clients use both IPv4 and IPv6. ISA can
also notify NLB about node overloads and SYN attacks. Synchronize (SYN) and acknowl-
edge (ACK) are part of the TCP connection process. A SYN attack is a denial-of-service
attack that exploits the retransmission and time-out behavior of the SYN-ACK process to
create a large number of half-open connections that use up a computer’s resources.
To provide high-performance throughput and responsiveness, Network Load Balanc-
ing normally uses two network adapters, as shown in Figure 39-3. The fi rst network
adapter, referred to as the cluster adapter, handles network traffi c for the cluster, and
the second adapter, referred to as the dedicated adapter, handles client-to-cluster net-
work traffi c and other traffi c originating outside the cluster network.
Server application WLBS.exe
Cluster host
Windows kernel
LAN
TCP/IP
Network Load Balancing driver
Network adapter
driver
Network adapter
driver
Cluster network
adapter
Dedicated network
adapter
Figure 39-3 Network Load Balancing with two network adapters.
Note
Network Load Balancing can be used with Microsoft Internet Security and Accelera-
tion (ISA) Server. For Windows Server 2008, both NLB and ISA have been extended for
improved interoperability. With ISA Server, you can confi gure multiple dedicated IP
addresses for each node in the NLB cluster when clients use both IPv4 and IPv6. ISA can
also notify NLB about node overloads and SYN attacks. Synchronize (SYN) and acknowl-
edge (ACK) are part of the TCP connection process. A SYN attack is a denial-of-service
attack that exploits the retransmission and time-out behavior of the SYN-ACK process to
create a large number of half-open connections that use up a computer’s resources.
Using Network Load Balancing 1333
Chapter 39
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Network Load Balancing can also work with a single network adapter. When it does so,
there are limitations. With a single adapter in unicast mode, node-to-node communica-
tions are impossible, which means nodes within the cluster cannot communicate with
each other. Servers can, however, communicate with servers outside the cluster subnet.
By using a single adapter in multicast mode, node-to-node communications are possible
as are communications with servers outside the cluster subnet. However, the confi gura-
tion is not optimal for handling moderate to heavy traffi c from outside the cluster sub-
net to specifi c cluster hosts. For handling node-to-node communications and moderate
to heavy traffi c, two adapters should be used.
Regardless of whether a single adapter or multiple adapters are used, all servers in
the group operate in either unicast or multicast mode—not both. In unicast mode, the
cluster’s Media Access Control (MAC) address is assigned to the computer’s network
adapter and the network adapter’s built-in MAC address is disabled. All participating
servers use the cluster’s MAC address, allowing incoming packets to be received by all
servers in the group and passed to the Network Load Balancing driver for fi ltering. Fil-
tering ensures that only packets intended for the server are received and all other pack-
ets are discarded. To avoid problems with Layer 2 switches, which expect to see unique
source addresses, Network Load Balancing uniquely modifi es the source MAC address
for all outgoing packets. The modifi ed address shows the server’s cluster priority in one
of the MAC address fi elds.
Because the built-in MAC address is used, the server group has some communication
limitations when a single network adapter is confi gured. Although the cluster servers
can communicate with other servers on the network and with servers outside the net-
work, the cluster servers cannot communicate with each other. To resolve this problem,
two network adapters are needed in unicast mode.
In multicast mode, the cluster’s MAC address is assigned to the computer’s network
adapter and the network adapter’s built-in MAC address is maintained so that both can
be used. Because each server has a unique address, only one adapter is needed for net-
work communications within the cluster group. Multicast offers some additional per-
formance benefi ts for network communications as well. However, multicast traffi c can
fl ood all ports on upstream switches. To prevent this, a virtual LAN should be set up for
the participating servers.
If Network Load Balancing clients are accessing a cluster through a router, be sure
that the router is confi gured properly. For unicast clusters, the router should accept a
dynamic Address Resolution Protocol (ARP) reply that maps the unicast IP address to its
unicast MAC address. For multicast clusters, the router should accept an ARP reply that
has a MAC address in the payload of the ARP structure. If the router isn’t able to do this,
you can also create a static ARP entry in the router to handle these requirements. Some
routers will require a static ARP entry because they do not support the resolution of uni-
cast IP addresses to multicast MAC addresses.
SIDE OUT
Using Network Load Balancing with routers
If Network Load Balancing clients are accessing a cluster through a router, be sure
that the router is confi gured properly. For unicast clusters, the router should accept a
dynamic Address Resolution Protocol (ARP) reply that maps the unicast IP address to its
unicast MAC address. For multicast clusters, the router should accept an ARP reply that
has a MAC address in the payload of the ARP structure. If the router isn’t able to do this,
you can also create a static ARP entry in the router to handle these requirements. Some
routers will require a static ARP entry because they do not support the resolution of uni-
cast IP addresses to multicast MAC addresses.
Chapter 39
1334 Chapter 39 Preparing and Deploying Server Clusters
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Network Load Balancing Port and Client Affi nity
Confi gurations
Several options can be used to optimize performance of a Network Load Balancing clus-
ter. Each server in the cluster can be confi gured to handle a specifi c percentage of client
requests, or the servers can handle client requests equally. The workload is distributed
statistically and does not take into account CPU, memory, or drive usage. For IP-based
traffi c, the technique does work well, however. Most IP-based applications handle many
clients, and each client typically has multiple requests that are short in duration.
Many Web-based applications seek to maintain the state of a user’s session within the
application. A session encompasses all the requests from a single visitor within a speci-
fi ed period of time. By maintaining the state of sessions, the application can ensure that
the user can complete a set of actions, such as registering for an account or purchasing
equipment. Network Load Balancing clusters use fi ltering to ensure that only packets
intended for the server are received and all other packets are discarded. Port rules spec-
ify how the network traffi c on a port is fi ltered. Three fi ltering modes are available:
Disabled
No fi ltering
Single Host
Direct traffi c to a single host
Multiple Hosts
Distribute traffi c among the Network Load Balancing servers
Port rules are used to confi gure Network Load Balancing on a per-port basis. For ease of
management, port rules can be assigned to a range of ports as well. This is most useful
for UDP traffi c when many different ports can be used.
When multiple hosts in the cluster will handle network traffi c for an associated port
rule, you can confi gure client affi nity to help maintain application sessions. Client affi n-
ity uses a combination of the source IP address and source and destination ports to
direct multiple requests from a single client to the same server. Three client affi nity set-
tings can be used:
None
Specifi es that Network Load Balancing doesn’t need to direct multiple
requests from the same client to the same server
Single
Specifi es that Network Load Balancing should direct multiple requests
from the same client IP address to the same server
Network
Specifi es that Network Load Balancing should direct multiple requests
from the same Class C address range to the same server
Network affi nity is useful for clients that use multiple proxy servers to access the
cluster.
Using Network Load Balancing 1335
Chapter 39
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Planning Network Load Balancing Clusters
Many applications and services can work with Network Load Balancing, provided they
use TCP/IP as their network protocol and use an identifi able set of TCP or UDP ports.
Key services that fi t these criteria include the following:
FTP over TCP/IP, which normally uses TCP ports 20 and 21
HTTP over TCP/IP, which normally uses TCP port 80
HTTPS over TCP/IP, which normally uses TCP port 443
IMAP4 over TCP/IP, which normally uses TCP ports 143 and 993 (SSL)
POP3 over TCP/IP, which normally uses TCP ports 110 and 995 (SSL)
SMTP over TCP/IP, which normally uses TCP port 25
Network Load Balancing can be used with virtual private network (VPN) servers, ter-
minal servers, and streaming media servers as well. For Network Load Balancing, most
of the capacity planning focuses on the cluster size. Cluster size refers to the number of
servers in the cluster. Cluster size should be based on the number of servers necessary
to meet anticipated demand.
Stress testing should be used in the lab to simulate anticipated user loads prior to
deployment. Confi gure the tests to simulate an environment with increasing user
requests. Total requests should simulate the maximum anticipated user count. The
results of the stress tests will determine whether additional servers are needed. The
servers should be able to meet demands of the stress testing with 70 percent or less
server load with all servers running. During failure testing, the peak load shouldn’t rise
above 80 percent. If either of these thresholds is reached, the cluster size might need to
be increased.
Servers that use Network Load Balancing can benefi t from optimization as well. Serv-
ers should be optimized for their role, the types of applications they will run, and the
anticipated local storage they will use. Although you might want to build redundancy
into the local hard drives on Network Load Balancing servers, this adds to the expense
of the server without signifi cant availability gains in most instances. Because of this,
Network Load Balancing servers often have drives that do not use redundant array of
independent disks (RAID) and do not provide fault tolerance, the idea being that if a
drive causes a server failure, other servers in the Network Load Balancing cluster can
quickly take over the workload of the failed server.
If it seems odd not to use RAID, keep in mind that servers using Network Load Balanc-
ing are organized so they use identical copies of data on each server. Because many
different servers have the same data, maintaining the data with RAID sets isn’t as
important as it is with failover clustering. A key point to consider when using Network
Load Balancing, however, is data synchronization. The state of the data on each server
must be maintained so that the clones are updated whenever changes are made. The
need to synchronize data periodically is an overhead that must be considered when
designing the server architecture.
Chapter 39
1336 Chapter 39 Preparing and Deploying Server Clusters
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.