Getting HA (High Availability) and FT (Fault Tolerance) in your VMWare environment without an Enterprise SAN…
The VMWare ESX environment opens up a lot of options for a company looking to replace servers. By adding ESX(i) into the scenario you are able to potentially save a lot of money and open up a lot of options that you wouldn’t have otherwise had. The idea of easily recovering a failed domain controller to new hardware, or reverting to a snapshot of a server that was taken before that last Windows update was installed is pretty compelling.
There are some complications in a small to medium sized data center.
What happens if your ESX or ESXi host’s hardware physically fails? If you hadn’t planned for this type of scenario, you are looking at restoring ESX to a new machine and moving the guest server’s files to that new machine’s storage.
In a vSphere environment you need shared network and shared storage and a second ESX(i) host can easily start all of the guest servers. You can easily find each machine’s vmx file(from the failed host) and add it to the second Host.
That sounds easy enough in principal but for a small business with no SAN in place it gets a little tricky. There are plenty of NAS solutions out there that you can use as an NFS or iSCSI target but redundancy gets expensive quickly. If your shared storage fails then it won’t matter that you have 2 ESX(i) hosts. Neither will be able to access the vmx and vmdk files of the guest servers that need to be started.
For this specific scenario we are implementing an HP/Lefthand P4000 Virtual San Appliance. In our real-life scenario we are replacing an aging infrastructure for a customer with 100 users in 5 locations that currently have no virtual infrastructure.
What this will look like: We end up with 2 ESXi host machines that are identical DL380-G6 servers. Each host has 2TB of local storage. We need one physical machine running vCenter server and to act as a 3rd vote, to facilitate failover between the hosts. Again we’re going to use a DL380 but we don’t need to give it as many CPU’s or memory We have a Cisco 3750 connecting everything together, and that connects to the client’s existing physical network. Total of 6U in one rack.
What we are replacing:
- 1 FC SAN with 4 LUNS being used by 6 physical machines in an MSCS cluster.
- 2 SQL 2000 servers (MSCS)
- 2 Openedge Progress servers (MSCS)
- 2 Terminal Servers/File servers (MSCS and NLB)
- 2 IIS servers in an NLB cluster used for the intranet.
- 2 domain controllers.
- 1 backup server.
- 1 Proxy server.
- 2 DMZ web servers.
- 1 Accounting Server (Runs Terminal services and Quickbooks)
- 1 Exchange 2003 server
- 1 Phone server (Shoretel Shoreware director)
- 4 of 6 2900 series switches.
We don’t yet have any existing example to gauge overall future performance in this scenario so we may find that some servers do not get migrated to these hosts. For now, it looks like we can replace everything and still see a substantial performance gain in places. For ease of writing I am going to ignore that possibility and talk about it more if ‘Real-Life’ eventually encroaches upon the thread.
Background Information: This is one company with a central office and effectively 4 satellite offices. that connect to the central office using a point to point T1. One of the offices acts as it’s own company and has some resources that are dedicated to it on site, and is the only user of the Quickbooks server.
The SAN holds 8 production databases. The largest database is 80 GB but only has 35 user that access it. These users access the database through a web based application that runs on the IIS servers. Users are all on workstations running Windows 2000 or XP and currently all servers are Windows 2003. The SAN is fiber channel and uses HSG80 controllers to present the storage to the Windows machines. Within this site most applications, storage, server roles and network paths are redundant. Including switches and phone equipment there are 8 full racks containing all of the servers, network hardware and phone hardware. We can potentially replace 6 of those racks with a solution that removes the complexity of MSCS and NLB, decreases latency between the application servers and the database servers and makes more resources available to all applications. The potential down side is going to be related to general best practices for storage, related to Exchange, SQL and Progress. In this scenario we are banking on the idea that we will see enough of a performance gain using new hardware that we will negate performance hit that we take, spanning the volumes that contain log files, databases and operating systems across, the same 16 disks in a raid 5 array.