Comment How-to: Active/Active iSCSI + VMware (Part 1) - 01/20/10

As promised earlier, here is the first installment of the how-to.  Before we get going to far, this series of how-to’s has the following disclaimers

  • A basic knowledge of Linux is assumed.  I’ll provide the commands to perform certain activities, but I’ll assume you know how to get a basic Linux install going
  • This how-to is written with CentOS 5.4 in mind – any distro will do, but you may need to modify some commands to make it happen.
  • This how-to uses virtual machines and is the result of proofing out the concept.  I’m sure that when I build the final servers – I’ll fine tune it a bit more (which will result in an addendum to the how-to :) )
  • This how-to will be done in a progressive fashion, each part will layer another level functionality of onto the configuration.

In Part 1 of the how-to, we will complete the create of two Virtual Machines, install Linux, VMtools, needed storage space, and required software.  At the end of this part, you will have two functional virtual machines that will replicate storage between themselves in a Primary/Secondary fashion, and you will be able to share out the LUN from the primary node via iSCSI.

0.  Most commands below need to be execute with root privilege

  1. Create two virtual machines.  I created them on my desktop using Virtual Workstation.  I created the machines with two Network Cards, one of which was bridged over the desktops NIC, the other on vmnet0 (or any other non-bridge network) - to provide a private network connection for the two storage nodes.  Additional – I removed all fluff hardware from the VM (USB, Floppy, etc.)  I attached an 2.5 GB thin provisioned HDD to each machine.
  2. Install Linux – I did this by doing a netinstall, and selecting all packages from the install process.  This means I will need to add everything to the system.  The output of df -h on a completed system is below
    • /dev/sda2             1.5G  1.1G  325M  78% /
      /dev/sda4             396M   34M  342M   9% /var
      /dev/sda1              99M   17M   78M  18% /boot
      tmpfs                 189M     0  189M   0% /dev/shm
    • Configure eth0 to communicate on your public network
    • Configure eth1 to communicate on the private VM only network
  3. I disabled selinux, I realize this may be a debatable action, however I just don’t understand it enough, and I want to eliminate it as a potential problem.
    • vi /etc/selinux/config
    • change enabled  to disabled  
  4. optional - Next we’re going to remove a few pieces of software that I don’t want on the system -
    • echo y | yum remove iptables
    • echo y | yum remove cups-libs
  5. Next we are going to install perl and than upgrade the system
    • yum install perl
    • echo y | yum upgrade
    • reboot
  6. Install VMtools via your preferred choice – once you have installed it besure to clean up the install files as they take up a lot of hard drive space
  7. Now we are going to install all of the software for the core functionality of the system
    • echo y | yum install mdadm
    • echo y | yum install kmod-drbd83
    • echo y | yum install wget
    • echo y | yum install make
    • echo y | yum install gcc
    • echo y | yum install openssl-devel
    • echo y | yum install kernel-devel
    • echo y | yum install patch
  8. Download and install ietd
  9. At this point – add an additional two disks two your VM (reboot or rescan the scsi bus to find them)
  10. Let’s create a single raid 0 array on the two disks we just added
    • mdadm –create /dev/md/d0 –auto=mdp –level=0 –raid-devices=2 /dev/sdb /dev/sdc
    • mdadm –detail –scan >> /etc/mdadm.conf
    • add DEVICE /dev/sd*to mdadm.conf as the first line of the file
  11. Step 10 creates a software raid array that we can partition.  Making the array is very important for future expansion of the array, without disrupting the drbd resource that is on the array.
    • This will be evident down the road when I walk us through dynamically adding additional space to our storage system.
  12. Now let’s partition the new array
    1. fdisk /dev/md/d0
    2. n – command to create a new partition
    3. p – create it as a primary partition
    4. 1 – partition number
    5. <enter> – begin at the beginning of the disk
    6. <enter> – end at the end of the disk
    7. w – write the partition table and exit
  13. It’s time to create your drbd.conf file.  You can scour the internet and make your own, or you can download mineand modify it to match your hostnames/IP config
  14. Create the drbd resource meta data
    • drbdadm create-md r0
  15. At this point you should either clone this VM and modify it appropriately for the second node, or repeat all of the above steps on the second VM.  Do as you feel comfortable

At this point we have two identically configured VMs, that have a raid 0 array, drbd installed and configured to write to that array.  The next step is to fire up drbdfor the first time and perform the initial sync between the nodes

  1. Before we go any further, let’s set both drbd and iscsi-target to manual startup – this is annoying for a reboot, and makes it unusable in a real sense – but it allows us total control of our test scenarios
    • chkconfig –level 0123456 drbd off
    • chkconfig –level 0123456 iscsi-target off
  2. In step 13 above, you should have created your drbd.conf file.  Going forward I will refer to the nodes as I have them named in my example file.  kroker01 is the primary node with kroker02 the secondary.
  3. On kroker01 start drbd up and tell it to perform a sync to kroker02 (when it comes online)
    • service drbd start
    • drbdadm — –overwrite-data-of-peer primary r0
  4. Check out /proc/drbd to verify that kroker01 is in a WFConnection state, and is the primary
    •  1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r—-
          ns:10372472 nr:0 dw:0 dr:10380564 al:0 bm:633 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
  5. Go ahead and start drbd on the second node; at this point if you cat/proc/drbd you should see connection status as Connected and ro: as Primary/Secondary, with ds:UpToDate/Inconsistent.  You should also see a line in /proc/drbd detailing the status of the resource.  At this point we are safe to direct disk traffic to /dev/drbd1!
    • Please note
  6. Setup ietd.conf
    • Again, google is your friend or download my example, making any needed modifications for hostname and the like.
    • service iscsi-target start
  7. From any box that has an iscsi-initiator, point it towards the IP/hostname of the primary box (kroker01).
  8. After performing a disk rescan on the client host, you should now have a new disk to work with and format.

Play around with stopping and restart drbd on the secondary node, and observe what you see in /var/log/messages, and /proc/drbd.  As the secondary node comes up and down – drbd tells you various information about what it is doing and the status of the secondary disk.

What can I do with this?

Honestly – at this point we have an expensive, highly redundant, but manual raid 1 disk array for you to play with.  Should your primary node fail, you will have a block for block copy of your data on a second node, that you could pretty easily bring up as the primary node to access your data.

2 Comments Home ESX Infrastructure version 3.0 - 01/19/10

So I’ve been talking bragging to my co-workers and friends about my under-construction ESX environment at home.

Currently my environment is pretty simple to consists of a generic system running CentOS 5.4 farming out a few LUNS to my ESX server.

  • Pentium 4 2.8 ghz
  • Abit IS7-e w/ 1.5 GB RAM
  • Dual Port Intel E1000 Nic
  • LSILogic MegaRaid 150-4 Sata controller
  • 3×250 GB SATA disks in a Raid 5
  • A couple misc IDE drives

I have centOS installed onto a 2.5 GB partition, the remainder of the disk (approximately 800 GB usable total) is presented as iSCSI targets using ietd 1.4.19.

 As a side note, until just a few days ago this box was running Openfiler 2.3 with a P3 700 + Soyo SY-7VCA2, but i started having some stability problems with the motherboard/cpu and a MB/CPU swap seems to have made things better.

Said ESX box is currently running 3.5 with some decent hardware. 

  • Rioworks/Accelertech/Arima HDAMA Rev. G
  • 2xOpteron 248 ( Rev E.)
  • 8×1 GB PC1600R Dimms

The ESX box and CentOS system are connected via a crossover cable for iSCSI traffic.

Prior to the split setup between my ESX server and CentOS system, my storage (via the megaraid 150-4) card existed in my ESX box, and due to the need for more storage but not wanting to buy another ESX HCL listed SATA/SCSI card and the needed drives.  By moving it all to an iSCSI system, I removed the need for having to use only certain types of drives.

So what’s next?  I’m glad you asked – the Bowe ESX farm v3.0 will be a 100% availability infrastructure.  Well not truly 100% – I will have some limitations due to internal house electricity, but the [eventual] purchase of appropriately sized UPS capacity will solve that problem.

How do I intend to accomplish this – lots of used hardware.  Some of the hardware I have sitting around from previous spending binges, others have been acquired or will be acquired over the next few weeks via careful ebay shopping.  I will list the price I paid, plan to pay, or would expect to pay as appropriate for my purchase situation (got, getting, had).

ESXi (2x)

  • Tyan K8SR (paid 27.50 ea. shipped)
  • Dual Opteron 270 (budget 40 per pair shipped – ebay)
  • 8x 1GB PC1600R (would pay $10-$15/dimm – bought a bunch of hese a LONG time ago)
  • 1x Emulex 9802 HBA (paid $5 shipped – ebay)
  • 1x Tyan OOB mangement card (freebie with the K8SR)

Storage (2x)

  • Rioworks/Accelertech/Arima HDAMA (rev prior to G)  (paid $20 ea. on ebay)
  • Dual Opteron 246HE (freebie CPUs that came with the K8SRs in my host setup.)
  • 2x 1 GB PC1600R (see above for pricing)
  • 1x Qlogic 2340 HBA (paid 11.50 ea. – ebay)
  • 1x Emulex 9802 HBA (paid $5 shipped – ebay)
  • Generic SATA Controller ($20 ea.)
  • 2×250 GB Sata drives (bought a while ago, but market rate is ~$35 ea.)

All systems will be booting off a 2.5 GB Compact flash microdrive in an IDE adapter – ~50 dollars for 4 drives and adapters on ebay.  All systems also have power 350/400 watt power supplies.  Two of which came with the HDAMA MBs in the storage boxes, the other three of which I have on hand, but I may replace with a Sparkle FSP350-601u – which can be had on ebay for less than 20 bucks

Infrastructure

  • 16+ port Gigabit managed switch, that supports VLAN tagging (I’ve seen some of these on ebay for ~$50 in the past few weeks.  I am budgeting approximately $75 + shipping)
  • 8+ port 2 GB Fibrechannel switch (budget is $50 + shipping on ebay)
  • GBics (market seems to be $5-$10 ea. on eBay)

 

So the total infrastructure cost is less than $800 – less if you already own some of the hardware.

Software stack

Storage

  • SCST will be installed on each node to load an Emulex FC target driver to share out disk resources on the SAN Fabric
  • drbd 8.3 (in dual primary configuration) will be installed to perform replication of the disk between the two storage nodes
    • drbd will be using the Qlogic HBAs using an older driver with TCP/IP support for replication
  • pacemaker will be installed (most likely) to help control drbd and to control split-brain and act as STONITH
  • The 250 GB drives will be configured in a software raid 0.
    • Choosing to do a software raid removes dependencies on hardware raid controllers and it also will allow me to effectively scale the arrays outward by simply adding more drives.

ESX – I will be using ESXi 4.

Misc

  • My current ESX hardware will be repurposed as a physical Forefront Security Gateway 2010 (yeah Technet subscription) system.  One NIC into the cable modem, the other NIC into the Gig switch with all VLANS trunked to it

So what does this give me?  Once this is built out I will have a fully redundant ESX farm.  I will be able to power down either ESX server or either storage server for patching, maintenance, etc without taking down my virtual machines.  The only box at “risk” will be the ISA system.

At some point I’ll drop in appropriately sized UPS system(s) to provide 5-10 minutes+ of backup, although this looks like a pretty sweet solution.

Of the above environment described the only pieces I am missing are the Opteron 270s, the Fibre Switch (and Gbics, although I’m going to try to get them both in one auction if possible), and the gig network switch.  Depending on ebay availiblity I’m looking to have all of the hardware acquired in the next month or so and the entire environment built out shortly after that – although once a few packages arrive this week I should be able to start the actual build out

What do I know about this setup – DRBD works.  I’ve having a blast playing with various failure scenarios and split brain detection.  I feel like I have a setup that is very reliable at picking the right “master” to start sync from in a failure state, but I am going to start looking at pacemaker a bit to see if it makes building some logic into the setup easier.  Otherwise I will probably just code some rough bash scripts to control start-up of DRBD/SCST.

I have a fair bit to learn about the details Fibre Channel, but I’m looking forward to the challenge.

As I said earlier – once I start the build out, or have finalized my configuration – I’ll post a detailed how-to and possible sanitized VMs to work with.

Comment Help! My linux server is missing memory! - 11/7/08

A couple days ago I was involved in troubleshooting a linux server that was only reporting 4 GB of memory rather then the 8 GB it was supposed to have.

Since my team only technically supports the hardware of systems running Linux, the only way I have besides referring to documentation is to reboot the box and see what is reported during POST – in this case it was a production server, and was functioning.  So no reboot.

Hmmm what to do?  I spoke with one of the Unix/Linux administrators and got him to establish a console session on the physical console that I could drive over the IP-KVM.

I confirmed that only 4 GB of ram was inplace by looking at top.  To confirm that something didn’t happen to the server between boot and now, I executed more /var/log/messages* | grep Memory.  The output confirmed that only 4 GB of RAM was detected on the server:

Oct 12 09:49:29 localhost kernel: Memory: 3942888k/4194304k available (1675k kernel code, 74732k reserved, 1278k data, 228k init, 3104728k highmem)

There were several matches for this showing that the system had not successfully detected all 8 GB of memory for some time.  Next thing to check was the kernel compilation options, to check if PAE was enabled – cat /boot/config-2.4.21-57.ELsmp | grep CONFIG_HIGHMEM (I checked uname -a, and grub.conf to find out what kernel was booted).

The results was that CONFIG_HIGHMEM4GB=y was set, effectively limiting the system to 4 GB of RAM, when in reality CONFIG_HIGHMEM64GB=y which would have fully enabled PAE for a 32 bit system.

|

Bear