Archive for November, 2008

Comment VMware Snapshot alternate Datastore - 11/20/08

Frustrated by the amount of disk consumed by multiple machines with Snapshots on the same LUN?  Got a server with a VDMK that consumes all but a tiny fraction of the datastore it is located?

Here’s a potential solution leveraging a little documented VMX statement.  Please be aware that, while uses this method works from a functional point of view; I would tread carefully as it’s behavior could change unexpectedly in future versions of ESX and/or Virtual Center.

Because this statement by default places the virtual machine swap file in the same alternate datastore as the snapshot, I recommend only executing this if you are running ESX 3.5, which allows you to control the swap file placement.

I won’t hold your hand while executing this, but here is a outline of all the steps required, besure to use the datastores GUID and not the label.

Steps

  1. Shutdown the Virtual Machine cleanly.
  2. Log onto the Service console of the host the VM is registered as root, or an id with access to root level permissions.
  3. Edit the vmx – sudo vi /vmfs/volumes/DatastoreofVM/VMname/VMname.vmx
  4. Remove the following lines from the vmx
    • sched.swap.derivedName
    • workingDir (if present)
    • Save the file
      • [Esc] [colon] w
  5. Re-add the following lines as follows
    • workingDir = “/vmfs/volumes/SNAPSHOT-DATASTORE-GUID/VMname/”
    • Don’t forget the trailing /
    • Save the file and exit
      • [Esc] [colon] wq
  6. Create the snapshot directory and set the correct permissions
    • sudo mkdir /vmfs/volumes/SNAPSHOT-DATASTORE-LABEL/VMname
    • sudo chown root.root /vmfs/volumes/SNAPSHOT-DATASTORE-LABEL/VMname
    • sudo chmod 775 /vmfs/volumes/SNAPSHOT-DATASTORE-LABEL/VMName

Checkout
In Virtual Center, locate the VM, right click on it and Edit Settings…, in the options tab you should observe that the Working directory parameter is set to [SNAPSHOT-DATASTORE-LABEL]/VMName.

I have executed these steps, sometimes having to unregister the machine prior to beginning, and re-registering it after completing the changes, and others just making them and powering the VM on.

If you choose to do this; in addition to support concerns, don’t forget to think about redundancy of access to the datastore, the performance of the datastore, etc. 

Personally, the best use of this would be to place Snapshots on a high performance NFS mount that can be monitored for space consumption and expanded at will.

Comment Locked VMDK? - 11/7/08

Have you ever tried to power on a virtual machine only to get some cryptic error message about a file being locked?  It’s a frustrating message to get, primarily because it’s so cryptic and provides so little useful information, or sometimes your machine will not power on returning an error message stating that a file cannot be found; but upon doing a listing of the virtual machine’s directory, all files appear to be present.
 
You can confirm this problem by examining vmware.log looking for references to a locked file, an example of this is:
DISKLIB-LINK  : “/vmfs/volumes/47069165-c4ccb111-0513-001a4bbe40ba/ntadph1187m00/ntadph1187m00_1.vmdk” : failed to open (Device or
resource busy).
 

This indicates that a one or more members of the Virtual Machines cartel is running, but the host that is running it, has forgotten about it – hence the powered off state in VC.
 

When you encounter this situation the first thing to do is to check to see if the server is responsive via mstsc.  If the server responds to an RDP session – we have encountered the easiest solution.  Restart the management agents of all the hosts in the cluster; one at a time until VC reflects the correct state of the VM.
 
If the machine is unresposive, that means it is dead.  We now have to go on a search and destroy mission and eradicate the VMs Cartel.
 
On the service console of a host in the affected cluster, change to the VMs directory.  Execute the command sudo vmkfstools -D ./name-of-locked-file.
 
The command will execute, but will not print any output.  To get the output of the command, execute sudo tail /var/log/vmkernel
 
You should see several lines of the log, similar to the log segment below.  We are looking for the information that is bolded below – this is the MAC address for vmnic0 of the lockholder, sometimes vmnic1 (or whatever you have vSwif0 configured for).  You need to execute ifconfig vmnic1 or ifconfig vmnic0 on each host in the cluster, until you locate a match.
 
Mar 18 10:34:47 vmvsph6120m00 vmkernel: 160:18:34:52.424 cpu1:1038)FS3: 130: <START ntadph1187m00-flat.vmdk>
Mar 18 10:34:47 vmvsph6120m00 vmkernel: 160:18:34:52.424 cpu1:1038)Lock [type 10c00001 offset 16216064 v 77, hb offset 3406336
Mar 18 10:34:47 vmvsph6120m00 vmkernel: gen 3913, mode 1, owner 470beb69-e4d9d99c-805a-0016357c8e59 mtime 13395847]
Mar 18 10:34:47 vmvsph6120m00 vmkernel: 160:18:34:52.424 cpu1:1038)Addr <4, 105, 8>, gen 24, links 1, type reg, flags 0×0, uid 0, gid 0, mode 100600
Mar 18 10:34:47 vmvsph6120m00 vmkernel: 160:18:34:52.424 cpu1:1038)len 7534018560, nb 7185 tbz 0, zla 3, bs 1048576
Mar 18 10:34:47 vmvsph6120m00 vmkernel: 160:18:34:52.424 cpu1:1038)FS3: 132: <END ntadph1187m00-flat.vmdk>
 
 
Once you have a match, execute ps -efwww | grep Virtualmachine Name.  If you get a line of output similar to:
root     10793     1  0 Mar12 ?        00:00:15 /usr/lib/vmware/bin/vmkload_app /usr/lib/vmware/bin/vmware-vmx -ssched.group=host/user -@ pipe=/tmp/vmhsdaemon-0/vmxf863abd25c13a492;vm=f863abd25c13a492 /vmfs/volumes/47069165-c4ccb111-0513-001a4bbe40ba/ntadph1187m00/ntadph1187m00.vmx
 
You can safely kill -9 PID (10793 in this example) – please be sure you have performed the previous check for VM responsiveness.  Kill -9 essentially tears the legs out from under the process – it is not a clean kill.  A normal kill PID, may work for this, but I have had much better sucess with kill -9.  At this point you should be able to power the VM on.
 
 
If your ps -efwww does not return a vmx line, it means the host attempted to shut the VM down, but was unable to stop the entire Cartel – the only option left at this point is to reboot the host.

Comment Help! My linux server is missing memory! - 11/7/08

A couple days ago I was involved in troubleshooting a linux server that was only reporting 4 GB of memory rather then the 8 GB it was supposed to have.

Since my team only technically supports the hardware of systems running Linux, the only way I have besides referring to documentation is to reboot the box and see what is reported during POST – in this case it was a production server, and was functioning.  So no reboot.

Hmmm what to do?  I spoke with one of the Unix/Linux administrators and got him to establish a console session on the physical console that I could drive over the IP-KVM.

I confirmed that only 4 GB of ram was inplace by looking at top.  To confirm that something didn’t happen to the server between boot and now, I executed more /var/log/messages* | grep Memory.  The output confirmed that only 4 GB of RAM was detected on the server:

Oct 12 09:49:29 localhost kernel: Memory: 3942888k/4194304k available (1675k kernel code, 74732k reserved, 1278k data, 228k init, 3104728k highmem)

There were several matches for this showing that the system had not successfully detected all 8 GB of memory for some time.  Next thing to check was the kernel compilation options, to check if PAE was enabled – cat /boot/config-2.4.21-57.ELsmp | grep CONFIG_HIGHMEM (I checked uname -a, and grub.conf to find out what kernel was booted).

The results was that CONFIG_HIGHMEM4GB=y was set, effectively limiting the system to 4 GB of RAM, when in reality CONFIG_HIGHMEM64GB=y which would have fully enabled PAE for a 32 bit system.

Comment A fresh start. - 11/7/08

For better or worse I’ve installed wordpress on my site, I’m going to use it to basically blog about technical things that I feel compelled to write about.

|