Skip to content

Categories:

ESX Incorrectly Resignatured LUNs

We recently had a major storage outage after presenting existing LUNs from one ESX cluster to another – the goal was so both clusters shared the same LUNs so we could merge the clusters. This normally shouldn’t be an issue if there is a uniform presentation (ie consistent LUNIDs across all hosts/initiators).

However, when a rescan was performed on the new hosts, ESX decided to resignature some of the LUNs, resulting in the existing ESX hosts losing connectivity and causing the VMs to shut down. We then had to manually edit the .vmx files to point the drives to the updated datastore locations. We confirmed that the LUN IDs and SAN presentation were the same, so couldn’t understand why it was resignaturing.

I found a couple of VMware KB articles which described similar symptoms:

Root Cause

It turned out to be due to these two factors:

  • The affected datastores were originally formatted as VMFS2 (ie ESX 3.0) and had been upgraded at some point to VMFS3
  • The LVM.EnableResignature flag was enabled on the new ESX hosts but not on the old

Details

The messages log on the new ESX host showed that ESX detected the LUN as a snapshot, and that the LVM version was mismatched:

Sep 11 13:21:03 vmkernel: 5:20:20:13.259 cpu20:287470)LVM: 7455: Device naa.6006048000018712071753594d303739:1 detected to be a snapshot:

Sep 11 13:21:03 vmkernel: 5:20:20:13.471 cpu20:287470)WARNING: LVM: 2397: [naa.6782bcb0771cbe00167e415d0b0c0d90:3] LVM major version mismatch (device 5, current 3)

It then resignatured it and appended snap-037b03e4 to the datastore name:

Sep 11 13:21:04 vmkernel: 5:20:20:13.936 cpu20:287470)LVM: 7820: Device naa.6006048000018172071753594d303739:1 unsnapped

Sep 11 13:21:04 vmkernel: 5:20:20:13.939 cpu20:287470)LVM: 4900: Snapshot LV <snap-55656724-4d4b315c-1e142455-9e45-000423e569f1> successfully resignatured

Sep 11 13:21:04 vmkernel: 5:20:20:13.968 cpu20:287470)Vol3: 1070: [label: CORE-PROD-APP-012-DMX717-079, uuid: 4d4b315e-34cbb938-1f4a-000423e569f1] detected as a snapshot file system.

Sep 11 13:21:04 vmkernel: 5:20:20:13.970 cpu20:287470)Vol3: 871: Begin resignaturing volume label: CORE-PROD-APP-012-DMX717-079, uuid: 4d4b315e-34cbb938-1f4a-000423e569f1

Sep 11 13:21:04 vmkernel: 5:20:20:14.022 cpu20:287470)FS3: 6109: Marking HB [HB state abcdef02 offset 3354624 gen 969 stamp 12873297056919 uuid 4f8acb87-dce27b80-6e52-000423e56917 jrnl <FB 227000> drv 8.46] on vol ‘snap-037b03e4-CORE-PROD-APP-012-DMX717-079’

The hosts in the other cluster lose connectivity, as they’re still looking for the original uuid, causing VMs to lose disks or shutdown.

VMware confirmed that LVM version 3 was from ESX 3.0, so this datastore was originally formatted as VMFS2 and has since been upgraded to 3.46. The old ESX hosts didn’t resignature the LUN because the LVM.EnableResignature flag was set to 0 (the default).

Why was the LVM.EnableResignature flag set to 1 on the new ESX hosts? It turns out that Site Recovery Manager (SRM) turns it on when it performs test failovers or failovers. According to VMware, “this is the only immediately-known automated cause for enabling the LVM.enableResignature flag on one or more hosts.”

http://kb.vmware.com/kb/2010051: Setting LVM.enableResignature =1 remains set after a SRM Test Failover

Checking for problem LUNs

VMware recommended that we do the following:

  • Check which hosts have LVM.EnableResignature set to 1 by running this command via SSH on all hosts:
    • esxcfg-advcfg -g /LVM/EnableResignature
  • Set this to 0 on all ESX hosts:
    • esxcfg-advcfg -s 0 /LVM/EnableResignature
  • Perform a rescan
  • Check the ESX messages log and grep for snapshot. This will show which datastores have (incorrectly) been detected as snapshot LUNs. Or, run this command:
    • esxcfg-volume -l
  • If snapshot LUNs were found which shouldn’t be, storage vMotion the VMs to other LUNs
  • Reformat the problem LUNs or enable resignature flag and perform rescan so it resignatures it

Posted in Uncategorized.

Tagged with , , , , , .


Powering off an unresponsive VM

?When you cannot power off a VM, or the VM is unresponsive and cannot be stopped or killed, use these articles:

Posted in Uncategorized.

Tagged with , .


Enabling CDP on vSwitches

CDP (Cisco Discovery Protocol) is enabled in listen-only mode by default. To also broadcast CDP information to the switches, run the following command for each vSwitch on the host:

To check CDP status:

  • esxcfg-vswitch -b vSwitch1

To set CDP to both listen and broadcast:

  • ESXi 4.x: esxcfg-vswitch -B both vSwitch1
  • ESXi 5.x: esxcli network vswitch standard set –c both –v vSwitch1
Repeat for all vSwitches.

VMware KB 1003885: Configuring the Cisco Discovery Protocol (CDP) with ESX/ESXi

Posted in Uncategorized.

Tagged with , , , .


Host takes a long time to reconnect to vCenter and performs slowly due to corrupt esx.conf

I noticed a number of my Dell PowerEdge 2900 ESX 3.5 U4 hosts would take a long time to reconnect to vCenter after a reboot, or would appear as ‘disconnected’ or ‘not responding’ for up to 4 hours.

When logging into the console I see the following message: “configuration changes were not saved successfully during previous shutdown”.

/etc/vmware/esx.conf.LOCK file is present and sometimes other esx.conf.RaNdOm files too.  When I delete the .LOCK it just comes back again, and a host shutdown throws up errors as the file is locked.  When the host restarts I see the same error when logging in.

esxcfg-vswitch -l gives the following: “Listing failed for vswitch: vSwitch0, Error: Error interacting with configuration file /etc/vmware/esx.conf: Failed attempting to lock file. Another process has locked the file for more than 10 seconds. The process holding the lock is /usr/lib/vmware/hostd/vmware-hostd/etc/vmware/hostd/config.xml-u (2237). This operation will complete if it is run again after the lock is released.

The esx.conf file is huge (sometimes between 600K and 9MB) and contains spurious entries for resource pools – all of which are empty.

The hostd.log file shows errors such as: “Destroying unregistered VMkernel resource group ‘host/user/pool0/pool234/pool1‘” and top shows the hostd process runs at 100% CPU for up to 4 hours while it processes the incorrect resource pools in the esx.conf file.  A tail of /var/log/vmware/hostd.log shows it takes about 5 seconds to process each incorrect resource pool line.

The clusters these hosts are connected to used to have about 30 resource pools which were removed recently.  I checked a host in a cluster which never had resource pools and the esx.conf file looks fine and is only 40KB in length and that host doesn’t show these problems.

So it appears that somehow the esx.conf file has become corrupted, as it contains thousands of empty resource pools.  After finding nothing on Google I finally found a post from another user with the same problem.

VMware Support are looking into this issue and have confirmed it’s a bug. It will shortly be fixed in ESX v4.0 and they say ESX v3.5 will be fixed soon after. :-)

Posted in Issues.

Tagged with , , .


Decoding Machine Check Error (MCE) / Purple Screen of Death (PSOD)

When a PSOD occurs with the error Machine Check Exception, this means there was a hardware fault which was caught by the CPU.

The following VMware articles help with finding the cause of MCEs:

VMware KB 1003560: Determining if virtual machine and ESX host unresponsiveness is caused by hardware issues
VMware KB 1006796: Extracting the log file after an ESX or ESXi host fails with a purple screen error
VMware KB 1005184: Decoding Machine Check Exception (MCE) output after a purple screen error

AMD CPUs

If the CPU is AMD you can use their MCAT (Machine Check Analysis Tool) to find the cause of a Machine Check Exception. You’ll need the machine check exception general status, bank status and bank address codes in hex, which can be found on the PSOD or in the vmkernel-1.log file.

If the above link doesn’t work, search for the latest version here: http://support.amd.com/us/Search/results.aspx?k=MCAT

Look at the MCE decoding example to see how to decode a MCE using the above articles.

Posted in Issues.

Tagged with , , .


Machine Check Error (MCE) decoding example

The PSOD and the vmkernel-1.log shows the following:
28:06:50:01.381 cpu0:1283)ALERT: MCE: 579: Machine Check Exception
28:06:50:01.381 cpu0:1283)ALERT: MCE: 169: Machine Check Exception: General Status 0000000000000004
28:06:50:01.381 cpu0:1283)ALERT: MCE: 193: Machine Check Exception: Bank 0, Status b66d400000000135
28:06:50:01.381 cpu0:1283)ALERT: MCE: 226: Machine Check Exception: Bank 0, Addr 00000000c60e2be0, Valid TRUE
This occurred on CPU 0 and there is information populated in the General Status register (MCG_STATUS)
and the Bank 0 Status register (MC0_STATUS).
4 means 0100, and bit 2 mean MCIP machine check in progress.
b66d400000000135 = 1011011001101101010000000000000000000000000000000000000100110101 in binary
7 most sig bits: 1011 011
Bit 63 = 1 – MC0_STATUS register contents are valid
Bit 62 = 0 – An overflow did not occur
Bit 61 = 1 – Error was not corrected
Bit 60 = 1 – Error checking was enabled
Bit 59 = 0 – Contents of the MC0_MISC register is INVALID
Bit 58 = 1 – Contents of the MC0_ADDR register is valid
Bit 57 = 1 – Processor context is corrupt – register values are unreliable
MCA Error code: Bits 0-15: 0000 0001 0011 0101
Pattern is 0000 0001 RRRR TTLL {TT}CACHE{LL}_{RRRR}_ERR
Therefore it’s a Memory Hierarchy Error.
RRRR = 0011 – Operation was a data read
TT = 01 – Data
LL = 01 – Level 1
So the problem is due to a data read error with the CPU 0 Level 1 cache.
The AMD MCAT tool confirms this:
C:\Program Files\AMD\MCAT>mcat /cmd 0 0xb66d400000000135 0x00000000c60e2be0 0 /ghx4
Processor Number : Unknown
Bank Number : 0
Time Stamp (0x): 00000000 00000000
Error Status (0x): B66D4000 00000135
Error Address (0x): 00000000 C60E2BE0
Error Misc (0x): 00000000 00000000
Status Bit Decode :
Correctable ECC error
Processor context corrupt
Error address valid
Error enable
Error uncorrected
Error valid
Error Code (0x): 0135
Error Type – Memory
Memory Transaction Type (RRRR) – Data read (DRD)
Transaction Type (TT) – Data
Cache Level (LL) – Level 1 (L1)
Bank 0 Data Cache Errors:
Data Load – A data error occured while accessing or managing data.

Posted in Uncategorized.