This Blog is to share our knowledge and expertise on Linux System Administration and VMware Administration

Wednesday, February 17, 2016

An ESXi 5.x host running on HP server fails with a purple diagnostic screen and the error: hpsa_update_scsi_devices or detect_controller_lockup_thread

Wednesday, February 17, 2016 0
 Whenever you find below Symptoms

    Cannot run the host on Hewlett Packard (HP) hardware
    Running the host on HP hardware fails with a purple diagnostic screen
    You see the error:

    hpsa_update_scsi_devices@<None>#<None>+0x39c
    hpsa_scan_start@<None>#<None>+0x187
    hpsa_kickoff_rescan@<None>#<None>+0x20f
    kthread@com.vmware.driverAPI#9.2+0x185
    LinuxStartFunc@com.vmware.driverAPI#9.2+0x97
    vmkWorldFunc@vmkernel#nover+0x83
    CpuSched_StartWorld@vmkernel#nover+0xfa
    Your host fails with a purple diagnostic screen and you see the error:

    Panic: 892: Saved backtrace: pcpu X TLB NMI
    _raw_spin_failed@com.vmware.driverAPI#9.2+0x5
    detect_controller_lockup_thread@#+0x3a9
     kthread@com.vmware.driverAPI#9.2+0x185
     LinuxStartFunc@com.vmware.driverAPI#9.2+0x97
     vmkWorldFunc@vmkernel#nover+0x83                
     CpuSched_StartWorld@vmkernel#nover+0xfa
     PCPU X locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU9s): X)
    Before host becomes unresponsive, in the /var/log/vmkernel.log file, you see entries similar to:

    WARNING: LinDMA: Linux_DMACheckConstraints:149: Cannot map machine address = 0xfffffffffff, length = 49160 for device 0000:03:00.0; reason = buffer straddles device dma boundary (0xffffffff)WARNING: Heap: 4089: Heap_Align(vmklnx_hpsa, 32768/32768 bytes, 8 align) failed.  caller: 0x41802dcb1f91cpu4:1696102)<4>hpsa 0000:09:00.0: out of memory in adjust_hpsa_scsi_table
    Before you see a purple diagnostic screen, in the /var/log/vmkernel.log file, you see entries similar to:

    Note: These are multiple memory error messages from the hpsa driver.

    out of memory at vmkdrivers/src_9/drivers/hpsa/hpsa.c:3562
    out of memory at vmkdrivers/src_9/drivers/hpsa/hpsa.c:3562
    out of memory at vmkdrivers/src_9/drivers/hpsa/hpsa.c:3562
    out of memory at vmkdrivers/src_9/drivers/hpsa/hpsa.c:3562
    WARNING: Heap: 3622: Heap vmklnx_hpsa (39113576/39121768): Maximum allowed growth (8192) too small for size (20480)
    cpu7:1727675)<4>hpsa 0000:06:00.0: out of memory at vmkdrivers/src_9/drivers/hpsa/hpsa.c:3562
    cpu2:1727677)<4>hpsa 0000:0c:00.0: out of memory at vmkdrivers/src_9/drivers/hpsa/hpsa.c:3562
    cpu4:1727676)<4>hpsa 0000:09:00.0: out of memory at vmkdrivers/src_9/drivers/hpsa/hpsa.c:3562
    cpu3:1727738)WARNING: LinDMA: dma_alloc_coherent:726: Out of memory
    cpu3:1727738)<3>hpsa 0000:06:00.0: cmd_special_alloc returned NULL!

Resolution should be

This is a known issue affecting VMware ESXi 5.x.

To resolve this issue, apply the updated driver supplied by HP. Always check the HCL to determine the latest available driver update.

Note: For all BL685c G7 blades and DL360p Gen8 servers, HP recommends to update to ESXi 5.5 update1 to the June 2014 version. For more information, see

The reasons for the recommendation are:

    Fix for smx-provider memory leak issue is resolved.
    Several issues for the hpsa driver are resolved in the .60 version found in new June 2014 version of ESXi 5.5 update1. The previous version of the hpsa driver was .50 and was problematic.

For the DL360p Gen8 servers, the iLO firmware need to be checked. If the iLO Firmware is not at 1.51, it is recommended to update the Firmware on all servers to 1.51. This is a critical update to avoid NMI events which would cause PSOD in your environment.

It is also recommended to check the DL360p Gen8 servers to make sure that they are at least at Feb 2014 system ROM. This is to correct a possible IPMI issue.

If this issue persists after the driver upgrade:

    Open a HP Support Request, reference HP case 4648045806.
    If this issue persists, open a support request with VMware Support.
    Provide VMware support your HP case number.

ESXi 5.0 host experiences a purple diagnostic screen with the errors "Failed to ack TLB invalidate" or "no heartbeat" on HP servers with PCC support

Wednesday, February 17, 2016 0
Whevever - ESXi 5.0 host fails with a purple diagnostic screen

The purple diagnostic screen or core dump contains messages similar to:

PCPU 39 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 39).
0x41228efc7b88:[0x41800646cd62]Panic@vmkernel#nover+0xa9 stack: 0x41228efe5000
0x41228efc7cb8:[0x4180064989af]TLBDoInvalidate@vmkernel#nover+0x45a stack: 0x41228efc7ce8

@BlueScreen: PCPU 0: no heartbeat, IPIs received (0/1)....

0x4122c27c7a68:[0x41800966cd62]Panic@vmkernel#nover+0xa9 stack: 0x4122c27c7a98
0x4122c27c7ad8:[0x4180098d80ec]Heartbeat_DetectCPULockups@vmkernel#nover+0x2d3 stack: 0x0

NMI: 1943: NMI IPI received. Was eip(base):ebp:cs [0x7eb2e(0x418009600000):0x4122c2307688:0x4010](Src 0x1, CPU140)

Heartbeat: 618: PCPU 140 didn't have a heartbeat for 8 seconds. *may* be locked up

Cause might be some HP servers experience a situation where the PCC (Processor Clocking Control or Collaborative Power Control) communication between the VMware ESXi kernel (VMkernel) and the server BIOS does not function correctly.
As a result, one or more PCPUs may remain in SMM (System Management Mode) for many seconds. When the VMkernel notices a PCPU is not available for an extended period of time, a purple diagnostic screen occurs.

The solution should be

This issue has been resolved as of ESXi 5.0 Update 2 as PCC is disabled by default.
To work around this issue in versions prior to ESXi 5.0 U2, disable PCC manually.
To disable PCC:

Connect to the ESXi host using the vSphere Client.

    Click the Configuration tab.
    In the Software menu, click Advanced Settings.
    Select vmkernel.
    Deselect the vmkernel.boot.usePCC option.
    Restart the host for the change to take effect.

Tuesday, December 29, 2015

Differences between upgraded and newly created VMFS-5 datastores:

Tuesday, December 29, 2015 0
Differences between upgraded and newly created VMFS-5 datastores:
  • VMFS-5 upgraded from VMFS-3 continues to use the previous file block size which may be larger than the unified 1MB file block size. Copy operations between datastores with different block sizes won’t be able to leverage VAAI.  This is the primary reason I would recommend the creation of new VMFS-5 datastores and migrating virtual machines to new VMFS-5 datastores rather than performing in place upgrades of VMFS-3 datastores.
  • VMFS-5 upgraded from VMFS-3 continues to use 64KB sub-blocks and not new 8K sub-blocks.
  • VMFS-5 upgraded from VMFS-3 continues to have a file limit of 30,720 rather than the new file limit of > 100,000 for newly created VMFS-5.
  • VMFS-5 upgraded from VMFS-3 continues to use MBR (Master Boot Record) partition type; when the VMFS-5 volume is grown above 2TB, it automatically switches from MBR to GPT (GUID Partition Table) without impact to the running VMs.
  • VMFS-5 upgraded from VMFS-3 will continue to have a partition starting on sector 128; newly created VMFS-5 partitions start at sector 2,048.

Based on the information above, the best approach to migrate to VMFS-5 is to create net new VMFS-5 datastores if you have the extra storage space, can afford the number of Storage vMotions required, and have a VAAI capable storage array holding existing datastores with 2, 4, or 8MB block sizes.

Difference between VMFS 3 and VMFS 5 -- Part1

Tuesday, December 29, 2015 0
  • Explains you the major difference between VMFS 3 and VMFS 5. VM FS 5 is available as part of vSphere 5. VMFS 5 is introduced with lot of performance enhancements. 
  • Newly installed ESXi 5 will be formatted with VMFS 5 version but if you have upgraded the ESX 4 or ESX 4.1 to ESXi 5, then datastore version will be VMFS 3 only. 
  • You will able to upgrade the VMFS 3 to VMFS 5 via vSphere client once ESXi upgrade is Complete. This posts tells you some major differences between    VMFS 3 and VMFS 5




How to Identify the virtual machines with Raw Device Mappings (RDMs) using PowerCLI

Tuesday, December 29, 2015 0
Open the vSphere PowerCLI command-line.
Run the command:

Get-VM | Get-HardDisk -DiskType "RawPhysical","RawVirtual" | Select Parent,Name,DiskType,ScsiCanonicalName,DeviceName | fl

This command produces a list of virtual machines with RDMs, along with the backing SCSI device for the RDMs.

An output looks similar to:

Parent                      Virtual Machine Display Name
Name                       Hard Disk n
DiskType                  RawVirtual
ScsiCanonicalName naa.646892957789abcdef0892957789abcde
DeviceName            vml.020000000060912873645abcdef0123456789abcde9128736450ab

If you need to save the output to a file the command can be modified:

Get-VM | Get-HardDisk -DiskType "RawPhysical","RawVirtual" | Select Parent,Name,DiskType,ScsiCanonicalName,DeviceName | fl | Out-File –FilePath RDM-list.txt

Identify the backing SCSI device from either the ScsiCanonicalName or DeviceName identifiers.

Snapshot consolidation "error: maximum consolidate retries was exceeded for scsix:x"

Tuesday, December 29, 2015 0
Whenever you cannot perform snapshot consolidation in VMware ESXi 5.5 and ESXi 6.0.x.Performing a snapshot consolidation in ESXi 5.5 fails.

or

When attempting to consolidate snapshots using the vSphere Client, you see the error:

maximum consolidate retries was exceeded for scsix:x

Consolidate Disks message: The virtual machine has exceeded the maximum downtime of 12 seconds for disk consolidation.

 This issue occurs because ESXi 5.5 introduced a different behavior to prevent the virtual machine from being stunned for an extended period of time.

This message is reported if the virtual machine is powered on and the asynchronous consolidation fails after 10 iterations. An additional iteration is performed if the estimated stun time is over 12 seconds.This occurs when the virtual machine generates data faster than the consolidated rate.

To resolve this issue, turn off the snapshots consolidation enhancement in ESXi 5.5 and ESXi 6.0.x, so that it works like earlier versions of ESX/ESXi. This can be done by setting the snapshot.asyncConsolidate.forceSync to TRUE.

  Note: If the parameter is set to true, the virtual machine is stunned for long time to perform the snapshot consolidation, and it may not respond to ping during the consolidation.

To set the parameter snapshot.asyncConsolidate.forceSync to TRUE using the vSphere client:

Shut down the virtual machine.

Right-click the virtual machine and click Edit settings.

Click the Options tab.

Under Advanced, right-click General

Click Configuration Parameters, then click Add Row.

In the left pane, add this parameter:

snapshot.asyncConsolidate.forceSync

In the right pane, add this value:


TRUE

Click OK to save your change, and power on the virtual machine.

To set the parameter snapshot.asyncConsolidate.forceSync to TRUE without shutting down the virtual machine, run this Powercli command:

get-vm virtual_machine_name | New-AdvancedSetting -Name snapshot.asyncConsolidate.forceSync -Value TRUE -Confirm:$False

How to resolve : Cannot take a quiesced snapshot of Windows 2008 R2 virtual machine

Tuesday, December 29, 2015 0
When creating a snapshot on a Windows 2008 R2 virtual machine on ESXi/ESX 4.1 and later versions, you may experience these symptoms:
  • The snapshot operation fails to complete.
  • Unable to create a quiesced snapshot of the virtual machine.
  • Unable to back up the virtual machine.
  • Cloning a Windows 2008 R2 virtual machine fails.
  • In the Application section of the Event Viewer in virtual machine, Windows guest operating system reports an VSS error similar to:
           Volume Shadow Copy Service error: Unexpected error calling routine IOCTL_DISK_SET_SNAPSHOT_INFO(\\.\PHYSICALDRIVE1) fails with winerror 1168. hr = 0x80070490, Element not found.
  •  Any process that creates a quiesced snapshot fails.
  •  You see the error:
    Can not create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.

Backup applications, such as VMware Data Recovery, fails.You see the error:

  • Failed to create snapshot for vmname, error -3960 (cannot quiesce virtual machine)
  • This is a known issue with VSS application snapshots which is not caused by VMware software. It affects ESXi/ESX 4.1 and later versions.
  • Currently, there is no resolution.
  • To work around this issue, disable VSS quiesced application-based snapshots and revert to file system quiesced snapshots. You can disable VSS applications quiescing with either the VMware vSphere Client or with VMware Tools. Use one of these procedures:
 Disable VSS application quiescing using the vSphere Client:
  •  Power off the virtual machine.
  •  Log in to the vCenter Server or the ESXi/ESX host through the vSphere Client.
  •  Right-click the virtual machine and click Edit settings.
  •  Click the Options tab.
  •  Navigate to Advanced > General > Configuration Parameters.
  •  Add or modify the row disk.EnableUUID with the value FALSE.
  •  Click OK to save.
  •  Click OK to exit.
  •  Reboot the virtual machine for changes to take in effect.
Note: If this change is done through the command line using a text editor, the vim-cmd command to reload the vmx is enough to see the changes. For more information
Alternatively, un-register the virtual machine from the vCenter Server inventory. To un-register, right-click the virtual machine and click Remove from Inventory.
        Re-register the virtual machine back to the inventory.

Disable VSS application quiescing using VMware Tools:

  • Open the C:\ProgramData\VMware\VMware Tools\Tools.conf file in a text editor, such as Notepad. If the file does not exist, create it.
  • Add these lines to the file
            [vmbackup]
            vss.disableAppQuiescing = true
  • Save and close the file.
  • Restart the VMware Tools Service for the changes to take effect.
  • Click Start > Run, type services.msc, and click OK.
  • Right-click the VMware Tools Service and click Restart.