Nova Giveup of VHD Coalesce

4 minute read

When your openstack running with Xenserver as compute HyperVisor you may find this error to instance who made often migration across the HyperVisor.

If you check the nova-compute log, you will find error like NovaException: VHD coalesce attempts exceeded. , this because the xapi plugin is waiting for a coalesce to happen after the snapshot - and timing out.

this is the detail of ERROR the appears at nova-compute log.

2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [req-89edd082-6755-447a-bce8-93464d812ac1 44d172fd7b0b4cfd836329ac8cf3e7b1 3dabca2caf6f48aca388d21
17573459a - - -] [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688] _migrate_disk_resizing_up failed. Restoring orig vm due_to: VHD coalesce attempts exceeded
 (20), giving up....
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688] Traceback (most recent call last):
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]   File "/usr/lib/python2.7/site-packages/nova/vir
t/xenapi/vmops.py", line 1184, in _migrate_disk_resizing_up
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]     self._session, instance, vm_ref, label) as ro
ot_vdi_uuids:
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]   File "/usr/lib64/python2.7/contextlib.py", line
 17, in __enter__
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]     return self.gen.next()
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]   File "/usr/lib/python2.7/site-packages/nova/vir
t/xenapi/vm_utils.py", line 673, in _snapshot_attached_here_impl
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]     vdi_uuid_chain)
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]   File "/usr/lib/python2.7/site-packages/nova/vir
t/xenapi/vm_utils.py", line 2098, in _wait_for_vhd_coalesce
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]     raise exception.NovaException(msg)
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688] NovaException: VHD coalesce attempts exceeded (20
), giving up...
2017-09-13 12:58:57.807 1648 ERROR nova.virt.xenapi.vmops [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688]
2017-09-13 12:58:57.850 1648 INFO nova.compute.manager [req-89edd082-6755-447a-bce8-93464d812ac1 44d172fd7b0b4cfd836329ac8cf3e7b1 3dabca2caf6f48aca388d21175
73459a - - -] [instance: 8c8a0f6e-7108-4202-bbd6-9e3dffda4688] Setting instance back to ACTIVE after: Instance rollback performed due to: VHD coalesce attem
pts exceeded (20), giving up...

Then we have a few work on Dom0 that host the instance. First time we need to check the /var/log/SMlog of Dom0. You may find information like :

Sep 14 16:36:26 fatxs-1 SMGC: [19805] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Sep 14 16:36:26 fatxs-1 SMGC: [19805]          ***********************
Sep 14 16:36:26 fatxs-1 SMGC: [19805]          *  E X C E P T I O N  *
Sep 14 16:36:26 fatxs-1 SMGC: [19805]          ***********************
Sep 14 16:36:26 fatxs-1 SMGC: [19805] coalesce: EXCEPTION <class 'util.CommandException'>, Invalid argument
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/cleanup.py", line 1542, in coalesce
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     self._coalesce(vdi)
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/cleanup.py", line 1732, in _coalesce
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     vdi._doCoalesce()
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/cleanup.py", line 692, in _doCoalesce
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     self.parent._increaseSizeVirt(self.sizeVirt)
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/cleanup.py", line 890, in _increaseSizeVirt
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     self._setSizeVirt(size)
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/cleanup.py", line 905, in _setSizeVirt
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     vhdutil.setSizeVirt(self.path, size, jFile)
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/vhdutil.py", line 228, in setSizeVirt
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     ioretry(cmd)
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/vhdutil.py", line 102, in ioretry
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     errlist = [errno.EIO, errno.EAGAIN])
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/util.py", line 292, in ioretry
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     return f()
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/vhdutil.py", line 101, in <lambda>
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     return util.ioretry(lambda: util.pread2(cmd),
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/util.py", line 189, in pread2
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     return pread(cmdlist, quiet = quiet)
Sep 14 16:36:26 fatxs-1 SMGC: [19805]   File "/opt/xensource/sm/util.py", line 182, in pread
Sep 14 16:36:26 fatxs-1 SMGC: [19805]     raise CommandException(rc, str(cmdlist), stderr.strip())
Sep 14 16:36:26 fatxs-1 SMGC: [19805]
Sep 14 16:36:26 fatxs-1 SMGC: [19805] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Sep 14 16:36:26 fatxs-1 SMGC: [19805] Coalesce failed, skipping
Sep 14 16:36:26 fatxs-1 SMGC: [19805] In cleanup
Sep 14 16:36:26 fatxs-1 SMGC: [19805] Starting asynch srUpdate for SR 0995f66a-6188-ad63-4d00-117498b7fd46

To fixing corrupted/failed Coalesce of VHD you should repair the VHD using vhd-util tools.

Get the VHD information

# vmuuid=$(xe vm-list name-label=<INSTANCE NAME> --minimal)
# vbduuid=$(xe vbd-list vm-uuid=$vmuuid --minimal)
# xe vdi-list vbd-uuids=$vbduuid

uuid ( RO)                : 91499ab6-63e9-47cf-b73a-b7fc2b09dcc7
          name-label ( RW): instance-000000c4
    name-description ( RW): root
             sr-uuid ( RO): 0995f66a-6188-ad63-4d00-117498b7fd46
        virtual-size ( RO): 10737418240
            sharable ( RO): false
           read-only ( RO): false

Then Check the VHD condition using .

vhd-util check –n /var/run/sr-mount/<sr-uuid>/<vdi-uud>.vhd

Example :

# vhd-util check -n /var/run/sr-mount/0995f66a-6188-ad63-4d00-117498b7fd46/91499ab6-63e9-47cf-b73a-b7fc2b09dcc7.vhd
primary footer invalid: invalid cookie
/var/run/sr-mount/0995f66a-6188-ad63-4d00-117498b7fd46/91499ab6-63e9-47cf-b73a-b7fc2b09dcc7.vhd appears invalid; dumping metadata
VHD Footer Summary:
-------------------
Cookie              : conectix
Features            : (0x00000002) <RESV>
File format version : Major: 1, Minor: 0
Data offset         : 512
Timestamp           : Thu Sep 14 15:48:19 2017
Creator Application : 'tap'
Creator version     : Major: 1, Minor: 3
Creator OS          : Unknown!
Original disk size  : 3072 MB (3221225472 Bytes)
Current disk size   : 10240 MB (10737418240 Bytes)
Geometry            : Cyl: 20805, Hds: 16, Sctrs: 63
                    : = 10239 MB (10737377280 Bytes)
Disk type           : Differencing hard disk
Checksum            : 0xffffed5b|0xffffed5b (Good!)
UUID                : 55fd3e99-5cf8-4dfd-97f5-96851bf7f51b
Saved state         : No
Hidden              : 0

VHD Header Summary:
-------------------
Cookie              : cxsparse
Data offset (unusd) : 18446744073709
Table offset        : 1536
Header version      : 0x00010000
Max BAT size        : 5120
Block size          : 2097152 (2 MB)
Parent name         : 16f02a7d-4e4b-4298-a44a-9cc155cbd2b1.vhd
Parent UUID         : 2b0565ca-4f72-4471-a0e9-c7f2b7460491
Parent timestamp    : Thu Sep 14 15:48:19 2017
Checksum            : 0xffffd93a|0xffffd93a (Good!)

VHD Parent Locators:
--------------------
locator:            : 0
       code         : PLAT_CODE_MACX
       data_space   : 512
       data_length  : 49
       data_offset  : 23552
       decoded name : ./16f02a7d-4e4b-4298-a44a-9cc155cbd2b1.vhd

locator:            : 1
       code         : PLAT_CODE_W2KU
       data_space   : 512
       data_length  : 84
       data_offset  : 24064
       decoded name : ./16f02a7d-4e4b-4298-a44a-9cc155cbd2b1.vhd

locator:            : 2
       code         : PLAT_CODE_W2RU
       data_space   : 512
       data_length  : 84
       data_offset  : 24576
       decoded name : ./16f02a7d-4e4b-4298-a44a-9cc155cbd2b1.vhd

VHD Batmap Summary:
-------------------
Batmap offset       : 22528
Batmap size (secs)  : 2
Batmap version      : 0x00010002
Checksum            : 0xfffff312|0xfffff312 (Good!)

Repair the VHD

vhd-util repair –n /var/run/sr-mount/<sr-uuid>/<vdi-uud>.vhd

Example :

# vhd-util repair –n /var/run/sr-mount/0995f66a-6188-ad63-4d00-117498b7fd46/91499ab6-63e9-47cf-b73a-b7fc2b09dcc7.vhd

Check the VHD condition again :

# vhd-util check -n /var/run/sr-mount/0995f66a-6188-ad63-4d00-117498b7fd46/91499ab6-63e9-47cf-b73a-b7fc2b09dcc7.vhd
/var/run/sr-mount/0995f66a-6188-ad63-4d00-117498b7fd46/91499ab6-63e9-47cf-b73a-b7fc2b09dcc7.vhd is valid

If the vhd-util check result OK/Good you can try to migrate instance again.

Reference :

  • http://lists.openstack.org/pipermail/openstack/2013-March/041259.html
  • https://support.citrix.com/article/CTX201296

Comments