Proxmox: After Migrating VMs to node1, All VM Consoles Broke (“Failed to run vncproxy”) — Full Diagnosis and Fix Log

This post documents the full troubleshooting process (what I checked and why), not just the final workaround.

Environment

There is a dedicated 10G internal network between the cluster nodes and the NAS:

  • node1: 172.16.0.20
  • node2: 172.16.0.21
  • NAS2 (Synology): 172.16.0.10

The NAS also has another reachable service IP:

  • NAS2 service IP: 192.168.10.132

0. Symptoms

After migrating VMs from node2 to node1:

  • No VM console could be opened in the Proxmox Web UI.
  • The UI error was: failed to run vnc proxy

Additional observations:

  • The node-level shell console (node1 → Shell) still worked.

  • Running qm vncproxy <vmid> manually on node1 returned:

    • LC_PVE_TICKET not set, VNC proxy without password is forbidden

    This is expected when calling qm vncproxy outside the Web UI/API context, because it needs a PVE ticket. It was not the root cause.


1. First Checks: Is Proxmox Itself Healthy?

1.1 PVE services

systemctl status pveproxy pvedaemon pve-cluster --no-pager

What I saw:

  • pveproxy, pvedaemon, and pve-cluster (pmxcfs) were all active.
  • pveproxy logs frequently showed: proxy detected vanished client connection (often appears when a console connection drops unexpectedly).

1.2 Cluster / quorum

pvecm status

What I saw:

  • node1 still had quorum (Quorate: Yes).
  • Even if the cluster looked reduced (node1 + qdevice), pve-cluster itself was functioning.

Conclusion so far: this was not a “cluster is down” situation.


2. Console Infrastructure Check: termproxy / vncproxy and Ports

2.1 Check if termproxy/vncproxy are running and listening

ss -ltnp | egrep ':(59[0-9]{2})\b'
ps aux | egrep 'termproxy|vncproxy' | grep -v grep

What I saw:

  • termproxy appeared (for example, termproxy 5900 ...).
  • But VM consoles still failed.

Conclusion so far: the proxy layer existed, but something deeper prevented it from completing.


3. Find the Real Error: Why Does vncproxy Fail?

3.1 Read recent logs from pvedaemon and pveproxy

journalctl -u pvedaemon -u pveproxy --since "10 min ago" -l --no-pager | tail -200

Key pattern in the logs:

  • Many VMs produced repeated errors like:
    • VM <id> qmp command failed ... unable to connect to VM <id> qmp socket - timeout after 51 retries
  • When attempting a console, the sequence looked like:
    • starting vnc proxy ...
    • qmp command 'set_password' failed ... unable to connect to VM <id> qmp socket ...
    • Failed to run vncproxy.

Conclusion: the console wasn’t failing because “VNC proxy can’t start”; it failed because Proxmox could not talk to the VM’s QEMU via QMP in order to set VNC credentials.


4. The Contradiction: VM “running”, QMP sockets exist, but QMP is unreachable

Using VM 502 as an example.

4.1 Verify VM status and QMP/VNC socket files

qm status 502
ls -l /run/qemu-server/502.qmp /run/qemu-server/502.vnc 2>/dev/null || true

What I saw:

  • qm status 502 reported running.
  • /run/qemu-server/502.qmp and /run/qemu-server/502.vnc existed.

4.2 Try QEMU monitor

Because my qm version did not support --cmd/--command, I entered the interactive monitor:

qm monitor 502
# qm> info status

What I saw:

  • Even the monitor command failed:
    • human-monitor-command failed due to QMP timeout.

4.3 Check whether the qemu process actually exists

pgrep -af "qemu.*-id 502" || ps -ef | grep -E "qemu.*-id 502" | grep -v grep

What I saw:

  • No qemu process for VM 502 was found.
  • Yet Proxmox believed the VM was running and the socket files existed.

Conclusion: this strongly suggested a stuck or inconsistent runtime state, often caused by underlying storage/IO problems or stale mounts affecting how Proxmox tracks VM runtime state.


5. Secondary Clue: a shared NFS storage looked wrong on node1

I noticed something else that correlated with the timing:

  • A shared NFS storage (nas2-pxshare) showed correct capacity on node2.
  • On node1, it appeared as a grey question mark in the UI.

5.1 Confirm storage status from CLI

timeout 5 pvesm status || echo "pvesm status timeout"

What I saw:

  • nas2-pxshare was inactive on node1 (0 total/used/available).

5.2 Confirm the storage definition

grep -n "nas2-pxshare" -A6 -B2 /etc/pve/storage.cfg

At that time the definition was intended to be:

  • server: 172.16.0.10
  • export: /volume1/PxShare
  • path: /mnt/pve/nas2-pxshare

5.3 Verify network connectivity to NAS2 on the 10G network

ping -c 2 172.16.0.10
ip route get 172.16.0.10
rpcinfo -p 172.16.0.10 | head -30
showmount -e 172.16.0.10

What I saw:

  • Ping was fast and stable.
  • The route clearly used the internal bridge (vmbr1) and source 172.16.0.20.
  • rpcinfo and showmount succeeded and exports were listed.
  • The export ACL included 172.16.0.20 and 172.16.0.21.

Conclusion: the NAS and network were fine. The problem was about how node1 had mounted the share (and how Proxmox evaluated it).


6. Root Cause at the Mount Layer: “server IP drift” and stacked mounts on the same path

6.1 Check the effective mount source

findmnt /mnt/pve/nas2-pxshare || echo "NOT MOUNTED"
cat /proc/self/mountinfo | grep "/mnt/pve/nas2-pxshare"

What I saw earlier during troubleshooting:

  • findmnt reported the source as:
    • 192.168.10.132:/volume1/PxShare

Even though the Proxmox storage definition was supposed to be 172.16.0.10.

At the same time, mount options showed internal-network details (for example, clientaddr=172.16.0.20 and references to addr=172.16.0.10), so actual traffic was not necessarily going through the slow network. But from Proxmox’s point of view, the mounted “server identity” did not match the storage.cfg definition, which can lead to inactive.

6.2 The smoking gun: stacked mounts on the same mountpoint

Later, after stopping some PVE services, I observed the same mountpoint appearing twice:

findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS

It showed two layers:

  • /mnt/pve/nas2-pxshare 192.168.10.132:/volume1/PxShare nfs4 ...
  • /mnt/pve/nas2-pxshare 172.16.0.10:/volume1/PxShare nfs ...

This is a stacked mount situation:

  • The lower layer was NFSv4 showing 192.168.10.132.
  • The upper layer was NFSv3 showing 172.16.0.10.

fuser showed it was only held by the kernel mount, not a user process:

fuser -vm /mnt/pve/nas2-pxshare

7. Fix (without changing paths): fully unmount stacked layers, then mount only NFSv3

Goal:

  • Keep using /mnt/pve/nas2-pxshare (no new directories).
  • Remove stacked mounts completely.
  • Remount cleanly using 172.16.0.10 (10G network) with a single NFS version.

7.1 Stop PVE services that might trigger storage checks/re-mount behavior

systemctl stop pvestatd pvedaemon pveproxy

7.2 Unmount repeatedly until the mount is truly gone

This step was critical because there were multiple layers on the same mountpoint.

umount -f /mnt/pve/nas2-pxshare 2>/dev/null || umount -l /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS || echo "NOT MOUNTED"

umount -f /mnt/pve/nas2-pxshare 2>/dev/null || umount -l /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS || echo "NOT MOUNTED"

I continued until it returned:

NOT MOUNTED

7.3 Remount using NFSv3 on the internal 10G IP

mount -v -t nfs -o vers=3,proto=tcp 172.16.0.10:/volume1/PxShare /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS

Verification: it must show only one entry and the correct source:

/mnt/pve/nas2-pxshare 172.16.0.10:/volume1/PxShare nfs ...

7.4 Start PVE services back

systemctl start pveproxy pvedaemon pvestatd

7.5 Confirm Proxmox now sees the storage as active

pvesm status | grep nas2-pxshare

Result:

nas2-pxshare         nfs     active     18739479296     16557377408      2182101888   88.36%

At this point, the VM console issue was resolved.


8. Postmortem: Why This Happened (important background)

After reviewing the history, the most plausible explanation is:

  • Originally, nas2-pxshare was created pointing to 192.168.10.132.
  • Later, I deleted and re-created a storage with the same name nas2-pxshare but pointing to 172.16.0.10.
  • node2 did not show abnormal behavior, but node1 kept a stale mount state and/or ended up stacking mounts (NFSv4 from the old server identity plus NFSv3 from the new one) on the same mountpoint.
  • Once node1’s nas2-pxshare became inconsistent (wrong “server identity” or stacked mounts), Proxmox marked it inactive and started timing out in operations that indirectly depend on storage stability. The QMP timeouts and vncproxy failures were symptoms of the node being in a broken state, not purely a “console feature” issue.

9. Prevention Notes

  • Avoid changing a storage definition to a different server IP while keeping the same storage name, unless you ensure the old mount is fully gone on every node.
  • If a Proxmox NFS storage suddenly becomes inactive, the first command to run is:
findmnt /mnt/pve/<storage> -o TARGET,SOURCE,FSTYPE,OPTIONS

If the same mountpoint appears multiple times, resolve the stacked mounts first before trusting any higher-level Proxmox behavior.

Migrating an IBM Windows Server 2003 Physical Machine to Proxmox and Reducing Disk Size

This post documents the complete process of converting an old IBM physical server running Windows Server 2003 into a virtual machine on Proxmox VE, and then shrinking its 2 TB system disk down to a much smaller, space-efficient image.


1. P2V with VMware vCenter Converter (XP-Compatible Version)

Because the original system still ran within the Windows XP–era software environment, I used an older release of VMware vCenter Converter (P2V Virtual Machine Converter) — one of the few versions that still runs correctly on XP / Server 2003.

Steps

  1. Install VMware vCenter Converter on the IBM host.
  2. Choose Convert Local Machine.
  3. Set the destination format to VMware Workstation / Other VMware Virtual Machine.
  4. Export the resulting .vmdk files to an external drive or shared folder.

When finished, the converter produces a set of Windows2003.vmdk and .vmx files ready for further conversion.


2. Converting VMDK to Proxmox QCOW2

Copy the exported .vmdk file to the Proxmox storage path, for example:

/mnt/pve/nas2-in/images/602/

Then convert it to the QCOW2 format:

qemu-img convert -O qcow2 Windows2003.vmdk vm-602-disk-0.qcow2

Proxmox can now attach this QCOW2 disk directly.


3. First Boot Test

  1. Create a new virtual machine (SeaBIOS, VGA display).
  2. Attach vm-602-disk-0.qcow2 as the primary disk.
  3. Boot and verify that Windows Server 2003 starts properly.
  4. Check that IIS and custom applications still function.

If boot errors such as “NTLDR is missing” appear, boot from the Windows Server 2003 installation CD and run:

fixboot
fixmbr
bootcfg /rebuild

4. Preparing the System for Shrinking

The original disk size was 2 TB, but only around 50 GB was actually used.
Before reducing the virtual disk, the file system must be cleaned and unused blocks released.

4.1 Defragment the Disk

Run the built-in Disk Defragmenter inside Windows 2003 to move all data toward the beginning of the disk.

4.2 Zero Free Space with SDelete

Download Microsoft Sysinternals sdelete.exe and execute:

sdelete -z C:

This fills all free blocks with zeros so that later compression and trimming are effective.

Because the virtual disk was configured with an IDE interface, disk I/O was extremely slow.
In my case, SDelete took nearly two weeks to complete.
CPU and I/O utilization stayed at 100% throughout.
If you use VirtIO or SCSI storage instead, this step would finish in a few hours.


5. Shrinking the Partition in Windows 7

Windows Server 2003 cannot shrink a system partition natively.
Initially, I attempted to use Clonezilla and GParted Live, but both reported disk corruption or “invalid file system” errors and refused to resize the partition.
Even after running chkdsk /f back in Windows 2003, both tools continued to mis-detect the NTFS structure as damaged.
The issue appears to come from incompatibility between older NTFS metadata and the NTFS drivers shipped with those utilities.

The reliable workaround was to mount the QCOW2 disk inside a Windows 7 VM and use the native Disk Management utility:

  1. Power off the Windows 2003 VM.
  2. Attach its QCOW2 as a secondary disk to a Windows 7 VM.
  3. Open Disk Management (diskmgmt.msc).
  4. Right-click C:Shrink Volume, and reduce it to about 49 GB.
  5. The remaining space becomes Unallocated.

This approach worked perfectly and safely adjusted the NTFS partition size.


6. Trimming the QCOW2 Image in Proxmox

  1. Attach the QCOW2 via NBD and verify the partition table:

    qemu-nbd -r -c /dev/nbd0 vm-602-disk-0.qcow2
    fdisk -l /dev/nbd0

    /dev/nbd0p1 should report roughly 49 GB.

  2. Disconnect the mapping:

    qemu-nbd -d /dev/nbd0
  3. Safely shrink the virtual disk (leave a small buffer):

    qemu-img resize --shrink vm-602-disk-0.qcow2 52G
  4. Re-pack the image to remove zeroed blocks:

    qemu-img convert -O qcow2 vm-602-disk-0.qcow2 vm-602-disk-0-slim.qcow2

After conversion, the new QCOW2 file was only 30 – 35 GB instead of 2 TB.


7. Replace and Test

  1. In the Proxmox web UI, detach the old QCOW2 disk.
  2. Attach the new vm-602-disk-0-slim.qcow2.
  3. Boot the VM and confirm that Windows 2003 loads and all services run normally.
  4. Once verified, delete or archive the old file.

8. Results and Lessons Learned

Key takeaways:

  • The XP-compatible VMware vCenter Converter successfully performed the P2V migration.
  • Modern Windows tools can safely shrink old NTFS partitions when Clonezilla / GParted fail.
  • Clonezilla and GParted Live may misinterpret older NTFS metadata and falsely report corruption.
  • Running sdelete -z is essential for reclaiming space but extremely slow on IDE-based disks.
  • Combining sdelete + qemu-img convert provides a real, measurable disk-size reduction.

Final result

  • The original 2 TB disk image was reduced to a 52 GB virtual size and physically compressed to about 35 GB.
  • The system boots normally; IIS and internal applications work as before.
  • The only truly time-consuming step was SDelete, which required almost two weeks over IDE — using VirtIO would dramatically reduce that time.