pve – 好冷酒

This post documents the full troubleshooting process (what I checked and why), not just the final workaround.

Environment

There is a dedicated 10G internal network between the cluster nodes and the NAS:

node1: 172.16.0.20
node2: 172.16.0.21
NAS2 (Synology): 172.16.0.10

The NAS also has another reachable service IP:

NAS2 service IP: 192.168.10.132

0. Symptoms

After migrating VMs from node2 to node1:

No VM console could be opened in the Proxmox Web UI.
The UI error was: failed to run vnc proxy

Additional observations:

The node-level shell console (node1 → Shell) still worked.
Running qm vncproxy <vmid> manually on node1 returned:
- LC_PVE_TICKET not set, VNC proxy without password is forbidden
This is expected when calling qm vncproxy outside the Web UI/API context, because it needs a PVE ticket. It was not the root cause.

1. First Checks: Is Proxmox Itself Healthy?

1.1 PVE services

systemctl status pveproxy pvedaemon pve-cluster --no-pager

What I saw:

pveproxy, pvedaemon, and pve-cluster (pmxcfs) were all active.
pveproxy logs frequently showed: proxy detected vanished client connection (often appears when a console connection drops unexpectedly).

1.2 Cluster / quorum

pvecm status

What I saw:

node1 still had quorum (Quorate: Yes).
Even if the cluster looked reduced (node1 + qdevice), pve-cluster itself was functioning.

Conclusion so far: this was not a “cluster is down” situation.

2. Console Infrastructure Check: termproxy / vncproxy and Ports

2.1 Check if termproxy/vncproxy are running and listening

ss -ltnp | egrep ':(59[0-9]{2})\b'
ps aux | egrep 'termproxy|vncproxy' | grep -v grep

What I saw:

termproxy appeared (for example, termproxy 5900 ...).
But VM consoles still failed.

Conclusion so far: the proxy layer existed, but something deeper prevented it from completing.

3. Find the Real Error: Why Does vncproxy Fail?

3.1 Read recent logs from pvedaemon and pveproxy

journalctl -u pvedaemon -u pveproxy --since "10 min ago" -l --no-pager | tail -200

Key pattern in the logs:

Many VMs produced repeated errors like:
- VM <id> qmp command failed ... unable to connect to VM <id> qmp socket - timeout after 51 retries
When attempting a console, the sequence looked like:
- starting vnc proxy ...
- qmp command 'set_password' failed ... unable to connect to VM <id> qmp socket ...
- Failed to run vncproxy.

Conclusion: the console wasn’t failing because “VNC proxy can’t start”; it failed because Proxmox could not talk to the VM’s QEMU via QMP in order to set VNC credentials.

4. The Contradiction: VM “running”, QMP sockets exist, but QMP is unreachable

Using VM 502 as an example.

4.1 Verify VM status and QMP/VNC socket files

qm status 502
ls -l /run/qemu-server/502.qmp /run/qemu-server/502.vnc 2>/dev/null || true

What I saw:

qm status 502 reported running.
/run/qemu-server/502.qmp and /run/qemu-server/502.vnc existed.

4.2 Try QEMU monitor

Because my qm version did not support --cmd/--command, I entered the interactive monitor:

qm monitor 502
# qm> info status

What I saw:

Even the monitor command failed:
- human-monitor-command failed due to QMP timeout.

4.3 Check whether the qemu process actually exists

pgrep -af "qemu.*-id 502" || ps -ef | grep -E "qemu.*-id 502" | grep -v grep

What I saw:

No qemu process for VM 502 was found.
Yet Proxmox believed the VM was running and the socket files existed.

Conclusion: this strongly suggested a stuck or inconsistent runtime state, often caused by underlying storage/IO problems or stale mounts affecting how Proxmox tracks VM runtime state.

5. Secondary Clue: a shared NFS storage looked wrong on node1

I noticed something else that correlated with the timing:

A shared NFS storage (nas2-pxshare) showed correct capacity on node2.
On node1, it appeared as a grey question mark in the UI.

5.1 Confirm storage status from CLI

timeout 5 pvesm status || echo "pvesm status timeout"

What I saw:

nas2-pxshare was inactive on node1 (0 total/used/available).

5.2 Confirm the storage definition

grep -n "nas2-pxshare" -A6 -B2 /etc/pve/storage.cfg

At that time the definition was intended to be:

server: 172.16.0.10
export: /volume1/PxShare
path: /mnt/pve/nas2-pxshare

5.3 Verify network connectivity to NAS2 on the 10G network

ping -c 2 172.16.0.10
ip route get 172.16.0.10
rpcinfo -p 172.16.0.10 | head -30
showmount -e 172.16.0.10

What I saw:

Ping was fast and stable.
The route clearly used the internal bridge (vmbr1) and source 172.16.0.20.
rpcinfo and showmount succeeded and exports were listed.
The export ACL included 172.16.0.20 and 172.16.0.21.

Conclusion: the NAS and network were fine. The problem was about how node1 had mounted the share (and how Proxmox evaluated it).

6. Root Cause at the Mount Layer: “server IP drift” and stacked mounts on the same path

6.1 Check the effective mount source

findmnt /mnt/pve/nas2-pxshare || echo "NOT MOUNTED"
cat /proc/self/mountinfo | grep "/mnt/pve/nas2-pxshare"

What I saw earlier during troubleshooting:

findmnt reported the source as:
- 192.168.10.132:/volume1/PxShare

Even though the Proxmox storage definition was supposed to be 172.16.0.10.

At the same time, mount options showed internal-network details (for example, clientaddr=172.16.0.20 and references to addr=172.16.0.10), so actual traffic was not necessarily going through the slow network. But from Proxmox’s point of view, the mounted “server identity” did not match the storage.cfg definition, which can lead to inactive.

6.2 The smoking gun: stacked mounts on the same mountpoint

Later, after stopping some PVE services, I observed the same mountpoint appearing twice:

findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS

It showed two layers:

/mnt/pve/nas2-pxshare 192.168.10.132:/volume1/PxShare nfs4 ...
/mnt/pve/nas2-pxshare 172.16.0.10:/volume1/PxShare nfs ...

This is a stacked mount situation:

The lower layer was NFSv4 showing 192.168.10.132.
The upper layer was NFSv3 showing 172.16.0.10.

fuser showed it was only held by the kernel mount, not a user process:

fuser -vm /mnt/pve/nas2-pxshare

7. Fix (without changing paths): fully unmount stacked layers, then mount only NFSv3

Goal:

Keep using /mnt/pve/nas2-pxshare (no new directories).
Remove stacked mounts completely.
Remount cleanly using 172.16.0.10 (10G network) with a single NFS version.

7.1 Stop PVE services that might trigger storage checks/re-mount behavior

systemctl stop pvestatd pvedaemon pveproxy

7.2 Unmount repeatedly until the mount is truly gone

This step was critical because there were multiple layers on the same mountpoint.

umount -f /mnt/pve/nas2-pxshare 2>/dev/null || umount -l /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS || echo "NOT MOUNTED"

umount -f /mnt/pve/nas2-pxshare 2>/dev/null || umount -l /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS || echo "NOT MOUNTED"

I continued until it returned:

NOT MOUNTED

7.3 Remount using NFSv3 on the internal 10G IP

mount -v -t nfs -o vers=3,proto=tcp 172.16.0.10:/volume1/PxShare /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS

Verification: it must show only one entry and the correct source:

/mnt/pve/nas2-pxshare 172.16.0.10:/volume1/PxShare nfs ...

7.4 Start PVE services back

systemctl start pveproxy pvedaemon pvestatd

7.5 Confirm Proxmox now sees the storage as active

pvesm status | grep nas2-pxshare

Result:

nas2-pxshare         nfs     active     18739479296     16557377408      2182101888   88.36%

At this point, the VM console issue was resolved.

8. Postmortem: Why This Happened (important background)

After reviewing the history, the most plausible explanation is:

Originally, nas2-pxshare was created pointing to 192.168.10.132.
Later, I deleted and re-created a storage with the same name nas2-pxshare but pointing to 172.16.0.10.
node2 did not show abnormal behavior, but node1 kept a stale mount state and/or ended up stacking mounts (NFSv4 from the old server identity plus NFSv3 from the new one) on the same mountpoint.
Once node1’s nas2-pxshare became inconsistent (wrong “server identity” or stacked mounts), Proxmox marked it inactive and started timing out in operations that indirectly depend on storage stability. The QMP timeouts and vncproxy failures were symptoms of the node being in a broken state, not purely a “console feature” issue.

9. Prevention Notes

Avoid changing a storage definition to a different server IP while keeping the same storage name, unless you ensure the old mount is fully gone on every node.
If a Proxmox NFS storage suddenly becomes inactive, the first command to run is:

findmnt /mnt/pve/<storage> -o TARGET,SOURCE,FSTYPE,OPTIONS

If the same mountpoint appears multiple times, resolve the stacked mounts first before trusting any higher-level Proxmox behavior.

Tag Archives: pve

Proxmox: After Migrating VMs to node1, All VM Consoles Broke (“Failed to run vncproxy”) — Full Diagnosis and Fix Log

Environment

0. Symptoms

1. First Checks: Is Proxmox Itself Healthy?

1.1 PVE services

1.2 Cluster / quorum

2. Console Infrastructure Check: termproxy / vncproxy and Ports

2.1 Check if termproxy/vncproxy are running and listening

3. Find the Real Error: Why Does vncproxy Fail?

3.1 Read recent logs from pvedaemon and pveproxy

4. The Contradiction: VM “running”, QMP sockets exist, but QMP is unreachable

4.1 Verify VM status and QMP/VNC socket files

4.2 Try QEMU monitor

4.3 Check whether the qemu process actually exists

5. Secondary Clue: a shared NFS storage looked wrong on node1

5.1 Confirm storage status from CLI

5.2 Confirm the storage definition

5.3 Verify network connectivity to NAS2 on the 10G network

6. Root Cause at the Mount Layer: “server IP drift” and stacked mounts on the same path

6.1 Check the effective mount source

6.2 The smoking gun: stacked mounts on the same mountpoint

7. Fix (without changing paths): fully unmount stacked layers, then mount only NFSv3

7.1 Stop PVE services that might trigger storage checks/re-mount behavior

7.2 Unmount repeatedly until the mount is truly gone

7.3 Remount using NFSv3 on the internal 10G IP

7.4 Start PVE services back

7.5 Confirm Proxmox now sees the storage as active

8. Postmortem: Why This Happened (important background)

9. Prevention Notes