This post documents the full troubleshooting process (what I checked and why), not just the final workaround.
Environment
There is a dedicated 10G internal network between the cluster nodes and the NAS:
- node1:
172.16.0.20 - node2:
172.16.0.21 - NAS2 (Synology):
172.16.0.10
The NAS also has another reachable service IP:
- NAS2 service IP:
192.168.10.132
0. Symptoms
After migrating VMs from node2 to node1:
- No VM console could be opened in the Proxmox Web UI.
- The UI error was:
failed to run vnc proxy
Additional observations:
-
The node-level shell console (node1 → Shell) still worked.
-
Running
qm vncproxy <vmid>manually on node1 returned:LC_PVE_TICKET not set, VNC proxy without password is forbidden
This is expected when calling
qm vncproxyoutside the Web UI/API context, because it needs a PVE ticket. It was not the root cause.
1. First Checks: Is Proxmox Itself Healthy?
1.1 PVE services
systemctl status pveproxy pvedaemon pve-cluster --no-pager
What I saw:
pveproxy,pvedaemon, andpve-cluster (pmxcfs)were all active.pveproxylogs frequently showed:proxy detected vanished client connection(often appears when a console connection drops unexpectedly).
1.2 Cluster / quorum
pvecm status
What I saw:
- node1 still had quorum (
Quorate: Yes). - Even if the cluster looked reduced (node1 + qdevice),
pve-clusteritself was functioning.
Conclusion so far: this was not a “cluster is down” situation.
2. Console Infrastructure Check: termproxy / vncproxy and Ports
2.1 Check if termproxy/vncproxy are running and listening
ss -ltnp | egrep ':(59[0-9]{2})\b'
ps aux | egrep 'termproxy|vncproxy' | grep -v grep
What I saw:
termproxyappeared (for example,termproxy 5900 ...).- But VM consoles still failed.
Conclusion so far: the proxy layer existed, but something deeper prevented it from completing.
3. Find the Real Error: Why Does vncproxy Fail?
3.1 Read recent logs from pvedaemon and pveproxy
journalctl -u pvedaemon -u pveproxy --since "10 min ago" -l --no-pager | tail -200
Key pattern in the logs:
- Many VMs produced repeated errors like:
VM <id> qmp command failed ... unable to connect to VM <id> qmp socket - timeout after 51 retries
- When attempting a console, the sequence looked like:
starting vnc proxy ...qmp command 'set_password' failed ... unable to connect to VM <id> qmp socket ...Failed to run vncproxy.
Conclusion: the console wasn’t failing because “VNC proxy can’t start”; it failed because Proxmox could not talk to the VM’s QEMU via QMP in order to set VNC credentials.
4. The Contradiction: VM “running”, QMP sockets exist, but QMP is unreachable
Using VM 502 as an example.
4.1 Verify VM status and QMP/VNC socket files
qm status 502
ls -l /run/qemu-server/502.qmp /run/qemu-server/502.vnc 2>/dev/null || true
What I saw:
qm status 502reportedrunning./run/qemu-server/502.qmpand/run/qemu-server/502.vncexisted.
4.2 Try QEMU monitor
Because my qm version did not support --cmd/--command, I entered the interactive monitor:
qm monitor 502
# qm> info status
What I saw:
- Even the monitor command failed:
human-monitor-commandfailed due to QMP timeout.
4.3 Check whether the qemu process actually exists
pgrep -af "qemu.*-id 502" || ps -ef | grep -E "qemu.*-id 502" | grep -v grep
What I saw:
- No qemu process for VM 502 was found.
- Yet Proxmox believed the VM was running and the socket files existed.
Conclusion: this strongly suggested a stuck or inconsistent runtime state, often caused by underlying storage/IO problems or stale mounts affecting how Proxmox tracks VM runtime state.
5. Secondary Clue: a shared NFS storage looked wrong on node1
I noticed something else that correlated with the timing:
- A shared NFS storage (
nas2-pxshare) showed correct capacity on node2. - On node1, it appeared as a grey question mark in the UI.
5.1 Confirm storage status from CLI
timeout 5 pvesm status || echo "pvesm status timeout"
What I saw:
nas2-pxsharewasinactiveon node1 (0 total/used/available).
5.2 Confirm the storage definition
grep -n "nas2-pxshare" -A6 -B2 /etc/pve/storage.cfg
At that time the definition was intended to be:
- server:
172.16.0.10 - export:
/volume1/PxShare - path:
/mnt/pve/nas2-pxshare
5.3 Verify network connectivity to NAS2 on the 10G network
ping -c 2 172.16.0.10
ip route get 172.16.0.10
rpcinfo -p 172.16.0.10 | head -30
showmount -e 172.16.0.10
What I saw:
- Ping was fast and stable.
- The route clearly used the internal bridge (
vmbr1) and source172.16.0.20. rpcinfoandshowmountsucceeded and exports were listed.- The export ACL included
172.16.0.20and172.16.0.21.
Conclusion: the NAS and network were fine. The problem was about how node1 had mounted the share (and how Proxmox evaluated it).
6. Root Cause at the Mount Layer: “server IP drift” and stacked mounts on the same path
6.1 Check the effective mount source
findmnt /mnt/pve/nas2-pxshare || echo "NOT MOUNTED"
cat /proc/self/mountinfo | grep "/mnt/pve/nas2-pxshare"
What I saw earlier during troubleshooting:
findmntreported the source as:192.168.10.132:/volume1/PxShare
Even though the Proxmox storage definition was supposed to be 172.16.0.10.
At the same time, mount options showed internal-network details (for example, clientaddr=172.16.0.20 and references to addr=172.16.0.10), so actual traffic was not necessarily going through the slow network. But from Proxmox’s point of view, the mounted “server identity” did not match the storage.cfg definition, which can lead to inactive.
6.2 The smoking gun: stacked mounts on the same mountpoint
Later, after stopping some PVE services, I observed the same mountpoint appearing twice:
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS
It showed two layers:
/mnt/pve/nas2-pxshare 192.168.10.132:/volume1/PxShare nfs4 .../mnt/pve/nas2-pxshare 172.16.0.10:/volume1/PxShare nfs ...
This is a stacked mount situation:
- The lower layer was NFSv4 showing
192.168.10.132. - The upper layer was NFSv3 showing
172.16.0.10.
fuser showed it was only held by the kernel mount, not a user process:
fuser -vm /mnt/pve/nas2-pxshare
7. Fix (without changing paths): fully unmount stacked layers, then mount only NFSv3
Goal:
- Keep using
/mnt/pve/nas2-pxshare(no new directories). - Remove stacked mounts completely.
- Remount cleanly using
172.16.0.10(10G network) with a single NFS version.
7.1 Stop PVE services that might trigger storage checks/re-mount behavior
systemctl stop pvestatd pvedaemon pveproxy
7.2 Unmount repeatedly until the mount is truly gone
This step was critical because there were multiple layers on the same mountpoint.
umount -f /mnt/pve/nas2-pxshare 2>/dev/null || umount -l /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS || echo "NOT MOUNTED"
umount -f /mnt/pve/nas2-pxshare 2>/dev/null || umount -l /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS || echo "NOT MOUNTED"
I continued until it returned:
NOT MOUNTED
7.3 Remount using NFSv3 on the internal 10G IP
mount -v -t nfs -o vers=3,proto=tcp 172.16.0.10:/volume1/PxShare /mnt/pve/nas2-pxshare
findmnt /mnt/pve/nas2-pxshare -o TARGET,SOURCE,FSTYPE,OPTIONS
Verification: it must show only one entry and the correct source:
/mnt/pve/nas2-pxshare 172.16.0.10:/volume1/PxShare nfs ...
7.4 Start PVE services back
systemctl start pveproxy pvedaemon pvestatd
7.5 Confirm Proxmox now sees the storage as active
pvesm status | grep nas2-pxshare
Result:
nas2-pxshare nfs active 18739479296 16557377408 2182101888 88.36%
At this point, the VM console issue was resolved.
8. Postmortem: Why This Happened (important background)
After reviewing the history, the most plausible explanation is:
- Originally,
nas2-pxsharewas created pointing to192.168.10.132. - Later, I deleted and re-created a storage with the same name
nas2-pxsharebut pointing to172.16.0.10. - node2 did not show abnormal behavior, but node1 kept a stale mount state and/or ended up stacking mounts (NFSv4 from the old server identity plus NFSv3 from the new one) on the same mountpoint.
- Once node1’s
nas2-pxsharebecame inconsistent (wrong “server identity” or stacked mounts), Proxmox marked it inactive and started timing out in operations that indirectly depend on storage stability. The QMP timeouts and vncproxy failures were symptoms of the node being in a broken state, not purely a “console feature” issue.
9. Prevention Notes
- Avoid changing a storage definition to a different server IP while keeping the same storage name, unless you ensure the old mount is fully gone on every node.
- If a Proxmox NFS storage suddenly becomes
inactive, the first command to run is:
findmnt /mnt/pve/<storage> -o TARGET,SOURCE,FSTYPE,OPTIONS
If the same mountpoint appears multiple times, resolve the stacked mounts first before trusting any higher-level Proxmox behavior.
Facebook 留言