(Michael Chinen)

debugged

Can’t find CUDA devices after suspend on Linux Mint 22.1 XFCE. How to avoid restarting to fix.

mchinen

Since this is an anti-SEO place, I’ll start by saying what worked for my system:

sudo systemctl stop ollama.service # stop that service/process
sudo modprobe -r nvidia_uvm  # Restart the nvidia_uvm module.

Here’s how I figured it out:

First, Verify the problem


$ python -c "import torch; print('PyTorch version:', torch.version); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version) "
PyTorch version: 2.6.0+cu124
/home/mchinen/miniconda3/lib/python3.10/site-packages/torch/cuda/init.py:129: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES
after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
CUDA available: False
CUDA version:


sudo lsof | grep -i nvidia 
firefox-b 1559625 1563831 StreamTra mchinen 220u CHR 195,0 0t0 972 /dev/nvidia0
firefox-b 1559625 1563831 StreamTra mchinen 221u CHR 195,0 0t0 972 /dev/nvidia0
nvidia-dr 1560095 root cwd DIR 252,1 4096 2 /
nvidia-dr 1560095 root rtd DIR 252,1 4096 2 /
nvidia-dr 1560095 root txt unknown /proc/1560095/exe
nvidia-pe 1562548 nvidia-persistenced cwd DIR 252,1 4096 2 /
nvidia-pe 1562548 nvidia-persistenced rtd DIR 252,1 4096 2 /
nvidia-pe 1562548 nvidia-persistenced txt REG 252,1 208336 42994272 /usr/bin/nvidia-persistenced
nvidia-pe 1562548 nvidia-persistenced mem REG 252,1 398968 42995367 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.163.01
nvidia-pe 1562548 nvidia-persistenced mem REG 252,1 2125328 42993607 /usr/lib/x86_64-linux-gnu/libc.so.6
nvidia-pe 1562548 nvidia-persistenced mem REG 252,1 14624 42993621 /usr/lib/x86_64-linux-gnu/librt.so.1
nvidia-pe 1562548 nvidia-persistenced mem REG 252,1 14408 42993619 /usr/lib/x86_64-linux-gnu/libpthread.so.0
nvidia-pe 1562548 nvidia-persistenced mem REG 252,1 14408 42993609 /usr/lib/x86_64-linux-gnu/libdl.so.2
nvidia-pe 1562548 nvidia-persistenced mem REG 252,1 236616 42993596 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
nvidia-pe 1562548 nvidia-persistenced 0u unix 0xffff8c92c544b000 0t0 43860614 type=DGRAM (CONNECTED)
nvidia-pe 1562548 nvidia-persistenced 1uW REG 0,26 8 35240 /run/nvidia-persistenced/nvidia-persistenced.pid
nvidia-pe 1562548 nvidia-persistenced 2u unix 0xffff8c92c544e800 0t0 43860615 /var/run/nvidia-persistenced/socket type=STREAM (LISTEN)
nvidia-mo 2258523 root cwd DIR 252,1 4096 2 /
nvidia-mo 2258523 root rtd DIR 252,1 4096 2 /
nvidia-mo 2258523 root txt unknown /proc/2258523/exe
nvidia-mo 2258524 root cwd DIR 252,1 4096 2 /
nvidia-mo 2258524 root rtd DIR 252,1 4096 2 /
nvidia-mo 2258524 root txt unknown /proc/2258524/exe
ollama 2258618 ollama mem CHR 195,0 972 /dev/nvidia0
ollama 2258618 ollama 11u CHR 195,255 0t0 971 /dev/nvidiactl
ollama 2258618 ollama 12u CHR 511,0 0t0 973 /dev/nvidia-uvm

sudo systemctl stop ollama.service # stop that service/process
sudo modprobe -r nvidia_uvm  # Restart the nvidia_uvm module.
# you can optionally start ollama again now.

$ python -c "import torch; print('PyTorch version:', torch.version); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version) "
PyTorch version: 2.6.0+cu124
CUDA available: True

Without stopping ollama.service, I couldn’t restart nvidia_uvm:

$ sudo modprobe -r nvidia_uvm
modprobe: FATAL: Module nvidia_uvm is in use.

You might or might not have ollama installed, but if you’re doing LLM work there’s a decent chance you do.

I read various posts about this, and I was leaning towards this being due to Xorg or Firefox an nvidia service like nvidia-persistenced. That wasn’t the solution (I had firefox open when I fixed the system, because it doesn’t use nvidia-uvm). But these posts helped me identify the tools to debug the problem.

Cannot unload Nvidia driver – Graphics / Linux / Linux – NVIDIA Developer Forums

Reset driver without rebooting on linux – CUDA / CUDA Programming and Performance – NVIDIA Developer Forums

BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend – Graphics / Linux / Linux – NVIDIA Developer Forums

Back to top