all CUDA-capable devices are busy or unavailable on "GeForce RTX 2080" on Ubuntu 18.04

Hi,

I am getting this error message, and “Compute mode” is default and no other process is using.
Could some one has an idea how to fix this?

I am using “GeForce RTX 2080” on Ubuntu 18.04, and used the below deb to install cuda drivers

Thank you.

ERROR MSG

  • all CUDA-capable devices are busy or unavailabl

I am using “GeForce RTX 2080” on Ubuntu 18.04, and used the below deb to install cuda drivers

CUDA

http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

  • LINUX
    =======

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION=“Ubuntu 18.04.1 LTS”

  • nvidia-sma -a
    ============

==============NVSMI LOG==============

Timestamp : Wed Oct 31 01:15:10 2018
Driver Version : 410.48

Attached GPUs : 1
GPU 00000000:03:00.0
Product Name : GeForce RTX 2080
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-cd23c077-022c-bbd9-7313-fd050a0c0423
Minor Number : 0
VBIOS Version : 90.04.0B.00.86
MultiGPU Board : No
Board ID : 0x300
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x1E8710DE
Bus Id : 00000000:03:00.0
Sub System Id : 0x350019DA
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 7952 MiB
Used : 0 MiB
Free : 7952 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 33 C
GPU Shutdown Temp : 100 C
GPU Slowdown Temp : 97 C
GPU Max Operating Temp : 88 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 1.00 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Enforced Power Limit : 225.00 W
Min Power Limit : 105.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 1515 MHz
SM : 1515 MHz
Memory : 7000 MHz
Video : 1395 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2205 MHz
SM : 2205 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

If you google the error message you’ll find more information about it.

The usual scenario where I have seen this error message is when trying to run a CUDA code that uses OpenGL interop, and the CUDA context and OpenGL context are created on separate display devices.

But I have no idea if that applies to your case, because you have not indicated anything about when or under what circumstances you are receiving this error message.

You might also want to verify your CUDA install. Instructions for doing so are provided in the CUDA linux install guide.

Thank you for your response.

I have googled before posting question. What I found from the search was that some other application or x server might use GPU.
An error message was coming when I was running simple matrix multiplication.
In my case, I don’t have any other application that I am aware of, and in my server setup there is no graphics program running on.

As you have seen my log attached there is no process is attached to GPU (Processes : None).

For linux install, as I mentioned, I used the deb files from below site and did apt-get install.

What I was hoping to find

  • how to find which process is using GPU if at all?
  • how to verify CUDA install?

Thank you again.

As I expected, it must be related to CUDA deb install and driver related issue.
Once I uninstall packages done through apt-get and reinstall just *.run files, the issue is gone.

Here is the steps

uninstall apt packages from Nvidia

sudo apt-get purge nvidia-cuda-dev nvidia-cuda-doc nvidia-cuda-gdb nvidia-cuda-toolkit --yes
sudo apt-get purge nvidia-compute-utils-390 nvidia-driver-390 nvidia-kernel-common-390 nvidia-kernel-source-390 nvidia-utils-390 --yes
sudo apt purge cuda10.0 cuda-cublas-10-0 cuda-cufft-10-0 cuda-curand-10-0
cuda-cusolver-10-0 cuda-cusparse-10-0 libcudnn7
libnccl2 libnccl-dev cuda-command-line-tools-10-0 --yes

re-install direct install packages

sudo bash ./NVIDIA-Linux-x86_64-410.66.run
sudo sh ./cuda_10.0.130_410.48_linux.run

sudo dpkg -i libcudnn7_7.3.1.20-1+cuda10.0_amd64.deb
sudo dpkg -i libcudnn7-dev_7.3.1.20-1+cuda10.0_amd64.deb