cudaGetDeviceCount returned 100 -> no CUDA-capable device is detected

Hi everyone,

I am trying to get CUDA installed on my Lenovo Flex 5 with GeForce 940MX. I installed CUDA 10.1 and cuDNN 7.5.0.56 for CUDA 10.1. For some reason, whenever I try to run device query, all I get is this:

devicequery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL

nvidia-smi returns this:

==============NVSMI LOG==============

Timestamp : Sun Apr 14 13:26:22 2019
Driver Version : 425.31
CUDA Version : 10.1

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce 940MX
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : WDDM
Pending : WDDM
Serial Number : N/A
GPU UUID : GPU-c2a5a364-a252-ca1d-9321-0deb5959833a
Minor Number : N/A
VBIOS Version : 82.08.6D.00.1C
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x134D10DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x39CE17AA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 4x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : N/A
HW Power Brake Slowdown : N/A
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 2048 MiB
Used : 37 MiB
Free : 2011 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 225 MiB
Free : 31 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : N/A
Decoder : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 46 C
GPU Shutdown Temp : 99 C
GPU Slowdown Temp : 94 C
GPU Max Operating Temp : 90 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 405 MHz
Video : 135 MHz
Applications Clocks
Graphics : 1084 MHz
Memory : 2505 MHz
Default Applications Clocks
Graphics : 1082 MHz
Memory : 2505 MHz
Max Clocks
Graphics : 1189 MHz
SM : 1189 MHz
Memory : 2505 MHz
Video : 1165 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

Does anyone know what’s wrong here?

I also have GeForce game ready driver version 425.31

You have an optimus laptop. You may need to force your dGPU on in the system BIOS or control panel, or else use an optimus profile with the CUDA app you want to run. You can google for things like optimus profile.

I’m trying to train some basic models with Keras, is there a way I can force my program to use my dGPU? Or can I make an optimus profile for my python executable? I’m trying to avoid the bios tweak.

Yes, if you google “optimus profile” this is the first hit:

https://nvidia.custhelp.com/app/answers/detail/a_id/2615/~/how-do-i-customize-optimus-profiles-and-settings%3F

So I made a profile for all the executables under "Nvidia gpu computing toolkit’ and ran devicequery again (devicequery is in these folders), and still it fails.

devicequery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL

I also made a profile for python.exe and it doesn’t work.

I’m not sure why it doesn’t automatically run CUDA programs with the dGPU though.
Any other ideas?

UPDATE:

So I restarted my computer, turns out I was being stupid and didn’t do that earlier. Then I ran deviceQuery and bandwidthTest again and they passed. However, this is when CUDA_VISIBLE_DEVICES is 0. My python code did not run when it was 0. I set it to 1, restarted my computer, and everything failed, including my program.

Any ideas?

Why are you setting CUDA_VISIBLE_DEVICES ?

And if you are/were setting it, you should have mentioned that in your problem description.

It may be that you don’t understand its purpose. It could not possibly make sense to set it to 1 on a a system with only a single CUDA-capable GPU, if you actually intend to use that GPU.

There’s almost no sensible use-case for setting it at all (to any value) on a system with only a single CUDA-capable GPU.

I read about it in another forum (https://github.com/fo40225/tensorflow-windows-wheel/issues/63), to try to set it to 0 and try again, so I went ahead and tried that but it didn’t work. What I did try was to remove that variable, restart, and then I ran devicequery and bandwidth test again and they passed, then I ran my program and it returns this.

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

2019-04-15 15:08:50.998037: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-04-15 15:08:51.634105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.65GiB
2019-04-15 15:08:51.635133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0

Based on what I can see in the TensorFlow source code, the log snippet you show above indicates that Tensorflow is happy to use your GPU. What more do you need?

I imagine (not being a Tensorflow users at all) that a GPU with only 1.65 GB of free memory may impose pretty severe restrictions on what kind of working sets Tensorflow can operate on with GPU acceleration.

I’m sorry, I didn’t write the full problem in the post. After that, I ran a tutorial file that I found online and it prints the same thing but stops after “adding visible gpu devices: 0”.

Here is the code:

# python train_simple_nn.py --dataset animals --model output/simple_nn.model --label-bin output/simple_nn_lb.pickle --plot output/simple_nn_plot.png

# set the matplotlib backend so figures can be saved in the background
import matplotlib
matplotlib.use("Agg")

# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from keras.models import Sequential
from keras.layers.core import Dense
from keras.optimizers import SGD
from keras import backend as K
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
import argparse
import random
import pickle
import cv2
import os

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
	help="path to input dataset of images")
ap.add_argument("-m", "--model", required=True,
	help="path to output trained model")
ap.add_argument("-l", "--label-bin", required=True,
	help="path to output label binarizer")
ap.add_argument("-p", "--plot", required=True,
	help="path to output accuracy/loss plot")
args = vars(ap.parse_args())

K.tensorflow_backend._get_available_gpus()

# initialize the data and labels
print("[INFO] loading images...")
data = []
labels = []

# grab the image paths and randomly shuffle them
imagePaths = sorted(list(paths.list_images(args["dataset"])))
random.seed(42)
random.shuffle(imagePaths)

# loop over the input images
for imagePath in imagePaths:
	# load the image, resize the image to be 32x32 pixels (ignoring
	# aspect ratio), flatten the image into 32x32x3=3072 pixel image
	# into a list, and store the image in the data list
	image = cv2.imread(imagePath)
	image = cv2.resize(image, (32, 32)).flatten()
	data.append(image)

	# extract the class label from the image path and update the
	# labels list
	label = imagePath.split(os.path.sep)[-2]
	labels.append(label)

# scale the raw pixel intensities to the range [0, 1]
data = np.array(data, dtype="float") / 255.0
labels = np.array(labels)

# partition the data into training and testing splits using 75% of
# the data for training and the remaining 25% for testing
(trainX, testX, trainY, testY) = train_test_split(data,
	labels, test_size=0.25, random_state=42)

# convert the labels from integers to vectors (for 2-class, binary
# classification you should use Keras' to_categorical function
# instead as the scikit-learn's LabelBinarizer will not return a
# vector)
lb = LabelBinarizer()
trainY = lb.fit_transform(trainY)
testY = lb.transform(testY)

# define the 3072-1024-512-3 architecture using Keras
model = Sequential()
model.add(Dense(1024, input_shape=(3072,), activation="sigmoid"))
model.add(Dense(512, activation="sigmoid"))
model.add(Dense(len(lb.classes_), activation="softmax"))

# initialize our initial learning rate and # of epochs to train for
INIT_LR = 0.01
EPOCHS = 75

# compile the model using SGD as our optimizer and categorical
# cross-entropy loss (you'll want to use binary_crossentropy
# for 2-class classification)
print("[INFO] training network...")
opt = SGD(lr=INIT_LR)
model.compile(loss="categorical_crossentropy", optimizer=opt,
	metrics=["accuracy"])

# train the neural network
H = model.fit(trainX, trainY, validation_data=(testX, testY),
	epochs=EPOCHS, batch_size=32)

# evaluate the network
print("[INFO] evaluating network...")
predictions = model.predict(testX, batch_size=32)
print(classification_report(testY.argmax(axis=1),
	predictions.argmax(axis=1), target_names=lb.classes_))

# plot the training loss and accuracy
N = np.arange(0, EPOCHS)
plt.style.use("ggplot")
plt.figure()
plt.plot(N, H.history["loss"], label="train_loss")
plt.plot(N, H.history["val_loss"], label="val_loss")
plt.plot(N, H.history["acc"], label="train_acc")
plt.plot(N, H.history["val_acc"], label="val_acc")
plt.title("Training Loss and Accuracy (Simple NN)")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend()
plt.savefig(args["plot"])

# save the model and label binarizer to disk
print("[INFO] serializing network and label binarizer...")
model.save(args["model"])
f = open(args["label_bin"], "wb")
f.write(pickle.dumps(lb))
f.close()

When I run this file, this is the output:

Using TensorFlow backend.
2019-04-15 21:28:15.566334: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-04-15 21:28:16.189395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.65GiB
2019-04-15 21:28:16.212093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0

The terminal doesn’t do anything after this.
I have a feeling it is related to the cuda_visible_devices variable though.

Sure doesn’t look that way to me. Look at the source code: That message prints a list of usable devices (GPUs), enumerating their IDs. You have a single GPU in your system, and one GPU (with ID 0) is being listed. The next thing that happens after this message is written out is the function returns with status OK. So Tensorflow is still happy at that point.

You might want to trace execution further to see where it runs into trouble.

If it is happy with that, shouldn’t it be continuing to the next print statement?

K.tensorflow_backend._get_available_gpus()

# initialize the data and labels
print("[INFO] loading images...")
data = []
labels = []

Use standard debugging techniques to zero in on the point of failure. For example, you could start tracing activity from “tensorflow/core/common_runtime/gpu/gpu_device.cc:1512” onward.

BTW, fancy tools are not required for successful debugging of even fairly complicated systems. I once worked on an embedded system with about 1 MLOC where debugging was basically limited to printf() over a serial port. With a logging system employing different levels of detail it was reasonably straightforward to find the source of any issue.