Tesla card on Lucid Lynx - no CUDA-capable device is detected

Hi,

I am having trouble with a Tesla C1060 and Ubuntu 10.04 (64-bit). The display card is ATI so I do not care about the X configuration; I just need the Tesla for CUDA computing purposes.

My tests indicate the driver is installed correctly (just did a nvidia-installer --force-update to be sure):

$ lspci | grep -i nvidia
05:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060] (rev a1)

$ /sbin/modprobe -l nvidia
kernel/drivers/video/nvidia.ko

However, running examples in the SDK gives me this:

$ deviceQueryDrv
CUDA Device Query (Driver API) statically linked version
Cuda driver error 3 in file ‘deviceQueryDrv.cpp’ in line 42.

$ scan
scan Starting…

Allocating and initializing host arrays…
Allocating and initializing CUDA arrays…
main.cpp(56) : cudaSafeCall() Runtime API error : no CUDA-capable device is detected.

Looking at /etc/dev, I do not see any nvidia* items. I do not know if this is normal or not or even how to fix it.

As for the CUDA toolkit, I also reinstalled it and confirmed that the PATH and /etc/ld.so.conf are correct.

Any suggestions as to what to try next ?

Michel Lestrade
Crosslight Software

If you read the linux release notes for the CUDA toolkit, you will find a description of what to do when you don’t run the NVIDIA X11 driver to get those devices created at boot time.

This one, right ? http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_Toolkit_Release_Notes_Linux.txt

What section are your referring to ? The only thing I see is some comments about 32-bit systems and vmalloc values that have to be adjusted in grub.conf. Since I am on a 64-bit system, that did not seem relevant.

I was referring to this version in the Section “Known Issues”. NVIDIA seemed to have removed it from CUDA 3.2, but instead now supply a pdf which contains the same thing. Sorry for the confusion.

Michael,

Hi. I had the same error message yesterday installing a new GTX645 under Gentoo. In my case moving to the development driver (currently 270.18) solved the problem. I didn’t have to create any /dev entries or anything else.

Hope this helps,

Mark

He will though, because he is using a non-NVIDIA X11 driver, which is what will otherwise create the device entries automatically.

OK, the sample script they provided (+a little Googling on how to add stuff to /etc/init.d) fixed things. Thanks.

Previously he showed he modprobed nvidia. Why is that a ‘non-nvidia’ X11 driver?

Or is the creation of those entries only done when you are actually going to use the card for display purposes?

Anyway, thanks for the additional info. In my case I do use the GTX465 for both display and CUDA.

Cheers

Let’s read the first line of his first post again, shall we:

He is using an ATI card for display and a Telsa C1060 for compute. The ATI card runs a non NVIDIA X11 driver. That driver will not create the /dev entries required. The NVIDIA X11 driver, which he is not using, would. That is why it is necessary to manually create those /dev entries. Also note the distinction between a kernel driver and an X11 driver. They are not the same thing.

Ah, OK - I see that.

If I’m correctly understanding you it raises an interesting point. I was unclear that I could even talk to the CUDA without an NVidia driver installed. Certainly I installed it because it’s doing the X11 stuff, but I assumed it was completely necessary even for CUDA work. I guess not.

Anyway, thanks for the info.

Cheers,

Mark

You are not and it doesn’t. You need to have the NVIDIA kernel driver installed, along with the CUDA driver API library, libcuda.so in order to run CUDA. You don’t need the NVIDIA X11 driver or OpenGL client libraries installed to run CUDA. An X11 driver is a binary “plugin” for the Xorg server that provides API hooks between the X11 server and the device. Two objects, both “drivers”, both completely different. One you need for CUDA, the other you don’t.

OK, I’ll buy that you’re right. However I’m talking to CUDA and I don’t have (visibly anyway) a ‘kernel’ driver installed. I suspect this means that the standard nvidia X11 driver includes the kernel driver, but if you choose to not install the X11 driver then there is another driver which gives you access to CUDA?

Sorry for the confusion on my part. I only started with CUDA yesterday. I’m fine with the Linux side.

mark@c2stable ~ lsmod Module Size Used by vmnet 30852 15 vmblock 9883 1 vsock 33794 0 vmci 46073 2 vsock vmmon 64450 12 ipv6 211146 32 vboxnetadp 3948 0 vboxnetflt 14445 0 vboxdrv 1737903 2 vboxnetadp,vboxnetflt nvidia 10408743 28 snd_hda_codec_nvhdmi 12325 4 snd_hda_codec_analog 64664 1 snd_hda_intel 18536 3 snd_hda_codec 55527 3 snd_hda_codec_nvhdmi,snd_hda_codec_analog,snd_hda_intel sky2 38718 0 snd_pcm 60107 2 snd_hda_intel,snd_hda_codec snd_timer 15597 1 snd_pcm i2c_i801 6596 0 agpgart 23806 1 nvidia snd 38675 11 snd_hda_codec_analog,snd_hda_intel,snd_hda_codec,snd_pcm,snd_timer soundcore 840 1 snd snd_page_alloc 5961 2 snd_hda_intel,snd_pcm rtc_cmos 8078 0 mark@c2stable ~ !/opt
/opt/cuda/sdk/C/bin/linux/release/deviceQuery
/opt/cuda/sdk/C/bin/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce GTX 465”
CUDA Driver Version: 4.0
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1072889856 bytes
Multiprocessors x Cores/MP = Cores: 11 (MP) x 32 (Cores/MP) = 352 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.25 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce GTX 465

PASSED

Press to Quit…

mark@c2stable ~ $

What is this then?

nvidia 10408743 28

That is the X11 nvidia driver. It does both X11 and CUDA as far as I can tell. That’s my point. I don’t have a ‘kernel’ driver installed. I have the standard nvidia driver. So did the OP as far as I can tell as he showed his modprobe results. He just wasn’t using it for video.

My original reason to respond to this post was only to say that with the 260.19.36 nvidia driver installed under Gentoo I saw the same problem he saw. With the 270.18 nvidia driver I did not.

Owing to your explanation I can only assume the 260.19.36 didn’t include a ‘kernel’ driver that supported my GTX465 while 270.18 does.

That isn’t the X11 driver. It is the NVIDIA kernel driver. It provides the kernel space side support for NVIDIA’s GPUs. It knows nothing about CUDA or OpenCL or OpenGL or X11. User space libraries and code (note the distinction between kernel space and user space) interface to it. The NVIDIA X11 driver can be found in a typical installation at /usr/lib/xorg/modules/drivers/nvidia_drv.so. If you look at your Xorg server log you will see something like this:

II) LoadModule: "nvidia"

(II) Loading /usr/lib/xorg/modules/drivers//nvidia_drv.so

(II) Module nvidia: vendor="NVIDIA Corporation"

        compiled for 4.0.2, module version = 1.0.0

        Module class: X.Org Video Driver

which shows the X11 driver being loaded in user space by the X11 server. You do not need the X11 driver to run CUDA applications, you need the kernel driver and user space libraries (libcuda.so for the driver API, libcudart.so for the runtime library).

Thank you for sticking with me through the explanation.

OK, I see the same thing here:

c2stable ~ # locate nvidia | grep nvidia.ko
/lib64/modules/2.6.36-gentoo-r6/video/nvidia.ko
c2stable ~ # locate nvidia | grep nvidia_drv.so
/usr/lib64/xorg/modules/drivers/nvidia_drv.so
c2stable ~ # lsmod | grep nvidia
nvidia 10408743 28
agpgart 23806 1 nvidia
c2stable ~ # cat /var/log/Xorg.0.log | grep nvidia
[ 20.063] (II) LoadModule: “nvidia”
[ 20.063] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so
[ 20.135] (II) Module nvidia: vendor=“NVIDIA Corporation”
c2stable ~ #

What I was not aware of was that the X11 side module - nvidia_drv.so - doesn’t show up in modprobe.

Thanks,
Mark

It doesn’t show up in modprobe because it isn’t a kernel module. In a monolithic kernel architecture like Linux, kernel space (where device drivers, file sytems, and internal functions like memory management, process sheduling etc live), and user space (where X11 and other user programs run) are completely separated. The kernel module has access to the hardware and exposes the GPU via device files and some low level API to user space “client” code, like the X11 server, or CUDA or OpenCL.

So returning to the original question, can you now see how it might be posssible to run both NVIDIA and ATI kernel drivers and an ATI X11 driver and have the problem the orginal poster started with? And why your first couple of posts in this threads were complete red herrings?

Between your driver explanation and the Known Issues text you point to earlier, I can certainly see why my suggestion of trying the 270.18 driver would not have fixed his problem.

However I think the ‘red herring’ comment is a little unfair as we haven’t explained why, when using my GTX465 and launching X, the /dev/ entries were created on my machine but I was still unable to use CUDA. I received the same exact error message the OP got. (Which is why I posted the 270.18 suggestion.)

Why was I unable to find the device?

Thanks,

Mark

“We” haven’t explained because it is unrelated and not the topic of the thread. If you don’t like the expression red herring, I apolgize. Perhaps Ignoratio elenchi or thread hijack would be more appropriate?

Right. Except that there are at least 6 different problems I can think of off the topic of my head that will manifest themselves with the same symptom:

[list=1]

Kernel module not installed or not loaded

Mismatch of kernel module version and CUDA driver library version

Using a third party repackaged driver which has broken or missing CUDA support

Not running the NVIDIA X11 driver and taking no action to create the necessary /dev entries for the kernel driver [the winning entry in this case]

Creating the /dev entries by hand but not having correct permissions on them so that non-privileged user space processes can gain write access to the device

SELinux interactions which either leave the driver installer or user space processes with insufficient permissions to gain write access to the device or filesystem

Despite the fact you have conflated the original topic of this thread with your problem, they are almost certainly not the same and not caused by the same thing, nor would the solution that fixed the original problem fix your problem, or vice versa. The original poster was running different hardware and trying to do something completely different to you, despite the superficial similarities and the same eventual goal. It will be impossible to now say what the reason for whatever problem you had, because you fixed it.

So I guess we will go with thread hijack, then?