Trivial example code does not actually use the GPU device .. very strange.

So firstly this is a Red Hat Enterprise Linux machine with a NVIDIA Quadro K4200 device and I felt it reasonable to at least try a bit of CUDA sample code. I installed the CUDA 9.0 kit earlier this year and nvcc seems to do what it claims to do. However the first most trivial bit of code compiles and links but actually does nothing with the GPU. Very strange.

See page at https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/

The instructions at that page don’t really work and I don’t know why. However code is code and so it should compile and link neatly … which I do thus :

$ nvcc -I/usr/local/cuda/include -I. -arch=compute_30 -x cu -dc v3.cpp -o v3.o
$ nvcc -I/usr/local/cuda/include -I. -arch=compute_30 -x cu -dc particle.cpp -o particle.o
$ nvcc -I/usr/local/cuda/include -I. -arch=compute_30 -x cu -dc main.cpp -o main.o

That gives me the three object files and then :

$ nvcc -L/usr/local/cuda/lib64 -arch=compute_30 -o app main.o particle.o v3.o

Which results in the particle calculation executable “app” :

$ ./app
Moved 1000000 particles 100 steps. Average distance traveled is |(30.059565, 36.016434, 23.756998)| = 52.584751
$

Wonderful.

However this has nothing to do with the GPU and is entirely CPU bound.

$ ldd app
linux-vdso.so.1 => (0x00007ffd0fd0e000)
librt.so.1 => /lib64/librt.so.1 (0x00007fd33deb7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd33dc9b000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fd33da96000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fd33d78f000)
libm.so.6 => /lib64/libm.so.6 (0x00007fd33d48d000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fd33d276000)
libc.so.6 => /lib64/libc.so.6 (0x00007fd33cea9000)
/lib64/ld-linux-x86-64.so.2 (0x000055f7ca973000)
$

So not sure what the trivial issue is but I am guessing that it has something to do with the need to specify device code as opposed to host code bits and also a bit of linkage with libcuart would be helpful.

Any hints ?

Dennis

ps: also http://scv.bu.edu/documents/gpu_info.cu

  That doesn't get past the compile stage .. not sure why however that 
  is another topic.

nvcc links statically against libcudart by default. it will not show up in ldd

If you want to discover whether a GPU is being used I suggest trying one of the GPU profilers, such as nvprof.

ps:
I didn’t have any trouble compiling that code from the bu.edu website with this command:

nvcc -o gpu_info gpu_info.cu

Thank you for the reply. Sure enough the gpu_info code seems to work magically :

$ nvcc -o gpu_info gpu_info.cu
$ file gpu_info
gpu_info: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=3252d10269c76b075ecbc288a1ed014e98f875a3, not stripped
$

$ ./gpu_info
There is 1 device supporting CUDA

====== General Information for device 1 ======
Name: Quadro K4200
Compute capability: 3.0
Clock rate: 784000
Device copy overlap: Enabled
Kernel execution timeout: Enabled
----- Memory Information for device 1:
Total global memory: 4232183808 (4036 MB)
Total constant memory: 65536
Max memory pitch: 2147483647
Texture alignment: 512
----- MP Information for device 1:
Multiprocessor count: 7
Shared mem per block: 49152
Registers per block: 65536
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024 1024 64)
Max grid dimensions: (2147483647 65535 65535)
$

OKay … interesting. Not really device code however. Seems to be entirely host based.

The code at https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/ is a whole other matter where I expect that the computation is in the GPU and it just isn’t. As for the libcudart being static … what that baffles me. I was surprised to see the lib not referenced in the ELF file as NEEDED and given that it is a dynamic executable I would expect it to be NEEDED :

$ readelf -del app | grep -E “NEED|PATH”
[ 8] .gnu.version_r VERNEED 00000000004019c8 000019c8
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000006ffffffe (VERNEED) 0x4019c8
0x000000006fffffff (VERNEEDNUM) 7
$

I can assure you that it is entirely CPU bound and never uses the GPU at all.

$ ./app 100000000 1412341
Moved 100000000 particles 100 steps. Average distance traveled is |(5.368709, 5.368709, 5.368709)| = 9.298877
$

That takes some time and htop shows it as entirely CPU bound and nvidia-smi shows a dead cold nearly idle GPU.

Dennis

ps: I think it may be a good idea for me to flush out CUDA 9.0 and
update to 9.2 and then come back to try out these “Hello World”
samples.

It might be CPU bound. But it uses the GPU. It is running calculations on the GPU. Whether it is a fine example of a perfectly GPU bound code or not I can’t say, and I would suggest is besides the point of the article you got the code from. Try running the CUDA nbody sample code in benchmark mode, instead.

You don’t seem to understand the difference between static and dynamic linking.

https://kb.iu.edu/d/akqn

libcudart_static.a is linked statically to the executable. This provides the interface to the CUDA runtime library. The libcudart.so (shared object) is not needed by a CUDA executable linked that way, and it won’t show up in ldd output, and it won’t be reflected as NEEDED in the executable.

https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#file-and-path-specifications

--cudart {none|shared|static}	-cudart	
Specify the type of CUDA runtime library to be used: no CUDA runtime library, shared/dynamic CUDA runtime library, or static CUDA runtime library.

Allowed values for this option: none, shared, static.

Default value: static

Hrmmm … I’ll have to try nvcc with --cudart shared and then see what happens.

Also not sure where or what “CUDA nbody sample code in benchmark mode” is but I’ll
search around for that too.

Thank you for the help and I am now going to try to install the CUDA 9.2 kit but
I fear it my bork up my machine … should be interesting. I shall return.

Dennis

The sample codes are installed with the cuda toolkit usually.

They are described here:

[url]https://docs.nvidia.com/cuda/cuda-samples/index.html#samples-reference[/url]

you should find them (uncompiled, probably) in

/usr/local/cuda/samples

on your machine.

If you compile with

-cudart shared

you should see an appropriate libcudart reference in ldd

Yes indeed the CUDA 9.2 kit installs a whole pile of samples into /usr/local/cuda-9.2
and the symlink is /usr/local/cuda. Very nice.

However something horrific must have happened during the install process as
now I seem to have driver 396.26 but sure enough, upon reboot, nothing but
a black screen. So I have switched to a laptop and I’ll try to install
the version 390.67 which is the most recent “Long Lived Driver”. So this
will be a while before I can get back to trying out a compile of those
samples.

Dennis

ps: as expected my machine is borked :

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

So I have been here before but it takes hours to sort out.

Update … still working on this :

[url]https://devtalk.nvidia.com/default/topic/1037160/linux/cuda-9-2-install-on-rhel-7-5-results-in-driver-mismatch-nvidia-ko-396-26-nvidia-modeset-ko-396-24-/[/url]

Progress … however it did take about a few hours.

Sure enough a removal and then install of cuda 9.2 results in a driver
mismatch however I sorted that out as well as libvdpau and libEGL issues
and reboot half a dozen times.

Seems to work and yes the nvcc compiler seems to work fine and I can
set that libcudart is shared to make for a nice small output executable :

$ nvcc -v --cudart shared gpu_info.cu -o gpu_info

Works great … going to try the nbody sample now :-)

Dennis