I got problem with my second Cuda project. The first program runs and produces rezults.
Firts I hit the nvcc “ran out of registers” bug/feature. Attempting to walkaround
it, I did minor changes to the code: made a for loop out of
ENCCYCLE (0);
ENCCYCLE (1);
…
ENCCYCLE (7);
The new code compiles successfully, runs, but never terminates. When run under
X Windows there are no error messages. However, outside X Windows it says
I am 99% sure this is a bug in toolkit, not in my program. The aim of my project is
to estimate whether Cuda toolkit+videocard is suitable for a certain purpose, and the current status is “cuda cannot do it due to a bug”
I’m on 64-bit Linux and use nvidia drivers 169.09 and toolkit ver 1.1 if this
matters
I enclose full source code and Makefile. Built with Makefile are 3 executables:
compilin, terminatin within 0.2 seconds
not compilin with
compilin, not terminatin
The 3 executables are built from single source with different preprocessor
directives. The 1st executable is a trimmered version of the 2nd. The 3rd
differs from the 2nd only in the for loop mentioned above
My questions are:
How do I solve the “ran out of registers in integer64” problem without changing
C source code?
How do I change the code of executable 3 to make it compile and work
properly?
Other than the fact that you’re using an unsupported Linux distribution, I don’t see anything unusual in the bug report. I tried to reproduce this with the code that you attached in a supported Linux distribution, however it failed to build and I found your build instructions unclear.
Please clarify the build command(s) required to build your test app, or update the Makefile so that it can be built by running ‘make’.
…Please clarify the build command(s) required to build your test app, or update the Makefile > …so that it can be built by running ‘make’.
Ups? Did You read file ReAd.It? Build process is described there. I will shortly repeat it here.
The build process involves 2 steps. First make executable script called CUDA and place it somewhere in the $PATH. Then go to directory containin Makefile and type make. This should attempt to build and execute 3 executable files. Run result will be redirected into different files for each executable.
Yes, I read ReAd.It. Your instructions on how to make an executable script called CUDA were unclear. If building this app requires more than just running ‘make’ using the Makefile you provided, then please provide any additional requisite script(s) or build commands.
Separate script (setting up some environment variables) is needed by Makefile because I don’t know where you installed Cuda toolkit. To run my code do the following:
Go to /usr/local/bin
Open empty file in your favorite text editor
Keyboard 5 lines found between
==== /usr/local/bin/CUDA start ====
and
==== /usr/local/bin/CUDA end ====
inside file ReAd.It
Go to 1st line of the file and replace < path to toolkit > with the directory where you installed Cuda toolkit. This directory should contain 5 subdirs bin, doc, include, lib, open64; and directory bin/ should contain file nvcc and some other
Save file as CUDA
Leave text editor.
Type ls -l to check if the file CUDA is present in current directory /usr/local/bin
Make the file executable by issuing command < chmod +x ./CUDA >.
Now go to the directory containing ReAd.It and Makefile
Type < CUDA nvcc --version >. This should run nvcc executable. You should see 4 lines of nvcc introduction.
If you want to build all 3 executables yourself, type < make clean >. This will erase 2 executable files exe/*
Type < make > then watch executables compile and/or run. You may want to open another window to view files
rezult.*
If the 1st executable did not build, this could be due to the fact that file common/inc/cutil.h was not located my Makefile. cutil.h is part of Cuda SDK.
Standard output and standard error of 1st executable will be inside rezult.1.0.cout and rezult.1.0.cerr respectively
Second executable won’t build
Results of 3rd executable will be rezult.2.1.cout and rezult.2.1.cerr
3rd executable won’t terminate, and it won’t load CPU after a fraction of second. Type < ps | grep cuda > or < ps | grep make > to check what’s going on
Since usage of external script /usr/local/bin/CUDA became a problem, I changed Makefile to automagically find toolkit, so the script is no longer needed.
The new Makefile is attached to this message.
The new version is shorter, more verbose and has correct dependancies. Prior to running executable it outputs a message
I managed to walkaround the ran-out-of-registers and kernel-loops-forever bugs by changing source code. The new code compiles, runs and terminates, slightly outperforming central processor (see my signature for details).
My code heavily and randomly accesses constant memory, hence GPU is only slightly faster than CPU now. I hope to speed up cuda code by moving constants to shared memory
Hence I should inform of the intermediate result of the project:
the Nvidia compiler is BUGGY,
but sometimes it is worth spending time programming for GPU
I believe constant memory is as fast as it gets because it is cached, as long as all threads are accessing the same indices (if it is an array), otherwise it is indeed smart to put into shared memory or even a texture might do the trick.
Buggy is a strong statement I think. It does not generate wrong code, it crashes in certain circumstances. I have also had it happen once, but to be honest my code was crappy (in hindsight) and the compiler does not trip over my cleaner code.
FWIW, I rather have it crashing than generating wrong code, I already have enough trouble debugging my bugs :P
I prefer to take a working cuda-unaware program and convert it to cuda code with perl/bash script. This ensures absence of bugs. And very often compiler-ran-out-of-registers bug stops me (setting Olimit appears to have no effect). I failed several times before creating a variant which compiles and fits in registers completely. And the code is not optimal: if the compiler worked properly I could make it better
Moving hard-coded tables from constant to shared device memory increased speed more than twice, so now my G84-based card is more than 3 times better than two-core Athlon for the project (which means that $250 videocard should be >12 times better). I change my signature accordingly.
I’m using approximately half of shared memory, so I will try to switch from 1-byte char to 4-byte int
<<>>
Enlarging tables gave slight improvement. Now the program occupies 124 registers per thread and 11K shared memory
<<>>