this is my first post here and I start with a problem that I am experiencing
related to a port of my C++ code into CUDA.
The algorithm is a 2d array computed by CUDA that sometimes it works correctly
giving an exact match with emulation code but sometimes it does not work at
all and it returns an array with zeroes.
The incosistency comes on the fact that the output depends on how C/C++ device
instructions are positioned (that really shouldn’t make any difference).
For example, if I have an if-clause like this (where SomeComputation is a class
member variable), then I don’t get correct results (i.e. 2d arrays returns all zeroes).
if (SomeComputation)
{
result += SomeComputationFunction();
}
Then if I hack the code to something like this:
if (true)//if (SomeComputation)
{
result += SomeComputationFunction();
}
then everything is computed fine.
This goes on with several other occasions, making the whole port very “sensitive”
and unstable with respect to programming… It’s like the GPU encounters some
errors and exits prematurely… After lot of experimentation, I haven’t
yet figured out what’s wrong… seems like it’s a compiler (optimization?) issue
or something that I still don’t get about CUDA.
Note that I am using latest drivers 197.45 (also tested developer drivers), CUDA 3.0.14,
and the results are always fine and stable when using emulation code. Testing on GT220.
Any ideas??? Or any pointers to the solution ?? :">
Thanks in advance.
Apparently [font=“Courier New”]SomeComputationFunction()[/font] has some unexpected side effects. One candidate would be the use of [font=“Courier New”]__syncthreads()[/font], which has to appear in all threads executed.
The SomeComputation is a member variable. The function called afterwards is a member function.
The whole code is C++, the whole thing; inheritance, templates, etc.
The above code is an example, since the results are inconsistent not only by changing this
specific code but also other parts. I am quite experienced with C++, so I doubt that the
problem lies within the code (of course, you never know). But I am not very comfortable
with CUDA yet, so it must be something that I am missing…
I am quite sure about the code itself, this code is actually a simplification
from a much bigger C++ code compiling and running fine with various compilers.
Let’s say that we have a kernel of C++ code that uses global memory (as const
no write), local memory and that’s all (no shared memory). But that kernel gives
back results that “fluctuate”, not the same. I am not doing any syncthreads() [don’t
see any reason why] and on return, there is no error.
Assuming that the code is fine (it runs in emulation as a first hint of correctness
although this is not always enough), are there any issues that I am missing here?