Incosistent results - can't explain


this is my first post here and I start with a problem that I am experiencing
related to a port of my C++ code into CUDA.

The algorithm is a 2d array computed by CUDA that sometimes it works correctly
giving an exact match with emulation code but sometimes it does not work at
all and it returns an array with zeroes.

The incosistency comes on the fact that the output depends on how C/C++ device
instructions are positioned (that really shouldn’t make any difference).
For example, if I have an if-clause like this (where SomeComputation is a class
member variable), then I don’t get correct results (i.e. 2d arrays returns all zeroes).
if (SomeComputation)
result += SomeComputationFunction();
Then if I hack the code to something like this:
if (true)//if (SomeComputation)
result += SomeComputationFunction();
then everything is computed fine.

This goes on with several other occasions, making the whole port very “sensitive”
and unstable with respect to programming… It’s like the GPU encounters some
errors and exits prematurely… After lot of experimentation, I haven’t
yet figured out what’s wrong… seems like it’s a compiler (optimization?) issue
or something that I still don’t get about CUDA.

Note that I am using latest drivers 197.45 (also tested developer drivers), CUDA 3.0.14,
and the results are always fine and stable when using emulation code. Testing on GT220.

Any ideas??? Or any pointers to the solution ?? :">
Thanks in advance.

Probably something is not initialized. You need small test case with real code to show problem.

Apparently [font=“Courier New”]SomeComputationFunction()[/font] has some unexpected side effects. One candidate would be the use of [font=“Courier New”]__syncthreads()[/font], which has to appear in all threads executed.

Well, if I try to make the code “minimal” then everything works.

For me, it doesn’t seem probable that there’s an initialization problem, if that would be

the case, then at least I should get the same results when replacing the variable with

the constant true. I checked the variable by reading back to host its value and it was

true (not that I expected something different there).

I saw in another post, that something like this could relate to a kernel aborting due to

not enough registers. But I checked for any errors after the kernel call with cudaGetLastError()

and cudaThreadSynchronize() and they both returned cudaSuccess.

what’s going on?? :confused:

I am using global memory only to read values and no shared memory at all.

The [font=“Courier New”]SomeComputationFunction()[/font] uses global memory

without writing anything to that - just making a computation based on local

variables passed.

I doubt that there’s something specific about [font=“Courier New”]SomeComputationFunction()[/font]

since similar inconsistency can appear in other blocks of the code.

I don’t use any __syncthreads() at all.

Is there anything like an “instruction limit” for kernel execution?

Don’t know what else to think… I am in a dead end here…

The limit on kernel size is 2 million ptx instructions. I doubt you hit that.

What does

temp = SomeComputationFunction();

if (SomeComputation)


result += temp;



How can I be certain about this?

Same problem :">

You’d have to trust Appendix A of the Programming Guide.

Just to rule out the obvious: You’ve noticed that the two variants you posted are not equivalent?

Sure, but how can I be certain about the instructions of my current kernel.

Is there a tool that can tell me about that, or the compiler will complain? I doubt

that I am near this, mostly asking out of curiosity.

In the beginning? They are not the same but they should behave the same.

I know that the parameter used in the if-clause is true but of course the compiler

does not know it in the first case.

better to set temp=some computation
Are you sure it is variable, not some c++ function, macros etc?

better to set temp=some computation

It doesn’t work…

I saw in another post that it could be an issue with registers not being enough
but I simply don’t get any errors after kernel call.

The SomeComputation is a member variable. The function called afterwards is a member function.
The whole code is C++, the whole thing; inheritance, templates, etc.

The above code is an example, since the results are inconsistent not only by changing this
specific code but also other parts. I am quite experienced with C++, so I doubt that the
problem lies within the code (of course, you never know). But I am not very comfortable
with CUDA yet, so it must be something that I am missing…

Here’s the post I was referring to, I think it’s very similar with my problem
(commenting or making slight changes to the code changes the overall output).

BUT, unlike that post, I don’t get any errors after kernel call. Any other ideas?

Cuda has not virtual functions, and it maybe some compiler thing with c++ templates. Need to send code to nvidia compiler enginers.

Yes, there are no virtuals nor exceptions in the code.

If only there could be a pointer to the problem, I 'd be able to fix it… I am sure

there’s something I am missing about CUDA.

what type is some computation? try
Are you sure about result varable?

I am quite sure about the code itself, this code is actually a simplification
from a much bigger C++ code compiling and running fine with various compilers.

Let’s say that we have a kernel of C++ code that uses global memory (as const
no write), local memory and that’s all (no shared memory). But that kernel gives
back results that “fluctuate”, not the same. I am not doing any syncthreads() [don’t
see any reason why] and on return, there is no error.

Assuming that the code is fine (it runs in emulation as a first hint of correctness
although this is not always enough), are there any issues that I am missing here?