different results for same CUDA code

hey, I am facing weird problem. In my code, if I add some additional printf statements then it shows correct result…and if I remove those unnecessary printf statements then answer is wrong. What’s happening???

Please reply antone if facing same problem and have solution…

Can you post the code?

I am attaching file of code…

In addition_CUDA.cu line no.241

for(UINT32 i=0;i<HISTOGRAM_LENGTH;i++)

{	//printf("\nweights_H[%u]= %f",i,weights_H[i]);

	sumOfWeights +=weights_H[i];

}

if I keep printf commented then result is correct

and if I remove that comment then result gets modified and becomes false…
ObjectTrackerCopy.h (16.1 KB)
addition_CUDA.cu (11.6 KB)

You might have a race condition and the printf() is forcing an implicit synchronization that is masking the problem for you.

I didn’t get your solution. This printf function is called after completion of Device function. So how can it affect result of device function???

Have you tried replacing each printf with cudaThreadsSynchronize() of __synchthreads() (if the printf is inside a kernel)?

First, You can’t put printf in kernel and ya, I tried replacing printf with cudaThreadSynchronize(), but it doesn’t work…!!

Well, I am a bit confused by your conflicting statements of your problem.

Anyway, weights_H is not guaranteed to have the correct value before you call cudaThreadSynchronize(). You should put that call after line 238.

If this still doesn’t solve the problem for you, it could be that your kernel does not set weights_H properly, or just that your kernel fails.

I tried cudaThreadSynchronize()… I tried every possible combination of cudaThreadSynchronize() with and without printf()…But it only works in given form in file…That’s ridiculous and makes me crazy…and if I go for further processing which is required for completing algo, this results get modified…

This is swirling my brain…the code which is going to execute in future affects the present execution,that’s awe full.

Please tell me the proper procedure to call and execute any CUDA kernel completely with all required synchronization statements and other precautions…!!!

Hey everyone…I found the bug…It is use of array in structure…:)

CUDA supports structure but it creates problem while using array inside structure. I knew pointer also creates problem but now I found the use of array is also risky business. So, always use array separate from structure…

Hey everyone…I found the bug…It is use of array in structure…:)

CUDA supports structure but it creates problem while using array inside structure. I knew pointer also creates problem but now I found the use of array is also risky business. So, always use array separate from structure…

So, it’s about how you allocating space on your CUDA memory, isn’t?

If i have a new structure named COMPLEX, which has real and imaginary elements on float and i want to make 2048 data array of COMPLEX.

So how i should allocate space on CUDA memory? I use this and get same error on imaginary only, the real number was correct.

cudaMalloc((void**)&SPACE_FOR_COMPLEX, 2048 * sizeof(COMPLEX))

Regards,

I am not sure regarding your problem. Can u post the structure definition so I can get better understanding?

In my case,I was using array inside structure which was pointing to host memory. So I had to explicitly copy whole array in Device memory.

Okay. Here the codes sample.

typedef struct complex_t

{

     float real;

     float imaginer;

} COMPLEX

then on the main

int main()

{

   COMPLEX *DATA_IN;                                       //i want 2048 DATA_IN in 1D array on GPU

cudaMalloc((void **)&DATA_IN, 2048 * sizeof(COMPLEX));  //allocate mem on GPU global mem

.......

   return something;

}

Well, i used to see on some CUDA SDK samples which used cuComplex.h header, they use those method. So allocate all the data in a single cudaMalloc().

Any idea?