problems kernel returning wrong results

Hi all i’m kind of new to cuda programming and i’m trying to implement a kernel for this cpp function:

[codebox]int search1match (OneMatch *O, register uchar *text,

	  int from, int to, int *matches)

{ register int n = from;

 register bool *S = *O;

 register int count = 0;

if (n < 0) n = 0;

 while (true)

   { while (S[text[n++]]);

     if (n > to) break;

 matches[count++] = n;


 return count;


and this is my kernel:

[codebox]global void oneMatch_kernel(bool *S, char *text, int *matches, int *count, int from, int to) {

 if (from < 0) from = 1;

 int i = blockIdx.x*blockDim.x+threadIdx.x;

     if (i < from) return;


 if (i > to) return;

 matches[count[0]++] = i;


the function works by comparing characters and each time the characters match it increments the variable count but the values returned by the kernel are wrong so if anyone can help it would be great

thanks in advance

Try to debug your kernel and see what is going on.

i’ve been debugging it for days and i can’t seem to find out whats going on

Why count from 0 on last line? And while(S[text[i]]); is very strange. Looks like you have wrong assumptions about how gpu program works. It works in parallel, it is not a cycle.

count is what i want to return from the kernel but i couldn’t seem to use and int variable so i used an array and incremented the first element in it and i know i shouldn’t use this while loop but i want each thread to loop on the array the thing is it works perfectly in the serial code

Did you check in debug mode that while loop performed only once? How do you debug program on gpu?

i tried to debug it using visual studio but i couldn’t seem to find a way to debug the kernel, if you know any way that i can debug it with it would be of great help


use device emulation mode

how can i use that?

Check documentation.
Programming manual. 3.2.9 Debugging using the Device Emulation Mode.

You can’t just copy/paste serial code into CUDA :)

The count[0]++ means that all threads in all blocks will concurrently write to the same location in memory thus creating race-condition and obviously create

faulty results.

Either use Atomic functions or re-implement your algorithm to be multi-threaded safe. This really is not CUDA/GPU related but more of a multi-threaded issue.


can u just give me like a quick hint about atomic function like how can i use them?

simpleAtomicIntrinsics in the SDK and the programming guide (atomicAdd/atomicSub) might be a good starting point.

Mind you atomics are slow and if you can partition your code to be thread-safe without atomics it will probably run faster.

ok i’ll check it and thanks for the help