problem with dot product code

I am having problem with a code that calculates dot product of two vectors r1 and r2. This should be simple but for some reason i am unable to get it to work.

The vectors are one dimensional vectors.

dim3 block (BlockSize, 1);

dim3 grid(vec_size/block.x, 1);

dot2 <<<grid, block>>>(r1, r2, result);

.........................................

__global__ void dot2(float* r1, float* r2, float* result){

  Â  Â  Â int tid = blockDim.x * blockIdx.x + threadIdx.x;

 Â  Â  Â  Â float sum = r1[tid] * r2[tid];

 Â  Â  Â  Â __syncthreads();

 Â  Â  Â  Â *result += sum;

}

The problem is with the kernel code. It looks fine to me. I cant find the error. By problem I mean the answer from the CPU and CUBLAS does not match. So clearly I am missing out something.

Any help will be appreciated.

The problem is that the final sum is being added to by all the threads simultaneously… that’s undefined behavior (other than “one write will succeed.”)

You likely need to do some fancier shared memory distillation to compute the sum in a parallel binary hierarchy… take a look at the SDK examples like Scan to see how.

If you were using integers, you could use atomic operations, but those would require N writes since they’d be sequential. The parallel algorithms do it all in log2(N) steps.

the order in which the threads write to the memory location is undefined but one will succeed - from the manual

My understanding was that all threads will write to that location but the order will be undefined. So in this case it would not have been a problem. Shouldnt all threads be allowed to write ?

I am using floats so atomic instructions are out.

The manual is correct, but your interpretation is wrong. One write will succeed. That’s all you’re promised… not that ALL will succeed in undefined order, but that at least one will succeed. In practice what will likely happen is one thread per warp will write something. It’s not at all what you want.

Take a look at the Scan example. You need to do a binary reduction to compute the sum.

Hmmm… this changes everything. Thanks a lot for the help. I will look at the Scan example.

Hmm, actually, look at the Reduction example. That’s EXACTLY what you want, and it’s even structured as a tutorial with multiple variants and sample code.

Don’t forget to take a look at the cuBlas funciton: cublasSdot, which computes the dot product of two vectors. It may not be as fast as the optimized reduction code, but it couldn’t be easier to use.

I will do that as well.

I have another question. I have not tested it yet.

What if instead of having each thread writing it to the same location, we have different blocks writing to the same location. What I mean is, I use shared memory to transfer part of the vectors, get their sum and at the end write back to a memory location. Will this give the same error ?

I will do that as well.

I have another question. I have not tested it yet.

What if instead of having each thread writing it to the same location, we have different blocks writing to the same location. What I mean is, I use shared memory to transfer part of the vectors, get their sum and at the end write back to a memory location. Will this give the same error ?

yes you will get the same error, the only way it would work is with atomic operations on floats. And even then you would need to exit the kernel and start a new one since there is no device level thread sinc that can be called in a kernel. I wrote my dot product kernel over a year ago, use the reduction in the example. if you really want to dig into it there is a very good tutorial which Mark Haris gave in a super computing conference, which you can find in the cuda zone site.

Cheers
Eri

yes. managed to get the same error. My understanding was that if 30 threads want to write to the same location all 30 will get queued up, and one after another will write to the same location except in an undefined order. That would have made my work easier. I completely misunderstood. Quite a costly and stupid error.

what you describe is what happens with atomic operations (only int). It is also much slower than a reduction, so a reduction is the way to go ™ ;)