problem with dot product code

maringanti · June 22, 2008, 10:18am

I am having problem with a code that calculates dot product of two vectors r1 and r2. This should be simple but for some reason i am unable to get it to work.

The vectors are one dimensional vectors.

dim3 block (BlockSize, 1);

dim3 grid(vec_size/block.x, 1);

dot2 <<<grid, block>>>(r1, r2, result);

.........................................

__global__ void dot2(float* r1, float* r2, float* result){

Â  Â  Â  Â int tid = blockDim.x * blockIdx.x + threadIdx.x;

 Â  Â  Â  Â float sum = r1[tid] * r2[tid];

 Â  Â  Â  Â __syncthreads();

 Â  Â  Â  Â *result += sum;

}

The problem is with the kernel code. It looks fine to me. I cant find the error. By problem I mean the answer from the CPU and CUBLAS does not match. So clearly I am missing out something.

Any help will be appreciated.

SPWorley · June 22, 2008, 11:15am

I am having problem with a code that calculates dot product of two vectors r1 and r2. This should be simple but for some reason i am unable to get it to work.

The vectors are one dimensional vectors.
dim3 block (BlockSize, 1);

dim3 grid(vec_size/block.x, 1);

dot2 <<<grid, block>>>(r1, r2, result);

.........................................

__global__ void dot2(float* r1, float* r2, float* result){

ï¿½ Â ï¿½ Â ï¿½ Â Â int tid = blockDim.x * blockIdx.x + threadIdx.x;

 ï¿½ Â ï¿½ Â ï¿½ Â Â float sum = r1[tid] * r2[tid];

 ï¿½ Â ï¿½ Â ï¿½ Â Â __syncthreads();

 ï¿½ Â ï¿½ Â ï¿½ Â Â *result += sum;

}
The problem is with the kernel code. It looks fine to me. I cant find the error. By problem I mean the answer from the CPU and CUBLAS does not match. So clearly I am missing out something.

Any help will be appreciated.

[snapback]397939[/snapback]

The problem is that the final sum is being added to by all the threads simultaneously… that’s undefined behavior (other than “one write will succeed.”)

You likely need to do some fancier shared memory distillation to compute the sum in a parallel binary hierarchy… take a look at the SDK examples like Scan to see how.

If you were using integers, you could use atomic operations, but those would require N writes since they’d be sequential. The parallel algorithms do it all in log2(N) steps.

maringanti · June 22, 2008, 11:22am

the order in which the threads write to the memory location is undefined but one will succeed - from the manual

My understanding was that all threads will write to that location but the order will be undefined. So in this case it would not have been a problem. Shouldnt all threads be allowed to write ?

I am using floats so atomic instructions are out.

SPWorley · June 22, 2008, 11:34am

The manual is correct, but your interpretation is wrong. One write will succeed. That’s all you’re promised… not that ALL will succeed in undefined order, but that at least one will succeed. In practice what will likely happen is one thread per warp will write something. It’s not at all what you want.

Take a look at the Scan example. You need to do a binary reduction to compute the sum.

maringanti · June 22, 2008, 12:12pm

Hmmm… this changes everything. Thanks a lot for the help. I will look at the Scan example.

SPWorley · June 22, 2008, 12:14pm

Hmm, actually, look at the Reduction example. That’s EXACTLY what you want, and it’s even structured as a tutorial with multiple variants and sample code.

mattb3 · June 22, 2008, 4:16pm

Don’t forget to take a look at the cuBlas funciton: cublasSdot, which computes the dot product of two vectors. It may not be as fast as the optimized reduction code, but it couldn’t be easier to use.

maringanti · June 23, 2008, 6:52am

I will do that as well.

I have another question. I have not tested it yet.

What if instead of having each thread writing it to the same location, we have different blocks writing to the same location. What I mean is, I use shared memory to transfer part of the vectors, get their sum and at the end write back to a memory location. Will this give the same error ?

maringanti · June 23, 2008, 6:53am

I will do that as well.

I have another question. I have not tested it yet.

What if instead of having each thread writing it to the same location, we have different blocks writing to the same location. What I mean is, I use shared memory to transfer part of the vectors, get their sum and at the end write back to a memory location. Will this give the same error ?

Eri_Rubin · June 23, 2008, 2:24pm

yes you will get the same error, the only way it would work is with atomic operations on floats. And even then you would need to exit the kernel and start a new one since there is no device level thread sinc that can be called in a kernel. I wrote my dot product kernel over a year ago, use the reduction in the example. if you really want to dig into it there is a very good tutorial which Mark Haris gave in a super computing conference, which you can find in the cuda zone site.

Cheers
Eri

maringanti · June 24, 2008, 1:13pm

yes. managed to get the same error. My understanding was that if 30 threads want to write to the same location all 30 will get queued up, and one after another will write to the same location except in an undefined order. That would have made my work easier. I completely misunderstood. Quite a costly and stupid error.

E.D_Riedijk · June 24, 2008, 2:06pm

what you describe is what happens with atomic operations (only int). It is also much slower than a reduction, so a reduction is the way to go ™ ;)

Topic		Replies	Views
CUDA dot product atomics problem CUDA Programming and Performance	4	1851	February 26, 2012
Unexpected behavior on Dot Product Kernel CUDA Programming and Performance	8	9881	February 7, 2011
Scalar Product (dot product) with Atomic Operations How to implement a DotProduct all in the Kernel CUDA Programming and Performance	20	6333	December 11, 2009
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1415	November 8, 2023
do not understand thread/block division CUDA Programming and Performance	10	2798	April 23, 2012
Iteration help in CUDA CUDA Programming and Performance	11	6878	April 19, 2012
Different results on device and Emulation mode CUDA Programming and Performance	5	3548	February 5, 2009
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8287	April 15, 2011
problem of a simple CUDA program cuda CUDA Programming and Performance	4	2824	July 11, 2010
help with kernel synchronization? CUDA Programming and Performance	22	13899	August 26, 2010

problem with dot product code

Related topics