Vector Vector Multiplication Code : Error Basic Vector Vector Multiplication code

samritmaity · May 21, 2008, 12:06pm

I am writing a very basic version of Vector Vector Multiplication using CUDA. The code is as follows
Length of each vector : 100
Number of Grid Taken : 1
Number of Block : 1 ( 10 by 10 )

Input values are hard coded also. But while running , i am getting wrong result. Can any body point out , where i am making mistake.
While running with device emulation mode [ -deviceemu ], the same program is producing correct result.
please help me out…
thanks in advance .

with regards
sam

--------------------source-----------------------------
#include<stdio.h>
#include<cuda.h>

#define LEN 100

global void VectVect(float *dMatA, float *dMatB, int length, float *device_result)
{
int tidx = threadIdx.x;
int tidy = threadIdx.y;

float tempResult = 0.0f;

tempResult = dMatA[ (10 * tidx) + tidy ] * dMatB[ (10 * tidx) + tidy ];
__syncthreads();
device_result[0] += tempResult;
__syncthreads();

}//end of VectVect device function

int main(int argc, char* argv)
{
float *dMatA, *dMatB;
float *hMatA, *hMatB;
float *dresult, *hresult;
int length = LEN, count = 0;

//allocation host memory
hMatA = (float*) malloc( LEN * sizeof(float));
hMatB = (float*) malloc( LEN * sizeof(float));
hresult = (float*) malloc( sizeof(float));

//allocation device memory
cudaMalloc( (void**)&dMatA, LEN* sizeof(float));
cudaMalloc( (void**)&dMatB, LEN* sizeof(float));
cudaMalloc( (void**)&dresult, sizeof(float));

// assinging value to host vectors
for( count = 0; count < LEN ; count++ )
hMatA[count] = hMatB[count] = 2.00f;

// copying host vector to device vector
cudaMemcpy((void*)dMatA, (void*)hMatA, LEN* sizeof(float) , cudaMemcpyHostToDevice );
cudaMemcpy((void*)dMatB, (void*)hMatB, LEN* sizeof(float) , cudaMemcpyHostToDevice );

// defining thread grid and block
dim3 dimGrid(1,1);
dim3 dimBlock(10,10);

hresult[0] = 0.00f;
cudaMemcpy((void*)dresult, (void*)hresult, sizeof(float) , cudaMemcpyHostToDevice );

//calling device kernel
VectVect<<<dimGrid, dimBlock>>>( dMatA, dMatB, length, dresult );

//retriving result from device
cudaMemcpy((void*)hresult, (void*)dresult, sizeof(float) , cudaMemcpyDeviceToHost );
printf( " Result : %f ", hresult[0]);

cudaFree(dMatA);
cudaFree(dMatB);
cudaFree(dresult);

free(hMatA);
free(hMatB);
free(hresult);

}// end of main

---------------------------output I got ------------
Result : 4.00

---------------------------expected output------------
Result : 400.00

E.D_Riedijk · May 21, 2008, 12:11pm

You are updating 1 memory position from 100 threads at the same time, So 4 seems like a reasonable number to me.

samritmaity · May 21, 2008, 12:20pm

Thanks for your quick reply.

But I am synchronizing the access of memory location by using __syncthreads(). Isnt it the right way to do?

Do you have any suggestion regarding how to modify this code , to get correct result ?

JHHPC · May 21, 2008, 12:50pm

__syncthreads() just makes sure that all commands above and all memory and cache access took place.

Isn’t there a reduction example in the SDK?

samritmaity · May 21, 2008, 12:57pm

thanks . I got some clue…

:)

seibert · May 21, 2008, 4:07pm

Just to clarify the syncthreads issue a little further:

__syncthreads() marks a thread join for blocks, not a critical section. It forces threads in the block to wait there until all threads in the block reach the __syncthreads() call. You should use __syncthreads() to protect against read-after-write race conditions in shared memory. It cannot protect against race conditions when accessing global memory, as it only forces threads in the same block to wait.

For more information on reductions (I find the reduction example in the SDK a little busy),
take a look at this talk, starting at slide 15:
[url=“http://www.gpgpu.org/sc2007/SC07_CUDA_4_DataParallel_Owens.pdf”]http://www.gpgpu.org/sc2007/SC07_CUDA_4_Da...allel_Owens.pdf[/url]

And this goes into more detail about implementation strategies starting on slide 37:
[url=“http://www.gpgpu.org/sc2007/SC07_CUDA_5_Optimization_Harris.pdf”]http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op...tion_Harris.pdf[/url]

Topic		Replies	Views
Vector Reduction CUDA Programming and Performance	3	19737	March 9, 2011
Simple Thread Problem CUDA Programming and Performance	1	4033	September 24, 2009
The Cuda Programming Guide Samples Errors CUDA Programming and Performance	5	2157	August 26, 2009
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1800	January 14, 2009
Problem with vector comparison CUDA Programming and Performance cuda	7	541	October 27, 2022
HELP with vector sum CUDA Programming and Performance	6	2227	May 11, 2010
problem of matrix multiplication vector x matrix CUDA Programming and Performance	4	1228	August 22, 2010
Weird Matrix-Vector Results - Help? CUDA Programming and Performance	2	4932	April 6, 2010
Help! Sum of vectors CUDA Programming and Performance	7	899	June 16, 2011
Single Thread Processing a vector of elements General Concept. CUDA Programming and Performance	5	3044	July 27, 2009

Vector Vector Multiplication Code : Error Basic Vector Vector Multiplication code

Related topics