Matrix Multiplication Buggy

Jonathan_Scott · May 4, 2010, 5:26pm

The following is my kernel code for A * B, where A and B are 2x2 matrices.

global void CUDAMatrixMultiplication_kernel(double* lhsContainer, unsigned int lhsRows, unsigned int lhsColumns, double* rhsContainer, unsigned int rhsRows, unsigned int rhsColumns, double* resultContainer, unsigned int resultColumns) {
double Cvalue = 0.0;
unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int column = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int e = 0; e < lhsColumns; ++e) {
Cvalue += lhsContainer[row * lhsColumns + e] * rhsContainer[column + e * rhsColumns];
}
resultContainer[row * resultColumns + column] = Cvalue;
}

This code is almost verbatim from the example given in the CUDA Programming Guide 3.0. As it is, the resulting matrix’s first row is correct, but the second ends up containing only the second row from B.

A =
1,2
3,4

B =
5,6
7,8

C =
19,22
7,8

Has anyone else encountered such a problem before? Is there any advice on how to get this to work properly?

Jonathan Scott

Jonathan_Scott · May 5, 2010, 6:26am

I am using CUDA driver 3.0. I’ve played around with this for days and still cannot figure out why I cannot get a correct answer when it comes to matrix multiplication on the GPU (I have it working in CPU-space correctly) except to believe there must be a driver bug or GPU hardware bug.

The following is my code for allocating memory on the host and the device.

CUDAMatrix::CUDAMatrix(unsigned long rows, unsigned long columns) : BaseMatrix(rows, columns) {
matrixMemorySize_ = sizeof(double) * rows * columns;
// Host Memory
rawMatrix_ = (double*) malloc(matrixMemorySize_);
// Device Memory
cudaError status = cudaMalloc((void**)&container_, matrixMemorySize_);
if (status == cudaErrorMemoryAllocation)
throw IILException(“cudaErrorMemoryAllocation on cudaMalloc()”, FILE, LINE);
}

container_ = device pointer to matrix memory
rawMatrix_ = host pointer to matrix memory

The following is my code for copying the matrix over to GPU-space:

void CUDAMatrix::syncDevice() {
cudaError_t error = cudaMemcpy(container_, rawMatrix_, rows_ * columns_ * sizeof(double), cudaMemcpyHostToDevice );
if (error == cudaErrorInvalidValue)
throw IILException(“cudaErrorInvalidValue on cudaMalloc()”, FILE, LINE);
if (error == cudaErrorInvalidDevicePointer)
throw IILException(“cudaErrorInvalidDevicePointer on cudaMalloc()”, FILE, LINE);
if (error == cudaErrorInvalidMemcpyDirection)
throw IILException(“cudaErrorInvalidMemcpyDirection on cudaMalloc()”, FILE, LINE);
}

The logic seems correct, and there seem to be no violations. Still, my tests show the first line of results are correct, the rest are … strange.

Someone please help!
Jonathan Scott

Lev · May 5, 2010, 11:28am

Does it work with device emulation mode?

MisterAnderson42 · May 5, 2010, 12:47pm

Are you compiling with -arch=sm_13 ? doubles won’t work without that option.

Lev · May 5, 2010, 12:51pm

Strange, that first line of results is correct… I think problem maybe in kernel launch, too many threads and blocks.

Jonathan_Scott · May 5, 2010, 1:11pm

Thank you very much for your reply!

Yes, I just learned about that one when searching as to why nvcc demoted it to singe-precision when I believe my hardware is capable of double-precision.

Jonathan Scott

Jonathan_Scott · May 5, 2010, 1:15pm

Thank you very much for your reply!

I learned about this one recently myself when I searched in to why nvcc was demoting my double-precision to single-precision when I believe my hardware supports double-precision.

Jonathan Scott

Jonathan_Scott · May 5, 2010, 2:56pm

Thank you very much for your reply!

I learned about this one recently myself when I searched in to why nvcc was demoting my double-precision to single-precision when I believe my hardware supports double-precision.

Jonathan Scott

Lev · May 5, 2010, 2:57pm

So, does it help?

Jonathan_Scott · May 5, 2010, 2:57pm

Thanks for the hint. Let me give that a try. I should only have the total number of threads necessary in <<< >>> for the work to be done?

Jonathan

Lev · May 5, 2010, 2:59pm

With multiplications of 2x2 matrix you need 1 block and very small amount of threads. In kernell launch you specify grid size and block size. Block size in threads.

Jonathan_Scott · May 5, 2010, 3:06pm

I’ve made the blocks 1 and threads 3 with <<<1, 3>>> and could see a change in behaviour: the bottom row remained zeros. I changed it one by one to six (the minimum necessary to get the job done) and it still shows the second row as 8, 9, 10, and not as the results expected.

I have thread synchronization in place. Is there any other idea?

Jonathan Scott

Lev · May 5, 2010, 3:08pm

You should need 2 or 4 threads for such small size. And you should know before start, how many threads you need. Check matrix multiplication example is sdk.
I think I know now your mistake. You access threadIdx.y while you put 1D block size. You made false assumption about square indexing.

Jonathan_Scott · May 5, 2010, 3:23pm

I have two tests, and the one I showed is 2x2, the other was 2x3. So, yes, I am using 4, but still get the bottom row incorrect.

I have tried the two-dimensional specification with dim3 and it worked! I very much appreciate your help. I think I better understand about this now!

Jonathan Scott

Topic		Replies	Views
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3399	July 25, 2009
Matrix Multiplication In CUDA CUDA Programming and Performance	6	2525	May 11, 2015
32 x 32 Matrix Multiplication CUDA Programming and Performance	2	2853	March 5, 2010
Weird Matrix-Vector Results - Help? CUDA Programming and Performance	2	4930	April 6, 2010
Intro CUDA - Matrix Multiplication Returning Odd Values CUDA Programming and Performance	1	5708	June 25, 2009
matrix multiplication with its transpose in cuda(cudamemcpy from device to host not working) . CUDA Programming and Performance	6	1734	October 5, 2018
Double precision in CUDA 2.3 CUDA Programming and Performance	5	38171	March 5, 2010
Problem with double precision matrix float works, double doesn't CUDA Programming and Performance	7	3915	April 22, 2009
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	9993	January 18, 2012
Help with cuda 2d array CUDA Programming and Performance	6	7446	September 29, 2014

Matrix Multiplication Buggy

Related topics