Matrix Multiplication Buggy

The following is my kernel code for A * B, where A and B are 2x2 matrices.

global void CUDAMatrixMultiplication_kernel(double* lhsContainer, unsigned int lhsRows, unsigned int lhsColumns, double* rhsContainer, unsigned int rhsRows, unsigned int rhsColumns, double* resultContainer, unsigned int resultColumns) {
double Cvalue = 0.0;
unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int column = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int e = 0; e < lhsColumns; ++e) {
Cvalue += lhsContainer[row * lhsColumns + e] * rhsContainer[column + e * rhsColumns];
}
resultContainer[row * resultColumns + column] = Cvalue;
}

This code is almost verbatim from the example given in the CUDA Programming Guide 3.0. As it is, the resulting matrix’s first row is correct, but the second ends up containing only the second row from B.

A =
1,2
3,4

B =
5,6
7,8

C =
19,22
7,8

Has anyone else encountered such a problem before? Is there any advice on how to get this to work properly?

Jonathan Scott

I am using CUDA driver 3.0. I’ve played around with this for days and still cannot figure out why I cannot get a correct answer when it comes to matrix multiplication on the GPU (I have it working in CPU-space correctly) except to believe there must be a driver bug or GPU hardware bug.

The following is my code for allocating memory on the host and the device.

CUDAMatrix::CUDAMatrix(unsigned long rows, unsigned long columns) : BaseMatrix(rows, columns) {
matrixMemorySize_ = sizeof(double) * rows * columns;
// Host Memory
rawMatrix_ = (double*) malloc(matrixMemorySize_);
// Device Memory
cudaError status = cudaMalloc((void**)&container_, matrixMemorySize_);
if (status == cudaErrorMemoryAllocation)
throw IILException(“cudaErrorMemoryAllocation on cudaMalloc()”, FILE, LINE);
}

container_ = device pointer to matrix memory
rawMatrix_ = host pointer to matrix memory

The following is my code for copying the matrix over to GPU-space:

void CUDAMatrix::syncDevice() {
cudaError_t error = cudaMemcpy(container_, rawMatrix_, rows_ * columns_ * sizeof(double), cudaMemcpyHostToDevice );
if (error == cudaErrorInvalidValue)
throw IILException(“cudaErrorInvalidValue on cudaMalloc()”, FILE, LINE);
if (error == cudaErrorInvalidDevicePointer)
throw IILException(“cudaErrorInvalidDevicePointer on cudaMalloc()”, FILE, LINE);
if (error == cudaErrorInvalidMemcpyDirection)
throw IILException(“cudaErrorInvalidMemcpyDirection on cudaMalloc()”, FILE, LINE);
}

The logic seems correct, and there seem to be no violations. Still, my tests show the first line of results are correct, the rest are … strange.

Someone please help!
Jonathan Scott

Does it work with device emulation mode?

Are you compiling with -arch=sm_13 ? doubles won’t work without that option.

Strange, that first line of results is correct… I think problem maybe in kernel launch, too many threads and blocks.

Thank you very much for your reply!

Yes, I just learned about that one when searching as to why nvcc demoted it to singe-precision when I believe my hardware is capable of double-precision.

Jonathan Scott

Thank you very much for your reply!

I learned about this one recently myself when I searched in to why nvcc was demoting my double-precision to single-precision when I believe my hardware supports double-precision.

Jonathan Scott

Thank you very much for your reply!

I learned about this one recently myself when I searched in to why nvcc was demoting my double-precision to single-precision when I believe my hardware supports double-precision.

Jonathan Scott

So, does it help?

Thanks for the hint. Let me give that a try. I should only have the total number of threads necessary in <<< >>> for the work to be done?

Jonathan

With multiplications of 2x2 matrix you need 1 block and very small amount of threads. In kernell launch you specify grid size and block size. Block size in threads.

I’ve made the blocks 1 and threads 3 with <<<1, 3>>> and could see a change in behaviour: the bottom row remained zeros. I changed it one by one to six (the minimum necessary to get the job done) and it still shows the second row as 8, 9, 10, and not as the results expected.

I have thread synchronization in place. Is there any other idea?

Jonathan Scott

You should need 2 or 4 threads for such small size. And you should know before start, how many threads you need. Check matrix multiplication example is sdk.
I think I know now your mistake. You access threadIdx.y while you put 1D block size. You made false assumption about square indexing.

I have two tests, and the one I showed is 2x2, the other was 2x3. So, yes, I am using 4, but still get the bottom row incorrect.

I have tried the two-dimensional specification with dim3 and it worked! I very much appreciate your help. I think I better understand about this now!

Jonathan Scott