The following is my kernel code for A * B, where A and B are 2x2 matrices.
global void CUDAMatrixMultiplication_kernel(double* lhsContainer, unsigned int lhsRows, unsigned int lhsColumns, double* rhsContainer, unsigned int rhsRows, unsigned int rhsColumns, double* resultContainer, unsigned int resultColumns) {
double Cvalue = 0.0;
unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int column = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int e = 0; e < lhsColumns; ++e) {
Cvalue += lhsContainer[row * lhsColumns + e] * rhsContainer[column + e * rhsColumns];
}
resultContainer[row * resultColumns + column] = Cvalue;
}
This code is almost verbatim from the example given in the CUDA Programming Guide 3.0. As it is, the resulting matrix’s first row is correct, but the second ends up containing only the second row from B.
A =
1,2
3,4
B =
5,6
7,8
C =
19,22
7,8
Has anyone else encountered such a problem before? Is there any advice on how to get this to work properly?
I am using CUDA driver 3.0. I’ve played around with this for days and still cannot figure out why I cannot get a correct answer when it comes to matrix multiplication on the GPU (I have it working in CPU-space correctly) except to believe there must be a driver bug or GPU hardware bug.
The following is my code for allocating memory on the host and the device.
CUDAMatrix::CUDAMatrix(unsigned long rows, unsigned long columns) : BaseMatrix(rows, columns) {
matrixMemorySize_ = sizeof(double) * rows * columns;
// Host Memory
rawMatrix_ = (double*) malloc(matrixMemorySize_);
// Device Memory
cudaError status = cudaMalloc((void**)&container_, matrixMemorySize_);
if (status == cudaErrorMemoryAllocation)
throw IILException(“cudaErrorMemoryAllocation on cudaMalloc()”, FILE, LINE);
}
container_ = device pointer to matrix memory
rawMatrix_ = host pointer to matrix memory
The following is my code for copying the matrix over to GPU-space:
void CUDAMatrix::syncDevice() {
cudaError_t error = cudaMemcpy(container_, rawMatrix_, rows_ * columns_ * sizeof(double), cudaMemcpyHostToDevice );
if (error == cudaErrorInvalidValue)
throw IILException(“cudaErrorInvalidValue on cudaMalloc()”, FILE, LINE);
if (error == cudaErrorInvalidDevicePointer)
throw IILException(“cudaErrorInvalidDevicePointer on cudaMalloc()”, FILE, LINE);
if (error == cudaErrorInvalidMemcpyDirection)
throw IILException(“cudaErrorInvalidMemcpyDirection on cudaMalloc()”, FILE, LINE);
}
The logic seems correct, and there seem to be no violations. Still, my tests show the first line of results are correct, the rest are … strange.
Yes, I just learned about that one when searching as to why nvcc demoted it to singe-precision when I believe my hardware is capable of double-precision.
I learned about this one recently myself when I searched in to why nvcc was demoting my double-precision to single-precision when I believe my hardware supports double-precision.
I learned about this one recently myself when I searched in to why nvcc was demoting my double-precision to single-precision when I believe my hardware supports double-precision.
With multiplications of 2x2 matrix you need 1 block and very small amount of threads. In kernell launch you specify grid size and block size. Block size in threads.
I’ve made the blocks 1 and threads 3 with <<<1, 3>>> and could see a change in behaviour: the bottom row remained zeros. I changed it one by one to six (the minimum necessary to get the job done) and it still shows the second row as 8, 9, 10, and not as the results expected.
I have thread synchronization in place. Is there any other idea?
You should need 2 or 4 threads for such small size. And you should know before start, how many threads you need. Check matrix multiplication example is sdk.
I think I know now your mistake. You access threadIdx.y while you put 1D block size. You made false assumption about square indexing.