I am currently working on a CUDA project regarding matrices and everything runs smoothly until I run a certain function which calls a kernel. Every further kernel functions after running that certain problematic one doesn’t run properly even though in general they work just fine.
The issue starts when i start working on large scale matrices meaning at (32x32)^2 it works but for (64x64)^2 and (128x128)^2 it wont give proper results. The problematic function is called MatPolMulLoop and is suppose to multiply two polynomial matrices (meaning matrices in which every cell is an array representing a polynomial for example x1+x2 = [1, 1])
Here is the code for the problematic function:
void MatPolMulLoop(const Matrix A, const Matrix B, Matrix C, int psize, int dir)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);
// Allocate C in device memory
Matrix d_C;
d_C.width = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid((C.width / dimBlock.x) / psize, (C.height / dimBlock.y) * psize);
MatPolMulLoopKernel << <dimGrid, dimBlock >> >(d_A, d_B, d_C, psize, dir);
// Read C from device memory
cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
*psize is the size of the polynomials.
My best guess is that the problem is linked to the large amounts of memory being used and allocated but I couldn’t confirm it.
If more code or further explenations are needed please reply and ill add it as soon as i can.
Thanks!