Program fails after running function

I am currently working on a CUDA project regarding matrices and everything runs smoothly until I run a certain function which calls a kernel. Every further kernel functions after running that certain problematic one doesn’t run properly even though in general they work just fine.
The issue starts when i start working on large scale matrices meaning at (32x32)^2 it works but for (64x64)^2 and (128x128)^2 it wont give proper results. The problematic function is called MatPolMulLoop and is suppose to multiply two polynomial matrices (meaning matrices in which every cell is an array representing a polynomial for example x1+x2 = [1, 1])

Here is the code for the problematic function:

void MatPolMulLoop(const Matrix A, const Matrix B, Matrix C, int psize, int dir)
	// Load A and B to device memory
	Matrix d_A;
	d_A.width = A.width; d_A.height = A.height;
	size_t size = A.width * A.height * sizeof(float);
	cudaMalloc(&d_A.elements, size);
	cudaMemcpy(d_A.elements, A.elements, size,
	Matrix d_B;
	d_B.width = B.width; d_B.height = B.height;
	size = B.width * B.height * sizeof(float);
	cudaMalloc(&d_B.elements, size);
	cudaMemcpy(d_B.elements, B.elements, size,

	// Allocate C in device memory
	Matrix d_C;
	d_C.width = C.width; d_C.height = C.height;
	size = C.width * C.height * sizeof(float);
	cudaMalloc(&d_C.elements, size);

	// Invoke kernel
	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
	dim3 dimGrid((C.width / dimBlock.x) / psize, (C.height / dimBlock.y) * psize);

	MatPolMulLoopKernel << <dimGrid, dimBlock >> >(d_A, d_B, d_C, psize, dir);

	// Read C from device memory
	cudaMemcpy(C.elements, d_C.elements, size,

	// Free device memory

*psize is the size of the polynomials.

My best guess is that the problem is linked to the large amounts of memory being used and allocated but I couldn’t confirm it.
If more code or further explenations are needed please reply and ill add it as soon as i can.

use proper CUDA error checking (google that)

and run your code with cuda-memcheck

I did as you asked and it was a memory error. Im attaching the kernel because i cant figure out why it fails and only with a certain matrix size.

__global__ void MatPolMulLoopKernel(Matrix A, Matrix B, Matrix C, int psize, int dir)
    int row = (blockIdx.y / psize) * blockDim.y + threadIdx.y;
    int col = (blockIdx.x * psize + blockIdx.y) * blockDim.x + threadIdx.x;

    if (dir == 0)
        for (int i = 0; i < B.height; ++i)
            C.elements[row * C.width + col] += A.elements[row * A.width + i * psize + col % psize] * B.elements[i * B.width + col / psize];
        C.elements[row * C.width + col] = (int)C.elements[row * C.width + col] % 2;

        for (int i = 0; i < A.width; ++i)
            C.elements[row * C.width + col] += A.elements[row * A.width + i] * B.elements[i * B.width + col];
        C.elements[row * C.width + col] = (int)C.elements[row * C.width + col] % 2;

also I noticed i didnt specify some thing so:
*psize is the size of the polynomials = MAT_SIZE^2. *BLOCK_SIZE = 16
*Im running the code on Visual Studio 2017 with gpu - GTX 1060