CUDA Matrix Multiplication Issues threads and blocks problem

Hello everyone!

Here is the situation: I have a kernel that does a matrix multiplication! The matrices are allocated and populated in CPU code as vectors and then transfered to CUDA using cudaMemcpy. The matrices are referenced (in CPU and CUDA) as 2D using the [x*number_of_columns + y] notation!! So so far we have two 1D matrices referenced as 2D in CUDA’s global memory. The kernel that multiplies them is as follows:

[codebox]global void CudaMul(float *A, float *B, float *C, int noc0, int noc1, int nor0)

{

int k;

int i = blockDim.x*blockIdx.x + threadIdx.x;

int j = blockDim.y*blockIdx.y + threadIdx.y;

if(i < noc0 && j < nor1)

{

	C[j*noc1 + i] = 0;

	for(k = 0; k < noc0; k++)

	{

		C[j*noc1 + i] +=  A[j*noc0 + k] * B[k*noc1 + i];

	}

}

}[/codebox]

Where noc0 is the number of columns of the first array, noc1 the number of columns of the second array and nor0 the number of rows of the first array. The multiplication is done in this order: matrix0 * matrix1 and the result matrix is already allocated!!

So, now when I call the kernel with a block of 16 by 16 threads and a grid of ceil(nor0/float(16)) by ceil(noc1/float(16)). I actually create a thread to process one element of the resulting matrix. If I load two 40004000 matrices it should create a 250 by 250 grid of blocks (4000/16 = 250) with each block having 1616=256 thread. So a total of 16000000 threads will be created, equal to the number of elements of the resulting matrix! The multiplication looks to work and it takes approximately 6seconds!

How the scheduling is done at this point??Is it possible for blocks to “wait” somewhere before they get scheduled in a multiprocessor for execution?

The device is a GeForce GTX280 with 30 multiprocessors. If there is any problem will the kernel notify me or it will just calculate wrong output?

If I increase the size of matrices to 45004500 the execution freezes and nothing happens, is there a limit (number of threads, blocks etc) I overcome by multiplying 45004500 matrices? I think is not possible for CUDA to need more than 20 minutes to calculate 45004500 multiplication when it only takes 6seconds to calculate 40004000!

Thanx in advance for all information you might provide!!

P.S.: Here is also the function that calls the kernel:

[codebox]void MulCuda()

{

float *result_d, *result_h;

char choise;

system("clear");

if(!valid_data)

{

	printf("Please load matrices data first.");

	return;

}

if(noc[0] == nor[1])

{

	cudaEvent_t start, stop;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);



	printf("Allocating space on host...");

	result_h = (float *)malloc(noc[0]*nor[1]*sizeof(float));

	printf("[OK]\n");



	printf("Allocating space on the device...");

	cudaMalloc((void**)&matrix_d[0], nBytes[0]);

	cudaMalloc((void**)&matrix_d[1], nBytes[1]);

	cudaMalloc((void**)&result_d, noc[0]*nor[1]*sizeof(float));

	printf("[OK]\n");

	printf("Transfering data from host to device...");

	cudaEventRecord(start, 0);

	cudaMemcpy(matrix_d[0], matrix_h[0], nBytes[0], cudaMemcpyHostToDevice);

	cudaMemcpy(matrix_d[1], matrix_h[1], nBytes[1], cudaMemcpyHostToDevice);

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&clock_stat.data_transfer_to_device, start, stop);

	printf("[OK]\n");

	

	dim3 dimBlock(16,16);

	dim3 dimGrid(ceil(nor[0]/float(16)), ceil(noc[1]/float(16)));



	printf("Calculating product of matrices on the device...");

	cudaEventRecord(start, 0);

	CudaMul<<<dimGrid, dimBlock>>>(matrix_d[0], matrix_d[1], result_d, noc[0], noc[1], nor[0]);

	cudaThreadSynchronize();

	printf("skata");

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&clock_stat.operation_execution, start, stop);

	printf("[OK]\n");



	printf("Retrieving result from device...");

	cudaEventRecord(start, 0);

	cudaMemcpy(result_h, result_d, nor[0]*noc[1]*sizeof(float), cudaMemcpyDeviceToHost);

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&clock_stat.data_transfer_to_host, start, stop);

	printf("[OK]\n");



	cudaEventDestroy(start);

	cudaEventDestroy(stop);



	clock_stat.total = clock_stat.data_transfer_to_device + clock_stat.operation_execution + clock_stat.data_transfer_to_host;

	

	printf("\nTime statistics\n");

	printf("===============\n");

	printf("Data transfer to device: %f msec\n", clock_stat.data_transfer_to_device);

	printf("Computation time: %f msec\n", clock_stat.operation_execution);

	printf("Data transfer to host: %f msec\n", clock_stat.data_transfer_to_host);

	printf("Total time needed: %f msec\n", clock_stat.total);



	if(nor[0] < 100 & noc[1] < 9)

	{

		printf("\nResult matrix:\n");

		printf("==============\n");

		print_matrix(result_h, nor[0], noc[1]);

		printf("\n\n");

	}

	else

	{

		printf("\nResulting matrix was too big to display (%d by %d),\ndo you want to save it in a file?(y/n): ", nor[0], noc[1]);

		scanf("%c", &choise);

		scanf("%c", &choise);

		if(choise == 'y')

		{

			printf("Saving file...\n");

			export_result(result_h, nor[0], noc[1]);

		}

	}



	printf("\nReleasing device space...");

	cudaFree(matrix_d[0]);

	cudaFree(matrix_d[1]);

	printf("[OK]\n");



	free(result_h);

	char dummy;

	printf("\nPress any key to return to menu...");

	scanf("%c", &dummy);

	scanf("%c", &dummy);

	system("clear");

}

else

{

	printf("Multiplication is not possible because matrices don't have matching dimentions.");

}

}[/codebox]

Anybodyyyy??? :( :( :( :( :( :( :( !!!

Yes, I think you’re running out of memory with multiplication of matrices that are that large.