CUDA Matrix Multiplication Issues threads and blocks problem

lordi · February 15, 2009, 4:34pm

Hello everyone!

Here is the situation: I have a kernel that does a matrix multiplication! The matrices are allocated and populated in CPU code as vectors and then transfered to CUDA using cudaMemcpy. The matrices are referenced (in CPU and CUDA) as 2D using the [x*number_of_columns + y] notation!! So so far we have two 1D matrices referenced as 2D in CUDA’s global memory. The kernel that multiplies them is as follows:

[codebox]global void CudaMul(float *A, float *B, float *C, int noc0, int noc1, int nor0)

{

int k;

int i = blockDim.x*blockIdx.x + threadIdx.x;

int j = blockDim.y*blockIdx.y + threadIdx.y;

if(i < noc0 && j < nor1)

{

	C[j*noc1 + i] = 0;

	for(k = 0; k < noc0; k++)

	{

		C[j*noc1 + i] +=  A[j*noc0 + k] * B[k*noc1 + i];

	}

}

}[/codebox]

Where noc0 is the number of columns of the first array, noc1 the number of columns of the second array and nor0 the number of rows of the first array. The multiplication is done in this order: matrix0 * matrix1 and the result matrix is already allocated!!

So, now when I call the kernel with a block of 16 by 16 threads and a grid of ceil(nor0/float(16)) by ceil(noc1/float(16)). I actually create a thread to process one element of the resulting matrix. If I load two 40004000 matrices it should create a 250 by 250 grid of blocks (4000/16 = 250) with each block having 1616=256 thread. So a total of 16000000 threads will be created, equal to the number of elements of the resulting matrix! The multiplication looks to work and it takes approximately 6seconds!

How the scheduling is done at this point??Is it possible for blocks to “wait” somewhere before they get scheduled in a multiprocessor for execution?

The device is a GeForce GTX280 with 30 multiprocessors. If there is any problem will the kernel notify me or it will just calculate wrong output?

If I increase the size of matrices to 45004500 the execution freezes and nothing happens, is there a limit (number of threads, blocks etc) I overcome by multiplying 45004500 matrices? I think is not possible for CUDA to need more than 20 minutes to calculate 45004500 multiplication when it only takes 6seconds to calculate 40004000!

Thanx in advance for all information you might provide!!

P.S.: Here is also the function that calls the kernel:

[codebox]void MulCuda()

{

float *result_d, *result_h;

char choise;

system("clear");

if(!valid_data)

{

	printf("Please load matrices data first.");

	return;

}

if(noc[0] == nor[1])

{

	cudaEvent_t start, stop;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);



	printf("Allocating space on host...");

	result_h = (float *)malloc(noc[0]*nor[1]*sizeof(float));

	printf("[OK]\n");



	printf("Allocating space on the device...");

	cudaMalloc((void**)&matrix_d[0], nBytes[0]);

	cudaMalloc((void**)&matrix_d[1], nBytes[1]);

	cudaMalloc((void**)&result_d, noc[0]*nor[1]*sizeof(float));

	printf("[OK]\n");

	printf("Transfering data from host to device...");

	cudaEventRecord(start, 0);

	cudaMemcpy(matrix_d[0], matrix_h[0], nBytes[0], cudaMemcpyHostToDevice);

	cudaMemcpy(matrix_d[1], matrix_h[1], nBytes[1], cudaMemcpyHostToDevice);

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&clock_stat.data_transfer_to_device, start, stop);

	printf("[OK]\n");

	

	dim3 dimBlock(16,16);

	dim3 dimGrid(ceil(nor[0]/float(16)), ceil(noc[1]/float(16)));



	printf("Calculating product of matrices on the device...");

	cudaEventRecord(start, 0);

	CudaMul<<<dimGrid, dimBlock>>>(matrix_d[0], matrix_d[1], result_d, noc[0], noc[1], nor[0]);

	cudaThreadSynchronize();

	printf("skata");

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&clock_stat.operation_execution, start, stop);

	printf("[OK]\n");



	printf("Retrieving result from device...");

	cudaEventRecord(start, 0);

	cudaMemcpy(result_h, result_d, nor[0]*noc[1]*sizeof(float), cudaMemcpyDeviceToHost);

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&clock_stat.data_transfer_to_host, start, stop);

	printf("[OK]\n");



	cudaEventDestroy(start);

	cudaEventDestroy(stop);



	clock_stat.total = clock_stat.data_transfer_to_device + clock_stat.operation_execution + clock_stat.data_transfer_to_host;

	

	printf("\nTime statistics\n");

	printf("===============\n");

	printf("Data transfer to device: %f msec\n", clock_stat.data_transfer_to_device);

	printf("Computation time: %f msec\n", clock_stat.operation_execution);

	printf("Data transfer to host: %f msec\n", clock_stat.data_transfer_to_host);

	printf("Total time needed: %f msec\n", clock_stat.total);



	if(nor[0] < 100 & noc[1] < 9)

	{

		printf("\nResult matrix:\n");

		printf("==============\n");

		print_matrix(result_h, nor[0], noc[1]);

		printf("\n\n");

	}

	else

	{

		printf("\nResulting matrix was too big to display (%d by %d),\ndo you want to save it in a file?(y/n): ", nor[0], noc[1]);

		scanf("%c", &choise);

		scanf("%c", &choise);

		if(choise == 'y')

		{

			printf("Saving file...\n");

			export_result(result_h, nor[0], noc[1]);

		}

	}



	printf("\nReleasing device space...");

	cudaFree(matrix_d[0]);

	cudaFree(matrix_d[1]);

	printf("[OK]\n");



	free(result_h);

	char dummy;

	printf("\nPress any key to return to menu...");

	scanf("%c", &dummy);

	scanf("%c", &dummy);

	system("clear");

}

else

{

	printf("Multiplication is not possible because matrices don't have matching dimentions.");

}

}[/codebox]

lordi · February 17, 2009, 2:20pm

Anybodyyyy??? :( :( :( :( :( :( :( !!!

stefangr1 · March 1, 2009, 9:56pm

Yes, I think you’re running out of memory with multiplication of matrices that are that large.

Topic		Replies	Views
Matrix Multiplication In CUDA CUDA Programming and Performance	6	2525	May 11, 2015
matrix multiplication with its transpose in cuda(cudamemcpy from device to host not working) . CUDA Programming and Performance	6	1734	October 5, 2018
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	9993	January 18, 2012
32 x 32 Matrix Multiplication CUDA Programming and Performance	2	2853	March 5, 2010
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13323	February 17, 2010
Matrix multiplication from CUDA programming guide CUDA Programming and Performance	0	1825	November 23, 2009
CUDA Matrix Multiplication: One thread computes multiple elements CUDA Programming and Performance	4	4812	December 28, 2014
Weird Matrix-Vector Results - Help? CUDA Programming and Performance	2	4930	April 6, 2010
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	743	August 26, 2018
matrix multiply reduction CUDA Programming and Performance	41	35525	January 15, 2011

CUDA Matrix Multiplication Issues threads and blocks problem

Related topics