Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it?

dipta · January 6, 2017, 4:06am

Hi,

I am trying to compare the performance of using threads and blocks in my CUDA program.
I use NVIDIA GeForce GT 750M 2048 MB on my MacBook Pro for CUDA.

My code is based on this material : http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf

So I created three function like these:

#define N 1000000
// using blocks
__global__ void add_block(int *a, int *b, int *c)
{
	c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

//using threads
__global__ void add_thread(int *a, int *b, int *c)
{
	c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

// using blocks and threads
__global__ void add_block_thread(int *a, int *b, int *c)
{
	int index = threadIdx.x + blockIdx.x * blockDim.x;
	c[index] = a[index] + b[index];
}

And then I tried to sum an array of integers with 1 million of elements for each function (I define N = 1,000,000), record their execution time, and then compare it.

//////calculate with cuda blocks
	start = clock(); //tic
	// execute the operation on the device
	add_block<<<N,1>>>(d_a, d_b, d_c);
	end = clock(); //toc
	cuda_block_seconds = (float)(end - start) / CLOCKS_PER_SEC;

	// copy the result back to host
	cudaMemcpy(c_block, d_c, size, cudaMemcpyDeviceToHost);
	//////////////////////////////////////////

	//////calculate with cuda threads
	start = clock(); //tic
	// execute the operation on the device
	add_thread<<<1,N>>>(d_a, d_b, d_c);
	end = clock(); //toc
	cuda_thread_seconds = (float)(end - start) / CLOCKS_PER_SEC;

	// copy the result back to host
	cudaMemcpy(c_thread, d_c, size, cudaMemcpyDeviceToHost);
	//////////////////////////////////////////

	//////calculate with threads and blocks
	start = clock(); //tic
	// execute the operation on the device
	add_block_thread<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
	end = clock(); //toc
	cuda_combine_seconds = (float)(end - start) / CLOCKS_PER_SEC;

	// copy the result back to host
	cudaMemcpy(c_combine, d_c, size, cudaMemcpyDeviceToHost);
	//////////////////////////////////////////

On the code I put number of blocks or threads to be N, which is 1 million. I have checked the results by comparing them with the calculation function on the host sequentially and all of them were correct.

I got results that 1 block with 1 million threads has the best performance than 1 million blocks with 1 thread. here are the printouts:

Calculation without CUDA (sequential) time is 0.042559 seconds
CUDA with 1 million blocks and 1 thread time is 0.000073 seconds
CUDA with 1 million threads and 1 block time is 0.000007 seconds
CUDA with blocks and threads combined time is 0.000014 seconds

What I don’t understand is, our maximum number of threads is supposed to be 512, but why did it still work when I put 1 million on the number of threads or blocks?

Could anyone help me to understand this?

Thank you!

SPWorley · January 6, 2017, 4:12am

You’re not checking for errors, nor even printing the resulting compute to verify it worked. The kernel launch is failing with the too large block size, so the evaluation is instant, but it didn’t actually compute anything. Always check for error code results in all CUDA code.

dipta · January 6, 2017, 4:22am

Thank you for your prompt reply!

Actually I checked using this code

// compute sequentially
	for(int i=0; i< N; i++)
	{
		c_seq[i] = a[i] + b[i];
	}

//checking the results with CUDA blocks
	for(int i=0; i< N; i++)
	{
		if(c_seq[i] != c_block[i])
		{
			printf("not the same with block\n");
			break;
		}
	}

//checking the results with CUDA threads
	for(int i=0; i< N; i++)
	{
		if(c_seq[i] != c_thread[i])
		{
			printf("not the same with thread \n");
			break;
		}
	}

And it didn’t go to those error message.
I also didn’t get any message that the kernel launch is failing.
Is that shown in the console window? I use Eclipse Nsight

SPWorley · January 6, 2017, 5:29am

NSight might catch and report many errors, but you should put such checks right in your code, even for production release builds. There are many small macro API call wrappers people like to use, but the key point is to actually check the return value of each of your API calls, or alternatively query them afterwards with cudaGetLastError() or cudaPeekAtLastError(). You’re currently ignoring errors potentially reported in lines 20, 26, and 31. (your real error is in line 26, the kernel call, but the point is to always test every API call just to prevent your own confusion about why things are “too fast” or “don’t work” when you’re not even checking for any calling or runtime error.)

dipta · January 6, 2017, 7:28am

Wow, you are so right. I got error “invalid configuration argument in …/main.cu at line …”

Thanks a lot

Topic		Replies	Views
Two questions about too many threads in a block CUDA Programming and Performance	5	2317	October 26, 2011
Threads and blocks concept question Invoking a kernel CUDA Programming and Performance	2	1678	December 5, 2007
Max threads/block CUDA Programming and Performance	10	22257	March 7, 2011
Maximum Number of Threads CUDA Programming and Performance	5	2419	June 4, 2010
Threads Per Block Issue CUDA Programming and Performance	2	899	September 7, 2010
New findings needed to be verified: Maximum thread block is not 1024 in K20 CUDA Programming and Performance	4	775	November 17, 2014
Exceeding number of threads/block in a kernel CUDA Programming and Performance	1	2713	July 24, 2010
threads and blocks CUDA Programming and Performance	3	1370	May 7, 2012
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7030	January 30, 2008
Block and thread configuration CUDA Programming and Performance	2	1506	February 11, 2008

Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it?

Related topics