Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it?

Hi,

I am trying to compare the performance of using threads and blocks in my CUDA program.
I use NVIDIA GeForce GT 750M 2048 MB on my MacBook Pro for CUDA.

My code is based on this material : http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf

So I created three function like these:

#define N 1000000
// using blocks
__global__ void add_block(int *a, int *b, int *c)
{
	c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

//using threads
__global__ void add_thread(int *a, int *b, int *c)
{
	c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

// using blocks and threads
__global__ void add_block_thread(int *a, int *b, int *c)
{
	int index = threadIdx.x + blockIdx.x * blockDim.x;
	c[index] = a[index] + b[index];
}

And then I tried to sum an array of integers with 1 million of elements for each function (I define N = 1,000,000), record their execution time, and then compare it.

//////calculate with cuda blocks
	start = clock(); //tic
	// execute the operation on the device
	add_block<<<N,1>>>(d_a, d_b, d_c);
	end = clock(); //toc
	cuda_block_seconds = (float)(end - start) / CLOCKS_PER_SEC;

	// copy the result back to host
	cudaMemcpy(c_block, d_c, size, cudaMemcpyDeviceToHost);
	//////////////////////////////////////////

	//////calculate with cuda threads
	start = clock(); //tic
	// execute the operation on the device
	add_thread<<<1,N>>>(d_a, d_b, d_c);
	end = clock(); //toc
	cuda_thread_seconds = (float)(end - start) / CLOCKS_PER_SEC;

	// copy the result back to host
	cudaMemcpy(c_thread, d_c, size, cudaMemcpyDeviceToHost);
	//////////////////////////////////////////

	//////calculate with threads and blocks
	start = clock(); //tic
	// execute the operation on the device
	add_block_thread<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
	end = clock(); //toc
	cuda_combine_seconds = (float)(end - start) / CLOCKS_PER_SEC;

	// copy the result back to host
	cudaMemcpy(c_combine, d_c, size, cudaMemcpyDeviceToHost);
	//////////////////////////////////////////

On the code I put number of blocks or threads to be N, which is 1 million. I have checked the results by comparing them with the calculation function on the host sequentially and all of them were correct.

I got results that 1 block with 1 million threads has the best performance than 1 million blocks with 1 thread. here are the printouts:

Calculation without CUDA (sequential) time is 0.042559 seconds
CUDA with 1 million blocks and 1 thread time is 0.000073 seconds
CUDA with 1 million threads and 1 block time is 0.000007 seconds
CUDA with blocks and threads combined time is 0.000014 seconds

What I don’t understand is, our maximum number of threads is supposed to be 512, but why did it still work when I put 1 million on the number of threads or blocks?

Could anyone help me to understand this?

Thank you!

You’re not checking for errors, nor even printing the resulting compute to verify it worked. The kernel launch is failing with the too large block size, so the evaluation is instant, but it didn’t actually compute anything. Always check for error code results in all CUDA code.

Thank you for your prompt reply!

Actually I checked using this code

// compute sequentially
	for(int i=0; i< N; i++)
	{
		c_seq[i] = a[i] + b[i];
	}

//checking the results with CUDA blocks
	for(int i=0; i< N; i++)
	{
		if(c_seq[i] != c_block[i])
		{
			printf("not the same with block\n");
			break;
		}
	}

//checking the results with CUDA threads
	for(int i=0; i< N; i++)
	{
		if(c_seq[i] != c_thread[i])
		{
			printf("not the same with thread \n");
			break;
		}
	}

And it didn’t go to those error message.
I also didn’t get any message that the kernel launch is failing.
Is that shown in the console window? I use Eclipse Nsight

NSight might catch and report many errors, but you should put such checks right in your code, even for production release builds. There are many small macro API call wrappers people like to use, but the key point is to actually check the return value of each of your API calls, or alternatively query them afterwards with cudaGetLastError() or cudaPeekAtLastError(). You’re currently ignoring errors potentially reported in lines 20, 26, and 31. (your real error is in line 26, the kernel call, but the point is to always test every API call just to prevent your own confusion about why things are “too fast” or “don’t work” when you’re not even checking for any calling or runtime error.)

Wow, you are so right. I got error “invalid configuration argument in …/main.cu at line …”

Thanks a lot