Two questions about too many threads in a block

Hello all,

I just start practicing cuda programming.
Here I have some questions about the case that too many threads in a block…

Since each block can only have 256 (or 512) threads, how do I acquire this information in my host program?

Meanwhile, what happens if we give too many threads in a block? What can be expected?

For example,
I just wrote a very simple cuda program:

==== following are just pseudo codes ====
==== the kernel ====
global void adder (float *buff) {
int idx = threadIdx.x;
buff[idx] = buff[idx] + 1;
}

==== the host ====
#define BUFF_SIZE 16

host_buff = malloc(BUFF_SIZE * sizeof(float));
device_buff = cudaMalloc(BUFF_SIZE * sizeof(float));

cudaMemCpy(device_buff, host_buff, BUFF_SIZE * sizeof(float), cudaMemcpyHostToDevice);

adder<<<1, BUFF_SIZE>>>(device_buff);

cudaMemcpy(host_buff, device_buff, BUFF_SIZE * sizeof(float), cudaMemcpyDeviceToHost);

for (int i = 0 ; i < BUFF_SIZE ; i++)
if (host_buff[i] != 1) {
printf(“NOT GOOD\n”);
break;
}
}
===== end of pseudo codes ====

This example works well.
However, if I change the “BUFF_SIZE” to 4096!! Obviously, we will have 4096 threads in a block which is not allowed.
In this case, I got every element of the “host_buff” not added!! (so 4096 “NOT GOOD” was printed)

If we put too many threads to a block, does it means cuda will do nothing for us instead of at least does 256 adding for us?

Thanks.

  • Wei-Fan

If you have installed the Nvidia examples, run the program device query and look for the property max threads per block. You should submit the kernel like this

adder<<<(BUFF_SIZE+max_threads-1)/max_treads, max_threads>>>(device_buff);

I think that most of the video cards which support cuda made in the last 2 years have max_threads=512. The fermi cards have 1024. there is a limit for size of the dimension of the blocks.

In general you would use

adder<<<(bx,by,bz),(tx,ty,tz)>>>(device_buff);

with txtytz<=512

bx<=65536

by<=65536

bz<=64

I strongly recommend checking the return status of every CUDA API call and the status of every kernel execution. In addition, I would suggest taking a closer look at the functions cudaGetLastError() and cudaGetErrorString().

Any way the result will be wrong.

Unless you have a compelling reason not to, hard-code the number of threads per block. You can even hard-core the number of blocks per SM. Use the launch_bounds modifier on a kernel function. I wrote a nifty Fermi occupancy table here:
http://www.moderngpu.com/intro/workflow.html#Occupancy

Try to stay above 50% occupancy, although going lower won’t necessarily hurt performance if your kernels have decent ILP. Small blocks are preferred over large blocks. Hard-coding the block size will reduce register pressure and almost certainly increase performance by turning some expressions into compile-time constants.

You can also put some constant number for the threads in the block and use this modified kernel:

==== following are just pseudo codes ====
==== the kernel ====
global void adder (float *buff,size(buff))
{
int idx = threadIdx.x;
while(idx<size(buff)
{
buff[idx] = buff[idx] + 1;
idx=idx+blockDim.x;
}
}
This way each thread will do more than one addition.