Hello all,
I just start practicing cuda programming.
Here I have some questions about the case that too many threads in a block…
Since each block can only have 256 (or 512) threads, how do I acquire this information in my host program?
Meanwhile, what happens if we give too many threads in a block? What can be expected?
For example,
I just wrote a very simple cuda program:
==== following are just pseudo codes ====
==== the kernel ====
global void adder (float *buff) {
int idx = threadIdx.x;
buff[idx] = buff[idx] + 1;
}
==== the host ====
#define BUFF_SIZE 16
host_buff = malloc(BUFF_SIZE * sizeof(float));
device_buff = cudaMalloc(BUFF_SIZE * sizeof(float));
cudaMemCpy(device_buff, host_buff, BUFF_SIZE * sizeof(float), cudaMemcpyHostToDevice);
adder<<<1, BUFF_SIZE>>>(device_buff);
cudaMemcpy(host_buff, device_buff, BUFF_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0 ; i < BUFF_SIZE ; i++)
if (host_buff[i] != 1) {
printf(“NOT GOOD\n”);
break;
}
}
===== end of pseudo codes ====
This example works well.
However, if I change the “BUFF_SIZE” to 4096!! Obviously, we will have 4096 threads in a block which is not allowed.
In this case, I got every element of the “host_buff” not added!! (so 4096 “NOT GOOD” was printed)
If we put too many threads to a block, does it means cuda will do nothing for us instead of at least does 256 adding for us?
Thanks.
- Wei-Fan