When “t” is a relatively small number (then also num_block is a small number), like 10000, 20000, but also 100000, it works well.
but when it is for example 10 millions, the gpu do not generate 10millions/block_size blocks but less!
Why? There is a limit in the number that I can pass to dimGrid()???
When “t” is a relatively small number (then also num_block is a small number), like 10000, 20000, but also 100000, it works well.
but when it is for example 10 millions, the gpu do not generate 10millions/block_size blocks but less!
Why? There is a limit in the number that I can pass to dimGrid()???
Much better than just using larger blocks - and i find it weird that in the code i’ve seen on this forum this is always missing - each to have each thread loop through the problem space with a stride of the # of threads in the grid (or block).
i.e. instead of:
int index = blockIdx.x*blockDim.x+threadIdx.x;
if( index < count) {
...
}
write:
for( int index = blockIdx.x*blockDim.x+threadIdx.x; index < count; index += blockDim.x*gridDim.x) {
...
}
that way your problem size can greatly exceed the number of cores you have available. i.e. you don’t have to have 65,000 blocks to solve a 65,000-input problem.
(i don’t understand what kind of confusion would lead one to limit it to the number of cores in the first place, but apparently it’s ubiquitous)
Much better than just using larger blocks - and i find it weird that in the code i’ve seen on this forum this is always missing - each to have each thread loop through the problem space with a stride of the # of threads in the grid (or block).
i.e. instead of:
int index = blockIdx.x*blockDim.x+threadIdx.x;
if( index < count) {
...
}
write:
for( int index = blockIdx.x*blockDim.x+threadIdx.x; index < count; index += blockDim.x*gridDim.x) {
...
}
that way your problem size can greatly exceed the number of cores you have available. i.e. you don’t have to have 65,000 blocks to solve a 65,000-input problem.
(i don’t understand what kind of confusion would lead one to limit it to the number of cores in the first place, but apparently it’s ubiquitous)