String search with many threads

Hi, this is the first time I write in this forum. I’m Alessandro, an italian student of computer engineering. I have a little problem
when I do this:

block_size = 128;
dim3 dimBlock(block_size,1,1);
int num_block = t/block_size;
dim3 dimGrid(num_block,1);
Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

When “t” is a relatively small number (then also num_block is a small number), like 10000, 20000, but also 100000, it works well.
but when it is for example 10 millions, the gpu do not generate 10millions/block_size blocks but less!

Why? There is a limit in the number that I can pass to dimGrid()???

Thanks.

Hi, this is the first time I write in this forum. I’m Alessandro, an italian student of computer engineering. I have a little problem
when I do this:

block_size = 128;
dim3 dimBlock(block_size,1,1);
int num_block = t/block_size;
dim3 dimGrid(num_block,1);
Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

When “t” is a relatively small number (then also num_block is a small number), like 10000, 20000, but also 100000, it works well.
but when it is for example 10 millions, the gpu do not generate 10millions/block_size blocks but less!

Why? There is a limit in the number that I can pass to dimGrid()???

Thanks.

The limit is 65535 in any dimension of dimGrid.
See the CUDA programming guide, Appendix G.

The limit is 65535 in any dimension of dimGrid.
See the CUDA programming guide, Appendix G.

it’s true :) Thanks a lot.

it’s true :) Thanks a lot.

  1. maximum dimension of grid is (65535, 65535,1)

10millions/block_size exceeds capacity of one-dimensional grid, you should check error.

block_size = 128;

dim3 dimBlock(block_size,1,1);

int num_block = t/block_size;

dim3 dimGrid(num_block,1);

Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

cudaThreadSynchronize();

cudaError_t status = cudaGetLastError();

if ( cudaSuccess != status ){

    fprintf(stderr, "Error: %s\n", cudaGetErrorString(status)) ;

    exit(1) ;

}

you should use 2-dimensional grid.

  1. maximum dimension of grid is (65535, 65535,1)

10millions/block_size exceeds capacity of one-dimensional grid, you should check error.

block_size = 128;

dim3 dimBlock(block_size,1,1);

int num_block = t/block_size;

dim3 dimGrid(num_block,1);

Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

cudaThreadSynchronize();

cudaError_t status = cudaGetLastError();

if ( cudaSuccess != status ){

    fprintf(stderr, "Error: %s\n", cudaGetErrorString(status)) ;

    exit(1) ;

}

you should use 2-dimensional grid.

Thanks. this is the code for check the errors? I’ll try it.

Thanks. this is the code for check the errors? I’ll try it.

Much better than just using larger blocks - and i find it weird that in the code i’ve seen on this forum this is always missing - each to have each thread loop through the problem space with a stride of the # of threads in the grid (or block).

i.e. instead of:

int index = blockIdx.x*blockDim.x+threadIdx.x;

if( index < count) {

   ...

}

write:

for( int index = blockIdx.x*blockDim.x+threadIdx.x; index < count; index += blockDim.x*gridDim.x) {

   ...

}

that way your problem size can greatly exceed the number of cores you have available. i.e. you don’t have to have 65,000 blocks to solve a 65,000-input problem.

(i don’t understand what kind of confusion would lead one to limit it to the number of cores in the first place, but apparently it’s ubiquitous)

Much better than just using larger blocks - and i find it weird that in the code i’ve seen on this forum this is always missing - each to have each thread loop through the problem space with a stride of the # of threads in the grid (or block).

i.e. instead of:

int index = blockIdx.x*blockDim.x+threadIdx.x;

if( index < count) {

   ...

}

write:

for( int index = blockIdx.x*blockDim.x+threadIdx.x; index < count; index += blockDim.x*gridDim.x) {

   ...

}

that way your problem size can greatly exceed the number of cores you have available. i.e. you don’t have to have 65,000 blocks to solve a 65,000-input problem.

(i don’t understand what kind of confusion would lead one to limit it to the number of cores in the first place, but apparently it’s ubiquitous)