String search with many threads

Painkiller_1986 · November 4, 2010, 5:27pm

Hi, this is the first time I write in this forum. I’m Alessandro, an italian student of computer engineering. I have a little problem
when I do this:

block_size = 128;
dim3 dimBlock(block_size,1,1);
int num_block = t/block_size;
dim3 dimGrid(num_block,1);
Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

When “t” is a relatively small number (then also num_block is a small number), like 10000, 20000, but also 100000, it works well.
but when it is for example 10 millions, the gpu do not generate 10millions/block_size blocks but less!

Why? There is a limit in the number that I can pass to dimGrid()???

Thanks.

Painkiller_1986 · November 4, 2010, 5:27pm

Hi, this is the first time I write in this forum. I’m Alessandro, an italian student of computer engineering. I have a little problem
when I do this:

block_size = 128;
dim3 dimBlock(block_size,1,1);
int num_block = t/block_size;
dim3 dimGrid(num_block,1);
Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

When “t” is a relatively small number (then also num_block is a small number), like 10000, 20000, but also 100000, it works well.
but when it is for example 10 millions, the gpu do not generate 10millions/block_size blocks but less!

Why? There is a limit in the number that I can pass to dimGrid()???

Thanks.

insmvb00 · November 4, 2010, 5:36pm

The limit is 65535 in any dimension of dimGrid.
See the CUDA programming guide, Appendix G.

insmvb00 · November 4, 2010, 5:36pm

The limit is 65535 in any dimension of dimGrid.
See the CUDA programming guide, Appendix G.

Painkiller_1986 · November 4, 2010, 5:51pm

it’s true :) Thanks a lot.

Painkiller_1986 · November 4, 2010, 5:51pm

it’s true :) Thanks a lot.

LSChien · November 4, 2010, 10:47pm

maximum dimension of grid is (65535, 65535,1)

10millions/block_size exceeds capacity of one-dimensional grid, you should check error.

block_size = 128;

dim3 dimBlock(block_size,1,1);

int num_block = t/block_size;

dim3 dimGrid(num_block,1);

Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

cudaThreadSynchronize();

cudaError_t status = cudaGetLastError();

if ( cudaSuccess != status ){

    fprintf(stderr, "Error: %s\n", cudaGetErrorString(status)) ;

    exit(1) ;

}

you should use 2-dimensional grid.

LSChien · November 4, 2010, 10:47pm

maximum dimension of grid is (65535, 65535,1)

10millions/block_size exceeds capacity of one-dimensional grid, you should check error.

block_size = 128;

dim3 dimBlock(block_size,1,1);

int num_block = t/block_size;

dim3 dimGrid(num_block,1);

Search<<<dimGrid,dimBlock>>>(da, da2, p_res, lun_par);

cudaThreadSynchronize();

cudaError_t status = cudaGetLastError();

if ( cudaSuccess != status ){

    fprintf(stderr, "Error: %s\n", cudaGetErrorString(status)) ;

    exit(1) ;

}

you should use 2-dimensional grid.

Painkiller_1986 · November 5, 2010, 12:45pm

Thanks. this is the code for check the errors? I’ll try it.

Painkiller_1986 · November 5, 2010, 12:45pm

Thanks. this is the code for check the errors? I’ll try it.

happyjack272 · November 5, 2010, 3:59pm

Much better than just using larger blocks - and i find it weird that in the code i’ve seen on this forum this is always missing - each to have each thread loop through the problem space with a stride of the # of threads in the grid (or block).

i.e. instead of:

int index = blockIdx.x*blockDim.x+threadIdx.x;

if( index < count) {

   ...

}

write:

for( int index = blockIdx.x*blockDim.x+threadIdx.x; index < count; index += blockDim.x*gridDim.x) {

   ...

}

that way your problem size can greatly exceed the number of cores you have available. i.e. you don’t have to have 65,000 blocks to solve a 65,000-input problem.

(i don’t understand what kind of confusion would lead one to limit it to the number of cores in the first place, but apparently it’s ubiquitous)

happyjack272 · November 5, 2010, 3:59pm