Call Configuration for 498?


CPU function

void foo( … )
for( int i=0; i<498; ++i )

I want to execute the above function on GPU using threads concept.
I’m stucking at the configuration. see the below GPU code

GPU function

global void foo( … )
int currThread = blockIdx.x * blockDim.x + threadIdx.x; // instead of for loop


I tried with the below config, but unable to execute the for loop body complete 497 times
dim3 numBlocks( 498/16,1,1 );
dim3 numThreads( 16,1,1);
foo<<< numBlocks, numThreads >>>( … );

The above config executes 495 times only.

How can I execute the foo( ) function complete 497 times?
Is there any CUDA guide lines for this type of problems?

498/16 = 31.125

Since floating point number of blocks are not possible, your code is not executed to your expectation.

Floating point number of blocks is not possible because:

  1. There are infinite floating point numbers between any two floating point numbers.

  2. This problem already manifests itself in the form of precision problem.
    Even double precision numbers are not quite sufficient to represent infinite floating points.
    There is always error if there will be a floating point computer. If not, the computer will be still under construction.

  3. So even if NVIDIA manufactures such a card, there will be another user who may complain that it does not work for 7/3 = (2.33333333…)

So how can I execute the above function 497 times?

not possible?

Add an if function before your calculations and just start 32 blocks.

if(currThread < 498)
do calculations

Run a for loop baby…

Like: you have one CPU and you can initialize so much… if you have “N” CPUs, you should still be fine with it,… Is it not? FOR loop is your answer.

Thinking of floating points, I think computer designers could just use a numerator and denominator to represent all kinds of floating points.
They should fall back to normal ways when the computation gets tough… This way some accuracy could be preserved.

Like say (2/3) * (2/3) could easily be stored as (4/9) in the computer… which is more accurate than a truncated 0.6666… squared.

I think this muss have been considered and ditched by the elites long back… Just a few musings frm a lazy soul

wow… I have absolutly no idea what you’re talking about…

If I understand correctly what Jeroen suggested is the way to go… what’s all those floating calculations

and new hardware for floating point block indexes ???

I realy don’t understand what you mean… :(

This is a common CUDA idiom.

Say you want to do something N times in parallel, with block size of BLOCKSIZE.

Configure a grid with (N+BLOCKSIZE-1)/BLOCKSIZE blocks. This will round up to the next block if N is not divisible by BLOCKSIZE.

Within your kernel, compute idx = blockIdx.x*BLOCKSIZE + threadIdx.x;

Then, since you might have extra threads in the last block, do:
if (i >= N) {

Thanks Jamie,

I also thought the same :)

I was just digressing, pursuing a tangent on floating point arithmetic – nothing to do with blockId at all…

Amusing musings , if you would agree to it.