void foo( … )
{
for( int i=0; i<498; ++i )
{
…
}
}
I want to execute the above function on GPU using threads concept.
I’m stucking at the configuration. see the below GPU code
GPU function
global void foo( … )
{
int currThread = blockIdx.x * blockDim.x + threadIdx.x; // instead of for loop
{
…
}
}
I tried with the below config, but unable to execute the for loop body complete 497 times
dim3 numBlocks( 498/16,1,1 );
dim3 numThreads( 16,1,1);
foo<<< numBlocks, numThreads >>>( … );
The above config executes 495 times only.
How can I execute the foo( ) function complete 497 times?
Is there any CUDA guide lines for this type of problems?
Since floating point number of blocks are not possible, your code is not executed to your expectation.
Floating point number of blocks is not possible because:
There are infinite floating point numbers between any two floating point numbers.
This problem already manifests itself in the form of precision problem.
Even double precision numbers are not quite sufficient to represent infinite floating points.
There is always error if there will be a floating point computer. If not, the computer will be still under construction.
So even if NVIDIA manufactures such a card, there will be another user who may complain that it does not work for 7/3 = (2.33333333…)
Thinking of floating points, I think computer designers could just use a numerator and denominator to represent all kinds of floating points.
They should fall back to normal ways when the computation gets tough… This way some accuracy could be preserved.
Like say (2/3) * (2/3) could easily be stored as (4/9) in the computer… which is more accurate than a truncated 0.6666… squared.
I think this muss have been considered and ditched by the elites long back… Just a few musings frm a lazy soul