void foo ( unsigned char* src, int width, int height )
{
int id=0;
for( int i=0; i<width; i++ )
{
for( int j=0; j<height; j++)
{
if ( src[id] == 0 )
id++;
else
id += width;
}
}
Please observe the above code carefully.
the variable “int id” is updated depends on 'src[id]".
the “updated” id should be used entire two for loops.
can i write some thing like this…
device int ID = 0;
global void foo ( unsigned char* src, int width, int height )
{
int id=ID;
int i = blockIdx.x * blockDim.x + threadIdx.x;
//for( int i=0; i<width; i++ )
if ( i < width )
{
for( int j=0; j<height; j++)
{
if ( src[id] == 0 )
id++;
else
id += width;
ID = id ;
}
}
Is above CUDA version program correct?
suppose in cpu code when i = 34; j = 67; id = 89; then
is it guarantee that the same values will there in if i = 34 in GPU??
becuase thread execution is not sequential.
The above program is not correct. Each iteration is dependent on a new value of id because of the reference to src[id]. I don’t see a way to parallelize the algorithm because you have to order the values of i and j.
What are you trying to do? To take advantage of CUDA (or any parallel methods) you are going to have to change your algorithm.
Above your Cuda Program is not correct but I think that your program can be parallelize.
I mean that not all your algorithm can parallelize. You can split your algorithm in to 2 steps.
First step is find all “!=0” index of each column and store in an 2D array. (parallel algorithm)
Second step is calculate the “Id” variable (serial algorithm)
I don’t think that faster than you use original serial algorithm.
:)