Not able to use _syncthreads inside a loop in emulation mode But it works fine without emulation&#33

Hi,

I am writing a kernel in which at one point, I need to use a part of the threads in my block to process elements from a shared mem array and write the result into another shared memory array. Then, I use a single thread to sum up all the elements in that shared memory array… And this is repeated for a few iterations.

Hence, I need to do a _syncthreads() after the conditional is done so that tid = 0 can read the elements written by the other threads without any conflicts. My code looks like this…

[codebox]for( loop)

{

if( condition)

{

 .........

}

__syncthreads();

if(tid ==0)

{

 .........

}

}

__syncthreads();

[/codebox]

But for some reason, when I try to run the code in Emulation mode the program aborts with the message “Incorrect use of _syncthreads();”. There are no conditionals enclosing the given code. So all the threads should definitely make it till the _syncthreads() point. As I understand, the only restriciton on using _syncthreads() is not to use inside a conditional, at a point where all threads might not reach… So I am not able to understand why this is so…

Running the code without -deviceemu gives no problems and the code runs fine. I ve also checked the output and it is correct and as expected. But I just wanted to clear this, to make sure I am not overlooking which will bite me later on.

Appreciate any help…

thanks,

Avinash

P.S. I simplified the code to make it easier to look at. I can post the actual code if required.

Hi,

I am posting my kernel code for the problem I described above. It is supposed to take a matrix and normalise each row of the matrix as a vector. If any of you could take a look at it or try giving it a run through on your computer in emulation mode, I would really appreciate that.

[codebox]global void normalise( float *d_array, int mz_range, int rows, int blocksize, int n_itrns)

{

shared float sh_array[2960];

shared float sq_array[370];

int tid = threadIdx.x;

int blockid = blockIdx.x;

shared int i, j, sq_arr_size;

shared float mag;

mag =0;

sq_arr_size =370;

for( i=0; i<n_itrns; i++)

{

if( (blocksize*i + tid) <mz_range)

  {

sh_array[blocksizei +tid] = d_array[ blockidmz_range + blocksize*i +tid];

}

}

__syncthreads();

for(i=0; i< 8; i++)

{

if( tid< sq_arr_size)

  {

sq_array[tid] = sh_array[ sq_arr_sizei + tid] * sh_array[ sq_arr_sizei +tid];

}

__syncthreads(); /** This __syncthreads is the one which causes the problem **/

if(tid ==0)

  {

for( j=0; j<sq_arr_size; j++)

    {

mag = mag + sq_array[j];

    }

}

__syncthreads(); /** And this one too **/

}

__syncthreads();

for(i=0; i<n_itrns; i++)

{

if( (blocksize*i + tid) < mz_range )

 {

sh_array[blocksizei + tid] = sh_array[blocksizei +tid] * rsqrtf(mag);

 }

}

__syncthreads();

for(i=0; i<n_itrns; i++)

{

if( (blocksize*i + tid) < mz_range )

 {

d_array[blockidmz_range + blocksizei +tid] = sh_array[blocksize*i +tid];

 }

}

}

[/codebox]

The following is the main function to call the kernel…[codebox]int main()

{

int mz_range = 2960, rows = 10, n_itrns;

int blocksize = 512, i,j;

FILE *fp;

float *array, *d_array;

array = (float*) malloc(mz_rangerowssizeof(float) );

cudaMalloc( (void**) &d_array, mz_rangerowssizeof(float));

if( (fp=fopen(“test_out.txt”,“w”)) == NULL)

{

printf(“Error… can;t open outfile”);

}

for (i=0; i<10; i++)

{

for(j=0; j< mz_range; j++)

{

array[i*mz_range + j] = j;

}

}

n_itrns = (mz_range / blocksize) + ( (mz_range % blocksize) == 0? 0:1 );

cudaMemcpy( d_array, array, mz_rangerowssizeof(float), cudaMemcpyHostToDevice);

normalise <<< rows, blocksize>>> (d_array, mz_range, rows, blocksize, n_itrns);

for(i=0; i<mz_range; i++)

{

fprintf(fp,"%f\t",array[i]);

}

fprintf(fp,"\n\n");

cudaMemcpy(array, d_array, mz_rangerowssizeof(float), cudaMemcpyDeviceToHost);

for(i=0; i<mz_range; i++)

{

fprintf(fp,"%f\t",array[i]);

}

fclose(fp);

printf(“Program done…”);

return (0);

}[/codebox]

Thanks…