questions about #threads

#pragma acc parallel loop gang worker vector_length(32) pcopyin(A[0:SizeA],B[0:SizeB]) pcopyout(C[0:SizeC])
    for (int i=0; i<iteration; i++){
      f1(A,B,C);
    }

As the code above, I want to parallelize a simple for loop containing a function cal. After parallelization, each thread is expected to execute function f1 with input A and B, and write result into different position in C.

  1. When compile this program, the compiler indicate
    #pragma acc loop gang, vector(32), worker(4) /* blockIdx.x threadIdx.x threadIdx.y */

So I guess the kernel launch only one block and each block is a two-dimentional block, aka 4*32. And all of these 128 threads would execute the function f1 here. Is it correct ?

  1. Is there any way to see the generated CUDA code to better understand the generated code and helping debugging ? I tried -ta = tesla:nollvm , but failed .

  2. The erratic thing happened here is that when I increase the #iteration to about 200, the program generate the wrong answer. I felt confused about this, cause array A and B are read-only global arrays and each thread write data into different positions in the array C. Simply increasing the #iteration shouldn’t lead to different results. Any thought about how this might happen or some reasonable way for debugging ?

Actually I found when the #threads <= 32, the answers would be correct. And when 33, the answer is wrong. So I guess is due to some synchronization problems between warps. However, each thread wrote results into different position to the global buffer C here, which I thought shouldn’t cause a synchronization problems.


Thanks.

So I guess the kernel launch only one block and each block is a two-dimentional block, aka 4*32. And all of these 128 threads would execute the function f1 here. Is it correct ?

Correct. Though it would be 32x4 since vector corresponds to threadidx%x and worker to threadidx%y.

  1. Is there any way to see the generated CUDA code to better understand the generated code and helping debugging ? I tried -ta = tesla:nollvm , but failed .

You can use the “keep” sub-option to keep the intermediary files. (-ta=tesla:nollvm,keep) The “.gpu” file will contain the generated CUDA C code. Note that it’s been highly optimized so can be difficult to read. Compiling without optimization, “-O0”, helps a bit.

  1. The erratic thing happened here is that when I increase the #iteration to about 200, the program generate the wrong answer.

It does sound like some type of warp synchronization issue, though without a reproducing example I can’t really tell what wrong. It could be a problem with your code or it could be a code generation issue. Can you post or send to PGI Customer Service (trs@pgroup.com) an example and ask them to forward it to me?

Thanks,
Mat

Is it possible to use printf in the parallel region to print out some info ?

Also, for private array declaration, except private(A[0:size]), did we need to use
copy to first copy the data into global memory ?

Is it possible to use printf in the parallel region to print out some info ?

No, sorry.

Also, for private array declaration, except private(A[0:size]), did we need to use copy to first copy the data into global memory ?

Variables in a “private” clause are not initialized. Instead use “firstprivate” which does initialize the data or write code within the compute region to initialize the data.

The data copy clauses are for use with globally shared memory and can’t be used to update private variables.

  • Mat

Thanks, Mat

Here is the critical section of the code:

#pragma acc routine seq
void knn (node * in_data, int num_record, int num_feature, float * new_point, int k, node rst,int id){

  /* Exhaustive Search */
  for (int i=0; i < num_record; i++){
    in_data[i].dis = Dist(in_data[i].record,new_point,num_feature);
  }

  node tmp;
  for (int i=0; i< num_record-1;i++)
    {
    for (int j =0; j<num_record-i-1; j++)
      {
        if (in_data[j].dis > in_data[j+1].dis)
          {
            tmp = in_data[j];
            in_data[j] = in_data[j+1];
            in_data[j+1] = tmp;
          }
      }
      }

  /* K nearest result */
  for (int i=0;i<k;i++)
    rst[id*k+i] = in_data[i];


}


#pragma acc parallel loop gang worker vector_length(32) pcopyin(in_data[0:num_record],new_point[0:num_feature]) pcopyout(rst[0:threads*k])
    for (int i=0; i<threads; i++){

      knn(in_data,num_record,num_feature,new_point,k,rst,i);

    }

Basically the code want to parallelize multiple knn method. The knn method here is functionally correct. Correct answer in sequential version and also as a kernel in OpenCL code. The global memory chunk “in_data” is modified in the method, but might not necessarily affect the result.

]Variables in a “private” clause are not initialized. Instead use “firstprivate” which does initialize the data or write code within the compute region to initialize the data.

The data copy clauses are for use with globally shared memory and can’t be used to update private variables.

Thanks, Mat

So I guess I could first copy data through global memory and then use it to initialize the
private data section in the compute region.