questions about #threads

evanchong · August 1, 2015, 2:58am

#pragma acc parallel loop gang worker vector_length(32) pcopyin(A[0:SizeA],B[0:SizeB]) pcopyout(C[0:SizeC])
    for (int i=0; i<iteration; i++){
      f1(A,B,C);
    }

As the code above, I want to parallelize a simple for loop containing a function cal. After parallelization, each thread is expected to execute function f1 with input A and B, and write result into different position in C.

When compile this program, the compiler indicate
#pragma acc loop gang, vector(32), worker(4) /* blockIdx.x threadIdx.x threadIdx.y */

So I guess the kernel launch only one block and each block is a two-dimentional block, aka 4*32. And all of these 128 threads would execute the function f1 here. Is it correct ?

Is there any way to see the generated CUDA code to better understand the generated code and helping debugging ? I tried -ta = tesla:nollvm , but failed .
The erratic thing happened here is that when I increase the #iteration to about 200, the program generate the wrong answer. I felt confused about this, cause array A and B are read-only global arrays and each thread write data into different positions in the array C. Simply increasing the #iteration shouldn’t lead to different results. Any thought about how this might happen or some reasonable way for debugging ?

Actually I found when the #threads <= 32, the answers would be correct. And when 33, the answer is wrong. So I guess is due to some synchronization problems between warps. However, each thread wrote results into different position to the global buffer C here, which I thought shouldn’t cause a synchronization problems.

Thanks.

MatColgrove · August 3, 2015, 4:05pm

So I guess the kernel launch only one block and each block is a two-dimentional block, aka 4*32. And all of these 128 threads would execute the function f1 here. Is it correct ?

Correct. Though it would be 32x4 since vector corresponds to threadidx%x and worker to threadidx%y.

Is there any way to see the generated CUDA code to better understand the generated code and helping debugging ? I tried -ta = tesla:nollvm , but failed .

You can use the “keep” sub-option to keep the intermediary files. (-ta=tesla:nollvm,keep) The “.gpu” file will contain the generated CUDA C code. Note that it’s been highly optimized so can be difficult to read. Compiling without optimization, “-O0”, helps a bit.

The erratic thing happened here is that when I increase the #iteration to about 200, the program generate the wrong answer.

It does sound like some type of warp synchronization issue, though without a reproducing example I can’t really tell what wrong. It could be a problem with your code or it could be a code generation issue. Can you post or send to PGI Customer Service (trs@pgroup.com) an example and ask them to forward it to me?

Thanks,
Mat

evanchong · August 3, 2015, 6:10pm

Is it possible to use printf in the parallel region to print out some info ?

Also, for private array declaration, except private(A[0:size]), did we need to use
copy to first copy the data into global memory ?

MatColgrove · August 3, 2015, 7:55pm

Is it possible to use printf in the parallel region to print out some info ?

No, sorry.

Also, for private array declaration, except private(A[0:size]), did we need to use copy to first copy the data into global memory ?

Variables in a “private” clause are not initialized. Instead use “firstprivate” which does initialize the data or write code within the compute region to initialize the data.

The data copy clauses are for use with globally shared memory and can’t be used to update private variables.

Mat

evanchong · August 3, 2015, 8:04pm

Thanks, Mat

Here is the critical section of the code:

#pragma acc routine seq
void knn (node * in_data, int num_record, int num_feature, float * new_point, int k, node rst,int id){

  /* Exhaustive Search */
  for (int i=0; i < num_record; i++){
    in_data[i].dis = Dist(in_data[i].record,new_point,num_feature);
  }

  node tmp;
  for (int i=0; i< num_record-1;i++)
    {
    for (int j =0; j<num_record-i-1; j++)
      {
        if (in_data[j].dis > in_data[j+1].dis)
          {
            tmp = in_data[j];
            in_data[j] = in_data[j+1];
            in_data[j+1] = tmp;
          }
      }
      }

  /* K nearest result */
  for (int i=0;i<k;i++)
    rst[id*k+i] = in_data[i];


}


#pragma acc parallel loop gang worker vector_length(32) pcopyin(in_data[0:num_record],new_point[0:num_feature]) pcopyout(rst[0:threads*k])
    for (int i=0; i<threads; i++){

      knn(in_data,num_record,num_feature,new_point,k,rst,i);

    }

Basically the code want to parallelize multiple knn method. The knn method here is functionally correct. Correct answer in sequential version and also as a kernel in OpenCL code. The global memory chunk “in_data” is modified in the method, but might not necessarily affect the result.

evanchong · August 3, 2015, 8:08pm

]Variables in a “private” clause are not initialized. Instead use “firstprivate” which does initialize the data or write code within the compute region to initialize the data.

The data copy clauses are for use with globally shared memory and can’t be used to update private variables.

Thanks, Mat

So I guess I could first copy data through global memory and then use it to initialize the
private data section in the compute region.

Topic		Replies	Views
Questions about "parallel" and "loop" Legacy PGI Compilers	1	2672	August 5, 2015
Shared memory is not correctly used in kernels block nvc, nvc++ and nvfortran	13	670	June 15, 2022
cudaLaunchKernel returned status 1: invalid argument Legacy PGI Compilers	4	2218	March 25, 2020
Strange CUDA grid/block choices in kernels region Legacy PGI Compilers	1	1831	September 15, 2017
Parallelizing a loop Legacy PGI Compilers	9	5569	March 1, 2016
about gang and worker Legacy PGI Compilers	3	2298	November 21, 2012
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8165	October 22, 2013
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7145	July 13, 2016
How to not parallelize inner loops in OpenACC ? Legacy PGI Compilers	7	3814	May 1, 2020
questions about a program Legacy PGI Compilers	1	2326	July 31, 2015

questions about #threads

Related topics