Omp Pragmas in Thrust functor

Hello everybody,

i have a thrust functor that makes use of the AtomicAdd function. When i switch from the CUDA to the OMP backend of Thrust this cant be used because atomicadd is a device function. I managed to implement a serial version by the following snippet

Real operator()(const thrust::tuple<Real&, Real,Real> &t) 
    {
#ifdef __CUDA_ARCH__    
   const Real &t0 = thrust::get<0>(t);
   Real summand = thrust::get<1>(t)*thrust::get<2>(t);
   Real *p_t0 = const_cast<Real *>(&t0);
   return atomicAdd(p_t0, summand); 
#else
   #pragma omp atomic
   thrust::get<0>(t)+= thrust::get<1>(t)* thrust::get<2>(t);
#endif
  }

As i mentioned this works when i set the thread number to one. But when i use multiple threads it doesn’t. I am not sure if the pragma is understood inside the functor. Should it be?

ANy hints or tips how an atomic add can be implemented for the omp backend?

best regards
Franz

Anyone?

Up

That is indeed a very good question, did you managed to get any solution using #pragma omp atomic ?

Also, it would be interesting to know what compiler you are using, and what version of OpenMP it supports.

In the future, if you want help, my suggestion is to provide a complete example, not a snippet. I tend to ignore questions that appear incomplete or confusing, and others may also. I think you’re more likely to get help if you provide a complete example.

It’s not clear to me based on the snippet you’ve provided why atomics would even be necessary or make sense given that functor. Isn’t each operation (thread, or element of an input vector) working on a separate tuple? Why would you need to atomically update the zero’th element of the tuple, in that case? Or, if multiple threads are working on the same tuple, could you explain why that would make sense, and how you would construct a thrust algorithm to do that?

So I came up with my own (contrived) example. The OMP atomic seems to work for me. Here’s a worked example with the #pragma omp atomic commented out:

$ cat t702.cu
#include <thrust/device_vector.h>
#include <thrust/for_each.h>

#define DSIZE 64000000

template <typename T>
struct sum_functor
{
  T *result;
  sum_functor(T *_res) : result(_res) {};
  __host__ __device__
  void operator()(const T &d1)
    {
#ifdef __CUDA_ARCH__
      atomicAdd(result, d1);
#else
   // #pragma omp atomic
      *result += d1;
#endif
    }
};

int main(){

  thrust::device_vector<int> result(1);
  thrust::device_vector<int> data(DSIZE, 1);
  thrust::for_each(data.begin(), data.end(), sum_functor<int>(thrust::raw_pointer_cast(result.data())));
  std::cout << "Result:  "  << result[0] << std::endl;
}
$ nvcc -O3 -o t702 t702.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t702
Result:  16000036
$ OMP_NUM_THREADS=1 ./t702
Result:  64000000
$

My machine happens to have a quad core CPU. When I run the OMP version of the app with 4 OMP threads, I get an incorrect result. If I switch to a single thread, I get the correct result.

If I then repeat the above process but uncomment the pragma omp atomic line, then I get a correct result regardless of how many OMP threads I specify.

So #pragma omp atomic seems to work for me.

Having said all that, thrust reductions may be a better choice for this kind of operation.

Hi there,

thanks for the reply.
It was not my intention at all to confuse anybody, sorry for that. The reason why i need atomic add in this situation is that the 0-th element of the tuple is a permutation of a vector i want to write on. This permutation is realized via the make_permutation__iterator functionality of thrust. The permutation indices have repeating values, and that is why i need atomics. I was pretty sure that i need atomic operations so i had no doubt and no reason to discuss the frame around my atomic add.

In your example you are not writing to one of the input tuples right? maybe that is the reason it works in your example and not in mine. Anyway, the fact that you managed to use pragmas helped me a lot allready. Now i have hope to find a solution again.

best regards

Franz

I’m still not sure what the problem is. Here’s a modified example that uses a zip_iterator and a permutation_iterator the way you describe (I think) and it still seems to have the same behavior: without the pragma, I get invalid results, and with the pragma, I get correct results. If you provide a short, complete example (just like I have; it shouldn’t be that difficult), rather than a snippet, it might be possible to figure out what is happening in your case.

$ cat t702.cu
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>

#define DSIZE 64000000

struct sum_functor
{
  template <typename T>
  __host__ __device__
  void operator()(T t)
    {
#ifdef WITH_PRAGMA
     #pragma omp atomic
#endif
      thrust::get<0>(t) += thrust::get<1>(t)*thrust::get<2>(t);
    }
};

int main(){

  thrust::device_vector<int> result(1);
  thrust::device_vector<int> data1(DSIZE, 1);
  thrust::device_vector<int> data2(DSIZE, 1);
  thrust::device_vector<int> map(DSIZE);
  thrust::for_each_n(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(result.begin(), map.begin()), data1.begin(), data2.begin())), DSIZE, sum_functor());
  std::cout << "Result:  "  << result[0] << std::endl;
}
$  nvcc -O3 -o t702 t702.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp
$ ./t702
Result:  24107216
$  nvcc -O3 -DWITH_PRAGMA -o t702 t702.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp
$ ./t702
Result:  64000000
$