pthreads vs. OpenMP?

I need to develop a CUDA application that should be easy to run on
the following computers:

  1. a workstation with a 6-core Intel processor
  2. a server with two six-core Intel Processors
  3. a server with four 12-core AMD processors
    where each computer will have multiple Fermi GPUs connected to it.

I want to be able to run multiple threads on the cores of each of the above computers,
and use the many GPUs.
The multiple CPU threads will have to communicate some.

Shall I use pthreads or OpenMP?

Right now I would strongly suggest pthreads or boost threads over OpenMP. Establishing and maintaining thread-context affinity in CUDA with OpenMP is notoriously difficult to get right. There are rumours that things will be changing in a future CUDA release, but today I wouldn’t use OpenMP.

Where can I find an example of using boost threads with GPUs and CUDA?

Which one is better then: pthreads or boost threads?

What do you mean by thread-context affinity, and where do I read how to do it for CUDA and pthreads?

GPUworker, formally part of the HOOMD project, is often cited, and still available here.

They both implement the POSIX threads API. pthreads is Unix style C API. boost threads is portable C++ template library which was slated to become part of the C++ standard library.

I mean that the CUDA model relies on having one host thread tied to a given GPU context. GPU context establishment is expensive, so it is normal to have each thread establish a context at the beginning of an application and hold that through the life of the thread. Many OpenMP runtimes use the idea of persistant operating system thread pools from which OpenMP operations draw threads to do parallel operations. The problem can come that given logical OpenMP thread ID (so what omp_get_thread_num returns) may not at all times be associated with the same operating system thread. Trying to manage GPU contexts inside a pool of operating system threads over which there is no explicit programmer control can be painful. The CUDA driver API has a context migration mechanism that lets contexts be moved from thread to thread, but entails is a lot of extra administrative overhead, both inside the driver at runtime, and in your code.

There is supposed to be some quite big changes in the CUDA APIs to make this easier, but today, pthreads or boost threads is still preferrable, even though it requires more code than OpenMP does to get the same thread operation done.

Okay, but how d0 you use pthreads with nvcc on a windows platform. I want to use pthreads for the Host part of the C code, but it doesn’t recognize #include <pthread.h> in the code or -pthread on the cmd line. Any suggestions ?