GPUworker, formally part of the HOOMD project, is often cited, and still available here.
They both implement the POSIX threads API. pthreads is Unix style C API. boost threads is portable C++ template library which was slated to become part of the C++ standard library.
I mean that the CUDA model relies on having one host thread tied to a given GPU context. GPU context establishment is expensive, so it is normal to have each thread establish a context at the beginning of an application and hold that through the life of the thread. Many OpenMP runtimes use the idea of persistant operating system thread pools from which OpenMP operations draw threads to do parallel operations. The problem can come that given logical OpenMP thread ID (so what omp_get_thread_num returns) may not at all times be associated with the same operating system thread. Trying to manage GPU contexts inside a pool of operating system threads over which there is no explicit programmer control can be painful. The CUDA driver API has a context migration mechanism that lets contexts be moved from thread to thread, but entails is a lot of extra administrative overhead, both inside the driver at runtime, and in your code.
There is supposed to be some quite big changes in the CUDA APIs to make this easier, but today, pthreads or boost threads is still preferrable, even though it requires more code than OpenMP does to get the same thread operation done.