Using OpenMP section with cublas

I get a launch error upon parallelizing follows using OpenMP and CUDA.

#pragma omp parallel sections
{
#pragma omp section
{
cublasDgemm(…)
}
#pragma omp section
{
some_function_utilizing_CPU(…)
}
}

Why is this happening and what can I do to fix this problem? Thanks!

are you declaring the pointer and malloc’ing the memory from within the section that calls cublas. If not then this is probably the problem.

Also, it is a bit dangerous to do so, because the OpenMP runtime implementation is unlikely to garantee that the same pthreads are reused in the different parallel sections. This would only work if everything (memory allocation, context initialization etc.) was done in the loop body, which is certainly not what you want in general.

CUDA is doing really “weird” asumptions in a multithreaded context that you should really take care if you are going to use OpenMP + CUDA (or CUBLAS).

Cédric

I would second that advice. I really wouldn’t recommend using OpenMP for multi-gpu. There is an awful lot of implementation specific/unstandardized stuff that has to go one between your #pragma directives and the context and data hitting the right GPU and working as designed. A threading model where there is a lot more control, like pthreads (or boost threads or similar), is probably a much safer and more reliable bet.

Ok, I have a pthreads question. Let’s say I have a for loop where N is large and func1 and func2 are independent of one another.

for (i < N)

{

func1(i);

func2(i);

}

I can use pthreads to create 2 threads to concurrently operate on func1 and func2

for (i < N)

{

for (j < NUM_THREADS)

    pthread_create(...   )

}

However, this causes a lot of overhead because I am creating threads for all N iterations. So how can I call pthread_create outside of the for loop w/o

putting all the contents of the for loop inside the thread calling function?

As I suggested in your other thread on this, use persistent threads based on the producer-consumer model. Condition variables and mutexes can be used to keep threads alive (and holding contexts for the lifespan of the application). The pthreads model includes both a condition broadcast and unicast mechanism which can be use to signal to idle threads to wake up, and barriers to synchronize them. Work can be passed to the worker threads via function pointers and the condition variable itself.

Any reasonable reference on pthreads should have everything you need to get started. I have a tattered old copy of a Sun pthreads programmers handbook which has served me well. Sun should have a pdf version of it for download in their online reference library. There was also an Oreilly book that a lot of people like, although I haven’t read it.

I’m not sure how it will interoperate with CUBLAS, but the GPUWorker class makes this multiGPU very easy for generic CUDA:

https://codeblue.umich.edu/hoomd-blue/trac/…/libhoomd/utils

(GPUWorker.cc and GPUWorker.h)