Error with pinned memory and threads on the host


please consider the following code which launches threads to perform some actions in parallel on multiple GPUs:

#include <cstdlib>
#include <memory>
#include <thread>

#include <openacc.h>

int main(int argc, char* argv[])
	int a[10];
	int numDevices = acc_get_num_devices(acc_get_device_type());
	std::unique_ptr<std::thread[]> threads(new std::thread[numDevices]);

	for (int i = 0; i < numDevices; i++) {
		threads[i] = std::thread([i] ( ) {
			acc_set_device_num(i, acc_get_device_type());
			#pragma acc data create(a[0:10])
			{ }

	for (int i = 0; i < numDevices; i++) {


The code works fine with version 17.4 with

-std=c++11 -acc -ta=nvidia

. However, I’d like to use pinned memory in a real application.
So, I’m passing

-std=c++11 -acc -ta=nvidia:cc60,pinned

. Please first note that I have to specify


(or probably something similar) or else the executable will segfault. But even then the executable fails at runtime with

call to cuMemAlloc returned error 201: Invalid context


  1. Is this code valid, i.e. am I allowed to use OpenACC directives inside a threaded environment? The threads are not accessing the same device.
  2. Is there anything wrong with my compilation or is than an error in the compiler / runtime?


Hi Jonas,

Not sure if this is a PGI issue or a CUDA issue since I can replicate the error even if I remove all the OpenACC constructs. With “pinned”, we replace the memory allocation calls with calls to “cudaMallocHost”. It seems that the segv occurs when this memory is getting deallocated.

I’ve added a problem report (TPR#24636) and sent it on to engineering for investigation.

Thanks for the report!

Hi Mat,

thanks for your response and confirmation. cuMemAlloc should be responsible for allocation on the device, no?

But it really looks like it’s somehow related to C++ threads: Everything works fine if I use pthreads and OpenMP looks fine as well! I’ll use this as a workaround…
EDIT: No, the error just does not trigger on every execution. So no workaround for now…


cuMemAlloc should be responsible for allocation on the device, no?

For the device data, yes. However “pinned” pins host data to physical memory via a call to “cudaMallocHost”.