Multiple calls to ACC routine do not work

I have created an ACC routine like below:

void gpuOverlay1(unsigned long size,
YCBYCR (* restrict A),
BGRA (* restrict B),
YCBYCR (* restrict C))
{
float y1,y2,cb,cr,r,g,b;
float alpha;
unsigned long i,total;

#pragma acc kernels pcopyin(A[0:size],B[0:size]) pcopyout(C[0:size])

for (i=0; i<size; i++)
{
// Convert first pixel or pair
y1 = (((float)A_.y1) - 16.0)*1.164;
cb = ((float)A.cb)-128.0;
y2 = (((float)A.y2) - 16.0)*1.164;
cr = ((float)A.cr)-128.0;

r = y1 + ((float)1.793 * cr);
g = y1 - ((float)0.534 * cr) - ((float) 0.213 * cb);
b = y1 + ((float)2.115 * cb);

// First overlay

alpha = ((float)B.a1)/(float)255.0; // get alpha

b += (((float)B.b1) - b)*alpha;
g += (((float)B.g1) - g)*alpha;
r += (((float)B.r1) - r)*alpha;

// Store data

C.y1 = ((float)0.183 * r) + ((float)0.614 * g) + ((float)0.062 * b) + ((float)16.0);
C.cb = ((float) -0.101 * r) - ((float) 0.338 * g) + ((float)0.439 * b) + ((float)128.0);
}

This is compiled with: pgcc -tp=p7 -m64 -ta=tesla:nollvm,nordc -Minfo=accel -Mlargeaddressaware -c gpu_overlay gpuOverlay.c

I then link this object into my VS2010 project.

The routine works fine if I call from a single thread. But as soon as I call from a second thread (both running at same time) the program will simply exit - no crash, no exceptions, just exits.

The memory spaces of the data regions for each thread are unique - allocated separately, so no conflicts.

So, can an ACC routine be called from multiple threads at the same time or is it limited to a single instance?_

So, can an ACC routine be called from multiple threads at the same time or is it limited to a single instance?

You should be able to. We did have a known race condition in our run time when using multiple OpenMP threads, but this only occurred when using OpenACC “async” queues so shouldn’t be relevant here. Granted, some other issue could be happening.

Could you send a reproducing example to PGI Customer Support (trs@pgroup.com)? Understandable if you can’t. In this case I’ll see if I can write a driver for this routine but will ask you to give more detail on how you build your project. Do you build a DLL or static library? What are your link options that you pass to MS CL? Any information that you can provide to help recreate the issue would be welcome.

Best Regards,
Mat

Hi Mat,

Thank you for your reply.

I sent an email which contains information on how to pull sample from my FTP which does reproduce the issue.

One additional piece of information. I had linked the PGI ACC compiled object into a DLL originally and when attempting to run multiple instances of the ACC routine it would just exit.

I created a standalone VS application that links in the same object, and when it fails on the multiple threads, it exits with the message:

call to cuMemAlloc returned error 201: Invalid context

You can reply via this post or directly to the email address which I sent FTP information.

Best Regards,

Gary

As a side note, the crash does not occur if I run multiple instances as separate processes.

Hi Mat,

Have you been able to reproduce the issue with the provided code?

Best Regards,

Gary

Hi Gary,

Sorry, not yet. I had to pass it off to another engineer since I’m attending a conference.

  • Mat

Hi Mat,

This issue is still a problem for me. I have tested with 16.3 and the issue still remains, which is that if I attempt to run multiple threads which call ACC routines it will crash.

I had provided a sample a while a go, but no progress was made examining. I have sent a new sample to TRS which is a standalone version which can be compiled using:

pgcc -acc -o acctest acctest.c

If you run as “ACCTEST 1 100” it will run one thread for 100 iterations. This works. 2 threads will sometimes work. 3 or 4 fail most of the time.

If I run as separate processes, rather than threads, it will work.

If I run under pgdbg with the args as “4 100” it will report "Signalled ACCESS_VIOLATION at 0x14002248C, function _pgi_uacc_move_buffer, line 171. The call stack shows:


0 _pgi_uacc_move_buffer0x14002248c
1 _pgi_uacc_cuda_drain_down -1 0x1400193b5
2 _pgi_uacc_cuda_wait -1 0x14000b7e1
3 _pgi_uacc_computedone -1 0x140004c17
4 gpuOverlay1HD8 171 E:/tools/acctest - pgi/acctest.c 0x140001740
5 overlayTask 251 E:/tools/acctest - pgi/acctest.c 0x140001fb9
6 BaseThreadInitThunk -1 0x7ffe29a78102
7 RtlUserThreadStart -1 0x7ffe29e4c5b4
8 SortGetHandle -1 0x7ffe29a780e0
*** Stack frame numbers 8 and higher may be incorrect ***

If I build without the -acc it works fine.

The performance running under separate processes is very poor as I need to copy buffers into global memory in order to have access in my main code.

Sorry, hit send early.

It is very easy to reproduce running ACCTEST 4 100

The new sample only requires pgi code.

Also, I even tried creating separate ACC routines for each thread, and these still failed. When I tried enabling PGI_ACC_NOTIFY to 15 I saw that it was not reporting the correct names of the routines sometimes. So not sure if this is showing the issue, or that the NOTIFY just does not handle multi-threading, so reports incorrectly.

Regards,
Gary

Gary sent in his example program. It uses multiple WinThreads and looks to expose a race condition with device data buffers. I logged the error as TPR#22568. The work around is to set the environment variable “PGI_ACC_MEM_MANAGE=0” which disables the optimized buffers.

With the work around Gary reported that he still saw issues when increasing the number of iterations to 5000. However, the issues were very intermittent and I was not able to recreate them. I suggested he try using as different async queue for each thread in case there was contention when using the same CUDA stream.

  • Mat