Some questions

chlskawo12 · September 12, 2016, 4:05am

Hello.

I have a CUDA program that runs on multiple GPUs.

Each GPU takes its own work and tallies some data.

After calculation, the data that each GPU is holding have to be summed up at host.

In this case, what would be the fastest way to sum up multiple GPU’s data to host?

Do !$ACC DATA and !$ACC HOST_DATA have implicit barrier?

I want to run processes on host and device simultaneously.

I know that a CUDA kernel is launched asynchronously by default.

However, in my case, the CUDA kernel is called in the !$ACC DATA region, while host subprogram is called outside the region.

So I want to know whether !$ACC DATA and !$ACC HOST_DATA have its own barrier.

If I try to use !$ACC KERNELS or !$ACC PARALLEL in OpenMP parallel region, internal compiler error occurs.

The following is the full message:

1 Internal compiler error. get_var_line(): block not found for sptr 4929 J:\nTRACER\src\CUDARayTraceKernels.cuf 33

What could be the cause? Maybe it is not allowed to use OpenMP and OpenACC parallel region together?

MatColgrove · September 12, 2016, 2:50pm

Hi CNJ,

In this case, what would be the fastest way to sum up multiple GPU’s data to host?

Ideally, you would have each GPU perform a partial sum on it’s part of the data. Then have the host perform the final sum.

Do !$ACC DATA and !$ACC HOST_DATA have implicit barrier?

!$ACC DATA regions always have an implicit barrier. Unstructured data regions, “!$ACC ENTER DATA”, have an implicit barrier as well, but also allow the use of the “async” clause which makes the data movement asynchronous to the host.

The HOST_DATA directive is a compile time directive telling the compiler to replace host pointer addresses with their device address. Hence, a barrier doesn’t apply.

What could be the cause? Maybe it is not allowed to use OpenMP and OpenACC parallel region together?

Internal compiler errors (ICE) are always a problem with the compiler. Even if you’re doing something that isn’t allowed, the compiler should issue an informative error, not an ICE. For these cases, please send a reproducing example to PGI Customer Service (trs@pgroup.com).

For this specific error, I do see a similar known issue in our issue tracking system, TPR#22079. Please do send us a reproducing example so we can either add your issue to this or create a new problem report if unrelated. Also, we’ll try to find you a workaround.

OpenACC and OpenMP are allowed to be used together provided that OpenMP is used before OpenACC and they can’t be used on the same loop at the same time.

Mat

chlskawo12 · September 29, 2016, 12:05pm

If HOST_DATA is a compile time directive, doesn’t it work with acc_set_device_num() which is a runtime API?

Then, if I use HOST_DATA with multiple GPUs, what is the directive’s behavior?

For example, let’s say that I have some data DAT.

I copied DAT to 2 GPUs by using !$ACC ENTER DATA and acc_set_device_num().

Then,

DO tid = 0, 1
CALL acc_set_device_num(tid, acc_device_nvidia)
ierr = cudaSetDevice(tid)
!$ACC HOST_DATA USE_DEVICE(DAT)
CALL kernel<<< … >>>(DAT)
!$ACC END HOST_DATA
ENDDO

or

!$OMP PARALLEL PRIVATE(tid)
tid = omp_get_thread_num()
CALL acc_set_device_num(tid, acc_device_nvidia)
ierr = cudaSetDevice(tid)
!$ACC HOST_DATA USE_DEVICE(DAT)
CALL kernel<<< … >>>(DAT)
!$ACC END HOST_DATA
!$OMP END PARALLEL

are these wrong way of using multiple GPUs?

MatColgrove · September 29, 2016, 8:15pm

If HOST_DATA is a compile time directive, doesn’t it work with acc_set_device_num() which is a runtime API?

Fair enough. There is a runtime look up into the present table to get the device pointer value to set within the host data region. But there’s still no barrier on the host_data construct since there’s no device code being used here. You’re just defining the region to use the device pointer.

\

Mat

Topic		Replies	Views
Unified memory - more than 1 GPU Legacy PGI Compilers	5	2700	January 17, 2019
CUDA thread in background? CUDA Programming and Performance	10	15987	February 19, 2010
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	10888	September 30, 2010
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10776	March 15, 2013
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9587	January 1, 2009
Unified binary for accelerators, serial? Legacy PGI Compilers	7	8353	November 6, 2013
Atomic usage Legacy PGI Compilers	8	3618	July 2, 2019
OpenMP + OpenACC problem Legacy PGI Compilers	9	5262	April 17, 2019
Can still use OMP_NUM_THREADS without OpenMP compilation Legacy PGI Compilers	4	2909	November 12, 2019
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9723	September 22, 2007

Some questions

Related topics