Hello.
-
I have a CUDA program that runs on multiple GPUs.
Each GPU takes its own work and tallies some data.
After calculation, the data that each GPU is holding have to be summed up at host.
In this case, what would be the fastest way to sum up multiple GPU’s data to host?
-
Do !$ACC DATA and !$ACC HOST_DATA have implicit barrier?
I want to run processes on host and device simultaneously.
I know that a CUDA kernel is launched asynchronously by default.
However, in my case, the CUDA kernel is called in the !$ACC DATA region, while host subprogram is called outside the region.
So I want to know whether !$ACC DATA and !$ACC HOST_DATA have its own barrier.
-
If I try to use !$ACC KERNELS or !$ACC PARALLEL in OpenMP parallel region, internal compiler error occurs.
The following is the full message:
1 Internal compiler error. get_var_line(): block not found for sptr 4929 J:\nTRACER\src\CUDARayTraceKernels.cuf 33
What could be the cause? Maybe it is not allowed to use OpenMP and OpenACC parallel region together?
Hi CNJ,
In this case, what would be the fastest way to sum up multiple GPU’s data to host?
Ideally, you would have each GPU perform a partial sum on it’s part of the data. Then have the host perform the final sum.
Do !$ACC DATA and !$ACC HOST_DATA have implicit barrier?
!$ACC DATA regions always have an implicit barrier. Unstructured data regions, “!$ACC ENTER DATA”, have an implicit barrier as well, but also allow the use of the “async” clause which makes the data movement asynchronous to the host.
The HOST_DATA directive is a compile time directive telling the compiler to replace host pointer addresses with their device address. Hence, a barrier doesn’t apply.
What could be the cause? Maybe it is not allowed to use OpenMP and OpenACC parallel region together?
Internal compiler errors (ICE) are always a problem with the compiler. Even if you’re doing something that isn’t allowed, the compiler should issue an informative error, not an ICE. For these cases, please send a reproducing example to PGI Customer Service (trs@pgroup.com).
For this specific error, I do see a similar known issue in our issue tracking system, TPR#22079. Please do send us a reproducing example so we can either add your issue to this or create a new problem report if unrelated. Also, we’ll try to find you a workaround.
OpenACC and OpenMP are allowed to be used together provided that OpenMP is used before OpenACC and they can’t be used on the same loop at the same time.
If HOST_DATA is a compile time directive, doesn’t it work with acc_set_device_num() which is a runtime API?
Then, if I use HOST_DATA with multiple GPUs, what is the directive’s behavior?
For example, let’s say that I have some data DAT.
I copied DAT to 2 GPUs by using !$ACC ENTER DATA and acc_set_device_num().
Then,
DO tid = 0, 1
CALL acc_set_device_num(tid, acc_device_nvidia)
ierr = cudaSetDevice(tid)
!$ACC HOST_DATA USE_DEVICE(DAT)
CALL kernel<<< … >>>(DAT)
!$ACC END HOST_DATA
ENDDO
or
!$OMP PARALLEL PRIVATE(tid)
tid = omp_get_thread_num()
CALL acc_set_device_num(tid, acc_device_nvidia)
ierr = cudaSetDevice(tid)
!$ACC HOST_DATA USE_DEVICE(DAT)
CALL kernel<<< … >>>(DAT)
!$ACC END HOST_DATA
!$OMP END PARALLEL
are these wrong way of using multiple GPUs?
If HOST_DATA is a compile time directive, doesn’t it work with acc_set_device_num() which is a runtime API?
Fair enough. There is a runtime look up into the present table to get the device pointer value to set within the host data region. But there’s still no barrier on the host_data construct since there’s no device code being used here. You’re just defining the region to use the device pointer.
\