Your assumptions are correct. This was mainly just a toy example, and the jacobi
function looked like this:
subroutine jacobi(gridA,gridB,width,height) bind(c)
use cudafor, only: dim3
integer(c_int), value :: width, height
real(wp), intent(inout), device :: gridA(width,height)
real(wp), intent(in), device :: gridB(width,height)
type(dim3) :: grid, tblock
tblock = dim3(128,4,1)
grid = dim3((width + tblock%x - 1)/tblock%x, &
(height + tblock%y - 1)/tblock%y, 1)
call jk<<<grid,tblock>>>(gridA,gridB,width,height)
end subroutine
Thanks for the pointers w.r.t. the stream handling. I found some advice for mapping between target
constructs and streams here: HPC Compilers User's Guide Version 24.9 for ARM, OpenPower, x86
I think I got the gist, so I’d need to do something like:
subroutine jacobi(gridA,gridB,height,width,stream) bind(c)
! ...
integer(kind=cuda_stream_kind), optional :: stream
! the stream is implicitly cast from (cudaStream_t *) in C
! ...
if (present(stream)) then
call jk<<<grid,tblock,stream=stream>>>(...)
else
! default stream = 0
call jk<<<grid,tblock>>>(...)
! >>> alternative would be to synchronize here <<<
end if
end subroutine
In the C code I’d then have to use:
#include <cuda_runtime.h>
extern "C" void jacobi(double *gridA,
double *gridB,
int nx,
int ny,
cudaStream_t *stream);
// ...
cudaStream_t stream =
(cudaStream_t) ompx_get_cuda_stream(omp_get_default_device(), 0);
// ...
t1 = dtime();
for (int it = 0; it < niters; it++) {
#pragma omp target data use_device_ptr(gridA,gridB)
jacobi(gridA, gridB, nx, ny, &stream);
std::swap(gridA,gridB);
}
#pragma omp taskwait depend(in: stream)
t2 = dtime();
I’m not entirely sure if the taskwait would work like this, because omp target data
is not a task-generating construct, and the stream
instance isn’t referenced in any previous depend clause. It could be added to the data mapping constructs, but I’d like to measure only the computational part. I noticed the OpenMP standard has a depend object, but it is unsupported at the moment. I think this would look like this:
cudaStream_t stream =
(cudaStream_t) ompx_get_cuda_stream(omp_get_default_device(), 0);
// We use a user-provided dependency object, since the stream
// doesn't appear explicitly in a depend clause of a task-generating construct
omp_depend_t obj;
#pragma omp depbj(obj) depend(inout: stream)
#pragma omp target data map(tofrom: gridA[:nx*ny], gridB[:nx*ny])
{
t1 = dtime();
for (int it = 0; it < niters; it++) {
#pragma omp target data use_device_ptr(gridA,gridB)
jacobi(gridA, gridB, nx, ny, &stream);
#pragma omp depobj(obj) update(inout: stream)
std::swap(gridA,gridB);
}
#pragma omp taskwait depend(in: stream)
t2 = dtime();
}
#pragma omp depobj(obj) destroy
Obviously, this is a lot of extra typing to achieve essentially the same thing as cudaStreamSynchronize()
.
An empty task-generating construct like this would probably also work (and involves less typing):
#pragma omp target depend(inout:stream)
{ /* empty */ }
In either case it makes sense now that an explicit mechanism is needed to trigger the synchronization of the OpenMP runtime and the CUDA runtime.