I am learning openacc and I am stuck with a loop that works on several arrays, but the number of these arrays is not known at compile time. For instance I have an array of those device pointers (provided elsewhere by cuda), called p, with size n. How can I start a parallel loop with openacc and passing all these pointers in deviceptr()?
From the API documentation it looks like I can only pass pointers one by one. Not an array of them.
You should be able to pass in the array of device pointers via a data directive. Something like the following pseudocode (I’m assuming you’re using C/C++):
float ** p = (float **) malloc(n*sizeof(float *));
for (int i=0; i < n; ++i) {
err = cudaMalloc(&p[i], somesize);
}
#pragma acc enter data copyin(p[:n])
...
#pragma acc parallel loop gang present(p)
for (int arrNum = 0; arrNum < n; ++arrNum) {
#pragma acc loop vector
for (int eleNum=0; eleNum < sizeofArr ; ++eleNum) {
p[arrNum][eleNum] = something;
}
}
Thank you for this idea. I did not realize that simply having the pointer on the device told enough to the compiler. I will try as soon as possible.
Now I don’t really see whether there is a point to the present(p) here. I could simply have the copyin directly in the #pragma acc parallel loop I guess?
I put “p” in an outer unstructured data region so it could be used across multiple compute regions. If you only used it in this one region, then yes, you can replace present with copyin. But if there are more than one, then you have extra data movement.
The “copy” data clauses use “present_or” semantics meaning if the data is already present in an outer data region, then don’t do the create or copy. This can be useful in cases where you don’t know if the incoming data is present or not, but for cases where it’s expected to be present, then using the “present” clause is considered best practice.