Problems of using void pointer in the device


I have attached a mini app and it modifies the data partially, pickle the partially modified data to binary streams and un-pickle the data to check the results. It tries to mimic a situation where only part of the data need to be modified after a halo exchange. There is no MPI involved in this for simplicity.

I am using the cudaMemcpy to do the pickle and unpickle. In the attached mini app, my problem is on line 157. I use a void pointer, which is cast from a (double*). If I do it this way, the data are not unpickled correctly. Alternatively if I use the (double*) pointer directly (line 158), the data are unpickled as expected. In the mini app, there is a shift of the pointers by iss to only operate on part of the data.

Could someone please shine some light on this problem?

Many thanks in advance,

main.cpp (4.5 KB) (66 Bytes)

Hi Feng,

The main issue is that the code is passing the host copy of “wptr” which was filled on the device and not updated. Hence, you’re passing in garbage values to unpickle.

Though since you’re capturing device pointers, I’d suggest you make “wptr” a “size_t” array rather than a “void **”. “**” is an array of pointers to arrays so the compiler would be expecting it to be 2 dimensions. This can cause issues when using things like the “update” clause. A “size_t” array will still be able to capture pointer values, but allow for proper use in data clauses.

Example: main.cpp (4.5 KB)

Hope this helps,

Hi Mat,

Many thanks for your reply. This really helps!

I have tried your modified version, it produces the expected results. It seems I have to use size_t for wptr to hold the pointers otherwise the “#pragma acc update self(wptr)” will crash, so it is not happy with void* pointers.

I also noticed that in your modified version, in the “unpckle” function, yours look like:

template <typename type> void unpckle( size_t *len, int n, type *var, const pickle_t buf )
   size_t l= (size_t)(n*sizeof(*var));
  #pragma acc host_data use_device(buf) 
      cudaMemcpy( (void*)var, (const void*)(buf+(*len)), l, cudaMemcpyDeviceToDevice );
   (*len)+= l;

“var” is removed from the “use_device”, actually if I put it back, it crashes. since the memory operation is on the device, how does know it is the device version of “var” would be used?


This really helps!

Glad to hear it!

“var” is already a device pointer so no need to put it into a “use_device” clause.

Hi Mat,

Thanks for your reply!

I also tried uintptr_t to replace the void* pointer, it also works with the openacc update directives. For my previous question, just for curiosity, if I add “var” to “use_device” in the “unpckle” function, it crashes and the error is :

FATAL ERROR: data in use_device clause was not found on device 1: host:0x7f875e0faa28

I thought if it is already in device, add it to “use_device” should not cause any harm? or did I missed something here?


“use_device” looks up the host address in the present table to return the device address associated with that host address. “var” is a device address, so when the function looks through the list of host addressed, it’s not found and hence the error.

Many thanks. I understand it now!


Hi Mat,

I have made my mini app a bit more representative of my full application. “wptr” is now a three dimensional array. it looks like wptr[id][ia][iv] . id is the message index, ia is the cpu index and iv is the variable index. In my previous simple mini app, I only have the variable index iv. I have managed to make it work with wptr as 3D arrays, but when I compile the code, the output is:

    105, Generating enter data copyin(wptr[:nmsg])
    108, Generating enter data copyin(wptr->[:ncpu])
    111, Generating enter data copyin(wptr->->[:nv])
    116, Generating enter data copyin(data[:nv],iref[:nq],sdata[:nq*nv],sbuf[:ssize])
         Generating present(iref[:nq])
         Generating NVIDIA GPU code
        116, #pragma acc loop gang /* blockIdx.x */
    123, Generating present(data[:nv])
         Generating NVIDIA GPU code
        123, #pragma acc loop gang /* blockIdx.x */
        125, #pragma acc loop seq
    125, Complex loop carried dependence of data->,data->-> prevents parallelization
    133, Generating update self(sdata[:nq*nv])
    161, Generating enter data copyin(rdata[:nq*nv])
    170, Generating present(wptr[:][:][:],data[:nv])
    172, Generating NVIDIA GPU code
        172, #pragma acc loop seq
    179, Accelerator clause: upper bound for dimension 1 of array 'wptr' is unknown
        Accelerator clause: upper bound for dimension 0 of array 'wptr' is unknown
        Generating update self(wptr[:1][:1][:nv])**
    187, Generating update self(rdata[:nq*nv])
    200, Generating exit data delete(iref[:1],data[:1][:1],rdata[:1],sdata[:1],sbuf[:1])
    205, Generating exit data delete(wptr->->[:1][:1][:1])
    207, Generating exit data delete(wptr->[:1][:1][:1])
    210, Generating exit data delete(wptr[:1][:1][:1])
void subv<double>(int, int, T1 *, T1 **):
     17, Generating present(a[:n2*n1],as[:n1])
         Accelerator serial kernel generated
         Generating NVIDIA GPU code
     20, Generating NVIDIA GPU code
         20, #pragma acc loop seq
void pckle<double>(unsigned long *, int, int *, const T1 *, char *):
     49, Generating enter data create(tmp[:n])
         Generating present(tmp[:n],var[:n],iref[:n])
         Generating NVIDIA GPU code
         49, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     61, Generating exit data copyout(tmp[:n])

I am concerned with the output related to wptr, such as line 179, is this something I should worry about. I have attached my updated mini app.

Thanks in advance,

main.cpp (5.6 KB)

Good eye. Yes this is a problem with the compiler where it’s still using an old method where the compiler would implicitly add a lower bound when the triplet notation isn’t used. This was out of alignment with some revisions in the OpenACC standard where it should indicate a single element. We made this change to the data directives, but looks like there’s a problem with the update directive. I’ve submitted an issue report, TPR#31692, and have asked engineering to investigate.

The work around is to use triplet notation with a range of one.

#pragma acc update self(wptr[id:1][ia:1][0:nv])


One other thing, while it may be needed as you port your code to the full application, at least for this example, “wptr” doesn’t really need to be a 3D array. It’s only capturing one dimension at a time, which in turn is only used once. You could make it a 1D array, and then reuse the buffer each time through the id and ia loops.

Hi Mat
it works. Many thanks!!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.