Problems of using void pointer in the device

fw407 · April 18, 2022, 5:29pm

Hello,

I have attached a mini app and it modifies the data partially, pickle the partially modified data to binary streams and un-pickle the data to check the results. It tries to mimic a situation where only part of the data need to be modified after a halo exchange. There is no MPI involved in this for simplicity.

I am using the cudaMemcpy to do the pickle and unpickle. In the attached mini app, my problem is on line 157. I use a void pointer, which is cast from a (double*). If I do it this way, the data are not unpickled correctly. Alternatively if I use the (double*) pointer directly (line 158), the data are unpickled as expected. In the mini app, there is a shift of the pointers by iss to only operate on part of the data.

Could someone please shine some light on this problem?

Many thanks in advance,
Feng

main.cpp (4.5 KB)
compile.sh (66 Bytes)

MatColgrove · April 18, 2022, 8:11pm

Hi Feng,

The main issue is that the code is passing the host copy of “wptr” which was filled on the device and not updated. Hence, you’re passing in garbage values to unpickle.

Though since you’re capturing device pointers, I’d suggest you make “wptr” a “size_t” array rather than a “void **”. “**” is an array of pointers to arrays so the compiler would be expecting it to be 2 dimensions. This can cause issues when using things like the “update” clause. A “size_t” array will still be able to capture pointer values, but allow for proper use in data clauses.

Example: main.cpp (4.5 KB)

Hope this helps,
Mat

fw407 · April 19, 2022, 10:16am

Hi Mat,

Many thanks for your reply. This really helps!

I have tried your modified version, it produces the expected results. It seems I have to use size_t for wptr to hold the pointers otherwise the “#pragma acc update self(wptr)” will crash, so it is not happy with void* pointers.

I also noticed that in your modified version, in the “unpckle” function, yours look like:

template <typename type> void unpckle( size_t *len, int n, type *var, const pickle_t buf )
{
   size_t l= (size_t)(n*sizeof(*var));
  #pragma acc host_data use_device(buf) 
  {
      cudaMemcpy( (void*)var, (const void*)(buf+(*len)), l, cudaMemcpyDeviceToDevice );
  }
   (*len)+= l;
}

“var” is removed from the “use_device”, actually if I put it back, it crashes. since the memory operation is on the device, how does know it is the device version of “var” would be used?

Thanks,
Feng

MatColgrove · April 19, 2022, 3:04pm

This really helps!

Glad to hear it!

“var” is already a device pointer so no need to put it into a “use_device” clause.

fw407 · April 19, 2022, 4:06pm

Hi Mat,

Thanks for your reply!

I also tried uintptr_t to replace the void* pointer, it also works with the openacc update directives. For my previous question, just for curiosity, if I add “var” to “use_device” in the “unpckle” function, it crashes and the error is :

FATAL ERROR: data in use_device clause was not found on device 1: host:0x7f875e0faa28

I thought if it is already in device, add it to “use_device” should not cause any harm? or did I missed something here?

Thanks,
Feng

MatColgrove · April 19, 2022, 5:05pm

“use_device” looks up the host address in the present table to return the device address associated with that host address. “var” is a device address, so when the function looks through the list of host addressed, it’s not found and hence the error.

fw407 · April 19, 2022, 5:13pm

Many thanks. I understand it now!

Feng

fw407 · April 20, 2022, 11:42am

Hi Mat,

I have made my mini app a bit more representative of my full application. “wptr” is now a three dimensional array. it looks like wptr[id][ia][iv] . id is the message index, ia is the cpu index and iv is the variable index. In my previous simple mini app, I only have the variable index iv. I have managed to make it work with wptr as 3D arrays, but when I compile the code, the output is:

main:
    105, Generating enter data copyin(wptr[:nmsg])
    108, Generating enter data copyin(wptr->[:ncpu])
    111, Generating enter data copyin(wptr->->[:nv])
    116, Generating enter data copyin(data[:nv],iref[:nq],sdata[:nq*nv],sbuf[:ssize])
         Generating present(iref[:nq])
         Generating NVIDIA GPU code
        116, #pragma acc loop gang /* blockIdx.x */
    123, Generating present(data[:nv])
         Generating NVIDIA GPU code
        123, #pragma acc loop gang /* blockIdx.x */
        125, #pragma acc loop seq
    125, Complex loop carried dependence of data->,data->-> prevents parallelization
    133, Generating update self(sdata[:nq*nv])
    161, Generating enter data copyin(rdata[:nq*nv])
    170, Generating present(wptr[:][:][:],data[:nv])
    172, Generating NVIDIA GPU code
        172, #pragma acc loop seq
    179, Accelerator clause: upper bound for dimension 1 of array 'wptr' is unknown
        Accelerator clause: upper bound for dimension 0 of array 'wptr' is unknown
        Generating update self(wptr[:1][:1][:nv])**
    187, Generating update self(rdata[:nq*nv])
    200, Generating exit data delete(iref[:1],data[:1][:1],rdata[:1],sdata[:1],sbuf[:1])
    205, Generating exit data delete(wptr->->[:1][:1][:1])
    207, Generating exit data delete(wptr->[:1][:1][:1])
    210, Generating exit data delete(wptr[:1][:1][:1])
void subv<double>(int, int, T1 *, T1 **):
     17, Generating present(a[:n2*n1],as[:n1])
         Accelerator serial kernel generated
         Generating NVIDIA GPU code
     20, Generating NVIDIA GPU code
         20, #pragma acc loop seq
void pckle<double>(unsigned long *, int, int *, const T1 *, char *):
     49, Generating enter data create(tmp[:n])
         Generating present(tmp[:n],var[:n],iref[:n])
         Generating NVIDIA GPU code
         49, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     61, Generating exit data copyout(tmp[:n])

I am concerned with the output related to wptr, such as line 179, is this something I should worry about. I have attached my updated mini app.

Thanks in advance,
Feng

main.cpp (5.6 KB)

MatColgrove · April 20, 2022, 4:16pm

Good eye. Yes this is a problem with the compiler where it’s still using an old method where the compiler would implicitly add a lower bound when the triplet notation isn’t used. This was out of alignment with some revisions in the OpenACC standard where it should indicate a single element. We made this change to the data directives, but looks like there’s a problem with the update directive. I’ve submitted an issue report, TPR#31692, and have asked engineering to investigate.

The work around is to use triplet notation with a range of one.

#pragma acc update self(wptr[id:1][ia:1][0:nv])

-Mat

MatColgrove · April 20, 2022, 4:18pm

One other thing, while it may be needed as you port your code to the full application, at least for this example, “wptr” doesn’t really need to be a 3D array. It’s only capturing one dimension at a time, which in turn is only used once. You could make it a 1D array, and then reuse the buffer each time through the id and ia loops.

fw407 · April 20, 2022, 9:45pm

Hi Mat
it works. Many thanks!!
Feng

system · May 4, 2022, 9:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wierd crash in modified VectorAdd example. CUDA Programming and Performance	7	1050	May 25, 2011
Problem when using NPP libirary, nppiMinIndx_32f_C1R() GPU-Accelerated Libraries	8	1479	July 31, 2018
Nvc++ 25.1, stdpar, HtoD for already fetched member variable nvc, nvc++ and nvfortran parallel-computing	9	77	February 28, 2025
Invalid Device Pointer CUDA Programming and Performance	9	24504	January 15, 2009
pointer to shared memory compiler problems CUDA Programming and Performance	19	14584	June 7, 2008
C++ Smart Pointers and OpenACC nvc, nvc++ and nvfortran nvcc	3	335	July 31, 2024
Strange memory gremlins Getting pwned by pointers CUDA Programming and Performance	9	12180	July 1, 2009
Nvfortran: Passing shared arrays of variable size to device subroutine causes memory error nvc, nvc++ and nvfortran cuda	7	85	August 28, 2024
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	696	May 31, 2023
Problem in getting __device__ dynamic array data back to host CUDA Programming and Performance	13	2141	January 18, 2013

Problems of using void pointer in the device

Related topics