Casting CUdeviceptr to floating point array for CUDA kernel

Hello all,

This is probably a simple question, but say I have a CUDA kernel that takes a floating point array and I have a CUdeviceptr I would like to pass to that kernel.

Can this be done? I am guessing that this is probably easy but I have never really used CUdeviceptr, so this is kind of new to me.

A snippet of code follows:

__global__ void myKernel(float *array, const int N) {
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    if(id < N) {
        // Do something with argument array of floats
        array[idx] = array[idx] + 1.0f;
    }
}

...
int main(int argc, char **argv) {
    int N = 3;
    float *h_A = nullptr;

    // Allocate and initialize HOST-side array of floats
    h_A = (float*)malloc(sizeof(float) * N);
    for(int i = 0; i < N; ++i) {
        h_A[i] = static_cast<float>(i);
    }

    // This needs to hold/point to floating point array shown above
    CUdeviceptr dptr = 0;

    // How can I copy above dptr to floating point array from HOST 
    // and then pass to the myKernel(...) CUDA kernel ?????

   // Then how can I retrieve data from myKernel(...) CUdeviceptr from
   // DEVICE side to HOST side ??????
    ...

    return 0;
}

Thanks to anyone that can provide a hint or any help and apologies if this is a dumb question.

You can cast from CUdeviceptr to an “ordinary” pointer and things should “just work”.

Thank you @Robert_Crovella for the fast reply. It all looks good and makes sense.

However, I have ran into problems with trying to call these cu* functions with CUDA, e.g., cannot locate cuMemAlloc_v2 when executing my Makefile. I know this needs to be addressed (and I plan on it) but just to get some simple code going I would like to try and use the older cudaMemcpy, cudaMalloc, etc., calls. Is this possible using CUdeviceptr? Is there a lot of modification(s) that need to be applied to do this in the code?

Thank you again for the assistance.

To use the driver API, you have to

#include <cuda.h>

and link against

-lcuda

There are various sample codes that demonstrate proper usage of the driver API, such as vectorAddDrv.

I feel like this was already addressed with the statement about simple casting from one to another. It works both ways AFAIK. Not sure what is unclear about this. Did you read the linked thread?

You can post a simple example of what you actually tried here, it may be clearer.

TBH, I probably don’t understand the issue. You have shown what looks to be entirely host code and CUDA device code. You’ve shown no origin for the CUdeviceptr. Why not just use cudaMalloc/cudaMemcpy and an ordinary pointer?

1 Like

This is really puzzling:

So if I wanted to “use” that, (not sure why, as the silliness will be immediately evident) along with all your other requests, I would:

#include <cuda.h>
...
CUdeviceptr dptr = 0;
float *d_array;
cudaMalloc(&d_array, sizeof(float)*N);
cudaMemcpy(d_array, h_A, sizeof(float)*N, cudaMemcpyHostToDevice);
dptr = (CUdeviceptr)d_array;
float *d_A = (float *)dptr;
myKernel<<<(N+255)/256, 256>>>(d_A, N);
cudaMemcpy(h_A, d_A, sizeof(float)*N, cudaMemcpyDeviceToHost;

and you may need link against -lcuda. (For this silly case, I’m not really sure. For the general interop case, you would definitely need to link against it.)

1 Like

Thank you @Robert_Crovella for all your patience and assistance with my admittedly silly and simplistic questions. The answers you gave were exactly what I was looking for and I appreciate it.

If the origin of the CUdeviceptr is from some actual piece of driver API code (e.g. that starts with cuInit/cuCtxCreate/cuMemMalloc), then you’ll need to be aware of driver and runtime API differences in context handling, but that is presumably a question for another day.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.