Mapped address consistency in Interop-ed buffer/array

Hello,
I’m writing an application which uses CUDA/OpenGL interoperability.

I understand that cuGraphicsResourceGetMappedPointer, cuGraphicsResourceGetMappedMipmappedArray and cuGraphicsSubResourceGetMappedArray mentions that a mapped pointer or a mapped (mipmapped-)array may change every time the resource is mapped.
e.g. cuGraphicsResourceGetMappedPointer:

The value set in pPointer may change every time that resource is mapped.

Is this true even when I don’t change buffer size or array dimensions between map/unmaps?

I’m not surprised if a mapped pointer changes when I reallocate/resize the original buffer, but in the case where I do neither of reallocation nor resize(, just update the buffer contents), it looks useful if there is a guarantee that the mapped pointer doesn’t change.

Thanks.

Yes.

Thanks for the quick reply.

Next question is:
We need to create a surface object and a texture object every frame for the interop-ed resource when we want to read/write a texture every frame for CUarray since CUarray itself changes every frame. Is this correct? (I have CUDA kernel dispatches and OpenGL calls every frame)
If this is correct, I’m curious about performance of such a object creation. Is it healthy to create a surface object/texture object every frame? Does it have some implicit stream synchronization?

I don’t think I can grok what you’re asking.

If I wanted to make a texture available for both reading and writing on the CUDA side, I would just use a surface.

If I wanted to do this frame-by-frame, I would probably set up two surfaces and ping-pong between them.

Regardless of mechanism, the OpenGL<->CUDA handoff involves synchronization.

I don’t have technical details of surface creation (or texture object creation), beyond what is published. It may very well have an effect like cudaMalloc.

  • Surface object and texture object creation requires an array.
  • Using CUDA after OpenGL calls requires mapping the resource.
  • Mapping the resource always produces a different array.

Therefore, it seems impossible to create only two surface objects for ping-pong before the rendering loop if I understand correctly.

It is not a problem that mapping the resource has GPU-GPU synchronization (i.e. OpenGL shaders and CUDA dispatches). My concern is whether calling the mapping function on the host side will have CPU-GPU synchronization like waiting all the CUDA dispatches already issued regardless of CUstream passed.

Yes, you need to create the surface object each time you map the resource (acquire the CUDA array) from the OpenGL side. That is part of the hand-off from OpenGL to CUDA. That has synchronization.

I’m not aware, for example, that calling cudaCreateSurfaceObject has its own synchronizing effect (like cudaDeviceSynchronize()) but it may.

From what I can tell, cudaCreateSurfaceObject doesn’t seem to synchronize the CPU thread against pending GPU work:

$ cat t2023.cu

// Includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// Includes CUDA
#include <cuda_runtime.h>

// Utilities and timing functions
#include <helper_functions.h>    // includes cuda.h and cuda_runtime_api.h

// CUDA helper functions
#include <helper_cuda.h>         // helper functions for CUDA error check
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#define DELAY_CYCLES 10000000000ULL
__global__ void delay_kernel(unsigned long long delay){
  unsigned long long start = clock64();
  while (clock64() < (start+delay));
}

int main(){
    int width = 1024;
    int height = 1024;
    unsigned long long start = dtime_usec(0);
    // Allocate array and copy image data
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray *cuArray;
    checkCudaErrors(cudaMallocArray(&cuArray,
                                    &channelDesc,
                                    width,
                                    height,
                                    cudaArraySurfaceLoadStore));

    dim3 dimBlock(8, 8, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

    cudaSurfaceObject_t outputSurface;
    cudaResourceDesc    surfRes;
    memset(&surfRes, 0, sizeof(cudaResourceDesc));
    surfRes.resType = cudaResourceTypeArray;
    surfRes.res.array.array = cuArray;
    delay_kernel<<<1,1>>>(DELAY_CYCLES);
    checkCudaErrors(cudaCreateSurfaceObject(&outputSurface, &surfRes));
    unsigned long long dk = dtime_usec(start);
    printf("surface object creation completed at: %lu\n", dk);
    checkCudaErrors(cudaDeviceSynchronize());
    unsigned long long ds = dtime_usec(start);
    printf("kernel completed at: %lu\n", ds);

    checkCudaErrors(cudaFreeArray(cuArray));
}
$ nvcc -o t2023 t2023.cu -I/usr/local/cuda/samples/common/inc
$ ./t2023
surface object creation completed at: 323261
kernel completed at: 7558553
$

CUDA 11.4, Tesla V100, driver 470.57.02, CentOS 7

What would seem logical to me is that the mapping function synchronizes on OpenGL pending work, and the unmapping function synchronizes on CUDA pending work, but I haven’t checked that out. You might be able to use a methodology similar to above to answer that.

Thanks for the code.

I’ll try a similar measurement on my side.

Let me mark this issue as resolved.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.