Mapped address consistency in Interop-ed buffer/array

shocker.0x15 · May 30, 2022, 5:26pm

Hello,
I’m writing an application which uses CUDA/OpenGL interoperability.

I understand that cuGraphicsResourceGetMappedPointer, cuGraphicsResourceGetMappedMipmappedArray and cuGraphicsSubResourceGetMappedArray mentions that a mapped pointer or a mapped (mipmapped-)array may change every time the resource is mapped.
e.g. cuGraphicsResourceGetMappedPointer:

The value set in pPointer may change every time that resource is mapped.

Is this true even when I don’t change buffer size or array dimensions between map/unmaps?

I’m not surprised if a mapped pointer changes when I reallocate/resize the original buffer, but in the case where I do neither of reallocation nor resize(, just update the buffer contents), it looks useful if there is a guarantee that the mapped pointer doesn’t change.

Thanks.

Robert_Crovella · May 30, 2022, 6:27pm

Yes.

shocker.0x15 · May 30, 2022, 6:44pm

Thanks for the quick reply.

Next question is:
We need to create a surface object and a texture object every frame for the interop-ed resource when we want to read/write a texture every frame for CUarray since CUarray itself changes every frame. Is this correct? (I have CUDA kernel dispatches and OpenGL calls every frame)
If this is correct, I’m curious about performance of such a object creation. Is it healthy to create a surface object/texture object every frame? Does it have some implicit stream synchronization?

Robert_Crovella · May 30, 2022, 6:55pm

I don’t think I can grok what you’re asking.

If I wanted to make a texture available for both reading and writing on the CUDA side, I would just use a surface.

If I wanted to do this frame-by-frame, I would probably set up two surfaces and ping-pong between them.

Regardless of mechanism, the OpenGL<->CUDA handoff involves synchronization.

I don’t have technical details of surface creation (or texture object creation), beyond what is published. It may very well have an effect like cudaMalloc.

shocker.0x15 · May 30, 2022, 7:08pm

Surface object and texture object creation requires an array.
Using CUDA after OpenGL calls requires mapping the resource.
Mapping the resource always produces a different array.

Therefore, it seems impossible to create only two surface objects for ping-pong before the rendering loop if I understand correctly.

It is not a problem that mapping the resource has GPU-GPU synchronization (i.e. OpenGL shaders and CUDA dispatches). My concern is whether calling the mapping function on the host side will have CPU-GPU synchronization like waiting all the CUDA dispatches already issued regardless of CUstream passed.

Robert_Crovella · May 30, 2022, 7:36pm

Yes, you need to create the surface object each time you map the resource (acquire the CUDA array) from the OpenGL side. That is part of the hand-off from OpenGL to CUDA. That has synchronization.

I’m not aware, for example, that calling cudaCreateSurfaceObject has its own synchronizing effect (like cudaDeviceSynchronize()) but it may.

Robert_Crovella · May 30, 2022, 8:07pm

From what I can tell, cudaCreateSurfaceObject doesn’t seem to synchronize the CPU thread against pending GPU work:

$ cat t2023.cu

// Includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// Includes CUDA
#include <cuda_runtime.h>

// Utilities and timing functions
#include <helper_functions.h>    // includes cuda.h and cuda_runtime_api.h

// CUDA helper functions
#include <helper_cuda.h>         // helper functions for CUDA error check
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#define DELAY_CYCLES 10000000000ULL
__global__ void delay_kernel(unsigned long long delay){
  unsigned long long start = clock64();
  while (clock64() < (start+delay));
}

int main(){
    int width = 1024;
    int height = 1024;
    unsigned long long start = dtime_usec(0);
    // Allocate array and copy image data
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray *cuArray;
    checkCudaErrors(cudaMallocArray(&cuArray,
                                    &channelDesc,
                                    width,
                                    height,
                                    cudaArraySurfaceLoadStore));

    dim3 dimBlock(8, 8, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

    cudaSurfaceObject_t outputSurface;
    cudaResourceDesc    surfRes;
    memset(&surfRes, 0, sizeof(cudaResourceDesc));
    surfRes.resType = cudaResourceTypeArray;
    surfRes.res.array.array = cuArray;
    delay_kernel<<<1,1>>>(DELAY_CYCLES);
    checkCudaErrors(cudaCreateSurfaceObject(&outputSurface, &surfRes));
    unsigned long long dk = dtime_usec(start);
    printf("surface object creation completed at: %lu\n", dk);
    checkCudaErrors(cudaDeviceSynchronize());
    unsigned long long ds = dtime_usec(start);
    printf("kernel completed at: %lu\n", ds);

    checkCudaErrors(cudaFreeArray(cuArray));
}
$ nvcc -o t2023 t2023.cu -I/usr/local/cuda/samples/common/inc
$ ./t2023
surface object creation completed at: 323261
kernel completed at: 7558553
$

CUDA 11.4, Tesla V100, driver 470.57.02, CentOS 7

Robert_Crovella · May 30, 2022, 8:13pm

What would seem logical to me is that the mapping function synchronizes on OpenGL pending work, and the unmapping function synchronizes on CUDA pending work, but I haven’t checked that out. You might be able to use a methodology similar to above to answer that.

shocker.0x15 · June 1, 2022, 5:28pm

Thanks for the code.

I’ll try a similar measurement on my side.

Let me mark this issue as resolved.

system · June 15, 2022, 5:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenGL interop performance CUDA Programming and Performance	2	139	December 11, 2024
OpenGL texture interaction with surface object API without copy CUDA Programming and Performance	0	566	August 16, 2019
cudaGraphicsMapResources each frame or just once when cuda-opengl interop ? which better? CUDA Programming and Performance	7	541	December 6, 2023
CUDA-OpenGL Interop - CUDA API CUDA Programming and Performance cuda , opengl	3	1606	April 10, 2023
When to call cudaMapResources? CUDA Programming and Performance	4	3228	February 28, 2010
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76671	February 14, 2010
Access registered OpenGL texture from another thread CUDA Programming and Performance	0	4347	May 30, 2011
CUDA randomly freezes when using cudaStreamSynchronize with OpenGL interop CUDA Programming and Performance	3	75	January 9, 2025
Reading and Writing OpenGL Textures with Cuda CUDA Programming and Performance	24	25495	June 26, 2023
CUDA GL Interop CUDA Programming and Performance	0	3186	December 16, 2010

Mapped address consistency in Interop-ed buffer/array

Related topics