Copy image tiles to const memory with cudaMemcpyToSymbol

Hello.
I need to optimize memory access to specific tiles in large image. Lets say that the tile is at position (tx,ty) in the image and with size (DX,DY). I want to copy this tile to a const memory buffer, I know that this copy can be done with cudaMemcpyToSymbol, both from host to device and from device to device, as is my case.

So far so good but my problem is that cudaMemcpyToSymbol and cudaMemcpyToSymbolAsync API does not allow to make the copies with one call, I had to do it like this:

   for (int i = 0; i < DY; i++)
    {
        CHECK_CUDA(cudaMemcpyToSymbolAsync(pixels_const_gpu, img_gpu, DX, DX*i, cudaMemcpyDeviceToDevice));
        img_gpu += w;
    }

In this way, I copy the tile row by row to the const memory buffer where DX,DY is the tile width and height. Even as this seems slightly faster these copies are eating up my optimization because DY can be quite big and I am doing this many times. At the end, there are too many calls to cudaMemcpyToSymbol .

So my question - Is there a better and faster way to copy dynamically the tile to the const GPU memory?

I’m not sure it will be any faster, but it should be possible to use cudaGetSymbolAddress to make the symbol destination target usable in an ordinary cudaMemcpyXXX operation. Then you could use (a single) cudaMemcpy2D to do the strided copy.

A strided copy like that will still be generally slower than an ordinary contiguous bulk copy, for the same quantity of bytes.

Finally, you could do a 2-stage copy. Use cudaMemcpy2D to a device buffer during which you do the strided to contiguous conversion. Then do a contiguous copy using a single cudaMemcpyToSymbol call.

Just for the usual disclaimers, __constant__ memory is optimized for the uniform access case. That means that the expectation for best performance is that when threads in a warp are reading from constant memory, in any given instruction or cycle, all threads are reading from the same location in constant memory.

Hello, Robert.
Thanks, I tested the solution with cudaMemcpy2D and cudaGetSymbolAddress, yes it is faster, but not from the order I expected. I wrote my own function for 2D memory copy, it is even better, but still the whole effect from this optimization is 10-15% faster at most comparing to the global memory access and is decreasing with image sizes. I expected much better performance gain.

You wrote- " __constant__ memory best performance is when threads in a warp are reading from constant memory, in any given instruction or cycle, all threads are reading from the same location in constant memory". My case is exactly like this, many warps from different blocks are accessing this __constant__ memory at the same time so I expected this to have significant effect. But can you explain if this means access to the the same address in the __constant__ memory from all threads in the warp or just any address in the __constant__ memory block? So if the threads in the warp, for example, are accessing consecutive memory addresses each, shouldn’t it be better?

Yes, by “the same location” I meant “the same address”. If you are reading adjacent locations in adjacent threads (or any other access pattern than uniform), then __constant__ memory won’t give expected performance benefits.

If you want to read adjacent locations in adjacent threads from “read-only” global data, you could try to take advantage of the so-called “read-only” cache. Pass a pointer to that data to your kernel like so:

 __global__ void  kernel(const float * __restrict__ my_ro_data, ...)

and you should generally decorate all other pointer kernel parameters in that prototype with __restrict__ at least using the same syntactical pattern. In so doing, the compiler has additional information to load the decorated data through the RO cache mechanism. This mechanism does not suffer a performance loss for non-uniform access patterns in the same way as __constant__. It will probably still benefit, however, from the same kind of coalesced access patterns (such as adjacent threads reading adjacent locations) that ordinary global access benefits from. It also has no particular size limits like __constant__, so probably no need to slice and dice it.

Thanks again for this explanation.
I use several buffers for image pixel reconstruction algorithms, it is expected to works as you describe, all threads from many warps (in many blocks) should be using one RGB value from the tile, at the same time, then they are jumping 3 bytes reading the next adjacent RGB value, and so on until the whole tile with size DX x DY is traversed, thus the assumption that prefetching the tile in the __constant__ pixels buffer will increase the performance greatly .

About the __restrict__ optimization - I tested it as you described but so far seems to not have any noticeable effect. Maybe it does works differently on the more recent video cards and CUDA compilers (I use RTX 3060, Ampere, compute capability 8.6, cuda11.8). I also tried some shared vs global memory optimizations with tiles for image filters (I will ask about it in separate thread).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.