Fast update for rectangular memory region? How to add a scalar to a submatrix?


I am trying to use CUDA to update a simple 2d array, which I allocate

using cudaMalloc. I want to add a scalar value to a rectangular region

in the array but the code is running much slower than if I try running a

similar operation on the CPU.

I setup the array in device memory then try to process the rectangle using a thread per row:

__global__ void add_d(

    float * grid, std::size_t pitch,

    std::size_t left, std::size_t bottom,

    std::size_t right, std::size_t top,

    float density)


    std::size_t const y = bottom + blockIdx.y * blockDim.y + threadIdx.y;

    float * row = grid + pitch * y;


    if (y <= top) 

        for (std::size_t x = left; x <= right; ++x)

            row[x] += density; 


I call the kernel using

dim3 dimBlock(1, BLOCK_SIZE);

dim3 dimComputeGrid(1, grid_size(height, dimBlock.y));

add_d<<<dimComputeGrid, dimBlock>>>(

     grid, pitch, left, bottom, right, top, density);

I also tried to do the same update “by columns” to try to coalesce my memory accesses but it is still much slower than the CPU version.

I realize that the computation to memory transfer ration in the kernel is very small but it seems that this kind of update operation should be very easy and fast to do in the GPU.

Can someone please tell me how to do this fast? Should I not be using CUDA for this? If not what do you suggest?


You would likely do much better to assign one thread per array element, since your operation doesn’t have any neighborhood dependency.

You are right, for large blocks that does work much better.

I was testing with relatively small blocks of 50 by 50 elements. It seems that the overhead for calling the cuda kernel and launching the threads for the small blocks is large. Here are the times I get for doing the operation 10000 times on rectangles of different sizes:


rectangle size GPU[s] CPU[s]

4x4 0.25 0

8x8 0.235 0

16x16 0.25 0

32x32 0.234 0.016

64x64 0.25 0.062

128x128 0.266 0.235

256x256 0.344 0.843

512x512 0.64 3.157

1024x1024 1.797 23.78


Is there a way to avoid the high constant cost for the smaller rectangles?


The launch overhead is pretty small - tens of microseconds. So, I think there still are some issues in your code, since 4x4 case takes a quarter of a second. Also, the very small cases are not really utilizing all the hardware - you need at least 16 threablocks, each one should have at least a couple hundred threads for efficient utilization.

Memory coalescing is critical to a small and memory-intensive kernels as yours. The difference between the coalesced and uncoalesced version will be almost a factor of 10x.


0.25 seconds is quite reasonable for “tens of microseconds” * 10000.

Actually, many of us considers that to be quite large.
E.g. This results in that CPU can read back, scan and write back a somewhat 4kb GPU int array faster than launching an empty kernel.

Hi paulius and asadafag

Thank you very much for replying.

The times that I reported are for 10000 calls of the kernel. I should have divided the numbers by 10000 to be more clear. The 0.25s therefore corresponds to 25 microseconds.

You are also right regarding coalescing. Depending on the block sizes I pick I can get run times that are 10 times larger than the ones I show above.

But what I was wondering is if there is a way to run the smaller cases faster i.e. can I “keep the kernel loaded” or something. Right now I need a rectangle of 128x128 or larger for the GPU implementation to be faster.

Moreover, the times for rectangles smaller than 128x128 are all the same which seems to indicate that the time spent in the call is mostly constant overhead. Is there a way to avoid this overhead?


In your code one of the solutions would be using float4 instead float. You could read 4 floats from the global memory at the same time, and then you could work on registers. I think it would give some speedup.


I missed the part that times were for 10000 repetitions. It makes sense then. There really isn’t a way to avoid the launch overhead. You could try using the driver API, which may let you reuse some parameters, but I think the gain would not be significant.


This once helped me to reduce a somewhat 70us overhead to 60us. The main benefit seems to be from reusing texture parameters, though. And this is not likely to help your 25us.

The driver seems to be queuing kernel calls in your case, while in my case that’s impossible since I need readback after each call. You’re already getting near top performance in your test case.

Driver overhead for small memcpy between CPU and GPU is ~2us, so you could consider using CPU for really small problem. But again, that may not be helpful for 2D problems.

Using the CPU for the small problems would be nice … but then having to have a special case based on size would be annoying. I wll try to batch the operation for several rectangles in the kernel and see if it helps.

Meanwhile I measured the time it is taking to launch an empty kernel with cuda runtime vs the cuda driver api and I got the following results:

Time for launching an empty kernel (overhead)
Cuda runtime API Cuda driver API
1 float arguments 7.84 us 5.67 us
2 float arguments 8.85 us 5.80 us
3 float arguments 9.97 us 6.24 us
4 float arguments 11.1 us 6.33 us

Which seem to indicate that, in my system, the lower bound in terms of overhead for launching a kernel is about 5.7us.

Thank you for all the answers and help!