You are right, for large blocks that does work much better.
I was testing with relatively small blocks of 50 by 50 elements. It seems that the overhead for calling the cuda kernel and launching the threads for the small blocks is large. Here are the times I get for doing the operation 10000 times on rectangles of different sizes:
rectangle size GPU[s] CPU[s]
4x4 0.25 0
8x8 0.235 0
16x16 0.25 0
32x32 0.234 0.016
64x64 0.25 0.062
128x128 0.266 0.235
256x256 0.344 0.843
512x512 0.64 3.157
1024x1024 1.797 23.78
Is there a way to avoid the high constant cost for the smaller rectangles?
The launch overhead is pretty small - tens of microseconds. So, I think there still are some issues in your code, since 4x4 case takes a quarter of a second. Also, the very small cases are not really utilizing all the hardware - you need at least 16 threablocks, each one should have at least a couple hundred threads for efficient utilization.
Memory coalescing is critical to a small and memory-intensive kernels as yours. The difference between the coalesced and uncoalesced version will be almost a factor of 10x.
The times that I reported are for 10000 calls of the kernel. I should have divided the numbers by 10000 to be more clear. The 0.25s therefore corresponds to 25 microseconds.
You are also right regarding coalescing. Depending on the block sizes I pick I can get run times that are 10 times larger than the ones I show above.
But what I was wondering is if there is a way to run the smaller cases faster i.e. can I “keep the kernel loaded” or something. Right now I need a rectangle of 128x128 or larger for the GPU implementation to be faster.
Moreover, the times for rectangles smaller than 128x128 are all the same which seems to indicate that the time spent in the call is mostly constant overhead. Is there a way to avoid this overhead?
In your code one of the solutions would be using float4 instead float. You could read 4 floats from the global memory at the same time, and then you could work on registers. I think it would give some speedup.
I missed the part that times were for 10000 repetitions. It makes sense then. There really isn’t a way to avoid the launch overhead. You could try using the driver API, which may let you reuse some parameters, but I think the gain would not be significant.
Using the CPU for the small problems would be nice … but then having to have a special case based on size would be annoying. I wll try to batch the operation for several rectangles in the kernel and see if it helps.
Meanwhile I measured the time it is taking to launch an empty kernel with cuda runtime vs the cuda driver api and I got the following results:
Time for launching an empty kernel (overhead)
Cuda runtime API Cuda driver API
1 float arguments 7.84 us 5.67 us
2 float arguments 8.85 us 5.80 us
3 float arguments 9.97 us 6.24 us
4 float arguments 11.1 us 6.33 us
Which seem to indicate that, in my system, the lower bound in terms of overhead for launching a kernel is about 5.7us.