Both cublas and my gemm/convolution kernels shuffle things around in shared memory prior to writing out to global. If you can do it in a way such that the shuffle addresses all stay within the same region for each warp, you don’t need synchthreads. The shared memory access is very fast compared to global and the latencies can often be hidden by TLP, so it’s most likely going to be faster for you, particularly if you can avoid the synchthreads.
If by “same region” you mean never crossing warp boundaries, since my patch size is 19x21 and I believe a warp is 32 threads, it seems if a single thread processes 4x4 (or 8x2) data points, all of the shared memory will be processed by one warp.
I’m not sure if there is any other way to do what you referenced?
I meant that you don’t have one warp writing to a piece of memory that another warp is reading from. Within a warp you can shuffle things however you like. Another thing to look at is the warp shuffle instruction. That can be quicker than using shared memory but isn’t as flexible.
In general a write to global memory could be faster than a read, since writes can be cached.
Non-coalesced reads/writes in shared memory have no negative effects on your performance, just watch out for bank conflicts there. Since you did not explain in what way your data is shifted: I take the example of a matrix transpose. You can divide the matrix in blocks, which are also represented by cuda blocks. Every block reads coalesced, flips the indexes in shared memory, and writes back coalesced. Fastest version you can get. Non-Coalesced reads/writes will always ruin your performance. So if you can avoid them in any way, do it.
As for shuffle being too complicated: You need to decide whether you want to write a fast program or just one that works, if it’s the latter one you could also use a cpu. Shuffle is probably the most efficient mechanism you can use at the moment.
It will not always be faster, as I mentioned above it depends on how the data needs to be shifted. If you need to place your data at totally random places when you save it, then you’ll have trouble writing a kernel which can prevent global non-coalesced writes.
I wouldn’t be too concerned with synchthreads, you’ll likely see a speed up with or without it. Though as in all things it’s best to run some tests to accurately measure the differences.
As for running with 32 threads per block, if each warp is independent from the the others, there’s not much to gain from larger block sizes, particularly on maxwell. But if you can effectively share data between warps to reduce device memory bandwidth or compute, then you should use bigger blocks.