The load/store of idx should coalesce, but the load and store of the char won’t, both because there is no coalesced access for byte sized types, and because of the index values themselves and how much “entropy” there is in them. You might want to look at using a texture for reads from s.
In this one, the index read should coalesce, the write won’t. Coalescing the write would require sequential idx values, and I am guessing that isn’t hat you have
If it is almost what you have (so there is some order or manageable range in the idx values a block will get), then one way that you might be able to at least partiailly engineer it is to use a shared memory buffer for the output. If a block chooses an output idx range, and puts stored values into the output buffer in order, then a warp or half warp of threads can then do the write in coalesce chunks to the indexes that need to be stored. Threads that have to write outside the range of the block would still be uncoalesced, but even 50% coalesced would be better than nothing.
Probably not. Compute 1.2 and above can add an additional transaction to a load or store request in cases where coalescing rules would otherwise force full serialisation of the load or store. But that is all it can do, So if your half warp write needs to spread data over a wide range of indices it will still be serialized.