One of the kernels that I am developing computes and then writes data to global memory.
Because it is possible to know whether the data written is “redundant” (and thus will be discarded) I’m considering a possible optimization in which redundant data is not written to global memory.
I know that CUDA memory access in a cache line are coalesced. But I don’t know if the number of bytes to be written in the cache line has any impact on the latency of the write.
Does the proportion of bytes (in a cache line) to be written to global memory have an impact on the latency of a write?