Latency of write and cacheline utilization

One of the kernels that I am developing computes and then writes data to global memory.

Because it is possible to know whether the data written is “redundant” (and thus will be discarded) I’m considering a possible optimization in which redundant data is not written to global memory.

I know that CUDA memory access in a cache line are coalesced. But I don’t know if the number of bytes to be written in the cache line has any impact on the latency of the write.

Does the proportion of bytes (in a cache line) to be written to global memory have an impact on the latency of a write?