Confirm my feeling about distributing memory transaction effort over a kernel execution?

Just posting to make sure that my sensibilities are correct…

Let’s say I have a kernel which launches on 320 blocks and accomplishes several thousand work units over the course of its launch (each block is taking a number of these work units, one at a time, until the whole list is finished). These work units involve significant computation and can be considered compute-bound. Now, let’s say that I have some independent arrays of data that I need to initialize (for future use by a subsequent kernel) by filling them with zeros. It is also given that I can include information in the work units to have parts of that array written at the end of each of them, in a way that is well-coalesced and distributed over all threads in the block.

The question is whether there is a difference between distributing the initialization work over the first 320 work units, that is utilizing the whole width of the launch grid but concentrating the writing to just the first work unit that each thread block will handle, or distributing the initialization work over the entire list of work units and thus the entire width of the launch grid and run-time of the kernel. Will the compute-bound nature of the kernel work in my favor and help hide the latency of the array initialization work? And am I best off spreading that initialization work as thin as possible, so long as I am utilizing all threads of the block for whatever portion of the writing I assign to each work unit?

My feeling is that yes, it will. The goal is to hide as much of this array initialization latency as possible, and also to eliminate an extra kernel launch or cudaMemset() call that would otherwise be necessary. But, posting here just in case the rosy scenario I’m envisioning is not as likely as I think.

Conventional wisdom is that writes in CUDA code are not critical to performance, as they are (to first order) “fire and forget”. Nothing is waiting for that data to arrive. The requests get stuffed into load/stores queues, cache writeback buffers, etc., and will drain away to DRAM over time, and we don’t really care how long that takes as it happens asynchronously.

Now, one can argue that we do care about writes if the amount of data written is so large that it will overwhelm the buffering capacity of the on-chip memory hierarchy and cause back pressure, resulting in stalls of memory instructions. I would not want to hazard a guess how extensive the buffering capability of modern GPUs is and it presumably differs by architecture. One could find out with targeted microbenchmarks.

A practical approach would be to first split the totality of writes into a few chunks that can be written out to memory at convenient times and locations in the code, and see whether this has any beneficial performance effect compared to writing the entire data in bulk. Only if that results in performance improvements outside noise level (> 2%) would I go and try to finely distribute the writes, at potentially significantly larger cost/pain of integrating that into an existing code base, then maintaining that going forward.

1 Like

Conventional wisdom is that writes in CUDA code are not critical to performance, as they are (to first order) “fire and forget”. Nothing is waiting for that data to arrive.

That’s what I was thinking, so thanks for confirming. Like you said, I am not wanting to lean 100% on the notion that I can queue up a giant number of writes and just expect them to happen asynchronously with no consequences to other pipelines. I can finely distribute the writes practically as easily as I can lump them together (part of the beauty of my work unit structure), so I will write 20-30 extra lines in the underlying C++ layer and just go for the toughest solution.

If it is easy enough to do it either way in your specific context, why not give bulk vs distributed a quick try and share the results here, if you like. The thing with conventional wisdom is that it may apply for a long time, and then suddenly it doesn’t really apply anymore, but at first nobody really notices and after that it takes years for everybody to catch on to the new reality.

I will see about getting this result.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.