Confirm my feeling about distributing memory transaction effort over a kernel execution?

dscerutti · September 5, 2022, 3:04am

Just posting to make sure that my sensibilities are correct…

Let’s say I have a kernel which launches on 320 blocks and accomplishes several thousand work units over the course of its launch (each block is taking a number of these work units, one at a time, until the whole list is finished). These work units involve significant computation and can be considered compute-bound. Now, let’s say that I have some independent arrays of data that I need to initialize (for future use by a subsequent kernel) by filling them with zeros. It is also given that I can include information in the work units to have parts of that array written at the end of each of them, in a way that is well-coalesced and distributed over all threads in the block.

The question is whether there is a difference between distributing the initialization work over the first 320 work units, that is utilizing the whole width of the launch grid but concentrating the writing to just the first work unit that each thread block will handle, or distributing the initialization work over the entire list of work units and thus the entire width of the launch grid and run-time of the kernel. Will the compute-bound nature of the kernel work in my favor and help hide the latency of the array initialization work? And am I best off spreading that initialization work as thin as possible, so long as I am utilizing all threads of the block for whatever portion of the writing I assign to each work unit?

My feeling is that yes, it will. The goal is to hide as much of this array initialization latency as possible, and also to eliminate an extra kernel launch or cudaMemset() call that would otherwise be necessary. But, posting here just in case the rosy scenario I’m envisioning is not as likely as I think.

njuffa · September 5, 2022, 3:35am

Conventional wisdom is that writes in CUDA code are not critical to performance, as they are (to first order) “fire and forget”. Nothing is waiting for that data to arrive. The requests get stuffed into load/stores queues, cache writeback buffers, etc., and will drain away to DRAM over time, and we don’t really care how long that takes as it happens asynchronously.

Now, one can argue that we do care about writes if the amount of data written is so large that it will overwhelm the buffering capacity of the on-chip memory hierarchy and cause back pressure, resulting in stalls of memory instructions. I would not want to hazard a guess how extensive the buffering capability of modern GPUs is and it presumably differs by architecture. One could find out with targeted microbenchmarks.

A practical approach would be to first split the totality of writes into a few chunks that can be written out to memory at convenient times and locations in the code, and see whether this has any beneficial performance effect compared to writing the entire data in bulk. Only if that results in performance improvements outside noise level (> 2%) would I go and try to finely distribute the writes, at potentially significantly larger cost/pain of integrating that into an existing code base, then maintaining that going forward.

dscerutti · September 5, 2022, 3:41am

Conventional wisdom is that writes in CUDA code are not critical to performance, as they are (to first order) “fire and forget”. Nothing is waiting for that data to arrive.

That’s what I was thinking, so thanks for confirming. Like you said, I am not wanting to lean 100% on the notion that I can queue up a giant number of writes and just expect them to happen asynchronously with no consequences to other pipelines. I can finely distribute the writes practically as easily as I can lump them together (part of the beauty of my work unit structure), so I will write 20-30 extra lines in the underlying C++ layer and just go for the toughest solution.

njuffa · September 5, 2022, 3:49am

If it is easy enough to do it either way in your specific context, why not give bulk vs distributed a quick try and share the results here, if you like. The thing with conventional wisdom is that it may apply for a long time, and then suddenly it doesn’t really apply anymore, but at first nobody really notices and after that it takes years for everybody to catch on to the new reality.

dscerutti · September 5, 2022, 2:42pm

I will see about getting this result.

system · September 19, 2022, 2:43pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fundamental differences on ways of spreading the load CUDA Programming and Performance	3	616	June 25, 2018
Occupancy and memory CUDA Programming and Performance	3	1544	March 25, 2010
Registers and Shared Memory question CUDA Programming and Performance	7	5452	September 10, 2007
What is the performance impact of launching many many small blocks? CUDA Programming and Performance cuda , kernel	7	165	November 7, 2024
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6522	May 9, 2011
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
Launch Parameters for Large Problems CUDA Programming and Performance cuda , kernel	13	2007	October 12, 2021
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
CUDA motivation for multi-dimensional kernel execution CUDA Programming and Performance	6	4159	December 8, 2013
Hide latency CUDA Programming and Performance	3	515	June 9, 2023

Confirm my feeling about distributing memory transaction effort over a kernel execution?

Related topics