I’m more than a little perplexed about why, in PTX ISA (2.3), all we get for a Grid ID is the Special Register, %gridid.
As you are all no doubt aware, most CUDA kernels are run on the premise that they’re going to use the thread ID, the Block ID, and/or the Grid ID to calculate one or more indices, which can then be used to index into one or more arrays to access only those elements that are specific to each thread’s particular task.
All very well and good, or at least it would be if it were not for the fact that the Grid ID is not really accessible (so far as I can tell).
What is accessible is this Special Register %gridid, which the PTX ISA 2.3 documentation says is merely a “temporal” grid identifier. The docs further state:
This is understandable, given that if I launch a kernel in a 1-D grid of say, 200, and I don’t have enough SMs to handle that many grids concurrently, the SMs will queue the leftover blocks until such time as they can be executed. The leftover blocks will, in essence, be executed sequentially, and will end up reusing the same Grid numbers.
The end result is that one or more threads will receive the same value in the Special Register %gridid.
But for the purpose of being able to properly index my thread-specific array elements, I need a Grid ID that represents the virtual Grid number - a Grid number that is unique; one that I can be certain won’t be reused.
In short, a Grid number that is “non-temporal”. If I launch a kernel with a Grid size of 200, I need to know that, for example, 1/200th of my threads will receive a Grid ID of 199.
It boggles me that no one has mentioned this problem before (as far as I can tell). It would seem obvious to me that the CUDA system needs to provide a “virtual” Grid ID, so that threads can properly index their arrays.
But perhaps I’m misreading the docs, or otherwise missing something. If I launch a kernel in a 1-D <Grid,Block,Thread> dimensionality of <200,8,1024>, how is the kernel supposed to get its hands on the (required) Grid ID, 199 ?
Does anyone know where such a ‘virtual’ Grid ID is stored, or perhaps know of a a good work-around for this dilemma?
I have put in a request for clarification from the author of the PTX specification. Note that CUDA program use the multi-dimensional thread index (within a thread block = CTA) and the multi-dimensional block index (within a grid) to uniquely identify each thread in a kernel launch, which is then used to assign work within a particular kernel launch. Typically there are multiple threads to a thread block, so multiple threads will have the same ctaid.
Starting with compute capability 2.0, there can be multiple kernel launches (each representing a grid) in flight at any given time, so there needs to be a way of uniquely identifying each of these concurrently running kernels. As the PTX documentation points out, this is primarily of interest to tools such as debuggers. It is not clear to me why and to what purpose a CUDA programmer would want to use the grid ID from inside a kernel. I cannot think of a use case for that off the top of my head.
I do not know by what scheme grid IDs are being generated, but the PTX documentation certainly implies that any concurrently running kernels are guaranteed to have different grid IDs (which seems sufficient for the intended use by tools, independent of the details of grid ID generation).
I am aware of a thread ID and a block ID, each of which may have up to three components x, y, and z. I am not aware of the use of some grid ID. What would be its purpose that cannot be served by thread or grid IDs?
This description does not seem accurate to me. All blocks within the same kernel launch will of course see the same grid ID, and a unique block ID for each of them. What you describe seems rather like the SM ID.
I’m not sure whether your <200,8,1024> notation is meant to resemble the [font=“Courier New”]<<<nblocks, nthreads>>>[/font] notation of kernel launches. The latter however would not have an [font=“Courier New”]ngrids[/font] component though.
I’m not quite sure what you are trying to achieve, but the y-component of the block ID [font=“Courier New”]%ctaid.y[/font] might serve your purpose.
Thanks very much for putting in that request for clarification, but I feel at this point that such a clarification may turn out to be redundant (i.e. it may not ‘clarify’ anything that I don’t already know).
You are quite right that “the PTX documentation certainly implies that any concurrently running kernels are guaranteed to have different grid IDs”, but the key word there is “concurrently”. If you look at Figure 1-4 on page 5 of the CUDA C Programming Guide Version 4.0 (“Automatic Scalability”), you can see quite clearly what I’ve been attempting to refer to. In that figure, they show an example application consisting of 8 blocks, or more precisely, one grid of 8 blocks. But because of the (hypothetical) hardware limitations, that one grid of 8 blocks may be executed as 4 grids of 2 blocks each, or 2 grids of 4 blocks each (as the diagram shows).
Extrapolating that scenario, it could be said that the same thing could happen to a kernel invocation of 2 grids of 8 blocks each, which might be executed as 8 grids of 2 blocks each, 4 grids of 4 blocks each, or 2 grids of 8 blocks each, depending on the limitations of the SMs present in the hardware.
In that case, what happens to the kernel code that attempts to discern an index into an array based on the thread ID, %tid.x, the Block ID, %ctaid.x, and the Grid ID, %gridid, as follows:
.reg.u32 Ndx; // <== the index into a byte array whose elements determine the thread's task, based on a calculated 'globally unique thread ordinal' for this device..
mad.lo.u32 Ndx, %gridid, %nctaid.x, %ctaid.x; // <== Ndx = %gridid * %nctaid.x + %ctaid.x = [globally unique block ordinal]
mad.lo.u32 Ndx, ndx, %ntid.x, %tid.x; // <== Ndx = [globally unique block ordinal] * %ntid.x + %tid.x = [globally unique thread ordinal]
Answer: the above code won’t work, because the special register %gridid might run from (0-7) twice, or from (0-3) 4 times, or from (0-1) 8 times, where only the latter would yield the expected index…
UNLESS, of course, the special register %nctaid.x (number of blocks per grid) changes accordingly. But I’ve never seen any documentation that says it would (i.e. that it’s “temporal” too)…
The solution, of course, would be to attempt to determine, in advance, exactly how many grids each SM can handle concurrently, based on, well, I don’t know what yet, because most, if not all, of the specs that I’ve ever seen pertaining to the capacity of CUDA SMs seems to concentrate exclusively on their maximum “resident” blocks, which may or may not be the same as, and probably isn’t the same as, the maximum number of “concurrently executing” blocks that an SM can handle. As far as I can tell, “resident” in this sense seems to refer to the total number of blocks that can be either executing, or queued for future execution.
As a sidebar, if there is indeed no other way to ensure the contiguous nature of the Grid ID other than to ensure that no SM is given more kernels than it can handle concurrently, wouldn’t such a requirement tend to nullify any claimed benefit of “Automatic Scalability”? Just saying…
Anyway, maybe I’ve interpreted one or more nuances of the CUDA system wrong here, in which case, and by all means, please correct me, but at the moment I don’t believe so. Thing is, the application that I’m currently looking at has to, of necessity, saturate all available GPUs at all times. That means that every launch has to be in the context of multiple Grids, containing multiple Blocks, containing multiple Threads. So for every device the program finds, the plan was to execute the kernel in (8*[SM Cnt]) Grids, each having 1024 Blocks, each having CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK Threads…
The good news is that in this particular app, it doesn’t matter if one or more SMs decide to ‘queue’ one or more blocks - the execution speed will remain the same regardless. BUT, it is a requirement that the Grid IDs aren’t ‘reused’ - i.e. that they be both contiguous and unique, in the manner that I’ve already described…
Anyway, thanks for the reply, and thanks to all for still listening…
Thank you for pointing out that figure. It seems to be a leftover from times when Nvidia marketing used different terms. You need to replace every mention of “core” in that figure with “streaming multiprocessor” or “SM” (even though that term is only explained far later in the Programming Guide).
Norbert, can you please, please file another request for clarification for the Programming Guide as well to fix this horrible misrepresentation of the CUDA programming model? No wonder people are getting confused all the time if even Nvidia itself in the definitive documentation on the subject doesn’t get it right.
No, one grid is always executed as one grid. Don’t read too much into the blue rectangles that are supposed to indicate which of those blocks execute in parallel. They do not represent separate grids. Anyway it was only true on compute capability 1.x devices that a new wave of blocks is scheduled on all SMs at the same time. Compute capability 2.x devices schedule a new block on an SM as soon as a previous block finishes. Furthermore the figure completely ignores that more than one block can run on a SM at one time.
The simple answer to this is that the code should never care about the grid id at all. Thread id and block id together uniquely identify the item of work. There is no place for the grid id in this scheme.
No. Your code has no need to deal with this. If however you want a monotonically increasing unique identifier for different kernel invocations (=grids), you can have a counter in your host code that increases by one for each kernel invocation, and is given as an argument to the kernel.
As far as I know, “resident” in this context means exactly the same as “concurrently executing” for a given point in time.
I believe that you are misinterpreting the meaning of “grid” and “grid id” in the PTX documentation and in the CUDA C Programming Guide.
If you are concerned about never having any SM go idle, on compute capability 1.x devices you need to make sure to launch each grid with a total number of blocks that is a multiple of the number of SMs times the number of resident blocks per SM (where you can determine the latter using the Occupancy Calculator spreadsheet). This assumes that all blocks need the same time to finish. If the runtime of blocks varies, you need to run your own block scheduler (I am doing this in my code), and you have to accept a small idle time once there are fewer blocks remaining than can run in parallel.
On compute capability 2.x devices things are a lot easier. You can queue multiple kernel invocations (=grids) in different streams, and as long as there is work remaining, all SMs will always be busy.
Thanks once again tera, for all your amazingly prescient answers which, as usual, have forced me to re-evaluate my assumptions. In doing so, I realize now that I made what can only be described as a huge blunder. Allow me to explain.
Working from memory, I remembered a table wherein the number “8” was associated with a single CUDA SM’s “maximum residency”. Apparently, my mind filled in the gap (without telling me), and I proceeded to code both my algorithm and my kernel code based on the “fact” that each CUDA SM can accommodate eight concurrent “Grids” (of 1024 Blocks each).
Well, no, no, and duh, NO. What I thought I was remembering was the “Maximum number of resident blocks per multiprocessor”, as listed in the table on page 158 of the CUDA C Programming Guide Version 4.0, which is quite obviously NOT the maximum number of resident “Grids”.
And, of course, this changes everything, including the need to work with Grids or Grid IDs at all.
So, instead of having to “execute the kernel in (8*[SM Cnt]) Grids, each having 1024 Blocks, each having CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK Threads”, I now know that to saturate every GPU, my program just needs to execute the kernel in (8*[SM Cnt]) Blocks, each having CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK Threads, for each and every device it finds. Or, as you’ve correctly pointed out, some whole number “multiple” of same.
No Grids, no worries.
BTW, in case I didn’t make it clear, “SM Cnt” in the above is the total number of SMs on a device, obtained by calling cuDeviceGetAttribute() with CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT…
That makes my code much simpler, as there is now only one “mad” instruction required to calculate the “globally unique thread ordinal” from the block and thread IDs.
So my problem is now solved. Thank you.
There does seem to be some confusion left pertaining to this Grid ID business, but being the one who started this thread on a grossly mistaken premise, I think I should bow out gracefully at this point, rather than remain a participant in what, for me, has (thankfully) become a purely academic subject.
So again, thanks to one and all for so graciously tolerating this struggling neophyte’s questions…
It’s not clear for what purpose you are trying to “saturate” the GPU. To operate the GPU efficiently it is necessary to cover basic latencies (in particular memory latency) by running a sufficient number of threads per SM. The exact number will differ from GPU to GPU. As a rule of thumb I usually try to achieve an occupany of at least 0.33 on Fermi-class GPUs, which corresponds to 512 threads per SM. The exact number of thread blocks is of secondary importance, but in general one would want at least two per SM, expecially when the kernel uses __syncthreads() which causes loss of effective parallelism around the __syncthreads(). If the resource constraints (registers per thread, shared memory per thread block) allow it, I will try to achieve close to full occupancy, for example six thread blocks of 256 threads, or eight blocks of 192 threads.
For memory-bound streaming kernels on Fermi-class GPUs it is in general beneficial to target a total thread-block count per grid of at least 20x the number of concurrently running thread blocks. Let us assume we are able to concurrently run four thread blocks of 384 threads per SM on a GPU with 14 SMs. One would want to run a grid with at least 20 x 4 x 14 = 1120 thread blocks to achieve the maximum memory throughput. The exact amount of “over-subscription” necessary to maximize the memory throughput is a function of many variables, in particular the ratio of computation to bandwidth, but 20x is a useful rule of thumb. Running with an “over-subscription” factor that is too small can reduce the achievable memory bandwidth by up to 15% if I recall correctly from experiments with the various kernels from the STREAM benchmark.
@tera: I will suggest a review of figure 1-4 in the Programming Guide to the revelant parties.
I have confirmed with the writer of the PTX specification that the purpose of %gridid is to provide a unique identifier for each grid that is running concurrently, and that the grid ID is not provided so programmers can use it to split work between multiple grids, the way %tid and %ctaid are used to split work between the blocks and threads in a grid = kernel launch. The grid IDs of concurrently running kernels are not necessarily contiguous. Based on that I would suggest that a suitable analogy may be the process ID (pid) in an operating system.
Thank you, Norbert. And please take my apologies for the strong language in the previous post. I just was really surprised when I learned that the cause for all the posts in the forums confusing “core” and “SM” might actually be the Programming Guide itself.