Does a CUDA thread get assigned to a specific core from the start and until it finishes execution?

Hello everybody,

I have multiple related questions regarding how CUDA threads are scheduled to run on CUDA core:

First, I need to know if a CUDA thread gets assigned to a specific core from the moment it starts running and until it finishes execution. In other words, we know that warps run concurrently on and SM in the following way: threads in a warp (name it warp 1) start running an instruction, then the warp may stall if threads need sometime to finish the instruction (say wait for the memory because of a load/store instr.), so another warp (name it warp 2) is selected to run during this time, my question is, when warp 1 comes back to continue, does each thread in this warp get assigned to the same physical core that it started running on at the first beginning, or it may be run by another core within the same SM?

Second question is, did this very low-level scheduling get changed/modified by Fermi? and how? I know that Fermi has enhanced the scheduling with the gigaThread scheduling engine for concurrent kernel execution, but i’m concerned about the specific part of “assigning threads to physical cores”.

Third question, if i call the same kernel twice for the same data and everything, just calling it twice consecutively, do i know for sure that each thread will run on the same core it has run on during the first call? since the indexing is the same?

I’ve been trying to find out the answers to these questions and made a lot of search but didn’t find answers.

Thank you.

I think it is totally transparent for the programmer. Also, core number is smaller than warp size. Though, most likely in purpose of fast access to register file and coalescing, threads are fixed. Cause access to register file is done on warp level, so each thread reads its variable from its column.

As has been said, this is all transparent to the programmer. The answers also may change from generation to generation, so don’t assume anything! Do the answers to these questions really matter to anything at all? The scheduling of blocks on the hardware has been shown to effect performance to some degree, but I find it very hard to imagine why you would need to program for a specific thread schedule.

With that being said, here is what I know.

No. There is no 1 to one mapping from threads to CUDA cores. A single CUDA core processes many threads, from different warps one clock to the next. With the dual schedulers on Fermi it may be possible that the 2nd scheduler picks up a warp that ran on the first one earlier. I don’t know how the hardware works at this extremely low level (and I don’t care) so this is pure speculation on my part.

As far as I know, gigaThread deals with block scheduling. It enables immediate replacement of blocks on an SM when one completes executing and also enables concurrent kernel execution.

The answer to this is no with 100% certainty.

For simple linear addressing, it might be beneficial to use the SM_ID while indexing, instead of blockId.x. This way you will be able to ensure locality of data in L1 memory.

More details in the following forum…

http://forums.nvidia.com/index.php?showtopic=186669

http://forums.nvidia.com/index.php?showtopic=198466

The limitation of this approach is that it is hardware specific (may not work in older generation GPU cards) and allows only 1D linear indexing.

After the thread return from stalling (actually if one thread stalls, the entire thread-block will stall), they can be scheduled to run on any core within that SM. They all share the same memory resources, so this should be totally transparent to you?

This is about the same answer as before. The threads are allowed to run on any core within that SM and they all share the same memory resources (register/L1 cache) so it should be transparent to you.

There is no such guarantee. In fact, the 2nd kernel execution may reshuffle the blocks to run on different SMs from the original.

Thank you all for your replies, they were of great benefit to me. I’m interested in these answers for the sake of a project I’m working on.

Lev, The problem is that I wanted to know if I can control to which core a thread goes, if I can do so, I can ensure that a 2nd run of a thread will run on a different core. I think I should have phrased my questions this way: “If I run my code twice, is there a way to ensure that a thread will run on a different core the second time? or on a different SM?”

DrAnderson42,

So, scheduling has no determinism whatsoever?

jarjar,

Thanks for the links, what I understood is that I can “query” the SM_ID -it’s a read-only predefined variable-, is the only way to control to which SM a block will go is to manipulate the SM_ID I get to distribute the work? I’m trying to explore the different capabilities.

hocheung20,

The “reshuffling” is decided based on what? do you have any documentation on how the scheduler works?

Thanks all again

This is architecture specific and is probably considered a trade secret and I’m not sure NVIDIA has any papers on how this works exactly.

Not that I’m aware of.

No, NVIDIA does not document this. You can find out all the forums knows about G80 and G200 scheduling here: http://forums.nvidia.com/index.php?showtopic=94734&view=findpost&p=531002 Summary: on those cards, block scheduling is static round-robin assigned by TPC. In any real kernel, slight variations in runtimes of blocks will lead to the next kernels on the stack to execute ending up on different TPCs - hence the scheduling is non-deterministic.

And here is a statement from Tim confirming that blocks are replaced as they finish on Fermi: http://forums.nvidia.com/index.php?showtopic=164912&st=0&p=1031594&#entry1031594 - so you get non-determinism for the same reason, just at a finer granularity.

You also don’t have to believe us. Write a mini-app for yourself that saves the sm id for each block into an array. Then launch that kernel many times and observe the distribution of assignments.

Why are you looking for guarantees about block scheduling? If you’re trying to perform global synchronization, that’s a bad idea. I speak from experience.

Thanks all for the replies.

hocheung20, seems so since I didn’t find any documentation on how the scheduling works on this low level. This is the best I could find.

DrAnderson42, I tried it, and I saw no-determinism as you have stated. Thanks for the suggestion.

tmurray, I’m trying to apply some fault detection concepts on CUDA GPUs.

I appreciate if you have an answer/suggestion on how to control the scheduling on the software level (if possible) to guarantee running on a different core or SM in a second run of a program. Thanks.

I dont see how you could do this using CUDA as the idea of any “high-level” language is to abstract away the low-level implementation details.

In fact, I think you couldn’t do this even in PTX either I think, since it is a virtual assembly language.

This is probably a long shot, but maybe you have access in device driver, (dig around in Noveau, the open-source Linux NVIDIA GPU driver).

But from what I understand, scheduling is hardware-determined (scoreboarding, etc etc) and so maybe you need hardware debugger for your purposes.

You can find out on which SM your block is executing so all you need to do is to come up with a clever way of redistributing work between blocks.

Googling for “persistent threads” may provide you with some inspiration how to do this.