Number of blocks per multi-processor Understanding Fermi Block Scheduling

Hello,

It used to be the case in pre-Fermi times that the number of blocks to be scheduled to run on a multi-processor at once could be deduced from the object file: the GPU scheduler would run as many blocks as possible given the register, shared memory and thread count constraints (correct me if I’m wrong here).

Fermi architecture offers shared memory vs cache flexibility by means of cuda[Thread/Func]SetCacheConfig(…) functions.

As far as I understand, the invocation of cuda[Thread/Func]SetCacheConfig(…) sets the PREFERRED, not the REQUIRED shared memory/cache allocation scheme. The doc is fairly vague about how much on-chip memory the GPU actually chooses to allocate to cache, even given that preference. This sounds like an unfortunate design decision to me: given how critical the amount of shared memory is for my application, I would like to be able to instruct the GPU whether I want more cache or shared memory and get back an error message, if the required allocation is not possible for whatever reason.

Given the uncertainty with the amount of shared memory vs cache, I can no longer determine how many blocks run on the MP concurrently from just the object file.

Hence the question: can I see from some kind of a profiling output how many blocks the scheduler runs on each MP (say, on average during each kernel launch)?

Thanks!

Hello,

It used to be the case in pre-Fermi times that the number of blocks to be scheduled to run on a multi-processor at once could be deduced from the object file: the GPU scheduler would run as many blocks as possible given the register, shared memory and thread count constraints (correct me if I’m wrong here).

Fermi architecture offers shared memory vs cache flexibility by means of cuda[Thread/Func]SetCacheConfig(…) functions.

As far as I understand, the invocation of cuda[Thread/Func]SetCacheConfig(…) sets the PREFERRED, not the REQUIRED shared memory/cache allocation scheme. The doc is fairly vague about how much on-chip memory the GPU actually chooses to allocate to cache, even given that preference. This sounds like an unfortunate design decision to me: given how critical the amount of shared memory is for my application, I would like to be able to instruct the GPU whether I want more cache or shared memory and get back an error message, if the required allocation is not possible for whatever reason.

Given the uncertainty with the amount of shared memory vs cache, I can no longer determine how many blocks run on the MP concurrently from just the object file.

Hence the question: can I see from some kind of a profiling output how many blocks the scheduler runs on each MP (say, on average during each kernel launch)?

Thanks!