So, first off, to be up-front I’m working with Cuda on my honours thesis. I’ve implemented my algorithm (optimised version of the double description method) and it does run fine on my GT240 using all 96 cores.
I’ve also got a single-threaded, CPU-only, version of the algorithm. It runs fine as well.
What I want to do is test my algorithm, with Cuda, but only using a smaller number of cores. I’m basically hoping to test my code with maybe 12/24/48/96 cores enabled, so I can get a sense of how well my algorithm is scaling. However, I don’t have any other Nvidia devices lying around.
Is there any way to tell Cuda to limit itself to a certain number of cores, beyond manually editing the program to make the algorithm only run a certain number of threads at a time?
So, first off, to be up-front I’m working with Cuda on my honours thesis. I’ve implemented my algorithm (optimised version of the double description method) and it does run fine on my GT240 using all 96 cores.
I’ve also got a single-threaded, CPU-only, version of the algorithm. It runs fine as well.
What I want to do is test my algorithm, with Cuda, but only using a smaller number of cores. I’m basically hoping to test my code with maybe 12/24/48/96 cores enabled, so I can get a sense of how well my algorithm is scaling. However, I don’t have any other Nvidia devices lying around.
Is there any way to tell Cuda to limit itself to a certain number of cores, beyond manually editing the program to make the algorithm only run a certain number of threads at a time?
I don’t knonw how to disable some SMs, but you can artificially increase shared memory to decrease occupancy.
for example, GT240 has 12 SMs, suppose your kernel can launch 4 thread blocks per SM, and each block has 128 threads, then you have 124128 threads in flight.
you can increase declaration of shared memory in your kernel such that only 2 thread blocks per SM, then you have 122128 threads in flight.
decreasing occupancy will increase penalty of global memory because memory latency is not hidden by arithmetic operations very well.
To disable some SMs will also increases penalty of global memory because few memory controllers are used so bandwidth decreases.
Under this reason, both methods have the same effect.
I don’t knonw how to disable some SMs, but you can artificially increase shared memory to decrease occupancy.
for example, GT240 has 12 SMs, suppose your kernel can launch 4 thread blocks per SM, and each block has 128 threads, then you have 124128 threads in flight.
you can increase declaration of shared memory in your kernel such that only 2 thread blocks per SM, then you have 122128 threads in flight.
decreasing occupancy will increase penalty of global memory because memory latency is not hidden by arithmetic operations very well.
To disable some SMs will also increases penalty of global memory because few memory controllers are used so bandwidth decreases.
Under this reason, both methods have the same effect.
You can find out which SM a block is running on by reading the %smid special register from inline PTX code. You could then have all blocks executed on certain SMs just do nothing.
If you do your own round-robin block scheduling (to redistrribute the work to the other cores), I think you could get close to how a device with fewer SMs would operate. This of course assumes that the work done in each block is significant enough that the scheduling overhead does not skew the results.
EDIT: Thinking about this a bit more, I think you don’t need to run your own scheduler. Since you know beforehand which ratio of the blocks is going to execute on SMs where blocks refuse to execute, you can just schedule more blocks accordingly, and get their “effective” ids from an atomic counter instead of from threadIdx. After execution of the kernel, you can check wether the counter has reached the expected value.
You can find out which SM a block is running on by reading the %smid special register from inline PTX code. You could then have all blocks executed on certain SMs just do nothing.
If you do your own round-robin block scheduling (to redistrribute the work to the other cores), I think you could get close to how a device with fewer SMs would operate. This of course assumes that the work done in each block is significant enough that the scheduling overhead does not skew the results.
EDIT: Thinking about this a bit more, I think you don’t need to run your own scheduler. Since you know beforehand which ratio of the blocks is going to execute on SMs where blocks refuse to execute, you can just schedule more blocks accordingly, and get their “effective” ids from an atomic counter instead of from threadIdx. After execution of the kernel, you can check wether the counter has reached the expected value.