In a single GPU such as P100 there are 56 SMs(Streaming Multiprocessors), and different SMs may have little correlation .I would like to know the application performance variation with different SMs.So it there any way to disable some SMs for a certain GPU. I know CPU offer the corresponding mechanisms but have get a good one for GPU yet.Thanks!
what is the “correlation” you are saying about? are you are going to disable parts of L2 cache and memory bandwidth too? do you know that there are no other CC 6.0 devices, so your emulation doesn’t translate to any other real devices?
that said, there is a trick - run endless loop in the kernel after analyzing on which SM it runs. it’s called “permanent kernel” technique, you can google for it
Hey,thanks for your reply. Little correlation means if we could turn off some SMs there may cause no effect for the rest SMs. This is a assumption for me.
The simple idea for me is really easy to understand, for example if we implement a matrix multiplication on GPU, the task mapping is implement by kernel and CUDA runtime, and we not care about the SM number. And now I would like to know if we run the same matrix multiplication on 5,10,15,20,25,30… SMs for a certain GPU,how long will the application cost? Will the application performance arise with the SMs numbers(device computation power) used? For L2 Cache and memory bandwidth, we had better could disable it, but we also accept the solution for only SMs disabled.Thanks!
I assume you are referring to Linux and Windows configuration settings that instruct the operating system not to use a CPU core (rather than physically disabling it through hardware configuration).
To my knowledge, such fine-grained control does not exist for the GPU at this time. CUDA_VISIBLE_DEVICES lets you disable entire GPUs (again, from a software perspective), but there is no corresponding setting to disable individual SMs on a GPU.
If you have a roofline model for your application, you should be able to estimate the performance across different GPUs based on GPU specifications (such as FLOPS and memory bandwidth). I have not actually gone through that exercise, so I couldn’t tell you how accurate such estimates would be, but simple models like this tend to have errors in the 10% range. If you need to know the performance with certainty, stock your lab with a bunch of different GPUs e.g. GTX 1030, GTX 1050, GTX 1060, GTX 1070, GTX 1080, and measure it.
What higher-level problem are you trying to solve here?