Back with another question. I’m just curious if there is any way to emulate an older CUDA capable GPU to make a comparison between runtimes? I’m not getting my hopes up and realise that comparing architectures is a hazy area, but I thought I’d check.
In my case I’m hoping to emulate a 8800GTX using a GTX480.
If code is compiled for a lower compute capability than the device it is being run on, does it still make use of increased memory sizes etc.? For example can a program compiled with sm_10 still allocate the full 48KB of shared memory on a sm_20 device?
That might explain why there is a noticable delay whenever the program is started. How do I prevent this from happening? Using arch_20, sm_20 does not seem to make a difference to this delay…