Effect of compiling CUDA for an older compute capability


I’m using a GPU with compute capability 7.0 with CUDA 10, with legacy code that builds for compute capability (CC) 3.5 explicitly (–gpu-architecture compute_35). I’m wondering what the exact effect is of this configuration. As far as I can find documentation on CC only describes the relation as in “is my hardware capable of supporting this feature” but I can’t so much find a description of the behavior for legacy code other than that CUDA 10 still supports down to CC 3.x (however it is the last version to support CC 3.x).

For example, table 15 in the CUDA programming guide denotes that for CC 3.5 the maximum number of resident blocks per multiprocessor is 16, where my device with CC 7.0 has 32 as maximum number of resident blocks per multiprocessor. Will compilation with CC 3.5 result in a maximum of 16 blocks per multiprocessor here and thus simulate or force CC 3.5 behavior on my CC 7.0 device?

What else should I be weary of when porting CUDA code with CC 3.5 to 7.0?

Thank you for your time!