There is something I am unhappy about, and it’s creating problems.
The programmer and users are faced with two different compute capabilities:
Kernel Compute Capability (as specified in PTX code, during compile time).
Device Compute Capability (as specified via Driver API return information,during run time).
For me as a programmer it is natural to expect the driver to try and achieve maximum performance.
So I expect the driver to apply the Higher Compute Capabilities to kernels of Lower Compute Capabilities where ever possible. This is also better for backwards/future compatibility.
Currently a compute 2.0 kernel is being launched with block/grid dimensions which exceed compute 2.0 capabilities but it’s being launched on a much higher compute capability device.
However the launch fails ?! <- This makes me unhappy, and this also sucks for users.
So basically the question is which compute capability should be assumed when a programmer/user is faced with these compute capability situations:
Kernel Compute Capability versus Device Compute Capability ?!?
This seems somewhat undocumented ?!
What is the order ? What is the preference ?
Should the programmer limit it’s launch parameters to the “lowest common” compute capability ? (which would probably suck for performance ? perhaps not… but it’s weird/akward).
Or should the driver be fixed/changed/adjusted work better and strive for lifting restrictions when possible.
And/or should the documentation be updated to clearify this situation ?
As far as I am concerned the documentation does not mention which compute capability has precendence ?! (Device Compute Capability versus Kernel Compute Capability ?!)
The compute capability that is specified for the virtual architecture phase of compilation (source->ptx) has precedence. Any device features in use (dynamic parallelism, warp shuffle, grid dimensions, block dimensions, etc.) should conform to and be supported by the specified virtual architecture.
Since the virtual architecture (and PTX) are not executable, there is a necessary step to create machine code, whether that creation step be at compile-time or at run-time. This step cannot contravene any limitations imposed by the specification of the virtual architecture.
I want my software to work always no matter what. In principle the software ultimately has to work on the device. So limiting the capabilities towards what the device can handle seems best in general.
In the event that a future compute capability might actually have less resources available than compute 2.0 capability, and if my software would assume it would be valid to launch with compute 2.0 capabilities on a device which can no longer support it, (for example maybe nano-devices in the human body) then the software would still fail. Especially since there is no further documentation and no further garantee that future compute capabilities will always have more resources I must come to the conclusion that to be absolutely safe I will have to limit my software to the lowest compute capability available, and to limit my software to the very minimum to make sure it runs.
So final compute capability will be determined as follows, conceptually:
Then again my line of reasoning does not really make sense in it self.
The assumption for that code would be that a lower compute capability number is a minimum. While in reality there is no such garantuee.
Compute Capability 10.0 could have less capabilities than Compute Capability 2.0.
For example the nano-device line of reasoning.
This situation is a bit messy.
What Bob wrote seems to make most sense. At the time of writing the kernel a certain compute capability was kept in mind by the programmer, thus assuming and requesting these device capabilities from the device seems to be reasonable. If it fails… the software could still adept.
For now I will alter my software and apply the kernel compute capability restrictions to the launch parameters and so forth.
Concerning block/grid dimensions… this could lead to situations… where an older kernel might not be able to execute on a more capable device… simply because of restrictions being exceeded. This can be solved though by checking maximums of each compute capability and making sure launch parameters do not exceed it, and if necessary perform multiple launches of older kernels… this is less ideal… but at least should work.
So by computing a minimum compute capability specification it’s probably possible to garantuee that a kernel will always launch… no matter which is the bigger of the two, kernel vs device. Minimum is always calculated correctly. I like this solution.
It’s certainly possible to write CUDA codes that run correctly (observing the limits) on a cc2.0 architecture device, but also maximally utilize a cc3.0 or higher device. In particular this is true for kernel grid dimensions. I can have a code that detects a cc2.0 device, launches a cc2.0 kernel, and makes sure that the grid dimensions do not exceed 65535. It’s also possible for the same code to detect a cc3.0 device, launch a cc3.0 kernel, and allow the first grid dimension (.x) to exceed 65535.
Here’s a worked example:
$ cat t624.cu
#include <stdio.h>
__device__ int my_count = 0;
__global__ void my_kernel(){
atomicAdd(&my_count, 1);
}
int main(){
cudaDeviceProp d_prop;
cudaError_t my_error = cudaGetDeviceProperties(&d_prop, 0);
if (my_error != cudaSuccess) {printf("failure!\n"); return 1;}
printf("running on device: %s\n", d_prop.name);
int block_x = 65535;
if (d_prop.major >= 3) {block_x = 100000;}
my_kernel<<<block_x, 1>>>();
int h_count = 0;
my_error = cudaMemcpyFromSymbol(&h_count, my_count, sizeof(int));
if (my_error != cudaSuccess) {printf("failure!\n"); return 1;}
printf("result = %d\n", h_count);
return 0;
}
$ nvcc -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 t624.cu -o t624
$ CUDA_VISIBLE_DEVICES="0" ./t624
running on device: Quadro 5000
result = 65535
[bob@cluster1 misc]$ CUDA_VISIBLE_DEVICES="1" ./t624
running on device: GeForce GT 640
result = 100000
$
(Created a new topic how to determine the correct compute capability version for a loaded PTX file, please continue discussion of this specific sub topic there).