Kernel Compute Capability versus Device Compute Capability

Hello,

There is something I am unhappy about, and it’s creating problems.

The programmer and users are faced with two different compute capabilities:

  1. Kernel Compute Capability (as specified in PTX code, during compile time).
  2. Device Compute Capability (as specified via Driver API return information,during run time).

For me as a programmer it is natural to expect the driver to try and achieve maximum performance.

So I expect the driver to apply the Higher Compute Capabilities to kernels of Lower Compute Capabilities where ever possible. This is also better for backwards/future compatibility.

Currently a compute 2.0 kernel is being launched with block/grid dimensions which exceed compute 2.0 capabilities but it’s being launched on a much higher compute capability device.

However the launch fails ?! <- This makes me unhappy, and this also sucks for users.

So basically the question is which compute capability should be assumed when a programmer/user is faced with these compute capability situations:
Kernel Compute Capability versus Device Compute Capability ?!?
This seems somewhat undocumented ?!
What is the order ? What is the preference ?
Should the programmer limit it’s launch parameters to the “lowest common” compute capability ? (which would probably suck for performance ? perhaps not… but it’s weird/akward).
Or should the driver be fixed/changed/adjusted work better and strive for lifting restrictions when possible.
And/or should the documentation be updated to clearify this situation ?

As far as I am concerned the documentation does not mention which compute capability has precendence ?! (Device Compute Capability versus Kernel Compute Capability ?!)

Bye,
Skybuck.

The compute capability that is specified for the virtual architecture phase of compilation (source->ptx) has precedence. Any device features in use (dynamic parallelism, warp shuffle, grid dimensions, block dimensions, etc.) should conform to and be supported by the specified virtual architecture.

Since the virtual architecture (and PTX) are not executable, there is a necessary step to create machine code, whether that creation step be at compile-time or at run-time. This step cannot contravene any limitations imposed by the specification of the virtual architecture.

From the NVCC manual:

http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architectures

“The chosen virtual architecture is more of a statement on the GPU capabilities that the application requires…”

You should choose your virtual architecture based on the capabilities your application requires.

If you think it should be some other way, I encourage you to file a RFE (bug):

http://developer.nvidia.com

I want my software to work always no matter what. In principle the software ultimately has to work on the device. So limiting the capabilities towards what the device can handle seems best in general.

In the event that a future compute capability might actually have less resources available than compute 2.0 capability, and if my software would assume it would be valid to launch with compute 2.0 capabilities on a device which can no longer support it, (for example maybe nano-devices in the human body) then the software would still fail. Especially since there is no further documentation and no further garantee that future compute capabilities will always have more resources I must come to the conclusion that to be absolutely safe I will have to limit my software to the lowest compute capability available, and to limit my software to the very minimum to make sure it runs.

So final compute capability will be determined as follows, conceptually:

FinalComputeCapability = MinimumOf( KernelComputeCapability, DeviceComputeCapability )

Then again my line of reasoning does not really make sense in it self.

The assumption for that code would be that a lower compute capability number is a minimum. While in reality there is no such garantuee.

Compute Capability 10.0 could have less capabilities than Compute Capability 2.0.

For example the nano-device line of reasoning.

This situation is a bit messy.

What Bob wrote seems to make most sense. At the time of writing the kernel a certain compute capability was kept in mind by the programmer, thus assuming and requesting these device capabilities from the device seems to be reasonable. If it fails… the software could still adept.
For now I will alter my software and apply the kernel compute capability restrictions to the launch parameters and so forth.

Concerning block/grid dimensions… this could lead to situations… where an older kernel might not be able to execute on a more capable device… simply because of restrictions being exceeded. This can be solved though by checking maximums of each compute capability and making sure launch parameters do not exceed it, and if necessary perform multiple launches of older kernels… this is less ideal… but at least should work.

I just thought of a better solution.

The compute capability number is useless.

However the compute capabilities are known and can also be requested.

Thus it is possible to use the API to query the capabilities of the device.

It’s also possible to store all capabilities of older compute capabilities. And thus it’s still possible to calculate safe minimum.

Such code could look like:

MinimumBlockWidth = MinimumOf( BlockWidth(KernelComputeCapability), BlockWidth(DeviceComputeCapability) )

So by computing a minimum compute capability specification it’s probably possible to garantuee that a kernel will always launch… no matter which is the bigger of the two, kernel vs device. Minimum is always calculated correctly. I like this solution.

It’s certainly possible to write CUDA codes that run correctly (observing the limits) on a cc2.0 architecture device, but also maximally utilize a cc3.0 or higher device. In particular this is true for kernel grid dimensions. I can have a code that detects a cc2.0 device, launches a cc2.0 kernel, and makes sure that the grid dimensions do not exceed 65535. It’s also possible for the same code to detect a cc3.0 device, launch a cc3.0 kernel, and allow the first grid dimension (.x) to exceed 65535.

Here’s a worked example:

$ cat t624.cu
#include <stdio.h>

__device__ int my_count = 0;

__global__ void my_kernel(){

  atomicAdd(&my_count, 1);
}

int main(){

  cudaDeviceProp d_prop;
  cudaError_t my_error = cudaGetDeviceProperties(&d_prop, 0);
  if (my_error != cudaSuccess) {printf("failure!\n"); return 1;}
  printf("running on device: %s\n", d_prop.name);
  int block_x = 65535;
  if (d_prop.major >= 3) {block_x = 100000;}
  my_kernel<<<block_x, 1>>>();
  int h_count = 0;
  my_error = cudaMemcpyFromSymbol(&h_count, my_count, sizeof(int));
  if (my_error != cudaSuccess) {printf("failure!\n"); return 1;}
  printf("result = %d\n", h_count);
  return 0;
}


$ nvcc -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 t624.cu -o t624
$ CUDA_VISIBLE_DEVICES="0" ./t624
running on device: Quadro 5000
result = 65535
[bob@cluster1 misc]$ CUDA_VISIBLE_DEVICES="1" ./t624
running on device: GeForce GT 640
result = 100000
$

You are basically cheating by compiling the same source code twice and thus creating two different ptx versions.

The goal is to compile once, and generate one ptx version that runs on all cuda devices, past, present and future.

The problem is it might not be possible to re-compile the code in the future, thus it must be future-proof today/now.

Unfortunately there is a big problem with my solution, and CUDA in general.

It seems to be impossible to tell for which compute capability a kernel was compiled, once the kernel is loaded via the driver api.

So the question is basically:

Is there a driver API which can be used to discover the compute capability version to which the loaded kernel was compiled for ?

So far I have seen two fields of some interest:

PTX version
Binary version

However I doubt that these fields correspond directly to a compute capability version.

THIS IS A BIG PROBLEM ?!

How am I suppose to apply the correct compute capability settings if my software cannot determine the compute capability version of a PTX kernel ?!?!

(Created a new topic how to determine the correct compute capability version for a loaded PTX file, please continue discussion of this specific sub topic there).