code compiled for newer achitectures runs on older architectures but produces errors/incorrect results?


new to this so I’m probably missing something. Setup: CUDA 8, gcc 4.9.1, and a K20X card which has compute capabilities of 3.5. Nvidia driver 375.26. I also tried this on CUDA 9 with driver 387.26

Simple example code that copies individual integers to the GPU in a loop and calculates a cumulative sum. I don’t think code is relevant, but i pasted it at the bottom anyway (ignore that it uses the GPU in an obviously insane way). If i compile with -gencode=arch=compute_35,code=sm35 it works as expected - returning the sum of the integers from 1 o 10^6. If i compile with -gencode=arch=compute_60,code=sm_60 it compiles fine and cuobjdump shows that only code for sm_60 was generated:

$ cuobjdump -all add_loop
Fatbin elf code:

arch = sm_60
code version = [1,7]
producer =
host = linux
compile_size = 64bit

Fatbin elf code:

arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

I would have expected this to not run at all on the k20x, but it does run - only it returns a 0 instead of the sum. That same binary returns the correct sum on a P100. So why does the runtime select sm_60 code to run on an sm_35 device without complaint?

as an aside - i also did this with matrixMul from cuda_samples - compiling for sm_60, still runs on sm_35 but produces errors.


#include <stdio.h>

static void HandleError( cudaError_t err,
                         const char *file,
                         int line ) {
    if (err != cudaSuccess) {
        printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
                file, line );
        exit( EXIT_FAILURE );

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

__global__ void add(long int *a, long int *b) {
    *a = *a + *b;

int main(void) {
    long int a;
    long int *da, *db;
    size_t si = sizeof(long int);
    // allocate device memory
    HANDLE_ERROR(cudaMalloc((void **)&da, si));
    HANDLE_ERROR(cudaMalloc((void **)&db, si));
    // initial value
    a = 0;
    HANDLE_ERROR(cudaMemcpy(da, &a, si, cudaMemcpyHostToDevice));
    for (long int i = 0; i <= 1000000; i++) {
        // copy to device
        HANDLE_ERROR(cudaMemcpy(db, &i, si, cudaMemcpyHostToDevice));
        // launch kernel
        add<<<1,1>>>(da, db);
    // copy results back
    HANDLE_ERROR(cudaMemcpy(&a, da, si, cudaMemcpyDeviceToHost));
    // cleaning
    printf("result = %li\n", a);
    return 0;

The runtime isn’t selecting sm_60 code to run on a sm_35 device. What’s happening is that the runtime is reporting an error but your code is ignoring the error reporting. So you see nothing other than an app that appears to run correctly but produces incorrect results.

Do a google search on “proper CUDA error checking”, take the first hit, study it (carefully!), and apply it to your code. Your code will then report errors when you try to run the sm_60 binary on a sm_35 device.

Alternatively just look at the error checking methodology in one of the CUDA sample codes.

Anticipating the next question “But I am doing error checking???”

Yes, you are. but it is incomplete. Immediately after the kernel call, (or any time after the kernel call, say, at the end of your code), add this:


Ah - Thanks. I was missing something basic because I made an incorrect assumption that the runtime would ensure that images for the current hardware were available in the binary. Now i do get ‘no kernel image available…’. I did not look at enough of the samples and matrixMul happens to skip this step as well…