Hi,
new to this so I’m probably missing something. Setup: CUDA 8, gcc 4.9.1, and a K20X card which has compute capabilities of 3.5. Nvidia driver 375.26. I also tried this on CUDA 9 with driver 387.26
Simple example code that copies individual integers to the GPU in a loop and calculates a cumulative sum. I don’t think code is relevant, but i pasted it at the bottom anyway (ignore that it uses the GPU in an obviously insane way). If i compile with -gencode=arch=compute_35,code=sm35
it works as expected - returning the sum of the integers from 1 o 10^6. If i compile with -gencode=arch=compute_60,code=sm_60
it compiles fine and cuobjdump shows that only code for sm_60 was generated:
$ cuobjdump -all add_loop
Fatbin elf code:
arch = sm_60
code version = [1,7]
producer =
host = linux
compile_size = 64bit
Fatbin elf code:
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
I would have expected this to not run at all on the k20x, but it does run - only it returns a 0 instead of the sum. That same binary returns the correct sum on a P100. So why does the runtime select sm_60 code to run on an sm_35 device without complaint?
as an aside - i also did this with matrixMul from cuda_samples - compiling for sm_60, still runs on sm_35 but produces errors.
Thanks,
Wolfgang
#include <stdio.h>
static void HandleError( cudaError_t err,
const char *file,
int line ) {
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
__global__ void add(long int *a, long int *b) {
*a = *a + *b;
}
int main(void) {
long int a;
long int *da, *db;
size_t si = sizeof(long int);
// allocate device memory
HANDLE_ERROR(cudaMalloc((void **)&da, si));
HANDLE_ERROR(cudaMalloc((void **)&db, si));
// initial value
a = 0;
HANDLE_ERROR(cudaMemcpy(da, &a, si, cudaMemcpyHostToDevice));
for (long int i = 0; i <= 1000000; i++) {
// copy to device
HANDLE_ERROR(cudaMemcpy(db, &i, si, cudaMemcpyHostToDevice));
// launch kernel
add<<<1,1>>>(da, db);
}
// copy results back
HANDLE_ERROR(cudaMemcpy(&a, da, si, cudaMemcpyDeviceToHost));
// cleaning
cudaFree(da);
cudaFree(db);
printf("result = %li\n", a);
return 0;
}