I have a system with C1060’s and I’m trying to convert the “template” project into my own routine. The problem I’m having is that the runtime gives me an error if I request more than 512 threads per block, but the data sheets say I should have 1024 threads per block. I have force the compiler to use sm_13 with the line
CUFILES_sm_13 := Lebesgue_genpower.cu
which does what I expect with nvcc:
/usr/local/cuda/bin/nvcc -o obj/x86_64/release/Lebesgue_genpower.cu_sm_13.o -c Lebesgue_genpower.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I…/…/common/inc -I…/…/…/shared//inc -DUNIX -O2 -arch sm_13 …
How do I tell the runtime which compute level to use? I thought the cudaSetDevice() would set that up automatically.