LAunching kernels fails in sm_35, but works fine for sm_30-

Hello

Well, I have (almost) just started my experience with CUDA programming, and been successful so far, with one little exception. I have written several kernels, ran the successfully, got all I wanted, but came to the point, where I wanted one kernel to execute subsequent threads, etc. Since I got note, that I have to enable sm_35 architecture, I just added -arch=sm_35, and found the compilation running smoothly, but the application itself gave me very strange results. I’ve done some research, and found out, that no kernels are running successfully with the -arch=sm_35, but everything seems to be ok with -arch=sm_30. I’m running GTX 670, so I assume that it’s not the GPU incompatibility, but rather something with the compiler.

Here is my nvcc invocation for the simplest test:

nvcc -m64 -arch=sm_35 -c errtest.cu
nvcc -m64 -arch=sm_35 -o err errtest.obj

where errtest.cu is:

#include "helper_cuda.h"
#include <iostream>
#include <iomanip>

typedef unsigned long long uint64;

__global__ void add1(const uint64* d_in, uint64* d_out)
{
	int pos = threadIdx.x;
	
	d_out[pos] = d_in[pos] + 1;
}

__global__ void nothing()
{
}

int main()
{
	const size_t ARRAY_SIZE = 64, ARRAY_BYTES = ARRAY_SIZE * sizeof(uint64);
	
	uint64 h_arr[ARRAY_SIZE];
	
	for(int i=0; i<ARRAY_SIZE; i++)
		h_arr[i] = i;
		
	uint64 *d_in;
	uint64 *d_out;
	
	cudaMalloc(&d_in , ARRAY_BYTES);
	cudaMalloc(&d_out, ARRAY_BYTES);
	
	cudaMemcpy(d_in, h_arr, ARRAY_BYTES, cudaMemcpyHostToDevice);
    getLastCudaError("Memcpy H2D failed");

	#ifndef NOTHING
	add1<<<1, ARRAY_SIZE>>>(d_in, d_out);
	#else
	nothing<<<1, ARRAY_SIZE>>>();
	#endif
    getLastCudaError("Kernel execution failed");
	
	cudaMemcpy(h_arr, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
    getLastCudaError("Memcpy D2H failed");
	
	std::cout<<std::endl<<"add1:"<<std::endl;
	for(int i=0; i<ARRAY_SIZE; i++)
		std::cout << std::hex << std::setw(16) << h_arr[i] << ((i%3 == 2) ? "\n" : "\t");
	std::cout<<std::endl;
}

(helper_cuda.h is the one from CUDA 5 Toolkit examples)

As I run the executable, i get the following:

errtest.cu(41) : getLastCudaError() CUDA error : Kernel execution failed : (8) invalid device function .

But when I switch from sm_35 to sm_30, everything runs smoothly.

Here’s the full experiment:

Am I missing something? A compiler directive, a function call? Once again everything runs smoothly and without any errors on all other architectures.

Thanks in advance, M.

Hi Marandil,

GTX 670 is a compute capability 3.0 device (not 3.5). So, as far as I know, your GTX 670 does not support dynamic parallelism.

More details regarding compute capability: https://developer.nvidia.com/cuda-gpus

M. López.

Thanks

I got confused, since I found somewhere (can’t find the source atm) that 3.5 was introduced in Kepler architecture, but it turned out that it’s only available in GK110, not in the previous releases (although all of them are called Keplers).

Well, at least now I won’t lose my time trying to figure out what i’m doing wrong.