Hello
Well, I have (almost) just started my experience with CUDA programming, and been successful so far, with one little exception. I have written several kernels, ran the successfully, got all I wanted, but came to the point, where I wanted one kernel to execute subsequent threads, etc. Since I got note, that I have to enable sm_35 architecture, I just added -arch=sm_35, and found the compilation running smoothly, but the application itself gave me very strange results. I’ve done some research, and found out, that no kernels are running successfully with the -arch=sm_35, but everything seems to be ok with -arch=sm_30. I’m running GTX 670, so I assume that it’s not the GPU incompatibility, but rather something with the compiler.
Here is my nvcc invocation for the simplest test:
nvcc -m64 -arch=sm_35 -c errtest.cu
nvcc -m64 -arch=sm_35 -o err errtest.obj
where errtest.cu is:
#include "helper_cuda.h"
#include <iostream>
#include <iomanip>
typedef unsigned long long uint64;
__global__ void add1(const uint64* d_in, uint64* d_out)
{
int pos = threadIdx.x;
d_out[pos] = d_in[pos] + 1;
}
__global__ void nothing()
{
}
int main()
{
const size_t ARRAY_SIZE = 64, ARRAY_BYTES = ARRAY_SIZE * sizeof(uint64);
uint64 h_arr[ARRAY_SIZE];
for(int i=0; i<ARRAY_SIZE; i++)
h_arr[i] = i;
uint64 *d_in;
uint64 *d_out;
cudaMalloc(&d_in , ARRAY_BYTES);
cudaMalloc(&d_out, ARRAY_BYTES);
cudaMemcpy(d_in, h_arr, ARRAY_BYTES, cudaMemcpyHostToDevice);
getLastCudaError("Memcpy H2D failed");
#ifndef NOTHING
add1<<<1, ARRAY_SIZE>>>(d_in, d_out);
#else
nothing<<<1, ARRAY_SIZE>>>();
#endif
getLastCudaError("Kernel execution failed");
cudaMemcpy(h_arr, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
getLastCudaError("Memcpy D2H failed");
std::cout<<std::endl<<"add1:"<<std::endl;
for(int i=0; i<ARRAY_SIZE; i++)
std::cout << std::hex << std::setw(16) << h_arr[i] << ((i%3 == 2) ? "\n" : "\t");
std::cout<<std::endl;
}
(helper_cuda.h is the one from CUDA 5 Toolkit examples)
As I run the executable, i get the following:
errtest.cu(41) : getLastCudaError() CUDA error : Kernel execution failed : (8) invalid device function .
But when I switch from sm_35 to sm_30, everything runs smoothly.
Here’s the full experiment:
External Media
Am I missing something? A compiler directive, a function call? Once again everything runs smoothly and without any errors on all other architectures.
Thanks in advance, M.