However, the folllowing code works well on CUDA 2.2. So, if I write kernels like this – Does it mean, I am in un-supported land?
#include <stdio.h>
class base
{
protected:
int data[100]; /* __device__ qualification FAILS for data members of a class */
public:
__device__ __host__ int store(int i, int d)
{
if (i<100)
{
data[i] = d;
}
return 0;
}
};
/*
* __device__ and __host__ qualifiers work only for FUNCTIONS.
*
* Using it in front of "class" specifier does NOT work.
*
* To use a C++ object in kernel, one needs to use the __device__ in front of all functions
* in that class.
* The data-members dont need any qualification. Compiler errors otherwise.
* The data-members reside in the same space where the object is declared.
*/
class sample : public base
{
public:
__device__ sample()
{
for(int i=0; i<100; i++)
data[i] = i;
}
__device__ __host__ int fetch(int i)
{
return data[i];
}
};
__global__ void mykernel(float *result)
{
__shared__ sample d; /* data[100] private member occupies SHARED MEMORY space */
int sum = 0;
for(int i=0; i<100; i++)
sum += d.fetch(i);
*result = sum;
d.store(0, sum);
}
int main(void)
{
void *result;
float data;
cudaMalloc(&result, sizeof(float));
mykernel <<< 1, 1>>> ((float*)result);
cudaThreadSynchronize();
cudaMemcpy(&data, result, sizeof(float), cudaMemcpyDeviceToHost);
printf("%f\n", data);
return 0;
}
In practice, simple classes with public/private data and methods work fine.
It’s pointers that are the sticking point, that rules out pretty much any C++ library, virtual inheritance, etc.
But I’ve found that simple classes are enough for a lot of nice techniques… especially when used as functors with templates. I even use template metaprogramming a lot in CUDA for a PRNG.
There’s a small gotcha where device emulation mode doesn’t like private data or methods. That’s likely a bug, I haven’t tested it since CUDA 2.0. Workaround is to just do simple things like:
#ifndef __DEVICE_EMULATION__
private:
#endif
Study all of Mark Harris’s code like the scan examples and CUDPP. He’s really good at using templates to generalize and parameterize kernels. I learned a lot from him (and it’s not even GPU-specific)
@sarnath, that class fails with the int data[100] because you’re then effectively asking for dynamic device memory allocation, and that just aint gonna happen. This is true even if it were going to be local memory.
This is where memory becomes a gotcha… when the fundamentals become invalid, it just affects everything.
Maybe a workaround is to statically allocate the local memory yourself then pass it to your class in the constructor. That’s ugly too but is likely to work.