Declare global variable in function scope or suggest alternate solution

Hello, I have a GeForce GTX 470 card and I was trying to take advantage of the c++ capabilities of compute 2.0 and cuda 3.2.

On my old compute 1.3 card I was implementing it like this:

struct DVector{

	size_t length;

	float *data;

};

typedef struct DVector DVector;

Memory allocation was done in the host using:

__host__ DVector DV (size_t n){

	DVector v;

	v.length=n;

	cudaMalloc((void **) &(v.data), n*sizeof(float));

	return v;

}

Now I’m trying to take advantage of templates (no problems), in-kernel malloc(), and member methods.

I have a Vector class (1D array + arithmetic operations).

I can make the first thread allocate memory using malloc(), but [b]I don’t know how to put this pointer in global memory so that all threads across all the blocks have access to this pointer.

[/b]

The only way I know of declaring global variables, is to use the device qualifier in file scope but I’d like to initialize an arbitrary number of vectors, during execution.

I tried to get around this limitation by having a device *T temp; global pointer where the pointer returned by malloc() is stored and then have each thread copy this pointer in its local memory but this crashes my kernel code (Unspecified kernel failure).

Can you suggest a better way to do this? I’d like to be able to allocate memory in device code rather than using cudaMalloc and passing pointers from the host.

My second problem is:

Even though Compute 2.0 is supposed to support classes, I get

Error: External calls are not supported (found non-inlined call to _ZN7DVectorIdED1Ev)

when trying to call member functions i.e.

DVector<double> v(1000); v.square();

The DVector class is defined in a header included in main.cu so I’m aware that there is no linker.

template<class T>

class DVector

{

private:

	size_t length;

	T *data;

public:

	//Constructors

	__device__ DVector (size_t n){

		length=n;

		// Don't know how to allocate Memory

	}

.

.

.

	/*********************************

	 * VECTOR INITIALIZATION

	 *********************************/

	__device__ void init (T x){

		unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

		while (tid < (int) length) {

			data[tid]=x;

			tid += blockDim.x * gridDim.x;

		}

		__syncthreads();

	}

.

.

.

etc.

};

Any suggestions are welcome.

Did you sync your threads before they access the pointer in global memory? If you did, did you make sure that every single thread executed the sync?

Also, you can put the pointer in shared memory, if only threads from the same block need to access this DVector. Otherwise, from what I know, you can only use global memory as you already did, though you need to check your syncing.

Thank you, that was the problem, it works now.

However is there a way to create global variables during kernel execution?

Right now I’m using a pre-declared global pointer to hold the address returned by malloc() and then each thread copies it in it’s local memory.

__device__ double* temp;
template<class T>

class DVector

{

public:

	size_t length;

	T *data;

__device__ DVector (size_t n){

		unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

		length=n;

		if(tid==0)

			temp=(double*) malloc(n*sizeof(double));

		__syncthreads();

		data=temp;

	}

.

.

.

};

Thanks again