Conversion from cpp functors to __device__ functors

Hi everybody. I’m writing some kind of library to help people who don’t know CUDA programming to use its features. Particularly, I’m translating STL most common algorithms in a parallel version. I’d like to hide completely .cu files so a user can write parallel code with no knowledge of implementation details. Now I’m facing an apparently insuperable obstacle: functors. I’d like to give the user the opportunity to write a brand new functor in his .cpp code and pass it to a parallelized STL algorithm (for instance, a for_each_parallel or a generate_parallel) that should invoke some kernels using a “deviced” version of the user-defined functor. Translation of algorithms without any functor is simple, but I really don’t know how to translate .cpp code (with restrictions, of course) in .cu code using only libraries. Any idea? Thank you to everybody.

Biagio

Are you sure you are not just reinventing Thrust?

Yes, I am! The difference between my project and Thrust is, fondamentally, the “visibility” of CUDA code. My aim is to let the user write .cpp code, NOT .cu code! At the end of the project, I’d like to switch between a CUDA parallelism and a Multicore parallelism by passing only a parameter like “USING_CUDA” or “USING_MULTI_CORE”. That’s why I must hide .cu files in a library. For instance, that’s how I wrote a parallel fill algorithm:

main.cpp

#define N 100

//...

int main()

{

   // ...

   vector<float> vett(N);

   fill_p(vett.begin(),vett.end(),4,USING_CUDA);

   // ...

}

STL_parallel.h

template<class ForwardIterator, class T>

void fill_p(ForwardIterator first, ForwardIterator last, T value, Parallelism p_type)

{

	T* p = &first[0];

	switch( p_type )

	{

	case USING_CUDA:		

		fill_CUDA(p, (int)distance(first,last), value);

		break;

	case USING_MULTI_CORE:

                // multi-core version of fill

		break;

	default:

		fill(first,last,value);

		break;

	}

	

}

STL_CUDA.cu - int,double too

void fill_CUDA(float* first, int dist, float value)

{

	float* d_vett;

	CUDA_SAFE_CALL( cudaMalloc(&d_vett,dist*sizeof(float)) );

	dim3 blocks_per_grid((dist+BLOCK_SIZE-1)/BLOCK_SIZE, 1, 1);

	dim3 threads_per_block(BLOCK_SIZE, 1, 1);

	fill_kernel<<<blocks_per_grid,threads_per_block>>>(d_vett,dist,value);

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	CUDA_SAFE_CALL( cudaMemcpy(first,d_vett,dist*sizeof(float),cudaMemcpyDeviceToHost) );

	CUDA_SAFE_CALL( cudaFree(d_vett) );	

}

STL_CUDA_kernels.cu

template<class T>

__global__ void fill_kernel(T* v, int dist, T value)

{

	int index = blockIdx.x * blockDim.x + threadIdx.x;

	if(index < dist)

		v[index] = value;

}