Need help with Classes (oop) and CUDA

Hi!
I try to make my CUDA code more object oriented but have problems with the CUDA limitations, hope you can help me. I want to do something like this (simplified).

//main.cu

__global__ void compute(Swarm *swarm){

	int threadID = threadIdx.x;

	swarm->getParticleOnPositionI(threadID)->computeFitness();
}
int main(){

	Swarm *deviceSwarm, *hostSwarm;

	
	hostSwarm = new Swarm();
	

	size_t sizeOfSwarm = sizeof(Swarm); // I think here is the first mistake?

	cudaMalloc((void**)&deviceSwarm, sizeOfSwarm);

	cudaMemcpy(deviceSwarm, hostSwarm, sizeOfSwarm, cudaMemcpyHostToDevice);

	compute(deviceSwarm);

	cudaMemcpy(hostSwarm, deviceSwarm, sizeOfSwarm, cudaMemcpyDeviceToHost);

	

	
return 0;
}
//Swarm.h

#ifdef __CUDACC__
#define CUDA_CALLABLE_MEMBER   __device__
#else
#define CUDA_CALLABLE_MEMBER
#endif

class Swarm
{
public:
	Swarm(void);
	~Swarm(void);

	

	#ifdef __CUDACC__
	 CUDA_CALLABLE_MEMBER Particle* getParticleOnPositionI(int i){
	return &particles[i];
}
	  #endif

private:

	Particle *particles;

};
//Particle.h
#ifdef __CUDACC__
#define CUDA_CALLABLE_MEMBER   __device__
#else
#define CUDA_CALLABLE_MEMBER
#endif

class Particle
{
public:
	Particle(void);
	~Particle(void);

	#ifdef __CUDACC__
	CUDA_CALLABLE_MEMBER void computeFitness(){
	int i = threadIdx.x;
	*fitness = i*i;
	}
	#endif
	

	
private:
	double *fitness;
};

I have just posted to header files for convenience.
This code does not work as intended, perhaps you can help me to modify it and get it work. I have tried to find literature about oop with CUDA, but nothing fits my problem :(

A few items:
the compute/kernel call has no grid/block information ?
The memory pointer by “fitness” and “particles” is never allocated (and subsequently no data is ever copied to it).

As I understand it, “swarm” is not an array but a single element. You could pass that element directly to the kernel through its copy constructor (provided you do not have an host array malloc-ed within the object, as you do now).
Similarly for “fitness”, if it is a scalar value, you can drop the pointer and just give it a value.

I would say for now your biggest problem is the (array?) of particles that is never cudamalloced.

Hope this helps.

Yes, thank you very much, that helps allot!
How can I cudamalloc the particles array of the deviceSwarm? I don’t know how to allocate an element within an element (the particles are elements of the swarm).

You do as you would with any device pointer that you want to malloc. In this case you could do it in the constructor of the Swarm class if you already know how many particles you will need. And have a cudaFree in the destructor of the class. Or through some setParticles(const Particle* const h_particles, const size_t numParticles); where you would cudaMalloc the device Particle array and cudaMemcpy them.

Unless the particles are populated on the device itself, you will also need a host pointer of particles to memcpy from.

Great! I think I have understood the concept now! I will try it on Monday, as I have to do something else at the moment and let you know if it worked for me :)
Thank you very very much for your help!

So, I tried to do it with some kind of setParticles(…) (as you said) but still no success.
I have problems accessing the objects of the deviceSwarm. Doing so in the main.cu file:

Swarm *deviceSwarm, *hostSwarm;

	
	hostSwarm = new Swarm();
	hostSwarm->initializeParticles();

	size_t sizeOfSwarm = sizeof(Swarm);
	size_t sizeOfParticles = sizeof(hostSwarm->getParticles());

	cudaMalloc((void**)&deviceSwarm, sizeOfSwarm);
        deviceSwarm->allocateDeviceMemories(sizeOfParticles);

will not work, as I can not access device objects like this on the host. My question is: How do I get the particles of the deviceSwarm, so I can set them and copy the data of the hostParticles to the deviceParticles?

That is certainly quite impossible!

I have not seen the rest of the object, but you most likely do not need a device copy of the object on the device. When you call your global kernel, you will give it as a parameter ‘hostSwarm’ and a copy of the object will be available from within your kernel.

As in regular c++, what you need to worry about when copying objects (as is done here when you call the kernel, a device copy of the object will be created), are shallow copies of objects with dynamic memory. In this case, if your Swarm object has a pointer to an array of particles, that pointer either points to memory on the host or on the device.

So what you will need in your object are two pointers, one pointing to the array of particles on the host that you use from host code, and one pointing to the array of particles on the device that you have cudamalloced and cudamemcopied.

Ill put some code, but don’t expect that to compile (or even be close to compiling)

class Swarm
{
public : 
      void mallocAndCopyCurrentParticlesOnDevice();
   
private:
...
Particle* h_particles;
Particle* d_particles;
unsigned int numParticles;
};

//This takes your current host array of particles and mallocs/copies it to the device.
//After this function is called, the d_particles pointer can be used from device code to access
//the particles.
void Swarm::mallocAndCopyCurrentParticlesOnDevice()
{
     cudaMalloc(&d_particles, sizeof(Particle)*numParticles;
     cudaMemcpy(d_particles, h_particles, sizeof(Particle)*numParticles);
}

__global__ myKernelCallThatUsesSwarm(Swarm swarm, OtherStuff otherstuff)
{
.... do stuff
//The swarm.d_particles points to particles on the device, you can use it here.
Particle myParticle = swarm.d_particles[idx];
}

void main(void)
{
    Swarm* myHostSwarm = new Swarm();
    //whatever code is needed here in order to fill h_particles and numParticles
    myHostSwarm->mallocAndCopyCurrentParticlesOnDevice();
    
    myKernelCallThatUsesSwarm(*myHostSwarm, otherStuff)
}

Thanks again very much for your help! Sorry, that I have so many questions regarding to such a simple problem -.-
I have tried the way you have suggested, but there is still something wrong (damn it).
My global function looks like this:

__global__ void computeNewFitness(Swarm *swarm){
	int threadID = threadIdx.x;
	Particle *myParticle = swarm->getParticleOnPositionI(threadID);
	myParticle->computeFitness();	
	
}

while the particle class looks like mentioned in my first posting. The computeFitness() function seems not to work (at least, this is one of the problems…), but why?

If it’s exactly like in your first post, then the ‘double *fitness’ points to nothing. It does not need to be a pointer (as far as I understand it, it will only hold a scalar value).

I suggest you use nsight and debug your host code. You’ll be able to see which lines of code do not work, see if you’re able to even get inside the code for getParticleOnPosition, etc. It’s easy to forget something and, for me anyway, the easiest way to find what has been forgotten is to step the code line by line and check every variable/pointer value.

Ok, I will try it. I use Nsight, but having problems with it. Will reinstall a new version of it or use the terminal version o cuda-gdc. Thanks very much, let’s see what the next problems will be :D