Instansiating objects directly to the gpu Totally new to cuda

I’ve been developing in C and C+ for a few years and I’m interested in getting into cuda. We currently run a C based in house app that requires millions of objects to be stored in memory. Theres a lot of initing and destroying of objects through out the course of it’s execution. We can easily integrate cuda into our infrastructure if cuda allows objects to be stored in the gpus memory as well all the instansiating of those objects from struct be executed from the gpu. Is this possible? From what I’ve been reading so far the main thing the cuda sdk can support is executing formulas and moving chunks of memory around from gpu mem to system mem and visa versa.

I would also be interested to know

The following code worked. The CUBIN said smem usage was 420 bytes.

Thus, the data portion of the object is stored in “shared memory” if the object is intantiated in smem.

I think a similar declaration of an object should also be possible with global memory.

I am using CUDA 2.2. I hope this support is officlal one from NVIDIA. if some1 could confirm, it would be useful.

#include <stdio.h>

class sample



	int data[100];


	__host__ __device__ sample()


		for(int i=0; i<100; i++)

			data[i] = i;


	__host__ __device__ int fetch(int i)


		return data[i];




__global__ void mykernel(float *result)


	__shared__ sample d;

	int sum = 0;

	for(int i=0; i<100; i++)

		sum += d.fetch(i);

	*result = sum;


int main(void)


	void *result;

	float data;

	cudaMalloc(&result, sizeof(float));

	mykernel <<< 1, 1>>> ((float*)result);


	cudaMemcpy(&data, result, sizeof(float), cudaMemcpyDeviceToHost);

	printf("%f\n", data);

	return 0;


I guess the question you should mainly ask yourself is : “Do I need performance boost? if yes, how can CUDA/GPU help me?”

The building of classes/structs is not such a crucial things, I think. Worstcase you can come up with some sort of workaround.

From what you describe you have a lot of objects/structures created dynamically (if I understand correctly) - not sure the GPU is

the best thing for this.

But either I’m wrong or maybe you can elaborate more on what you want to accomplish with the GPU…


Yes, I do need a performance boost. True, the building of classes and struct isn’t that performance intensive but when the app is constantly building and destroying them, it does cause noticeable load. I figured if I could build the objects using the GPU and store the objects in the GPUs memory, it would take a lot of the load off the host cpu and host memory. Yeah, I can create/store the objects using the system cpu/memory and then copy them over into the gpus memory but I only figure that would be too slow since the data has to travel through the host bus that everything else on the system moves through. I don’t know the exact bus size of the gpu and gpus memory but I can only imagine it’s much much MUCH bigger then the cpu and memories bus.

What I want to accomplish with the gpu is to have all the main cpu and memory utilization for this one app to go on the gpu. Granted I do understand the more simple stuff can be done on the cpu without any performance hit but main guts and glory to this app and the biggest performance burdins are…

  • the creation/destruction of objects

  • storing and reading the objects

  • heavy formulas done on values found in objects

Everything else the app does is total fluff like echoing out text to the screen, writting files to the hd and generating output images. The cpu can do that just fine and we don’t need the gpu for that.

Heavy formulas on the objects sounds good for the GPU - make sure you dont move objects back and forth from the host to device

and vice-versa. That would kill your performance. Load data to the GPU calculate as much as possible and go to the next objects

to be calculated.


Another thing to remember: CUDA like a structure containing several arrays, rather than an array of structures. Without seeing your actual code, it’s impossible to say for sure, but the most complicated part of the porting process can be tracking down which data you need, and packing it into nice 1D arrays (not to mention the corresponding unpack once the GPU is done).

The way the app works is it collects directly related data through objects that are assoc to other objects. The app never just for loops through an array of all the objects, that would melt the server. Instead it picks an object and jumps to the objects directly associated to it and jumps to objects directly associated to that, etc etc and collects data that way.

Doing random walks, one pointer at the time does not sound too promising. You’ll have to do very similar things for all objects to be efficient. You could do, say a class ‘Mammals’ but you can not specialize into individual species. Even Monkeys will have to grow their antlers although only Reindeers will have this growth set to a non-zero value.

Think in terms of wide vectors when you walk around in memory, or you’ll immediately hit a bandwidth brick wall:…hots/Worms9.JPG