Sending large number of vertices to the kernel

Hi everyone,

I am trying to send a very large number of vertices to the kernel. The kernel will do some computation on these, and send the result of the computations back. The vertices themselves will not be modified.

I know this is a simple task, but I’m a bit lost.

Should I store these n vertices as a vector of vectors with std::vector class? How would I then pass such a thing to the kernel?

Or should I allocate the space with cudaMalloc3D or cudaMalloc3DArray, then fill in that block with the vertices created by float3 vertices = make_float3(x,y,z) ? Where would cudaMemcpy come in?

Or should I use vertex buffer objects?

Or some other technique that I haven’t mentioned, are any of these legit? What is the best way to do this?

Thanks so much!

Here’s a simple example of how you could do it:

int num_vertices;

// set num_vertices to however many vertices you have

size_t memsize = sizeof(float) * vertices;

float *h_x = (float *) malloc(memsize);

float *h_y = (float *) malloc(memsize);

float *h_z = (float *) malloc(memsize);

for (int i=0; i < num_vertices; i++) {

  // fill in entries of h_x, h_y, h_z

}

float *d_x, *d_y, *d_z;

CUDA_SAFE_CALL(cudaMalloc((void**) &d_x, memsize));

CUDA_SAFE_CALL(cudaMalloc((void**) &d_y, memsize));

CUDA_SAFE_CALL(cudaMalloc((void**) &d_y, memsize));

CUDA_SAFE_CALL(cudaMemcpy(d_x, h_x, memsize, cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(d_y, h_y, memsize, cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(d_z, h_z, memsize, cudaMemcpyHostToDevice));

// Now call your kernel passing in d_x, d_y, d_z

(I’ve used the CUDA_SAFE_CALL macro from cutil.h in the SDK (common/inc/cutil.h), which only checks for error codes if _DEBUG is #defined. Roll in your own error checking if required.)

Regarding the options you mention:

  • Passing C++ objects (especially STL) to the GPU generally does not work. You will be much more successful with plain C-style arrays and simple structs.

  • cudaMalloc3D() and cudaMalloc3DArray() are for allocating 3 dimensional arrays, not arrays of 3d objects (like vertices). cudaMalloc3D() just allocates a big linear chunk of memory anyway, but pads out some dimensions if they would cause alignment problems. cudaMalloc3DArray() allocates a special cudaArray object, which is required for 3D textures.

  • Creating a C-style array of float3 objects is on the right track, and would certainly work. The only problem with float3 objects is that they are the wrong size for coalesced memory access on older devices (GeForce 8 and 9) which can only coalesce reads of size 32, 64 and 128 bits. You can access an array of float3 in a coalesced way using shared memory as a staging area, but this is a little complicated when you are starting out. Making an array for each x,y,z component is a simple way to fix this problem.

  • Vertex buffer objects are a DirectX/OpenGL concept, and don’t exist in CUDA except for the purposes of graphic interoperability (like writing a CUDA kernel whose output is going to be directly rendered).