3D data structure woes T_T

Hello all, I’m attempting to create a 3D data structure of nodes.

So here’s what I’m trying to do, create a 3D structure where every element is of type ‘struct node’, which I will be defining. I’m attempting to make a particle mesh, basically, so these structures also have to point to other structures, ‘struct particle’. I’ve tried looking up various different methods for doing this but I’m currently unsuccessful in actually manipulating my mesh with the device. It tells me that my indices are out of bounds so I’m posting what my code currently is:

#include <stdlib.h>
#include <stdio.h>
#include <cuda.h>

struct particle {

   int type;

   double p[3];
   double v[3];
   double f[3];

   double mass;

struct node {

   int full;
   struct particle *part;

__global__ void mesh(cudaPitchedPtr devPitchedPtr, cudaExtent extent) {

   char* devPtr = (char*) devPitchedPtr.ptr;
   size_t pitch = devPitchedPtr.pitch;
   size_t slicePitch = pitch*extent.height;
   for (int k=0; k<extent.depth; k++) {

      char* slice = devPtr + k*slicePitch;
      for (int j=0; j<extent.height; j++) {

         struct node* row = (struct node*) (slice+j*pitch);
         for (int i=0; i<(extent.width/sizeof(struct node*)); i++) {

            row[i].full = 1;

int main (void) {

   int numPart = 100;
   int gl = 128;

   struct node ***root;
   root = (struct node***)malloc(gl*sizeof(*root));
   for (int i=0; i<gl; i++) {

      root[i] = (struct node**)malloc(gl*sizeof(*root[i]));
      for (int j=0; j<gl; j++) {

         root[i][j] = (struct node*)malloc(gl*sizeof(*root[i][j]));
         for (int k=0; k<gl; k++) {

            root[i][j][k].full = 0;
            root[i][j][k].part = NULL;

   cudaError_t status = cudaSuccess;

   printf("Size of node : %d\n", sizeof(struct node));

   cudaExtent extent;
   extent.width = gl*sizeof(struct node);
   extent.height = gl;
   extent.depth = gl;

   cudaPitchedPtr mem_device;

   status=cudaMalloc3D(&mem_device, extent);

   if(status != cudaSuccess){
      fprintf(stderr, "MemcpyHtD: %s\n", cudaGetErrorString(status));

   cudaMemcpy3DParms p = { 0 };
   p.srcPtr = make_cudaPitchedPtr((void*)root, gl*sizeof(struct node), gl, gl);
   p.dstPtr = mem_device;
   p.extent = extent;
   p.kind = cudaMemcpyHostToDevice;
   if(status != cudaSuccess){fprintf(stderr, "MemcpyHtD: %s\n", cudaGetErrorString(status));}

   mesh<<<1, 1>>>(mem_device, extent);

   for (int i=0; i<gl; i++) {
      for (int j=0; j<gl; j++) {
         free (root[i][j]);
      free (root[i]);
   free (root);

   return 0;

So basically, I double-checked the example I copied and it seems like I’m doing the copying into device memory correctly. The error I’m getting is that my indices are out of bounds. So how do I find the proper bounds for this or am I even doing this correctly? I tried to extrapolate from the example but I’m iffy.

It appears that you are only launching one block with one thread in it. While that is OK it will only use about 1% of the ability of the GPU and for a problem of dimensions [128,128,128] will not be as fast as it could be.

Have you had a look at the n-body example in the Software Development Kit (SDK) or any other of the SDK examples.

NB You should also test status after launching your kernel.

Hope this helps

Okay, so I changed my code. I checked and I have 128^3 elements so I have 2,097,152 possible threads, right? Or at least, I can implement that many threads at once if I should so choose as each structure is independent of the other.

I checked using cudaGetDeviceProperties() and I have 1024 max threads per block so I need to pass 2,097,152/1024 = 2048 blocks.

So now I’m using mesh<<<2048, 1024>>> and now when I compile, I get the following warnings:

“./pmesh.cu(28): Warning: Cannot tell what pointer points to, assuming global memory space”

This is done using explicit struct indices (root[0][0][0]) and I’m not faulting. The memchecker complains like crap though. I think if I pass this many blocks then I need to use the threadId’s don’t I? I’m assuming that each block is assigned a segment of the memory, correct?

For 3D problem a 3D grid would be better:
blocks per grid = 4096 (3D grid, 16 in each grid dimention)
threads per block = 512 (3D block, 8 in each block dimention)

Another thing - try doing more then one particle per thread.

So let me see if I’m understanding this correctly, I choose 16^3 because the size of each node is 16 bytes. Therefore we assume that we write bytes along every Cartesian axis, yielding the cube, right? The threads per block I’m not sure of, actually.

And how am I supposed to thread this out? I copy-pasted the 3D indexing syntax from some example without really understanding it. As a simple test, should I just start by setting each structure’s int value equal to 1? So then I’d change 512 structures per block, right?

I was refering to the organization of execution grid, provided You need 2048*1024 threads. If You have a 3D input/output structure it is better to read/write it in a natural way - with 3D thread indices. In other words call Your kernel this way:

kernel<<<dim3(16, 16, 16), dim3(8, 8, 8)>>>(...)