Why nested 3D data structures cause a crash cuda?

I use a simple organization of 3D data structures to create a grid on CPU and fast searching occupied cells.

struct Voxel
{
  uint8_t ID = 0;
  uint8_t hit_counter = 0;
};

struct Chunk_level_1
{
  uint8_t counter = 0;
  Voxel voxel [10][10][10] = {0}; // 1 000
};

struct Chunk_level_2
{
  uint8_t counter = 0;
  Chunk_level_1 Chk_Lv1 [10][10][10]= {0};  // 1 000 000
};

Chunk_level_2 Chk_Lv2 [10][10][10] = {0};  //1 000 000 000

I want to declare it as a global variable to store in the gpu memory of the entire grid, updates it on gpu from rgb-d data and periodically synchronize filled cells with a similar structure on the CPU. But when I try to compile this, I get an build infinite loop until the memory on my ssd storage runs out.

__device__ struct Voxel
{
  uint8_t ID = 0;
  uint8_t hit_counter = 0;
};

__device__ struct Chunk_level_1
{
  uint8_t counter = 0;
  Voxel voxel [10][10][10] = {0}; // 1 000
};

__device__ struct Chunk_level_2
{
  uint8_t counter = 0;
  Chunk_level_1 Chk_Lv1 [10][10][10]= {0};  // 1 000 000
};

__device__ Chunk_level_2 Chk_Lv2 [10][10][10] = {0};  //1 000 000 000

I used cuda toolkit 12.0.1 in mobile 3050ti, qmake config:

# CUDA
# nvcc flags (ptxas option verbose is always useful)
NVCCFLAGS = --compiler-options -fno-strict-aliasing -use_fast_math --ptxas-options=-v
# Path to cuda toolkit install
CUDA_DIR = /usr/local/cuda-12.0
# GPU architecture (ADJUST FOR YOUR GPU)
CUDA_GENCODE  = arch=compute_86,code=sm_86
# manually add CUDA sources (ADJUST MANUALLY)
CUDA_SOURCES += cudamap.cu
# Path to header and libs files
INCLUDEPATH  += $$CUDA_DIR/include
# libs used in your code
LIBS += -L $$CUDA_DIR/lib64 -lcudart -lcuda

cuda.commands        = $$CUDA_DIR/bin/nvcc -c -gencode $$CUDA_GENCODE $$NVCCFLAGS -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda.dependency_type = TYPE_C
cuda.depend_command  = $$CUDA_DIR/bin/nvcc -M ${QMAKE_FILE_NAME} | sed \"s/^.*: //\" #For Qt 5.12.2
cuda.input           = CUDA_SOURCES
cuda.output          = ${OBJECTS_DIR}${QMAKE_FILE_BASE}_cuda.o
# Tell Qt that we want add more stuff to the Makefile
QMAKE_EXTRA_COMPILERS += cuda

You don’t do this in CUDA:

What you have there is a struct definition not an instantiation. struct definitions don’t get tagged with __device__.

If you would like to instantiate a __device__ variable of that type, you would do:

struct Voxel
{
  uint8_t ID = 0;
  uint8_t hit_counter = 0;
};

__device__ struct Voxel v;  // the actual global device variable

Beyond that, if you are still having trouble, I suggest:

  1. test against the latest version of CUDA
  2. provide a short, complete example, with a description of the compile failure, along with the CUDA version you are using and the compile command line that caused the failure.

This isn’t really a forum for assistance with Qt. Also, please don’t post pictures of text on this forum.

I also note that eventually your variable will occupy over 2GB of space. Is that your intent? I would suggest to manage that with a dynamic allocation (e.g. cudaMalloc()) rather than a static allocation.

On CUDA 12.0.1, I see that compiling this:

#include <cstdio>
#include <cstdint>
struct Voxel
{
  uint8_t ID = 0;
  uint8_t hit_counter = 0;
};

struct Chunk_level_1
{
  uint8_t counter = 0;
  Voxel voxel [10][10][10] = {0}; // 1 000
};

struct Chunk_level_2
{
  uint8_t counter = 0;
  Chunk_level_1 Chk_Lv1 [10][10][10]= {0};  // 1 000 000
};

__device__ Chunk_level_2 Chk_Lv2 [10][10][10] = {0};  //1 000 000 000

__global__ void k(){
        printf("%c\n", Chk_Lv2[1][2][3].counter);
}

int main(){

        k<<<1,1>>>();
        cudaDeviceSynchronize();
}

Takes a “very long” time. On the other hand, I don’t have trouble with this:

#include <cstdio>
#include <cstdint>
struct Voxel
{
  uint8_t ID = 0;
  uint8_t hit_counter = 0;
};

struct Chunk_level_1
{
  uint8_t counter = 0;
  Voxel voxel [10][10][10] = {0}; // 1 000
};

struct Chunk_level_2
{
  uint8_t counter = 0;
  Chunk_level_1 Chk_Lv1 [10][10][10]= {0};  // 1 000 000
};


__global__ void k(Chunk_level_2 *c){
        printf("%c\n", c[(((1*10)+2)*10)+3].counter);
}

int main(){

        Chunk_level_2 *dc;
        cudaMalloc(&dc,  sizeof(Chunk_level_2)*10*10*10);
        cudaMemset(dc,0, sizeof(Chunk_level_2)*10*10*10);
        k<<<1,1>>>(dc);
        cudaDeviceSynchronize();
}

So if you’re not happy with that, and you want the __device__ variable instead, I would suggest:

  1. retest on the latest version of CUDA
  2. if it still manifests, file a bug.

According to my testing, it doesn’t seem to be possible to create a __device__ variable larger than 2GB. So I wouldn’t bother with that approach.

Compilation time and compiler memory requirements will grow with the size of the variable. I have not checked to confirm but as I recall, a statically initialized variable requires the entire variable to be written out to the object file. So with a 2GB statically initialized variable the object file will be > 2GB. Given enough system memory, disk space, and compilation time (20 minutes or so :-), the build should complete eventually.

Don’t do that. Dynamic allocation and use of cudaMemset() is the approach i would advise.

That seems like a compelling reason :-) Presumably a quite reasonable limitation to use a signed 32-bit integer somewhere (object file format?) to represent data length.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.