Dynamic Parallelism with texture memory

Hi all,

I’m trying to use dynamic parallelism as described here (https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/). I first created a simple example which is working fine. The problem start when using it in my bigger project with a lot of data and many kernel calls.
When I do my calculation in the parent kernel everything is fine (when using the commented code and not the child kernel). If I call a single child kernel as the code is present, passing some data pointers, a texture object and a index, I don’t get a error, but the images I’m working on get some weird artifacts and I just don’t know what is causing them. Is there some kind of memory restricting when using child kernels?

The actual image processing is done in the getData function which is a device function and the same whether called from the parent or the child kernel. It access the texture memory, does some calculation and writes back to the texture memory. I tried to simplify my code for this post as much as possible. The general aim is to call less parent kernels (by factor 32) but then calling a child that calculates the data for all 32. As said doing so in the parent is working fine but I want to use the child kernels as those will be better with later image processing steps.

My code structure right now is as follows:


__device__ Mydata getData(uint proj_idx, cudaTextureObject_t proj_tex, float *vol,
                          uint col_idx, uint row_idx, uint s_idx, int x, int y, int z)
{
 Mydata myData;
weight = getweight(col_idx, row_idx, s_idx, proj_idx);
myData.proj = tex2D<float>(proj_tex, col_idx + 0.5f, row_idx + 0.5f));
      myData.pos = (slice_idx +z) * num_rows * num_cols + (row_idx +y) * num_cols + col_idx +x;
      myData.cur =vol[myData.pos];
      vol[myData.pos] += myData.proj * weight);
}

__global__ void child_ker(uint proj_idx, cudaTextureObject_t proj_tex, float *vol, uint col_idx, uint row_idx, uint s_idx)
{
    for(int i=0;i<32;i++){
      int x=i%4;       //0..3
      int y=i/4 - x*4;  //0..3
      int z=i/16;        //0..1
      Mydata myData= getData(proj_idx, proj_tex, vol, col_idx, row_idx, s_idx, x, y, z);
    }
}

__global__ void parentKern( uint proj_idx, cudaTextureObject_t proj_tex, float *vol)
{
    uint const col_idx = blockIdx.x * blockDim.x *4  + threadIdx.x *4;
    uint const row_idx = blockIdx.y * blockDim.y *4+ threadIdx.y *4; 
    uint const slice_idx = blockIdx.z * blockDim.z *2+ threadIdx.z *2;

//    for(int i=0;i<32;i++){
//    int x=i%4;       //0..3
//    int y=i/4 - x*4;  //0..3
//    int z=i/16;        //0..1
//    Mydata myData= getData(proj_idx, proj_tex, vol, col_idx, row_idx, s_idx, x, y, z);
//    }

    child_ker<<<1,1>>>(proj_idx, proj_tex, vol, col_idx, row_idx, s_idx);

Yes, there are some restrictions. It’s not legal to pass (and use in the child kernel) a pointer to the parent kernel’s local or shared spaces. But you’re not doing those things. I don’t see any obvious issues with what you have shown here. I suspect that typical debug steps would be in order at this point.

Thanks a lot for the quick answer. The next course is in September but I will later take a look at compute-sanitizer as suggested in the slides. I also ran nsys profile but in the ‘cuda_gpu_kern_sum’ stats report I only see the parent kernel, not the child kernel, also when I use my simple example where child usage is working fine. Am I missing something?

I think you’re misreading things. The unit 12 linked is the one on debugging. There is no next course. Those courses are all in the past and archived there. Both slides and session recordings.

I would suggest doing that first, just as I suggested in unit 12 of that training course.

And beyond that it requires debugging. I certainly cannot solve it from what I see here.

Ah you’re right.
compute-sanitizer found some invalid global reads. Thanks for suggesting it. I hope I find the reason for the reads. Thanks a lot for your help