Hi all,
I’m trying to use dynamic parallelism as described here (https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/). I first created a simple example which is working fine. The problem start when using it in my bigger project with a lot of data and many kernel calls.
When I do my calculation in the parent kernel everything is fine (when using the commented code and not the child kernel). If I call a single child kernel as the code is present, passing some data pointers, a texture object and a index, I don’t get a error, but the images I’m working on get some weird artifacts and I just don’t know what is causing them. Is there some kind of memory restricting when using child kernels?
The actual image processing is done in the getData function which is a device function and the same whether called from the parent or the child kernel. It access the texture memory, does some calculation and writes back to the texture memory. I tried to simplify my code for this post as much as possible. The general aim is to call less parent kernels (by factor 32) but then calling a child that calculates the data for all 32. As said doing so in the parent is working fine but I want to use the child kernels as those will be better with later image processing steps.
My code structure right now is as follows:
__device__ Mydata getData(uint proj_idx, cudaTextureObject_t proj_tex, float *vol,
uint col_idx, uint row_idx, uint s_idx, int x, int y, int z)
{
Mydata myData;
weight = getweight(col_idx, row_idx, s_idx, proj_idx);
myData.proj = tex2D<float>(proj_tex, col_idx + 0.5f, row_idx + 0.5f));
myData.pos = (slice_idx +z) * num_rows * num_cols + (row_idx +y) * num_cols + col_idx +x;
myData.cur =vol[myData.pos];
vol[myData.pos] += myData.proj * weight);
}
__global__ void child_ker(uint proj_idx, cudaTextureObject_t proj_tex, float *vol, uint col_idx, uint row_idx, uint s_idx)
{
for(int i=0;i<32;i++){
int x=i%4; //0..3
int y=i/4 - x*4; //0..3
int z=i/16; //0..1
Mydata myData= getData(proj_idx, proj_tex, vol, col_idx, row_idx, s_idx, x, y, z);
}
}
__global__ void parentKern( uint proj_idx, cudaTextureObject_t proj_tex, float *vol)
{
uint const col_idx = blockIdx.x * blockDim.x *4 + threadIdx.x *4;
uint const row_idx = blockIdx.y * blockDim.y *4+ threadIdx.y *4;
uint const slice_idx = blockIdx.z * blockDim.z *2+ threadIdx.z *2;
// for(int i=0;i<32;i++){
// int x=i%4; //0..3
// int y=i/4 - x*4; //0..3
// int z=i/16; //0..1
// Mydata myData= getData(proj_idx, proj_tex, vol, col_idx, row_idx, s_idx, x, y, z);
// }
child_ker<<<1,1>>>(proj_idx, proj_tex, vol, col_idx, row_idx, s_idx);