Hi, I am a beginner in CUDA.
Recently I was trying to implement BFS with CUDA Dynamic Parallelism (CDP).
if (id < TEST_NODE && Fa[id] == true && Xa[id] == true)
{
Fa[id] = false;
int start = Va[id].start;
int end = start + Va[id].length;
if (Va[id].length >= THRESHOLD) // CDP condition
{
int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
int threads = 32;
// launch child kernel
CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
}
else
{
for (int i = start; i < end; i++)
{
int nid = Ea[i];
if (Xa[nid] == false)
{
Ca[nid] = Ca[id] + 1;
Fa[nid] = true;
*done = false;
}
}
}
}
The above code is part of the BFS kernel. When the number of neighbors of a node is greater than the given threshold, the child kernel will be launched.
if (Va[id].length >= THRESHOLD) // THRESHOLD = 1048576
{
int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
int threads = 32;
CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
}
else
{
...; // visit neighbors w/o CDP
}
}
However, after I turned CDP off (set the threshold to a very large number), I found that even though the conditions for CDP were never hold, the child kernel launch overhead was still there (elapsed time ≈ 125ms for BFS).
if (Va[id].length >= THRESHOLD) // THRESHOLD = 1048576
{
int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
int threads = 32;
// CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
}
else
{
...; // visit neighbors w/o CDP
}
}
if (Va[id].length >= THRESHOLD) // THRESHOLD = 1048576
{
int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
int threads = 32;
while(true) printf("Oh NO...");
CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
}
else
{
...; // visit neighbors w/o CDP
}
}
And when I delete the line of code that launches the child kernel, or add while(true);
above it, the performance is improved. (elapsed time ≈ 25ms for BFS)
What I want to ask is:
- Where does this launch overhead come from (when the CDP condition is never met)?
- Can I avoid unnecessary launch overhead when I don’t want to start the kernel (the number of neighbors is less than the threshold)?