Dynamic parallelism, Kernel didn't launch

Hi everyone,

Today I’m trying to do some Dynamic parallelism.

So to begin :
I compile with : -rdc=true --machine 64 compute_35,sm_35
I added cudadevrt.lib to the Linker > Input
Common > Host > Runtime Library is set to "Multi-Threaded Debut (/MTd)

I tried to do simple things but didn’t work at all.

__global__ void child(int index)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 1)
	{
		printf("child\ttid = %d and index = %d\n", tid, index);
	}
}

__global__ void	kernel(void)
{
	int			tid;
	printf("father");
	tid = threadIdx.x + blockIdx.x * blockDim.x;
	child << < 1, 1 >> > (tid); 
}

int	main(void)
{
	printf("start\n");
	kernel << < 1, 1 >> > ();
	gpuErrchk(cudaPeekAtLastError());
	printf("end\n");
}

The terminal output is :
start
end

Yesterday i codded a more complicated code and it worked partially, and I did’nt get why…

I launche in main my parent kernel like this :

mainKernel <<< 780, 1024 >>> (/* some args */);

and in this mainKernel I used printf to see the tid (threadIdx.x + blockDim.x * blockIdx.x)

and there the max tid was 798719. But I tried to launch a childKernel into this parent Kernel and in this child Kernel :
I use printf to see the tid of the parent and the tid of my actual Kernel.
But the max tid never reached 798719 and the amount of child kernel seems to be random… So if i do something wrong, I don’t know what is it.

I read that a parent kernel wait until his child kernel end his work to continue. A function called cudaDeviceSynchronize can be used to insure that the parent kernel will wait. But if I use it or not, the output keep random.

Thanks for help

your initial example doesn’t wait in main until kernel finishes. also add error checking around cuda calls and use cuda-memcheck before asking for a help

since you haven’t posted entire second example, it’s hard to understand what’s the problem, but probably the same

I just added to the main the function cudaDeviceSynchronize after the kernel and the code works, but if i modify my code by adding a condition :

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
	if (code != cudaSuccess)
	{
		fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
		if (abort) exit(code);
	}
}

__global__ void child(int index)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 1)
	{
		index > 798000 ? printf("child\ttid = %d and index = %d\n", tid, index) : 0;
	}
}

__global__ void	kernel(void)
{
	int			tid;
	//printf("father\n");
	tid = threadIdx.x + blockIdx.x * blockDim.x;
	tid > 798000 ? printf("%d\n", tid) : 0;
	child << < 1, 1 >> > (tid); 
}

int	main(void)
{
	printf("start\n");
	kernel << < 780, 1024 >> > ();
	gpuErrchk(cudaPeekAtLastError());
	gpuErrchk(cudaDeviceSynchronize());
	printf("end\n");
}

On the output i get :

GPUassert : unknown error […]path[…] kernel.cu 39

Line 39 is

cudaDeviceSynchronize();

same if i add this line after the call to child.

Edit : And I don’t know how to use cuda-memcheck on launch. I use visual studio 2013
Edit 2 : Why would there be memory error ? There isn’t any array or anything just a variable.

You’re not doing any error checking on the child kernel launch.

You should do so. You can do in a similar fashion to how you do it on the host.

Note that child kernel launches have various limits. You might want to read the programming guide section on CDP (CUDA Dynamic Parallelism)

For example there is a pending launch limit. If you are hitting the pending launch limit, and you do proper error checking on the child launch, you will discover it.

Ok so i’m coming back,

I do my check errors with a define, for the mem-check i don’t really know how to do it on VS2013 (windows).
I code a program with kernels and if i just check the index of the kernels (see if i get all the index and not some) it work (so i guess i didn’t reach the launch limit). But if i do something in the kernels like the calculation i must do, then the program crash on the cudaDeviceSytynchronize in the main. With the error :
“unknow error”

Here is my code that work :

__global__ void	secondKernel(int realTid, Cartesian Sat, Cartesian Base, Cartesian Geo)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 209440)
	{
		realTid < 1 && tid > 209438 ? printf("Sat %.3f, %.3f, %.3f\nBase %.3f %.3f %.3f\nGeo %.3f %.3f %.3f\n", Sat.m_X, Sat.m_Y, Sat.m_Z, Base.m_X, Base.m_Y, Base.m_Z, Geo.m_X, Geo.m_Y, Geo.m_Z) : 0;
	}
}

__global__ void		kernelStream(double mult, Propagator *sat)
{
	int				tid;
	Cartesian		sat_position;
	Cartesian		geo_position;
	Cartesian		base_position(1597885.53777688, 1253552.16551859, 6046164.27311665);

	tid = (threadIdx.x + blockIdx.x * blockDim.x) + mult;
	if (tid < 798132)
	{
		sat_position = sat[0].evaluate(tid * STEP, SIMULATION_DURATION, 0);
		geo_position = findStartGeo(base_position);
		secondKernel <<< 205, 1024 >>> (tid, sat_position, base_position, geo_position);
		gpucheckError(cudaGetLastError());
		gpucheckError(cudaDeviceSynchronize());
	}
}

int		main(void)
{
	cudaStream_t	stream[NB_STREAM];
	cudaEvent_t		start, stop;
	dim3			nb_threads(98, 1, 1);
	dim3			nb_blocks(1, 1, 1);
	float			elapsedTime;
	Propagator		*sat, *cuda_sat;

	cpucheckError(cudaHostAlloc((void **)&sat, sizeof(Propagator) * NB_SAT, cudaHostAllocDefault));
	cpucheckError(cudaMalloc((void **)&cuda_sat, sizeof(Propagator) * NB_SAT));
	sat[0].propagator("Sat 1", 7847.3, 53, 0, 18, 0, 67.5, true, 5, true, 3.4000000596279278E-05);
	cpucheckError(cudaMemcpy(cuda_sat, sat, sizeof(Propagator) * NB_SAT, cudaMemcpyHostToDevice));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamCreate(&stream[i]));
	cpucheckError(cudaEventCreate(&start));
	cpucheckError(cudaEventCreate(&stop));
	cpucheckError(cudaEventRecord(start, 0));
	for (int i = 0; i < NB_STREAM; i++)
	{
		kernelStream << < nb_blocks, nb_threads, 0, stream[i] >> > (i * nb_threads.x * nb_blocks.x, cuda_sat);
		cpucheckError(cudaGetLastError());
	}
	cpucheckError(cudaDeviceSynchronize());
	cpucheckError(cudaEventRecord(stop, 0));
	cpucheckError(cudaEventSynchronize(stop));
	cpucheckError(cudaEventElapsedTime(&elapsedTime, start, stop));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamDestroy(stream[i]));
	cpucheckError(cudaEventDestroy(start));
	cpucheckError(cudaEventDestroy(stop));
	printf("time : %f ms\n", elapsedTime);
	return (0);

}

Here is my defines for check :

# define cpucheckError(value) { cpuAssert((value), __FILE__, __LINE__); }
__host__ inline void cpuAssert(cudaError_t code, const char *file, int line)
{
	if (code != cudaSuccess)
	{
		printf("%s %s %d\n", cudaGetErrorString(code), file, line);
		exit(0);
	}
}
# define gpucheckError(value) { gpuAssert((value), __FILE__, __LINE__); }
__device__ inline void gpuAssert(cudaError_t code, const char *file, int line)
{
	if (code != cudaSuccess)
	{
		printf("%s %s %d\n", cudaGetErrorString(code), file, line);
		return;
	}
}

Here is the code that didn’t work

__constant__ double CUDA_result[BLOCKS_ANGLE];
__constant__ int	CUDA_tid[BLOCKS_ANGLE];

__global__ void	secondKernel(int realTid, Cartesian Sat, Cartesian Base, Cartesian Geo)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 209440)
	{
		__shared__ double	tmp[THREADS_ANGLE][2];
		Cartesian			new_pos, vecU, vecV;
		Global				global;
		int					tid, idx, i;
	
		idx = threadIdx.x;
		if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < IT_ANGLE)
		{
			new_pos = global.rotationZAxis(Geo, STEP_ANGLE * tid);
			vecU.set(Sat.m_X - Base.m_X, Sat.m_Y - Base.m_Y, Sat.m_Z - Base.m_Z);
			vecV.set(new_pos.m_X - Base.m_X, new_pos.m_Y - Base.m_Y, new_pos.m_Z - Base.m_Z);
			tmp[idx][0] = global.dotProduct(vecU, vecV);
			tmp[idx][1] = (double)tid;
		}
		__syncthreads();
		i = THREADS_ANGLE / 2;
		while (i != 0)
		{
			(idx < i && tmp[idx][0] > tmp[idx + i][0]) ? (tmp[idx][0] = tmp[idx + i][0], tmp[idx][1] = tmp[idx + i][1]) : (0);
			__syncthreads();
			i /= 2;
		}
		(idx == 0) ? (CUDA_result[blockIdx.x] = tmp[0][0], CUDA_tid[blockIdx.x] = (int)tmp[0][1]) : (0);
	}
}

__global__ void		kernelStream(double mult, Propagator *sat)
{
	int				tid, index;
	Cartesian		sat_position;
	Cartesian		geo_position;
	Cartesian		base_position(1597885.53777688, 1253552.16551859, 6046164.27311665);
	Cartesian		output;

	tid = (threadIdx.x + blockIdx.x * blockDim.x) + mult;
	if (tid < 798132)
	{
		sat_position = sat[0].evaluate(tid * STEP, SIMULATION_DURATION, 0);
		geo_position = findStartGeo(base_position);
		secondKernel <<< 205, 1024 >>> (tid, sat_position, base_position, geo_position);
		gpucheckError(cudaGetLastError());
		gpucheckError(cudaDeviceSynchronize());
		index = getIndex(CUDA_result, CUDA_tid);
		output.set(geo_position.m_X * cos(STEP * index) + geo_position.m_Y * -sin(STEP * index), geo_position.m_X * sin(STEP * index) + geo_position.m_Y * cos(STEP * index), geo_position.m_Z);
	}
}

int		main(void)
{
	cudaStream_t	stream[NB_STREAM];
	cudaEvent_t		start, stop;
	dim3			nb_threads(98, 1, 1);
	dim3			nb_blocks(1, 1, 1);
	float			elapsedTime;
	Propagator		*sat, *cuda_sat;

	cpucheckError(cudaHostAlloc((void **)&sat, sizeof(Propagator) * NB_SAT, cudaHostAllocDefault));
	cpucheckError(cudaMalloc((void **)&cuda_sat, sizeof(Propagator) * NB_SAT));
	sat[0].propagator("Sat 1", 7847.3, 53, 0, 18, 0, 67.5, true, 5, true, 3.4000000596279278E-05);
	cpucheckError(cudaMemcpy(cuda_sat, sat, sizeof(Propagator) * NB_SAT, cudaMemcpyHostToDevice));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamCreate(&stream[i]));
	cpucheckError(cudaEventCreate(&start));
	cpucheckError(cudaEventCreate(&stop));
	cpucheckError(cudaEventRecord(start, 0));
	for (int i = 0; i < NB_STREAM; i++)
	{
		kernelStream << < nb_blocks, nb_threads, 0, stream[i] >> > (i * nb_threads.x * nb_blocks.x, cuda_sat);
		cpucheckError(cudaGetLastError());
	}
	cpucheckError(cudaDeviceSynchronize());
	cpucheckError(cudaEventRecord(stop, 0));
	cpucheckError(cudaEventSynchronize(stop));
	cpucheckError(cudaEventElapsedTime(&elapsedTime, start, stop));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamDestroy(stream[i]));
	cpucheckError(cudaEventDestroy(start));
	cpucheckError(cudaEventDestroy(stop));
	printf("time : %f ms\n", elapsedTime);
	return (0);
}

I set the number of stream to : 12471

Can’t run your code. Don’t know what BLOCKS_ANGLE, STEP_ANGLE, THREADS_ANGLE are, and who knows what else.

Run your program from the command line in a windows command prompt console.

If your program is called test.exe, run it like this:

cuda-memcheck test

You’ll never be abble to launch my program with the code given earlier juste because i’vo got a lot more function called and i’ll mjst post like 50-60function to allow you this launch. And all BLOCKS_ANGLE are defines. I’ve got like 50defines.

Wow cuda-memcheck is really really slow compared to CUDA ! I launch and after 15min i got : idx = 10 000

To continue if you want know the kernelStream is called 10 time on 38950 streams and in this kernel i call another kernel which is call with 205 blocks and 1024 threads. That’s maybe because i reach the maximum of the hardware i got some “unknow error”

Edit : cuda-memcheck detect 0 error
Edit 2: i launch the program without cudamemcheck it “works” and if i launch with cuda-memcheck i get “streamTest has stopped working”