Dynamic parallelism, Kernel didn't launch

Kasbrick · September 1, 2016, 8:26am

Hi everyone,

Today I’m trying to do some Dynamic parallelism.

So to begin :
I compile with : -rdc=true --machine 64 compute_35,sm_35
I added cudadevrt.lib to the Linker > Input
Common > Host > Runtime Library is set to "Multi-Threaded Debut (/MTd)

I tried to do simple things but didn’t work at all.

__global__ void child(int index)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 1)
	{
		printf("child\ttid = %d and index = %d\n", tid, index);
	}
}

__global__ void	kernel(void)
{
	int			tid;
	printf("father");
	tid = threadIdx.x + blockIdx.x * blockDim.x;
	child << < 1, 1 >> > (tid); 
}

int	main(void)
{
	printf("start\n");
	kernel << < 1, 1 >> > ();
	gpuErrchk(cudaPeekAtLastError());
	printf("end\n");
}

The terminal output is :
start
end

Yesterday i codded a more complicated code and it worked partially, and I did’nt get why…

I launche in main my parent kernel like this :

mainKernel <<< 780, 1024 >>> (/* some args */);

and in this mainKernel I used printf to see the tid (threadIdx.x + blockDim.x * blockIdx.x)

and there the max tid was 798719. But I tried to launch a childKernel into this parent Kernel and in this child Kernel :
I use printf to see the tid of the parent and the tid of my actual Kernel.
But the max tid never reached 798719 and the amount of child kernel seems to be random… So if i do something wrong, I don’t know what is it.

I read that a parent kernel wait until his child kernel end his work to continue. A function called cudaDeviceSynchronize can be used to insure that the parent kernel will wait. But if I use it or not, the output keep random.

Thanks for help

BulatZiganshin · September 1, 2016, 8:48am

your initial example doesn’t wait in main until kernel finishes. also add error checking around cuda calls and use cuda-memcheck before asking for a help

since you haven’t posted entire second example, it’s hard to understand what’s the problem, but probably the same

Kasbrick · September 1, 2016, 9:02am

I just added to the main the function cudaDeviceSynchronize after the kernel and the code works, but if i modify my code by adding a condition :

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
	if (code != cudaSuccess)
	{
		fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
		if (abort) exit(code);
	}
}

__global__ void child(int index)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 1)
	{
		index > 798000 ? printf("child\ttid = %d and index = %d\n", tid, index) : 0;
	}
}

__global__ void	kernel(void)
{
	int			tid;
	//printf("father\n");
	tid = threadIdx.x + blockIdx.x * blockDim.x;
	tid > 798000 ? printf("%d\n", tid) : 0;
	child << < 1, 1 >> > (tid); 
}

int	main(void)
{
	printf("start\n");
	kernel << < 780, 1024 >> > ();
	gpuErrchk(cudaPeekAtLastError());
	gpuErrchk(cudaDeviceSynchronize());
	printf("end\n");
}

On the output i get :

GPUassert : unknown error […]path[…] kernel.cu 39

Line 39 is

cudaDeviceSynchronize();

same if i add this line after the call to child.

Edit : And I don’t know how to use cuda-memcheck on launch. I use visual studio 2013
Edit 2 : Why would there be memory error ? There isn’t any array or anything just a variable.

Robert_Crovella · September 1, 2016, 2:21pm

You’re not doing any error checking on the child kernel launch.

You should do so. You can do in a similar fashion to how you do it on the host.

Note that child kernel launches have various limits. You might want to read the programming guide section on CDP (CUDA Dynamic Parallelism)

For example there is a pending launch limit. If you are hitting the pending launch limit, and you do proper error checking on the child launch, you will discover it.

Kasbrick · September 12, 2016, 8:30am

Ok so i’m coming back,

I do my check errors with a define, for the mem-check i don’t really know how to do it on VS2013 (windows).
I code a program with kernels and if i just check the index of the kernels (see if i get all the index and not some) it work (so i guess i didn’t reach the launch limit). But if i do something in the kernels like the calculation i must do, then the program crash on the cudaDeviceSytynchronize in the main. With the error :
“unknow error”

Here is my code that work :

__global__ void	secondKernel(int realTid, Cartesian Sat, Cartesian Base, Cartesian Geo)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 209440)
	{
		realTid < 1 && tid > 209438 ? printf("Sat %.3f, %.3f, %.3f\nBase %.3f %.3f %.3f\nGeo %.3f %.3f %.3f\n", Sat.m_X, Sat.m_Y, Sat.m_Z, Base.m_X, Base.m_Y, Base.m_Z, Geo.m_X, Geo.m_Y, Geo.m_Z) : 0;
	}
}

__global__ void		kernelStream(double mult, Propagator *sat)
{
	int				tid;
	Cartesian		sat_position;
	Cartesian		geo_position;
	Cartesian		base_position(1597885.53777688, 1253552.16551859, 6046164.27311665);

	tid = (threadIdx.x + blockIdx.x * blockDim.x) + mult;
	if (tid < 798132)
	{
		sat_position = sat[0].evaluate(tid * STEP, SIMULATION_DURATION, 0);
		geo_position = findStartGeo(base_position);
		secondKernel <<< 205, 1024 >>> (tid, sat_position, base_position, geo_position);
		gpucheckError(cudaGetLastError());
		gpucheckError(cudaDeviceSynchronize());
	}
}

int		main(void)
{
	cudaStream_t	stream[NB_STREAM];
	cudaEvent_t		start, stop;
	dim3			nb_threads(98, 1, 1);
	dim3			nb_blocks(1, 1, 1);
	float			elapsedTime;
	Propagator		*sat, *cuda_sat;

	cpucheckError(cudaHostAlloc((void **)&sat, sizeof(Propagator) * NB_SAT, cudaHostAllocDefault));
	cpucheckError(cudaMalloc((void **)&cuda_sat, sizeof(Propagator) * NB_SAT));
	sat[0].propagator("Sat 1", 7847.3, 53, 0, 18, 0, 67.5, true, 5, true, 3.4000000596279278E-05);
	cpucheckError(cudaMemcpy(cuda_sat, sat, sizeof(Propagator) * NB_SAT, cudaMemcpyHostToDevice));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamCreate(&stream[i]));
	cpucheckError(cudaEventCreate(&start));
	cpucheckError(cudaEventCreate(&stop));
	cpucheckError(cudaEventRecord(start, 0));
	for (int i = 0; i < NB_STREAM; i++)
	{
		kernelStream << < nb_blocks, nb_threads, 0, stream[i] >> > (i * nb_threads.x * nb_blocks.x, cuda_sat);
		cpucheckError(cudaGetLastError());
	}
	cpucheckError(cudaDeviceSynchronize());
	cpucheckError(cudaEventRecord(stop, 0));
	cpucheckError(cudaEventSynchronize(stop));
	cpucheckError(cudaEventElapsedTime(&elapsedTime, start, stop));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamDestroy(stream[i]));
	cpucheckError(cudaEventDestroy(start));
	cpucheckError(cudaEventDestroy(stop));
	printf("time : %f ms\n", elapsedTime);
	return (0);

}

Here is my defines for check :

# define cpucheckError(value) { cpuAssert((value), __FILE__, __LINE__); }
__host__ inline void cpuAssert(cudaError_t code, const char *file, int line)
{
	if (code != cudaSuccess)
	{
		printf("%s %s %d\n", cudaGetErrorString(code), file, line);
		exit(0);
	}
}
# define gpucheckError(value) { gpuAssert((value), __FILE__, __LINE__); }
__device__ inline void gpuAssert(cudaError_t code, const char *file, int line)
{
	if (code != cudaSuccess)
	{
		printf("%s %s %d\n", cudaGetErrorString(code), file, line);
		return;
	}
}

Here is the code that didn’t work

__constant__ double CUDA_result[BLOCKS_ANGLE];
__constant__ int	CUDA_tid[BLOCKS_ANGLE];

__global__ void	secondKernel(int realTid, Cartesian Sat, Cartesian Base, Cartesian Geo)
{
	int			tid;

	if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < 209440)
	{
		__shared__ double	tmp[THREADS_ANGLE][2];
		Cartesian			new_pos, vecU, vecV;
		Global				global;
		int					tid, idx, i;
	
		idx = threadIdx.x;
		if ((tid = threadIdx.x + blockIdx.x * blockDim.x) < IT_ANGLE)
		{
			new_pos = global.rotationZAxis(Geo, STEP_ANGLE * tid);
			vecU.set(Sat.m_X - Base.m_X, Sat.m_Y - Base.m_Y, Sat.m_Z - Base.m_Z);
			vecV.set(new_pos.m_X - Base.m_X, new_pos.m_Y - Base.m_Y, new_pos.m_Z - Base.m_Z);
			tmp[idx][0] = global.dotProduct(vecU, vecV);
			tmp[idx][1] = (double)tid;
		}
		__syncthreads();
		i = THREADS_ANGLE / 2;
		while (i != 0)
		{
			(idx < i && tmp[idx][0] > tmp[idx + i][0]) ? (tmp[idx][0] = tmp[idx + i][0], tmp[idx][1] = tmp[idx + i][1]) : (0);
			__syncthreads();
			i /= 2;
		}
		(idx == 0) ? (CUDA_result[blockIdx.x] = tmp[0][0], CUDA_tid[blockIdx.x] = (int)tmp[0][1]) : (0);
	}
}

__global__ void		kernelStream(double mult, Propagator *sat)
{
	int				tid, index;
	Cartesian		sat_position;
	Cartesian		geo_position;
	Cartesian		base_position(1597885.53777688, 1253552.16551859, 6046164.27311665);
	Cartesian		output;

	tid = (threadIdx.x + blockIdx.x * blockDim.x) + mult;
	if (tid < 798132)
	{
		sat_position = sat[0].evaluate(tid * STEP, SIMULATION_DURATION, 0);
		geo_position = findStartGeo(base_position);
		secondKernel <<< 205, 1024 >>> (tid, sat_position, base_position, geo_position);
		gpucheckError(cudaGetLastError());
		gpucheckError(cudaDeviceSynchronize());
		index = getIndex(CUDA_result, CUDA_tid);
		output.set(geo_position.m_X * cos(STEP * index) + geo_position.m_Y * -sin(STEP * index), geo_position.m_X * sin(STEP * index) + geo_position.m_Y * cos(STEP * index), geo_position.m_Z);
	}
}

int		main(void)
{
	cudaStream_t	stream[NB_STREAM];
	cudaEvent_t		start, stop;
	dim3			nb_threads(98, 1, 1);
	dim3			nb_blocks(1, 1, 1);
	float			elapsedTime;
	Propagator		*sat, *cuda_sat;

	cpucheckError(cudaHostAlloc((void **)&sat, sizeof(Propagator) * NB_SAT, cudaHostAllocDefault));
	cpucheckError(cudaMalloc((void **)&cuda_sat, sizeof(Propagator) * NB_SAT));
	sat[0].propagator("Sat 1", 7847.3, 53, 0, 18, 0, 67.5, true, 5, true, 3.4000000596279278E-05);
	cpucheckError(cudaMemcpy(cuda_sat, sat, sizeof(Propagator) * NB_SAT, cudaMemcpyHostToDevice));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamCreate(&stream[i]));
	cpucheckError(cudaEventCreate(&start));
	cpucheckError(cudaEventCreate(&stop));
	cpucheckError(cudaEventRecord(start, 0));
	for (int i = 0; i < NB_STREAM; i++)
	{
		kernelStream << < nb_blocks, nb_threads, 0, stream[i] >> > (i * nb_threads.x * nb_blocks.x, cuda_sat);
		cpucheckError(cudaGetLastError());
	}
	cpucheckError(cudaDeviceSynchronize());
	cpucheckError(cudaEventRecord(stop, 0));
	cpucheckError(cudaEventSynchronize(stop));
	cpucheckError(cudaEventElapsedTime(&elapsedTime, start, stop));
	for (int i = 0; i < NB_STREAM; i++)
		cpucheckError(cudaStreamDestroy(stream[i]));
	cpucheckError(cudaEventDestroy(start));
	cpucheckError(cudaEventDestroy(stop));
	printf("time : %f ms\n", elapsedTime);
	return (0);
}

I set the number of stream to : 12471

Robert_Crovella · September 12, 2016, 2:12pm

Can’t run your code. Don’t know what BLOCKS_ANGLE, STEP_ANGLE, THREADS_ANGLE are, and who knows what else.

Run your program from the command line in a windows command prompt console.

If your program is called test.exe, run it like this:

cuda-memcheck test

Kasbrick · September 12, 2016, 2:45pm

You’ll never be abble to launch my program with the code given earlier juste because i’vo got a lot more function called and i’ll mjst post like 50-60function to allow you this launch. And all BLOCKS_ANGLE are defines. I’ve got like 50defines.

Wow cuda-memcheck is really really slow compared to CUDA ! I launch and after 15min i got : idx = 10 000

To continue if you want know the kernelStream is called 10 time on 38950 streams and in this kernel i call another kernel which is call with 205 blocks and 1024 threads. That’s maybe because i reach the maximum of the hardware i got some “unknow error”

Edit : cuda-memcheck detect 0 error
Edit 2: i launch the program without cudamemcheck it “works” and if i launch with cuda-memcheck i get “streamTest has stopped working”

Topic		Replies	Views
CUDA Dynamic Parallelism API and Principles Technical Blog	19	916	May 2, 2017
Dynamic parallelism childern kernel monitoring CUDA Programming and Performance	2	591	August 8, 2016
CUDA Dynamic Parallelism, the code doesn't run as expected CUDA Programming and Performance	18	434	July 20, 2024
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	696	June 3, 2024
Passing dynamically allocated memory in kernel to sub kernel via dynamic parallelism CUDA Programming and Performance	6	705	May 3, 2019
Unspecified launch failure error CUDA Programming and Performance	10	18810	January 6, 2018
Maximum Threads for Kernel Call CUDA Programming and Performance	38	16455	May 25, 2010
Silent kernel failure CUDA Programming and Performance	25	8378	May 18, 2020
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1799	November 4, 2016
Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure? CUDA Programming and Performance	6	751	March 20, 2018

Dynamic parallelism, Kernel didn't launch

Related topics