Strannge behaviour of kernel: unspecified launch failure

This kernel that actually do nothing generates unspecified launch failure in debug mode on various GPUs (1060,1080)

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

unsigned const GROUP_ID = 2;
unsigned const WORD_ID = 16;
size_t const BLOCK_ID = 16;
size_t const ITER = 128;

__global__ void Test()
	__shared__ unsigned sharedBlocks[GROUP_ID][BLOCK_ID][WORD_ID + 1]; // [2][16][16+1]

	unsigned group_id = threadIdx.x / BLOCK_ID;
	unsigned thread_id = threadIdx.x % BLOCK_ID;

	for (size_t iter = 0; iter < ITER; iter++)
		for (size_t blockId = 0; blockId < BLOCK_ID; blockId++)
			sharedBlocks[group_id][blockId][thread_id] = unsigned(group_id * blockId * thread_id);

int main()
	unsigned const THREADS_PER_CUDA_BLOCK = 32;
	unsigned const CUDA_BLOCKS = 4096 * 128;


	cudaError_t cudaStatus = cudaDeviceSynchronize();
	if (cudaStatus != cudaSuccess)
		printf("Error: %s\n", cudaGetErrorName(cudaStatus));
		printf("Error: %s\n", cudaGetErrorString(cudaStatus));

	return 0;

I’m just blind and I can not find where the error is.

When I change the size of a shared memory to shared unsigned sharedBlocks[GROUP_ID][BLOCK_ID][WORD_ID] then it works just fine.

You’re evidently on windows.
On a very fast GPU (Tesla V100) when compiled with debug switch, your kernel takes about 0.8s to execute. Your GPUs that you list are slower, so the kernel execution will be longer.

I suspect you’re just hitting a WDDM TDR timeout. If a kernel takes more than about 2s to execute on windows, you will hit a WDDM TDR timeout. Your code as-is (compiled in debug mode) does not generate any error for me under cuda-memcheck.


If you reduce the ITER count to say 1, I think this will go away, further suggesting this is a timeout issue (since that affects no other aspect of code behavior). If you reduce the ITER count, then learn to use the profiler(s) available to you, you can monitor kernel execution time, and then repeat with larger ITER counts until it starts failing. Should be around 2s kernel duration.

Or you can build kernel timing directly into the code, and do a similar experiment, although windows WDDM can get in the way here too.

The modification to the shared memory dimension that makes the issue go away somehow is affecting kernel duration. For example, it may be affecting the complexity of indexing calculations.

Thank you for the quick response!

Yes, I use Windows, and it is possible that it is due to WDDM TDR timeout.

This kernel is just a part of a bigger kernel, but I narrowed error to the fragment I posted.
However, it’s strange that the error does not happen when I change the size of the shared memory ([GROUP_ID][BLOCK_ID][WORD_ID + 1] → [GROUP_ID][BLOCK_ID][WORD_ID]. Also, when I change the type “size_t” to “unsigned” then the error disappears (in posted kernel, not in the bigger).

Really strange

I directly addressed this topic in my previous response. I suspect it is affecting kernel execution time. The last dimension you are changing is the “column” dimension, i.e. the width of the fixed-size array. The compiler will generate indexing calculations off of that dimension specifically. Changing it from a power-of-two to 17 could definitely have a substantial impact on the complexity of indexing calculations (for example introducing an actual integer division or multiplication operation where previously there was none).

That is switching from a 64-bit type to a 32-bit type. So, building off the previous argument, if I have to do 64-bit integer division or multiplication, it’s going to take longer than 32-bit.

If you are bouncing right around the TDR limit, there are probably all sorts of tweaks like this that could pop you over the limit or drop you under the limit.

Thank you, txbob! My case is resolved. Next for those, who’s interested in details.

I have had very similar issue on Windows 10 with GeForce 840M while working with JCuda via my own Java API. The problem always looked like as code allocated more memory than some supposed “internal limit” (250 MB via cudaHostAlloc or 500Mb via cuMemAlloc, real device limit was found 1,772,542,361 bytes). Allocation went fine but kernel failed at run time, as described in posts above. But my “reason” appears not true.

After TDM was reset to 10 seconds, the problem has gone. Below is a log for a test that previously failed with default TDM=2s. As one can see, code allocated about 1.5 GB and processed it more than 7 (6 for kernel) seconds. 2 seconds were obviously not enough.

Allocated input field memory of 11 520 000 bytes <== cudaHostAlloc 
Stuffed input field memory
Allocated matrix memory of 1 474 560 000 bytes <== cuMemAlloc
Stuffed matrix memory
Allocated output field memory of 11 520 000 bytes <== cudaHostAlloc 
Loaded module "/ptx/FlatMatrix4D_0.ptx"
Obtained module function "compute"
Constant module memory "size" has size of 16 bytes
Copied 16 bytes to static memory
Launching the kernel
The kernel finished. <== cuCtxSynchronize
Test PASSED in 7 382 601 453(6 039 020 606) ns

Offtopic: Java 8 on i7-4510U CPU runs same matrix 3x times faster after 4D matrix array was replaced by flat 1D array in Java. I observe Java runtime enforced multithreading internally so computing goes to all 4 CPU cores. GPU has 3 SM’s with 128(2048?) thread limit for each. To compare, before flattening, GPU took only 24% of CPU time and I was so happy :) A time to investigate it.