Kernel works sometimes

I am implementing AES algorithm in ECB mode. My basic variable is custom struct (Block) that contains
unsigned char[4][4], and represents one block 4x4.

First of all, I am reading some text file and storing it to these blocks. For example, if there is 16000 characters in text file I will make array of 1000 structs (blocks 4x4),so number of blocks is 1000 (using it later as dim numBlocks(1, number_of_blocks)).

After that, I am allocating memory on host and device for plaintext, ciphertext and encrypted text all of them having same size. Those are arrays of structs.

Like this
cudaMallocHost((void**)&plaintext, number_of_blockssizeof(Block));
cudaMalloc((void**)&plaintext, number_of_blocks
sizeof(Block));

Then I’m using cudaMemCpy to copy data from host to device.

Before calling kernel I set up
dim3 threadsPerBlock(4,4)
dim numBlocks(1, number_of_blocks)

Now there is a problem. For example if input text file contains less than one milion characters everything is working fine, but if number of charaters is greater than one million kernel returns empty array, nothing.

What could be the problem? Is it most likely to be problem with memory or my code (chances are 90%)?
I can provide my code later if necessary.

Those are poor choices if you want an efficient kernel, therefore the kernel may be running slowly. As you increase the data set size, the kernel duration increases.

If you are running on a GPU that is hosting a display, you may be hitting a kernel timeout.

windows:

https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/timeout_detection_recovery.htm

linux:

https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

Usually if your kernel duration is longer than about 2 seconds, you may run into this issue.

Sorry for late answer but I managed to solve this problem. Thing was, that number_of_blocks was to big to run kernel at all. So I just replaced that with several kernel calls but with smaller number of number_of_blocks.