Why the number of parallel threads slows down operation

Dear All, I am new to the community, so sorry if my question is trivial.

I wrote a simple code in CUDA C++ which implements Transformer. The problem is that operation of my app depends on the number of parallel threads and I do not know why. For example I have the following code:

dim3 numberOfBlocks(A,B,C);
dim3 numberOfThreadsPerBlock(C,D,E);

and if A, B, C or the same C,D,E are small my app works fast but if I only change slightly one of the above parameters from 10 to 50 the app slows down. I thought that if everything happens in parallel and the conditions regarding the maximum number of threads in a block and the number of blocks are met, everything should work equally fast. But it is not the case.

It certainly depends on code design.

You can write a code that uses the available grid to work on data. An example of such a design paradigm is a grid stride loop. Such a code design can do a variable amount of work per thread, depending on the problem size and the grid size. Within various ranges, you might observe that the kernel runtime is mostly independent of the grid sizing.

You can also write a code that does fixed work per thread. In that case, if you increase the number of threads, you will increase the work, and most likely increase the kernel duration, regardless of whether that was your intent, or not.

Thank you very much for the useful information.