kernel invocation from another cuda file

Hello,
I am new in cuda programming and I practice on BlackScholes. I would like to create a cuda file which will be able to invoke the kernel because I want to change the number of threads used. Until now I tried to do it by calling some cuda functions like cudaLaunchKernel but it doesn’t work. I guess there is a general problem in doing so, because there are many factors I can not detour such as the parameters used in kernel function. But is there any possibility to find a way to do this?

Any help appreciated.

Kind regards

1: What is your GPU model and above all the computing capability on it?
For dynamic parallelism you must have at least a Kepler architecture card.
Computing cap. 3.0 or later.

2: Have you checked the segment on Dynamic parallelism in the programming manual?

That’s a good starting point in lieu of more details from your environment.

I have a Quadro P1000, a Quadro K2200 and a GeForce GT 720. The truth is that I haven’t checked the Dynamic Parallelism segment but I will. The general concept is that I don’t want to change the BlackScholes code but I want to call the kernel with different number of blocks than those it already has.

Thank you very much for your reply!

Should work fine as far I can tell for all your cards.

P1000 should be Pascal based.
K2000 should be Keplar based.

GT720 I am uncertain due to the fact Nvidias specs page is still not always presenting “Computation level” or architecture on the page for all GeForce cards but the 700 series should be either Keplar or Maxwell if I got the system correctly adjusted in my mind.

Only Fermi and older than 2.x are unable to handle “Dynamic Parallelism”.

Actually dynamic parallelism (CDP) requires compute capability 3.5 or higher:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism

Kepler devices of compute capability 3.0 or 3.2 are not supported for CDP.

If you have devices in your possession, an easy way to find out their compute capability is to run the deviceQuery sample code on them.

Having said all that, I doubt dynamic parallelism is the answer to OP’s request, but the request is a little unclear to me, so I may be wrong.

Hello txbob,

My problem is that I have a kernel call like this :

kernelInvocation<<<10, 128>>>();

and I want without changing the BlackScholes code to write something or use a cuda function which will be able to make this call be executed with as much threads as I want. I don’t know if that is possible, I have tried it in many ways, mostly with some cuda functions. I want to call this kernel with less number of threads and see how this affects the execution time. Is there any possible way to it or do I have to change manually the “128” to see some difference?

Thank you very much for your reply.

For starters you would have to manually change the 10, 128 numbers.

Beyond that, depending on the kernelInvocation code itself, other changes might be necessary.

You probably need to learn CUDA programming.

Yes the truth is that I just started to learn CUDA. Ok, I will do it manually then as the kernelInvocation code itself does need others changes indeed.

Thank you again.

So, apart from the fact that I am a beginner, I just want to know if generally there is a way to configure the device and make it use specific number of threads. For now I will do it manually but in the near future maybe I will try do it more automatically so I want to know what to look for.
Is that possible to happen? Are there any cuda functions to help me with it or doing it manually is the only way? Is it something which needs more specific and deep understanding of CUDA programming?

Either way it is important for me to know as I study for this as part of my BSc thesis.

No there are no device configuration settings that affect or override the number of threads that will be launched by a kernel. The kernel has exclusive control over the total number of threads to launch, as well as the number of threads per block.

Ok thank you very much! :)