Why would you keep calling a kernel? If I understand correctly, it’s a pretty expensive operation – make a scheduler call, load the binary blob, start a new thread, etc
Why wouldn’t you just let your kernel run (a while loop?) and then communicate with it (pass the next set of parameters, I assume?) by simply updating some memory region. To overcome collisions you could implement some locking mechanism or have double/triple buffers.
Interesting product. In your documentation page you mention that CUDA drivers are available only for Windows. That’s not accurate as we provide CUDA drivers for a variety of linux distributions as well as Mac OS. CUDA 2.0 beta is also supported for Vista.
I would say that this is normal. 10’000 invocations per second translates to running time of 100us which is way too low for CUDA. You should increase amount of work performed in one kernel call (possible increase grid size) so that running time is at least few ms. You should see performance improvement, too.
-----------------------------------------
making data: 21%
cudaMemcpyHostToDevice (~300 bytes): 15%
Kernel<<< grid, threads >>>: 11%
249 threads: 6%
cudaMemcpyDeviceToHost (1992 bytes): 47%
-----------------------------------------
Total time 100%
The ray tracing itself takes 6%, but the data exchange is more than half of total time. So, the implementation of your suggestion will remove ~ 11% only (the kernel initializing).
It will likely remove more than 11%, as the overhead of transferring data to/from GPU gets lower when more data needs to be transferred. So the time to transfer 100 times 100 bytes is (much) longer as the time to transfer 10000 bytes.
In current version of LINZIK the lens optimizer runs on CPU and calls the goal (merit) function on CUDA. The optimizer creates a job for raytracer step by step. But it never has many jobs simultaneously.
It’s possible to include the optimizer in CUDA kernel, but this addition means that the LINZIK language interpreter (at least his virtual machine) must be there…