LINZIK: freeware optical CAD: powered by CUDA Ray tracing using Feder's formulas ...

Hi All!

Subj: [url=“http://www.linzik.com/”]http://www.linzik.com/[/url] This is a beta version; it has English interface but only Russian Help yet. The little LINZIK forum: [url=“Мой LINZIK еще не ZEMAX, но ... - стр. 8 - Телескопостроение, оптика - Астрофорум – астрономический портал”]http://www.astronomy.ru/forum/index.php/to.../topicseen.html[/url]

I have such problem: the maximal frequency of calling CUDA kernel is always less than ~10 000 … 15 000 Hz. Can I improve this fact radically?

Thanks for answers and attention!

Arkady

P.S. I’m using GeForce 8800GT and Intel E2180 CP. This low frequence of CUDA kernel call seems a bottleneck, certainly :(

Klevaia programka!

Why would you keep calling a kernel? If I understand correctly, it’s a pretty expensive operation – make a scheduler call, load the binary blob, start a new thread, etc

Why wouldn’t you just let your kernel run (a while loop?) and then communicate with it (pass the next set of parameters, I assume?) by simply updating some memory region. To overcome collisions you could implement some locking mechanism or have double/triple buffers.

Hi Arkady,

Interesting product. In your documentation page you mention that CUDA drivers are available only for Windows. That’s not accurate as we provide CUDA drivers for a variety of linux distributions as well as Mac OS. CUDA 2.0 beta is also supported for Vista.

Paulius

I would say that this is normal. 10’000 invocations per second translates to running time of 100us which is way too low for CUDA. You should increase amount of work performed in one kernel call (possible increase grid size) so that running time is at least few ms. You should see performance improvement, too.

Here is the profile of my “calling the kernel”:

----------------------------------------- 

                         making data: 21%

 cudaMemcpyHostToDevice (~300 bytes): 15%

         Kernel<<< grid, threads >>>: 11%

                         249 threads:  6%

 cudaMemcpyDeviceToHost (1992 bytes): 47%

-----------------------------------------

Total time                          100%

The ray tracing itself takes 6%, but the data exchange is more than half of total time. So, the implementation of your suggestion will remove ~ 11% only (the kernel initializing).

Thank you, Paulius, and excuse me for this inaccurate info, I will correct it.

It will likely remove more than 11%, as the overhead of transferring data to/from GPU gets lower when more data needs to be transferred. So the time to transfer 100 times 100 bytes is (much) longer as the time to transfer 10000 bytes.

In current version of LINZIK the lens optimizer runs on CPU and calls the goal (merit) function on CUDA. The optimizer creates a job for raytracer step by step. But it never has many jobs simultaneously.

It’s possible to include the optimizer in CUDA kernel, but this addition means that the LINZIK language interpreter (at least his virtual machine) must be there…

The new question.

What’s the error code: 7 or 9? The “CudaError_enum type” has no such error explanations.