I plan to code something in OpenCL, using Ã¼berKernel pattern.
It means that a given kernel would have this structure:
__kernel void my_uber_kernel(void)
stage = stage + 1 ;
Each one of
potentially contains a substantial amount of code.
I’m wondering if there is known limitations regarding the amount of instructions supported (per thread?) before performances are impacted ?
Does splitting process in small device functions calls help to optimize ?
Or do I have to split process in several kernel calls (so that above-mentioned device_function_X become kernels)
Doing this in separate kernel launches will include reading / writing to global memory overheads ( 100’s of cycles ). Best thing to do here is to compute all the stuff in one kernel, keep temporary results in registers and write results just once.
The memory usage overhead is one reason why I chose Ã¼berkernel way, but what if, in the end, the kernel contains like 10,000 lines of code (all calls inlined) ?
Maximum kernel size (the limit is on the kernel, not thread), is 2000000 assembly instructions (I don’t think that that changed with Fermi).
The thing that you may need to watch is instruction cache pollution. You don’t want too much code inside an if conditional where the block diverges as it causes instruction cache pollution that can degrade performance. It can also cause issues if you have multiple blocks per multicore and they diverge.
Whether it’s better to split to multiple kernels or use a single Ã¼berKernel depends on your actual code. Going to global memory is very expensive, generally much more so than instruction cache pollution, but there are exception.
and do you know an order of magnitude for the program cache (instruction cache) size ?
something like 64KB ?
The instruction cache is in the constant cache. If memory serves its 8KB.
Best thing to do here is to compute all the stuff in one kernel.