Sorry if this is a silly question…
In my OpenCL programme the slowest part of my kernel is reading a const float *array from global to local. Each time the kernel runs it requires this array in local memory, so if there were a way to transfer the array to the local memory of each multiprocessor core just once per clEnqueueNDRange() call I’d have a good speedup.
I’m guessing that this is not possible, and I can think of alternative ways to get around this by rewriting the kernel, but I’m new to OpenCL and just wanted to check…
Thanks in advance!
It is not possible due to hardware limitations that are probably never going to change (they exist for a very specific reason). A good way to get around this is to combine the work of multiple kernels into a single kernel so that the data stored in local memory can be reused for multiple operations. This of course assumes that you don’t need the global barrier between kernel launches.
Thanks for the reply. Out of interest - what is the very specific reason?
The idea is to be able to support a very large number of thread groups (millions or billions). If you allow local memory to be persistent, then a GPU core will have to somehow store all of that memory rather than just writing over it when it executes the next thread group.