No, each thread reads a number of values from global memory. There are 13 input-only arrays total (which remain unchanged throughout the execution of the program). 10 of these are pretty small, so I was able to move these into constant memory easily. The remaining 3 are pretty large (sized equally to the I/O arrays), so I am not sure what I can do with those.
Are there any tutorials out there on what shared memory does or how to use it effectively? I was starting to look into that yesterday…
Currently, it is. I am working on a way to overcome this issue. My initial testing indicates that I can cut my time in half if I remove the current branching statements.
I have played around with the Occupancy calculator. I am not absolutely certain that I have used it correctly, but I chose the block size that seemed appropriate according to the calculator. My experiments with different block sizes seem to bear that out for now.
I am currently using 14 registers in my kernel. I originally started with 20. I don’t think I can reduce it any further (or at least much) unless I split the kernel into multiple kernels. However, that will increase the number of global memory reads per iteration of the program. I assume this will ultimately slow my program down.
Fortunately, I do not use any of these functions. Everything is straight-forward multiplication, division, addition, and subtraction.
Any constants are computed prior to the first execution of the kernel and passed in.
Thanks for the comments!