I am developing some CUDA code for a scientific application that has 3 arrays which are constant. These arrays will be used throughout the code, but I am unsure of how to declare and use these arrays for maximum performance. The problem is in polar coordinates so I have constant arrays of r, sin theta and cos theta, and these are the arrays which will be used throughout the code.
I assume I am not the first to encouner this problem, so any advice would be greatly appreciated.
For performance the access pattern is important, as well. Not just that the data is constant.
Why don’t you try out all three methods and choose the fastest?
There are even more ways. You could load the coefficients into local variables, you could hardcode them into your instructions, you could put them into global memory and hope for the L1 cache, you could calculate them, everytime you use them. Besides access pattern, the size of the arrays are important. Also do all threads of a warp or several warps access the same variables, do you use the Tensor Cores? Perhaps the decision does not even affect the performance of your program as the bottleneck is elsewhere.