Reassigning array elements inside kernel

I have a basic question that’s been bugging me.

Say I have an array A that I pass to a kernel, and within each thread I need to do a few calculations with one element of that array (unique to each thread). Is it better to

(1) do each calculation with A[i] explicitly in the equations
or
(2) assign a new variable “float Ai = A[i];” within each thread, then do each calculation with Ai

I could see (1) might fetch A[i] each time it needed it, increasing data transfer time, and I could see (2) unnecessarily defining an extra variable that uses extra space. Does the compiler optimize this away? Or is one way better than the other?

Thank you!

I think no matter what, you’re caching the data. I’m not sure about register usage though. I think declaring float Ai would put the number in the registers vs a raw read which would put it in the cache.

I’m not sure though. But that’s my assumption. And I think registers would be faster than cache.

The compiler will put both versions in register space if there’s room. If you think there’s not going to be room then I’d transfer the array to shared memory and work from there. You can always inspect the sass (cuobjdump) to see what the compiler chooses. Also keep in mind the more registers you use up the less occupancy you’ll have.

But there’s been shown to be high throughput at low occupancy.

There’s a Berkeley slideshow that demonstrates this (http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf) and depending upon the task, it makes sense. That’s how I even knew about register space in the first place XD

I am aware of that as I’ve written kernels using up to 128 registers that perform at 95% theoretical throughput. Achieving this level of instruction level parallelism takes some pretty in depth knowledge of the hardware though. But thread level parallelism isn’t something you want to discount too much. This hardware really shines when you can achieve both.

That is, whether you hide memory latencies by executing instructions from the current thread vs from some other thread isn’t that important. What is important is that those instructions are ready to issue when you have a latency at hand. If you have two pools to draw from… that’s all the better.