Prefetching vs. bank conflicts vs. float operators

Hello,

I’m thinking of the best way to access shared memory in a particular problem :

consider the code :

if(a < 0.f)

   res = shared_mem[idx + 1];

else

   res = shared_mem[idx];

*** for some reason I don’t care about the res value if a = 0.f ***

Then I can translate it into :

res = a < 0.f : shared_mem[idx + 1] : shared_mem[idx];

it looks cool, but in this case, could the possible bank conflict possibly affect the instruction prefetch ?

I could also do this :

int newidx = idx + (a < 0.f)

res = shared_mem[newidx];

I have also bank conflict for some values, but is this preferrable to the “? :” operator ?

Also, last possibility :

float fakebool = 1.f * (a < 0.f);

res = fakebool * shared_mem[idx + 1] + (1.f - fakebool) * shared_mem[idx];

Here, no bank conflicts (I guess!), but two “x” and two “+”

I’d really be interested in your feeling about those possibilities : what would you do, with and without assuming good data values “locality” (ie. assuming the different values of ‘a’ in each half warp “often” share their sign)

Thank you very much

You’re likely worrying too much about bank conflicts. They tend to be very minor and sometimes unavoidable anyway.

In particular, none of your possibilities are immune to bank conflicts. Bank conflicts happen when different threads try to read different addresses on the same bank. Here you have a per-thread idx value which could be anything, so you can’t avoid potential conflicts anyway.

If you happened to know that idx was equal to the thread index (common!) then your examples 1 2 and 4 above have no bank conflicts, but #3 could potentially hit them.
But again I stress this is such a minor effect that you’re likely worrying too much. Even your case #3 will hit them but be resolved in just one more clock tick since your index delta is constant and any thread that missed its read the first time would be clear for the second.

You’re likely worrying too much about bank conflicts. They tend to be very minor and sometimes unavoidable anyway.

In particular, none of your possibilities are immune to bank conflicts. Bank conflicts happen when different threads try to read different addresses on the same bank. Here you have a per-thread idx value which could be anything, so you can’t avoid potential conflicts anyway.

If you happened to know that idx was equal to the thread index (common!) then your examples 1 2 and 4 above have no bank conflicts, but #3 could potentially hit them.
But again I stress this is such a minor effect that you’re likely worrying too much. Even your case #3 will hit them but be resolved in just one more clock tick since your index delta is constant and any thread that missed its read the first time would be clear for the second.

Yep, SPWorley hit it right on the head. Also, #2 and #3 should get translated into the same PTX instructions (probably something like “mov.f32”, “setp.le.f32”, “selp.s32”, “add.s32”, “mov.s32”…this last one depends on the type of “res”).

Yep, SPWorley hit it right on the head. Also, #2 and #3 should get translated into the same PTX instructions (probably something like “mov.f32”, “setp.le.f32”, “selp.s32”, “add.s32”, “mov.s32”…this last one depends on the type of “res”).

Thank you very much for your advice, that’s exactly what I wanted !

I just forgot to mention that the ‘idx’ values are the thread IDs, so there are no bank conflict except in cases where there is a non uniform index shift within a half warp. You say that the #2 case is bank-conflict free, and I can’t figure out why ! What is the exact mechanism of instruction prefetching ? I’m desperate to find any technical CUDA-related paper on this…

And profquail and you seem to disagree about that case. I’d tend to agree with profquail, but I’m not very sure of what to think.

Anyway, thanks again you 2 for your precious help. I’d go for the #3 solution (I use this ‘newidx’ all the time in my kernel, and a little bench shows me a little performance gain over all the other solutions.)

Thank you very much for your advice, that’s exactly what I wanted !

I just forgot to mention that the ‘idx’ values are the thread IDs, so there are no bank conflict except in cases where there is a non uniform index shift within a half warp. You say that the #2 case is bank-conflict free, and I can’t figure out why ! What is the exact mechanism of instruction prefetching ? I’m desperate to find any technical CUDA-related paper on this…

And profquail and you seem to disagree about that case. I’d tend to agree with profquail, but I’m not very sure of what to think.

Anyway, thanks again you 2 for your precious help. I’d go for the #3 solution (I use this ‘newidx’ all the time in my kernel, and a little bench shows me a little performance gain over all the other solutions.)

I thought I could add this one to your nice list of variations:

[font=“Courier New”]res = shared_mem[idx + (a < 0.f)];
[/font]

I thought I could add this one to your nice list of variations:

[font=“Courier New”]res = shared_mem[idx + (a < 0.f)];
[/font]