mis-compile causes small perf issue in 7.5 with global reads

this:
global void kernel(const int dstcnt,const int2 * dstactive,…)
{
for (int activationIdId = threadIdx.x; activationIdId < dstcnt; activationIdId += blockDim.x)
{
int2 activationId = dstactive[activationIdId];
int cid = activationId.x >> 16;
activationId.x &= 0xffff;
activationId.y &= 0xffff;

generates 2 global reads for activationId - one 16bit, one 32bit. Apparently it tries to optimize activationId.y &=0xffff away, but one 64bit read is faster since the buffer is well-aligned.

Perhaps you can override this behavior with some inline ptx? Or perhaps not if ptxas is making this optimization. Maybe find some way to use the high bits of y so the vector load is preserved.

The difference in speed may be small since most of the overhead would be in pulling the data to the texture cache and both versions would make the same transactions. Though you could look at the sass and confirm if LDG.E.CI is being used here (cache incoherent).

I notice you already have dstactive marked const. You don’t mention which device you are compiling for.

If you mark it const restrict and compile for cc3.5 or higher, I believe cuda 7.5 will do a single 64-bit LDG load.

Otherwise if the perf difference is important, I would file a bug with a demonstrator showing the perf difference.

thanks. I can work around the issue. It’s about 5us difference for the whole kernel launch, but it’s visible for a 30us kernel. device was a 970 - I’ll go look for a bugreport form now.

@MrVlad, I’ve seen this behavior as well (here).

A workaround might be defining a union type of a u64, int2 and a short4 or your own vector and bit types.

After loading/initializing the union, the cid would be the first u16 and you could mask the low u16s in each int with a 64 bit mask applied to the u64 member.

That might be enough… although ptxas is very aggressive.

If it’s not enough you could try masking first (48 bits) or just don’t mask at all and use the union members.

Worst case, throw in a single instruction dependency on the hi y bits.

Just make sure you inspect the SASS with cuobjdump or nvdisasm. :)