Problem with Tesla D870 compute capability

Hi all,

I’ve Tesla D870(2xC870) GPU. Unfortunately its compute capability is 1.0. I need to use atomicXor function badly. This is the snippet of the code which I want to write in cuda.

for(i=0;i<4;i++)

			{	

				t=state[0][i];

				Tmp = state[0][i] ^ state[1][i] ^ state[2][i] ^ state[3][i];

				

				Tm = state[0][i] ^ state[1][i];

				Tm = xtime(Tm);

				state[0][i] ^= Tm ^ Tmp;

				

				Tm = state[1][i] ^ state[2][i];

				Tm = xtime(Tm);

				state[1][i] ^= Tm ^ Tmp;

								

				Tm = state[2][i] ^ state[3][i];

				Tm = xtime(Tm);

				state[2][i] ^= Tm ^ Tmp;

				Tm = state[3][i] ^ t;

				Tm = xtime(Tm);

				state[3][i] ^= Tm ^ Tmp;

	

			}

int xtime(x)

{

  ((x<<1) ^ (((x>>7) & 1) * 0x1b);

  return(x);

)

Where ^ = XOR, >> and << are right shift and left shift. I’m unable to do it myself because my GPU doesnt support atomic functions. Can anyone transform this to CUDA code? I shall send my full project back to them. Hoping for favorable replies.

Thanks in advance!

As a check : Initially if state[0][0] = 7D h, state[1][0] = 2B h, state[2][0] = 30 h, state[3][0] = 67 h

After one round of for loop it should give state[0][0] = D0 h, state[1][0] = 1C h, state[2][0] = 9F h, state[3][0] = 52 h

PS : All the values are in hex.

Hi all,

The above code is small part of ADVANCED ENCRYPTION STANDARD. Apart from this I’ve completed full coding. If anyone is interested in helping me out for this part, let me know. I’m ready to share my project.

You can create mutexes and atomic ops in shared memory, even in G80, by using the write guarantees of CUDA (multiple writes guarantee that ONE write will succeed.) I posted the method and code here many months back.

However, global memory atomics are trickier, since there’s no blockwide synchronization except for ending the kernel and starting a new one.
You can sometimes assign unique global ranges to BLOCKS and then use blockwide atomics via shared memory at least.

In general, atomics tend to be used for data management, not computation… things like requesting and reporting job results to a queue where many threads may be trying to read or write. For your code snippet above it’s very unclear what you’re trying to do, or why you’d need atomics to do it though.

Thanks for responding Worley!

The above code is Mixing columns. Its a part of AES standard. I tried it in CPU and it works perfectly fine. I wanted to translate the same to GPU. So when I used inbuilt function for XOR, it gave me error telling that Tesla D870 doesnt support atomic XOR functions.

Now my question is, Is there any other way by which I can implement BITWISE XOR efficiently? Any idea/lead would help me a lot.

Ppl waiting for your replies…

umm… XOR and atomic XOR are completely different things…

XOR is supported on G80, and the ^ operator will do the same thing in CUDA as it does in your normal C compiler.