Unrolled loop does not work

Hi all!!!
I’ve written a simple script using Python+PyOpenCL which implements the AES 128bit encryption algorithm in ECB mode. The first part of the algorithm (Rijndael scheduling) was initially implemented inside a for loop. For optimization purposes (maybe not needed since probably the loop is automatically unrolled by the compiler), I’ve unrolled the loop manually. Here’s the code:


From line 88 to line 122 there’s the original loop (commented), followed by the same loop (unrolled). I’m working with a work group size of 256: in each work group only the first work item executes the loop. Since the result of the computation is written to shared memory, I’ve put a barrier at the end of the loop in order to assure local memory consistency among work items. I’m pretty sure the implementation is correct, and in fact I’m getting the correct result launching the code on the CPU (using the amd sdk). Using the CUDA toolkit provided OpenCL libraries, I obtain a correct encryption using the for loop, but the encryption fails when the for loop is unrolled. I’ve no idea what I’m doing wrong, do you have any suggestion?

Some info about my dev platform:

Fedora 11, kernel
Cuda toolkit 4.1
Dev drivers 285.05.33
Nvidia GTX 260 (216cores model)


Nothing immediately jumps out as wrong, but what happens if you use pragma unroll rather than manually unroll the loop?