Two symmetric kernel organisations with vastly different performance Switch the role of thread and

Running C2050 tesla cuda card. compute capacity 2.0,
with 3,220,897,792 bytes global mem.
14 MP.
Shared mem 49152/MP

Mystery: Whis is Alternate organization 30 times slower than Regular organization?

Regular organization:
Each kernel launcy call uses 1024= block size and 1024 threads per block.
Each thread block is assigned a particular pol of the 1024 of them.
Each scn is assigned to do a scn, of the 1024 of them.

Each kernel launch produces output array O.
The output arrays O are keyed by {pol,scn,i} i={0,1,2}
so O is 1024 * 1024 *3 doubles in size and is kept in mapped memory.

Pol and scn both have associated arrays.
The 4 scn arrays are in gpu memory loaded once at the beginning of the run and kept in place for all pol.

  The 2 pol arrays are in mapped memory. 
  All arrays, pol and scn, are 1440 doubles in length and indexed by mo, mo=1,1440.

Alternate organization: inverts the (pol,scn) role in thread block and thread within a thread block.
Same size kernel launchs 1024 blocks and each thread block is 1024 threds in size.
each thread block does a scenario
and each thread does a pol.
Assignmend of scn and pol arrays remain the same by memory type.
The ouput O is exactly the same.

(1) The alternate organization is 30 times slower for some reason.
(2) NO communication between different blocks in either organization.
(3) NO communication between threads within a particular thread block in either organiation.
(4) Processing in the kernel thumbs through all arrays elements, scn and pol by mo.
(5) All arrays, scn and pol, are read only. Only the output array O is written to by all cuda threads.
(6) each cuda thread, (bid,tid) in the kernel launch is resposible for writing only three
doubles in O[i] i={0,1,2}.

To fix (1), I could try to move some of the pol arrays into shared memory, and because
this kind of memory is faster than the mapped memory pol arrays live in,
maybe this could help, but I don’t really need anything shared.

I think the reason is the very inferior memory performance of mapped memory to device memory.
I will change the memory used in the alternate organization to device memory and see if his is whats
going on. I hope its that simple.