This is the problem:
120,000doubles*8bytes/double = 960000 bytes = over 900 Kbytes
The amount of local data per thread is limited based on the GPU:
for CC 1.x it is limited to 16Kbytes per thread. For other devices it is limited to 512Kbytes per thread. Based on the exact error you are receiving, it looks like you are compiling for a cc1.x device. One way or another you’ll need to reduce the size of that GU22 declaration. A straightforward approach would be to allocate and locate this in global memory instead of local memory.
If you can switch to a cc2.x or newer device, you should be able to use about half that much local memory, e.g.:
While we’re at it, this combination of statements:
GU22[i] = GU2[i] ;
Seems like it could allow for out-of-bounds accesses to GU22, depending on i (i.e. grid size)