Device Register Array and Device Malloc

Hi folks! I have been scouring the forums, the book (Programming Massively Parallel Processors by Kirk & Hwu) and the Programming Guide and can’t seem to find an answer to this question. I am using a GTX 470 and intended to only target devices with compute 2.x and higher capability.

I am building a Genetic Algorithm using CUDA so I have each block set to the max thread size (1024 for the device I am using).

I need to make a few short arrays for each thread. I calculated that 256 bytes per thread would work well for what I am doing, but I may need to increase to 512 bytes per thread. I would like to store these arrays in registers for speed, but I have a few problems. The max registers per block for my device is 32768 bytes and the max threads per SM is 1024. With 256 bytes per thread, only 128 threads can be scheduled per SM before the register limit is reached. From reading other posts I know there is a threshold when the register memory will be moved to global memory to increase performance. I am guess that trying to allocate 256 bytes per thread would cause this since each SM is at 1/8th capacity - is that correct? Is there a way to force these arrays to be in register memory (i.e. using the register keyword) and if so can someone please post a short example? I know I can do coalesced global memory, I was just curious if there was another way. Finally, with the L2 cache (640kB on my device) do I really need to go through all this trouble to try and force this memory into registers or do global coalescing (probably depends on my total data size in global memory)?

Additionally, since I have a Compute 2.0 device, I can call malloc() at the device level. I figure this gives me global device memory but have not been able to confirm that. Does anyone know for sure that a device level malloc will return global device memory?

Thanks!

-matto-

Compute capability 2.x devices can use at most 64 registers per thread no matter how few threads you have, because the binary instruction format has only reserved 6 bits to encode the register number.
Also registers are not indexed, so you can only use the for arrays if all array indices are always constants.

Yes, malloc() will give you global memory.

And yes, since 2.x devices cache global memory accesses it is not worth going through all this trouble.