Compute capability and registers

Hello,
I need some information about the registers. I have written some code to compute geometric primitives. And when I am running the kernel with a specific configuration like 10 blocks with 512 threads inside. The run fail! But when i change the configuration to 20 blocks with 200 threads for example it works! I guess that there is a problem with the registers which a multi processor has. I have a compute capability of 1.1 and therefore each multiprocessor has 8192 registers. What exactly are the registers for? Is one register and more needed for computation?

Thanks to all!

Registers are fast scratch space for the calculation each thread is doing. Each register can hold a 32 bit number (float or int), and the compiler determines how many you need when you compile your kernel. To see how many registers your kernel requires per thread, pass the option --ptxas-options=-v to nvcc and it will print it. Due to the way registers are allocated, the number of registers required for a block can be slightly larger than [number of threads per block] * [number of registers per thread]. The programming guide shows the exact formula.

Thanks a lot! I have seen with the compiler directive you can see every information. So far… I am computing Bezier curves with different parallelism strategies and one of them is block wise allocation. My strategy is that every thread computes a whole curve. And I have seen that with CUDA is no classical programming possible, i.e. I have to be very carefully with the memory. My question is: Is there a difference between these memory allocations:

__device__ float3 memory;

void function() {

   cudaMalloc((void**)&memory, ....);

   ....

}

or

void function() {

   float3 memory;

   cudaMalloc((void**)&memory, ....);

   ....

}

I have the problem with memory, when I have configured some cases where the program doesn’t work with N Beziers, M controlpoints and P curvepoints. I modeled the points with float3. I don’t know where the problem is… The same situation with float2 and the program works. It seems to me very strange because my graphic card has 1024 MB memory on board and I calculated that these described situations need only 3 MB of memory. Where is my mistake? What I have to consider when I am allocating memory on the device? Thanks a lot for your help.