CUDA Fortran- threads

crip_crop1 · April 13, 2011, 9:38am

Hi there,

I have quite a large code that I have ported using CUDA Fortran which uses large local arrays.

Now, I was playing around with the number of threads per block and found an unexpected bug. When I increase the number of threads above 128 my results become incorrect. I can’t seem to find a problem with the actual code so I was wondering if this could be a result of local memory limitations per multiprocessor.

Any information or suggestions would be much appreciated.

Cheers,
Crip_crop

crip_crop1 · April 13, 2011, 1:18pm

Okay, I found the solution to my problem. It might be of interest to some of you.

The problem was that too many resources were being requested for launch. This meant that the kernels in the code weren’t being executed properly.

In order to prevent this problem the number of registers needs to be controlled using the -Mcuda=maxregcount:n flag on compilation. The general rule of thumb is that:

no. of registers*blocksize should not be greater than 8192

So, for a blocksize of 258 the regcount must be no more than 32…

… and hey presto, it works!

Crip_crop

tlstar · April 13, 2011, 9:04pm

Hi Crip_crop,

I’m a beginner to cuda. Could you tell me why it can’t large than 8192?

I have some problem about registers. My GPU is Tesla M2050 (cc2.0), which may have 32K registers per SM.
But the compiler seems to block the kernel code to use more registers than 8K. I try to convert some small arrays into registers, but results in lower performance. Thanks in advance.

gfwang

crip_crop1 · April 14, 2011, 11:08am

Hi GfWang,

Could you tell me why it can’t large than 8192?

I think it’s as simple as a hardware limitation. However, I think the maximum number of registers per block depends on the compute capability that you are using. I’m using 1.3. The maximum number of registers is higher for higher compute capabilities.

But the compiler seems to block the kernel code to use more registers than 8K

Do you mean that this is the maximum number of registers your kernel is using? If so, to allow the compiler to allocate more registers to your kernel you need to lower the number of threads per block that you’re using. Each multiprocessor allocates registers to one thread block at a time but all the threads on a multiprocessor have to share the limited number of registers. This means that the more threads in a thread block then the less registers available to each thread.

However, it’s not always useful to have the maximum number of registers per thread as this means that you limit the number of warps that can execute on a multiprocessor at any one time, which could hinder performance as latency from inactive warps tends not to be hidden. As you can see, it’s quite a complex issue which you should probably read more about. Here’s a link to some documentation you might find useful… the metric is known as occupancy.

I try to convert some small arrays into registers, but results in lower performance.

I think this could be because there aren’t enough registers available to hold your arrays so they’re spilling over to local memory… which essentially is thread private global memory, hence the poor performance could be the latency of fetching the data.

Hope that helps,
Crip_crop

tlstar · April 14, 2011, 2:09pm

Hi crip_crop,

Thanks a lot for kindest reply. Your information are very helpful to me.

Gaofeng

crip_crop1 · April 14, 2011, 2:25pm

No problem. I just realised that I forgot to post the link. Here goes…

http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Best_Practices_Guide.pdf

Crip_crop

Topic		Replies	Views
How many registers could one thread use at most? When would the data be spit out to local memory? CUDA Programming and Performance	22	8853	July 1, 2010
number of threads and registers CUDA Programming and Performance	10	4864	March 14, 2008
Handling resources CUDA Programming and Performance	1	2207	May 12, 2008
Max blocks per SM less than expected CUDA Programming and Performance	5	1328	May 16, 2017
CUDA FORTRAN/OpenACC "Overflow" Register with maxr Legacy PGI Compilers	5	6152	February 5, 2014
cudaFuncAttributes.maxThreadsPerBlock cudaDeviceProp.maxThreadsPerBlock mismatch problem CUDA Programming and Performance cuda	22	1544	April 8, 2021
Understanding number of threads Problems with program working CUDA Programming and Performance	3	1039	August 17, 2009
register count explodes with CUDA 1.1 CUDA Programming and Performance	2	7293	December 12, 2007
regsPerBlock CUDA Programming and Performance	4	2450	September 28, 2008
Occupancy Calculation in check but still 'out of resource' error. CUDA Programming and Performance	4	3013	November 15, 2009

CUDA Fortran- threads

Related topics