Confusion regarding register use and occupancy calculator

My cuda project compiled but when I ran it one kernel reported “too many resources requested for launch”. So I recompiled with --ptxas-options="-v" to find that the troublesome kernel required
59 registers
256+256 bytes smem
132 bytes cmem[1]
8 bytes cmem[14]

However the occupancy calculator said this was OK for the blocksize of 128 I am using. It also said a blocksize of 64 was OK (and even better in some respects) but I still got the same error with a blocksize of 64.

So what is going on?

The GPU is a C1060.

ps And in the limited table, what does it mean if a number is highlighted in red?

How much dynamic shared memory are you allocating at launch?

None, but now you have raised that point it is possible that it is the kernel being called immediately before the kernel I thought was the culprit. That previous kernel reports resource usage of

9 registers

1232+1232 bytes smem

8 bytes cmem[1]

8 bytes cmem[14]

and contains this line to dynamically allocate shared memory

shared uint sharedmemvar[257];

When I add these the occ calc appears to give the thumbs up, but I don’t like the look of the Limited by Shared Memory / Multiprocessor entry which is highlighted in red, which is why I also asked about that.

The error reporting functions report the last error that occurred. It will give you the error from a previous call if you don’t check it religiously after every interaction with the API. So you should call cudaGetLastError after every launch, and check the return status of every API call to pin down where the error is occurring.

That isn’t dynamically shared memory. That is static shared memory, and the compiler is including that in the 1232 bytes it is reporting at compile time. What about shared memory specified at launch?

That would be a static allocation (with a CUDA-specific meaning of “static”, not the C one). This is already included in the summary printed by the compiler. Avidday probably wanted to know about the third argument inside the <<< >>> on kernel launch, which gets added to the number given by the compiler.

I thought that’s what dynamically shared memory was. As it is clearly not, would you describe what you mean by dynamically shared memory?

When you launch a kernel in the runtime API the syntax is:

kernelname <<< grid dimensions, block dimensions, shared memory, stream number >>> (kernel argument list)

The shared memory number is an additional amount of shared memory which is dynamically allocated for each block when the kernel launches. The ptx assembler knows nothing about shared memory at compile time and can’t account for it in the output statistics you get during compilation. When considering resource usage of a kernel, the dynamic shared memory amount must be added to the amount statically declared inside the kernel (and which the assembler reports at compile time).

In that case, none, and that is for all kernels. All I pass is the kernelname<<<NumBlocks,NumThreads>>>(…)

In that case it most likely isn’t shared memory causing it. So what is the exact block and grid dimensions you are using to launch the kernel that fails?

The full call is

[codebox]

CalculateForces<<<numBlocks,numThreads>>>(x,v,hsml,mass,rho,p,c,type,d_sorted_x,d_sorted_v,

		d_sorted_mass,d_sorted_hsml,d_sorted_rho,d_sorted_p,d_sorted

_c,

		d_sorted_type,d_av,particleHash,cellStart,outfile,

		thetime,d_viscdt,d_tf,d_interactions,piac,d_sorted_TotalE,

		d_dVoldt,d_dVolrhodt,d_dVolrhovdt,d_dVolTotalEdt,d_Vol,d_sor

ted_Vol);

cudaThreadSynchronize();

printf("\n\nCalcForces %s\n", cudaGetErrorString(cudaGetLastError()));

[/codebox]

The following are evaluated dynamically

numBlocks = 16

numThreads = 128

the variable lists consists of textures, arrays of int, float or float2, and a FILE*. The textures and arrays are of size 2048(=16x128) and the arrays are created dynamically but not in shared memory but with cudaMalloc.

Are numBlocks and numThreads being passed as scalars or do you form them as dim3, and if so what are the actual dimensions? (I ask this because if you manage to form a dim3 with a zero dimension, it will cause the kernel launch to abort).

Also you error checking is incorrect. You should call and check cudaGetLastError directly after the launch (so before you do any other API calls).

numBlocks and numThreads are int, not dim3.

And on your suggestion about cudaGetLastError,

Before CalcForces no error

After CalcForces too many resources requested for launch

So it looks like it is the CalcForces kernel, and that I’m going to be occupied this weekend.

Thanks alot for trying to sort this out for me, but I’ve taken up more time than I expected.

What is the combined length of the argument list? You only have 256 bytes and you seem to be passing a lot of arguments by value. It could be that the argument list is too large. Also you say you are passing textures as arguments? I am pretty sure that is illegal. Textures have to be statically declared at file scope in the file they are using in.

When I say I am passing textures I mean I am binding an array to a texture and then passing the array not the texture to the kernel, which is legal (I think, because it is done in one of the CUDA SDK projects).

But yes I do have a lot of variables being passed to it. I was not aware of the 256 byte limit.

The kernel declaration is

[codebox]global void CalculateForces(

float2* x,

float2* v,

float* hsml,

float* mass,

float* rho,

float* p,

float* c,

int* type,

float2* d_sorted_x,

float2* d_sorted_v,

float* d_sorted_mass,

float* d_sorted_hsml,

float* d_sorted_rho,

float* d_sorted_p,

float* d_sorted_c,

int* d_sorted_type,

float2* av,

uint2* particleHash,

uint* cellStart,

FILE* outfile,

float time,

float* d_viscdt,

float* d_tf,

int* d_interactions,

int* piac,

float* d_sorted_TotalE,

float* d_dVoldt,

float* d_dVolrhodt,

float2* d_dVolrhovdt,

float* d_dVolTotalEdt,

float* d_Vol,

float* d_sorted_Vol)[/codebox]

How much is that?

If I count correctly, that’s 128 bytes on a 32-bit system or 252 bytes on a 64-bit system.
Makes one wonder if the limit actually is a bit lower than the Programming Guide says. The number 240 comes to mind. Could you try to remove at least 3 pointer arguments?

That does look pretty close to 256 bytes. I must say the FILE argument rather stands out. If you don’t mind me asking, what are you passing a FILE type to a kernel for?

The FILE* is a hangover from emulation mode, so can be removed.

As for the other variables I think interactions and piac can also be removed as they are from debugging, but after that I think we are down to the minimum, unless I could used a float4* to hold two float2*? Is a float4* pointer the same size as a float2* pointer?

They are all just pointers, so 8 bytes each on 64 bit plaforms and 4 bytes each on 32 bit platforms.

You can always replace the argument list by a pointer to a struct in constant memory and place the actual arguments there. However, as you have already identified three pointers that can easily be removed, try that first to see what happens once the total argument size gets to 240.

I assume you are on e 64 bit system and your GPU is not a Fermi?

BINGO!

After removing the FILE* and the interaction stuff there are no runtime errors from any kernel!

Thankyou very much Avidday and Tera for all your help. I’ll never forget about that limit now!

:D