Too Many Resources Requested

Hi,

I’ve attached my CUDA code below. The problem is that when I run this code as part of my C++ code I get the error “too many resources requested” as the kernel launches to fail.

So what I did was I put the functions into a .cu file by itself and ran the profiler on it and it didn’t produce any results.

sample.txt (18.3 KB)

The funny thing is when I change the line inside the kernel function (which is the very last line)

divigden[index] = delta_rho; TO divigden[index] = 0;

the kernel launches. The profiler details are attached.

[attachment=9564:test_Ses…Device_0.xls]

Can anyone please help me to figure this out and how I could fix this.

I am working on a 8800 GTS card.
test_Session1_Device_0.xls (18 KB)

You’re probably using too much registers/smem whatever.

Compile your .cu file with: --ptxas-options="-v -mem "

That will show you the resource usage of your kernels.

regarding the commented line, that’s probably because the compiler optimized out your code/portions of your code and therefore

was able to run it.

eyal

The output from doing --ptxas-options="-v -mem " is this was done with the line divigden[index] = delta_rho;

1>ptxas info : Compiling entry function ‘_Z21calcDivergenceDensityPiS_S_S_S_S_S_S_PfS0_S0_S0_ii’

1>ptxas info : Used 42 registers, 12+0 bytes lmem, 72+68 bytes smem, 44 bytes cmem[1]

1>Memory space statistics for ‘OCG memory pool for function _Z21calcDivergenceDensityPiS_S_S_S_S_S_S_PfS0_S0_S0_ii’

1>=========================================================

========================================================

1>Page size : 0x1000 bytes

1>Total allocated : 0x57d410 bytes

1>Total available : 0x36870 bytes

1>Nrof small block pages : 880

1>Nrof large block pages : 306

1>Longest free list size : 1

1>Average free list size : 0

1>Memory space statistics for ‘Top level ptxas memory pool’

1>=========================================================

1>Page size : 0x1000 bytes

1>Total allocated : 0x20e10 bytes

1>Total available : 0x129f0 bytes

1>Nrof small block pages : 29

1>Nrof large block pages : 2

1>Longest free list size : 0

1>Average free list size : 0

1>Memory space statistics for ‘Permanent OCG memory pool’

1>=======================================================

1>Page size : 0x1000 bytes

1>Total allocated : 0x39048 bytes

1>Total available : 0x47b0 bytes

1>Nrof small block pages : 4

1>Nrof large block pages : 16

1>Longest free list size : 1

1>Average free list size : 0

1>Memory space statistics for ‘PTX parsing state’

1>===============================================

1>Page size : 0x1000 bytes

1>Total allocated : 0x8cd98 bytes

1>Total available : 0xd0f8 bytes

1>Nrof small block pages : 134

1>Nrof large block pages : 6

1>Longest free list size : 1

1>Average free list size : 0

1>Memory space statistics for ‘Command option parser’

1>===================================================

1>Page size : 0x1000 bytes

1>Total allocated : 0x6058 bytes

1>Total available : 0x4f80 bytes

1>Nrof small block pages : 6

1>Nrof large block pages : 0

I really am new at this so any help in terms of how many registers I am over etc and how I can fix it would be very much appreciated

The important line is:
ptxas info : Used 42 registers, 12+0 bytes lmem, 72+68 bytes smem, 44 bytes cmem[1]

Your kernel uses 42 registers per thread. Each block runs on a single multiprocesssor, so 42 * num_threads_per_block must be less than or equal to the number of registers on the multiprocessor. That is 8192 on older hardware and 16384 on G200.

42 registers is very high, how do you run your kernel? how many blocks/threads?

You should put this info in the Occupancy calculator (the excel file) and see where you’re at…

you could also try to either optimize your code to reduce the registers usage, break the kernel into 2(or more)

kernels to use less registers per kernel (if possible) or use the “-maxrregcount=XX” compiler option to limit the register usage (that will cause them to spill to local memory)

eyal

So if I reduce the thread size and increase the number of blocks it could potentially run?? Is there anyway I can try to reduce the bumber of registers?

This is how I am currently calling the kernel

int N = nodes; // 76896

int block_size = 16;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

calcDivergenceDensity<<<n_blocks,block_size>>>

I think I have narrowed the problem down to exceeding my shared memory by nearly double as

1>ptxas info : Used 38 registers, 12+0 bytes lmem, 72+68 bytes smem, 44 bytes cmem[1]

and int N = nodes; // 76896
int block_size = 208;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); i.e. 370
blocks

this would mean I would have per block:

38*208 = 7904 registers < 8192 registers that i’m allowed

12 bytes * 208 = 2496 lmem - not sure what lmemory is???

72+68 bytes shared memory = 140 bytes shared memory

140*208 = 29120 bytes > 16384 bytes of shared memory I am allowed

44 bytes constant memory * 208 = 9152 bytes < 65536 bytes of constant memory
I am allowed

Is there anyway I can reduce my shared memory, or is there a card that allows this kind of capacity of shared memory??

You should realy read the programming guide. You dont compute things like that.

you dont need to multiple the shared mem and constant memory by the number of threads.

also 72+68 actually means you only use 72bytes per block and not per thread.

eyal

Using the occupancy calcultor with a compute capability 1.3 i get the following

1.) Select Compute Capability (click): 1.3

2.) Enter your resource usage:
Threads Per Block 256
Registers Per Thread 38
Shared Memory Per Block (bytes) 72

(Don’t edit anything below this line)

3.) GPU Occupancy Data is displayed here and in the graphs:
Active Threads per Multiprocessor 256
Active Warps per Multiprocessor 8
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 25%

Physical Limits for GPU: 1.3
Threads / Warp 32
Warps / Multiprocessor 32
Threads / Multiprocessor 1024
Thread Blocks / Multiprocessor 8
Total # of 32-bit registers / Multiprocessor 16384
Register allocation unit size 512
Shared Memory / Multiprocessor (bytes) 16384
Warp allocation granularity (for register allocation) 2

Allocation Per Thread Block
Warps 8
Registers 9728
Shared Memory 512
These data are used in computing the occupancy data in blue

Maximum Thread Blocks Per Multiprocessor Blocks
Limited by Max Warps / Multiprocessor 4
Limited by Registers / Multiprocessor 1 (RED)
Limited by Shared Memory / Multiprocessor 32
Thread Block Limit Per Multiprocessor highlighted RED

Does this mean that the code would work under a card witha compute capability of 1.3??