I’ve attached my CUDA code below. The problem is that when I run this code as part of my C++ code I get the error “too many resources requested” as the kernel launches to fail.
So what I did was I put the functions into a .cu file by itself and ran the profiler on it and it didn’t produce any results.
The important line is:
ptxas info : Used 42 registers, 12+0 bytes lmem, 72+68 bytes smem, 44 bytes cmem[1]
Your kernel uses 42 registers per thread. Each block runs on a single multiprocesssor, so 42 * num_threads_per_block must be less than or equal to the number of registers on the multiprocessor. That is 8192 on older hardware and 16384 on G200.
42 registers is very high, how do you run your kernel? how many blocks/threads?
You should put this info in the Occupancy calculator (the excel file) and see where you’re at…
you could also try to either optimize your code to reduce the registers usage, break the kernel into 2(or more)
kernels to use less registers per kernel (if possible) or use the “-maxrregcount=XX” compiler option to limit the register usage (that will cause them to spill to local memory)
So if I reduce the thread size and increase the number of blocks it could potentially run?? Is there anyway I can try to reduce the bumber of registers?
This is how I am currently calling the kernel
int N = nodes; // 76896
int block_size = 16;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
Using the occupancy calcultor with a compute capability 1.3 i get the following
1.) Select Compute Capability (click): 1.3
2.) Enter your resource usage:
Threads Per Block 256
Registers Per Thread 38
Shared Memory Per Block (bytes) 72
(Don’t edit anything below this line)
3.) GPU Occupancy Data is displayed here and in the graphs:
Active Threads per Multiprocessor 256
Active Warps per Multiprocessor 8
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 25%
Allocation Per Thread Block
Warps 8
Registers 9728
Shared Memory 512
These data are used in computing the occupancy data in blue
Maximum Thread Blocks Per Multiprocessor Blocks
Limited by Max Warps / Multiprocessor 4
Limited by Registers / Multiprocessor 1 (RED)
Limited by Shared Memory / Multiprocessor 32
Thread Block Limit Per Multiprocessor highlighted RED
Does this mean that the code would work under a card witha compute capability of 1.3??