How to calculate register resource correctly? I met strange problem on calculate resource such as re

Hi, guys

I spent hours in debugging this problem. In one word, this problem is, the cudaDeviceProp.regsPerBlock seems not correct. Ok, here I try to describe my situation.

First, I have such piece of kernel, using template, I tried to let it support both float and double:

[codebox]

template global void xSweep(…) { … }

template global void ySweep(…) { … }

template cudaError_t callKernel(size_t maxThreadNum, …)

{

.....

size_t xSweepGridSize = ny+1;

size_t xSweepBlockSize = nx+1;

if (xSweepGridSize > prop.maxGridSize[0]) xSweepGridSize = prop.maxGridSize[0];

if (xSweepBlockSize > maxThreadNum) xSweepBlockSize = maxThreadNum;

size_t xSweepSharedMem = (nx+1)*sizeof(RealType)*4;

size_t ySweepGridSize = nx+1;

size_t ySweepBlockSize = ny+1;

if (ySweepGridSize > prop.maxGridSize[0]) ySweepGridSize = prop.maxGridSize[0];

if (ySweepBlockSize > maxThreadNum) ySweepBlockSize = maxThreadNum;

size_t ySweepSharedMem = (ny+1)*sizeof(RealType)*4;

printf(" >> xSweep: grid=%d block=%d shared=%d\n", xSweepGridSize, xSweepBlockSize, xSweepSharedMem);

printf("  >> ySweep: grid=%d block=%d shared=%d\n", ySweepGridSize, ySweepBlockSize, ySweepSharedMem);

while (…some condition…)

{

    .....

// x-direction sweep

    xSweep<RealType><<<xSweepGridSize, xSweepBlockSize, xSweepSharedMem>>>(x, y, nx, ny, Cx, Cy, dt, t0, t1, u0, u1);

    error = cudaGetLastError();

    if (error != cudaSuccess) { printf("%s(line %d): %s\n", __FILE__, __LINE__, cudaGetErrorString(error)); return error; }

    error = cudaThreadSynchronize();

    if (error != cudaSuccess) { printf("%s(line %d): %s\n", __FILE__, __LINE__, cudaGetErrorString(error)); return error; }

// y-direction sweep

    ySweep<RealType><<<ySweepGridSize, ySweepBlockSize, ySweepSharedMem>>>(x, y, nx, ny, Cx, Cy, dt, t0, t1, u1, u0);

    error = cudaGetLastError();

    if (error != cudaSuccess) { printf("%s(line %d): %s\n", __FILE__, __LINE__, cudaGetErrorString(error)); return error; }

    error = cudaThreadSynchronize();

    if (error != cudaSuccess) { printf("%s(line %d): %s\n", __FILE__, __LINE__, cudaGetErrorString(error)); return error; }

}

return cudaGetLastError();

}

[/codebox]

This code has two kernels, and to compile it, first I need initiate it, for example, I initiate it with float type somewhere in my code:

[codebox]

callKernel(512, …);

[/codebox]

Then I tried to compile it with compute capability 1.1 and generate verbose output to see how many registers the kernel uses:

[codebox]

nvcc -arch sm_11 --ptxas-options=-v …

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6ySweepIfEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_11’

ptxas info : Used 31 registers, 176+0 bytes lmem, 80+16 bytes smem, 24 bytes cmem[0], 120 bytes cmem[1]

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6xSweepIfEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_11’

ptxas info : Used 30 registers, 176+0 bytes lmem, 80+16 bytes smem, 24 bytes cmem[0], 120 bytes cmem[1]

[/codebox]

From the output I can see, it has two compiled kernel, and one uses 31 registers, the other uses 30 registers.

Question 1: The two kernels xSweep and ySweep is very much alike, they only differ in the direction of data handle, why they differ in the usage of register?

The next step, I want to calculate correct resources, to determine the maxThreadNum (the max number of threads per block that can launched). The resources about one block should be determined by two things: registers and shared memroy. Here in my situation, the shared memory is set correctly, and will not be changed. So the problem is about how to calculate the register resource.

To calculate maxThreadNum, I coded:

[codebox]

template struct KernelParam;

template <> struct KernelParam

{

static size_t registersPerThread() { return 31; }

};

template

struct KernelQuery : public KernelParam

{

cudaDeviceProp prop;

KernelQuery()

{

    int id;

    cudaGetDevice(&id);

    cudaGetDeviceProperties(&prop, id);

}

typedef KernelParam param_t;

int maxThreadNumPerBlock() const

{

    int num;

    size_t regs = param_t::registersPerThread();

    bool good = false;

for (num = prop.warpSize; num <= prop.maxThreadsDim[0] && num*regs <= prop.regsPerBlock; num += prop.warpSize, good = true);

if (good) return num-prop.warpSize;

    else return -1;

}

};

[/codebox]

the class KernelParam is specialized by float, and return 31 for registers per block, which is the max number from the nvcc output (30 and 31). And the algorithm is, maxThreadNum starts from warp size, and if not exceed max number of first dimension and the total registers in one block not exceed device’s regsPerBlock, then it steps one warp size, to find the biggest one that matched these conditions. Then I can initiate the kernel function like this:

[codebox]

KernelQuery query;

callKernel(query.maxThreadNumPerBlock(), …);

[/codebox]

Till now, everything works well, I ran my program, it generate correct answer and output:

[codebox]

Calling cudaRoutine with tstop=1 on (0023BD20: [#.0 GeForce GTX 260: 1.3, 27 SM, GMEM=879.375 MB, SMEM=16384 B])

CUDA kernel calling…

xSweep: grid=1008 block=512 shared=16128

ySweep: grid=1008 block=512 shared=16128

Solved by CUDA (float): 1008 x 1008

[/codebox]

Which shows it starts kernel of 512 threads per block, times 31 (the max register used in those 2 kernels), it’s 15872 registers per block. And check my device properties:

[codebox]

Device 0: “GeForce GTX 260”

CUDA Driver Version: 3.10

CUDA Runtime Version: 3.10

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 922091520 bytes

Number of multiprocessors: 27

Number of cores: 216

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

[/codebox]

The registers available per block is 16384, I used 15872, so everything is ok. But then if I move on, the problem shows.

My next step is, with the same code, I compiled it with ‘-arch sm_13’, but not initiate the double version of kernel. The compiler generates:

[codebox]

nvcc -arch sm_13 --ptxas-options=-v …

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6ySweepIfEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_13’

ptxas info : Used 33 registers, 176+0 bytes lmem, 80+16 bytes smem, 24 bytes cmem[0], 120 bytes cmem[1]

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6xSweepIfEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_13’

ptxas info : Used 32 registers, 176+0 bytes lmem, 80+16 bytes smem, 24 bytes cmem[0], 120 bytes cmem[1]

[/codebox]

This time, the registers became 33 and 32 (previouse it’s 31 and 30 when compiled by -arch sm_11).

Question 2: why the register number differs between ‘-arch sm_11’ and ‘-arch sm_13’? My code doesn’t changed.

Then I need update my class KernelParam, to update 31 to 33:

[codebox]

template <> struct KernelParam

{

static size_t registersPerThread() { return 33; }

};

[/codebox]

But this time, when I run the program, it generate runtime exceptions:

[codebox]

Calling cudaRoutine with tstop=1 on (001BBCF0: [#.0 GeForce GTX 260: 1.3, 27 SM, GMEM=879.375 MB, SMEM=16384 B])

CUDA kernel calling…

xSweep: grid=1008 block=480 shared=16128

ySweep: grid=1008 block=480 shared=16128

kernel.cu(line 298): too many resources requested for launch

Kernel running error: too many resources requested for launch

[/codebox]

So let’s check the registers. As I changed my code the 31 to 33, so this time, the maxThreadNum became 480, not 512.

Question 3: 480 * 33 = 15840 registers per block, it’s still below the up limitation regsPerBlock 16384, why it report cuda error: too many resources requested for launch?

I tested set that value to 34, still not working, but when I changed that to 35, it works, and this time, the maxThreadNum became 448:

[codebox]

Calling cudaRoutine with tstop=1 on (002BBCF0: [#.0 GeForce GTX 260: 1.3, 27 SM, GMEM=879.375 MB, SMEM=16384 B])

CUDA kernel calling…

xSweep: grid=1008 block=448 shared=16128

ySweep: grid=1008 block=448 shared=16128

[/codebox]

Question 4: And 448 * 33 = 14784 registers, it’s far below the up limitation 16384, I don’t know what happens in these registers.

Continue to test, I initiate the double version of kernel:

[codebox]

KernelQuery query;

callKernel(query.maxThreadNumPerBlock(), …);

callKernel(512, …);

[/codebox]

Compile with ‘-arch sm_13’, I got now four kernels (2 for float, and 2 for double):

[codebox]

nvcc -arch sm_13 --ptxas-options=-v …

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6ySweepIdEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_13’

ptxas info : Used 56 registers, 320+0 bytes lmem, 96+16 bytes smem, 168 bytes cmem[0], 68 bytes cmem[1]

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6xSweepIdEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_13’

ptxas info : Used 57 registers, 320+0 bytes lmem, 96+16 bytes smem, 168 bytes cmem[0], 68 bytes cmem[1]

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6ySweepIfEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_13’

ptxas info : Used 33 registers, 176+0 bytes lmem, 80+16 bytes smem, 168 bytes cmem[0], 120 bytes cmem[1]

ptxas info : Compiling entry function 'ZN4alea4math3pde4dim24cuda6xSweepIfEEvPT_S6_iiS5_S5_S5

S5_S5_NS3_6MatrixIS5_EES8_’ for ‘sm_13’

ptxas info : Used 32 registers, 176+0 bytes lmem, 80+16 bytes smem, 168 bytes cmem[0], 120 bytes cmem[1]

[/codebox]

Now I know the max register number per block of double version is 57, so I changed my code, to specialize class KernelParam for double:

[codebox]

template <> struct KernelParam

{

static size_t registersPerThread() { return 57; }

};

// and call with the calculated maxThreadNum

KernelQuery queryFloat;

callKernel(queryFloat.maxThreadNumPerBlock(), …);

KernelQuery queryDouble;

callKernel(queryDouble.maxThreadNumPerBlock(), …);

[/codebox]

Run this program, it works still well:

[codebox]

Calling cudaRoutine with tstop=1 on (0030BCD8: [#.0 GeForce GTX 260: 1.3, 27 SM, GMEM=879.375 MB, SMEM=16384 B])

CUDA kernel calling…

xSweep: grid=504 block=256 shared=16128

ySweep: grid=504 block=256 shared=16128

[/codebox]

But this time, the maxThreadNum became 256, and 256*57=14592, which is below the up limitation 16384, seems no problem now.

So, from my test, seems the cudaDeviceProp.regsPerBlock is not good enough for testing, or there are some hiding register usage? Anyone can give me ideas?

Thanks.

Say, you have the following variables initialized:

pDeviceProps - CUDA properties of the device you are using;

nCubinRegUsage - number of registers you kernel requires (you can check it from the verbose compiler output or using cudaFuncGetAttributes routine);

nCubinMemUsage - amount of shared memory your kernel requires (you can check it from the verbose compiler output or using cudaFuncGetAttributes routine);

nSharedMemPerThread - amount of shared memory you are going to use for a single thread;

Now you can do this:

#define MAX_THREADS_PER_BLOCK_REGS(DeviceProps, RegUsage)\

	min(DeviceProps->maxThreadsPerBlock,\

	((DeviceProps->regsPerBlock / (16 * RegUsage)) & ~3) * 16)

#define MAX_THREADS_PER_BLOCK_SHMEM(DeviceProps, SharedMemUsage, SharedMemPerThread)\

	(((((int)DeviceProps->sharedMemPerBlock - SharedMemUsage) / SharedMemPerThread) /\

	DeviceProps->warpSize) * DeviceProps->warpSize)

int nThreadsPerBlockRegs = MAX_THREADS_PER_BLOCK_REGS(pDeviceProps, nCubinRegUsage);

int nThreadsPerBlockShMem = MAX_THREADS_PER_BLOCK_SHMEM(pDeviceProps, nCubinMemUsage,

	nSharedMemPerThread);

nThreadsPerBlockRegs - number of threads in block you can run being limited by registers;

nThreadsPerBlockShMem - number of threads in block you can run being limited by shared memory;

Obviously, you can run the minimum of them:

int nThreadsPerBlock = min(nThreadsPerBlockRegs, nThreadsPerBlockShMem);

And, finally:

int nSharedMemPerBlock = nSharedMemPerThread * nThreadsPerBlock;

Kernel<<<Number of blocks desired, nThreadsPerBlock, nSharedMemPerBlock>>>(…);

Hope this helps.

Not sure if I got something wrong in what you presented, but if you change your kernel from float to double, the register usage is expected to increase under normal conditions. A register holds 4 bytes -> a float takes 1 register, a double takes 2 registers.

Hope this helps,
kynan