My Kernal is quite large, I’ll try simplify it and post it, I also have to check if I’m allowed to post the code since I don’t own the original code, I just have transfered it to CUDA.
I have the following in my code:
#define MULTIPLIER 1
#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )
What I found werid was when I change my MULTIPLIER constant, I get different register usages and cmem usage. I use SUB_FRINGE_SIZE to allocate memory on the device. e.g.
cudaMalloc( (void **) &r_tableDevice, SUB_FRINGE_SIZE * GRID_SIZE * BLOCK_SIZE * sizeof(int) );
cudaMalloc( fringeSize, SUB_FRINGE_SIZE * SUB_FRINGE_SIZE * sizeof(float) );
I’m using the following values currently in my code:
#define BLOCK_SIZE 16
#define GRID_SIZE 256
Using __launch_bounds__( BLOCK_SIZE*BLOCK_SIZE, 2 )
#define MULTIPLIER 1
#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )
compiler output:
ptxas info : Used 32 registers, 88+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]
#define MULTIPLIER 2
#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )
compiler output:
ptxas info : Used 30 registers, 80+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]
#define MULTIPLIER 3
#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )
compiler output:
ptxas info : Used 32 registers, 96+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 116 bytes cmem[1]
#define MULTIPLIER 4
#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )
compiler output:
ptxas info : Used 30 registers, 80+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 116 bytes cmem[1]
#define MULTIPLIER 5
#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )
compiler output:
ptxas info : Used 32 registers, 96+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]
#define MULTIPLIER 6
#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )
compiler output:
ptxas info : Used 32 registers, 96+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 116 bytes cmem[1]
Any ideas why changing that MULTIPLIER define would cause this? something to do with power of 2 allocations?