Skybuck's RAM Test version 0.07 kernel discussion

Hello,

In this forum posting I posted details about my RAM test (which is a performance test):

http://forums.nvidia.com/index.php?showtopic=205099

In this posting I will post the kernel which is used by it (which is/was also included in the rar file and so forth):

I welcome any idea’s how to make the kernel any faster without changing what it needs to do ;) which is: “traverse the whole index chain for it’s block”.

The kernel 0.02 is:

Copy & paste friendly:

// test cuda random memory access performance

//

// cuda kernel version 0.02 created on 12 july 2011 by Skybuck Flying

//

extern “C”

{ // extern c begin

global void Kernel( int ElementCount, int BlockCount, int LoopCount, int *Memory, int *BlockResult )

{

int BlockIndex;

int ElementIndex;

int LoopIndex;

int LinearIndex;

//	 uses 9 registers for sm_10

// alternative ways to calculate BlockIndex

/*

BlockIndex = 

	(threadIdx.x) + 

	(threadIdx.y * blockDim.x) + 

	(threadIdx.z * blockDim.x * blockDim.y) + 

	(blockIdx.x * blockDim.x * blockDim.y * blockDim.z) + 

	(blockIdx.y * blockDim.x * blockDim.y * blockDim.z * gridDim.x) +

	(blockIdx.z * blockDim.x * blockDim.y * blockDim.z * gridDim.x * gridDim.y);

*/

// uses 8 registers for sm_10

int LinearDimension;

BlockIndex = threadIdx.x;

LinearDimension = blockDim.x;

BlockIndex = BlockIndex + threadIdx.y * LinearDimension;

LinearDimension = LinearDimension * blockDim.y;

BlockIndex = BlockIndex + threadIdx.z * LinearDimension;

LinearDimension = LinearDimension * blockDim.z;

BlockIndex = BlockIndex + blockIdx.x * LinearDimension;

LinearDimension = LinearDimension * gridDim.x;

BlockIndex = BlockIndex + blockIdx.y * LinearDimension;

LinearDimension = LinearDimension * gridDim.y;

BlockIndex = BlockIndex + blockIdx.z * LinearDimension;

LinearDimension = LinearDimension * gridDim.z;

if (BlockIndex < BlockCount)

{

	ElementIndex = 0;

			

	for (LoopIndex = 0; LoopIndex < LoopCount; LoopIndex++)				

	{

		LinearIndex = ElementIndex + (BlockIndex * ElementCount);

			

		// get next element index

		ElementIndex = Memory[ LinearIndex ];		

	}

	

	// each block should output the last element index for a check up and to prevent the kernel from being reduced away ! <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' />

	BlockResult[ BlockIndex ] = ElementIndex;

}

}

} // extern c end

Layout friendly:

// test cuda random memory access performance

//

// cuda kernel version 0.02 created on 12 july 2011 by Skybuck Flying

//

extern "C" 

{ // extern c begin

__global__ void Kernel( int ElementCount, int BlockCount, int LoopCount, int *Memory, int *BlockResult )

{

	int BlockIndex;

	int ElementIndex;

	int LoopIndex;

	int LinearIndex;

	//	 uses 9 registers for sm_10

	// alternative ways to calculate BlockIndex

/*	

	BlockIndex = 

		(threadIdx.x) + 

		(threadIdx.y * blockDim.x) + 

		(threadIdx.z * blockDim.x * blockDim.y) + 

		(blockIdx.x * blockDim.x * blockDim.y * blockDim.z) + 

		(blockIdx.y * blockDim.x * blockDim.y * blockDim.z * gridDim.x) +

		(blockIdx.z * blockDim.x * blockDim.y * blockDim.z * gridDim.x * gridDim.y);

*/

	// uses 8 registers for sm_10

	int LinearDimension;

	BlockIndex = threadIdx.x;

	LinearDimension = blockDim.x;

	BlockIndex = BlockIndex + threadIdx.y * LinearDimension;

	LinearDimension = LinearDimension * blockDim.y;

	BlockIndex = BlockIndex + threadIdx.z * LinearDimension;

	LinearDimension = LinearDimension * blockDim.z;

	BlockIndex = BlockIndex + blockIdx.x * LinearDimension;

	LinearDimension = LinearDimension * gridDim.x;

	BlockIndex = BlockIndex + blockIdx.y * LinearDimension;

	LinearDimension = LinearDimension * gridDim.y;

	BlockIndex = BlockIndex + blockIdx.z * LinearDimension;

	LinearDimension = LinearDimension * gridDim.z;

	if (BlockIndex < BlockCount)

	{

		ElementIndex = 0;

				

		for (LoopIndex = 0; LoopIndex < LoopCount; LoopIndex++)				

		{

			LinearIndex = ElementIndex + (BlockIndex * ElementCount);

				

			// get next element index

			ElementIndex = Memory[ LinearIndex ];		

		}

		

		// each block should output the last element index for a check up and to prevent the kernel from being reduced away ! <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' />

		BlockResult[ BlockIndex ] = ElementIndex;

	}

}

} // extern c end

And it’s ptx is:

Copy * Paste friendly:

.version 2.3

.target sm_20

.address_size 32

// compiled with C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../open64/lib//be.exe

// nvopencc 4.0 built on 2011-05-13

//-----------------------------------------------------------

// Compiling C:/Users/Skybuck/AppData/Local/Temp/tmpxft_000010cc_00000000-11_CudaMemoryTest.cpp3.i (C:/Users/Skybuck/AppData/Local/Temp/ccBI#.a04384)

//-----------------------------------------------------------

//-----------------------------------------------------------

// Options:

//-----------------------------------------------------------

//  Target:ptx, ISA:sm_20, Endian:little, Pointer Size:32

//  -O3	(Optimization level)

//  -g0	(Debug level)

//  -m2	(Report advisories)

//-----------------------------------------------------------

.file	1	"C:/Users/Skybuck/AppData/Local/Temp/tmpxft_000010cc_00000000-10_CudaMemoryTest.cudafe2.gpu"

.file	2	"c:\tools\microsoft visual studio 10.0\vc\include\codeanalysis\sourceannotations.h"

.file	3	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\crt/device_runtime.h"

.file	4	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\host_defines.h"

.file	5	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\builtin_types.h"

.file	6	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_types.h"

.file	7	"c:\tools\cuda\toolkit 4.0\v4.0\include\driver_types.h"

.file	8	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_types.h"

.file	9	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_types.h"

.file	10	"c:\tools\cuda\toolkit 4.0\v4.0\include\vector_types.h"

.file	11	"c:\tools\cuda\toolkit 4.0\v4.0\include\builtin_types.h"

.file	12	"c:\tools\cuda\toolkit 4.0\v4.0\include\host_defines.h"

.file	13	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\device_launch_parameters.h"

.file	14	"c:\tools\cuda\toolkit 4.0\v4.0\include\crt\storage_class.h"

.file	15	"C:\Tools\Microsoft Visual Studio 10.0\VC\bin/../../VC/INCLUDE\time.h"

.file	16	"O:/CUDA C/Tests/test cuda memory test/version 0.02/CudaMemoryTest.cu"

.file	17	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\common_functions.h"

.file	18	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions.h"

.file	19	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_constants.h"

.file	20	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_functions.h"

.file	21	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_11_atomic_functions.h"

.file	22	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_12_atomic_functions.h"

.file	23	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_13_double_functions.h"

.file	24	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_atomic_functions.h"

.file	25	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_intrinsics.h"

.file	26	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_functions.h"

.file	27	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_fetch_functions.h"

.file	28	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions_dbl_ptx3.h"

.entry Kernel (

	.param .s32 __cudaparm_Kernel_ElementCount,

	.param .s32 __cudaparm_Kernel_BlockCount,

	.param .s32 __cudaparm_Kernel_LoopCount,

	.param .u32 __cudaparm_Kernel_Memory,

	.param .u32 __cudaparm_Kernel_BlockResult)

{

.reg .u32 %r<43>;

.reg .pred %p<5>;

.loc	16	4	0

$LDWbegin_Kernel:

cvt.s32.u32 	%r1, %ntid.x;

mov.u32 	%r2, %tid.y;

mul.lo.u32 	%r3, %r1, %r2;

mov.u32 	%r4, %ntid.y;

mul.lo.u32 	%r5, %r1, %r4;

cvt.s32.u32 	%r6, %tid.x;

add.u32 	%r7, %r6, %r3;

mov.u32 	%r8, %tid.z;

mul.lo.u32 	%r9, %r8, %r5;

mov.u32 	%r10, %ntid.z;

mul.lo.u32 	%r11, %r10, %r5;

add.u32 	%r12, %r7, %r9;

mov.u32 	%r13, %ctaid.x;

mul.lo.u32 	%r14, %r13, %r11;

mov.u32 	%r15, %nctaid.x;

mul.lo.u32 	%r16, %r15, %r11;

add.u32 	%r17, %r12, %r14;

mov.u32 	%r18, %ctaid.y;

mul.lo.u32 	%r19, %r18, %r16;

mov.u32 	%r20, %nctaid.y;

mul.lo.u32 	%r21, %r20, %r16;

add.u32 	%r22, %r17, %r19;

mov.u32 	%r23, %ctaid.z;

mul.lo.u32 	%r24, %r23, %r21;

add.u32 	%r25, %r22, %r24;

ld.param.s32 	%r26, [__cudaparm_Kernel_BlockCount];

setp.le.s32 	%p1, %r26, %r25;

@%p1 bra 	$Lt_0_2050;

ld.param.s32 	%r27, [__cudaparm_Kernel_LoopCount];

mov.u32 	%r28, 0;

setp.le.s32 	%p2, %r27, %r28;

@%p2 bra 	$Lt_0_3586;

mov.s32 	%r29, %r27;

ld.param.s32 	%r30, [__cudaparm_Kernel_ElementCount];

mul.lo.s32 	%r31, %r30, %r25;

ld.param.u32 	%r32, [__cudaparm_Kernel_Memory];

mov.s32 	%r33, 0;

mov.s32 	%r34, 0;

mov.s32 	%r35, %r29;

$Lt_0_3074:

// Loop body line 4, nesting depth: 1, estimated iterations: unknown

.loc	16	55	0

add.s32 	%r36, %r31, %r34;

mul.lo.u32 	%r37, %r36, 4;

add.u32 	%r38, %r32, %r37;

ld.global.s32 	%r34, [%r38+0];

add.s32 	%r33, %r33, 1;

setp.ne.s32 	%p3, %r27, %r33;

@%p3 bra 	$Lt_0_3074;

bra.uni 	$Lt_0_2562;

$Lt_0_3586:

mov.s32 	%r34, 0;

$Lt_0_2562:

.loc	16	59	0

ld.param.u32 	%r39, [__cudaparm_Kernel_BlockResult];

mul.lo.u32 	%r40, %r25, 4;

add.u32 	%r41, %r39, %r40;

st.global.s32 	[%r41+0], %r34;

$Lt_0_2050:

.loc	16	68	0

exit;

$LDWend_Kernel:

} // Kernel

Layout-friendly:

.version 2.3

	.target sm_20

	.address_size 32

	// compiled with C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../open64/lib//be.exe

	// nvopencc 4.0 built on 2011-05-13

	//-----------------------------------------------------------

	// Compiling C:/Users/Skybuck/AppData/Local/Temp/tmpxft_000010cc_00000000-11_CudaMemoryTest.cpp3.i (C:/Users/Skybuck/AppData/Local/Temp/ccBI#.a04384)

	//-----------------------------------------------------------

	//-----------------------------------------------------------

	// Options:

	//-----------------------------------------------------------

	//  Target:ptx, ISA:sm_20, Endian:little, Pointer Size:32

	//  -O3	(Optimization level)

	//  -g0	(Debug level)

	//  -m2	(Report advisories)

	//-----------------------------------------------------------

	.file	1	"C:/Users/Skybuck/AppData/Local/Temp/tmpxft_000010cc_00000000-10_CudaMemoryTest.cudafe2.gpu"

	.file	2	"c:\tools\microsoft visual studio 10.0\vc\include\codeanalysis\sourceannotations.h"

	.file	3	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\crt/device_runtime.h"

	.file	4	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\host_defines.h"

	.file	5	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\builtin_types.h"

	.file	6	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_types.h"

	.file	7	"c:\tools\cuda\toolkit 4.0\v4.0\include\driver_types.h"

	.file	8	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_types.h"

	.file	9	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_types.h"

	.file	10	"c:\tools\cuda\toolkit 4.0\v4.0\include\vector_types.h"

	.file	11	"c:\tools\cuda\toolkit 4.0\v4.0\include\builtin_types.h"

	.file	12	"c:\tools\cuda\toolkit 4.0\v4.0\include\host_defines.h"

	.file	13	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\device_launch_parameters.h"

	.file	14	"c:\tools\cuda\toolkit 4.0\v4.0\include\crt\storage_class.h"

	.file	15	"C:\Tools\Microsoft Visual Studio 10.0\VC\bin/../../VC/INCLUDE\time.h"

	.file	16	"O:/CUDA C/Tests/test cuda memory test/version 0.02/CudaMemoryTest.cu"

	.file	17	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\common_functions.h"

	.file	18	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions.h"

	.file	19	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_constants.h"

	.file	20	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_functions.h"

	.file	21	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_11_atomic_functions.h"

	.file	22	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_12_atomic_functions.h"

	.file	23	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_13_double_functions.h"

	.file	24	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_atomic_functions.h"

	.file	25	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_intrinsics.h"

	.file	26	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_functions.h"

	.file	27	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_fetch_functions.h"

	.file	28	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions_dbl_ptx3.h"

	.entry Kernel (

		.param .s32 __cudaparm_Kernel_ElementCount,

		.param .s32 __cudaparm_Kernel_BlockCount,

		.param .s32 __cudaparm_Kernel_LoopCount,

		.param .u32 __cudaparm_Kernel_Memory,

		.param .u32 __cudaparm_Kernel_BlockResult)

	{

	.reg .u32 %r<43>;

	.reg .pred %p<5>;

	.loc	16	4	0

$LDWbegin_Kernel:

	cvt.s32.u32 	%r1, %ntid.x;

	mov.u32 	%r2, %tid.y;

	mul.lo.u32 	%r3, %r1, %r2;

	mov.u32 	%r4, %ntid.y;

	mul.lo.u32 	%r5, %r1, %r4;

	cvt.s32.u32 	%r6, %tid.x;

	add.u32 	%r7, %r6, %r3;

	mov.u32 	%r8, %tid.z;

	mul.lo.u32 	%r9, %r8, %r5;

	mov.u32 	%r10, %ntid.z;

	mul.lo.u32 	%r11, %r10, %r5;

	add.u32 	%r12, %r7, %r9;

	mov.u32 	%r13, %ctaid.x;

	mul.lo.u32 	%r14, %r13, %r11;

	mov.u32 	%r15, %nctaid.x;

	mul.lo.u32 	%r16, %r15, %r11;

	add.u32 	%r17, %r12, %r14;

	mov.u32 	%r18, %ctaid.y;

	mul.lo.u32 	%r19, %r18, %r16;

	mov.u32 	%r20, %nctaid.y;

	mul.lo.u32 	%r21, %r20, %r16;

	add.u32 	%r22, %r17, %r19;

	mov.u32 	%r23, %ctaid.z;

	mul.lo.u32 	%r24, %r23, %r21;

	add.u32 	%r25, %r22, %r24;

	ld.param.s32 	%r26, [__cudaparm_Kernel_BlockCount];

	setp.le.s32 	%p1, %r26, %r25;

	@%p1 bra 	$Lt_0_2050;

	ld.param.s32 	%r27, [__cudaparm_Kernel_LoopCount];

	mov.u32 	%r28, 0;

	setp.le.s32 	%p2, %r27, %r28;

	@%p2 bra 	$Lt_0_3586;

	mov.s32 	%r29, %r27;

	ld.param.s32 	%r30, [__cudaparm_Kernel_ElementCount];

	mul.lo.s32 	%r31, %r30, %r25;

	ld.param.u32 	%r32, [__cudaparm_Kernel_Memory];

	mov.s32 	%r33, 0;

	mov.s32 	%r34, 0;

	mov.s32 	%r35, %r29;

$Lt_0_3074:

 //<loop> Loop body line 4, nesting depth: 1, estimated iterations: unknown

	.loc	16	55	0

	add.s32 	%r36, %r31, %r34;

	mul.lo.u32 	%r37, %r36, 4;

	add.u32 	%r38, %r32, %r37;

	ld.global.s32 	%r34, [%r38+0];

	add.s32 	%r33, %r33, 1;

	setp.ne.s32 	%p3, %r27, %r33;

	@%p3 bra 	$Lt_0_3074;

	bra.uni 	$Lt_0_2562;

$Lt_0_3586:

	mov.s32 	%r34, 0;

$Lt_0_2562:

	.loc	16	59	0

	ld.param.u32 	%r39, [__cudaparm_Kernel_BlockResult];

	mul.lo.u32 	%r40, %r25, 4;

	add.u32 	%r41, %r39, %r40;

	st.global.s32 	[%r41+0], %r34;

$Lt_0_2050:

	.loc	16	68	0

	exit;

$LDWend_Kernel:

	} // Kernel

Bye,

Skybuck.

Pretty important, this probably explains the low random memory access performance of a multi processor:

I think I now also understand better why the random access memory test performed so bad.

The random memory access test does 1 memory access per thread.

So let’s assume 32 threads are executed in parallel this means 32 memory accesses per clock cycle.

The multi processor only has room for 1024 threads. Because the first 32 threads stall immediately it switches to the next warp.

So 1024 / 32 = 32.

This means after 32 clock cyles all thread contexes have been used up… and all 1024 threads are now stalled waiting for memory.

The memory latency is said to be about 600 clocks cycles.

So 600 - 32 = 568 clock cycles cuda is waiting and doing nothing :(

If thread resources was higher for example then it would be:

1536 / 32 = 48 clock cycles… 600-48 still a lot of waiting time.

This even assumes worst case scenerio, in reality it probably executes 48 threads in parallel.

So real numbers are probably:

1024 / 48 = 21 clock cycles.

After 21 clock cycles all threads are stalled and waiting for memory :(

So an interesting question for hardware developers would be:

“How many thread contexes/resources does cuda need to completely hide memory latency ?”

Let’s leave branches and other slightly instruction overhead out of equation.

Assuming cuda issues 48 memory requests per clock cycle then it’s a pretty easy formula:

cuda cores * memory latency = number of thread contexes needed.

So in this case:

48 * 600 clock cycles = 28800 thread contexes.

So cuda should at least have 28800 thread resources per multi processor to completely hide memory latency.

This would be the best case/extreme case.

In reality perhaps some clock cycles per memory request are spent on branching or increasing an index or so…

Still having it maxed out would be nice.

Now let’s compare best case to current situation:

28800 / 1024 = 28 clock cycles.

Cuda assumes that each thread will spent 28 clock cycles on overhead.

For my ram test this is probably not the case… and the overhead is perhaps 3 clock cycles or so… maybe even less…

So at least to me cuda seems “thread contexes/resources” starved at least for random access memory.

This seems to be the bottleneck for now, once this bottleneck is lifted in future, perhaps only then dram 32 byte memory transaction size would become a limit.

But for now, cuda seems thread resources starved :(

Hmm, now I am not so sure anymore, by changing the threads per block from 1024 to 256 according to the occupancy calculator this should max out the number of threads being used on the multi processor which would be 1536 instead of just 1024.

This should have given higher ammount of memory transactions per second, but it didn’t… so perhaps bottleneck is somewhere else…

I am also unsure why 256 threads per block would give 100% occupancy for compute 2.1 ?!?

Ok, this is a bit whacky but here goes, there are apperently further constraints as follows:

Maximum number of resident warps per multiprocessor = 48 for compute 2.0

^ This number is the number of groups (each group being 32 threads, so a total of 48x32 = 1536 threads).

However each multi processor can only have 8 blocks, since warps are responsible for executing the blocks, the warps need to be distributed over the blocks so this gives:
(Maximum number of resident blocks per multiprocessor = 8 for compute 2.0)

So this gives following formula:

MaxResidentWarps / MaxResidentBlock = MaxResidentWarpsPerBlock.

So plugging in the numbers gives:

48 / 8 = 6 resident warps per block.

Since each warp has 32 threads this gives:

6x32 = 192 resident threads per block.

Since there are 8 blocks this gives: 8 x 192 resident threads = 1536 threads.

So the number 256 threads per block probably wasn’t optimal. Maybe the calculator was wrong or maybe it used some extra threads available or maybe I made mistake in formula’s above, when I first did some calculations with calculator 256 seemed to make sense but now I don’t make that much sense to me anymore…

I am going to give 192 a try and see what happens, so far the occopany calculator still says: 100%

Well these constraints cause multiple optimal solutions at least when it comes to occupancy.

So far 128, 192, 256 all give 100% occupancy though 192 seems to perform slighty worse then the rest.

Also no further increase, bottleneck remains 92 million memory transactions per second.

128 probably not optimal, google didn’t refresh the results I think… 128 threads per block, would give too many blocks: 12.

So it’s either: 1536 / 256 = 6 blocks each of (48/6) = 8 warps = 8 * 32 = 256 threads again.

or

1536 / 192 = 8 blocks each of (48/8) = 6 warps = 6 * 32 = 192 threads again.

The complete list of optimal occupancy for thread block size is:

192, 256, 384, 512, 768

This is pretty easy to try out:

1536/8 = 192
1536/7 = bad
1536/6 = 256
1536/5 = bad
1536/4 = 384
1536/3 = 512
1536/2 = 768

Number of threads cannot exceed 1024 so /1 falls off.
Number of blocks cannot exceed 8 so /9 and above falls off.

Some divisions lead to fractions so those fall off.

Which leaves the 5 solutions above.

However the warps must also be distributed across the blocks so further
calculations could be done to see if it’s nicely distributed, just to make
sure each block completes within the same time, this is probably not
a requirement but it’s interesting anyway:

48 / 8 = 6
48 / 6 = 8
48 / 4 = 12
48 / 3 = 16
48 / 2 = 24

So surprisingly even 3 produces nice warp distribution ! =D

1536 (maximum resident threads per multi processor)
8 (maximum resident blocks per multi processor)
48 (maximum resident warps per multi processor)