Skybuck's CUDA Ram (Speed) Test version 0.10 now available ! (Kernel discussion)

Hello,

Skybuck’s CUDA RAM (Speed) Test version 0.10 is now available at the following link:

Link to compressed archive:

http://www.skybuck.org/CUDA/RAMTest/version%200.10/Zipped/RAMTestVersion010.rar

Link to folder with individual files:

http://www.skybuck.org/CUDA/RAMTest/version%200.10/Unzipped/

Some improvements have been made to the test program, it looks a bit more professional to me ;)

The test program now uses 123 MB of host ram and device ram, so this should make it possible to run it on practically any graphics card with compute 2.0 or so.

The test program now also pauses after it’s done, so it’s easy to see the results and make a screenshot of it or copy & paste the text.
(if command line parameters are specified like --noprompt it will simply terminate as soon as it’s done, this allows profiling with visual profiler).

The ammount of blocks has been reduced from 20.000 to just 4.000 so the program doesn’t take so long to run.

Some additional interesting information is displayed about the device, about the settings, and about the calculated optimal dimensions and kernel launch parameters/dimensions.

A screenshot of what to expect is included as well:

And here is the output in text form (new/fresh run):

"
Test Cuda Random Memory Access Performance.
version 0.10 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup…
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize…
LoadModule…
OpenEvents…
OpenStream…
SetupKernel…
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 256
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 16
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel…
ReadBackResults…
DisplayResults…
CloseStream…
CloseEvents…
UnloadModule…
ExecuteCPU…
Kernel execution time in seconds: 3.4775507812500000
CPU execution time in seconds : 1.4399700939644564
Cuda memory transactions per second: 92018785.6710395765000000
CPU memory transactions per second : 222226837.4470134930000000
program finished.
"

I hope you will give it a try and run it and then post some results here… that would be interesting !

If anybody wants to discuss the kernel used by the test program, and this is the place to do that ! ;) =D

Bye,
Skybuck.

The block index calculation can also be done a bit more efficient, so a third way has been introduced in the middle, and is currently the active one, it’s pretty nice, just 5 additions and 5 multiplications plus some temporaries (uses just 4 to 5 registers), also some information updated for sm_10 and sm_20:

(See * new way *)

int BlockIndex;

	int ElementIndex;

	int LoopIndex;

	int LinearIndex;

	

	// uses 9 registers for sm_10

	// uses 7 registers for sm_20

	// alternative ways to calculate BlockIndex

/*	

	BlockIndex = 

		(threadIdx.x) + 

		(threadIdx.y * blockDim.x) + 

		(threadIdx.z * blockDim.x * blockDim.y) + 

		(blockIdx.x * blockDim.x * blockDim.y * blockDim.z) + 

		(blockIdx.y * blockDim.x * blockDim.y * blockDim.z * gridDim.x) +

		(blockIdx.z * blockDim.x * blockDim.y * blockDim.z * gridDim.x * gridDim.y);

*/

        // * new way *

	// uses 4 registers for sm_10

	// uses 5 registers for sm_20

	// still needs to be verified, but looks correct to me <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' />

	BlockIndex = 

		threadIdx.x + blockDim.x *

		(

			threadIdx.y + blockDim.y *

			(			

				threadIdx.z + blockDim.z *

				(

					blockIdx.x + gridDim.x *

					(

						blockIdx.y + gridDim.y *

						(

							blockIdx.z

						)

					)

				)

			)

		);

/*

	// uses 8 registers for sm_10

	// uses 5 registers for sm_20

	int LinearDimension;

	BlockIndex = threadIdx.x;

	LinearDimension = blockDim.x;

	BlockIndex = BlockIndex + threadIdx.y * LinearDimension;

	LinearDimension = LinearDimension * blockDim.y;

	BlockIndex = BlockIndex + threadIdx.z * LinearDimension;

	LinearDimension = LinearDimension * blockDim.z;

	BlockIndex = BlockIndex + blockIdx.x * LinearDimension;

	LinearDimension = LinearDimension * gridDim.x;

	BlockIndex = BlockIndex + blockIdx.y * LinearDimension;

	LinearDimension = LinearDimension * gridDim.y;

	BlockIndex = BlockIndex + blockIdx.z * LinearDimension;

	LinearDimension = LinearDimension * gridDim.z;

*/