Skybuck's CUDA RAM (Speed) Test version 0.10 now available !

Hello,

Skybuck’s CUDA RAM (Speed) Test version 0.10 is now available at the following link:

Link to compressed archive:

http://www.skybuck.org/CUDA/RAMTest/version%200.10/Zipped/RAMTestVersion010.rar

Link to folder with individual files:

http://www.skybuck.org/CUDA/RAMTest/version%200.10/Unzipped/

Some improvements have been made to the test program, it looks a bit more professional to me External Image

The test program now uses 123 MB of host ram and device ram, so this should make it possible to run it on practically any graphics card with compute 2.0 or so.

The test program now also pauses after it’s done, so it’s easy to see the results and make a screenshot of it or copy & paste the text.
(if command line parameters are specified like --noprompt it will simply terminate as soon as it’s done, this allows profiling with visual profiler).

The ammount of blocks has been reduced from 20.000 to just 4.000 so the program doesn’t take so long to run.

Some additional interesting information is displayed about the device, about the settings, and about the calculated optimal dimensions and kernel launch parameters/dimensions.

A screenshot of what to expect is included as well:

And here is the output in text form (new/fresh run):

"
Test Cuda Random Memory Access Performance.
version 0.10 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup…
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize…
LoadModule…
OpenEvents…
OpenStream…
SetupKernel…
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 256
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 16
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel…
ReadBackResults…
DisplayResults…
CloseStream…
CloseEvents…
UnloadModule…
ExecuteCPU…
Kernel execution time in seconds: 3.4775507812500000
CPU execution time in seconds : 1.4399700939644564
Cuda memory transactions per second: 92018785.6710395765000000
CPU memory transactions per second : 222226837.4470134930000000
program finished.
"

I hope you will give it a try and run it and then post some results here… that would be interesting !

Bye,
Skybuck.

Might be good to explicitly output GB/s dev-> host host-> dev, if this is a RAM speed test.

No it doesn’t measure the RAM of the host, it only measures the RAM of the graphics card.

Though the CPU also measures the host RAM. So transfer between CPU and GPU is not interesting for me, this benchmark survives a particular purpose:

Measure RAM access speed of CPU vs GPU.

Actually what happens in this test is the following:

  1. The CPU is probably using it’s L1 cache completely. So that’s why the CPU is so fast. It’s probably not doing a lot of RAM access… it all gets cached.

  2. The GPU is doing all lot of RAM access and makes it slow.

I shall write another kernel which uses 1 cuda thread and then use shared memory to compare “shared memory” performance with CPU L1 cache performance.

Though CPU cache is probably bigger so that would scale better in the end, not entirely sure though.

Cuda also has some L2 cache which seems unspecified… I haven’t read much about how the L2 cache works…

So after the kernel perhaps another kernel could be written which assumes L2 cache access for cuda/gpu and then see how that performs.

If it works well then at least it would run nice on a single multi processor, so then just one thread/block.

On graphics cards with multiple multiprocessors it would then work even faster which would be interesting.

But I suspect the gpu L1 cache/shared memory will have lot’s of bank conflicts so time to find out ! External Image :)

Ok, the shared memory kernel is done… it also executes 4000 blocks but this time sequentially…

This test/results made my jaw drop ! LOL… which offers possibilities/hope for cuda:

Just a single cuda thread did this:

Text:

"
Test Cuda Random Memory Access Performance.
version 0.12 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup…
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize…
LoadModule…
OpenEvents…
OpenStream…
SetupKernel…
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 256
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 16
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel…
ReadBackResults…
DisplayResults…
CloseStream…
CloseEvents…
UnloadModule…
ExecuteCPU…
Kernel execution time in seconds: 0.3385913085937500
CPU execution time in seconds : 1.4263124922301578
Cuda memory transactions per second: 945092186.0015719590000000
CPU memory transactions per second : 224354762.1879504710000000
program finished.
"

Conclusion: shared memory is HELL/SUPER FAST !

Almost 4 times faster than the CPU ?!?!

I am gonna do a little debug test with VS 2010, because this is almost unbelievable ! LOL. But I believe but gjez ?! Cool.

Though the GPU L1 cache is probably smaller than CPU L1 cache which could explain it’s higher speed External Image

For real purposes I might require an even larger cache and then maybe the results will be different… but for now it’s hopefull External Image

In reality this probably means the gpu is twice as fast as a dual core, since the dual core will also probably be as fast as single core.

So if a quad core processor would face a gt 520 they would both be about the same speed would be my estimate, unless newer cpu’s have even faster caches External Image

woops little word “double” missing, corrected:

"In reality this probably means the gpu is twice as fast as a dual core, since the dual core will also probably be double as fast as single core.

So if a quad core processor would face a gt 520 they would both be about the same speed would be my estimate, unless newer cpu’s have even faster caches

Woops there was something wrong with the kernel and also the kernel launch parameters.

Kernel was doing only 1 block, and launch parameters where 4000 threads.

Now the situation has been corrected.

The kernel is doing 4000 blocks and only 1 thread.

It turns out it’s fricking slow !

Test Cuda Random Memory Access Performance.
version 0.12 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].SharedMemoryPerMultiProcessor: 49152
Device[0].RegistersPerMultiProcessor: 32768
Device[0].ConstantMemory: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup…
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize…
LoadModule…
OpenEvents…
OpenStream…
SetupKernel…
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 1
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 1
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel…
ReadBackResults…
DisplayResults…
CloseStream…
CloseEvents…
UnloadModule…
ExecuteCPU…
Kernel execution time in seconds: 24.2583750000000000
CPU execution time in seconds : 1.4263193366754714
Cuda memory transactions per second: 13191320.5233244183900000
CPU memory transactions per second : 224353685.5819891260000000
program finished.

(Picture already updated above).