Skybuck's RAM Test version 0.07 available.

Hello,

Skybuck’s RAM Test version 0.07 is now available at the following link, in either winrar form or loose files (3):

File:

http://www.skybuck.org/CUDA/RAMTest/SkybuckRAMTestV007b.rar

Folder:

http://www.skybuck.org/CUDA/RAMTest/

What the test does is the following:

It creates 20.000 blocks. Each block has 8.000 elements. Each element is a 32 bit integer (4 bytes).

Each block has one execution thread.

The execution thread “travels” through the elements in a random fashion. (RAM read test).

It stores/writes the last element it processed in the BlockResult[ BlockIndex ] to verify if it indeed did any processing at all.

This test is performed on GPU and CPU. (On the CPU only one thread/core is used for now, perhaps a future test will include multi-threading).

The timing and performance results are then displayed at the bottom.

The GT 520 gpu and the AMD x2 3800+ dual core single thread performed as follows:

Kernel execution time in seconds: 25.0870683593750000
CPU execution time in seconds : 11.8696194628088207

Cuda memory transactions per second: 63777878.5898704829000000
CPU memory transactions per second : 134797918.7549603890000000

Conclusion: CPU’s single thread is twice as fast as GPU.

Note: this test requires 611 megabyte (640.000.000 bytes) to be free/available on CPU and GPU.

I would be very much interested in how this test performs on your system/cpu/gpu.

So if you do run this test on your system, please post the results below (just the 4 lines as above is enough/fine).

You can also e-mail results to:

skybuck2000@hotmail.com

Bye,
Skybuck.

I did some more test with different settings after seeing the depression results for random access memory for cuda and probably register dependency and such.

These graphics cards are supposed to be good for linear access/vector like access so I tested that as well somewhat.

If the number of elements is just one cuda performance extremely well, 10 times as fast as the cpu.

If the number of elements is 10 cuda still performance 5 times as fast as the cpu roughly speaking.

So there is still some hope inside of me that cuda will be usefull for for example video codecs.

I was hoping to use cuda for something else but I guess that will have to go back into the freezer for now.

Or I could give opencl a try and see if somebody’s ati card does better, but opencl seems somewhat boring and very little information about the instruction set used by opencl programs.

So perhaps I should spent some time on giving my lossless video codec another try but this time use cuda to see if it can achieve faster performance and perhaps even high resolution, which would be nice.

It needs to be at least twice as fast for somewhat decent frame rates at normal resolution and then it will need to be 4 times as fast for double resolution… so it needs to be 8 times as fast.

Seeing a speed up of 10 is nice.

However parallel algorithm might also require some rounds… but the test settings also included that somewhat, loops were 10… I just did another test with 100 loops, cuda still 3 times faster than cpu.

Time for a more serious test. I set elements to 1 which would mean 32 bit colors. I set blocks to 1920x1200 and I set loops to 22 for a parallel scan simulation * 60 for video frequency.

I won’t reveal the numbers lol. But I can tell you. The GPU is 40 times as fast as the CPU ! LOL.

That puts big smile on my face ! External Image =D

CUDA just made my day real happy ! External Image =D

Sigh… so CUDA should be perfectly suited for writing video codecs as long as the video codecs do their work as sequentially as possible External Image

Bye,
Skybuck.

This is what I get:

LoadModule...

vCudaError: 301

Exception: TCudaMemoryTest.LoadModule failed.

Do I need a sm20 device to run the ptx?

I think so… the test program uses a stream and events to measure the kernel execution time… I think this requires compute capability 2.0.

You could try re-compiling the cu file to ptx with sm10 or compute10 or so.

Or simply changing .sm20 to .sm10

And then run the program… but it will probably crash or something.

I have uploaded two more files to the webdriver folder.

Folder:

http://www.skybuck.org/CUDA/RAMTest/

Files:

http://www.skybuck.org/CUDA/RAMTest/CudaMemoryTestCompute10.ptx
http://www.skybuck.org/CUDA/RAMTest/CudaMemoryTestCompute20.ptx

The first one is compiled for compute 1.0 the second for compute 2.0

So download these two files to where the test program is.

Then replace CudaMemoryTest.ptx with one of the two files above.

And then run the test program.

Compute 1.0 and 2.0 has been tested with the test program on my GT 520 and both versions work.

(However the test program assumes cuda 4.0 api present so I am not sure if that will work on older cards, perhaps a driver update might help).

Hmm strange:

Cuda Error 301 means file not found.

You should make sure that the *.ptx files are present in the same folder as the TestProgram.exe ?!?

The number of blocks doesn’t really matter.

I test with 2000 and it gives same performance results, it just takes shorter to test, it is after all divided by seconds taken.

The higher block numbers were just to test if it might help.

Anyway I have managed to find a little optimization trick via ptx.

It’s quite significant too.

By adding a “cop” directive which stands for “cache operation” specifier cuda can be made to run faster:

The following instruction was changed from:

ld.global.s32 	%r34, [%r38+0];

To:
ld.global.cg.s32 %r34, [%r38+0];

This seems to give 50% more performance for random access memory with cuda !

However care/more tests should be done to be sure… maybe it’s just for this particular situation, but the difference is so big there is probably something to it ! External Image

Surprisingly the “cop” .cs did not give more performance, which is what I tried first.

I still have others to try, but this is already pretty spectacular ! External Image

Since everything else I tried with code adjustments didn’t help ! External Image

So there is still hope yet to squeeze some more performance out of it ! External Image =D

The CPU is still twice as fast by a large margin though ! External Image

The following technique works more or less the same way at the source level:

The following parameter is altered

from:

int *Memory,

to:

volatile int *Memory,

This produces the instruction:

ld.volatile.global.s32 %r34, [%r38+0];

I also tried adding .cg behind the global but that is not allowed that would be recursive…

volatile already indicates that no cache operations are allowed.

This gives the same 50% performance increase which is very nice ! External Image

Bye,
Skybuck.

I was reading the ptx 2.3 manual to see if there are any more juicy tricks in it… at the very start it mentions to use 1 thread per pixel or 1 thread per tiniest element ! External Image :)

I wrote about that idea and had that idea a long time ago it feels like…

Well if cuda wants tiny elements then cuda will get tiny elements… I got plenty of tiny elements for cuda ! LOL.

Tomorrow I am going to give each memory cell it’s own thread.

And then it will probably crash or complain out of cuda memory or resources or something but it’s still an interesting idea to see how it will work out ! External Image :)

Stay tuned ! =D

I think I just profiled/reversed-engineerd this GT 520 graphics card ! External Image

If the graphics card has a memory system of 1200 mhz and it has a bus bit width of 64 bit and it only achieves roughly 64 million reads without cache tricks then this leads to the following reverse-enginering/profiling formula:

( ( ( ( hertz * bus bits ) / 8 (bits for a byte) ) / (observed maximum reads / sec) ) = bytes per read

( bytes per read / element size ) = minimum transfer size.

Plugging the number leads to 37.5 bytes which is close to 32 bytes.

The 5.5 is probably cache hits.

The 32 bytes as a minimum transfer size seems to stroke/match with what somebody else wrote on the forum somewhere External Image

However perhaps it was a little mistake to divide the bytes per read by an integer… since the integer is probably already part of this…

This gives 150 bytes as minimum transfer size.

Perhaps it’s 128 bytes as minimum transfer size on compute 2.0/gt 520 and the rest is cache hits ?!?

Pretty important, this probably explains the low random memory access performance of a multi processor:

I think I now also understand better why the random access memory test performed so bad.

The random memory access test does 1 memory access per thread.

So let’s assume 32 threads are executed in parallel this means 32 memory accesses per clock cycle.

The multi processor only has room for 1024 threads. Because the first 32 threads stall immediately it switches to the next warp.

So 1024 / 32 = 32.

This means after 32 clock cyles all thread contexes have been used up… and all 1024 threads are now stalled waiting for memory.

The memory latency is said to be about 600 clocks cycles.

So 600 - 32 = 568 clock cycles cuda is waiting and doing nothing :(

If thread resources was higher for example then it would be:

1536 / 32 = 48 clock cycles… 600-48 still a lot of waiting time.

This even assumes worst case scenerio, in reality it probably executes 48 threads in parallel.

So real numbers are probably:

1024 / 48 = 21 clock cycles.

After 21 clock cycles all threads are stalled and waiting for memory :(

So an interesting question for hardware developers would be:

“How many thread contexes/resources does cuda need to completely hide memory latency ?”

Let’s leave branches and other slightly instruction overhead out of equation.

Assuming cuda issues 48 memory requests per clock cycle then it’s a pretty easy formula:

cuda cores * memory latency = number of thread contexes needed.

So in this case:

48 * 600 clock cycles = 28800 thread contexes.

So cuda should at least have 28800 thread resources per multi processor to completely hide memory latency.

This would be the best case/extreme case.

In reality perhaps some clock cycles per memory request are spent on branching or increasing an index or so…

Still having it maxed out would be nice.

Now let’s compare best case to current situation:

28800 / 1024 = 28 clock cycles.

Cuda assumes that each thread will spent 28 clock cycles on overhead.

For my ram test this is probably not the case… and the overhead is perhaps 3 clock cycles or so… maybe even less…

So at least to me cuda seems “thread contexes/resources” starved at least for random access memory.

This seems to be the bottleneck for now, once this bottleneck is lifted in future, perhaps only then dram 32 byte memory transaction size would become a limit.

But for now, cuda seems thread resources starved :(

Hmm, now I am not so sure anymore, by changing the threads per block from 1024 to 256 according to the occupancy calculator this should max out the number of threads being used on the multi processor which would be 1536 instead of just 1024.

This should have given higher ammount of memory transactions per second, but it didn’t… so perhaps bottleneck is somewhere else…

Ok, this is a bit whacky but here goes, there are apperently further constraints as follows:

Maximum number of resident warps per multiprocessor = 48 for compute 2.0

^ This number is the number of groups (each group being 32 threads, so a total of 48x32 = 1536 threads).

However each multi processor can only have 8 blocks, since warps are responsible for executing the blocks, the warps need to be distributed over the blocks so this gives:
(Maximum number of resident blocks per multiprocessor = 8 for compute 2.0)

So this gives following formula:

MaxResidentWarps / MaxResidentBlock = MaxResidentWarpsPerBlock.

So plugging in the numbers gives:

48 / 8 = 6 resident warps per block.

Since each warp has 32 threads this gives:

6x32 = 192 resident threads per block.

Since there are 8 blocks this gives: 8 x 192 resident threads = 1536 threads.

So the number 256 threads per block probably wasn’t optimal. Maybe the calculator was wrong or maybe it used some extra threads available or maybe I made mistake in formula’s above, when I first did some calculations with calculator 256 seemed to make sense but now I don’t make that much sense to me anymore…

I am going to give 192 a try and see what happens, so far the occopany calculator still says: 100%

Well these constraints cause multiple optimal solutions at least when it comes to occupancy.

So far 128, 192, 256 all give 100% occupancy though 192 seems to perform slighty worse then the rest.

128 probably not optimal, google didn’t refresh the results I think… 128 threads per block, would give too many blocks: 12.

So it’s either: 1536 / 256 = 6 blocks each of (48/6) = 8 warps = 8 * 32 = 256 threads again.

or

1536 / 192 = 8 blocks each of (48/8) = 6 warps = 6 * 32 = 192 threads again.

The complete list of optimal occupancy for thread block size is:

192, 256, 384, 512, 768

This is pretty easy to try out:

1536/8 = 192
1536/7 = bad
1536/6 = 256
1536/5 = bad
1536/4 = 384
1536/3 = 512
1536/2 = 768

Number of threads cannot exceed 1024 so /1 falls off.
Number of blocks cannot exceed 8 so /9 and above falls off.

Some divisions lead to fractions so those fall off.

Which leaves the 5 solutions above.

However the warps must also be distributed across the blocks so further
calculations could be done to see if it’s nicely distributed, just to make
sure each block completes within the same time, this is probably not
a requirement but it’s interesting anyway:

48 / 8 = 6
48 / 6 = 8
48 / 4 = 12
48 / 3 = 16
48 / 2 = 24

So surprisingly even 3 produces nice warp distribution ! =D

1536 (maximum resident threads per multi processor)
8 (maximum resident blocks per multi processor)
48 (maximum resident warps per multi processor)

ic.