Global Memory vs Constant vs Texture Fetch Performance

walterman · February 21, 2009, 4:45pm

Hello,

I’m writing a perlin noise generator kernel, and i’m trying to optimize it.

The kernel uses 2 small tables (8KB total) with precomputed random values (part of the perlin noise algorithm).

Each thread needs 30 read accesses to the tables, to very random locations.

At the end, the kernel writes a single float value to the global memory, as result.

I have tried 3 different versions of the kernel, placing the tables in different locations: in the global memory, as constants, and in textures.

The execution time of the 3 methods is almost the same (less than 1% of difference).

I’m using the CUDA Visual profiler, and this is the result:

Global Memory

External Media External Media

Constants

External Media External Media

Texture Fetching

External Media External Media

The benchmark tries all the possible <<numBlocks, blockSize>> combinations, and it selects the best:

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.1 Alpha (Global Memory)

Running Benchmarks …

[1, 65536] Total Time: 0.024465s

[2, 32768] Total Time: 0.024471s

[4, 16384] Total Time: 0.024478s

[8, 8192] Total Time: 0.024522s

[16, 4096] Total Time: 0.024645s

[32, 2048] Total Time: 0.024602s

[64, 1024] Total Time: 0.024712s

[128, 512] Total Time: 0.538523s

[256, 256] Total Time: 0.540375s

[512, 128] Total Time: 0.544998s

[1024, 64] Total Time: 0.537881s

[2048, 32] Total Time: 0.520117s

[4096, 16] Total Time: 0.512108s

[8192, 8] Total Time: 0.536764s

[16384, 4] Total Time: 0.638648s

[32768, 2] Total Time: 1.100332s

[65536, 1] Total Time: 0.024361s

Best Config [65536, 1]: 0.024361s

Running Verification Tests …

Everything OK :)

Z:\code\Visual Studio Projects\BloodRayne 2\br2cudaPerlin\Debug>br2cudaperlin 20

0 256

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.1 Alpha (Constants)

Running Benchmarks …

[1, 65536] Total Time: 0.024708s

[2, 32768] Total Time: 0.024606s

[4, 16384] Total Time: 0.024629s

[8, 8192] Total Time: 0.024564s

[16, 4096] Total Time: 0.024349s

[32, 2048] Total Time: 0.024581s

[64, 1024] Total Time: 0.024389s

[128, 512] Total Time: 0.397907s

[256, 256] Total Time: 0.414112s

[512, 128] Total Time: 0.387517s

[1024, 64] Total Time: 0.409437s

[2048, 32] Total Time: 0.458779s

[4096, 16] Total Time: 0.570715s

[8192, 8] Total Time: 0.577919s

[16384, 4] Total Time: 0.576570s

[32768, 2] Total Time: 0.740538s

[65536, 1] Total Time: 0.024476s

Best Config [16, 4096]: 0.024349s

Running Verification Tests …

Everything OK :)

Z:\code\Visual Studio Projects\BloodRayne 2\br2cudaPerlin\Debug>br2cudaperlin 20

0 256

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.1 Alpha (Textures)

Running Benchmarks …

[1, 65536] Total Time: 0.024669s

[2, 32768] Total Time: 0.024617s

[4, 16384] Total Time: 0.024719s

[8, 8192] Total Time: 0.024570s

[16, 4096] Total Time: 0.024710s

[32, 2048] Total Time: 0.024474s

[64, 1024] Total Time: 0.024700s

[128, 512] Total Time: 0.258591s

[256, 256] Total Time: 0.256918s

[512, 128] Total Time: 0.256123s

[1024, 64] Total Time: 0.255724s

[2048, 32] Total Time: 0.251797s

[4096, 16] Total Time: 0.266790s

[8192, 8] Total Time: 0.305139s

[16384, 4] Total Time: 0.470133s

[32768, 2] Total Time: 0.905707s

[65536, 1] Total Time: 0.024506s

Best Config [32, 2048]: 0.024474s

Running Verification Tests …

Everything OK :)

As you can see, the execution times are almost the same with the 3 methods.

Global memory: 77% gld coalesced / 22% instructions. GPU Time: 2213 / Occupancy: 0.25

Constants: 68% warp serialize / 30% instructions. GPU Time: 1657 / Occupancy: 0.75

Textures: 2% gst coalesced / 97% instructions. GPU Time: 1118 / Occupancy: 0.25

I’m really confused.

This code is going to be part of a personal project: http://www.coopdb.com/modules.php?name=BR2fsaa&op=Info

Please, i need advice to optimize my code.

I run a quad core Xeon 3350 @ 3.6 GHz & an eVGA GTX 285 SSC.

Btw, the code runs 27x times faster on the GPU than in the CPU, but, i think that it could be faster.

Thank you very much !

MisterAnderson42 · February 21, 2009, 5:35pm

Well, the first thing you need to do is add in some error checking. Are the results even what you’d expect them to be? I’d be surprised if they were: A majority of your benchmarked configurations are returning immediately because you are giving invalid block/grid configurations.

Specifically, the maximum number of threads per block is 512 (or smaller if your kernel uses many registers) (so most of your initial tests are returning immediately with errors). And the maximum value of either dimension of the grid is 65535, so your last run is also returning immediately with an error.

Also note that it really only makes sense to run block sizes as multiples of 32.

So with this in mind, if you look at a reasonable grid that is likely running without errors: [1024,64], you see that the performance is indeed very different among the 3 runs.

walterman · February 21, 2009, 6:19pm

Well, the first thing you need to do is add in some error checking. Are the results even what you’d expect them to be? I’d be surprised if they were: A majority of your benchmarked configurations are returning immediately because you are giving invalid block/grid configurations.

Specifically, the maximum number of threads per block is 512 (or smaller if your kernel uses many registers) (so most of your initial tests are returning immediately with errors). And the maximum value of either dimension of the grid is 65535, so your last run is also returning immediately with an error.

Also note that it really only makes sense to run block sizes as multiples of 32.

So with this in mind, if you look at a reasonable grid that is likely running without errors: [1024,64], you see that the performance is indeed very different among the 3 runs.

I think that you are right.

I run the verification test at <<256, 256>>, and i was not doing error checking due to this.

Then, the texture fetch version has the best performance.

But, now i have a terrible fear. The best result with <<1024, 64>> is 0.255724s, and my multi-threaded SSE3 perlin version needs 0.65s on my quad to run the same test. It’s only 2.5x times faster than my CPU, and i’ve got one of the fastest GPUs atm :(

Well, thank you very much for your input.

I’ll be busy trying to make it run faster.

walterman · February 21, 2009, 7:27pm

Now, it is ok, but, the performance isn’t that great as expected.

SPWorley · February 21, 2009, 11:30pm

Which flavor of Perlin noise are you using?

If you’re not using a table of precomputed gradients for random lookup, it may be cheaper to have no lookup at all but do some hashing right in the kernel… replacing a table read with some math. On a CPU a table lookup is often faster, but GPUs have FLOPS to spare.

A very cheap hash for perlin noise, mapping an integer X Y Z triple to a float from 0 to 1 might be like the following code. (This is NOT a good PRNG, but just a little dropin code snippet just for a quick test.)

This hash does tile with a period of 2^24 in X Y and Z. That can be fixed but it’s likely fine for Perlin noise. [Perlin’s own original implementation tiled with a period of 2^8]

I’m typing this in the forum without actually trying this, but it’s a strategy I’ve used a lot in the past. If it is faster than the table, and you need to improve randomness, I can help with it.

__device__ float Hash3(int x, int y, int z)

{

	int h=__mul24(x, 0xDEADBEEF) ^ __mul24(y, 0xABCDEF01) ^ __mul24(z, 0x987654321);

	h=h^(h>>11);  // move information from higher bits down to lower bits. Signed shift is OK!

	h=h^(h>>6);

	return (1.0f/4294967296.0f)*(((unsigned int)h)*0x64213579); // full 32 bit integer multiply. Overkill?

}

walterman · February 22, 2009, 12:58am

PM sent :)

walterman · February 26, 2009, 9:23pm

I have a new version.

It uses 2 methods: texture fetching / shared memory.

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha

Running Benchmarks …

TF [128, 512] Total Time: 0.192191s

SM [128, 512] Total Time: 0.107221s

TF [256, 256] Total Time: 0.191223s

SM [256, 256] Total Time: 0.103024s

TF [512, 128] Total Time: 0.190813s

SM [512, 128] Total Time: 0.126796s

TF [1024, 64] Total Time: 0.189704s

SM [1024, 64] Total Time: 0.189470s

TF [2048, 32] Total Time: 0.189634s

SM [2048, 32] Total Time: 0.390741s

TF [4096, 16] Total Time: 0.198291s

SM [4096, 16] Total Time: 0.942939s

TF [8192, 8] Total Time: 0.255677s

SM [8192, 8] Total Time: 2.238887s

TF [16384, 4] Total Time: 0.435167s

SM [16384, 4] Total Time: 6.500768s

TF [32768, 2] Total Time: 0.856746s

SM [32768, 2] Total Time: 22.913676s

Best Config (Shared Memory) [256, 256]: 0.103024s

Running Verification Test at (Shared Memory) [256, 256] …

Everything OK :)

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha

Running Benchmarks …

TF [512, 512] Total Time: 0.697142s

SM [512, 512] Total Time: 0.370065s

TF [1024, 256] Total Time: 0.692771s

SM [1024, 256] Total Time: 0.374493s

TF [2048, 128] Total Time: 0.690623s

SM [2048, 128] Total Time: 0.464357s

TF [4096, 64] Total Time: 0.688960s

SM [4096, 64] Total Time: 0.712639s

TF [8192, 32] Total Time: 0.690871s

SM [8192, 32] Total Time: 1.504626s

TF [16384, 16] Total Time: 0.702908s

SM [16384, 16] Total Time: 3.673776s

TF [32768, 8] Total Time: 0.974903s

SM [32768, 8] Total Time: 8.863379s

Best Config (Shared Memory) [512, 512]: 0.370065s

Running Verification Test at (Shared Memory) [512, 512] …

Everything OK :)

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha

Running Benchmarks …

TF [2048, 512] Total Time: 2.361321s

SM [2048, 512] Total Time: 1.282897s

TF [4096, 256] Total Time: 2.369100s

SM [4096, 256] Total Time: 1.336673s

TF [8192, 128] Total Time: 2.361914s

SM [8192, 128] Total Time: 1.677807s

TF [16384, 64] Total Time: 2.360700s

SM [16384, 64] Total Time: 2.639611s

TF [32768, 32] Total Time: 2.360139s

SM [32768, 32] Total Time: 5.692210s

Best Config (Shared Memory) [2048, 512]: 1.282897s

Running Verification Test at (Shared Memory) [2048, 512] …

Everything OK :)

External Media

It’s 6.5x times faster than the CPU. It will be hard to make it faster.

You can leech it here: http://www.speedyshare.com/455357158.html

I have problems to run it on my old G80. If somebody can try it, i would like to know if it works with other cards.

walterman · February 26, 2009, 9:23pm

I have a new version.

It uses 2 methods: texture fetching / shared memory.

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha

Running Benchmarks …

TF [128, 512] Total Time: 0.192191s

SM [128, 512] Total Time: 0.107221s

TF [256, 256] Total Time: 0.191223s

SM [256, 256] Total Time: 0.103024s

TF [512, 128] Total Time: 0.190813s

SM [512, 128] Total Time: 0.126796s

TF [1024, 64] Total Time: 0.189704s

SM [1024, 64] Total Time: 0.189470s

TF [2048, 32] Total Time: 0.189634s

SM [2048, 32] Total Time: 0.390741s

TF [4096, 16] Total Time: 0.198291s

SM [4096, 16] Total Time: 0.942939s

TF [8192, 8] Total Time: 0.255677s

SM [8192, 8] Total Time: 2.238887s

TF [16384, 4] Total Time: 0.435167s

SM [16384, 4] Total Time: 6.500768s

TF [32768, 2] Total Time: 0.856746s

SM [32768, 2] Total Time: 22.913676s

Best Config (Shared Memory) [256, 256]: 0.103024s

Running Verification Test at (Shared Memory) [256, 256] …

Everything OK :)

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha

Running Benchmarks …

TF [512, 512] Total Time: 0.697142s

SM [512, 512] Total Time: 0.370065s

TF [1024, 256] Total Time: 0.692771s

SM [1024, 256] Total Time: 0.374493s

TF [2048, 128] Total Time: 0.690623s

SM [2048, 128] Total Time: 0.464357s

TF [4096, 64] Total Time: 0.688960s

SM [4096, 64] Total Time: 0.712639s

TF [8192, 32] Total Time: 0.690871s

SM [8192, 32] Total Time: 1.504626s

TF [16384, 16] Total Time: 0.702908s

SM [16384, 16] Total Time: 3.673776s

TF [32768, 8] Total Time: 0.974903s

SM [32768, 8] Total Time: 8.863379s

Best Config (Shared Memory) [512, 512]: 0.370065s

Running Verification Test at (Shared Memory) [512, 512] …

Everything OK :)

BloodRayne 2 FSAA Patch - CUDA Perlin Benchmark Tool 0.15 Alpha

Running Benchmarks …

TF [2048, 512] Total Time: 2.361321s

SM [2048, 512] Total Time: 1.282897s

TF [4096, 256] Total Time: 2.369100s

SM [4096, 256] Total Time: 1.336673s

TF [8192, 128] Total Time: 2.361914s

SM [8192, 128] Total Time: 1.677807s

TF [16384, 64] Total Time: 2.360700s

SM [16384, 64] Total Time: 2.639611s

TF [32768, 32] Total Time: 2.360139s

SM [32768, 32] Total Time: 5.692210s

Best Config (Shared Memory) [2048, 512]: 1.282897s

Running Verification Test at (Shared Memory) [2048, 512] …

Everything OK :)

External Media

It’s 6.5x times faster than the CPU. It will be hard to make it faster.

You can leech it here: http://www.speedyshare.com/455357158.html

I have problems to run it on my old G80. If somebody can try it, i would like to know if it works with other cards.

walterman · March 4, 2009, 8:33pm

A new “beta” version 0.46:

http://www.speedyshare.com/633582280.html

It will benchmark your CPU vs your GPU.

It supports multi-GPU rigs too.

You can specify from the command line, the number of GPUs to use. You will need to disable SLI to use multiple GPUs in CUDA, according to nVidia papers.

Examples:

br2perlin 1 5 → This will use just 1 GPU

br2perlin 2 5 → This will use 2 GPUs

The library also supports mixing the CPU & GPU at the same time. In theory, when i designed it, i thought that CPU+GPU was going to be faster, but, due to the asynchronous nature of CUDA, it ends slower than the CPU or GPU alone.

My BR2 Patch is using the new CUDA code now, and the perlin effects run on the GPU now.

Unluckily, if you only have 1 gfx card, this is not a good idea, because the framerate is lower due to the resources used for the CUDA calculations. But, if you have 2 gfx cards, you won’t lose any fps, and the perlin code will run faster in the GPU (bigger & more complex effects).

Basically, i’ve written this to use my old 8800GTX to run the Perlin effects, and my GTX285 to render the shiny graphics at 1920x1200 SSAA 2x :)

The results of my Xeon 3350 @ 3.6 GHz + eVGA GTX 285 SSC:

In my system, the GPU is 6.5x times faster than the CPU.

spacerat · March 5, 2009, 6:29am

GF 8600 GTS 256MB vs. PentiumD 3.0

CPU SSE3 4 Threads

Total Time: 8.217884, Min: -0.698635, Max: 0.798246, Range: 1.496881

GPU

Total Time: 0.111804

I wonder that my card is only 0.01s slower than the fastest card on the market…

I guess there is more space for improvement in your algorithm.

The GTX285 has 2x the shared-memory, 2x the number of registers and 7.5x the number of stream processorsif I remember correctly.

E.D_Riedijk · March 5, 2009, 10:32am

no, all CUDA GPUs have the same amount of shared memory

walterman · March 5, 2009, 2:10pm

GF 8600 GTS 256MB vs. PentiumD 3.0
CPU SSE3 4 Threads

Total Time: 8.217884, Min: -0.698635, Max: 0.798246, Range: 1.496881

GPU

Total Time: 0.111804
I wonder that my card is only 0.01s slower than the fastest card on the market…

I guess there is more space for improvement in your algorithm.

The GTX285 has 2x the shared-memory, 2x the number of registers and 7.5x the number of stream processorsif I remember correctly.

I think that the kernel is returning with errors.

Try this version: http://www.speedyshare.com/633582280.html

On my old 8800GTX, it needs 0.22s.

walterman · March 10, 2009, 11:06am

The code is working properly, and it’s integrated into the BR2 patch:

CPU - SSE3 - 3 Threads: 76 fps
External Media

GPU - CUDA - 1 Thread (= 1 GPU): 46 fps

I’m just worried about the performance.

I removed all the memory acceses, to check the memory latency impact over the performance, and the code performed 2x times faster.

I also changed the algorithm, to process several ‘pixels’ per CUDA thread. This should have helped to reduce the init time of the shared memory, because you need less blocks of threads, but, it ended a 33% slower.

I also reduced the table lookups from 3 levels to just 1, by using a big 256MB 3D texture with the float4 gradient values, but again, it ended performing slower than with the shared memory version. And this version used only 8 memory accesses.

I dunno how to make it run faster. External Image

Anyway, i’m happy with my new multi-threading code. I have rewritten all my classes from zero, and the new code is great. I can use several queues of works, and divide the works across several devices (cpu + gpu), and use any number of threads, and with a weight per thread, to do load balancing between different devices (cpu / gpu).

Topic		Replies	Views
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6131	December 8, 2008
Constant Arrays CUDA Programming and Performance	13	30719	November 24, 2007
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15141	December 17, 2008
Is CUDA better than GLSLang? I need to know more... CUDA Programming and Performance	30	38786	July 13, 2007
Why taking so much time? CUDA Programming and Performance	22	3538	June 27, 2009
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4194	November 3, 2010
Device memory latency and lookup tables - please advise CUDA Programming and Performance	26	7186	February 28, 2009
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37816	August 30, 2009
How to optimize this code (whirlpool hash)? Independant parallel hashing with whirlpool is way too s CUDA Programming and Performance	28	3174	September 14, 2010
Why texture memory is better on Fermi? CUDA Programming and Performance	62	21118	January 28, 2011

Global Memory vs Constant vs Texture Fetch Performance

Related topics