Global Memory vs Constant vs Texture Fetch Performance

Hello,

I’m writing a perlin noise generator kernel, and i’m trying to optimize it.

The kernel uses 2 small tables (8KB total) with precomputed random values (part of the perlin noise algorithm).

Each thread needs 30 read accesses to the tables, to very random locations.

At the end, the kernel writes a single float value to the global memory, as result.

I have tried 3 different versions of the kernel, placing the tables in different locations: in the global memory, as constants, and in textures.

The execution time of the 3 methods is almost the same (less than 1% of difference).

I’m using the CUDA Visual profiler, and this is the result:

Global Memory

External MediaExternal Media

Constants

External MediaExternal Media

Texture Fetching

External MediaExternal Media

The benchmark tries all the possible <<numBlocks, blockSize>> combinations, and it selects the best:

As you can see, the execution times are almost the same with the 3 methods.

Global memory: 77% gld coalesced / 22% instructions. GPU Time: 2213 / Occupancy: 0.25

Constants: 68% warp serialize / 30% instructions. GPU Time: 1657 / Occupancy: 0.75

Textures: 2% gst coalesced / 97% instructions. GPU Time: 1118 / Occupancy: 0.25

I’m really confused.

This code is going to be part of a personal project: http://www.coopdb.com/modules.php?name=BR2fsaa&op=Info

Please, i need advice to optimize my code.

I run a quad core Xeon 3350 @ 3.6 GHz & an eVGA GTX 285 SSC.

Btw, the code runs 27x times faster on the GPU than in the CPU, but, i think that it could be faster.

Thank you very much !

Well, the first thing you need to do is add in some error checking. Are the results even what you’d expect them to be? I’d be surprised if they were: A majority of your benchmarked configurations are returning immediately because you are giving invalid block/grid configurations.

Specifically, the maximum number of threads per block is 512 (or smaller if your kernel uses many registers) (so most of your initial tests are returning immediately with errors). And the maximum value of either dimension of the grid is 65535, so your last run is also returning immediately with an error.

Also note that it really only makes sense to run block sizes as multiples of 32.

So with this in mind, if you look at a reasonable grid that is likely running without errors: [1024,64], you see that the performance is indeed very different among the 3 runs.

I think that you are right.

I run the verification test at <<256, 256>>, and i was not doing error checking due to this.

Then, the texture fetch version has the best performance.

But, now i have a terrible fear. The best result with <<1024, 64>> is 0.255724s, and my multi-threaded SSE3 perlin version needs 0.65s on my quad to run the same test. It’s only 2.5x times faster than my CPU, and i’ve got one of the fastest GPUs atm :(

Well, thank you very much for your input.

I’ll be busy trying to make it run faster.

Now, it is ok, but, the performance isn’t that great as expected.

Which flavor of Perlin noise are you using?

If you’re not using a table of precomputed gradients for random lookup, it may be cheaper to have no lookup at all but do some hashing right in the kernel… replacing a table read with some math. On a CPU a table lookup is often faster, but GPUs have FLOPS to spare.

A very cheap hash for perlin noise, mapping an integer X Y Z triple to a float from 0 to 1 might be like the following code. (This is NOT a good PRNG, but just a little dropin code snippet just for a quick test.)

This hash does tile with a period of 2^24 in X Y and Z. That can be fixed but it’s likely fine for Perlin noise. [Perlin’s own original implementation tiled with a period of 2^8]

I’m typing this in the forum without actually trying this, but it’s a strategy I’ve used a lot in the past. If it is faster than the table, and you need to improve randomness, I can help with it.

__device__ float Hash3(int x, int y, int z)

{

	int h=__mul24(x, 0xDEADBEEF) ^ __mul24(y, 0xABCDEF01) ^ __mul24(z, 0x987654321);

	h=h^(h>>11);  // move information from higher bits down to lower bits. Signed shift is OK!

	h=h^(h>>6);

	return (1.0f/4294967296.0f)*(((unsigned int)h)*0x64213579); // full 32 bit integer multiply. Overkill?

}

PM sent :)

I have a new version.

It uses 2 methods: texture fetching / shared memory.

External Media

It’s 6.5x times faster than the CPU. It will be hard to make it faster.

You can leech it here: http://www.speedyshare.com/455357158.html

I have problems to run it on my old G80. If somebody can try it, i would like to know if it works with other cards.

I have a new version.

It uses 2 methods: texture fetching / shared memory.

External Media

It’s 6.5x times faster than the CPU. It will be hard to make it faster.

You can leech it here: http://www.speedyshare.com/455357158.html

I have problems to run it on my old G80. If somebody can try it, i would like to know if it works with other cards.

A new “beta” version 0.46:

http://www.speedyshare.com/633582280.html

It will benchmark your CPU vs your GPU.

It supports multi-GPU rigs too.

You can specify from the command line, the number of GPUs to use. You will need to disable SLI to use multiple GPUs in CUDA, according to nVidia papers.

Examples:

br2perlin 1 5 → This will use just 1 GPU

br2perlin 2 5 → This will use 2 GPUs

The library also supports mixing the CPU & GPU at the same time. In theory, when i designed it, i thought that CPU+GPU was going to be faster, but, due to the asynchronous nature of CUDA, it ends slower than the CPU or GPU alone.

My BR2 Patch is using the new CUDA code now, and the perlin effects run on the GPU now.

Unluckily, if you only have 1 gfx card, this is not a good idea, because the framerate is lower due to the resources used for the CUDA calculations. But, if you have 2 gfx cards, you won’t lose any fps, and the perlin code will run faster in the GPU (bigger & more complex effects).

Basically, i’ve written this to use my old 8800GTX to run the Perlin effects, and my GTX285 to render the shiny graphics at 1920x1200 SSAA 2x :)

The results of my Xeon 3350 @ 3.6 GHz + eVGA GTX 285 SSC:

In my system, the GPU is 6.5x times faster than the CPU.

GF 8600 GTS 256MB vs. PentiumD 3.0

CPU SSE3 4 Threads

Total Time: 8.217884, Min: -0.698635, Max: 0.798246, Range: 1.496881

GPU

Total Time: 0.111804

I wonder that my card is only 0.01s slower than the fastest card on the market…

I guess there is more space for improvement in your algorithm.

The GTX285 has 2x the shared-memory, 2x the number of registers and 7.5x the number of stream processorsif I remember correctly.

no, all CUDA GPUs have the same amount of shared memory

I think that the kernel is returning with errors.

Try this version: http://www.speedyshare.com/633582280.html

On my old 8800GTX, it needs 0.22s.

The code is working properly, and it’s integrated into the BR2 patch:

CPU - SSE3 - 3 Threads: 76 fps
External Media

GPU - CUDA - 1 Thread (= 1 GPU): 46 fps

I’m just worried about the performance.

I removed all the memory acceses, to check the memory latency impact over the performance, and the code performed 2x times faster.

I also changed the algorithm, to process several ‘pixels’ per CUDA thread. This should have helped to reduce the init time of the shared memory, because you need less blocks of threads, but, it ended a 33% slower.

I also reduced the table lookups from 3 levels to just 1, by using a big 256MB 3D texture with the float4 gradient values, but again, it ended performing slower than with the shared memory version. And this version used only 8 memory accesses.

I dunno how to make it run faster. External Image

Anyway, i’m happy with my new multi-threading code. I have rewritten all my classes from zero, and the new code is great. I can use several queues of works, and divide the works across several devices (cpu + gpu), and use any number of threads, and with a weight per thread, to do load balancing between different devices (cpu / gpu).