I’m writing a perlin noise generator kernel, and i’m trying to optimize it.
The kernel uses 2 small tables (8KB total) with precomputed random values (part of the perlin noise algorithm).
Each thread needs 30 read accesses to the tables, to very random locations.
At the end, the kernel writes a single float value to the global memory, as result.
I have tried 3 different versions of the kernel, placing the tables in different locations: in the global memory, as constants, and in textures.
The execution time of the 3 methods is almost the same (less than 1% of difference).
I’m using the CUDA Visual profiler, and this is the result:
The benchmark tries all the possible <<numBlocks, blockSize>> combinations, and it selects the best:
As you can see, the execution times are almost the same with the 3 methods.
Global memory: 77% gld coalesced / 22% instructions. GPU Time: 2213 / Occupancy: 0.25
Constants: 68% warp serialize / 30% instructions. GPU Time: 1657 / Occupancy: 0.75
Textures: 2% gst coalesced / 97% instructions. GPU Time: 1118 / Occupancy: 0.25
I’m really confused.
This code is going to be part of a personal project: http://www.coopdb.com/modules.php?name=BR2fsaa&op=Info
Please, i need advice to optimize my code.
I run a quad core Xeon 3350 @ 3.6 GHz & an eVGA GTX 285 SSC.
Btw, the code runs 27x times faster on the GPU than in the CPU, but, i think that it could be faster.
Thank you very much !