I’ve recently started to optimize some of my CUDA kernels, as most of them have been rather brute force / naive implementations, however starting to optimize one of my CUDA functions that does a ‘max filter’ on an image… I’ve run into performance issues regarding reading from shared memory…
I’m reading in approximately 117MB of data from shared memory for the case I’m testing (40x40 max filter on 320x240 image) - that’s 1600 bytes per thread (with one thread per pixel) - 256 threads per block - totaling 300 blocks.
I’m currently caching the ‘block + filter radius’ in pixels for each block (16x16 blocks + 40x40 radius, that’s 56x56 = 3136) - which comes out as 3136 bytes of shared memory per block - which would indicate I can only run 5 blocks at a time (with 16kb smem) for each multiprocessor. I’m currently using a Quadro FX 570 in this case, which has 4 multi-processors - which means I can run 20 blocks at once across the whole card.
The first half of the kernel takes ~5ms (which is quite a while for what it does, but I can ‘live’ with 5ms for now) - which simply reads in the memory from global memory (properly coalesced) and writes it into shared memory (avoiding bank conflicts the whole time).
The second half then loops over the radius around the pixel the thread is assigned to, finding the maximum pixel value… (reading from the cached smem values) - This second half of the kernel takes 35ms, to simplify the test case, I made it simply loop 1600 times, reading the same memory address (each memory address is 32bits apart from the previous threads address, so it should avoid bank conflicts)…
My question is, what could be taking 35ms - when simply reading from shared memory? (Would this indicate I DO have bank conflicts, and I’m either miscalculating my indices - or don’t understand how smem banks are laid out?)
After further investigation, it would appear my kernel is using 16 registers - despite me having told nvcc to set the max register count to 10… Thus I’m only getting 67% occupancy…
This still wouldn’t explain why it’s taking 35ms though, when I’d expect it to take a few us - maybe 1ms tops, but I’m sure it’s slowing me down a bit.
Edit: After ripping apart my project and removing my test case from my internal test harness, I managed to get the visual profiler working with my kernel - however it didn’t tell me anything I didn’t already know - it simply told me the kernel time, and occupancy (didn’t tell me about divergences, coalesced memory, or anything I would have expected the profiler to tell me).
Okay, the Programming Guide is either wrong yet again, or lacking on the details, or I’m misinterpreting it. (this is becoming all too familiar)
Appending B 1.3 - Integer Functions
“The CUDA runtime library supports integer min(x,y) and max(x,y) which map to a single instruction on the device.”
I’m calling “foo = max ( a, b )” inside my loop (which iterates 1600 times per thread) - and it takes my kernel execution time from 5ms, to 45-50ms… foo/a/b are all “unsigned char” types - I have however attempted to use “int” and “unsigned” int, which lowered the kernel from ~45ms to ~28ms - but it’s certainly not 5ms.
It’s approximately 5ms faster than using an if statement, and exactly the same as using a ternary operator… but it’s certainly not a single instruction… (unless it takes an absurd amount of cycles?)
Edit - This is using the following setup:
Quadro FX 570 (1.1 Captability)
Driver version: 177.84
I was still writing the results back to gmem - I just wasn’t filtering the results - thus avoiding the ‘max’ function - which brought it down to 5ms (the time it took to read the memory into smem, then read it directly back out into gmem - without filtering).
I’m aware of that, I also ensured the memory was properly aligned, and that the first thread of the warp started with the first word in the aligned memory address - thus coalescing the reads appropriately for 1.0+ capable hardware.
In practice that takes ~5ms - but that includes taking it into smem. In theory given 100% bandwidth throughput, it should take ~1ms for 117mb on my card (i think).
That’s 50 KB per warp. An SM has only 16 KB of shared memory total. Also, a 40x40 array of ints/floats is actually 40x40x4 bytes = 6400 bytes per thread. I’m not sure I understand how your kernel works but are you sure stuff doesn’t leak to local memory?
I’m using unsigned chars - which makes memory coalescing and avoiding bank conflicts a LOT of fun… but I managed to get by.
And yes, I’m reading 50Kb per warp - but I only have 56x56 (block size + filter size = (16+40)^2 = 56x56) = 3136 bytes of shared memory per BLOCK.
Once again, to re-iterate - All of this works fine, and takes ~5-6ms as long as I avoid the min/max functions… (which sadly I need…).
I’ve attempted to use branchless min/max replacements, but it appears to take the exact same amount of time as the CUDA min/max functions.
eg: max( a, b ) = a - (( a - b ) & -( a < b )) - which is at least 5 instructions… (which would indicate CUDAs min/max functions are about 5 instructions too)
Edit: Hmm, I have a horrible feeling that when I removed the min/max functions from my loop to test the speed, it may have optimized out the previous 1599 iterations (as I’m simply assigning the resulting value to an smem value without the ‘max’ function - thus only the last assignment really matters) - which would account for the speed increase I’m seeing when I remove the max function. I’ll have to test this next week.
I’m launching 256 threads per block, and 300 blocks.
Setting the max register limit to 128 didn’t change much (runs at more or less the same speed) - for both the filter kernel itself, and the simple loop test case. The register count went up to 20 from 16 (for the filter kernel), but besides that - nothing interesting.
I did run the cubin through decuda - I didn’t quite understand it all though, I’m going to have to brush up on my PTX before I can get anything useful out of it.