Extremely slow smem reads Possible bank conflicts?

Hi all…

I’ve recently started to optimize some of my CUDA kernels, as most of them have been rather brute force / naive implementations, however starting to optimize one of my CUDA functions that does a ‘max filter’ on an image… I’ve run into performance issues regarding reading from shared memory…

I’m reading in approximately 117MB of data from shared memory for the case I’m testing (40x40 max filter on 320x240 image) - that’s 1600 bytes per thread (with one thread per pixel) - 256 threads per block - totaling 300 blocks.

I’m currently caching the ‘block + filter radius’ in pixels for each block (16x16 blocks + 40x40 radius, that’s 56x56 = 3136) - which comes out as 3136 bytes of shared memory per block - which would indicate I can only run 5 blocks at a time (with 16kb smem) for each multiprocessor. I’m currently using a Quadro FX 570 in this case, which has 4 multi-processors - which means I can run 20 blocks at once across the whole card.

The first half of the kernel takes ~5ms (which is quite a while for what it does, but I can ‘live’ with 5ms for now) - which simply reads in the memory from global memory (properly coalesced) and writes it into shared memory (avoiding bank conflicts the whole time).

The second half then loops over the radius around the pixel the thread is assigned to, finding the maximum pixel value… (reading from the cached smem values) - This second half of the kernel takes 35ms, to simplify the test case, I made it simply loop 1600 times, reading the same memory address (each memory address is 32bits apart from the previous threads address, so it should avoid bank conflicts)…

My question is, what could be taking 35ms - when simply reading from shared memory? (Would this indicate I DO have bank conflicts, and I’m either miscalculating my indices - or don’t understand how smem banks are laid out?)

Thanks in advance…

After further investigation, it would appear my kernel is using 16 registers - despite me having told nvcc to set the max register count to 10… Thus I’m only getting 67% occupancy…

This still wouldn’t explain why it’s taking 35ms though, when I’d expect it to take a few us - maybe 1ms tops, but I’m sure it’s slowing me down a bit.

Edit: After ripping apart my project and removing my test case from my internal test harness, I managed to get the visual profiler working with my kernel - however it didn’t tell me anything I didn’t already know - it simply told me the kernel time, and occupancy (didn’t tell me about divergences, coalesced memory, or anything I would have expected the profiler to tell me).

Okay, the Programming Guide is either wrong yet again, or lacking on the details, or I’m misinterpreting it. (this is becoming all too familiar)

Appending B 1.3 - Integer Functions
“The CUDA runtime library supports integer min(x,y) and max(x,y) which map to a single instruction on the device.”

I’m calling “foo = max ( a, b )” inside my loop (which iterates 1600 times per thread) - and it takes my kernel execution time from 5ms, to 45-50ms… foo/a/b are all “unsigned char” types - I have however attempted to use “int” and “unsigned” int, which lowered the kernel from ~45ms to ~28ms - but it’s certainly not 5ms.

It’s approximately 5ms faster than using an if statement, and exactly the same as using a ternary operator… but it’s certainly not a single instruction… (unless it takes an absurd amount of cycles?)

Edit - This is using the following setup:
CUDA 2.0
Quadro FX 570 (1.1 Captability)
Driver version: 177.84

How did you measure this?

If you have disabled the second part of your kernel, the dead-code optimizer has probably removed all of the code of the first part too, because there is no result written back to global memory.

Also having the adress 32 bits apart is as far as I know not enough to avoid bank conflicts. The MatrixMul example shows how to avoid bank conflicts when you load 16x16 values into shared memory.

Other than that I would calculate how much memory you are reading and writing from global mem, and see how much time that alone takes (given the mem bandwidth of your card)

Hmm, I may have some numbers on the amount of cycles of such an instruction at work, I’ll check. integer instruction are in general slower than floating point, that is for sure.

I was still writing the results back to gmem - I just wasn’t filtering the results - thus avoiding the ‘max’ function - which brought it down to 5ms (the time it took to read the memory into smem, then read it directly back out into gmem - without filtering).

I’m aware of that, I also ensured the memory was properly aligned, and that the first thread of the warp started with the first word in the aligned memory address - thus coalescing the reads appropriately for 1.0+ capable hardware.

In practice that takes ~5ms - but that includes taking it into smem. In theory given 100% bandwidth throughput, it should take ~1ms for 117mb on my card (i think).

That’s 50 KB per warp. An SM has only 16 KB of shared memory total. Also, a 40x40 array of ints/floats is actually 40x40x4 bytes = 6400 bytes per thread. I’m not sure I understand how your kernel works but are you sure stuff doesn’t leak to local memory?

I’m using unsigned chars - which makes memory coalescing and avoiding bank conflicts a LOT of fun… but I managed to get by.

And yes, I’m reading 50Kb per warp - but I only have 56x56 (block size + filter size = (16+40)^2 = 56x56) = 3136 bytes of shared memory per BLOCK.

Once again, to re-iterate - All of this works fine, and takes ~5-6ms as long as I avoid the min/max functions… (which sadly I need…).

I’ve attempted to use branchless min/max replacements, but it appears to take the exact same amount of time as the CUDA min/max functions.

eg: max( a, b ) = a - (( a - b ) & -( a < b )) - which is at least 5 instructions… (which would indicate CUDAs min/max functions are about 5 instructions too)

Edit: Hmm, I have a horrible feeling that when I removed the min/max functions from my loop to test the speed, it may have optimized out the previous 1599 iterations (as I’m simply assigning the resulting value to an smem value without the ‘max’ function - thus only the last assignment really matters) - which would account for the speed increase I’m seeing when I remove the max function. I’ll have to test this next week.

Are you checking if you’re using local memory? Limiting registers to 10 is a bad idea… occupancy buys nothing after a certain point, you don’t want 100%.

Also, have you tried the visual profiler? This will report bank conflicts, etc. (At least, when using XP.)

I’m unable to profile this at the moment, I no longer have access to an XP machine (I’m working on getting access to XP again).

I also checked the cubin file and I’m using 0 lmem.

I’ve just realised however that looping in CUDA seems to be a lot slower than I thought…

Something nice and simple like the following takes ~15ms (unrolled) and ~32ms normally!!!:

[codebox]unsigned int max_value = 0;

for(unsigned int i = 0; i < 1600; ++i)


max_value = max_value + i + threadIdx.x;


// The above loop takes 15-30ms

// result[index] = max_value; [/codebox]

This quite easily explains why my kernel takes so long… (even without the smem reads inside the loop).

But algorithmically it’s very … improbable I can write a max filter without looping - and even after unrolling it fully - it takes ~15ms… which is insane.

So… is this the kind of performance I should expect from loops? (I’m assuming not, as I use loops in other kernels which only take a fraction of the time.)

There’s no divergence, I’m doing very very simple arithmetic inside the loop… on something which is apparently a register - so it ‘should’ be quite fast, no?

Edit: Added threadIdx.x to the loop arithmetic (without this, the loop unrolls rather quickly).

And how many blocks/threads are you launching?

The surest way to find out what’s going on in a kernel is to run it through decuda. You can then see if it’s doing what you expect or if it’s flying to China every iteration.

P.S. I haven’t heard of loops being particularly slow. I think your observation has to do with something else, like register use.
P.P.S. try setting maxrregcount to 128, to maximize optimization

I’m launching 256 threads per block, and 300 blocks.

Setting the max register limit to 128 didn’t change much (runs at more or less the same speed) - for both the filter kernel itself, and the simple loop test case. The register count went up to 20 from 16 (for the filter kernel), but besides that - nothing interesting.

I did run the cubin through decuda - I didn’t quite understand it all though, I’m going to have to brush up on my PTX before I can get anything useful out of it.