Maximum of a Histogram with NPP What am I doing wrong?

Hi guys,

this is probably a stupid question as there must be some obvious mistake in my code, but I can’t find out where.

What I want to do is this: I have a 14 bit image and want to calculate it’s histogram, then calculate the maximum of that histogram and paint it in a kernel.

d_Hist = nppsMalloc_32s(levels-1);

nppiHistogramEvenGetBufferSize_16u_C1R(size_img, levels, &hbufsize);

d_buffer = nppsMalloc_8u(hbufsize);

nppiHistogramEven_16u_C1R(d_img, d_img_pitch, size_in, d_Hist, levels,0,16384,d_buffer);

nppsMax_32s(d_Hist, levels-1, d_hMax, d_buffer);

PaintHist_kernen<<<nBlocks,threadsPerBlock>>>(d_Hist, *d_hMax, d_histImg, d_histImg_pitch);

d_img is my 14 image on the device, d_Hist its histogram and d_histImg is a 384*288 image into which I want to paint the histogram, normalized to the maximum.

The histogram I get looks correct (when I copy d_hist back to the host) and re-using the buffer doesn’t seem to be the problem (same result when I create a new buffer for nppsMax_32s). But the maximum I get is completely wrong. And the strangest thing is that the maximum is correct if I calculate it for less than (levels-1) signal length. So the last value seems to cause the problem, but it’s not out of bounds or something, as I can easily check with an overexposed image.

What could be the problem?


OK, after some further testing it seems like NPP has a severe bug:
The described problem only occurs when the signal length is a power of 2. I had n=64 and therefore got this wrong result, for any n that is not a power of 2 it works.

Unfortunately, using a length of 64 is most convenient for me. Can we expect a fix of this soon?

I don’t have the docs in front of me but …
can you mix 16 bit unsigned with 32 bit signed functions?

What version of NPP are you using?

Do I understand this correctly, you have traced the problem to the Max primitive. So if the signal’s (S) length is l = 2[sup]n[/sup], the maximum isn’t found if it is the last value of the signal, i.e. S[l]? But for other signal lengths a maximum in the last slot is properly found?

Also, I have one question about your code:

PaintHist_kernen<<<nBlocks,threadsPerBlock>>>(d_Hist, *d_hMax, d_histImg, d_histImg_pitch);

I’m assuming the d_hMax is a device pointer? What do you expect the [font=“Courier New”]*d_hMax [/font]to do?

I’m using the latest version of NPP.

If the signal length is 2[sup]n[/sup] then no correct maximum is found, no matter where it lies. For other signal lengths I tried it worked as it should.

You mean what I use d_hMax for in the kernel? Well, the kernel paints and image of the histogram that is of a defined size (say 400x300). Then I need to know the histogram maximum, so that the bar at this maximum is exactly 300 pixels high and all other bar are scaled accordingly, as in height = d_hist / d_hMax * 300.

You better try to isolate your problem, if problem is in npps_Maxs, than create separate test case for it.

Frank hasn’t jumped back in but I think he’s asking why is your code

PaintHist_kernen<<<nBlocks,threadsPerBlock>>> (d_Hist, *d_hMax, d_histImg, d_histImg_pitch);

and not

PaintHist_kernen<<<nBlocks,threadsPerBlock>>> (d_Hist, d_hMax, d_histImg, d_histImg_pitch);

Note the pointer difference in d_hMax.

What are you trying to say? It is perfectly clear to me where the problem is: the NPP function has a bug.

And why would that matter? It has nothing to do with the issue being discussed.

Basically this thread is now a bug report, I have resolved the problem for me by choosing a different signal length (despite 64 being most natural and convenient for my application). Now it’s up to the NVIDIA developers to find out what is wrong with their function(s) and correct it.

Maybe it is side effect of some errors in your code. Can you just present a case with that alone function? And you can submit it to nv developers.

Hi Ygrene,

thanks for the clarification. I’m adding this as an NPP bug and we’ll try to reproduce based on the information you provided.

Regarding my question about the *d_hMax. I believe I understand what you’re attempting to do. However if d_hMax is a device pointer, it would not be valid to dereference it in host code. You would have to pass the maximum value to your draw kernel by reference, not by value or alternatively read-back the device maximum to a host variable.

LOL. If I had a dollar for every claim that CUDA has a bug in it.

Pasting a piece of incomplete code, especially one with an identified (unrelated?) bug, does not constitute a bug report.

And if I had a dollar for every cocky, unhelpful reply trying to trace the problem to the user himself instead of actually dealing with what he said.

In the full code there is a simple parameter: int levels = (…). If I set levels to 65 (giving a histogram length of 64) it doesn’t work. If I set it to 66, 67, 68,… it works. With 33 and 129 it doesn’t. With 34 and 130 it does. But yeah, keep searching for other mistakes in a code I didn’t copy but typed to make the problem clear. I didn’t expect people wouldn’t even try to understand but just nag about unimportant stuff. Well done guys!

Maybe some memory allocation problem somewhere, in that function code or elsewhere, if it allocates memory blocks of 2^n size and access eelemnt with 2^n number.

Hmmm. Maybe you should re-read your comments to see who’s being cocky.

People come here to help and be helped. EVERYONE makes mistakes and not every suggestion is a gem, but if you want to be helped YOU need to lose the attitude. We’re not being paid to give you technical support.

Lev has given a couple of useful suggestions; I suggest you follow up on them.

And there is NO ‘unimportant stuff’ when it comes to debugging.


I think it would matter because (as Frank tried to say) if d_hMax is a device pointer “*d_hMax” is invalid in host code (like in the kernel call of your code). The code as it is written would pass “garbage” to the PaintHist_kernen() kernel since you are passing by value the dereference of a device memory pointer.

one more thing - although it’s a long shot - I’ve seen that the last NPP release supports CUDA 3.2RC (and not CUDA 3.2) - maybe there were some last minute changes from the RC to the final release that are causing the compatibility problem for the NPP library…

happy bug hunting,


I attempted to reproduce this problem but without success.

I did run a number of tests of nppsMax_32s on arrays/signals of length 64.

At this point, I would like to ask for a small stand alone reproducer app. I can also attempt to reproduce if you could provide a complete set of the parameters passed to the function for the failing invocation (i.e. values of the pointers and parameters).