histogram algorithm problems with the SDK example

oakfellow · November 3, 2010, 6:07pm

A bit of background: I’m trying to alter the basic 256-bin histogram algorithm presented in the SDK samples so it provides more bin. I was seeing intermittent ‘unspecified launch failures’, even when always using the same data. To rule out my own changes, I ran the SDK sample code unaltered:

With random data (as used in the sample app), everything worked fine.

With real image data, intermittent ‘unspecified launch failure’.

I modified the sample application to use a single value for all of the data points (rather than a random value 0-256 for each), and now the sample application crashes every time. It appears that the addByte() function is not working correctly to resolve per-warp shared memory write conflicts. Am I doing something wrong here, or is this sample broken such that it only works on random data (where the chance of multiple threads in the same warp trying to write to the same bin is rather low)?

I’ve confirmed that when run on a uniform ‘image’, the addByte() function can loop forever, causing an ‘unspecified launch failure’. I modified the loop to only run 64 times (by the documentation, it should only run a max of 32 times – the number of threads per warp). This produces no launch error, but the histogram results are not correct, since the loop is cut off before some values actually get incremented.

inline __device__ void addByte(volatile uint *s_WarpHist, uint data, uint threadTag){

    uint count;

    uint max = 64;

    do{

        count = s_WarpHist[data] & TAG_MASK;

        count = threadTag | (count + 1);

        s_WarpHist[data] = count;

    }while(--max && s_WarpHist[data] != count);

}

Any help or pointers people can give would be greatly appreciated.

-Ian

oakfellow · November 3, 2010, 6:07pm

A bit of background: I’m trying to alter the basic 256-bin histogram algorithm presented in the SDK samples so it provides more bin. I was seeing intermittent ‘unspecified launch failures’, even when always using the same data. To rule out my own changes, I ran the SDK sample code unaltered:

With random data (as used in the sample app), everything worked fine.

With real image data, intermittent ‘unspecified launch failure’.

I modified the sample application to use a single value for all of the data points (rather than a random value 0-256 for each), and now the sample application crashes every time. It appears that the addByte() function is not working correctly to resolve per-warp shared memory write conflicts. Am I doing something wrong here, or is this sample broken such that it only works on random data (where the chance of multiple threads in the same warp trying to write to the same bin is rather low)?

I’ve confirmed that when run on a uniform ‘image’, the addByte() function can loop forever, causing an ‘unspecified launch failure’. I modified the loop to only run 64 times (by the documentation, it should only run a max of 32 times – the number of threads per warp). This produces no launch error, but the histogram results are not correct, since the loop is cut off before some values actually get incremented.

inline __device__ void addByte(volatile uint *s_WarpHist, uint data, uint threadTag){

    uint count;

    uint max = 64;

    do{

        count = s_WarpHist[data] & TAG_MASK;

        count = threadTag | (count + 1);

        s_WarpHist[data] = count;

    }while(--max && s_WarpHist[data] != count);

}

Any help or pointers people can give would be greatly appreciated.

-Ian

plegresley · November 3, 2010, 6:46pm

Are you using a GPU or GPUs that support shared memory atomics in hardware? I’ve found it to be much more robust than the software approach that seems to often result in unspecified launch failures, particularly with corner cases like constant values.

plegresley · November 3, 2010, 6:46pm

Are you using a GPU or GPUs that support shared memory atomics in hardware? I’ve found it to be much more robust than the software approach that seems to often result in unspecified launch failures, particularly with corner cases like constant values.

oakfellow · November 3, 2010, 6:51pm

Unfortunately not, we’re using Compute 1.1 GPUs.

oakfellow · November 3, 2010, 6:51pm

Unfortunately not, we’re using Compute 1.1 GPUs.

oakfellow · November 3, 2010, 7:16pm

I believe I’ve found a solution (I’m not sure it’s the solution).
I found reference to older GPUs being synchronized across a half-warp, instead of the full warp. I decreased the warp size in the code (LOG2_WARP_SIZE 5U → 4U), and now things seem to be running smoothly (albeit a bit more slowly). Perhaps the sample code should be changed to use half-warp so that it works on any GPU?

-Ian

oakfellow · November 3, 2010, 7:16pm

I believe I’ve found a solution (I’m not sure it’s the solution).
I found reference to older GPUs being synchronized across a half-warp, instead of the full warp. I decreased the warp size in the code (LOG2_WARP_SIZE 5U → 4U), and now things seem to be running smoothly (albeit a bit more slowly). Perhaps the sample code should be changed to use half-warp so that it works on any GPU?

-Ian

Topic		Replies	Views
histogram256 unspecified launch failure CUDA Programming and Performance	0	1942	October 9, 2009
Unspecified Launch error with histogram256 Unspecified launch error occurs in histogram256 sample. CUDA Programming and Performance	0	4915	December 18, 2008
Shared Memory Problems ... Conflict free access CUDA Programming and Performance	22	3721	August 24, 2010
Histogram computation CUDA Programming and Performance	7	6517	February 8, 2008
Histogram implementation CUDA Programming and Performance	12	2525	November 27, 2014
Trouble with Histogram Function Memory access CUDA Programming and Performance	9	4489	September 28, 2007
Fast 256-bin histogram CUDA Programming and Performance	6	2506	May 9, 2016
Problems with simple histogram CUDA Programming and Performance	1	1316	April 7, 2009
compute histogram with large bins CUDA Programming and Performance	11	20569	July 29, 2011
oclHistogram sample. doc references non-existent function addData256() CUDA Programming and Performance	0	1802	February 19, 2010

histogram algorithm problems with the SDK example

Related topics