histogram algorithm problems with the SDK example

A bit of background: I’m trying to alter the basic 256-bin histogram algorithm presented in the SDK samples so it provides more bin. I was seeing intermittent ‘unspecified launch failures’, even when always using the same data. To rule out my own changes, I ran the SDK sample code unaltered:

With random data (as used in the sample app), everything worked fine.

With real image data, intermittent ‘unspecified launch failure’.

I modified the sample application to use a single value for all of the data points (rather than a random value 0-256 for each), and now the sample application crashes every time. It appears that the addByte() function is not working correctly to resolve per-warp shared memory write conflicts. Am I doing something wrong here, or is this sample broken such that it only works on random data (where the chance of multiple threads in the same warp trying to write to the same bin is rather low)?

I’ve confirmed that when run on a uniform ‘image’, the addByte() function can loop forever, causing an ‘unspecified launch failure’. I modified the loop to only run 64 times (by the documentation, it should only run a max of 32 times – the number of threads per warp). This produces no launch error, but the histogram results are not correct, since the loop is cut off before some values actually get incremented.

inline __device__ void addByte(volatile uint *s_WarpHist, uint data, uint threadTag){

    uint count;

    uint max = 64;

    do{

        count = s_WarpHist[data] & TAG_MASK;

        count = threadTag | (count + 1);

        s_WarpHist[data] = count;

    }while(--max && s_WarpHist[data] != count);

}

Any help or pointers people can give would be greatly appreciated.

-Ian

A bit of background: I’m trying to alter the basic 256-bin histogram algorithm presented in the SDK samples so it provides more bin. I was seeing intermittent ‘unspecified launch failures’, even when always using the same data. To rule out my own changes, I ran the SDK sample code unaltered:

With random data (as used in the sample app), everything worked fine.

With real image data, intermittent ‘unspecified launch failure’.

I modified the sample application to use a single value for all of the data points (rather than a random value 0-256 for each), and now the sample application crashes every time. It appears that the addByte() function is not working correctly to resolve per-warp shared memory write conflicts. Am I doing something wrong here, or is this sample broken such that it only works on random data (where the chance of multiple threads in the same warp trying to write to the same bin is rather low)?

I’ve confirmed that when run on a uniform ‘image’, the addByte() function can loop forever, causing an ‘unspecified launch failure’. I modified the loop to only run 64 times (by the documentation, it should only run a max of 32 times – the number of threads per warp). This produces no launch error, but the histogram results are not correct, since the loop is cut off before some values actually get incremented.

inline __device__ void addByte(volatile uint *s_WarpHist, uint data, uint threadTag){

    uint count;

    uint max = 64;

    do{

        count = s_WarpHist[data] & TAG_MASK;

        count = threadTag | (count + 1);

        s_WarpHist[data] = count;

    }while(--max && s_WarpHist[data] != count);

}

Any help or pointers people can give would be greatly appreciated.

-Ian

Are you using a GPU or GPUs that support shared memory atomics in hardware? I’ve found it to be much more robust than the software approach that seems to often result in unspecified launch failures, particularly with corner cases like constant values.

Are you using a GPU or GPUs that support shared memory atomics in hardware? I’ve found it to be much more robust than the software approach that seems to often result in unspecified launch failures, particularly with corner cases like constant values.

Unfortunately not, we’re using Compute 1.1 GPUs.

Unfortunately not, we’re using Compute 1.1 GPUs.

I believe I’ve found a solution (I’m not sure it’s the solution).
I found reference to older GPUs being synchronized across a half-warp, instead of the full warp. I decreased the warp size in the code (LOG2_WARP_SIZE 5U → 4U), and now things seem to be running smoothly (albeit a bit more slowly). Perhaps the sample code should be changed to use half-warp so that it works on any GPU?

-Ian

I believe I’ve found a solution (I’m not sure it’s the solution).
I found reference to older GPUs being synchronized across a half-warp, instead of the full warp. I decreased the warp size in the code (LOG2_WARP_SIZE 5U → 4U), and now things seem to be running smoothly (albeit a bit more slowly). Perhaps the sample code should be changed to use half-warp so that it works on any GPU?

-Ian