CUB Histogram

Hi,
In the scope of image processing I would need a fast pix values histogram from my images,
I’m desperately trying to use CUB Histogram rather than atomicHistogram as it is probalby much faster.

I derived a test piece of code by copying the provided sample in the CUB library but couldn’t get sensible output. neither found on the web any other example to mimic.

Here is my code

#include <stdio.h>
#include <time.h>
#include <math.h>
#include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>

#define TPB 16
#define RANGE 16

// Create and compute histogram
int main (int argc, char * argv[]) 
{
	// Declare, allocate, and initialize device-accessible pointers for input samples 
	// and output histogram
	int num_samples = 10;    
	int *d_samples;
	int *d_histogram;
	int num_levels = 10;     //    (10 level boundaries for 9 bins)
	int lower_level = 0;    // lower sample value boundary of lowest bin)
	int upper_level = 9;    // (upper sample value boundary of upper bin)
		
	//Initialize vector with random values
	int h_samples[num_samples] = {1,2,3,4,5,6,7,8,9,10};
	printf("Samples\n");
	for (int i = 0; i < num_samples; i++)
		printf ("%d ", h_samples[i]);
	cudaMalloc (&d_samples, num_samples * sizeof(int));
	cudaMemcpy (&d_samples, h_samples, num_samples * sizeof(int), cudaMemcpyHostToDevice);

	//Allocate device and host memory for histogram
	cudaMalloc (&d_histogram, (num_levels+1) * sizeof(int));
	//int *h_histogram = (int *) malloc ((num_levels+1) * sizeof(int));
	int h_histogram[num_levels];

	// Determine temporary device storage requirements
	void *d_temp_storage = NULL;
	size_t temp_storage_bytes = 0;
	cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
			d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);
	
	// Allocate temporary storage
	cudaMalloc(&d_temp_storage, temp_storage_bytes);
	
	// Compute histograms
	cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
    	d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);
	
	cudaMemcpy (&h_histogram, d_histogram, (num_levels+1) * sizeof(int), cudaMemcpyDeviceToHost);
	printf("\nHistogram\n");
	for (int i = 0; i <= num_levels; i++)
		printf ("%d ", h_histogram[i]);

	// Cleanup and closing
	cudaFree(d_samples); cudaFree(d_histogram); cudaFree(d_temp_storage); 

	printf("\n");
	return 0;
}

Here is the ouput with wrong histogram values
Samples
1 2 3 4 5 6 7 8 9 10
Histogram
10 0 0 0 0 0 0 0 0 0 0

Obviously it is not what is expected but cannot get what is wrong…

Please help, many thanks in advance

Any time you are having trouble with a CUDA code, I strongly encourage you to use proper CUDA error checking, and run your code with cuda-memcheck, before asking others for help. Not sure what either one of those are? Just google:

proper cuda error checking

or

cuda-memcheck

Even if you don’t understand the output from the above checks, it will be useful for others trying to help you.

Your code has incorrectly written cudaMemcpy statements. The error checking would have drawn your attention to this. Remove the ampersands.

Your final for-loop is also incorrectly iterating beyond the end of the h_histogram array. That is a basic C/C++ coding error, not specific to CUDA.

Many Thanks for pointing out my basic mistakes in cudaMemcpy and final printloop, this is fixed now.
However I’m still confused on the meaning of upper_level and relation to num_levels.
CUB documentation and code snippet looks contradictory to me.
See below: it says on the comment “seven level boundaries for six bins”
and shows any array of 8 [ -, -, -, -, -, -, -, -]
This confuses me

I would really appreciate if you can clarify this point.
is the d_histogram an array of num_levels?

Documentation states that lower_level is included and upper_level is excluded.
To remove any ambiguity say I have integer values in range [0-9] what are the parameters?

many thanks for you guidance

#include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>
// Declare, allocate, and initialize device-accessible pointers for input samples and
// output histogram
int      num_samples;    // e.g., 10
float*   d_samples;      // e.g., [2.2, 6.0, 7.1, 2.9, 3.5, 0.3, 2.9, 2.0, 6.1, 999.5]
int*     d_histogram;    // e.g., [ -, -, -, -, -, -, -, -]
int      num_levels;     // e.g., 7       (seven level boundaries for six bins)
float    lower_level;    // e.g., 0.0     (lower sample value boundary of lowest bin)
float    upper_level;    // e.g., 12.0    (upper sample value boundary of upper bin)

The size of d_histogram and therefore h_histogram should be equal to the number of bins. The number of bins is 1 less than the number of levels.

Your code had one additional error in that the size of the cudaMemcpy operation from device to host was incorrect.

Here’s a fixed version of the code you wrote, demonstrating the correct sizes, and showing that the histogram correctly reports 1 sample per bin.

$ cat t11.cu
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>

#define TPB 16
#define RANGE 16

// Create and compute histogram
int main (int argc, char * argv[])
{
        // Declare, allocate, and initialize device-accessible pointers for input samples
        // and output histogram
        int num_samples = 10;
        int *d_samples;
        int *d_histogram;
        int num_levels = 11;     //    (11 level boundaries for 10 bins)
        int num_bins = num_levels-1;
        int lower_level = 1;    // lower sample value boundary of lowest bin)
        int upper_level = 11;    // (upper sample value boundary of upper bin)

        //Initialize vector with random values
        int h_samples[num_samples] = {1,2,3,4,5,6,7,8,9,10};
        printf("Samples\n");
        for (int i = 0; i < num_samples; i++)
                printf ("%d ", h_samples[i]);
        cudaMalloc (&d_samples, num_samples * sizeof(int));
        cudaMemcpy (d_samples, h_samples, num_samples * sizeof(int), cudaMemcpyHostToDevice);

        //Allocate device and host memory for histogram
        cudaMalloc (&d_histogram, num_bins * sizeof(int));
        //int *h_histogram = (int *) malloc ((num_levels+1) * sizeof(int));
        int h_histogram[num_bins];

        // Determine temporary device storage requirements
        void *d_temp_storage = NULL;
        size_t temp_storage_bytes = 0;
        cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
                        d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);

        // Allocate temporary storage
        cudaMalloc(&d_temp_storage, temp_storage_bytes);

        // Compute histograms
        cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
        d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);

        cudaMemcpy (h_histogram, d_histogram, (num_bins) * sizeof(int), cudaMemcpyDeviceToHost);
        printf("\nHistogram\n");
        for (int i = 0; i < num_bins; i++)
                printf ("%d ", h_histogram[i]);

        // Cleanup and closing
        cudaFree(d_samples); cudaFree(d_histogram); cudaFree(d_temp_storage);

        printf("\n");
        return 0;
}
$ nvcc -o t11 t11.cu
$ cuda-memcheck ./t11
========= CUDA-MEMCHECK
Samples
1 2 3 4 5 6 7 8 9 10
Histogram
1 1 1 1 1 1 1 1 1 1
========= ERROR SUMMARY: 0 errors
$

Since you specifically asked for samples of values 0-9, here is the same code reworked to have samples from 0-9, one sample per histogram bin:

$ cat t11.cu
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <cub/cub.cuh>   // or equivalently <cub/device/device_histogram.cuh>

#define TPB 16
#define RANGE 16

// Create and compute histogram
int main (int argc, char * argv[])
{
        // Declare, allocate, and initialize device-accessible pointers for input samples
        // and output histogram
        int num_samples = 10;
        int *d_samples;
        int *d_histogram;
        int num_levels = 11;     //    (11 level boundaries for 10 bins)
        int num_bins = num_levels-1;
        int lower_level = 0;    // lower sample value boundary of lowest bin)
        int upper_level = 10;    // (upper sample value boundary of upper bin)

        //Initialize vector with random values
        int h_samples[num_samples] = {0,1,2,3,4,5,6,7,8,9};
        printf("Samples\n");
        for (int i = 0; i < num_samples; i++)
                printf ("%d ", h_samples[i]);
        cudaMalloc (&d_samples, num_samples * sizeof(int));
        cudaMemcpy (d_samples, h_samples, num_samples * sizeof(int), cudaMemcpyHostToDevice);

        //Allocate device and host memory for histogram
        cudaMalloc (&d_histogram, num_bins * sizeof(int));
        //int *h_histogram = (int *) malloc ((num_levels+1) * sizeof(int));
        int h_histogram[num_bins];

        // Determine temporary device storage requirements
        void *d_temp_storage = NULL;
        size_t temp_storage_bytes = 0;
        cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
                        d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);

        // Allocate temporary storage
        cudaMalloc(&d_temp_storage, temp_storage_bytes);

        // Compute histograms
        cub::DeviceHistogram::HistogramEven(d_temp_storage, temp_storage_bytes,
        d_samples, d_histogram, num_levels, lower_level, upper_level, num_samples);

        cudaMemcpy (h_histogram, d_histogram, (num_bins) * sizeof(int), cudaMemcpyDeviceToHost);
        printf("\nHistogram\n");
        for (int i = 0; i < num_bins; i++)
                printf ("%d ", h_histogram[i]);

        // Cleanup and closing
        cudaFree(d_samples); cudaFree(d_histogram); cudaFree(d_temp_storage);

        printf("\n");
        return 0;
}
$ nvcc -o t11 t11.cu
$ cuda-memcheck ./t11
========= CUDA-MEMCHECK
Samples
0 1 2 3 4 5 6 7 8 9
Histogram
1 1 1 1 1 1 1 1 1 1
========= ERROR SUMMARY: 0 errors
$

Dear txbob,

Many thanks for the time spent to help me understand the correct parameters usage and the examples code. You really save on my time, now I feel comfortable to use the histogram for processing my astro images…

Best,