Setting begin/end bit parameters in cub::DeviceRadixSort::SortPairs()

Olumide · November 6, 2017, 3:09pm

The CUB function cub::DeviceRadixSort::SortPairs() has two optional parameters begin_bit and end_bit that specify the “bit subrange [begin_bit, end_bit) of differentiating key bits. … This can reduce overall sorting overhead and yield a corresponding performance improvement.”

When I profiled my application, I found that 87% of the time is consumed by cub::DeviceRadixSortDownsweepKernel() and cub::DeviceRadixSortUpsweepKernel() – which I assume are called by cub::DeviceRadixSort::SortPairs().

The keys in my input data contain unsigned integers in the range [1, N] where N depends on the problem, but is always a small number for example 30. What would I set begin/end bit to? The number of bits in N (e.g. 30) = 5?? And what would begin bit be, 1??

Robert_Crovella · November 7, 2017, 1:06am

A radix sort typically may work bitwise from LSB to MSB.

The begin bit would be the LSB bit position that is your least significant bit (for comparison).

The end bit would be the MSB bit position that is your most significant bit (for comparison).

I would think 0 and 4 should work in your case (integers from 0 to 30, inclusive), but you may want to experiment with that.

Olumide · November 7, 2017, 11:48pm

Thanks txbob. I’ve cobbled together another a small program that uses cib::DeviceRadixSort::SortPairs() with the begin and end bits values that you recommend. In this case all the numbers in the key array are in the range [1,20] inclusive. Unfortunately the result of the sort is incorrect when begin/end bit is set as recommended.

BTW, I’m using the latest version of cub (1.7.4).

#include <iostream>
#include <cub/cub.cuh>

// Helper Functions Decl
void print( short* data , unsigned size , const char* name );
void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber );
void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber );
void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber );

#define CHECK_ERROR( error )                                                    \
    if( error != cudaSuccess )                                                  \
    {                                                                           \
        std::string msg( "[Error] " );                                          \
        msg += cudaGetErrorString( error );                                     \
        std::cerr << msg << std::endl;       									\
		exit(-1);																\
    }

////////////////////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)
{
	cudaError_t status;
	void* tmpStorage = 0;
	size_t tmpStorageSize = 0;

	static const unsigned COUNT = 20;
	short h_values[COUNT] = { 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0};
	short h_keys[COUNT] = { 17, 11, 16, 3, 10, 13, 1, 17, 14, 15, 18, 15, 10, 4, 8, 11, 15, 20, 10, 18 };

	short* d_keys = 0;
	short* d_values = 0;
	unsigned dataSize = COUNT * sizeof(short);

	allocateDeviceMemory( &d_keys , dataSize , __LINE__ );
	allocateDeviceMemory( &d_values , dataSize , __LINE__ ); 

	copyDataToDevice( d_keys , h_keys , dataSize , __LINE__ );
	copyDataToDevice( d_values , h_values , dataSize , __LINE__ );

	print( h_keys , COUNT , "Pre-sort Keys" );
	print( h_values , COUNT , "Pre-sort Values" );

	status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );			// Correct
	//status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 4 );  // BUG
	CHECK_ERROR( status );

	allocateDeviceMemory( &tmpStorage , tmpStorageSize , __LINE__ );

	status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );			// Correct
	//status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 4 );	// BUG
	CHECK_ERROR( status );

	copyDataToHost( h_keys , d_keys , dataSize , __LINE__ );
	copyDataToHost( h_values , d_values , dataSize , __LINE__ );
    
	print( h_keys , COUNT , "Post-sort Keys" );
	print( h_values , COUNT , "Post-sort Values" );

    return 0;
}

////////////////////////////////////////////////////////////////////////////////////////////////
// Helper Functions Impl

void print( short* data , unsigned size , const char* name )
{
	std::cout << "[" << name << "]\n[ " << data[0];
	for( unsigned i = 1; i < size; ++i )
	{
		std::cout << " , " << data[i];
	}
	std::cout << " ]" << std::endl;
}

void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber )
{
    cudaError_t error = cudaMalloc( (void**) devPtr , size );
    if( error != cudaSuccess )
    {
    	std::cout << "[Line " << lineNumber << " -- Error " << error << " : Unable to allocate device memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( hostPtr , devPtr , size , cudaMemcpyDeviceToHost );
    if( error != cudaSuccess )
    {
    	std::cout << "[Line " << lineNumber << " -- Error " << error << " : Unable to copy device data to host memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( devPtr , hostPtr , size , cudaMemcpyHostToDevice );
    if( error != cudaSuccess )
    {
    	std::cout << "[Line " << lineNumber << " -- Error " << error << " : Unable to copy host data to device memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

Robert_Crovella · November 8, 2017, 12:37am

sorry, the end bit needs to be 1 higher than the most significant bit in your data set.

So use 0, 5 in your case.

This is evident if you study the default values for these parameters in the documentation.

Olumide · November 8, 2017, 11:37pm

Begin/end bit value of 5 works on the initial toy example but it fails on my humble mobile Fermi card when I throw a bigger data set at it. I’ve modified my sample program accordingly (mostly to reduce the amount of data printed)

#include <iostream>
#include <cub/cub.cuh>

#include "Data.h"  // see github gist

// nvcc -ccbin g++ -g -I/path/to/cub-1.7.4 -m64 -gencode arch=compute_30,code=sm_30 -o CubRadixSortTest CubRadixSortTest.cu

// Helper Functions Decl
void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber );
void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber );
void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber );

#define CHECK_ERROR( error )                                                    \
    if( error != cudaSuccess )                                                  \
    {                                                                           \
        std::string msg( "[Error] " );                                          \
        msg += cudaGetErrorString( error );                                     \
        std::cerr << msg << std::endl;       									\
		exit(-1);																\
    }

template<typename T>
void print( T* data , unsigned size , const char* name )
{
	std::cout << "[" << name << "]\n[ " << data[0];
	for( unsigned i = 1; i < size; ++i )
	{
		std::cout << " , " << data[i];
	}
	std::cout << " ]" << std::endl;
}

////////////////////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)
{
	cudaError_t status;
	void* tmpStorage = 0;
	size_t tmpStorageSize = 0;

	short* d_keys = 0;
	unsigned* d_values = 0;

	unsigned keysDataSize = COUNT * sizeof(short);
	unsigned valuesDataSize = COUNT * sizeof(unsigned);

	allocateDeviceMemory( &d_keys , keysDataSize , __LINE__ );
	allocateDeviceMemory( &d_values , valuesDataSize , __LINE__ ); 

	copyDataToDevice( d_keys , h_keys , keysDataSize , __LINE__ );
	copyDataToDevice( d_values , h_values , valuesDataSize , __LINE__ );

	//print( h_keys , COUNT , "Pre-sort Keys" );
	//print( h_values , COUNT , "Pre-sort Values" );

	status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );			// Correct
	//status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 5 );  // BUG
	CHECK_ERROR( status );

	allocateDeviceMemory( &tmpStorage , tmpStorageSize , __LINE__ );

	status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );			// Correct
	//status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 5 );	// BUG
	CHECK_ERROR( status );

	copyDataToHost( h_keys , d_keys , keysDataSize , __LINE__ );
	copyDataToHost( h_values , d_values , valuesDataSize , __LINE__ );
    
	//print( h_keys , COUNT , "Post-sort Keys" );
	print( h_values , COUNT , "Post-sort Values" );

    return 0;
}

////////////////////////////////////////////////////////////////////////////////////////////////
// Helper Functions Impl

void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( hostPtr , devPtr , size , cudaMemcpyDeviceToHost );
    if( error != cudaSuccess )
    {
    	std::cout << "[Line " << lineNumber << " -- Error " << error << " : Unable to copy device data to host memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( devPtr , hostPtr , size , cudaMemcpyHostToDevice );
    if( error != cudaSuccess )
    {
    	std::cout << "[Line " << lineNumber << " -- Error " << error << " : Unable to copy host data to device memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber )
{
    cudaError_t error = cudaMalloc( (void**) devPtr , size );
    if( error != cudaSuccess )
    {
    	std::cout << "[Line " << lineNumber << " -- Error " << error << " : Unable to allocate device memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

////////////////////////////////////////////////////////////////////////////////////////////////

Data.h is in stored as a github gist. In data the keys range [1, 23] inclusive to that N = 5.

PS:
The problem appears to be that while the key length remains the same multiple keys appear in the list post sort. (I checked the output data in Python.)

Robert_Crovella · November 9, 2017, 3:55am

If you’re compiling this way for Fermi, I would expect that to fail:

-gencode arch=compute_30,code=sm_30

on a laptop you need to make sure you don’t exceed any display timeouts. Fermi are also limited to 65535 blocks in the x-grid direction. And CUDA 9 doesn’t support Fermi. The latest cub probably received little testing on Fermi.

Your most recent posting involves files from the web. If you want my help, make it super easy for me to help you. I’m not going to run around assembling files from multiple places to build a project. Sorry. Create a single file test, posted here only, that I can copy, paste, compile and run, and see the issue obiously, without having to add anything or change anything. If that’s too much trouble, perhaps someone else will be able to help you. I also don’t want to have to check the output with python script. Build results validity verification into your test case. Yes, it requires effort on your part. If you’re not willing to expend that effort, neither am I.

Olumide · November 9, 2017, 10:41am

Apologies for the reference to Python. As you will see, it was an afterthought (PS) that I decided added to in case you asked how I know that the sorted data includes duplicates. My production code includes such a check and I’ve updated my sample code to reflect that.

I did not post the previous data because there was so much of it. I found a much smaller data set that fails (7140 elements compared to the 44850), but this was still too much to include in a post (it got truncated). Fortunately, I’ve found the attachment feature, so no more external links.

BTW, I’m using CUDA 8 for the reason that you pointed out.

#include <vector>
#include <iostream>
#include <cub/cub.cuh>

#include "Data2.h" // see attachment

// Helper Functions Decl
void validateValues( unsigned* d_values , const char* msg );
void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber );
void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber );
void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber );

#define CHECK_ERROR( error )                                                    \
    if( error != cudaSuccess )                                                  \
    {                                                                           \
        std::string msg( "[Error] " );                                          \
        msg += cudaGetErrorString( error );                                     \
        std::cerr << msg << std::endl;       									\
		exit(-1);																\
    }

template<typename T>
void print( T* data , unsigned size , const char* name )
{
	std::cout << "[" << name << "]\n[ " << data[0];
	for( unsigned i = 1; i < size; ++i )
	{
		std::cout << " , " << data[i];
	}
	std::cout << " ]" << std::endl;
}

////////////////////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)
{
	cudaError_t status;
	void* tmpStorage = 0;
	size_t tmpStorageSize = 0;

	ushort* d_keys = 0;
	unsigned* d_values = 0;

	unsigned keysDataSize = COUNT * sizeof(ushort);
	unsigned valuesDataSize = COUNT * sizeof(unsigned);

	allocateDeviceMemory( &d_keys , keysDataSize , __LINE__ );
	allocateDeviceMemory( &d_values , valuesDataSize , __LINE__ ); 

	copyDataToDevice( d_keys , h_keys , keysDataSize , __LINE__ );
	copyDataToDevice( d_values , h_values , valuesDataSize , __LINE__ );

	//print( h_keys , COUNT , "Pre-sort Keys" );
	//print( h_values , COUNT , "Pre-sort Values" );

	validateValues( d_values , "Pre-sorted values" );

	//status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );			// Correct
	status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 4 );  // BUG
	CHECK_ERROR( status );

	allocateDeviceMemory( &tmpStorage , tmpStorageSize , __LINE__ );

	//status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT );			// Correct
	status = cub::DeviceRadixSort::SortPairs( tmpStorage , tmpStorageSize , d_keys , d_keys , d_values , d_values , COUNT , 0 , 4 );	// BUG
	CHECK_ERROR( status );

	copyDataToHost( h_keys , d_keys , keysDataSize , __LINE__ );
	copyDataToHost( h_values , d_values , valuesDataSize , __LINE__ );
    
	//print( h_keys , COUNT , "Post-sort Keys" );
	//print( h_values , COUNT , "Post-sort Values" );

	validateValues( d_values , "Sorted values" );

    return 0;
}

////////////////////////////////////////////////////////////////////////////////////////////////
// Helper Functions Impl

void validateValues( unsigned* d_values , const char* msg )
{
	std::vector<unsigned> h_values( COUNT );
    copyDataToHost( h_values.data() , d_values , COUNT * sizeof(unsigned) , __LINE__ );

	std::vector<bool> exists( COUNT , false );
	for( unsigned val : h_values )
	{
		if( exists[val] )
		{
			std::cerr << msg << " contain duplicate " << val << "  -- terminating on first duplicate found" << std::endl;
			exit(-1);
		}
		else
		{
			exists[val] = true;
		}
	}
}

void copyDataToHost( void* hostPtr , void* devPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( hostPtr , devPtr , size , cudaMemcpyDeviceToHost );
    if( error != cudaSuccess )
    {
    	std::cerr << "[Line " << lineNumber << " -- Error " << error << " : Unable to copy device data to host memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

void copyDataToDevice( void* devPtr , void* hostPtr , unsigned size , int lineNumber  )
{
    cudaError_t error = cudaMemcpy( devPtr , hostPtr , size , cudaMemcpyHostToDevice );
    if( error != cudaSuccess )
    {
    	std::cerr << "[Line " << lineNumber << " -- Error " << error << " : Unable to copy host data to device memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

void allocateDeviceMemory( void* devPtr , unsigned size , int lineNumber )
{
    cudaError_t error = cudaMalloc( (void**) devPtr , size );
    if( error != cudaSuccess )
    {
    	std::cerr << "[Line " << lineNumber << " -- Error " << error << " : Unable to allocate device memory] " << cudaGetErrorString( error ) << std::endl;
    	exit(-1);
    }
}

////////////////////////////////////////////////////////////////////////////////////////////////

Data2.h (68.8 KB)

Olumide · November 10, 2017, 1:09pm

Got the same pattern of behavior using CUDA 9 on a 64-bit Windows 7 workstation with a Quadro K600 GPU. (Previous tests were run on Ubuntu x86_64.)

RaulPPelaez · November 25, 2017, 7:27pm

Are you sure you can alias the input and output data? (as in passing d_keys and d_values twice to sortPairs).
Just throwing out an idea…
I do not find in the docs a line that says that you cannot, but neither saying that you can…

It says that the input arrays are not altered, but passing it as the second argument means that it will alter it. The question is if this alteration will happend after the sort has ended (so there would be no problem), or is happening while the sort takes place (which will cause weird results for sure).

I found out recently that cub::RadixSort takes some decisions based on end_bit (like for example swapping the references in the doublebuffer version), so maybe you are blaming on end_bit something that is actually unrelated to it. It has happened to me before [url]Redirecting to Google Groups

Maybe check using a different array for input and output?

Good luck!

Olumide · November 27, 2017, 11:01pm

By jove, you may be right! My example passed the validation when the the keys and values do not alias.

Many thanks.

RaulPPelaez · November 28, 2017, 8:18am

All right!!
If you do not need to keep the original (unsorted) array, the DoubleBuffer version reserves much less memory ( O(N·P) vs O(P), P ~ nGPU ) if I am not mistaken. In exchange it keeps the right to modify both input/output arrays.

So I prefer that version. But make sure of asking the DoubleBuffer for .Current() after the sorting to get the sorted array!

Olumide · November 28, 2017, 10:25am

Thanks again. I must admit I didn’t really get the point of the DoubleBuffer versions of the implementation. And looking over the docs again it still isn’t clear.

Can’t help but wonder what makes the DoubleBuffer version reserve much less memory considering that alternative key and value buffer but still created.

RaulPPelaez · November 28, 2017, 10:28am

I think it is because in the DoubleBuffer version SortPairs is allowed to modify both arrays in the DoubleBuffer, so it has a larger workspace. But this is just guessing. I have not dived into the code…