Using Thrust to sort Unified Memory Buffer?

Hi All,

I’m running some tests using UMA, and I wanted to try and sort the complex output of an FFT .

In the code below I’m using libsndfile to read in a .wav file and cuFFT to convert it to the frequency domain. I’d then like to sort it using thrust.

All of that seems to be working, but what I’m actually interested in is seeing how that works with UMA memory. Let’s look at the code

#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/device_ptr.h>
#include <cufft.h>
#include <stdio.h>
#include <sndfile.h>

// Helper functions, main method below

// Kernel to fill indices buffer with correct value
__global__
void fillIdx(int * indices, int N){
   int idx = threadIdx.x+blockIdx.x*blockDim.x;
   if (idx<N)
      indices[idx]=idx;
}

// Magnitude of a complex number
__inline__
__host__ __device__
float cplxMag(cufftComplex cplx){
   return sqrt(pow(cplx.x,2)+pow(cplx.y,2));
}

// Less functor for cufttComplex
struct cplxLess : public thrust::less<cufftComplex>
{
   __inline__
   __host__ __device__
   bool operator()(const cufftComplex& a, const cufftComplex& b) const{
      return cplxMag(a)<cplxMag(b);
   }
};

// Main method
int main(){
   // Indices ptr, complex ptr, FFT plan
   int * A(0);
   cufftComplex * B(0);
   cufftHandle plan;

   // Soundfile info, file ptr
   SF_INFO info;
   SNDFILE * file = sf_open("drTest2.wav", SFM_READ, &info);

   // Pad # of frames till N%1024 == 0
   int N = info.frames;
   while (N%1024) N++;

   // Allocate managed memory for indices, fourier components
   cudaMallocManaged((void **)&A, sizeof(int)*N);
   cudaMallocManaged((void **)&B, sizeof(cufftComplex)*N);

   // Read wav file (as floats) into complex buffer (inplace transform)
   sf_readf_float(file, (float *)B, info.frames);

   // Close the file
   sf_close(file); file=0;

   // # threads, # blocks
   int nT=1024, nB=N/nT;

   // Fill the index array with a kernel
   fillIdx<<<nB,nT>>>(A,N);

   // Create the forward FFT, execute in place
   cufftPlan1d(&plan, N, CUFFT_R2C,, 1);
   cufftExecR2C(plan, (float *)B, B);

   // Create thrust device pointers
   thrust::device_ptr<int> indices(A);
   thrust::device_ptr<cufftComplex> fourier(B);

   // Why do I need a cudaDeviceSynchronize call before the sort?
   cudaDeviceSynchronize();
   thrust::sort_by_key(fourier.get(), fourier.get()+N, indices.get(), cplxLess());

   // Just to read back
   //cudaDeviceSynchronize();
   for (int i=N-10;i<N;i++){
      float freq = float(A[i])*(float(info.frames)/float(info.samplerate));
      printf("%f ", freq);
   } printf("\n");

   // Teardown
   cudaFree(A);
   cudaFree(B);
   cufftDestroy(plan);

   return 0;
}

In that code, I’ve got to call cudaDeviceSynchronize() before I can sort or else I get a segfault. The docs say that cudaDeviceSynchronize() blocks until all CUDA operations complete, but in the past when I’ve used UMA I’ve also had to call that function before I can access my buffer from a host function.

Basically, I’m concerned that the thrust::sort_by_key call is occurring on the host, rather than on the device. Is such a thing possible with thrust? Or are all algorithm calls assumed to be running on the device?

Sorry if this is a basic question, I found plenty of thrust sorting examples but none of them dealt directly with UMA. You can see that I tried to use thrust::device_ptrs to wrap my unified memory buffers, and while this compiles I have no idea if it’s kosher or not.

Sorry if that was too much text, but if you read it, thanks for reading and hope you can help.

  • John

p.s cuFFT calls can be done in place, right? I tried graphing the output and everything seemed ok…

The requirement to call cudaDeviceSynchronize() (after kernel usage) before UM data can be used on the host is spelled out and explained in the UM documentation:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive

thrust is a combination of host code and device code, and thrust may be doing unexpected things. You can discover this by parsing through the thrust code or by running it through a profiler. You’ll get an idea of additional kernel calls and cudaMemcpy operations that thrust may be triggering that you hadn’t anticipated. I would assume something like that is occurring if the cudaDeviceSynchronize() is needed before the thrust sort operation. (But based on my simple test below, it’s not needed.)

If you have concerns about where you think thrust is executing an algorithm, you can use thrust execution policies to ensure that it executes the algorithm where you want it to:

https://thrust.github.io/doc/group__execution__policies.html

by specifying thrust::device as the first parameter to your sort, you can guarantee that thrust will interpret the pointers as device pointers and sort the data on the device. However, if you simply use the device_ptr directly:

thrust::sort_by_key(fourier, fourier+N, indices, cplxLess());

thrust recognizes that the usage of device_ptr indicates the algorithm is to be performed on the device (implicit dispatch).

Yes, CUFFT can be done in place. That is discussed in the documentation also:

http://docs.nvidia.com/cuda/cufft/index.html#using-the-cufft-api

just search for “in-place”

In the future, I’d suggest building a simpler test case than what you have shown. The CUFFT and soundfile manipulation are mostly irrelevant to your questions, I think. (the kernel is sufficient to demonstrate the device activity preceding the UM activity.)

Here’s an abbreviated code that seems to work for me without a cudaDeviceSynchronize() prior to the sort call:

$ cat t750.cu
#include <thrust/sort.h>
#include <thrust/device_ptr.h>
#include <stdio.h>

// Helper functions, main method below

// Kernel to fill indices buffer with correct value
__global__
void fillIdx(int * indices, int N){
   int idx = threadIdx.x+blockIdx.x*blockDim.x;
   if (idx<N)
      indices[idx]=N-idx;
}

// Main method
int main(){
   // Indices ptr, complex ptr, FFT plan
   int * A(0);
   int * B(0);

   int N = 65535;
   while (N%1024) N++;

   // Allocate managed memory for indices, fourier components
   cudaMallocManaged((void **)&A, sizeof(int)*N);
   cudaMallocManaged((void **)&B, sizeof(int)*N);

   // # threads, # blocks
   int nT=1024, nB=N/nT;

   // Fill the index array with a kernel
   fillIdx<<<nB,nT>>>(A,N);
   fillIdx<<<nB,nT>>>(B,N);

   // Create thrust device pointers
   thrust::device_ptr<int> indices(A);
   thrust::device_ptr<int> fourier(B);

   // Why do I need a cudaDeviceSynchronize call before the sort?
//   cudaDeviceSynchronize();
   thrust::sort_by_key(fourier, fourier+N, indices);

   // Just to read back
   cudaDeviceSynchronize();
   for (int i=N-10;i<N;i++){
      printf("%d, %d \n", A[i], B[i]);
   } printf("\n");

   // Teardown
   cudaFree(A);
   cudaFree(B);

   return 0;
}
$ nvcc -arch=sm_35 -o t750 t750.cu
$ cuda-memcheck ./t750
========= CUDA-MEMCHECK
65527, 65527
65528, 65528
65529, 65529
65530, 65530
65531, 65531
65532, 65532
65533, 65533
65534, 65534
65535, 65535
65536, 65536

========= ERROR SUMMARY: 0 errors
$

Thanks for the response,

I thought about trimming down the code to its essentials, but I didn’t think it was terribly complex to begin with and figured no one actually needed to run it. Still I think you’re right, I’ll create a simple worked example for questions in the future.

I got a thrust::system::system_error exception thrown at me when I used the device pointer directly, but I think it’s because I’m not overloading the less functor correctly for a device pointer to that type (thrust::device_ptr), and hopefully I’ll be able to get through it.

I’ll continue to tinker and check out the profiler. My aim here is to see if using Managed memory provides any sort of performance increase in an application like this.

Thanks for your help,

John

edit: and thanks for your code revision, I’ll check that out immediately

Here’s a more complete example based on your code:

$ cat t750.cu
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/device_ptr.h>
#include <stdio.h>
#include <cufft.h>
#include <thrust/execution_policy.h>

// Helper functions, main method below

// Kernel to fill indices buffer with correct value
__global__
void fillIdx(int * indices, int N){
   int idx = threadIdx.x+blockIdx.x*blockDim.x;
   if (idx<N)
      indices[idx]=idx;
}
__global__
void fillcplx(cufftComplex * indices, int N){
   int idx = threadIdx.x+blockIdx.x*blockDim.x;
   if (idx<N){
      indices[idx].x=N-idx;
      indices[idx].y= 0;}
}

// Magnitude of a complex number
__inline__
__host__ __device__
float cplxMag(cufftComplex cplx){
   return sqrt(pow(cplx.x,2)+pow(cplx.y,2));
}

// Less functor for cufttComplex
struct cplxLess
{
   __inline__
   __host__ __device__
   bool operator()(const cufftComplex& a, const cufftComplex& b) const{
      return cplxMag(a)<cplxMag(b);
   }
};

// Main method
int main(){
   // Indices ptr, complex ptr, FFT plan
   int * A(0);
   cufftComplex * B(0);

   int N = 65535;
   while (N%1024) N++;

   // Allocate managed memory for indices, fourier components
   cudaMallocManaged((void **)&A, sizeof(int)*N);
   cudaMallocManaged((void **)&B, sizeof(cufftComplex)*N);

// # threads, # blocks
   int nT=1024, nB=N/nT;

   // Fill the index array with a kernel
   fillIdx<<<nB,nT>>>(A,N);
   fillcplx<<<nB,nT>>>(B,N);

   thrust::sort_by_key(thrust::device, B, B+N, A, cplxLess());

   // Just to read back
   cudaDeviceSynchronize();
   for (int i=N-10;i<N;i++){

      printf("%d:%f \n", A[i], B[i].x);
   } printf("\n");

   // Teardown
   cudaFree(A);
   cudaFree(B);

   return 0;
}
$ nvcc -arch=sm_35 -o t750 t750.cu
$ cuda-memcheck ./t750
========= CUDA-MEMCHECK
9:65527.000000
8:65528.000000
7:65529.000000
6:65530.000000
5:65531.000000
4:65532.000000
3:65533.000000
2:65534.000000
1:65535.000000
0:65536.000000

========= ERROR SUMMARY: 0 errors
$

I think it’s unlikely that managed memory would provide any performance benefit vs. manually copying the data as needed.

You’re a true hero, Bob, and I agree.

Thank you.

edit:
Unfortunately I’ve hit some ugly stl-reminiscent errors when compiling your example on my Jetson.
According to devicequery the cuda capability is 3.2, so unless I’m misinterpreting your command I ran

nvcc -arch=sm_32 txbob_test.cu -o unifiedSort

after pasting your code into txbob_test.cu. I figure that’s because of some error with my overloaded less functor, but seeing as how you’ve got the compiled output right there I must be making a mistake. I’ll continue to look into it since you’ve already done more than I could have asked for, but for posterity here’s a sampling of the type of error I got

        instantiation of "thrust::system::cuda::detail::stable_sort_detail::enable_if_comparison_sort<RandomAccessIterator1, StrictWeakOrdering>::type thrust::system::cuda::detail::stable_sort_detail::stable_sort_by_key(thrust::system::cuda::detail::execution_policy<DerivedPolicy> &, RandomAccessIterator1, RandomAccessIterator1, RandomAccessIterator2, StrictWeakOrdering) [with DerivedPolicy=thrust::system::cuda::detail::par_t, RandomAccessIterator1=cufftComplex *, RandomAccessIterator2=int *, StrictWeakOrdering=cplxLess]" 

/usr/local/cuda/bin/…/targets/armv7-linux-gnueabihf/include/thrust/system/cuda/detail/sort.inl(279): here
instantiation of “void thrust::system::cuda::detail::stable_sort_by_key(thrust::system::cuda::detail::execution_policy &, RandomAccessIterator1, RandomAccessIterator1, RandomAccessIterator2, StrictWeakOrdering) [with DerivedPolicy=thrust::system::cuda::detail::par_t, RandomAccessIterator1=cufftComplex *, RandomAccessIterator2=int *, StrictWeakOrdering=cplxLess]”
/usr/local/cuda/bin/…/targets/armv7-linux-gnueabihf/include/thrust/detail/sort.inl(131): here
instantiation of “void thrust::stable_sort_by_key(const thrust::detail::execution_policy_base &, RandomAccessIterator1, RandomAccessIterator1, RandomAccessIterator2, StrictWeakOrdering) [with DerivedPolicy=thrust::system::cuda::detail::par_t, RandomAccessIterator1=cufftComplex *, RandomAccessIterator2=int *, StrictWeakOrdering=cplxLess]”
/usr/local/cuda/bin/…/targets/armv7-linux-gnueabihf/include/thrust/system/detail/generic/sort.inl(89): here
instantiation of “void thrust::system::detail::generic::sort_by_key(thrust::execution_policy &, RandomAccessIterator1, RandomAccessIterator1, RandomAccessIterator2, StrictWeakOrdering) [with DerivedPolicy=thrust::system::cuda::detail::par_t, RandomAccessIterator1=cufftComplex *, RandomAccessIterator2=int *, StrictWeakOrdering=cplxLess]”
/usr/local/cuda/bin/…/targets/armv7-linux-gnueabihf/include/thrust/detail/sort.inl(103): here
instantiation of “void thrust::sort_by_key(const thrust::detail::execution_policy_base &, RandomAccessIterator1, RandomAccessIterator1, RandomAccessIterator2, StrictWeakOrdering) [with DerivedPolicy=thrust::system::cuda::detail::par_t, RandomAccessIterator1=cufftComplex *, RandomAccessIterator2=int *, StrictWeakOrdering=cplxLess]”
txbob_test.cu(62): here

I’ve tried compiling the code with

nvcc -arch=sm_32 …

on both CUDA 6.5 and CUDA 7.0 without issue.

What version of CUDA is installed on that Jetson?

Ah, maybe that’s it, it seems this Jetson has 6.0. I’ll look into upgrading it ASAP…

Here’s the output of devicequery

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GK20A”
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 3.2
Total amount of global memory: 1746 MBytes (1831051264 bytes)
( 1) Multiprocessors, (192) CUDA Cores/MP: 192 CUDA Cores
GPU Clock rate: 852 MHz (0.85 GHz)
Memory Clock rate: 924 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 131072 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GK20A
Result = PASS

Do you recommend that I upgrade?

edit:

I just tried this on another machine I have… it’s got a Tesla C2050, Cuda 6.5, Cuda Capability 2.0. I’ve noticed that, at least with the driver I’ve installed for it, things like UMA and context sharing do not run. However, the do compile, and I still get the same error after directly copying and pasting your code. The code compiles fine on this machine and on my Jetson with the call to thrust::sort commented out.

I’m sure the error is on my end… my hardware is a bit unorthodox. I’ll let you know what happens if I ever get Cuda 6.5 on my Jetson.

For the record, here’s the output of devicequery on my Tesla C2050 machine, where again I ran into the same compilation error:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla C2050”
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 3071 MBytes (3220373504 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

I am able to reproduce the same compile error on CUDA 6.0. The thrust version in CUDA 6.0 does not like the thrust execution policy of thrust::device with the raw pointers A and B, it seems.

Based on my testing, the compile error (in CUDA 6.0) can be fixed by adding back the device_ptr usage:

thrust::device_ptr<cufftComplex> dpB(B);
thrust::device_ptr<int>          dpA(A);

thrust::sort_by_key(thrust::device, dpB, dpB+N, dpA, cplxLess());

regarding CUDA 6.5, it’s possible I was testing on a CUDA 6.5 machine that had a newer version of thrust on it.

regarding UM, you are correct it is not supported on cc2.0 devices like your C2050. UM requires cc3.0+

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

Lovely, that compiles. I thought I had tried that…

Thanks, you’ve been a huge help.