Hi All,
I’m running some tests using UMA, and I wanted to try and sort the complex output of an FFT .
In the code below I’m using libsndfile to read in a .wav file and cuFFT to convert it to the frequency domain. I’d then like to sort it using thrust.
All of that seems to be working, but what I’m actually interested in is seeing how that works with UMA memory. Let’s look at the code
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/device_ptr.h>
#include <cufft.h>
#include <stdio.h>
#include <sndfile.h>
// Helper functions, main method below
// Kernel to fill indices buffer with correct value
__global__
void fillIdx(int * indices, int N){
int idx = threadIdx.x+blockIdx.x*blockDim.x;
if (idx<N)
indices[idx]=idx;
}
// Magnitude of a complex number
__inline__
__host__ __device__
float cplxMag(cufftComplex cplx){
return sqrt(pow(cplx.x,2)+pow(cplx.y,2));
}
// Less functor for cufttComplex
struct cplxLess : public thrust::less<cufftComplex>
{
__inline__
__host__ __device__
bool operator()(const cufftComplex& a, const cufftComplex& b) const{
return cplxMag(a)<cplxMag(b);
}
};
// Main method
int main(){
// Indices ptr, complex ptr, FFT plan
int * A(0);
cufftComplex * B(0);
cufftHandle plan;
// Soundfile info, file ptr
SF_INFO info;
SNDFILE * file = sf_open("drTest2.wav", SFM_READ, &info);
// Pad # of frames till N%1024 == 0
int N = info.frames;
while (N%1024) N++;
// Allocate managed memory for indices, fourier components
cudaMallocManaged((void **)&A, sizeof(int)*N);
cudaMallocManaged((void **)&B, sizeof(cufftComplex)*N);
// Read wav file (as floats) into complex buffer (inplace transform)
sf_readf_float(file, (float *)B, info.frames);
// Close the file
sf_close(file); file=0;
// # threads, # blocks
int nT=1024, nB=N/nT;
// Fill the index array with a kernel
fillIdx<<<nB,nT>>>(A,N);
// Create the forward FFT, execute in place
cufftPlan1d(&plan, N, CUFFT_R2C,, 1);
cufftExecR2C(plan, (float *)B, B);
// Create thrust device pointers
thrust::device_ptr<int> indices(A);
thrust::device_ptr<cufftComplex> fourier(B);
// Why do I need a cudaDeviceSynchronize call before the sort?
cudaDeviceSynchronize();
thrust::sort_by_key(fourier.get(), fourier.get()+N, indices.get(), cplxLess());
// Just to read back
//cudaDeviceSynchronize();
for (int i=N-10;i<N;i++){
float freq = float(A[i])*(float(info.frames)/float(info.samplerate));
printf("%f ", freq);
} printf("\n");
// Teardown
cudaFree(A);
cudaFree(B);
cufftDestroy(plan);
return 0;
}
In that code, I’ve got to call cudaDeviceSynchronize() before I can sort or else I get a segfault. The docs say that cudaDeviceSynchronize() blocks until all CUDA operations complete, but in the past when I’ve used UMA I’ve also had to call that function before I can access my buffer from a host function.
Basically, I’m concerned that the thrust::sort_by_key call is occurring on the host, rather than on the device. Is such a thing possible with thrust? Or are all algorithm calls assumed to be running on the device?
Sorry if this is a basic question, I found plenty of thrust sorting examples but none of them dealt directly with UMA. You can see that I tried to use thrust::device_ptrs to wrap my unified memory buffers, and while this compiles I have no idea if it’s kosher or not.
Sorry if that was too much text, but if you read it, thanks for reading and hope you can help.
- John
p.s cuFFT calls can be done in place, right? I tried graphing the output and everything seemed ok…