how bessel function work?

Hi all,
I am using CUDA 4.2 and has the bessel function used in the code in both CPU and GPU space. I am using thrust to wrap my code but I figure out the jn function only works in the host not in the device. If I used it in the device, even jn(0, 1) gives the wrong result.

#include
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <cublas_v2.h>

struct OP
{
host device double operator()(double &k) const {return jn(0, 1.0);}
};

int main(int argc, char *argv)
{
thrust::device_vector V(5);
thrust::transform(V.begin(), V.end(), V.begin(), OP());
cout << V[0] << endl;

return 0;
}

Does jn(0,1) return the correct answer when it is NOT wrapped in thrust? i.e. in a CUDA kernel? Does upgrading to CUDA 5.0 or 5.5 solve the problem?

I would say if I use thrust for host vector, it also works but it doesn’t work for device vector in thrust. I don’t have right to install CUDA 5 or 5.5 so I don’t know if that will solve the problem. Hopefully someone will get a chance to test that ;)

I have tested CUDA’s j0 function for fun and indeed it works for the host, but it does not return correct results for the device. I have not used the thrust library.

My configuration: CUDA 5.0 and Win7.

Your code uses double-precision data so you need a device with double precision support, at least compute capability 1.3. Make sure you compile with the appropriate -arch flag. nvcc compilation defaults to code generation for compute capability 1.0 which does not support double precision.

Once that is taken care of, and you are still getting incorrect results, I would suggest trying the Bessel functions in a minimal test app without the use of Thrust. By my understanding of the Bessel functions jn(0,1.0) should be equivalent to j0(1.0), so you might want to give the latter a try as well (it could also be more efficient to call the specific, rather than the generic function).

The device

The device is NVIDIA Tesla C2075 and I compile with nvcc with flag -arch=sm_20. It should be fine, shouldn’t it?

Yes, that looks fine. I have not used the Bessel functions before, I will give them a try in a bit (without Thrust) and let you know what I find. Can you show the expected vs the actual result from calling jn(0,1.0) so we can compare notes?

I try to change all double to float but it still gives the same wrong numbers

I just asked my IT staff to test the code with CUDA5.5, but again, it gives the same wrong numbers.

Does your code check the return status of all CUDA API calls, and of all kernel launches? I suspect your GPU code may not actually be running at all, and picks up random data.

So how do I check that? I just start to learn CUDA and I currently I use thrust only, I don’t know how to do that without thrust wrapper.

Sorry, I have never used Thrust.

Ok. I write a code based on CUDA only for testing.

#include<iostream>
#include<cuda.h>

using namespace std;

__global__ void filldevice(double* ddata, int num)
{
  for (int k=0; k<num; k++) ddata[k] = jn(0, 1.0);
};

int main(void)
{
  const int num=5;
  double* ddata, hdata[num], hdata2[num];
  cudaError_t r;

  // data in host
  for (int k=0; k<num; k++) hdata[k] = jn(0, 1.0);
  
  r = cudaMalloc((void**)&ddata, sizeof(double)*num);
  cout << "cudaMalloc: " << cudaGetErrorString(r) << endl;

  r = cudaMemcpy(ddata, hdata, sizeof(double)*num, cudaMemcpyHostToDevice);
  cout << "cudaMemcpy hdata => ddata: " << cudaGetErrorString(r) << endl;

  filldevice<<<1,1>>>(ddata, num);

  cout << "Host Data: ";
  for (int k=0; k<num; k++) cout << hdata[k] << " ";
  cout << endl;

  r = cudaMemcpy(hdata2, ddata, sizeof(double)*num, cudaMemcpyDeviceToHost);
  cout << "cudaMemcpy ddata => hdata2: " << cudaGetErrorString(r) << endl;

  cout << "Host Data2: ";
  for (int k=0; k<num; k++) cout << hdata2[k] << " ";
  cout << endl;

  r = cudaFree(ddata);
  cout << "cudaFree: " << cudaGetErrorString(r) << endl;
  
  return 0;
}

I didn’t see any error code but the result is not right for jn(0, 1)

I have tested your code (CUDA 5.0 and Win 7) and gives a wrong number on device data. I have used j0 in a standard CUDA kernel and I obtain the same result on device data.

If you are in urgent need of a temporary workaround (until the issue clarifies), you can immediately exploit the routines of

W.H. Press, S.A. Teukolsky, W.T. Wetterling, B.P. Flannery, “Numerical Recipes in C”.

Basically, you need to put the device keyword in front of those function headers. The drawback of those routines is that the accuracy is limited (they do no reach double precision accuracy, if I remember good). For better accuracy, there are alternative solutions.

Thanks for your suggestion. So if it means that I can use any self-defined function for device operation with keyword device added?

Yes. device means that the function can be called only from the device and not from the host.

Basically, instead of having

__global__ void filldevice(double* ddata, int num)
{
  for (int k=0; k<num; k++) ddata[k] = jn(0, 1.0);
};

you should have something like

__device__ float bessj0(float x) {
   // do some stuff
}

__global__ void filldevice(float* ddata, int num)
{
  for (int k=0; k<num; k++) ddata[k] = bessj0(0, 1.0);
};

I restate that those functions will not reach double precision accuracy, unless properly modified.

j0(x) seems to give the expected results.

Host Data  : 0.765198 0.765198 0.765198 0.765198 0.765198 
Host Data2: 0.765198 0.765198 0.765198 0.765198 0.765198

I am following up with the library team, thanks for reporting the issue.

I confirmed that double-precision jn(0,1.0) delivers an incorrect result. But double-precision j0(1.0) delivers the correct result. For now, I would suggest calling j0() directly instead of via the generic function jn(). Please file a bug report for this issue using the bug reporting form linked from the registered developer website. Sorry for the inconvenience and thank you for your help.

reference: jn(0,1.0) = 7.651976865579665514497e-1
CUDA j0(1.0) = 7.6519768655796638e-1
CUDA jn(0,1.0) = 8.8256964215676956e-2

Best I can tell the mappings jn(0,x) --> j0(x) and jn(1,x) --> j1(x) are broken at the moment. As a workaround, please invoke j0() and j1() directly. This applies to both the single and the double precision versions.

Thanks for telling that. But the point is I need to go for any order from 0 to any high order. I use jn(0, x) in the code just because for pointing out the issue.