CUDA Hyperbolic Trig Functions Calculate Different Results in Different Locations

I’m running into an odd issue with a simulation I’ve coded. I’ve setup an interface function for other code to call: BFieldAtS. This function returns the result of a function pointed to by a global function pointer which I define in another place before calling this interface function. Imagine, if you would, a simulation of the earth’s magnetic field. There are a number of models that are more or less accurate in different regions. I would like the user to be able to define the model used at run time and for the BFieldAtS interface to call whatever model the user defines - hence the callback function. Perhaps there is a better way to achieve this - I’m open to that. However, that is not specifically where my question is.

The issue:
In order to accomplish the above, I’ve placed this BFieldAtS function in a .cu file and the assignment and implementation of the callback in another file. This is where I run into issues. An equation is producing blatantly wrong results. The equation is:

double x{ asinh(sqrt(3.0) * sin(lambdaDegrees * 3.1415927 / 180.0)) };

I’ve found hard-coding values for the angle (lambdaDegrees) in the equation in question produces the (correct) expected results. Additionally, if I place everything in one file and consolidate headers/remove externs, the function produces the expected results again. The rest of the functions are working correctly - i.e. with the wrong x value, the functions still calculate what would be expected with that x value - compared with hand/excel calculations.

The issue is when the implementation of the model is separated. With the code separated into two files, printing the angle before and after the equation for “x” is executed, the angle is unchanged, yet without hard-coding the angle, the equation produces the wrong results. Furthermore, it produces the same wrong value over and over. The weirdest part is simply adding another diagnostic printf function caused the function to produce yet another (different, also wrong) result. I’m wondering if it has anything to do with the way I’ve set up a callback function on the GPU…maybe multiple threads using the function at the same time leading to some (consistent) undefined behavior? Taking this one step further, sin and sqrt seems to calculate correctly, and the wrong value produced is greater than the equation allows (i.e. there is no angle that produces the output result). It seems as if the issue is with asinh specifically.

Expected value of this equation, producing “x” within getSAtLambda (the printf statement) is 1.268… Result is 1.768… Let me know what you think. By the way, I’m aware that PI as defined here has less accuracy than double (thanks njuffa). However, I don’t feel like this should impact the issue. Simply placing all code in one file (with the same definition for PI) corrects the issue.

CUDA Version: release 8.0, V8.0.60
Flags: -rdc=true, -gencode=arch=compute_50,code=“sm_50,compute_50”, otherwise normal visual studio flags (windows)

otherfuncs.cu

#include <cmath>
#include <iostream>

//CUDA includes
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"

typedef double(*callbackFcn)(double*, int, double, double, int);

__device__ double*     fieldConstArray_GPU{ nullptr };
__device__ int         arraySize_GPU{ 7 };
__device__ callbackFcn callback_GPU{ nullptr };

__host__ __device__ double getSAtLambda(double* consts, int arrayLength, double lambdaDegrees, double simtime, int thdInd)
{//returns s in units of L
	double x{ asinh(sqrt(3.0) * sin(lambdaDegrees * 3.1415927 / 180.0)) };

	if (simtime == 0.0 && thdInd == 31487) { printf("\n\ngetSAtLambda: %f, %f\n\n", lambdaDegrees, x); }

	return (0.5 * consts[2] / sqrt(3.0)) * (x + sinh(x) * cosh(x));
}

__host__ __device__ double getLambdaAtS(double* consts, int arrayLength, double s, double simtime, int thdInd)
{// consts: [ B0, ILATDeg, L, L_norm, s_max, ds, errorTolerance ]
	double lambda_tmp{ (-consts[1] / consts[4]) * s + consts[1] }; //-ILAT / s_max * s + ILAT
	double s_tmp{ consts[4] - getSAtLambda(consts, arrayLength, lambda_tmp, simtime, thdInd) };
	double dlambda{ 1.0 };
	bool   over{ 0 };

	while (abs((s_tmp - s) / s) > 1e-4) //errorTolerance
	{
		while (1)
		{
			over = (s_tmp >= s);
			if (over)
			{
				lambda_tmp += dlambda;
				s_tmp = consts[4] - getSAtLambda(consts, arrayLength, lambda_tmp, simtime, 0);
				if (s_tmp < s)
					break;
			}
			else
			{
				lambda_tmp -= dlambda;
				s_tmp = consts[4] - getSAtLambda(consts, arrayLength, lambda_tmp, simtime, 0);
				if (s_tmp >= s)
					break;
			}
		}
		if (dlambda < 1e-4 / 100.0) //errorTolerance
			break;
		dlambda /= 5.0; //through trial and error, this reduces the number of calculations usually (compared with 2, 2.5, 3, 4, 10)
	}

	return lambda_tmp;
}

__host__ __device__ double BFieldAtS(double* consts, int arrayLength, double s, double simtime, int thdInd)
{// consts: [ B0, ILATDeg, L, L_norm, s_max, ds, errorTolerance ]
	double lambda_deg{ getLambdaAtS(consts, arrayLength, s, simtime, thdInd) };
	double lambda_rad{ lambda_deg * 3.1415927 / 180.0 };
	double rnorm{ consts[3] * pow(cos(lambda_rad), 2) };

	return -consts[0] / pow(rnorm, 3) * sqrt(1.0 + 3 * pow(sin(lambda_rad), 2));
}

__host__ __device__ double gradBAtS(double* consts, int arrayLength, double s, double simtime, int thdInd)
{
	return (BFieldAtS(consts, arrayLength, s + 6371.0, simtime, thdInd) - BFieldAtS(consts, arrayLength, s - 6371.0, simtime, thdInd)) / (2 * 6371.0);
}

__global__ void setupEnvironmentGPU(double* constArrayPtr)
{
	callback_GPU = gradBAtS; //sets pointer to callback function
	arraySize_GPU = 7;
	fieldConstArray_GPU = constArrayPtr;
}

main.cu

//CUDA includes
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"

typedef double(*callbackFcn)(double*, int, double, double, int);

//on GPU global variables
extern __device__ double*     fieldConstArray_GPU;
extern __device__ int         arraySize_GPU;
extern __device__ callbackFcn callback_GPU;

__host__ __device__ double BFieldAtS(double* consts, int arrayLength, double s, double simtime, int thdInd);
__host__ __device__ double gradBAtS(double* consts, int arrayLength, double s, double simtime, int thdInd);
__global__ void setupEnvironmentGPU(double* constArrayPtr);

__global__ void execute()
{
	int thdInd{ blockIdx.x * blockDim.x + threadIdx.x };
	callback_GPU(fieldConstArray_GPU, arraySize_GPU, (thdInd == 31487) ? 1233005.097 : ((115200 - thdInd) / 50000.0 * 6.371e6), 0.0, thdInd ); //3rd argument are example values
}

void setupEnvironment()
{// consts: [ B0, ILATDeg, L, L_norm, s_max ]
	double fieldConstArray_h[]{ 3.12e-5, 72.0, 66717978.17, 10.47213595, 85670894.1 };
	double* fieldConstants_d{ nullptr };

	cudaMalloc((void **)&fieldConstants_d, 5 * sizeof(double));
	cudaMemcpy(fieldConstants_d, fieldConstArray_h, 5 * sizeof(double), cudaMemcpyHostToDevice);
	
	setupEnvironmentGPU <<< 1, 1 >>> (fieldConstants_d);
}

int main()
{
	setupEnvironment();
	int loops{ 0 };

	while (loops < 3)
	{
		execute <<< 115200 / 256, 256 >>> ();
		cudaDeviceSynchronize();
		loops++;
	}

	return 0;
}

A repro case also includes information about the CUDA version used, host platform used, and relevant compilation and link flags passed to nvcc.

In determining correctness, what is being compard here? It sounds like the assumption is that at the point of the printf(), the values of ‘x’ and ‘lambdaDegrees’ should be the same, however I don’t understand why this should be the case. Please clarify.

Here is the output I get when running your application (CUDA 8, Win64, sm_50). I have no idea what to do with this output:

getSAtLambda: 70.963751, 1.401904
getSAtLambda: 70.963751, 1.401904
getSAtLambda: 70.963751, 1.401904
getSAtLambda: 70.963751, 1.401904
getSAtLambda: 70.963751, 1.401904
getSAtLambda: 70.963751, 1.401904

When I run the application under control of cuda-memcheck, it indicates out-of-bounds memory accesses, in particular loads, so the code picks up random data from memory.

Great! Looks like a good observation. Updated the original post to include the requested info. So, basically the code as a whole takes in a position “s” along a magnetic field line (of a magnetic dipole) and calculates the gradient of the magnetic field at that position. The function getSAtLambda returns the position “s” of a particle at a specified angle “lambda” (one of the function arguments) from the equatorial plane (plane passing through the equator, if that visual makes sense).

The passed in “s” value is contrived for this example. In the real world application it would refer to the position of a particle in meters from the surface of the earth along the specified field line (specified by a passed in constant).

What is printed with printf is the value of the passed in latitude (lambda) in degrees for one of the particles/threads (one particle per thread) (specifically index 31487 but I think this happens in every thread). Next in the printf statement is “x” which is the equation: asinh(sqrt(3.0) * sin(lambdaDegrees * 3.1415927 / 180.0)). What is in error is the second number “x”. If you do a quick compute by hand - asinh(sqrt(3) * sin(70.963751 in radians)) is 1.2686… The value shown from your output is wrong, as is mine (and it’s a different value - 1.7 yada yada).

Seems like this out-of-bounds access is the issue. The interesting thing is that if these functions are put into the same file, the function will output the correct number for “x”. How do I track down where this error is happening? Also, is the general layout described in the first paragraph a good design, or is there a better way to achieve this? I’m also happy to explain more if needed.

Thanks for your help.

EDITS: for clarity

If you use a debug build, cuda-memcheck should point you to the line with the out-of-bounds access. Note: I am getting a watchdog timer event when I try to run the debug version, meaning the kernel runs too long. So you might want to cut down the example to a smaller version first.

There might be a compiler issue at the bottom of all this, because when I add -Xptxas -O1 to my build, I get what you describe as the correct result:

getSAtLambda: 70.963751, 1.268603
getSAtLambda: 70.963751, 1.268603
getSAtLambda: 70.963751, 1.268603
getSAtLambda: 70.963751, 1.268603
getSAtLambda: 70.963751, 1.268603
getSAtLambda: 70.963751, 1.268603

But if there are out-of-bounds accesses or race conditions in the code, it is also possible that it may just “happen to work” in some compilation modes.

The current code still seems quite complicated for a repro case; I am certainly not really being able to follow the computation. My suggestion would be to cut the code down to the bare minimum that still reproduces the issue, and then work out whether the root cause is a bug in the code or an issue with the compiler. At the current level of tool chain maturity, one should normally suspect a bug in one’s code, although compiler bugs are not unheard of.

Also: Proper status checking currently is absent from the code and I would urge checking the return status of every CUDA API call and every kernel launch, because from the output of cuda-memcheck it is clear that the out-of-bounds accesses cause kernel launch failures that are being silently ignored.

Thanks for your work on this. I’ll have to do some more reading. I don’t exactly know how to status-check. Can you point me in the right direction? Something like this macro?

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { printf("Error %d at %s:%d\n",EXIT_FAILURE,__FILE__,__LINE__);}} while(0)

Also, how would I use cuda-memcheck? I’m somewhat unfamiliar with testing code in general (I’m new to this sort of thing). I can also try pairing down the example and explaining it better. Give me a bit.

Thanks.

A quick explanation:
I’m using a simple model of the Earth’s magnetic field to track the motion of electrons/ions under certain conditions. A dipole field looks something like so: http://ec.europa.eu/health/scientific_committees/opinions_layman/en/electromagnetic-fields/images/dipole.gif Those loopy lines you see coming out of the sphere are referred to as field lines. The particles follow those like a bead on a string. What this code attempts to do is to calculate the strength of the change of the magnetic field as a function of the distance of the particle along the field line from the earth’s surface - s. So, in other words, it’s like the “slope” of the magnetic field at the measured point. By the way - “magnetic field” and “B field” are interchangeable. B is the letter used to represent “magnetic field” - no idea why - maybe “M” was already used.

Now if you place your mouse cursor on one of those field lines at some point and imagine a straight line from the mouse cursor on that field line to the center of the circle (Earth), the latitude, lambda, is the angle between that straight line from the mouse cursor and the horizontal line running through the middle of the circle. Now if you move your mouse cursor along the field line, you can see that this angle, lambda, changes, as well as the distance of your mouse cursor from that central point of the circle - which we’ll call “r”. The distance that you have moved along the curved field line has also changed - we’ll call that “s”. Note the difference - “r” is the straight line distance from the center of the circle/earth to your mouse cursor on the field line, while “s” is the length along the field line to your mouse cursor from the surface of the circle.

Normally, we know the distance along the field line “s” of the particle and need to know what the magnetic field (and slope of the magnetic field - known as the gradient) is at that point.

Okay, so now I can offer a quick explanation of the functions:
otherfuncs.cu

  • getSAtLambda: returns “s” for a specified “lambda” - there is a simple function for this and it’s useful to define an intermediate value “x” to make the actual equation easier to read - printf statement is here
  • getLambdaAtS: returns “lambda” for a specified “s” - there is no simple equation for this, that I’m aware of anyway, so this function takes a(n intelligent) guess at lambda, checks to see for the “guess lambda” what the associated “guess s” value is, and if this “guess s” is not close enough to the actual s of the particle - adds/subtracts some small increment to lambda depending on if we are under/over, checks again, rinse and repeat until we are close enough
  • BFieldAtS: utilizes getLambdaAtS to find lambda, uses lambda to find r, then uses lambda and r in an equation that spits out the strength of the field
  • GradBAtS: utilizes BFieldAtS to measure the B Field at two locations: s minus some small step and s plus some small step then find the gradient through: “change in B / change in s”
  • setupEnvironmentGPU: a kernel called from the host, sets the global function pointer and constant array (which holds constants related to the simulation - necessary for some of the above equations)

variables:

  • fieldConstArray_GPU: a pointer to the first element of the constant array mentioned above
  • arraySize_GPU: the size of the array
  • callback_GPU: a pointer to gradBAtS

main.cu

  • setupEnvironment: calls setupEnvironmentGPU kernel, copies the constant array to the GPU
  • execute: simply calls gradBAtS many times (115200 in this example)
  • main: calls the two above functions, loops “execute” a few times to show that the error is consistent

Hopefully that was all interesting/relevant. Let me work on pairing the example down.

Alright, I have more info and an improved example:

Apparently it is possible to cause this behavior with everything in one file and quite a few functions cut out. The issue seems to be due to the way I have set up a callback function on the GPU but I can’t figure out what I’ve done wrong…maybe the logic behind the implementation is wrong. Note: calling the function getSAtLambda() directly yields the correct result, while calling it through a function pointer yields the wrong result (in the kernel execute() ). Thoughts?

EDIT: Running through cuda-memcheck says 0 errors, yet the wrong answer is displayed compiling as “Release”. Simply compiling as “Debug” fixes the issue and still yields 0 errors with cuda-memcheck. At this point, compiler bug? If so, what are the next steps?

#include <cmath>
#include <iostream>

//CUDA includes
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"

typedef double(*callbackFcn)(double, int);

//on GPU global variables
__device__ callbackFcn callback_GPU;

__host__ __device__ double getSAtLambda(double lambdaDegrees, int thdInd)
{//returns s in units of L
	double x{ asinh(sqrt(3.0) * sin(lambdaDegrees * 3.1415927 / 180.0)) };

	if (thdInd == 31487) { printf("\n\ngetSAtLambda: %f, %f\n\n", lambdaDegrees, x); }

	return 0.0; //normally there would be some equation here, but not relevant to the issue
}

__global__ void setupEnvironmentGPU()
{
	callback_GPU = getSAtLambda; //sets pointer to callback function
}

__global__ void execute()
{
	int thdInd{ blockIdx.x * blockDim.x + threadIdx.x };
	callback_GPU(70.963751, thdInd); //this produces the wrong result
	//getSAtLambda(70.963751, thdInd); //this produces the right result - namely x = 1.268603
}

int main()
{
	setupEnvironmentGPU << < 1, 1 >> > ();
	int loops{ 0 };

	while (loops < 3)
	{
		execute <<< 115200 / 256, 256 >>> ();
		cudaDeviceSynchronize();
		loops++;
	}

	return 0;
}

The plot thickens. Using a trig identity for asinh - namely ln(xtmp + sqrt(xtmp * xtmp + 1)) where xtmp = sqrt(3) * sin(lambda in rads) (this is what is inside the asinh above) produces the correct results. So a quick recap:

Correct results are produced when the code above:

  1. Is compiled as debug instead of release (except for -O1)
  2. When a trig identity of asinh is used instead of the actual asinh function
  3. When the argument for asinh is hardcoded
  4. With -O1 instead of -O2 for both release and debug
  5. (Paradoxically) When the function getSAtLambda is called directly instead of through a function pointer

Incorrect results are produced for asinh(x) when:

  1. Compiled as release with -O2 with a non-hardcoded value through a function pointer

Safe to say a compiler bug / something on NVIDIA’s end?

Tom

In you functions you read from consts[5], but consts only has five elements. I guess you read random memory, which affects your results.

Thanks for the response. That was an oversight on my part when writing the example (not the case in the code where consts[5] exists). If you take a look at the bottom example, I’ve done away with the consts array entirely and just hard coded the values. The post immediately above yours details a sort of “summary” of what I’ve noticed with that out of bounds error fixed, as well as the code consolidated to one file and unnecessary fluff taken out.

Sorry, I did not pay enough attention.

What GPU are you using and what is the exact nvcc command for the example in post #6?

It runs fine on my linux machine when I use
nvcc -O2 -std=c++11 -rdc=true -gencode=arch=compute_50,code=“sm_50,compute_50” main.cu -o main

It works fine with nvcc V9.1.85 and nvcc V8.0.61 on a Titan V and a Titan Xp.
Output is always “getSAtLambda: 70.963751, 1.268603”

Maybe it is a windows problem?

GPU is GTX 960M, Windows built through Visual Studio 15, now the latest CUDA (I updated) V9.1.85.

Exact command run by visual studio is:
nvcc -gencode=arch=compute_50,code=“sm_50,compute_50” --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -gencode=arch=compute_50,code=“sm_50,compute_50” -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o x64\Release\main.cu.obj “C:\folder_of_project\main.cu”

Linked by:
nvcc -dlink -o x64\Release\CUDACallbacksTest.device-link.obj -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MD " -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64” cudart.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib -gencode=arch=compute_50,code=sm_50 --machine 64 x64\Release\main.cu.obj

Interesting all the other stuff VS tacks on.

I’m also not super committed to the layout of the code (callback function) if there’s a better way to do it…

You state that you are using CUDA 9.1, yet the MSVS log includes the following:

-I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include"

That’s a red flag. The compiler needs to be the same version as the include files it pulls in, otherwise Bad Stuff ™ can happen.

Interesting…that must have happened recently when I updated (30 min ago). How do I force VS to fix that?

Although, it was doing the same thing when I was back on CUDA 8

OK. Think I got it. I just searched the vcxproj file and replaced all (2) 8.0 with 9.1. Easy peasy. Now everything looks like its 9.1

Well alright. Now that I’m compiling with 9.1 the problem is solved!!!1!11

Thanks for all your help, and yours njuffa.