Why when I tried to use "cosf" function in CUDA, there ocurred errors?

__global__ void TukeyWin(int N, float r, float* t, float* result)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;

	float per = 0.5 * r;

	int tl = floor(per * (N - 1)) + 1;

	int th = N - tl + 1;

	if (i < N)
	{
		t[i] = (float)i / (float)(N - 1);
	}
	__syncthreads();
	if (i < tl)
	{
		result[i] = (1 + cosf(pi / per * (t[i] - per))) / 2;
	}
	__syncthreads();
	if (i >= tl && i < th - 1)
	{
		result[i] = 1;
	}
	__syncthreads();
	if (i >= th - 1 && i < N)
	{
		result[i] = (1 + cosf(pi / per * (t[i] - 1 + per))) / 2;
	}
}

Errors like:

identifier "acos<int, (int)0> " is undefined in device code	
calling a __host__ function("double  ::acos<int, (int)0> (T1)") from a __global__ function("TukeyWin") is not allowed

I have called the library " cuda_runtime.h".

I do not see a call to acos() in the posted code, yet the error message seems to complain about such a call. Is the code posted here identical to what you are actually compiling?

The standard math functions have a limited number of overloads. Overloads for float and double definitely exist. I am not aware of an overload double acos (int) being supported by CUDA and am not aware that such an overload is required by C++.

There is a possibility that such an overload may have been added in one of the more recent revisions of the ISO-C++ standard (check the standard document). If such an overload is required, and the particular revision of the C++ standard that requires it is claimed to be supported by CUDA, you may want to file a bug.

Hm, CPP Reference says the int overloads were added in C++11:

Additional overloads (since C++11)
Defined in header <cmath>
template< class Integer >
double acos ( Integer num );

I am looking at the list of supported C++11 features in CUDA:

but they are organized by specific proposal to the standard committee, which is a bit inconvenient for regular folks. None of the proposals listed looks like it has anything to do with adding integer overloads for standard math functions, which likely means that this C+11 feature is not supported by CUDA.

Assuming the problem is reproduceable with CUDA 12.2, you could file a feature request with NVIDIA. Use the bug reporting form and make a note that this is an enhancement request.

1 Like

Many thanks. Currently I can’t figure this problem out, so i rewrite my code in C, which is a little bit slower but more stable with no errors. 0_0

If the only problem is the passing of int arguments to trigonometric functions as indicated by the error message in the thread-starting post, I would suggest simply casting such function arguments to double. Then the existing overload for double will match.

1 Like

It’s really weird. I didn’t call any acos function in my code, and the error refered to
result[i] = (1 + cosf(pi / per * (t[i] - per))) / 2;. I guess cosf used acos in its definition, which i didn’t actually see:
_Check_return_ __inline float __CRTDECL cosf(_In_ float _X) { return (float)cos(_X); }
besides, i didn’t use “int” type in my code, and to make sure that parameters in cosf are float, i even forced parameters to become type float like:
result[i] = (1 + cosf(float(pi / per * (t[i] - per)))) / 2;
but it was still wrong.

It’s really weird. I didn’t call any acos function in my code, and the error refered to
result[i] = (1 + cosf(pi / per * (t[i] - per))) / 2;. I guess cosf used acos in its definition, which i didn’t actually see:
_Check_return_ __inline float __CRTDECL cosf(_In_ float _X) { return (float)cos(_X); }
besides, i didn’t use “int” type in my code, and to make sure that parameters in cosf are float, i even forced parameters to become type float like:
result[i] = (1 + cosf(float(pi / per * (t[i] - per)))) / 2;
but it was still wrong.

Can’t reproduce. I took the code from the original post and put it into a .cu file. I added:

#include <math.h>
#define pi 3.141592653f

at the start, and

int main (void)
{
    TukeyWin<<<1,1>>>(0,0,0,0);
    return 0;
}

at the end. Compiles without errors. Either your posted code does not reflect what you are compiling, or there is an issue with your tool chain. Do you have multiple versions of CUDA installed on your computer by any chance?

1 Like

I am sure that I don’t have another version of CUDA. So ridiculious, I placed this global function in another CUDA file and tried to compile it. It worked!

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <cuComplex.h>
#include <iostream>
#include <fstream>
#include <cstring>
#include <cmath>
#include <cufft.h>
#include <chrono>
#include <cublas_v2.h>

#define pi 3.14

__global__ void TukeyWin(int N, float r, float* t, float* result)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;

	float per = 0.5 * r;

	int tl = floor(per * (N - 1)) + 1;

	int th = N - tl + 1;

	if (i < N)
	{
		t[i] = (float)i / (float)(N - 1);
	}
	__syncthreads();
	if (i < tl)
	{
		result[i] = (1 + cosf(pi / per * (t[i] - per))) / 2;
	}
	__syncthreads();
	if (i >= tl && i < th - 1)
	{
		result[i] = 1;
	}
	__syncthreads();
	if (i >= th - 1 && i < N)
	{
		result[i] = (1 + cosf(pi / per * (t[i] - 1 + per))) / 2;
	}
}

int main()
{
	std::cout << "Hello world!" <<std::endl;
}

Magic?! Is it possible that there are some overloads that overload cosf in CUDA, but it actually impossible I think.

There is no need to do this in a CUDA source file (.cu suffix):

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

and in fact I would strongly advise against adding such includes.

Those includes are needed in my code, so i put them in. I need to temporarily ignore this problem, because i have to finish my rest project. Anyway, really thank you for your help. Are you Asian? if you are American, it’s almost 3’o clock there. Once again, really thank your for your selfless help.

If you are available, I meet an acceleration problem. Hope you can give a hand:
I have 2 pointers, and want to relocate data of the first pointer in the second one. Like the first pointer:{ 1,2,3,4,5,6,7,8,9,10} , the second one:{0,0,0,0,0,0,0,0,0,0,0,0,0,0}. After relocating:
the second one becomes:{1,2,3,4,5,0,0,7,8,9,10,0,0}.
I tried to manipulate pointer to relocate the data, including using cublasScopy, cudaMemcpy, but it was too slow. And finally i chose to use global function:
i first write my code like:

__global__ void GetOverlapData(cuFloatComplex* Input, cuFloatComplex* Output, float* tukey, int Unit, int Interval, int num)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;

	for (int j = 0; j < num; j++)
	{
		if (i < Unit)
		{
			Output[i + j * Interval] = complexMul(Input[i + j * Unit], tukey[i]);
		}
	}
}

__device__ cuFloatComplex complexMul(cuFloatComplex a, float constant) 
{
	cuFloatComplex result;
	result.x = a.x * constant;
	result.y = a.y * constant;
	return result;
}

That is, i used 1-Dimension index of threads, and the answer was right. complexMul is a function i used to calculate a constant times complex number. But I still thought it was not fast enough, so I rewrote this function into 2-Dimension:

__global__ void GetOverlapData(cuFloatComplex* Input, cuFloatComplex* Output, float* tukey, int Unit, int Interval, int num)
{
	int row = threadIdx.y + blockDim.y * blockIdx.y;
	int col = threadIdx.x + blockDim.x * blockIdx.x;

	if (row < num && col < Unit)
	{
		Output[row * Interval + col] = complexMul(Input[row * Unit + col], tukey[Unit]);	
	}
}

Then there occurred problems, i can’t get right answer. And i tried to copy data in device to host and print them, there are some right data, but most data were zeros.