Main.cu(446): error: a host function call cannot be configured


The total horror show of dealing with NVidia CUDA is that code which works
and produces good results can be a disaster to compile. Why? Because nvcc
is a horror show. A nightmare.

I wrote some code that simply does a bit of math that we have seen since
the early 1980's. Easy stuff. The good old Mandelbrot is always fun. We
should all be thankful to Scientific American for the article ( Aug 1985 ) that
was great joy to those of us with a mainframe handy.

So it seemed clear to me that any decent GPU from NVidia can crank the
numbers on this in a jiffy. Well, after some hacking and testing it was clear
that I am correct. The GPU will rip the math faster than anything I have
ever seen. This includes modern IBM POWER9 killers.

A pile of people have run this code stuff and they can produce good stable
numerical data. That is fancy sales people talk for "it works". The numbers
make sense. Certainly when we plot them onto a pretty colour screen. Hey,
lets be honest, the original FORTRAN code would crank out the image onto
a line printer. So many trees lost to that way back in the '80s.

However, the joy goes away when I try to make the code into little easy bits
we can all chew.  There is a ( poorly written ) article that suggests we can
tear code back down into little pieces and then compile objects and link
later : 

Separate Compilation and Linking of CUDA C++ Device Code
https://developer.nvidia.com/blog/separate-compilation-linking-cuda-device-code/

Well guess what ? 

That code and makefile are history. Nope, they will not work anymore.

Easy to fix : 

https://git.sr.ht/~blastwave/bw/tree/bw/item/nvidia_cuda/separate_compile_link/Makefile

OKay, not so easy. Suffice it to say that I wish I have the GV100 killer or
the all singing and dancing Ada Lovelace. Who was in the room when the
marketing and sales people made that name? Defining a dramatically higher
baseline of GPU performance, it marks the tipping point for AI, ray tracing,
and neural graphics.  Everyone goes dancing ... except the code will not 
compile.

So the code that was working could crank all the numbers and create an
output file of data quite neatly. I gave up on the NVidia CUDA Compile beer
magic and tossed it all into a single dumb file : 

https://git.sr.ht/~blastwave/bw/tree/bw/item/nvidia_cuda/mandel_hack/one_big.cu

            T H A T    W O R K S

It cranks the numbers and creates an output file that matches perfect
SHA512 hash on IBM POWER9 and any damn thing you have. The math
is clean. That data can be checked. It even plots to make pretty pictures.

Feel free to run the stuff at : 

    https://git.sr.ht/~blastwave/bw/tree/bw/item/xmand

Regardless I did break out the code into little easy pieces : 

    https://git.sr.ht/~blastwave/bw/tree/bw/item/nvidia_cuda/mandel_hack

However there is no EASY way to compile and link. Why ? What is going
on here : 

titan$ ./do.sh 
gpu_mbrot.cu(22): warning #20039-D: a __host__ function("gpu_mbrot") redeclared with __device__, hence treated as a __host__ __device__ function

gpu_mbrot.cu(22): warning #20039-D: a __host__ function("gpu_mbrot") redeclared with __device__, hence treated as a __host__ __device__ function

main.cu(446): error: a host function call cannot be configured

main.cu(617): warning #550-D: variable "err_status" was set but never used

1 error detected in the compilation of "main.cu".
titan$

What does that mean?

What is going on here ? 

https://www.genunix.com/dclarke/nvidia_cuda_wat.txt

--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken

PS: I gave up on a Makefile 

In mand.h, the function is declared as implicit __host__ function.

void gpu_mbrot( const double *c_r, const double *c_i, uint32_t *mval, int num_elements );

in gpu_mbrot.cu, it is declared as __device__ function

__device__ 
void gpu_mbrot( const double *c_r, const double *c_i, uint32_t *mval, int num_elements )

in main.cu, it is assumed a __global__ function, which it is not.

gpu_mbrot<<<blocksPerGrid, threadsPerBlock>>>( device_r, device_j, device_mval, num_elements );

That won’t work, and is reported by you nvcc compiler.

okay so what is the deal here ? 

perhaps in mand.h we need to say __device__ there ? 

stick "  __device__ " inside mand.h and then boom : 

$ nvcc -gencode arch=compute_35,code=sm_35 -gencode arch=compute_60,code=sm_60 -Wno-deprecated-gpu-targets --ftz=false --prec-div=true --prec-sqrt=true --fmad=true -c -o array_offset.o array_offset.c
In file included from array_offset.c:3:
mand.h:55:11: error: expected ‘;’ before ‘void’
   55 | __device__ void gpu_mbrot( const double *c_r, const double *c_i, uint32_t *mval, int num_elements );
      |           ^~~~~
      |           ;
$ 

so that will not work. 

Perhaps you mean something else?

I won’t track this issue down for you. __device__ functions cannot be executed from the host anyways, you need to use __global__.

If you are not familiar with CUDA, I would suggest reading the official programming guide CUDA C Programming Guide or other training material, and start with a simple “hello world” program to better understand the different types of annotations, rather than a complex multi-file project.


The code works just fine. 

I can slam all the sources into one big dumb file and it compile
and runs perfectly.

The only appears when I need to "decorate" the code bits with the
NVidia CUDA special spices in order for the NVidia CUDA Compiler to
not choke. Which it does.

In any case I will go back and look at the cute flavour decorations that
NVidia wants. Either that or NVidia just wants more money to sell me
something.