Inheritence issue in Cuda


I came across a strange issue trying to implemet a simple inheritence structure in CUDA. The following code compiles well on windows but fails on Ubuntu. My parent class is Mat2D which is inherited by the class Clamp. On Ubuntu the parent class’s variables _srcPtr, _dstPtr, _srcStepSize, _dstStepSize are not recognized when compiling the child class. Any idea why? and if so why it compiles on Windows?

#include <algorithm>  
#include <vector>

#include <cuda_runtime.h>

template<typename srcT, typename dstT>
struct Mat2D {

Mat2D(const srcT* __restrict srcPtr, int srcStepSize,
     dstT* __restrict dstPtr, int dstStepSize) {

_srcPtr = srcPtr;
_dstPtr = dstPtr;

_srcStepSize = srcStepSize;
_dstStepSize = dstStepSize;


const srcT* __restrict _srcPtr;
     dstT* __restrict _dstPtr;

int _srcStepSize;
int _dstStepSize;

template<typename T>
struct clamp : Mat2D<T, T> {

clamp(const T* __restrict srcPtr, int srcStepSize,
     T* __restrict dstPtr, int dstStepSize, T lowVal, T highVal) : Mat2D<T,T>(srcPtr, srcStepSize,
    dstPtr, dstStepSize) {

_operation._low_val  = lowVal;
_operation._high_val = highVal;


struct operation {

__host__ __device__ __forceinline__ T operator()(T val) {
T clamp_min = val < _low_val ? _low_val : val;
T res = clamp_min > _high_val ? _high_val : clamp_min;
return res;

T _low_val;
T _high_val;

__host__ __device__ void  operator()(int x, int y) {

auto val = _srcPtr[y * _srcStepSize + x];
_dstPtr[y * _dstStepSize + x] = _operation(val);


operation _operation;


int main() {

//clamp a;



The posted code does not compile because identifier "GpuMat" is undefined. Pro tip: post minimal complete example code that others can use to reproduce issues you are observing.

Updated the code, I accidently pressed the submit button :)
I’m using the following cmake file for compilation, the error i’m getting is error: identifier “XXX” is undentified for the Mat2D internal variables _srcPtr, _dstPtr, _srcStepSize, _dstStepSize

cmake_minimum_required(VERSION 3.8 FATAL_ERROR)
project(cmake_and_cuda LANGUAGES CXX CUDA)


target_compile_features(test_cuda PUBLIC cxx_std_11)

set_target_properties( test_cuda

I should have said that what is needed is a minimal, complete and self-contained example. That means no external dependencies like Cmake. Instead, post the exact nvcc invocation used, and note the compiler versions. Also, copy&paste any error message produced by the compiler.

So far, I have been unable to achieve failing compilation on Linux. I used this compiler invocation based on what I think is being selected via Cmake:

nvcc --std=c++11 -dc -rdc=true

I tried multiple CUDA 11.x toolchains, but don’t get any compilation errors. What CUDA version are you using? Is the version of Ubuntu and gcc used a supported configuration for that CUDA version?

I’m compiling the code on the Jetson Nano using the latest available image, the NVCC info is as follows:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

the GCC info is as follows:

 Using built-in specs.
 Target: aarch64-linux-gnu
 Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
 Thread model: posix
 gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)

I tried using your suggested compiler invocation and i’m getting the same error which is: error: identifier "_srcPtr" is undefined error: identifier "_srcStepSize" is undefined error: identifier "_dstPtr" is undefined error: identifier "_dstStepSize" is undefined
4 errors detected in the compilation of "/tmp/tmpxft_00005efc_00000000-4_main.cpp4.ii".

You stated that the compilation completes without errors on Windows. My assumption is that the Jetson Nano does not run with Windows, and that you tried on an x86-based Windows PC and that works fine. I also tried that and see no errors reported. Similarly, it seems to work fine on an x86 Linux PC. I cannot get it to fail on that when using either CUDA 10.2 or CUDA 11.x. Your observation might be local to the Jetson Nano environment.

I do not have access to a Jetson Nano and the associated toolchain, so I am afraid I cannot assist any further. In case the observed behavior is due to a bug in the toolchain, have you tried CUDA 11.5? You may also wish to consult with the folks over in the Jetson Nano sub-forum.

When dealing with issues that occur with NVIDIA’s embedded products, asking questions in the dedicated sub forums for these is generally preferable over posting in this forum, because very few of the people active here have experience with and/or ready access to those products.

You’re running into this. It has nothing to do with CUDA (this is easily demonstrable). If you precede all the undefined identifiers with this-> then the errors disappear.

I personally don’t wish to explain it, I think the linked post does better than anything I could say.


I compiled the posted code with multiple CUDA 11.x toolchains on Compiler Explorer ( I double checked and these are definitely hosted on a Linux platform using gcc as the host compiler. No compilation errors are reported: Compiler Explorer

Supposing that the posted code can be compiled without errors only with the Microsoft toolchain due MSVC’s deviation from the C++ language standard (as alluded to in the linked Stackoverflow answer), what would explain the error-free compilation I observe with these Linux-hosted CUDA toolchains?

Here is a godbolt failure. That code fails the same way from what I can see for g++ versions in the 7-10 range, at least. (Yes, I have modified that source somewhat. I don’t think any of the modifications materially impact my claims.) OP’s original code fails exactly as OP describes on my linux machine using nvcc from CUDA 11.4 with g++ 7.3.1 as the host compiler. Yes I can see on godbolt that CUDA/nvcc seems happy with the code. On my machine, it is not happy. I’m not a godbolt expert, so I don’t really know how it chooses the underlying g++ toolchain; it doesn’t seem selectable. But I tried CUDA versions back to CUDA 9 on godbolt and it seems to pass.

I guess it requires more careful analysis to discover what is going on.

I am not a C++ language lawyer, so I don’t know what exactly constitutes correct behavior in C++11. My focus was: Does the CUDA toolchain behave differently between Ubuntu and Windows in regard to the posted code, and best I can tell so far, the answer is “no”.

FWIW, the godbolt example I linked in my previous post uses gcc 10.2 as the host compiler, as I established from the values of __GNUC__ and __GNUC_MINOR__

One difference between compiling plain C++ host code with the host compiler and compiling host code with nvcc is that the CUDA toolchain performs some amount of processing before passing the host code to the host compiler.

Thanks for the quick help! adding this-> does solve this issue.

Studying --verbose output:

  • on two linux machines I looked at, both with different versions of CUDA and g++, I observe that the failures are reported from cudafe++
  • when I add --verbose to godbolt nvcc compilation (which doesn’t report any errors) I observe that cudafe++ is not being run.
  • If I add --ptx to the compilation command lines on my linux machines, the errors disappear, and the verbose output seems to roughly match godbolt nvcc.

So I’m not really sure what godbolt is doing at this point. On an “actual” CUDA linux installation running nvcc “normally”, I observe the failures. I have not seen yet an “actual” linux install that does not fail.

I have not run on windows. I’m taking OP’s word for it that it does not fail on windows, and the article I linked (which has nothing to do with CUDA) seems to corroborate that.

On linux, if we leave CUDA out of it, godbolt fails.

So right now it seems to me that this is not a CUDA issue, and the CUDA toolchain has no bearing on the observability of the problem. Yes, I cannot explain godbolt CUDA toolchain, but I have evidence that there is some kind of unexpected behavior (from my own perspective) there.

Thanks for digging.

My understanding of godbolt CUDA compilation is that it produces PTX by default, and presumably this happens by specifying --ptx. To see SASS, one has to turn on the “Compile to binary” option under the Output tab. For the posted code as-is, the device code output is empty because the kernel isn’t actually invoked.

Looking at the compilation trajectory of a CUDA program, it looks different from what I, possibly incorrectly, remembered (cudafe++ running early).

First the host compiler is invoked, then cicc (the device code frontend compiler), then ptxas (the device code backend compiler), then cudafe++. godbolt stops after cicc (by default) or ptxas (if “Output to binary” is selected) since it is focused on the device code generation. In any event it never gets to cudafe++, so any error messages originating there would not be visible, as you noted.

I observe that passing --ptx to nvcc terminates the compilation trajectory after invoking cicc, that is, compilation never gets around to invoking cudafe++. Which explains why the error disappears when this nvcc option is specified.

I conclude that Compiler Explorer appears to be unsuitable for investigating the kind of issue discussed here.

1 Like