First CUDA program -- Integrating CUDA with existing code base -- not working.

I’ve created a MWE that demonstrates how I intend to integrate CUDA kernels into my existing C++ code base. The plan is to compile the CUDA code as a library and link it to the existing code. The code (shown below) compiles but does not run correctly i.e. the “Hello” message is not printed.

Please note that I am a CUDA beginner, Thanks.

[main.cpp]

#include <iostream>

[main.cpp]
extern "C" void runKernel();

int main(int argc, char **argv)
{
	runKernel();
}

[Test.cu]

__global__ void testKernel( unsigned *data )
{
	int tId = threadIdx.x;
	data[tId] = tId;
}

extern "C" void runKernel()
{
	std::cout << "Running Kernel" << std::endl;

	const unsigned NUM_THREADS = 32;

	unsigned *hostData;
	unsigned *devPtrData;
	cudaMalloc( (void**) devPtrData , NUM_THREADS );

	testKernel<<<1,NUM_THREADS>>>();

	cudaMemcpy( hostData , devPtrData , NUM_THREADS * sizeof(unsigned) , cudaMemcpyDeviceToHost );

	for( unsigned i = 0; i < NUM_THREADS; ++i )
	{
		std::cout << hostData[i] << std::endl;
	}
}

Compilations steps:

nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -lib -o Test.a Test.o
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o main.o -c main.cpp
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o test Test.a main.o

I don’t see any indication that your posted code should print “Hello”.

Your posted compilation steps don’t ever show any indication of compiling Test.cu. If you want the compiler to do something with a file called Test.cu, it usually means that file name has to explicitly appear in a compile command somewhere.

Thanks txbob. I’ve fixed my build steps. Here’s Test.cu

extern "C" void runKernel()
{
	cudaError_t error;
	const unsigned NUM_THREADS = 32;

	unsigned *devPtrData = NULL;
	unsigned *hostData = NULL;

	error = cudaMalloc( (void**)&devPtrData , NUM_THREADS * sizeof(unsigned) );
	if( error != cudaSuccess )
	{
    	    std::cout << "CUDA cudaMalloc error: " << cudaGetErrorString(error) << std::endl;
    	    exit(-1);
  	}

	testKernel<<<1,NUM_THREADS>>>( devPtrData );

	error = cudaMemcpy( hostData , devPtrData , NUM_THREADS * sizeof(unsigned) , cudaMemcpyDeviceToHost );
  	
	if( error != cudaSuccess )
	{
    	    std::cout << "CUDA cudaMemcpy error: " << cudaGetErrorString(error) << std::endl;
    	    exit(-1);
  	}

	for( unsigned i = 0; i < NUM_THREADS; ++i )
	{
		std::cout << hostData[i] << std::endl;
	}
}

Updated build steps

nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -dc -o Test.o -c Test.cu
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -lib -o Test.a Test.o
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o main.o -c main.cpp
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o test Test.a main.o

Problem is that now I’m getting the error

CUDA cudaMalloc error: invalid argument

So your Test.cu now has no definition for testKernel?

and you’re not providing any allocation for hostData?

You may possibly be struggling with basic C/C++ concepts. It’s usually a good idea to have some basic competence in C/C++ before tackling CUDA, as it leverages the underlying concepts quite a bit.

The error I get when I compile an approximate version of what you have is not

CUDA cudaMalloc error: invalid argument

but

CUDA cudaMemcpy error: invalid argument

And this makes sense to me. If we provide a proper host side allocation for hostData, this error will go away.

I also consider your error checking to be incomplete, and I also find the compile arch commands overly restrictive (providing only SASS for cc3.0 devices).

I made the following changes and it works correctly for me:

Test.cu:

#include <iostream>

__global__ void testKernel( unsigned *data )
{
        int tId = threadIdx.x;
        data[tId] = tId;
}

extern "C" void runKernel()
{
        cudaError_t error;
        const unsigned NUM_THREADS = 32;

        unsigned *devPtrData = NULL;
        unsigned *hostData = NULL;
        hostData = new unsigned[NUM_THREADS];
        error = cudaMalloc( (void**)&devPtrData , NUM_THREADS * sizeof(unsigned) );
        if( error != cudaSuccess )
        {
            std::cout << "CUDA cudaMalloc error: " << cudaGetErrorString(error) << std::endl;
            exit(-1);
        }

        testKernel<<<1,NUM_THREADS>>>( devPtrData );

        error = cudaMemcpy( hostData , devPtrData , NUM_THREADS * sizeof(unsigned) , cudaMemcpyDeviceToHost );

        if( error != cudaSuccess )
        {
            std::cout << "CUDA cudaMemcpy error: " << cudaGetErrorString(error) << std::endl;
            exit(-1);
        }

        for( unsigned i = 0; i < NUM_THREADS; ++i )
        {
                std::cout << hostData[i] << std::endl;
        }
        error = cudaGetLastError();
        if( error != cudaSuccess )
        {
            std::cout << "CUDA error: " << cudaGetErrorString(error) << std::endl;
            exit(-1);
        }


}

build sequence:

nvcc -ccbin g++ -m64 -arch=sm_30 -dc -o Test.o -c Test.cu
nvcc -ccbin g++ -m64 -arch=sm_30 -lib -o Test.a Test.o
nvcc -ccbin g++ -m64 -arch=sm_30 -o main.o -c main.cpp
nvcc -ccbin g++ -m64 -arch=sm_30  -o test Test.a main.o

Thanks txbob. Your observation what spot on.

In spite of what the evidence may suggest, I do know C++ and have been using it for over 12 years. I’ve also read CUDA by Example cover to cover and watched a ton of CUDA, thrust and GPU architecture presentations before I writing my first CUDA code. Nevertheless, I find that it takes a bit of practice to remember all the basics and/or spot one’s mistakes. This is probably quite natural when learning a new framework (e.g. CUDA, OpenMP).

The arch type that I’m compiling for is merely for purpose of experiment. Even them, I can’t help but wonder what compiling for several architectures does for the size of the binary.

What concrete changes would you suggest to ‘complete’ error checking.

Complete error checking, at a minimum, means your program will not exit without alerting the user to a CUDA error, if one occurred during execution. At a lazy minimum, this means immediately prior to exiting, your program should check for any CUDA error that may have occurred during execution. I have demonstrated one possible lazy approach in the code I posted.

For a better description of rigorous error checking, google “proper CUDA error checking”, and take the first hit, and start reading.

As a trivial example, without the error checking addition that I made, my modified code (and your original code, if it were in otherwise working order), combined with your “restrictive” choice of embedding only cc3.0 SASS, if you run that code on a mismatched architecture, will “silently” fail. That is, it will spit out bogus results, but otherwise give no indication of an explicit error. (because of “incomplete” error checking, in my view).

In my case I used your build commands on my GTX 960 (a cc5.2 device) and since only cc3.0 SASS was embedded, the kernel launch failed (“silently”) and I just got bogus data.

At least if you throw in the error check at the end, I get “invalid device function” or a similar error, which immediately translates for me “aha, probably I compiled for the wrong architecture”.

By switching to -arch=sm_30, I get cc3.0 SASS and also cc3.0 PTX. The cc3.0 SASS will not run on my GTX960, but the cc3.0 PTX will be jit-compiled to run. I find that convenient, especially when I’m exploring and don’t want to be troubled by extraneous issues.

Do whatever you wish. It is your time/productivity that you should judge.