First CUDA program -- Integrating CUDA with existing code base -- not working.

Olumide · June 11, 2017, 12:37am

I’ve created a MWE that demonstrates how I intend to integrate CUDA kernels into my existing C++ code base. The plan is to compile the CUDA code as a library and link it to the existing code. The code (shown below) compiles but does not run correctly i.e. the “Hello” message is not printed.

Please note that I am a CUDA beginner, Thanks.

[main.cpp]

#include <iostream>

[main.cpp]
extern "C" void runKernel();

int main(int argc, char **argv)
{
	runKernel();
}

[Test.cu]

__global__ void testKernel( unsigned *data )
{
	int tId = threadIdx.x;
	data[tId] = tId;
}

extern "C" void runKernel()
{
	std::cout << "Running Kernel" << std::endl;

	const unsigned NUM_THREADS = 32;

	unsigned *hostData;
	unsigned *devPtrData;
	cudaMalloc( (void**) devPtrData , NUM_THREADS );

	testKernel<<<1,NUM_THREADS>>>();

	cudaMemcpy( hostData , devPtrData , NUM_THREADS * sizeof(unsigned) , cudaMemcpyDeviceToHost );

	for( unsigned i = 0; i < NUM_THREADS; ++i )
	{
		std::cout << hostData[i] << std::endl;
	}
}

Compilations steps:

nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -lib -o Test.a Test.o
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o main.o -c main.cpp
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o test Test.a main.o

Robert_Crovella · June 11, 2017, 5:10pm

I don’t see any indication that your posted code should print “Hello”.

Your posted compilation steps don’t ever show any indication of compiling Test.cu. If you want the compiler to do something with a file called Test.cu, it usually means that file name has to explicitly appear in a compile command somewhere.

Olumide · June 11, 2017, 7:54pm

Thanks txbob. I’ve fixed my build steps. Here’s Test.cu

extern "C" void runKernel()
{
	cudaError_t error;
	const unsigned NUM_THREADS = 32;

	unsigned *devPtrData = NULL;
	unsigned *hostData = NULL;

	error = cudaMalloc( (void**)&devPtrData , NUM_THREADS * sizeof(unsigned) );
	if( error != cudaSuccess )
	{
    	    std::cout << "CUDA cudaMalloc error: " << cudaGetErrorString(error) << std::endl;
    	    exit(-1);
  	}

	testKernel<<<1,NUM_THREADS>>>( devPtrData );

	error = cudaMemcpy( hostData , devPtrData , NUM_THREADS * sizeof(unsigned) , cudaMemcpyDeviceToHost );
  	
	if( error != cudaSuccess )
	{
    	    std::cout << "CUDA cudaMemcpy error: " << cudaGetErrorString(error) << std::endl;
    	    exit(-1);
  	}

	for( unsigned i = 0; i < NUM_THREADS; ++i )
	{
		std::cout << hostData[i] << std::endl;
	}
}

Updated build steps

nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -dc -o Test.o -c Test.cu
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -lib -o Test.a Test.o
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o main.o -c main.cpp
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -o test Test.a main.o

Problem is that now I’m getting the error

CUDA cudaMalloc error: invalid argument

Robert_Crovella · June 12, 2017, 4:45am

So your Test.cu now has no definition for testKernel?

and you’re not providing any allocation for hostData?

You may possibly be struggling with basic C/C++ concepts. It’s usually a good idea to have some basic competence in C/C++ before tackling CUDA, as it leverages the underlying concepts quite a bit.

The error I get when I compile an approximate version of what you have is not

CUDA cudaMalloc error: invalid argument

but

CUDA cudaMemcpy error: invalid argument

And this makes sense to me. If we provide a proper host side allocation for hostData, this error will go away.

I also consider your error checking to be incomplete, and I also find the compile arch commands overly restrictive (providing only SASS for cc3.0 devices).

I made the following changes and it works correctly for me:

Test.cu:

#include <iostream>

__global__ void testKernel( unsigned *data )
{
        int tId = threadIdx.x;
        data[tId] = tId;
}

extern "C" void runKernel()
{
        cudaError_t error;
        const unsigned NUM_THREADS = 32;

        unsigned *devPtrData = NULL;
        unsigned *hostData = NULL;
        hostData = new unsigned[NUM_THREADS];
        error = cudaMalloc( (void**)&devPtrData , NUM_THREADS * sizeof(unsigned) );
        if( error != cudaSuccess )
        {
            std::cout << "CUDA cudaMalloc error: " << cudaGetErrorString(error) << std::endl;
            exit(-1);
        }

        testKernel<<<1,NUM_THREADS>>>( devPtrData );

        error = cudaMemcpy( hostData , devPtrData , NUM_THREADS * sizeof(unsigned) , cudaMemcpyDeviceToHost );

        if( error != cudaSuccess )
        {
            std::cout << "CUDA cudaMemcpy error: " << cudaGetErrorString(error) << std::endl;
            exit(-1);
        }

        for( unsigned i = 0; i < NUM_THREADS; ++i )
        {
                std::cout << hostData[i] << std::endl;
        }
        error = cudaGetLastError();
        if( error != cudaSuccess )
        {
            std::cout << "CUDA error: " << cudaGetErrorString(error) << std::endl;
            exit(-1);
        }


}

build sequence:

nvcc -ccbin g++ -m64 -arch=sm_30 -dc -o Test.o -c Test.cu
nvcc -ccbin g++ -m64 -arch=sm_30 -lib -o Test.a Test.o
nvcc -ccbin g++ -m64 -arch=sm_30 -o main.o -c main.cpp
nvcc -ccbin g++ -m64 -arch=sm_30  -o test Test.a main.o

Olumide · June 12, 2017, 11:12pm

Thanks txbob. Your observation what spot on.

In spite of what the evidence may suggest, I do know C++ and have been using it for over 12 years. I’ve also read CUDA by Example cover to cover and watched a ton of CUDA, thrust and GPU architecture presentations before I writing my first CUDA code. Nevertheless, I find that it takes a bit of practice to remember all the basics and/or spot one’s mistakes. This is probably quite natural when learning a new framework (e.g. CUDA, OpenMP).

The arch type that I’m compiling for is merely for purpose of experiment. Even them, I can’t help but wonder what compiling for several architectures does for the size of the binary.

What concrete changes would you suggest to ‘complete’ error checking.

Robert_Crovella · June 13, 2017, 4:08am

Complete error checking, at a minimum, means your program will not exit without alerting the user to a CUDA error, if one occurred during execution. At a lazy minimum, this means immediately prior to exiting, your program should check for any CUDA error that may have occurred during execution. I have demonstrated one possible lazy approach in the code I posted.

For a better description of rigorous error checking, google “proper CUDA error checking”, and take the first hit, and start reading.

As a trivial example, without the error checking addition that I made, my modified code (and your original code, if it were in otherwise working order), combined with your “restrictive” choice of embedding only cc3.0 SASS, if you run that code on a mismatched architecture, will “silently” fail. That is, it will spit out bogus results, but otherwise give no indication of an explicit error. (because of “incomplete” error checking, in my view).

In my case I used your build commands on my GTX 960 (a cc5.2 device) and since only cc3.0 SASS was embedded, the kernel launch failed (“silently”) and I just got bogus data.

At least if you throw in the error check at the end, I get “invalid device function” or a similar error, which immediately translates for me “aha, probably I compiled for the wrong architecture”.

By switching to -arch=sm_30, I get cc3.0 SASS and also cc3.0 PTX. The cc3.0 SASS will not run on my GTX960, but the cc3.0 PTX will be jit-compiled to run. I find that convenient, especially when I’m exploring and don’t want to be troubled by extraneous issues.

Do whatever you wish. It is your time/productivity that you should judge.

Topic		Replies	Views
unresolved external symbol _main referenced in function ___tmainCRTStartup CUDA Programming and Performance	7	9306	February 22, 2011
CUDA Error: Invalid Device Function Debugging CUDA errors CUDA Programming and Performance	3	5761	July 29, 2009
Build Error MSB3721 When calling object method within kernel, using compiler directives CUDA Programming and Performance	9	5712	November 18, 2015
Odd error fixed by commenting unrelated line? CUDA Programming and Performance	11	8613	February 17, 2010
cuda.h error message CUDA Programming and Performance	9	5961	October 22, 2009
Silent kernel failure CUDA Programming and Performance	25	8203	May 18, 2020
Need Help. CUDA kernel fails randomly CUDA Programming and Performance cuda , kernel	3	505	July 27, 2022
Very simple CUDA program bad output CUDA Programming and Performance	3	756	July 3, 2017
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2631	August 2, 2010
Cant modify data on the GPU CUDA Programming and Performance	16	10239	December 20, 2008

First CUDA program -- Integrating CUDA with existing code base -- not working.

Related topics