My program is going to slow. CUDA problem?

Hi. I’m trying to parallelize my C++ code and its going sooooo slow when I add the CUDA Part.

I’ve got a list of Objects, and for every object I have to do some operations…In spite of using a “for” estructure in order to going throw the list, my idea was to parallelyze this list.

The problem is that this block of code (the “for” structure with the operations) is called several times during the execution of the program, and when I add the call to the cuda Function it adds a big delay to the final result (Not perceptible in a single iteration, but yes in the global execution time of the program)

I don’t know if there is a problem of a communication delay between C Code and CUDA Code, or perhaps, my CUDA Code is not defined properly. I put a sample of what I want to do with CUDA

My cpp function

void Env::check ( void ) {

    if ( pop->size() < get_param ( "max_size" ) ) {

std::list<BAC *>::iterator j;
        //CALL TO CUDA FUNCTION

        CUDA_envCheck();
       
        for ( j=population->begin(); j!=population->end(); j++ ) {

           /*Set of operations*/

        }

}

}

And this is the CU File

// CUDA-C includes
#include <cuda.h>
#include "device_launch_parameters.h"
#include <cuda_runtime.h>
#include <stdio.h>

#include *******
#include *******

cudaError_t _envCheckCuda();

__global__ void _envCheckKernel()
{
    int i = threadIdx.x;
   //Here will be the functions for each object of the list, by the moment is empty, just for checking 
  //that the code doesn't have too many delays of time
}

extern "C" int*  CUDA_envCheck(int *c){

int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = envCheckCuda(c);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "envCheckCuda failed!");
        //return 1;
    }
    
    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        //return 1;
    }

    return c;
}

cudaError_t envCheckCuda(int *c)
{
    int size=16;
    cudaError_t cudaStatus;
    std::list<BAC *>::iterator j;
    int *dev_c = 0;

// Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

// Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_c, c, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

worldUpdKernel<<<1, size>>>(dev_c);
    // Launch a kernel on the GPU with one thread for each element.

// Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "decKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.

Error:
    cudaFree(dev_c);

    return cudaStatus;
}

Thank you

Based on the preceding cudaMalloc and cudaMemcpy,

should c in:

worldUpdKernel<<<1, size>>>(c);

not be dev_c instead…?

Yes, is dev_c, I made a mistake while introducing the code in the post :).

Any idea of the reasons of the delays in my program?

your program’s broad path seems to be:

CUDA_envCheck();
envCheckCuda(c);
worldUpdKernel<<<1, size>>>(dev_c);

perhaps use the debugger and first step into envCheckCuda(c), and then step over its instructions, as well as the kernel, noting the time each take to execute
This way, confirm that it is the kernel responsible for the significant time delay; or otherwise determine which other instruction is causing the delay, through the same means
If it is indeed the kernel, then post its code, as I currently do not see it

OK, I’ll try to debug it as you said :)

That is the code of the kernel. I didn’t implement the set of functions yet, I simply put the structure in order to check that it works right (I’m new in CUDA).

I’ll try to debug it, but it will be impossible for me doing it untill monday :( Thanks for your advices!!! I’ll come back with news :)

Hi!!

I solved my problem, but I didn’t fully understand it XD. The delay was produced when envCheckCuda() returned the cudaError_t to the function CUDA_envCheck().

If I change this cudaError_t to another type of value, for example, a char * with a message code for my inner control, it work as it should.

But i have the next question. What is the reason of such delay when returning this function? Do you suggest to keep using it considering the delays, or are ok with using another way to return the information of an error (and success) of my program?

Thanks four your help :)

the cudaDeviceReset() would be killing your performance, I suppose.

I don’t know, the fact is that cudaDeviceReset() doesn’t give me any problem (of time). I don’t know why, the delay is in the act of returning cudaError_t in envCheckCuda().

As you point out, you could return the status data in another way/ as another type - a boolean, or an integer value perhaps
I would think that the error types are enumerations, so an integer should produce the same result

Alternatively, can you not (return nothing and) simply use cudaGetLastError()?

I didn’t think about it…it seems a great idea!!! Thank you very much :D