CUDA very slow performance

I am doing a project in which I have to port a molecular prediction program to the GPU, it works with boost, I made it able to be compiled with NVCC compiler and it works. But the problem is that when I add any CUDA kernel, it becomes slower and slower. Now for each function I have written a Kernel, but when I call any of these kernels the program runs extremelly slow. I am talking like really really slow, u can go to the toilet to wash your hands and come back and it would still be running… I have tried to add many other compiler flags but nothing works… it is still too slow… Am I missing something?

There is no information in the post that would be sufficient even for generic advice, let alone specific suggestions as to what you could try. Have you tried using the CUDA profiler to point out bottlenecks?

even before it actually starts running some GPU code it freezes for a while and then starts running really slow… does anybody have an idea?

Let me explain it more clearly:
without any CUDA kernel being called the program performs at a normal speed.
but when I call one cuda kernel, the program becomes slower, and the more kernels I call, the slower the program runs.
It is a big program and i just run the cuda code in a small part, the results are correct but the performance is way too slow. is it clearer now? thank you

Sorry, it is no clearer than before. If you were presented solely with the information you provided here by someone else, what advice would you give them?

At this point, we know nothing about the nature of the app, how it calls the CUDA kernels (actual code), what those CUDA kernels look like (more code), what your host platform and GPU are, how you are building the CUDA code, what version of CUDA you are using.

no that is not clearer.

It is the same as asking a mechanic to fix your car for free, but not letting him see the car or even telling him what kind of car.

What GPU are you using?

Give a code example of the kernels called, otherwise no one can help you.

Thanks for replying my post.
Sorry for being unclear, I am new to CUDA.
I am using cuda 6.5 with an NVIDIA GTX650 and here is a part of my code where all the kernels are called:

fl bfgs(quasi_newton_aux& f, conf& x, change& g, const unsigned max_steps, const fl average_required_improvement, const sz over) { // x is I/O, final value is$

        sz n = g.num_floats();
        flmat h(n, 0);
        set_diagonal(h, 1);

        change g_new(g);
        conf x_new(x);
        fl f0 = f(x, g);

        fl f_orig = f0;
        change g_orig(g);
        conf x_orig(x);

        change p(g);

        flv f_values; f_values.reserve(max_steps+1);

          /*Device Pointers declaration*/

          change *dev_p;
          change *dev_y;
          change *dev_g;
          change * dev_minus_hy;
          fl* dev_yp;
          fl* gpu_result;
          fl * gpu_yy;
          fl* scalar_product_gpu;
          flmat* dev_h;
          fl *dev_yhy;
          fl *dev_set_diag;
          /*Device pointer allocations*/
         cudaMalloc((void**)&dev_yhy, sizeof(fl));
         //cudaMalloc((void**)&dev_set_diag, sizeof(fl));
          cudaMalloc((change**)&dev_p, sizeof(struct change));
          cudaMalloc((change**)&dev_y, sizeof(struct change));
          cudaMalloc((void**)&dev_yp, sizeof(fl));
  cudaMalloc((change**)&dev_g, sizeof(struct change));
          cudaMalloc((void**)&gpu_result, sizeof(fl));
          cudaMalloc((void**)&gpu_yy, sizeof(fl));
          cudaMalloc((void**)&scalar_product_gpu, sizeof(fl));
          cudaMalloc((flmat**)&dev_h, sizeof(flmat));
          cudaMalloc((change**)&dev_minus_hy, sizeof(struct change));
          cudaMemcpy(dev_p, &p, sizeof(struct change), cudaMemcpyHostToDevice);
          cudaMemcpy(dev_h, &h, sizeof(flmat), cudaMemcpyHostToDevice);
          cudaMemcpy(dev_g, &g, sizeof(struct change), cudaMemcpyHostToDevice);

         for(step = 0; step < max_steps; step++) {

 gpu_minus_mat_vec_product<<<h.dim(), h.dim()>>>(dev_h, dev_g, dev_p);
          //minus_mat_vec_product(h, g, p);^M
          //already implemented on the GPU

            cudaMemcpy(&p, dev_p, sizeof(struct change), cudaMemcpyDeviceToHost);
          // copy only dev_p to p because gpu_minus_mat_vec_product only modified dev_p

                fl f1 = 0;
                const fl alpha = line_search(f, n, x, g, f0, p, x_new, g_new, f1);
                  change y(g_new);
                // 1 - Change y(g_new); 2 - subtract_change(y, g, n);

                // 1 - implementing change y(g_new) on the GPU:
                copy_change<<<1,1>>>(dev_y, g_new);

                // 2 - implementing subtract_change(y, g, n) on the GPU:
                //dev_y is already allocated, no need to copy
                //dev_g has just been copied, so no need to copy
                //we can just call the function

                gpu_subtract_change<<<1,n>>>(dev_y, dev_g, n);

                //copy back y so it can have an updated value
                //cudaMemcpy(&y, dev_y, sizeof(struct change), cudaMemcpyDeviceToHost);

                f0 = f1;
                x = x_new;

                  //if(!(std::sqrt(scalar_product(g, g, n)) >= 1e-5)) break; // breaks for nans too // FIXME !!??

                //implementing if(!(std::sqrt(scalar_product(g, g, n)) >= 1e-5)) break; on the GPU:

                //no need to copy dev_g because it hasn't been changed since last copy, so:
         gpu_scalar_product<<<1, 32>>>(dev_g, dev_g, n, scalar_product_gpu);
                fl result;
                cudaMemcpy(&result, scalar_product_gpu, sizeof(fl), cudaMemcpyDeviceToHost);

                if(!(std::sqrt(result) >= 1e-5)) break;
                // Nothing was changed so no need to copy

                g = g_new;
                //this is the only place where g is changed, no more copies are needed from now on
                copy_change<<<1,1>>>(dev_g, g_new);
                // ?

                if(step == 0) {
                  //const fl yy = scalar_product(y, y, n);
                          //implementing const fl yy = scalar_product(y, y, n); on the GPU:

                          fl yy;
                          gpu_scalar_product<<<1,32>>>(dev_y, dev_y, n, gpu_yy);
                          //copying the result back:
   cudaMemcpy(&yy, gpu_yy, sizeof(fl), cudaMemcpyDeviceToHost);

                          if(std::abs(yy) > epsilon_fl){

                                 //implementing scalar_product(y, p, n) on the GPU:
                               gpu_scalar_product<<<1,32>>>(dev_y, dev_p, n, dev_yhy);
                               fl set_diag;
                               cudaMemcpy(&set_diag, dev_yhy, sizeof(fl), cudaMemcpyDeviceToHost);
                               //set_diagonal(h, alpha * set_diag / yy);
                                 //implementing: set_diagonal(h, alpha * set_diag / yy) on the GPU:
                               //cudaMemcpy(dev_h, &h, sizeof(flmat), cudaMemcpyHostToDevice);

                                 gpu_set_diagonal<<<1, h.dim()>>>(dev_h, alpha * set_diag / yy);

                                 //cudaMemcpy(&h, dev_h, sizeof(flmat), cudaMemcpyDeviceToHost);



                  //Last part:
                  // bool h_updated = bfgs_update(h, p, y, alpha);
                  bool h_updated;
                  //const fl yp  = scalar_product(y, p, h.dim());
                  fl yp;
                    //implementing const fl yp  = scalar_product(y, p, h.dim()); on the GPU:
                  gpu_scalar_product<<<1,32>>>(dev_y, dev_p, h.dim(), dev_yp);
                   cudaMemcpy(&yp, dev_yp, sizeof(fl), cudaMemcpyDeviceToHost);

                    if(alpha * yp < epsilon_fl)
                       h_updated = false; // FIXME?

                      change minus_hy(y); //minus_mat_vec_product(h, y, minus_hy);
        //implementing Change minus_hy(y); minus_mat_vec_product(h, y, minus_hy); on the GPU:
        // 1 - change minus)hy(y);
        copy_change_2<<<1,1>>>(dev_minus_hy, dev_y);
        // 2 - minus_mat_vec_product(h, y, minus_yu):
        gpu_minus_mat_vec_product<<<h.dim(), h.dim()>>>(dev_h, dev_y, dev_minus_hy);
        //cudaMemcpy(&minus_hy, dev_minus_hy, sizeof(flmat), cudaMemcpyDeviceToHost);
        //const fl yhy = - scalar_product(y, minus_hy, h.dim());
        fl yhy;
        // implementing const fl yhy = - scalar_product(y, minus_hy, h.dim()); on the GPU:

          gpu_scalar_product<<<1,32>>>(dev_y, dev_minus_hy, h.dim(), dev_yhy);
          cudaMemcpy(&yhy, dev_yhy, sizeof(fl), cudaMemcpyDeviceToHost);

          yhy = - yhy;

        const fl r = 1 / (alpha * yp); // 1 / (s^T * y) , where s = alpha * p // FIXME   ... < epsilon
        const sz n_2 = p.num_floats();
          //h_updated = bfgs_update_2(h, p, y, alpha, minus_hy, yhy, r, n);
          //implementing h_updated = bfgs_update_2(h, p, y, alpha, minus_hy, yhy, r, n); on the GPU:
          gpu_bfgs_update<<<n_2, n_2>>>(dev_h, dev_p, dev_y, dev_minus_hy, alpha, yhy, r, n_2);
          h_updated = true;
          //cudaMemcpy(&h, dev_h, sizeof(flmat), cudaMemcpyDeviceToHost);

	if(!(f0 <= f_orig)) { // succeeds for nans too
                f0 = f_orig;
                x = x_orig;
                g = g_orig;
	return f0;


That is all for the section that calls the kernels, and I have already tested the results returned by the kernels and their cpu implementations and the results are correct, the only problem is that it is really slow. and I even removed the cudaMemcpy from the for loop where the kernels are called but it is still really slow

I already posted the code and more details, I would be really thankful if you could help me

You did not state what GPU you are using, but even for a low-end device you are not using parallelism. GPUs are throughput devices that require thousands or tens of thousands (for high-end devices) threads running concurrently to make meaningful use of the hardware resources.

The configuration of the following kernels calls suggest there is not nearly enough parallelism exposed in the code:

gpu_scalar_product<<<1, 32>>>

Synchronous host/device copies inside a loop also raises a red flag:

cudaMemcpy(&p, dev_p, sizeof(struct change), cudaMemcpyDeviceToHost)

You would want to avoid frequent transfers of data between the host and the device (in either direction). Instead, download all relevant source data to the device once, process it with kernels, then upload the results once. Depending on how much processing occurs on the device, the resulting code may still be bottle-necked by PCIe traversals.

If you are new to CUDA, studying some relevant example programs may be helpful. While the example programs that ship with CUDA are not themselves performance optimized, they demonstrate useful idioms and techniques. Also, I would suggest familiarizing yourself with the CUDA profiler. At minimum, it would have pointed out here that occupancy is very low, but I can do failry sophisticated analysis, assuming you use failry recent GPU hardware and a fairly recent version of CUDA.

I am using an NVIDIA GeForce GTX 650 and cuda 6.5. Even removing that memory copy from the for loop the program still runs really slow. Before it even gets to the for loop, it freezes for a while and only later starts running the GPU code. If i make it launch many threads it would still run very slow, I have already tried thousands of threads. Another thing is that i made an experiment where I deleted all those kernels and I just created some kernels to add and subtract vectors and then I called these kernels inside that for loop, just to see how it would work, then I found that the more kernels I add, the slower the program becomes, no matter the number of threads. That’s why I feel like I am missing something, maybe in the compilation, or some cuda function that I should call before doing any kernel launch.

Since your code is not a complete, runnable program, there is no way for me to reproduce “freezes for a while”. I would suggest:

  1. Add proper status checks to all your CUDA API calls and kernel launches.
  2. Make sure you are compiling for the appropriate GPU architecture (i.e. the compute capability of the GTX 650), so you are not impacted by JIT compilation overhead.
  3. The first CUDA API call typically triggers CUDA context creation. If you are on Linux, you would place the CUDA driver into persistence mode to avoid unloading of the driver, which can signficantly lengthen the time for a subsequent CUDA context creation
  4. Use the profiler to determine where time is spent, then drill down on the areas it identifies as bottlenecks.

The GTX 650 is not a high end device, but still should perform better than you observe.

Which OS ? Windows, linux or Mac OS?

Best to look at other samples of CUDA code, because as njuffa mentions those kernel launches are very small and not taking advantage of the parallel capabilities.

This site has some excellent examples of CUDA code which may give you some ideas;

Thank you for all the replies guys!
I am using a lynux and I have already tried to run my code in another high end device (a tesla K40) but it runs even slower, that is why I thought I was doing something wrong before the function calls.
The part where it freezes most is in the part right before the kernel calls and also, in the kernel calls, it keeps freezing for a long time. The time is so long that it doesn’t make any sense to use the GPU at all. How can I try this persistance mode thing? I have been stuck with this problem of speed for almost a month now.

Do the CUDA SDK samples run in good time? In other words does other ‘professional’ CUDA code run correctly without freezing?

Run the DeviceQuery and the matrixMulCUBLAS.exe and post that output.

Once again, thanks a lot for even reading my posts man.
Here goes the device query result:

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 650”
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 1024 MBytes (1073414144 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1072 MHz (1.07 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GeForce GTX 650
Result = PASS

Yes, the other codes run really well! all the examples work fine!

And here is the result for the matrixMulCUBAS:

[Matrix Multiply CUBLAS] - Starting…
GPU Device 0: “GeForce GTX 650” with compute capability 3.0

MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Computing result using CUBLAS…done.
Performance= 341.11 GFlop/s, Time= 0.384 msec, Size= 131072000 Ops
Computing result using host CPU…done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

the persistance mode didnt do anything!!!

It appears your ‘!’ key is stuck, I would suggest fixing that.

If turning on persistence mode didn’t change anything that just means that unloading/reloading of the CUDA driver wasn’t a contributing factor to the monetary “freezes”. Still knowing next to nothing about your code, I provided a brief check list of generic items to check. The chances of pinpointing the root cause by such a process are slim indeed.

As I stated previously, it seems a basic issue in your code is insufficient parallelism. This may lead to lengthy kernel run times, as a GPU running just a few threads will be very slow. If you run your machine with a GUI, it may even cause the kernel to be terminated by the OS’s watchdog timer. That can sometimes lead to a lengthy recovery process in the CUDA driver, I have seen up to several seconds (not sure whether this is by design or whether that is a bug). Careful checking of return status and running with the profiler should provide some clarity whether that is a problem here.