How to improve the speed up?


Program for addition of two array elements (array A and array B ,each having ARY_N elements) and put the result into another array .

I am calling the kernel function as follows

AddGPU<<<1, 1>>>(






The Kernel function is


global void AddGPU(

            int *d_ainp,

            int *d_binp,

            int *d_Cadd,

            const int ARY_N    )


//Thread index

const int tid = blockDim.x * blockIdx.x + threadIdx.x;

//Total number of threads in execution grid

const int THREAD_N = blockDim.x * gridDim.x;

for(int ar = tid; ar< ARY_N; ar+= THREAD_N)






My problem is :

I want to focus on kernal<1,1>

When i try to access the 2 operands of addition from global memory, and then write the result to global memory,The kernel runs somewhat slow,the GPU core is idle for the kernel<1,1>,because the access to global memory is slow.I am using 1 GPU core, and only 1 thread. Overall, GPU core is only working 1/6th (17%) of the time, and is idle (waiting for memory accesses) during 83% of the time.

Any ideas about how to improve the speed of the <<<1,1>>> kernel?

How to improve its execution speed, so that it improves the utilization from 17%, as far as possible, close to 100%.

              So can anyone please help me to make my program run faster with kernal<<<,1>>>.

Thank You in advance


There is no sense in running a <1,1> kernel on a GPU. The GPU was built for running thousends of threads and is only going to perform well in those dimensions. Memory access is usually hidden by computations in other threads. If there are no other threads you will stall for a couple of hundreds of cycles.

I know that , but I am trying to compare the GPU speedup for different block sizes with different number of threads with this <<<1,1>>> kernel.

so if u have any idea in this case, could you suggest me.

Read Chapter 5 of the CUDA programming guide. What you are trying to do is an exercise in futility.

Ya I read that, but my problem is, in my prgram I am trying to access the 2 operand of addition from global memory. And then write the result into the global memory, the GPU core is idle because I am using only one core and only one thread. Overall the GPU core is working only 1/6 th of the time (17%) and is idle (waiting for memory access) during 83% of the time. so from chapter 5 I could understand that shared memory is faster than global and local memory. so I wanted to use shared memory in my kernel function. I tried to do it but I am getting errors so can u please me.

Then my I suggest you re-read it, because clearly you didn’t understand it sufficiently.

You cannot use shared memory in the way you imagine. You cannot avoid reading data from global memory. You can only ensure that global memory reads are coalesced to minimize the very large global memory access latency. If your kernel require access to a relatively compact region of memory, but in a fashion which is not optimal for minimizing global memory latency, then you have the option of having many threads in a block read from global memory in a coalesced fashion and then assemble a copy of that compact global region in shared memory, which has much lower latency. The key point is that shared memory is SHARED. How do you imagine a single thread in block containing in only one thread can share memory? With what? What possible benefit can it provide?

you are not understanding what I am trying to explain you. I am starting with <<<1,1>>> kernel then I am increasing the number of threads as well as the number of blocks so that I can compare the GPU speedup and later I am thinking of reducing the amount of time required for the calculation. actually I already did that now I want to reduce the execution time and increase the speedup, so instead of accessing the array elements each and every time from global memory and storing the result back to the global memory. so I am thinking whether I can use the shared memory to reduce the access time. But I have some problem in using the shared memory. I not just trying to use <<<1,1>>, but to increase the block size and grid size . If you have any idea then could you please help me

I understand it perfectly well.

My point is that your kernel (and your code in the other thread) are completely nonsensical and suggest that you don’t understand the CUDA programming model or the architecture at all. Your kernel should ideally perform the basic mathematical operation (in this case the addition) on a single element in your input vector. Not many in a loop. You then launch many thousands of parallel threads to complete the operation on the complete data in global memory. Only by having many thousands of parallel threads will you be able to hide the global memory latency and achieve any kind of acceptable performance.

Attempting to “optimize” the execution of a kernel such as the one you posted here is a waste of time, because you can’t. Attempting to use shared memory in this sort of naive kernel is also going to be pointless.

It is at 17% because you are using <<<1,1>>>. To reach absolute maximum (occupancy 100%) you should run kernels with configuration at least (for GTX 280)


and each kernel should

  • use maximum 16 registers

  • use maximum 8KB of shared memory

and preferably each kernel shoud access global memory in a coalesed way.

Such optimal configuration is often hard to reach, but even if you get only half of it, you can often reach close to optimal.

<<<1,1>>> is, by definiton, slow. As long as you stick to this configuration, your program will be slow and your GPU most of the time - idle.