Hi,
Program for addition of two array elements (array A and array B ,each having ARY_N elements) and put the result into another array .
I am calling the kernel function as follows
AddGPU<<<1, 1>>>(
d_ainp,
d_binp,
d_Cadd,
ARY_N
);
The Kernel function is
[codebox]
global void AddGPU(
int *d_ainp,
int *d_binp,
int *d_Cadd,
const int ARY_N )
{
//Thread index
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
//Total number of threads in execution grid
const int THREAD_N = blockDim.x * gridDim.x;
for(int ar = tid; ar< ARY_N; ar+= THREAD_N)
{
d_Cadd[ar]=d_ainp[ar]+d_binp[ar]
}
}
[/codebox]
My problem is :
I want to focus on kernal<1,1>
When i try to access the 2 operands of addition from global memory, and then write the result to global memory,The kernel runs somewhat slow,the GPU core is idle for the kernel<1,1>,because the access to global memory is slow.I am using 1 GPU core, and only 1 thread. Overall, GPU core is only working 1/6th (17%) of the time, and is idle (waiting for memory accesses) during 83% of the time.
Any ideas about how to improve the speed of the <<<1,1>>> kernel?
How to improve its execution speed, so that it improves the utilization from 17%, as far as possible, close to 100%.
So can anyone please help me to make my program run faster with kernal<<<,1>>>.
Thank You in advance
kirti