Profiler OpenCL results explanations

Hi,
i’m performing test and collecting results with NVidia Compute Visual Profiler.
THe problem is, when executing kernel the CPU Time is comparable to GPU time.
Whereas i would expect a much much lower CPU time.
Can someone explain such strange behaviour?
Thanks in advance
V.

It could be anything - not a suitable parallel problem, hight register usage or high local memory usage → low occupancy on GPU, bad global memory accessing scheme,… please specify more how your kernel looks like

Thanks for reply,
my program try to compare two vector with size of 12 Mb
I use no local memory.
to create the vector i use
Input1P = clCreateBuffer (Contesto,CL_MEM_READ_ONLY |
CL_MEM_ALLOC_HOST_PTR,Msg_size1,NULL,&Error);
OutputP = clCreateBuffer (Contesto,CL_MEM_WRITE_ONLY |
CL_MEM_ALLOC_HOST_PTR,(Nres*sizeof(unsigned)),NULL,&Error);

    my kernel is :
    
    __kernel void Confronto(__global char* msg1,__global char* msg2,int
    size,__global unsigned *out,int ln)
    {
           int gidx,gidy,lidx,lidy,i,index,temp;
           i=0;
           gidx=get_global_id(0);
           gidy=get_global_id(1);
           index = gidy *  get_global_size(0) + gidx;
           temp=index*ln;
           while (i<ln)
           {
                   if (msg1[temp+i] != msg2[temp+i]){ out[index]=temp+i; return;}
                   i++;
           }
           out[index]=0;
    
    }
    NDRange size in order to 159 x 159;
    
    the profiler return a output like:
                                   GPU time  CPU time
    1.29072e+15     Confronto       125761    125979
    
    i don't understand why cpu time is so similar to gpu time.
    i would expect a much much lower CPU time.
    
    Thanks in advance
    V.

Aaa, you meant the CPU time written in OpenCL profiler! From your first post I thought, you run your OpenCL code on CPU and compare it to GPU time…

So the copy paste from the profiler help defines what is the CPU time:
CPU Time: It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.

I understand.
Thanks for explanation.
V.