Performance CPU/GPU


I have a urge trouble in simple morphology program.
I use the nppiErode_8u_C1R function to erode a gray image and I do the same think on CPU.
On HD images, the GPU takes ~25ms and on the CPU it takes 3ms…
Why is there that hude diffence? Should not the GPU be faster thant CPU?

I just had some information. This is my kernel version of erode:

    /// Local copy of the Image +size of the kernel and the kernel coordinate
    __shared__ u_int8_t LocalImage[(BLOCK_SIZE_X+kernelSize_X)*(BLOCK_SIZE_Y+kernelSize_Y)];
    __shared__ int16_t LocalKernelCoord[kernelSize_X*kernelSize_Y];

    int x = blockIdx.x*blockDim.x;
    int y = blockIdx.y*blockDim.y;

    for(int i=threadIdx.x;i<BLOCK_SIZE_X+kernelSize_X && (x+i)<w;i=i+blockDim.x)
        for(int j=threadIdx.y;j<BLOCK_SIZE_Y+kernelSize_Y && (y+j)<h;j=j+blockDim.y)
            LocalImage[i+j*(BLOCK_SIZE_X+kernelSize_X)] = (u8_ImageIn)[x+i + (y+j)*w];

    if(threadIdx.x + threadIdx.y*BLOCK_SIZE_X <isizeCoord)
        int ptrk =  threadIdx.x + threadIdx.y*BLOCK_SIZE_X;
        LocalKernelCoord[ptrk] = i16_Kernel[ptrk];



    int iMidX = (kernelSize_X-1)/2;
    int iMidY = (kernelSize_Y-1)/2;

    int xglobal = x+threadIdx.x+iMidX;
    int yglobal = y+threadIdx.y+iMidY;

    if(/*xglobal <= kernelSize_X/2 ||*/ xglobal >= w-kernelSize_X/2 || /*yglobal <= kernelSize_Y/2 ||*/ yglobal >= h-kernelSize_Y/2)
        // Change the output pixel to 0

    //    /*if(threadIdx.x>kernelSize_X  || threadIdx.y>kernelSize_Y)
    //        return;*/

int ptrl = threadIdx.x+iMidX + (threadIdx.y+iMidY) *(BLOCK_SIZE_X+kernelSize_X);
    bool IsErode =  false;
    if(LocalImage[ptrl] ==0)
        u8_ImageOut[xglobal+yglobal*w] =  0;

for(int k=0;k<isizeCoord;k++)

        if(LocalImage[ptrl +i16_Kernel[k]] == 0)
            IsErode = true;

        //float val;
        //atomicAdd(&val, LocalImage[ptrl+i16_Kernel[k]] );

if(IsErode == true)
        u8_ImageOut[xglobal+yglobal*w] =  0;//u8_ImageIn[xglobal+yglobal*w];
        u8_ImageOut[xglobal+yglobal*w] =  255;//u8_ImageIn[xglobal+yglobal*w];

The kernel is an ellipsoide kernel so it is not separable. This version takes ~35ms and on opencv ~3ms.
I don’t understand why…

Lots of questions remain:

Size of image: how many pixels are we talking about?

Kernel Launch parameters: (Number of blocks in total, number of threads per block (BLOCK_SIZE_X, BLOCK_SIZE_Y))

Timing: Did you measure only kernel execution time, or time including memory transfers from host to device memory and back? Did you measure timings after pre-cooking your kernel (i.e. running the kernel once before taking timing measurements to prevent measuring any one time initialization overhead)

Hardware: what’s your GPU?

Profiling: did you run this through any profiling tools such as the Visual Profiler, or nVidia nSight? did it highlight any shared memory bank conflicts or uncoalesced memory access?


Thank you for the answer.

Here is the information about my card:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 930M"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147352576 bytes)
MapSMtoCores for SM 5.0 is undefined.  Default to use 192 Cores/SM
MapSMtoCores for SM 5.0 is undefined.  Default to use 192 Cores/SM
  ( 3) Multiprocessors x (192) CUDA Cores/MP:    576 CUDA Cores
  GPU Clock rate:                                941 MHz (0.94 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           9 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce 930M

I want to process a HD image (1920x1080) and the information about the define variable:

#define BLOCK_SIZE_X 32
#define BLOCK_SIZE_Y 32

#define kernelSize_X 15
#define kernelSize_Y 15

The time I gave you is only the kernel process without the memcpy. I don’t notice any shared memory trouble. I will try with my nsight.


I just notice that when I remove that part:

if(IsErode == true)
        u8_ImageOut[xglobal+yglobal*w] =  0;//u8_ImageIn[xglobal+yglobal*w];
        u8_ImageOut[xglobal+yglobal*w] =  255;//u8_ImageIn[xglobal+yglobal*w];

the code take ~ 1ms. Is it a normal behavior?

I wouldn’t be able to help unless you provided a complete test case, but your last observation is not surprising. When you remove those last lines of code, there is much other code in your kernel that is not doing anything observable globally, so the optimizing compiler is free to remove that also. The result is that your kernel execution time drops a lot.

This is a typical issue associated with this style of “commenting code” analysis. The compiler may do things that defeat your purposes when you try to tactically comment out lines of code.

Here is the project I made on Qt on share website.

Fichier :
Utilisateur : ffgnullk
Mot de passe : hvwf3c6a

the 930M is a relatively low end product with only a 64 bit memory interface. It may be able to surpass the CPU’s performance in a some tasks when using well optimized code, but certainly not in all tasks.


The single-precision GFLOPS provided by the GTX 930M are about 3x that of a four-core x86 processor, but this image processing task would appear to me memory bound, rather than compute bound.

The DDR3-based memory subsystem of the GTX 930M provides, at 16 GB/sec (according to, lower memory bandwidth than the system memory of a simple PC, which is about 24 GB/sec (high-performance x86 systems can provide up to 60 GB/sec). Compare also with high-end GPUs that provide memory bandwidth of up to 336 GB/sec, or 20x as much.

Erosion has some data re-use and unless my math if off, each image should be <= 8 MB and therefore is small enough to fit into the last-level cache of a modern i7 CPU, meaning this is a task well-suited to efficient CPU work.

[s]You are loading the same data into shared memory multiple times:

for(int i=threadIdx.x;i<BLOCK_SIZE_X+kernelSize_X && (x+i)<w;i=i+blockDim.x)
        for(int j=threadIdx.y;j<BLOCK_SIZE_Y+kernelSize_Y && (y+j)<h;j=j+blockDim.y)
            LocalImage[i+j*(BLOCK_SIZE_X+kernelSize_X)] = (u8_ImageIn)[x+i + (y+j)*w];

This code has the structure of a (nonseparable) 2d convolution, getting executed for each combination of threadIdx.x, threadIdx.y, i, and j.
However all it does is load data into shared memory, for which no loop over i and j should be necessary.

Thank you tera.

Why do you think I am loading the same data into the shared mémory?

Sorry, my bad, it’s all fine. Seems I had been glancing over your code too quickly.

Is there any other improvement to reduce the time of computing?

I tried to find the acceleration process to reduce the erosion process , it only works on square because it is separable but in other case I didn’t find any other solution.

The other solution would be to change the card. I look at the specification between nvidia normal and M version. It seems that the Higher memory bandwidth are really different 105.8 GB/s vs 28.8 GB/s. Is the “Higher memory bandwidth” change the time to access to the global memory?

The access time of global memory (= latency) is very similar on all GPUs, and is quite high, about 5x to 8x to what you would observe on an x86 CPU (~80 ns). However, GPUs are designed as latency tolerant throughput machines, so to first order we don’t care for latency.

Memory throughput (for global memory) is pretty much inversely linear to execution time for memory-bound kernels (so triple the bandwidth -> 1/3 of the execution time). There is some variation because of ECC vs non-ECC, different types of DRAM used on different GPUs, specifics of the global memory access pattern. A well optimized code tyypically achieves 75% to 85% of the theoretical bandwidth listed in NVIDIA’s specs, and you can use that number to estimate performance of memory-bound kernels.

I would suggest to alway use the NVIIDA profiler to check that your code’s memory accesses are close to optimal, so the code makes the best possible use of the available memory bandwidth.

High-speed memories, and also the processor interfaces to high-speed memory, are quite power hungry due to very fast switching of large capacitive loads, which is presumably the reason mobile GPUs run narrower memory interfaces at reduced memory clocks, compared to GPUs for workstations or servers.

How can I access the performance in profiler. I run nsight but I just have the memory rate to transfert data from host to device or device to host.

So the bandwidth indicate in the secification (for nvidia 930M Memory bandwidth 14.4 GB/s or 16 depends on the website) is the speed to read global variable if I understand what you wrote? And the write operation are slower. What is the speed of local access or shared variable?

So if I take a processor with more bandwidth the read/write in global memory will be speeded?

A last question for that post in cpu memory copy, to read and write information faster it is better to use 32bits pointer than 8bits pointers.:Example

int8_t *ptdataSrc = mydatatoread;
int8_t *ptdataDst =Tosomewhre ;
for(int i=0;i< sizei;i++)
     ptdataSrc[i] = ptdataDst[i]

///the better solution
int32_t *ptdataDst =Tosomewhre ;
for(int i=0;i< sizei/4;i++)
     ptdataSrc[i] = ptdataDst[i];

Are there the same trouble to access to the global memory?