Strange memory read.


I develop a image processing on 1920x1080 image. All the processing works file but when I read back the memory with Memcpy All the data above the 1056 line are random.
Here is the device_query result of my card:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 930M"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147352576 bytes)
MapSMtoCores for SM 5.0 is undefined.  Default to use 192 Cores/SM
MapSMtoCores for SM 5.0 is undefined.  Default to use 192 Cores/SM
  ( 3) Multiprocessors x (192) CUDA Cores/MP:    576 CUDA Cores
  GPU Clock rate:                                941 MHz (0.94 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           9 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce 930M

Could someone explain me that strange memory read?

Add proper error checking after all API calls. in particular, call cudaGetLastError() after a cudaDeviceSynchronize() following your Kernel call. If the kernel crashed for whatever reason, this is how you will know.

Run your program through cuda-memcheck to see possible memory access violation (cuda-memcheck also reports some CUDA API errors).

There are several reason why part of your data readback could be corrupt. Undersized buffer allocations, a kernel crash just before the computation finished, …

Sorry I’m not a developer. I see you have a problem with low available memory reporting is that correct?
But are you running on Windows 10 ?

if so please be kind to read the below link.

you might be suffering from Windows 10 WDDM 2 …

Thank you oguzbir for that but I am working on ubuntu.

And thank you cbuchner1 I will try your solution.

My bad sorry. Best of luck

In addition to adding proper CUDA error checking to your code, try running your app with cuda-memcheck and fix any issues it reports.

So I tried many improvement in my code, there is no error according to cuda-mencheck.
I knew that under a minimum of data, the memcpy could not process the transfert and cannot execute the commun (example transfering only 500bytes is not possible).Could this be the same over a maximum data?

As I mentioned on my first post, over 1056 lines on my image (1920*1080) the artefact appears. Could this come from a limitation of memory?

memcopy() and cudaMemcpy() can certainly copy smaller amounts of data than 500 bytes. It may not be very efficient to use these functions for moving small amounts of data but functionally there is nothing that prevents programmers from doing that.

Based on what little (and partially conflicting) information you have provided, your code likely contains one or several bugs. I would suggest using standard debugging techniques to track down these issues. CUDA comes with a powerful debugger that should help you with this, but you can likely get pretty far even by simply logging activity with printf().

Indeed it comes from my code. In fact the number of blocks was badly initialised.

Thank you everyone.