Misaligned access violation

Hi,

I have a problem with misaligned violation access errors. My Setup is:
Windows 10 Pro, v1703
NVIDIA GeForce GTX 1080 Ti, driver version 22.21.13.8253 (2017/06/07)
Visual Studio Professional 2015, Version 14.0.25425.01 Update 3
cuda_8.0.61
NVIDIA Nsight Visual Studio Edition 5.3.0.17162

Originally I wanted to compare two image regions. I reduced the code to a minimum, still producing the behaviour that I don’t understand.

#include "cuda_runtime.h"

#include <stdio.h>
#include <iostream>

cudaError_t doSomethingInCuda(float *data0, unsigned int width, unsigned int height);

__global__ void testKernel(float *data0, unsigned int width, unsigned int height)
{
  for (size_t i = 0; i < width * height; i++)
  {
    double im0;
    double sum = 0.0;
    for (int j = -70; j <= 70; j++)
    {        
        im0 = data0[50000];
        sum += im0;
    }
  }
}


int main()
{
  const unsigned int width = 512;
  const unsigned int height = 1024;
  const unsigned int dataSize = width * height;
  const unsigned int memSize = dataSize * sizeof(float);

  float* data0 = (float*)malloc(memSize);
  //Fill data array...
  for (size_t i = 0; i < dataSize; i++)
  {
    data0[i] = (float)i;
  }

  doSomethingInCuda(data0, width, height);

  free(data0);

  std::cout << "Computation finished. Press any key...";
  std::cin.ignore();

  return 0;
}


cudaError_t doSomethingInCuda(float *data0, unsigned int width, unsigned int height)
{
  float *d_data0 = NULL;
  cudaError_t cudaStatus;
  const unsigned int size = width * height;
  const unsigned int memSize = size * sizeof(float);
  // Choose which GPU to run on, change this on a multi-GPU system.
  cudaStatus = cudaSetDevice(0);
  if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
  }

  // Allocate GPU buffers for three vectors (two input, one output)    .
  cudaStatus = cudaMalloc((void**)&d_data0, memSize);
  if (cudaStatus != cudaSuccess) {
    goto Error;
  }

  // Copy input vectors from host memory to GPU buffers.
  cudaStatus = cudaMemcpy(d_data0, data0, memSize, cudaMemcpyHostToDevice);
  if (cudaStatus != cudaSuccess) {
    goto Error;
  }
    
  // Launch a kernel on the GPU with one thread for each element.
  testKernel << <1, 1 >> > (d_data0, width, height);

  // Check for any errors launching the kernel
  cudaStatus = cudaGetLastError();
  if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
  }

  // cudaDeviceSynchronize waits for the kernel to finish, and returns
  // any errors encountered during the launch.
  cudaStatus = cudaDeviceSynchronize();
  if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    goto Error;
  }
  
Error:
  cudaFree(d_data0);

  return cudaStatus;
}

This code leads to misaligned access violations, which seem to be random. Here are three examples:

i = 391465, j = -14
CUDA context created : 279dfef6140
CUDA module loaded:   279e228af00 kernel.cu
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
    CUcontext    = 279dfef6140
    CUstream     = 279e228c900
    CUmodule     = 279e228af00
    CUfunction   = 279ef77c680
    FunctionName = _Z10testKernelPfjj
    GridId       = 1
    gridDim      = {1,1,1}
    blockDim     = {1,1,1}
    sharedSize   = 256
    Parameters:
        data0 = 0x0000000b09600000  0
        width = 512
        height = 1024
    Parameters (raw):
         0x09600000 0x0000000b 0x00000200 0x00000400
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                         PC  Source
------------------------------------------------------------------------------------------------------------------
 b09630d40     4    mis ld    g           0       0          {0,0,0}    {0,0,0}  _Z10testKernelPfjj+0003b0  ...\kernel.cu:16


Summary of access violations:
...\kernel.cu(16): error MemoryChecker: #misaligned=1  #invalidAddress=0
================================================================================

Memory Checker detected 1 access violations.
error = misaligned load (global memory)
gridid = 1
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0xb09630d40
accessSize = 4
i = 26934, j = 19
CUDA context created : 1f1618a5d60
CUDA module loaded:   1f163c849b0 kernel.cu
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
    CUcontext    = 1f1618a5d60
    CUstream     = 1f163c84fb0
    CUmodule     = 1f163c849b0
    CUfunction   = 1f1711b9ce0
    FunctionName = _Z10testKernelPfjj
    GridId       = 1
    gridDim      = {1,1,1}
    blockDim     = {1,1,1}
    sharedSize   = 256
    Parameters:
        data0 = 0x0000000d09600000  0
        width = 512
        height = 1024
    Parameters (raw):
         0x09600000 0x0000000d 0x00000200 0x00000400
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                         PC  Source
------------------------------------------------------------------------------------------------------------------
 d09630d40     4    mis ld    g           0       0          {0,0,0}    {0,0,0}  _Z10testKernelPfjj+0003b0  ...\kernel.cu:16


Summary of access violations:
...\kernel.cu(16): error MemoryChecker: #misaligned=1  #invalidAddress=0
================================================================================

Memory Checker detected 1 access violations.
error = misaligned load (global memory)
gridid = 1
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0xd09630d40
accessSize = 4
i = 372617, j = -3
CUDA context created : 23a134f5d00
CUDA module loaded:   23a158ae630 kernel.cu
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
    CUcontext    = 23a134f5d00
    CUstream     = 23a158add30
    CUmodule     = 23a158ae630
    CUfunction   = 23a22da7430
    FunctionName = _Z10testKernelPfjj
    GridId       = 1
    gridDim      = {1,1,1}
    blockDim     = {1,1,1}
    sharedSize   = 256
    Parameters:
        data0 = 0x0000000b09600000  0
        width = 512
        height = 1024
    Parameters (raw):
         0x09600000 0x0000000b 0x00000200 0x00000400
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                         PC  Source
------------------------------------------------------------------------------------------------------------------
 b09630d40     4    mis ld    g           0       0          {0,0,0}    {0,0,0}  _Z10testKernelPfjj+0003b0  ...\kernel.cu:16


Summary of access violations:
...\kernel.cu(16): error MemoryChecker: #misaligned=1  #invalidAddress=0
================================================================================

Memory Checker detected 1 access violations.
error = misaligned load (global memory)
gridid = 1
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0xb09630d40
accessSize = 4

If I remove the inner loop (for j=…) I don’t get an Error. If I reduce the range of j, the probability of getting the misaligned access error decreases.

What I found so far is a similar problem but no solution:
https://devtalk.nvidia.com/default/topic/1005607/shared-memory-debug-errors-in-nsight/

Thanks for your help
Marcus

Run your code outside of visual studio, from a windows command prompt. Suppose your executable is called my.exe, then run it with cuda-memcheck like this, from a windows command prompt:

cuda-memcheck my.exe

If you don’t get the error report from cuda-memcheck, then I would say the problem is with nsight VSE, not your code. In that case, I would file a bug against nsight VSE with your simple reproducer.

Upon inspection of your code, I don’t see how it could trigger a misaligned access. In fact, I’m assuming you are building a debug project, because in release mode that kernel should reduce to an empty kernel. It modifies no global state.

Thanks for your reply.

I forgot to mention, that it is an x64 debug build.

I have already tried executing with cuda-memcheck. Both my displays become black for a short time just before the application crashes.
I have shotened the path in the error output. The error message is

...\CUDA\x64\Debug>cuda-memcheck TestCuda.exe
========= CUDA-MEMCHECK
cudaDeviceSynchronize returned error code 30 after launching addKernel!
Computation finished. Press any key...
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexObjectGetTextureDesc + 0x2b7527) [0x2c44cb]
=========     Host Frame:...\CUDA\x64\Debug\cudart64_80.dll (cudaDeviceSynchronize + 0xf9) [0x1da99]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (doSomethingInCuda + 0x1dc) [0x19ac]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (main + 0xa5) [0x1b05]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (invoke_main + 0x34) [0x50d4]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (__scrt_common_main_seh + 0x127) [0x4f97]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (__scrt_common_main + 0xe) [0x4e5e]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (mainCRTStartup + 0x9) [0x50f9]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexObjectGetTextureDesc + 0x2b7527) [0x2c44cb]
=========     Host Frame:...\CUDA\x64\Debug\cudart64_80.dll (cudaFree + 0xfd) [0x230ad]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (doSomethingInCuda + 0x210) [0x19e0]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (main + 0xa5) [0x1b05]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (invoke_main + 0x34) [0x50d4]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (__scrt_common_main_seh + 0x127) [0x4f97]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (__scrt_common_main + 0xe) [0x4e5e]
=========     Host Frame:...\CUDA\x64\Debug\TestCuda.exe (mainCRTStartup + 0x9) [0x50f9]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= ERROR SUMMARY: 2 errors

But I can’t get useful information out of that.

There is no difference if the Nsight Monitor is sctive in the tray bar or not.

With my useful code (as I said, I have minimized the example) the release build application behaves equally.

The displays becoming black and the unknown error are due to the WDDM TDR mechanism resetting the GPU.

You should google that (“WDDM TDR timeout”) and adjust your timeout mechanism. It’s possible that an incorrect (too short) timeout period is also causing nsight VSE to misbehave

I have just made some experiments. I can say, that “something is happening”. I will work that out tomorrow and let you know, what is actually happening.

Thanks a lot!

I increased the TDR timeouts heavily.
This leads to the behaviour that cuda-memcheck does not show an error. But everything freezes during the computation which took 14 minutes with cuda-memcheck. I will vary my TDR configuration and see, if I can remove the freezing. I’ll publish my final configuration values, if I can find suitable values.

Nsight sill reports misaligned access violations. My code does not seem to be very exciting. Does nobody else experience that behaviour?

  1. your code is serial and use non-optimized compilation, so its slowness isn’t indicative for real program performance

  2. nsight probably has similar TDR that can’t be disabled and just report random error when time goes

so your experience doesn’t reveal any actual problems except for unfriendliness of CUDA ecosystem, especially for newcomers

Thank you for your comments. I agree with you that I have produced very non-optimized code. I am new to CUDA and wanted to go the way: Step one: make it run; Next steps: make it run faster.

I am sure that my problems will be reduced or maybe vanish when I have an optimized solution.
I have removed the outer loop and called the kernel with 1024 blocks and 512 threads. Now everything is fine and quick (using Nsight or cuda-memcheck). Yesterday I did not know that the execution time on the GPU is significant.

If Nsight had an internal TDR, that would explain why I still have these random errors. If I reduce the range of my loops, the error propability decreases.

So do you recommend to use several threads and blocks at the first step when implementing new kernels?

yes. GPU programming is about dividing work into many thousands of independent jobs (i.e. grid). If your code can’t be split into that many jobs, GPU will be slower than CPU

Yes, that’s why I started GPU programming, but I thought it would be possible to start with one task and develop step by step to a fully parallel code design. So I was wrong.

And I thought programming CUDA would be strongly device independent. But what happens if my code runs on older/slower hardware. Might the older hardware be too slow for exhaustive tasks? So I could not say on which device my kernel runs successfully. Is that right?

These are all characteristic of running codes on WDDM GPUs. The solution is to use a GPU in TCC mode, or else use Linux.

In TCC mode, none of these timeout concepts apply. You can readily run codes for 15 minutes or 15 hours on a TCC GPU, without disturbing anything else.

The usage of a WDDM GPU that is effectively driving a display (at least the Windows OS thinks it is or may be) is what complicates things. The CUDA ecosystem in that case must run within the confines of what the OS requires of a WDDM GPU, and yes, the situation is not perfect.

But it would be senseless to paint all of GPU computing with this particular brush. It’s not representative of the experience in TCC mode or in Linux.

Thanks again for your explainations!
Using Linux unfortunately is not an option. I understood, that the TCC mode is not supported by all devices. And I recognized, that switching on TCC mode follows that the device cannot be connected to a display.

So I am wondering if there could be a (low budget) device for the display together with a TCC device for the CUDA computations in a windows workstation. Or would maybe all problems vanish if a had two devices (one for display, one for computation) even if they do not support TCC mode?

TCC mode should be supported by most Quadro, Tesla and Titan family products. (Don’t make a buying decision based on this statement, however. Verify first. There are some exceptions, I believe.)

A reasonably-current (e.g. Kepler or newer) low end GeForce product combined with one of these makes a good GPU computing platform, IMO, ATM. Two devices that don’t support TCC mode doesn’t really get you out of the woods, IMO. Others may disagree. However, if you have 2 non-TCC devices, one of which is hosting the display, and the other is running GPU computation, and you increase the WDDM TDR timeout, then your display interactivity should not be impacted. That is beneficial, probably. However you are still subject to (meaning you still have to modify) WDDM TDR. Currently, a GPU that is running a kernel will not be responding to display task requests, giving rise to the “freezing” behavior you witness, when that GPU is both computing and servicing a display.

There are other differences in behavior between WDDM and TCC mode as well.

In the future, it might be possible (for future GPUs, perhaps not for current products i.e. Kepler/Maxwell/Pascal) that additional integration of compute preemption by the CUDA development team may allow a WDDM GPU to service display tasks while performing long-running compute tasks. These two activities are quite dissimilar, however, so sharing a processor in such cases will still have some aspects of “non-perfection”, but at least you may not be as directly subject to WDDM TDR timeout. This last paragraph is purely speculation, and should not be construed as a forward-looking statement of what to expect for future products. I wouldn’t be able to answer questions about what may or will happen in the future.

A high-end GPU for compute plus a low-end GPU for the operating system’s GUI is a common configuration for a GPU computing workstation, and one I would recommend. On all operating systems supported by CUDA there are watchdog timers for the GUI, so if you run a Linux workstation with a graphical desktop you will also encounter the timeout issue when running with just one GPU.

I used such big/tiny GPU setups for many years across Windows and Linux, using a highest-end Tesla or Quadro for compute and a lowest-end Quadro for the GUI. Even when running both GPUs with WDDM on Windows this was quite workable (but may require increasing the TDR timeout, as mentioned by txbob). But TCC is definitely recommended, as use of WDDM can also cause weird performance artifacts.

A word of general programming advice: An approach of “I will get the serial version to run on the GPU before converting it into a parallel version” is not realistically workable in my experience. Massively parallel programming – of any kind, not just on GPUs – requires re-thinking a problem from the ground up. GPUs are designed for massively parallel computation, using serial code on them won’t make you happy as a programmer, as you just found out.

If you need a serial version of your code to cross check results, run that on the CPU. If and when you do that, make sure to read NVIDIA’s white paper on cross-platform floating-point issue first and do not expect your results to match bitwise. https://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

Maybe I am an outlier but have had very little issue with WDDM in Windows 7/8.1 once I set the TDR in the registry. Even on my laptop with only 1 GTX 980m I can run a kernel for 30 minutes successfully even though connected to the display.

I do agree that the ‘big/small’ configuration is best, even if the ‘big’ uses WDDM mode. On one of my systems I have Windows 8.1, with a GTX 980 for video and a GTX 1080ti for compute and I run that thing for hours (Machine learning on the GTX 1080ti WDDM) while still being able to work on other projects, watch videos etc using the GTX 980.

Linux also has a timeout issue as well, which is a pain to configure, so it is not like this issue is specific to Windows. Windows has other negatives, but as far as CUDA goes my experience has been more positive than when working with Ubuntu.

Thank to all for your opinions.

I have already made my original code run concurrently where I replaced the loops with bolckIdx and threadIdx ;-). Evenything is fine and quick.
So thanks again for the discussion about how to start programming a CUDA kernel. So I learnt that iterating the source code from sequential to fully parallel is not the way.

@njuffa I had already expected differences in floating point computations. It’s a very clear article that you linked. I never heared of FMA. Thanks!