cuCtxSynchronize failure using pyCuda with semi-large allocations

I have a pretty simple pycuda script here that’s supposed to load in a grayscale image of a truck (2048x1365), invert the colors, and save it back;

import pycuda.autoinit
import pycuda.driver as device
from pycuda.compiler import SourceModule as cpp

import numpy as np
import cv2

modify_image = cpp("""
__global__ void modify_image(int pixelcount, int* inputimage, int* outputimage)
{
  int id = threadIdx.x + blockIdx.x * blockDim.x;
  if (id >= pixelcount)
    return;

  outputimage[id] = 255 - inputimage[id];
}
""").get_function("modify_image")

print("Loading image")

image = cv2.imread("truck.jpg", cv2.IMREAD_GRAYSCALE)

print("Processing image")

pixels = image.shape[0] * image.shape[1]
output = np.zeros_like(image)
modify_image(
  device.In(np.int32(pixels)),
  device.In(image), 
  device.Out(output),
  block=(1024,1,1), grid=(pixels // 1024, 1))

print("Saving image")

cv2.imwrite("processed.png", output)

print("Done")

However, when trying to run it, I get a cuCtxSynchronize error and the program fails. I did some small modification to the code and futher testing, and found out that it works fine up to a certain image size, but trying to allocate any more than around 420,000 bytes causes the error to occur.

I used MSI Afterburner to monitor my gpu’s memory usage, and it never goes above 1GB, even when running the program. My GPU is a GTX 980 with 4GB of VRAM, so I shouldn’t even be close to it’s limit. If anybody knows what’s going on here, I would really appreciate the help.

Here is the exact output produced by the program;

Loading image
Processing image
Traceback (most recent call last):
  File "<path to source file>\imfiltertest.py", line 36, in <module>
    block=(1024,1,1), grid=(pixels // 1024, 1))
  File "C:\Users\<me>\AppData\Local\Programs\Python\Python36\lib\site-packages\pycuda\driver.py", line 405, in function_call
    Context.synchronize()
pycuda._driver.LaunchError: cuCtxSynchronize failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: unspecified launch failure
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: unspecified launch failure

I’ve already attempted disabling Window’s TDR with a registry key, but it had no effect on the issue. The whole program takes less than a second to run anyway, so I doubt it’s relevant to my problem.

cross posting:

http://stackoverflow.com/questions/44057516/cuda-cuctxsynchronize-error-using-pycuda-2016-1-2

I think your kernel should be using unsigned char * pointers, not int * pointers.

Is crossposting against the rules?

Regardless, I’ll try your solution when I next have available time. But why would it be an issue in the first place? Is it because the int in the kernel definition is 32 bits and my image only contains 8 bit items?

Also, wouldn’t that cause issues no matter how small the image is? Why does it only cause access errors once the data approaches a certain size?

I implemented your change, and it was a success! Thank you. If you want, you can make it an answer on my stackoverflow post and i’ll check it.

But i’m still confused as to how it even worked in the start anyway. Shouldn’t even the smallest of illegal accesses have caused it to fail?

Because GPUs don’t generally check accesses at a granularity of byte addressing level. If you have an allocation size of 24 kbyte, the GPU may only have an error-checking mechanism that intercepts an illegal access if it is beyond 1MB from your original allocation, for example.

Basically there is a granularity. You have to go outside a window before it will detect the invalid access at the hardware level.

You can tighten this up by running your code with cuda-memcheck, which should have a much finer ability to detect out-of-bounds accesses. This is performed via software checking, and generally may cause your code to run noticeably slower.

Crossposting is not against the rules. I sometimes mark posts that are crossposted, because if I were reading your question and it were already answered somewhere else (which it is, effectively, in the comments on the SO question) then I would not want to waste my time trying to figure out what is going on. It’s not really a message directed at you but for the benefit of other readers.

Oh, that wasn’t your comment on my post? You posted at a very similar time.

But thank you very much, that was an incredible explanation. I’ll make sure to keep my datatypes matching in the future.