Could not find an applicable discussion about our specific problem after searching online and in this forum.

First attempt at real world CUDA programming. Have a C++ CUDA project which is loaded and called by a C# using ManagedCuda. (yes, the GPU usage is inefficient but first we need to get it to just work)

The CUDA function works fine when called during unit testing of the C# method.

But when the CUDA function is called in the application it always crashes with CUDA_ERROR_ILLEGAL_ADDRESS.

Debugged using NSight CUDA Debugging. Says all of the input array variables point to invalid memory - ???
The first two int parameters have correct values.

Any ideas why the memory pointers are wrong? TIA!

Here are the calling function and the CUDA function:

    private bool CalculateOnCUDAThread()
        CudaDeviceVariable<int> d_data = _lines.Data;

        int[] h_lines = new int[_lines.FrameHeight * 2];
        for (int i = 0, j = 0; i < _lines.LineCount; i++)
            h_lines[j++] = (int)_lines.Lines[i].Start.X;
            h_lines[j++] = (int)_lines.Lines[i].End.X;
        CudaDeviceVariable<int> d_lines = h_lines;

        int dimResults = 10;
        CudaDeviceVariable<Int64> d_results = new CudaDeviceVariable<Int64>(_lines.FrameHeight * dimResults);

        int threadsPerBlock = Math.Min(256, _lines.FrameHeight);
        cudaPeak.BlockDimensions = threadsPerBlock;
        cudaPeak.GridDimensions = (_lines.FrameHeight + threadsPerBlock - 1) / threadsPerBlock;

        cudaPeak.Run(_lines.FrameWidth, _lines.FrameHeight, d_data.DevicePointer, d_lines.DevicePointer, d_results.DevicePointer);

        Int64[] h_results = d_results;


global void cudaPeak(int width, int height, int *data, int *lines, __int64 *result)
int y = threadIdx.x + blockIdx.x * blockDim.x;
if (y >= height) return;

int offset = y*width;
int xstart = lines[y * 2 + 0];
int xend = lines[y * 2 + 1];

__int64 total = 0;
__int64 totalX = 0;
__int64 totalY = 0;
int min = INT_MAX;
int max = INT_MIN;
int maxX = -1;
result[y * 10 + 3] = LONG_MAX;
result[y * 10 + 4] = LONG_MIN;
for (int x = xstart; x <= xend; x++)
    int dataValue = data[x + offset];
    totalX += dataValue * x;
    totalY += dataValue * y;
    if (dataValue < min)
        min = dataValue;
        if (dataValue > max)
            max = dataValue;
            maxX = x;
        total += dataValue;
result[y * 10 + 0] = total;
result[y * 10 + 1] = totalX;
result[y * 10 + 2] = totalY;
result[y * 10 + 3] = min;
result[y * 10 + 4] = max;
result[y * 10 + 5] = maxX;

result[y * 10 + 6] = threadIdx.x;
result[y * 10 + 7] = blockIdx.x;
result[y * 10 + 8] = blockDim.x;


This gives a possible methodology using cuda-memcheck to debug illegal address issues:

But, doesn’t CUDA-MEMCHECK run the CU module in isolation?

Our CUDA function runs without error and returns the correct answers when called by the C# method when called by unit tests.

possibly one of the memory allocations you are passing to the kernel is working in one case but not done correctly in the other case.

If I pass a pointer to a kernel with memory not properly allocated to that pointer, and attempt to use it in the device code, I’ll get an illegal address error.

If you have one case where the kernel works correctly, and another case where it doesn’t, then I would inspect the allocations you are passing. Also, it may be a data-dependent problem. For example, if your kernel is passed a paramter that indicates the extent or size of memory to access, then one set of data passed the kernel may be OK, whereas another data set is not.

no, cuda-memcheck doesn’t run the CU module “in isolation”. It runs the whole program, as-is. If you have an illegal address violation that is reported as a CUDA runtime error (what is happening here in the C# wrapper), then running the same scenario with cuda-memcheck should give the same error, and give you more information about the error. If you use -lineinfo during compilation (this may be hard to do or figure out how to do in your C# environment), then you may get additional information such as the actual line of kernel code that is causing the error.

It’s just a suggestion. Feel free to ignore it.

thanks for your reply. maybe I am not being clear.

If you have one case where the kernel works correctly, and another case where it doesn’t, then I would inspect the allocations you are passing
as you can see in the C# code, allocations are done just before calling the kernel. not sure how these could be different in one case versus another. verified that the root C# data is correct in both cases.

if your kernel is passed a parameter that indicates the extent or size of memory to access
one of the unit tests passes in same size data. this test calls CalculateOnCUDAThread() which calls cudaPeak.Run()

this may be hard to do or figure out how to do in your C# environment
the CUDA project itself is C++. turned on line numbers and verbose output but this did not add any information.

what we have here is a C# assembly that loads and calls a CUDA kernel

  1. when the assembly is loaded and called from the unit test framework everything works fine
  2. when the assembly is loaded and called from the controlling application, we get CUDA memory error
  3. is it possible to use cuda-memcheck to load and call into a C# assembly?

using NSignht to debug the controlling application I can break in the CUDA code then attach to the controlling application.

in the controlling application:

in the cudaPeakfunction

sensing a pattern here… BUT, with y=0, the line “int xstart = lines[y * 2 + 0]” throws an error

[i]CUDA Memory Checker detected 32 threads caused an access violation:
Launch Parameters
CUcontext = 1c62f410
CUstream = 1db5f580
CUmodule = 1e054450
CUfunction = 1e08c3e0
FunctionName = cudaPeak
GridId = 1
gridDim = {6,1,1}
blockDim = {256,1,1}
sharedSize = 256
width = 1928
height = 1448
data = 0x00c60000 ???
lines = 0x01720000 ???
result = 0x01820000 ???
Parameters (raw):
0x00000788 0x000005a8 0x00c60000 0x01720000
GPU State:
Address Size Type Mem Block Thread blockIdx threadIdx PC Source

00720000 4 adr ld s 0 0 {0,0,0} {0,0,0} cudaPeak+000120 e:\beamstack\common\source\resultslib.cuda.powerenergy.kernel\

007200f8 4 adr ld s 0 31 {0,0,0} {31,0,0} cudaPeak+000120 e:\beamstack\common\source\resultslib.cuda.powerenergy.kernel\

Summary of access violations:
e:\beamstack\common\source\resultslib.cuda.powerenergy.kernel\ error MemoryChecker: #misaligned=0 #invalidAddress=32

Memory Checker detected 32 access violations.
error = access violation on load (shared memory)
gridid = 1
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x00720000
accessSize = 4


My guess is an incompatible setting of 32/64 bit code. Make sure that Cuda-kernels, unit tests and application are all compiled to the same settings. Especially the “Any CPU” settings in C# can lead to a change during runtime why you should better fix it to 32bit or 64 bits.
(PTX modules don’t seem to be checked while loading and then the pointers have the wrong size…)

By the way, managedCuda wrapps the driver API of Cuda. Any debugging or memCheck that works with the driver API does also work with managedCuda. There’s no limitation here…

bitness does seem to be the problem. configured C# and PTX project for 64-bit. Now the unit test fails with CUDA_ERROR_ILLEGAL_ADDRESS.

And was able to run the EXE under cude-memcheck. Thank you.

But “cuda-memcheck --log app.log app.exe” did not create a LOG file. .OUT built during execution but then it went away. how do I get a log file?

Debugged again using NSight. C# calling function seems to be using 64-bit addresses. Why does the CU function seem to be using 32-bit addresses?