Assigning a value to a Unified Memory array

I’ve written a simple program to experiment with Unified memory in CUDA. The main function runs until I try to retrieve the data from shared memory [line 103].

However, if I change line 40 to something like output[index] = input[index]*0.2f; then the program runs properly and displays a correct (darker) image. In my mind the current line 40 and the above mentioned alternative are both doing the same thing: assigning a value to the array. There must be some nuance that I am missing. Can someone please shed some light on where I am going wrong?

void __device__ getCoordFromIndex(int index, int stride, int channels, int* X, int* Y, int* C){
    *Y = (index / stride);
    *X = (index - (*Y*stride)) / channels;
    *C =  index - (*Y*stride)  - (*X*channels);
}

void __global__ cudaBoxBlur(unsigned char* input, unsigned char* output, int width, int height, int channels, int radius){
    
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int W = width;
    int H = height;
    int C = channels;
    int rowstride = W*C;
    int samples = radius * 2 + 1;
    samples *= samples;
    float sum = 0.0f;
    
    int X, Y, Z;
    X = 0;
    Y = 0;
    Z = 0;
    getCoordFromIndex(index,rowstride,C,&X,&Y,&Z);
    
    //add border condition checks
    for (int y = -radius; y <= radius; y++){
        int rowoffset = 0;
        rowoffset = min(max(Y+y,0),H-1);
        rowoffset*=rowstride;
        for (int x = -radius; x <= radius; x++){
            int coloffset = 0;
            coloffset = min(max(X+x,0),W-1);
            coloffset *= C;
            int i = index + coloffset + rowoffset;
            sum += input[i];
        }
    }
    
    sum /= samples;
    unsigned char out = unsigned char(int(sum+0.5f));
    output[index] = out;

}

void toSharedMem(cv::Mat* image, unsigned char* shared){
    int p = 0;
    int H = image->rows;
    int W = image->cols;
    int C = image->channels();
    for (int y = 0; y < H; y++){
        uchar* row = image->ptr<uchar>(y);
        std::copy_n(row,W*C,&shared[p]);
        p+=W*C;
    }
}


void fromSharedMem(cv::Mat* output, unsigned char* shared){
    int p = 0;
    int H = output->rows;
    int W = output->cols;
    int C = output->channels();
    int rowlength = W*C;
    std::cout<<H<<" "<<W<<" "<<C<<std::endl;
    for (int y = 0; y < H; y++){
        uchar* row = output->ptr<uchar>(y);
        std::cout<<"actual copy"<<std::endl;        
        std::copy_n(&shared[p],rowlength,row);
        
        p+=W*C;
    }
}

int main(int argc, char** argv){
    cudaSetDevice(1);

    cv::Mat input = cv::imread(argv[1],-1);
    int H = input.rows;
    int W = input.cols;
    int C = input.channels();

    
    unsigned char* input_imgdata = new unsigned char[H*W*C];
    unsigned char* output_imgdata = new unsigned char[H*W*C];        
    cudaMallocManaged(&input_imgdata, H*W*C*sizeof(char));
    cudaMallocManaged(&output_imgdata,H*W*C*sizeof(char));
    
    toSharedMem(&input,input_imgdata);
    toSharedMem(&input,output_imgdata);
    
    int blockSize = 1024;
    int numBlocks = (H*W*C + blockSize - 1) / blockSize;
    
    int radius = 10;
    std::cout << "Begin\n";
    cudaBoxBlur<<<numBlocks,blockSize>>>(input_imgdata, output_imgdata, W, H, C, radius);
    cudaDeviceSynchronize();
    
    cv::Mat output;
    input.copyTo(output);
    output*=0;

    std::cout<<"Copying"<<std::endl;
    fromSharedMem(&output,output_imgdata);
    std::cout<<"Copied"<<std::endl;

    cv::resize(output,output,cv::Size(1280,960));
    cv::imshow("",output);
    int k = (int)cv::waitKey(0);
    
    cudaFree(input_imgdata);
    cudaFree(output_imgdata);
    
    return 0;
}

What happens when you print the value of “out” after line 39?
Does it show exactly the same thing as “input[index]*0.2f”? By “exactly” I mean, does it show a value you would expect?
Do you get any error when running with cuda-memcheck?

Printing the value of out gives the expected values (meaning an unsigned integer within an 8-bit range).
Printing index[input]*0.2f gives an expected value as well, a float within an 8-bit range. The float is then cast to the appropriate unsigned char upon assignment.

Below are the memcheck outputs for each line. The only thing being changed between these two outputs is the single line:

When using the line: “output[index] = out” I get the following memcheck output:
[i]========= CUDA-MEMCHECK
Begin
Copying
2440 3050 3
========= Error: process didn’t terminate successfully
========= Program hit cudaErrorInvalidDevice (error 10) due to “invalid device ordinal” on CUDA API call to cudaSetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\SYSTEM32\nvcuda.dll (cuModuleGetSurfRef + 0x2d60c6) [0x2e3c8b]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0x123f]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0xe670]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0xf199]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x73691]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\SYSTEM32\nvcuda.dll (cuModuleGetSurfRef + 0x2d60c6) [0x2e3c8b]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0x110e]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0xe8ec]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0xf199]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x73691]

========= No CUDA-MEMCHECK results found[/i]

When using the line: “output[index] = input[index]*0.2f” I get the following memcheck output:
[i]========= CUDA-MEMCHECK
Begin
Copying
2440 3050 3
Copied
========= Program hit cudaErrorInvalidDevice (error 10) due to “invalid device ordinal” on CUDA API call to cudaSetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\SYSTEM32\nvcuda.dll (cuModuleGetSurfRef + 0x2d60c6) [0x2e3c8b]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0x123f]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0xe670]
========= Host Frame:C:\Users\youser\Desktop\cuda_stuff\cuda_test\build\Release\cuda_blur.exe [0xf199]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x73691]

========= ERROR SUMMARY: 1 error[/i]

You actually have an error that is not yet related to UMA, according to cuda-memcheck.
Unless you have 2 (or more) CUDA-enabled cards and want to run this code on the second card, then you should do:

cudaSetDevice(0);

So it selects the first card, or comment out line #74. If you only have 1 card, you are selecting a card that doesn’t exist in your system. Of course, I’m basing this on what cuda-memcheck is telling us.
You can also copy my signature to your program and wrap your API calls with the macro, such as:

__CUDA_SAFE_CALL(cudaSetDevice(1));

Then it would immediately show this error. Give a try and see what happens.

Thanks for the reply and the help on this. My laptop does have two GPU’s:

integrated Intel GPU listed as GPU 0

and

GTX 1060 card listed as GPU 1

It is confusing because when I run the program with the output = input * scalar everything works correctly. I’ve also written other CUDA programs using cudaSetDevice(1) that do things on my NVidia card, and I can confirm that the NVidia card is being utilized when I run them using my GPU profiling software. However, when I run the setDevice command with your SAFE_CALL macro, the program fails no matter what I do to the code.

Ok. I also have a laptop with 2 cards and the first time I tried setting device to 1, it wouldn’t work, had to be 0 (or no setdevice at all). What does it output when you call setdevice with the macro?
I don’t have a compiler at hand now to test anything.

cudaSetDevice operates on CUDA GPUs. your integrated intel GPU is not a CUDA GPU. You can confirm this by running the deviceQuery app. It will list all cuda-capable devices in their enumeration order.

if you do a cudaSetDevice(x) on an ordinal x that does not map to a valid GPU, the operation is effectively a no-op. That is why it “works” when you pass it an out-of-range value x.