Calling cudaHostUnregister() on the same 4KB page twice (CUDA 9.1)

I have an application that uses cudaHostRegister() to pin some memory that is allocated externally (by the client). Everything seems to work EXCEPT when I run with cuda-gdb, where it produces an error ‘CUDBG_ERROR_INVALID_MEMORY_SEGMENT(0x7)’ at the point where I try to call cudaHostUnregister().

My first thought is that the arguments passed to cudaHostRegister (ptr and size) should be page-aligned and that is what is causing the error. Older versions of the documentation seem to state this, however in the current version of the documentation (v9.1.85) I do not see this stated anywhere.

Can anyone clarify whether the arguments passed to cudaHostRegister need to be page-aligned? Thanks in advance!

Here is the documentation in question:
http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c

EDIT - I have discovered that this issue is not related to alignment but due to trying to trying to register/unregister small segments of memory that reside in the same 4KB page. Please see my next post below.

I don’t know if it is an absolute requirement, but I would assume it would be for at least some of the flags you can pass this with. In my experience, everything that deals with pages and locking them to memory should be page-aligned.

I have reproduced the cuda-gdb error I am seeing outside my code base. It turns out the issue is not one of alignment but one of trying to pin/unpin two small segments of memory that reside within the same 4KB page. Consider the following example that tries to pin two segments of 100 bytes that are separate by dist bytes:

#include <cassert>
#include <iostream>

int main() {

    using namespace std;

    // lenth of segment in bytes
    size_t len  = 100;

    // distance between segments
    size_t dist = 0;

    // allocate contiguos mem
    char * x = new char[2*len+dist];

    // segment 1
    char * x1 = &x[0];

    // segment 2
    char * x2 = &x[len+dist];

    cudaError_t chk;

    // pin x1
    chk = cudaHostRegister(x1, len*sizeof(char), cudaHostRegisterMapped);
    assert(chk == cudaSuccess);
    cout << "pinned x1" << endl;

    // pin x2
    chk = cudaHostRegister(x2, len*sizeof(char), cudaHostRegisterMapped);
    assert(chk == cudaSuccess);
    cout << "pinned x2" << endl;

    // unpin x1
    chk = cudaHostUnregister(x1);
    assert(chk == cudaSuccess);
    cout << "unpinned x1" << endl;

    // unpin x2
    chk = cudaHostUnregister(x2);
    assert(chk == cudaSuccess);
    cout << "unpinned x2" << endl;

    // free
    delete [] x;

    return 0;
}

This code executes without problem (none of the CUDA API calls return an error) and the output is:

pinned x1
pinned x2
unpinned x1
unpinned x2

However, if one runs with cuda-gdb the following error arises:

pinned x1
pinned x2
unpinned x1
Error: Internal error reported by CUDA debugger API (error=CUDBG_ERROR_INVALID_MEMORY_SEGMENT(0x7)). The application cannot be further debugged.

If dist is set to >=4096 bytes, the cuda-gdb problem goes away which leads me to believe this issue is related to trying to unregister the same 4KB page twice. Similarly, one can check that if you unpin x1 before pinning x2, cuda-gdb does not produce any errors.

My question is therefore: if trying to unregister the same 4KB page is not allowed, shouldn’t the CUDA API call return an error code? On the other hand, if the calls don’t return an error code by design (i.e., this behaviour is totally fine) then why is cuda-gdb producing such an error? It seems to be this behaviour is kind of inconsistent.

Hello, Did U resolve the problem? I got the same error, but when i using the command ‘cuda-gdb’ in my terminal, i got the python error, just for there are serveral versions in my compute, after fixed the problem ,ruing the program is OK ,but i can’t using Debug, and i got the same error :

Error: Failed to suspend device for CUDA device0, error=CUDBG_ERROR_INVALID_MEMORY_SEGMENT(0x7).

I can using Remote Debug to debug the program in another compute with GPU, and no error happened.

here are the code i am using:
Right code:

int w = 3* 1*1024*1024;
int n = 1;
add<<< w/n, n>>>(1,2,dev_addr);

Wrong code:

int w = 3* 1*1024*1024 + 1;
int n = 1;
add<<< w/n, n>>>(1,2,dev_addr)

Right code:

int w = 4* 1*1024*1024;
int n = 256;
add<<< w/n, n>>>(1,2,dev_addr)

Wrong code:

int w = 4* 1*1024*1024 + 256;
int n = 256;
add<<< w/n, n>>>(1,2,dev_addr)

OS system: ubuntu16.04
CUDA Toolkit: 10.0
Python Version: 2.7.16
Nsight Version: 10.0

I also using cuda9.2 cuda-gdb in my system, the error happened too

I don’t know what happened to cuda-gdb, if anyone can resolve it. thanks a alot

Hello, Did U resolve the problem? I got the same error, but when i using the command ‘cuda-gdb’ in my terminal, i got the python error, just for there are serveral versions in my compute, after fixed the problem ,ruing the program is OK ,but i can’t using Debug, and i got the same error :

Error: Failed to suspend device for CUDA device0, error=CUDBG_ERROR_INVALID_MEMORY_SEGMENT(0x7).

I can using Remote Debug to debug the program in another compute with GPU, and no error happened.

here are the code i am using:
Right code:

int w = 3* 1*1024*1024;
int n = 1;
add<<< w/n, n>>>(1,2,dev_addr);

Wrong code:

int w = 3* 1*1024*1024 + 1;
int n = 1;
add<<< w/n, n>>>(1,2,dev_addr)

Right code:

int w = 4* 1*1024*1024;
int n = 256;
add<<< w/n, n>>>(1,2,dev_addr)

Wrong code:

int w = 4* 1*1024*1024 + 256;
int n = 256;
add<<< w/n, n>>>(1,2,dev_addr)

OS system: ubuntu16.04
CUDA Toolkit: 10.0
Python Version: 2.7.16
Nsight Version: 10.0

I also using cuda9.2 cuda-gdb in my system, the error happened too

I don’t know what happened to cuda-gdb, if anyone can resolve it. thanks a alot