cublasScnrm2(...) keeps crashing and get Segmentation fault (core dumped) $EXECUTABLE

I am building a proof of concept code for signal processing.
I am trying to use the function of: [cublasScnrm2](cuBLAS
t-nrm2) in order to compute the absolute value of a matrix (a vector for that matter).

this is a code snippet with parameters values printed:

void complexAbsWrepper(cuComplex* inputMatrixComplex, float* outputMatrixReal, int numRows, int numCols, int numThreads, cublasHandle_t  &handle) 
    cublasStatus_t cublasStatus;
    printf("\ncomplexAbsWrepper(...) Start\n\n");
    printf("Input Arguments:\n");
    printf("inputMatrixComplex: %p\n", inputMatrixComplex);
    printf("outputMatrixReal: %p\n", outputMatrixReal);
    printf("handle: %p\n", &handle);
    printf("numRows: %d\n", numRows);
    printf("numCols: %d\n", numCols);

    cublasStatus = cublasScnrm2(
        handle,                     // cublasHandle_t handle 
        numRows * numCols,          // int n
        inputMatrixComplex,         // const cuComplex * x
        1,                          // int incx
        outputMatrixReal);          // float  *result

    if (cublasStatus != CUBLAS_STATUS_SUCCESS)
        printf("\ncomplexAbsWrepper(...) Error - %d\n", cublasStatus);
        printf("\ncomplexAbsWrepper(...) Success - %d\n", cublasStatus);
    printf("\ncomplexAbsWrepper(...) End\n\n");


the code crashes with segmentation fault once reaching the line of the function: cublasScnrm2
this is the output print before crash:

Input Arguments:
inputMatrixComplex: 0x2165d3000
outputMatrixReal: 0x2265d3000
handle: 0xffffcf9b7eb0
numRows: 65536
numCols: 512
./ line 39:  7565 Segmentation fault      (core dumped) $EXECUTABLE

attaching some info of the hardware and environment:
I am using the AGX Orin 64GB Development kit.

noam@noam-desktop:~/Documents/HelloWorld$ lscpu
Architecture:            aarch64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               ARM
  Model name:            Cortex-A78AE
    Model:               1
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          3
    Stepping:            r0p1
    CPU max MHz:         2201.6001
    CPU min MHz:         115.2000
    BogoMIPS:            62.50
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcp
                         c flagm paca pacg
Caches (sum of all):     
  L1d:                   768 KiB (12 instances)
  L1i:                   768 KiB (12 instances)
  L2:                    3 MiB (12 instances)
  L3:                    6 MiB (3 instances)
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11

noam@noam-desktop:~/Documents/HelloWorld$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:08:11_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

One of your arguments to the call may be set up incorrectly. For example, perhaps you have not initialized the handle correctly. It’s impossible to say which argument has a problem from what you have posted.

The matrices pointers are used in other functions (not cuBlas functions) and works correctly.
The handle is initialized outside in a main function, using the function cublasCreate(…):

cublasHandle_t handle;
statusCuBlasHandle = cublasCreate(&handle);

the return code does not show any error
is there another function required that performs initialization?

from doing some reading,
if the pointer of the handler is created in the main host code, it is a considered a host memory pointer and is not right to be used in the a cuBlas function since it is not in the GPU context, is that a thing or I’m way off here?

I wouldn’t be able to help further without a short complete reproducer.