Jetsom memory question

I’m running some experiments on my Jetson Nano SDK. It’s my first time developing for CUDA. The sample application I’m working on is an ADS-B demodulator.

My question is about memory. My data is coming in as samples at 2 megasamples/second of IQ data. The first step of processing is to take the IQ data and compute the vector magnitude (for this stage of demodulation type I don’t care about the phase, just the amplitude). That’s great, I can do that with a hypotf() call. It’s a little expensive to convert the offset unsigned 8-bit ints into two floats but it’s not terrible. I can do this step on 60 seconds of data in about 750 milliseconds. The kernel is very simple:

The kernel is butt simple:

__global__
void conv16(SAMPLE *din, float *dout) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    float f1 = (float) (din[i].i) - 128.0f;
    float f2 = (float) (din[i].q) - 128.0f;
    dout[i] = hypotf(f1, f2);
}

I send data into the kernel in batches of 16384. I don’t know why but if I pick a number greater than about 20,000 it just doesn’t work.

My first question is this. I allocate input and output buffers for the kernel using cudaMalloc() and copy data in and out using cudaMemCpy(). The documentation says that cudaMalloc() is allocating “device memory” out of the “global” address space, but on the Jetson Nano that’s still actually just system memory allocated by the Linux kernel, correct? In other words, the linux kernel has its 4 GB of RAM and my kernel is working off of that same physical memory, just allocated in a way that’s properly aligned etc. for the GPU? Do I understand that correctly?

My second question is more of a design question. The second part of my demodulator is a preamble detector. The way that works is that there’s a pattern of bits that make up a preamble. In a sample set of say 10,000 samples that preamble could be anywhere. My original thought was to set up a bunch of threads and each one looks for the preamble starting at a different place in the 10,000 samples:

thread 1 looks for the preamble between samples 0 and 100
thread 2 looks for the premable between samples 1 and 101
thread 3 looks for the preamble between samples 2 and 102

That may or may not be a terrible design, but that’s not really my question. What I’m wondering is this. I have some number of threads all accessing basically the same chunk of memory at slightly different offsets. If I have 100 threads then at SOME point in time they’ll all be reading sample 99. So my question is: does that defeat coalescing, and cause a huge bottleneck?

What got me thinking about this is thinking about the way images are processed. If we divide a big image up into 100 pieces to, say, convert them from RGBA to YUV420, all those threads/cores/blocks can be working on all different pieces of memory at the same time and nobody overlaps. In my case, everybody will be working not randomly but certainly concurrently, so am I going to choke myself on memory bandwidth by organizing it this way?

I’m just trying to get my head around all these memories and how i need to structure these algorithms to work best.

Hi,

1. CPU and GPU share some physical memory so cudaMalloc() allocate memory from the 4G physical memory.

2. Sure.

If you only read the sample once, maybe you can put them in the pinned memory.
It can save you transfer overhead but will cause some slightly latency.

Please check this document for the memory selection on the tegra system:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Thanks.