CUDA 11.4.19 : median on ~40 elements integer vector + if statement

That is done in this line here:

This assumes the vectors are stored contiguously in memory.

You might be running into the limits of printf from kernel output.

1 Like

Hi Robert,

It’s working great.
In the line:

typedef cub::BlockRadixSort<T, NS, 1> BlockRadixSort;

NS must be a define. Is it possible that it will be variable ?
For example:

uint32_t Ns = NS;
typedef cub::BlockRadixSort<T, Ns, 1> BlockRadixSort;

This code has compilation error.

Thank you,
Zvika

no, it can’t be a variable. Its being used as a template argument there, which must be resolvable at compile-time.

For values in the 40-64 range, I would just set it to 64 and then pad each vector.

Thank you very much !

Hi All,

I used the code from Quick Select Algorithm
to develop a kernel that calculates the median for all N vectors each contains M elements.
The run time is ~4msec compared to ~7msec with BlockRadixSort

In the kernel I copied the vector to a local array and then ran the median code.
My code is naĆÆve.
I’m quite sure it can run faster.

Each thread handles a 40 elements vector.
All vectors are consecutive in RAM.
Can you please tell if I’m facing bank conflicts ?

signal.cu.txt (3.8 KB)

Thank you,
Zvika

not using shared memory? (hint: you are not)
then there is no possibility for bank conflicts.

Hi Robert,

Thank you very much for your reply.
Does it matter where each vector starts in global memory ?
If (for example) each vector will start in an address aligned to 256.

Can I be sure that: ā€œuint32_t arr[40]ā€ is all in registers (which is fast compared to local and shared memory) ?

Best regards,
Zvika

I haven’t studied your code carefully and at the moment, don’t intend to.

It will depend on your access patterns, to be sure, but generally I wouldn’t expect it to matter much if your vectors are stored in adjacent locations. If you want to learn about the effect of access patterns on global memory, you can search for forum topics on ā€œcoalescingā€ in CUDA, and/or study the programming guide, and/or study section 4 of this online training series.

I definitely would not make that assumption without careful study. In most cases I expect such an array would not be ā€œall in registersā€. This topic is covered in a number of forum posts such as this one and this one if you want to understand the general requirements for that to happen, and how to inspect for it. With a bit of searching you can find more related forum posts for additional learning.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.