CUDA 11.4.19 : median on ~40 elements integer vector + if statement

Hello,

I have ~1000 vectors each ~50 elements long.
For each vector I have to calculate the median and multiply only the median by factor.
Then, for each element in the vector I have to check if the result > element value.
Do you think CUDA kernel can do it with reasonable performance ?
I’m asking it because from what I know, GPU kernel has great performance in number crunching but not with decisions.

Currently we are running this Intel based x64 server.

Thank you,
Zvika

I would be more concerned about insufficient inherent parallelism. Ideally, for modern GPUs you would want parallelism on the order of 10K threads.

GPUs do not have a problem with “decisions” (which I take to mean if-statements) per se, but rather with control flow that diverges for an extended time. As long as control flow quickly re-converges, it is not a significant distractor from performance. Plus for small “local” branching the compiler may convert branch-y code into non-branch-y code.

What type are the items your are computing the median of? floats, ints, something else? How are you computing the median currently?40 to 50 elements is a bit ambitious for a sorting network, which has, however, the advantage that it can be built from max and min primitives without any branches. Are you computing the exact median, or a median of medians (say a 7x7 matrix in which you compute the median of each row, then the median of the row-medians)?

Hello njuffa,

Highly appreciate your detailed and fast reply.

The items are integer (32bit)
I did not start writing code yet.
For each ~50 elements in vector I need the exact median (not median of medians).

Best regards,
Zvika

GPUs have integer MIN and MAX instructions, exposed via min() and max()functions, that one can use to implement networks. The biggest sorting network I have encountered in internet resources has 32 inputs and requires 185 comparison elements, each comprising a MIN and a MAX. Now sorting networks provide more information than is needed for a median, and compilers do a reasonable job of trimming away unneeded comparisons that do not contribute to the median result (this would happen automatically as part of dead code elimination).

Are the 32-bit integers completely random, or are they restricted to a limited range or have some internal structure?

You might want to check whether radix sort is applicable.

The benefits really depend on the used system, cpu, gpu, and if this is the only computation you intent to perform on the GPU. Memory allocation and data transfer could be a bottleneck
I would suggest you set up a simple benchmark that compares computing the median of each segment on the CPU and on the GPU. You could use 32 threads per block, 1 block per segment, and uses cub’s blockradixsort with 2 elements per thread. With 32 threads, you can then simply use warp shuffles to find the median.

You could possibly use something like this as a starting point:

#include <cub/cub.cuh>


using mt = int;
const int NS = 64;
const int NV = 1000;

template <typename T>
__global__ void k(const T * __restrict__ d, bool * __restrict__ r, const T f1) {
    // Specialize BlockRadixSort for a 1D block of NS threads owning 1 item each
    typedef cub::BlockRadixSort<T, NS, 1> BlockRadixSort;
    // Allocate shared memory for BlockRadixSort
    __shared__ typename BlockRadixSort::TempStorage temp_storage;
    __shared__ T median;
    if (!threadIdx.x) median = 0;
    // Obtain a segment of consecutive items that are blocked across threads
    T thread_keys[1];
    thread_keys[0] = d[blockIdx.x*NS+threadIdx.x];
    // Collectively sort the keys
    BlockRadixSort(temp_storage).Sort(thread_keys);
    if (NS & 1) { // set length is odd
      if (threadIdx.x == (NS/2)) median = 2*thread_keys[0];}
    else // set length is even
      if ((threadIdx.x == (NS/2)) || (threadIdx.x == ((NS/2)-1))) atomicAdd(&median, thread_keys[0]);
    __syncthreads();
    r[blockIdx.x*NS+threadIdx.x] = (thread_keys[0] > f1*(median/2));
}


int main(){

  mt *d;
  bool *r;
  mt fac = 2;
  cudaMallocManaged(&d, NS*NV*sizeof(d[0]));
  cudaMallocManaged(&r, NS*NV*sizeof(r[0]));
  // initialize data in d
  // ...
  k<<<NV, NS>>>(d, r, fac);
  cudaDeviceSynchronize();
}

If this is the only thing you intend to do on the GPU, I doubt it will be meaningfully faster than a reasonably good CPU implementation. For vector length less than 64 (e.g. 50) I would recommend padding each vector and adjust the median point accordingly. Note that padding could be done at the point of data load by each threadblock; you do not necessarily have to pad the data in global memory. This code provides results for the elements in sorted order. If you wanted results for elements in unsorted order, you could use cub sort pairs functionality, sorting each data set item with an index. Due to the median handling, this approach will have some limitations. For example, for integer types, it won’t work if the actual median is greater than half the available integer range. There are ways to address this with additional code complexity.

1 Like

Hi njuffa, striker159 and Robert,

Thank you very much for your replies.

Do you think it’s worthwhile checking also the sample:
/usr/local/cuda/samples/6_Advanced/cdpAdvancedQuicksort ?

Best regards,
Zvika

Traditionally, quickselect has been a popular algorithm for finding the k-th smallest number in an array and in particular for finding the median. I have not personally used quickselect, and the last time I worked with quicksort was as an undergrad, decades ago. Quickselect is related to Quicksort, as far as I recall, so it would not hurt to look at a GPU implementation of either of them.

Sorting and selecting is not really my area of expertise (I am more of an expert in fixed-point and floating-point arithmetic as well as general bit-twiddling), but my general sense is that variants of radix sort are usually better suited to GPU work than variants of quicksort; you have already received a concrete suggestion in that direction.

It may be worthwhile; you may learn something.

However if it were me, using cdp (CUDA Dynamic Parallelism) is definitely not the way I would go for a problem of the type you described. I haven’t looked at that sample code lately, but years ago it was a foolish sorting method (as a whole). It’s purpose was not to demonstrate high-performance sorting, but instead to show the use of CDP in an iterative algorithm where synchronization was needed. So if you do visit that code, I would look for things to learn without getting wrapped around the CDP axle. There is a warp-level sort implementation in there, for example (which is not foolish).

I’d be very surprised and would like to see it, if anyone were able to come up with a GPU method that is faster than using cub block level sort.

You can worry about branchy or non-branchy if you wish. But sorting is one of those algorithms that has been subject to ninja-level tuning. It would be absolutely foolish IMO to leave that value aside and begin to write your own code, just because you wanted to avoid branchy-ness.

There is a general principle (I believe) in computer science, that roughly says “don’t reinvent the wheel”. When applied to libraries, I would say, “Don’t write your own code when a high quality library implementation exists”. You can further characterize that with “without good reason” if you wish; all such directions probably have that disclaimer.

I haven’t seen a good reason not to try using the cub library for this, so far, in this thread.

1 Like

@Robert_Crovella Specifically for segmented sort, there are papers “Fast segmented sort on GPUs” (2017) and “Faster Segmented Sort on GPUs” (2023) which achieve greater performance than CUB for various segment sizes using sorting networks. Those methods could be applicable here, as well.

1 Like

Yes, I probably didn’t phrase my challenge correctly. I would love to see a demonstration of it, benchmarked against today’s cub.

The 2017 paper makes claims that I find curious for a research paper in 2017, or 2015, or 2013, or anything with Kepler or Pascal in view. The 2023 paper seems to be behind a paywal.

Anyway I used poor phrasing. I would restate it as:

“I would love to see an actual demonstration here, in this thread, of a technique on a GPU that did better on this problem, tested and benchmarked against today’s cub.”

1 Like

Hi Robert,

I slightly modified your code:

...
__syncthreads();
r[blockIdx.x*NS+threadIdx.x] = median;

So the first element in each vector is the median.
For example:
The CPU calculated median is:
126 134 128 127 121 124 123 130 125

The GPU calculated median is:
126 134 128 128 122 124 123 131 125

It seems the output is wrong. For some vectors, the GPU calculated median is ±1 compared to the CPU calculated median.

Can you please tell where is my mistake ?

Thank you very much,
Zvika

Hi striker159,

Do you have a CUDA code implementing those papers ?

Thank you,
Zvika

That won’t work. r points to a boolean type.

Anyway I would suggest you provide a complete test case, both for your GPU and CPU versions. I can’t comment too much on the behavior of code you haven’t shown. It’s also worth noting that I don’t know if the median has an actual definition for an even vector length. if it does, I may have not done it correctly.

Looking things up, it seems the median of an even set of numbers is the average of the value just below and just above the midpoint. My previously posted code is wrong in median handling. I’ve modified my previous posting to handle it correctly, I think.

The common approach for determining the median of an even number of data items is to compute the numerical average of the middle two numbers. Other conventions may apply when such computation is deemed too expensive or otherwise inappropriate, e.g. a requirement that the median must be one of the input items.

1 Like

Hi Robert,

If length is odd, then median is:

atomicAdd(&median, thread_keys[0]);

I think (not sure) the median should be atomicAdd/2. It seems the median type should be float.

The attached code contains the median calculated by CPU (in main.cpp) and also your kernel code (in signal.cu)

Median.zip (6.7 KB)

Thank you very much,
Zvika

In this code, the median for even number of elements uses the usual convention of computing the numerical average of the two middle elements:

    // If n is odd 
    if (n % 2 == 1) {
        MedianUtil(arr, 0, n - 1,
            n / 2, a, b);
        ans = (float)b;
    }

    // If n is even 
    else {
        MedianUtil(arr, 0, n - 1,
            n / 2, a, b);
        ans = (float)(a + b) / 2;
    }
1 Like

Paper codes are available here GitHub - vtsynergy/bb_segsort and here https://gitlab.rlp.net/pararch/faster-segmented-sort-on-gpus

2 Likes

I crafted a benchmark for median computation of 40 elements per segment, 1 million segments, using 1 warp per segment with 2 elements per thread. It compares cub::blockradixsort, cub::warpmergesort, and a sorting network for this thread configuration found in the linked repository (https://gitlab.rlp.net/pararch/faster-segmented-sort-on-gpus/-/blob/main/gtx_1080_ti_keys_only/my_regsort_kernels.h?ref_type=heads#L4582)

Using cuda 12.2 on an A100, I observe the following timings:

medianKernel_sortingnetwork: 0.000992722s
medianKernel_cubradixsort:   0.00561694s
medianKernel_cubmergesort:   0.00165363s

I did not check other thread configurations for 64 elements, so I cannot tell whether 32*2 is optimal for any of the three algorithms.

//nvcc -std=c++17 -O3 -arch=sm_80 --expt-relaxed-constexpr main.cu -o main
#include <cub/cub.cuh>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <vector>
#include <numeric>
#include <iostream>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <random>

#define CMP_SWP(t1,_a,_b) if(_a>_b)  {t1 _t=_a;_a=_b;_b=_t;}
#define EQL_SWP(t1,_a,_b) if(_a!=_b) {t1 _t=_a;_a=_b;_b=_t;}
#define     SWP(t1,_a,_b)            {t1 _t=_a;_a=_b;_b=_t;}

template<class K>
__device__ inline void exch_intxn(K &k0, K &k1, int mask, const int bit) {
    K ex_k0, ex_k1;
    ex_k0 = k0;
    ex_k1 = __shfl_xor_sync(0xffffffff,k1, mask);
    CMP_SWP(K, ex_k0, ex_k1);
    if(bit) EQL_SWP(K, ex_k0, ex_k1);
    k0 = ex_k0;
    k1 = __shfl_xor_sync(0xffffffff,ex_k1, mask);
}

template<class K>
__device__ inline void exch_paral(K &k0, K &k1, int mask, const int bit) {
    K ex_k0, ex_k1;
    if(bit) SWP(K, k0, k1);
    ex_k0 = k0;
    ex_k1 = __shfl_xor_sync(0xffffffff,k1, mask);
    CMP_SWP(K, ex_k0, ex_k1);
    if(bit) EQL_SWP(K, ex_k0, ex_k1);
    k0 = ex_k0;
    k1 = __shfl_xor_sync(0xffffffff,ex_k1, mask);
    if(bit) SWP(K, k0, k1);
}


template<class K>
__device__
void sortnetwork64_32_2(
    K& rg_k0,
	K& rg_k1
){
	const int tid = (threadIdx.x & 31);
	const int bit1 = (tid>>0)&0x1;
	const int bit2 = (tid>>1)&0x1;
	const int bit3 = (tid>>2)&0x1;
	const int bit4 = (tid>>3)&0x1;
	const int bit5 = (tid>>4)&0x1;

    CMP_SWP(K, rg_k0, rg_k1);
    exch_intxn(rg_k0, rg_k1, 0x1, bit1);
    CMP_SWP(K, rg_k0, rg_k1);
    exch_intxn(rg_k0, rg_k1, 0x3, bit2);
    exch_paral(rg_k0, rg_k1, 0x1, bit1);
    CMP_SWP(K, rg_k0, rg_k1);
    exch_intxn(rg_k0, rg_k1, 0x7, bit3);
    exch_paral(rg_k0, rg_k1, 0x2, bit2);
    exch_paral(rg_k0, rg_k1, 0x1, bit1);
    CMP_SWP(K, rg_k0, rg_k1);
    exch_intxn(rg_k0, rg_k1, 0xf, bit4);
    exch_paral(rg_k0, rg_k1, 0x4, bit3);
    exch_paral(rg_k0, rg_k1, 0x2, bit2);
    exch_paral(rg_k0, rg_k1, 0x1, bit1);
    CMP_SWP(K, rg_k0, rg_k1);
    exch_intxn(rg_k0, rg_k1, 0x1f, bit5);
    exch_paral(rg_k0, rg_k1, 0x8, bit4);
    exch_paral(rg_k0, rg_k1, 0x4, bit3);
    exch_paral(rg_k0, rg_k1, 0x2, bit2);
    exch_paral(rg_k0, rg_k1, 0x1, bit1);
    CMP_SWP(K, rg_k0, rg_k1);
}



//uses 1 warp per segment, 2 elements per thread. assumes max segment size of 64
__global__
void medianKernel_sortingnetwork(float* medianOut, const int* data, const int* beginOffsets, const int* endOffsets, int numSegments){
    __builtin_assume(blockDim.x == 32);

    const int warpId = blockIdx.x;
    
    if(warpId < numSegments){
        const int segmentBegin = beginOffsets[warpId];
        const int segmentEnd = endOffsets[warpId];
        const int segmentSize = segmentEnd - segmentBegin;

        int items[2];
        cub::LoadDirectStriped<32>(threadIdx.x, data + segmentBegin, items, segmentSize, std::numeric_limits<int>::max());
        sortnetwork64_32_2(items[0], items[1]);

        float median = 0;
        if(segmentSize % 2 == 1){
            const int medianIndex = segmentSize / 2;
            const int medianThread = medianIndex / 2;
            const int medianIndexInThread = medianIndex % 2;
            const int src = (medianIndexInThread == 0) ? items[0] : items[1];
            median = __shfl_sync(0xFFFFFFFF, src, medianThread);
        }else{
            float median_l = 0;
            float median_r = 0;
            const int medianIndex_r = segmentSize / 2;
            const int medianThread_r = medianIndex_r / 2;
            const int medianIndexInThread_r = medianIndex_r % 2;
            const int src_r = (medianIndexInThread_r == 0) ? items[0] : items[1];
            median_r = __shfl_sync(0xFFFFFFFF, src_r, medianThread_r);
            const int medianIndex_l = (segmentSize-1) / 2;
            const int medianThread_l = medianIndex_l / 2;
            const int medianIndexInThread_l = medianIndex_l % 2;
            const int src_l = (medianIndexInThread_l == 0) ? items[0] : items[1];
            median_l = __shfl_sync(0xFFFFFFFF, src_l, medianThread_l);
            
            median = (median_l + median_r) / 2.0f;
        }

        if(threadIdx.x == 0){
            medianOut[warpId] = median;
        }
    }
}



//uses 1 warp per segment, 2 elements per thread. assumes max segment size of 64
__global__
void medianKernel_cubradixsort(float* medianOut, const int* data, const int* beginOffsets, const int* endOffsets, int numSegments){
    __builtin_assume(blockDim.x == 32);

    using BlockRadixSort = cub::BlockRadixSort<int, 32, 2>;
    __shared__ typename BlockRadixSort::TempStorage tempSort;

    const int warpId = blockIdx.x;

    
    if(warpId < numSegments){
        const int segmentBegin = beginOffsets[warpId];
        const int segmentEnd = endOffsets[warpId];
        const int segmentSize = segmentEnd - segmentBegin;

        int items[2];
        cub::LoadDirectStriped<32>(threadIdx.x, data + segmentBegin, items, segmentSize, std::numeric_limits<int>::max());   
        BlockRadixSort(tempSort).Sort(items);

        float median = 0;
        if(segmentSize % 2 == 1){
            const int medianIndex = segmentSize / 2;
            const int medianThread = medianIndex / 2;
            const int medianIndexInThread = medianIndex % 2;
            const int src = (medianIndexInThread == 0) ? items[0] : items[1];
            median = __shfl_sync(0xFFFFFFFF, src, medianThread);
        }else{
            float median_l = 0;
            float median_r = 0;
            const int medianIndex_r = segmentSize / 2;
            const int medianThread_r = medianIndex_r / 2;
            const int medianIndexInThread_r = medianIndex_r % 2;
            const int src_r = (medianIndexInThread_r == 0) ? items[0] : items[1];
            median_r = __shfl_sync(0xFFFFFFFF, src_r, medianThread_r);
            const int medianIndex_l = (segmentSize-1) / 2;
            const int medianThread_l = medianIndex_l / 2;
            const int medianIndexInThread_l = medianIndex_l % 2;
            const int src_l = (medianIndexInThread_l == 0) ? items[0] : items[1];
            median_l = __shfl_sync(0xFFFFFFFF, src_l, medianThread_l);
            
            median = (median_l + median_r) / 2.0f;
        }

        if(threadIdx.x == 0){
            medianOut[warpId] = median;
        }
    }
}

//uses 1 warp per segment, 2 elements per thread. assumes max segment size of 64
__global__
void medianKernel_cubmergesort(float* medianOut, const int* data, const int* beginOffsets, const int* endOffsets, int numSegments){
    __builtin_assume(blockDim.x == 32);

    using WarpMergeSort = cub::WarpMergeSort<int, 2>;
    __shared__ typename WarpMergeSort::TempStorage tempSort;

    const int warpId = blockIdx.x;

    
    if(warpId < numSegments){
        const int segmentBegin = beginOffsets[warpId];
        const int segmentEnd = endOffsets[warpId];
        const int segmentSize = segmentEnd - segmentBegin;

        int items[2];
        cub::LoadDirectStriped<32>(threadIdx.x, data + segmentBegin, items, segmentSize, std::numeric_limits<int>::max());      
        WarpMergeSort(tempSort).Sort(items, std::less<int>{});

        float median = 0;
        if(segmentSize % 2 == 1){
            const int medianIndex = segmentSize / 2;
            const int medianThread = medianIndex / 2;
            const int medianIndexInThread = medianIndex % 2;
            const int src = (medianIndexInThread == 0) ? items[0] : items[1];
            median = __shfl_sync(0xFFFFFFFF, src, medianThread);
        }else{
            float median_l = 0;
            float median_r = 0;
            const int medianIndex_r = segmentSize / 2;
            const int medianThread_r = medianIndex_r / 2;
            const int medianIndexInThread_r = medianIndex_r % 2;
            const int src_r = (medianIndexInThread_r == 0) ? items[0] : items[1];
            median_r = __shfl_sync(0xFFFFFFFF, src_r, medianThread_r);
            const int medianIndex_l = (segmentSize-1) / 2;
            const int medianThread_l = medianIndex_l / 2;
            const int medianIndexInThread_l = medianIndex_l % 2;
            const int src_l = (medianIndexInThread_l == 0) ? items[0] : items[1];
            median_l = __shfl_sync(0xFFFFFFFF, src_l, medianThread_l);
            
            median = (median_l + median_r) / 2.0f;
        }

        if(threadIdx.x == 0){
            medianOut[warpId] = median;
        }
    }
}


int main(){
    int numSegments = 1000000;
    int elementsPerSegment = 40;
    const int timingIterations = 10;

    std::mt19937 gen(42);
    std::uniform_int_distribution<> distrib(0, 65536);

    std::vector<int> data(numSegments * elementsPerSegment);
    std::generate(data.begin(), data.end(), [&](){return distrib(gen);});
    std::vector<int> offsets(numSegments+1);
    for(int i = 0; i < numSegments+1; i++){
        offsets[i] = elementsPerSegment * i;
    }

    thrust::device_vector<int> d_data = data;
    thrust::device_vector<int> d_offsets = offsets;
    thrust::device_vector<float> d_medianOut(numSegments);
    std::vector<float> medianOutGpu(numSegments);


    std::vector<float> medianOutCpu(numSegments);
    auto timebegin = std::chrono::system_clock::now();
    for(int iter = 0; iter < timingIterations; iter++){
        for(int i = 0; i < numSegments; i++){
            const int segmentBegin = offsets[i];
            const int segmentEnd = offsets[i+1];
            const int segmentSize = segmentEnd - segmentBegin;
            if (segmentSize % 2 == 0) {    
                std::nth_element(
                    data.data() + segmentBegin, 
                    data.data() + segmentBegin + segmentSize / 2, 
                    data.data() + segmentEnd
                );     
                std::nth_element(
                    data.data() + segmentBegin, 
                    data.data() + segmentBegin + (segmentSize - 1) / 2, 
                    data.data() + segmentEnd
                );    
                medianOutCpu[i] = (data[segmentBegin + (segmentSize - 1) / 2] + data[segmentBegin + segmentSize / 2]) / 2.0; 
            }else{ 
                std::nth_element(
                    data.data() + segmentBegin, 
                    data.data() + segmentBegin + segmentSize / 2, 
                    data.data() + segmentEnd
                );     
                medianOutCpu[i] = data[segmentBegin + segmentSize / 2]; 
            } 
        }
    }
    auto timeend = std::chrono::system_clock::now();
    std::chrono::duration<double> delta = timeend - timebegin;
    std::cout << "median CPU: " << delta.count() / timingIterations << "s\n";

    timebegin = std::chrono::system_clock::now();
    for(int iter = 0; iter < timingIterations; iter++){
        medianKernel_sortingnetwork<<<numSegments, 32>>>(
            d_medianOut.data().get(), 
            d_data.data().get(), 
            d_offsets.data().get(), 
            d_offsets.data().get() + 1, 
            numSegments
        );
    }
    cudaDeviceSynchronize();
    timeend = std::chrono::system_clock::now();
    delta = timeend - timebegin;
    std::cout << "medianKernel_sortingnetwork: " << delta.count() / timingIterations << "s\n";
    thrust::copy(d_medianOut.begin(), d_medianOut.end(), medianOutGpu.begin());

    for(int i = 0; i < numSegments; i++){
        if(std::abs(medianOutGpu[i] - medianOutCpu[i]) > 1e-5){
            std::cout << "medianKernel_sortingnetwork error segment " << i << " " << medianOutGpu[i] << " " << medianOutCpu[i] << "\n";
            break;
        }
    }

    thrust::fill(d_medianOut.begin(), d_medianOut.end(), 0);

    timebegin = std::chrono::system_clock::now();
    for(int iter = 0; iter < timingIterations; iter++){
        medianKernel_cubradixsort<<<numSegments, 32>>>(
            d_medianOut.data().get(), 
            d_data.data().get(), 
            d_offsets.data().get(), 
            d_offsets.data().get() + 1, 
            numSegments
        );
    }
    cudaDeviceSynchronize();
    timeend = std::chrono::system_clock::now();
    delta = timeend - timebegin;
    std::cout << "medianKernel_cubradixsort: " << delta.count() / timingIterations << "s\n";
    thrust::copy(d_medianOut.begin(), d_medianOut.end(), medianOutGpu.begin());

    for(int i = 0; i < numSegments; i++){
        if(std::abs(medianOutGpu[i] - medianOutCpu[i]) > 1e-5){
            std::cout << "medianKernel_cubradixsort error segment " << i << " " << medianOutGpu[i] << " " << medianOutCpu[i] << "\n";
            break;
        }
    }


    thrust::fill(d_medianOut.begin(), d_medianOut.end(), 0);

    timebegin = std::chrono::system_clock::now();
    for(int iter = 0; iter < timingIterations; iter++){
        medianKernel_cubmergesort<<<numSegments, 32>>>(
            d_medianOut.data().get(), 
            d_data.data().get(), 
            d_offsets.data().get(), 
            d_offsets.data().get() + 1, 
            numSegments
        );
    }
    cudaDeviceSynchronize();
    timeend = std::chrono::system_clock::now();
    delta = timeend - timebegin;
    std::cout << "medianKernel_cubmergesort: " << delta.count() / timingIterations << "s\n";
    thrust::copy(d_medianOut.begin(), d_medianOut.end(), medianOutGpu.begin());

    for(int i = 0; i < numSegments; i++){
        if(std::abs(medianOutGpu[i] - medianOutCpu[i]) > 1e-5){
            std::cout << "medianKernel_cubmergesort error segment " << i << " " << medianOutGpu[i] << " " << medianOutCpu[i] << "\n";
            break;
        }
    }

    

    



}
3 Likes

Hello,

I used Robert’s code:

#define NS 64
#define NV 1000
template <typename T>
__global__ void mat_median_kernel(const T * __restrict__ d, T * __restrict__ r) 
{
	// Specialize BlockRadixSort for a 1D block of NS threads owning 1 item each
    	typedef cub::BlockRadixSort<T, NS, 1> BlockRadixSort;
        ...
        __syncthreads();
        printf ("Completed\n");
}

mat_median_kernel <<<NV, NS>>> (pSrc, pMedian);

The goal is to calculate median of 1000 (NV) vectors, each contains 64 (NS) elements.
How should I specify this kernel where is the start of each vector ?
It seems that with my code, the “Completed” is printed 64 times. I expected 1000 times.

Thank you,
Zvika