Function is much slower on GPU than on CPU

robert.angermayr · July 22, 2022, 5:35pm

Hello,

i’m trying to speedup some function by moving them to GPU, but as a result, its much slower on GPU than on CPU.
Measured time on CPU is 103.796 msecs on GPU it is 8547.94 msecs.

im calling the function by:
SetBitAllGpu<<<50, 128, 0, streamSetBitGpu>>>(gpuBait,
gpuReducedMetaR, gpuReducedMetaRPerR, container->rCols, gpuMetaRSizes, container->bitarrayLength);
cudaDeviceSynchroice(); // wait until its finished
SetBarFirstValueGpu<<<50, 128, 0, streamSetBarFirstValueGpu>>>(gpuBar, gpuBait, container->bitarrayLength);
cudaDeviceSynchroice(); // wait until its finished
AndOpAllGpu<<<50, 128, 0, streamAndOpAllGpu>>>(gpuBar, gpuBait, container->ramboCols, container->bitarrayLength);
cudaDeviceSynchroice(); // wait until its finished

The methods are:

__global__ void SetBitAllGpu(char *bait, int *rMRs, uint* noOfrMRpr, uint R, uint* mRS, uint bal)
{
    uint mrsc = 0;
    for (uint r = 0; r < R; r++)
    {
        for (uint n = 0; n < noOfrMRpr[r]; n++)
        {
            for (uint s = 0; s < mRS[n]; s++)
            {
                bait[r*bal + (rMRs[mrsc + s] / 8)] |= (1 << (rMRs[mrsc + s] % 8));
                __syncthreads();
            }
            __syncthreads();
            mrsc += mRS[n];
            __syncthreads();
        }
        __syncthreads();
    }
    __syncthreads();
}

__global__ void SetBarFirstValueGpu(char *bar, char *bait, int bitarrayLength)
{

    for (int len = 0; len < bitarrayLength; len++)
    {
        bar[len] = bait[len]; //+ 1];
        __syncthreads();
    }
    __syncthreads();
}

__global__ void AndOpAllGpu(char *bar, char *bait, int ramboCols, int bitarrayLength)
{

    for (int baitNumber = 1; baitNumber < ramboCols; baitNumber++)
    {
        for (int len = 0; len < bitarrayLength; len++)
        {

            bar[len] &= bait[bitarrayLength * baitNumber + len]; //+ 1];
            __syncthreads();
        }
        __syncthreads();
    }
    __syncthreads();
}

striker159 · July 22, 2022, 6:02pm

I would suggest taking a look ath some CUDA tutorials. Your kernels are effectivly single-threaded.

robert.angermayr · July 22, 2022, 6:06pm

this is my first CUDA trial, can you give me more informations.
Why are they singlethreaded, i thaught that the call <<50, 128, sets the number of threads.

Can you point me to the right tutorial, please.

Robert_Crovella · July 22, 2022, 6:18pm

I recommend this one.

Yes, 50,128 set the number of threads, total, that will be launched.

How does each thread know what to do?

As you have written it, every one of those 50 x 128 threads is doing precisely the same thing. And I do mean precisely. Is that what you want? Probably not. That is a reason for the comment:

In CUDA we usually want each thread to do something similar but not precisely the same thing. We do this by differentiating thread behavior in kernel code, so that, for example, each thread participating in a vector add will work on a different vector element. It would be senseless to have all threads working on exactly the same vector element.

In CUDA, the most typical way to do this starts with a definition like:

size_t idx = blockDim.x*blockIdx.x+threadIdx.x;

this creates a globally unique thread index (every thread gets a different value for idx) allowing us to accomplish similar work in a sensible fashion.

The absence of any of that in your kernels means you are using CUDA in a way that no one would think is sensible. (and it won’t be performant)

You can get a more orderly treatment of this material via the tutorial I linked.

robert.angermayr · July 22, 2022, 7:32pm

I think o understand my mistake.
Will check the tutorials.
Thank you

Topic		Replies	Views
single thread performance CUDA Programming and Performance	2	2541	July 30, 2008
CUDA perormances CUDA Programming and Performance	10	7129	January 22, 2008
CUDA slower than CPU Help me please... CUDA Programming and Performance	2	5709	February 8, 2010
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8618	December 18, 2008
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10270	June 5, 2009
CUDA slower than CPU? CUDA Programming and Performance	7	832	August 18, 2023
Why the following multigpu code works faster when I set GPU_N=1 while it is slower for GPU_N=4? CUDA Programming and Performance cuda	1	629	September 21, 2020
CUDA trouble CUDA Programming and Performance	3	977	March 19, 2013
CUDA principals - summary CUDA Programming and Performance	0	335	September 1, 2018
Measuring speed of a calculation in a single thread CUDA Programming and Performance	6	1129	March 2, 2011

Function is much slower on GPU than on CPU

Related topics