Function is much slower on GPU than on CPU

Hello,

i’m trying to speedup some function by moving them to GPU, but as a result, its much slower on GPU than on CPU.
Measured time on CPU is 103.796 msecs on GPU it is 8547.94 msecs.

im calling the function by:
SetBitAllGpu<<<50, 128, 0, streamSetBitGpu>>>(gpuBait,
gpuReducedMetaR, gpuReducedMetaRPerR, container->rCols, gpuMetaRSizes, container->bitarrayLength);
cudaDeviceSynchroice(); // wait until its finished
SetBarFirstValueGpu<<<50, 128, 0, streamSetBarFirstValueGpu>>>(gpuBar, gpuBait, container->bitarrayLength);
cudaDeviceSynchroice(); // wait until its finished
AndOpAllGpu<<<50, 128, 0, streamAndOpAllGpu>>>(gpuBar, gpuBait, container->ramboCols, container->bitarrayLength);
cudaDeviceSynchroice(); // wait until its finished

The methods are:

__global__ void SetBitAllGpu(char *bait, int *rMRs, uint* noOfrMRpr, uint R, uint* mRS, uint bal)
{
    uint mrsc = 0;
    for (uint r = 0; r < R; r++)
    {
        for (uint n = 0; n < noOfrMRpr[r]; n++)
        {
            for (uint s = 0; s < mRS[n]; s++)
            {
                bait[r*bal + (rMRs[mrsc + s] / 8)] |= (1 << (rMRs[mrsc + s] % 8));
                __syncthreads();
            }
            __syncthreads();
            mrsc += mRS[n];
            __syncthreads();
        }
        __syncthreads();
    }
    __syncthreads();
}

__global__ void SetBarFirstValueGpu(char *bar, char *bait, int bitarrayLength)
{

    for (int len = 0; len < bitarrayLength; len++)
    {
        bar[len] = bait[len]; //+ 1];
        __syncthreads();
    }
    __syncthreads();
}

__global__ void AndOpAllGpu(char *bar, char *bait, int ramboCols, int bitarrayLength)
{

    for (int baitNumber = 1; baitNumber < ramboCols; baitNumber++)
    {
        for (int len = 0; len < bitarrayLength; len++)
        {

            bar[len] &= bait[bitarrayLength * baitNumber + len]; //+ 1];
            __syncthreads();
        }
        __syncthreads();
    }
    __syncthreads();
}

I would suggest taking a look ath some CUDA tutorials. Your kernels are effectivly single-threaded.

this is my first CUDA trial, can you give me more informations.
Why are they singlethreaded, i thaught that the call <<50, 128, sets the number of threads.

Can you point me to the right tutorial, please.

I recommend this one.

Yes, 50,128 set the number of threads, total, that will be launched.

How does each thread know what to do?

As you have written it, every one of those 50 x 128 threads is doing precisely the same thing. And I do mean precisely. Is that what you want? Probably not. That is a reason for the comment:

In CUDA we usually want each thread to do something similar but not precisely the same thing. We do this by differentiating thread behavior in kernel code, so that, for example, each thread participating in a vector add will work on a different vector element. It would be senseless to have all threads working on exactly the same vector element.

In CUDA, the most typical way to do this starts with a definition like:

size_t idx = blockDim.x*blockIdx.x+threadIdx.x;

this creates a globally unique thread index (every thread gets a different value for idx) allowing us to accomplish similar work in a sensible fashion.

The absence of any of that in your kernels means you are using CUDA in a way that no one would think is sensible. (and it won’t be performant)

You can get a more orderly treatment of this material via the tutorial I linked.

I think o understand my mistake.
Will check the tutorials.
Thank you