Launch Parameters for Large Problems

I have a long list of numbers (size N) and I want to process every unordered pair from that list. I don’t know N ahead of time, but I do know that N will be at least 2000. This means that launching one thread per operation is not an option.

I have designed my kernel as follows:

__global__ void pairwise(int n, float *l)
{
    int id = = threadIdx.x + (blockIdx.x * blockDim.x);

    for (int i = 1; i < n; i++)
        for (int j = i + 1 + id; j < n; j += TOTAL_THREADS)
            // process l[i] and l[j]
}

To launch the kernel I make a call like so:

const int THREADS_PER_BLOCK = 256, BLOCKS = 3;
const int TOTAL_THREADS = THREADS_PER_BLOCK * BLOCKS;

/* ... */

pairwise<<<BLOCKS, THREADS_PER_BLOCK>>(n, l);

However, no matter what launch configuration I use the kernel always takes the same amount of time to run, I can use <<<1, 32>>> and get the same results as when I use <<<3, 768>>>.

I have used cudaOccupancyMaxPotentialBlockSize which indicated that the best grid and block sizes would be 3 and 768 but to no avail.

Also note that the kernel requires no synchronization primitives like __syncthreads() and I also do not use any shared memory. I have also tried on 2 different GPUs (GTX 1060 and MX 150) and the results are the same.

How can I choose the correct launch parameters to make the best use of the GPU?

The logic there escapes me.

perhaps your time measurement method is flawed?

The logic there escapes me.

If it is the case that I can launch a thread for each operation, I still don’t understand how I would find the best execution configuration. How would I go about doing that?



perhaps your time measurement method is flawed?

I am measuring using steady_clock. The code looks something like this (error checking omitted for clarity):

    // using std::chrono;
    auto begin = steady_clock::now();
    pairwise<<<...>>>(n, l);
    cudaDeviceSynchronize();
    float time = duration_cast<microseconds>(steady_clock::now() - begin).count() / 1000000.0;

A reasonable starting point for this is one thread per op (ie. per output point creation/calculation). For example, if your output dataset consists of N points, and you calculate one point per thread (a common thread strategy) then you would launch N threads in total, divided up among a number of blocks.

The fact that you measure no difference in execution time for a launch config of <<<1,32>>> and <<<3,768>>> suggests to me that something else is clouding your vision (e.g. a coding defect, a launch failure of some sort, etc.) I probably wouldn’t be able to say more without seeing a complete code.

In case you are not, I suggest using proper CUDA error checking (just google that, take the first hit) and run your code with cuda-memcheck, before trying to assess performance. I usually also suggest making sure you are getting correct results before assessing performance also, in case you are not already.

Ok so as I understand, I can use launch parameters like this to launch my kernel:

const int BLOCK_SIZE = 512;

int total_ops = (n * (n-1))/2;
int grid_size = static_cast<int>(std::ceil(total_ops / BLOCK_SIZE));

pairwise<<<grid_size, BLOCK_SIZE>>>(n, l);

Then I can modify the kernel to perform a range check to ensure that the last block does not access out of bounds memory.

If the above is correct, is there any way to determine the correct BLOCK_SIZE? Is cudaOccupancyMaxPotentialBlockSize correct here?

However, I have another problem. Each operation needs to have a copy of some data which it will work with. The data will be 80+ kB and is the same for every operation. In other words, each active thread needs to have a copy of that data but once it is finished, another thread can use the same copy. So, if my GPU can only run 96 threads in parallel, instead of creating one copy per operation (millions of copies) I can create only 96 copies and reuse them. Is this possible or will I have to copy the data for every operation?

I am thinking of something like this:

__global__ void pairwise(int n, float *l, large_struct *data)
{
    large_struct *my_data = data[ (warp_size * warp_id) + warp_index ];
    // operations on my_data
}

People either do some testing or use a heuristic such as the one in cudaOccupancyMaxPotentialBlockSize . Many codes don’t show much variation in performance for a range of threadblock sizes that are powers of 2 such as 128,256,512.

Based on the code you have shown, you are not making a “copy” of the data, but merely establishing a pointer to a relevant section of it. This is usually a best practice.

That is rarely the right way to think about GPU programming, especially for someone who might not yet be an expert. A basic performance goal should be to fill the GPU with threads. This is roughly 2048*number of SMs in your GPU (will vary to some possibly lower values for GPUs of compute capability 7.5 and above). Aim for kernel launches with at least that many threads.

Great I’ll do some testing to find the best block size.

I think my example might have mislead you. Let me make my intent a little clearer.

I could, for every thread, make a copy of the data required like so:

__global__ void process(large_struct *data)
{
    // get a copy of data for this thread
    //   80+ kb of data copied
    large_struct my_data = *data; 

    // operations on my_data
}

int main() {
    // construct data on host
    large_struct h_data;                        
    
    // allocate memory for data on device
    large_struct *data;                        
    cudaMalloc(&data, sizeof(large_struct));   

    // copy data from host to device
    cudaMemcpy(data, &h_data, sizeof(large_struct), cudaMemcpyHostToDevice);

    // call kernel
    process<<<...>>>(data);
}

Since there is a large chunk of data being copied for so many threads, the performance implications of this are certainly non-trivial.

My idea is that instead of copying the data millions of times (once per thread), if I could create only one copy for each thread that is physically active and then get a pointer to one part of it for each active thread this would reduce that overhead by orders of magnitude.

    // before calling kernel...
    cudaMalloc(&data, active_threads * sizeof(large_struct)); 
    // fill data[...] with copies of h_data

    // call kernel

    // inside kernel...
    large_struct *my_data = data[ active_thread_id ];

Is this possible or am I stuck with creating a copy for each thread (a copy per operation). I will also add this for larger problems the size of large_struct could easily be in excess of 250 kB.

I don’t know why you are stuck with creating a copy for each thread, or why you would need to do any copying at all. The single copy in global memory is accessible by all threads, at the same time. And I definitely agree you do not want to be doing this:

large_struct my_data = *data;

Just use the version in global memory (*data) directly, as needed.

To the extent that you are only reading this data, I simply cannot imagine what the concern would be. If you are in fact writing to that from each thread, then of course that has to be sorted out. You haven’t provided enough information to explain how you will be using the data to carry this any further.

But you seem fixated on the idea of making a copy per thread. I’m not sure why you have spent so much time on that idea, and I’m concerned there is a gap in your knowledge that is giving rise to this.

For all threads, they can read the data from global memory directly, without making any copies.

Of course I agree that if each thread was only reading from the data then it would be ideal for them all to access the same copy, however each thread needs to modify it’s own copy of data as part of the operation. That is why I am concerned with making a copy of the data for each thread.

In light of this, is there some alternative that could solve the performance issue here?

To give some additional context, I will add that to have this operation use multiple CPU threads, I might have each thread do something like:

large_struct my_data = *data;   // copy the data
my_data.set_rollback_point();

for(/* each operation assigned to this thread */) {
    // perform read/write operations on my_data for this operation
    my_data.rollback();             // restore the data for the next operation
}

It is much less expensive for me to rollback my_data and reuse the data structure (almost free in most cases) rather than make a new copy of data for each operation.

It’s still not obvious to me why you need to make a copy. If you have an array of structures (AoS) (in global memory) and each thread will update a single, unique member in that array, then there should be no problem just letting each thread work on its structure in the array, in global memory, without making any copies. (By the way, AoS is a canonically bad storage scheme for a compute arrangement where each thread works on a particular structure.)

If on the other hand, you have a situation where multiple threads are updating the same structure, then that is a different issue, and nothing you’ve described here will sort that out, with or without copies of the structure. Even if you make a local copy of the structure, if each thread is updating their own local copy, you haven’t indicated how you intend to “merge” those back to a single structure result.

This is a non-issue since once the operation is complete the result is actually discarded and I only maintain a score (a single integer value). Each operation needs to do some work on data, calculate a score for the result, and return the score. I am actually not concerned with merging those structures after the fact (I only want to know the maximum score).

Each operation will have the same starting point, it will mutate that starting point to produce a result which is then scored and discarded. However, it is a trivial operation to roll back the result to the starting point so that it could potentially be reused by a different thread.

I am interested in reusing the structure so that the expensive copy doesn’t have to occur once for each operation and instead I can use the cheap roll back.

I’m not really sure what is best here, I would usually like a complete example to study. However I think what you are describing for how you would handle it with your CPU thread implementation could be done in a CUDA model as well.

I would essentially use the methodology here to create a number of reusable structures, one per “active” thread. When a thread releases the structure when it retires, a newly spun up thread can reuse the structure (first doing the “rollback”). This methodology ensures that only one consumer (i.e. thread) is using a particular area at a particular time, and that you only need to have enough concurrent reusable structures to satisfy the maximum instantaneous thread count.

With such an arrangement, there would be no need to create any local copies per thread. You would allocate all the structures you need in global memory, and each thread would acquire use of one of them, perform the “rollback” operation, then begin to use the structure in global memory.

Thanks, this is exactly what I needed!