Requesting advice on kernel efficiency

Hello everyone, I am relatively new to CUDA (only been working with it for about 1 year), so please bare with me.

I have a kernel that nvprof tells me is taking 50 seconds to run. I mainly wanted to ask 2 questions:

  1. Is there any glaring inefficiencies with my kernel code that you can see (Keeping in mind my description of the goal and reasoning behind what I put)
  2. Can someone point out to me some nvprof command options that I can use to gather more insightful information. I did take a look at --query-events to see all the events, but there are just too many options/events/metrics that I can't determine what is good for my situation. I've not used nvprof much so I have not yet learned what kind of information to look for when dealing with a kernel that takes a long time like in this case. Any help with this would help a lot

Here is the kernel in question, and its half-finished right now. It’s purpose is to perform statistical bootstrap sampling on an original sample (a sample with 11,897,026 entries). Bootstrap sampling is multiple re-samplings of the original one, and all with the same size of that original. The original sample is a Random Sample WITH Replacement, while all the Bootstrap re-samples are Random Samples WITHOUT Replacement. Currently with this kernel each bootstrap sample corresponds to 1 block of 1024 threads, where the 1024 threads handle the random sampling of the 11,897,026 numbers within the for loop (where MAX_ENTRIES = 11,897,026). I’m tasked with performing a bootstrap with 2048 bootstrap re-samples, and so that means that this kernel was launched with <<<2048, 1024>>>( … )

__global__ void bootstrap(int *output_mean, int *d_sample, curandState *state)
    unsigned int idx = threadIdx.x + (blockIdx.x*blockDim.x);
    unsigned int tNum = threadIdx.x;
    unsigned int bSize = blockDim.x;
    unsigned long int ts = 0;
    long long int tSum = 0;
    /*__shared__*/ int partial_Sums[1024];
    int count = 0;

    for(unsigned int i=tNum; i<MAX_ENTRIES; i+=bSize){
	ts = getnextrandscaled(&state[idx], MAX_ENTRIES);
	tSum += d_sample[ts];

    partial_Sums[tNum] = tSum/count;


I have plans to make use of shared memory, but that isn’t finished yet. Likewise, my plans for rest of the kernel code is not finished either.

I hope I properly formulated & asked my question, and Thank you

(1) Make sure you are doing all performance work using release (optimized) builds
(2) The CUDA profiler can tell you what the principal bottlenecks are for your code. Address those.
(3) Read the Best Practices Guide and apply lessons learned

That’s not nearly enough total threads, and way too few blocks to fill up the average GPU. Rule of thumb: At least two blocks per SM, on the order of 10,000 total threads.

Experiment with the PRNG you are using. Philox is often a good choice as it has small startup cost and is among the fastest PRNGs offered by CURAND.

I’m sorry, but I must have been vague in my description. I did not mean to say that I only have 1 block. I am running 2048 blocks, and each one has 1024 threads.

Does this change your recommendation at all?

Thanks for the response!

all of the efficiency concerns for what you have shown would revolve around this:

tSum += d_sample[ts];

You are doing a random/scattered load. That is the worst case scenario for efficient use of global memory. Since you’ve shown very little code, I wouldn’t be able to make any definite suggestions for improvement.

Its clear that you want to take a random selection of a large data set (d_sample). If you were only doing this once, my suggestion would be to reorder the data rather than doing a random selection from it. But since you are doing it many times, I’m not sure that makes sense. And I don’t know enough about the rest of what you are doing to make any more abstract suggestions.

Apart from that, this sequence doesn’t make sense to me:

unsigned int tNum = threadIdx.x;
    /*__shared__*/ int partial_Sums[1024];
    partial_Sums[tNum] = tSum/count;

Each thread will only have a single value for tNum. Therefore each thread will have a local array, partial_Sums, with 1024 entries, only one of which (per thread) will be populated.

But its not a problem, per-se, and should not be any sort of performance issue.

If my understanding is correct, the code you have shown here is the WITH REPLACEMENT code.

Sorry, I must have misread your description. What does the profiler say the bottlenecks are in your code? From a cursory look your code could be memory bound but one shouldn’t guess.

Forgive me for the confusion, and let me elaborate:

That whole kernel is performing the bootstrap re-samples, that is to say the part WITHOUT replacement. The sample that is WITH replacement is already taken care of and it is passed to the kernel; that would be the d_sample variable.

And for the bit with the partial sums is unfinished, but its suppose to go something like this:
Each thread will put it’s sum into shared memory (i.e. shared int partial_Sums[1024]).
After all the threads are synchronized I will perform something like atomic add or something like that on all these sums across the threads.

In the end I will have all these sums condensed into 1 number, the mean of all the threads. Each block will be doing this, and so I will have 2048 numbers (or means) that I will then send back over to the host.

Does that clear things up? Thank you for the Reply!

I’m sorry, but I’m not sure how to figure that out. As I said, I ran nvprof without any command line options, and it merely said that the kernel takes 50 seconds to run, and that the kernel made up 99% of the runtime for the program. I don’t know anymore than that, because I am not sure of which options to use with nvprof in order to find that additional information.

Plus, before you suggest it: I cannot use something like Nsight Compute, or any visual GUI based profiler. This is because I’m working on the project by remotely connecting to a server via ssh. This server is one at the University that I go to, and I do not have full permissions on the server. I’ve tried, but it seems that nvprof is all that I am capable of.

The most important thing I said in my response was the first thing. To whatever extent you give it any thought, you might want to spend your time there and not on anything else in my response. (None of your commentary seems to be responsive to that item, anyway.)

At the risk of further clouding things, though, I’ll delve into one of the lesser topics, because its of interest to me.

For this code:

ts = getnextrandscaled(&state[idx], MAX_ENTRIES);
	tSum += d_sample[ts];

my guess is that there is nothing about your getnextrandscaled function (would be nice if you showed that code) that prevents the same value of ts being returned multiple times. If that guess is incorrect, I’d certainly like to know how you constructed that. I know of a couple ways to do it but its not trivial random number generation.

If my guess is correct, then it’s not clear to me how such a sampling involves “WITHOUT REPLACEMENT”

Both nsight compute and nsight systems (the modern profilers) have CLI (command-line-interface) modes.

nsight compute:
nsight systems:

However, its evident that you don’t have a mental model of how to profile code. You start by estabilishing a pareto list of limiters to performance. This is non-trivial and goes by the moniker analysis-driven optimization.

There are presentations that discuss how to do it with a GPU focus. Googling “gtc cuda optimization” will show you many good treatments.

Also, FWIW, the nsight compute blog chooses global load efficiency to focus on as an example and this is precisely the biggest problem with the code you have shown.

I saw what you wrote first. I think I slightly understand why the part with d_Sample[ts] is very inefficient, but I have no idea about any alternative. I just chose to do it that way because it was the first thing that came to mind, and so I guess I’m having a hard time grasping what to do at all.

Let me post more code, and hopefully that can bring a fully picture.

__device__ float getnextrand(curandState *state){
    return (float)(curand_uniform(state));

__device__ int getnextrandscaled(curandState *state, unsigned long int scale){
    return (unsigned long int) scale * getnextrand(state);

I’m doing this here to randomize the index of the next element I intend to pull out of d_Sample. This is where “WITHOUT REPLACEMENT” comes in.

And looking back on it now, I realize that I accidentally got the two switched. d_Sample is a sample performed WITHOUT replacement before the bootstrap kernel executes. The bootstrap re-samples are random samples WITH replacement. Plase forgive me for that confusion.

Anyway, This is where “WITH/WITHOUT REPLACEMENT” comes in. When I call getnextrandscaled, it will return to me a random index number intended for d_Sample. WITH Replacement implies that I do not keep track of which random numbers I’ve already gotten. So it is allowable for the same value to occur multiple times. So your guess about the ts variable is correct. It could hold the same value multiple times, and that is the point.

And here is the kernel I use for initializing the curandStates:

__global__ void initCurand(curandState *state, unsigned long seed){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(idx+seed, 0, 0, &state[idx]);

I’m giving each thread a unique seed here, because it is much faster than what I was doing previously. Before I had “1234” as the seed for each thread, and idx was used for the sequence. This for some reason that I still do not know caused it to take way longer than it should. Now it is around 20 milliseconds.

The Host code merely consists of allocation, and then launching these two kernels. I’m not sure if you still want me to post the host code regardless, so let me know.

I hope this helps clear some things up, and as far as the random/scattered load issue. I can not come up with any way on my own to improve this, so if you are intending to make a suggestion then I will provide you with any requested information that I can.

Thank you, so much!

You’re correct. I’ve never learned to profile code, and I’m pretty lost when it comes to doing it.
I appreciate your advice, and tips, but I wonder if I’ll have the time to do it. I have a deadline coming up, but I’ll check it out if I can.

I am not quite sure what change you made, but it may have fundamentally changed the properties of your random number generation.

Usually, when generating parallel numbers in parallel, we want the generated sequences to be uncorrelated. Partitioning the PRNG output into uncorrelated sequences can be computationally expensive. How expensive is a function of the generator used. Among the choices offered by CURAND, it is typically cheapest with the Philox generator.

CURAND documentation:

It might be that the random number generation is a more significant performance factor than the random reads. I wouldn’t be able to say for sure without a complete code to test. Since you have not provided one, I’ll leave that experiment up to you to investigate if you wish.

And since I have no idea what the rest of your code does, all this discussion could be in the noise.

Yes, I don’t doubt it, but I couldn’t be for sure. Before I made the change I had asked many people about it, but no one could really truly come up with an explanation why my initCurand kernel was taking 115 seconds to complete. As for the changes I’ve made, it was suggested to me that maybe I should try giving every single thread a unique seed, and have the sequence set to zero for all of them.

So at first I was using curand_init(1234, idx, 0, &state[idx]); which would taking 115 seconds to run.
And then I changed it to curand_init(idx+seed, 0, 0, &state[idx]); which only took around 200 milliseconds to run.
This is the only changes to the kernel, and the only other line in the kernel is the one storing the thread id into idx. I hope this clears up to you what changes I was referring to.

Based on what I’ve read so far I suspect that this approach with curand_init(idx+seed, 0, 0, &state[idx]); is not guaranteed to produce uncorrelated results between the threads; would you agree with this? If so, then just how much of an issue do you think can arise from this?

In any event, thank you for the tip about the Philox generator being the cheapest generator. With my code it is defaulting to using XORWOW, and I didn’t think to suspect any potential downsides to using it. I’ll definitely check out the documentation on Philox.

Forgive me for not posting enough code. I was wondering only if anything could be suggested based on the code I did post, but I did wish to personally investigate this as you mention. I feel like I should learn this sooner or later, and it was one of the main reasons I made this post.

I forgot to mention this earlier, but the reason I stated that I cannot use it was because I already had tried. After many attempts and asking people about it I came to the conclusion that the permissions I have on the server my work is on is the cause for it to not work out. Specifically, the reason it wasn’t working out is because my projects files couldn’t be found through Nsight, and not only that but even my user account’s directory folder along with all its sub-directories on that server could not be found on Nsight. So I tried just connecting to root without specifying the file I wished to work on, but ran into many problems with that. That is why I concluded that I couldn’t use Nsight.

Thank you for your suggestions and help here. As I expressed earlier in this replay I wish to learn how to do that which you’re referring to here, and so I will be looking into. As I’m looking into that and reading up on the documentation & presentations, are there any hints & suggestions about analysis-driven optimization you can share here that you can recall off the top of your head?

As an aside I was thinking about just getting my work from the remote server over to my actual system and then trying to profile it there. Although I think that might be folly, as the remote server is running CentOS 7 where I am running Windows 10. The main thing though is learning how to profile through the command line on the remote server. So again, thank you for your suggestions/tips/hints, and if you have any quick helpful tips & tricks regarding profiling on the command line I would love to hear them. I’ll be looking into the documentation and presentations on that as well.

Sorry if this was long winded. Thank you very much for your responses!

I am not an expert on CURAND or parallel PRNGs.

What I do know is that using a different seed in every thread does not, in general, provide independent uncorrelated sequences of random numbers for each thread. PRNGs produce a fixed sequence of certain length based on a algorithm. If, for example, two seeds provided cause the PRNG to start at positions N and N+1, respectively, in that sequence, the two resulting sequences of random numbers are highly correlated.

There are two common ways of addressing that issue in parallel PRNGs: (1) blocking (2) leap frogging. In the first technique, the total sequence of the PRNG is split into M contiguous blocks, where each block comprises roughly 1/M of the total sequence produced by the PRNG. With leapfrogging, every thread skips M elements in the PRNG sequence every time it pulls a random number. I think CURAND uses blocking, as this generally is the faster way of providing a random sequence after initial setup. For either case, the number of random numbers that can be extracted per thread is diminished compared to using the full PRNG sequence, so it may be necessary to use a PRNG with a longer period (e.g. Mersenne Twister) if many random numbers per thread are needed.

It is possible to do the skipping or blocking in less than linear time by precomputing the state advancement “formula” necessary. Often this involves generating a matrix, or series of matrices, that the PRNG state vector is multiplied with to effect state advancement by a certain number of elements. This computation may be dirt cheap (in Philox) or quite expensive (in XORWOW). If each thread generates only a few random numbers, the cost of pre-computation can dominate to the point of exceeding the cost of subsequently generating random numbers by orders of magnitude.

In CURAND, while XORWOW is the fastest generator in terms of sequentially generating random numbers, I am reasonably sure that it is also one of the one of the most expensive in terms of blocking computation. Conversely, Philox is the fastest generator in terms of blocking, while still very fast in generating random number sequentially. So when generating, say, fewer than 10,000 random numbers per thread, I would recommend use of Philox. That said, I would suggest running your own experiments as speeds may differ by hardware platform and CUDA version, resulting in different trade-offs.

The consequences of using potentially / partially correlated sequences of random numbers across threads may range from negligible to catastrophic: it very much depends on the use case. Catastrophic means that simulation results are invalid, which may ultimately result in real-world harm.