Hello everyone, I am relatively new to CUDA (only been working with it for about 1 year), so please bare with me.

I have a kernel that nvprof tells me is taking 50 seconds to run. I mainly wanted to ask 2 questions:

- Is there any glaring inefficiencies with my kernel code that you can see (Keeping in mind my description of the goal and reasoning behind what I put)
- Can someone point out to me some nvprof command options that I can use to gather more insightful information. I did take a look at --query-events to see all the events, but there are just too many options/events/metrics that I can't determine what is good for my situation. I've not used nvprof much so I have not yet learned what kind of information to look for when dealing with a kernel that takes a long time like in this case. Any help with this would help a lot

Here is the kernel in question, and its half-finished right now. It’s purpose is to perform statistical bootstrap sampling on an original sample (a sample with 11,897,026 entries). Bootstrap sampling is multiple re-samplings of the original one, and all with the same size of that original. The original sample is a Random Sample WITH Replacement, while all the Bootstrap re-samples are Random Samples WITHOUT Replacement. Currently with this kernel each bootstrap sample corresponds to 1 block of 1024 threads, where the 1024 threads handle the random sampling of the 11,897,026 numbers within the for loop (where MAX_ENTRIES = 11,897,026). I’m tasked with performing a bootstrap with 2048 bootstrap re-samples, and so that means that this kernel was launched with <<<2048, 1024>>>( … )

```
__global__ void bootstrap(int *output_mean, int *d_sample, curandState *state)
{
unsigned int idx = threadIdx.x + (blockIdx.x*blockDim.x);
unsigned int tNum = threadIdx.x;
unsigned int bSize = blockDim.x;
unsigned long int ts = 0;
long long int tSum = 0;
/*__shared__*/ int partial_Sums[1024];
int count = 0;
for(unsigned int i=tNum; i<MAX_ENTRIES; i+=bSize){
ts = getnextrandscaled(&state[idx], MAX_ENTRIES);
tSum += d_sample[ts];
count++;
}
partial_Sums[tNum] = tSum/count;
//__syncthreads();
/* REMAINING KERNEL CODE WITHHELD FOR THE MOMENT */
}
```

I have plans to make use of shared memory, but that isn’t finished yet. Likewise, my plans for rest of the kernel code is not finished either.

I hope I properly formulated & asked my question, and Thank you