I recently ran a kernel through the profiler, but to my dismay, some of the runs did not show data about gridsize/blocksize/occupancy or memory access usage. Yet they did have time stamps and timing data. All the runs I did prior to it used a data size of 512 floats, but when I attempted to run the algorithm with 1024 floats in my array, the profiler would not show things such as gridsize.
My question is: Is a problem like this caused by the way I transfer memory to the GPU? Is there some setting im missing in the profiler? Unfortunatly, posting the code I have is not an option, so please don’t ask.
I am fairly confident that I am using cudamalloc correctly; when I run it on larger data sets, the results are correct. The only problem is the profiler is not showing me some of the profiler data I want to see.