Need help with using a card to it's max

I need some help with using a card to 100%. I have a large data set that I have to run same kernel on. I had planned to setup the data structure in the host then hand the buffer off to the gpu for processing. How to set the launch params to have all cores doing something at all times? The kernel function is not huge but also not small, will I need to put the kernel launch in a for loop to prevent a TDR? I am not too new to CUDA, but I am also not a pro as you can tell. The launch params need to be caculated at runtime so that it will be portable. What is the best way to form the memory block that I will copy to device memory so that access to it from each kernel is streamlined?

thanks for any help,

Try nvvp or nvprof?

I need code help, I can’t profile code that is not wrote. I have read 2 books, but neither go into detail about how threads, blocks, warps, and grids work together. Also, nothing has told how to get deviceProps and translate the returned info into the block size, thread size, ect…


Oh, okay. Well, I’m sure there’s a bit of information out there if you google it. There’s many presentations/powerpoint slides of the info you’re looking for.

And instead of using deviceProps, there are occupancy calculators now which, again, you should research on your own first. This new API makes the task of choosing block size simpler.

Edit : But I think you should learn enough CUDA to write something compile-able and then you should profile it to identify where the bottlenecks actually are. Learning about CUDA theoretically is good but so is learning it in the wild and you shouldn’t be scared of writing something slow.

This stuff can’t be pre-computed. When you change from one machine to another, it might not have same abilities as you coded for. So for my problem, it has to be solved in code. I have written several utils that use CUDA, I am comfy with writing cuda code. I can compile and run cuda apps. I just asked how to solve the amount of block, threads, warps, ect with code.


Did you not see me talk about the occupancy calculator?

Yes, I seen that and I dl’d it. This is a great tool if the hardware does not change, unless I am missing something. What I am asking is how to make code use max resources for any cuda enabled card that it might be used with. From a GTX560 to the Titian to a Tesla? Am I missing something in this tool?

Edit : Oh, I see now. You can try to write code based off different shared memory configurations and you can detect grid sizes and stuff like that using deviceProps() or w/e it’s called.

Sorry, took me awhile to get what you were really asking.

Edit edit : Are you sure you’re not able to use deviceProps to extract all the info you need? It gives you darn near everything.

I am sure that the deviceProps does, I am just not sure how to use it for what I want to accomplish. :) Thanks for being patient during the misunderstanding.


CUDA 6.5 includes a new occupancy API for making occupancy calculations at run-time. May be of interest:

That’s what I was trying to find!

Correct me if I am wrong, but this helps determine if your code is using all the resources of the gpu. I want to calculate how many threads, blocks, and grids I can use so that I use the entire gpu.

Literally literally the Simple Occupancy example shows you how to launch kernels with an estimated ideal block size.

Here’s my output, as an example :

starting Simple Occupancy

[ Manual configuration with 32 threads per block ]
Potential occupancy: 50%
Elapsed time: 0.210944ms

[ Automatic, occupancy-based configuration ]
Suggested block size: 1024
Minimum grid size for maximum occupancy: 10
Potential occupancy: 100%
Elapsed time: 0.137696ms


more info:

Thanks to all that responded, this article explained what was going one, rather than just assume what was going on.