Hi all,
I have a basic mass spring system implemented in CUDA. I now am looking to improve its performance. My question is
where do i start optimizing from? I know that the launch configuration is also critical but how do i determine the optimal
configuration? What tools can i use to aid me in all this. In case I add in the shared memory to reduce the read latency, what
should the size be of the shared mem. array? Is there an optimization guide available? Can anyone help me with the optimization
or help me in finding the pressure points in my kernel. If so I am willing to share my basic code.
Instead of just dumping a tarfile with 1359 lines of code on us, I was more thinking of posting just the kernel, together with it’s execution configuration.
Your kernel is clearly memory bound. What compute capability is your device? On 1.x devices, using either a texture for [font=“Courier New”]_vertexData[/font] or preloading blocks from it into shared memory would be essential to avoid a lot of unnecessary memory accesses.
Thanks for the fast response and sorry for the zip file dump.
The total simulation grid size is 256x256 and the execution config. is 8 x 8.
I am on ComputeCapab=1.3. If i cache the data into the shared memory, what should the size of the shared memory array be? How do I determine this size?
10x10 would naturally allow each thread to read all of it’s neighbors.
Using a texture you can potentially avoid re-reading the borders from global memory. Part of that gain may also be realized from larger blocks. I’d try something like 32x4 or so, as horizontal borders are more expensive than vertical borders due to coalescing rules/partly wasted bandwidth.
Before doing the optimization by using shared memeory/texture usage, I thought I would time my kernel without these optimizations and here are the results.
So you can see that the kernel performs best on 8x8 configuration. Now i have one question about using texture. In the convolve example that ships with the sdk, the texture version performs poorer than the array version, do u think that since textures will handle the edge case well, I will get any speedup? Currently, I map the vbo and pass that mapped pointer directly which is probably the fastest method to push data to GPU if I am not mistaken what do you say?
The larger blocks will only be useful once you cache data in shared memory, so I’m not surprised (apart from 8x8 performing so much faster - apparently there is not enough work to load your whole device with larger blocks).
I would expect the texture version to be slightly faster but its hard to predict.
I second that the texture version would probably be faster. Make sure to test a variety of block sizings if you go that route - 16x4 will probably be best on Tesla-class hardware, a different size might work better on Fermi.
The shared memory optimization is key.
A lot of duplicated code in this file could be eliminated, with no performance penalty, by moving the force calculation to a separate device function. The compiler will inline calls to that function.
You might want to consider using the intrinsics for reciprocal and reciprocal square root, if your application can tolerate the precision loss.
Thanks for your replies. Actually, this code is based on my CPU cloth code. I am quite new to CUDA and so I am trying to do the conversion myself by learning CUDA at the same time and it is not easy atleast for me. Could u suggest a good site/article that may help me with the conversion of this code to using shared / texture memory?
While this version performs better there are two things I want to ask,
Are there anyother optimizations that I may add in this to optimize the code further?
Doing this way, I am restricted by the size of the shared memory. I cannot use this code for any grid size > 8x8 on my device. Could anyone tell me how to use the shared memory such that I may use it with larger grid size.
Hi, I think there is bank conflict …
Lets say for some threadId.y=0 we are accessing s_Data[threadIdx.x][threadIdx.y][0]
Then for
threadIdx.x=0 we access location 0 … access to bank 0
threadIdx.x=1 we access location 24 … access to bank 8
threadIdx.x=2 we access location 48 … access to bank 0
threadIdx.x=3 we access location 72 … access to bank 8
threadIdx.x=4 we access location 96 … access to bank 0
and so on …