optimizing the execution dependency

I am new in CUDA and I need some suggestions to optimize my code,
although the prog works well, but I wonder if there is more parallelism can be applied on this code, I used the Visual profiler and found out that performance is bounded by the instruction and memory latency, and there is an execution dependency, I also used the 3D indexing x,y,z but there is no improve in the execution time.
I simplified my kernel as followed where it consists basically of three independent instructions. any advises please ? can I make each instruction executes on a single thread?

__global__ void simple_vbo_kernel(float4 *pos, float slopx, float slopy,float slopz)
   int blockId = blockIdx.x + blockIdx.y * gridDim.x;
   unsigned int idx = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x; 
    float u = 0.0f +idx*slopx;
    float v = 0.0f +idx*slopy;
    float w = 0.0f +idx*slopz;

    pos[idx] = make_float4(u,v, w, 1.0f);

You can certainly crank up your parallelism by a factor of 4 (i.e. increase total thread count by a factor of 4) by getting rid of the float4 treatment.

I certainly would not recommend that if you are launching “enough threads”, and if you are not launching enough threads, you might start by considering why that is, and is your problem too small to be interesting on the GPU.

More generally I question your diagnosis and suggest there is likely not much you can do to improve this code. It is almost certainly a memory bound code, so for performance analysis you should determine what is a reasonable bandwidth target, then see if your kernel is hitting that bandwidth target or not. If it is, there is nothing else you can do, focusing on just what you have shown here. It would shock me if this code, launched on a sufficiently large grid, did not come very close to the peak achievable bandwidth for your GPU.

To make improvements beyond that you would need to give the GPU more work to do. This kernel is performing a trivial operation.

thank u for your reply Robert_Crovella, but how can I get rid of float4 ? is there example code for that?
indeed, I am confused in understanding the Bandwidth that is shown in the attachments, my GPU is Geforce 1050 the effective B.W is about 112 GB/sec. in my visual profile result, the achieved BW is 137 Gb/sec! for 64x64 grid. is that right? if not, what this value represents?

It looks to me like you are compiling a debug project. Never never never evaluate performance using a debug project.

Switch to a release project.

The line to focus on is Device Memory - Writes

Run bandwidthTest project to get a proxy measurement of the device memory bandwidth of your GPU - use the largest number reported, the device-device bandwidth.

Then compare that to the Device Memory - Writes number. They should be close.

Also I would change your grid size from 64x64 to 256x256. This will minimize the residual effect of the L2 cache on the measurement.

many thanks!

indeed I have used the visual profile to test the performance which executes in debug mode! therefore I turn to Nsight to measure the performance. when I switch to a release project the execution time is decreased and the B.W is closed to the device - device bandwidth which about 95.4 GB/sec. as u expected, but what about the 112 GB/sec. which is calculated according to the CUDA toolkit documentation, is it a maximum theoretical value? or the effect of ECC is triggered to take 20% from this value?

the last question please, how can I print the .nvreport in Nsight??

GeForce 1050 doesn’t have ECC

The achievable bandwidth is less than the calculated or peak theoretical bandwidth, due to various overheads. This is commonly observed across all GPUs and all CPUs that I am familiar with.

I’m not aware of a generalized print-to-printer function in the visual profiler. However various sections have export capabilities. For example the analysis tab has an export to PDF function, and the GPU details tab has an export to CSV function.

in visual profile I can export to pdf, but my question is for Nsight, the file has is .nvrepport extension, there is no export function!


this exports the performance as tables only, I think the charts (as issue efficiency and occupancy,…) can’t be exported.

If no other method is provided, you can always use screen capture on a windows machine. That has nothing to do with CUDA. Or you can export the data, import it into Excel, and build charts there.

Furthermore, if you’d like to see a change in behavior, you can express that by filing a bug. The instructions are linked to a sticky post at the top of this sub-forum

thanks, I’ll try it.