Slow multigpu When converting the code to run as multigpu, the kernel is much slower

So I’ve updated my project to work in a multiGPU environment and utilize all available GPUs to compute the final result. I’m still testing it on one GPU and the same machine I’ve always used for development. However, the multiGPU implementation is much slower than the single GPU version. While there are some minor changes here and there between the attached versions, the major difference is supporting multiGPU. The slow multiGPU version takes about 2 minutes to run whereas the fast single GPU version takes about 20 seconds. It appears the bulk of the kernel time is spent reading from texture memory and a write to global memory. Any ideas on what needs to be changed would be greatly appreciated. Unfortunately I cannot provide test data and runtime environments because all of it takes a couple of gigs.
fast.tar.gz (9.37 KB)
slow.tar.gz (10.4 KB)

Over on Win 7 today, so not being in zip format did not look at your attachments. Are you creating separate mem objects for each device? Did the flag for the mem change to CL_MEM_USE_HOST_PTR?

I managed to fix the problem. I was really surprised that it was all caused by the access speed to a minor variable. Simply changed how I passed two ints and things went back to normal.

Problematic code:

[codebox]__kernel void kernel_fdk(

__global float *dev_vol,

__read_only image2d_t dev_img,

__constant float *dev_matrix,

__constant float4 *nrm,

__constant float4 *vol_offset,

__constant float4 *vol_pix_spacing,

__constant int4 *vol_dim,

__constant float2 *ic,

__constant int2 *img_dim,

__constant float *sad,

__constant float *scale,

__constant int4 *offset,

__global int4 *ndevice

) {

uint i = get_global_id(0);

uint j = get_global_id(1);

uint k = get_global_id(2);

if (i >= (*ndevice).x || j >= (*ndevice).y || k >= (*ndevice).z)

	return;

// Index row major into the volume

long vol_idx = i + (j * (*vol_dim).x) + (k * (*vol_dim).x * (*vol_dim).y);

vol_idx -= (*offset).w;

k += (*offset).z;

…[/codebox]

Solution code:

[codebox]// FDK kernel

__kernel void kernel_fdk(

__global float *dev_vol,

__read_only image2d_t dev_img,

__constant float *dev_matrix,

__constant float4 *nrm,

__constant float4 *vol_offset,

__constant float4 *vol_pix_spacing,

__constant int4 *vol_dim,

__constant float2 *ic,

__constant int2 *img_dim,

__constant float *sad,

__constant float *scale,

__constant int4 *ndevice,

int offset,

int offset_size

) {

uint i = get_global_id(0);

uint j = get_global_id(1);

uint k = get_global_id(2);

if (i >= (*ndevice).x || j >= (*ndevice).y || k >= (*ndevice).z)

	return;

// Index row major into the volume

long vol_idx = i + (j * (*vol_dim).x) + (k * (*vol_dim).x * (*vol_dim).y) - offset_size;

k += offset;

…[/codebox]