Segmentation fault using command line profiler in a multi GPU application

Hello!
I want to profile my application with command line profiler (the profiler which is enable with COMPUTE_PROFILE environment variable). I can only use this profiler for some reasons.

My node have two GPUs. On each one a issue

  1. StreamCreate EventCreate x 7 cudaMalloc x 2

If I use the first GPU alone there is no problem. If I use the second GPU alone there is no problem neither. But If I use the two GPUs in the same run, first issueing all instructions mentioned to GPU0 and then issueing those instruction to GPU1, I get a segmentation fault when the second GPU (GPU1) is going to create the stream.

I have ensured to set the device (cudaSetDevice) and I haven’t read anything in documentation of command line profiling saying that is only suitable to profile one GPU.

Source code here:
https://docs.google.com/open?id=0BzrWAKyLcuZeTDY3WS05Y3lCRXM

I am using CUDA 4.0 with the 4.1 CUDA driver version. Each node in the cluster has two Testa M2090 (compute capability 2.0). My OS is Red Hat Enterprise Linux Server release 5.3 (Tikanga) x86_64 version and PCI-Express version is 2.0.

Can anyone suggest any idea about what’s going on?
Thanks in advance!