<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
      <title>Tagged with tesla - NVIDIA Developer Forums</title>
      <link>http://forums.developer.nvidia.com/devforum/discussions/tagged/tesla/feed.rss</link>
      <pubDate>Wed, 16 May 12 17:37:17 -0400</pubDate>
         <description>Tagged with tesla - NVIDIA Developer Forums</description>
   <language>en-CA</language>
   <atom:link href="/devforum/discussions/taggedtesla/feed.rss" rel="self" type="application/rss+xml" />
   <item>
      <title>Tesla vs GTX560M this is wierd!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8311/tesla-vs-gtx560m-this-is-wierd</link>
      <pubDate>Wed, 16 May 2012 16:32:03 -0400</pubDate>
      <dc:creator>Bleiner</dc:creator>
      <guid isPermaLink="false">8311@/devforum/discussions</guid>
      <description><![CDATA[Hello Everyone,<br /><br />So I have been working on porting over a lot of the more computationally intensive portions of my code from matlab to .ptx gpu code.  I have been doing the development on my laptop where I have made HUGE increases in speed.  Achieving at least a 10X speed up in a lot of areas.  Now all my work is double precision, and involves millions of objects.  My laptop with a gaming GPU (that by design has hampered double precision performance, and fewer processing cores) can complete a task in about 46 seconds.  <br /><br />I have a desktop with a Tesla C2075, that typically outperforms my laptop by a factor of two.  When I bring this code over to the Tesla machine it is running anywhere from 42-48 seconds to complete the same task.  Does anyone have any idea why this would be?  <br /><br />The only thing that comes to mind on this is that I upgraded my laptop to the 301.40 driver version to use Nsight visual studio, while the Tesla machine is still on 301.32 (or someething like that.  When I attempted to upgrade the tesla machine it appears that the 301.40 drivers have been removed from the website.  When I upgraded to the Beta Cuda 5 version drivers 301.53 (i think) Matlab no longer recognizes that there is a GPU attached to the system.  <br /><br />Could the issues be the driver?  Did it improve that much from 301.27 to 301.4?  Does it have anything to do with the GPU on the laptop being compute level 2.1 while tesla is 2.0?  Is there a memory manager issue that 2.1 does a lot better?  <br /><br />The strange thing is that TESLA USED to perform at twice the speed of the laptop (and that includes all the overhead in matlab that is taking place on the CPU.  So Tesla must have been significantly more than 2x faster.  <br /><br />Any thoughts?<br /><br />Thanks<br />Ben<br /><br />As an after thought here are the specs of the machines in question.. <br />Tesla Work station<br />Dell Precision T5500<br />Xeon E5620 2.4ghz<br />12gb ram<br />Telsa C2075 with 6gb ram driver 301.32<br /><br />Laptop<br />Asus G74S<br />Core I7 2670QM 2.2ghz<br />12gb Ram<br />GTX 560M 3gb ram Driver 301.40<br />]]></description>
   </item>
      <item>
      <title>Cuda, PTX and Debugging symbols</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6511/cuda-ptx-and-debugging-symbols</link>
      <pubDate>Wed, 28 Mar 2012 13:08:21 -0400</pubDate>
      <dc:creator>Bleiner</dc:creator>
      <guid isPermaLink="false">6511@/devforum/discussions</guid>
      <description><![CDATA[Hello everyone,<br /><br />I have a question about .ptx files and debugging in Visual studio 10 professional.  I am attempting to write some ptx code to integrate into Matlab.  However I am ONLY writing the  __global__ functions and I have not written a main function or any host code.  The functions are fairly simple, but what I want to do is be able to debug my code when matlab is running it.  I am compiling my code using nvcc -ptx 'functionname.cu'. and when I try to use nvcc -G -ptx 'functionname.cu' to get debugging information nothing else is returned except the .ptx file.  I should mention that I am relatively new to visual studio and that is why I am compiling using the command line.  <br /><br />I believe that if I had the symbols generated I could attach the process to Matlab and debug it.  Since .ptx is supposed to be just in time compiled code does it not allow me to have debugging symbols?  <br /><br />Any help would be greatly appreciated<br />Ben]]></description>
   </item>
      <item>
      <title>OpenCL callbacks scheduling</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8031/opencl-callbacks-scheduling</link>
      <pubDate>Thu, 10 May 2012 09:59:48 -0400</pubDate>
      <dc:creator>rjmarques</dc:creator>
      <guid isPermaLink="false">8031@/devforum/discussions</guid>
      <description><![CDATA[Greatings,<br /><br />I am having a huge overhead when using callbacks on linux. After I enqueue the necessary read operation, I set a callback for apropriate threatment. The read takes less then a milisecond to complete, however the callback is only issued after, about, 19 miliseconds. Is this a driver issue?<br /><br />The graphics card is a Tesla C2050.<br />The driver version is 295.41.<br />And the GCC version is 4.4.3.<br /><br />Thanks,<br />Ricardo Marques<br /><br /> ]]></description>
   </item>
      <item>
      <title>i write this program but its not work, please can you tell me what wrong with it , im new programmer</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8006/i-write-this-program-but-its-not-work-please-can-you-tell-me-what-wrong-with-it-im-new-programmer</link>
      <pubDate>Thu, 10 May 2012 04:42:48 -0400</pubDate>
      <dc:creator>analdo</dc:creator>
      <guid isPermaLink="false">8006@/devforum/discussions</guid>
      <description><![CDATA[<a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> N 16<br /><br />__global__ void matAdd(float* A, float* B, float* C) <br />{ <br />		int i= threadIdx.x + threadIdx.y*blockDim.x; <br />		C[i]= A[i] + B[i];<br />	}<br />int main()<br />	{<br />		int i;<br />		int numBlocks = 1;<br />		size_t size = N* sizeof(float);<br />		dim3 threadsPerBlock (4, 4);<br /><br />		// Initialisation des vecteurs <br />		float A[]= {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};<br />		float B[]= {1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5};<br />		float C[N];			<br /><br />		// Allocation des vecteurs dans la mémoire du GPU <br />		float* d_A; <br />		cudaMalloc(&amp;d_A, size); <br />		float* d_B; <br />		cudaMalloc(&amp;d_B, size); <br />		float* d_C; <br />		cudaMalloc(&amp;d_C, size);<br /><br />		// Copie des vecteurs de la mémoire du CPU vers la mémoire du GPU <br />		cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice); <br />		cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);<br /><br /><br /><br />		// Execution du kernel<br />		matAdd&lt;&lt;&gt;&gt;(A, B, C);<br /><br />		cudaMemcpy(C,d_C,size,cudaMemcpyDeviceToHost);<br /><br />		printf("C= {");<br />		for(i=0;i			{<br />				printf("%2.2f ", C);<br />			}<br />		printf("}\n");<br />		cudaFree(d_A), cudaFree(d_B), cudaFree(d_C);<br />}<br />]]></description>
   </item>
      <item>
      <title>Streaming Multiprocessors</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7676/streaming-multiprocessors</link>
      <pubDate>Sat, 28 Apr 2012 16:08:59 -0400</pubDate>
      <dc:creator>Saouli</dc:creator>
      <guid isPermaLink="false">7676@/devforum/discussions</guid>
      <description><![CDATA[Hello <br />How can we know the Number of Streaming Multiprocessors in Nvdia devices and how much threads can take like the Nvidia G80 have i guess 16 SMP each one can take 8 blocks of threads and max thread shoud be 768 thread]]></description>
   </item>
      <item>
      <title>cudaEvent timers vs. Host timers</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7541/cudaevent-timers-vs-host-timers</link>
      <pubDate>Wed, 25 Apr 2012 16:35:14 -0400</pubDate>
      <dc:creator>dlowell</dc:creator>
      <guid isPermaLink="false">7541@/devforum/discussions</guid>
      <description><![CDATA[In doing performance testing we are trying two methods of timing.<br />Cuda event based timing and system timer.<br />We are running on a Fermi 2070 sm_20, with CUDA SDK 4.2<br />I haven't seen anything the internet that makes it clear whether one is superior over the other in terms of timing. I've seen the cudaDeviceSynchronize() used for this purpose,...any insight would be valuable. <br /><br />Thanks ahead of time!<br /><br /><br />The first is the built in event based:<br /><br /><code>cudaEventRecord(start, 0);<br />kernel&lt;&lt;&lt;grid,block&gt;&gt;&gt;(devy,devx,alpha,length);<br />cudaEventRecord(stop, 0);<br />cudaEventSynchronize(stop);</code><br /><br /><br />The second is a system time based timer using a barrier.<br /><br /><code>  start = getclock();<br />  kernel&lt;&lt;&lt;dimGrid,dimBlock&gt;&gt;&gt;(devy,devx,alpha,length);<br />  cudaDeviceSynchronize();<br />  finish = getclock();</code><br /><br />where getclock() is defined as:<br /><br /><code><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;sys/time.h&gt;<br />double getclock(){<br />  struct timezone tzp;<br />  struct timeval tp;<br />  gettimeofday (&amp;tp, &amp;tzp);<br />  return (tp.tv_sec + tp.tv_usec*1.0e-6);<br />}</code><br /><br /><br />The kernel we are running is:<br /><br /><code>__global__ void  kernel(double* devY,double* devX, double alpha, int length){<br /> /* w &lt;- y + alpha*x */<br />  int tid = blockIdx.x*blockDim.x+threadIdx.x;<br />  if(tid&lt;length){<br />    devY[tid]=alpha*devX[tid];<br />  }<br />}</code><br />]]></description>
   </item>
      <item>
      <title>uncertain results in CUDA program</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7711/uncertain-results-in-cuda-program</link>
      <pubDate>Tue, 01 May 2012 05:07:32 -0400</pubDate>
      <dc:creator>tanjun2525</dc:creator>
      <guid isPermaLink="false">7711@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />I'm trying to write a matrix iteration program by CUDA. But I get some problems.<br />The whole program procedures are: CPU assign the matrix to GPU; GPU iterate the matrix following a fomular; CPU get back the new matrix; CPU judge whether the matrix is convergent, if it is, stop iteration, otherwise, continue.<br />Matrices are transformed to 1-D arrays while delivering between CPU and GPU.<br />Now I cannot get right results. Because the iteration process is uncertain. Different results would be obtained while excuting the program several times.<br />Each element of the matrix is assigned one thread to iterate. So, I defined n1*n2 threads to iterate an n1*n2 matrix.<br />I cannot figure out why the result is uncertain.<br /><strong>Maybe some "volatile" keywords are needed. But I dont know how to define a volatile variable in global memory. I tried, but "error: argument type 'volatile double *' is imconpatible with parameter of type 'void *'" was reported while assigning values to them using cudaMemcpy.</strong><br />The codes are shown following. I'm sorry for my poor English.<br />Any suggestions would be highly appreciated.<br /><br /><br />CPU_function{<br />  const int num_threads = n1 * n2;<br />  const int threadsPerBlock = 16 * 16;<br />  const int blocksPerGrid = (num_threads + threadsPerBlock - 1) / threadsPerBlock;<br /><br />  ...//variables defining and values assigning<br /><br />  do<br />  {<br />	 if(times &gt; 1000)<br />              break;<br /><br />     // assign the new result to the next iteration step<br />     CUDA_SAFE_CALL( cudaMemcpy( d_Sk2, h_Sk2, n1*n2*sizeof(double), cudaMemcpyHostToDevice) );<br /><br />     // core function<br />     ComputeSim_Kernel&lt;&lt;&gt;&gt;(d_Sim, <br />	 d_Sk2,<br />	 d_Sk1,<br />	 d_adjMatrix_Yeast,<br />	 d_adjMatrix_Fly,<br />	 d_yeastIndex,<br />	 d_flyIndex,<br />	 n1, n2<br />	 );<br />     cudaThreadSynchronize();<br /><br />    // get the new matrix from GPU<br />    CUDA_SAFE_CALL( cudaMemcpy( h_Sk1, d_Sk1, n1*n2*sizeof(double), cudaMemcpyDeviceToHost) );<br /><br />    // get the maxum element of the matrix<br />    maxw = h_Sk1[0];<br />    for(int i=1; i    {<br />	 if(h_Sk1[i] &gt; maxw)<br />	 	 maxw = h_Sk1[i];<br />    }<br /><br />    minSk = 1;<br />    maxDeltaSk01 = 0;<br />    maxDeltaSk02 = 0;<br />    deltaSk01 = 0;<br />    deltaSk02 = 0;<br /><br />    ...// some compute for judging convergency<br /><br />    tmpsk = h_Sk2;<br />    h_Sk2 = h_Sk1;<br />    h_Sk1 = h_Sk0;<br />    h_Sk0 = tmpsk;<br /><br />    ++times;<br /><br />   }while((maxDeltaSk01 &gt; 0.01) &amp;&amp; (maxDeltaSk02 &gt; 0.01));<br />}<br /><br /><br />Codes on device:<br /><br /><a href="/devforum/search?Search=%23ifndef&amp;Mode=like">#ifndef</a> __COMPUTESIM_KERNEL_H__<br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> __COMPUTESIM_KERNEL_H__<br /><br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "cuda.h"<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "cutil.h"<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "computeSim_kernel.h"<br /><br />const int threadsPerBlock = 256;<br /><br />inline __global__ void<br />ComputeSim_Kernel(double *Sim, double *Sk2, double *Sk1, int *adjMatrix_Yeast, <br />	 int *adjMatrix_Fly, int *yeastIndex, int *flyIndex, int n1, int n2)<br />{<br />	 for(int i=0; i	 	 yeastIndex[i] = 0;<br />	 for(int j=0; j	 	 flyIndex[j] = 0;<br /><br />	 unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;<br /><br />	 const unsigned int index = threadIdx.x;<br />	 const unsigned int stride = blockDim.x * gridDim.x;<br />	 int iIndex, jIndex;<br />	 int iDegree, jDegree;<br />	 double N1, N2;<br /><br />	 N1 = 0.0;<br />	 N2 = 0.0;<br />	 iDegree = 0;<br />	 jDegree = 0;<br /><br />	 // iterate the n1*n2 matrix<br />	 while(tid &lt; (n1*n2))<br />	 {<br />	 	 Sk1[tid] = 0;// for storing the new value<br />	 	 if(Sim[tid] == 0)// if the element is 0, no need to iterate<br />	 	 {<br />	 	 tid += stride;<br />	 	 continue;<br />	 	 }<br /><br />		 iIndex = tid / n2; // get the row index in matrix<br />	 	 jIndex = tid % n2; // get the column index in matrix<br /><br />	 	 // some data structure for iteration<br />	 	 for(int i=0; i	 	 	 yeastIndex[i] = adjMatrix_Yeast[iIndex * (n1+1) + i];<br />	 	 iDegree = adjMatrix_Yeast[iIndex * (n1+1) + n1];<br /><br />	 	 // some data structure for iteration<br />	 	 for(int j=0; j	 	 	 flyIndex[j] = adjMatrix_Fly[jIndex * (n2+1) + j];<br />	 	 jDegree = adjMatrix_Fly[jIndex * (n2+1) + n2];<br /><br /><br />	 	 // compute N1 for iteration<br />	 	 if((iDegree != 0) &amp;&amp; (jDegree != 0))<br />	 	 {<br />	 	 	 for(int i=0; i	 	 	 {<br />	 	 	 	 if(yeastIndex[i] == 1)<br />	 	 	 	 	 for(int j=0; j	 	 	 	 	 {<br />	 	 	 	 	 	 // a2&lt;-&gt;a, b2&lt;-&gt;b<br />	 	 	 	 	 	 if(flyIndex[j] == 1)<br />	 	 	 	 	 	 	 N1 += Sk2[i * n2 + j];<br />	 	 	 	 	 }<br />	 	 	 }<br />	 	 	 N1 /= (iDegree * jDegree);<br />	 	 }<br />	 	 else if((iDegree == 0) &amp;&amp; (jDegree == 0))<br />	 	 {<br />	 	 	 for(int i=0; i	 	 	 {<br />	 	 	 	 for(int j=0; j	 	 	 	 	 N1 += Sk2[i * n2 + j];<br />	 	 	 }<br />	 	 	 N1 /= (n1 * n2);<br />	 	 }<br />	 	 else<br />	 	 	 N1 = 0;<br /><br />	 	 // compute N2 for iteration<br />	 	 if((iDegree != n1) &amp;&amp; (jDegree != n2))<br />	 	 {<br />	 	 	 for(int i=0; i	 	 	 {<br />	 	 	 	 if(yeastIndex[i] == 0)<br />	 	 	 	 	 for(int j=0; j	 	 	 	 	 {<br />	 	 	 	 		  // a2 !&lt;-&gt;! a, b2 !&lt;-&gt;! b<br />	 	 	 	 		  if(flyIndex[j] == 0)<br />	 	 	 	 	 	 	 N2 += Sk2[i * n2 + j];<br />	 	 	 	 	 }<br />	 	 	 }<br />	 	 	 N2 /= ((n1 - iDegree) * (n2 - jDegree));<br />	 	 }<br />	 	 else if((iDegree == n1) &amp;&amp; (jDegree == n2))<br />	 	 {<br />	 	 	 for(int i=0; i	 	 	 {<br />	 	 	 	 for(int j=0; j	 	 	 	 	 N2 += Sk2[i * n2 + j];<br />	 	 	 }<br />	 	 	 N2 /= (n1 * n2);<br />	 	 }<br />	 	 else<br />	 	 	 N2 = 0;<br /><br />	 	 // update the matrix using N1 and N2<br />	 	 Sk1[tid] = (N1 + N2)/2 * Sim[tid];<br /><br />	 	 tid += stride;<br /><br />	 } // while(tid}<br /><br /><a href="/devforum/search?Search=%23endif&amp;Mode=like">#endif</a> // __COMPUTESIM_KERNEL_H__]]></description>
   </item>
      <item>
      <title>Level of detail in Nsight profiling?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7621/level-of-detail-in-nsight-profiling</link>
      <pubDate>Thu, 26 Apr 2012 20:00:54 -0400</pubDate>
      <dc:creator>obuzko</dc:creator>
      <guid isPermaLink="false">7621@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br />I'm trying to optimize a fairly complex kernel that includes quite a few functions. Is Parallel Nsight capable of determining function-level performance bottlenecks within a kernel? <br />I'm using version 2.0 with VS2010 and a Tesla2050, and it appears to be limited to the kernel as a unit with no further breakdown. Can the latest version provide function-level profiling? If not, could someone suggest an alternative (if available)?<br />Thanks in advance<br /><br />Sasha<br />]]></description>
   </item>
      <item>
      <title>M2050 performance and bandwidth test</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7356/m2050-performance-and-bandwidth-test</link>
      <pubDate>Fri, 20 Apr 2012 14:11:27 -0400</pubDate>
      <dc:creator>ddjbrown</dc:creator>
      <guid isPermaLink="false">7356@/devforum/discussions</guid>
      <description><![CDATA[We have 44 IBM iDataPlex dx360-M3 machines with two M2050 GPUs attached via PCIe.  The results of running bandwidthTest vary by up to 25% on various machines, but in general seem to be about half of what we would expect from PCIe Gen2.  Here's the output from the best-performing node:<br /><br />Device 0: Tesla M2050<br /> Quick Mode<br /><br /> Host to Device Bandwidth, 1 Device(s), Paged memory<br />   Transfer Size (Bytes)	Bandwidth(MB/s)<br />   33554432			3972.9<br /><br /> Device to Host Bandwidth, 1 Device(s), Paged memory<br />   Transfer Size (Bytes)	Bandwidth(MB/s)<br />   33554432			3339.3<br /><br /> Device to Device Bandwidth, 1 Device(s)<br />   Transfer Size (Bytes)	Bandwidth(MB/s)<br />   33554432			123992.6<br /><br /><br />[bandwidthTest] - Test results:<br />PASSED<br /><br />Also, nvidia-smi on two nodes shows the "SM" clock rate as half of MAX, and one node shows PCIe Gen 1 rather than 2.  Finally, there seems to be no temperature and fan info, and a lot of other stuff shows up as N/A.  WTF???<br /><br />Here is the nvidia-smi excerpt:<br />        GPU Link Info<br />            PCIe Generation<br />                Max             : 2<br />                Current         : 1<br />            Link Width<br />                Max             : 16x<br />                Current         : 16x<br />    Fan Speed                   : N/A<br />.<br />.<br />.<br />    Temperature<br />        Gpu                     : N/A<br />    Power Readings<br />        Power Management        : N/A<br />        Power Draw              : N/A<br />        Power Limit             : N/A<br />    Clocks<br />        Graphics                : 270 MHz      &lt;&lt;&lt;&lt;&lt;<br />        SM                      : 540 MHz      &lt;&lt;&lt;&lt;&lt;<br />        Memory                  : 1546 MHz<br />    Max Clocks<br />        Graphics                : 573 MHz<br />        SM                      : 1147 MHz<br />        Memory                  : 1546 MHz<br />]]></description>
   </item>
      <item>
      <title>Compute-modify with __threadfence()</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7276/compute-modify-with-__threadfence</link>
      <pubDate>Wed, 18 Apr 2012 11:44:28 -0400</pubDate>
      <dc:creator>chrism0dwk</dc:creator>
      <guid isPermaLink="false">7276@/devforum/discussions</guid>
      <description><![CDATA[Hi All,<br /><br />I have an algorithm in which requires a summary measure of a dataset to be computed before modifying that dataset (necessarily in that order).  The modification is only small -- in fact, only one element of a large array is changed.  My current code does the following:<br /><br />1. myKernel&lt;&lt;&gt;&gt;(...) ;<br />2. (thrust::device_vector) theDataset[modIdx] = modVal ; // Memcpy implemented as a thrust::device_vector<br /><br />According to the profiler, I get a long latency associated with the cudaMemcpy() call. I wondered (since I'm actually passing the modified value to the kernel anyway) if there was a sensible way of getting the kernel to do the update, but only after all threads have done their bit of the calculation?<br /><br />I wondered about the following kernel definition:<br /><br /><code>__global__<br />void<br />myKernel(float modifiedVal, int modifiedIdx, float* dataset,...)<br />{<br />  int tid = threadIdx.x + blockIdx.x*blockDim.x;<br /><br />  // Calculations here...<br /><br />  __threadfence(); // threads in all blocks must have <br />                   // read from dataset before modification<br /><br />  if(tid == 0) dataset[modifiedIdx] = modifiedVal;<br />}</code><br /><br />might be a good idea?  The bit that concerns me is the "if(tid==0)" line -- does this mean that the first block will be sat idle, consuming resources, whilst other blocks are executed around it?  Is there a better way to achieve my aim?<br /><br />Thanks,<br /><br />Chris<br />]]></description>
   </item>
      <item>
      <title>Remote Debugging with EC2 GPU Cluster Instance</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6941/remote-debugging-with-ec2-gpu-cluster-instance</link>
      <pubDate>Wed, 11 Apr 2012 20:35:29 -0400</pubDate>
      <dc:creator>psstatuser</dc:creator>
      <guid isPermaLink="false">6941@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I am trying to set up a parallel nsight remote debugging infrastructure with one of Amazon's EC2 GPU instances. My setup is as follows:<br /><br />1. A local windows 7 x64 machine that has visual studio 2008 installed, all configured for CUDA development. I am able to compile and run CUDA programs locally without issue.<br /><br />2. A remote EC2 GPU instance with 2 telsa GPU's on it and all nvidia drivers and nsight monitor installed.<br /><br />I have an ssh tunnel between the two machines, forwarding port 8000 from the local machine to the remote machine running the nsight monitor. When I start nsight debugging, I am able to connect to the remote machine, however, I receive an error complaining that the remote machine is connected via remote desktop. After doing some research in the various forums, I noticed that the session 0 isolation in windows server 2008 may be the culprit and that many people have succssfully debugged running a remote session via the console using a remote software like VNC. See here: <br /><br /><a href="http://forum-archive.developer.nvidia.com/index.php?showtopic=5371&amp;st=20&amp;p=18346&amp;#entry18346" target="_blank" rel="nofollow">http://forum-archive.developer.nvidia.com/index.php?showtopic=5371&amp;st=20&amp;p=18346&amp;#entry18346</a><br /><br />However, this requires physical access to the remote machine, which I do not have, in order to start the VNC server and connect to the remote machine in a non session 0 environment. As this is an Amazon EC2 instance, I cannot do this.<br /><br />I was under the impression that the TCC (telsa compute cluster) drivers solved this problem.<br /><br />Any ideas on how to make this work?<br /><br />Thanks.<br />]]></description>
   </item>
      <item>
      <title>tesla card issue</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5081/tesla-card-issue</link>
      <pubDate>Thu, 23 Feb 2012 23:16:28 -0500</pubDate>
      <dc:creator>mushymac2000</dc:creator>
      <guid isPermaLink="false">5081@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />As you know, I'm trying to get the CUDA Toolkit up and running on the SUSE Linux box  for the job.<br /><br />I managed to download and install the latest Nvidia Driver (ver. 285.05.33), CUDA Toolkit (4.1) and GPU Computing SDK.<br /><br />When I attempted to programmers included in the SDK, I got the following error.<br /><br />CUDA Runtime API error 39: uncorrectable ECC error encountered.<br /><br />When I disabled the ECC feature on the card, I was able to run the programs<br /><br />In the attachment I've included the output I got with ECC enabled and ECC disabled in two separate folders. The programs I tested with were <br />•	deviceQuery<br />•	bandwidthTest<br />•	matrixMul<br />•	oceanFFT<br /><br />The details on the environment is as follows (refer attachment for more info);<br />•	SUSE Linux Enterprise Server 11 - x86_64 (kernel 2.6.32.12-0.7)<br />•	gcc 4.3.4<br />•	Nvidia driver 285.05.33<br />•	Cuda Toolkit 4.1<br /><br />The details on the Tesla Card is as follows;<br />•	Tesla C2075 (S/N: 0323111025730)<br />•	Bios version 70.10.46.00.05<br /><br />The Motherboard is an Asus Maximus IV Extreme-Z.<br />]]></description>
   </item>
      <item>
      <title>PhysX with CUDA on Linux</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6671/physx-with-cuda-on-linux</link>
      <pubDate>Tue, 03 Apr 2012 23:12:10 -0400</pubDate>
      <dc:creator>Bonaducci</dc:creator>
      <guid isPermaLink="false">6671@/devforum/discussions</guid>
      <description><![CDATA[Actually, I'm working now on physics for MMO RPG game with big ability of modyfying scenery. Basicly I'd like to use PhysX to perform most of expected actions, but I have linux server running on tesla. I've been using this mainly to calculate sparse matrix-vector multiplications and other things like this. Of cource CUDA with linux works fine, but I'm new with PhysX and I'm wondering, if it's possible to use CUDA for PhysX on linux OS?]]></description>
   </item>
      <item>
      <title>device query</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6106/device-query</link>
      <pubDate>Mon, 19 Mar 2012 12:49:11 -0400</pubDate>
      <dc:creator>vishalthelegend</dc:creator>
      <guid isPermaLink="false">6106@/devforum/discussions</guid>
      <description><![CDATA[Can anybody let me know how many grids, blockspergrid and threadperblocks are present in <br />geforce 8200m G <br />guys help me out]]></description>
   </item>
      <item>
      <title>how to use multiple blocks??</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6291/how-to-use-multiple-blocks</link>
      <pubDate>Fri, 23 Mar 2012 12:15:53 -0400</pubDate>
      <dc:creator>vishalthelegend</dc:creator>
      <guid isPermaLink="false">6291@/devforum/discussions</guid>
      <description><![CDATA[how do i make use of multiple blocks in cuda<br />i am using dim3 gridsize(n,n)<br />but it is not wrking unless n=1...that is i am able to use only one block<br />help me out..<br />i have geforce 8200 m gpu]]></description>
   </item>
      <item>
      <title>unexpected situation when using nvidia-smi command</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6246/unexpected-situation-when-using-nvidia-smi-command</link>
      <pubDate>Fri, 23 Mar 2012 03:44:26 -0400</pubDate>
      <dc:creator>AlanKao2012</dc:creator>
      <guid isPermaLink="false">6246@/devforum/discussions</guid>
      <description><![CDATA[Hi there:<br /><br />If anything goes fine, it should be like this When I keyed in "nvidia-smi":<br />+------------------------------------------------------+                       <br />| NVIDIA-SMI 2.285.05   Driver Version: 285.05.33      |                       <br />|-------------------------------+----------------------+----------------------+<br />| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB / DB |<br />| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute M. |<br />|===============================+======================+======================|<br />| 0.  Tesla C2070               | 0000:02:00.0  On     |       Off            |<br />|  38%   80 C  P8    Off /  Off |   1%   54MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 1.  Tesla C2070               | 0000:03:00.0  Off    |       Off            |<br />|  30%   57 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 2.  Tesla C2070               | 0000:83:00.0  Off    |       Off            |<br />|  30%   66 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 3.  Tesla C2070               | 0000:84:00.0  Off    |       Off            |<br />|  30%   67 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| Compute processes:                                               GPU Memory |<br />|  GPU  PID     Process name                                       Usage      |<br />|=============================================================================|<br />|  No running compute processes found                                         |<br />+-----------------------------------------------------------------------------+<br /><br />But sometimes, it halts for a while, and:<br />+------------------------------------------------------+                       <br />| NVIDIA-SMI 2.285.05   Driver Version: 285.05.33      |                       <br />|-------------------------------+----------------------+----------------------+<br />| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB / DB |<br />| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute M. |<br />|===============================+======================+======================|<br />| 0.  Tesla C2070               | 0000:02:00.0  On     |       Off            |<br />|  38%   ERR!  P8    Off /  Off |   1%   54MB / 6143MB |   99%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 1.  Tesla C2070               | 0000:03:00.0  Off    |       Off            |<br />|  30%   57 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 2.  Tesla C2070               | 0000:83:00.0  Off    |       Off            |<br />|  30%   66 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| 3.  Tesla C2070               | 0000:84:00.0  Off    |       Off            |<br />|  30%   67 C  P8    Off /  Off |   0%   10MB / 6143MB |    0%     Default    |<br />|-------------------------------+----------------------+----------------------|<br />| Compute processes:                                               GPU Memory |<br />|  GPU  PID     Process name                                       Usage      |<br />|=============================================================================|<br />|  No running compute processes found                                         |<br />+-----------------------------------------------------------------------------+<br /><br />It happens to a supermicro 7046GT server with 4 tesla C2070. Each time this error occurs, I always reboot it, but soon I would find that it would not successfully boot up and need to be rebooted again.<br />It is so disturbing and it DOES cause some errors when trying to do computing on GPU.<br /><br />Is there anyone who has encountered this situation?<br />Is it more likely to be a hardware problem(motherboard, GPU, ...)? or software problem(bad usage of program, driver, OS)?<br /><br />Any reply will be appreciated. Thank you very much.]]></description>
   </item>
      <item>
      <title>Enabling Double Precision on Tesla C2050 Using arch=sm_20</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6101/enabling-double-precision-on-tesla-c2050-using-archsm_20</link>
      <pubDate>Mon, 19 Mar 2012 08:53:09 -0400</pubDate>
      <dc:creator>waltee1000</dc:creator>
      <guid isPermaLink="false">6101@/devforum/discussions</guid>
      <description><![CDATA[Hello All,<br />I tried to declare double in my .cu file and compile the code using -arch=sm_20, the code compiles without any error or warning but when I run the code I get a following message and the run aborts.<br />*** glibc detected *** ./a.out: double free or corruption (out): 0x0000000000a83da0 ***<br />======= Backtrace: =========<br />/lib/libc.so.6[0x7f38fb293928]<br />/lib/libc.so.6(cfree+0x76)[0x7f38fb295a36]<br />./a.out[0x41b744]<br />/lib/libc.so.6(__libc_start_main+0xe6)[0x7f38fb23e1a6]<br />./a.out(__gxx_personality_v0+0xa1)[0x401449]<br />======= Memory map: ========<br /><br />If I replace the double with float everything works fine. the code complies, runs to completion with correct answers.<br /><br />I saw on the internet that the only change needed to make a double data type work is to compile the code with the -arch=sm_20 argument. why is it not working even after doing setting the argument?<br /><br />Regards,<br /><br />Walter   <br /><br />]]></description>
   </item>
      <item>
      <title>Microbenchmarking using clock()</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5981/microbenchmarking-using-clock</link>
      <pubDate>Thu, 15 Mar 2012 14:40:04 -0400</pubDate>
      <dc:creator>fwende</dc:creator>
      <guid isPermaLink="false">5981@/devforum/discussions</guid>
      <description><![CDATA[hi,<br /><br />i'am going to do some microbenchmarking on a tesla m2090. for that purpose i decided to use the clock() function within the respective kernels. unfortunately, the difference of timestamps returned from successive calls of the clock() function seems to depend on the number of threadblocks used and the number of threads within the thread blocks. consider the following kernel:<br /><br /><code><br />__global__ void kernel( uint *time ) {<br /><br />  uint<br />    t1,t2;<br /><br />  t1 = clock();<br />  t2 = clock();<br /><br />  time[blockIdx.x*blockDim.x+threadIdx.x] = t2-t1;<br /><br />}<br /></code><br /><br />if i run the kernel with less than 17 thread blocks and less than 385 threads per block, 't2-t1' is equal to 24. if the number of thread blocks is less than 17 but the number of threads per block is between 385 and 416 't2-t1' is 24 for threads within half the number of warps, and 28 for threads within the other warps. if the number of threads is between 417 and 448 't2-t1' is 28 for all threads. with increasing number of threads per block the value of 't2-t1' also increases.<br /><br />the same happens if the number of thread blocks becomes larger than 16. <br /><br />for me it is not obvious what causes this behavior.<br /><br />does someone has any idea.<br /><br />thanks]]></description>
   </item>
      <item>
      <title>CUDA Runtime Error 39</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5241/cuda-runtime-error-39</link>
      <pubDate>Wed, 29 Feb 2012 04:44:46 -0500</pubDate>
      <dc:creator>dyanwithu</dc:creator>
      <guid isPermaLink="false">5241@/devforum/discussions</guid>
      <description><![CDATA[Hi everybody,<br />I meet a Cuda Runtime Error 39<br />and my device is tesla c2070 with ECC enabled<br /><br />does anyone know the details of this error code? I just can't find it.<br /><br />thx]]></description>
   </item>
      <item>
      <title>System with 8 GPUs and 1 IOH chip</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1146/system-with-8-gpus-and-1-ioh-chip</link>
      <pubDate>Wed, 26 Oct 2011 04:18:00 -0400</pubDate>
      <dc:creator>Voronyuk Mikhail</dc:creator>
      <guid isPermaLink="false">1146@/devforum/discussions</guid>
      <description><![CDATA[Hello!<br /><br />In webinar "Multi-GPU and Host Multi-Threading Considerations" by Dr Paulius Micikevicius (<a href="http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf" target="_blank" rel="nofollow">http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf</a>) was example with 8-GPU-system which has only 1 IOH chip (p.19) (so it is possible to use GPUDirect technology for peer copy between each couple of devices).<br /><br />Could you please tell what exact this system is.<br /><br />Thanks in advance]]></description>
   </item>
      <item>
      <title>Overclocking/underclock</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5496/overclockingunderclock</link>
      <pubDate>Tue, 06 Mar 2012 10:26:29 -0500</pubDate>
      <dc:creator>jbraswe3</dc:creator>
      <guid isPermaLink="false">5496@/devforum/discussions</guid>
      <description><![CDATA[Is it possible to overclock/underclock the Tesla GPU programmatically?  I have read and seen things about using MSI Afterburner and other such programs to adjust the clock speeds but for my proposes I would need my program to be able to dynamically adjust the speeds.<br /><br />Any info would be greatly approciated.]]></description>
   </item>
      <item>
      <title>shall I install the developer drivers?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5096/shall-i-install-the-developer-drivers</link>
      <pubDate>Fri, 24 Feb 2012 05:18:18 -0500</pubDate>
      <dc:creator>martinaG</dc:creator>
      <guid isPermaLink="false">5096@/devforum/discussions</guid>
      <description><![CDATA[HI all,<br /><br />I need to set up my workstation with CUDA. It already has the nVidia runtime installed<br /><code><br />NVRM version: NVIDIA UNIX x86_64 Kernel Module  280.13  Wed Jul 27 16:53:56 PDT 2011<br /></code><br />I installed the toolkit 4.1, then I saw the <em>Developer Drivers Downloads</em> section. <br />What are them? shall i install these drivers too?<br />In the installation guide, they suggest to install the drivers before installing the toolkit. <br /><br />So, in the case I need the dev drivers too, shall i re-install the toolkit?<br /><br />Thank you in advance for any help you would give me.<br /><br />Bests,<br /><br />Martina<br /><br /><br />My OS: Ubuntu 11.10<br />My GPUs: Quadro FX 4800<br />         Tesla C1060]]></description>
   </item>
      <item>
      <title>NVIDIA Parallel Nsight 2.1 Release Candidate 2 now available!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2171/nvidia-parallel-nsight-2-1-release-candidate-2-now-available</link>
      <pubDate>Mon, 05 Dec 2011 19:16:15 -0500</pubDate>
      <dc:creator>Sebastien Domine</dc:creator>
      <guid isPermaLink="false">2171@/devforum/discussions</guid>
      <description><![CDATA[<br /><p class="MsoNormal">NVIDIA Parallel Nsight 2.1 Release Candidate 2 now available! </p><br /><p class="MsoNormal">Dear Parallel Nsight User,</p><br /><p class="MsoNormal">Building on the NVIDIA Parallel Nsight™ 2.1 Release Candidate 1 release with multiple bug fixes and stability improvements, we are proud to announce the release of <b>NVIDIA Parallel Nsight™ 2.1 Release Candidate 2</b>. This release<br />brings support for the new <b>CUDA Toolkit 4.1 </b>Release Candidate 2, which can be downloaded under the CUDA Registered Developer Program (<a href="http://www.developer.nvidia.com/join">www.developer.nvidia.com/join</a>). Parallel Nsight 2.1 adds a number of new features to enhance debugging and profiling capabilities. </p><br /><p class="MsoNormal">This release requires <b>NVIDIA Display Driver Release 285.86</b>, available on the same download site. </p><br /><ul style="list-style-type:disc;margin-top:0in;"><li class="MsoNormal" style="margin-bottom:.0001pt;"> Traced workloads can now <b>navigate the dependencies and call stack</b> to allow the developer to follow through GPU workloads, corresponding API calls and host code that was the cause of the activity.</li><li class="MsoNormal" style="margin-bottom:.0001pt;"><b>CUDA warp watch</b> visualizes variables and expressions across an entire CUDA warp.</li><li class="MsoNormal"></li></ul>]]></description>
   </item>
      <item>
      <title>Typecasting to custom struct in kernel</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5161/typecasting-to-custom-struct-in-kernel</link>
      <pubDate>Sun, 26 Feb 2012 17:17:30 -0500</pubDate>
      <dc:creator>avion85</dc:creator>
      <guid isPermaLink="false">5161@/devforum/discussions</guid>
      <description><![CDATA[Since this is my first post, greetings to everyone.<br /><br />I've encountered a problem regarding casting to a custom struct in a kernel, hopefully someone else was in the same situation.<br /><br />I'm passing as a parameter into a kernel from a .cu file a large array which I would like to cast into a struct and access as an array of structures.<br /><br />pseudo-code:<br /><br />kernels.cu (with nvcc)<br /><code><br />struct myMatrix<br />{<br />	float e[6];<br />};<br />__global__ myKernel(float *raw, myMatrix *p){<br /> myID = int me_idx = blockIdx.x * blockDim.x + threadIdx.x;<br /><br /> myMatrix m = p[myID];	  //does not work - "???" in nsight for all values <br /><br /> myMatrix n =((myMatrix *)raw)[myID];     //does not work also - "???"<br /><br /> float a = raw[0];    //works and I get correct single float values, but unstructured<br /><br /> float 4 b = ((float4*)raw)[0];  //works and I get correct tuples<br /><br />//what I want:<br />Matrix m = p[myID];<br />float something = m.e[3];<br />}<br /></code><br /><br /><br />main.cu (with microsoft c compiler)<br /><code><br />float *p = [large array];<br />myKernel&lt;&lt;&lt;block,thread&gt;&gt;&gt;(p,(myMatrix*)p);<br /></code><br /><br />I am using Parallel Nsight to inspect the values and what I get is "???" while stepping through the program. I have never had problems if I use the built-in types like float4. However,  I would of course, like to have my own structures working properly.<br />Maybe the problem is in the alignment? If so, to which value to I align? <br /><br />Appreciate the help.<br />Avion<br /><br />PS.Working with Visual Studio, everything is 64bit.<br /><br />EDIT: added another example that works.]]></description>
   </item>
      <item>
      <title>GPUs stopped working after being left idling</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1081/gpus-stopped-working-after-being-left-idling</link>
      <pubDate>Thu, 20 Oct 2011 13:03:01 -0400</pubDate>
      <dc:creator>Gary Chandler</dc:creator>
      <guid isPermaLink="false">1081@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />My university department bought a system with 2 Tesla 2050 GPUs about 1 year ago. They were working when new, but after leaving them idle for a few months they stopped working.  A few days after I first tried to access them after leaving them idle, they "woke up" and worked OK again. After leaving them idle again for a few months they have stopped working again.<br /><br />When they stop working it is like they do not exist. <br />lspci | grep -i nvidia   returns nothing.<br /><br />nvidia-smi   returns:<br />NVIDIA: could not open the device file /dev/nvidiactl (No such file or directory).<br />Failed to allocate an RM client<br />Could not allocate resources!<br /><br />From a CUDA code that calls CUFFT:<br />Cuda error in file 'cuda.cu' in line 11 : unspecified driver error.<br /><br />Does anyone know what this problem might be? and is there any way to "wake" them up straight away? Instead of having to wait a few days. <br /><br />My operating system is openSUSE 11.2 <br /><br />Thanks<br /><br />Gary<br />]]></description>
   </item>
      <item>
      <title>Degradion Performance 4.1 over 4.0</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4436/degradion-performance-4-1-over-4-0</link>
      <pubDate>Wed, 08 Feb 2012 09:29:40 -0500</pubDate>
      <dc:creator>kalman</dc:creator>
      <guid isPermaLink="false">4436@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br />due the fact our application has to not be simply fast but it should perform<br />some operations with fixed deadlines (we analyze a continuous radio signal)<br />we perform several time per day benchmarks of all our algorithm.<br /><br />We are experiencing a clear degradation adopting CUDA 4.1 over the old CUDA 4.0.<br /><br />I have attached 4 images showing the historical performance data of 4 algorithms <br />(they are not all the affected ones, but the simplest to show you the kernel code).<br /><br />For all graphs the reported time is in milliseconds (y-axis).<br /><br />All kernels are launched in this way:<br /><br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> BLOCK_SIZE (1&lt;&lt;9)<br /><br />dim3 myThreads(BLOCK_SIZE);<br />dim3 myGrid( (aSize + BLOCK_SIZE - 1) / BLOCK_SIZE);<br /><br />Kernel&lt;&lt;&lt; myGrid, myThreads&gt;&gt;&gt;(.....);<br /><br />We have the C2050 cards with ECC off.<br /><br />============================================================================<br />Sum of two complex vectors (2^20 complex):  add_cc.png<br /><br />__global__ void<br />VectorVectorSumKernelCC_O(const float2* aIn1,<br />	                  const float2* aIn2,<br />                          float2* aOut,<br />                          const unsigned int aSize) {<br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    aOut[myPos].x = aIn1[myPos].x + aIn2[myPos].x;<br />    aOut[myPos].y = aIn1[myPos].y + aIn2[myPos].y;<br />  }<br />}<br />============================================================================<br />Product of two complex vectors (2^20 complex): mul_cc.png<br /><br />__global__ void<br />MulKernel_cv_cv_o(const float2* aIn1,<br />                  const float2* aIn2,<br />                  float2* aOut,<br />                  const unsigned int aSize) {<br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    const float myReal1 = aIn1[myPos].x;<br />    const float myReal2 = aIn2[myPos].x;<br />    const float myImag1 = aIn1[myPos].y;<br />    const float myImag2 = aIn2[myPos].y;<br />    aOut[myPos].x = myReal1 * myReal2 - myImag1 * myImag2;<br />    aOut[myPos].y = myReal1 * myImag2 + myImag1 * myReal2;<br />  }<br />}<br />============================================================================<br />Product of two complex vectors (2^20 complex), in place: mul_cc_i.png<br /><br />__global__ void<br />MulKernel_cv_cv_i(const float2* aIn,<br />                        float2* aInOut,<br />                  const unsigned int aSize) {<br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    const float myTmp = aInOut[myPos].x;<br />    const float myInR = aIn[myPos].x;<br />    const float myInI = aIn[myPos].y;<br />    aInOut[myPos].x = myInR * aInOut[myPos].x - myInI * aInOut[myPos].y;<br />    aInOut[myPos].y = myInR * aInOut[myPos].y + myInI * myTmp;<br />  }<br />}<br />============================================================================<br />Tone generation (2^20 vector long): tone.png<br /><br />__global__ void<br />ComplexExpKernel(float2* aInOut,<br />                 const unsigned int aSize,<br />                 const float aMagnitude,<br />                 const float aNormalizedFrequency,<br />                 const float aInverseFrequency,<br />                 const float aPhase) {<br /><br />  const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;<br />  if (myPos &lt; aSize) {<br />    const float myArgument = aNormalizedFrequency * fmodf((float)myPos, aInverseFrequency) + aPhase;<br />    aInOut[myPos].x = aMagnitude * __cosf(myArgument);<br />    aInOut[myPos].y = aMagnitude * __sinf(myArgument);<br />  }<br />}<br />============================================================================ ]]></description>
   </item>
      <item>
      <title>kepler device memory architecture</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5016/kepler-device-memory-architecture</link>
      <pubDate>Wed, 22 Feb 2012 21:18:04 -0500</pubDate>
      <dc:creator>iourikarpov</dc:creator>
      <guid isPermaLink="false">5016@/devforum/discussions</guid>
      <description><![CDATA[Hello, with plenty of rumours swirling around kepler number of cores, processors, etc., there has beed very little info (that I could find) on the L1 memory size, shared, constant memory sizes, and other key metrics that are extremely important to developers.  Does anyone have any links, leaks, or ideas on what the kepler architecture will bring on this front?<br /><br />Thanks in advance, Joe]]></description>
   </item>
      <item>
      <title>Out of memory on first CUDA runtime call</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4776/out-of-memory-on-first-cuda-runtime-call</link>
      <pubDate>Thu, 16 Feb 2012 06:51:05 -0500</pubDate>
      <dc:creator>chrism0dwk</dc:creator>
      <guid isPermaLink="false">4776@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br /><br />I've hit a strange bug while porting my CUDA code from my MacBook Pro (NVIDIA GeForce GT 330M) to my university's HPC gpu node with its two Tesla M2050s.  My code is as follows:  I have 2 classes defined as follows:<br /><br /><code><br />class MatLikelihood<br />{<br /> // Class to manage maths environment<br /> MatLikelihood(DataFromMain&amp; data)<br />  {<br />    gpu_ = new GpuLikelihood(data.data, data.number, data.moredata, data.etc);<br />    ...<br />  }<br />private:<br />  GpuLikelihood* gpu_;<br />}<br /><br />class GpuLikelihood<br />{<br />  // Class to manage GPU computation<br />  GpuLikelihood(float* data, int number, float* moredata, int etc)<br />  {<br />    int numDevices; cudaError_t err;<br />    err = cudaGetDeviceCount(&amp;numDevices);<br />    if (err != cudaSuccess) throw AnException("cudaGetDeviceCount bombs out",err);<br /><br />    &lt;do a bunch of cudaMallocs here&gt;<br />  }<br />}<br /></code><br /><br />These classes are in separate files (MatLikelihood.hpp/cpp and GpuLikelihood.hpp/cu), and I compile .cpp files with the system g++ and .cu files with nvcc, before linking with the system g++.  GpuLikelihood makes use of both cudablas_v2 and cusparse (v1) libraries.<br /><br />On my MacBook Pro (CUDA 4.1, GCC 4.2.1) all works well.  However, I have a problem on the HPC node (CUDA 4.0.17, GCC 4.3.4). Every single call I make to the CUDA runtime, even cudaGetDeviceCount, fails with error 2 "out of memory"!  This also occurs if I try to explicitly set the device id using cudaSetDevice.  cuda-memcheck doesn't tell me anything.  cuda-gdb gives me an error "The CUDA driver failed initialization. (error code = 20)".<br /><br />The interesting thing is that if I write a standalone CUDA test code (eg. call cudaGetDeviceCount, do a cudaMalloc or two, and end with cudaFrees) in one .cu file, all works fine.  I'm wondering if my problem is anything to do with my class hierarchy, or the separate gcc linking?<br /><br />Any ideas?<br /><br />Thanks,<br /><br />Chris<br /><br /><br />]]></description>
   </item>
      <item>
      <title>Developing Windows applications for the Tesla GPU</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4526/developing-windows-applications-for-the-tesla-gpu</link>
      <pubDate>Fri, 10 Feb 2012 06:51:21 -0500</pubDate>
      <dc:creator>paysonwelch</dc:creator>
      <guid isPermaLink="false">4526@/devforum/discussions</guid>
      <description><![CDATA[Greetings I have a very quick question. I am a Windows application developer who needs to crunch lots of data.  Can anyone tell me if it is relatively easy to write code that can be offloaded to a Tesla GPU using Visual Studio? <br /><br />Specifically my application has a GUI front-end, at the very least is it possible to compile a C++ library that can be included without much hassle?<br /><br />Also any tips or suggestions for research would be greatly appreciated. Thanks in advance.]]></description>
   </item>
      <item>
      <title>Upgrading an S870</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4576/upgrading-an-s870</link>
      <pubDate>Sat, 11 Feb 2012 18:50:29 -0500</pubDate>
      <dc:creator>paysonwelch</dc:creator>
      <guid isPermaLink="false">4576@/devforum/discussions</guid>
      <description><![CDATA[I'm looking at an S870 1U rackmount enclosure with 4xC870 GPU cards.  I am wondering if it is possible to swap out the C870 cards for something with a little more punch.  Does anyone know if this is possible?  E.g. if I replace the C870 cards will the S870 accept new ones and continue to function properly?  <br /><br />]]></description>
   </item>
      <item>
      <title>Concurrent Kernel Execution - Scheduling Mechanism</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4356/concurrent-kernel-execution-scheduling-mechanism</link>
      <pubDate>Mon, 06 Feb 2012 16:51:07 -0500</pubDate>
      <dc:creator>fwende</dc:creator>
      <guid isPermaLink="false">4356@/devforum/discussions</guid>
      <description><![CDATA[currently, i'am doing some experiments on how to efficiently adapt concurrent kernel execution to my projects. to get a deeper insight into how kernels are scheduled on the gpu/device, i played around with a small kernel doing some simple math on a vector of O(10^6) floats. the kernel itself is designed such that it uses just one threadblock consisting of 256 threads, so that each kernel can be mapped onto one multiprocessor.<br /><br />if i now run n=32 of these kernels (each of them manipulating its own vector) using the same stream, i get a total runtime for all kernels which is t_0=432ms. running all 32 kernels using 32 different streams gives a total runtime t_1=28ms, which is approximately t_0/15, which is close to 16 (i run the program on a tesla m2090, which is capable of running up to 16 kernels concurrently). using just 16 different streams also results in t_1=28ms, which is clear, as the device allows for at most 16 concurrent kernels at the same time.<br /><br /><code><br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NUM_KERNELS 32<br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NUM_STREAMS 16<br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> NUM_VEC_ELEMENTS ( 1024*1024 )<br /><br />static float *d_array[NUM_KERNELS];<br />static float *h_array[NUM_KERNELS];<br /><br />static __global__ void kernel( float *ptr ) {<br />  // do some math<br />}<br /><br />int main() {<br /><br />  cudaStream_t <br />    streamPool[NUM_KERNELS],<br />    *stream[NUM_KERNELS];<br /><br />  // create arrays on device &amp; host.<br />  for( int i=0; i&lt;NUM_KERNELS; i++ ) {<br />    cudaMalloc( (void **) &amp;d_array[i], NUM_VEC_ELEMENTS*sizeof( float ) );<br />    h_array[i] = new float[NUM_VEC_ELEMENTS];<br />  }<br /><br />  // fill 'h_array' at random using drand48(),<br />  // and copy data to device ('d_array').<br /><br />  // create 'NUM_KERNELS' streams.<br />  for( int i=0; i&lt;NUM_KERNELS; i++ )<br />    cudaStreamCreate( &amp;streamPool[i] );<br /><br />  // map streams onto 'streamPool' using this pattern:<br />  // 0 1 2 3 ... 15 0 1 2 3 ... 15<br />  for( int i=0; i&lt;NUM_KERNELS; i++ )<br />    stream[i] = &amp;streamPool[i%NUM_STREAMS];<br /><br />  // start timer, and then run kernels.<br />  for( int i=0; i&lt;NUM_KERNELS; i++ )<br />    kernel&lt;&lt;&lt; 1, 256, 0, ( *stream[i] ) &gt;&gt;&gt;( d_array[i] );<br /><br />  // cudaDeviceSynchronize &amp; stop timer.<br /><br />  // copy data from device in order to check results.<br /><br />  // free memory.<br /><br />  return 0;<br /><br />}<br /></code><br /><br />up to this point, all is well. if i now switch to the following stream mapping<br /><br /><code><br />// code as above<br /><br />int main() {<br />  ...<br />  // map streams onto 'streamPool' using this pattern.<br />  // 0 0 1 1 2 2 3 3 ... 15 15<br />  for( int i=0; i&lt;NUM_KERNELS; i++ )<br />    stream[i] = &amp;streamPool[i/( NUM_KERNELS/NUM_STREAMS )];<br />  ...<br />}<br /></code><br /><br />the total runtime is t_2=230ms, which gives a speedup of almost a factor 2 compared with serial kernel execution. my understanding of this is as follows:<br /><br />kernels sent to gpu are placed into some kind of gpu-internal queue (i have a dequeue in mind). say we have some kind of scheduler on the gpu, then this scheduler takes kernels from the queue's front and makes them run on any available multiprocessor. if successive kernels in the queue should run on different streams, the scheduler is free to run them concurrently on the device, but at most 16 of them. if two (or more) successive kernels should run on the same stream, the scheduler takes the first of them and then stops dequeuing kernels from the queue, since they may be not independent (it also stops dequeuing even if after the stop-dequeuing-kernel there is a kernel that should run on a different stream). it restarts dequeuing kernels if the previous kernel (which makes the scheduler stop dequeuing) is finished. <br />with respect to my second code-sample (which gives a speedup of almost 2, although 16 different streams are used), this would explain the bad performance. instead of 16 kernels, there are just 2 kernels that can run concurrently on the device. all in all, this happens 15 times, so that the speedup over serial execution is 32/(32-32/2+1)=1.88 -&gt; execution time t_2=(432/1.88)ms=230ms.<br /><br />my question now is: are my considerations right? is there a scheduler on the device, which acts similar to what i described above.<br /><br />if yes: the speedup that can be achieved using concurrent kernel execution significantly depends on the order in which kernels are send to gpu/device.]]></description>
   </item>
      <item>
      <title>Tesla multi copy not as fast as expected</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4421/tesla-multi-copy-not-as-fast-as-expected</link>
      <pubDate>Wed, 08 Feb 2012 04:37:54 -0500</pubDate>
      <dc:creator>paulvisschers</dc:creator>
      <guid isPermaLink="false">4421@/devforum/discussions</guid>
      <description><![CDATA[When I run the simpleMultiCopy in the SDK (4.0) on the Tesla C2050 I get the following results:<br /><code>[simpleMultiCopy] starting...<br />[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)<br />&gt; Device name: Tesla C2050<br />&gt; CUDA Capability 2.0 hardware with 14 multi-processors<br />&gt; scale_factor = 1.00<br />&gt; array_size   = 4194304<br /><br /><br />Relevant properties of this CUDA device<br />(X) Can overlap one CPU&lt;&gt;GPU data transfer with GPU kernel execution (device property "deviceOverlap")<br />(X) Can overlap two CPU&lt;&gt;GPU data transfers with GPU kernel execution<br />    (compute capability &gt;= 2.0 AND (Tesla product OR Quadro 4000/5000)<br /><br />Measured timings (throughput):<br /> Memcpy host to device	: 2.725792 ms (6.154988 GB/s)<br /> Memcpy device to host	: 2.723360 ms (6.160484 GB/s)<br /> Kernel			: 0.611264 ms (274.467599 GB/s)<br /><br />Theoretical limits for speedup gained from overlapped data transfers:<br />No overlap at all (transfer-kernel-transfer): 6.060416 ms <br />Compute can overlap with one transfer: 5.449152 ms<br />Compute can overlap with both data transfers: 2.725792 ms<br /><br />Average measured timings over 10 repetitions:<br /> Avg. time when execution fully serialized	: 6.113555 ms<br /> Avg. time when overlapped using 4 streams	: 4.308822 ms<br /> Avg. speedup gained (serialized - overlapped)	: 1.804733 ms<br /><br />Measured throughput:<br /> Fully serialized execution		: 5.488530 GB/s<br /> Overlapped using 4 streams		: 7.787379 GB/s<br />[simpleMultiCopy] test results...<br />PASSED</code><br />This shows that the expected runtime is 2.7 ms, while it actually takes 4.3. What is it exactly that causes this discrepancy?]]></description>
   </item>
      <item>
      <title>On streams and asynchronous execution</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3496/on-streams-and-asynchronous-execution</link>
      <pubDate>Mon, 16 Jan 2012 03:32:28 -0500</pubDate>
      <dc:creator>lucana</dc:creator>
      <guid isPermaLink="false">3496@/devforum/discussions</guid>
      <description><![CDATA[This question is just to make sure I'm understading well how CUDA streams work. <br /><br />Imagine I have a for loop like this. I am using only one stream.<br /><br />for (i=0; i &lt; N; i++)<br />{<br />	 run operations on CPU <br />	 copy results of CPU operations to CUDA kernel with cudaMemcpyAsync<br />	 call kernel &lt;&lt;&lt;  , &gt;&gt;&gt;<br />}<br /><br />My understanding is that the kernel for i and the CPU operations for i+1 at the begining of the loop will execute concurrently, but the kernel won't start for i+1 until the CPU has finished computing results for i+1.<br /><br />Is this right? Or will the operations on CPU and GPU never overlap? Will the kernel start before have the proper results computed from the CPU? Is it necessary to put some control flags to make sure the operations on the CPU have finished before the kernel starts?<br /><br />This diagram shows what I want to do. In fact it is a pipeline, but I'm still unsure if it is possible with CUDA. <br /><br />----------i = 0 -------------------- i = 1 --------------------------- i = 2<br />(t0) compute results on CPU<br />(t1) copy results to CUDA kernel -- compute results on CPU<br />(t2) execute kernel --------------- copy results to CUDA kernel -- compute results on CPU<br />(t3) ------------------------------ execute kernel --------------- copy results to CUDA kernel <br />(t4)-------------------------------------------------------------- execute kernel<br /><br />Finally, I would like to ask if it makes sense to use CUDA streams when there is data dependacy between streams, with a pipeline like the one showed before. ]]></description>
   </item>
      <item>
      <title>Can we change the number of running cores in Tesla Card?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4021/can-we-change-the-number-of-running-cores-in-tesla-card</link>
      <pubDate>Thu, 26 Jan 2012 14:14:30 -0500</pubDate>
      <dc:creator>hugepuff</dc:creator>
      <guid isPermaLink="false">4021@/devforum/discussions</guid>
      <description><![CDATA[I want to know how to change the number for the running cores in Tesla card.<br />When you call and initial a kernel, you have pointed out how many thread and stream processor you are going to use.<br />However if you have 480 core gpu card,and you are running a program with only 120 thread. What the status of the other 360 stream processor?<br />Have they automatically closed by Nivdia power saving strategy for saving power? ]]></description>
   </item>
      <item>
      <title>Kernel panics with 285.5.32 driver</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4171/kernel-panics-with-285-5-32-driver</link>
      <pubDate>Tue, 31 Jan 2012 12:12:07 -0500</pubDate>
      <dc:creator>jlong</dc:creator>
      <guid isPermaLink="false">4171@/devforum/discussions</guid>
      <description><![CDATA[I installed the 285.5.32 driver pointed to by the Cuda 4.1<br />download page on to my Dell C6145 compute node.  It is connected<br />to eight M2070 GPUs through the Dell C410X chassis.  The driver<br />installed without error, loads without issues, and created the<br />/dev/nvidia* files as expected.    When I run something like<br />nvidia-smi -L, the system immediately kernel panics.<br /><br /><br />kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000048<br /><br />Has anyone else experienced this issue?]]></description>
   </item>
      <item>
      <title>Nvidia X11 driver does not allow stereo with Tesla M2090.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3941/nvidia-x11-driver-does-not-allow-stereo-with-tesla-m2090-</link>
      <pubDate>Wed, 25 Jan 2012 08:57:38 -0500</pubDate>
      <dc:creator>tripiana</dc:creator>
      <guid isPermaLink="false">3941@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br /><br />I'm trying to enable the stereo 3D using a Tesla M2090 but the Nvidia X11 driver doesn't allow me to do that.<br /><br />Here are some parts of the Xorg log:<br /><code>[...]<br />(**) NVIDIA(0): Option "NoLogo" "true"<br />(**) NVIDIA(0): Option "Stereo" "3"<br />(**) NVIDIA(0): Option "NvAGP" "0"<br />(**) NVIDIA(0): Option "UseEDID" "false"<br />(**) NVIDIA(0): Option "MetaModes" "1920x1200_60 +0+0"<br />(==) Jan 25 13:55:41 NVIDIA(0): Using HW cursor<br />(**) Jan 25 13:55:41 NVIDIA(0): Onboard stereo requested (DIN connector)<br />(==) Jan 25 13:55:41 NVIDIA(0): Video key set to default value of 0x101fe<br />(==) Jan 25 13:55:41 NVIDIA(0): Not mapping the primary surface by default.<br />(**) Jan 25 13:55:41 NVIDIA(0): Use of AGP disabled per request<br />(**) Jan 25 13:55:41 NVIDIA(0): Ignoring EDIDs<br />(II) Jan 25 13:55:42 NVIDIA(0): Implicitly enabling NoScanout<br />(II) Jan 25 13:55:42 NVIDIA(0): NVIDIA GPU Tesla M2090 (GF110) at PCI:2:0:0 (GPU-0)<br />[...]<br />(--) Jan 25 13:55:42 NVIDIA(0): Connected display device(s) on Tesla M2090 at PCI:2:0:0<br />(--) Jan 25 13:55:42 NVIDIA(0):     none<br />(II) Jan 25 13:55:42 NVIDIA(0): NoScanout X screen configured with resolution 1920x1200 (from<br />(II) Jan 25 13:55:42 NVIDIA(0):     Virtual X configuration option)<br />(II) Jan 25 13:55:42 NVIDIA(0): Validated modes:<br />(II) Jan 25 13:55:42 NVIDIA(0): MetaMode "nvidia-auto-select":<br />(II) Jan 25 13:55:42 NVIDIA(0):     Bounding Box: [0, 0, 1920, 1200]<br />(**) Jan 25 13:55:42 NVIDIA(0): Virtual screen size configured to be 1920 x 1200<br />(==) Jan 25 13:55:42 NVIDIA(0): DPI set to (75, 75); computed from built-in default<br />(WW) Jan 25 13:55:42 NVIDIA(0): Stereo not supported with NoScanout; disabling Stereo.<br />[...]<br />(**) NVIDIA(1): Option "NoLogo" "true"<br />(**) NVIDIA(1): Option "Stereo" "3"<br />(**) NVIDIA(1): Option "NvAGP" "0"<br />(**) NVIDIA(1): Option "UseEDID" "false"<br />(**) NVIDIA(1): Option "MetaModes" "1920x1200_60 +0+0"<br />(==) Jan 25 13:55:42 NVIDIA(1): Using HW cursor<br />(**) Jan 25 13:55:42 NVIDIA(1): Onboard stereo requested (DIN connector)<br />(==) Jan 25 13:55:42 NVIDIA(1): Video key set to default value of 0x101fe<br />(==) Jan 25 13:55:42 NVIDIA(1): Not mapping the primary surface by default.<br />(**) Jan 25 13:55:42 NVIDIA(1): Use of AGP disabled per request<br />(**) Jan 25 13:55:42 NVIDIA(1): Ignoring EDIDs<br />(II) Jan 25 13:55:43 NVIDIA(0): Implicitly enabling NoScanout<br />(II) Jan 25 13:55:43 NVIDIA(1): NVIDIA GPU Tesla M2090 (GF110) at PCI:131:0:0 (GPU-1)<br />[...]<br />(--) Jan 25 13:55:43 NVIDIA(1): Connected display device(s) on Tesla M2090 at PCI:131:0:0<br />(--) Jan 25 13:55:43 NVIDIA(1):     none<br />(II) Jan 25 13:55:43 NVIDIA(1): NoScanout X screen configured with resolution 1920x1200 (from<br />(II) Jan 25 13:55:43 NVIDIA(1):     Virtual X configuration option)<br />(II) Jan 25 13:55:43 NVIDIA(1): Validated modes:<br />(II) Jan 25 13:55:43 NVIDIA(1): MetaMode "nvidia-auto-select":<br />(II) Jan 25 13:55:43 NVIDIA(1):     Bounding Box: [0, 0, 1920, 1200]<br />(**) Jan 25 13:55:43 NVIDIA(1): Virtual screen size configured to be 1920 x 1200<br />(==) Jan 25 13:55:43 NVIDIA(1): DPI set to (75, 75); computed from built-in default<br />(WW) Jan 25 13:55:43 NVIDIA(1): Stereo not supported with NoScanout; disabling Stereo.<br />[...]<br /></code><br /><br />"Stereo not supported with NoScanout; disabling Stereo". That's the problem. "NoScanout" is enabled by default. "NoScanout" is also possible to be enabled using "UseDisplayDevice" "none". For this reason I tried "UseDisplayDevice" "DFP-0" with no result.<br /><br />After that, I looked with nvidia-settings for more info and I found this:<br /><br /><code>Attribute 'ConnectedDisplays' (nvb119:0.0): 0x00000000.<br />    'ConnectedDisplays' is a bitmask attribute.<br />    'ConnectedDisplays' is a read-only attribute.<br />    'ConnectedDisplays' can use the following target types: X Screen, GPU.<br /></code><br /><br />I suppose the problem is that physically there are no screen connectors, and the driver is not able to set stereo on for virtual displays.<br /><br />I think it would be possible but I don't know if I have missed something or if I'm dealing with an unimplemented feature. Can anyone help me please? It is very important for my company, we are trying to enable stereo on a remote cluster".<br /><br />Thanks in advance!]]></description>
   </item>
      <item>
      <title>Need your expertise?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3901/need-your-expertise</link>
      <pubDate>Tue, 24 Jan 2012 15:43:49 -0500</pubDate>
      <dc:creator>mabouali</dc:creator>
      <guid isPermaLink="false">3901@/devforum/discussions</guid>
      <description><![CDATA[Hi, <br />I have a CUDA code (a __global__ function which give calls to couple of __device__ function). Every variable is float, all the constants has f at the end (let say like 0.4f), and I have replaced all the sin, exp, ... function with sinf,expf, ...<br />Yet when I compile it with -arch=sm_11 it tells me double not supported demoting to float, and the provided line number is in the .ptx file, (not the .cu file). Well, I tried to have a look at ptx file, (no luck there).<br />I am using nvcc 4.0 along with gcc 4.2.1 on a MacOS 10.6.8 system. But I get the same warning on Linux CentOS.<br />The code is doing fine, (amazing speedup, more than 86 times faster). It does what it should do and the output is correct. My concern is that I have to compile it with sm_11 even if I am running it on Tesla M1060. If I compile it with sm_13 (of course I don't get the warning) but the code runs slower (obviously something is using double there).<br />Any clues how I can pinpoint where the double operation is performed, considering I have changed everything to float). I appreciate your help.]]></description>
   </item>
      <item>
      <title>NSIGHT doesn&#039;t let me choose threads with id greater than 15.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3696/nsight-doesnt-let-me-choose-threads-with-id-greater-than-15-</link>
      <pubDate>Thu, 19 Jan 2012 10:41:02 -0500</pubDate>
      <dc:creator>lucana</dc:creator>
      <guid isPermaLink="false">3696@/devforum/discussions</guid>
      <description><![CDATA[I have managed to stop CUDA debugging at breakpoints. I'm working with VS2010. I can use the Debug Focus to select threads and blocks to follow. But I can't select any of the threads/blocks defined. The dimensions of grid and block written there are wrong. For example, I launched 1024 (kernel&lt;&lt;&lt;1, 1024&gt;&gt;&gt;)threads, but it only lets me choose up to thread number 15. Is it normal? I'm I doing something wrong? ]]></description>
   </item>
      <item>
      <title>Different results between debug and release version when working in VS2010</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3691/different-results-between-debug-and-release-version-when-working-in-vs2010</link>
      <pubDate>Thu, 19 Jan 2012 10:32:58 -0500</pubDate>
      <dc:creator>lucana</dc:creator>
      <guid isPermaLink="false">3691@/devforum/discussions</guid>
      <description><![CDATA[I have a CUDA application which runs on a TESLA C2075. I'm using CUDA toolkit 4.0 with 285.86 drivers. <br /><br />I'm experiencing strange problems. When I run my application in debug mode (no CUDA debugging, only host debugging) things work. When I try to run the release version, strange results arise. Has anyone experienced something similar? Does any of you know what could be the problem? <br /><br />Anyway, has anyone had good results working with CUDA in the VS2010 environment? By good I mean accurate from the scientific point of view. Would you recommend to switch to a Linux environment with gcc compiler?<br /><br />Thanks for your help.]]></description>
   </item>
      <item>
      <title>Hardware H264 encoding/decoding</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3606/hardware-h264-encodingdecoding</link>
      <pubDate>Wed, 18 Jan 2012 05:39:53 -0500</pubDate>
      <dc:creator>jankowalski8234</dc:creator>
      <guid isPermaLink="false">3606@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I am developer of an app which allows live video streaming. I plan to spend some amount of money on NVidia Tesla GPUs to do the video encoding but really I do not think it is a good choice:<br /><br />Here are the details. I have source stream frames already decoded and I am encoding these again using open source H264 codec. It is really time consuming for CPU to do this work, and CPU time is relative expensive. This is why I want to move the hard work to hi-end GPU and I belive it is really possible with a little amount of work to replace the open source codec to 'H264 GPU api' of some sort.<br /><br />I know there is Windows API for H264 hardware codec and a example. It is working fine however... I can not run secondary instance of this application. Is this a hardware limitation? - Also, I know I can run up to 4 NVidia CUDA kernels but is this problem related to these limitations ? The Windows platform is also not a great choice for me and there is no such video encidng API on the Linux platform. How much I must pay to obtain such API for Linux ?<br /><br />How I should accomplish this? Real time H264 encoding on the GPU is the primary goal but the problem is - in my opionion - I can not have more video streams than physical GPU cards.<br /><br />If there is no way NVidia could license this API for Linux then which commercial GPU H264 codec will you recommend for such a solution?<br /><br />I hope someone will clear my mind. Help! ]]></description>
   </item>
      <item>
      <title>driver for GTX 295 + tesla?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/2106/driver-for-gtx-295-tesla</link>
      <pubDate>Sat, 03 Dec 2011 14:54:30 -0500</pubDate>
      <dc:creator>haynor</dc:creator>
      <guid isPermaLink="false">2106@/devforum/discussions</guid>
      <description><![CDATA[hi gang,<br /><br />i have a 64-bit Win7 system with GTX 295 graphics card and 2 Tesla (C1060) cards.  if i uninstall the nvidia driver, i see all the cards.  if i install the current 270.81 driver, though, the Teslas disappear from Device Manager and aren't accessible in CUDA.  any advice?  anyone with a similar system -- if so, which drivers do you use?<br /><br />thanks in advance.]]></description>
   </item>
      <item>
      <title>Bad performance with peer to peer memory transfer</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1736/bad-performance-with-peer-to-peer-memory-transfer</link>
      <pubDate>Tue, 22 Nov 2011 18:04:14 -0500</pubDate>
      <dc:creator>Pyt</dc:creator>
      <guid isPermaLink="false">1736@/devforum/discussions</guid>
      <description><![CDATA[Hello everyone,<br /><br />I am currently using a multi-GPU cluster equipped with 3 Tesla M2090 cards. <br />I wanted to monitor the speed at which data is transferred from one GPU to another.<br />I used the "simpleP2P" code from the SDK and modified it to obtain the transfer rate for all GPUs (see code attached - you'll need to rename it to .cu). The memory allocated is still 64 MB, though I tried with bigger arrays (128,1024,2048 and 4096 MB - the results are still the same). <br />The arrays are transferred a hundred times. I tried to transfer memory with cudaMemcpy() and cudaMemcpyPeer() but obviously both give the same results.<br /><br />I get the following speeds, which are not that good for a GPU to GPU transfer rate:<br /><code>cudaMemcpy between GPU0 and GPU1: 316.54MB/s<br />cudaMemcpy between GPU0 and GPU2: 316.55MB/s<br />cudaMemcpy between GPU1 and GPU0: 316.86MB/s<br />cudaMemcpy between GPU2 and GPU0: 316.93MB/s<br />cudaMemcpy between GPU1 and GPU2: 3699.74MB/s<br />cudaMemcpy between GPU2 and GPU1: 3669.23MB/s<br /></code> <br />According to Dr. Micikevicius in the multi-GPU webinar, speeds of roughly 6.6 GB/s can be achieved with transfers through PCI-e, so the results here are quite puzzling.<br /><br />Additional details:<br />I compiled my code with the flags -O3 and -m 64.<br />All the drivers are up to date (version 4.0 for CUDA).<br />As for the operating system, uname -a gives me the following:<br /><code>2.6.32-131.17.1.el6.x86_64 <a href="/devforum/search?Search=%231&amp;Mode=like">#1</a> SMP Thu Sep 29 10:24:25 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux</code><br /><br />The deviceQry program from the SDK gives the following output:<br /><code>Found 3 CUDA Capable device(s)<br /><br />Device 0: "Tesla M2090"<br />  CUDA Driver Version / Runtime Version          4.0 / 4.0<br />  CUDA Capability Major/Minor version number:    2.0<br />  Integrated GPU sharing Host Memory:            No<br />  Device has ECC support enabled:                Yes<br />  Device is using TCC driver mode:               No<br />  Device supports Unified Addressing (UVA):      Yes<br />  Device PCI Bus ID / PCI location ID:           8 / 0<br /><br />Device 1: "Tesla M2090"<br />  CUDA Driver Version / Runtime Version          4.0 / 4.0<br />  CUDA Capability Major/Minor version number:    2.0<br />  Integrated GPU sharing Host Memory:            No<br />  Device has ECC support enabled:                Yes<br />  Device is using TCC driver mode:               No<br />  Device supports Unified Addressing (UVA):      Yes<br />  Device PCI Bus ID / PCI location ID:           27 / 0<br /><br />Device 2: "Tesla M2090"<br />  CUDA Driver Version / Runtime Version          4.0 / 4.0<br />  CUDA Capability Major/Minor version number:    2.0<br />  Integrated GPU sharing Host Memory:            No<br />  Device has ECC support enabled:                Yes<br />  Device is using TCC driver mode:               No<br />  Device supports Unified Addressing (UVA):      Yes<br />  Device PCI Bus ID / PCI location ID:           21 / 0<br /></code><br /><br />The output of the command lspci -tvv can be found attached. It seems that two cards are on the same IOH chip but not the last one, which would explain a transfer rate of 315 MB/s.<br />However, it does not explain a transfer rate of 3.7 GB/s between the two other ones. Maybe the quality of the PCI-e connection is not really good ?<br /><br />What would be a possible explanation of such a low speed ?<br /><br />Thank you for your answers.<br />]]></description>
   </item>
      <item>
      <title>OpenMPI + CUDA usage</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1766/openmpi-cuda-usage</link>
      <pubDate>Wed, 23 Nov 2011 09:00:59 -0500</pubDate>
      <dc:creator>ph0enix</dc:creator>
      <guid isPermaLink="false">1766@/devforum/discussions</guid>
      <description><![CDATA[Hello!<br /><br />I want to try GPUDirect with openmpi with cuda extension. I'm a little confused. In the FAQ it says that cuInit() and cuCtxCreate() should be called prior to MPI_Init().<br /><br />But how can I do this if I need to bind a GPU in each process after I call MPI_Init()? And before this call other processes aren't available.]]></description>
   </item>
      <item>
      <title>NVIDIA Parallel Nsight 2.1 Release Candidate 1 now available for download!</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1431/nvidia-parallel-nsight-2-1-release-candidate-1-now-available-for-download</link>
      <pubDate>Thu, 10 Nov 2011 20:50:50 -0500</pubDate>
      <dc:creator>Sebastien Domine</dc:creator>
      <guid isPermaLink="false">1431@/devforum/discussions</guid>
      <description><![CDATA[Dear Parallel Nsight Users,<br /><br />NVIDIA is proud to announce the new release of NVIDIA Parallel Nsight™ 2.1 Release Candidate 1. This new release brings support for the new CUDA Toolkit 4.1 Release Candidate 1, which can be downloaded under the CUDA Registered Developer Program (www.developer.nvidia.com/join). Parallel Nsight 2.1 also adds new features for an enhanced CUDA debugging and profiling experience, and new features for DirectX graphics developers such as the ability to edit shaders, while the application is running, and measure drawcall timings for more advanced profiling analysis. <br /><br />Parallel Nsight 2.1 RC1 can be downloaded from <a href="http://parallelnsight.nvidia.com/content/parallel-nsight-early-access" target="_blank" rel="nofollow">http://parallelnsight.nvidia.com/content/parallel-nsight-early-access</a>, and requires Driver Release 285.67, available on the same download site.<br /><br />For a complete list of the new exciting 2.1 features, go to <a href="http://parallelnsight.nvidia.com/content/parallel-nsight-early-access" target="_blank" rel="nofollow">http://parallelnsight.nvidia.com/content/parallel-nsight-early-access</a>.<br /><br />We recommend developers to give this new release an early trial and send us feedback via the developer tools forums or report issues by emailing ParallelNsight-Support@nvidia.com. <br /><br />The NVIDIA Developer Tools Team<br />]]></description>
   </item>
      <item>
      <title>How do I see/control which programs use which GPUs?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1091/how-do-i-seecontrol-which-programs-use-which-gpus</link>
      <pubDate>Thu, 20 Oct 2011 13:21:38 -0400</pubDate>
      <dc:creator>Gary Chandler</dc:creator>
      <guid isPermaLink="false">1091@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I have a system with 2 tesla 2050 GPUs and Matrox G200eW onboard graphics. Obviously when I run codes in CUDA I can use  cudaSetDevice() to specify which GPU to use for computations, but how do I know what GPU is being used for normal graphics? Will running computations on the tesla GPUs affect the running of other programs that put graphics on the screen? I also use visualization software to render images from my data, will running computations on the tesla GPUs affect rendering? Does rendering automatically use one of the tesla GPUs?<br /><br />Is it possible to see and control which programs use which GPUs? <br /><br />I access the system with the tesla cards remotely from my desktop with ssh -X. does this change the graphics behaviour in any way?<br /><br />Thanks<br /><br />Gary<br />  <br />]]></description>
   </item>
      <item>
      <title>Question about race condition</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1096/question-about-race-condition</link>
      <pubDate>Thu, 20 Oct 2011 21:04:08 -0400</pubDate>
      <dc:creator>salahsaleh</dc:creator>
      <guid isPermaLink="false">1096@/devforum/discussions</guid>
      <description><![CDATA[Hello!,<br />In the situations were different threads are accessing the same location of shared memory(just reading), how much overhead will be notice if compared to the overhead of read/write situation?]]></description>
   </item>
      <item>
      <title>GPU doesn&#039;t work in exclusive compute mode</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/1086/gpu-doesnt-work-in-exclusive-compute-mode</link>
      <pubDate>Thu, 20 Oct 2011 13:08:40 -0400</pubDate>
      <dc:creator>Gary Chandler</dc:creator>
      <guid isPermaLink="false">1086@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I have a system with 2 tesla 2050 GPUs. They both work fine in Shared compute mode (mode 0) but after I switched them both to exclusive mode (mode 1) only GPU 1 works. The call to cudaSetDevice(0) works, but when i make a call to a CUFFT routine I get:<br />CUFFT error in file 'cuda.cu' in line 17<br /><br />Does anyone know why my GPU 0 stops working in compute mode 1? Does anyone know how to fix this?<br /><br />Thanks<br /><br />Gary<br /><br />]]></description>
   </item>
      <item>
      <title>CUDA 4.0 with AS 5.6</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/991/cuda-4-0-with-as-5-6</link>
      <pubDate>Thu, 13 Oct 2011 21:12:34 -0400</pubDate>
      <dc:creator>jackson312</dc:creator>
      <guid isPermaLink="false">991@/devforum/discussions</guid>
      <description><![CDATA[I just installed CUDA 4.0 on a Dell R5500 with a C2050 card in it. I can see the card using lspci, but I am not able to query the card using the deviceQuery which is built in the SDK. <br /><br />I have used CUDA 4.0 successfully on a Dell Optiplex 740 with a 9800GT card on AS 5.5. Are there any issues with CUDA 4.0 and AS 5.6? <br /><br />I did not have any problems getting the development driver to build or load. I got the SDK to compile fine.<br /><br />Thanks,<br /><br />Jackson]]></description>
   </item>
      <item>
      <title>Problems getting CUDA samples running on Red Hat 5.5 (64-bit)</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/951/problems-getting-cuda-samples-running-on-red-hat-5-5-64-bit</link>
      <pubDate>Wed, 12 Oct 2011 13:48:05 -0400</pubDate>
      <dc:creator>brucep</dc:creator>
      <guid isPermaLink="false">951@/devforum/discussions</guid>
      <description><![CDATA[We have a research cluster and just added a new compute node into it that has a GPU so we can start providing GPU capabilities to our users.  The server is an IBM iDataPlex running Red Hat linux 5.5 (64-bit).  When I run lspci it shows:<br /><br />[root@node46 ~]# lspci | grep -i nvidia<br />19:00.0 3D controller: nVidia Corporation GF100 [Tesla S2050] (rev a3)<br />19:00.1 Audio device: nVidia Corporation GF100 High Definition Audio Controller (rev a1)<br />1a:00.0 3D controller: nVidia Corporation GF100 [Tesla S2050] (rev a3)<br />1a:00.1 Audio device: nVidia Corporation GF100 High Definition Audio Controller (rev a1)<br /><br />So I downloaded and installed NVIDIA-Linux-x86_64-270.41.34.run, cudatoolkit_4.0.17_linux_64_rhel5.5.run, and gpucomputingsdk_4.0.17_linux.run.  After installing each of those and adding the appropriate paths to my PATH &amp; LD_LIBRARY_PATH I rebooted the server to ensure everything is kosher.  lsmod shows that the nvidia driver is installed:<br /><br />[root@node46 ~]# lsmod | grep -i nvidia<br />nvidia              10765936  0 <br />i2c_core               57537  3 i2c_ec,i2c_i801,nvidia<br /><br />At this point I went into the NVIDIA_GPU_Computing_SDK/C directory and did a "make x86_64=1" to build all the samples. I then went into /NVIDIA_GPU_Computing_SDK/C/bin/linux/release and tried to run matrixMul as the SDK documentation suggested.  However when I try running it I get:<br /><br /><br />./matrixMul<br />[matrixMul] starting...<br />[ matrixMul ]<br />./matrixMul Starting (CUDA and CUBLAS tests)...<br /><br />matrixMul.cu(83) : cudaSafeCall() Runtime API error 38: no CUDA-capable device is detected.<br /><br />So what am I missing here?<br /><br />Thanks,<br /><br />-Bruce<br /><br />]]></description>
   </item>
      <item>
      <title>GPU computing with National Instruments LabWindows</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/491/gpu-computing-with-national-instruments-labwindows</link>
      <pubDate>Mon, 12 Sep 2011 16:58:56 -0400</pubDate>
      <dc:creator>LarryM</dc:creator>
      <guid isPermaLink="false">491@/devforum/discussions</guid>
      <description><![CDATA[I'm developing a numerical application using National Instruments LabWindows development environment. We believe it can be speeded up considerably by using a GPU. I've never used one before.  Has anyone ever used CUDA or OpenCl with LabWindows?<br /> ]]></description>
   </item>
      </channel>
</rss>
