<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
      <title>Tagged with cuda-toolkit - NVIDIA Developer Forums</title>
      <link>http://forums.developer.nvidia.com/devforum/discussions/tagged/cuda-toolkit/feed.rss</link>
      <pubDate>Wed, 16 May 12 17:31:19 -0400</pubDate>
         <description>Tagged with cuda-toolkit - NVIDIA Developer Forums</description>
   <language>en-CA</language>
   <atom:link href="/devforum/discussions/taggedcuda-toolkit/feed.rss" rel="self" type="application/rss+xml" />
   <item>
      <title>cudaFree returning cudaErrorMemoryAllocation - bug?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8181/cudafree-returning-cudaerrormemoryallocation-bug</link>
      <pubDate>Mon, 14 May 2012 13:38:35 -0400</pubDate>
      <dc:creator>ajsimmonds</dc:creator>
      <guid isPermaLink="false">8181@/devforum/discussions</guid>
      <description><![CDATA[I have been encountering a strange problem using the cuda 4.2 tools where our application eventually receives a cudaErrorMemoryAllocation error when trying to perform a cudaFree on cudaMalloc'd memory. The number of allocations and deallocations performed varies but the problem can be reproduced in the app relatively easily. Once the error has been received once further frees and also cudaMemGetInfo continue to return the error.<br /><br />To further narrow down the error I have also written a test program that simply allocates areas using cudaMalloc and when this returns an out of memory error, releases one or more of the previously allocated areas to make space. This program, which launches no kernels, fails with the same symptoms. I have also tried this with the 4.0 tools and still receive the same error condition.<br /><br />If I limit the number of iterations such that the error is not encountered then it is quite likely that the free memory value returned by cudaGetMemInfo is larger than the value the program started with.<br /><br />This looks all the world to me as if there is a problem with the tracking of memory allocations within the SDK, so can anyone confirm this or possibly point me at things I may be doing wrong?!<br /><br />Many thanks<br />Andrew]]></description>
   </item>
      <item>
      <title>simpleStreams example in SDK not working</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8101/simplestreams-example-in-sdk-not-working</link>
      <pubDate>Sat, 12 May 2012 03:36:29 -0400</pubDate>
      <dc:creator>madhur13490</dc:creator>
      <guid isPermaLink="false">8101@/devforum/discussions</guid>
      <description><![CDATA[I've installed CUDA 4.1 GPUComputingSDK and GPUComputing toolkit. I'm trying to see performance improvement for simpleStreams example given in src folder but it seems there is some problem in new version. Streamed version is consistently taking more time than non-streamed version. I've no modified code. It seems there is some bug new examples.]]></description>
   </item>
      <item>
      <title>Trouble with processing image in rows</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/8121/trouble-with-processing-image-in-rows</link>
      <pubDate>Sun, 13 May 2012 05:29:07 -0400</pubDate>
      <dc:creator>laz007</dc:creator>
      <guid isPermaLink="false">8121@/devforum/discussions</guid>
      <description><![CDATA[Hello!<br />I'm making an image filter that is processing the image in rows.<br />Two weeks I'm trying to figure out why it's not working when executed in parallel.<br />I use only threads in the Y dimension. Is that a problem?<br /><br /><br /><br />Here is part of the code:<br />BLOCKDIM_Y=16;<br />....<br />dim3 threads(1, BLOCKDIM_Y);<br />dim3 grid(1,  iDivUp(h, BLOCKDIM_Y));<br /><br />my_CUDA_filter&lt;&lt;&lt; grid, threads&gt;&gt;&gt;(sumR, sumG, sumB, mask,h,w, inD, outD, test);<br />...<br /><br />__global__ void my_CUDA_filter_simple222(int* sumR, int* sumG, int* sumB, int mask,int h,int w, u_int8_t *in, u_int8_t *out, int* test){<br />...<br />int iy = blockDim.y * blockIdx.y + threadIdx.y;<br />int ix=0;<br /><br />	if (iy&gt;=m &amp;&amp; iy&lt;(h-m)) {<br /><br />	//for(iy=m; iy&lt;h-m; iy++){<br /><br />	 ...<br />	for(ix=m+1;ix&lt;w-m;ix++){<br />	 ...<br />	 }<br />}<br /><br />The result image is messed up...<br />If I use for(iy=m; iy&lt;h-m; iy++){ <br />and run the kernel with one single thread (that means there is no parallelization) everything is OK.<br /><br />Any ideas?<br /><br />]]></description>
   </item>
      <item>
      <title>Portable pinned memory and multiple GPUs: Performance and stability</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7386/portable-pinned-memory-and-multiple-gpus-performance-and-stability</link>
      <pubDate>Sun, 22 Apr 2012 13:47:59 -0400</pubDate>
      <dc:creator>tbenson</dc:creator>
      <guid isPermaLink="false">7386@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I am having some problems using portable pinned memory to share one pinned buffer between multiple GPUs.  I have two separate issues:<br /><br />	 1) Performance of transfers for the GPUs not corresponding to the allocation context are massively degraded; and<br />	 2) It tends to crash my Linux host and force a reboot.<br /><br />I included the code at the end.  There are several flags at the top of the source file to control behavior, including NDEVICES, NBUFFERS, and USE_PINNED_MEMORY.  NDEVICES is the number of GPUs to use, NBUFFERS is the number of buffers to be allocated, and USE_PINNED_MEMORY determines whether or not the buffers are pinned.  The case that fails is NDEVICES = 2, NBUFFERS = 1, and USE_PINNED_MEMORY = true. If I use as many buffers as devices, then things work with or without pinned memory.  It also works without pinned memory for any number of buffers.  However, with the failing case, I get the following:<br /><br />[host:portable_pinned]$ ./portable <br />id = 0, cudaMemcpy time = 22.32 ms<br />id = 0, val = 3.000000 (should be 3.000000)<br />id = 1, cudaMemcpy time = 5457.76 ms<br />id = 1, val = 6.000000 (should be 6.000000)<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.826763] Stack:<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.828257] Call Trace:<br /><br />Message from syslogd@host at Apr 22 13:25:16 ...<br /> kernel:[41786.845154] Code: f6 62 00 85 c0 74 10 e8 69 e4 65 00 0f 1f 00 eb 06 89 77 6c 89 4f 70 48 83 c5 10 5b c3 41 54 53 48 83 ec 08 48 83 ed 08 41 89 f4 &lt;39&gt; 77 6c 73 17 39 77 70 0f 87 ac 00 00 00 39 77 6c 73 09 39 77 <br /><br />The host at this point is only partially responsive and needs to be rebooted.  The system log is full of errors, but a sampling is attached.  This is using driver version 285.05.33, CUDA 4.1, Fedora 14, and kernel 2.6.35.6-45.  The GPUs are two Tesla C2050s that reside in a Tesla S2050 compute server.  They are connected to the host via a single PCI-e cable.  This is a single host in a cluster, so updating the driver is not trivial, although I will do so if this is a known bug.<br /><br />In any case, I suspect that the kernel/driver error is just a bug as I have done something similar in the past without this problem.  However, I still had the poor performance in the past.  Above, the PCIe transfer to the CUDA context in which the allocation was not made is over 200x slower than the transfer for the context in which the allocation was made.  Is this normal?  The documentation just says that cudaHostAllocPortable allows pinned memory to be recognized by other contexts, but does not mention the performance implications of accessing the memory.<br /><br />Thanks for any help/comments,<br /><br />Tom<br /><br />The code is below.  The Timing class is just a wrapper that I have for host timing.  I can include it if needed, but already had to rework this email due to character limitations.  The references can be commented out to compile  the code.<br /><br /><code><br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cuda_runtime.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cassert&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cstdio&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;pthread.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "timing.hpp"<br /><br />namespace<br />{<br />    const size_t BUFSIZE = 32*1024*1024;<br />    const int NDEVICES = 2;<br />    const int NBUFFERS = 1;<br />    const bool USE_PINNED_MEMORY = true;<br />}<br /><br />struct Params<br />{<br />    float *buf;<br />    int id;<br />};<br /><br />__global__ void test_kernel(float *buf, float val) { buf[0] = val; }<br /><br />void *gpu_thread(void *v)<br />{<br />    Params *params = (Params *) v;<br /><br />    cudaSetDevice(params-&gt;id);<br /><br />    float *dev_buf;<br />    assert(cudaMalloc((void **) &amp;dev_buf, sizeof(float)*BUFSIZE) == cudaSuccess);<br /><br />    double start = Timing::ElapsedTimeMs();<br />    assert(cudaMemcpy(dev_buf, params-&gt;buf, sizeof(float)*BUFSIZE, cudaMemcpyHostToDevice) == cudaSuccess);<br />    double elapsed = Timing::ElapsedTimeMs() - start;<br />    printf("id = %d, cudaMemcpy time = %.2f ms\n", params-&gt;id, elapsed);<br /><br />    test_kernel&lt;&lt;&lt;1,1&gt;&gt;&gt;( dev_buf, (params-&gt;id+1) * 3.0f );<br />    assert(cudaThreadSynchronize() == cudaSuccess);<br /><br />    float retval;<br />    assert(cudaMemcpy(&amp;retval, dev_buf, sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess);<br /><br />    printf("id = %d, val = %f (should be %f)\n", params-&gt;id, retval, (params-&gt;id+1)*3.0f);<br /><br />    assert(cudaFree(dev_buf) == cudaSuccess);<br /><br />    return NULL;<br />}<br /><br />int main(int argc, char **argv)<br />{<br />    float *pinned[NDEVICES];<br /><br />    assert(NBUFFERS &lt;= NDEVICES);<br /><br />    for (int i = 0; i &lt; NBUFFERS; ++i)<br />    {<br />        assert(cudaSetDevice(i) == cudaSuccess);<br />        if (USE_PINNED_MEMORY)<br />        {<br />            assert(cudaHostAlloc((void **) &amp;pinned[i], sizeof(float)*BUFSIZE, cudaHostAllocPortable) == cudaSuccess);<br />        }<br />        else<br />        {<br />            pinned[i] = new float[BUFSIZE];<br />        }<br />        for (size_t k = 0; k &lt; BUFSIZE; ++k) { pinned[i][k] = 1.0f; }<br />    }<br /><br />    pthread_t tid[NDEVICES];<br />    Params params[NDEVICES];<br /><br />    for (int i = 0; i &lt; NDEVICES; ++i)<br />    {<br />        params[i].id = i;<br />        params[i].buf = pinned[i%NBUFFERS];<br />        assert(pthread_create(tid+i, NULL, gpu_thread, (void *) &amp;params[i]) == 0);<br />    }<br /><br />    for (int i = 0; i &lt; NDEVICES; ++i)<br />    {<br />        assert(pthread_join(tid[i], NULL) == 0);<br />    }<br /><br />    for (int i = 0; i &lt; NBUFFERS; ++i)<br />    {<br />        if (USE_PINNED_MEMORY)<br />        {<br />            assert(cudaFreeHost(pinned[i]) == cudaSuccess);<br />        }<br />        else<br />        {<br />            delete [] pinned[i];<br />        }<br />    }<br /><br />    return 0;<br />}<br /></code>]]></description>
   </item>
      <item>
      <title>Linker error with c function in .cu file</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7881/linker-error-with-c-function-in-cu-file</link>
      <pubDate>Fri, 04 May 2012 16:08:40 -0400</pubDate>
      <dc:creator>basementscientist</dc:creator>
      <guid isPermaLink="false">7881@/devforum/discussions</guid>
      <description><![CDATA[I've created a kernel inside a .cu file. Also inside the .cu file is a c++ function that calls<br />the kernal. Everything compiles ok, but on the final linking, the c++ function is not visible to the rest of the program. How do I make the function visible?<br /><br />I am using Visual Studio 2010 on Windows 8, and the newest SDK and Toolkit.]]></description>
   </item>
      <item>
      <title>npp problems</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7321/npp-problems</link>
      <pubDate>Thu, 19 Apr 2012 11:05:15 -0400</pubDate>
      <dc:creator>lancewellspring</dc:creator>
      <guid isPermaLink="false">7321@/devforum/discussions</guid>
      <description><![CDATA[I have 2 problems.<br />1) Function call to <code>nppiGetAffineTransform</code> is returning NPP_AFFINE_QUAD_INCORRECT_WARNING.<br /><em>parameter srcRoi is:</em><br />x	0	int<br />y	0	int<br />width	5000	int<br />height	5000	int<br /><em>parameter quad is:</em><br />[0]	0x00000000002af1d0	double [2]<br />	[0]	0.00000000000000000	double<br />	[1]	102.69965808786287	double<br />[1]	0x00000000002af1e0	double [2]<br />	[0]	5128.9289884048958	double<br />	[1]	0.00000000000000000	double<br />[2]	0x00000000002af1f0	double [2]<br />	[0]	5230.9576202374628	double<br />	[1]	5149.2023110261380	double<br />[3]	0x00000000002af200	double [2]<br />	[0]	102.53232406430637	double<br />	[1]	5251.8818716857804	double<br /><br />Does the function expect the points of quad in a specific order?  Right now they are: topleft, topright, botleft, botright.<br /><br />2) Function call to <code>nppiWarpAffine_8u_C3R</code> is returning NPP_STEP_ERROR.<br /><em>parameter pSrc is 75000000 bytes.</em> <br /><em>parameter srcSize is:</em><br />width	5000	int<br />height	5000	int<br /><em>parameter nSrcStep is 15000.</em> <br /><em>parameter srcRoi is:</em><br />x	0	int<br />y	0	int<br />width	5000	int<br />height	5000	int<br /><em>parameter pDst is 82419636 bytes.</em><br /><em>parameter nDstStep is 15693.</em><br /><em>parameter dstRoi is:</em><br />x	0	int<br />y	0	int<br />width	5231	int<br />height	5252	int<br /><em>parameter coeffs is:</em><br /><br />coeffs	0x00000000002af328	double [2][3]<br />[0]	0x00000000002af328	double [3]<br />	[0]	1.0259909958801552	double<br />	[1]	0.020409808328179044	double<br />	[2]	0.00000000000000000	double<br />[1]	0x00000000002af340	double [3]<br />	[0]	-0.020544040425657707	double<br />	[1]	1.0300464714995274	double<br />	[2]	102.69965808786287	double<br /><em>parameter interpolation is NPPI_INTER_CUBIC.</em><br /><br />I dont have any idea what is going wrong here.<br /><br />Any help is greatly appreciated!<br /><br />I'm running on a Windows 7 machine, with a Quadro FX 1800M, using Visual Studio 2010.  Running the basic cuda examples works just fine.]]></description>
   </item>
      <item>
      <title>Cuda Kernels Stop Running After Few Iterations</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7381/cuda-kernels-stop-running-after-few-iterations</link>
      <pubDate>Sat, 21 Apr 2012 14:15:31 -0400</pubDate>
      <dc:creator>Eman</dc:creator>
      <guid isPermaLink="false">7381@/devforum/discussions</guid>
      <description><![CDATA[Hello,<br /><br />I am writing a code that calls a number of kernels inside a for loop. The number of the loop iterations is 1000. When I run the program, the kernels stop running after a number of iterations. I tried to use cudaGetLastError(); but it didn't give me any information as the output was "Error: unknown error". AS I increase the size of the blocks and the number of threads, the kernels stop running sooner. For example, when the block size is 8 it stopped at iteration 740, while when the size of the block is 16, it stopped at iteration 440.  In each iteration the same resources is being re-used so I really don't understand what is the problem!. <br /><br />Any help will be appreciated. <br /><br />Thanks, <br /> ]]></description>
   </item>
      <item>
      <title>trouble with cudaBindTexture2D/tex2D and toolkit 4.0/4.1</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4341/trouble-with-cudabindtexture2dtex2d-and-toolkit-4-04-1</link>
      <pubDate>Mon, 06 Feb 2012 06:21:04 -0500</pubDate>
      <dc:creator>franzdaubner</dc:creator>
      <guid isPermaLink="false">4341@/devforum/discussions</guid>
      <description><![CDATA[I recently tried to switch all my kernels from toolkit 3.2 to toolkit 4.1.<br /><br />One of the kernels (the only one using textures) is not working anymore, it just behaves like an empty kernel with no code in it. There's no error message, no warning, it just does nothing. This behaviour is reproducible with toolkit 4.0<br /><br />If I switch back to toolkit 3.2 everything works fine.<br /><br />So here is my question:<br />Did anyone use cudaBindTexture2D/tex2D with success on toolkit 4.1? On what hardware?<br />I'm using a Geforce GTX285 on Windows 7 (64) and Visual Studio 2008.]]></description>
   </item>
      <item>
      <title>Cuda Multip Kernel</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7641/cuda-multip-kernel</link>
      <pubDate>Fri, 27 Apr 2012 11:44:56 -0400</pubDate>
      <dc:creator>Saouli</dc:creator>
      <guid isPermaLink="false">7641@/devforum/discussions</guid>
      <description><![CDATA[Hello There<br />My question is : can we invoque kernel inside an other kernel?<br />Exemple :<br />__global__ kernel1(.....)<br />{<br />//do some thing<br />kernel2 &lt;&lt;&gt;&gt;(...);<br />//with the resulte of kernel 2 do the rest of the work of kernel1<br />}<br />please i need answers thank you for your time reading this <br />Abdelhak]]></description>
   </item>
      <item>
      <title>Driver issue at Linux 11.10</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5411/driver-issue-at-linux-11-10</link>
      <pubDate>Mon, 05 Mar 2012 04:39:28 -0500</pubDate>
      <dc:creator>sajis</dc:creator>
      <guid isPermaLink="false">5411@/devforum/discussions</guid>
      <description><![CDATA[Hello forum,<br /><br />I have installed ubuuntu linux 11.10 on asus g74 and installed the most recent nvidia driver instead of the one that comes as default. It has the GeForce GTX 560M on board<br /><br />Now the system hangs whenever i try to run any of the CUDA examples. The very samples runs fine in windows with the same machine. I want to test OpenCL apps in linux, but the tool kit do not come with any opencl exmples for linux.<br /><br />Any suggestion deal with this issue?<br /><br /><br />Regards<br />Sajjad]]></description>
   </item>
      <item>
      <title>Origin of CUDA_ERROR_INVALID_IMAGE when calling cuModuleLoad - invalid nvcc call?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5946/origin-of-cuda_error_invalid_image-when-calling-cumoduleload-invalid-nvcc-call</link>
      <pubDate>Thu, 15 Mar 2012 04:59:27 -0400</pubDate>
      <dc:creator>mkastrop</dc:creator>
      <guid isPermaLink="false">5946@/devforum/discussions</guid>
      <description><![CDATA[I always get the CUDA_ERROR_INVALID_IMAGE CUresult when calling cuModuleLoad on my .cubin-file. Even if I break it down to the following most simple kernel:<br /><br /><code><br />// minimal.cu<br /><br />__global__ void<br />minimal()<br />{<br />}<br /></code><br /><br />My compiler call looks like the following:<br /><br /><code><br />nvcc.exe -cubin -arch=sm_21 -o "minimal.cubin" "minimal.cu"<br /></code><br /><br />Its output looks good, doesn't it?<br /><br /><code><br />minimal.cu<br />minimal.cu<br />tmpxft_00000b38_00000000-3_minimal.cudafe1.gpu<br />tmpxft_00000b38_00000000-10_minimal.cudafe2.gpu<br /></code><br /><br />My system is a Win7 machine with one NVIDIA Quadro 600 in it. Can you please explain what I am doin' wrong? Unfortunately there are no concrete explanations to the CUDA error codes like I know them from OpenCL... That would be preferable.<br />]]></description>
   </item>
      <item>
      <title>linker error using cuda toolkit 4.1</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7091/linker-error-using-cuda-toolkit-4-1</link>
      <pubDate>Mon, 16 Apr 2012 05:39:43 -0400</pubDate>
      <dc:creator>sicherer</dc:creator>
      <guid isPermaLink="false">7091@/devforum/discussions</guid>
      <description><![CDATA[I just upgraded to version 4.1 of the cuda toolkit and now I get a linker error (Ubuntu 10.04):<br />CUDAPACKAGE/ipdiagsolver/CG.o: In function `cublasSdot':<br />tmpxft_00004800_00000000-1_CG.cudafe1.cpp:(.text+0x1c): undefined reference to `cublasGetCurrentCtx'<br />Using readelf -Ws I found that this symbol is no longer present in libcublas.so as it was in the 4.0 version. This is strange. I get no compiler errors, only the linker complains. <br />How can I get my code to link again? Please help!]]></description>
   </item>
      <item>
      <title>Cannot find Reduce1.sln</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7416/cannot-find-reduce1-sln</link>
      <pubDate>Mon, 23 Apr 2012 05:21:04 -0400</pubDate>
      <dc:creator>celebisait</dc:creator>
      <guid isPermaLink="false">7416@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />I'm new at GPU programming and CUDA. I read the CUDA C Programming Guide. It was very helpful for me. And now I'm reading that tutorial* from Cyril Zeller. However, it says "Open up reduce\src\reduce1.sln" on the page 36/157, and I couldn't find that visual studio solution file.<br /><br />I have NVIDIA GPU Computing SDK 4.1. I searched in the SDK and found something at:<br /><br />"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\reduction"<br /><br />but I'm not sure is that the same thing with the PDF because it doesn't have the solution files separate like reduce1.sln, reduce2.sln etc.<br /><br />I will be appreciated for any help,<br />Sait.<br /><br />*<a href="http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf" target="_blank" rel="nofollow">http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf</a>]]></description>
   </item>
      <item>
      <title>Crash with the new LLVM compiler</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6606/crash-with-the-new-llvm-compiler</link>
      <pubDate>Mon, 02 Apr 2012 06:41:37 -0400</pubDate>
      <dc:creator>Tofic</dc:creator>
      <guid isPermaLink="false">6606@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br /><br />  OpenCL 1.1, drivers 296.10, GTX 580, 64-bits compiler, Windows 7 64-bits.<br /><br />  100% crash inside the compiler, when trying to compile this construction (well-compilable with the old compiler):<br /><strong>const struct BBox bbox = { (float4)(-.5f,-.5f,-.5f,0), (float4)(.5f,.5f,.5f,0) };<br /><br />	....</strong><br /><br />  Error: <em>OpenCL error 'Invalid binary': compilation error<br />	 ptxas application ptx input, line 13; error : Module-scoped variables in .local state space are not allowed with ABI</em><br /><br />       or<br /><br /><em>UNREACHABLE executed.</em><br /><br /><br />  Fix: remove the "const" modifier. Started with new LLVM compiler.<br /><br />Best wishes,<br />Anton]]></description>
   </item>
      <item>
      <title>Unable to build cutil64D.lib library file.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7406/unable-to-build-cutil64d-lib-library-file-</link>
      <pubDate>Mon, 23 Apr 2012 03:22:01 -0400</pubDate>
      <dc:creator>atul2188</dc:creator>
      <guid isPermaLink="false">7406@/devforum/discussions</guid>
      <description><![CDATA[Dear All,<br /><br />	 I am new to cuda programming and I am trying to run a CUDA program.But while building the project it is failing giving the error : cutil64D.lib file not found.<br /><br />Though I tried to build the library file by proper mehtods still I am unable to get the file..<br /><br />Please suggest something.<br /><br />Thanks.]]></description>
   </item>
      <item>
      <title>CUDA Toolkit 4.2 Released for GeForce GTX 600 Series</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7426/cuda-toolkit-4-2-released-for-geforce-gtx-600-series</link>
      <pubDate>Mon, 23 Apr 2012 12:23:29 -0400</pubDate>
      <dc:creator>Nadeem Mohammad</dc:creator>
      <guid isPermaLink="false">7426@/devforum/discussions</guid>
      <description><![CDATA[An updated CUDA Toolkit is now available for developers targeting GeForce GTX 600 Series GPUs.<br /><br /><br />This release includes updated versions of the CUDA-GDB debugger, Visual Profiler, and others tools that support the Kepler architecture GPUs.  Updated versions of NVIDIA’s GPU-accelerated libraries have are also provided in this release, including cuBLAS, cuSPARSE, cuFFT, and the NVIDIA Performance Primitives (NPP) library for image and signal processing.<br />CUDA Toolkit 4.2 is now available at <a href="http://developer.nvidia.com/cuda-downloads">www.nvidia.com/getcuda</a><br /><br />There will be more information about the next release of the CUDA Toolkit and the compute capabilities of the Kelper architecture at the GPU Technology Conference, 2012 to be held in San Jose, CA , May 14 to 17. Find out more at <a href="http://www.gputechconf.com">www.gputechconf.com</a><br />]]></description>
   </item>
      <item>
      <title>CUDA Toolkit and GPU Computing SDK</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/186/cuda-toolkit-and-gpu-computing-sdk</link>
      <pubDate>Mon, 29 Aug 2011 18:02:51 -0400</pubDate>
      <dc:creator>Nadeem Mohammad</dc:creator>
      <guid isPermaLink="false">186@/devforum/discussions</guid>
      <description><![CDATA[If you have general questions about the CUDA Toolkit - not relating to the included libraries, just tag the question or discussion with the cuda-toolkit tag so its easy to find.<br />The SDK contains 100's of samples for CUDA, OpenCL, DirectCompute and use of many libraries, use these forums to discuss any aspect of the SDK - be sure to use the TAG below or add ones yourself.]]></description>
   </item>
      <item>
      <title>CUDA 4.X UVA and P2P is broken for 2 x GTX 680 running on AMD 990FX chipset</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7336/cuda-4-x-uva-and-p2p-is-broken-for-2-x-gtx-680-running-on-amd-990fx-chipset</link>
      <pubDate>Thu, 19 Apr 2012 16:13:18 -0400</pubDate>
      <dc:creator>mdvornik</dc:creator>
      <guid isPermaLink="false">7336@/devforum/discussions</guid>
      <description><![CDATA[CUDA kernels with UVA fetching are not running properly on 2 x GTX 680, AMD 990FX mobo. So far, it has been confirmed only for Scientific  Linux 6.2 (2.6.32-220.13.1 x86_64) with 295.40 Nvidia driver and CUDA 4.1(2 RC1).<br /><br />Symptoms: CUDA kernels running extremely slow and eventually the execution hangs. When running simpleP2P the reported bandwidth is 1GB/s. By repeatedly running simpleP2P, it hangs at some point just like the CUDA kernels from our software.<br /><br />The kernels running just fine with 2 x GTX 480 on Intel X58 mobo.<br /><br />Finally, 2 x GTX 480 are also happy with 990FX mobo!<br /><br />So the question is: Does consumer-grade Kepler has full-featured GLDirect enabled?]]></description>
   </item>
      <item>
      <title>CUDA 4.2 Release Candidate 1</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7046/cuda-4-2-release-candidate-1</link>
      <pubDate>Sat, 14 Apr 2012 15:27:13 -0400</pubDate>
      <dc:creator>jack-sky</dc:creator>
      <guid isPermaLink="false">7046@/devforum/discussions</guid>
      <description><![CDATA[Is "CUDA 4.2 RC1" available for download? I've a "REGISTERED DEVELOPER PROGRAMS" accounts approved for "Parallel Nsight Registered Developer Program" and "CUDA/GPU Computing Registered Developer Program" but I can't found where download "CUDA 4.2 RC1". Thank's.]]></description>
   </item>
      <item>
      <title>Is there a bug in surf3Dread/surf3Dwrite on CUDA 4.1?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5726/is-there-a-bug-in-surf3dreadsurf3dwrite-on-cuda-4-1</link>
      <pubDate>Fri, 09 Mar 2012 18:10:59 -0500</pubDate>
      <dc:creator>vitaminSP</dc:creator>
      <guid isPermaLink="false">5726@/devforum/discussions</guid>
      <description><![CDATA[For some reason, surf3Dread/surf3Dwrite is ignoring the depth parameter.<br />I am attaching a small program to reproduce the bug.<br />Also, using a depth outside the array's bounds does _NOT_ trigger an error (cudaBoundaryModeTrap), while using an X or Y outside the bounds _DOES_ trigger.<br /><br />Is anyone else experiencing the same behavior? I am running windows 7 x64 and the bug can be reproduced on both Quadro 6000 and GT 520.<br /><br />Edit: About the program - it should output "surf[?, ?, 2].x = 2" on the screen, as it always reads at depth 2, but instead it outputs "surf[?, ?, 2].x = 11"]]></description>
   </item>
      <item>
      <title>Async memory copy to executing kernel</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/7136/async-memory-copy-to-executing-kernel</link>
      <pubDate>Mon, 16 Apr 2012 16:25:52 -0400</pubDate>
      <dc:creator>pikecillo</dc:creator>
      <guid isPermaLink="false">7136@/devforum/discussions</guid>
      <description><![CDATA[I am trying to perform an async memory copy to an executing kernel, and get a reply on the host when the updated memory has been read. That is, I want to be able to process new data (created at the host) in the kernel without interrupting it, and then let the host know when it has been processed. I'm trying to do that using asynchronous memory and streams (for H-&gt;D data transfers) and host mapped memory (for the D-&gt;H reply). Bellow is a code draft of what I want to do, but it is not working. So what's wrong with my approach? Am I doing something that doesn't make sense? Actually, I'm not even sure if what I want to do is possible.<br />I have a GeForce GT 435M<br />CUDA Capability Major/Minor version number: 2.1<br />Concurrent copy and execution: Yes with 1 copy engine(s)<br /><br />Any kind of help is welcome.<br /><br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> CUDA_SAFE(statement) {						\<br />  statement;								\<br />  cudaError_t __result = cudaGetLastError();				\<br />if(__result != cudaSuccess){						\<br />  printf("CUDA error: %s\n", cudaGetErrorString(__result));		\<br />  assert(false);							\<br /> }									\<br />}									\<br /><br />__global__<br />void kernel(int *global, int *barrier) {<br />  int local_id = threadIdx.x;<br /><br />  if(local_id == 0) {<br />    while(true) { <br />      if(global[0] != 0) break;<br />    }<br /><br />    // Reply to host<br />    barrier[0] = 1;<br />  }<br />}<br /><br />cudaStream_t kernel_stream, memory_stream;<br /><br />void async_copy(int *d_global, int *h_barrier) {<br />   cudaEvent_t sync_event;<br />   int *h_global;<br /><br />   CUDA_SAFE(cudaEventCreate(&amp;sync_event));<br /><br />   CUDA_SAFE(cudaHostAlloc(&amp;h_global, sizeof(int),<br />			   cudaHostAllocDefault));<br /><br />   h_global[0] = 1;<br />   // Update device memory<br />   CUDA_SAFE(cudaMemcpyAsync(d_global, h_global,<br />			     sizeof(int),<br />			     cudaMemcpyHostToDevice,<br />			     memory_stream));<br />   // Wait until the update has been made<br />   CUDA_SAFE(cudaEventRecord(sync_event, memory_stream));<br />   CUDA_SAFE(cudaEventSynchronize(sync_event));<br /><br />   while(h_barrier[0] == 0);<br /><br />   CUDA_SAFE(cudaFreeHost(h_global));<br />   CUDA_SAFE(cudaEventDestroy(sync_event));<br />}<br /><br />int main() {<br />  // Allow mapped host memory<br />  CUDA_SAFE(cudaSetDeviceFlags(cudaDeviceMapHost));<br /><br />  // Create streams, one for kernel execution and one for<br />  // async memory copies<br />  CUDA_SAFE(cudaStreamCreate(&amp;kernel_stream));<br />  CUDA_SAFE(cudaStreamCreate(&amp;memory_stream));<br /><br />  int *d_global;<br />  int *h_barrier, *d_barrier_ptr;<br /><br />  CUDA_SAFE(cudaHostAlloc(&amp;h_barrier, sizeof(int),<br />			  cudaHostAllocMapped));<br />  CUDA_SAFE(cudaHostGetDevicePointer(&amp;d_barrier_ptr,<br />				     h_barrier, 0));<br />  CUDA_SAFE(cudaMalloc(&amp;d_global, sizeof(int)));<br /><br />  CUDA_SAFE(cudaMemset(d_global, 0, sizeof(int)));<br /><br />  h_barrier[0] = 0;<br /><br />  int block = 1;<br />  int grid = 1;<br /><br />  kernel &lt;&lt;&lt; grid, block, 0, kernel_stream &gt;&gt;&gt; (d_global, d_barrier_ptr);<br />  async_copy(d_global, h_barrier);<br />  CUDA_SAFE(cudaStreamSynchronize(kernel_stream));<br />  CUDA_SAFE(cudaFreeHost(h_barrier));<br />  CUDA_SAFE(cudaFree(d_global));<br />  CUDA_SAFE(cudaStreamDestroy(kernel_stream));<br />  CUDA_SAFE(cudaStreamDestroy(memory_stream));<br />}<br />]]></description>
   </item>
      <item>
      <title>Is it possible to debug a .exe cuda app, in other words, Can cuda app be reversed ?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6881/is-it-possible-to-debug-a-exe-cuda-app-in-other-words-can-cuda-app-be-reversed-</link>
      <pubDate>Wed, 11 Apr 2012 04:08:27 -0400</pubDate>
      <dc:creator>leolord</dc:creator>
      <guid isPermaLink="false">6881@/devforum/discussions</guid>
      <description><![CDATA[Almost all the software are in danger to be reversed.I want to know whether the cracker can debug the executable file which is out of souse code or debug symbolics.]]></description>
   </item>
      <item>
      <title>Is it possible to have be.exe compiled has 64bits?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6646/is-it-possible-to-have-be-exe-compiled-has-64bits</link>
      <pubDate>Tue, 03 Apr 2012 10:09:14 -0400</pubDate>
      <dc:creator>MarcoSilva</dc:creator>
      <guid isPermaLink="false">6646@/devforum/discussions</guid>
      <description><![CDATA[Hi All,<br /><br />Before anything else here is my test machine and CUDA version:<br />Win7 64bits with 6GB of RAM.<br />CUDA Tookit 4.0 also 64bits.<br /><br />Now my problem:<br />I have a pretty big kernel to compile and it stops at be.exe.<br />After some investigation I found that be.exe stops with an out of memory error when it started using more than 2GB of RAM, and that makes sense, because be.exe is a 32bits application (even thou the toolkit is the 64bits version).<br />Hoping that my kernel compile wouldn't exceed 3GB and that be.exe would be compatible with the flag /LARGEDADDRESSAWARE, I changed it.<br />But alas... The compilation quickly consumed the 3GB and the same error occurred...<br /><br />Now I am stuck... The only chance is to have be.exe compiled has a 64bits executable, then it would be able to use all the RAM available (unfortunately cutting stuff out of the kernel is not a possibility)....<br /><br />Or there is any workaround that I am missing?<br /><br />Edit:<br />SOLVED!!!<br />thank you Tera!<br /><br />Solution use the -nvvm switch on nvcc.<br />With this both sm1x and sm2x can be compiled with the new LLVM compiler (being this new compiler 64bits, the out of memory problem is gone).<br /><br />Best regards to all!<br />Marco Silva<br />]]></description>
   </item>
      <item>
      <title>seeking a movie via CUDA Video Decoder API</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6926/seeking-a-movie-via-cuda-video-decoder-api</link>
      <pubDate>Wed, 11 Apr 2012 12:30:16 -0400</pubDate>
      <dc:creator>Joseph Laurino</dc:creator>
      <guid isPermaLink="false">6926@/devforum/discussions</guid>
      <description><![CDATA[In doing some modifications to the CUDA Video Decoder D3D9 sample in the GPU Computing SDK, we could not find a way to use any of the available api to seek to a specific time within a video. <br /><br />We are wondering if seeking within a movie is possible.<br /><br />Thank you,<br />-Joseph<br /><br />]]></description>
   </item>
      <item>
      <title>Nsight 2.1 RC2 not working?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3701/nsight-2-1-rc2-not-working</link>
      <pubDate>Thu, 19 Jan 2012 11:21:55 -0500</pubDate>
      <dc:creator>Vogi</dc:creator>
      <guid isPermaLink="false">3701@/devforum/discussions</guid>
      <description><![CDATA[Hi!<br /><br />I have now installed the CUDA 4.1 RC2, Nsight 2.1 RC2 and the 285.86-driver on my machines (Win7 Prof. 64-bit, VS2010 Ultimate, GTX480 + a second Nvidia-card (GT520 on the one and a Geforce7xxx on the other one)).<br /><br />I did the same setup and configurations as I had before with CUDA 4.0 and Nsight 2.0 and the old driver (I guess it was 280.xx). Unfortunately, when starting the debugger, now Nsight does not hit any breakpoint anymore. VS just shows the "No source correspondence". :(<br /><br />I found this in the User Guide:<br /><code> If a breakpoint cannot be resolved in a loaded module, the breakpoint will display a warning icon during the debug session. This occurs if the debugger is unable to find the source location of the breakpoint. Make sure you are using the CUDA toolkit that ships with the Parallel Nsight tools. If you are writing code based on the CUDA driver API, you can check to see if the symbol files are being generated (.cubin.elf.o files located alongside your .cubin files). Code that is based on the CUDART API does not create .cubin.elf.o files.</code><br /><br />Well, I am using the driver API and compile to PTX. All I can find in the output-directoy are files like "tmpxft_000013d0_00000000-11_RaytracingKernel.cpp3.o" with 0KB.<br /><br />The command-line I use for compiling is:<br /><code>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\bin\nvcc.exe" -G0 -ptx --cl-version 2010 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC" -I "[path to my includes]" -I "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include" -L "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib" -m 64 -arch sm_20 -Xptxas -v myKernel.cu<br /></code><br /><br />For more information, the PTX starts with the following:<br /><code>//<br />// Generated by NVIDIA NVVM Compiler<br />// Compiler built on Sat Nov 19 07:29:21 2011 (1321684161)<br />// Cuda compilation tools, release 4.1, V0.2.1221<br />//<br /><br />.version 3.0<br />.target sm_20, debug<br />.address_size 64<br /></code><br /><br />So, what am I doing wrong here?<br />Do I need any additional setup to Nsight 2.0?<br /><br />Thanks for your help!<br /><br />Greetings,<br />  Vogi<br />]]></description>
   </item>
      <item>
      <title>error MSB3721 when compiling</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6391/error-msb3721-when-compiling</link>
      <pubDate>Mon, 26 Mar 2012 14:09:09 -0400</pubDate>
      <dc:creator>brachistochron</dc:creator>
      <guid isPermaLink="false">6391@/devforum/discussions</guid>
      <description><![CDATA[Hi<br />there is compilation error that i got when .cu file compile<br /><br />(there some cyrillic symbols, because i have russian version of VS2010)<br />&gt;  Compiling CUDA source file c.cu...<br />1&gt;  <br />1&gt;  c:\Users\Андрей\Documents\Visual Studio 2005\Projects\cudatest3\cudatest3&gt;"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2008 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\include"  -G0  --keep-dir "Debug" -maxrregcount=0  --machine 32 --compile  -g    -Xcompiler "/EHsc /nologo /Od /Zi  /MDd " -o "Debug\c.cu.obj" "c:\Users\??????\Documents\Visual Studio 2005\Projects\cudatest3\cudatest3\c.cu" <br />1&gt;c1xx : fatal error C1083: ═х єфрхЄё  юЄъЁ√Є№ Їрщы шёЄюўэшъ: c:/Users/??????/Documents/Visual Studio 2005/Projects/cudatest3/cudatest3/c.cu: Invalid argument<br />1&gt;C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\BuildCustomizations\CUDA 4.1.targets(361,9): error MSB3721: выход из команды ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2008 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\include"  -G0  --keep-dir "Debug" -maxrregcount=0  --machine 32 --compile  -g    -Xcompiler "/EHsc /nologo /Od /Zi  /MDd " -o "Debug\c.cu.obj" "c:\Users\Андрей\Documents\Visual Studio 2005\Projects\cudatest3\cudatest3\c.cu"" with code "2".<br /><br />i use this manual for cofigure vs :<br />http://www.aimantarek.com/2011/01/how-to-make-new-cuda-project-in-vs-2010.html<br />my configuration is i7+GTX560+win7x64<br /><br /><br /><br />can anybody help me?]]></description>
   </item>
      <item>
      <title>Cuda 4.1/4.2 (latest releases) on a Mac book Pro 3.1 with Mac OS X 10.7.3 is not running at all</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6361/cuda-4-14-2-latest-releases-on-a-mac-book-pro-3-1-with-mac-os-x-10-7-3-is-not-running-at-all</link>
      <pubDate>Mon, 26 Mar 2012 08:15:57 -0400</pubDate>
      <dc:creator>slajar</dc:creator>
      <guid isPermaLink="false">6361@/devforum/discussions</guid>
      <description><![CDATA[Hi all,<br /><br />I am trying to run the samples from the CUDA toolkit and I always get this error:<br />"Runtime API error 2: out of memory." <br /><br />Somewhere I read I should disable Parallels from beeing run at the same time. Unfortunately, it does not make any difference. Maybe someone else can help me out?<br /><br />kind regards<br />Matthias]]></description>
   </item>
      <item>
      <title>GTX680 - CUDA support on Linux</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6206/gtx680-cuda-support-on-linux</link>
      <pubDate>Thu, 22 Mar 2012 11:49:18 -0400</pubDate>
      <dc:creator>jirim</dc:creator>
      <guid isPermaLink="false">6206@/devforum/discussions</guid>
      <description><![CDATA[Dear NVIDIA Developer Team,<br /><br />Is there a CUDA Toolkit supporting GTX 680 on Linux? I am using 295.20 linux driver which seems to be working, at least there are no complains in dmesg when I load it. But when I try to run any CUDA related application (one of mine or even deviceQuery) I am getting "invalid device ordinal" error message at first call of a cuda API function (e.g.cudaGetDeviceCount()). After that dmesg reports following message:<br /><br />[ 3240.924534] NVRM: failed to copy vbios to system memory.<br />[ 3240.931041] NVRM: RmInitAdapter failed! (0x30:0xffffffff:858)<br />[ 3240.931051] NVRM: rm_init_adapter(0) failed<br /><br />I am using latest Toolkit 4.1, 295.20 driver, and GTX680.]]></description>
   </item>
      <item>
      <title>Anyone got Nsight 2.1 working with C#?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5861/anyone-got-nsight-2-1-working-with-c</link>
      <pubDate>Tue, 13 Mar 2012 12:28:49 -0400</pubDate>
      <dc:creator>Vogi</dc:creator>
      <guid isPermaLink="false">5861@/devforum/discussions</guid>
      <description><![CDATA[Hi!<br /><br />I was using Nsight version &lt;=2.0 successfully with C#.<br />Nsight 2.1 does not stop at breakpoints anymore when using C# as host language. It works fine with C++ however.<br /><br />So, anyone got it working? (Or am I the only one using Nsight with C# anyway?)<br /><br />Furthmore:<br />Anyone any plan how to get the CUDA-device printf()-function working with C#?<br /><br />Thanks!<br /><br />Greetings,<br />  Vogi<br />]]></description>
   </item>
      <item>
      <title>Parallel Nsight debugging wont detect my two GPUs</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/6061/parallel-nsight-debugging-wont-detect-my-two-gpus</link>
      <pubDate>Sun, 18 Mar 2012 08:03:41 -0400</pubDate>
      <dc:creator>EmilNorden</dc:creator>
      <guid isPermaLink="false">6061@/devforum/discussions</guid>
      <description><![CDATA[Hello there,<br /><br />I'm sure this is a question that has been asked previously, but I have been unable to find a solution that works. I apologize if I have missed something obvious.<br /><br />I recently began trying out CUDA development and installed the Parallel Nsight tool the other day. When trying to start a CUDA debugging session from Visual Studio, I got the message:<br /><br /><em>Local debugging failed. Nsight debugging cannot be performed when there is only one GPU detected on the system.</em><br /><br />I read about this online and it made sense that it would not work. So I went out and got myself a second card to run in SLI. However, it still gives me that exact message.<br />I have followed the "Tutorial: Using the CUDA Debugger" section of the User Guide, but I must have missed something.<br /><br />I am using 2x Geforce GTX 560 Ti.<br /><br />I would appreciate some help, thanks in advance.]]></description>
   </item>
      <item>
      <title>nvcc: internal error: assertion failed: scope_of_local_variable: scope not found</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5426/nvcc-internal-error-assertion-failed-scope_of_local_variable-scope-not-found</link>
      <pubDate>Mon, 05 Mar 2012 06:58:38 -0500</pubDate>
      <dc:creator>stevenlovegrove</dc:creator>
      <guid isPermaLink="false">5426@/devforum/discussions</guid>
      <description><![CDATA[Since moving to CUDA 4.1, I'm having trouble with code that references a template library that previously compiled fine with CUDA 3.X (a library called TooN). The code is not being used inside a kernel, but within a .cu file. The code listing below is the smallest code sample I could think of to reproduce the (admittedly fringe) error outside of the library.<br /><br />The code can be made to compile by changing the static const definition of NewSize to just const, or by removing the inline attribute. Maybe LLVM is confused by the 'local_var.template ...' syntax?<br /><br />// Code listing test.cu (edited as per comment below)<br />//////////////////////////////////<br /><br /><code><br />template&lt;int Size&gt;<br />struct TestStruct {<br />    template&lt;int a&gt;<br />    void TestStructTemplateMethod() { }<br />};<br /><br />template&lt;int Size&gt;<br />inline void TestTemplateMethod(){<br />    TestStruct&lt;Size&gt; local_var;<br />    static const int NewSize = Size+1;<br />    local_var.template TestStructTemplateMethod&lt;NewSize&gt;();<br />}<br /><br />void TestMethod() {<br />  TestTemplateMethod&lt;3&gt;();<br />}<br /></code><br /><br />//////////////////////////////////<br /><br />Build fails with with:<br />/home/sl203/code/cuda_bug/test.cu(14): internal error: assertion failed: scope_of_local_variable: scope not found (/home/buildmeister/nightly/rel/gpgpu/toolkit/r4.1/compiler/edg/EDG_4.3/src/il.c, line 11476)<br /><br />1 catastrophic error detected in the compilation of "/tmp/tmpxft_00003ce3_00000000-9_test.cpp4.ii".<br />Compilation aborted.<br />Aborted<br /><br />&gt; nvcc --version<br />nvcc: NVIDIA (R) Cuda compiler driver<br />Copyright (c) 2005-2011 NVIDIA Corporation<br />Built on Thu_Jan_12_14:41:45_PST_2012<br />Cuda compilation tools, release 4.1, V0.2.1221<br /><br />&gt; gcc --version<br />gcc (Ubuntu/Linaro 4.4.6-11ubuntu2) 4.4.6<br /><br />&gt; uname -a<br />Linux caravaggio 3.0.0-15-generic <a href="/devforum/search?Search=%2326-Ubuntu&amp;Mode=like">#26-Ubuntu</a> SMP Fri Jan 20 17:23:00 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux<br /><br />Thanks]]></description>
   </item>
      <item>
      <title>controlling fan speed of tesla</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5216/controlling-fan-speed-of-tesla</link>
      <pubDate>Tue, 28 Feb 2012 05:57:17 -0500</pubDate>
      <dc:creator>AlanKao2012</dc:creator>
      <guid isPermaLink="false">5216@/devforum/discussions</guid>
      <description><![CDATA[Hi there:<br /><br />Recently, I set up a supermicro 7046GT with 4 tesla C2070, and I found that the temperature of GPU is really high, so I would like to adjust the fan speed as to cool the GPU a little bit.<br />However, after I follow some instructions in a geek's blog post, the fan speed of 3 cards still remains unchangeable.<br /><br />Is there anyone who has tried to do this on tesla cards and successfully done?<br />Or I should give up NOW and move the machine to a air-conditioned room?<br /><br />Any reply will be appreciated.<br />]]></description>
   </item>
      <item>
      <title>Windows 7, Visual studio C++ 2010, Error on cutil32.dll</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3046/windows-7-visual-studio-c-2010-error-on-cutil32-dll</link>
      <pubDate>Wed, 04 Jan 2012 13:19:05 -0500</pubDate>
      <dc:creator>hassy1977</dc:creator>
      <guid isPermaLink="false">3046@/devforum/discussions</guid>
      <description><![CDATA[Hi,<br />I am now installing Cuda to my windows PC with windows 7 of 64 bit.<br />I an going to use visual studio C++ 2010 to write programs.<br />I installed everything according to the procedure shown in the following HP<br /><a href="http://forums.nvidia.com/index.php?showtopic=216829">http://forums.nvidia.com/index.php?showtopic=216829</a><br />Sample programs work correctly.<br /><br />I made a simple program of for test.<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> <br /><br />     int main(int argc, char** argv){<br /><br />         CUT_DEVICE_INIT(argc, argv);<br />         CUT_EXIT(argc, argv);<br />         return 0;<br />     }<br /><br />The program looks successfully compiled, but in the end, it shows error comment of<br />   "The program can't start because cutil32.dll is missing from your computer.<br />   Try reinstalling the program to fix this problem."<br /><br />I renew the cutil32.dll file again and again with cutil_vc2010.sln but the result is same.<br /><br />Does someone else face to the same problem?<br />]]></description>
   </item>
      <item>
      <title>kepler device memory architecture</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/5016/kepler-device-memory-architecture</link>
      <pubDate>Wed, 22 Feb 2012 21:18:04 -0500</pubDate>
      <dc:creator>iourikarpov</dc:creator>
      <guid isPermaLink="false">5016@/devforum/discussions</guid>
      <description><![CDATA[Hello, with plenty of rumours swirling around kepler number of cores, processors, etc., there has beed very little info (that I could find) on the L1 memory size, shared, constant memory sizes, and other key metrics that are extremely important to developers.  Does anyone have any links, leaks, or ideas on what the kepler architecture will bring on this front?<br /><br />Thanks in advance, Joe]]></description>
   </item>
      <item>
      <title>P2P mem transfer between multiple CPU processes</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4821/p2p-mem-transfer-between-multiple-cpu-processes</link>
      <pubDate>Fri, 17 Feb 2012 16:15:15 -0500</pubDate>
      <dc:creator>tgramicc</dc:creator>
      <guid isPermaLink="false">4821@/devforum/discussions</guid>
      <description><![CDATA[After viewing the webinars on GPU-Direct/UVA and Multi-GPU, I am still confused about whether it is possible to perform a P2P mem copy between 2 GPU's when each GPU context is owned by a different CPU process.  I have looked at the SDK example, threadMigration, and I see how it is possible to perform a P2P copy from different threads within a single process, however I am wondering if it is possible to access a GPU's memory pointer between 2 CPU processes using an IPC shared memory space.  Is this possible, or does the UVA structure make this impossible?  Thanks for your reply.]]></description>
   </item>
      <item>
      <title>How to develop for RTOS(VxWorks, RT Linux, etc.)?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4781/how-to-develop-for-rtosvxworks-rt-linux-etc-</link>
      <pubDate>Thu, 16 Feb 2012 06:59:01 -0500</pubDate>
      <dc:creator>leventk</dc:creator>
      <guid isPermaLink="false">4781@/devforum/discussions</guid>
      <description><![CDATA[Nowadays, RTOS(VxWorks, RT Linux, etc.) can share a workstation or a PC with undeterministic (Windows/Linux) OSs. This enhancements has directed us to create products on desktop computers as well. These OSs are deterministic which is a must for real time products.<br /><br />My question is, <br />Can I develop an application using GPGPU on VxWorks OS?<br />Can I create an application using NVIDIA API on for example Windriver Workbench?<br />If no, is there any plan?<br /><br />Best Regards,<br />Levent]]></description>
   </item>
      <item>
      <title>Is this an nvcc preprocessor bug?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4606/is-this-an-nvcc-preprocessor-bug</link>
      <pubDate>Sun, 12 Feb 2012 04:16:02 -0500</pubDate>
      <dc:creator>chrism0dwk</dc:creator>
      <guid isPermaLink="false">4606@/devforum/discussions</guid>
      <description><![CDATA[Hi All,<br /><br />I appear to have found a bug in nvcc.  I wish to use Boost::ublas for sparse matrices/vectors on Linux (using gcc 4.3.4) with CUDA.  Unfortunately, nvcc is unable to compile even the simplest of code incorporating sparse vectors/matrices, eg:<br /><br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> "boost/numeric/ublas/vector_sparse.hpp"<br />int main(int argc, char* argv[])<br />{<br />  compressed_vector v(10);<br />  return 0;<br />}<br /><br />The compile fails at the final stage of nvcc's toolchain with<br />&gt; nvcc --keep -I$HOME/include nvccboostvector.cu<br />...<br />#$ gcc -c -x c++ -I"/home/stats/stsiab/include" "-I/hpcwarwick/gpu/cuda/4.0.17/cuda/bin/../include" "-I/hpcwarwick/gpu/cuda/4.0.17/cuda/bin/../include/cudart"   -fpreprocessed -m64 -o "nvccboostvector.o" "nvccboostvector.cu.cpp" <br />/home/stats/stsiab/include/boost/numeric/ublas/vector_sparse.hpp: In member function ‘const T* boost::numeric::ublas::mapped_vector::find_element(typename A::size_type) const’:<br />/home/stats/stsiab/include/boost/numeric/ublas/vector_sparse.hpp:390: error: ‘__T13’ has not been declared<br /><br />Many messages about '__T13' appear, and seem related to nvcc's rewriting of the element access and assign methods for compressed_vector.  I cannot find any declaration of '__T13' in any of the nvcc intermediate output.  Interestingly, I only have this bug on my Linux platform (both CUDA 4.0 and 4.1, with gcc 4.3.4).  My installation on Mac OSX 10.7 appears to work fine.  <br /><br />Is there a workaround?  I feel it is quite important to have nvcc working with Boost::ublas as this is turning out to be a useful C++ BLAS library.<br /><br />Thanks,<br /><br />Chris]]></description>
   </item>
      <item>
      <title>Max number of GPU supported per host?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4566/max-number-of-gpu-supported-per-host</link>
      <pubDate>Sat, 11 Feb 2012 12:18:24 -0500</pubDate>
      <dc:creator>hammer256</dc:creator>
      <guid isPermaLink="false">4566@/devforum/discussions</guid>
      <description><![CDATA[I've recently converted my simulation to support multiple CUDA GPUs, and I'm wondering, what is the max number of GPUs supported by the driver for the host computer? My algorithm should be able to handle up to 32 GPUs. <br />I'm running Linux 64 bit (Gentoo) with CUDA 4.1, currently with a GTX 470 and a GTX 580 (hardware we had lying around in the lab) on a core i7-2600k host. I only transfer KBs of data over pci-e per time step in the simulation, so pci-e bandwidth is not a concern. I'm thinking about building a 8 (or 16, using expanders) GPU computer, so I want to know what the driver's limitations are for number of GPUs.<br /><br />Thanks,<br />Wen]]></description>
   </item>
      <item>
      <title>Display driver vs Developer driver  Version</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4496/display-driver-vs-developer-driver-version</link>
      <pubDate>Thu, 09 Feb 2012 11:03:13 -0500</pubDate>
      <dc:creator>4fermi</dc:creator>
      <guid isPermaLink="false">4496@/devforum/discussions</guid>
      <description><![CDATA[<br />The latest "Display Driver" from <a href="http://www.nvidia.com/object/linux-display-amd64-290.10-driver.html">the products page</a> is <strong>ver 290.10</strong>.  But the latest "Developer Driver" from <a href="http://www.developer.nvidia.com/cuda-toolkit-41#s=bcb">the CUDA Developer Toolkit 4.1 download page</a> is <strong>ver 285.05.33</strong>.<br /><br />Question:  why must CUDA developers use an older version of the driver?  Afterall the cuda product made by the developer will be used by people with the newer driver!<br />]]></description>
   </item>
      <item>
      <title>nsight OpenCL profiling issue</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4316/nsight-opencl-profiling-issue</link>
      <pubDate>Sun, 05 Feb 2012 10:49:26 -0500</pubDate>
      <dc:creator>ivanmalin</dc:creator>
      <guid isPermaLink="false">4316@/devforum/discussions</guid>
      <description><![CDATA[Hi! I'm a beginner in OpenCL and would like to ask a question.<br />I'm trying to get profiling information about my OpenCL app and get no success. <br />Using MS VS 2010, parallel nsight 2.1 under Windows 7 x 64 with GTX 550 Ti card.<br />I start new analysis activity, set its type to "Profile", experiment configuration to "Memory" and launch. But after it finishes its work, I get no information except for "Session Summary" and in the report events collection status is No Events Captured.<br />How can I get all counters, statistics etc?<br /><br />P.S. Also, when I use nVidia Visual Profiler from 4.1 Toolkit, I get an error, saying "Unable to collect metric and event values" CUPTI_ERROR_PARAMETER_SIZE_NOT_SUFFICIENT]]></description>
   </item>
      <item>
      <title>Dynamic memory allocation in 2.x CUDA devices</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4321/dynamic-memory-allocation-in-2-x-cuda-devices</link>
      <pubDate>Sun, 05 Feb 2012 13:34:55 -0500</pubDate>
      <dc:creator>IndrajeetK</dc:creator>
      <guid isPermaLink="false">4321@/devforum/discussions</guid>
      <description><![CDATA[  C:\Users\DELL\Desktop\template(CUDA)&gt;"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2010 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin" -I"./" -I"../../common/inc" -I"../../../shared/inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\include"  -G0  --keep-dir "Debug" -maxrregcount=0  --machine 32 --compile  -g    -Xcompiler "/EHsc /nologo /Od /Zi  /MTd " -o "Win32/Debug/template.cu.obj" "C:\Users\DELL\Desktop\template(CUDA)\template.cu" <br />1&gt;  template.cu<br />1&gt;C:/Users/DELL/Desktop/template(CUDA)/template.cu(6): error : calling a host function("operator new ") from a __device__/__global__ function("mallocTest") is not allowed<br />1&gt;  <br />1&gt;C:/Users/DELL/Desktop/template(CUDA)/template.cu(7): error : calling a host function("free") from a __device__/__global__ function("mallocTest") is not allowed<br /><br /><br /><br />I am using a Nvidia geforce 525 with nvcc 4.1<br /><br />The code is pretty much same as in http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf<br /><br />page number 123,124<br />Please help !!!<br />Thanks]]></description>
   </item>
      <item>
      <title>Segmentation fault when using CUPTI</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4336/segmentation-fault-when-using-cupti</link>
      <pubDate>Sun, 05 Feb 2012 17:59:26 -0500</pubDate>
      <dc:creator>30716160</dc:creator>
      <guid isPermaLink="false">4336@/devforum/discussions</guid>
      <description><![CDATA[<code><br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;stdio.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cuda.h&gt;<br /><a href="/devforum/search?Search=%23include&amp;Mode=like">#include</a> &lt;cupti.h&gt;<br /><br /><a href="/devforum/search?Search=%23define&amp;Mode=like">#define</a> __MALLOC__<br /><br /><br />// Vector addition kernel<br />__global__ void kernel(const int* A)<br />{<br /><br />}<br /><br />void CUPTIAPI getTimestampCallback(void *userdata, CUpti_CallbackDomain domain,<br />                     CUpti_CallbackId cbid, const CUpti_CallbackData *cbInfo)<br />{<br />	uint64_t startTimestamp = 0;<br />	uint64_t endTimestamp = 0;<br />	CUptiResult cuptiErr;<br /><br />	//Kernal Launch<br />	if ( (domain == CUPTI_CB_DOMAIN_RUNTIME_API)&amp;&amp;(cbid == CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020) )<br />	{<br />		if ( cbInfo-&gt;callbackSite == CUPTI_API_ENTER )<br />		{<br />			cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;startTimestamp);<br />			printf(" Entry of %s:\n %llu \n",<br />	 	 	 	 	 cbInfo-&gt;symbolName, <br />	 	 	 	 	 (long long unsigned int)startTimestamp);<br />		}	<br />		if ( cbInfo-&gt;callbackSite == CUPTI_API_EXIT )<br />		{<br />    		cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;endTimestamp);<br />			printf(" Exit  of %s:\n %llu \n",<br />	 	 	 	 	 cbInfo-&gt;symbolName, <br />	 	 	 	 	 (long long unsigned int)endTimestamp);		<br />		}<br />	}	<br />	//cudaMemCpy<br />	if ( (domain == CUPTI_CB_DOMAIN_RUNTIME_API)&amp;&amp;(cbid == CUPTI_RUNTIME_TRACE_CBID_cudaMemcpy_v3020) )<br />	{	<br />		if ( cbInfo-&gt;callbackSite == CUPTI_API_ENTER )<br />		{<br />			cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;startTimestamp);<br />			printf(" Entry of cudaMemCpy:\n %llu \t \n",<br />	 	 	 	 	 (long long unsigned int)startTimestamp);<br />		}<br />		if ( cbInfo-&gt;callbackSite == CUPTI_API_EXIT )<br />		{<br />      		cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;endTimestamp);<br />			printf(" Exit  of cudaMemCpy:\n %llu \t \n",<br />	 	 	 	 	 (long long unsigned int)endTimestamp);<br />		}	<br />	}<br /><br /><a href="/devforum/search?Search=%23ifdef&amp;Mode=like">#ifdef</a> __MALLOC__<br />	//cudaMalloc<br />	if( (domain == CUPTI_CB_DOMAIN_RUNTIME_API)&amp;&amp;(cbid == CUPTI_RUNTIME_TRACE_CBID_cudaMalloc_v3020)  )<br />	{<br />		if( cbInfo-&gt;callbackSite == CUPTI_API_ENTER )<br />		{<br />			cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;startTimestamp);<br />			printf(" Entry of cudaMalloc:\n %llu \t \n",<br />	 	 	 	 	 (long long unsigned int)startTimestamp);<br />		}<br />		if( cbInfo-&gt;callbackSite == CUPTI_API_EXIT)<br />		{<br />			cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;endTimestamp);<br />			printf(" Exit  of cudaMalloc:\n %llu \t \n",<br />	 	 	 	 	 (long long unsigned int)endTimestamp);<br />		}<br />	}<br /><a href="/devforum/search?Search=%23endif&amp;Mode=like">#endif</a>	// __MALLOC__<br />	//devicechoose<br />	if( (domain == CUPTI_CB_DOMAIN_RUNTIME_API)&amp;&amp;(cbid == CUPTI_RUNTIME_TRACE_CBID_cudaChooseDevice_v3020) )<br />	{<br />		if( cbInfo-&gt;callbackSite == CUPTI_API_ENTER )<br />		{<br />			cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;startTimestamp);<br />			printf(" Entry of cudaChooseDevice:\n %llu \t Device:%d \n",<br />	 	 	 	 (long long unsigned int)startTimestamp,<br />	 	  *((cudaChooseDevice_v3020_params*)(cbInfo-&gt;functionParams))-&gt;device);<br />		}<br />		if( cbInfo-&gt;callbackSite == CUPTI_API_EXIT )<br />		{<br />			cuptiErr = cuptiDeviceGetTimestamp(cbInfo-&gt;context, &amp;endTimestamp);<br />			printf( " Exit  of cudaChooseDevice:\n %llu \t Device:%d \n\n",<br />	 	 	 	 (long long unsigned int)endTimestamp,<br />	 	 	  *((cudaChooseDevice_v3020_params*)(cbInfo-&gt;functionParams))-&gt;device);<br />		}<br />	}<br />}<br /><br /><br />int main()<br />{<br />  CUcontext context = 0;<br />  CUdevice device = 0;<br />  CUresult cuerr;<br />  CUptiResult cuptierr;<br />  cudaError_t cudaerr;<br /><br />  int N = 500;<br />  size_t size = N * sizeof(int);<br />  int *h_A;<br />  int *d_A;<br />  h_A = (int *)malloc(size);<br />  cudaerr = cudaSetDevice(0);<br />  if ( cudaerr != cudaSuccess )<br />  {<br />	printf("cudaSetDevice Error!\n");<br />	exit(0);<br />  }<br />  CUpti_SubscriberHandle subscriber;<br /><br />// cuerr = cuInit(0);<br />// cuerr = cuCtxCreate(&amp;context, 0, device);<br /><br />  cuptierr = cuptiSubscribe(&amp;subscriber, (CUpti_CallbackFunc)getTimestampCallback , NULL);<br />  cuptierr = cuptiEnableDomain(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API);<br /><br />  // Allocate vectors in device memory<br />  cudaMalloc((void**)&amp;d_A, size);<br />  // Copy vectors from host memory to device memory<br />  cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);<br /><br />  kernel&lt;&lt;&lt;16, 128&gt;&gt;&gt;(d_A);<br />  cudaThreadSynchronize();<br /><br />  cuptierr = cuptiUnsubscribe(subscriber);<br />  if (d_A)<br />    cudaFree(d_A);<br />  if (h_A)<br />    free(h_A);<br />  return 0;<br />}<br /></code><br /><br />Compile the above code and run it. There will be "Segmentation Fault".<br />However, if I comment out the "cudaMalloc" part in callback code, it works fine.  Attachment is the project, you can try it.<br /><br />Other 2 ways to solve the problem is <br />1. call "cuCtxCreate", if cudaSetDevice is called.<br />or<br />2. Don't use cudaSetDevice<br /><br />I don't understand why. Maybe it's a bug of CUPTI. Any help or explanation is appreciated.<br /><br />]]></description>
   </item>
      <item>
      <title>CUDA Toolkit 4.1.15 on openSUSE 12.1?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3651/cuda-toolkit-4-1-15-on-opensuse-12-1</link>
      <pubDate>Wed, 18 Jan 2012 15:55:04 -0500</pubDate>
      <dc:creator>gue22</dc:creator>
      <guid isPermaLink="false">3651@/devforum/discussions</guid>
      <description><![CDATA[Went to great lengths to install openSUSE 12.1 on a physical machine (as opposed to the VMware and Hyper-V VMs I normally use to try all kinds of things) only to find out upon closer inspection that the CUDA Toolkit 4.1.15 downloads are targeted for 11.2.<br /><br />Any chance for success on 12.1 (quite different core from 11.x) or should I set up Yet Another variant with 11.2?<br />Thx<br />G.]]></description>
   </item>
      <item>
      <title>On streams and asynchronous execution</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3496/on-streams-and-asynchronous-execution</link>
      <pubDate>Mon, 16 Jan 2012 03:32:28 -0500</pubDate>
      <dc:creator>lucana</dc:creator>
      <guid isPermaLink="false">3496@/devforum/discussions</guid>
      <description><![CDATA[This question is just to make sure I'm understading well how CUDA streams work. <br /><br />Imagine I have a for loop like this. I am using only one stream.<br /><br />for (i=0; i &lt; N; i++)<br />{<br />	 run operations on CPU <br />	 copy results of CPU operations to CUDA kernel with cudaMemcpyAsync<br />	 call kernel &lt;&lt;&lt;  , &gt;&gt;&gt;<br />}<br /><br />My understanding is that the kernel for i and the CPU operations for i+1 at the begining of the loop will execute concurrently, but the kernel won't start for i+1 until the CPU has finished computing results for i+1.<br /><br />Is this right? Or will the operations on CPU and GPU never overlap? Will the kernel start before have the proper results computed from the CPU? Is it necessary to put some control flags to make sure the operations on the CPU have finished before the kernel starts?<br /><br />This diagram shows what I want to do. In fact it is a pipeline, but I'm still unsure if it is possible with CUDA. <br /><br />----------i = 0 -------------------- i = 1 --------------------------- i = 2<br />(t0) compute results on CPU<br />(t1) copy results to CUDA kernel -- compute results on CPU<br />(t2) execute kernel --------------- copy results to CUDA kernel -- compute results on CPU<br />(t3) ------------------------------ execute kernel --------------- copy results to CUDA kernel <br />(t4)-------------------------------------------------------------- execute kernel<br /><br />Finally, I would like to ask if it makes sense to use CUDA streams when there is data dependacy between streams, with a pipeline like the one showed before. ]]></description>
   </item>
      <item>
      <title>how many can it(GTX 460) create h.264 codec to decode HD(1280x720) at the same time ?</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4251/how-many-can-itgtx-460-create-h-264-codec-to-decode-hd1280x720-at-the-same-time-</link>
      <pubDate>Thu, 02 Feb 2012 01:57:00 -0500</pubDate>
      <dc:creator>shlee7708</dc:creator>
      <guid isPermaLink="false">4251@/devforum/discussions</guid>
      <description><![CDATA[hello,<br /><br />I want to decode HD 720P H.264 32 channel at the same time using cuda.<br /><br />Is it possible ?<br /><br />if it is possible, what kind of ndvia gpu do i use ?]]></description>
   </item>
      <item>
      <title>Please update the openSUSE packages.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4086/please-update-the-opensuse-packages-</link>
      <pubDate>Sat, 28 Jan 2012 09:51:42 -0500</pubDate>
      <dc:creator>Deanjo</dc:creator>
      <guid isPermaLink="false">4086@/devforum/discussions</guid>
      <description><![CDATA[Can you guys please update the openSUSE packages? openSUSE 11.2's support was discontinued May 12th 2011 and 11.3's support was discontinued January 20th 2012.  12.1 is the current release and all we are asking is for the package to be updated and a bit of equality in support here. Just to give a bit of perspective here, the openSUSE versions support was discontinued around the same time latest Cuda supported version of Ubuntu was released.]]></description>
   </item>
      <item>
      <title>New CUDA Toolkit 4.1, Now in Production</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/4036/new-cuda-toolkit-4-1-now-in-production</link>
      <pubDate>Thu, 26 Jan 2012 16:24:57 -0500</pubDate>
      <dc:creator>Nadeem Mohammad</dc:creator>
      <guid isPermaLink="false">4036@/devforum/discussions</guid>
      <description><![CDATA[A new production release of CUDA has been posted. This new release makes it faster and easier to accelerate scientific research with GPUs.  Key features include a re-designed Visual Profiler with automated performance analysis, a new LLVM-based compiler that helps your apps run up to 10% faster, and 1000+ new imaging and signal processing functions in the NPP library.  We’ve also added a new tri-diagonal solver, 2x faster SpMV using the ELL hybrid format, and some great improvements to the debugging and performance analysis tools.  Learn more and download from <a href="http://bit.ly/w3H6Z7">CUDAZone</a>]]></description>
   </item>
      <item>
      <title>GPU Accelerated 2D to Stereo 3D Video Conversion</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3906/gpu-accelerated-2d-to-stereo-3d-video-conversion</link>
      <pubDate>Tue, 24 Jan 2012 16:43:20 -0500</pubDate>
      <dc:creator>DryRiver</dc:creator>
      <guid isPermaLink="false">3906@/devforum/discussions</guid>
      <description><![CDATA[Hello All,<br /><br />I have written a pretty good 2D-to-3D video conversion algorithm in C# NET. (Took a little over 2 years of experimenting to get it right)<br /><br />I now want to GPU accelerate this 2D-to-3D conversion algorithm. I am hoping for a 10x - 20x times speedup using the GPU to do the pixel crunching, instead of the CPU. <br /><br />My requirements are:<br /><br />- The GPU code needs to execute inside a C# .NET Windows Forms Applicaton<br /><br />- I want to use the easiest/beginner friendliest GPU coding method possible<br /><br />Where should I start with this? CUDA.NET? OpenCL.NET? Brahma (for C#)?<br /><br />Are there any beginners tutorials for using CUDA/OpenCL inside C# NET?<br /><br />Are there, specifically, any Image Processing tutorials/examples for CUDA/OpenCL?<br /><br />Thank you for any feedback. I am a complete CUDA/OpenCL Noob and am hoping for expert advice on making my first GPU accelerated project happen.<br /><br />Best Regards,<br /><br />                  DryRiver<br /><br /><br /><br /><br /><br /><br />]]></description>
   </item>
      <item>
      <title>device function pointers</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3921/device-function-pointers</link>
      <pubDate>Wed, 25 Jan 2012 04:12:51 -0500</pubDate>
      <dc:creator>micheletuttafesta</dc:creator>
      <guid isPermaLink="false">3921@/devforum/discussions</guid>
      <description><![CDATA[Dear Sirs,<br />I need a device version of the following<br />host code:<br /><br />double (**func)(double x);<br /><br />double func1(double x)<br />{<br /> return x+1.;<br />}<br /><br />double func2(double x)<br />{<br /> return x+2.;<br />}<br /><br />double func3(double x)<br />{<br /> return x+3.;<br />}<br /><br />void test(void)<br />{<br /> double x;<br /><br /> for(int i=0;i&lt;3;++i){<br />  x=func[i](2.0);<br />  printf("%g\n",x);<br /> }<br /><br />}<br /><br />int main(void)<br />{<br /> func=(double (**)(double))malloc(10*sizeof(double (*)(double)));<br /><br /> test();<br /><br /> return 0;<br />}<br /><br /><br />where func1, func2, func3<br />have to be __device__ functions<br />and "test"<br />has to be a (suitably modified) __global__ kernel.<br /><br />I have a NVIDIA GeForce GTS 450 (compute capability 2.1)<br />Thank you in advance<br />Michele<br /><br />]]></description>
   </item>
      <item>
      <title>NSIGHT doesn&#039;t let me choose threads with id greater than 15.</title>
      <link>http://forums.developer.nvidia.com/devforum/discussion/3696/nsight-doesnt-let-me-choose-threads-with-id-greater-than-15-</link>
      <pubDate>Thu, 19 Jan 2012 10:41:02 -0500</pubDate>
      <dc:creator>lucana</dc:creator>
      <guid isPermaLink="false">3696@/devforum/discussions</guid>
      <description><![CDATA[I have managed to stop CUDA debugging at breakpoints. I'm working with VS2010. I can use the Debug Focus to select threads and blocks to follow. But I can't select any of the threads/blocks defined. The dimensions of grid and block written there are wrong. For example, I launched 1024 (kernel&lt;&lt;&lt;1, 1024&gt;&gt;&gt;)threads, but it only lets me choose up to thread number 15. Is it normal? I'm I doing something wrong? ]]></description>
   </item>
      </channel>
</rss>
