<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Squarespace V5 Site Server v5.13.156 (http://www.squarespace.com) on Mon, 20 May 2013 06:17:22 GMT--><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><title>Pixel I/O Blog</title><link>http://www.pixel.io/blog/</link><description></description><lastBuildDate>Mon, 20 May 2013 03:11:13 +0000</lastBuildDate><copyright></copyright><language>en-US</language><generator>Squarespace V5 Site Server v5.13.156 (http://www.squarespace.com)</generator><item><title>High Register-Count GK110 Kernels in HotSort</title><category>CUDA</category><category>GK110</category><category>HotSort</category><category>K20c</category><category>Kepler</category><category>Tesla</category><category>gpu</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Sun, 19 May 2013 21:37:27 +0000</pubDate><link>http://www.pixel.io/blog/2013/5/19/high-register-count-gk110-kernels-in-hotsort.html</link><guid isPermaLink="false">424276:4970759:33732002</guid><description><![CDATA[<p>Last week I returned to working on HotSort in order to add a few new features.  One of the "free" features on my list was to implement high register-count merge kernels on GK110 and GT200 architectures.</p>

<p>The merge kernels in HotSort minimize global loads and stores by maximizing the number of element comparisons performed per thread.</p>

<p>Up until now, the same merging algorithm and register configurations were being used across all CUDA architectures and the resulting merge kernels were approaching the Fermi and GK104 63 register-per-thread limit.</p>

<p>Since GK110 and GT200 devices support high register-count kernels, two new merge kernels have been implemented to exploit this capability.</p>

<p>These new kernels further reduce the total number of global memory transactions resulting in an ~8-12% performance increase in sorting large arrays of 32-bit and 64-bit elements.</p>

<p>Not a bad result for simply adjusting a few configuration files and rebuilding!</p>

<p>You can see a comparison between the old and new Tesla K20c kernels <a href="http://www.pixel.io/storage/post-files/k20.pdf">here</a>.</p>

<p>The updated HotSort Benchmarks doc for all architectures including GT200 is <a href="http://www.pixel.io/storage/hotSortBench.pdf">here</a>.</p>

<p><em><strong>Update:</strong></em></p>

<p>One last note on performance, I can actually achieve an extra ~2% improvement on large arrays as well as on large numbers of small arrays if the new high register-count merge kernels are used as early as possible in the sorting process. However, this results in the very small single array benchmarks being ~1% below their peak. Right now I'm mostly interested in small array performance, so I chose not to disturb the small array sorting kernel launch logic.  The assumption is that this is entirely due to SMX under-utilization. The fix is straightforward: launch the smaller merging kernels when performing small sorts. I'll save that work for later.</p>

<p><span class="thumbnail-image-block ssNonEditable"><span><a href="javascript:showFullImage('/display/ShowImage?imageUrl=%2Fstorage%2Fpost-images%2Fk20c_hireg.png%3F__SQUARESPACE_CACHEVERSION%3D1368999719618',776,748);"><img src="http://www.pixel.io/storage/thumbnails/4681329-22723421-thumbnail.jpg?__SQUARESPACE_CACHEVERSION=1368999728016" alt=""/></a></span></span></p>
]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-33732002.xml</wfw:commentRss></item><item><title>Kernel arguments vs __constant__ variables?</title><category>C</category><category>C++</category><category>C99</category><category>CUDA</category><category>__constant__</category><category>gpu</category><category>kernel</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Thu, 09 May 2013 23:39:09 +0000</pubDate><link>http://www.pixel.io/blog/2013/5/9/kernel-arguments-vs-__constant__-variables.html</link><guid isPermaLink="false">424276:4970759:33653289</guid><description><![CDATA[<p>It's not uncommon to have a GPU kernel where a number of the kernel parameters are left as constants throughout the entire call chain and, if the target were a CPU, would have been implemented as traditional <code>static const</code> file scope variables.</p>

<p>For this reason, you might think that exploiting the <code>__constant__</code> qualifier and its associated <code>cudaXXXSymbol()</code> routines is a good way of reducing the number of kernel arguments and generally cleaning up your code.</p>

<p>CUDA <code>__constant__</code> variables can definitely simplify your kernel code however you're not only shifting some complexity toward module symbol initialization but you're potentially closing off the opportunity to launch back-to-back grids with different <code>__constant__</code> variable values.</p>

<p>Let's look at what is actually being generated by the compiler and see if there is any major difference between kernel args and <code>__constant__</code> variables at the SASS level. The expectation is that there is not (see Section D.2.5.2 in the <em>CUDA C Programming Guide</em>).</p>

<p>These two kernels are functionally the same but one uses parameters and the other constants:</p>

<p><span class="full-image-block ssNonEditable"><span><img src="http://www.pixel.io/storage/post-images/symarg.cu.png?__SQUARESPACE_CACHEVERSION=1368145921880" alt=""/></span></span></p>

<p>They produce identical SASS code:</p>

<p><span class="full-image-block ssNonEditable"><span><img src="http://www.pixel.io/storage/post-images/symarg.sass.png?__SQUARESPACE_CACHEVERSION=1368145976200" alt=""/></span></span></p>

<p>Since kernel parameters are constants on sm_20+ I expected both kernels to produce similar SASS. The only difference is that it appears constants are being pulled from different constant banks.</p>

<p>If you look at pre-Fermi output you'll see the SASS is <em>not</em> the same because kernel parameters are passed via shared memory.</p>

<p>So back to the original question.  Are module-scoped <code>__constant__</code> variables useful given that strictly using kernel parameters produces similar code?  Sure, you can write simpler code as long as you're sure you'll never need to update the <code>__constant__</code> variables in a context. This very issue and a workaround was <a href="https://devtalk.nvidia.com/default/topic/540182/cuda-programming-and-performance/how-is-__constant__-memory-with-respect-to-cuda-streams-/post/3785547/">recently discussed in the CUDA Developer Forums</a>.</p>

<p>But what I really want is the ability to <em>optionally</em> declare that a <code>const</code> kernel parameter has been raised to <em>kernel scope</em> (file scope?) as if it had been declared as a CUDA <code>__constant__</code> variable.</p>

<p>It would be convenient and powerful to be able to write something like:</p>

<pre><code>    __global__ foo(__constant__ int* bar, int* baz)
    {
       ...
    }
</code></pre>

<p>This would indicate that the parameter <code>bar</code> should be raised to kernel scope. The benefit being that the developer wouldn't have to invoke <code>cudaXXXSymbol()</code> initialization routines after loading the kernel module and before kernel launch.</p>

<p>An earlier file scope declaration that matches the parameter name and type might also be appropriate.</p>

<p>This syntax won't (ever) be appearing in CUDA C99/C++ but I think it's conceptually interesting for anyone writing a performance-focused DSL for GPUs to understand where there are mismatches between C99/C++ and CUDA.</p>

<p><strong><em>Update:</em></strong> A compromise might be to create a structure that contains all of the constants you wish to raise to kernel scope and pass it (by value) as a parameter to the kernel. This way only a single reference needs to be passed throughout the kernel to access a number of launch-time CTA-specific constants.</p>

<p>The idiom would look something like this:</p>

<pre><code>    typedef struct
    {
       int* bar;
       int* baz;
    } KernelEnv;

    __global__ foo(const KernelEnv kenv)
    {
       myxl(kenv,...);
       plyx(kenv,...);
    }
</code></pre>

<p>The code snippet and PTX/SASS dumps are <a href="https://gist.github.com/allanmac/a1cc8ab7b56d23ccd6d3">here</a>.</p>
]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-33653289.xml</wfw:commentRss></item><item><title>Fast matrix transposition without shuffling or shared memory</title><dc:creator>Allan MacKinnon</dc:creator><pubDate>Mon, 08 Apr 2013 02:51:21 +0000</pubDate><link>http://www.pixel.io/blog/2013/4/7/fast-matrix-transposition-without-shuffling-or-shared-memory.html</link><guid isPermaLink="false">424276:4970759:33265304</guid><description><![CDATA[<p><a href="http://www.pixel.io/blog/2013/3/25/fast-matrix-transposition-on-kepler-without-using-shared-mem.html">Last time</a> I presented a clever way of transposing that exploited the GPU's support of 32-byte stores as its smallest "100% efficient" transaction size.</p>

<p>An open question was whether designing a transpose that performed 64-byte stores would achieve better performance at the cost of more instructions.</p>

<p>The answer is yes. Performing 64-byte transactions improves throughput by ~8%.  The new 64-byte transpose kernel reaches ~148 GB/sec. on a 1024x1024 matrix of 32-bit elements on a K20c (758MHz).  This is getting very close to the ~153 GB/sec. demonstrated by the shared transpose example.</p>

<pre><code>    Tesla K20c : sm_35 * 13
    transposeKernel&lt;&lt;&lt;1024,64&gt;&gt;&gt;(1024 x 1024)
    .... Validated!
    loops (100)  : avg   0.05274 ms. =  148.12 GB/sec.
</code></pre>

<p>The interesting difference between this approach and last time is that of the different approaches I tried the highest performing kernel dispensed with <code>SHFL</code>'s and implicitly performed rotations by manipulating the lower bits of the load and store pointers.</p>

<p>Similar to last time, neither <code>__syncthreads()</code> calls or shared memory are required.</p>

<p>Here is an illustration of how I chose to rearrange a tile of 4 lanes each holding 16 32-bit values:</p>

<p><span class="thumbnail-image-block ssNonEditable"><span><a href="javascript:showFullImage('/display/ShowImage?imageUrl=%2Fstorage%2Fpost-images%2Fimg_transpose_4x4.png%3F__SQUARESPACE_CACHEVERSION%3D1365439891215',520,680);"><img src="http://www.pixel.io/storage/thumbnails/4681329-22391894-thumbnail.jpg?__SQUARESPACE_CACHEVERSION=1365439892517" alt=""/></a></span></span></p>

<p>A quick explanation:</p>

<p>Each lane loads 16 values.  A logical tile contains 16x4 elements and occupies 4 lanes.  I explicitly show the rotations that are performed on load but do not show the counter-rotations that must be performed on store. More on that later.</p>

<p>The rotations can be performed by either <code>SHFL</code>'s or implicitly through pointer manipulation. Implicit "rotation pointers" are created by add-masking or using a <code>BFI</code> instruction.  On load, each 4x4 block of registers is rotated one lane to the right of the block directly above it. With either approach, the warp performs 16 standard coalesced load transactions.</p>

<p>The exchange phase simply rearranges values in lanes 2 and 3 of the tile so that the next phase can use the <code>SLCT</code> opcode to choose which value to store back to device memory.  More simply, this is every warp lane id that has bit 2 enabled.</p>

<p>Finally, the select phase takes advantage of the symmetries created by the rotation and exchange phases to enable all 4 lanes to coordinate a 64-byte (16-element) write.  If you inspect each set of same-colored blocks you will see that they can be selected and written as <code>vec4</code> instances simply by checking whether the lane is odd or even.</p>

<p>The 16x4 registers are written by the 4 coordinating lanes as 4 <code>vec4</code>'s in the order: red, orange, green, blue.  They're each separated by a matrix width of elements.</p>

<p>As noted above, the 4x4 blocks of registers across the 4 lanes can either be counter-rotated using SHFL or the store pointer can be manipulated to perform the same operation implicitly.</p>

<p>That's it and the performance is pretty good.  I found that the pointer manipulation approach was slightly faster than using <code>SHFL</code>'s. Conversely, in the <a href="http://www.pixel.io/blog/2013/3/25/fast-matrix-transposition-on-kepler-without-using-shared-mem.html">previous blog post</a> the <code>SHFL</code> implementation was slightly faster than the pointer approach when performing 32-byte stores.</p>

<p>There are quite a few ways to avoid using shared memory but... it's probably not possible to get that last 10 GB/sec. of throughput achieved by NVIDIA's shared transpose example on 1024x1024x32-bit matrices.  The number of instructions in that kernel is tiny. </p>

<p>But if you're interested in creating your own specialized transpose or unique data-rearrangement kernel, here is a list of low-level GPU features that you can mix-and-match:</p>

<ul>
<li>writing 16-byte <code>v2</code> or <code>v4</code> types to simplify inter-lane word movement</li>
<li>explicit shuffling</li>
<li><a href="http://www.pixel.io/blog/2013/3/14/experiments-with-shfl.html">rotations via shuffling</a></li>
<li>implicit shuffling on store</li>
<li>implicit shuffling on load</li>
<li>maximizing use of the <code>SELP</code> opcode: <code>d = pred ? a : b;</code></li>
<li>thinking in terms of the non-existent opcode <code>MOVP</code>: <code>if (pred) a = c; else b = c;</code></li>
<li>exchange two registers with an <code>XCHG</code>: <code>t = a; a = b; b = t;</code></li>
</ul>

<p>Please let me know if you have any questions or feedback.</p>
]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-33265304.xml</wfw:commentRss></item><item><title>Fast matrix transposition on Kepler without using shared memory</title><category>Kepler</category><category>gpu</category><category>gpu hacks</category><category>matrix transpose</category><category>shfl</category><category>small code</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Mon, 25 Mar 2013 21:23:25 +0000</pubDate><link>http://www.pixel.io/blog/2013/3/25/fast-matrix-transposition-on-kepler-without-using-shared-mem.html</link><guid isPermaLink="false">424276:4970759:33149964</guid><description><![CDATA[<p>[3/26/2013: <em>Updated benchmark.</em>]<br />[3/29/2013: <em>Updated NV transpose tile size.</em>]</p>
<p>I needed a fast and minimal routine that could transpose a warp with 32 elements per lane and store the result to device memory as the final step in a complex kernel.</p>
<p>However, I had a hypothesis that Kepler SHFL instructions and 16-byte stores by pairs of lanes would be functionally equivalent to the standard matrix transpose approach but not require any shared memory or thread block synchronization yet still achieve 100% memory store efficiency.</p>
<p>I banged out the code over the weekend to see what the PTX/SASS would look like and after getting encouraging performance results I used the hypothesized warp-centric "pseudo-transpose" primitive to implement a full matrix transpose.</p>
<h3>Results</h3>
<p>A 1024x1024 matrix of 32-bit elements transposes on a K20c at ~137 GB/sec. &nbsp;<span style="text-decoration: line-through;">This is ~22% better than the "optimized outer" variant example in the CUDA Toolkit which averages 112 GB/sec</span>.</p>
<p>Here's the output from the micro benchmark:</p>
<pre><code>    Tesla K20c : sm_35 * 13
    transposeKernel&lt;&lt;&lt;2048,64&gt;&gt;&gt;(1024 x 1024)
    .... Validated!
    loops (100)  : avg   0.05677 ms. =  137.62 GB/sec.
</code></pre>
<p>This is pretty good and beats the default NVIDIA example. &nbsp;</p>
<p>But if the "optimized outer" transpose example in the CUDA Toolkit is modified to use a 32x8 block size (32x32 tile size) it jumps from 112 GB/sec to 153.5 GB/sec. on a K20c (758 MHz). &nbsp;Kudos!</p>
<p>If you have a reason to avoid using shared memory then the approach I describe below remains a good option.</p>
<h3>Implementation</h3>
<p>The excellent&nbsp;<a href="http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf">Matrix Transpose Example</a>&nbsp;in the CUDA Toolkit uses a square tile of padded shared memory to switch the row-column ordering. 256 elements are loaded from device memory, stored to the shared tile, synchronized, loaded from the shared tile in transposed order and stored back out to device memory. &nbsp;It's simple and fast. &nbsp;Note that very few registers are actually put to use with this approach.  This isn't a problem since the kernel has a 1:1 thread to element ratio -- i.e. <em>many</em> threads are being used.</p>
<p>My transpose routine takes a different approach. Instead of striving to construct wide 128-byte aligned stores, each warp in this routine performs a trivial rearrangement of each lane pair's elements followed by every lane storing half of an aligned 32-byte transaction.</p>
<p>It's easier to show how this works with a few illustrations.</p>
<p>We start with a conceptual warp of 8 lanes with 8 32-bit elements per lane:&nbsp;</p>
<p style="text-align: center;"><span class="full-image-block ssNonEditable"><img src="http://www.pixel.io/storage/post-images/img_tile.png?__SQUARESPACE_CACHEVERSION=1364267347815" alt="" /></span></p>
<p style="text-align: left;">Next, exchange the odd 16 bytes in even lanes with the even 16 bytes in odd lanes:</p>
<p style="text-align: center;"><span class="full-image-block ssNonEditable"><img src="http://www.pixel.io/storage/post-images/img_shuffle.png?__SQUARESPACE_CACHEVERSION=1364267641554" alt="" /></span></p>
<p style="text-align: left;">Once the shuffle is complete, 32 bytes of elements from a lane column have been transformed into two lanes of 16 bytes in the same register row. The 32-byte groupings are highlighted in different colors:</p>
<p style="text-align: center;"><span class="full-image-block ssNonEditable"><img src="http://www.pixel.io/storage/post-images/img_coordinate.png?__SQUARESPACE_CACHEVERSION=1364267806082" alt="" /></span></p>
<p style="text-align: left;">At this point, each lane pair can perform an efficient 32-byte <code>st.global.vec4.u32</code> (or equivalent). &nbsp;The output address calculation is very simple and dependent on the size of the tile and matrix. &nbsp;</p>
<p style="text-align: left;">Each lane stores twice in this example. &nbsp;The conceptual 8-lane warp will perform two 128-bit stores per lane. &nbsp;Each 128-bit store results in four 32-byte aligned global store transactions.</p>
<p style="text-align: left;"><span class="full-image-block ssNonEditable"><img style="width: 630px;" src="http://www.pixel.io/storage/post-images/img_write.png?__SQUARESPACE_CACHEVERSION=1364268507185" alt="" /></span></p>
<p style="text-align: left;">Finally, if the transposed warp were reloaded from device memory here is where each color-coded 32-byte store transaction occurred:</p>
<p style="text-align: center;"><img src="http://www.pixel.io/storage/post-images/img_transposed.png?__SQUARESPACE_CACHEVERSION=1364316191263" alt="" /></p>
<h3>Extensions</h3>
<p>Pre-Kepler devices require minimal use of shared memory to simulate a SHFL operation but synchronization is still not required as all shared stores and loads are warp-centric. &nbsp;</p>
<p>Transposing 64-bit elements is nearly identical except that only two elements are exchanged per lane pair.</p>
<h3>Conclusion</h3>
<p>It's not only possible to transpose a matrix without using shared memory or synchronization but it's also efficient!</p>]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-33149964.xml</wfw:commentRss></item><item><title>Experiments with SHFL</title><category>Kepler</category><category>gpu hacks</category><category>shfl</category><category>small code</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Fri, 15 Mar 2013 01:06:16 +0000</pubDate><link>http://www.pixel.io/blog/2013/3/14/experiments-with-shfl.html</link><guid isPermaLink="false">424276:4970759:33046511</guid><description><![CDATA[<p><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">I wanted to double-check my understanding of&nbsp;</span><span style="font-family: 'courier new', monospace;">shfl</span><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">&nbsp;when using negative indices or negative index offsets. &nbsp;The PTX documentation on this instruction is accurate but a little terse so a micro-test was in order.</span></p>
<p><span>The summary results are:</span>&nbsp;</p>
<ul>
<li><span style="font-family: 'courier new', monospace;">shfl.idx</span>&nbsp;handles negative indices without any problem which means "shuffle rotations" are feasible.</li>
<li><span style="font-family: 'courier new', monospace;">shfl.up</span>&nbsp;has no chance to mask the "<span style="font-family: 'courier new', monospace;">lane - bval</span>" value so it sets the in-range predicate to false when negative indices are produced and the lane's current value is assigned.</li>
<li><span style="font-family: 'courier new', monospace;">shfl.down</span>&nbsp;has similar behavior.</li>
<li>negative&nbsp;<span style="font-family: 'courier new', monospace;">shfl.[up|down]</span>&nbsp;offsets are treated as unsigned 5-bit values. &nbsp;e.g. -5 = 27.</li>
</ul>
<p><span>It's also important to note that invoking&nbsp;</span><span style="font-family: 'courier new', monospace;">shfl.idx</span><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">&nbsp;with a signed offset subtracted from the&nbsp;</span><span style="font-family: 'courier new', monospace;">laneId</span><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">&nbsp;results in no surprises and is a two or three-instruction SASS sequence:</span></p>
<div style="padding-left: 30px;"><span style="font-family: 'courier new', monospace;">S2R R4, SR_LaneId;</span></div>
<div style="padding-left: 30px;"><span style="font-family: 'courier new', monospace;">IADD R4, R4, -&lt;register or constant&gt;;</span></div>
<div style="padding-left: 30px;"><span style="font-family: 'courier new', monospace;">SHFL.IDX pt, R4, R0, R4, 0x1f;</span></div>
<div><span style="font-family: 'courier new', monospace;"><br /></span></div>
<p><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">I like to think of this operation as a&nbsp;</span>"<span style="font-family: 'courier new', monospace;">shfl.rot</span>".</p>
<p>Source code can be found <a href="https://gist.github.com/allanmac/5166783">here</a>.</p>
<p><span>Here's a screenshot of the output. &nbsp;An 'x' indicates a false predicate.</span></p>
<p><span class="full-image-block ssNonEditable"><span><img src="http://www.pixel.io/storage/post-images/shflrot.png?__SQUARESPACE_CACHEVERSION=1363319098804" alt="" /></span></span></p>
<div>
<ul>
</ul>
</div>]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-33046511.xml</wfw:commentRss></item><item><title>HotSort -- now with support for 32+32 and 64-bit keys</title><category>HotSort</category><category>algorithms</category><category>gpu</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Fri, 19 Oct 2012 23:19:46 +0000</pubDate><link>http://www.pixel.io/blog/2012/10/19/hotsort-now-with-support-for-3232-and-64-bit-keys.html</link><guid isPermaLink="false">424276:4970759:29962326</guid><description><![CDATA[<p>HotSort has been updated to support 32+32 key-val and 64-bit keys.</p>

<p>The results are very good.</p> 

<p>When sorting 64-bit keys, Kepler achieves ~49% of the throughput of the 32-bit key benchmarks. The wider comparison sort performs twice the number of SASS comparisons and triple the number of calls to <tt>__syncthreads()</tt> on types that are twice as wide so getting half the throughput is excellent.</p>

<p>Additional optimizations were made in the past few weeks and there is now a general performance improvement across all architectures: approximately 12% on GT200 and almost 5% on Kepler. Fermi's improvement was the smallest at ~1%.</p>

<p>HotSort dominates Thrust Radix Sort until ~8m keys but, as before, the most important performance numbers to observe are HotSort's "binned" sorting rates.  On a GTX 680, HotSort posts some pretty incredible numbers:</p>

 <ul>
    <li>1.1 to 10.3 billion keys/sec. for subarrays containing 1K to
1M 32-bit keys.</li>
    <li>500 million to 5.3 billion keys/sec. for subarrays containing
512 to 512K 64-bit keys.</li>
  </ul>

<p>Some pretty plots of the 64-bit <a href="http://www.pixel.io/storage/hotSortBench.pdf">results</a> are below starting on page 8.  The 64-bit binned sorting results for Kepler are on page 9.</>

<div id="pdf">
  <object width="630" height="800" type="application/pdf" data="/storage/hotSortBench.pdf#view=Fit&page=8&scrollbar=1&toolbar=0&navpanes=0" id="pdf_content">
<p>A datasheet containing detailed information on HotSort is <a href="http://www.pixel.io/storage/hotSortBench.pdf">here</a>.</p>
  </object>
</div>]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-29962326.xml</wfw:commentRss></item><item><title>HotSort -- a new GPU sorting algorithm</title><category>CUDA</category><category>HotSort</category><category>algorithms</category><category>gpu</category><category>programming</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Tue, 04 Sep 2012 19:08:24 +0000</pubDate><link>http://www.pixel.io/blog/2012/9/4/hotsort-a-new-gpu-sorting-algorithm.html</link><guid isPermaLink="false">424276:4970759:27457003</guid><description><![CDATA[<p>Earlier this year I determined I was going to need a specialized sorting algorithm in order to complete another GPU project.</p>

<p>I needed a sorter that was portable, in-place, supported key-vals wider than 32-bits and could sort binned (tiled) independent data sets output by other GPU kernels.  </p>

<p>But most of all, the sorting algorithm had to be uber fast on small GPUs.</p>

<p>A number of months later HotSort was completed. I've achieved most of my design goals and exceeded a few of them.</p>

<p>The HotSort algorithm outperforms the current GPU champ — Thrust Radix Sort — by a large margin up until ~8m elements on an NVIDIA Kepler GTX 680 GPU.  You can see some performance plots <a href="http://www.pixel.io/products/">here</a>.  The implementation shows similar advantages on both small and large Fermi and GT200 devices.</p>

<p>When simultaneously sorting subarrays containing 1k to 1m elements then HotSort can achieve sustained sorting rates from 1 billion to over 10 billion keys/sec. on a Kepler GTX 680.  This is exactly what I wanted.</p>

<p>I'm currently testing support for wider key-val sizes — u32b32, u64 and u64b32.</p>

<p>Also, for those of you with eagle eyes, you can see that the algorithm shows significant roll-off after its peak throughput.  Clearly that implies the algorithm is not exhibiting O(nlgn) complexity past this point.  Don't fret, the roll-off will be fixed and HotSort will become competitive on very large arrays.</p>

<p>I'm expecting the sorting gurus from NVIDIA — Merrill, Baxter, Harris, Garland, et al. — to improve their own truly awesome implementations now that they have a target.  Always good to have a competitor!</p>

<p>Watch here for more updates on HotSort!</p>
]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-27457003.xml</wfw:commentRss></item><item><title>GPU Hack #2 -- Use LLVM to compile pre-Fermi kernels</title><category>Fermi</category><category>gpu</category><category>gpu hacks</category><category>gt200</category><category>llvm</category><category>nvcc</category><category>register spills</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Mon, 28 May 2012 00:42:34 +0000</pubDate><link>http://www.pixel.io/blog/2012/5/27/gpu-hack-2-use-llvm-to-compile-pre-fermi-kernels.html</link><guid isPermaLink="false">424276:4970759:16465803</guid><description><![CDATA[<p>I've been cleaning up a set of kernels so that they will run optimally on GT200 devices (sm_1x). &nbsp;The kernels run extremely well on Fermi so I was disappointed when the <code>opencc</code> compiler struggled to use a reasonable number of registers despite having access to the same number of registers per thread.</p>

<p>I was getting over 200 bytes of spills in a critical kernel that had no spills at all on Fermi.</p>

<p>Not good! &nbsp;So what could I do?</p>

<p>In my case, the answer was to force use of the LLVM compiler with the&nbsp;<code>--nvvm</code>&nbsp;switch. This produced kernels with either zero or at most 8 bytes of locals.</p>

<p>My understanding is that this switch is <a href="http://forums.nvidia.com/index.php?showtopic=227024&amp;view=findpost&amp;p=1395383">unsupported</a> for pre-Fermi devices but it worked very well for me and all of the kernels passed their verification tests.</p>

<p>On an old GT215 @ 550 MHz:</p>

<pre><code>opencc:   29.14 MKeys/sec
4.1-nvvm: 42.56 MKeys/sec
5.0-nvvm: 43.30 MKeys/sec
</code></pre>

<p>Almost a 50% improvement... I'll take it!</p>

<p><span style="text-decoration: underline;"><strong>Update Aug. 17, 2012:</strong></span></p>

<p>CUDA 5.0 RC finally removed the&nbsp;<code>--nvvm</code>&nbsp;compiler switch.</p>

<p>The workaround is to generate sm_1x PTX with a 4.x compiler and generate the cubin with the bug-fixed 5.0 ptxas.</p>

<p>It wasn't mentioned above, but 4.x and 5.0 Preview had a bug where sm_12 devices were treated as if they were sm_11 devices with only 8192 registers. &nbsp;That bug is fixed in 5.0 RC thus the baroque workaround I'm suggesting.&nbsp;</p>
]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-16465803.xml</wfw:commentRss></item><item><title>Does anyone actually use CUDA's built-in "warpSize" variable?</title><category>gpu</category><category>gpu hacks</category><category>programming</category><category>warp size</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Thu, 19 Apr 2012 22:02:53 +0000</pubDate><link>http://www.pixel.io/blog/2012/4/19/does-anyone-actually-use-cudas-built-in-warpsize-variable.html</link><guid isPermaLink="false">424276:4970759:15919035</guid><description><![CDATA[<p>In CUDA C, the built-in variable <tt>warpSize</tt> is initially treated as a variable at compile-time and doesn't appear to be recognized as a constant until the PTX generation phase. This could be an issue if the warp width is part of some tricky preprocessing early in the compilation.</p>

The simple line:

<pre>
  const unsigned int w99 = warpSize * 99; 
</pre>

is resolved to the following PTX:

<pre>
  mov.u32    %r5, WARP_SZ;
  mul.lo.s32 %r6, %r5, 99;
</pre>

<p>Yes, the stage after PTX will do a great job folding/propagating away WARP_SZ but sometimes you need to resolve logic in the preprocessor.</p>

<p>Including a #define WARP_SIZE 32 in your kernel is just fine until NVIDIA tells us <a href="https://twitter.com/pixelio/status/258254722035748864">otherwise</a>.</p>]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-15919035.xml</wfw:commentRss></item><item><title>GPU Hack #1 -- High lane wins in shared memory write conflicts</title><category>GF100</category><category>gpu</category><category>gpu hacks</category><category>programming</category><category>software rasterization</category><dc:creator>Allan MacKinnon</dc:creator><pubDate>Wed, 27 Jul 2011 22:28:41 +0000</pubDate><link>http://www.pixel.io/blog/2011/7/27/gpu-hack-1-high-lane-wins-in-shared-memory-write-conflicts.html</link><guid isPermaLink="false">424276:4970759:12303505</guid><description><![CDATA[<p>A useful GF100 GPU hack is revealed in Laine &amp; Karras' <a href="http://code.google.com/p/cudaraster/">paper</a>&nbsp;<em>"High-Performance Software Rasterization on GPUs"</em>&nbsp;on page 6. &nbsp;They state that:</p>
<blockquote>
<p>When there are <em>shared memory write conflicts within the warp, the&nbsp;write from a thread on a higher lane</em>, therefore containing a&nbsp;later triangle, <em>will override a write from a thread on a lower&nbsp;lane</em>, containing an earlier triangle. The CUDA programming guide&nbsp;explicitly leaves it undefined which thread will succeed in the write, but at least on GF100 the behavior is consistent and can&nbsp;be exploited.</p>
</blockquote>
<p>Good to know!</p>]]></description><wfw:commentRss>http://www.pixel.io/blog/rss-comments-entry-12303505.xml</wfw:commentRss></item></channel></rss>