GPU Hack #1 -- High lane wins in shared memory write conflicts

A useful GF100 GPU hack is revealed in Laine & Karras' paper "High-Performance Software Rasterization on GPUs" on page 6.  They state that:

When there are shared memory write conflicts within the warp, the write from a thread on a higher lane, therefore containing a later triangle, will override a write from a thread on a lower lane, containing an earlier triangle. The CUDA programming guide explicitly leaves it undefined which thread will succeed in the write, but at least on GF100 the behavior is consistent and can be exploited.

Good to know!

Elliott Bay

Blog updates will return shortly.

Meanwhile, here is the view from my desk:

CUDA ION2 Benchmarks

I needed to test my CUDA application on a low-end configuration so I plunked down $210 and bought a Jetway Mini-TOP D525+ION2 computer.  Below are some basic CUDA benchmark results.

First, some relevant specifications on this machine:

  • Dual-core Atom D525 @ 1.8GHz
  • Intel NM10 chipset
  • NVIDIA ION2 GPU (GT218) with 512MB of DDR3
  • DDR2 667/800 SO-DIMMs on a 64-bit bus
  • A RealTek Gigabit PCIe Ethernet port

There are a number of netbooks and nettops that have nearly identical specifications.

The NM10 has 4 PCIe 1.0a root ports but is connected to the ION2 using a single lane (x1) PCIe link.  The link has a theoretical 250 MB/s data rate.

The result is that, according to the "bandwidthTest.exe" benchmark, the ION2 is able to copy from host-to-device at 165 MB/sec and from device-to-host at 204 MB/sec.  Onboard the GPU, device-to-device copies reach a much higher ~8GB/sec with the default GPU clock setting.

The GT218 is a standard Compute Capability 1.2 device and has 2 multiprocessors each with 8 cores.

See below for the following results using the 260.63 driver and 3.2 RC CUDA Toolkit:

  • GPU-Z
  • CUDA "deviceQuery.exe"
  • CUDA "bandwidthTest.exe --memory=pinned --wc"
  • CUDA "bandwidthTest.exe --memory=pinned --wc --mode=shmoo"
  • CUDA "nbody.exe"

The GPU is running at its default speed with 4GB of DDR2-667 overclocked to 800.

Here is a GPU-Z screen capture:

GPU-Z incorrectly reports the ION2 is running on a PCIe 2.0 x1 lane.

The CUDA 3.2 Toolkit's "deviceQuery" output is here.

And the results of the "bandwidthTest" with write-combined and pinned memory are here and "shmoo" results here.

And, finally, a CUDA "N-Body" screen capture:

It's also worth noting that this machine is running Windows 7 Ultimate x64 and, in its off hours, operating as a Media Center serving two XBOX 360's.  The CPU speed is acceptable but on the video side the ION2 has played everything I've tried with excellent results.

Feel free to leave a comment below if you want to see additional CUDA benchmarks.

Classic Video: The Origins of APL

In the same theme as my last post, what's old is often new again...

While I was on vacation, I read Shasha and Lazere's new book Natural Computing and was surprised to see a small section dedicated to Iverson's APL, J and K languages.

The last and only time I used J was in 1993 for a grad school class on queueing networks.  A half-page of terse code was dropped into my paper and I was done.  I would not have wanted or been able to do this in C or Fortran.

At some point, I'd like to investigate porting a J-like language to CUDA or OpenCL.  The obvious array processing parts of the interactive language would map well onto a GPU but the language might need a scheme to ensure that enough work could be issued at once so that the GPU was fully utilized.  It would be a fun project if there was interest.

Below is a brilliant roundtable discussion on APL featuring Ken Iverson and others.  I saw it a year ago, but it's worth watching again if you've had any exposure (or mental damage from) APL/J/K.  Pour yourself some scotch, sit back and enjoy:

There is also a detailed description on this video's original page and a discussion at LtU.

Also check out Catherine Lathwell's cleverly named blog Chasing Men Who Stare at Arrays.  Catherine is working on a documentary about the history of APL language and its founders.

Chromatic Research, GPUs and the Wheel of Reincarnation

Last week I chatted with a couple of friends about the x86 clones that appeared in the 1990's.  Companies like Rise, Nexgen and Centaur.

When I got back to my desk I recalled that the mid-1990's also had a number of companies building media processors to offload MPEG-1/2/4 video, audio, 2D graphics and even modem codec processing.

Googling reveals a great example.  The 1996 Hot Chips 8 presentation by Paul Kalapathy of Chromatic Research which covers the Mpact multimedia processor.  A quick skim reveals an architecture that should be familiar to any GPU developer: a sea of general-purpose registers, huge on/off-chip bandwidth and SIMD vector opcodes driving integer ALUs and SFUs.  The part reportedly achieved 1 BOPS at 120MHz.

Very impressive and another great spoke on the wheel of reincarnation!