Big Mess o’ Wires


A home-built CPU, and other messy electronics adventures

More on Memory

I’ve been working hard the past week on a DDR2 memory controller for the Xilinx starter kit, and refining my estimates for 3d Graphics Thingy’s memory bandwidth requirements. There’s been progress, but it feels like things are moving at a snail’s pace.

I wrote earlier about some basic bandwidth estimates, and have revised them somewhat here. Assume pixels and texels are 16 bits (5-6-5 RGB format), z-buffer entries are 24 bits, and the screen resolution is 640×480 @ 60Hz. Let’s also assume a simple case where there’s no alpha blending being performed, and every triangle has one texture applied to it, using point sampling for the texture lookup. For every pixel, every frame, the hardware must:

  1. Clear the z-buffer, at the start of the frame: 3 bytes
  2. Clear the frame buffer, at the start of the frame: 2 bytes
  3. Read the z-buffer, when a new pixel is being drawn: 3 bytes
  4. Write the z-buffer, if the Z test passes: 3 bytes
  5. Read the texture data, if the Z test passes: 2 bytes
  6. Write the frame buffer, if the Z test passes: 2 bytes
  7. Read the frame buffer, when the display circuit paints the screen: 2 bytes

Assume too that the scene’s depth complexity is 4, meaning the average pixel is covered by 4 triangles, and steps 3-6 will be repeated 4 times. Add everything up, and that’s 47 bytes per pixel, times 640 x 480 is 14.43 MB per frame, times 60 Hz is 866.3 MB/s.

The DDR2 memory on the Xilinx starter kit board has a theoretical maximum bandwidth of 1064 MB/s, so that might just fit. I have serious reservations about my ability to later recreate such a high-speed memory interface on a custom PCB, but ignore that for now. Unfortunately you’ll never get anything close to the theoretical bandwidth in real world usage, unless you’re streaming a huge chunk of data to consecutive memory addreses. Even half the theoretical bandwidth would be doing well. I’ll be conservative and assume I can reach 1/3 of the theoretical bandwidth, which means 355 MB/s. That’s not enough. And I’ll also need some bandwidth for vertex manipulations, since I’ve only considered pixel rasterization, and possibly for CPU operations too. It looks like things will definitely be bandwidth constrained.

Fortunately there are some clever tricks that can be used to save lots of memory bandwidth.

  1. Z occlusion: When a pixel fails the Z test at step 3, then steps 4-6 can be skipped. With a depth complexity of 4, and assuming randomly-ordered triangles, then on average 1 + 1/2 + 1/3 + 1/4 = 2.08 triangles will pass the Z test and get drawn, not 4. That’s a savings of 14 bytes per pixel, or 258 MB/s!
  2. Back-face culling: When drawing solid objects, it’s guaranteed that any triangle facing away from the camera will be overdrawn by some other triangle facing towards the camera. These back-face triangles can be ignored completely, skipping steps 3-6 and saving 10 bytes per culled pixel. Assuming half the pixels are part of back-facing triangles, then that’s a savings of 369 MB/s. Of course some of the pixels rejected due to back-face culling would also have been rejected by Z occlusion, so it’s not valid to simply add the savings from the two techniques.
  3. Z pre-pass: Another technique is to draw the entire scene while skipping steps 5 and 6, so only the Z buffer is updated. Then the scene is drawn again, but step 3 is changed to test for an exactly equal Z value, and step 4 is eliminated. This guarantees that steps 5 and 6 are only performed once per pixel, for the front-most triangle. However, step 3 must now be performed twice as many times, and all the vertex transformation and triangle setup work not accounted for here must be done twice. Whether this results in an appreciable overall savings depends on many factors.
  4. Skip frame buffer clear: If the rendered scene is indoors and covers the entire screen, then the frame buffer clear in step 2 can be omitted. That’s a savings of 37 MB/s.
  5. Skip Z-buffer clear: If the rendered scene covers the entire screen, then the Z-buffer clear in step 1 can also be omitted, but sacrificing one bit of Z-buffer accuracy. On even frames, the low half of the Z-buffer range can be used. On odd frames, the high half can be used, along with a reversal in the sense of direction, so larger values are treated as being closer to the camera. This means that every Z value from an even frame is farther away than any Z value from an odd frame, so each frame effectively clears the Z-buffer for the next one. This provides a savings of 55 MB/s.
  6. Texture compression: Compression formats like DXT1 can provide a 4:1 or better compression ratio for texture data. If the rasterizer can be structured so that an entire texture is read into a cache, and then used for calculations on many adjacent pixels, this can translate directly into a 4:1 bandwidth savings on step 5. Assuming less than perfect gains of 2:1, that translates to a savings of 18 MB/s.
  7. Texture cache: Neighboring pixels on the screen are likely to access the same texels, when the textures are drawn magnified. A texture that’s tiled many times across the face of a triangle may also result in many reads of the same texel. The expected savings depend on the particular model that’s rendered, but are probably similar to those for texture compression, or about 18 MB/s.
  8. Tiled Z-Buffer: The Z-buffer can be divided into many 8×8 squares, with a small amount of state data cached for each square: the farthest point (largest Z value) in the square, and a flag indicating if the square has been cleared. That’s 25 bits per square, or 15 KB for a 640×480 Z-buffer. That should fit in the FPGA’s block RAM. Then when considering a pixel before step 3, if the pixel’s Z value is larger than the cached Z-max for that square, the pixel can be rejected without actually doing the Z-buffer read. Furthermore, when the Z-buffer needs to be cleared, the cleared flag for the block can be set without actually clearing the Z-buffer values. Then the next time that Z-buffer square is read, if the cleared flag is set, the hardware can return a square filled with Z-far without actually reading the Z-buffer values. This skips both a Z write and a Z read for the entire square. In order to gain the benefit of the cleared flag, the hardware must operate on entire 8×8 blocks at once before writing the result back to the Z-buffer. The total savings for both these techniques is at least 110 MB/s, and possibly as much as 165 MB/s depending on how much is occluded with the square-level Z test.
  9. Z-buffer compression: 8×8 blocks of Z-buffer data can be stored compressed in memory, using some kind of differential encoding scheme. Like the previous technique, this would require the hardware to operate on an entire 8×8 block at a time in order to see any benefit. The cost of all Z-buffer reads and writes might be reduced by 2:1 to 4:1, at the cost of additional latency and hardware complexity to handle the compression. This could provide a savings in the range of 350 MB/s.

Unfortunately the savings from all these techniques can’t merely be summed, and the savings I’ve estimated for each one are assuming it’s done by itself, without any of the other techniques. However, when used together, the combination of backface culling plus Z-occlusion should provide at least 400 MB/s in savings, texture compression and caching another 30 MB/s, and Z-buffer tiling another 110 MB/s. That lowers the total bandwidth needs down to 326 MB/s, roughly the same as my conservative estimate of real-world available bandwidth.

No comments

FPGA Pong

Everyone working on a video application using an FPGA seems to start with Pong, so why should I be any different? I put together this Pong demo as an exercise to help get more familiar with Verilog, and gain some experience working with the Xilinx tools and the Spartan 3A FPGA starter kit. It was very slow going at first, but things are slowly beginning to make sense to me now. And hey, I’ve got Pong!!

pong.v - Verilog HDL source
pong.ucf - user constaints file, with pin-mappings for the Spartan 3A starter kit

Some of the ideas were taken from fpga4fun’s Pong tutorial, and the quadrature decoding logic for the rotary knob was ripped from the tutorial verbatim.

I found that the most difficult part to get working was the collision detection. My brain kept getting tripped up by the difference between writing software where statements are executed sequentially, and HDL statements defining a bunch of operations that all happen at once. I ended up with something like this:

always @(posedge clk) begin
    if (collided)
        direction <= !direction;
    if (direction)
        position <= position + 1;
    else
        position <= position - 1;
end

That looks fine for sequential code, but in hardware it didn’t work. When the ball reached a point where a collision was detected, the hardware would switch the direction and increment the position simultaneously. The position increment used the old direction value, since it was happening in parallel. The result was that the ball would move one step deeper into collision territory. On the next clock, it would move in the new direction, but since it had been two steps into collision territory, one step wasn’t enough to get out, so another collision was detected and it reversed direction yet again. This caused the ball to get stuck in the wall. Confused? Me too.

I ended up solving this by introducing an XOR:

always @(posedge clk) begin
    if (collided)
        direction <= !direction;
    if (direction ^ collided)
        position <= position + 1;
    else
        position <= position - 1;
end

I think there must be some more elegant solution, but I didn’t find it.

I also had trouble handling the initial conditions, setting the ball position and direction to appropriate values at startup. Normally I’d do this with a reset line, but I’m not sure how to do it in an FPGA. The Spartan 3A starter kit doesn’t provide any kind of reset input that I could find in the documentation. And at any rate, it would need to reset the logic of the design, without resetting the FPGA itself, which clears the design entirely until it’s programmed again.

I later discovered from James Newman that “initial” blocks are actually synthesizable, so you can do something like:

initial begin
    position <= 200;
end

That blew my mind, because I must have read 100 different Verilog guides that say initial blocks can’t be synthesized into hardware. And yet, it’s true. I’m very curious how this actually works.

I want to review the Pong design, because it feels way more complicated than it needs to be. I described this to James as Verilog making me lazy. It seems way too easy to write some complicated expression involving a dozen equality comparisons, greater/less than comparisons, and ANDs and ORs, when with a little more thought, the logic could probably be substantially simplified. For instance, I think

wire ball = (xpos >= ballX && xpos <= ballX+7);

could be rewritten as

wire [9:0] delta = xpos - ballX;
wire ball = (delta[9:3] == 0); // or even wire ball = ~(|delta[9:3])

which only requires a 10-bit subtractor and a 7-input NOR, instead of two 10-bit comparators and a 2-input AND. Or maybe:

reg[2:0] ballCount;
always @(posedge clk) begin
    if (xpos == ballX)
       ballCount <= 7;
    else if (ballCount != 0)
        ballCount <= ballCount - 1;
end

wire ball;
assign ball = ballCount != 0;

That’s a 3-bit down counter, and a bunch of XORs and NORs. I’m not sure if that’s better, but you get the idea. Once you start writing complex clauses of nested ifs and lengthy boolean expressions, it’s easy to lose sight of the underlying hardware implementation.

2 comments

Memory Bandwidth

I did some preliminary memory bandwidth calculations for 3D Graphics Thingy, based upon the discussion in the comments of the previous post, and the numbers aren’t encouraging. Even for the simplest possible case, I don’t think there will be enough bandwidth to do what I’m imagining, let alone any more complex cases involving more interesting rendering effects.

For every pixel, every frame, this is the minimum that must be done:

  1. Clear the z-buffer, at the start of the frame
  2. Clear the frame buffer, at the start of the frame
  3. Read the z-buffer, when a new pixel is being drawn
  4. Write the z-buffer, if the Z test passes
  5. Write the frame buffer, if the Z test passes
  6. Read the frame buffer, when the display circuit paints the screen

That’s 6 memory operations, per pixel, per frame. Assuming a pixel is 3 bytes (one byte each for red, green, and blue), a z-buffer entry is also 3 bytes, the frame buffer is 640 x 480, and the refresh rate is 60 Hz, then that’s:

6 * 3 * 640 * 480 * 60 =  316MB/sec

If the DRAM datapath is 16 bits wide, as is common, then that’s 158 million memory transactions per second, so the DRAM must run at 158MHz. This is maybe within the realm of possibility, but not by much, I think.

A more realistic estimate would involve steps 3, 4, and 5 happening many times, as fragments of different triangles overlap the same pixel. Performing alpha blending would add an additional “read the frame buffer” step for each pixel. And drawing textured triangles rather than flat-shaded ones would involve one or more additional texture memory reads for each pixel. A more realistic memory bandwidth estimate for a scene with an average depth complexity of 4, with alpha blending and texturing, is probably about 1.4 GB/second, requiring a memory speed of 700MHz. That’s definitely out of reach.

There are certainly some tricks I could use to improve things, starting with using several DRAMs in parallel. Unfortunately each new DRAM requires about 40 FPGA pins to interface with it, and since I need to limit myself to low pin-count FPGAs that can be hand-soldered, realistically I probably can’t do more than two DRAMs in parallel. Using a smaller frame buffer or fewer bits per pixel would also help, but that’s trading away image quality, which I’d like to avoid.

Caching seems like it should play a role here, but I’m not sure exactly how. If there are many shader units operating in parallel, they all must keep their caches in sync somehow, or share a single cache. And even if there’s only one shader unit, so cache coherency isn’t an issue, it’s not obvious to me that a traditional cache would actually speed things up. A pixel shader won’t spend lots of time manipulating the same few bytes of memory over and over, the way a CPU does when executing a small loop. Instead, it traverses the interior of a triangle, visiting each pixel exactly once. The next triangle it processes is unlikely to have any overlap with the previous one. Given these patterns, a cache probably won’t help.

14 comments

FPGAngst

I’ve spent the past few days getting familiar with my Xilinx Spartan 3A Starter Kit, and so far, it’s not going well. I’d thought I was pretty competent with the basics of digital electronics, and the concepts of HDL programming and FPGAs. But working through a “blink the LED” example using the Xilinx ISE WebPack software has been an exercise in frustration. The learning curve is more like a brick wall, and I’m getting dizzy from banging my head into it over and over.

I’ll begin with the starter kit itself. Given the name, you might think it’s aimed at people who want to get started with FPGAs. Forget it. I was very disappointed to find that the starter kit came with almost no documentation at all. Instead, it just had a DVD with a two year old version of the ISE software (now two major releases out of date), and a leaflet with a URL to find more information. The only printed documentation was for the Embedded Devlopment Kit, which is a separate product and doesn’t even work with the free version of the Xilinx ISE. Following the URL, I found the manual for the starter kit, but it’s little more than a catalog of all the hardware on the board. If you want any kind of tutorial for an FPGA “hello world” using this board, or a high-level overview of the various steps involved in creating and programming an FPGA design, or any kind of “starter” information at all, you’ll have to look elsewhere.

Plowing through the ISE software on my own, the first issue I faced was the need to choose what FPGA model I wanted to target. You might think there would be a predefined choice for “Spartan 3A Starter Kit”, but you’d be wrong. After some digging, I found that the starter kit has a XC3S700A, but that wasn’t enough. I needed to specify what package it was, and what speed grade too. How do you tell this? It’s mentioned nowhere in the starter kit manual. After about 20 minutes of searching around, I finally managed to find the web page that deciphered the tiny, near-illegible numbers printed on the chip to determine the package and speed. It’s FG484-4, if you’re keeping score at home.

The ISE itself is really bewildering. It’s basically a shell application that coordinates half a dozen other tools, each of which has its own UI and terminology. The other tools look like old command-line apps that someone slapped together a GUI for using Tcl/Tk. The ISE uses a strange (to me at least) “process” metaphor, which is a context-sensitive subpanel that fills with different actions, depending on what you’ve selected in the main GUI. It took me two days of hunting to figure out what I needed to click on to make the simulation-related process options magically appear. The processes are also arranged in a hierarchical list, so in most cases, running a process requires running all the ones in the tree before it. I still haven’t figured out how to check if my Verilog compiles without doing a complete synthesize, place, and route for the entire design.

Other ISE headaches:

  • The GUI-based Plan Ahead tool used to assign signals to physical pins bears no relation to the text-based UCF (user constraints) file examples in the starter kit online manual.
  • ISE keeps getting confused about the UCF file, and I have to remove it from the project and re-add it. It’ll complain that I don’t have a UCF file, then when I try to add one, it complains there already is one.
  • Integration with iMPACT (the programming tool) is apparently broken. ISE says it’s launching it, but doesn’t. iMPACT must be launched manually.
  • After using a wizard to create a DCM module to divide the input clock by two, there’s no “results” page or other info that defines what ports the wizard-created module has. It doesn’t let you actually look at the module code: clicking on it just relaunches the wizard. I had to go poke through random files on disk to discover the module ports.

In comparison to the software, the hardware itself seems pretty good, but maybe a little TOO good. There are no less than four different configuration EEPROMs that can be programmed, with a complicated system of jumpers for controlling which one to program and which to use at startup. This just makes life more complicated than it needs to be.

The only big negative about the hardware is that there’s no SRAM at all. I don’t know how I missed this when I was looking at the specs. Instead, it has 64MB of DDR2 SDRAM. Yeah, that’s a lot of RAM, but creating a memory controller interface for DDR2 RAM is a big honking complicated task all in itself. That means that if you want to do any kind of project involving RAM, you either need to be content with the few kilobytes of block RAM in the FPGA itself, or go on a long painful detour to design a DDR2 memory controller first. The 133MHz oscillator for the DDR2 RAM also occupies the only free clock header, so it’s impossible to introduce another clock to the design (for example, a 25.175MHz oscillator for generating VGA video).

Stumbling blindly through the software, I did finally manage to design, simulate, program, and run a simple example that blinked three LEDs. I’m sure everything will make more sense in time, but it’s hard for me not to feel grumpy right now. I feel like I’m spending all my energy wrestling with the tool software, and none on the project itself. In short, it feels like a software project, not a hardware one. I’ve barely touched the board, other than to plug in the USB cable and flip the power switch. My multimeter, chip puller, wire stripper, and other tools sit unused in my toolbox. Instead, I’m spending time reading a lot of manuals and guessing at what some opaque piece of software is actually doing under the hood. The experience with the clock generation wizard was downright depressing: it just writes some HDL code for you and doesn’t even let you see it, so you’re at least two levels removed from having a prayer of actually understanding what’s going on. Since my end goal in all my homebrew hardware is to gain a better understanding of how things work, that’s especially galling.

I’m going to search out some more ISE tutorials and any other good learning tools I can find, but I’m also going to take another look at the Altera tools. I’ve heard that the Altera software is more beginner-friendly, but I went with Xilinx becuase their starter kit appeared more powerful. I’m now realizing that the quality of the software tools and ease of the development experience is much more important than the number of gates on a particular FPGA. Altera’s Cyclone II starter kit isn’t as full-featured as the Xilinx kit I have now, but it’s decent, and it has some SRAM too. More than likely, the Altera tools will just be a different flavor of incomprehensibility, but it’s worth a look.

35 comments

Goodbye BMOW, and a Contest

It’s a bittersweet day today. After 18 months of development, occupying the top of my desk and the majority of my spare time, BMOW 1 has been officially retired to the closet. I packed up the case, packed up the power supply, keyboard, cables, EPROM programmer, and everything else. At least it had a good send-off party at the Maker Faire. Now I can actually see the surface of my desk for the first time since 2007, and I’m ready to turn my full attention to my next project.


To mark the occasion, I’m running a small contest, with ten fabulous BMOW stickers as the prize. If you’ve followed my progress for a while, then you know that the “C” key doesn’t work in BMOW BASIC, due to a bug related to control-C handling that I never bothered to fix. The contest is simple: write a BMOW BASIC program that prints the letter C to the screen, without typing “C” as part of the program. This isn’t as easy as it might seem, since many of the relevant BASIC keywords also have the letter C in their name.The first person to reply with a working solution, as judged by me, will receive the stickers by mail, and the honorary title of “BMOW Guru”. It’s not quite like being knighted, but it’s close.

Some hints:

  • BMOW BASIC is a straight port of Microsoft BASIC, but does not include any machine specific commands for file I/O, graphics, etc. Do a Google search to learn more about MS BASIC keywords.
  • The video memory is not mapped anywhere into the BASIC address space, so you can’t just POKE a byte into screen memory.
  • Try this Javascript Applesoft BASIC interpreter, which is also a Microsoft BASIC variant. But remember, BMOW BASIC isn’t 100% identical to Applesoft. Try the BMOW simulator on the Downloads page if you’re unsure.
12 comments

Uzebox, Take 2

It’s done. Yes, it cost twice as much money and 100X as much time as just buying one from Adafruit or Sparkfun, but I’ve finished my home-made Uzebox.

Notable features vs. the “stock” Uzebox:

  • vertical mount
  • mini-stereo jack for audio
  • internal speaker
  • transparent acrylic case
  • lots of glue

And it plays Arkanoid! Now I can return to my regularly scheduled life. :-)

   

I wrote about the Uzebox earlier: it’s an open-source hardware project utilizing a microcontroller to synthesize an NTSC video signal on the fly, in software. Many classic games have been ported to it, and there’s an active developer community.

11 comments

Care Package

After reading about BMOW on Slashdot last week, Jim George offered up some Augat wire-wrap boards and old-school ICs that were sitting around gathering dust. His care package arrived today, just in time for a weekend of tinkering.

There are four Augat boards, each one about 7 x 2.5 inches, or about 25% of the area of the BMOW system board. Each board has space for five columns of skinny DIP 0.3 inch chips. The undersides (not shown) are pre-populated with about 600 wire-wrap pins.


Jim also threw in a few dozen wire-wrap tags, which are just little plastic cards with holes in them that can be placed on the pin side of the board, showing where the chips are placed and marking the pin numbers. I can’t believe I built all of BMOW without these. They seem like such an obvious thing. Staring at a featureless green board with a thousand pins on it, it’s easy to get disoriented without markers like these.

To round out the package, Jim also included a handful of 7400 series logic chips, and other related parts:

  • 74AS181 x 5, 4-bit ALU
  • 74ALS374 x 11, 8-bit register
  • 74F323 x 7,  8-bit shift register
  • 74F299 x 6, 8-bit shift register
  • 6116 x 3, 2K x 8 SRAM
  • Intel 8255 peripheral interface adapter

Thank you Jim!

3 comments

FillRect

OK, time to get back to 3D Graphics Thingy! 3D graphics rendering, implemented in hardware. Here we go. Starting right now. Any time now. 1, 2, 3, go. Getting ready, this is it, here we go. OK, really, going to start now.

I stared at a blank sheet of paper for a long while yesterday, and realized I have no idea what I’m doing. How do you make a digital circuit that draws stuff? Where do you even start?

Assuming I had some video memory, and a circuit to display the contents of video memory, I might implement a FillRect function in C like this:

void FillRect(int left, int top, int right, int bottom, int color) {
    for (int y=top; y<=bottom; y++) {
        for (int x=left; x<=right; x++) {
            memory[x][y] = color;
        }
    }
}

Taking this a little further, I could assume a screen width and height of 256 pixels, and one byte per pixel, for a total video memory size of 64K. That would enable me to directly use the X and Y coordinates of a pixel to determine its memory address, by using Y as the upper 8 bits of address, and X as the lower 8 bits. The revised C code would be:

void FillRect(uchar left, uchar top, uchar right, uchar bottom, uchar color) {
    for (uchar y=top; y<=bottom; y++) {
        for (uchar x=left; x<=right; x++) {
            memory[(y<<8)|x] = color;
        }
    }
}

Now how do you build hardware to do that? After more staring at a blank sheet of paper for a while, I started drawing things, and eventually came up with this:

My scrawl is difficult to follow, so I cleaned it up in a drawing program. In the diagram below, the data path is drawn in blue, and the control path in orange. All the control signals are assumed to use positive logic for clarity (1 = active), although in a real circuit with real chips, most would actually use negative logic. The five blue boxes along the top are all 8-bit registers, something like a 74HC377. The two darker blue boxes for X and Y are 8-bit counters, like a 74HC393 maybe. For the counters, I assume that their load input takes priority over their count input, if both are asserted simultaneously. The two boxes labeled “=” are comparators that output true when their inputs are equal, like a 74HC688. The orange run box is a single flip-flop with synchronous set and reset inputs.

How does this work? Initially run is false, so the output of the AND gate connected to the RAM’s write input is also always false, and nothing gets written. To start things going, the CPU (not shown) writes the desired values to the bottom, top, left, right, and color registers for the rectangle to be filled, and then asserts start. At the next clock edge, three things happen:

  1. run is set to true
  2. Y is loaded from top
  3. X is loaded from left

Notice the clock signal itself is used as a control signal, connected through an inverter to the AND gate feeding the RAM’s write input. On the next clock cycle following start, during the second half when the clock signal is low, the output of the AND gate becomes true, and the value in the color register is written to the address specified by the X and Y registers. Oh my God, a pixel was just filled!

The X register’s count input is connected directly to true. At the next clock edge, X is incremented by one, and on the next clock cycle, the neighboring pixel one spot to the right is filled. This continues until X reaches the value stored in the right register, at which point the output of the comparator becomes true. This forces X to be reloaded with the value from left (load preempts count), and also increments Y by one, accomplishing a movement to the start of the next line.

Eventually the filling operation reaches the last pixel of the last line, when Y equals bottom and X equals right, and the output of both comparators becomes true. This resets the flip-flop, forcing run to false, and disabling any further writes to video memory. The FillRect operation is complete.

With work, this FillRect hardware could be generalized to FillTriangle hardware more similar to what 3DGT will need. If left and right were initially equal, and then incremented or decremented by a fixed step on each new line, then flat-bottomed triangles of any shape could be drawn. The per-line steps for left and right would be the inverse slopes of the triangle edges. To draw any general triangle without the flat-bottom limitation, the same circuit could be used again in reverse to draw a flat-top triangle connected to the first flat-bottom one. Alternatively, the hardware could be extended to draw the complete triangle directly, by adding a new knee register, and changing one of the line slopes when Y reaches knee.

3 comments

Aftermath

Photo by Joe Pankow Oh ye gods! Crazy, nutty, insane. The past few days have been among the strangest of my life. I think I’ve had my 15 minutes of fame and then some. First there was the article about BMOW 1 on Wired.com, then the story got picked up by CNet, Digg, Slashdot, Engadget, Gizmodo, Reddit, and many others, all over the web. My inbox overflowed with people asking about the project. Then came the Maker Faire, where a few thousand people came through the BMOW booth over a two-day period. People were amazingly enthusiastic, and quite a few people told me they came to the Maker Faire specifically to see BMOW! In the Wired article about “What to do at the Maker Faire“, BMOW was even the featured attraction for the entire event. In the end, it won an editor’s choice award for the show, and I talked myself hoarse.

While I appreciate all the attention this project has suddenly received, I have to admit I feel like a fraud. For one thing, Bill Buzbee’s fabulous Magic-1 homebrew computer was also at the Maker Faire, and it’s twice as cool as BMOW, but didn’t get nearly the press coverage. I also think that most of the people talking about BMOW thought it was something it’s not. A lot of people seemed to have the idea that I’d built a CPU entirely out of wires, as if wires themselves could perform computations. Or they thought I’d built some kind of giant machine the size of a refrigerator. When they saw a rather ordinary-looking 12 x 8 inch board at the show, they looked disappointed. Many people also seemed to have the idea that I’d built a CPU out of thousands of individual transistors. Nope. BMOW is made from sixty-five chips including 7400-series parts, 22v10 PALs, ROMs, SRAM, a video DAC, and an AY audio chip.

It was an incredible time at the Maker Faire. Setup began on Friday, so I was able to get in before the show opened to the public, and chat with some of the other Makers before the crowds arrived. The show is sprawling, massive: two giant expo halls, plus all the grounds between and around them, and half of an enormous parking lot. It was really too much to experience in a single day.

One downside of being a Maker presenting BMOW is that I didn’t get much chance to visit the other exhibits. Fortunately, I did have a chance to talk with a few amazingly brilliant people. I spoke to Jeri Ellsworth, creator of the C-One reconfigurable retro-computer (who was demoing DIY transistors), Limor “Lady Ada” Fried, inventor of all manner of Arduino and other electronic projects, and evilmadscientist.com’s Windell Oskay, who was demonstrating the Candy Fab 6000 sugar-based 3D printer. There were also many other amazing and creative people I spoke to, not all of whose names I caught, but it was great to get wrapped up in an aura of geeky achievements with them all.

Thanks to everyone who came by the BMOW booth at the Faire, and to my friends Kevin and Eric for helping out as BMOW crew members for the weekend. The booth was packed almost non-stop from open to close, both days. If I’d had the foresight to make extras, I could have sold a ton of books and T-shirts, and lots of people asked about them. Some people even asked about buying kits. I can only assume they had a spare year with no particular plans, and were looking to fill it. A few people came by the booth and gave me free stuff! A guy from Parallax gave me a Propeller development board. I’ll definitely have to play around with that.


A few questions about BMOW came up again and again:

  • “What’s the operating system?” When I answered “there isn’t one”, this seemed to blow people’s minds. How could you have a computer without an OS? For BMOW the hardware is so simple, the programs themselves essentially *are* the OS.
  • “What compiler did you use to write programs?” Some people seemed genuinely astounded that it’s possible for a human to write programs in assembly language. Whether they’re too young to remember when that was the norm, or have just spent too much time coding in Python or something, I don’t know. Yes, it’s all BMOW assembly language, which is mostly identical to 6502 assembly. I tinkered with retargeting a C compiler for BMOW, but never went very far with it.
  • “Where are the wires?” Ah yeah. I supposed it’s false advertising to have a project called “Big Mess o’ Wires”, and not have a giant hairball of wiring hanging out somewhere. There wires are all hidden on the underside of the system board. Sorry.

In the end, BMOW won a Maker Faire Editor’s Choice award. In fact, it won one twice. I’m not really sure what happened there, but when the second editor came by with Lady Ada to give me the award, and heard that someone else had already given me one, he seemed pretty ticked. I’m guessing maybe different editors were supposed to give awards in different categories, and two different editors claimed BMOW in their category. Regardless, I’m thrilled and excited to be recognized by Make.

To the guy I talked with who’s got some unused wire-wrap boards, please send me an email, and I promise to give them a good home. To the guy who turned out to live down the street from me in Belmont, email me, and maybe we can hook-up for some neighborhood nerd projects. For anyone else who emailed me already, if I didn’t reply, try me again.

Last but not least, I’m taking orders for SWAG. If you’d like a few BMOW stickers, send me a SASE or $0.50 by PayPal, and I’ll get you some. I’ll also be placing another order for BMOW T-shirts in a week, on June 7. If you’re in the USA, send me $28 by PayPal before the 7th, along with your shirt size, and you’ll get a shirt in a couple of weeks. If you’re outside the USA, email me to ask about shipping costs. Sorry, but 5-color silkscreened T-shirts for small run orders aren’t cheap. I’m not making any profit off these.

Whew! It’s been an amazing couple of days, but I’ve had enough. Time to go crack open a beer.

Meeting Bill Buzbee, creator of the Magic-1:

A day at the Maker Faire:

9 comments

BMOW Project Summary

Crazy day today! Maker Faire setup begins tomorrow, and BMOW is featured on wired.com, and is the #1 Top in All Topics story on Digg! Oh man, this poor server is getting hammered.

Many people have asked for high-res photos. See this entry from February, and click any of the thumbnails to get the high-res versions of wire-wrapping craziness.

For people interested in viewing or buying the “Making of BMOW” photo book, you can order it from Shutterfly here.

 

Here’s a summary of  the project, for everyone following the link from the article.

Big Mess o’ Wires 1 is an original CPU design. It does not use any commercial CPU, but instead has a custom CPU constructed from dozens of simple logic chips. Around this foundation is built a full computer with support for a keyboard, sound, video, and external peripherals.

My original goals were:

  • Build the CPU from scratch, primarily using basic 7400-series logic. No 6502, Z-80, etc.
  • Keep the hardware complexity to a minimum. I’m not an electrical engineer.
  • Be capable of running “real” programs, not a 4-bit CPU or toy machine.
  • Provide a way to interface with a PC.
  • Be fast enough to run interesting programs interactively.

Stretch goals:

  • Boot into a simple integer BASIC program, capable of interactively editing and running its own programs.
  • Support multiple programs executing simultaneously, via a pre-emptive multitasking OS.
  • Provide keyboard input, VGA video and sound output.

Initial design began in November 2007 with a high-level sketch of the CPU internal design. A simplified Verilog hardware simulation proved the key design details. Construction began in earnest in February 2008, using a large wire-wrap board to interconnect the 50 or so chips needed. In April, a half-finished BMOW 1 booted up for the first time, computing fibonacci(12) = 144 using a simple ROM-based program. One by one the original system goals and stretch goals were met, including VGA video, three voice audio, BASIC, and a bootloader for communication with an attached PC. BMOW 1 eventually gained the ability to run complex programs written in assembly or compiled from C. The main construction phase ended in February 2009, with the completion of a customized case to house everything. As of March 2009, Big Mess o’ Wires 1 is fully functional, but will probably never be “finished”.

Architecture

BMOW 1 borrows liberally from other homebrew designs, as well as the MAYBE design presented in the book Computation Structures by Stephen Ward and Robert Halstead. Data busses are 8 bits wide, and the address bus is 24 bits. Four 8-bit registers are used for general data, and three 24-bit registers store the program counter, stack pointer, and a scratch/working address pointer. Registers and the arithmetic and logic unit are interconnected by one data bus, while RAM, ROM, and memory-mapped hardware devices use a second data bus. The ALU also has dedicated left and right data input busses.

Machine language instructions are implemented as a series of micro-instructions, stored in three parallel ROMs to create a 24-bit microcode word. One micro-instruction is executed each clock cycle, and the micro-instruction bits are used directly as enable and select inputs to control all the chips in the machine. Up to 16 micro-instructions may be needed to implement a single machine language instruction.

Note: Some additional devices are not shown here, including the VGA display circuitry and real-time clock.

24-bit addresses allow for up to 16MB of memory, but only a little more than 1MB of combined RAM and ROM is installed. The most-significant byte of the address is called the bank byte, and is normally invisible to programs. The standard instruction set presents a 16-bit interface to programs, with most instructions implicitly referencing the current bank. Cross-bank references are possible, but awkward (think x86 segment registers).

A 512K ROM contains a bootloader/menu program. A USB-to-TTL interface based on an FTDI chip provides an easy way to move data to and from a connected PC. A standard PC keyboard with PS/2 connector is used for keyboard input, and a 24×2 character text LCD serves as a debug output display. Custom video circuitry drives a standard VGA monitor, with a maximum resolution of 512 x 480. A three-voice programmable sound generator provides music and sounds.

BMOW 1 is built on an Augat wire-wrap board pre-populated with thousands of wire-wrap pins. The chips are pushed into the board without soldering, and can be easily removed, similar to a prototyping breadboard. Unlike a breadboard, the pins are individually connected on the underside of the board according to the needs of the circuit design. A wire-wrap tool is used to wrap stripped wire ends tightly around each pin. Wires can be removed fairly easily in case of a mistake. BMOW 1 contains about 2500 such wire wraps.

Specs

  • Current clock speed is 2MHz. It could theoretically go to about 3MHz (untested).
  • 512 KBytes of RAM, 512 KBytes of ROM.
  • Power draw is 10 Watts, 2.0A at 5V.
  • VGA video output is 512×480 with two colors, or 128×240 with 256 colors.
  • Audio and music is provided by a three-voice programmable sound generator.
  • Keyboard input is a standard PC keyboard with PS/2 connector.
  • Debug display is a 24×2 character text LCD.
  • There are roughly 1250 wires connecting the components, so 2500 individual hand-turned wire wraps.

Please send me your thoughts and questions!

BMOW 1 Photo Gallery

101 comments

Next Page »


Homebuilt CPUs WebRing

JavaScript by Qirien Dhaela

Join the ring?

David Brooks, designer of the Simplex-III homebrew computer, has founded the Homebuilt CPUs Web Ring. To join, drop Dave a line, mentioning your page's URL. It will then be added to the list. You'll need to copy this code fragment into your page.