Archive for December, 2007
Universal Programmer
I got a TOP 2007 universal programmer yesterday, for burning EPROMs, GALs, and Flash memory. Whenever I hear the term “universal programmer”, I imagine a nerd-equivalent of the 1992 movie Universal Soldier, starring Jean-Claude Van Damme. Plot: a mild-mannered C++ programmer gets cybernetic implants and goes on a killing spree.

TOP 2007 Pros: It’s pretty cheap (about $100), and it programs GALs, unlike most low-cost programmers.
TOP 2007 Cons: Horrible, horrible software. I’ve seen slipshod, poorly-translated software before, but this really elevates it to an art form. Maybe 25% of the text isn’t translated at all, and is still in Chinese, including some status and error messages. What text is in English is so poorly translated, it’s difficult to tell what it means. “Driver not ready ,if load it now?” Damned if I know. And what’s a “cussor?” Clicking the close box minimizes the window instead. Reading from a chip overwrites the contents of any data file you previously had open for writing. Lots of options and menu items do mysterious unknown functions, or nothing at all. Device recognition doesn’t seem to work– you have to manually select the right device from a list of hundreds, and pray you got it right.
Despite these issues, programming a 128K Flash memory worked fine. I had problems with some Lattice 22V10D GALs, however. That’s especially frustrating, since GAL support is the whole reason I chose the TOP 2007 over other choices. The 22V10 is on the supported device list, and it does almost work, but not quite. What appears to happen is that it successfully programs all the 5000+ fuses needed to implement the logic expressions, but won’t program the last 20 fuses that configure the output pins to be inverting or non-inverting, registered or combinatorial.
Any sane person would have attempted to return the programmer, or junk it and get a better one. Instead I spent an afternoon trying to reverse engineer the software to see if I could fix the problem. By accident, I discovered that I could program 14 of the 20 unprogrammable fuses by adding extra fuse data to the data file, beyond the number of fuses that are actually in the GAL. I also found that 3 more of those 20 seemed to copy the data from elsewhere in the fuse map. But the last 3 fuses appear permanently stuck at 0. The net result is that with some complicated effort, I can completely configure 8 of the 10 GAL outputs. The other two are stuck in registered, inverting mode. I can probably work with that for most purposes. Worst case, those 2 outputs will just be unused.
GAL Counter: The hardware design calls for a stack pointer that’s a 24-bit up/down counter with output enable. The simplest approach would be to use six 4-bit up/down counters, and three 74LS244 drivers, requiring nine chips total. A better solution would use the hard-to-find 74LS569 4-bit up/down counter with integrated output enable, requiring only six chips. Better still would be to use three 22V10 GALs to make 8-bit up/down counters with output enables, requiring just three chips.
It turns out that programming an 8-bit up/down counter into a GAL is quite a challenge. For starters, there are barely enough pins. The 22V10 has 22 data pins, of which at most 10 can be outputs. I was able to barely cram it in by encoding some of the control inputs into a 2-bit function code. But the bigger challenge is that the logic equations for computing the new value of each counter bit require more product terms than the GAL supports. My equation for the MSB of the counter includes 19 AND terms all OR-ed together, but the 22V10 supports at most 16 terms. Here’s my equation:
/q7 := f1*f0*/q7 + /f1*/cet*/q7 + f1*/f0*/d7 + /f1*f0*cet*q0*q1*q2*q3*q4*q5*q6*q7 + /f1*f0*cet*/q7*/q0 + /f1*f0*cet*/q7*/q1 + /f1*f0*cet*/q7*/q2 + /f1*f0*cet*/q7*/q3 + /f1*f0*cet*/q7*/q4 + /f1*f0*cet*/q7*/q5 + /f1*f0*cet*/q7*/q6 + /f1*/f0*cet*q7*/q6*/q5*/q4*/q3*/q2*/q1*/q0 + /f1*/f0*cet*/q7*q0 + /f1*/f0*cet*/q7*q1 + /f1*/f0*cet*/q7*q2 + /f1*/f0*cet*/q7*q3 + /f1*/f0*cet*/q7*q4 + /f1*/f0*cet*/q7*q5 + /f1*/f0*cet*/q7*q6
cet is the count enable (active high). F=00 means count down, F=01 means count up, F=10 means load, and F=11 means no change. So this crazy equation says that bit 7 should be zero if we’re not changing and the current bit 7 is zero, or we’re trying to count up/down but counting is disabled and the current bit 7 is zero, or we’re loading and the input bit 7 is zero, or we’re counting up and the counter is at 11111111, or we’re counting up and the the current bit 7 is zero and any of bits 0-6 are zero, or we’re counting down and the counter is at 10000000, or we’re counting down and the current bit 7 is zero and any of bits 0-6 are one. Phew! If you can simplify that to 16 product terms, you’ll win a prize. Maybe I should create a truth table with 4096 entries, and build a Karnaugh map. Or not.
I need to take another look to make sure there isn’t some clever way I can simplify the equation to get under 16 terms. If not, then I’ll probably either make four 6-bit up/down GAL counters (which would be a little strange, since bytes would be awkwardly divided among different GALs), or six 4-bit up/down GAL counters, essentially replicating a 74LS569 in a GAL. So far I haven’t found any place that has 74LS569s is stock.
No commentsSchematics
I’ve started creating schematics for the machine, using an evaluation version of CSiEDA 5. Making a real schematic that shows all the parts, pins, and interconnections is amazingly time-consuming, but it’s essential if I want to avoid making construction mistakes.

It took me a couple of hours to get familiar with CSiEDA and create this schematic for the clock and reset generation circuitry. A crystal oscillator is used to clock a pair of flip-flops, wired so as to produce two new clock signals, Q0 and Q1, at half the frequency of the oscillator input. Q0 is the primary clock signal, and is used by most other clocked components in the system. Q1 lags Q0 by 90 degrees (one-quarter of a cycle), which is useful for generating other timing signals. The clock signals are buffered by a 74LS244, which has a higher drive current than most TTL chips, meaning each output pin can drive up to 16 other TTL inputs. Some signals appear on multiple ‘244 output pins, where I expect to need them at more than 16 inputs elsewhere in the system.
A Microchip TC1232 supervisor chip is used to generate the /RESET signal. Whenever the +5V input to the TC1232 is too low (during power-on, power-off, or power glitches), it forces the /RESET signal active for about 100ms, restoring the machine to its initial state. It also has an input for a reset switch, with built-in pull-up resistor and debouncing circuitry. The /RESET signal is clocked through another flip-flop, so the rest of the system will only see /RESET change at a clock edge, rather than in the middle of a clock cycle.
The TC1232 also has a “watchdog” feature that can be used to auto-reset the machine if it crashes. The CPU is expected to toggle the TC1232’s strobe (/ST) input periodically to indicate that it’s still alive. If too much time passes without /ST being toggled, the TC1232 will force the /RESET signal active. I don’t plan to use this feature, so I’ve tied /ST directly to the clock signal, so it will always be toggling.
In the final implementation, the flip-flops and 74LS244 may all be replaced by a single GAL, but the functionality will remain the same.
No commentsFinal Design Tweaks?
I’m trying to finish up the final hardware design now, so I can get started with actually building this thing. Although it will probably never be truly “done”, I don’t want to end up ripping out half the components and wiring to accommodate some new design feature I should have anticipated in the first place.
Here’s what I’ve been considering:
Improved condition code register: I could use a GAL to replace the 4-bit shift register with a custom dual parallel-in, parallel-out register. That would make it possible to load and store the entire CC register in a single clock cycle, instead of shifting data in/out over multiple cycles. The savings would help speed up the BRK and RTI instructions used during interrupt processing, shaving a total of 9 clock cycles off the total time needed to invoke an interrupt service routine and then return to the original program.
Conclusion: Skip it. I expect that a typical interrupt service routine will be tens of instructions long, taking probably 50 to 100 clock cycles, so a savings of 9 clock cycles isn’t that compelling.
Zero-page addressing: The 6502 CPU, from which I’ve borrowed the assembler syntax, has a mode known as zero-page addressing. Instructions using this addressing mode have an implied high-byte of zero for the address, so only the low-byte is specified. This means the instruction requires one less byte, resulting in more compact program code. On the 6502, zero-page addressing instructions also execute in fewer clock cycles than their absolute addressing counterparts. It’s sort of like having an extra 256 registers (the size of the zero page) that can be manipulated with a speed somewhere between true CPU registers and generic memory locations.
To gain a speed benefit from zero-page addressing, the BMOW hardware would require a change to permit zeroing of the high-byte of the address register simultaneously with loading of the low-byte. It would probably also require some tweaking of the memory mapping and reset circuitry, since page zero is currently part of ROM, and the machine begins program execution at memory address 0 after a reset.
If a program could be written such that one in every four instructions employed zero-page addressing, then I estimate it would be about 8% smaller and 6% faster than a program that never used zero-page addressing. In the limiting case where every instruction employed zero-page addressing (not realistic), the program would be 33% smaller and 25% faster.
Conclusion: Skip it, mostly. A typical improvement of under 10% doesn’t seem worth the hassle of changing the hardware design yet again. I may still choose to implement the zero-page addressing mode instructions later as a software-only change (new microcode), which would provide the program size savings, but no speed benefit. It would just substitute a clock cycle where the high-byte of the address register is loaded with some constant value for a cycle where the high-byte would otherwise have been loaded with a byte from the program code.
Add a Y register: I’ve been talking about this for a while, and I think I’ve figured out how to shoehorn a Y register into the left ALU input, where it must be in order to work as intended. The left input already has 4 possible sources, and with no spare control ROM outputs, and I was originally stumped as to how to support a fifth source for the left input.
My solution is to make X and X7 (a pseudo-register containing X’s sign bit) share a single enable signal from the control ROM. This signal would be AND-ed with the load enable signals for PCHI and ARHI, the high-bytes of the program counter and address register, in order to create the individual enable signals for X and X7. If the load destination were PCHI or ARHI, then X7 would be enabled, otherwise X would be enabled. While this is arbitrary and potentially limiting, in practice it mirrors exactly how X7 is already used for address calculations by the microcode. With X and X7 now sharing a control signal, there would be a free one for the Y register.
Conclusion: Do it. While the solution is a bit ugly, it’s relatively isolated. Adding Y will give the machine three general-purpose data registers rather than two, which is a significant improvement that should enable writing substantially faster/simpler programs. It will also make it much easier to port 6502 assembly programs to BMOW.
More than 64K Memory: 64K is the standard memory space for an 8-bit machine, but something larger would open up many interesting possibilities related to multi-tasking, for which 64K is probably too small to be practical. It would also allow the creation of single programs operating on larger data sets. Realistic values for the total amount of memory are in the 128K to 4MB range, assuming the use of standard SRAM.
A key consideration is how the extra memory should be addressed. One option is to have a separate segment register to hold the highest address bits. This register might be explicitly loadable by programs, or might be controlled by the OS, with each process given a separate segment. With this scheme, the bulk of the instructions would still use 16-bit addresses, and the segment register would presumably be altered infrequently. The alternative is to change all the instructions to use 24-bit addresses, providing for a totally generic 4MB address space. That would negatively impact program size and speed, however, due to the extra byte of address data in most instructions. Fortunately these approaches all require the same underlying hardware, with the differences lying entirely in the instruction set design and microcode.
Conclusion: Do it. The extra hardware needed is trivial, and the decision regarding how to use the additional memory can be made later.
But wait, there’s more! On top of these four issues, there are several other half-conceived ideas flying around my head as well:
- Direct connection of a keyboard and monitor (or LCD panel?), instead of using a PC as a terminal.
- Compact flash or IDE-based file storage.
- Integration of a real-time clock with timer interrupts.
- Two-phase clock. Investigate the necessity of buffering for clock signals due to TTL fanout limits.
- Physical construction. I need a case, a power supply, on/off switch, reset button, maybe a fuse? The case must also provide easy access to all the hardware, as well as space/power/mounting points for future add-ons I haven’t yet thought of.
I think I’m getting a little carried away. It’s time to build the basic machine and get it working, then I can return to these other ideas.
No commentsProtoboardin’
My long-awaited delivery of hardware arrived yesterday, including a protoboard, tools, and about half the components I need to build the machine. I’m using the protoboard to try various test circuits, before I construct the real thing on the wire wrap board. Last night I geeked out with the protoboard, and threw together some sample circuits.

I cut the tip off an old 9V laptop power supply, and wired it to the Vcc and ground terminals of the protoboard. Then I used a 5V voltage regulator and some capacitors (at right in the photo) to get a smooth 5V supply for the other components. The silver rectangle is a 1MHz clock oscillator. The yellow chips are resistors used to limit the current through the LED, and pull up the voltage at the push-switch. The black chip is a 4-bit counter. I wired the push-switch to the counter’s clock input, and the LED to the lowest bit of its output. Every time I press and release the switch, it clocks the counter, the low bit switches between 0 and 1, and the LED toggles off and on. It’s digital baby!
No commentsMax Clock Speed
I used Verilog to do a simple test to estimate the machine’s top clock speed. I kept increasing the clock rate until the validation test suite started failing, and it topped out at about 2.63MHz. That feels respectable for a home-built machine. In reality I think I can go faster than that, since the timing data for my Verilog simulation is built around the worst-case estimates. Depending where the critical path is (I didn’t investigate to see), I may also be able to speed it up further by using a two-phase clock. Of course this all assumes I’m not limited by signal noise or some other intrusion of physics into the digital domain!
2 commentsMicrocode 1.0 Complete
Time to celebrate! I powered through the remainder of the microcode for the core machine instructions, along with a full test suite to verify each instruction, and it all works. That’s 50 different instructions, and a couple of hundred individual tests. The test suite program alone fills more than half the machine’s ROM. Whew! I’m beat.
What I call the “core” instructions are the overlap between the 6502 instruction set (from which I’ve largely adopted the assembler syntax) and BMOW’s instruction set. That means jumps, subroutines, branches, math, boolean operators, comparisons, stack manipulation… everything you would need to write a real program. It obviously doesn’t include instructions related to 6502-specific features, like the Y register and zero-page addressing. It also doesn’t include planned BMOW-specific instructions and addressing modes; these will come later, but they aren’t essential. The core instructions provide all the functionality needed for most programs.
I resolved the problem regarding overwritten condition codes with the quick and dirty solution I first thought of. The machine saves and restores the CC’s on the stack during the instruction’s execution. ADC, SBC, ROL, and ROR all faced this problem, and my solution adds 3-4 clock cycles to those instructions. That stinks, but at least it works.
Ultimately, I think I may implement the extra carry pseudo-flag I described previously. That would let me reclaim the lost clock cycles on those four instructions, and also save a cycle on several other instructions in certain cases. Perhaps more importantly, it would also remove the need for many other instructions (like branches, or anything using x-indexed addressing mode) to modify the condition codes.
Here’s the source for BMOW Microcode 1.0. The microcode syntax is mostly self-explanatory. Each line is prefixed by a set of condition codes. A prefix of “*” means the line should be used no matter what the values of the condition codes, while a prefix like c=1 means the line should only be used if the carry flag equals 1. Each line (or pair of lines, in cases like c=0, c=1) represents one phase (one clock cycle) of the instruction’s execution.
The longest instructions are BRK (all the machine state must be saved) and ROR (rotate right is performed by repeated rotate lefts). Each is 16 cycles. At the other extreme, INX, DEX, ASL, TAX, and TXA are all 1 cycle, and many others are 2 cycles.
No commentsADC $NNNN,X
Uh oh. I was cranking through the microcode implementation for BMOW’s instruction set, making good progress. I’d finished about two-thirds of the microcode, when I came to ADC absolute, X-indexed. That’s when the wheels came off the cart.
This instruction (add with carry) is supposed to take an absolute memory address, adjust the address by adding the contents of the X register, then add the value at the effective address to the accumulator, plus whatever value was already in the carry bit of the condition code register. Unfortunately, I don’t see how I can reasonably implement it with my current hardware design.
The reason that ADC $NNNN,X is problematic is that it modifies the condition codes (to test and propagate a carry from the low byte to the high byte of the effective address when adding X), but it also depends on the current value of the condition codes (for the carry flag). The effective address computation destroys the carry bit that’s needed for the add step. SBC $NNNN,X (subtract with carry, absolute, X-indexed) has the same problem.
I see two possible ways to fix this, neither one great:
- Store the old condition codes somewhere before computing the effective address, then restore them for the add step. That would make this instruction unreasonably slow, however. It might not even fit within the 16-clock limitation for a single instruction.
- Add a fifth pseudo-flag to the condition code register for handling the carry propagation during effective address computation. This flag would be invisible to the programmer, and only accessible at the microcode level. But that would give me 5 bits in my 4-bit condition code register, and also require compensatory changes in the control subsystem and microcode assembler.
Poop. I’m half hoping that if I stare at it for a while, I’ll think of a better solution.
2 commentsOptimization, or Distraction?
I think I may be getting too concerned about potential optimizations to the hardware design, before I’ve even built or simulated the initial design. A couple of possible optimizations occurred to me recently.
Improved CC Register: I described earlier how the condition codes are stored in a 4-bit shift register with parallel output, enabling the control circuitry to read all the flags simultaneously. Copying the condition codes to/from a register is done serially, requiring 4 clock cycles of bit shifting.
My latest improvement idea is to use a GAL to make a custom 4-bit register with two independent parallel load inputs, one connected to the ALU and one connected to the low 4 bits of the data bus. That would permit loading of all the condition codes from a stored value in a single clock cycle. The outputs of this GAL-register could also drive the data bus, either directly or through the ALU, making it possible to store all the condition codes in a single clock cycle as well.
These improvements would only benefit instructions that load/store the condition codes, though, which really only happens during interrupt processing, so maybe it’s not worth the effort. There are also a few problems regarding how to connect the GAL-register output to the data bus that I would need to resolve.
An Extra Data Register: A while ago, I considered adding a Y register to the machine, but ran into the limit of 8 possible load destinations. Now that I’ve increased the limit to 16, adding a Y register would be as simple as adding 2 more chips to store and drive the data. The required load enable and output enable lines are already there.
Unlike the A and X registers, the Y register would be connected to the right ALU input, which presents some problems. The T (temporary) register is connected to the right ALU input as well, which means it would be impossible to directly compute any functions of Y and T. For example, to add a constant value to X, the machine can load the constant into T, add X and T, and store the result where needed. But to add a constant value to Y, it would first need to copy X to T, load the constant into X, add X and Y, then restore the old value of X from T.
That would give the machine the odd property that operations involving Y are slower than those involving X and A. To be most useful, the proposed Y register really needs to be connected to the left ALU input, but that input is already “full”, and I don’t think I can relocate or remove any of the existing left inputs.
Too Many GALs? In my drive to optimize the design and reduce chip count, I’m beginning to wonder if I’m using too many GALs. I keep finding more and more places where two or more 7400-series parts could be replaced by a single GAL. For example, every combination of a ‘377 register and ‘244 output driver could be replaced by a single GAL, and the 16-bit pointer registers could be built from two GALs each, instead of four ‘569 counters. In fact, I could replace almost every chip with a GAL, except for the ALU, memory, field decoders, and a few drivers. In that case, looking at the schematic wouldn’t tell you anything at all about how the machine worked, and you’d have to read the GAL program data to understand anything.
I’m not sure I like that idea. Although it would still be a real hardware implementation, it would feel more like a simulation in many ways. It would also once again raise the question of why not just implement the bulk of the machine as a single FPGA?
For the sake of comparison, I estimate that using GALs as much as possible would reduce the component count to about 40, while not using GALs at all would increase the count to about 60.
No commentsInterrupts and Condition Codes
I’ve finished the design changes for interrupts. It worked out pretty much how I’d outlined earlier, using a GAL to implement the special OP register that has both a load enable and a clear input, since no such 7400-family part exists. I also added an interrupt enable bit that can be set and cleared from the microcode. When OP is about to be loaded with the next instruction, and interrupts are enabled, and an interrupt is pending, then the OP register will be synchronously cleared upon the next clock edge. This will force the machine to execute opcode 0, IRQ, which saves the processor state and jumps to an interrupt service routine.
I also made a necessary hardware change to support multitasking: making the stack pointer loadable from microcode, so it can be saved and restored across processes. The machine now has 10 possible destinations for storing data, instead of the 8 it had previously. That change required me to spend my last unused control ROM bit to make the store destination a 4-bit field rather than 3 bits. If I end up needing another control ROM bit for something else later, it’s going to be a problem. I also had to add a second 74LS138 to decode the additional store destinations, so now the component count is up to 51.
Condtion Code Optimization: I originally planned to use 8K control ROMs, with 13 address lines: 8 for the opcode, 4 for the phase, and 1 for the current condition code. That’s why the condition codes are stored in a shift register. If the desired condition code isn’t already in the least-significant bit of the shift register, then it must be right-shifted until it’s in the right spot. Since it takes one extra clock cycle for each shift of the condition codes, this design results in the somewhat bizarre property of testing the carry flag being faster than testing the negative flag, which itself is faster than the other flags.
It turns out that the most commonly-available (and cheapest) ROMs now are 64K or 128K. A 64K ROM has 16 address lines, providing enough width for the 8-bit opcode, 4-bit phase, and all 4 condition codes simultaneously. No more shifting! That simplifies the microcode, and also puts all the flags on even par with one another with regards to testing time.
Although bits are no longer shifted out of the condition codes in order to test them, the shift register is still needed for a different reason. When restoring the condition codes after an interrupt, the old values are shifted in, one at a time. Since the parallel load inputs of the register are connected directly to the ALU, there’s no other way to set the condition code values without adding extra multiplexing hardware.
No commentsParts List
I made an accounting of all the components needed by my design as it stands today. It totals up to 50 components altogether, with a combined cost of $114.32.
| Count | Part | Purpose | Price Each | Price Total |
|---|---|---|---|---|
| ALU Module | ||||
| 2 | 74LS181 | core ALU functions | $2.58 | $5.16 |
| 1 | 22V10 GAL | computation of Z and V condition code flags | $4.49 | $4.49 |
| 1 | 74LS244 | bus driver for ALU output | $0.53 | $0.53 |
| Data Registers | ||||
| 3 | 74LS377 | A, X, and T registers | $0.43 | $1.29 |
| 7 | 74LS244 | bus drivers for ALU inputs | $0.53 | $3.71 |
| Address Registers | ||||
| 12 | 74LS569A | 4-bit counter, 4 each for 16-bit PC, AR, SP | $1.25 | $15.00 |
| Control System | ||||
| 3 | 29F010-70 | 128KByte 70ns Flash ROM for control ROMs | $6.09 | $18.27 |
| 1 | 74LS163A | 4-bit phase counter | $0.46 | $0.46 |
| 1 | 22V10 GAL | custom OP register | $4.49 | $4.49 |
| 1 | 74LS138 | 3-to-8 line mux, for load enable signals | $0.52 | $0.52 |
| 2 | 74LS139 | dual 2-to-4 line mux, for ALU input and address drive enable signals | $0.37 | $0.74 |
| 1 | 74LS194A | 4-bit shift register, for condition code flags | $1.23 | $1.23 |
| Memory System | ||||
| 1 | TI BQ4013MA-85 | 128KByte 85ns SRAM | $13.29 | $13.29 |
| 1 | 29F010-70 | 128KByte 70ns Flash ROM for OS | $6.09 | $6.09 |
| 1 | Futurlec USBMOD4 | USB I/O | $24.90 | $24.90 |
| 1 | 22V10 GAL | address decoding, generation of enable signals | $4.49 | $4.49 |
| 2 | 7-segment LED | data display | $0.95 | $1.90 |
| 1 | 74LS377 | data display register | $0.43 | $0.43 |
| 1 | 8-wide DIP switch | console input/debugging | $1.25 | $1.25 |
| 1 | momentary pushbutton | console input/debugging | $1.18 | $1.18 |
| 2 | 74LS244 | bus drivers for switch/button | $0.53 | $1.06 |
| 1 | 74LS245 | bidirectional bus driver to/from ALU data bus | $0.64 | $0.64 |
| Miscellaneous | ||||
| 1 | crystal oscillator | 1.0MHz clock oscillator | $1.49 | $1.49 |
| 1 | 74LS244 | clock signal buffer | $0.53 | $0.53 |
| 1 | momentary pushbutton | reset button | $1.18 | $1.18 |
| Grand Total | ||||
| 50 | $114.32 | |||