Archive for April, 2010
Experimental Hardware
With the design of the Tiny CPU core more or less finished, I’ve started thinking about how to build a small computer around it. My goal is to create a simple machine with a keyboard input, a 4-line LCD output, and a few buttons and LEDs for debugging. Everything should be mounted on a custom PCB that I’ll design as well.

I’ve purchased an Altera USB-Blaster, and a CPLD prototyping board containing the same CPLD model that I plan to use. This will let me see exactly how someone else built a working system around this device, and give me something to compare to when my own machine inevitably fails to even turn on after it’s built. I can also add a few components to the prototyping board, to try out a scaled-back version of the computer design before I commit to manufacturing my custom PCB.
The documentation with this board was pretty sparse, and the USB-Blaster clone had none at all, but after a little work I managed to figure it out. I’ve been able to reprogram the CPLD on the board, and do a few basic LED blinking types of tests. If I get motivated, I may try to fit a RAM, ROM, and a few other parts in that empty area on the right, and see what I can do.
For the ultimate Tiny CPU PCB, even for a “simple” system, there are going to be quite a lot of components. Assuming I use the free version of Eagle for the PCB layout, with its 10cm x 8cm area limit, I may need to stack two or even three boards to get everything in. The semi-final parts list is:
- CPLD #1 - for the Tiny CPU
- CPLD #2 - for address decoding, LCD interface, keyboard interface, etc
- ROM (in a ZIF socket, or maybe a JTAG-programmable ROM)
- RAM
- clock oscillator
- DC power jack
- voltage regulator
- reverse voltage protection diode
- capacitors for voltage regulator
- PS2 keyboard jack
- pull-up resistor for keyboard clock
- Shottky inverter for keyboard clock, to address very slow slew rate (based on BMOW experience)
- LCD connector header
- resistor for LCD backlight
- variable resistor for LCD contrast
- piezo beeper
- variable resistor for volume
- transistor for piezo power
- 7-segment LED
- current-limiting resistors for 7-segment LED
- reset button
- pull-up resistor for reset button
- power LED
- current-limiting resistor for power LED
- on/off switch
- rotary encoder
- push button
- ISP/JTAG header (connnect both CPLDs into a JTAG chain)
- RC reset circuit
- debug headers
That’s a lot of stuff to fit into 80 cm^2. For comparison, the board in the photo above is about 126 cm^2, but contains less hardware than what I think I’ll need.
6 commentsVariable Size Instructions
My analysis of the advantages of fixed-size instructions proved to be badly flawed. The improvements I saw when switching to a 16-bit fixed instruction size were not what I originally thought: the size and speed gains came from the reduction in address size, which reduced the size of many instructions, and sped up their execution. The gains had nothing to do with the fact that all instructions were now a fixed size. In fact, going to a fixed size made matters worse for instructions like push and increment, which were now larger and slower.
Fortunately, this was almost trivially easy to fix. With just a few lines changed in the assembler and Verilog source, I was able to restore all the implicit instructions to a single byte, while keeping address-oriented instructions at two bytes (with an embedded 10 bit address). That provides the best of both worlds:
| Variable Size, 16-bit addr | Fixed Size, 10-bit addr | Variable Size, 10-bit addr | |
|---|---|---|---|
| macrocells | 119 | 112 | 116 |
| verification program size (bytes) | 2055 | 1890 | 1629 |
| verification program execution time (clocks) | 835 | 574 | 552 |
The gains aren’t amazing, but every little bit helps. The space savings are especially nice, since with the 10-bit address space, I’ll need to make the most of every byte.
1 commentTiny CPU Architecture
As promised, here’s the Tiny CPU architecture diagram. SP is the stack pointer, and is 6 bits, providing a 64-entry stack. EA is the effective address, used for data load/store from absolute or computed addresses. PC is the program counter. The accumulator A and index register X are the only data registers. The datapath is controlled by a state machine and combinatorial logic, using the current opcode, state, and arithmetic/logic flags as input.
The diagram glosses over a few details, such has how the 8-bit data bus is connected to 10-bit address registers. Where busses and registers of differing sizes are connected, additional logic selects the low or high byte as needed.

Tiny Asm
I’ve finished writing the Tiny CPU assembler, and it works. It took about four hours across two nights to get something with basic functionality. The curious can take a look at the assembler source code for details.
I don’t have much experience with writing these kinds of tools, so my parser is a little ugly. It goes line by line, ignoring whitespace and comments, until it finds a line beginning with a token. This token must either by an instruction mnemonic, or a label. If it’s a mnemonic, a few additional checks determine the operand and address mode, and then a table lookup determines the opcode value for that instruction and address mode combination. If it’s a label, its address is stored, and all previously-pending references to that label are resolved. Anonymous forward and backward labels are also supported.
It would be nice to add features like named constants, conditional compilation, and macros. The assembler also lacks directives for setting the assembly address, or embedded constant data like tables and strings. I’ll add some of those features later, as the need arises.
5 commentsFixed Size Instructions
I’ve finished my experiment with fixed-size instructions for Tiny CPU, and the results are encouraging. I did a straightforward conversion to a 16-bit instruction size, with the opcode in the upper 6 bits and the address (if any) in the lower 10. Here’s how it compares to the original, variable-size instruction version:
| Variable Size | Fixed Size | Percent Reduction | |
|---|---|---|---|
| macrocells | 119 | 112 | 6 |
| verification program size (bytes) | 2055 | 1890 | 8 |
| verification program execution time (clocks) | 835 | 574 | 31 |
So it’s an improvement across the board. The only drawback is that increasing the address size to something larger than 10 would be fairly difficult. It’s technically possible to fit all the opcodes into 5 bits (there are 31 unique opcodes), allowing for 11 bits of address. However, it would be a poor encoding that would probably require the decoding logic to be substantially more complex, increasing the macrocell count.
I wrote a tool to convert variable-sized program binaries into fixed-size, but it’s ugly and brittle. My next step, therefore, will be to write a custom Tiny Assembler for my Tiny CPU.
4 commentsTiny CPU
I’ve got a working CPU! You can grab the Verilog source and a testbench here. The instruction set and addressing modes are as I described them in my previous post, except that I shrank the stack pointer to 6 bits (64 byte stack), and was able to add the missing branch if carry/zero not set instructions. The CPU has a 10-bit (1K) address space, and fits in 119 macrocells of an Altera EPM7128S, when set to optimize for area and with Parallel Expander Chain Length set to 0. Sometime soon, I’ll make some nice datapath diagrams and post them.In addition to the small address space and limited instruction set, there are a few ugly elements of the design that were necessary to make it fit the device. The absence of a Compare X instruction is glaring, but is impossible to include with significant changes. There’s also a wasted state after many of the math/logic ops, in which the Zero flag is redundantly set. This makes those instructions take one clock cycle longer than actually necessary, but was necessary to avoid more complicated state transition logic. The Zero flag handling in general is definitely awkward.So what’s next? I hope to shrink the design slightly further, by simplifying logic, using more Altera primitives, or by using a smarter instruction set encoding that uses instruction bits directly as control signals. If I can save a few more macrocells, I hope to increase the address space to 11 or 12 bits (2K or 4K), because 1K feels very limited.Beyond that, I’m considering a few larger changes:
Fixed Instruction Size
The current design has instructions that are one, two, or three bytes in size. I’m considering moving to a fixed instruction size of 2 bytes: 6 bits for the opcode, and 10 bits for an address or constant value. This would simplify the state machine logic, eliminating extra states needed to perform operand fetches, and reducing the logic resources needed to implement the state machine. It would probably also result in slightly more compact code, making more efficient use of the limited address space.The downside of a fixed instruction size is that it would also fix the address size at 10 bits (or maybe 11 if I’m really clever), with no hope of increasing it. It would also require an opcode register, to hold the first 8 bits of the instruction while the second 8 bits are fetched. And it would force me to throw out the bastardized 6502 assembler I’ve been using, and create some new software tools.
Larger Bus Size
If I switch to a fixed 16-bit instruction size, it may also be worthwhile to switch to a 16-bit data bus. This would permit loading an entire instruction in one clock cycle, eliminating the need for the opcode register, and further simplifying the state machine. The downside is that I’d then need extra logic to make the memory byte-addressable for load/store of data, or else increase the data word size to 16 bits and forget about byte addressing entirely. A larger data bus output mux would also be needed. And of course, two parallel 8-bit RAMs would be needed on the CPU board.
Harvard Architecture
Not a Colonial Period building at Harvard University, but a computer with separate address spaces for programs and data. This would permit a 16-bit interface to program memory, and 8-bit interface to data memory, which is seemingly the best of both worlds. The program memory address bus wouldn’t need a mux, because it would always be driven by the program counter. Separate program and data memories would also allow for faster CPU operation, by enabling instruction fetches and data access to happen in parallel. The total amount of addressable memory would also increase, because the program and data memories could each be 1K in size, for 2K total.Separate program and data memories mean the CPU board would need two 8-bit ROMs as well as an 8-bit RAM, further increasing the component count.The major drawback of the Harvard Architecture is that working with large data constants like strings and tables is cumbersome, because they must be loaded or copied a byte at a time using Load Immediate instructions. The indexed address instructions typically used to access such structures operate on the data memory. The standard solution to this problem is to use a Modified Harvard Architecture, adding new instructions like Load Constant Indexed to fetch values from program memory. Unfortunately that negates some of the original advantages, requiring an additional address register for program memory, an address bus mux, and additional complexity in the state machine.
4 commentsBack-Annotation
I think I’m close to having this CPU fit the 128 macrocell CPLD, but running into some problems with the final details. Soon I’m going to work on a physical board layout for this CPU, at which point all the pin assignments need to be finalized. Any further design tweaks need to maintain those pin assignments, or else they could not use the same board.Altera’s Quartus II software meets this need with a tool called back-annotation. Once you’ve synthesized the Verilog design, and computed a fit for the particular device you’re using, you can back-annotate the original design with the pin placements determined by the fitter. Then if you later change the design, the software will attempt to keep those same placements, or report failure if it can’t.That sounds great, except when I use back-annotation, it always causes fitting to fail. Starting with no constraints, I can synthesize and fit my design successfully (currently 117 macrocells), then back-annotate the device and pin assignments, and run synthesize and fit again. Since I made no changes to the design, the result the second time should be identical to the first, and should match the back-annotation constraints perfectly. But what actually happens is that the second run of synthesize and fit fails, complaining that it’s unable to pack the cells into LABs successfully. This is proving very frustrating, since it’s a showstopper problem if I can’t find a solution.
No commentsSuccess?
I think I’ve succeeded at cramming a decently useful CPU into this 128 macrocell device! I threw away my first design and started over from scratch, abandoning almost all the 6502-related elements, and working closer to a minimal instruction set design. I also changed the Verilog structure to explicitly specify the internal datapaths between CPU registers, rather than the more behavior-oriented design I tried originally.
The major caveat is that none of this is tested it yet. No testing whatsoever. It’s virtually certain that the design contains many mistakes, some of which may cause it to fit into fewer macrocells than it would otherwise. As of now, the design occupies 121 of 128 macrocells. Due to routability contraints, however, I can’t really make use of the last seven.
My primary goals were to making a working 8-bit CPU, with a stack, index register, and indexed addressing mode. I’ve accomplished that, but a lot of other nice stuff had to be tossed out. Here are the features that the CPLD CPU supports, and doesn’t support:
Addressing Modes
Supported: immediate, absolute, absolute with index
Not supported: indirect, indirect with index, read-modify-write
Indirect addressing would be nice, but I’m not going to lose sleep over its absence. Self-modifying code can be used as a sort of poor-man’s indirect addressing where needed.
Program Flow
Supported: jump, branch if carry set, branch if zero flag set, call, return
Not Supported: indirect jump, branch if carry not set, branch if zero flag not set, negative flag, overflow flag
The overflow flag is rarely used, and a subroutine can be written to do the same thing if needed. The negative flag would be more useful, but not critical. Branch on flag not set would be very useful, but can always be avoided by modifying the branch test.
Math and Logical Operations
Supported: add A, sub A, compare A, nor A, load/store/push/pull A, load/store/push/pull X, increment/decrement X
Not supported: add/subtract with carry-in, compare X, and, or, xor, shift, rotate, test, register-to-register transfers, direct set/clear of flags
I decided the lack of a carry-in wasn’t a big deal. You can always test the carry out, and manually add 1 to the next stage of a multi-byte addition or subtraction.
Having NOR as the only logical operation seems strange, but is surprisingly powerful:
not A = A nor 0
A or B = (A nor B) nor 0
A and B = (A nor 0) nor (B nor 0)
For the common task of AND-ing a number with an immediate value to check if a particular bit is set, this can be done with NOR in a single step if you use the bitwise-complement of the immediate value, and also reverse the sense of the test:
NOR #$7F
BZ highBitIsOne
Other missing operations can be easily simulated:
clear carry = ADD #0
set carry = SUB #0
left shift = STA temp, ADD temp
transfer A to X = PHA, PLX
transfer X to A = PHX, PLA
Of all the missing functions, the only ones I really wish I could squeeze in are the branch if not set, and compare X. Not having compare X means that any loop over X has to start at some number and go down to zero, instead of start at zero and go up. Most of the time that’s probably OK. In an emergency, compare X could be simulated as:
PHA, PHX, PLA, CMP, PLA
But that’s pretty ugly, and it also assumes that PLA doesn’t modify the flags (still undecided).
I will post the Verilog code once it’s tidied up a bit more, and I’m confident I’ve squeezed it as much as possible.
No commentsSynthesis Puzzles
The more I try to understand the Verilog synthesis tool behavior, the less I understand it. I decided to go back to square 1 with my design, and start by implementing a basic 8-bit counter that can be reset, loaded, incremented, or decremented. Here’s the source:
module counter
(input clk,
input reset,
input [3:0] state,
input [7:0] d,
output reg [7:0] q);
localparam [3:0] load = 4′b0000,
inc = 4′b0101,
dec = 4′b1111;
always @(posedge clk or negedge reset) begin
if (!reset)
q <= 0;
else if (state == load)
q <= d;
else if (state == inc)
q <= q + 1′b1;
else if (state == dec)
q <= q - 1′b1;
end
endmodule
If I synthesize that using Altera Quartus II v9.0, with a Max 7000S series target device (EPM7128SLC84-15), and the optimization set for minimum area, it uses 10 macrocells. 10 macrocells for an 8-bit counter. The “Timing Closure Floorplan” view will show which macrocells were used, and the equations implemented in each one. It turns out that the first 6 bits of the counter each fit in a single macrocell, and use a T-type flip-flop with four product terms each, which follow a recognizable pattern. But for unknown reasons, the software implements the last two bits completely differently, using D-type flip-flops, an extra macrocell per bit for additional product terms, and one shared expander term. If you’ve got Quartus II Web Pack installed, you can easily confirm this yourself.
I couldn’t understand why the software didn’t just follow the pattern of the first 6 bits for the 7th and 8th bits too. There didn’t seem to be any limit on number of product terms or inputs that it would run into, as far as I could tell. I decided to try it, by using Altera primitives to explicitly specify a T-type flip-flop for all 8 bits, and listing out the exact logic equations for each bit. You can view the Verilog code here: counter_v2.v
The new version worked fine. In fact, it worked better than fine. It fit in 8 macrocells instead of 10, and used no shared expanders or other magic. And it was not only smaller, it was also faster. The software computed a maximum speed of 76.92MHz, compared to only 45.45MHz for the first version.