Big Mess o’ Wires


A home-built CPU, and other messy electronics adventures

Archive for the 'Uncategorized' Category

Verilog Examples Synthesized

I decided to take the advice I gave myself in the comments of my previous post, and actually synthesize the three Verilog adder examples to see what would happen. I tried each of the examples under Quartus II Web Edition 9.0, set to optimize for area. The size of a, b, c, d0, d1, and d2 was chosen as 8 bits.

1.  44 macrocells. Yes, it created 3 separate dedicated adders. The RTL showed three registers for d0, d1, d2, each with a mux leading into it, as well as the adders and a single decoder for state. The Technology Map Viewer showed 24 mc’s used by the registers, and 20 mc’s total by the three adders.

2.  This design is broken. By not specifying default values for in1 and in2, the software inferred a latch for them in the hypothetical s3 state. After fixing that, the design consumed 52 macrocells. Again the RTL showed three registers for d0, d1, d2, each with a mux leading into it. It showed a single adder, with 2 cascaded muxes at each adder input. It also showed a decoder and a stray OR gate. The Technology Map Viewer showed 24 mc’s used by the registers, 27 by the single adder, and 1 more that I couldn’t exactly account for– part of one of the muxes maybe.

3.  This design is also broken in the same way as #2. There’s also a copy-paste error in the enable signal in s2 state. After fixing those mistakes, the design consumed 52 macrocells. The RTL looked very similar to #2, and the Technology Map View was identical to #2.

There’s a lot to investigate further here, such as how the single adder in #2 could require 27 mc’s when the three adders in #1 only require a combined total of 20 mc’s. But the major conclusion is that all my attempts at “improving” the design only made the results worse.

1 comment

Verilog Headaches

I’m having some trouble finding the best way to structure the Verilog code for this CPU. In particular, I’ve encountered one small headache and one larger one.

The small headache relates to the best way to describe complex combinatorial logic that doesn’t involve any registers. Consider some hypothetical logic that determines the value of the incrementPC and loadA control signals, based on the current state. One way to do this would be:

    wire incrementPC, loadA;
    assign incrementPC = (state == s1) || (state == s3) || (state == s4);
    assign loadA = (state == s0) || (state == s2) || (state == s4);

That works fine, and it’s pretty clear what it does. But for more complex designs, it’s clearer to use procedural assignment and a case statement, grouping all of the control signals for each state together:

    reg incrementPC, loadA;
    always @* begin
        case (state)
            s0:
                incrementPC = 1'b0;
                loadA = 1'b1;
                // other control signals...
            s1:
                incrementPC = 1'b1;
                loadA = 1'b0;
                // other control signals...
            s2:
                incrementPC = 1'b0;
                loadA = 1'b1;
                // other control signals...
            s3:
                incrementPC = 1'b1;
                loadA = 1'b0;
                // other control signals...
            s4:
                incrementPC = 1'b1;
                loadA = 1'b1;
                // other control signals...
        endcase
    end


The problem with this approach is visible in the first line: incrementPC and loadA must be declared as type “reg”, even though they are not registers. During synthesis, no register will be created as long as your code is correct, but Verilog demands that the target of a procedural assignment like this always be type “reg”. So reg does not always mean that something is a register. I find this very confusing and misleading, because it means you can’t just look at the Verilog code to see which signals are registers and which are purely combinatorial.

My bigger problem is more subtle, and is about good HDL design practices rather than any quirk of the Verilog standard. I’m unsure how explicit I should be in defining the structure of the virtual hardware described by the Verilog code. At one extreme, I could write a high-level functional description of *what* the CPU does, ignoring *how* it does it, and leave the Synthesis software to figure it out. Or at the other extreme, I could work out a block diagram of the CPU consisting of familiar real-world elements like registers, arithmetic unit, muxes, and busses, and then write Verilog code to describe these elements and how they’re all connected.

To help make this distinction clearer, here’s an example based on section 6.2.4 of the book FPGA Prototyping by Verilog Examples. Imagine a state-driven system that can add two input registers, and store the output in a third register. One way to describe this would be high-level, functional:


    always @(posedge clk) begin
        case (state)
            s0:
                d0 <= a + b;
            s1:
                d1 <= b + c;
            s2:
                d2 <= a + c;
        endcase
    end
    

Great, that’s compact and clear. But what does the datapath of this hardware look like? Is there one adder unit, or three? Who knows? It’s a black box, relying entirely on the synthesis software to do the right thing.

A second approach would be to explicitly define a single adder unit:


    assign mout = in1 + in2;
    
    always @* begin
        // default: maintain same values
        d0_next = d0;
        d1_next = d1;
        d2_next = d2;
        
        case (state)
            s0:
                begin
                    in1 = a;
                    in2 = b;
                    d0_next = mout;
                end
            s1:
                begin
                    in1 = b;
                    in2 = c;
                    d1_next = mout;
                end
            s2:
                begin
                    in1 = a;
                    in2 = c;
                    d2_next = mout;
                end
        endcase
    end
    
    always @(posedge clk) begin
        d0 <= d0_next;
        d1 <= d1_next;
        d2 <= d2_next;
    end
    

That makes the hardware design clearer, so it’s unambiguous that there’s only one adder. Is this second approach better than the first, then? Mabye, maybe not. If you’re optimizing for space, and don’t trust the synthesis software to be as smart as you are, then the second example is probably better. But if you’re optimizing for speed, having three separate adders (or at least the possibility of three) may actually be better.

Even this second design is somewhat ambiguous. Presumably there are some muxes at the input to the adder, and a mux or load enable at the input to each D register too. But the Verilog code leaves this all implied and unspecified. Here’s a third example that spells everything out in full detail:


    wire [1:0] in1Select, in2Select;
    assign in1 = (in1Select == 2'b00) ? a :
        (in1Select == 2'b01) ? b :
        (in1Select == 2'b10) ? c :
        d;
    assign in2 = (in2Select == 2'b00) ? a :
        (in2Select == 2'b01) ? b :
        (in2Select == 2'b10) ? c :
        d;
    
    assign mout = in1 + in2;
    
    wire loadEnableD0;
    wire loadEnableD1;
    wire loadEnableD2;
    
    always @* begin
        // default: disable all loads
        loadEnableD0 = 1'b0;
        loadEnableD1 = 1'b0;
        loadEnableD2 = 1'b0;
    
        case (state)
            s0:
                begin
                    in1Select = 2'b00;
                    in2Select = 2'b01;
                    loadEnableD0 = 1'b1;
                end
            s1:
                begin
                    in1Select = 2'b01;
                    in2Select = 2'b10;
                    loadEnableD1 = 1'b1;
                end
            s2:
                begin
                    in1Select = 2'b00;
                    in2Select = 2'b10;
                    loadEnableD1 = 1'b1;
                end
        endcase
    end
    
    always @(posedge clk) begin
        if (loadEnableD0)
            d0 <= mout;
        if (loadEnableD1)
            d1 <= mout;
        if (loadEnableD2)
            d2 <= mout;
    end
    

This approach makes it very clear what’s happening in terms of the hardware, and you could build an equivalent physical circuit from 7400 parts. Is this better or worse than the other two approaches? I find it better in terms of understanding what will be synthesized, but it’s worse in terms of length. I also suspect that by specifying all the details in this way, it may be over-constraining the synthesis software, preventing it from using some clever optimizations to pack the same amount of logic into less space.

I find myself going around in circles with variations of these three approaches, unable to really get started with the actual CPU design work.

3 comments

RC Servo Signal Decoder, Part 2

It works! I’ve continued poking away at this circuit to decode an RC airplane servo signal and trigger a camera shutter during flight, and I’m happy to report success!

Once I switched to using the CD4013 flip-flop with a positive logic clear input instead of negative logic, it was a piece of cake. I have to say, living just a mile from one of the USA’s largest electronics dealers (Jameco) is pretty sweet. I can hit their web site and place an order for practically any obscure electronic component I can think of, then cruise down to their offices and pick it up from the will-call desk an hour later. Nice!

I rebuilt the decoder circuit that I discussed last time, soldering everything together “dead bug” style. This was necessary in order to keep everything as small as possible, so I could fit it inside the camera body.  I forgot to take a photo before I closed everything up, but it looks very similar to this example from laureanno.com:

When I first connected the servo, decoder, and camera, it didn’t work. Nothing happened when I toggled the switch on my RC transmitter. Setting up the oscilloscope again, I was able to see that the reference pulse width generated by the RC circuit I’d built was about twice as long as it should have been. I’m not sure how that happened, even with 20% tolerance components, but I was able to quickly swap in a different value resistor, and get it working perfectly. Then with a bit of creative packing, I managed to cram it all back inside the camera body.

Today during my lunch hour, I was able to try it out for the first time. The shutter trigger worked fabulously! I wish I could say the same for the quality of the pictures, but unfortunately the focus wasn’t set quite right, and the photos are a little blurry. They’re still pretty fun to look at though. I was flying next to the headquarters of Oracle Corporation in Redwood City, California. Those are the clustered cylinder-shaped mirrored buildings you see in the photos. The plane looks like it was a little higher than the tallest building, which I think is 20 stories tall. See if you can find me in some of the photos!

Click any of the thumbnails below to see the full-sized version.

    

   

   

February 27 Edit: I corrected the focus problem, and tried again. Unfortunately I got the propeller in some of the shots, and this new set wasn’t from as high an altitude. But I did get some great shots of the bay, an aerial self-portrait, and a flock of Canada geese.

   

   

2 comments

SDRAM

I think I’m making life more difficult than it needs to be, trying to get this DDR2 SDRAM interface to work. It’s not that the logical interface is so complicated, really… you set your row and column addresses, do a burst transaction, check for refresh… not trivial, but not rocket science either. And the Xilinx MIG or other vendor-specific wizard will generate a memory interface for you to use as a starting point.

No, what seems to be difficult is that the margin for error with DDR2 SDRAM is much smaller than with SRAM or plain (single data rate) SDRAM. The voltages are lower, the timing tolerances are tighter, and much more care must be given to compensating for things like possible skew, processes variation between different FPGAs, power supply tolerances, and a host of other worries.

I’ve been reading a LOT on this topic in the past couple of weeks, and I’ve been struck by one thing. Except for my Xilinx Spartan 3A starter board, and Altera’s comperable Cyclone III board, I’ve seen zero boards that use DDR or DDR2 memory. The all use plain SDR SDRAM, also known as PC100 or PC133 depending on the speed. I looked at boards in the $150 to $300 range from Opal Kelly, KNJN, XESS, and others, and they all use plain SDR SDRAM. Maybe I should take a hint?

Meanwhile, I’ve been digesting as much FPGA documentation as I can. So far I’ve chewed through about 1500 pages of the Xilinx MIG user manual, Spartan 3 series user manual, and Spartan 3A addendum, and I’m midway through the comprehensive book FPGA Prototyping by Verilog Examples: Xilinx Spartan-3 Version. It’s the best “getting started” reference I’ve seen yet, with good coverage of Verilog, FPGA hardware, and the Xilinx software tools.

9 comments

Small Progress

Finally, some small progress on the memory interface. After banging my head every which way against the Xilinx tools, and reading everything I could find on the subject, I came across Leo Silvestri’s page on modifying the Xilinx MIG memory controller design for a Spartan 3E board. It’s for a different kit and an older version of the software, but with his help I was finally able to build the reference design and testbench for the Spartan 3A board, program it to the FPGA, and see the LED that indicates success. It’s not very exciting, but it’s progress.

I still can’t believe all the steps I went through, and the whole process has made me quite bitter about Xilinx’s software tools. I’m sure it would be easier if I had better general knowledge of this field, but the last few weeks of this project have been like being lost at sea, and totally disoriented. It still feels more like a series of disconnected guesses than a genuine understanding, but here’s what I’ve managed to piece together on the topic of using the DDR2 SDRAM that’s on the Spartan 3A kit board.

  1. The Xilinx MIG can’t be used to generate a new memory controller design for the Spartan 3A board. This is because the way the SDRAM on the board is connected to the FPGA pins violates some of the MIG design rules. The only solution is to use the pre-built Spartan 3A board reference controller design, which then locks you into a specific burst length and CAS latency, or to hand-modify the code generated by the MIG, which is way beyond the skills of a noob like me.
  2. Using the newest version of the Xilinx ISE and MIG, attempting to add the Spartan 3A reference design to your project will cause a crash. No answer from Xilinx support on this.
  3. You can also get the Spartan 3A reference design as a zip file. But if you unzip it, add all the files to a new ISE project, and try to build it, you’ll get lots of errors about non-existant nets that I couldn’t resolve.
  4. There’s also a batch file in the zip file that will create a new ISE project for you. But try to build it, and you’ll be told that the design requires a ChipScopePro license, which is Xilinx’s software logic analyzer. I found a discussion of this on the Xilinx forums, but no resolution other than to create a new controller design that omits ChipScopePro support, which is impossible for this board due to issue number 1 above.
  5. What finally worked was to hand-edit the reference design, deleting parts of it semi-randomly until the ChipScopePro error disappeared. It turned out that required removing three modules called icon, ila, and vio, none of which seemed obviously related to debugging to me.

So there you have it. The next step will be to begin to actually use this interface for something more interesting than lighting up an LED. I’m just now realizing that the interface created by the MIG is just the first, small step towards what the 3DGT memory controller must eventually become. It’s not enough to simply have an interface that permits reading and writing. To achieve half-way decent performance, much care will be required to manage and coordinate those reads and writes, minimizing waiting and wasted time, and maximizing throughput. And to top it off, it’s going to need a bus master to arbitrate memory access between the display circuit, pixel processors, vertex processors, and any other consumers of memory. All this is a substantial project in itself, that will need to be at least partially completed before any real progress can begin on the 3D part of 3DGT. Looks like a long, slow climb, but I’m moving ahead.

3 comments