VHDL, and End-of-Semester Course Review

CS 641 Lecture, Dr. Lawlor

So, you want to build some hardware. Back in the day (that is, in 1985), you would start with a pile of logic chips, each with as many as four logic gates on it, and solder together a big ball of wires (known as "rat's nest" circuit layout).

Unfortunately, this doesn't scale. Past a certain complexity (say, a thousand pins), you basically can't fabricate a working circuit by hand in a reasonable length of time. Figuring out how and where to route the wires, and actually routing the wires, becomes virtually impossible.

So since the 1960's, people have turned to computers to design the circuits used for computers, which is to say Computer Aided Circuit Design or the modern term Electronic Design Automation. (You know that Terminator quote "... at 2am SkyNet began learning at a geometric rate"? Yeah, that's a restatement of Moore's Law.)

VHDL Digital Circuit Design

There are several different languages out there for digital circuit design. VHDL (.vhd or .vhdl), or "Very High-level hardware Design Language", is probably the biggest digital circuit description language--the only other major language is Verilog (.v) . VHDL's syntax derives from Pascal by way of Ada, and like those languages it's got a lot of "housekeeping" syntax. Here's hello world:

use std.textio.all;

entity foo is
end foo;

architecture arch of foo is
begin
        process
                variable L : line;
        begin
                write(L, string'("a big old hello world!"));
                writeline(output, L);
                wait;
        end process;
end arch;

(executable NetRun link)

Notice that there's no sign of a *circuit* here--VHDL should be thought of more as a parallel programming language, handy for circuit design! Indeed, you can run your VHDL in a huge variety of ways. You can simulate VHDL on the CPU using a VHDL-to-executable-code translator like ghdl (NetRun does this) or use any of a variety of commercial simulators. Or you can use a VHDL-to-FPGA programming tool (like this FPGA IDE from Xilinx) to make the VHDL code run directly on an FPGA, such as this $150 FPGA on a PCI card. Or you can "tape-out" a VHDL design to actual silicon ASIC, a custom chip that runs your code directly. Custom silicon prices start at about $20,000 from MOSIS, for 40 tiny chips.

VHDL code inside a "process ... begin .. end process" basically looks like sequential C-like code. But there's a twist--everything between every entity's process begin/end pairs executes repeatedly, over and over again. Try the above code without the "wait;", and it'll keep printing hello! "wait;" actually means "wait forever here", which in the simulator means "exit".

But it's even weirder with multiple process statements (which is very common), or for stuff outside a process; everything else actually executes concurrently! This is how real hardware works--all your logic, muxes, registers, gates, and so on just run all the time. Sequential C/C++/Java/C#/VB programmers aren't at all used to this.

Here's a VHDL program with three separate things happening at once:

use std.textio.all;
entity foo is 
end foo;

architecture arch of foo is
  signal BOB: bit;  -- BOB and TED are both bits.
  signal TED: bit;
begin
  process  -- Drives BOB low, then high.
  begin
    BOB <= '0';  wait for 1000 ns;
    BOB <= '1';  wait;
  end process;

  TED <= BOB; -- Always copies BOB's value into TED

  process -- Waits for BOB, then prints TED
    variable L : line;
  begin
    wait until BOB = '1';
    wait for 1 ns;  -- Time for TED to catch up (important!)
    write(L, TED);
    writeline(output, L);
    wait;
  end process;
end arch;

(executable NetRun link)

The three things that continually happen are:

The BOB driver first drives BOB low, then high.
The TED copier continually copies TED from BOB.
The printer waits until BOB is ready, then prints out TED.

Strange, eh? You can read way more about VHDL at this excellent tutorial or this group of examples. Here's some FPGA-specific VHDL.

CPUs in VHDL

The obvious thing to do with VHDL is to build a CPU. Here's a trivial UEMU-like non-pipelined CPU:

-- nanoproc, a very small CPU + RAM 
-- by Dr. Orion Sky Lawlor, olawlor@acm.org, 2007-09-19 (Public Domain)
use std.textio.all;
library IEEE; use IEEE.numeric_bit.all;

entity foo is   end entity;
architecture behaviour of foo is
  -- Data type used throughout; a 16-bit number
  subtype NREG is UNSIGNED(15 downto 0); -- IEEE.numeric_bit datatype
  function MAKEREG(V: in integer) return NREG is
  begin   return to_unsigned(V,16);  end;
  
  -- Debugging printout function
  procedure print(S: in string; V : in NREG) is
  	  variable L : line;
  begin
  	  write(L,S); 
  	  write(L, to_integer(V));
  	  writeline(output,L); 
  end;
  
begin -- architecture of nanoproc
main: process
  -- 16 32-bit registers. 
  type regarray is array(0 to 15) of NREG;
  variable r : regarray;
  -- Register 15 is the program counter.
  alias pc : NREG is r(15);
  
  -- 256 32-bit memory locations
  type memarray is array(0 to 255) of NREG;
  variable m : memarray;
  
  variable inst : NREG;
  variable setup : NREG;
begin -- main process
  -- Initialize registers
   pc := MAKEREG(0); -- Start executing at address zero
   r(0) := MAKEREG(0); -- register zero is always equal to zero
   
  -- Initialize RAM with some instructions
   m(0) := x"1107"; -- load 0x07 into register 1
   m(1) := x"1203"; -- load 0x03 into register 2
   m(2) := x"a112"; -- add regs 1 and 2
   m(3) := x"b001"; -- magic print-reg-1 instruction
   m(4) := x"e000"; -- magic Exit instruction
   
   -- Main execution loop
   loop
     inst := m(to_integer(pc));  -- fetch next instruction
     pc := pc + MAKEREG(1);  -- increment program counter
     case inst(15 downto 12) is 
--------------- Table of Instruction Opcodes ---------------
     when x"1" =>  -- Load-immediate instruction: 0x1 dest <8-bit immed>
	r(to_integer(inst(11 downto 8)))
		:= MAKEREG(to_integer(inst(7 downto 0)));
     when x"a" =>  -- Addition instruction: 0xA dest src1 src2
	r(to_integer(inst(11 downto 8)))
		:= r(to_integer(inst(7 downto 4)))
		 + r(to_integer(inst(3 downto 0)));
     when x"b" =>  -- Output instruction
	print(string'("Value in r(1)="),r(to_integer(inst(3 downto 0))));
     when x"e" =>  -- Execution complete
	print(string'("End of execution at pc="),pc-1);
	wait; -- exits simulator
     when others => -- Illegal instruction!
	print(string'("Invalid instruction at pc="),pc-1);
	print(string'("Instruction value="),inst);
	wait; -- exits simulator
     
--------------- End Table of Instructions ---------------
     end case;
   end loop;

end process; -- main

end architecture;

(executable NetRun link)

Here are some bigger VHDL example CPUs:

The Superscalar DLX CPU from Darmstadt is particularly impressive.
Xilinx is using a VHDL-synthesized CPU on their FPGAs called Microblaze.
A huge set of GPL'd VHDL CPUs is at opencores.org.
An excellent index of bigger VHDL codes from Hamburg.
Simple write-style examples from U. Maryland.

Course Review

During this semester, we've covered:

CPU design, starting with registers and arithmetic, and working upwards.
Pipelining, out-of-order execution, speculation, renaming, superscalar wide issue, and other ways to extract parallelism from sequential code.
Single Instruction Multiple Data (SIMD) instructions, which are a very small step toward explicitly-parallel software: a single instruction that operates on several (usually four) separate values simultaniously. SIMD only gets you 4-way parallelism in the best case, though.
Multithreading is a more general way of expressing problem-domain parallelism in software. The tricky part is getting the software to actually work after you've parallelized it this way:

Raw threads, like pthreads or windows threads, provide only primitive, dangerous tools like mutexes (locks) to keep the program operating properly--unless you consistently and explicitly order memory accesses, shared variables will get updated incorrectly. Threads are old, since 1970's or so, but still haven't gotten their bugs worked out properly!
OpenMP is a syntactically nicer interface for threading, but is probably less general than raw threads. It works best on loops, and provides a very simple one-line annotation (#pragma omp parallel for) to parallelize those loops. OpenMP reached mainstream in 2007, when it got compiler support everywhere.

Because writing software to properly share memory is so tricky, and building hardware that scalably shares memory is also tricky, some have suggested not sharing memory, and only having an explicit message-passing network instead.

TCP sockets provide a straightforward interface to point-to-point networks of all sorts. The tricky parts are finding IP addresses, making sure servers stay running, and building application-layer communication on top of raw bytes. Sockets exist on every OS since 1990, and are used by web browsers etc.
MPI (Message Passing Interface) is a handier high-performance-computing library used to access the network. MPI automatically handles IP addresses and process startup, and provides some primitive methods for sending application-layer data structures. MPI is absurdly popular on big parallel machines, and is slowly percolating into the wider world.
Charm++ is a University of Illinois research project to build an object-oriented message passing system. It can migrate objects between physical processors for fault tolerance or load balancing, and uses Dr. Lawlor's own genuine PUP system to easily send arbitrary C++ classes between processors. Charm++ is used by more than six real projects, but this is still about 10,000x fewer than MPI.

My new personal favorite way to write parallel code is to write graphics-card code, where pixels can be rendered in parallel. Consumer programmable graphics hardware was first created in 2001 (or so), but is now mainstream in 2008 (thanks to Vista!).
In general, the problems with parallel performance are going to be:

Batch size. It takes microseconds to create a thread, send a network message, or read back a pixel, so you better be doing milliseconds of work in parallel if you want to amortize away that overhead.
Load balance. It's easy to screw up your code so that only one CPU is doing work, and all the other CPUs just sit there and watch him. That's a waste!