VHDL, and End-of-Semester Course Review
CS 641 Lecture, Dr. Lawlor
So, you want to build some hardware. Back in the day (that is, in
1985), you would start with a pile of logic chips, each with as many as
four logic gates on it, and solder together a big ball of wires (known as "rat's nest" circuit layout).
Unfortunately, this doesn't scale. Past a certain complexity
(say, a thousand pins), you basically can't fabricate a working circuit
by hand in a reasonable length of time. Figuring out how and
where to route the wires, and actually routing the wires, becomes
virtually impossible.
So since the 1960's, people have turned to computers to design the
circuits used for computers, which is to say Computer Aided Circuit
Design or the modern term Electronic Design Automation. (You know that Terminator quote "... at 2am SkyNet began learning at a geometric rate"? Yeah, that's a restatement of Moore's Law.)
VHDL Digital Circuit Design
There are several different languages out there for digital circuit design. VHDL
(.vhd or .vhdl), or "Very High-level hardware Design Language", is
probably the biggest digital circuit description language--the only
other major language is Verilog
(.v) . VHDL's syntax derives from Pascal by way of Ada, and like
those languages it's got a lot of "housekeeping" syntax. Here's
hello world:
use std.textio.all;
entity foo is
end foo;
architecture arch of foo is
begin
process
variable L : line;
begin
write(L, string'("a big old hello world!"));
writeline(output, L);
wait;
end process;
end arch;
(executable NetRun link)
Notice that there's no sign of a *circuit* here--VHDL should be thought
of more as a parallel programming language, handy for circuit
design! Indeed, you can run your VHDL in a huge variety of
ways. You can simulate VHDL on the CPU using a VHDL-to-executable-code translator like ghdl (NetRun does this) or use any of a variety of commercial simulators. Or you can use a VHDL-to-FPGA programming tool (like this FPGA IDE from Xilinx) to make the VHDL code run directly on an FPGA, such as this $150 FPGA on a PCI card. Or you can "tape-out" a VHDL design to actual silicon ASIC, a custom chip that runs your code directly. Custom silicon prices start at about $20,000 from MOSIS, for 40 tiny chips.
VHDL code inside a "process ... begin .. end process" basically looks
like sequential C-like code. But there's a
twist--everything between every entity's process
begin/end pairs executes repeatedly, over and over again. Try
the above code without the "wait;", and it'll keep printing
hello! "wait;" actually means "wait forever here", which in the
simulator means "exit".
But it's even weirder with multiple process statements (which is very common), or for stuff outside a process;
everything else actually executes concurrently! This is how real
hardware works--all your logic, muxes, registers, gates, and so on just
run all the time. Sequential C/C++/Java/C#/VB programmers aren't
at all used to this.
Here's a VHDL program with three separate things happening at once:
use std.textio.all;
entity foo is
end foo;
architecture arch of foo is
signal BOB: bit; -- BOB and TED are both bits.
signal TED: bit;
begin
process -- Drives BOB low, then high.
begin
BOB <= '0'; wait for 1000 ns;
BOB <= '1'; wait;
end process;
TED <= BOB; -- Always copies BOB's value into TED
process -- Waits for BOB, then prints TED
variable L : line;
begin
wait until BOB = '1';
wait for 1 ns; -- Time for TED to catch up (important!)
write(L, TED);
writeline(output, L);
wait;
end process;
end arch;
(executable NetRun link)
The three things that continually happen are:
- The BOB driver first drives BOB low, then high.
- The TED copier continually copies TED from BOB.
- The printer waits until BOB is ready, then prints out TED.
Strange, eh? You can read way more about VHDL at this excellent tutorial or this group of examples. Here's some FPGA-specific VHDL.
CPUs in VHDL
The obvious thing to do with VHDL is to build a CPU. Here's a trivial UEMU-like non-pipelined CPU:
-- nanoproc, a very small CPU + RAM
-- by Dr. Orion Sky Lawlor, olawlor@acm.org, 2007-09-19 (Public Domain)
use std.textio.all;
library IEEE; use IEEE.numeric_bit.all;
entity foo is end entity;
architecture behaviour of foo is
-- Data type used throughout; a 16-bit number
subtype NREG is UNSIGNED(15 downto 0); -- IEEE.numeric_bit datatype
function MAKEREG(V: in integer) return NREG is
begin return to_unsigned(V,16); end;
-- Debugging printout function
procedure print(S: in string; V : in NREG) is
variable L : line;
begin
write(L,S);
write(L, to_integer(V));
writeline(output,L);
end;
begin -- architecture of nanoproc
main: process
-- 16 32-bit registers.
type regarray is array(0 to 15) of NREG;
variable r : regarray;
-- Register 15 is the program counter.
alias pc : NREG is r(15);
-- 256 32-bit memory locations
type memarray is array(0 to 255) of NREG;
variable m : memarray;
variable inst : NREG;
variable setup : NREG;
begin -- main process
-- Initialize registers
pc := MAKEREG(0); -- Start executing at address zero
r(0) := MAKEREG(0); -- register zero is always equal to zero
-- Initialize RAM with some instructions
m(0) := x"1107"; -- load 0x07 into register 1
m(1) := x"1203"; -- load 0x03 into register 2
m(2) := x"a112"; -- add regs 1 and 2
m(3) := x"b001"; -- magic print-reg-1 instruction
m(4) := x"e000"; -- magic Exit instruction
-- Main execution loop
loop
inst := m(to_integer(pc)); -- fetch next instruction
pc := pc + MAKEREG(1); -- increment program counter
case inst(15 downto 12) is
--------------- Table of Instruction Opcodes ---------------
when x"1" => -- Load-immediate instruction: 0x1 dest <8-bit immed>
r(to_integer(inst(11 downto 8)))
:= MAKEREG(to_integer(inst(7 downto 0)));
when x"a" => -- Addition instruction: 0xA dest src1 src2
r(to_integer(inst(11 downto 8)))
:= r(to_integer(inst(7 downto 4)))
+ r(to_integer(inst(3 downto 0)));
when x"b" => -- Output instruction
print(string'("Value in r(1)="),r(to_integer(inst(3 downto 0))));
when x"e" => -- Execution complete
print(string'("End of execution at pc="),pc-1);
wait; -- exits simulator
when others => -- Illegal instruction!
print(string'("Invalid instruction at pc="),pc-1);
print(string'("Instruction value="),inst);
wait; -- exits simulator
--------------- End Table of Instructions ---------------
end case;
end loop;
end process; -- main
end architecture;
(executable NetRun link)
Here are some bigger VHDL example CPUs:
Course Review
During this semester, we've covered:
- CPU design, starting with registers and arithmetic, and working upwards.
- Pipelining, out-of-order execution, speculation, renaming,
superscalar wide issue, and other ways to extract parallelism from
sequential code.
- Single Instruction Multiple Data (SIMD) instructions, which are a
very small step toward explicitly-parallel software: a single
instruction that operates on several (usually four) separate values
simultaniously. SIMD only gets you 4-way parallelism in the best
case, though.
- Multithreading is a more general way of expressing problem-domain
parallelism in software. The tricky part is getting the software
to actually work after you've parallelized it this way:
- Raw threads, like pthreads or windows threads, provide only
primitive, dangerous tools like mutexes (locks) to keep the program
operating properly--unless you consistently and explicitly order memory
accesses, shared variables will get updated incorrectly. Threads
are old, since 1970's or so, but still haven't gotten their bugs worked
out properly!
- OpenMP is a syntactically nicer interface for threading, but is
probably less general than raw threads. It works best on loops,
and provides a very simple one-line annotation (#pragma omp parallel
for) to parallelize those loops. OpenMP reached mainstream in
2007, when it got compiler support everywhere.
- Because writing software to properly share memory is so tricky,
and building hardware that scalably shares memory is also tricky, some
have suggested not sharing memory, and only having an explicit message-passing network instead.
- TCP sockets provide a straightforward interface to
point-to-point networks of all sorts. The tricky parts are
finding IP addresses, making sure servers stay running, and building
application-layer communication on top of raw bytes. Sockets
exist on every OS since 1990, and are used by web browsers etc.
- MPI (Message Passing Interface) is a handier
high-performance-computing library used to access the network.
MPI automatically handles IP addresses and process startup, and
provides some primitive methods for sending application-layer data
structures. MPI is absurdly popular on big parallel machines, and
is slowly percolating into the wider world.
- Charm++ is a University of Illinois research project to build
an object-oriented message passing system. It can migrate objects
between physical processors for fault tolerance or load balancing, and
uses Dr. Lawlor's own genuine PUP system to easily send arbitrary C++
classes between processors. Charm++ is used by more than six real
projects, but this is still about 10,000x fewer than MPI.
- My new personal favorite way to write parallel code is to write
graphics-card code, where pixels can be rendered in parallel.
Consumer programmable graphics hardware was first created in 2001 (or
so), but is now mainstream in 2008 (thanks to Vista!).
- In general, the problems with parallel performance are going to be:
- Batch size. It takes microseconds to create a thread,
send a network message, or read back a pixel, so you better be doing milliseconds of work in parallel if you want to amortize away that overhead.
- Load balance. It's easy to screw up your code so that
only one CPU is doing work, and all the other CPUs just sit there and
watch him. That's a waste!