Hardware is naturally parallel: all the logic gates do their thing simultaniously. This sort of unsynchronized parallelism is full of timing bugs and data race conditions, so ever since the very first machines in the 1940's, designers focused on correctly executing one instruction at a time. This "one instruction at a time" serial model became the default assumption for software, further constraining future hardware in a feedback cycle.
Back in the 1970's, when bits were expensive, the typical CPU encoding used exactly as many bytes as each instruction needed and no more. For example, a "return" instruction might use one byte (0xc3), while a "load a 32-bit constant" instruction might use five bytes (0xb8 <32-bit constant>). These variable-sized instructions are (retroactively) called Complex Instruction Set Computing (CISC), and x86 is basically the last surviving CISC machine.
During the 1980's, folks realized they could build *parallelizing* CPU decoders, which detected independent instructions and executed them in parallel. The details are coming next week, but the basic idea is to check for dependencies between instructions at runtime. Initially, this was only considered feasible if all the instructions took the same number of bytes, usually four bytes, which allowed the decoder to look ahead in the instruction stream for parallel opportunities. This idea is called Reduced Instruction Set Computing (RISC), and was built into ARM, MIPS, PowerPC, SPARC, DEC Alpha, and other commercial CPUs, many of which survive to this day. Here's a good but long retrospective article on the RISC-vs-CISC war, which got pretty intense during the 1990's. Nowadays, RISC machines might compress their instructions (like CISC), while high performance CISC machines usually decode their instructions into fixed-size blocks (like RISC), so the war ended in the best possible way--both sides have basically joined forces!
The limiting factor for modern machine performance is parallelism, so it's natural to break the sequential programming model, and somehow indicate multiple operations should happen at once. In a Single Instruction Multiple Data (SIMD) programming model, such as SSE, the programmer explicitly specifies that multiple operations can happen simultaniously.
Surprisingly, it's actually very easy to execute parallel instructions like SIMD on a sequential computer, by just executing one part at a time; while it's very hard to execute sequential instructions on a parallel computer, because you don't know what parts can run at the same time. This makes it tragic that we always write sequential software instead of parallel software!
During the late 1980's and early 1990's, several companies created even longer instruction machines, called Very Long Instruction Word (VLIW), where basically each part of the CPU has corresponding bits in every instruction. This makes for very simple decoding, and allows some interesting parallel tricks by carefully lighting up different parts of the CPU at the same time, but each instruction might be a hundred bytes or more!
Rumor has it that ATI graphics processors briefly used VLIW internally, and there are several strange digital signal processor chips that are VLIW, but the concept hasn't really caught on for the main CPU or even many niche processors. One big problem is the static scheduling inherent in VLIW doesn't deal well with unexpected events, such as cache misses, which slow down only one of the execution units, but not the others. And any company that produced a successful VLIW chip would have a big problem maintaining binary compatibility for an improved processor, since each instruction specifically describes what should happen on each part of the old chip.
A field-programmable gate array (FPGA) chip consists of a dense grid of many thousands of "logic blocks", each divided into smaller "logic cells". Each logic cell has a handy selection of useful circuits such as lookup tables, adders, and flip flops, with the output from each selected by a multiplexor (mux). By changing the entries in the lookup tables and select bits of the muxes, called "reconfiguring" the FPGA, you can make an FPGA simulate any circuit, from something simple like an adder, to something complicated like an entire multicore CPU. A CPU built from FPGA parts is known as a "soft processor", since you can reconfigure it, as contrasted to a "hard" processor built into the silicon directly.
Here's a Wikipedia example of an FPGA logic cell.
LUT: lookup table, shown here with 3 inputs; some more recent FPGAS use 6-input LUTs.
FA: Full Adder, including carry bits. This is included explicitly to make it easier to build adders or multipliers.
DFF: D-Flip Flop, a memory element.
One limiting factor in the adoptation of FPGAs, which can provide very high degrees of parallelism, is the programming model, since clearly an ordinary sequential program cannot execute without radical transformation. Altera recently released an OpenCL to FPGA compiler, which reconfigures the FPGA at runtime to execute (parallel) OpenCL programs.
Another issue with FPGAs is that despite their circuit-level design, delivered performance isn't actually very good for some complicated problems such as floating-point multiplication. These problems need a large area for the relevant shifting and renormalization, limiting overall parallelism; and they need many dependent stages to execute, limiting the achievable clock rate due to the many muxes on the critical path. FPGAs are hence indeed much faster than a single CPU running non-SIMD code, but for many tasks they're merely performance-competitive with a GPU, or a tuned SIMD + multicore code on a conventional modern CPU. More recent FPGAs have begun to add hard silicon parts such as multiplier blocks or hard processor cores mixed into the soup of logic blocks, a practice which gives higher performance at the price of slightly lower programmability for problems where the hard blocks aren't relevant.
The extreme example of this "hardening" design trend is an Application Specific Integrated Circuit (ASIC), which is essentially non-programmable. Because there are *no* multiplexors along the data paths, this can provide 10-100x performance gains over general purpose programmable designs, at the price of being useful for only a single problem. For example, bitcoin mining was once profitable on the CPU, before being driven out by GPU and FPGA miners, and then in turn replaced by ASIC miners.
CS 441 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.