Go to page: 1 [2] 3

[#] Tue Jan 17 2023 18:27:22 UTC from rss <>

Subject: Reverse-engineering an airspeed/Mach indicator from 1977

How does a vintage airspeed indicator work? CuriousMarc picked one up for a project, but it didn't have any documentation, so I reverse-engineered it. This indicator was used in the cockpit panel for business jets such as the Gulfstream G-III, Cessna Citation, and Bombardier Challenger CL600. It was probably manufactured in 1977 based on the dates on its transistors.

You might expect that the indicators on an aircraft control panel are simple dials. But behind this dial is a large, 2.8-pound box with a complex system of motors, gears, and feedback potentiometers, controlled by two boards of electronics. But for all this complexity, the indicator doesn't have any smarts: the pointers just indicate voltages fed into it from an air data computer. This is a quick blog post to summarize what I found.

Front view of the indicator.

The dial has two rotating pointers: the white pointer indicates airspeed in knots while the striped pointer indicates the maximum airspeed (which varies depending on altitude). The "digital" indicator at the top shows Mach number from 0.10 to 0.99, implemented with rotating digit wheels. When the unit is operating, the OFF indicator flag switches to black. The flag switches to a bright V_MO warning if the pilot exceeds the maximum airspeed.1 On the rim of the dial, two small markers called "bugs" can be manually moved to indicate critical speeds such as takeoff speed.

In use, the indicator is connected to a Sperry air data computer and receives voltage signals to control the dial positions.3 The air data computer measures the static and dynamic air pressure from pitot tubes and determines the airspeed, Mach number, altitude, and other parameters. (These calculations become nontrival near Mach 1 as air compresses and the fluid dynamics change.) Since we didn't have the air data computer or its specifications, I needed to figure out the connections from the computer to the display.

With the unit's cover removed, you can see the internal mechanisms and circuitry. Each of the three indicators is controlled by a small DC motor with a potentiometer providing feedback. To the right, two circuit boards provide the electronics to drive the indicators.4 At the upper right, the black blob is a 26-volt 400-Hertz transformer to power the unit. Some power supply components are in front of it. Below the transformer is an orangish flexible printed-circuit board, which seems advanced for the timeframe. This flexible ribbon connects the transformer, the external connector, and the printed-circuit board sockets, providing the backplane for the system.

A side view of the unit shows the gears to control the indicators.

The diagram below shows the principle behind the servo mechanism that controls each indicator. The goal is to rotate the indicator to a position corresponding to the input voltage. A feedback loop is used to achieve this. The potentiometer provides a voltage proportional to its rotation. The input voltage and the feedback voltage are inputs to an op amp, which generates an error signal based on the difference between the inputs. The error signal rotates the DC motor in the appropriate direction until the potentiometer voltage matches the input voltage. Because the indicator and the potentiometer are geared together, the indicator will be in the correct position. As the input voltage changes, the system will continuously track the changes and keep the indicator updated.

A diagram illustrating the servo feedback loop.

Because the DC motor spins much faster than the dial moves, reduction gears slow the rotation. The photo below shows the gear train in the unit. A potentiometer is at the upper-right with three wires attached.

A closeup of the gear train. A potentiometer is on the right.

The Mach number has additional gearing to rotate the numbered wheels. When the low-digit wheel cycles around, it advances the high-digit wheel, similar to an odometer.

The mechanism to rotate the digit wheels for the Mach number.

Fault checking

One interesting feature of the indicator unit is that it implements fault checking to alert the pilot if something goes wrong. The front panel has a three-position flag. By default it's in the OFF position. Powering the coil in one direction rotates the flag to the blank side. Powering the coil in the other direction rotates the flag to the "V_MO" position which indicates that the pilot has exceeded the maximum operating speed.

I figured that powering up the unit would move the flag out of the OFF position, but it's more complicated than that. First, the unit checks that the air data computer is providing a suitable reference voltage. Second, the unit verifies that the motor voltages for the two needles are within limits; this ensures that the servo loop is operating successfully. Third, the unit checks that signals are received on status pins K and L. The unit only moves out of the OFF state if all these conditions are satisfied.5 Thus, if the unit receives bad signals or is malfunctioning, the pilot will be alerted by the OFF indicator, rather than trusting the faulty display.

The circuitry

The unit is powered by 26 volts, 400 Hz, a standard voltage for aviation. A small transformer provides multiple outputs for the various internal voltages. The unit has four power supplies: three on the first board and one on the back wall of the unit. One power supply is for the status indicator, one is for the op amps, one powers the 41.7V motors, and the fourth provides other power.

One subtlety is how the feedback potentiometers are powered. The servo loop compares the potentiometer voltage with the input voltage. But this only works if the potentiometer and the input voltage are using the same reference. One solution would be for the indicator unit and the air data computer to contain matching precision voltage regulators. Instead, the system uses a simpler, more reliable approach: the air data computer provides a reference voltage that the indicator unit uses to power the potentiometers.6 With this approach, the air data computer's voltage reference can fluctuate and the indicator will still reach the right position. (In other words, a 5V input with a 10V reference and a 6V input with a 12V reference are both 50%.)

The diagram below shows the board with the servo circuitry. The board uses dual op-amp integrated circuits, packaged in 10-pin metal cans that protected against interference.7 The ICs and some of the other components have obscure military part numbers; I don't know if this unit was built for military use or if military-grade parts were used for reliability.

The servo board is full of transistors, resistors, capacitors, diodes, and op-amp integrated circuits.

The circuitry in the lower-left corner handles the reference voltage from the air data computer. The board buffers this voltage with an op amp to power the three feedback potentiometers. The op amp also ensures that the reference voltage is at least 10 volts. If not, the indicator unit shows the "OFF" flag to alert the pilot.

The schematic below shows one of the servo circuits; the three circuits are roughly the same. The heart of the circuit is the error op amp in the center. It compares the voltage from the potentiometer with the input voltage and generates an error output that moves the motor appropriately. A positive error output will turn on the upper transistor, driving the motor with a positive voltage. Conversely, a negative error output will turn on the lower transistor, driving the motor with a negative voltage. The motor drive circuit has clamp diodes to limit the transistor base voltages.

Schematic of one of the servo circuits.

The op amp also receives a feedback signal from the motor output. I don't entirely understand this signal, which goes through a filter circuit with resistors, diodes, and a capacitor. I think it dampens the motor signal so the motor doesn't overshoot the desired position. I think it also keeps the transistor drive signal biased relative to the emitter voltage (i.e. the motor output).

On the input side, the potentiometer voltage goes through an op amp follower buffer, which simply outputs its input voltage. This may seem pointless, but the op amp provides a high-impedance input so the potentiometer's voltage doesn't get distorted.

The external input voltage goes through a resistor/capacitor circuit to scale it and filter out noise. Curiously, the circuit board was modified by cutting a trace and adding a resistor and capacitor to change the input circuit for one of the inputs. In the photo below, you can see the added resistor and capacitor; the cut trace is just to the right of the capacitor. I don't know if this modification changed the scale factor or if it filtered out noise. A label on the box says that Honeywell performed a modification on November 8, 1991, which presumably was this circuit.

A closeup of the circuit board showing the modification.

The second board implements three power supplies as well as the circuitry for the OFF/V_MO flag. The power supplies are simple and unregulated, just diode bridges to convert AC to DC, along with filter capacitors. Most of the circuitry on the board controls the status flag. Two dual op amps check the motor voltages against upper and lower limits to ensure that the motors are tracking the inputs. These outputs, along with other logic status signals, are combined with diode-transistor logic to determine the flag status. Driver transistors provide +18 or -18 volts to the flag's coil to drive it to the desired position.

This board has power supply circuitry and the control circuitry for the indicator flag.

Conclusions

After reverse-engineering the pinout, I connected the airspeed indicator to a stack of power supplies and succeeded in getting the indicators to operate (video). This unit is much more complex than I expected for a simple display, with servoed motors controlled by two boards of electronics. Air safety regulations probably account for much of the complexity, ensuring that the display provides the pilot with accurate information. For all that complexity, the unit is essentially a voltmeter, indicating three voltages on its display. This airspeed indicator is a bit different from most of the hardware I examine, but hopefully you found this look at its internal circuitry interesting.

With the case removed, the internal circuitry is visible.

You can follow me on Twitter @kenshirriff or rss. I've also started experimenting with mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

Since the unit has airspeed and maximum airspeed indicators, you might expect it to display the maximum airspeed warning flag based on the two speed inputs. Instead, the flag is controlled by input pin "L". In other words, the air data computer, not the indicator unit, determines when the maximum airspeed is exceeded. ↩
This unit is a "Mach Airspeed Indicator", 4018366, apparently also called the SI-225,2

Product label with part number 4018366-901.

Note that the label says Sperry. In 1986, Sperry attempted to buy Honeywell but instead Burroughs made a hostile takeover bid. The merger of Sperry and Burroughs formed Unisys. A couple of months after the merger, the Sperry Aerospace Group was sold to Honeywell for $1.025 billion. Thus, the indicator became a Honeywell product. This corporate history explains why the unit has a Honeywell product support sticker.

Labels on top of the unit indicate that it worked with the Sperry 4013242 and 4013244 air data computers. These became the Honeywell AZ-242 and AZ-244.

↩

The connector is a 32-pin MIL Spec round connector. Most of the 32 pins are unused. The connector has complex keying with 5 slots. I assume the keying is specific to this indicator, so the wrong indicator doesn't get connected.

A closeup of the 32-pin connector, probably a MIL Spec 18-32.

For reference, here is the pinout of the unit. Since this is based on reverse engineering, I don't guarantee it 100%. Don't use this for flight!

Pin	Use
A	5V illumination
B	Chassis ground
C	AC ground
E	26V 400 Hz
F	26V 400 Hz
K	Enable
L	Speed ok
M	Signal ground
N	Ref. voltage
P	Vmax control voltage
R	Airspeed control voltage
S	Mach control voltage
V	Chassis ground

Pins D, G, H, J, T, U, W, X, Y, Z, a, b, c, d, e, f, g, h, and j are unused. ↩

The chassis has an empty slot for a third circuit board. My guess is that this chassis was used for multiple types of indicators and others required a third board. ↩
If the L pin goes low, the indicator will move to the V_MO position. ↩
My hypothesis is that the correct reference voltage is 11.7 volts. This yields a scale factor of 1 volt equals 50 knots. It also matches up the display's change in scale at 250 knots with the measured scale change. ↩
The meter uses three different integrated circuits in 10-pin metal cans with mysterious military markings: "FHL 24988", "JM38510/10102BIC 27014", and "SL14040". These appear to all be equivalent to uA747 dual op amps. (Note that JM38510 is not a part number; it is a general military specification for integrated circuits. The number after it is the relevant part number.) ↩

[#] Mon Feb 27 2023 18:20:35 UTC from rss <>

Subject: The 8086 processor's microcode pipeline from die analysis

Intel introduced the 8086 microprocessor in 1978, and its influence still remains through the popular x86 architecture. The 8086 was a fairly complex microprocessor for its time, implementing instructions in microcode with pipelining to improve performance. This blog post explains the microcode operations for a particular instruction, "ADD immediate". As the 8086 documentation will tell you, this instruction takes four clock cycles to execute. But looking internally shows seven clock cycles of activity. How does the 8086 fit seven cycles of computation into four cycles? As I will show, the trick is pipelining.

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below, which will be important in the discussion. The Bus Interface Unit handles memory accesses (including instruction prefetching), while the Execution Unit executes instructions. The functional blocks labeled in black are the ones that are part of the discussion below. In particular, the registers and ALU (Arithmetic/Logic Unit) are at the left and the large microcode ROM is in the lower-right.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Microcode for "ADD"

Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. The motivation is that instructions usually require multiple steps inside the processor. One of the hardest parts of computer design is creating the control logic that directs the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated, error-prone, and hard to design.

The alternative is microcode: instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task.

The 8086 uses a hybrid approach: although the 8086 uses microcode, much of the instruction functionality is implemented with gate logic. This approach removed duplication from the microcode and kept the microcode small enough for 1978 technology. In a sense the microcode is parameterized. For instance, the microcode can specify a generic ALU operation, and the gate logic determines from the instruction which ALU operation to perform. Likewise, the microcode can specify a generic register and the gate logic determines which register to use. The simplest instructions (such as prefixes or condition-code operations) don't use microcode at all. Although this made the 8086's gate logic more complicated, the tradeoff was worthwhile.

The 8086's microcode was disassembled by Andrew Jenner (link) from my die photos, so we can see exactly what micro-instructions the 8086 is running for each machine instruction. In this post, I will focus on the ADD instruction, since it is fairly straightforward. In particular, the "ADD AX, immediate" instruction contains a 16-bit value that is added to the value in the 16-bit AX register. This instruction consists of three bytes: the opcode 05, followed by the two-byte immediate value. (An "immediate" value is included in the instruction, rather than coming from a register or memory location.)

This ADD instruction is implemented in the 8086's microcode as four micro-instructions, shown below. Each micro-instruction specifies a move operation across the internal ALU bus. It also specifies an action. In brief, the first two instructions get the immediate argument from the prefetch queue. The third instruction gets the argument from the AX register and starts the ALU (Arithmetic/Logic Unit) operation. The final instruction stores the result into the AX register and updates the condition flags.

µ-address    move        action
   018    Q → tmpBL     L8    2
   019    Q → tmpBH
   01a    M → tmpA      XI    tmpA, NXT
   01b    Σ → M         RNI   FLAGS

In detail, the first instruction moves a byte from the prefetch queue (Q) to one of the ALU's temporary registers, specifically the low byte of the tmpB register. (The ALU has three temporary registers to hold arguments: tmpA, tmpB, and tmpC. These temporary registers are invisible to the programmer and are unrelated to the AX, BX, CX registers.) Likewise, the second instruction fetches the high byte of the immediate value from the queue and stores it in the high byte of the ALU's tmpB register. The action in the first micro-instruction, L8, will branch to step 2 (01a) if the instruction specifies an 8-bit operation, skipping the load of the high byte. Thus, the same microcode supports the 8-bit and 16-bit ADD instructions.1

The third micro-instruction is more complicated. The move section moves the AX register's contents (indicated by M) to the accumulator's tmpA register, getting both arguments ready for the operation. XI tmpA starts an ALU operation, in this case adding tmpA to tmpB.2 Finally, NXT indicates that this is the next-to-last micro-instruction, as will be discussed below.

The last micro-instruction stores the ALU's result (Σ) into the AX register. The end of the microcode for this machine instruction is indicated by RNI (Run Next Instruction). Finally, FLAGS causes the 8086's condition flags register to be updated, indicating if the result is zero, negative, and so forth.

You may have noticed that the microcode doesn't explicitly specify the ADD operation or the AX register, using XI and M instead. This illustrates the "parameterized" microcode mentioned earlier. The microcode specifies a generic ALU operation with XI,3 and the hardware fills in the particular ALU operation from bits 5-3 of the machine instruction. Thus, the microcode above can be used for addition, subtraction, exclusive-or, comparisons, and four other arithmetic/logic operations.

The other parameterized aspect is the generic M register specification. The 8086's instruction set has a flexible way of specifying registers for the source and destination of an operation: registers are often specified by a "Mod R/M" byte, but can also be specified by bits in the first opcode. Moreover, many instructions have a bit to switch the source and destination, and another bit to specify an 8-bit or 16-bit register. The microcode can ignore all this; a micro-instruction uses M and N for the source and destination registers, and the hardware handles the details.4 The M and N values are implemented by 5-bit registers that are invisible to the programmer and specify the "real" register to use. The diagram below shows how they appear on the die.

Die photo of the circuitry that implements the M and N registers. A multiplexer selects a source for the N register value and feeds it into the 5-bit N register. The M register is similar. Between the two registers is a "swap" circuit to swap the outputs of the two registers based on the instruction's "direction" bit. In this image, the metal layer has been dissolved with acid to show the transistors in the silicon layer underneath.

Pipelining

The 8086 documentation says this ADD instruction takes four clock cycles, and as we have seen, it is implemented with four micro-instructions. One micro-instruction is executed per clock cycle, so the timing seems straightforward. The problem, however, is that a micro-instruction can't be completed in one clock cycle. It takes a clock cycle to read a micro-instruction from the microcode ROM. Sending signals across an internal bus typically takes a clock cycle and other actions take more time. So a typical micro-instruction ends up taking 2½ clock cycles from start to end. One solution would be to slow down the clock, so the micro-instruction can complete in one cycle, but that would drastically reduce performance. A better solution is pipelining the execution so a micro-instruction can complete every cycle.5

The idea of pipelining is to break instruction processing into "stages", so different stages can work on different instructions at the same time. It's sort of like an assembly line, where a particular car might take an hour to manufacture, but a new car comes off the assembly line every minute. The diagram below shows a simple example. Suppose executing an instruction requires three steps: A, B, and C. Executing four instructions, as shown at the top would take 12 steps in total.

Diagram of a simple pipeline showing four instructions executing through three stages.

However, suppose the steps can execute independently, so step B for one instruction can execute at the same time as step A for another instruction. Now, as soon as instruction 1 finishes step A and moves on to step B, instruction 2 can start step A. Next, instruction 3 starts step A as instructions 2 and 1 move to steps B and C respectively. The first instruction still takes 3 time units to complete, but after that, an instruction completes every time unit, providing a theoretical 3× speedup.6 In a bit, I will show how the 8086 uses the idea of pipelining.

The prefetch queue

The 8086 uses instruction prefetching to improve performance. Prefetching is not the focus of this article, but a brief explanation is necessary. (I wrote about the prefetch circuitry in detail earlier.) Memory accesses on the 8086 are relatively slow (at least four clock cycles), so we don't want to wait every time the processor needs a new instruction. The idea behind prefetching is that the processor fetches future instructions from memory while the CPU is busy with the current instruction. When the CPU is ready to execute the next instruction, hopefully the instruction is already in the prefetch queue and the CPU doesn't need to wait for memory. The 8086 appears to be the first microprocessor to implement prefetching.

In more detail, the 8086 fetches instructions into its prefetch queue asynchronously from instruction execution: The "Bus Interface Unit" performs prefetches, while the "Execution Unit" executes instructions. Prefetched instructions are stored in the 6-byte prefetch queue. The Q bus (short for "Queue bus") provides bytes, one at a time, from the prefetch queue to the Execution Unit.7 If the prefetch queue doesn't have a byte available when the Execution Unit needs one, the Execution Unit waits until the prefetch circuitry can complete a memory access.

The loader

To decode and execute an instruction, the Execution Unit must get instruction bytes from the prefetch queue, but this is not entirely straightforward. The main problem is that the prefetch queue can be empty, blocking execution. Second, instruction decoding is relatively slow, so for maximum performance, the decoder needs a new byte before the current instruction is finished. A circuit called the "loader" solves these problems by using a small state machine (below) to efficiently fetch bytes from the queue at the right time.

The state machine for the 8086 "loader" circuit. I'm not going to explain how it works in this post, but the diagram looks pretty cool. From patent US4449184.

The loader generates two timing signals that synchronize instruction decoding and microcode execution with the prefetch queue. The FC (First Clock) indicates that the first instruction byte is available, while the SC (Second Clock) indicates the second instruction byte. Note that the First Clock and Second Clock are not necessarily consecutive clock cycles because the first byte could be the last one in the queue, delaying the Second Clock.

At the end of a microcode sequence, the Run Next Instruction (RNI) micro-operation causes the loader to fetch the next machine instruction. However, microcode execution would be blocked for a cycle due to the delay of fetching and decoding the next instruction. In many cases, this can be avoided: if the microcode knows that it is one micro-instruction away from finishing, it issues a Next-to-last (NXT) micro-operation so the loader can start loading the next instruction before the previous instruction finishes. As will be shown in the next section, this usually allows micro-instructions to run without interruption.

Instruction execution

Putting this all together, we can see how the ADD instruction is executed, cycle by cycle. Each clock cycle starts with the clock high (H) and ends with the clock low (L).8 The sequence starts with the prefetch queue supplying the ADD instruction across the Q bus in cycle 1. The loader indicates that this is First Clock and the instruction is loaded into the microcode address register. It takes a clock cycle for the address to exit the address register (as indicated by an arrow) along with the microcode counter value indicating step 0. To remember the ALU operation, bits 5-3 of the instruction are saved in the internal X register (unrelated to the AX register).

In cycle 2, the prefetch queue has supplied the second byte of the instruction so the loader indicates Second Clock. In the second half of cycle 2, the microcode address decoder has converted the instruction-based address to the micro-address 018 and supplies it to the microcode ROM.

In cycle 3, the microcode ROM outputs the micro-instruction at micro-address 018: Q→tmpBL, which will move a byte from the prefetch queue bus (Q bus) to the low byte of the ALU temporary B register, as described earlier. It takes a full clock cycle for this action to take place, as the byte traverses buses to reach the register. This micro-instruction also generates the L8 micro-op, which will branch if an 8-bit operation is taking place. As this is a 16-bit operation, no branch takes place.9 Meanwhile, the microcode address register moves to step 1, causing the decoder to produce the micro-address 019.

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

In cycle 4, the prefetch queue provides a new byte, the high byte of the immediate value. The microcode ROM outputs the micro-instruction at micro-address 019: Q→tmpBH, which will move this byte from the prefetch queue bus to the high byte of the ALU temporary B register. As before, it takes a full cycle for this move to complete. Meanwhile, the microcode address register moves to step 2, causing the decoder to produce the micro-address 01a.

In cycle 5, the microcode ROM outputs the micro-instruction at micro-address 01a: M→tmpA,XI tmpA,NXT. Since the M (source) register specifies AX, the contents of the AX register will be moved into the ALU tmpA register, but this will take a cycle to complete. The XI tmpA part starts decoding the ALU operation saved in the X register, in this case ADD. Finally, NXT indicates that the next micro-instruction is the last one in this instruction. In combination with the next instruction on the Q bus, this causes the loader to issue First Clock. This starts execution of the next machine instruction, even though the current instruction is still executing.

In cycle 6, the microcode ROM outputs the micro-instruction at micro-address 01b: Σ→M,RNI. This will store the ALU output into the register indicated by M (i.e. AX), but not yet. In the first half of cycle 6, the ALU decoder determines the ALU control signals that will cause an ADD to take place. In the second half of cycle 6, the ALU receives these control signals and computes the sum. The RNI (Run Next Instruction) and the second instruction byte from the prefetch queue cause the loader to issue Second Clock, and the micro-address for the next machine instruction is sent to the microcode ROM.

Finally, in cycle 7, the sum is written to the AX register and the flags are updated, completing the ADD instruction. Meanwhile, the next instruction is well underway with its first micro-instruction being executed.

As you can see, execution of a micro-instruction is pipelined, with three full clock cycles from the arrival of an instruction until the first micro-instruction completes in cycle 4. Although this system is complex, in the best case it achieves the goal of running a micro-instruction each cycle, without gaps. (There are gaps in some cases, most commonly when the prefetch queue is empty. A gap will also occur if the microcode control flow doesn't allow a NXT micro-instruction to be issued. In that case, the loader can't issue First Clock until the RNI micro-instruction is issued, resulting in a delay.)

Conclusions

The 8086 uses multiple types of pipelining to increase performance. I've focused on the pipelining at the microcode level, but the 8086 uses at least four interlocking types of pipelining. First, microcode pipelining allows micro-instructions to complete at the rate of one per clock cycle, even though it takes multiple cycles for a micro-instruction to complete. Admittedly, this pipeline is not very deep compared to the pipelines in RISC processors; the 8086 designers called the overlap in the microcode ROM a "sort of mini-pipeline."10

The second type of pipelining overlaps instruction decoding and execution. Instruction decoding is fairly complicated on the 8086 since there are many different formats of instructions, usually depending on the second byte (Mod R/M). The loader coordinates this pipelining, issuing the First Clock and Second Clock signals so decoding on the next instruction can start before the previous instruction has completed. Third is the prefetch queue, which overlaps fetching instructions from memory with execution. This is accomplished by partitioning the processor into the Bus Interface Unit and the Execution Unit, with the prefetch queue in between. (I recently wrote about instruction prefetching in detail.)

There's a final type of pipelining that I haven't discussed. Inside the memory access sequence, computing the memory address from a segment register and offset is overlapped with the previous memory access. The result is that memory accesses appear to take four cycles, even though they really take six cycles. I plan to write more about memory access in a later post.

The 8086 was a large advance in size, performance, and architecture compared to earlier microprocessors such as the Z80 (1976), 8085 (1977), and 6809 (1978). As well as moving to 16 bits, the 8086 had a considerably more complex architecture with instruction prefetching and microcode, among other features. At the same time, the 8086 avoided the architectural overreach of Intel's ill-fated iAPX 432, a complex processor that supported garbage collection and objects in hardware. Although the 8086's architecture had flaws, it was a success and led to the x86 architecture, still dominant today.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

The lowest bit of many 8086 instructions selects if the instruction operates on a byte or a word. Thus, many instructions in the instruction set appear in pairs. The support for byte operations gave the 16-bit 8086 processor compatibility with the older 8-bit 8080, if assembly code was suitably translated. ↩
The microcode for an ALU operation can select the first operand from tmpA, tmpB, or tmpC. The second operand is always tmpB. ↩
I don't know why Intel used XI to indicate the ALU opcode. I don't think it's the Greek letter Ξ, although they did use Σ (sigma) for the ALU output. The opcode is stored in the X register, so maybe XI is X Instruction? (It's also unclear why the register is called X.) ↩
Normally, the internal M register specifies the source register and the N register specifies the destination register, and these two registers are loaded from the instruction. However, some instructions only use the A or AX register, depending on whether the instruction acts on bytes or words. These instructions are the ALU immediate instructions, accumulator move instructions, string instructions, and the TEST, IN, and OUT instructions. For these instructions, the Group Decode ROM activates a signal that forces the M register to specify the AX register for a 16-bit operation, or the A register for an 8-bit operation. Thus, by specifying the M register in the microcode above, the same microcode is used for instructions with an 8-bit immediate argument or a 16-bit immediate argument. This also illustrates how the designers of the 8086 kept the microcode small by moving a lot of logic into hardware. ↩
I should mention that the pipelining in the 8086 is completely different from the parallelism in modern superscalar CPUs. The 8086 is executing instructions linearly, step-by-step, even though instructions overlap. There is only one execution path and no speculative execution, for instance. ↩
I showed a theoretical speedup from pipelining. Several issues make the real speedup smaller. First, the steps of an instruction typically don't take the same amount of time, so you're limited by the slowest step. Second, the overhead to handle the steps adds some delay. Finally, conflicts between instructions and other "hazards" may prevent overlap in various cases. ↩
The interaction between the prefetch queue and the Execution Unit is a "push" model rather than a "pull" model. If the prefetch queue contains a byte, the prefetch circuitry puts the byte on the Q bus and lets the Execution Unit know that a byte is available. The Execution Unit signals the prefetch circuitry when it uses a byte, and the prefetch queue moves to the next byte in the queue. If the Execution Unit needs a byte and it isn't ready, it blocks until a byte is available. The prefetch queue loads new words as it empties, when the memory bus isn't in use for other purposes. ↩
The 8086 is active during both phases (low and high) of the clock, with things happening both while the clock is high and while it is low. One unusual feature of the 8086 is that the clock signal is asymmetrical with a 33% duty cycle, so the clock is low for twice as long as the clock is high. In other words, the 8086 does twice as much (by time) during the low part of the clock cycle as during the high part of the clock cycle. There are multiple reasons why actions take a full clock cycle to complete. Much of the circuitry uses edge-triggered flip-flops to hold state. These latch data on one clock edge and move data internally during the other part of the clock. (The 8086 uses both positive-edge and negative-edge triggered flip flops; some latch when the clock goes high and others latch when the clock goes low.) Many control signals have their voltage level boosted by a bootstrap driver circuit, driven by the clock.

Many buses are precharged during one clock phase and then transmit a signal during the other phase. The motivation behind precharging the bus is that NMOS transistors are much better at pulling a line low than pulling it high (i.e. they can provide more current). This especially affects buses because they have relatively high capacitance due to their length, so pulling the bus high is slow. Thus, the bus is "leisurely" precharged to a high state during one clock phase, and then it can be rapidly pulled low (if the bit is a 0) and transmit the data during the other clock phase. ↩
You might expect that the 8-bit ADD would be faster than the 16-bit ADD since it is a 2-byte instruction instead of a 3-byte instruction and one micro-instruction is skipped. However, both the 8-bit and the 16-bit ADD instructions take 4 cycles. The reason is that branching to a new micro-instruction requires updating the microcode address register, which takes a clock cycle, resulting in a wasted clock cycle where no micro-instruction is executed. (Specifically, the next micro-instruction is on the way, so it is blocked by the ROM Enable (ROME) signal going low.) The result of this is that the branch for an 8-bit ADD costs an extra cycle, which cancels out the saved cycle. (In practice, the 16-bit instruction might be slower because it needs one more byte from the prefetch queue, which could cause a delay.) Just as a branch in the machine instructions can cause a delay (a "bubble") in the instruction pipeline, a branch in the microcode causes a delay in the micro-instruction pipeline. ↩
The design decisions for the 8086 are described in: J. McKevitt and J. Bayliss, "New options from big chips," in IEEE Spectrum, vol. 16, no. 3, pp. 28-34, March 1979, doi: 10.1109/MSPEC.1979.6367944. ↩

[#] Mon Feb 27 2023 18:20:53 UTC from rss <>

Subject: Inside the 8086 processor's instruction prefetch circuitry

The groundbreaking 8086 microprocessor was introduced by Intel in 1978 and led to the x86 architecture that still dominates desktop and server computing. One way that the 8086 increased performance was by prefetching: the processor fetches instructions from memory before they are needed, so the processor can execute them without waiting on the (relatively slow) memory. I've been reverse-engineering the 8086 from die photos and this blog post discusses what I've uncovered about the prefetch circuitry.

The 8086 was introduced at an interesting point in microprocessor history, where memory was becoming slower than the CPU. For the first microprocessors, the speed of the CPU and the speed of memory were comparable.1 However, as processors became faster, the speed of memory failed to keep up. The 8086 was probably the first microprocessor to prefetch instructions to improve performance. While modern microprocessors have megabytes of fast cache2 to act as a buffer between the CPU and much-slower main memory, the 8086 has just 6 bytes of prefetch queue. However, this was enough to increase performance by about 50%.3

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; ones that are important to the prefetch queue are highlighted in red and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions.

Prefetching and the architecture of the 8086

Prefetching had a major impact on the design of the 8086. Earlier processors such as the 6502, 8080, or Z80 were deterministic. The processor fetched an instruction, executed the instruction, fetched the next instruction, and so forth. Memory accesses corresponded directly to instruction fetching and execution and instructions took a predictable number of clock cycles. This all changed with the introduction of the prefetch queue. Memory operations became unlinked from instruction execution since prefetches happen as needed and when the memory bus is available.

Since memory operations and instruction execution happen independently, the implementors of the 8086 split the chip into two processing units: the Bus Interface Unit (BIU) that handles memory accesses, and the Execution Unit (EU) that executes instructions, as shown below.4 The Bus Interface Unit contains the 6-byte instruction prefetch queue; it supplies instructions to the Execution Unit via the Q (queue) bus. The adder (Σ) performs address calculation, adding the segment register base to an address offset, among other things. The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions. The address adder and the ALU are independent arithmetic units. The segment registers (CS, DS, SS, ES) and the Instruction Pointer (IP) are in the Bus Interface Unit since they are directly involved in memory accesses, while the general-purpose registers are in the Execution Unit.

Block diagram of the 8086 processor. This diagram differs from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. The "Internal Communication Registers" consist of the Indirect Register (IND) and the Operand Register (OPR). These hold a memory address and memory data value respectively. From The 8086 Family User's Manual.

The 8086's segment registers play an important part in this architecture, so I'll review them quickly. One of the challenges of the 8086 was how to support more than 64K of memory with 16-bit registers. The much-reviled solution was to create a 1-megabyte (20-bit) address space consisting of 64K segments, with segment registers indicating the start of each segment. Specifically, a memory address was specified by a 16-bit offset address along with a particular segment register selecting a segment (Code Segment, Data Segment, Stack Segment, or Extra Segment). The segment register's value was shifted by 4 bits to give the segment's 20-bit base address. The 16-bit offset address was added, yielding a 20-bit memory address. This gave the processor a 1-megabyte address space, although only 64K could be accessed without changing a segment register.

It may seem inefficient for the Bus Interface Unit to have its own adder instead of using the ALU, but there are a couple of reasons for the separate adder. First, every memory access uses the adder at least once to add the segment base and offset. The adder is also used to increment the PC or index registers. Since these operations are so frequent, they would create a bottleneck if they used the ALU. Second, since the Execution Unit and the Bus Interface Unit run asynchronously with respect to each other, it would be complicated to share the ALU without causing delays and conflicts.

Prefetching had another major but little-known effect on the 8086 architecture: the designers were considering making the 8086 a two-chip microprocessor. Prefetching, however, required a one-chip design because the number of control signals required to synchronize prefetching across two chips exceeded the package pins available. This became a compelling argument for the one-chip design that was used for the 8086.3 (The unsuccessful Intel iAPX 432, which was under development at the same time, ended up being a two-chip processor: one to fetch and decode instructions, and one to execute them.)

Implementing the queue

The instruction prefetch queue is implemented with three 16-bit queue registers along with two hardware pointers that keep track of the current position in the queue. One two-bit counter keeps track of the current read position from 0 to 3, i.e. the queue register that will provide the next instruction. The second counter keeps track of the current write position, i.e. the queue register that will receive the next instruction from memory. As words are fetched from the queue, the read pointer advances. As words are added to the queue, the write pointer advances. Because the queue registers hold words, while the prefetch circuitry provides bytes, another flip-flop keeps track of whether the high byte or the low byte of the word is being used. I call this the HL flip-flop. This causes the low byte to be provided first and then the high byte (since the 8086 is little-endian).

The diagram below shows an example queue configuration with four bytes. The first two queue registers (Q0 and Q1) hold data. The read pointer and HL pointer indicate that the next prefetched byte will come from the low byte of Q0. The write pointer indicates that the next prefetched word will go into Q2.

A queue configuration with four bytes in the prefetch queue. Bytes in blue hold prefetched data.

The diagram below shows how the queue pointers can wrap around. In this configuration, one byte has been used from Q2 so the next byte will be Q2's high byte. Q0 holds the next prefetched word. The next word to be prefetched will be stored in Q1, as indicated by the write pointer.

A queue configuration with three bytes in the prefetch queue.

The relative positions of the write and read pointers indicate how much data is in the queue. If the write pointer is one position before the read pointer (modulo 3), the queue holds 3 or 4 bytes. If the write pointer is one position after the read pointer (modulo 3), the queue holds 1 or 2 bytes. But what about when the read pointer and write pointer indicate the same register? This can either indicate that the queue is empty or that the queue is full (5 or 6 bytes). To distinguish these cases, a flip-flop is set if the queue enters the empty state. This flip-flop generates a signal that Intel called MT (empty).

Another complication occurs if you jump to an odd address. Because of its 16-bit bus, the 8086 will prefetch a word from the even address one less. This loads one usable byte and one byte that needs to be discarded. The 8086 handles this by setting the HL flip-flop high, using a handful of gates to detect this case. As in the diagram above, the unwanted low byte will be skipped.

The diagram below zooms in on the prefetch and queue control circuitry on the die, with the main flip-flops and circuitry labeled. The lower half manages the queue, keeping track of the read and write positions and computing the queue length. The upper circuitry controls prefetch operations and interacts with the rest of the memory cycle circuitry.

The queue and prefetch circuitry on the die. The metal layer has been removed for the closeup to show the silicon of the underlying transistors.

Even though there is not a lot of circuitry involved (about a dozen flip-flops and associated logic gates), this circuitry occupies a substantial part of the 8086 die. (The relatively small amount of circuitry did not make this easy to reverse-engineer, however!) Compared to modern chips, the density of the 8086 is very low; you can almost see the flip-flops with the naked eye. This diagram only shows the circuitry directly involved in prefetching. Additional circuitry is scattered through the memory cycle control circuitry to deal with prefetching, and the queue registers take up a substantial part of the register file. Thus, prefetching was a moderately expensive feature for the 8086, as far as die area.

The loader

To decode and execute an instruction, the Execution Unit must get instruction bytes from the Bus Interface Unit, but this is not entirely straightforward. The main problem is that the queue can be empty, in which case instruction decoding must block until a byte is available from the queue. The second problem is that instruction decoding is relatively slow, so for maximum performance, the decoder needs a new byte before the current instruction is finished. A circuit called the "loader" solves these problems by providing synchronization between the prefetch queue and the instruction decoder. The loader uses a small state machine to efficiently fetch bytes from the queue at the right time and to provide timing signals to the decoder and microcode engine.

In more detail, as the loader requests the first two instruction bytes from the prefetch queue, it generates two timing signals that control the microcode execution. The FC (First Clock) indicates that the first instruction byte is available, while the SC (Second Clock) indicates the second instruction byte. Note that the First Clock and Second Clock are not necessarily consecutive clock cycles because the prefetch queue could be empty or contain just one byte, in which case the First Clock and/or Second Clock would be delayed. The instruction decoding circuitry and the microcode engine are controlled by the First Clock and Second Clock signals, so they remain synchronized with the bytes supplied by the prefetch queue.

At the end of a microcode sequence, the Run Next Instruction (RNI) micro-operation causes the loader to fetch the next machine instruction. However, fetching and decoding the next instruction is a bit slow so microcode execution would be blocked for a cycle. In many cases, this slowdown can be avoided: if the microcode knows that it is one micro-instruction away from finishing, it issues a Next-to-last (NXT) micro-operation so the loader can start loading the next instruction. This achieves a degree of pipelining in most cases; fetching the next instruction is overlapped with finishing the execution of the previous instruction.

The state machine for the 8086 "loader" circuit. The 1BL signal indicates a 1-byte instruction implemented in logic rather than microcode. From patent US4449184.

The diagram above shows the state machine for the loader. I won't explain it in detail, but essentially it keeps track of whether it is waiting for a First Clock byte or a Second Clock byte, and if it is performing a fetch in advance (NXT) or at the end of an instruction (RNI). The state machine is implemented with two flip-flops to support its four states.

Other memory accesses

The loader takes care of fetching an instruction that consists of an opcode byte and a Mod R/M (addressing mode) byte. However, many instructions have additional bytes or don't follow this format For example, an opcode such as "ADD AX" can be followed by an 8- or 16-bit immediate value, adding that value to the AX register. Or a "move memory to AX" instruction can be followed by a 16-bit memory address The microcode uses a separate mechanism for fetching these instruction bytes from the queue. Specifically, each micro-instruction contains a source register and a destination register that specify a data move. By specifying "Q" (the queue) as the source, a byte is fetched from the prefetch queue.

A third path is used for arbitrary memory reads and writes, such as when an instruction stores a register's contents to memory. In this case, the microcode puts the memory address in the IND (indirect) register. The microcode then issues a read or write micro-operation which causes the memory contents to be read into the OPR (operand register) or written from the OPR. (The IND and OPR registers are internal 8086 registers that are not visible to the programmer.) In the 8086, a memory cycle takes at least four clock cycles (called T1 through T4), including adding the segment register to compute the memory address. An "unaligned" memory access takes twice as long, though, because the 8086 has a word-based 16-bit data bus. Thus, if you try to access a word from an odd address, two memory accesses are required, one for the first byte and one for the second byte.5

As you can see, a memory access is a fairly complex operation. In the 8086, all these steps are done by hardware in the Bus Interface Unit, rather than being performed by microcode. (I'll discuss the complex memory control circuitry in detail in a future post.) After issuing a memory read or write, the microcode engine is blocked until the memory request completes.

Microcode instructions and the correction circuitry

The microcode interacts with prefetching in several ways. In addition to requesting a byte from the queue (as discussed above), microcode can perform three micro-instructions that involve prefetching: SUSP, FLUSH, and CORR. The SUSP (suspend) micro-instruction stops prefetching, typically before a change to execution flow. The FLUSH micro-instruction flushes the prefetch queue and resumes prefetching. To implement these, the prefetching circuitry has a flip-flop to keep track of the suspended state, and logic to reset the queue pointer counters to flush the queue.

The CORR (correct) micro-instruction corrects the Instruction Pointer to point to the next execution position. This is an interesting and more complicated micro-instruction. Like most processors, the 8086 has a program counter (PC) to keep track of what instruction to execute; the 8086 calls this the Instruction Pointer (IP). In the programmer's view, the Instruction Pointer points to the memory address of the next instruction to execute. However, in the hardware, the Instruction Pointer points to the next instruction to be fetched, which is generally several bytes after the next instruction to be executed.6

For the most part, this doesn't matter; the queue provides instructions in the order they were fetched and it doesn't matter if the Instruction Pointer runs ahead. However, there are a few cases where the "real" Instruction Pointer address is needed. For example, a relative jump instruction causes execution to jump to an address relative to the current instruction. When performing a subroutine call, the return address must be pushed on the stack. The correct Instruction Pointer value is also needed for an interrupt. Thus, the 8086 needs a mechanism to compute the real Instruction Pointer value from the value in the Instruction Pointer register.

The solution is the CORR micro-instruction, which corrects the Instruction Pointer value by subtracting the prefetch queue length, so the Instruction Pointer holds the "true" value. For instance, if there are 4 bytes in the queue, then the address in the Instruction Pointer register is four more than the desired Instruction Pointer address. The Bus Interface Unit performs this subtraction by using the addressing adder and a small table of constants called the Constant ROM.7

The diagram below zooms in on the Constant ROM, located next to the adder. The Constant ROM is implemented as a PLA (programmable logic array), a two-level structured arrangement of gates. The first level (bottom) selects the desired correction constant, while the second level (middle) generates the bits of the constant: three bits plus a sign bit. The necessary correction constant is selected based on the length of the queue in words, the HL pointer, and the empty (MT) flag.

The Constant ROM, highlighted on the die.

The Constant ROM is used for more than just address correction. For example, it is also used to increment the Instruction Pointer by 2 after a prefetch. Other constants are used for the 8086's string operations, which act on a block of memory. The index registers are incremented or decremented by 1 for bytes or 2 for words. When popping a value from the stack, the stack pointer is decremented, which uses the constant -2. Additional constants are required to increment and decrement the IND register when accessing words from unaligned (odd) addresses. These increment/decrement values are selected in the upper part of the Constant ROM. In total, the Constant ROM holds values from -6 to +2.

Policy

There are some "policy" decisions on prefetching, and it's interesting to see how the 8086 implements them. Prefetching is not free: there is a tradeoff when performing a prefetch between saving time later versus delaying memory accesses from an executing instruction. Moreover, if a jump operation takes place, the prefetch queue is discarded and the memory cycles were wasted. Thus, the length of the queue is an "extremely tricky design problem, because performance can deteriorate if the queue is too long as well as if it is too short."3

Intel performed simulations to determine the best queue length. A 4-byte queue provided a large benefit, while a 6-byte queue (which they chose) was slightly better. The designers were surprised to find that performance flattened out after that; they expected a much longer queue would be necessary. The 8088 process has only a 4-byte prefetch queue because its 8-bit bus changes the tradeoffs.8

The basic prefetch policy is that if a memory access and a prefetch are requested at the same time, the memory access "wins", since it is guaranteed to be useful while the prefetch is just speculative. If the queue holds 0 to 2 bytes, prefetch happens during the next free memory cycle. If the queue holds 5 or 6 bytes, no prefetch can happen, since prefetch happens a word at a time. However, if the queue holds 3 or 4 bytes, prefetch is delayed for two clock cycles, which is an interesting choice. This gives an instruction more opportunities to perform a memory operation without being delayed by a prefetch. There is a tradeoff because maybe delaying the prefetch will waste two cycles of memory bandwidth, but performing the prefetch might waste four cycles of memory bandwidth. The motivation for this delay is that the last two bytes in the queue are less valuable because they are more likely to be discarded.

Another policy decision is how to handle a change in execution flow, such as a jump or subroutine call. The 8086 simply discards the prefetch queue and starts fetching from the new address. The 8086 designers considered better ways of handling jumps, but it wasn't practical to implement at the time. There is no intelligence if the instructions are already in the queue (e.g. jumping forward a couple of bytes). There is also no branch prediction; prefetching proceeds linearly regardless of branch instructions.

The 8086 does nothing to ensure consistency between the prefetch queue and memory if a prefetched instruction is modified in memory.9 In this case, the "stale" instruction in the queue is executed. This situation may seem contrived, but self-modifying code used to be fairly popular, where a program would change its own instructions.10

Prefetching and the 8087 coprocessor

One feature of the 8086 microprocessor is that it supports coprocessors such as the 8087 floating point chip.11 The 8087 implements high-performance floating-point computation, performing arithmetic and transcendental computations up to 100 times faster than the 8086. The 8087 gets instructions in an interesting fashion, executing floating-point instructions from the 8086's instruction stream. Specifically, an "ESCAPE" opcode indicates an instruction that is performed by the 8087 rather than the 8086. However, prefetching adds a lot of complexity to the coprocessor because the 8087 monitors the bus to determine when it should execute an instruction. With prefetching, the instruction on the bus doesn't match the instruction being executed. An instruction may be executed many cycles after it was fetched over the bus. A prefetched instruction may even be discarded and never executed.

To solve this problem, the 8087 manages its own copy of the prefetch queue to determine when the 8086 would be executing a floating-point instruction. The 8087 watches the bus to see when instructions are prefetched. The 8086 provides queue status signals (QS0 and QS1) to indicate when it takes bytes from the queue or flushes the queue. These signals allow the 8087 coprocessor to keep track of the 8086's queue state so it can tell what instruction the 8086 is executing. In other words, the 8086 doesn't tell the 8087 coprocessor what to do; instead, the two chips process the instruction stream in parallel. Another complication is the 8087 coprocessor can be used with the 8088 processor chip, which has a smaller 4-byte queue. Thus, the 8087 coprocessor must detect whether it is connected to an 8086 or an 8088 and maintain its queue appropriately.

Brief history

Caching and prefetching were used in mainframe computers dating back to the 1960s. For instance, the IBM System/360 Model 91 (1966) had a cache with prefetching. Minicomputers such as the VAX 11/780 (1977) later used caching and prefetching. However, these features took a while to trickle down to microprocessors. The Motorola 68000 (1980) had a 4-byte prefetch queue. As far as I can tell, the 8086 was the first microprocessor with a prefetch queue.

We can view the 8086 as a stepping-stone towards the large caches first used externally in the 80386 and internally in the 486. The 80186 and 80286 kept the 6-byte prefetch buffer size of the 8086. The 80386 has a 16-byte prefetch buffer, although apparently due to a bug it was shrunk to 12 bytes in later revisions. As well as the prefetch queue, the 80386 supported an external cache.

Early microprocessors such as the 6502 or Z80 could fetch the next instruction while they were finishing the previous instruction. This minimal two-stage pipelining improved performance, but was much more limited than 8086-style prefetching. An Intel study found that this simple overlap provides a 35% performance increase with 15% more hardware, while implementing prefetching provided an additional 11% gain with 14% more hardware.3 This illustrates how the increasing transistor counts from Moore's law opened up new opportunities to improve performance. But it also shows diminishing returns as performance increases become smaller and more expensive.

Conclusions

Well, this was supposed to be a quick post about the prefetch queue, but the topic turned out to have a lot more complexity than I expected. A six-byte prefetch queue may seem like a simple feature to add to a processor, but it affects many parts of the system. Prefetching is tied closely to the memory access circuitry, of course, but it also required a Constant ROM to handle the difference between the execution address and the prefetch address. Prefetching also impacted the microcode, with three micro-instructions to support prefetching.

Prefetching also illustrates some of the ways that each feature and corner case of a processor like the 8086 leads to more complexity. For instance, byte-aligned (rather than word-aligned) instructions require a mechanism to fetch bytes as well as words. Supporting multiple instruction formats (1-byte opcodes, an opcode byte followed by a Mod R/M byte, multi-byte instructions) resulted in the loader state machine. The segment registers required an adder to compute the memory address for every access. Looking at the 8086 internals makes it easier to understand the motivation behind RISC processors, discarding the complexity and corner cases to create a simpler but faster processor.

Notes and references

Steve Furber, co-creator of the ARM chip, mentions that "The first integrated CPUs were coincidentally quite well matched to semiconductor memory speeds, and were therefore built without caches. This can now be seen as a temporary aberration." See VLSI Risc Architecture and Organization p77. To make this concrete, the Apple II (1977) used a MOS 6502 processor running at about 1 megahertz while its 4116 DRAM chips could perform an access in 250 nanoseconds (4 times the clock speed). The 8086 processor ran at 5-10 MHz which meant that 250 ns DRAM chips were slower than the clock speed. Nowadays, processors run at 4 GHz but DRAM access speed is about 50 nanoseconds (1/200 the clock speed). ↩
Modern processors use caches to improve memory performance; caches are often megabytes in size. Accessing data from a cache is faster than accessing it from main memory, but the tradeoff is that caches are smaller. The 8086's prefetch queue is similar to a cache in some ways, but there are some key differences. First, the prefetch queue is strictly sequential. If you jump ahead two bytes, even if the prefetch queue has those instruction bytes, the processor can't use them. Second, the prefetch queue can't reuse bytes. If you have a 6-byte loop, even though all the code fits in the prefetch queue, it will be reloaded every time. Third, the prefetch queue doesn't provide any consistency. If you modify an instruction in memory a couple of bytes ahead of the PC, the 8086 will run the old instruction if it's in the queue. ↩
The design decisions for the 8086 prefetch cache (and many other aspects of the chip) are described in: J. McKevitt and J. Bayliss, "New options from big chips," in IEEE Spectrum, vol. 16, no. 3, pp. 28-34, March 1979, doi: 10.1109/MSPEC.1979.6367944. ↩↩↩↩
A detailed block diagram of the 8086 is provided in the patent. Conveniently, the layout of the diagram is close to the physical layout of the chip.

Detailed block diagram of the 8086, based on patent US4449184. I have modified the register names to match the common naming.

I won't discuss this block diagram in detail here, but I'll point out the Q (queue) control logic in the upper center, with its associated read and write pointers. The Q bus connects the queue to various parts of the instruction decoding circuitry.

↩
Supporting misaligned memory accesses (i.e. accessing a word at an odd address) adds complexity to the 8086. It's not surprising that many RISC chips prohibit unaligned accesses. On SPARC, for instance, an unaligned access fails with a "bus error", which the Sun programmers out there probably recognize. ARM processors before ARMv7 didn't support unaligned accesses. RISC-V supports misaligned data accesses but not misaligned instructions. ↩
The 8086 patent describes how the program counter in the 8086 does not hold the "real" value:

PC is not a real or true program counter in that it does not, nor does any other register within CPU, maintain the actual execution point at any time. PC actually points to the next byte to be input into queue. The real program counter is calculated by instruction whenever a relative jump or call is required by subtracting the number of accessed instructions still remaining unused in queue from PC.
↩
The CORR correction operation adds more complexity to the system than you might expect, with synchronization between the Bus Interface Unit and the Execution Unit. Because the correction computation uses the addressing adder, the correction operation must be synchronized with memory accesses that also use the adder. To accomplish this, the Bus Interface Unit waits until any memory operation is finished and then generates two clock cycles of "fake" memory operation, keeping the adder free for the CORR instruction. As a result, the memory control circuitry needs logic to implement this memory cycle. Meanwhile, the microcode engine is stopped until the CORR instruction completes, requiring synchronization circuitry. ↩
The 8088 is famous as the processor in the original IBM PC. The 8088 processor is essentially the same as the 8086 except that it has an 8-bit data bus instead of a 16-bit data bus, so it performs memory accesses a byte at a time instead of a word at a time. Internally, the 8088 is nearly identical to the 8086 but there are a few differences in microcode and in the bus circuitry. The most visible difference is that the 8088 has a 4-byte prefetch queue instead of a 6-byte prefetch queue. Simulations showed that a 4-byte queue was sufficient for the 8088. Because it fetches one byte at a time instead of two bytes, the 8088 fills the prefetch queue more slowly and wouldn't get much benefit from the larger queue. I haven't looked at the 8088's prefetch circuitry in detail, so I can't describe it exactly. ↩
At some point, Intel implemented consistency between cached instructions and memory. This ensures that self-modifying code will run the latest version of an instruction rather than a stale instruction in the cache. I couldn't determine exactly when this was implemented; various sources say the 486, the Pentium, or the Pentium Pro. (If you have a definitive answer, please let me know.) ↩
Self-modifying code can be used as a way to distinguish between the 8086 and 8088 chips in software. Since the 8086 has a 6-byte queue and the 8088 has a 4-byte queue, you can create self-modifying code that will run a prefetched instruction on the 8086 but run the modified instruction on the 8088. ↩
Although the 8087 is the most well-known coprocessor for the 8086, it was not the only coprocessor. The Intel 8089 input/output coprocessor provided mainframe-style I/O channels, offloading I/O processing from the 8086. More than just a DMA engine, the 8089 was a separate processor with its own instruction set. Unlike the 8087, the 8089 didn't take instructions from the 8086's instruction stream so it didn't interact with prefetching; instead, the 8086 sent a Channel Attention signal to the 8089 and the 8089 read instructions from shared memory. The 8089 was complex and expensive and wasn't very popular. The Intel 82586 Ethernet coprocessor used a similar Channel Attention scheme. ↩

[#] Tue Dec 06 2022 11:28:01 UTC from rss <>

Subject: How the 8086 processor's microcode engine works

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates desktop and server computing. The 8086 chip uses microcode internally to implement its instruction set. I've been reverse-engineering the 8086 from die photos and this blog post discusses how the chip's microcode engine operated. I'm not going to discuss the contents of the microcode1 or how the microcode controls the rest of the processor here. Instead, I'll look at how the 8086 decides what microcode to run, steps through the microcode, handles jumps and calls inside the microcode, and physically stores the microcode. It was a challenge to fit the microcode onto the chip with 1978 technology, so Intel used many optimization techniques to reduce the size of the microcode.

In brief, the microcode in the 8086 consists of 512 micro-instructions, each 21 bits wide. The microcode engine has a 13-bit register that steps through the microcode, along with a 13-bit subroutine register to store the return address for microcode subroutine calls. The microcode engine is assisted by two smaller ROMs: the "Group Decode ROM" to categorize machine instructions, and the "Translation ROM" to branch to microcode subroutines for address calculation and other roles. Physically, the microcode is stored in a 128×84 array. It has a special address decoder that optimizes the storage. The microcode circuitry is visible in the die photo below.

What is microcode?

Machine instructions are generally considered the basic steps that a computer performs. However, each instruction usually requires multiple operations inside the processor. For instance, an ADD instruction may involve computing the memory address, accessing the value, moving the value to the Arithmetic-Logic Unit (ALU), computing the sum, and storing the result in a register. One of the hardest parts of computer design is creating the control logic that signals the appropriate parts of the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated and error-prone.

In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control circuitry from complex logic gates, the control logic could be replaced with another layer of code (i. e. microcode) stored in a special memory called a control store. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task. Microcode also permits complex instructions and a large instruction set to be implemented without making the processor more complex (apart from the size of the microcode). Finally, it is generally easier to fix a bug in microcode than in circuit logic.

Early computers didn't use microcode, largely due to the lack of good storage technologies to hold the microcode. This changed in the 1960s; for example IBM made extensive use of microcode in the System/360 (1964). (I've written about that here.) But early microprocessors didn't use microcode, returning to hard-coded control logic with logic gates.3 This logic was generally more compact and ran faster than microcode, since the circuitry could be optimized. Since space was at a premium in early microprocessors and the instruction sets were relatively simple, this tradeoff made sense. But as microprocessor instruction sets became complex and transistors became cheaper, microcode became appealing. This led to the use of microcode in the Intel 8086 (1978) and 8088 (1979) and Motorola 68000 (1979), for instance.2

The 8086's microcode

The 8086's microcode is much simpler than in most processors, but it's still fairly complex. The code below is the microcode routine from the 8086 for a routine called "CORD", part of integer division, consisting of 16 micro-instructions. I'm not going to explain how this microcode works in detail, but I want to give a flavor of it. Each line has an address on the left (blue) and the micro-instruction on the right (yellow), specifying the low-level actions during one time step (i.e. clock cycle). Each micro-instruction performs a move, transferring data from a source register (S) to a destination register (D). (The source Σ indicates the ALU output.) For parallelism, the micro-instruction performs an operation or two at the same time as the move. This operation is specified by the "a" and "b" fields; their meanings depend on the type field. For instance, type 1 indicates an ALU instruction such as subtract (SUBT) or left-rotate through carry (LRCY). Type 4 selects two general operations such as "RTN" which returns from a microcode subroutine. Type 0 indicates a jump operation; "UNC 10" is an unconditional jump to line 10 while "CY 13" jumps to line 13 if the carry flag is set. Finally, the "F" field indicates if the condition code flags should be updated. The key points are that the micro-instructions are simple and execute in one clock cycle, they can perform multiple operations in parallel to maximize performance, and they include control-flow operations such as conditional jumps and subroutines.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

Each instruction is stored at a 13-bit address (blue) which consists of 9 bits shown explicitly and a 4-bit sequence counter "CR". The eight numbered address bits usually correspond to the machine instruction's opcode. The "X" bit is an extra bit to provide more address space for code that is not directly tied to a machine instruction, such as reset and interrupt code, address computation, and the multiply/divide algorithms.

A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

This "vertical" microcode format reduces the storage required for the microcode by encoding control signals into various fields. However, it requires some decoding logic to process the fields and generate the low-level control signals. Surprisingly, there's no specific "microcode decoder" circuit. Instead, the logic is scattered across the chip, looking for various microcode bit patterns to generate control signals where they are needed.

How instructions map onto the ROM

One interesting issue is how the micro-instructions are organized in the ROM, and how the right micro-instructions are executed for a particular machine instruction. The 8086 uses a clever mapping from the machine instruction to a microcode address that allows machine instructions to share microcode.

Different processors use a variety of approaches to microcode organization. One technique is for each micro-instruction to contain a field with the address of the next micro-instruction. This provides complete flexibility for the arrangement of micro-instructions, but requires a field to hold the address, increasing the number of bits in each micro-instruction. A common alternative is to execute micro-instructions sequentially, with a micro-program-counter stepping through each micro-address unless there is an explicit jump to a new address. This approach avoids the cost of an address field in each instruction, but requires a program counter with an incrementer, increasing the hardware complexity.

The 8086 uses a hybrid approach. A 4-bit program counter steps through the bottom 4 bits of the address, so up to 16 micro-instructions can be executed in sequence without a jump. This approach has the advantage of requiring a smaller 4-bit incrementer for the program counter, rather than a 13-bit incrementer. The microcode engine provides a "short jump" operation that makes it easy to jump within the group of 16 instructions using a 4-bit jump target, rather than a full 13-bit address.

Another important design decision in microcode is how to determine the starting micro-address for each machine instruction. In other words, if you want to do an ADD, where does the microcode for ADD start? One approach is a table of starting addresses: the system looks in the table to find the starting address for ADD, but this requires a large table of 256 entries. A second approach is to use the opcode code value as the starting address. That is, an ADD instruction 0x05 would start at micro-address 5. This approach has two problems. First, you can't run the microcode sequentially since consecutive micro-instructions belong to different machine instructions. Moreover, you can't share microcode since each instruction has a different address in the microcode ROM.

The 8086 solves these problems in two ways. First, the machine instructions are spaced sixteen slots apart in the microcode. In other words, the opcode is multiplied by 16 (has four zeros appended) to form the starting address in the microcode ROM, so there is plenty of space to implement each machine instruction. The second technique is that the ROM's addressing is partially decoded rather than fully decoded, so multiple micro-addresses can correspond to the same physical storage.4

To make this concrete, consider the 8086's arithmetic-logic instructions: one-byte add register to memory, one-byte add memory to register, one-word subtract memory from register, one-word xor register to memory, and so forth. There are 8 ALU operations and each can be byte- or word-sized, with memory as source or destination. This yields 32 different machine opcodes. These opcodes were carefully assigned, so they all have the format 00xxx0xx. The ROM address decoder is designed to look for three 0 bits in those positions, and ignore the other bits, so it will match that pattern. The result is that all 32 of these ALU instructions activate the same ROM column select line, and thus they all share the same microcode, shrinking the size of the ROM.

The microcode ROM's physical layout

The microcode ROM holds 512 words of 21 bits, so the obvious layout would be 512 columns and 21 rows. However, these dimensions are not practical for physically building the ROM because it would be too long and skinny. Instead, the ROM is constructed by grouping four words in each column, resulting in 128 columns of 84 rows, much closer to square. Not only does this make the physical layout more convenient, but it also reduces the number of column decoders from 512 to 128, reducing the circuitry size. Although the ROM now requires 21 multiplexers to select which of the four rows corresponds to each output bit, the circuitry is still much smaller. There is a tradeoff with the ability to merge addresses together by ignoring bits, though. Each decoder now selects a column of four words, rather than a single word, so each block of four words must have consecutive addresses.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The image above shows how microcode is stored and accessed. At the top is the 13-bit microcode address register, which will be discussed in detail below. The column selection circuit decodes 11 of the 13 address bits to select one column of the microcode storage. At the left, multiplexers select one bit out of each four rows using the two remaining address bits (specifically, the two lowest sequence bits). The selected 21 microcode outputs are latched and fed to the rest of the processor, where they are decoded as described earlier and control the processor's actions.

Optimizing the microcode

In 1978, the number of bits that could be stored in the microcode ROM was rather limited. In particular, the 8086 holds only 512 micro-instructions. Since it has approximately 256 machine-code instructions in its one-byte opcode, combined with multiple addressing modes, and each instruction requires multiple micro-instructions, compression and optimization were necessary to make the microcode fit.5 The main idea was to move functionality out of the microcode and into discrete logic when it made sense. I'll describe some of the ways they did this.

The 8086 has an arithmetic-logic unit (ALU) that performs operations such as addition and subtraction, as well as logical operations such as AND and XOR. Consider the machine instruction ADD, implemented with a few micro-operations that compute the memory address, fetch data, perform the addition, and store the result. The machine instructions for subtraction, AND, or XOR require identical steps, except that the ALU performs a different operation. In total, the 8086 has eight ALU-based operations that are identical except for the operation performed by the ALU.6 The 8086 uses a "trick" where these eight machine instructions share the same microcode. Specifically, the microcode tells the ALU to perform a special operation XI, which indicates that the ALU should look at the appropriate bits of the instruction and do the appropriate operation.7 This shrinks the microcode for these operations by a factor of eight, at the cost of requiring additional logic for the ALU. In particular, the ALU control circuitry has a register to hold the relevant instruction bits, and a PLA to decode these bits into low-level ALU control signals.

Similarly, the 8086 has eight machine instructions to increment a specific register (out of a set of 8), and eight instructions to decrement a register. All 16 instructions are handled by the same set of micro-instructions and the ALU does the increment or decrement as appropriate. Moreover, the register control circuitry determines which register is specified by the instruction, without involvement from the microcode.

Another optimization is that the 8086 has many machine instructions in pairs: an 8-bit version and a 16-bit version. One approach would be to have separate microcode for the two instructions, one to handle a single byte and one to handle two bytes. Instead, the machine instructions share microcode. The complexity is moved to the circuitry that moves data on the bus: it looks at the low bit of the instruction to determine if it should process a byte or a word. This cuts the microcode size in half for the many affected instructions.

Finally, simple instructions that can take place in one cycle are implemented with logic gates, rather than through microcode. For instance, the CLC (clear carry flag) instruction updates the flag directly. Similarly, prefix instructions for segment selection, instruction locking, or repetition are performed in logic. These instructions don't use any microcode at all, which will be important later below.

Using techniques such as these, about 75 different instruction types are implemented in the microcode (instead of about 256), making the microcode much smaller. The tradeoff is that the 8086 requires more logic circuitry, but the designers found the tradeoff to be worthwhile.

The ModR/M byte

There's another complication for 8086 microcode, however. Most 8086 instructions have a second byte: the ModR/M byte, which controls the addressing mode for the instructions in a complex way (shown below). This byte gives 8086 instructions a lot of flexibility: you can use two registers, a register and a memory location, or a register and an "immediate" value specified in the instruction. The memory location can be specified by 8 index register combinations with a one-byte or two-byte displacement optionally added. (This is useful for accessing data in an array or structure, for instance.) Although these addressing modes are powerful, they pose a problem for the microcode.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

These different addressing modes need to be implemented in microcode, since different addressing modes require different sequences of steps. In other words, you can't use the previous trick of pushing the problem into logic gates. And you clearly don't want a separate implementation of each instruction for each addressing mode since the size of the microcode would multiply out of control.

The solution is to use a subroutine (in microcode) to compute the memory address. Thus, instructions can share the microcode for each addressing mode. This adds a lot of complexity to the microcode engine, however, since it needs to store the micro-address for a micro-subroutine-call so it can return to the right location. To support this, the microcode engine has a register to hold this return address. (Since it doesn't have a full stack, you can't perform nested subroutine calls, but this isn't a significant limitation.)

The microcode ends up having about 10 subroutines for the different addressing modes, as well as four routines for the different sizes of displacement. (The 8 possibilities for source registers are handled in the register selection logic, rather than microcode.) Thus, the microcode handles the 256 different addressing modes with about 14 short routines that add the appropriate address register(s) and the displacement to obtain the memory address.

One more complication is that machine instructions can switch the source and destination specified by the ModR/M byte, depending on the opcode. For example, one subtract instruction will subtract a memory location from a register, while a different subtract instruction subtracts a register from a memory location. The two variants are distinguished by bit 1 of the instruction, the "direction" bit. These variants are handled by the control logic, so the microcode can ignore them. Specifically, before the source and destination specifications go to the register control circuitry, a crossover circuit can swap them based on the value of the direction bit.

The Translation ROM

As explained above, the starting address for a machine instruction is derived directly from the instruction's opcode. However, the microcode engine needs a mechanism to provide the address for jump and call operations. In the 8086, this address is hard-coded into the Translation ROM, which provides a 13-bit address.8 It holds ten destination addresses for jump operations and ten (different) addresses for call operations.

A second role of the Translation ROM is to hold target addresses for each ModR/M addressing mode, pointing to the code to compute the effective address. As a complication, two of the jump table entries in the Translation ROM are implemented with conditional logic, depending on whether or not the instruction's memory address calculation includes a displacement. By wiring this condition into the Translation ROM, the microcode avoids the need to test this condition.

The image below shows how the Translation ROM appears on the die. It is implemented as a partially-decoded ROM with multiplexed inputs.9 The inputs are at the bottom left. For a jump or call, the ROM uses 4 input bits from the microcode output, since the microcode selects the jump targets. For an address computation, it takes 5 bits from the instruction's ModR/M byte, so the routine is selected by the instruction. The ROM has additional input bits to select the mode (jump, call, or address) and for the conditional jumps. The decoding logic (left half) activates a row in the right half, generating the output address. This address exits at the bottom and is loaded into the micro-address register below the Translation ROM.

The Translation ROM holds addresses of routines in the microcode.

The Group Decode ROM

In the discussion above, I've discussed how various categories of instructions are optimized. For instance, many instructions have a bit that selects if they act on a byte or a word. Many instructions have a bit to reverse the direction of the operation's memory and register accesses. These features are implemented in logic rather than microcode. Other instructions are implemented outside microcode entirely. How does the 8086 determine which way to process an instruction?

The Group Decode ROM takes an instruction opcode and generate 15 signals that indicate various categories of instructions that are handled differently.10 The outputs from the Group Decode ROM are used by various logic circuits to determine how to handle the instruction. Some cases affect the microcode, for instance calling a microcode addressing routine if the instruction has a ModR/M byte. In other cases, these signals act "downstream" of the microcode, for example to determine if the operation should act on a byte or a word. Other signals cause the microcode to be bypassed completely.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

Specially-encoded instructions

For most of the 8086 instructions, the first byte specifies the instruction. However, the 8086 has a few instructions where the ModR/M byte completely changes the meaning of the first byte. For instance, opcode 0xF6 (Grp 1 below) can be a TEST, NOT, NEG, MUL, IMUL, DIV, or IDIV instruction based on the value of the ModR/M byte. Similarly, opcode 0xFE (Grp 2) indicates an INC, DEC, CALL, JMP, or PUSH instruction.11

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

This encoding may seem a bit random, but there's a reason behind it. Most instructions act on a source and a destination. But some, such as INC (increment) use the same register or memory location for the source and the destination. Others, such as CALL or JMP, only use one address. Thus, the "reg" field in the ModR/M byte is redundant. Since these bits would be otherwise "wasted", they are used instead to specify different instructions. (There are only 256 single-byte opcodes, so you want to make the best use of them.)

The implementation of these instructions in microcode is interesting. Since the instructions share the same first byte, the standard microcode mapping would put them at the same microcode address. However, these instructions are treated specially, with the "reg" field from the ModR/M byte copied into the lower bits of the microcode address. In effect, the instructions are treated as opcodes 0xF0 through 0xFF, so the different instruction variants execute at separate microcode addresses. You might expect a collision with the opcodes that really have the values 0xF0 through 0xFF. However, the 8086 opcodes were cleverly arranged so none of the other instructions in this range use microcode. As you can see above, the other instructions are prefixes (LOCK, REP, REPZ), halt (HLT), or flag operations (CMC, CLC, STC, CLI, STI, CLD, STD), all implemented outside microcode. Thus, the range 0xF0-0xFF is freed up for the "expanded" instructions.

The hardware implementation for this is not too complex. The Group ROM produces an output for these special instructions. This causes the microcode address register to load the appropriate bits from the ModR/M byte, causing the appropriate microcode routine to be executed.

The microcode address register

The heart of the microcode engine is the microcode address register, which determines which microcode address to execute. As described earlier, the microcode address is 13 bits, of which 8 bits generally correspond to the instruction opcode, one bit is an extra "X" instruction bit, and 4 bits are sequentially incremented. The diagram below shows how the circuitry for the bits is arranged. The 9 instruction bits each have a nearly-identical circuit. The sequence bits have more circuitry and each one is different, because the circuit to increment the address is different for each bit.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

The schematic below shows the circuitry for one bit in the microcode address register. It has two flip-flops: one to hold the current address bit and one to hold the old address while performing a subroutine call. A multiplexer (mux) selects the input to each flip-flop. For instance, if the microcode is waiting for a memory access, the "hold" input to the multiplexer causes the current address to loop around and get reloaded into the flip-flop. For a subroutine call, the "call" input saves the current address in the subroutine flip-flop. Conversely, when returning from a subroutine, the "return" input loads the old address from the subroutine flip-flop. The address flip-flop also has inputs to load the instruction as the address, to load an address from the translation ROM, or to load an interrupt microcode handler address. The circuit sends the address bit (and inverted address bit) to the microcode ROM's address decoder.

Schematic of a typical bit in the microcode address register.

Each bit has some special-case handling, so this schematic should be viewed as an illustration, not an accurate wiring diagram. In particular, the sequence bits also have inputs from the incrementer, so they can step to the next address. The low-order bits have instruction inputs to handle the specially-encoded "group" instructions discussed in the previous section.

The control signals for the multiplexers are generated from various sources. A circuit called the loader starts processing of an instruction, synchronized to the prefetch queue and instruction fetch from memory. The call and return operations are microcode instructions. The Group Decode ROM controls some of the inputs, for instance to process a ModR/M byte. Thus, there is a moderate amount of conditional logic that determines the microcode address and thus what microcode gets executed.

Conclusions

This has been a lot of material, so thank you for sticking with it to the end. I draw three conclusions from studying the microcode engine of the 8086. First, the implementation of microcode is considerably more complex than the clean description of microcode that is presented in books. A lot of functionality is implemented in logic outside of microcode, so it's not a "pure" microcode implementation. Moreover, there are many optimizations and corner cases. The microcode engine has two supporting ROMs: the Translation ROM and the Group Decode ROM. Even the microcode address register has complications.

Second, the need for all these optimizations shows how the 8086 was just on the edge of what was practical. The designers clearly went to a lot of effort to get the microcode to fit in the space available.

Finally, looking at the 8086 in detail shows how complex its instruction set is. I knew in the abstract that it was much more convoluted than, say, an ARM chip. But seeing all the special case circuitry on the die to handle the corner cases of the instruction set really makes this clear.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

The 8086 microcode was disassembled (link) a couple of years ago by Andrew Jenner by extracting the bits from my die photos. My post here is a bit different, looking at hardware that runs the microcode, rather than the contents of the microcode. ↩
According to Wikipedia, the Zilog Z8000 (1979) didn't use microcode, which is a bit surprising for that timeframe. This design decision had the advantage of reducing transistor count, but the disadvantage of hard-to-fix logic bugs in the instruction decoder. ↩
As an example of a non-microcoded processor, the MOS 6502 (1975) used a PLA (Programmable Logic Array) to perform much of the decoding (details). A PLA provides a structured way of implementing logic gates in a relatively dense array. A PLA is kind of like a ROM—a PLA can implement a ROM and vice versa—so it can be hard to see the difference. The usual distinction is that only one row of a ROM is active at a time, the address that you're reading out. A PLA is more general since multiple rows can be active at a time, combined to form the outputs.

The Z80 had a slightly different implementation. It used a smaller PLA to perform basic decoding of instructions into various types. It then generated control signals through a large amount of "random" logic (so-called because of its appearance, not because it's actually random). This logic combined instruction types and timing signals to generate the appropriate control signals. ↩
In a "normal" ROM, the address bits are decoded with logic gates so each unique address selects a different storage column in the ROM. However, parts of the decoder circuitry can "ignore" bits so multiple addresses select the same storage column. For a hypothetical example, suppose you're decoding 5-bit addresses. Instead of decoding 11010 and 11011 separately, you could ignore the last bit so both addresses access the same ROM data. or you could ignore the last three bits so all 8 addresses of the form 11xxx go to the same ROM location (where x indicates a bit that can be either 0 or 1). This makes the ROM more like a PLA (Programmable Logic Array), but still accessing a single row at a time. ↩
The Intel 8087 was the floating point co-processor for the 8086. The 8087 required a lot of microcode, more than could fit in a standard ROM on the die. To get the microcode to fit, Intel created a special ROM that stored two bits per transistor (instead of one) by using four different sizes of transistors to generate four different voltages. Analog circuitry converted each voltage level into two bits. This complex technique doubled the density (at least in theory), allowing the microcode to fit on the chip. I wrote about the 8087's non-binary ROM here. ↩
The ALU operations that are grouped together are add, add with carry, subtract, subtract with borrow, logical AND, logical XOR, logical OR, and compare. The compare operation may seem out of place in this list, but it is implemented as a subtract operation that updates the condition flags without changing the value. Operations such as increment, decrement, negation, and logical NOT may seem like they should be included, but since they operate on a single argument instead of two, they are implemented differently at the microcode level. Increment and decrement are combined in the microcode, however. Negation and logical NOT could be combined except that negation affects the condition code flags, while NOT doesn't, so they need separate microcode. (This illustrates how subtle features of the instruction set can have more impact than you might expect.) Since the ALU doesn't have hardware multiplication and division, the multiplication and division operations are implemented separately in microcode with loops. ↩
The ALU itself isn't examining instruction bits to decide what to do. There's some control circuitry next to the ALU that uses a PLA (Programmable Logic Array) to examine the instruction bits and the microcode's ALU command to generate low-level control signals for the ALU. These signals control things such as carry propagation, argument negation, and logical operation selection to cause the ALU to perform the desired operation. ↩
The Translation ROM has one additional output: a wire indicating an address mode that uses the BP register. This output goes to the segment register selection circuitry and selects a different segment register. The reason is that the 8086 uses the Data Segment by default for effective address computation, unless the address mode uses BP as a base register. In that case, the Stack Segment is used. This is an example of how the 8086 architecture is not orthogonal and has lots of corner cases. ↩
You can also view the Translation ROM as a PLA (Programmable Logic Array) constructed from two layers of NOR gates. The conditional entries make it seem more like a PLA than a ROM. Technically, it can be considered a ROM since a single row is active at a time. I'm using the name "Translation ROM" because that's what Intel calls it in the patents. ↩
Although the Group Decode ROM is called a ROM in the patent, I'd consider it more of a PLA (programmable logic array). Conceptually it holds 256 words, one for each instruction. But its implementation is an array of logic functions. ↩
These instructions were called "Group 1" and "Group 2" instructions in the 8086 documentation. Later Intel documentation renamed them as "Unary Group 3", "INC/DEC Group 4" and "Indirect Group 5". Some details are here. The 8086 has two other groups of instructions where the reg field defines the instruction: the "Immediate" instructions 0x80-0x83 and the "Shift" instructions 0xD0-0xD3. For these opcodes, the different instructions were implemented by the ALU. As far as the microcode was concerned, these were "normal" instructions so I won't discuss them in this post.

I should mention that although the 8086 opcodes are always expressed in hexadecimal, the encoding makes much more sense if you look at it in octal. Details here. The octal encoding also applies to other related chips including the 8008, 8080, and Z80. ↩

[#] Tue Dec 06 2022 11:28:15 UTC from rss <>

Subject: A bug fix in the 8086 microprocessor, revealed in the die's silicon

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates desktop and server computing. While reverse-engineering the 8086 from die photos, a particular circuit caught my eye because its physical layout on the die didn't match the surrounding circuitry. This circuit turns out to implement special functionality for a couple of instructions, subtlely changing the way they interacted with interrupts. Some web searching revealed that this behavior was changed by Intel in 1978 to fix a problem with early versions of the 8086 chip. By studying the die, we can get an idea of how Intel dealt with bugs in the 8086 microprocessor.

In modern CPUs, bugs can often be fixed through a microcode patch that updates the CPU during boot.1 However, prior to the Pentium Pro (1995), microprocessors could only be fixed through a change to the design that fixed the silicon. This became a big problem for Intel with the famous Pentium floating-point division bug. The chip turned out to have a bug that resulted in rare but serious errors when dividing. Intel recalled the defective processors in 1994 and replaced them, at a cost of $475 million.

The circuit on the die

The microscope photo below shows the 8086 die with the main functional blocks labeled. This photo shows the metal layer on top of the silicon. While modern chips can have more than a dozen layers of metal, the 8086 has a single layer. Even so, the metal mostly obscures the underlying silicon. Around the outside of the die, you can see the bond wires that connect pads on the chip to the 40 external pins.

The 8086 die with main functional blocks labeled. Click this image (or any other) for a larger version.

The relevant part of the chip is the Group Decode ROM in the upper center. The purpose of this circuit is to categorize instructions into groups that control how they are decoded and processed. For instance, very simple instructions (such as setting a flag) can be performed directly in one cycle. Other instructions are not complete instructions, but a prefix that modifies the following instruction. The remainder of the instructions are implemented in microcode, which is stored in the lower-right corner of the chip. Many of these instructions have a second byte, the "Mod R/M" byte that specifies a register and the memory addressing scheme. Some instructions have two versions: one for an 8-bit operand and one for a 16-bit operand. Some operations have a bit to swap the source and destination. The Group Decode ROM is responsible for looking at the 8 bits of the instruction and deciding which groups the instruction falls into.

A closeup of the Group Decode ROM. This image is a composite showing the metal, polysilicon, and silicon layers.

The photo above shows the Group Decode ROM in more detail. Strictly speaking, the Group Decode ROM is more of a PLA (Programmable Logic Array) than a ROM, but Intel calls it a ROM. It is a regular grid of logic, allowing gates to be packed together densely. The lower half consists of NOR gates that match various instruction patterns. The instruction bits are fed horizontally from the left, and each NOR gate is arranged vertically. The outputs from these NOR gates feed into a set of horizontal NOR gates in the upper half, combining signals from the lower half to produce the group outputs. These NOR gates have vertical inputs and horizontal outputs.

The diagram below is a closeup of the Group Decode ROM, showing how the NOR gates are constructed. The pinkish regions are silicon, doped with impurities to make it a semiconductor. The gray horizontal lines are polysilicon, a special type of silicon on top. Where a polysilicon crosses conductive silicon, it forms a transistor. The transistors are wired together by metal wiring on top. (I dissolved the metal layer with acid to show the silicon; the blue lines show where two of the metal wires were.) When an input is high, it turns on the corresponding transistors, pulling the vertical lines low. This creates NOR gates with multiple inputs. The key idea of the PLA is that at each point where horizontal and vertical lines cross, a transistor can be present or absent, to select the desired gate inputs. By doping the silicon in the desired pattern, transistors can be created or omitted as needed. In the diagram below, two of the transistors are highlighted. You can see that some of the other locations have transistors, while others do not. Thus, the PLA provides a dense, flexible way to produce a set of outputs from a set of inputs.

Cioseup of part of the Gate Decode ROM showing a few of the transistors. I dissolved the metal layer for this image, to reveal the silicon and polysilicon underneath.

Zooming out a bit, the PLA is connected to some unusual circuitry, shown below. The last two columns in the PLA are a bit peculiar. The upper half is unused. Instead, two signals leave the side of the PLA horizontally and bypass the top of the PLA. These signals go to a NOR gate and an inverter that are kind of in the middle of nowhere, separated from the rest of the logic. The output from these gates goes to a three-input NOR gate, which is curiously split into two pieces. The lower part is a normal two-input NOR gate, but then the transistor for the third input (the one we're looking at) is some distance away. It's unusual for a gate to be split across a distance like this.

The circuitry as it appears on the die.

It can be hard to keep track of the scale of these diagrams. The highlighted box in the image below corresponds to the region above. As you can see, the circuit under discussion spans a fairly large fraction of the die.

The red rectangle in this figure highlights the region in the diagram above.

My next question was what instructions were affected by this mystery circuitry. By looking at the transistor pattern in the Group Decode ROM, I determined that the two curious columns matched instructions with bits 10001110 and 000xx111. A look at the 8086 reference shows that the first bit pattern corresponds to the instructions MOV sr,xxx, which loads a value into a segment register. The second bit pattern corresponds to the instructions POP sr, which pops a value from the stack into a segment register. But why did these instructions need special handling?

The interrupt bug

After searching for information on these instructions, I came across errata stating: "Interrupts Following MOV SS,xxx and POP SS Instructions May Corrupt Memory. On early Intel 8088 processors (marked “INTEL ‘78” or “(C) 1978”), if an interrupt occurs immediately after a MOV SS,xxx or POP SS instruction, data may be pushed using an incorrect stack address, resulting in memory corruption." The fix to this bug turns out to be the mystery circuitry.

I'll give a bit of background. The 8086, like most processors, has an interrupt feature where an external signal, such as a timer or input/output, can interrupt the current program. The processor starts running different code to handle the interrupt, and then returns to the original program, continuing where it left off. When interrupted, the processor uses its stack in memory to keep track of what it was doing in the original program so it can continue. The stack pointer (SP) is a register that keeps track of where the stack is in memory.

A complication is that the 8086 uses "segmented memory", where memory is divided into chunks (segments) with different purposes. On the 8086, there are four segments: the Code Segment, Data Segment, Stack Segment, and Extra Segment. Each segment has an associated segment register that holds the starting memory address for that segment. Suppose you want to change the location of the stack in memory, maybe because you're starting a new program. You need to change the Stack Segment register (called SS) to point to the new location for the stack segment. And you also need to change the Stack Pointer register (SP) to point to the stack's current position within the stack segment.

A problem arises if the processor receives an interrupt after the Stack Segment register has been changed, but before the Stack Pointer register has been changed. The processor will store information on the stack using the old stack pointer address but in the new segment. Thus, the information is stored into essentially a random location in memory, which is bad.2 Intel's fix was to delay an interrupt after an update to the stack segment register, so you had a chance to update the stack pointer.3 The stack segment register could be changed in two ways. First, you could move a value to the register ("MOV SS, xxx" in assembly language), or you could pop a value off the stack into the stack segment register ("POP SS"). These are the two instructions affected by the mystery circuitry. Thus, we can see that Intel added circuitry to delay an interrupt immediately after one of these instructions and avoid the bug.

Conclusions

One of the interesting things about reverse-engineering the 8086 is when I find a curious feature on the die and then find that it matches an obscure part of the 8086 documentation. Most of these are deliberate design decisions, but they show how complex and ad-hoc the 8086 architecture is, with many special cases. Each of these cases results in some circuitry and gates, complicating the chip. (In comparison, I've reverse-engineered the ARM1 processor, a RISC processor that started the ARM architecture. The ARM1 has a much simpler architecture with very few corner cases. This is reflected in circuitry that is much simpler.)

The case of the segment registers and interrupts, however, is the first circuit that I've found on the 8086 die that is part of a bug fix. This fix appears to have been fairly tricky, with multiple gates scattered in unused parts of the chip. It would be interesting to get a die photo of a very early 8086 chip, prior to this bug fix, to confirm the change and see if anything else was modified.

If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier. I plan to write more about the 8086 so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space.

Notes and references

The modern microcode update process is more complicated than I expected with updates possible before the BIOS is involved, during boot, or even while applications are running. Intel provides details here. Apparently Intel originally added patchable microcode to the Pentium Pro for chip debugging and testing, but realized that it would be a useful feature to fix bugs in the field (details). ↩
The obvious workaround for this problem is to disable interrupts while you're changing the Stack Segment register, and then turn interrupts back on when you're done. This is the standard way to prevent interrupts from happening at a "bad time". The problem is that the 8086 (like most microprocessors) has a non-maskable interrupt (NMI), an interrupt for very important things that can't be disabled. ↩
Intel documents the behavior in a footnote on page 2-24 of the User's Manual:

There are a few cases in which an interrupt request is not recognized until after the following instruction. Repeat, LOCK and segment override prefixes are considered "part of" the instructions they prefix; no interrupt is recognized between execution of a prefix and an instruction. A MOV (move) to segment register instruction and a POP segment register instruction are treated similarly: no interrupt is recognized until after the following instruction. This mechanism protects a program that is changing to a new stack (by updating SS and SP). If an interrupt were recognized after SS had been changed, but before SP had been altered, the processor would push the flags, CS and IP into the wrong area of memory. It follows from this that whenever a segment register and another value must be updated together, the segment register should be changed first, followed immediately by the instruction that changes the other value. There are also two cases, WAIT and repeated string instructions, where an interrupt request is recognized in the middle of an instruction. In these cases, interrupts are accepted after any completed primitive operation or wait test cycle.

Curiously, the fix on the chip is unnecessarily broad: a MOV or POP for any segment register delays interrupts. There was no hardware reason for this: the structure of the PLA means that all the necessary instruction bits were present and it would be no harder to test for the Stack Segment register specifically. The fix of delaying interrupts after a POP or MOV remains in the x86 architecture today. However, it has been cleaned up so only instructions affecting the Stack Segment register cause the delay; operations on other segment registers have no effect. ↩

[#] Sat Aug 05 2023 09:39:41 UTC from rss <>

Subject: A close look at the 8086 processor's bus hold circuitry

The Intel 8086 microprocessor (1978) revolutionized computing by founding the x86 architecture that continues to this day. One of the lesser-known features of the 8086 is the "hold" functionality, which allows an external device to temporarily take control of the system's bus. This feature was most important for supporting the 8087 math coprocessor chip, which was an option on the IBM PC; the 8087 used the bus hold so it could interact with the system without conflicting with the 8086 processor.

This blog post explains in detail how the bus hold feature is implemented in the processor's logic. (Be warned that this post is a detailed look at a somewhat obscure feature.) I've also found some apparently undocumented characteristics of the 8086's hold acknowledge circuitry, designed to make signal transition faster on the shared control lines.

The die photo below shows the main functional blocks of the 8086 processor. In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured. The 8086 is partitioned into a Bus Interface Unit (upper) that handles bus traffic, and an Execution Unit (lower) that executes instructions. The two units operate mostly independently, which will turn out to be important. The Bus Interface Unit handles read and write operations as requested by the Execution Unit. The Bus Interface Unit also prefetches instructions that the Execution Unit uses when it needs them. The hold control circuitry is highlighted in the upper right; it takes a nontrivial amount of space on the chip. The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins. I've labeled the MN/MX, HOLD, and HLDA pads; these are the relevant signals for this post.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

How bus hold works

In an 8086 system, the processor communicates with memory and I/O devices over a bus consisting of address and data lines along with various control signals. For high-speed data transfer, it is useful for an I/O device to send data directly to memory, bypassing the processor; this is called DMA (Direct Memory Access). Moreover, a co-processor such as the 8087 floating point unit may need to read data from memory. The bus hold feature supports these operations: it is a mechanism for the 8086 to give up control of the bus, letting another device use the bus to communicate with memory. Specifically, an external device requests a bus hold and the 8086 stops putting electrical signals on the bus and acknowledges the bus hold. The other device can now use the bus. When the other device is done, it signals the 8086, which then resumes its regular bus activity.

Most things in the 8086 are more complicated than you might expect, and the bus hold feature is no exception, largely due to the 8086's minimum and maximum modes. The 8086 can be designed into a system in one of two ways—minimum mode and maximum mode—that redefine the meanings of the 8086's external pins. Minimum mode is designed for simple systems and gives the control pins straightforward meanings such as indicating a read versus a write. Minimum mode provides bus signals that were similar to the earlier 8080 microprocessor, making migration to the 8086 easier. On the other hand, maximum mode is designed for sophisticated, multiprocessor systems and encodes the control signals to provide richer system information.

In more detail, minimum mode is selected if the MN/MX pin is wired high, while maximum mode is selected if the MN/MX pin is wired low. Nine of the chip's pins have different meanings depending on the mode, but only two pins are relevant to this discussion. In minimum mode, pin 31 has the function HOLD, while pin 30 has the function HLDA (Hold Acknowlege). In maximum mode, pin 31 has the function RQ/GT0', while pin 30 has the function RQ/GT1'.

I'll start by explaining how a hold operation works in minimum mode. When an external device wants to use the bus, it pulls the HOLD pin high. At the end of the current bus cycle, the 8086 acknowledges the hold request by pulling HLDA high. The 8086 also puts its bus output pins into "tri-state" mode, in effect disconnecting them electrically from the bus. When the external device is done, it pulls HOLD low and the 8086 regains control of the bus. Don't worry about the details of the timing below; the key point is that a device pulls HOLD high and the 8086 responds by pulling HLDA high.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

The 8086's maximum mode is more complex, allowing two other devices to share the bus by using a priority-based scheme. Maximum mode uses two bidirectional signals, RQ/GT0 and RQ/GT1.2 When a device wants to use the bus, it issues a pulse on one of the signal lines, pulling it low. The 8086 responds by pulsing the same line. When the device is done with the bus, it issues a third pulse to inform the 8086. The RQ/GT0 line has higher priority than RQ/GT1, so if two devices request the bus at the same time, the RQ/GT0 device wins and the RQ/GT1 device needs to wait.1 Keep in mind that the RQ/GT lines are bidirectional: the 8086 and the external device both use the same line for signaling.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

The bus hold does not completely stop the 8086. The hold operation stops the Bus Interface Unit, but the Execution Unit will continue executing instructions until it needs to perform a read or write, or it empties the prefetch queue. Specifically, the hold signal blocks the Bus Interface Unit from starting a memory cycle and blocks an instruction prefetch from starting.

Bus sharing and the 8087 coprocessor

Probably the most common use of the bus hold feature was to support the Intel 8087 math coprocessor. The 8087 coprocessor greatly improved the performance of floating-point operations, making them up to 100 times faster. As well as floating-point arithmetic, the 8087 supported trigonometric operations, logarithms and powers. The 8087's architecture became part of later Intel processors, and the 8087's instructions are still a part of today's x86 computers.3

The 8087 had its own registers and didn't have access to the 8086's registers. Instead, the 8087 could transfer values to and from the system's main memory. Specifically, the 8087 used the RQ/GT mechanism (maximum mode) to take control of the bus if the 8087 needed to transfer operands to or from memory.4 The 8087 could be installed as an option on the original IBM PC, which is why the IBM PC used maximum mode.

The enable flip-flop

The circuit is built from six flip-flops. The flip-flops are a bit different from typical D flip-flops, so I'll discuss the flip-flop behavior before explaining the circuit.

A flip-flop can store a single bit, 0 or 1. Flip flops are very important in the 8086 because they hold information (state) in a stable way, and they synchronize the circuitry with the processor's clock. A common type of flip-flop is the D flip-flop, which takes a data input (D) and stores that value. In an edge-triggered flip-flop, this storage happens on the edge when the clock changes state from low to high.5 (Except at this transition, the input can change without affecting the output.) The output is called Q, while the inverted output is called Q-bar.

The symbol for the D flip-flop with enable.

Many of the 8086's flip-flops, including the ones in the hold circuit, have an "enable" input. When the enable input is high, the flip-flop records the D input, but when the enable input is low, the flip-flop keeps its previous value. Thus, the enable input allows the flip-flop to hold its value for more than one clock cycle. The enable input is very important to the functioning of the hold circuit, as it is used to control when the circuit moves to the next state.

How bus hold is implemented (minimum mode)

I'll start by explaining how the hold circuitry works in minimum mode. To review, in minimum mode the external device requests a hold through the HOLD input, keeping the input high for the duration of the request. The 8086 responds by pulling the hold acknowledge HLDA output high for the duration of the hold.

In minimum mode, only three of the six flip-flops are relevant. The diagram below is highly simplified to show the essential behavior. (The full schematic is in the footnotes.6) At the left is the HOLD signal, the request from the external device.

Simplified diagram of the circuitry for minimum mode.

When a HOLD request comes in, the first flip-flop is activated, and remains activated for the duration of the request. The second flip-flop waits if any condition is blocking the hold request: a LOCK instruction, an unaligned memory access, or so forth. When the HOLD can proceed, the second flip-flop is enabled and it latches the request. The second flip-flop controls the internal hold signal, causing the 8086 to stop further bus activity. The third flip-flop is then activated when the current bus cycle (if any) completes; when it latches the request, the hold is "official". The third flip-flop drives the external HLDA (Hold Acknowledge) pin, indicating that the bus is free. This signal also clears the bus-enabled latch (elsewhere in the 8086), putting the appropriate pins into floating tri-state mode. The key point is that the flip-flops control the timing of the internal hold and the external HLDA, moving to the next step as appropriate.

When the external device signals an end to the hold by pulling the HOLD pin low, the process reverses. The three flip-flops return to their idle state in sequence. The second flip-flop clears the internal hold signal, restarting bus activity. The third flip-flop clears the HLDA pin.7

How bus hold is implemented (maximum mode)

The implementation of maximum mode is tricky because it uses the same circuitry as minimum mode, but the behavior is different in several ways. First, minimum mode and maximum mode operate on opposite polarity: a hold is requested by pulling HOLD high in minimum mode versus pulling a request line low in maximum mode. Moreover, in minimum mode, a request on the HOLD pin triggers a response on the opposite pin (HLDA), while in maximum mode, a request and response are on the same pin. Finally, using the same pin for the request and grant signals requires the pin to act as both an input and an output, with tricky electrical properties.

In maximum mode, the top three flip-flops handle the request and grant on line 0, while the bottom three flip-flops handle line 1. At a high level, these flip-flops behave roughly the same as in the minimum mode case, with the first flip-flop tracking the hold request, the second flip-flop activated when the hold is "approved", and the third flip-flop activated when the bus cycle completes. An RQ 0 input will generate a GT 0 output, while a RQ 1 input will generate a GT 1 output. The diagram below is highly simplified, but illustrates the overall behavior. Keep in mind that RQ 0, GT 0, and HOLD use the same physical pin, as do RQ 1, GT 1, and HLDA.

Simplified diagram of the circuitry for maximum mode.

In more detail, the first flip-flop converts the pulse request input into a steady signal. This is accomplished by configuring the first flip-flop is configured with to toggle on when the request pulse is received and toggle off when the end-of-request pulse is received.10 The toggle action is implemented by feeding the output pulse back to the input, inverted (A); since the flip-flop is enabled by the RQ input, the flip-flop holds its value until an input pulse. One tricky part is that the acknowledge pulse must not toggle the flip-flop. This is accomplished by using the output signal to block the toggle. (To keep the diagram simple, I've just noted the "block" action rather than showing the logic.)

As before, the second flip-flop is blocked until the hold is "authorized" to proceed. However, the circuitry is more complicated since it must prioritize the two request lines and ensure that only one hold is granted at a time. If RQ0's first flip-flop is active, it blocks the enable of RQ1's second flip-flop (B). Conversely, if RQ1's second flip-flop is active, it blocks the enable of RQ0's second flip-flop (C). Note the asymmetry, blocking on RQ0's first flip-flop and RQ1's second flip-flop. This enforces the priority of RQ0 over RQ1, since an RQ0 request blocks RQ1 but only an RQ1 "approval" blocks RQ0.

When the second flip-flop is activated in either path, it triggers the internal hold signal (D).8 As before, the hold request is latched into the third flip-flop when any existing memory cycle completes. When the hold request is granted, a pulse is generated (E) on the corresponding GT pin.9

The same circuitry is used for minimum mode and maximum mode, although the above diagrams show differences between the two modes. How does this work? Essentially, logic gates are used to change the behavior between minimum mode and maximum mode as required. For the most part, the circuitry works the same, so only a moderate amount of logic is required to make the same circuitry work for both. On the schematic, the signal MN is active during minimum mode, while MX is active during maximum mode, and these signals control the behavior.

The "hold ok" circuit

As usually happens with the 8086, there are a bunch of special cases when different features interact. One special case is if a bus hold request comes in while the 8086 is acknowledging an interrupt. In this case, the interrupt takes priority and the bus hold is not processed until the interrupt acknowledgment is completed. A second special case is if the bus hold occurs while the 8086 is halted. In this case, the 8086 issues a second HALT indicator at the end of the bus hold. Yet another special case is the 8086's LOCK prefix, which locks the use of the bus for the following instruction, so a bus hold request is not honored until the locked instruction has completed. Finally, the 8086 performs an unaligned word access to memory by breaking it into two 8-bit bus cycles; these two cycles can't be interrupted by a bus hold.

In more detail, the "hold ok" circuit determines at each cycle if a hold could proceed. There are several conditions under which the hold can proceed:

The bus cycle is `T2`, except if an unaligned bus operation is taking place (i.e. a word split into two byte operations), or
A memory cycle is not active and a microcode memory operation is not taking place, or
A memory cycle is not active and a hold is currently active.

The first case occurs during bus (memory) activity, where a hold request before cycle T2 will be handled at the end of that cycle. The second case allows a hold if the bus is inactive. But if microcode is performing a memory operation, the hold will be delayed, even if the request is just starting. The third case is opposite from the other two: it enables the flip flop so a hold request can be dropped. (This ensures that the hold request can still be dropped in the corner case where a hold starts and then the microcode makes a memory request, which will be blocked by the hold.)

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

An instruction with the LOCK prefix causes the bus to be locked against other devices for the duration of the instruction. Thus, a hold cannot be granted while the instruction is running. This is implemented through a separate path. This logic is between the output of the first (request) flip-flop and the second (accepted) flip-flop, tied into the LOCK signal. Conceptually, it seems that the LOCK signal should block hold-ok and thus block the second (accepted) flip-flop from being enabled. But instead, the LOCK signal blocks the data path, unless the request has already been granted. I think the motivation is to allow dropping of a hold request to proceed uninterrupted. In other words, LOCK prevents a hold from being accepted, but it doesn't prevent a hold from being dropped, and it was easier to implement this in the data path.

The pin drive circuitry

The circuitry for the HOLD/RQ0/GT0 and HLDA/RQ1/GT1 pins is somewhat complicated, since they are used for both input and output. In minimum mode, the HOLD pin is an input, while the HLDA pin is an output. In maximum mode, both pins act as an input, with a low-going pulse from an external device to start or stop the hold. But the 8086 also issues pulses to grant the hold. Pull-up resistors inside the 8086 to ensure that the pins remain high (idle) when unused. Finally, an undocumented active pull-up system restores a pin to a high state after it is pulled low, providing faster response than the resistor.

The schematic below shows the heart of the tri-state output circuit. Each pin is connected to two large output MOSFETs, one to drive the pin high and one to drive the pin low. The transistors have separate control lines; if both control lines are low, both transistors are off and the pin floats in the "tri-state" condition. This permits the pin to be used as an input, driven by an external device. The pull-up resistor keeps the pin in a high state.

The tri-state output circuit for each hold pin.

The diagram below shows how this circuitry looks on the die. In this image, the metal and polysilicon layers have been removed with acid to show the underlying doped silicon regions. The thin white stripes are transistor gates where polysilicon wiring crossed the silicon. The black circles are vias that connected the silicon to the metal on top. The empty regions at the right are where the metal pads for HOLD and HLDA were. Next to the pads are the large transistors to pull the outputs high or low. Because the outputs require much higher current than internal signals, these transistors are much larger than logic transistors. They are composed of several transistors placed in parallel, resulting in the parallel stripes. The small pullup resistors are also visible. For efficiency, these resistors are actually depletion-mode transistors, specially doped to act as constant-current sources.

The HOLD/HLDA pin circuitry on the die.

At the left, some of the circuitry is visible. The large output transistors are driven by "superbuffers" that provide more current than regular NMOS buffers. (A superbuffer uses separate transistors to pull the signal high and low, rather than using a pull-up to pull the signal high as in a standard NMOS gate.) The small transistors are the pass transistors that gate output signals according to the clock. The thick rectangles are crossovers, allowing the vertical metal wiring (no longer visible) to cross over a horizontal signal in the signal layer. The 8086 has only a single metal layer, so the layout requires a crossover if signals will otherwise intersect. Because silicon's resistance is much higher than metal's resistance, the crossover is relatively wide to reduce the resistance.

The problem with a pull-up resistor is that it is relatively slow when connected to a line with high capacitance. You essentially end up with a resistor-capacitor delay circuit, as the resistor slowly charges the line and brings it up to a high level. To get around this, the 8086 has an active drive circuit to pulse the RQ/GT lines high to pull them back from a low level. This circuit pulls the line high one cycle after the 8086 drops it low for a grant acknowledge. This circuit also pulls the line high after the external device pulls it low.11 (The schematic for this circuit is in the footnotes.) The curious thing is that I couldn't find this behavior documented in the datasheets. The datasheets describe the internal pull-up resistor, but don't mention that the 8086 actively pulls the lines high.12

Conclusions

The hold circuitry was a key feature of the 8086, largely because it was necessary for the 8087 math coprocessor chip. The hold circuitry seems like it should be straightforward, but there are many corner cases in this circuitry: it interacts with unaligned memory accesses, the LOCK prefix, and minimum vs. maximum modes. As a result, it is fairly complicated.

Personally, I find the hold circuitry somewhat unsatisfying to study, with few fundamental concepts but a lot of special-case logic. The circuitry seems overly complicated for what it does. Much of the complexity is probably due to the wildly different behavior of the pins between minimum and maximum mode. Intel should have simply used a larger package (like the Motorola 68000) rather than re-using pins to support different modes, as well as using the same pin for a request and a grant. It's impressive, however, the same circuitry was made to work for both minimum and maximum modes, despite using completely different signals to request and grant holds. This circuitry must have been a nightmare for Intel's test engineers, trying to ensure that the circuitry performed properly when there were so many corner cases and potential race conditions.

I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space and Bluesky as @righto.com so you can follow me there too.

Notes and references

The timing of priority between RQ0 and RQ1 is left vague in the documentation. In practice, even if RQ1 is requested first, a later RQ0 can still preempt it until the point that RQ1 is internally granted (i.e. the second flip-flop is activated). This happens before the hold is externally acknowledged, so it's not obvious to the user at what point priority no longer applies. ↩
The RQ/GT0 and RQ/GT1 signals are active-low. These signals should have an overbar to indicate this, but it makes the page formatting ugly :-) ↩
Modern x86 processors still support the 8087 (x87) instruction set. Starting with the 80486DX, the floating point unit was included on the CPU die, rather than as an external coprocessor. The x87 instruction set used a stack-based model, which made it harder to parallelize. To mitigate this, Intel introduced SSE in 1999, a different set of floating point instructions that worked on an independent register set. The x87 instructions are now considered mostly obsolete and are deprecated in 64-bit Windows. ↩
The 8087 provides another RQ/GT input line for an external device. Thus, two external devices can still be used in a system with an 8087. That is, although the 8087 uses up one of the 8086's two RQ/GT lines, the 8087 provides another one, so there are still two lines available. The 8087 has logic to combine its bus requests and external bus requests into a single RQ/GT line to the 8086. ↩
Confusingly, some of the flip-flops in the hold circuit transistion when the clock goes high, while others use the inverted clock signal and transition when the clock goes low. Moreover, the flip-flops are inconsistent about how they treat the data. In each group of three flip-flops, the first flip-flop is active-high, while the remaining flip-flops are active-low. For the most part, I'll ignore this in the discussion. You can look at the schematic if you want to know the details. ↩
The schematics below shows my reverse-engineered schematic for the hold circuitry. I have partitioned the schematic into the hold logic and the output driver circuitry. This split matches the physical partitioning of the circuitry on the die.

In the first schematic, the upper part handles HOLD and request0, while the lower part handles request1. There is some circuitry in the middle to handle the common enabling and to generate the internal hold signal. I won't explain the circuitry in much detail, but there are a few things I want to point out. First, even though the hold circuit seems like it should be simple, there are a lot of gates connected in complicated ways. Second, although there are many inverters, NAND, and NOR gates, there are also complex gates such as AND-NOR, OR-NAND, AND-OR-NAND, and so forth. These are implemented as single gates. Due to how gates are constructed from NMOS transistors, it is just as easy to build a hierarchical gate as a single gate. (The last step must be inversion, though.) The XOR gates are more complex; they are constructed from a NOR gate and an AND-NOR gate.

Schematic of the hold circuitry. Click this image (or any other) for a larger version.

The schematic below shows the output circuits for the two pins. These circuits are similar, but have a few differences because only the bottom one is used as an output (HLDA) in minimum mode. Each circuit has two inputs: what the current value of the pin is, and what the desired value of the pin is.

Schematic of the pin output circuits.

↩
Interestingly, the external pins aren't taken out of tri-state mode immediately when the HLDA signal is dropped. Instead, the 8086's bus drivers are re-enabled when a bus cycle starts, which is slightly later. The bus circuitry has a separate flip-flop to manage the enable/disable state, and the start of a bus cycle is what re-enables the bus. This is another example of behavior that the documentation leaves ambiguous. ↩
There's one more complication for the hold-out signal. If a hold is granted on one line, a request comes in on the other line, and then the hold is released on the first line, the desired behavior is for the bus to remain in the hold state as the hold switches to the second line. However, because of the way a hold on line 1 blocks a hold on line 0, the GT1 second flip-flop will drop a cycle before the GT0 second flip-flop is activated. This would cause hold-out to drop for a cycle and the 8086 could start unwanted bus activity. To prevent this case, the hold-out line is also activated if there is an RQ0 request and RQ1 is granted. This condition seems a bit random but it covers the "gap". I have to wonder if Intel planned the circuit this way, or they added the extra test as a bug fix. (The asymmetry of the priority circuit causes this problem to occur only going from a hold on line 1 to line 0, not the other direction.) ↩
The pulse-generating circuit is a bit tricky. A pulse signal is generated if the request has been accepted, has not been granted, and will be granted on the next clock (i.e. no memory request is active so the flip-flop is enabled). (Note that the pulse will be one cycle wide, since granting the request on the next cycle will cause the conditions to be no longer satisfied.) This provides the pulse one clock cycle before the flip-flop makes it "official". Moreover, the signals come from the inverted Q outputs from the flip-flops, which are updated half a clock cycle earlier. The result is that the pulse is generated 1.5 clock cycles before the flip-flop output. Presumably the point of this is to respond to hold requests faster, but it seems overly complicated. ↩
The request pulse is required to be one clock cycle wide. The feedback loop shows why: if the request is longer than one clock cycle, the first flip-flop will repeatedly toggle on and off, resulting in unexpected behavior. ↩
The details of the active pull-up circuitry don't make sense to me. First it XORs the state of the pin with the desired state of the pin and uses this signal to control a multiplexer, which generates the pull-up action based on other gates. The result of all this ends up being simply NAND, implemented with excessive gates. Another issue is that minimum mode blocks the active pull-up, which makes sense. But there are additional logic gates so minimum mode can affect the value going into the multiplexer, which gets blocked in minimum mode, so that logic seems wasted. There are also two separate circuits to block pull-up during reset. My suspicion is that the original logic accumulated bug fixes and redundant logic wasn't removed. But it's possible that the implementation is doing something clever that I'm just missing. ↩
My analysis of the RQ/GT lines being pulled high is based on simulation. It would be interesting to verify this behavior on a physical 8086 chip. By measuring the current out of the pin, the pull-up pulses should be visible. ↩

[#] Sat Aug 12 2023 09:26:28 UTC from rss <>

Subject: Tracing the roots of the 8086 instruction set to the Datapoint 2200 minicomputer

The Intel 8086 processor started the x86 architecture that is still extensively used today. The 8086 has some quirky characteristics: it is little-endian, has a parity flag, and uses explicit I/O instructions instead of just memory-mapped I/O. It has four 16-bit registers that can be split into 8-bit registers, but only one that can be used for memory indexing. Surprisingly, the reason for these characteristics and more is compatibility with a computer dating back before the creation of the microprocessor: the Datapoint 2200, a minicomputer with a processor built out of TTL chips. In this blog post, I'll look in detail at how the Datapoint 2200 led to the architecture of Intel's modern processors, step by step through the 8008, 8080, and 8086 processors.

The Datapoint 2200

In the late 1960s, 80-column IBM punch cards were the primary way of entering data into computers, although CRT terminals were growing in popularity. The Datapoint 2200 was designed as a low-cost terminal that could replace a keypunch, with a squat CRT display the size of a punch card. By putting some processing power into the Datapoint 2200, it could perform data validation and other tasks, making data entry more efficient. Even though the Datapoint 2200 was typically used as an intelligent terminal, it was really a desktop minicomputer with a "unique combination of powerful computer, display, and dual cassette drives." Although now mostly forgotten, the Datapoint 2200 was the origin of the 8-bit microprocessor, as I'll explain below.

The Datapoint 2200 computer (Version II).

The memory storage of the Datapoint 2200 had a large impact on its architecture and thus the architecture of today's computers. In the 1960s and early 1970s, magnetic core memory was the dominant form of computer storage. It consisted of tiny ferrite rings, threaded into grids, with each ring storing one bit. Magnetic core storage was bulky and relatively expensive, though. Semiconductor RAM was new and very expensive; Intel's first product in 1969 was a RAM chip called the 3101, which held just 64 bits and cost $99.50. To minimize storage costs, the Datapoint 2200 used an alternative: MOS shift-register memory. The Intel 1405 shift-register memory chip provided much more storage than RAM chips at a much lower cost (512 bits for $13.30).1

Intel 1405 shift-register memory chips in metal cans, in the Datapoint 2200.

The big problem with shift-register memory is that it is sequential: the bits come out one at a time, in the same order you put them in. This wasn't a problem when executing instructions sequentially, since the memory provided each instruction as it was needed. For a random access, though, you need to wait until the bits circulate around and you get the one you want, which is very slow. To minimize the number of memory accesses, the Datapoint 2200 had seven registers, a relatively large number of registers for the time.2 The registers were called A, B, C, D, E, H, and L, and these names had a lasting impact on Intel processors.

Another consequence of shift-register memory was that the Datapoint 2200 was a serial computer, operating on one bit at a time as the shift-register memory provided it, using a 1-bit ALU. To handle arithmetic operations, the ALU needed to start with the lowest bit so it could process carries. Likewise, a 16-bit value (such as a jump target) needed to start with the lowest bit. This resulted in a little-endian architecture, with the low byte first. The little-endian architecture has remained in Intel processors to the present.

Since the Datapoint 2200 was designed before the creation of the microprocessor, its processor was built from a board of TTL chips (as was typical for minicomputers at the time). The diagram below shows the processor board with the chips categorized by function. The board has a separate chip for each 8-bit register (B, C, D, etc.) and separate chips for control flags (Z, carry, etc.). The Arithmetic/Logic Unit (ALU) takes about 18 chips, while instruction decoding is another 18 chips. Because every feature required more chips, the designers of the Datapoint 2200 were strongly motivated to make the instruction set as simple as possible. This was necessary since the Datapoint 2200 was a low-cost device, renting for just $148 a month. In contrast, the popular PDP-8 minicomputer rented for $500 a month.

The Datapoint 2200 processor board with registers, flags, and other blocks labeled. Click this image (or any other) for a larger version.

One way that the Datapoint 2200 simplified the hardware was by creating a large set of instructions by combining simpler pieces in an orthogonal way. For instance, the Datapoint 2200 has 64 ALU instructions that apply one of eight ALU operations to one of the eight registers. This requires a small amount of hardware—eight ALU circuits and a circuit to select the register—but provides a large number of instructions. Another example is the register-to-register move instructions. Specifying one of eight source registers and one of eight destination registers provides a large, flexible set of instructions to move data.

The Datapoint 2200's instruction format was designed around this principle, with groups of three bits specifying a register. A common TTL chip could decode the group of three bits and activate the desired circuit.3 For instance, a data move instruction had the bit pattern 11DDDSSS to move a byte from the specified source (SSS) to the specified destination (DDD). (Note that this bit pattern maps onto three octal digits very nicely since the source and destination are separate digits.4)

One unusual feature of the Datapoint instruction set is that a memory access was just like a register access. That is, an instruction could specify one of the seven physical registers or could specify a memory access (M), using the identical instruction format. One consequence of this is that you couldn't include a memory address in an instruction. Instead, memory could only be accessed by first loading the address into the H and L registers, which held the high and low byte of the address respectively.5 This is very unusual and inconvenient, since a memory access took three instructions: two to load the H and L registers and one to access memory as the M "register". The advantage was that it simplified the instruction set and the decoding logic, saving chips and thus reducing the system cost. This decision also had lasting impact on Intel processors and how they access memory.

The table below shows the Datapoint 2200's instruction set in an octal table showing the 256 potential opcodes.6 I have roughly classified the instructions as arithmetic/logic (purple), control-flow (blue), data movement (green), input/output (orange), and miscellaneous (yellow). Note how the orthogonal instruction format produces large blocks of related instructions. The instructions in the lower right (green) load (L) a value from a source to a destination. (The no-operation NOP and HALT instructions are special cases.7) In the upper-left are Load operations (LA, etc.) that use an "immediate" byte, a data byte that follows the instruction. They use the same DDD code to specify the destination register, reusing that circuitry.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	HALT	HALT	SLC	RFC	AD		LA	RETURN	JFC	INPUT	CFC		JMP		CALL
1			SRC	RFZ	AC		LB		JFZ		CFZ
2				RFS	SU		LC		JFS	EX ADR	CFS	EX STATUS		EX DATA		EX WRITE
3				RFP	SB		LD		JFP	EX COM1	CFP	EX COM2		EX COM3		EX COM4
4				RTC	ND		LE		JTC		CTC
5				RTZ	XR		LH		JTZ	EX BEEP	CTZ	EX CLICK		EX DECK1		EX DECK2
6				RTS	OR		LL		JTS	EX RBK	CTS	EX WBK				EX BSP
7				RTP	CP				JTP	EX SF	CTP	EX SB		EX REWND		EX TSTOP
0	ADA	ADB	ADC	ADD	ADE	ADH	ADL	ADM	NOP	LAB	LAC	LAD	LAE	LAH	LAL	LAM
1	ACA	ACB	ACC	ACD	ACE	ACH	ACL	ACM	LBA	LBB	LBC	LBD	LBE	LBH	LBL	LBM
2	SUA	SUB	SUC	SUD	SUE	SUH	SUL	SUM	LCA	LCB	LCC	LCD	LCE	LCH	LCL	LCM
3	SBA	SBB	SBC	SBD	SBE	SBH	SBL	SBM	LDA	LDB	LDC	LDD	LDE	LDH	LDL	LDM
4	NDA	NDB	NDC	NDD	NDE	NDH	NDL	NDM	LEA	LEB	LEC	LED	LEE	LEH	LEL	LEM
5	XRA	XRB	XRC	XRD	XRE	XRH	XRL	XRM	LHA	LHB	LHC	LHD	LHE	LHH	LHL	LHM
6	ORA	ORB	ORC	ORD	ORE	ORH	ORL	ORM	LLA	LLB	LLC	LLD	LLE	LLH	LLL	LLM
7	CPA	CPB	CPC	CPD	CPE	CPH	CPL	CPM	LMA	LMB	LMC	LMD	LME	LMH	LML	HALT

The lower-left quadrant (purple) has the bulk of the ALU instructions. These instructions have a regular, orthogonal structure making the instructions easy to decode: each row specifies the operation while each column specifies the source. This is due to the instruction structure: eight bits in the pattern 10AAASSS, where the AAA bits specified the ALU operation and the SSS bits specified the register source. The three-bit ALU code specifies the operations Add, Add with Carry, Subtract, Subtract with Borrow, logical AND, logical XOR, logical OR, and Compare. This list is important because it defined the fundamental ALU operations for later Intel processors.8 In the upper-left are ALU operations that use an "immediate" byte. These instructions use the same AAA bit pattern to select the ALU operation, reusing the decoding hardware. Finally, the shift instructions SLC and SRC are implemented as special cases outside the pattern.

The upper columns contain conditional instructions in blue—Return, Jump, and Call. The eight conditions test the four status flags (Carry, Zero, Sign, and Parity) for either True or False. (For example, JFZ Jumps if the Zero flag is False.) A 3-bit field selects the condition, allowing it to be easily decoded in hardware. The parity flag is somewhat unusual because parity is surprisingly expensive to compute in hardware, but because the Datapoint 2200 operated as a terminal, parity computation was important.

The Datapoint 2200 has an input instruction as well as many output instructions for a variety of specific hardware tasks (orange, labeled EX for external). Typical operations are STATUS to get I/O status, BEEP and CLICK to make sound, and REWIND to rewind the tape. As a result of this decision to use separate I/O instructions, Intel processors still use I/O instructions operating in an I/O space, different from processors such as the MOS 6502 and the Motorola 68000 that used memory-mapped I/O.

To summarize, the Datapoint 2200 has a fairly large number of instructions, but they are generated from about a dozen simple patterns that are easy to decode.9 By combining orthogonal bit fields (e.g. 8 ALU opererations multiplied by 8 source registers), 64 instructions can be generated from one underlying pattern.

Intel 8008

The Intel 8008 was created as a clone of the Datapoint 2200 processor.10 Around the end of 1969, the Datapoint company talked with Intel and Texas Instruments about the possibility of replacing the processor board with a single chip. Even though the microprocessor didn't exist at this point, both companies said they could create such a chip. Texas Instruments was first with a chip called the TMX 1795 that they advertised as a "CPU on a chip". Slightly later, Intel produced the 8008 microprocessor. Both chips copied the Datapoint 2200's instruction set architecture with minor changes.

The Intel 8008 chip in its 18-pin package. The small number of pins hampered the performance of the 8008, but Intel was hesitant to even go to the 18-pin package. Photo by Thomas Nguyen, (CC BY-SA 4.0).

By the time the chips were completed, however, the Datapoint corporation had lost interest in the chips. They were designing a much faster version of the Datapoint 2200 with improved TTL chips (including the well-known 74181 ALU chip). Even the original Datapoint 2200 model was faster than the Intel 8008 processor, and the Version II was over 5 times faster,11 so moving to a single-chip processor would be a step backward.

Texas Instruments unsuccessfully tried to find a customer for their TMX 1795 chip and ended up abandoning the chip. Intel, however, marketed the 8008 as an 8-bit microprocessor, essentially creating the microprocessor industry. In my view, Intel's biggest innovation with the microprocessor wasn't creating a single-chip CPU, but creating the microprocessor as a product category: a general-purpose processor along with everything customers needed to take advantage of it. Intel put an enormous amount of effort into making microprocessors a success: from documentation and customer training to Intellec development systems, from support chips to software tools such as assemblers, compilers, and operating systems.

The table below shows the opcodes of the 8008. For the most part, the 8008 copies the Datapoint 2200, with identical instructions that have identical opcodes (in color). There are a few additional instructions (shown in white), though. Intel Designer Ted Hoff realized that increment and decrement instructions (IN and DC) would be very useful for loops. There are two additional bit rotate instructions (RAL and RAL) as well as the "missing" LMI (Load Immediate to Memory) instruction. The RST (restart) instructions act as short call instructions to fixed addresses for interrupt handling. Finally, the 8008 turned the Datapoint 2200's device-specific I/O instructions into 32 generic I/O instructions.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	HLT	HLT	RLC	RFC	ADI	RST 0	LAI	RET	JFC	INP 0	CFC	INP 1	JMP	INP 2	CAL	INP 3
1	INB	DCB	RRC	RFZ	ACI	RST 1	LBI		JFZ	INP 4	CFZ	INP 5		INP 6		INP 7
2	INC	DCC	RAL	RFS	SUI	RST 2	LCI		JFS	OUT 8	CFS	OUT 9		OUT 10		OUT 11
3	IND	DCD	RAR	RFP	SBI	RST 3	LDI		JFP	OUT 12	CFP	OUT 13		OUT 14		OUT 15
4	INE	DCE		RTC	NDI	RST 4	LEI		JTC	OUT 16	CTC	OUT 17		OUT 18		OUT 19
5	INH	DCH		RTZ	XRI	RST 5	LHI		JTZ	OUT 20	CTZ	OUT 21		OUT 22		OUT 23
6	INL	DCL		RTS	ORI	RST 6	LLI		JTS	OUT 24	CTS	OUT 25		OUT 26		OUT 27
7				RTP	CPI	RST 7	LMI		JTP	OUT 28	CTP	OUT 29		OUT 30		OUT 31
0	ADA	ADB	ADC	ADD	ADE	ADH	ADL	ADM	NOP	LAB	LAC	LAD	LAE	LAH	LAL	LAM
1	ACA	ACB	ACC	ACD	ACE	ACH	ACL	ACM	LBA	LBB	LBC	LBD	LBE	LBH	LBL	LBM
2	SUA	SUB	SUC	SUD	SUE	SUH	SUL	SUM	LCA	LCB	LCC	LCD	LCE	LCH	LCL	LCM
3	SBA	SBB	SBC	SBD	SBE	SBH	SBL	SBM	LDA	LDB	LDC	LDD	LDE	LDH	LDL	LDM
4	NDA	NDB	NDC	NDD	NDE	NDH	NDL	NDM	LEA	LEB	LEC	LED	LEE	LEH	LEL	LEM
5	XRA	XRB	XRC	XRD	XRE	XRH	XRL	XRM	LHA	LHB	LHC	LHD	LHE	LHH	LHL	LHM
6	ORA	ORB	ORC	ORD	ORE	ORH	ORL	ORM	LLA	LLB	LLC	LLD	LLE	LLH	LLL	LLM
7	CPA	CPB	CPC	CPD	CPE	CPH	CPL	CPM	LMA	LMB	LMC	LMD	LME	LMH	LML	HLT

Intel 8080

The 8080 improved the 8008 in many ways, focusing on speed and ease of use, and resolving customer issues with the 8008.12 Customers had criticized the 8008 for its small memory capacity, low speed, and difficult hardware interfacing. The 8080 increased memory capacity from 16K to 64K and was over an order of magnitude faster than the 8008. The 8080 also moved to a 40-pin package that made interfacing easier, but the 8080 still required a large number of support chips to build a working system.

Although the 8080 was widely used in embedded systems, it is more famous for its use in the first generation of home computers, boxes such as the Altair and IMSAI. Famed chip designer Federico Faggin said that the 8080 really created the microprocessor; the 4004 and 8008 suggested it, but the 8080 made it real.13

Altair 8800 computer on display at the Smithsonian. Photo by Colin Douglas, (CC BY-SA 2.0).

The table below shows the instruction set for the 8080. The 8080 was designed to be compatible with 8008 assembly programs after a simple translation process; the instructions have been shifted around and the names have changed.15 The instructions from the Datapoint 2200 (colored) form the majority of the 8080's instruction set. The instruction set was expanded by adding some 16-bit support, allowing register pairs (BC, DE, HL) to be used as 16-bit registers for double add, 16-bit increment and decrement, and 16-bit memory transfers. Many of the new instructions in the 8080 may seem like contrived special cases— for example, SPHL (Load SP from HL) and XCHG (Exchange DE and HL)— but they made accesses to memory easier. The I/O instructions from the 8008 have been condensed to just IN and OUT, opening up room for new instructions.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	NOP	LXI B	STAX B	INX B	INR B	DCR B	MVI B	RLC	MOV B,B	MOV B,C	MOV B,D	MOV B,E	MOV B,H	MOV B,L	MOV B,M	MOV B,A
1		DAD B	LDAX B	DCX B	INR C	DCR C	MVI C	RRC	MOV C,B	MOV C,C	MOV C,D	MOV C,E	MOV C,H	MOV C,L	MOV C,M	MOV C,A
2		LXI D	STAX D	INX D	INR D	DCR D	MVI D	RAL	MOV D,B	MOV D,C	MOV D,D	MOV D,E	MOV D,H	MOV D,L	MOV D,M	MOV D,A
3		DAD D	LDAX D	DCX D	INR E	DCR E	MVI E	RAR	MOV E,B	MOV E,C	MOV E,D	MOV E,E	MOV E,H	MOV E,L	MOV E,M	MOV E,A
4		LXI H	SHLD	INX H	INR H	DCR H	MVI H	DAA	MOV H,B	MOV H,C	MOV H,D	MOV H,E	MOV H,H	MOV H,L	MOV H,M	MOV H,A
5		DAD H	LHLD	DCX H	INR L	DCR L	MVI L	CMA	MOV L,B	MOV L,C	MOV L,D	MOV L,E	MOV L,H	MOV L,L	MOV L,M	MOV L,A
6		LXI SP	STA	INX SP	INR M	DCR M	MVI M	STC	MOV M,B	MOV M,C	MOV M,D	MOV M,E	MOV M,H	MOV M,L	HLT	MOV M,A
7		DAD SP	LDA	DCX SP	INR A	DCR A	MVI A	CMC	MOV A,B	MOV A,C	MOV A,D	MOV A,E	MOV A,H	MOV A,L	MOV A,M	MOV A,A
0	ADD B	ADD C	ADD D	ADD E	ADD H	ADD L	ADD M	ADD A	RNZ	POP B	JNZ	JMP	CNZ	PUSH B	ADI	RST 0
1	ADC B	ADC C	ADC D	ADC E	ADC H	ADC L	ADC M	ADC A	RZ	RET	JZ		CZ	CALL	ACI	RST 1
2	SUB B	SUB C	SUB D	SUB E	SUB H	SUB L	SUB M	SUB A	RNC	POP D	JNC	OUT	CNC	PUSH D	SUI	RST 2
3	SBB B	SBB C	SBB D	SBB E	SBB H	SBB L	SBB M	SBB A	RC		JC	IN	CC		SBI	RST 3
4	ANA B	ANA C	ANA D	ANA E	ANA H	ANA L	ANA M	ANA A	RPO	POP H	JPO	XTHL	CPO	PUSH H	ANI	RST 4
5	XRA B	XRA C	XRA D	XRA E	XRA H	XRA L	XRA M	XRA A	RPE	PCHL	JPE	XCHG	CPE		XRI	RST 5
6	ORA B	ORA C	ORA D	ORA E	ORA H	ORA L	ORA M	ORA A	RP	POP PSW	JP	DI	CP	PUSH PSW	ORI	RST 6
7	CMP B	CMP C	CMP D	CMP E	CMP H	CMP L	CMP M	CMP A	RM	SPHL	JM	EI	CM		CPI	RST 7

The 8080 also moved the stack to external memory, rather than using an internal fixed special-purpose stack as in the 8008 and Datapoint 2200. This allowed PUSH and POP instructions to put register data on the stack. Interrupt handling was also improved by adding the Enable Interrupt and Disable Interrupt instructions (EI and DI).14

Intel 8085

The Intel 8085 was designed as a "mid-life kicker" for the 8080, providing incremental improvements while maintaining compatibility. From the hardware perspective, the 8085 was much easier to use than the 8080. While the 8080 required three voltages, the 8085 required a single 5-volt power supply (represented by the "5" in the part number). Moreover, the 8085 eliminated most of the support chips required with the 8080; a working 8085 computer could be built with just three chips. Finally, the 8085 provided additional hardware functionality: better interrupt support and serial I/O.

The Intel 8085, like the 8080 and the 8086, was packaged in a 40-pin DIP. Photo by Thomas Nguyen, (CC BY-SA 4.0).

On the software side, the 8085 is curious: 12 instructions were added to the instruction set (finally using every opcode), but all but two were hidden and left undocumented.16 Moreover, the 8085 added two new condition codes, but these were also hidden. This situation occurred because the 8086 project started up in 1976, near the release of the 8085 chip. Intel wanted the 8086 to be compatible (to some extent) with the 8080 and 8085, but providing new instructions in the 8085 would make compatibility harder. It was too late to remove the instructions from the 8085 chip, so Intel did the next best thing and removed them from the documentation. These instructions are shown in red in the table below. Only the new SIM and RIM instructions were supported, necessary in order to use the 8085's new interrupt and serial I/O features.

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	NOP	LXI B	STAX B	INX B	INR B	DCR B	MVI B	RLC	MOV B,B	MOV B,C	MOV B,D	MOV B,E	MOV B,H	MOV B,L	MOV B,M	MOV B,A
1	DSUB	DAD B	LDAX B	DCX B	INR C	DCR C	MVI C	RRC	MOV C,B	MOV C,C	MOV C,D	MOV C,E	MOV C,H	MOV C,L	MOV C,M	MOV C,A
2	ARHL	LXI D	STAX D	INX D	INR D	DCR D	MVI D	RAL	MOV D,B	MOV D,C	MOV D,D	MOV D,E	MOV D,H	MOV D,L	MOV D,M	MOV D,A
3	RDEL	DAD D	LDAX D	DCX D	INR E	DCR E	MVI E	RAR	MOV E,B	MOV E,C	MOV E,D	MOV E,E	MOV E,H	MOV E,L	MOV E,M	MOV E,A
4	RIM	LXI H	SHLD	INX H	INR H	DCR H	MVI H	DAA	MOV H,B	MOV H,C	MOV H,D	MOV H,E	MOV H,H	MOV H,L	MOV H,M	MOV H,A
5	LDHI	DAD H	LHLD	DCX H	INR L	DCR L	MVI L	CMA	MOV L,B	MOV L,C	MOV L,D	MOV L,E	MOV L,H	MOV L,L	MOV L,M	MOV L,A
6	SIM	LXI SP	STA	INX SP	INR M	DCR M	MVI M	STC	MOV M,B	MOV M,C	MOV M,D	MOV M,E	MOV M,H	MOV M,L	HLT	MOV M,A
7	LDSI	DAD SP	LDA	DCX SP	INR A	DCR A	MVI A	CMC	MOV A,B	MOV A,C	MOV A,D	MOV A,E	MOV A,H	MOV A,L	MOV A,M	MOV A,A
0	ADD B	ADD C	ADD D	ADD E	ADD H	ADD L	ADD M	ADD A	RNZ	POP B	JNZ	JMP	CNZ	PUSH B	ADI	RST 0
1	ADC B	ADC C	ADC D	ADC E	ADC H	ADC L	ADC M	ADC A	RZ	RET	JZ	RSTV	CZ	CALL	ACI	RST 1
2	SUB B	SUB C	SUB D	SUB E	SUB H	SUB L	SUB M	SUB A	RNC	POP D	JNC	OUT	CNC	PUSH D	SUI	RST 2
3	SBB B	SBB C	SBB D	SBB E	SBB H	SBB L	SBB M	SBB A	RC	SHLX	JC	IN	CC	JNK	SBI	RST 3
4	ANA B	ANA C	ANA D	ANA E	ANA H	ANA L	ANA M	ANA A	RPO	POP H	JPO	XTHL	CPO	PUSH H	ANI	RST 4
5	XRA B	XRA C	XRA D	XRA E	XRA H	XRA L	XRA M	XRA A	RPE	PCHL	JPE	XCHG	CPE	LHLX	XRI	RST 5
6	ORA B	ORA C	ORA D	ORA E	ORA H	ORA L	ORA M	ORA A	RP	POP PSW	JP	DI	CP	PUSH PSW	ORI	RST 6
7	CMP B	CMP C	CMP D	CMP E	CMP H	CMP L	CMP M	CMP A	RM	SPHL	JM	EI	CM	JK	CPI	RST 7

Intel 8086

Following the 8080, Intel intended to revolutionize microprocessors with a 32-bit "micro-mainframe", the iAPX 432. This extremely complex processor implemented objects, memory management, interprocess communication, and fine-grained memory protection in hardware. The iAPX 432 was too ambitious and the project fell behind schedule, leaving Intel vulnerable against competitors such as Motorola and Zilog. Intel quickly threw together a 16-bit processor as a stopgap until the iAPX 432 was ready; to show its continuity with the 8-bit processor line, this processor was called the 8086. The iAPX 432 ended up being one of the great disaster stories of modern computing and quietly disappeared.

The "stopgap" 8086 processor, however, started the x86 architecture that changed the history of Intel. The 8086's victory was powered by the IBM PC, designed in 1981 around the Intel 8088, a variant of the 8086 with a cheaper 8-bit bus. The IBM PC was a rousing success, defining the modern computer and making Intel's fortune. Intel produced a succession of more powerful chips that extended the 8086: 286, 386, 486, Pentium, and so on, leading to the current x86 architecture.

The original IBM PC used the Intel 8088 processor, a variant of the 8086 with an 8-bit bus. Photo by Ruben de Rijcke, (CC BY-SA 3.0).

The 8086 was a major change from the 8080/8085, jumping from an 8-bit architecture to a 16-bit architecture and expanding from 64K of memory to 1 megabyte. Nonetheless, the 8086's architecture is closely related to the 8080. The designers of the 8086 wanted it to be compatible with the 8080/8085, but the difference was too wide for binary compatibility or even assembly-language compatibility. Instead, the 8086 was designed so a program could translate 8080 assembly language to 8086 assembly language.17 To accomplish this, each 8080 register had a corresponding 8086 register and most 8080 instructions had corresponding 8086 instructions.

The 8086's instruction set was designed with a new concept, the "ModR/M" byte, which usually follows the opcode byte. The ModR/M byte specifies the memory addressing mode and the register (or registers) to use, allowing that information to be moved out of the opcode. For instance, where the 8080 had a quadrant of 64 instructions to move from register to register, the 8086 has a single move instruction, with the ModR/M byte specifying the particular instruction. (The move instruction, however, has variants to handle byte vs. word operations, moves to or from memory, and so forth, so the 8086 ends up with a few move opcodes.) The ModR/M byte preserves the Datapoint 2200's concept of using the same instruction for memory and register operations, but allows a memory address to be provided in the instruction.

The 8086 also cleans up some of the historical baggage in the instruction set, freeing up space in the precious 256 opcodes for new instructions. The conditional call and return instructions were eliminated, while the conditional jumps were expanded. The 8008's RST (Restart) instructions were eliminated, replaced by interrupt vectors.

The 8086 extended its registers to 16 bits and added several new registers. An Intel patent (below) shows that the 8086's registers were originally called A, B, C, D, E, H, and L, matching the Datapoint 2200. The A register was extended to the 16-bit XA register, while the BC, DE, and HL registers were used unchanged. When the 8086 was released, these registers were renamed to AX, CX, DX, and BX respectively.18 In particular, the HL register was renamed to BX; this is why BX can specify a memory address in the ModR/M byte, but AX, CX, and DX can't.

A patent diagram showing the 8086's registers with their original names. (MP, IJ, and IK are now known as BP, SI, and DI.) From patent US4449184.

The table below shows the 8086's instruction set, with "b", "w", and "i" indicating byte (8-bit), word (16-bit), and immediate instructions. The Datapoint 2200 instructions (colored) are all still supported. The number of Datapoint instructions looks small because the ModR/M byte collapses groups of old opcodes into a single new one. This opened up space in the opcode table, though, allowing the 8086 to have many new instructions as well as 16-bit instructions.19

	0	1	2	3	4	5	6	7	0	1	2	3	4	5	6	7
0	ADD b	ADD w	ADD b	ADD w	ADD bi	ADD wi	PUSH ES	POP ES	INC AX	INC CX	INC DX	INC BX	INC SP	INC BP	INC SI	INC DI
1	OR b	OR w	OR b	OR w	OR bi	OR wi	PUSH CS		DEC AX	DEC CX	DEC DX	DEC BX	DEC SP	DEC BP	DEC SI	DEC DI
2	ADC b	ADC w	ADC b	ADC w	ADC bi	ADC wi	PUSH SS	POP SS	PUSH AX	PUSH CX	PUSH DX	PUSH BX	PUSH SP	PUSH BP	PUSH SI	PUSH DI
3	SBB b	SBB w	SBB b	SBB w	SBB bi	SBB wi	PUSH DS	POP DS	POP AX	POP CX	POP DX	POP BX	POP SP	POP BP	POP SI	POP DI
4	AND b	AND w	AND b	AND w	AND bi	AND wi	ES:	DAA
5	SUB b	SUB w	SUB b	SUB w	SUB bi	SUB wi	CS:	DAS
6	XOR b	XOR w	XOR b	XOR w	XOR bi	XOR wi	SS:	AAA	JO	JNO	JB	JNB	JZ	JNZ	JBE	JA
7	CMP b	CMP w	CMP b	CMP w	CMP bi	CMP wi	DS:	AAS	JS	JNS	JPE	JPO	JL	JGE	JLE	JG
0	GRP1 b	GRP1 w	GRP1 b	GRP1 w	TEST b	TEST w	XCHG b	XCHG w			RET	RET	LES	LDS	MOV b	MOV w
1	MOV b	MOV w	MOV b	MOV w	MOV sr	LEA	MOV sr	POP			RETF	RETF	INT 3	INT	INTO	IRET
2	NOP	XCHG CX	XCHG DX	XCHG BX	XCHG SP	XCHG BP	XCHG SI	XCHG DI	Shift b	Shift w	Shift b	Shift w	AAM	AAD		XLAT
3	CBW	CWD	CALL	WAIT	PUSHF	POPF	SAHF	LAHF	ESC 0	ESC 1	ESC 2	ESC 3	ESC 4	ESC 5	ESC 6	ESC 7
4	MOV AL,M	MOV AX,M	MOV M,AL	MOV M,AX	MOVS b	MOVS w	CMPS b	CMPS w	LOOPNZ	LOOPZ	LOOP	JCXZ	IN b	IN w	OUT b	OUT w
5	TEST b	TEST w	STOS b	STOS w	LODS b	LODS w	SCAS b	SCAS w	CALL	JMP	JMP	JMP	IN b	IN w	OUT b DX	OUT w DX
6	MOV AL,i	MOV CL,i	MOV DL,i	MOV BL,i	MOV AH,i	MOV CH,i	MOV DH,i	MOV BH,i	LOCK		REPNZ	REPZ	HLT	CMC	GRP3a	GRP3b
7	MOV AX,i	MOV CX,i	MOV DX,i	MOV BX,i	MOV SP,i	MOV BP,i	MOV SI,i	MOV DI,i	CLC	STC	CLI	STI	CLD	STD	GRP4	GRP5

The 8086 has a 16-bit flags register, shown below, but the low byte remained compatible with the 8080. The four highlighted flags (sign, zero, parity, and carry) are the ones originating in the Datapoint 2200.

The flag word of the 8086 contains the original Datapoint 2200 registers.

Modern x86 and x86-64

The modern x86 architecture has extended the 8086 to a 32-bit architecture (IA-32) and a 64-bit architecture (x86-6420), but the Datapoint features remain. At startup, an x86 processor runs in "real mode", which operates like the original 8086. More interesting is 64-bit mode, which has some major architectural changes. In 64-bit mode, the 8086's general-purpose registers are extended to sixteen 64-bit registers (and soon to be 32 registers). However, the original Datapoint registers are special and can still be accessed as byte registers within the corresponding 64-bit register; these are highlighted in the table below.21

General purpose registers in x86-64. From Intel Software Developer's Manual.

The flag register of the 8086 was extended to 32 bits or 64 bits in x86. As the diagram below shows, the original Datapoint 2200 status flags are still there (highlighted in yellow).

The 32-bit and 64-bit flags of x86 contain the original Datapoint 2200 registers. From Intel Software Developer's Manual.

The instruction set in x86 has been extended from the 8086, mostly through prefixes, but the instructions from the Datapoint 2200 are still there. The ModR/M byte has been changed in 64-bit mode so the BX (originally HL) register is no longer special when accessing memory (although it's still special with 16-bit addressing, until Intel removes that in the upcoming x86-S simplification.) I/O ports still exist in x86, although they are viewed as more of a legacy feature: modern I/O devices typically use memory-mapped I/O instead of I/O ports. To summarize, fifty years later, x86-64 is slowly moving away from some of the Datapoint 2200 features, but they are still there.

Conclusions

The modern x86 architecture is descended from the Datapoint 2200's architecture. Because there is backward-compatibility at each step, you should theoretically be able to take a Datapoint 2200 binary, disassemble it to 8008 assembly, automatically translate it to 8080 assembly, automatically convert it to 8086 assembly, and then run it on a modern x86 processor. (The I/O devices would be different and cause trouble, of course.)

The Datapoint 2200's complete instruction set, its flags, and its little-endian architecture have persisted into current processors. This shows the critical importance of backward compatibility to customers. While Intel keeps attempting to create new architectures (iAPX 432, i960, i860, Itanium), customers would rather stay on a compatible architecture. Remarkably, Intel has managed to move from 8-bit computers to 16, 32, and 64 bits, while keeping systems mostly compatible. As a result, design decisions made for the Datapoint 2200 over 50 years ago are still impacting modern computers. Will processors still have the features of the Datapoint 2200 another fifty years from now? I wouldn't be surprised.22

Thanks to Joe Oberhauser for suggesting this topic. I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space so you can follow me there too.

Notes and references

Shift-register memory was also used in the TV Typewriter (1973) and the display storage of the Apple I (1976). However, dynamic RAM (DRAM) rapidly dropped in price, making shift-register memory obsolete by the mid 1970s. (I wrote about the Intel 1405 shift register memory in detail in this article.) ↩
For comparison, the popular PDP-8 minicomputer had just two main registers: the accumulator and a multiplier-quotient register; instructions typically operated on the accumulator and a memory location. The Data General Nova, a minicomputer released in 1969, had four accumulator / index registers. Mainframes generally had many more registers; the IBM System/360 (1964), for instance, had 16 general registers and four floating-point registers. ↩
On the hardware side, instructions were decoded with BCD-to-decimal decoder chips (type 7442). These decoders normally decoded a 4-bit BCD value into one of 10 output lines. In the Datapoint 2200, they decoded a 3-bit value into one of 8 output lines, and the other two lines were ignored. This allowed the high-bit line to be used as a selection line; if it was set, none of the 8 outputs would be active. ↩
These bit patterns map cleanly onto octal, so the opcodes are clearest when specified in octal. This octal structure has persisted in Intel processors including modern x86 processors. Unfortunately, Intel invariably specifies the opcodes in hexadecimal rather than octal, which obscures the underlying structure. This structure is described in detail in The 80x86 is an Octal Machine. ↩
I haven't been able to find any instruction set before the Datapoint 2200 that required memory addresses to be loaded into a register. This technique was common in microcode, where memory addresses were loaded into the Memory Address Register (MAR), but as far as I can tell, no instruction set did this. Moreover, no instruction set treated register and memory accesses identically. (If you know of exceptions, please let me know.)

At first, I thought that serial memory might motivate the use of an address register, but I don't think there's a connection. Most likely, the Datapoint 2200 used these techniques to create a simple, orthogonal instruction set that was easy to decode, and they weren't particularly concerned with performance. ↩
The instruction tables in this article are different from most articles, because I use octal instead of hexadecimal. (Displaying an octal-based instruction in a hexadecimal table obscures much of the underlying structure.) To display the table in octal, I break it into four quadrants based on the top octal digit of a three-digit opcode: 0, 1, 2, or 3. The digit 0-7 along the left is the middle octal digit and the digit along the top is the low octal digit. ↩
The regular pattern of Load instructions is broken by the NOP and HALT instructions. All the register-to-register load instructions along the diagonal accomplish nothing since they move a register to itself, but only the first one is explicitly called NOP. Moving a memory location to itself doesn't make sense, so its opcode is assigned the HALT instruction. Note that the all-0's opcode and the all-1's opcode are both HALT instructions. This is useful since it can stop execution if the program tries executing uninitialized memory. ↩
You might think that Datapoint and Intel used the same ALU operations simply because they are the obvious set of 8 operations. However, if you look at other processors around that time, they use a wide variety of ALU operations. Similarly, the status flags in the Datapoint 2200 aren't the obvious set; systems with four flags typically used Sign, Carry, Zero, and Overflow (not Parity). Intel processors didn't provide an Overflow flag until the 8086; even the 8080 didn't have it although the Motorola 6800 and MOS 6502 did. The 8085 implemented an overflow flag (V) but it was left undocumented. ↩
You might wonder if the Datapoint 2200 (and 8008) could be considered RISC processors since they have simple, easy-to-decode instruction sets. I think it is a mistake to try to wedge every processor into the RISC or CISC categories (Reduced Instruction Set Computer or Complex Instruction Set Computer). In particular, the Datapoint 2200 wasn't designed with the RISC philosophy (make a processor more powerful by simplifying the instruction set), its instruction set architecture is very different from RISC chips, and its implementation is different from RISC chips. Similarly, it wasn't designed with a CISC philosophy (make a processor more powerful by narrowing the semantic gap with high-level languages) and it doesn't look like a CISC chip.

So where does that leave the Datapoint 2200? In "RISC: Back to the future?", famed computer architect Gordon Bell uses the term MISC (Minimal Instruction Set Computer) to describe the architecture of simple, early computers and microprocessors such as the Manchester Mark I (1948), the PDP-8 minicomputer (1966), and the Intel 4004 (1971). Computer architecture evolved from these early hardwired "simple computers" to microprogrammed processors, processors with cache, and hardwired, pipelined processors. "Minimal Instruction Set Computer" seems like a good description of the Datapoint 2200, since it is about the smallest, simplest processor that could get the job done. ↩
Many people think that the Intel 8008 is an extension of the 4-bit Intel 4004 processor, but they are completely unrelated aside from the part numbers. The Intel 4004 is a 4-bit processor designed to implement a calculator for a company called Busicom. Its architecture is completely different from the 8008. In particular, the 4004 is a "Harvard architecture" system, with data storage and instruction storage completely separate. The 4004 also has a fairly strange instruction set, designed for calculators. For instance, it has a special instruction to convert a keyboard scan code to binary. The 4004 team and the 8008 team at Intel had many people in common, however, so the two chips have physical layouts (floorplans) that are very similar. ↩
In this article, I'm focusing on the Datapoint 2200 Version I. Any time I refer to the Datapoint 2200, I mean the version I specifically. The Version II has an expanded instruction set, but it was expanded in an entirely different direction from the Intel 8080, so it's not relevant to this post. The Version II is interesting, however, since it provides a perspective of how the Intel 8080 could have developed in an "alternate universe". ↩
Federico Faggin wrote The Birth of the Microprocessor in Byte Magazine, March 1992. This article describes in some detail the creation of the 8008 and 8080.

The Oral History of the 8080 discusses many of the problems with the 8008 and how the 8080 addressed them. (See page 4.) Masatoshi Shima, one of the architects of the 4004, described five problems with the 8008: It was slow because it used two clock cycles per state. It had no general-purpose stack and was weak with interrupts. It had limited memory and I/O space. The instruction set was primitive, with only 8-bit data, limited addressing, and a single address pointer register. Finally, the system bus required a lot of interface circuitry. (See page 7.) ↩
The 8080 is often said to be the "first truly usable microprocessor". Supposedly the source of this quote is Forgotten PC history, but the statement doesn't appear there. I haven't been able to find the original source of this statement, so let me know. In any case, I don't think that statement is particularly accurate, as the Motorola 6800 was "truly usable" and came out before the Intel 8080.

The 8080 was first in one important way, though: it was Intel's first microprocessor that was designed with feedback from customers. Both the 4004 and the 8008 were custom chips for a single company. The 8080, however, was based on extensive customer feedback about the flaws in the 8008 and what features customers wanted. The 8080 oral history discusses this in more detail. ↩
The 8008 was built with PMOS circuitry, while the 8080 was built with NMOS. This may seem like a trivial difference, but NMOS provided much superior performance. NMOS became the standard microprocessor technology until the rise of CMOS in the 1980s, combining NMOS and PMOS to dramatically reduce power consumption.

Another key hardware improvement was that the 8080 used a 40-pin package, compared to the 18-pin package of the 8008. Intel had long followed the "religion" of small 16-pin packages, and only reluctantly moved to 18 pins (as in the 8008). However, by the time the 8080 was introduced, Intel recognized the utility of industry-standard 40-pin packages. The additional pins made the 8080 much easier to interface to a system. Moreover, the 8080's 16-bit address bus supported four times the memory of the 8008's 14-bit address bus. (The 40-pin package was still small for the time; some companies used 50-pin or 64-pin packages for microprocessors.) ↩
The 8080 is not binary-compatible with the 8008 because almost all the instructions were shifted to different opcodes. One important but subtle change was that the 8 register/memory codes were reordered to start with B instead of A. The motivation is that this gave registers in a 16-bit register pair (BC, DE, or HL) codes that differ only in the low bit. This makes it easier to specify a register pair with a two-bit code. ↩
Stan Mazor (one of the creators of the 4004 and 8080) explained that the 8085 removed 10 of the 12 new instructions because "they would burden the 8086 instruction set." Because the decision came near the 8085's release, they would "leave all 12 instructions on the already designed 8085 CPU chip, but document and announce only two of them" since modifying a CPU is hard but modifying a CPU's paper reference manual is easy.

Several of the Intel 8086 engineer provides a similar explanation in Intel Microprocessors: 8008 to 8086: While the 8085 provided the new RIM and SIM instructions, "several other instructions that had been contemplated were not made available because of the software ramifications and the compatibility constraints they would place on the forthcoming 8086."

For more information on the 8085's undocumented instructions, see Unspecified 8085 op codes enhance programming. The two new condition flags were V (2's complement overflow) and X5 (underflow on decrement or overflow on increment). The opcodes were DSUB (double (i.e. 16-bit) subtraction), ARHL (arithmetic shift right of HL), RDEL (rotate DE left through carry), LDHI (load DE with HL plus an immediate byte), LDSI (load DE with SP plus an immediate byte), RSTV (restart on overflow), LHLX (load HL indirect through DE), SHLX (store HL indirect through DE), JX5 (jump on X5), and JNX5 (jump on not X5). ↩
Conversion from 8080 assembly code to 8086 assembly code was performed with a tool called CONV86. Each line of 8080 assembly code was converted to the corresponding line (or sometimes a few lines) of 8086 assembly code. The program wasn't perfect, so it was expected that the user would need to do some manual editing. In particular, CONV86 couldn't handle self-modifying code, where the program changed its own instructions. (Nowadays, self-modifying code is almost never used, but it was more common in the 1970s in order to make code smaller and get more performance.) CONV86 also didn't handle the 8085's RIM and SIM instructions, recommending a rewrite if code used these instructions heavily.

Writing programs in 8086 assembly code manually was better, of course, since the program could take advantage of the 8086's new features. Moreover, a program converted by CONV86 might be 25% larger, due to the 8086's use of two-byte instructions and inefficiencies in the conversion. ↩
This renaming is why the instruction set has the registers in the order AX, CX, DX, BX, rather than in alphabetical order as you might expect. The other factor is that Intel decided that AX, BX, CX, and DX corresponded to Accumulator, Base, Count, and Data, so they couldn't assign the names arbitrarily. ↩
A few notes on how the 8086's instructions relate to the earlier machines, since the ModR/M byte and 8- vs. 16-bit instructions make things a bit confusing. For an instruction like ADD, I have three 8-bit opcodes highlighted: an add to memory/register, an add from memory/register, and an immediate add. The neighboring unhighlighted opcodes are the corresponding 16-bit versions. Likewise, for MOV, I have highlighted the 8-bit moves to/from a register/memory. ↩
Since the x86's 32-bit architecture is called IA-32, you might expect that IA-64 would be the 64-bit architecture. Instead, IA-64 is the completely different architecture used in the ill-fated Itanium. IA-64 was supposed to replace IA-32, despite being completely incompatible. Since AMD was cut out of IA-64, AMD developed their own 64-bit extension of the existing x86 architecture and called it AMD64. Customers flocked to this architecture while the Itanium languished. Intel reluctantly copied the AMD64 architecture, calling it Intel 64. ↩
The x86 architecture allows byte access to certain parts of the larger registers (accessing AL, AH, etc.) as well as word and larger accesses. These partial-width reads and writes to registers make the implementation of the processor harder due to register renaming. The problem is that writing to part of a register means that the register's value is a combination of the old and new values. The Register Alias Table in the P6 architecture deals with this by adding a size field to each entry. If you write a short value and then read a longer value, the pipeline stalls to figure out the right value. Moreover, some 16-bit code uses the two 8-bit parts of a register as independent registers. To support this, the Register Alias Table keeps separate entries for the high and low byte. (For details, see the book Modern Processor Design, in particular the chapter on Intel's P6 Microarchitecture.) The point of this is that obscure features of the Datapoint 2200 (such as H and L acting as a combined register) can cause implementation difficulties 50 years later. ↩
Some miscellaneous references: For a detailed history of the Datapoint 2200, see Datapoint: The Lost Story of the Texans Who Invented the Personal Computer Revolution. The 8008 oral history provides a lot of interesting information on the development of the 8008. For another look at the Datapoint 2200 and instruction sets, see Comparing Datapoint 2200, 8008, 8080 and Z80 Instruction Sets. ↩

[#] Sat Sep 30 2023 09:03:05 UTC from rss <>

Subject: How flip-flops are implemented in the Intel 8086 processor

A key concept for a processor is the management of "state", information that persists over time. Much of a computer is built from logic gates, such as NAND or NOR gates, but logic gates have no notion of time. Processors also need a way to hold values, along with a mechanism to move from step to step in a controlled fashion. This is the role of "sequential logic", where the output depends on what happened before. Sequential logic usually operates off a clock signal,1 a sequence of regular pulses that controls the timing of the computer. (If you have a 3.2 GHz processor, for instance, that number is the clock frequency.)

A circuit called the flip-flop is a fundamental building block for sequential logic. A flip-flop can hold one bit of state, a "0" or a "1", changing its value when the clock changes. Flip-flops are a key part of processors, with multiple roles. Several flip-flops can be combined to form a register, holding a value. Flip-flops are also used to build "state machines", circuits that move from step to step in a controlled sequence. A flip-flops can also delay a signal, holding it from from one clock cycle to the next.

Intel introduced the groundbreaking 8086 microprocessor in 1978, starting the x86 architecture that is widely used today. In this blog post, I take a close look at the flip-flops in the 8086: what they do and how they are implemented. In particular, I will focus on the dynamic flip-flop, which holds its value using capacitance, much like DRAM.2 Many of these flip-flops use a somewhat unusual "enable" input, which allows the flip-flop to hold its value for multiple clock cycles.

The 8086 die under the microscope, with the main functional blocks. I count 184 flip-flops with enable and 53 without enable. Click this image (or any other) for a larger version.

The die photo above shows the silicon die of the 8086. In this image, I have removed the metal and polysilicon layers to show the silicon transistors underneath. The colored squares indicate the flip-flops: blue flip-flops have an enable input, while red lack enable. Flip-flops are used throughout the processor for a variety of roles. Around the edges, they hold the state for output pins. The control circuitry makes heavy use of flip-flops for various state machines, such as moving through the "T states" that control the bus cycle. The "loader" uses a state machine to start each instruction. The instruction register, along with some special-purpose registers (N, M, and X) are built with flip-flops. Other flip-flops track the instructions in the prefetch queue. The microcode engine uses flip-flops to hold the current microcode address as well as to latch the 21-bit output from the microcode ROM. The ALU (Arithmetic/Logic Unit) uses flip-flops to hold the status flags, temporary input values, and information on the operation.

The flip-flop circuit

In this section, I'll explain how the flip-flop circuits work, starting with a basic D flip-flop. The D flip-flop (below) takes a data input (D) and stores that value, 0 or 1. The output is labeled Q, while the inverted output is called Q (Q-bar). This flip-flop is "edge triggered", so the storage happens on the edge when the clock changes from low to high.4 Except at this transition, the input can change without affecting the output.

The symbol for a D flip-flop.

The 8086 implements most of its flip-flops dynamically, using pass transistor logic. That is, the capacitance of the wiring holds the 0 or 1 state. The dynamic implementation is more compact than the typical static flip-flop implementation, so it is often used in processors. However, the charge on the wire's capacitance will eventually leak away, just like DRAM (dynamic RAM). Thus, the clock must keep going or the values will be lost.3 This behavior is different from a typical flip-flop chip, which will hold its value until the next clock, whether that is a microsecond later or a day later.

The D flip-flop is built from two latch5 stages, each consisting of a pass transistor and an inverter.6 The first pass transistor passes the input value through while the clock is low. When the clock switches high, the first pass transistor turns off and isolates the inverter from the input, but the value persists due to the wire's capacitance (blue arrow). Meanwhile, the second pass transistor switches on, passing the value from the first inverter through the second inverter to the output. Similarly, when the clock switches low, the second transistor switches off but the value is held by wire capacitance at the green arrow. (The circuit does not need an explicit capacitor; the wiring has enough capacitance to hold the value.) Thus, the output holds the value of the D input that was present at the moment when the clock switched from low to high. Any other changes to the D input do not affect the output.

Schematic of a D flip-flop built from pass transistor logic.

The basic flip-flop can be modified by adding an "enable" input that enables or blocks the clock.7 When the enable input is high, the flip-flop records the D input on the clock edge as before, but when the enable input is low, the flip-flop holds its previous value. The enable input allows the flip-flop to hold its value for an arbitrarily long period of time.

The symbol for the D flip-flop with enable.

The enable flip-flop is constructed from a D flip-flop by feeding the flip-flop's output back to the input as shown below. When the enable input is 0, the multiplexer selects the current Q output as the new flip-flop D input, so the flip-flop retains its previous value. But when the enable input is 1, the multiplexer selects the new D value. (You can think of the enable input as selecting "hold" versus "load".)

Block diagram of a flip-flop with an enable input.

The multiplexer is implemented with two more pass transistors, as shown on the left below.8 When enable is low, the upper pass transistor switches on, passing the current Q output back to the input. When enable is high, the lower pass transistor switches on, passing the D input through to the flip-flop. The schematic below also shows how the inverted Q' output is provided by the first inverter. The circuit "cheats" a bit; since the inverted output bypasses the second transistor, this output can change before the clock edge.

Schematic of a flip-flop with an enable input.

The flip-flops often have a set or clear input, setting the flip-flop high or low. This input is typically connected to the processor's "reset" line, ensuring that the flip-flops are initialized to the proper state when the processor is started. The symbol below shows a flip-flop with a clear input.

The symbol for the D flip-flop with enable and clear inputs.

To support the clear function, a NOR gate replaces the inverter as shown below (red). When the clear input is high, it forces the output from the NOR gate to be low. Note that the clear input is asynchronous, changing the Q output immediately. The inverted Q output, however, doesn't change until clk is high and the output cycles around. A similar modification implements a set input that forces the flip-flop high: a NOR gate replaces the first inverter.

This schematic shows the circuitry for the clear flip-flop.

Implementing a flip-flop in silicon

The diagram below shows two flip-flops as they appear on the die. The bright gray regions are doped silicon, the bottom layer of the chip The brown lines are polysilicon, a layer on top of the silicon. When polysilicon crosses doped silicon, a transistor is formed with a polysilicon gate. The black circles are vias (connections) to the metal layer. The metal layer on top provides wiring between the transistors. I removed the metal layer with acid to make the underlying circuitry visible. Faint purple lines remain on the die, showing where the metal wiring was.

Two flip-flops on the 8086 die.

Although the two flip-flops have the same circuitry, their layouts on the die are completely different. In the 8086, each transistor was carefully shaped and positioned to make the layout compact, so the layout depends on the surrounding logic and the connections. This is in contrast to modern standard-cell layout, which uses a standard layout for each block (logic gate, flip-flop, etc.) and puts the cells in orderly rows. (Intel moved to standard-cell wiring for much of the logic in the the 386 processor since it is much faster to create a standard-cell design than to perform manual layout.)

Conclusions

The flip-flop with enable input is a key part of the 8086, appearing throughout the processor. However, the enable input is a fairly obscure feature for a flip-flop component; most flip-flop chips have a clock input, but not an enable.9 Many FPGA and ASIC synthesis libraries, though, provide it, under the name "D flip-flop with enable" or "D flip-flop with clock enable".

Notes and references

Some early computers were asynchronous, such as von Neumann's IAS machine (1952) and its numerous descendants. In this machine, there was no centralized clock. Instead, a circuit such as an adder would send a pulse to the next circuit when it was done, triggering the next circuit in sequence. Thus, instruction execution would ripple through the computer. Although almost all later computers are synchronous, there is active research into asynchronous computing which is potentially faster and lower power. ↩
I'm focusing on the dynamic flip-flops in this article, but I'll mention that the 8086 has a few latches built from cross-coupled NOR gates. Most 8086 registers use cross-coupled inverters (static memory cells) rather than flip-flops to hold bits. I explained the 8086 processor's registers in this article. ↩
Dynamic circuitry is why the 8086 and many other processors have minimum clock speeds: if the clock is too slow, signals will fade away. For the 8086, the datasheet specifies a maximum clock period of 500 ns, corresponding to a minimum clock speed of 2 megahertz. The Z80 processor, however, was designed so the clock could be slowed or even stopped. ↩
Some flip-flops in the 8086 use the inverted clock, so they transition when the clock switches from high to low. Thus, there are two sets of transitions in the 8068 for each clock cycle. ↩
The terminology gets confusing between flip-flops and latches, which sometimes refer to the same thing and sometimes different things. The term "latch" is often used for a flip-flop that operates on the clock level, not the clock edge. That is, when the clock input is high, the input passes through, and when the clock input is low, the value is retained. Confusingly, the clock for a latch is often called "enable". This is different from the enable input that I'm discussing, which is separate from the clock. ↩
I asked an Intel chip engineer if they designed the circuitry in the 8086 era in terms of flip-flops. He said that they typically designed it circuitry terms of the underlying pass transistors and gates, rather than using the flip-flop as a fundamental building block. ↩
You might wonder why the clock and enable are separate inputs. Why couldn't you just AND them together so when enable is low, it will block the clock and the flip-flop won't transition? That mostly works, but three factors make it a bad idea. First, the idea of using a clock is so everything changes state at the same time. If you start putting gates in the clock path, the clock gets a bit delayed and shifts the timing. If the delay is too large, the input value might change before the flip-flop can latch it. Thus, putting gates in the clock path is frowned upon. The second factor is that combining the clock and enable signals risks race conditions. For instance, suppose that the enable input goes low and high while the clock remains high. If you AND the two signals together, this will yield a spurious clock edge, causing the flip-flop to latch its input a second time. Finally, if you block the clock for too long, a dynamic flip-flop will lose its value. (Note that the flip-flop circuit used in the 8086 will refresh its value on each clock even if the enable input is held low for a long period of time.) ↩
A multiplexer can be implemented with logic gates. However, it is more compact to implement it with pass transistors. The pass transistor implementation takes four transistors (two fewer if the inverted enable signal is already available). A logic gate implementation would take about nine transistors: an AND-OR-INVERT gate, an inverter on the output, and an inverter for the enable signal. ↩
The common 7474 is a typical TTL flip-flop that does not have an enable input. Chips with an enable are rarer, such as the 74F377. Strangely, one manufacturer of the 74HC377 shows the enable as affecting the output; I think they simply messed up the schematic in the datasheet since it contradicts the function table.

Some examples of standard-cell libraries with enable flip-flops: Cypress SoC, Faraday standard cell library, Xilinx Unified Libraries, Infineon PSoC 4 Components, Intel's CHMOS-III cell library (probably used for the 386 processor), and Intel Quartus FPGA. ↩

[#] Sat Oct 07 2023 09:04:25 UTC from rss <>

Subject: Reverse-engineering the mechanical Bendix Central Air Data Computer

How did fighter planes in the 1950s perform calculations before compact digital computers were available? The Bendix Central Air Data Computer (CADC) is an electromechanical analog computer that used gears and cams for its mathematics. It was used in military planes such as the F-101 and the F-111 fighters, and the B-58 bomber to compute airspeed, Mach number, and other "air data".

The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version.

Aircraft have determined airspeed from air pressure for over a century. A port in the side of the plane provides the static air pressure,1 the air pressure outside the aircraft. A pitot tube points forward and receives the "total" air pressure, a higher pressure due to the speed of the airplane forcing air into the tube. The airspeed can be determined from the ratio of these two pressures, while the altitude can be determined from the static pressure.

But as you approach the speed of sound, the fluid dynamics of air changes and the calculations become very complicated. With the development of supersonic fighter planes in the 1950s, simple mechanical instruments were no longer sufficient. Instead, an analog computer calculated the "air data" (airspeed, air density, Mach number, and so forth) from the pressure measurements. This computer then transmitted the air data electrically to the systems that needed it: instruments, weapons targeting, engine control, and so forth. Since the computer was centralized, the system was called a Central Air Data Computer or CADC, manufactured by Bendix and other companies.

A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible.

Each value in the CADC is indicated by the rotational position of a shaft. Compact electric motors rotated the shafts, controlled by magnetic amplifier servos. Gears, cams, and differentials performed computations, with the results indicated by more rotations. Devices called synchros converted the rotations to electrical outputs that controlled other aircraft systems. The CADC is said to contain 46 synchros, 511 gears, 820 ball bearings, and a total of 2,781 major parts (but I haven't counted). These components are crammed into a compact cylinder: 15 inches long and weighing 28.7 pounds.

The equations computed by the CADC are impressively complicated. For instance, one equation is:2

\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]

It seems incredible that these functions could be computed mechanically, but three techniques make this possible. The fundamental mechanism is the differential gear, which adds or subtracts values. Second, logarithms are used extensively, so multiplications and divisions become additions and subtractions performed by a differential, while square roots are calculated by gearing down by a factor of 2. Finally, specially-shaped cams implement functions: logarithm, exponential, and functions specific to the application. By combining these mechanisms, complicated functions can be computed mechanically, as I will explain below.

The differential

The differential gear assembly is the mathematical component of the CADC, as it performs addition or subtraction. The differential takes two input rotations and produces an output rotation that is the sum or difference of these rotations.3 Since most values in the CADC are expressed logarithmically, the differential computes multiplication and division when it adds or subtracts its inputs.

A closeup of a differential mechanism.

While the differential functions like the differential in a car, it is constructed differently, with a spur-gear design. This compact arrangement of gears is about 1 cm thick and 3 cm in diameter. The differential is mounted on a shaft along with three co-axial gears: two gears provide the inputs to the differential and the third provides the output. In the photo, the gears above and below the differential are the input gears. The entire differential body rotates with the sum, connected to the output gear at the top through a concentric shaft. (In practice, any of the three gears can be used as the output.) The two thick gears inside the differential body are part of the mechanism.

Note that multiplying a rotation by a constant factor doesn't require a differential; it can be done simply with the ratio between two gears. (If a large gear rotates a small gear, the small gear rotates faster according to the size ratio.) Adding a constant to a rotation is even easier, just a matter of defining what shaft position indicates 0. For this reason, I will ignore constants in the equations.

The cams

The CADC uses cams to implement various functions. Most importantly, cams compute logarithms and exponentials. Cams also implement complicated functions of one variable such as ${M}/{\sqrt{1 + .2 M^2}}$. The function is encoded into the cam's shape during manufacturing, so a hard-to-compute nonlinear function isn't a problem for the CADC. The photo below shows a cam with the follower arm in front. As the cam rotates, the follower moves in and out according to the cam's radius.

A cam inside the CADC implements a function.

However, the shape of the cam doesn't provide the function directly, as you might expect. The main problem with the straightforward approach is the discontinuity when the cam wraps around, which could catch the follower. For example, if the cam implemented an exponential directly, its radius would spiral exponentially and there would be a jump back to the starting value when it wraps around.

Instead, the CADC uses a clever patented method: the cam encodes the difference between the desired function and a straight line. For example, an exponential curve is shown below (blue), with a line (red) between the endpoints. The height of the gray segment, the difference, specifies the radius of the cam (added to the cam's fixed minimum radius). The point is that this difference goes to 0 at the extremes, so the cam will no longer have a discontinuity when it wraps around. Moreover, this technique significantly reduces the size of the value (i.e. the height of the gray region is smaller than the height of the blue line), increasing the cam's accuracy.5

An exponential curve (blue), linear curve (red), and the difference (gray).

To make this work, the cam position must be added to the linear value to yield the result. This is implemented by combining each cam with a differential gear that performs the addition or subtraction.4 As the diagram below shows, the input (23) drives the cam (30) and the differential (25, 37-41). The follower (32) tracks the cam and provides a second input (35) to the differential. The sum from the differential produces the desired function (26).

This diagram, from Patent 2969910, shows how the cam and follower are connected to a differential.

Pressure inputs

The CADC receives two pressure inputs from the pitot tube.6 Inside the CADC, two pressure transducers convert the pressures into rotational positions. Each pressure transducer contains a pair of bellows that expand and contract as the applied pressure changes. The pressure transducer has a tricky job: it must measure tiny pressure changes, but it must also provide a rotational signal that has enough torque to rotate all the gears in the CADC. To accomplish this, each pressure transducer uses a servo loop that drives a motor, controlled by a feedback loop. Cams and differentials convert the rotation into logarithmic values, providing the static pressure as $ log \; P_s $ and the pressure ratio as $ log \; ({P_t}/{P_s}) $ to the rest of the CADC.

The synchro outputs

A synchro is an interesting device that can transmit a rotational position electrically over three wires. In appearance, a synchro is similar to an electric motor, but its internal construction is different, as shown below. Before digital systems, synchros were very popular for transmitting signals electrically through an aircraft. For instance, a synchro could transmit an altitude reading to a cockpit display or a targeting system. Two synchros at different locations have their stator windings connected together, while the rotor windings are driven with AC. Rotating the shaft of one synchro causes the other to rotate to the same position.7

Cross-section diagram of a synchro showing the rotor and stators.

For the CADC, most of the outputs are synchro signals, using compact synchros that are about 3 cm in length. For improved resolution, some of the CADC outputs use two synchros: a coarse synchro and a fine synchro. The two synchros are typically geared in an 11:1 ratio, so the fine synchro rotates 11 times as fast as the coarse synchro. Over the output range, the coarse synchro may turn 180°, providing the approximate output, while the fine synchro spins multiple times to provide more accuracy.

Examining the left section of the CADC

Another view of the CADC.

The Bendix CADC is constructed from modular sections. The right section has the pressure transducers (the black domes), along with the servo mechanisms that control them. The middle section is the "Mach section". In this blog post, I'm focusing on the left section of the CADC, which computes true airspeed, air density, total temperature, log true free air temperature, and air density × speed of sound. I had feared that any attempt at disassembly would result in tiny gears flying in every direction, but the CADC was designed to be taken apart for maintenance. Thus, I could remove the left section of the CADC for analysis.

The diagram below shows the side that connects to the aircraft.8 The various synchros generate the outputs. Some of the synchros have spiral anti-backlash springs installed. These springs prevent wobble in the synchro and gear train as the gears change direction. Three of the exponential cams are visible. The differentials and gears are between the two metal plates, so they are not visible from this angle.

The front of the CADC has multiple output synchros with anti-backlash springs.

Attached to the right side is the temperature transducer, a modular wedge that implements a motorized servo loop to convert the temperature input to a rotation. The servo amplifier consists of three boards of electronic components, including transistors and magnetic amplifiers to drive the motor. The large red potentiometer provides feedback for the servo loop. A flexible cam with 20 adjustment screws allows the transducer to be tuned to eliminate nonlinearities or other sources of error. I'll describe this module in more detail in another post.9

The photo below shows the other side of the section. This communicates with the rest of the CADC through the electrical connector and three gears that mesh with gears in the other section. Two gears receive the pressure signals $ P_t / P_s $ and $P_s$ from the pressure transducer subsystem. The third gear sends the log total temperature to the rest of the CADC. The electrical connector (a standard 37-pin D-sub) supplies 120 V 400 Hz power to the rest of the CADC and passes synchro signals from the rest of the CADC to the output connectors.

This side of the section interfaces with the rest of the CADC.

The equations

Although the CADC looks like an inscrutable conglomeration of tiny gears, it is possible to trace out the gearing and see exactly how it computes the air data functions. With considerable effort, I have reverse-engineered the mechanisms to create the diagram below, showing how each computation is broken down into mechanical steps. Each line indicates a particular value, specified by a shaft rotation. The ⊕ symbol indicates a differential gear, adding or subtracting its inputs to produce another value. The cam symbol indicates a cam coupled to a differential gear. Each cam computes either a specific function or an exponential, providing the value as a rotation. At the right, the rotations are converted to outputs, either by synchros or a potentiometer. This diagram abstracts out the physical details of the gears. In particular, scaling by constants or reversing the rotation (subtraction versus addition) are not shown.

This diagram shows how the values are computed. The differential numbers are my own arbitrary numbers. Click for a larger version.

I'll go through each calculation briefly.

Total temperature

The external temperature is an important input to the CADC since it affects the air density. A platinum temperature probe provides a resistance that varies with temperature. The resistance is converted to rotation by the temperature transducer, described earlier. The definition of temperature is a bit complicated, though. The temperature outside the aircraft is called the true free air temperature, T. However, the temperature probe measures a higher temperature, called the indicated total air temperature, T_i. The reason for this discrepancy is that when the aircraft is moving at high speed, the air transfers kinetic energy to the temperature probe, heating it up.

The differential and cam D15.

The temperature transducer provides the log of the total temperature as a rotation. At the top of the equation diagram, cam and differential D15 simply take the exponential of this value to determine the total temperature. This rotates the shaft of a synchro to produce the total temperature as an electrical output. As shown above, the D15 cam is attached to the differential by a shaft passing through the metal plate. The follower rotates according to the cam radius, turning the follower gear which meshes with the differential input. The result from the differential is the total temperature.

log free air temperature

A more complicated task of the CADC is to compute the true free air temperature from the measured total temperature. Free air temperature, T, is defined by the formula below, which compensates for the additional heating due to the aircraft's speed. $T_i$ is the indicated total temperature, M is the Mach number and K is a temperature probe constant.10

\[ T = \frac {T_i} {1 + .2 K M^2 } \]

The diagram below shows the cams, differentials, gear trains, and synchro that compute $log \; T$. First, cam D11 computes $ log \; (1 + .2 K M^2 ) $. Although that expression is complicated, the key is that it is a function of one variable (M). Thus, it can be computed by cam D11, carefully shaped for this function and attached to differential D11. Differential D10 adds the log total temperature (from the temperature transducer) to produce the desired result. The indicated servo outputs this value to other aircraft systems. (Note that the output is a logarithm; it is not converted to a linear value.11 This value is also fed (via gears) into the calculations of three more equations, below.

The components that compute log free air temperature. D12 is not part of this equation.

Air density

Air density is computed from the static pressure and true temperature:

\[ \rho = C_1 \frac{P_s} {T} \]

It is calculated using logarithms. D16 subtracts the log temperature from the log pressure and cam D20 takes the exponential.

True airspeed

True airspeed is computed from the Mach number and the total temperature according to the following formula:

\[V = 38.94 M \frac{\sqrt{T_i}}{\sqrt{1+.2KM^2}}\]

Substituting the true free air temperature simplifies the formula to the equation implemented in the CADC:

\[V = 38.94 M \sqrt{T} \]

This is computed logarithmically. First, cam and differential D12 compute $log \; M$ from the pressure ratio.13 Next differential D19 adds half the log temperature to multiply by the square root. Exponential cam D13 removes the logarithms, producing the final result. (The constant 38.94 is an important part of the equation, but is easily implemented with gear ratios.) The output goes to two synchros, geared to provide coarse and fine outputs.12

These components compute true airspeed and air density × speed of sound. Note the large gear driving the coarse synchro and the small gear driving the fine synchro. This causes the fine synchro to rotate at 11 times the speed of the coarse synchro.

Air density × speed of sound

Air density × speed of sound14 is given by the formula

\[ \rho \cdot a = C_2 \frac {P_s} {\sqrt{T}} \]

The calculation is almost the same as the air density calculation. Differential D18 subtracts half the log temperature from the log pressure and then cam D14 computes the exponential. Unlike the other values, this output rotates the shaft of a 1 KΩ potentiometer (above), changing its resistance. I don't know why this particular value is output as a resistance rather than a synchro angle.

Conclusions

The CADC performs nonlinear calculations that seem way too complicated to solve with mechanical gearing. But reverse-engineering the mechanism shows how the equations are broken down into steps that can be performed with cams and differentials, using logarithms for multiplication, division, and square roots. I'll point out that reverse engineering the CADC is not as easy as you might expect. It is difficult to see which gears are in contact, especially when gears are buried in the middle of the CADC and are hard to see. I did much of the reverse engineering by rotating one differential to see which other gears turn, but usually most of the gears turned due to the circuitous interconnections.15

By the late 1960s, as fighter planes became more advanced and computer technology improved, digital processors replaced the gears in air data computers. Garrett AiResearch's ILAAS air data computer (1967) was the first all-digital unit. Other digital systems were Bendix's ADC-1000 Digital Air Data Computer (1967) which was "designed to solve all air data computations at a rate of 75 times per second", Conrac's 3-pound solid-state air data computer (1967), Honeywell's Digital Air Data System (1968), and the LSI-based Garrett AiResearch F-14 CADC (1970). Nonetheless, the gear-based Bendix CADC provides an interesting reverse-engineering challenge as well as a look at the forgotten era of analog computing.

For more background on the CADC, see my overview article on the CADC. I plan to continue reverse-engineering the Bendix CADC and get it operational,16 so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon as @oldbytes.space@kenshirriff. Thanks to Joe for providing the CADC. Thanks to Nancy Chen for obtaining a hard-to-find document for me. Marc Verdiell and Eric Schlaepfer are working on the CADC with me.

Notes and references

The static air pressure can also be provided by holes in the side of the pitot tube. I couldn't find information indicating exactly how the planes with the CADC received static pressure. ↩
Although the CADC's equations may seem ad hoc, they can be derived from fluid dynamics principles. These equations were standardized in the 1950s by various government organizations including the National Bureau of Standards and NACA (the precursor of NASA). ↩
Strictly speaking, the output of the differential is the sum of the inputs divided by two. I'm ignoring the factor of 2 because the gear ratios can easily cancel it out. It's also arbitrary whether you think of the differential as adding or subtracting, since it depends on which rotation direction is defined as positive. ↩
The cam value will be added or subtracted, depending on whether the function is concave or convex. This is a simple matter of gearing when the values are fed into the differential. Matching the linear segment to the function is also done with gearing that scales the input value appropriately. ↩
The diagram below shows a typical cam function in more detail. The input is $log~ dP/P_s$ and the output is $log~M / \sqrt{1+.2KM^2}$. The small humped curve at the bottom is the cam correction. Although the input and output functions cover a wide range, the difference that is encoded in the cam is much smaller and drops to zero at both ends.

This diagram, from Patent 2969910, shows how a cam implements a complicated function.

↩
The CADC also has an input for the "position error correction", which I will ignore in this post. This input provides a correction factor because the measured static pressure may not exactly match the real static pressure. The problem is that the static pressure is measured from a port on the aircraft. Distortions in the airflow may cause errors in this measurement. A separate box, the "compensator", determined the correction factor based on the angle of attack and fed it to the CADC as a synchro signal. ↩
Internally, a synchro has a moving rotor winding and three fixed stator windings. When AC is applied to the rotor, voltages are developed on the stator windings depending on the position of the rotor. These voltages produce a torque that rotates the synchros to the same position. In other words, the rotor receives power (26 V, 400 Hz in this case), while the three stator wires transmit the position. The diagram below shows how a synchro is represented schematically, with rotor and stator coils.

↩
The schematic symbol for a synchro.
The CADC is wired to the rest of the aircraft through round military connectors. The front panel interfaces these connectors to the D-sub connectors used internally. The two pressure inputs are the black cylinders at the bottom of the photo.

The exterior of the CADC. It is packaged in a rugged metal cylinder.

↩
I don't have a blog post on the temperature module yet, but I have a description on Twitter and a video. ↩
The constant K depends on the recovery factor of the temperature probe. This compensates for a probe where not all of the air's kinetic energy gets transferred to the probe. The 1958 description says that with "modern total temperature probes available today", the K factor can be considered to be 1. ↩
The CADC specification says that it provides the log true free air temperature from -80° to +70° C. Obviously the log won't work for a negative value so I assume this is the log of the Kelvin temperature (°K). ↩
The CADC specification defines how the parameter values correspond to rotation angles of the synchros. For instance, for the airspeed synchros, the CADC supports the airspeed range 104.3 to 1864.7 knots. The coarse and fine outputs are geared in an 11:1 ratio, so the fine synchro will rotate multiple times over the range to provide more accuracy. Over this range, the coarse synchro rotates from -18.94° to +151.42° and the fine synchro rotates from -208.29° to +1665.68°, with 0° corresponding to 300 knots. ↩
The Mach function is defined in terms of $P_t/P_s $, with separate cases for subsonic and supersonic:

\[M<1:\] \[~~~\frac{P_t}{P_s} = ( 1+.2M^2)^{3.5}\]

\[M > 1:\]

\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]

Although these equations are very complicated, the solution is a function of one variable $P_t/P_s$ so M can be computed with a single cam. In other words, the mathematics needed to be done when the CADC was manufactured, but once the cam exists, computing M is trivial. ↩
I'm not sure why the CADC computes air density times speed of sound. I couldn't find any useful aircraft characteristics that depend on this value, but there must be something. In acoustics and audio, this product is useful as the "air impedance", but I couldn't determine the relevance for aviation. ↩
While reverse-engineering this system, I have gained more appreciation for the engineering involved. Converting complicated equations to gearing is a remarkable feat. But also remarkable is designing the CADC as a three-dimensional object that can be built, disassembled, and repaired, long before any sort of 3-D modeling was available. It must have been a puzzle to figure out where to position each differential. Each differential had three gears driving it, which had to mesh with gears from other differentials. There wasn't much flexibility in the gear dimensions, since the gear ratios had to be correct and the number of teeth on each gear had to be an integer. Moreover, it is impressive how tightly the gears are packed together without conflicting with each other. ↩
It was very difficult to find information about the CADC. The official military specification is MIL-C-25653C(USAF). After searching everywhere, I was finally able to get a copy from the Technical Reports & Standards unit of the Library of Congress. The other useful document was in an obscure conference proceedings from 1958: "Air Data Computer Mechanization" (Hazen), Symposium on the USAF Flight Control Data Integration Program, Wright Air Dev Center US Air Force, Feb 3-4, 1958, pp 171-194. ↩

[#] Sat Oct 14 2023 08:58:00 UTC from rss <>

Subject: Examining the silicon dies of the Intel 386 processor

You might think of the Intel 386 processor (1985) as just an early processor in the x86 line, but the 386 was a critical turning point for modern computing in several ways.1 First, the 386 moved the x86 architecture to 32 bits, defining the dominant computing architecture for the rest of the 20th century. The 386 also established the overwhelming importance of x86, not just for Intel, but for the entire computer industry. Finally, the 386 ended IBM's control over the PC market, turning Compaq into the architectural leader.

In this blog post, I look at die photos of the Intel 386 processor and explain what they reveal about the history of the processor, such as the move from the 1.5 µm process to the 1 µm process. You might expect that Intel simply made the same 386 chip at a smaller scale, but there were substantial changes to the chip's layout, even some visible to the naked eye.2 I also look at why the 386 SL had over three times the transistors as the other 386 versions.3

The 80386 was a major advancement over the 286: it implemented a 32-bit architecture, added more instructions, and supported 4-gigabyte segments. The 386 is a complicated processor (by 1980s standards), with 285,000 transistors, ten times the number of the original 8086.4 The 386 has eight logical units that are pipelined5 and operate mostly autonomously.6 The diagram below shows the internal structure of the 386.7

The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.

The heart of a processor is the datapath, the components that hold and process data. In the 386, these components are in the lower left: the ALU (Arithmetic/Logic Unit), a barrel shifter to shift data, and the registers. These components form regular rectangular blocks, 32 bits wide. The datapath, along with the circuitry to the left that manages it, forms the Data Unit. In the lower right is the microcode ROM, which breaks down machine instructions into micro-instructions, the low-level steps of the instruction. The microcode ROM, along with the microcode engine circuitry, forms the Control Unit.

The 386 has a complicated instruction format. The Instruction Decode Unit breaks apart an instruction into its component parts and generates a pointer to the microcode that implements the instruction. The instruction queue holds three decoded instructions. To improve performance, the Prefetch Unit reads instructions from memory before they are needed, and stores them in the 16-byte prefetch queue.8

The 386 implements segmented memory and virtual memory, with access protection.9 The Memory Management Unit consists of the Segment Unit and the Paging Unit: the Segment Unit translates a logical address to a linear address, while the Paging Unit translates the linear address to a physical address. The segment descriptor cache and page cache (TLB) hold data about segments and pages; the 386 has no on-chip instruction or data cache.10 The Bus Interface Unit in the upper right handles communication between the 386 and the external memory and devices.

Silicon dies are often labeled with the initials of the designers. The 386 DX, however, has an unusually large number of initials. In the image below, I have enlarged the tiny initials so they are visible. I think the designers put their initials next to the unit they worked on, but I haven't been able to identify most of the names.11

The 386 die with the initials magnified.

The shrink from 1.5 µm to 1 µm

The original 386 was built on a process called CHMOS-III that had 1.5 µm features (specifically the gate channel length for a transistor). Around 1987, Intel moved to an improved process called CHMOS-IV, with 1 µm features, permitting a considerably smaller die for the 386. However, shrinking the layout wasn't a simple mechanical process. Instead, many changes were made to the chip, as shown in the comparison diagram below. Most visibly, the Instruction Decode Unit and the Protection Unit in the center-right are horizontal in the smaller die, rather than vertical. The standard-cell logic (discussed later) is considerably more dense, probably due to improved layout algorithms. The data path (left) was highly optimized in the original so it remained essentially unchanged, but smaller. One complication is that the bond pads around the border needed to remain the same size so bond wires could be attached. To fit the pads around the smaller die, many of the pads are staggered. Because different parts of the die shrank differently, the blocks no longer fit together as compactly, creating wasted space at the bottom of the die. For some reason, the numerous initials on the original 386 die were removed. Finally, the new die was labeled 80C386I with a copyright date of 1985, 1987; it is unclear what "C" and "I" indicate.

Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici.

The change from 1.5 µm to 1 µm may not sound significant, but it reduced the die size by 60%. This allowed more dies on a wafer, substantially dropping the manufacturing cost.12 The strategy of shrinking a processor to a new process before designing a new microarchitecture for the process became Intel's tick-tock strategy.

The 386 SX

In 1988, Intel introduced the 386 SX processor, the low-cost version of the 386, with a 16-bit bus instead of a 32-bit bus. (This is reminiscent of the 8088 processor with an 8-bit bus versus the 8086 processor with a 16-bit bus.) According to the 386 oral history, the cost of the original 386 die decreased to the point where the chip's package cost about as much as the die. By reducing the number of pins, the 386 SX could be put in a one-dollar plastic package and sold for a considerably reduced price. The SX allowed Intel to segment the market, moving low-end customers from the 286 to the 386 SX, while preserving the higher sales price of the original 386, now called the DX.13 In 1988, Intel sold the 386 SX for $219, at least $100 less than the 386 DX. A complete SX computer could be $1000 cheaper than a similar DX model.

For compatibility with older 16-bit peripherals, the original 386 was designed to support a mixture of 16-bit and 32-bit buses, dynamically switching on a cycle-by-cycle basis if needed. Because 16-bit support was built into the 386, the 386 SX didn't require much design work. (Unlike the 8088, which required a redesign of the 8086's bus interface unit.)

The 386 SX was built at both 1.5 µm and 1 µm. The diagram below compares the two sizes of the 386 SX die. These photos may look identical to the 386 DX photos in the previous section, but close examination shows a few differences. Since the 386 SX uses fewer pins, it has fewer bond pads, eliminating the staggered pads of the shrunk 386 DX. There are a few differences at the bottom of the chip, with wiring in much of the 386 DX's wasted space.

Comparison of two dies for the 386 SX. Photos courtesy of Antoine Bercovici.

Comparing the two SX revisions, the larger die is labeled "80P9"; Intel's internal name for the chip was "P9", using their confusing series of P numbers. The shrunk die is labeled "80386SX", which makes more sense. The larger die is copyright 1985, 1987, while the shrunk die (which should be newer) is copyright 1985 for some reason. The larger die has mostly the same initials as the DX, with a few changes. The shrunk die has about 21 sets of initials.

The 386 SL die

The 386 SL (1990) was a major extension to the 386, combining a 386 core and other functions on one chip to save power and space. Named "SuperSet", it was designed to corner the notebook PC market.14 The 386 SL chip included an ISA bus controller, power management logic, a cache controller for an external cache, and the main memory controller.

Looking at the die photo below, the 386 core itself takes up about 1/4 of the SL's die. The 386 core is very close to the standard 386 DX, but there are a few visible differences. Most visibly, the bond pads and pin drivers have been removed from the core. There are also some circuitry changes. For instance, the 386 SL core supports the System Management Mode, which suspends normal execution, allowing power management and other low-level hardware tasks to be performed outside the regular operating system. System Management Mode is now a standard part of the x86 line, but it was introduced in the 386 SL.

The 386 SL die with functional blocks labeled. Die photo courtesy of Antoine Bercovici.

In total, the 386 SL contains 855,000 transistors,15 over 3 times as many as the regular 386 DX. The cache tag RAM takes up a lot of space and transistors. The cache data itself is external; this on-chip circuitry just manages the cache. The other new components are largely implemented with standard-cell logic (discussed below); this is visible as uniform stripes of circuitry, most clearly in the ISA bus controller.

A brief history of the 386

From the modern perspective, it seems obvious for Intel to extend the x86 line from the 286 to the 386, while keeping backward compatibility. But at the time, this path was anything but clear. This history starts in the late 1970s, when Intel decided to build a "micromainframe" processor, an advanced 32-bit processor for object-oriented programming that had objects, interprocess communication, and memory protection implemented in the CPU. This overly ambitious project fell behind schedule, so Intel created a stopgap processor to sell until the micromainframe processor was ready. This stopgap processor was the 16-bit 8086 processor (1978).

In 1981, IBM decided to use the Intel 8088 (an 8086 variant) in the IBM Personal Computer (PC), but Intel did not realize the importance of this at the time. Instead, Intel was focused on their micromainframe processor, also released in 1981 as the iAPX 432, but this became "one of the great disaster stories of modern computing" as the New York Times called it. Intel then reimplemented the ideas of the ill-fated iAPX 432 on top of a RISC architecture, creating the more successful i960.

Meanwhile, things weren't going well at first for the 286 processor, the follow-on to the 808616. Bill Gates and others called its design "brain-damaged". IBM was unenthusiastic about the 286 for their own reasons.17 As a result, the 386 project was a low priority for Intel and the 386 team felt that it was the "stepchild"; internally, the 386 was pitched as another stopgap, not Intel's "official" 32-bit processor.

Despite the lack of corporate enthusiasm, the 386 team came up with two proposals to extend the 286 to a 32-bit architecture. The first was a minimal approach to extend the existing registers and address space to 32 bits. The more ambitious proposal would add more registers and create a 32-bit instruction set that was significantly different from the 8086's 16-bit instruction set. At the time, the IBM PC was still relatively new, so the importance of the installed base of software wasn't obvious; software compatibility was viewed as a "nice to have" feature rather than essential. After much debate, the decision was made around the end of 1982 to go with the minimal proposal, but supporting both segments and flat addressing, while keeping compatibility with the 286.

By 1984, though, the PC industry was booming and the 286 was proving to be a success. This produced enormous political benefits for the 386 team, who saw the project change from "stepchild" to "king". Intel introduced the 386 in 1985, which was otherwise "a miserable year for Intel and the rest of the semiconductor industry," as Intel's annual report put it. Due to an industry-wide business slowdown, Intel's net income "essentially disappeared." Moreover, facing heavy competition from Japan, Intel dropped out of the DRAM business, a crushing blow for a company that got its start in the memory industry. Fortunately, the 386 would change everything.

Given IBM's success with the IBM PC, Intel was puzzled that IBM wasn't interested in the 386 processor, but IBM had a strategy of their own.18 By this time, the IBM PC was being cloned by many competitors, but IBM had a plan to regain control of the PC architecture and thus the market: in 1987, IBM introduced the PS/2 line. These new computers ran the OS/2 operating system instead of Windows and used the proprietary Micro Channel architecture.19 IBM used multiple engineering and legal strategies to make cloning the PS/2 slow, expensive, and risky, so IBM expected they could take back the market from the clones.

Compaq took the risky approach of ignoring IBM and following their own architectural direction.20 Compaq introduced the high-end Deskpro 386 line in September 1986, becoming the first major company to build 386-based computers. An "executive" system, the Deskpro 386 model 40 had a 40-megabyte hard drive and sold for $6449 (over $15,000 in current dollars). Compaq's gamble paid off and the Deskpro 386 was a rousing success.

The Compaq Deskpro 386 in front of the 386 processor (not to scale). From PC Tech Journal, 1987. Curiously, the die image of the 386 has been mirrored, as can be seen both from the positions of the microcode ROM and instruction decoder at the top as well as from the position of the cut corner of the package.

As for IBM, the PS/2 line was largely unsuccessful and failed to become the standard. Rather than regaining control over the PC, "IBM lost control of the PC standard in 1987 when it introduced its PS/2 line of systems."21 IBM exited the PC market in 2004, selling the business to Lenovo. One slightly hyperbolic book title summed it up: "Compaq Ended IBM's PC Domination and Helped Invent Modern Computing". The 386 was a huge moneymaker for Intel, leading to Intel's first billion-dollar quarter in 1990. It cemented the importance of the x86 architecture, not just for Intel but for the entire computing industry, dominating the market up to the present day.22

How the 386 was designed

The design process of the 386 is interesting because it illustrates Intel's migration to automated design systems and heavier use of simulation.23 At the time, Intel was behind the industry in its use of tools so the leaders of the 386 realized that more automation would be necessary to build a complex chip like the 386 on schedule. By making a large investment in automated tools, the 386 team completed the design ahead of schedule. Along with proprietary CAD tools, the team made heavy use of standard Unix tools such as sed, awk, grep, and make to manage the various design databases.

The 386 posed new design challenges compared to the previous 286 processor. The 386 was much more complex, with twice the transistors. But the 386 also used fundamentally different circuitry. While the 286 and earlier processors were built from NMOS transistors, the 386 moved to CMOS (the technology still used today). Intel's CMOS process was called CHMOS-III (complementary high-performance metal-oxide-silicon) and had a feature size of 1.5 µm. CHMOS-III was based on Intel's HMOS-III process (used for the 286), but extended to CMOS. Moreover, the CHMOS process provided two layers of metal instead of one, changing how signals were routed on the chip and requiring new design techniques.

The diagram below shows a cross-section through a CHMOS-III circuit, with an NMOS transistor on the left and a PMOS transistor on the right. Note the jagged three-dimensional topography that is formed as layers cross each other (unlike modern polished wafers). This resulted in the "forbidden gap" problem that caused difficulty for the 386 team. Specifically second-layer metal (M2) could be close to the first-layer metal (M1) or it could be far apart, but an in-between distance would cause problems: the forbidden gap. If the metal layer crossed in the "forbidden gap", the metal could crack and whiskers of metal would touch, causing the chip to fail. These problems reduced the yield of the 386.

A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology.

The design of the 386 proceeded both top-down, starting with the architecture definition, and bottom-up, designing standard cells and other basic circuits at the transistor level. The processor's microcode, the software that controlled the chip, was a fundamental component. It was designed with two CAD tools: an assembler and microcode rule checker. The high-level design of the chip (register-level RTL) was created and refined until clock-by-clock and phase-by-phase timing were represented. The RTL was programmed in MAINSAIL, a portable Algol-like language based on SAIL (Stanford Artificial Intelligence Language). Intel used a proprietary simulator called Microsim to simulate the RTL, stating that full-chip RTL simulation was "the single most important simulation model of the 80386".

The next step was to convert this high-level design into a detailed logic design, specifying the gates and other circuitry using Eden, a proprietary schematics-capture system. Simulating the logic design required a dedicated IBM 3083 mainframe that compared it against the RTL simulations. Next, the circuit design phase created the transistor-level design. The chip layout was performed on Applicon and Eden graphics systems. The layout started with critical blocks such as the ALU and barrel shifter. To meet the performance requirements, the TLB (translation lookaside buffer) for the paging mechanism required a creative design, as did the binary adders.

Examples of standard cells used in the 386. From "Automatic Place and Route Used on the 80386" by Joseph Krauskopf and Pat Gelsinger. I have added color.

The "random" (unstructured) logic was implemented with standard cells, rather than the transistor-by-transistor design of earlier processors. The idea of standard cells is to have fixed blocks of circuitry (above) for logic gates, flip-flops, and other basic functions.24 These cells are arranged in rows by software to implement the specified logic description. The space between the rows is used as a wiring channel for connections between the cells. The disadvantage of a standard cell layout is that it generally takes up more space than an optimized hand-drawn layout, but it is much faster to create and easier to modify.

These standard cells are visible in the die as regular rows of circuitry. Intel used the TimberWolf automatic placement and routing package, which used simulated annealing to optimize the placement of cells. TimberWolf was built by a Berkeley grad student; one 386 engineer said, "If management had known that we were using a tool by some grad student as the key part of the methodology, they would never have let us use it. " Automated layout was a new thing at Intel; using it improved the schedule, but the lower density raised the risk that the chip would be too large.

Standard cells in the 386. Each row consists of numerous standard cells packed together. Each cell is a simple circuit such as a logic gate or flip flop. The wide wiring channels between the rows hold the wiring that connects the cells. This block of circuitry is in the bottom center of the chip.

The data path consists of the registers, ALU (Arithmetic Logic Unit), barrel shifter, and multiply/divide unit that process the 32-bit data. Because the data path is critical to the performance of the system, it was laid out by hand using a CALMA system. The designers could optimize the layout, taking advantage of regularities in the circuitry, optimizing the shape and size of each transistor and fitting them together like puzzle pieces. The data path is visible on the left side of the die, forming orderly 32-bit-wide rectangles in contrast to the tangles of logic next to it.

Once the transistor-level layout was complete, Intel's Hierarchical Connectivity Verification System checked that the final layout matched the schematics and adhered to the process design rules. The 386 set an Intel speed record, taking just 11 days from completing the layout to "tapeout", when the chip data is sent on magnetic tape to the mask fabrication company. (The tapeout team was led by Pat Gelsinger, who later became CEO of Intel.) After the glass masks were created using an electron-beam process, Intel's "Fab 3" in Livermore (the first to wear the bunnysuits) produced the 386 silicon wafers.

Chip designers like to claim that their chip worked the first time, but that was not the case for the 386. When the team received the first silicon for the 386, they ran a trivial do-nothing test program, "NoOp, NoOp, Halt", and it failed. Fortunately, they found a small fix to a PLA (Programmable Logic Array). Rather than create new masks, they were able to patch the existing mask with ion milling and get new wafers quickly. These wafers worked well enough that they could start the long cycles of debugging and fixing.

Once the processor was released, the problems weren't over.25 Some early 386 processors had a 32-bit multiply problem, where some arguments would unpredictably produce the wrong results under particular temperature/voltage/frequency conditions. (This is unrelated to the famous Pentium FDIV bug that cost Intel $475 million.) The root cause was a layout problem, not a logic problem; they didn't allow enough margin to handle the worst case data in combination with manufacturing process and environment factors. This tricky problem didn't show up in simulation or chip verification, but was only found in stress testing. Intel sold the faulty processors, but marked them as only valid for 16-bit software, while marking the good processors with a double sigma, as seen below.26 This led to embarrassing headlines such as Some 386 Systems Won't Run 32-Bit Software, Intel Says. The multiply bug also caused a shortage of 386 chips in 1987 and 1988 as Intel redesigned the chip to fix the bug. Overall, the 386 issues probably weren't any worse than other processors and the problems were soon forgotten.

Bad and good versions of the 386. Note the labels on the bottom line. Photos (L), (R) by Thomas Nguyen, (CC BY-SA 4.0).

Conclusions

A 17-foot tall plot of the 386. The datapath is on the left and the microcode is in the lower right. It is unclear if this is engineering work or an exhibit at MOMA. Image spliced together from the 1985 annual report.

The 386 processor was a key turning point for Intel. Intel's previous processors sold very well, but this was largely due to heavy marketing ("Operation Crush") and the good fortune to be selected for the IBM PC. Intel was technologically behind the competition, especially Motorola. Motorola had introduced the 68000 processor in 1979, starting a powerful line of (more-or-less) 32-bit processors. Intel, on the other hand, lagged with the "brain-damaged" 16-bit 286 processor in 1982. Intel was also slow with the transition to CMOS; Motorola had moved to CMOS in 1984 with the 68020.

The 386 provided the necessary technological boost for Intel, moving to a 32-bit architecture, transitioning to CMOS, and fixing the 286's memory model and multitasking limitations, while maintaining compatibility with the earlier x86 processors. The overwhelming success of the 386 solidified the dominance of the x86 and Intel, and put other processor manufacturers on the defensive. Compaq used the 386 to take over PC architecture leadership from IBM, leading to the success of Compaq, Dell, and other companies, while IBM eventually departed the PC market entirely. Thus, the 386 had an oversized effect on the computer industry, shaping the winners and losers for decades.

I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space. Acknowledgements: The die photos are courtesy of Antoine Bercovici; you should follow him on Twitter as @Siliconinsid.27 Thanks to Pat Gelsinger and Roxanne Koester for providing helpful papers.

Notes and references

The 386 also changed the industry because Intel abandoned the standard practice of second sourcing (allowing other companies to manufacture a chip). AMD, for example, had been a second source for the 286. But Intel decided to keep production of the 386 to themselves. Intel ended up licensing the 386 to IBM, though, as the IBM 386SLC. Despite the name, this was the 386 SX, not the 386 SL. ↩
Intel made various keychains containing the 386 die, as shown at CPU World. If you know where to look, it is easy to distinguish the variants. In particular, look at the instruction decoders above the microcode and see if they are oriented vertically (pre-shrink 386) or horizontally (post-shrink 386). ↩
The naming of the 386 versions is a bit of a mess. The 386 started as the 80386 and later the i386. The 80386SX was introduced in 1988; this is the version with a 16-bit bus. The "regular" 386 was then renamed the DX to distinguish it from the SX. There are several other versions of the 386 that I won't discuss here, such as the EX, CXSB, and 80376. See Wikipedia for details.

Confusingly, the 486 also used the SX and DX names, but in a different way. The 486 DX was the original that included a floating-point unit, while floating-point was disabled in the 486 SX. Thus, in both cases "DX" was the full chip, while "SX" was the low-cost version, but the removed functionality was entirely different.

Another complication is that a 386DX chip will have a marking like "SX217", but this has nothing to do with the 386 SX. SX217 is an Intel S-Specification number, which specifies the particular stepping of the processor, indicating a manufacturing change or if a feature has been fixed or removed. ↩
Counting transistors isn't as straightforward as you might think. For example, a ROM may have a transistor for a 1 bit and no transistor for a 0 bit. Thus, the number of transistors depends on the data stored in the ROM. Likewise, a PLA has transistors present or absent in a grid, depending on the desired logic functions. For this reason, transistor counts are usually the number of "transistor sites", locations that could have a transistor, even if a transistor is not physically present. In the case of the 386, it has 285,000 transistor sites and 181,000 actual transistors (source), so over 100,000 reported transistors don't actually exist.

I'll also point out that most sources claim 275,000 transistors for the 386. My assumption is that 285,000 is the more accurate number (since this source distinguishes between transistor sites and physical transistors), while 275,000 is the rounded number. ↩
The 386's independent, pipelined functional units provide a significant performance improvement and the pipeline can be executing up to eight instructions at one time. For instance, the 386's microcode engine permits some overlap between the end of one instruction and the beginning of the next, an overlap that speeds up the processor by about 9%. But note that instructions are still executed sequentially, taking multiple clocks per instruction, so it is nothing like the superscalar execution introduced in the Pentium. ↩
The diagram of the 386 die shows eight functional units. It can be compared to the block diagram below, which shows how the units are interconnected.

Block diagram of the 386. From The Intel 80386—Architecture and Implementation.

↩
My labeled die diagram combines information from two Intel papers: The Intel 80386—Architecture and Implementation and Design and Test of the 80386. The former paper describes the eight functional units. The latter paper provides more details, but only shows six functional units. (The Control Unit and Data Unit are combined into the Execution Unit, while the Protection Test Unit is dropped as an independent unit.) Interestingly, the second paper is by Patrick Gelsinger, who is now CEO of Intel. Pat Gelsinger also wrote "80386 Tapeout - Giving Birth to an Elephant", which says there are nine functional units. I don't know what the ninth unit is, maybe the substrate bias generator? In any case, the count of functional units is flexible.

Patrick Gelsinger's biography from his 80386 paper.

↩
The 386 has a 16-byte prefetch queue, but apparently only 12 bytes are used due to a pipeline bug (details). ↩
Static checks for access violations are performed by the Protection Test Unit, while dynamic checks are performed by the Segment Unit and the Paging Unit. ↩
The 386 was originally supposed to have an on-chip cache, but there wasn't room and the cache was dropped in the middle of the project. As it was, the 386 die barely fit into the lithography machine's field of view. ↩
It kind of looks like the die has the initials ET next to a telephone. Could this be a reference to the movie E.T. and its catchphrase "E.T. phone home"? "SEC" must be senior mask designer Shirley Carter. "PSR" is probably Paul S. Ries who designed the 386's paging unit. ↩
I think that Intel used a 6" (150mm) wafer for the 386. With a 10mm×10mm die, about 128 chips would fit on a wafer. But with a 6mm×6.5mm die, about 344 would fit on a wafer, over 2.5 times as many. (See Die per wafer estimator.) ↩
The 286 remained popular compared to the 386, probably due to its lower price. It wasn't until 1991 that the number of 386 units sold exceeded the 286 (source). Intel's revenue for the 386 was much, much higher than for the 286 though (source). ↩
The "SuperSet" consisted of the 386 SL along with the 82360SL peripheral I/O chip. The I/O chip contained various ISA bus peripherals, taking the place of multiple chips such as the 8259 that dated back to the 8080 processor. The I/O chip included DMA controllers, timers, interrupt controllers, a real time clock, serial ports, and a parallel port. It also had a hard disk interface, a floppy disk controller, and a keyboard controller. ↩
The 386 SL transistor count is from the Intel Microprocessor Quick Reference Guide, which contains information on most of Intel's processors. ↩
The 186 processor doesn't fit cleanly into the sequence of x86 processors. Specifically, the 186 is an incompatible side-branch, rather than something in the 286, 386, 486 sequence. The 186 was essentially an 8086 that included additional functionality (clock generator, interrupt controller, timers, etc.) to make it more suitable for an emedded system. The 186 was used in some personal computers, but it was incompatible with the IBM PC so it wasn't very popular. ↩
IBM didn't want to use the 286 because they were planning to reverse-engineer the 286 and make their own version, a 16-megahertz CMOS version. This was part of IBM's plan to regain control of the PC architecture with the PS/2. Intel told IBM that "the fastest path to a 16-megahertz CMOS 286 is the 386 because it is CMOS and 16-megahertz", but IBM continued on their own 286 path. Eventually, IBM gave up and used Intel's 286 in the PS/2. ↩
IBM might have been reluctant to support the 386 processor because of the risk of cutting into sales of IBM's mid-range 4300 mainframe line. An IBM 4381-2 system ran at about 3.3 MIPS and cost $500,000, about the same MIPS performance as 386/16 system for under $10,000. The systems aren't directly comparable, of course, but many customers could use the 386 for a fraction of the price. IBM's sales of 4300 and other systems declined sharply in 1987, but the decline was blamed on DEC's VAX systems.

An IBM 4381 system. The 4381 processor is the large cabinet to the left of the terminals. The cabinets at the back are probably IBM 3380 disk drives. From an IBM 4381 brochure.

↩
The most lasting influence of the PS/2 was the round purple and green keyboard and mouse ports that were used by most PCs until USB obsoleted them. The PS2 ports are still available on some motherboards and gaming computers.

The PS/2 keyboard and mouse ports on the back of a Gateway PC.

↩
When Compaq introduced their 386-based system, "they warned IBM that it has but six months to announce a similar machine or be supplanted as the market's standard setter." (source). Compaq turned out to be correct. ↩
The quote is from Computer Structure and Logic. ↩
Whenever I mention x86's domination of the computing market, people bring up ARM, but ARM has a lot more market share in people's minds than in actual numbers. One research firm says that ARM has 15% of the laptop market share in 2023, expected to increase to 25% by 2027. (Surprisingly, Apple only has 90% of the ARM laptop market.) In the server market, just an estimated 8% of CPU shipments in 2023 were ARM. See Arm-based PCs to Nearly Double Market Share by 2027 and Digitimes. (Of course, mobile phones are almost entirely ARM.) ↩
Most of my section on the 386 design process is based on Design and Test of the 80386. The 386 oral history also provides information on the design process. The article Such a CAD! also describes Intel's CAD systems. Amusingly, I noticed that one of its figures (below) used a photo of the 386SL instead of the 386DX, with the result that the text is completely wrong. For instance, what it calls the microcode ROM is the cache tag RAM.

Erroneous description of the 386 layout. I put an X through it so nobody reuses it.

↩
Intel has published a guide to their 1.5 micron CHMOS III cell library. I assume this is the same standard-cell library that was used for the logic in the 386. The library provided over 150 logic functions. It also provided cell-based versions of the Intel 80C51 microcontroller and various Intel support chips such as the 82C37A DMA controller, the 82C54 interval timer, and the 82C59 interrupt controller.

Die photo of the 82360SL ISA Peripheral I/O Chip, from the 386 SL Data Book.

Interestingly, the 386 SL's Peripheral I/O chip (the 82360SL) included the functionality of these support chips. Standard-cell construction is visible as the stripes in the die photo (above). Moreover, the layout of the die shows separated blocks, probably corresponding to each embedded chip. I expect that Intel designed standard-cell versions of the controller chips to embed in the I/O chip and then added the chips to the standard-cell library since they were available. ↩
For an example of the problems that could require a new stepping of the 386, see Intel backs off 80386 claims but denies chip recast needed (1986). It discusses multitasking issues with the 386, with Intel calling them "minor imperfections" that could cause "little glitches", while others suggested that the chip would need replacement. The bugs fixed in each stepping of the 368 are documented here. ↩
One curiosity about the 386 is the IBTS and XBTS instructions. The Insert Bit String and Extract Bit String instructions were implemented in the early 386 processors, but then removed in the B1 stepping. It's interesting that the bit string instructions were removed in the B1 stepping, the same stepping that fixed the 32-bit multiplication bug. Intel said that they were removed "in order to use the area of the chip previously occupied for other microcircuitry" (source). I wonder if Intel fixed the multiplication bug in microcode, and needed to discard the bit string operations to free up enough microcode space. Intel reused these opcodes in the 486 for the CMPXCHG instruction, but that caused conflicts with old 386 programs, so Intel changed the 486 opcodes in the B stepping. ↩
Since Antoine photographed many different 386 chips, I could correlate the S-Specs with the layout changes. I'll summarize the information here, in case anyone happens to want it. The larger DX layout is associated with SX213 and SX215. (Presumably the two are different, but nothing that I could see in the photographs.) The shrunk DX layout is associated with SX217, SX218, SX366, and SX544. The 386 SL image is SX621. ↩

[#] Tue Oct 31 2023 08:41:39 UTC from rss <>

Subject: Reverse-engineering Ethernet backoff on the Intel 82586 network chip's die

Introduced in 1973, Ethernet is the predominant way of wiring computers together. Chips were soon introduced to handle the low-level aspects of Ethernet: converting data packets into bits, implementing checksums, and handling network collisions. In 1982, Intel announced the i82586 Ethernet LAN coprocessor chip, which went much further by offloading most of the data movement from the main processor to an on-chip coprocessor. Modern Ethernet networks handle a gigabit of data per second or more, but at the time, the Intel chip's support for 10 MB/s Ethernet put it on the cutting edge. (Ethernet was surprisingly expensive, about $2000 at the time, but expected to drop under $1000 with the Intel chip.) In this blog post, I focus on a specific part of the coprocessor chip: how it handles network collisions and implements exponential backoff.

The die photo below shows the i82586 chip. This photo shows the metal layer on top of the chip, which hides the underlying polysilicon wiring and silicon transistors. Around the edge of the chip, square bond pads provide the link to the chip's 48 external pins. I have labeled the function blocks based on my reverse engineering and published descriptions. The left side of the chip is called the "receive unit" and handles the low-level networking, with circuitry for the network transmitter and receiver. The left side also contains low-level control and status registers. The right side is called the "command unit" and interfaces to memory and the main processor. The right side contains a simple processor controlled by a microinstruction ROM.1 Data is transmitted between the two halves of the chip by 16-byte FIFOs (first in, first out queues).

The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version.

The 82586 chip is more complex than the typical Ethernet chip at the time. It was designed to improve system performance by moving most of the Ethernet processing from the main processor to the coprocessor, allowing the main processor and the coprocessor to operate in parallel. The coprocessor provides four DMA channels to move data between memory and the network without the main processor's involvement. The main processor and the coprocessor communicate through complex data structures2 in shared memory: the main processor puts control blocks in memory to tell the I/O coprocessor what to do, specifying the locations of transmit and receive buffers in memory. In response, the I/O coprocessor puts status blocks in memory. The processor onboard the 82586 chip allows the chip to handle these complex data structures in software. Meanwhile, the transmission/receive circuitry on the left side of the chip uses dedicated circuitry to handle the low-level, high-speed aspects of Ethernet.

Ethernet and collisions

A key problem with a shared network is how to prevent multiple computers from trying to send data on the network at the same time. Instead of a centralized control mechanism, Ethernet allows computers to transmit whenever they want.3 If two computers transmit at the same time, the "collision" is detected and the computers try again, hoping to avoid a collision the next time. Although this may sound inefficient, it turns out to work out remarkably well.4 To avoid a second collision, each computer waits a random amount of time before retransmitting the packet. If a collision happens again (which is likely on a busy network), an exponential backoff algorithm is used, with each computer waiting longer and longer after each collision. This automatically balances the retransmission delay to minimize collisions and maximize throughput.

I traced out a bunch of circuitry to determine how the exponential backoff logic is implemented. To summarize, exponential backoff is implemented with a 10-bit counter to provide a pseudorandom number, a 10-bit mask register to get an exponentially sized delay, and a delay counter to count down the delay. I'll discuss how these are implemented, starting with the 10-bit counter.

The 10-bit counter

A 10-bit counter may seem trivial, but it still takes up a substantial area of the chip. The straightforward way of implementing a counter is to hook up 10 latches as a "ripple counter". The counter is controlled by a clock signal that indicates that the counter should increment. The clock toggles the lowest bit of the counter. If this bit flips from 1 to 0, the next higher bit is toggled. The process is repeated from bit to bit, toggling a bit if there is a carry. The problem with this approach is that the carry "ripples" through the counter. Each bit is delayed by the lower bit, so the bits don't all flip at the same time. This limits the speed of the counter as the top bit isn't settled until the carry has propagated through the nine lower bits.

The counter in the chip uses a different approach with additional circuitry to improve performance. Each bit has logic to check if all the lower bits are ones. If so, the clock signal toggles the bit. All the bits toggle at the same time, rapidly incrementing the counter in response to the clock signals. The drawback of this approach is that it requires much more logic.

The diagram below shows how the carry logic is implemented. The circuitry is optimized to balance speed and complexity. In particular, bits are examined in groups of three, allowing some of the logic to be shared across multiple bits. For instance, instead of using a 9-input gate to examine the nine lower bits, separate gates test bits 0-2 and 3-5.

The circuitry to generate the toggle signals for each bit of the counter.

The implementation of the latches is also interesting. Each latch is implemented with dynamic logic, using the circuit's capacitance to store each bit. The input is connected to the output with two inverters. When the clock is high, the transistor turns on, connecting the inverters in a loop that holds the value. When the clock is low, the transistor turns off. However, the 0 or 1 value will still remain on the input to the first inverter, held by the charge on the transistor's gate. At this time, an input can be fed into the latch, overriding the old value.

The basic dynamic latch circuit.

The latch has some additional circuitry to make it useful. To toggle the latch, the output is inverted before feeding it back to the input. The toggle control signal selects the inverted output through another pass transistor. The toggle signal is only activated when the clock is low, ensuring that the circuit doesn't repeatedly toggle, oscillating out of control.

One bit of the counter.

The image below shows how the counter circuit is implemented on the die. I have removed the metal layer to show the underlying transistors; the circles are contacts where the metal was connected to the underlying silicon. The pinkish regions are doped silicon. The pink-gray lines are polysilicon wiring. When polysilicon crosses doped silicon, it creates a transistor. The blue color swirls are not significant; they are bits of oxide remaining on the die.

The counter circuitry on the die.

The 10-bit mask register

The mask register has a particular number of low bits set, providing a mask of length 0 to 10. For instance, with 4 bits set, the mask register is 0000001111. The mask register can be updated in two ways. First, it can be set to length 1-8 with a three-bit length input.5 Second, the mask can be lengthened by one bit, for example going from 0000001111 to 0000011111 (length 4 to 5).

The mask register is implemented with dynamic latches similar to the counter, but the inputs to the latches are different. To load the mask to a particular length, each bit has logic to determine if the bit should be set based on the three-bit input. For example, bit 3 is cleared if the specified length is 0 to 3, and set otherwise. The lengthening feature is implemented by shifting the mask value to the left by one bit and inserting a 1 into the lowest bit.

The schematic below shows one bit of the mask register. At the center is a two-inverter latch as seen before. When the clock is high, it holds its value. When the clock is low, the latch can be loaded with a new value. The "shift" line causes the bit from the previous stage to be shifted in. The "load" line loads the mask bit generated from the input length. The "reset" line clears the mask. At the right is the NAND gate that applies the mask to the count and inverts the result. As will be seen below, these NAND gates are unusually large.

One stage of the mask register.

The logic to set a mask bit based on the length input is shown below.6 The three-bit "sel" input selects the mask length from 1 to 8 bits; note that the mask0 bit is always set while bits 8 and 9 are cleared.7 Each set of gates energizes the corresponding mask line for the appropriate inputs.

The control logic to enable mask bits based on length.

The diagram below shows the mask register on the die. I removed the metal layer to show the underlying silicon and polysilicon, so the transistors are visible. On the left are the NAND gates that combine each bit of the counter with the mask. Note that large snake-like transistors; these larger transistors provide enough current to drive the signal over the long bus to the delay counter register at the bottom of the chip. Bit 0 of the mask is always set, so it doesn't have a latch. Bits 8 and 9 of the mask are only set by shifting, not by selecting a mask length, so they don't have mask logic.8

The mask register on the die.

The delay counter register

To generate the pseudorandom exponential backoff, the counter register and the mask register are NANDed together. This generates a number of the desired binary length, which is stored in the delay counter. Note that the NAND operation inverts the result, making it negative. Thus, as the delay counter counts up, it counts toward zero, reaching zero after the desired number of clock ticks.

The implementation of the delay counter is similar to the first counter, so I won't include a schematic. However, the delay counter is attached to the register bus, allowing its value to be read by the chip's CPU. Control lines allow the delay counter's value to pass onto the register bus.

The diagram below shows the locations of the counter, mask, and delay register on the die. In this era, something as simple as a 10-bit register occupied a significant part of the die. Also note the distance between the counter and mask and the delay register at the bottom of the chip. The NAND gates for the counter and mask required large transistors to drive the signal across this large distance.

The die, with counter, mask, and delay register.

Conclusions

The Intel Ethernet chip provides an interesting example of how a real-world circuit is implemented on a chip. Exponential backoff is a key part of the Ethernet standard. This chip implements backoff with a simple but optimized circuit.9

A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference.

For more chip reverse engineering, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space. Acknowledgments: Thanks to Robert Garner for providing the chip and questions.

Notes and references

I think the on-chip processor is a very simple processor that doesn't match other Intel architectures. It is described as executing microcode. I don't think this is microcode in the usual sense of machine instructions being broken down into microcode. Instead, I think the processor's instructions are primitive, single-clock instructions that are more like micro-instructions than machine instructions. ↩
The diagram below shows the data structures in shared memory for communication between the main processor and the coprocessor. The Command List specifies the commands that the coprocessor should perform. The Receive Frame area provides memory blocks for incoming network packets.

A diagram of the 82586 shared memory structures, from the 82586 datasheet.

I think Intel was inspired by mainframe-style I/O channels, which moved I/O processing to separate processors communicating through memory. Another sign of Intel's attempts to move mainframe technology to microprocessors was the ill-fated iAPX 432 processor, which Intel called a "micro-mainframe." (I discuss the iAPX 432 as part of this blog post.)

↩
An alternative approach to networking is token-ring, where the computers in the network pass a token from machine to machine. Only the machine with the token can send a packet on the network, ensuring collision-free transmission. I looked inside an IBM token-ring chip in this post. ↩
Ethernet's technique is called CSMA/CD (Carrier Sense Multiple Access with Collision Detection). The idea of Carrier Sense is that the "carrier" signal on the network indicates that the network is idle. Each computer on the network listens for the carrier before transmitting, which avoids most collisions. However, there is still a small chance of collision. (In particular, the speed of light means that there is a delay on a long network between when one computer starts transmitting and when a second computer can detect this transmission. Thus, both computers can think the network is free while the other computer is transmitting. This factor also imposes a maximum length on an Ethernet network segment: if the network is too long, a computer can finish transmitting a packet before the collision occurs, and it won't detect the collision.) Modern Ethernet has moved from the shared network to a star topology that avoids collisions. ↩
The length of the mask is one more than the three-bit length input. E.g. An input of 7 sets eight mask bits. ↩
The mask generation logic is a bit tricky to understand. You can try various bit combinations to see how it works. The logic is easier to understand if you apply De Morgan's law to change the NOR gates to AND gates, which also removes the negation on the inputs. ↩
The control line appears to enable or disable mask selection but its behavior is inexplicably negated on bit 1. ↩
The circuitry below the counter appears to be a state machine that is unrelated to the exponential backoff. From reverse engineering, my hypothesis is that the counter is reused by the state machine: it both generates pseudorandom numbers for exponential backoff and times events when a packet is being received. In particular, it has circuitry to detect when the counter reaches 9, 20, and 48, and takes actions at these values.

The state itself is held in numerous latches. The new state is computed by a PLA (Programmable Logic Array) below and to the right of the counter along with numerous individual gates. ↩
One drawback of this exponential backoff circuit is that the pseudorandom numbers are completely synchronous. If two network nodes happen to be in the exact same counter state when they collide, they will go through the same exponential backoff delays, causing a collision every time. While this may seem unlikely, it apparently happened occasionally during use. The LANCE Ethernet chip from AMD used a different approach. Instead of running the pseudorandom counter from the highly accurate quartz clock signal, the counter used an on-chip ring oscillator that was deliberately designed to be inaccurate. This prevented two nodes from locking into inadvertent synchronization. ↩

[#] Thu Nov 09 2023 08:52:19 UTC from rss <>

Subject: Reverse engineering the Intel 386 processor's register cell

The groundbreaking Intel 386 processor (1985) was the first 32-bit processor in the x86 line. It has numerous internal registers: general-purpose registers, index registers, segment selectors, and more specialized registers. In this blog post, I look at the silicon die of the 386 and explain how some of these registers are implemented at the transistor level. The registers that I examined are implemented as static RAM, with each bit stored in a common 8-transistor circuit, known as "8T". Studying this circuit shows the interesting layout techniques that Intel used to squeeze two storage cells together to minimize the space they require.

The diagram below shows the internal structure of the 386. I have marked the relevant registers with three red boxes. Two sets of registers are in the segment descriptor cache, presumably holding cache entries, and one set is at the bottom of the data path. Some of the registers at the bottom are 32 bits wide, while others are half as wide and hold 16 bits. (More registers with different circuits, but I won't discuss them in this post.)

The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.

The 6T and 8T static RAM cells

First, I'll explain how a 6T or 8T static cell holds a bit. The basic idea behind a static RAM cell is to connect two inverters into a loop. This circuit will be stable, with one inverter on and one inverter off, and each inverter supporting the other. Depending on which inverter is on, the circuit stores a 0 or a 1.

Two inverters in a loop can store a 0 or a 1.

To write a new value into the circuit, two signals are fed in, forcing the inverters to the desired new values. One inverter receives the new bit value, while the other inverter receives the complemented bit value. This may seem like a brute-force way to update the bit, but it works. The trick is that the inverters in the cell are small and weak, while the input signals are higher current, able to overpower the inverters.1 The write data lines (called bitlines) are connected to the inverters by pass transistors.2 When the pass transistors are on, the signals on the write lines can pass through to the inverters. But when the pass transistors are off, the inverters are isolated from the write lines. Thus, the write control signal enables writing a new value to the inverters. (This signal is called a wordline since it controls access to a word of storage.) Since each inverter consists of two transistors7, the circuit below consists of six transistors, forming the 6T storage cell.

Adding pass transistor so the cell can be written.

The 6T cell uses the same bitlines for reading and writing. Adding two transistors creates the 8T circuit, which has the advantage that you can read one register and write to another register at the same time. (I.e. the register file is two-ported.) In the 8T cell below, two additional transistors (G and H) are used for reading. Transistor G buffers the cell's value; it turns on if the inverter output is high, pulling the read output bitline low.3 Transistor H is a pass transistor that blocks this signal until a read is performed on this register; it is controlled by a read wordline.

Schematic of a storage cell. Each transistor is labeled with a letter.

To form registers (or memory), a grid is constructed from these cells. Each row corresponds to a register, while each column corresponds to a bit position. The horizontal lines are the wordlines, selecting which word to access, while the vertical lines are the bitlines, passing bits in or out of the registers. For a write, the vertical bitlines provide the 32 bits (along with their complements). For a read, the vertical bitlines receive the 32 bits from the register. A wordline is activated to read or write the selected register.

Static memory cells (8T) organized into a grid.

Silicon circuits in the 386

Before showing the layout of the circuit on the die, I should give a bit of background on the technology used to construct the 386. The 386 was built with CMOS technology, with NMOS and PMOS transistors working together, an advance over the earlier x86 chips that were built with NMOS transistors. Intel called this CMOS technology CHMOS-III (complementary high-performance metal-oxide-silicon), with 1.5 µm features. While Intel's earlier chips had a single metal layer, CHMOS-III provided two metal layers, making signal routing much easier.

Because CMOS uses both NMOS and PMOS transistors, fabrication is more complicated. In an MOS integrated circuit, a transistor is formed where a polysilicon wire crosses active silicon, creating the transistor's gate. A PMOS transistor is constructed directly on the silicon substrate (which is N-doped). However, an NMOS transistor is the opposite, requiring a P-doped substrate. This is created by forming a P well, a region of P-doped silicon that holds NMOS transistors. Each P well must be connected to ground; this is accomplished by connecting ground to specially-doped regions of the P well, called "well taps"`.

The diagram below shows a cross-section through two transistors, showing the layers of the chip. There are four important layers: silicon (which has some regions doped to form active silicon), polysilicon for wiring and transistors, and the two metal layers. At the bottom is the silicon, with P or N doping; note the P-well for the NMOS transistor on the left. Next is the polysilicon layer. At the top are the two layers of metal, named M1 and M2. Conceptually, the chip is constructed from flat layers, but the layers have a three-dimensional structure influenced by the layers below. The layers are separated by silicon dioxide ("ox") or silicon oxynitride4; the oxynitride under M2 caused me considerable difficulty.

A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology.

The image below shows how circuitry appears on the die;5 I removed the metal layers to show the silicon and polysilicon that form transistors. (As will be described below, this image shows two static cells, holding two bits.) The pinkish and dark regions are active silicon, doped to take part in the circuits, while the "background" silicon can be ignored. The green lines are polysilicon lines on top of the silicon. Transistors are the most important feature here: a transistor gate is formed when polysilicon crosses active silicon, with the source and drain on either side. The upper part of the image has PMOS transistors, while the lower part of the image has the P well that holds NMOS transistors. (The well itself is not visible.) In total, the image shows four PMOS transistors and 12 NMOS transistors. At the bottom, the well taps connect the P well to ground. Although the metal has been removed, the contacts between the lower metal layer (M1) and the silicon or polysilicon are visible as faint circles.

A (heavily edited) closeup of the die.

Register layout in the 386

Next, I'll explain the layout of these cells in the 386. To increase the circuit density, two cells are put side-by-side, with a mirrored layout. In this way, each row holds two interleaved registers.6 The schematic below shows the arrangement of the paired cells, matching the die image above. Transistors A and B form the first inverter,7 while transistors C and D form the second inverter. Pass transistors E and F allow the bitlines to write the cell. For reading, transistor G amplifies the signal while pass transistor H connects the selected bit to the output.

Schematic of two static cells in the 386. The schematic approximately matches the physical layout.

The left and right sides are approximately mirror images, with separate read and write control lines for each half. Because the control lines for the left and right sides are in different positions, the two sides have some layout differences, in particular, the bulging loop on the right. Mirroring the cells increases the density since the bitlines can be shared by the cells.

The diagram below shows the various components on the die, labeled to match the schematic above. I've drawn the lower M1 metal wiring in blue, but omitted the M2 wiring (horizontal control lines, power, and ground). "Read crossover" indicates the connection from the read output on the left to the bitline on the right. Black circles indicate vias between M1 and M2, green circles indicate contacts between silicon and M1, and reddish circles indicate contacts between polysilicon and M1.

The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown.

One more complication is that alternating registers (i.e. rows) are reflected vertically, as shown below. This allows one horizontal power line to feed two rows, and similarly for a horizontal ground line. This cuts the number of power/ground lines in half, making the layout more efficient.

Multiple storage cells.

Having two layers of metal makes the circuitry considerably more difficult to reverse engineer. The photo below (left) shows one of the static RAM cells as it appears under the microscope. Although the structure of the metal layers is visible in the photograph, there is a lot of ambiguity. It is difficult to distinguish the two layers of metal. Moreover, the metal completely hides the polysilicon layer, not to mention the underlying silicon. The large black circles are vias between the two metal layers. The smaller faint circles are contacts between a metal layer and the underlying silicon or polysilicon.

One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers.

With some effort, I determined the metal layers, which I show on the right: M2 (upper) and M1 (lower). By comparing the left and right images, you can see how the structure of the metal layers is somewhat visible. I use black circles to indicate vias between the layers, green circles indicate contacts between M1 and silicon, and pink circles indicate contacts between M1 and polysilicon. Note that both metal layers are packed as tightly as possible. The layout of this circuit was highly optimized to minimize the area. It is interesting to note that decreasing the size of the transistors wouldn't help with this circuit, since the size is limited by the metal density. This illustrates that a fabrication process must balance the size of the metal features, polysilicon features, and silicon features since over-optimizing one won't help the overall chip density.

The photo below shows the bottom of the register file. The "notch" makes the registers at the very bottom half-width: 4 half-width rows corresponding to eight 16-bit registers. Since there are six 16-bit segment registers in the 386, I suspect these are the segment registers and two mystery registers.

The bottom of the register file.

I haven't been able to determine which registers in the 386 correspond to the other registers on the die. In the segment descriptor circuitry, there are two rows of register cells with ten more rows below, corresponding to 24 32-bit registers. These are presumably segment descriptors. At the bottom of the datapath, there are 10 32-bit registers with the T8 circuit. The 386's programmer-visible registers consist of eight general-purpose 32-bit registers (EAX, etc.). The 386 has various control registers, test registers, and segmentation registers8 that are not well known. The 8086 has a few registers for internal use that aren't visible to the programmer, so the 386 presumably has even more invisible registers. At this point, I can't narrow down the functionality.

Conclusions

It's interesting to examine how registers are implemented in a real processor. There are plenty of descriptions of the 8T static cell circuit, but it turns out that the physical implementation is a complicated than the theoretical description. Intel put a lot of effort into optimizing this circuit, resulting in a dense block of circuitry. By mirroring cells horizontally and vertically, the density could be increased further.

Reverse engineering one small circuit of the 386 turned out to be pretty tricky, so I don't plan to do a complete reverse engineering. The main difficulty is the two layers of metal are hard to untangle. Moreover, I lost most of the polysilicon when removing the metal. Finally, it is hard to draw diagrams with four layers without the diagram turning into a mess, but hopefully the diagrams made sense.

I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space.

Notes and references

Typically the write driver circuit generates a strong low on one of the bitlines, flipping the corresponding inverter to a high output. As soon as one inverter flips, it will force the other inverter into the right state. To support this, the pullup transistors in the inverters are weaker than normal. ↩
The pass transistor passes its signal through or blocks it. In CMOS, this is usually implemented with a transmission gate with an NMOS and a PMOS transistor in parallel. The cell uses only the NMOS transistor, which makes it worse at passing a high signal, but substantially reduces the size, a reasonable tradeoff for a storage cell. ↩
The bitline is typically precharged to a high level for a read, and then the cell pulls the line low for a 0. This is more compact than including circuitry in each cell to pull the line high. ↩
One problem is that the 386 uses a layer of insulating silicon oxynitride as well as the usual silicon dioxide. I was able to remove the oxynitride with boiling phosphoric acid, but this removed most of the polysilicon as well. I'm still experimenting with the timing; 20 minutes of boiling was too long. ↩
The image is an edited composite of multiple cells since the polysilicon was highly damaged when removing the metal layers. Unfortunately, I haven't found a process for the 386 to remove one layer of metal at a time. As a result, reverse-engineering the 386 is much more difficult than earlier processors such as the 8086; I have to look for faint traces of polysilicon and puzzle over what connections the circuit requires. ↩
You might wonder why they put two cells side-by-side instead of simply cramming the cells together more tightly. The reason for putting two cells in each row is presumably to match the size of each bit with the rest of the circuitry in the datapath. If the register circuitry is half the width of the ALU circuitry, a bunch of space will be wasted by the wiring to line up each register bit with the corresponding ALU bit. ↩
A CMOS inverter is constructed from an NMOS transistor (which pulls the output low on a 1 input) and a PMOS transistor (which pulls the output high on a 0 input), as shown below.

A CMOS inverter.

↩↩
The 386 has multiple registers that are documented but not well known. Chapter 4 of the 386 Programmers Reference Manual discusses various registers that are only relevant to operating systems programmers. These include the Global Descriptor Table Register (GDTR), Local Descriptor Table Register (LDTR), Interrupt Descriptor Table Register (IDTR), and Task Register (TR). There are four Control Registers CR0-CR3; CR0 controls coprocessor usage, paging, and a few other things. The six Debug Registers for hardware breakpoints are named DR0-DR3, DR6, and DR7 (which suggests undocumented DR4 and DR5 registers). The two Test Registers for TLB testing are named TR6 and TR7 (which suggests undocumented TR0-TR5 registers). I expect that these registers are located near the relevant functional units, rather than part of the processing datapath. ↩

[#] Thu Nov 30 2023 08:52:31 UTC from rss <>

Subject: Inside the Intel 386 processor die: the clock circuit

Processors are driven by a clock, which controls the timing of each step inside the chip. In this blog post, I'll examine the clock-generation circuitry inside the Intel 386 processor. Earlier processors such as the 8086 (1978) were simpler, using two clock phases internally. The Intel 386 processor (1985) was a pivotal development for Intel as it moved x86 to CMOS (as well as being the first 32-bit x86 processor). The 386's CMOS circuitry required four clock signals. An external crystal oscillator provided the 386 with a single clock signal and the 386's internal circuitry generated four carefully-timed internal clock signals from the external clock.

The die photo below shows the Intel 386 processor with the clock generation circuitry and clock pad highlighted in red. The heart of a processor is the datapath, the components that hold and process data. In the 386, these components are in the lower left: the ALU (Arithmetic/Logic Unit), a barrel shifter to shift data, and the registers. These components form regular rectangular blocks, 32 bits wide. In the lower right is the microcode ROM, which breaks down machine instructions into micro-instructions, the low-level steps of the instruction. Other parts of the chip prefetch and decode instructions, and handle memory paging and segmentation. All these parts of the chip run under the control of the clock signals.

The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version.

A brief discussion of clock phases

Many processors use a two-phase clock to control the timing of the internal processing steps. The idea is that the two clock phases alternate: first phase 1 is high, and then phase 2 is high, as shown below. During each clock phase, logic circuitry processes data. A circuit called a "transparent latch" is used to hold data between steps.2 The concept of a latch is that when a latch's clock input is high, the input passes through the latch. But when the latch's clock input is low, the latch remembers its previous value. With two clock phases, alternating latches are active one at a time, so data passes through the circuit step by step, under the control of the clock.

The two-phase clock signal used by the Intel 8080 processor. The 8080 uses asymmetrical clock signals, with phase 2 longer than phase 1. From the 8080 datasheet.

The diagram below shows an abstracted model of the processor circuitry. The combinational logic (i.e. the gate logic) is divided into two blocks, with latches between each block. During clock phase 1, the first block of latches passes its input through to the output. Thus, values pass through the first logic block, the first block of latches, and the second logic block, and then wait.

Action during clock phase 1.

During clock phase 2 (below), the first block of latches stops passing data through and holds the previous values. Meanwhile, the second block of latches passes its data through. Thus, the first logic block receives new values and performs logic operations on them. When the clock switches to phase 1, processing continues as in the first diagram. The point of this is that processing takes place under the control of the clock, with values passed step-by-step between the two logic blocks.1

Action during clock phase 2.

This circuitry puts some requirements on the clock timing. First, the clock phases must not overlap. If both clocks are active at the same time, data will flow out of control around the loop, messing up the results.3 Moreover, because the two clock phases probably don't arrive at the exact same time (due to differences in the wiring paths), a "dead zone" is needed between the two phases, an interval where both clocks are low, to ensure that the clocks don't overlap even if there are timing skews. Finally, the clock frequency must be slow enough that the logic has time to compute its result before the clock switches.

Many processors such as the 8080, 6502, and 8086 used this type of two-phase clocking. Early processors such as the 8008 (1972) and 8080 (1974) required complicated external circuitry to produce two asymmetrical clock phases.4 For the 8080, Intel produced a special clock generator chip (the 8224) that produced the two clock signals according to the required timing. The Motorola 6800 (1974) required two non-overlapping (but at least symmetrical) clocks, produced by the MC6875 clock generator chip. The MOS 6502 processor (1975) simplified clock generation by producing the two phases internally (details) from a single clock input. This approach was used by most later processors.

An important factor is that the Intel 386 processor was implemented with CMOS circuitry, rather than the NMOS transistors of many earlier processors. A CMOS chip uses both NMOS transistors (which turn on when the gate is high) and PMOS transistors (which turn on when the gate is low).7 Thus, the 386 requires an active-high clock signal and an active-low clock signal for each phase,5 four clock signals in total.6 In the rest of this article, I'll explain how the 386 generates these four clock signals.

The clock circuitry

The block diagram below shows the components of the clock generation circuitry. Starting at the bottom, the input clock signal (CLK2, at twice the desired frequency) is divided by two to generate two drive signals with opposite phases. These signals go to the large driver circuits in the middle, which generate the two main clock signals (phase 1 and phase 2). Each driver sends an "inhibit" signal to the other when active, ensuring that the phases don't overlap. Each driver also sends signals to a smaller driver that generates the inverted clock signal. The "enable" signal shapes the output to prevent overlap. The four clock output signals are then distributed to all parts of the processor.

Block diagram of the clock circuitry. The layout of the blocks matches their approximate physical arrangement.

The diagram below shows a closeup of the clock circuitry on the die. The external clock signal enters the die at the clock pad in the lower right. The signal is clamped by protection diodes and a resistor before passing to the divide-by-two logic, which generates the two clock phases. The four driver blocks generate the high-current clock pulses that are transmitted to the rest of the chip by the four output lines at the left.

Details of the clock circuitry. This image shows the two metal layers. At the right, bond wires are connected to the pads on the die.

Input protection

The 386 has a pin "CLK2" that receives the external clock signal. It is called CLK2 because this signal has twice the frequency of the 386's clock. The chip package connects the CLK2 pin through a tiny bond wire (visible above) to the CLK2 pad on the silicon die. The CLK2 input has two protection diodes, created from MOSFETs, as shown in the schematic below. If the input goes below ground or above +5 volts, the corresponding diode will turn on and clamp the excess voltage, protecting the chip. The schematic below shows how the diodes are constructed from an NMOS transistor and a PMOS transistor. The schematic corresponds to the physical layout of the circuit, so power is at the bottom and the ground is at the top.

The input protection circuit. The left shows the physical circuit built from an NMOS transistor and a PMOS transistor, while the right shows the equivalent diode circuit.

The diagram below shows the implementation of these protection diodes (i.e. transistors) on the die. Each transistor is much larger than the typical transistors inside the 386, because these transistors must be able to handle high currents. Physically, each transistor consists of 12 smaller (but still relatively large) transistors in parallel, creating the stripes visible in the image. Each transistor block is surrounded by two guard rings, which I will explain in the next section.

This diagram shows the circuitry next to the clock pad.

Latch-up and the guard rings

The phenomenon of "latch-up" is the hobgoblin of CMOS circuitry, able to destroy a chip. Regions of the silicon die are doped with impurities to form N-type and P-type silicon. The problem is that the N- and P-doped regions in a CMOS chip can act as parasitic NPN and PNP transistors. In some circumstances, these transistors can turn on, shorting power and ground. Inconveniently, the transistors latch into this state until the power is removed or the chip burns up. The diagram below shows how the substrate, well, and source/drain regions can combine to act as unwanted transistors.8

This diagram illustrates how the parasitic NPN and PNP transistors are formed in a CMOS chip. Note that the 386's construction is opposite from this diagram, with an N substrate and P well. Image by Deepon, CC BY-SA 3.0.

Normally, P-doped substrate or wells are connected to ground and the N-doped substrate or wells are connected to +5 volts. As a result, the regions act as reverse-biased diodes and no current flows through the substrate. However, a voltage fluctuation or large current can disturb the reverse biasing and the resulting current flow will turn on these parasitic transistors. Unfortunately, these parasitic transistors drive each other in a feedback loop, so once they get started, they will conduct more and more strongly and won't stop until the chip is powered down. The risk of latch-up is highest with circuits connected to the unpredictable voltages of the outside world, or high-current circuits that can cause power fluctuations. The clock circuitry has both of these risks.

One way of protecting against latch-up is to put a guard ring around a potentially risky circuit. This guard ring will conduct away the undesired substrate current before it can cause latch-up. In the case of the 386, two concentric guard rings are used for additional protection.9 In the earlier die photo, these guard rings can be seen surrounding the transistors. Guard rings will also play a part in the circuitry discussed below.

Polysilicon resistor

After the protection diodes, the clock signal passes through a polysilicon resistor, followed by another protection diode. Polysilicon is a special form of silicon that is used for wiring and also forms the transistor gates. The polysilicon layer sits on top of the base silicon; polysilicon has a moderate amount of resistance, considerably more than metal, so it can be used as a resistor.

The image below shows the polysilicon resistor along with a protection diode. This circuit provides additional protection against transients in the clock signal.10 This circuit is surrounded by two concentric guard rings for more latch-up protection.

The polysilicon resistor and associated diode.

The divide-by-two logic

The input clock to the 386 runs at twice the frequency of the internal clock. The circuit below divides the input clock by 2, producing complemented outputs. This circuit consists of two set-reset latch stages, one driven by the input clock inverted and the second driven by the input clock, so the circuit will update once per input clock cycle. Since there are three inversions in the loop, the output will be inverted for each update, so it will cycle at half the rate of the input clock. The reset input is asymmetrical: when it is low, it will force the output low and the complemented output high. Presumably, this ensures that the processor starts with the correct clock phase when exiting the reset state.

The divide-by-two circuit.

I have numbered the gates above to match their physical locations below. In this image, I have etched the chip down to the silicon so you can see the active silicon regions. Each logic gate consists of PMOS transistors in the upper half and NMOS transistors in the lower half. The thin stripes are the transistor gates; the two-input NAND gates have two PMOS transistors and two NMOS transistors, while the three-input NAND gates have three of each transistor. The AND-NOR gates need to drive other circuits, so they use paralleled transistors and are much larger. Each AND-NOR gate contains 12 PMOS transistors, four for each input, but uses only 9 NMOS transistors. Finally, the inverter (7) inverts the input clock signal for this circuit. The transistors in each gate are sized to maximize performance and minimize power consumption. The two outputs from the divider then go through large inverters (not shown) that feed the driver circuits.11

The silicon for the divide-by-two circuit as it appears on the die.

The drivers

Because the clock signals must be transmitted to all parts of the die, large transistors are required to generate the high-current pulses. These large transistors, in turn, are driven by medium-sized transistors. Additional driver circuitry ensures that the clock signals do not overlap. There are four driver circuits in total. The two larger, lower driver circuits generate the positive clock pulses. These drivers control the two smaller, upper driver circuits that generate the inverted clock pulses.

First, I'll discuss the larger, positive driver circuit. The core of the driver consists of the large PMOS transistor (1) to pull the output high, and the large NMOS transistor (1) to pull the output low. Each transistor is driven by two inverters (2/3 and 6/7 respectively). The circuit also produces two signals to shape the outputs from the other drivers. When the clock output is high, the "inhibit" signal goes to the other lower driver and inhibits that driver from pulling its output high.12 This prevents overlap in the output between the two drivers. When the clock output is low, an "enable" output goes to the inverted driver (discussed below) to enable its output. The transistor sizes and propagation delays in this circuit are carefully designed to shape the internal clock pulses as needed.

Schematic of the lower driver.

The diagram below shows how this driver is implemented on the die. The left image shows the two metal layers. The right image shows the transistors on the underlying silicon. The upper section holds PMOS transistors, while the lower section holds NMOS transistors. Because PMOS transistors have poorer performance than NMOS transistors, they need to be larger, so the PMOS section is larger. The transistors are numbered, corresponding to the schematic above. Each transistor is physically constructed from multiple transistors in parallel. The two guard rings are visible in the silicon, surrounding and separating the PMOS and NMOS regions.

One of the lower drivers. The left image shows metal while the right image shows silicon.

The 386 has two layers of metal wiring. In this circuit, the top metal layer (M2) provides +5 for the PMOS transistors, ground for the NMOS transistors, and receives the output, all through large rectangular regions. The lower metal layer (M1) provides the physical source and drain connections to the transistors as well as the wiring between the transistors. The pattern of the lower metal layer is visible in the left photo. The dark circles are connections between the lower metal layer and the transistors or the upper metal layer. The connections to the two guard rings are visible around the edges.

Next, I'll discuss the two upper drivers that provided the inverted clock signals. These drivers are smaller, presumably because less circuitry needs the inverted clocks. Each upper driver is controlled by enable and drive from the corresponding lower driver. As before, two large transistors pull the output high or low, and are driven by inverters. Curiously, the enable input is wired to the output of inverter 4. Presumably, this provides a bit of shaping to the signal.

Schematic of the upper driver.

The layout (below) is roughly similar to the previous driver, but smaller. The driver transistors (1) are arranged vertically rather than horizontally, so the metal 2 rectangle to get the output is on the left side rather than in the middle. The transistor wiring is visible in the lower (metal 1) layer, running vertically through the circuit. As before, two guard rings surround the PMOS and NMOS regions.

One of the upper drivers. The left image shows metal while the right image shows silicon.

Distribution

Once the four clock signals have been generated, they are distributed to all parts of the chip. The 386 has two metal layers. The top metal layer (M2) is thicker, so it has lower resistance and is used for clock (and power) distribution where possible. The clock signal will use the lower M1 metal layer when necessary to cross other M2 signals, as well as for branch lines off the main clock lines.

The diagram below shows part of the clock distribution network; the four parallel clock lines are visible similarly throughout the chip. The clock signal arrives at the upper right and travels to the datapath circuitry on the left. As you can see, the four clock lines are much wider than the thin signal lines; this width reduces the resistance of the wiring, which reduces the RC (resistive-capacitive) delay of the signals. The outlined squares at each branch are the vias, connections between the two metal layers. At the right, the incoming clock signals are in layer M1 and zig-zag to cross under other signals in M2. The clock distribution scheme in the 386 is much simpler than in modern processors.

Part of the wiring for clock distribution. This image spans about 1/5 of the chip's width.

Clocks in modern processors

The 386's internal clock speed was simply the external clock divided by 2. However, modern processors allow the clock speed to be adjusted to optimize performance or to overclock the chip. This is implemented by an on-chip PLL (Phase-Locked Loop) that generates the internal clock from a fixed external clock, multiplying the clock speed by a selectable multiplier. Intel introduced a PLL to the 80486 processor, but the multipler was fixed until the Pentium.

The Intel 386's clock can go up to 33 megahertz. Although this was fast for the time, modern processors are over two orders of magnitude faster, so keeping the clock synchronized in a modern processor requires complex techniques.13 With fast clocks, even the speed of light becomes a constraint; at 6 GHz, light can travel just 5 centimeters during a clock cycle.

The problem is to ensure that the clock arrives at all circuits at the same time, minimizing "clock skew". Modern processors can reduce the clock skew to a few picoseconds. The clock is typically distributed by a "clock tree", where the clock is split into branches with each branch buffered and the same length, so the delays nearly match. One approach is an "H-tree", which distributes the clock through an H-shaped path. Each leg of the H branches into a smaller H recursively, forming a space-filling fractal, as shown below.

Clock distribution in a PowerPC chip. The recursive H pattern is only approximate since other layout factors constrain the clock tree. From ISSCC 2000.

Delay circuitry can actively compensate for differences in path time. A Delay-Locked Loop (DLL) circuit adds variable delays to counteract variations along different clock paths. The Itanium used a clock distribution hierarchy with global, regional, and local distribution of the clock. The main clock was distributed to eight regions that each deskewed the clock (in 8.5 ps steps) and drove a regional clock grid, keeping the clock skew under 28 ps. The Pentium 4's complex distribution tree and skew compensation circuitry got clock skew below ±8 ps.

Conclusions

The 386's clock circuitry turned out to be more complicated than I expected, with a lot of subtlety and complications. However, examining the circuit illustrates several features of CMOS design, from latch circuits and high-current drivers to guard rings and multi-phase clocks. Hopefully you have found this interesting.

I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space.

Notes and references

You might wonder why processors use transparent latches and two clock phases instead of using edge-triggered flip-flops and a single clock phase. First, edge-triggered flip-flops take at least twice as many transistors as latches. (An edge-triggered flip flop is often built from two latch stages.) Second, the two-phase approach allows processing to happen twice per clock cycle, rather than once per clock cycle. This may allow a faster implementation with more pipelining. ↩
The transparent latch was implemented by a single pass transistor in processors such as the MOS 6502. When the transistor was on, the input signal passed through. But when the transistor was off, the former value was held by the transistor's gate capacitance. Eventually the charge on the gate would leak away (like DRAM), so a minimum clock speed was required for reliable operation. ↩
To see why having multiple stages active at once is bad, here's a simplified example. Consider a circuit that increments the accumulator register. In the first clock phase, the accumulator's value might go through the adder circuit. In the second clock phase, the new value can be stored in the accumulator. If both clock phases are high at the same time, the circuit will form a loop and the accumulator will get incremented multiple times, yielding the wrong result. Moreover, different parts of the adder probably have different delays, so the result is likely to be complete garbage. ↩
To generate the clocks for the Intel 8008 processor, the suggested circuit used four analog (one-shot) delays to generate the clock phases. The 8008 and 8080 required asymmetrical clocks because the two blocks of logic took different amounts of time to process their inputs. The asymemtrical clock minimized wasted time, improving performance. (More discussion here.) ↩
You might think that the 386 could use two clock signals: one latch could use phase 1 for NMOS and phase 2 for PMOS, while the next stage is the other way around. Unfortunately, that won't work because the two phases aren't exactly complements. During the "dead time" when phase 1 and phase 2 are both low, the PMOS transistors for both stages will turn on, causing problems. ↩
Even though the 8086 has four clock signals internally, there are really just two clock phases. This is different from four-phase logic, a type of logic that was used in the late 1960s in some MOS processor chips. Four-phase logic was said to provide 10 times the density, 10 times the speed, and 1/10 the power consumption of standard MOS logic techniques. Designer Lee Boysel was a strong proponent of four-phase logic, forming the company Four Phase Systems and building a processor from a small number of MOS chips. Improvements in MOS circuitry in the 1970s (in particular depletion-mode logic) made four-phase logic obsolete. ↩
The clocking scheme in the 386 is closely tied to the latch circuit used in the processor, shown below. This is a transparent latch: when enable is high and the complemented enable is low, the input is passed through to the output (inverted). When enable is low and the complemented enable is high, the latch remembers the previous value. The important factor is that the enable and complemented enable inputs must switch in lockstep. (In comparison, earlier chips such as the 8086 used a dynamic latch built from one transistor that used a single enable input.)

The basic latch circuit used in the 386.

The circuit on the right shows the implementation of the 386 latch. The two transistors on the left form a transmission gate: when both transistors are on, the input is passed through, but when both transistors are off, the input is blocked. Data storage is implemented through the two inverters connected in a loop. The bottom inverter is "weak", generating a small output current. Because of this, its output will be overpowered by the input, replacing the value stored in the latch. This latch uses 6 transistors in total.

The 386 uses several variants of the latch circuit, for instance with set or reset inputs, or multiplexers to select multiple data inputs. ↩
The parasitic transistors responsible for latch-up can also be viewed as an SCR (silicon-controlled rectifier) or thyristor. An SCR is a four-layer (PNPN) silicon device that is switched on by its gate and remains on until power is removed. SCRs were popular in the 1970s for high-current applications, but have been replaced by transistors in many cases. ↩
The 386 uses two guard rings to prevent latch-up. NMOS transistors are surrounded by an inner N+ guard ring connected to ground and an outer P+ guard ring connected to +5. The guard rings are reversed for PMOS transistors. This page has a diagram showing how the guard rings prevent latch-up. ↩
The polysilicon resistor appears to be unique to the clock input. My hypothesis is that the CLK2 signal runs at a much higher frequency than other inputs (since it is twice the clock frequency), which raises the risk of ringing or other transients. If these transients go below ground, they could cause latch-up, motivating additional protection on the clock input. ↩
To keep the main article focused, I'll describe the inverters in this footnote. The circuitry below is between the divider logic and the polysilicon resistor, and consists of six inverters of various sizes. The large inverters 1 and 2 buffer the output from the divider to send to the drivers. Inverter 3 is a small inverter that drives larger inverter 4. I think this clock signal goes to the bus interface logic, perhaps to ensure that communication with the outside world is synchronized with the external clock, rather than the internal clock, which is shaped and perhaps slightly delayed. The output of small inverter 5 appears to be unused. My hypothesis is that this is a "dummy" inverter to match inverter 3 and ensure that both clock phases have identical circuitry. Otherwise, the load from inverter 3 might make that phase switch slightly slower.

The inverters that buffer the divider's output.

The final block of logic is shown below. This logic appears to take the chip reset signal from the reset pin and synchronize it with the clock. The first three latches use the CLK2 input as the clock, while the last two latches use the internal clock. Using the external reset signal directly would risk metastability because the reset signal could change asynchronously with respect to the rest of the system. The latches ensure that the timing of the reset signal matches the rest of the system, minimizing the risk of metastability. The NAND gate generates a reset pulse that resets the divide-by-two counter to ensure that it starts in a predictable state.

The reset synchronizer. (Click for a larger image.)

↩
The gate (2) that receives the inhibit signal is a bit strange, a cross between an inverter and a NAND gate. The gate goes low if the clk' input is high, but goes high only if both inputs are low. In other words, it acts like an inverter but the inhibit signal blocks the transition to the high output. Instead, the output will "float" with its previous low value. This will keep the driver's output low, ensuring that it doesn't overlap with the other driver's high output. ↩
An interesting 2004 presentation is Clocking for High Performance Processors. A 2005 Intel presentation also discusses clock distribution. ↩

[#] Wed Dec 06 2023 09:12:30 UTC from rss <>

Subject: Reverse engineering the barrel shifter circuit on the Intel 386 processor die

The Intel 386 processor (1985) was a large step from the 286 processor, moving x86 to a 32-bit architecture. The 386 also dramatically improved the performance of shift and rotate operations by adding a "barrel shifter", a circuit that can shift by multiple bits in one step. The die photo below shows the 386's barrel shifter, highlighted in the lower left and taking up a substantial part of the die.

The 386 die with the main functional blocks labeled. Click this image (or any other) for a larger version.)

Shifting is a useful operation for computers, moving a binary value left or right by one or more bits. Shift instructions can be used for multiplying or dividing by powers of two, and as part of more general multiplication or division. Shifting is also useful for extracting bit fields, aligning bitmap graphics, and many other tasks.1

Barrel shifters require a significant amount of circuitry. A common approach is to use a crossbar, a matrix of switches that can connect any input to any output. By closing switches along a desired diagonal, the input bits are shifted. The diagram below illustrates a 4-bit crossbar barrel shifter with inputs X (vertical) and outputs Y (horizontal). At each point in the grid, a switch (triangle) connects a vertical input line to a horizontal output line. Energizing the blue control line, for instance, passes the value through unchanged (X0 to Y0 and so forth). Energizing the green control line rotates the value by one bit position (X0 to Y1 and so forth, with X3 wrapping around to X0). Similarly, the circuit can shift by 2 or 3 bits. The shift control lines select the amount of shift. These lines run diagonally, which will be important later.

A four-bit crossbar switch with inputs X and outputs Y. Image by Cmglee, CC BY-SA 3.0.

The main problem with a crossbar barrel shifter is that it takes a lot of hardware. The 386's barrel shifter has a 64-bit input and a 32-bit output,2 so the approach above would require 2048 switches (64×32). For this reason, the 386 uses a hybrid approach, as shown below. It has a 32×8 crossbar that can shift by 0 to 28 bits, but only in multiples of 4, making the circuit much smaller. The output from the crossbar goes to a second circuit that can shift by 0, 1, 2, or 3 bits. The combined circuitry supports an arbitrary shift, but requires less hardware than a complete crossbar. The inputs to the barrel shifter are two 32-bit values from the processor's register file, stored in latches for use by the shifter.

Block diagram of the barrel shifter circuit.

The figure below shows how the shifter circuitry appears on the die; this image shows the two metal layers on the die's surface. The inputs from the register file are at the bottom, for bits 31 through 0. Above that, the input latches hold the two 32-bit inputs for the shifter. In the middle is the heart of the shift circuit, the crossbar matrix. This takes the two 32-bit inputs and produces a 32-bit output. The matrix is controlled by sloping polysilicon lines, driven by control circuitry on the right. The matrix output goes to the circuit that applies a shift of 0 to 3 positions. Finally, the outputs exit at the top, where they go to other parts of the CPU. The shifter performs right shifts, but as will be explained below, the same circuit is used for the left shift instructions.

The barrel shifter circuitry as it appears on the die. I have cut out repetitive circuitry from the middle because the complete image is too wide to display clearly.

The barrel shifter crossbar matrix

In this section, I'll describe the matrix part of the barrel shifter circuit. The shift matrix takes 32-bit values a and b. Value b is shifted to the right, with bits from a filling in at the left, producing a 32-bit output. (As will be explained below, the output is actually 37 bits due to some complications, but ignore that for now.) The shift count is a multiple of 4 from 0 to 28.

The diagram below illustrates the structure of the shift matrix. The two 32-bit inputs are provided at the bottom, interleaved, and run vertically. The 32 output lines run horizontally. The 8 control lines run diagonally, activating the switches (black dots) to connect inputs and outputs. (For simplicity, only 3 control lines are shown.) For a shift of 0, control line 0 (red) is selected and the output is b₃₁b₃₀...b₁b₀. (You can verify this by matching up inputs to outputs through the dots along the red line.)

Diagram of the shift matrix, showing three of the shift control lines.

For a shift right of 4, the cyan control line is activated. It can be seen that the output in this case is a₃a₂a₁a₀b₃₁b₃₀...b₅b₄, shifting b to the right 4 bits and filling in four bits from a as desired. For a shift of 28, the purple control line is activated, producing the output a₂₇...a₀b₃₁...b₂₈. Note that the control lines are spaced four bits apart, which is why the matrix only shifts by a multiple of 4. Another important feature is that below the red diagonal, the b inputs are connected to the output, while above the diagonal, the a inputs are connected to the output. (In other words, the black dots are shifted to the right above the diagonal.) This implements the 64-bit support, taking bits from a or b as appropriate.

Looking at the implementation on the die, the vertical wires use the lower metal layer (metal 1) while the horizontal wires use the upper metal layer (metal 2), so the wires don't intersect. NMOS transistors are used as the switches to connect inputs and outputs.4 The transistors are controlled by diagonal wires constructed of polysilicon that form the transistor gates. When a particular polysilicon wire is energized, it turns on the transistors along a diagonal line, connecting those inputs and outputs.

The image below shows the left side of the matrix.5 The polysilicon control lines are the green horizontal lines stepping down to the right. These control the transistors, which appear as columns of blue-gray squares next to the polysilicon lines. The metal layers have been removed; the position of the lower metal 1 layer is visible in the vertical bluish lines.

The left side of the matrix as it appears on the die.

The diagram below shows four of these transistors in the shifter matrix. There are four circuitry layers involved. The underlying silicon is pinkish gray; the active regions are the squares with darker borders. Next is the polysilicon (green), which forms the control lines and the transistor gates. The lower metal layer (metal 1) forms the blue vertical lines that connect to the transistors.3 The upper metal layer (metal 2) forms the horizontal bit output lines. Finally, the small black dots are the vias that connect metal 1 and metal 2. (The well taps are silicon regions connected to ground to prevent latch-up.)

Four transistors in the shifter matrix. The polysilicon and metal lines have been drawn in.

To see how this works, suppose the upper polysilicon line is activated, turning on the top two transistors. The two vertical bit-in lines (blue) will be connected through the transistors to the top two bit out lines (purple), by way of the short light blue metal segments and the via (black dot). However, if the lower polysilicon line is activated, the bottom two transistors will be turned on. This will connect the bit-in lines to the fifth and sixth bit-out lines, four lines down from the previous ones. Thus, successive polysilicon lines shift the connections down by four lines at a time, so the shifts change in steps of 4 bit positions.

As mentioned earlier, to support the 64-bit input, the transistors below the diagonal are connected to b input while the transistors above the diagonal are connected to the a input. The photo below shows the physical implementation: the four upper transistors are shifted to the right by one wire width, so they connect to vertical a wires, while the four lower transistors are connected to b wires. (The metal wires were removed for this photo to show the transistors.)

This photo of the underlying silicon shows eight transistors. The top four transistors are shifted one position to the right. the irregular lines are remnants of other layers that I couldn't completely remove from the die.

In the matrix, the output signals run horizontally. In order for signals to exit the shifter from the top of the matrix, each horizontal output wire is connected to a vertical output wire. Meanwhile, other processor signals (such as the register write data) must also pass vertically through the shifter region. The result is a complicated layout, packing everything together as tightly as possible.

The precharge/keepers

At the left and the right of the barrel shifter, repeated blocks of circuitry are visible. These blocks contain precharge and keeper circuits to hold the value on one of the lines. During the first clock phase, each horizontal bit line is precharged to +5 volts. Next, the matrix is activated and horizontal lines may be pulled low. If the line is not pulled low, the inverter and PMOS transistor will continuously pull the line high. The inverter and transistor can be viewed as a bus keeper, essentially a weak latch to hold the line in the 1 state. The keeper uses relatively weak transistors, so the line can be pulled low when the barrel shifter is activated. The purpose of the keeper is to ensure that the line doesn't drift into a state between 0 and 1. This is a bad situation with CMOS circuitry, since the pull-up and pull-down transistors could both turn on, yielding a short circuit.

The precharge/keeper circuit

The motivation behind this design is that implementing the matrix with "real" CMOS would require twice as many transistors. By implementing the matrix with NMOS transistors only, the size is reduced. In a standard NMOS implementation, pull-up transistors would continuously pull the lines high, but this results in fairly high power consumption. Instead, the precharge circuit pulls the line high at the start. But this results in dynamic logic, dependent on the capacitance of the circuit to hold the charge. To avoid the charge leaking away, the keeper circuit keeps the line high until it is pulled low. Thus, this circuit minimizes the area of the matrix as well as minimizing power consumption.

There are 37 keepers in total for the 37 output lines from the matrix.6 (The extra 5 lines will be explained below.) The photo below shows one block of three keepers; the metal has been removed to show the silicon transistors and some of the polysilicon (green).

One block of keeper circuitry, to the right of the shift matrix. This block has 12 transistors, supporting three bits.

The register latches

At the bottom of the shift circuit, two latches hold the two 32-bit input values. The 386 has multi-ported registers, so it can access two registers and write a third register at the same time. This allows the shift circuit to load both values at the same time. I believe that a value can also come from the 386's constant ROM, which is useful for providing 0, 1, or all-ones to the shifter.

The schematic below shows the register latches for one bit of the shifter. Starting at the bottom are the two inputs from the register file (one appears to be inverted for no good reason). Each input is stored in a latch, using the standard 3836 latch circuit.7 The latched input is gated by the clock and then goes through multiplexers allowing either value to be used as either input to the shifter. (The shifter takes two 32-bit inputs and this multiplexer allows the inputs to be swapped to the other sides of the shifter.) A second latch stage holds the values for the output; this latch is cleared during the first clock phase and holds the desired value during the second clock phase.

Circuit for one bit of the register latch.

The die photo below shows the register latch circuit, contrasting the metal layers (left) with the silicon layer (right). The dark spots in the metal image are vias between the metal layers or connections to the underlying silicon or polysilicon. The metal layer is very dense with vertical wiring in the lower metal 1 layer and horizontal wiring in the upper metal 2 layer. The density of the chip seems to be constrained by the metal wiring more than the density of the transistors.

One of the register latch circuits.

The 0-3 shifter

The shift matrix can only shift in steps of 4 bits. To support other shifts, a circuit at the top of the shifter provides a shift of 0 to 3 bits. In conjunction, these circuits permit a shift by an arbitrary amount.8 The schematic below shows the circuit. A bit enters at the bottom. The first shift stage passes the bit through, or sends it one bit position to the right. The second stage passes the bit through, or sends it two bit positions to the right. Thus, depending on the control lines, each bit can be shifted by 0 to 3 positions to the right. At the top, a transistor pulls the circuit low to initialize it; the NOR gate at the bottom does the same. A keeper transistor holds the circuit low until a data bit pulls it high.

One bit of the 0-3 shifter circuit.

The diagram below shows the silicon implementation corresponding to two copies of the schematic above. The shifters are implemented in pairs to slightly optimize the layout. In particular, the two NOR gates are mirrored so the power connection can be shared. This is a small optimization, but it illustrates that the 386 designers put a lot of work into making the layout dense.

Two bits of the 0-3 shifter circuit as it appears on the die.

Complications

As is usually the case with x86, there are a few complications. One complication is that the shift matrix has 37 outputs, rather than the expected 32. There are two reasons behind this. First, the upper shifter will shift right by up to 3 positions, so it needs 3 extra bits. Thus, the matrix needs to output bits 0 through 34 so three bits can be discarded. Second, shift instructions usually produce a carry bit from the last bit shifted out of the word. To support this, the shift matrix provides an extra bit at both ends for use as the carry. The result is that the matrix produces 37 outputs, which can be viewed as bits -1 through 35.

Another complication is that the x86 instruction set supports shifts on bytes and 16-bit words as well as 32-bit words. If you put two 8-bit bytes into the shifter, there will be 24 unused bits in between, posing a problem for the shifter. The solution is that some of the diagonal control lines in the matrix are split on byte and word boundaries, allowing an 8- or 16-bit value to be shifted independently. For example, you can perform a 4-bit right shift on the right-hand byte, and a 28-bit right shift on the left-hand byte. This brings the two bytes together in the result, yielding the desired 4-bit right shift. As a result, there are 18 diagonal control lines in the shifter (if I counted correctly), rather than the expected 8 control lines. This makes the circuitry to drive the control lines more complicated, as it must generate different signals depending on the size of the operand.

The control circuitry

The control circuitry at the right of the shifter drives the diagonal polysilicon lines in the matrix, selecting the desired shift. It also generates control signals for the 0-3 shifter, selecting a shift-by-1 or shift-by-2 as necessary. This circuitry operates under the control of the microcode, which tells it when to shift. It gets the shift amount from the instruction or the CL register and generates the appropriate control signals.

The distribution of control signals is more complex than you might expect. If possible, the polysilicon diagonals are connected on the right of the matrix to the control circuitry, providing a direct connection. However, many of the diagonals do not extend all the way to the right, either because they start on the left or because they are segmented for 8- or 16-bit values. Some of these signals are transmitted through polysilicon lines that run underneath the matrix. Others are transmitted through horizontal metal lines that run through the register latches. (These latches don't use many horizontal lines, so there is available space to route other signals.) These signals then ascend through the matrix at various points to connect with the polysilicon lines. This shows that the routing of this circuitry is carefully optimized to make it as compact as possible. Moreover, these "extra" lines disrupt the layout; the matrix is almost a regular pattern, but it has small irregularities throughout.

Implementing x86 shifts and rotates with the barrel shifter

The x86 has a variety of shift and rotate instructions.9 It is interesting to consider how they are implemented using the barrel shifter, since it is not always obvious. In this section, I'll discuss the instructions supported by the 386.

One important principle is that even though the circuitry shifts to the right, by changing the inputs this can achieve a shift to the left. To make this concrete, consider two input words a and b, with the shifter extracting the portion in red below. (I'll use 8-bit examples instead of 32-bit here and below to keep the size manageable.) The circuit shifts b to the right five bits, inserting bits from a at the left. Alternatively, the result can be viewed as shifting a to the left three bits, inserting bits from b at the right. Thus, the same result can be viewed as a right shift of b or a left shift of a. This holds in general, with a 32-bit right shift by N bits equivalent to a left shift by 32-N bits, depending on which word10 you focus on.

a₇a₆a₅a₄a₃a₂a₁a₀b₇b₆b₅b₄b₃b₂b₁b₀

Double shifts

The double-shift instructions (Shift Left Double (SHLD) and Shift Right Double (SHRD)) were new in the 386, shifting two 32-bit values to produce a 32-bit result. The last bit shifted out goes into the carry flag (CF). These instructions map directly onto the behavior of the barrel shifter, so I'll start with them.

Actions of the double shift instructions.

The examples below show how the shifter implements the SHLD and SHRD instructions; the shifter output is highlighted in red. (These examples use an 8-bit source (s) and destination (d) to keep them manageable.) In either case, 3 bits of the source are shifted into the destination; shifting left or right is just a matter of whether the destination is on the left or right.

SHLD 3: ddddddddssssssss

SHRD 3: ssssssssdddddddd

Shifts

The basic shift instructions are probably the simplest. Shift Arithmetic Left (SAL) and Shift Logical Left (SLL) are synonyms, shifting the destination to the left and filling with zeroes. This can be accomplished by performing a shift with the word on the left and zeroes on the right. Shift Logical Right (SLR) is the opposite, shifting to the right and filling with zeros. This can be accomplished by putting the word on the right and zeroes on the left. Shift Arithmetic Right (SAR) is a bit different. It fills with the sign bit, the top bit. The purpose of this is to shift a signed number while preserving its sign. It can be implemented by putting all zeroes or all ones on the right, depending on the sign bit. Thus, the shift instructions map nicely onto the barrel shifter.

Actions of the shift instructions.

The 8-bit examples below show how the shifter accomplishes the SHL, SHR, and SAR instructions. The destination value d is loaded into one half of the shifter. For SAR, the value's sign bit s is loaded into the other half of the shifter, while the other instructions load 0 into the other half of the shifter. The red box shows the output from the shifter, selected from the input.

SHL 3: dddddddd00000000

SHR 3: 00000000dddddddd

SAR 3: ssssssssdddddddd

Rotates

Unlike the shift instructions, the rotate instructions preserve all the bits. As bits shift off one end, they fill in the other end, so the bit sequence rotates. A rotate left or right is implemented by putting the same word on the left and right.

Actions of the rotate instructions.

The shifter implements rotates as shown below, using the destination value as both shifter inputs. A left shift by N bits is implemented by shifting right by 32-N bits.

ROL 3: d₇d₆d₅d₄d₃d₂d₁d₀d₇d₆d₅d₄d₃d₂d₁d₀

ROR 3: d₇d₆d₅d₄d₃d₂d₁d₀d₇d₆d₅d₄d₃d₂d₁d₀

Rotates through carry

The rotate through carry instructions perform 33-bit rotates, rotating the value through the carry bit. You might wonder how the barrel shifter can perform a 33-bit rotate, and the answer is that it can't. Instead, the instruction takes multiple steps. If you look at the instruction timings, the other shifts and rotates take three clock cycles. Rotating through the carry, however, takes nine clock cycles, performing multiple steps under the control of the microcode.

Actions of the rotate through carry instructions.

Without looking at the microcode, I can only speculate how it takes place. One sequence would be to get the top bits by putting zeroes in the right 32 bits and shifting. Next, get the bottom bits by putting the carry bit in the left 32 bits and shifting one bit more. (That is, set the left 32-bit input to either the constant 0 or 1, depending on the carry.) Finally, the result can be generated by ORing the two shift values together. The example below shows how an RCL 3 could be implemented. In the second step, the carry value C is loaded into the left side of the shifter, so it can get into the result. Note that bit d₅ ends up in the carry bit, rather than the result. The RCR instruction would be similar, but adjusting the shift parameters accordingly.

First shift: d₇d₆d₅d₄d₃d₂d₁d₀00000000

Second shift: 0000000Cd₇d₆d₅d₄d₃d₂d₁d₀

Result from OR: d₄d₃d₂d₁d₀Cd₇d₆

Conclusions

The shifter circuit illustrates how the rapidly increasing transistor counts in the 1980s allowed new features. Programming languages make it easy to shift numbers with an expression such as a>>5. But it takes a lot of hardware in the processor to perform these shifts efficiently. The additional hardware of the 386's barrel shifter dramaticallly improved shift performance for shifts and rotates compared to earlier x86 processors. I estimate that the barrel shifter requires about 2000 transistors, about half the number of the entire 6502 processor (1975). But by 1985, putting 2000 transistors into a feature was practical. (In total, the 386 contains 285,000 transistors, a trivial number now, but a large number for the time.)

I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space.

Notes and references

The earliest reference for a barrel shifter is often given as "A barrel switch design", Computer Design, 1972, but the idea of a barrel shifter goes back to 1964 at least. (The "barrel switch" name presumably comes from a physical barrel switch, a cylindrical multi-position switch such as a car ignition.) The CDC 6600 supercomputer (1964) had a 6-stage shifter able to shift up to 63 positions in one cycle (details); it was called a "parallel shifting network" rather than a "barrel shifter". A Burroughs patent filed in 1965 describes a barrel switch "capable of performing logical switching operations in a single time involving any amount of binary information," so the technology is older.

Early microprocessors shifted by one bit position at a time. Although the Intel 8086 provided instructions to shift by multiple bits at a time, this was implemented internally by a microcode loop, so the more bits you shifted, the longer the instruction took, four clock cycles per bit. Shifting on the 286 was faster, taking one additional cycle for each bit position shifted. The first ARM processor (ARM1, 1985) included a 32-bit barrel shifter. It was considerably simpler than the 386's design, following ARM's RISC philosophy. ↩
The 386 Hardware Reference Manual states that the 386 contains a 64-bit barrel shifter. I find this description a bit inaccurate, since the output is only 32 bits, so the barrel shifter is much simpler than a full 64-bit barrel shifter. ↩
The 386 has two layers of metal. The vertical lines are in the lower layer of metal (metal 1) while the horizontal lines are in the upper layer of metal (metal 2). Transistors can only connect to lower metal, so the connection between the horizontal line and the transistor uses a short piece of lower metal to bridge the layers. ↩
Each row of the matrix can be considered a multiplexer with 8 inputs, implemented by 8 pass transistors. One of the eight transistors is activated, passing that input to the output. ↩
The image below shows the full shift matrix. Click the image for a much larger view.

The matrix with the metal layer removed.

↩
The keepers are arranged with 6 blocks of three on the left and 6 blocks of 3 on the right, plus an additional one at the bottom right. ↩
The standard latch in the 386 consists of two cross-coupled inverters forming a static circuit to hold a bit. The input goes through a transmission gate (back-to-back NMOS and PMOS transistors) to the inverters. One inverter is weak, so it can be overpowered by the input. The 8086, in contrast, uses dynamic latches that depend on the gate capacitance to hold a bit. ↩
Some shifters take the idea of combining shift circuits to the extreme. If you combine a shift-by-one circuit, a shift-by-two circuit, a shift-by-four circuit, and so forth, you end up with a logarithmic shifter: selecting the appropriate stages provide an arbitrary shift. (This design was used in the CDC 6600.) This design has the advantage of reducing the amount of circuitry since it uses log₂(N) layers rather than N layers. However, the logarithmic approach has performance disadvantages since the signals need to go through more circuitry. This paper describes various design alternatives for barrel shifters. ↩
The basic rotate left and right instructions date back to the Datapoint 2200, the ancestor of the 8086 and x86. The rotate left through carry and rotate right through carry instructions in x86 were added in the Intel 8008 processor and the 8080 was the same. The MOS 6502 had a different set of rotates and shifts: arithmetic shift left, rotate left, logical shift right, and rotate right; the rotate instructions rotated through the carry. The Z-80 had a more extensive set: rotates left and right, either through the carry or not, shift left, shift right logical, shift right arithmetic, and 4-bit digit rotates left and right through two bytes. The 8086's set of rotates and shifts was similar to the Z-80, except it didn't have the digit rotates. The 8086 also supported shifting and rotating by multiple positions. This illustrates that there isn't a "natural" set of shift and rotate instructions. Instead, different processors supported different instructions, with complexity generally increasing over time. ↩
The x86 uses "word" to refer to a 16-bit value and "double word" or "dword" to refer to a 32-bit value. I'm going to ignore the word/dword distinction. ↩

[#] Sat Dec 16 2023 09:02:02 UTC from rss <>

Subject: Two interesting XOR circuits inside the Intel 386 processor

Intel's 386 processor (1985) was an important advance in the x86 architecture, not only moving to a 32-bit processor but also switching to a CMOS implementation. I've been reverse-engineering parts of the 386 chip and came across two interesting and completely different circuits that the 386 uses to implement an XOR gate: one uses standard-cell logic while the other uses pass-transistor logic. In this article, I take a look at those circuits.

The die of the 386. Click this image (or any other) for a larger version.

The die photo above shows the two metal layers of the 386 die. The polysilicon and silicon layers underneath are mostly hidden by the metal. The black dots around the edges are the bond wires connecting the die to the external pins. The 386 is a complicated chip with 285,000 transistor sites. I've labeled the main functional blocks. The datapath in the lower left does the actual computations, controlled by the microcode ROM in the lower right.

Despite the complexity of the 386, if you zoom in enough, you can see individual XOR gates. The red rectangle at the top (below) is a shift register for the chip's self-test. Zooming in again shows the silicon for an XOR gate implemented with pass transistors. The purple outlines reveal active silicon regions, while the stripes are transistor gates. The yellow rectangle zooms in on part of the standard-cell logic that controls the prefetch queue. The closeup shows the silicon for an XOR gate implemented with two logic gates. Counting the stripes shows that the first XOR gate is implemented with 8 transistors while the second uses 10 transistors. I'll explain below how these transistors are connected to form the XOR gates.

The die of the 386, zooming in on two XOR gates.

A brief introduction to CMOS

CMOS circuits are used in almost all modern processors. These circuits are built from two types of transistors: NMOS and PMOS. These transistors can be viewed as switches between the source and drain controlled by the gate. A high voltage on the gate of an NMOS transistor turns the transistor on, while a low voltage on the gate of a PMOS transistor turns the transistor on. An NMOS transistor is good at pulling the output low, while a PMOS transistor is good at pulling the output high. Thus, NMOS and PMOS transistors are opposites in many ways; they are complementary, which is the "C" in CMOS.

Structure of a MOS transistor. Although the transistor's name represents the Metal-Oxide-Semiconductor layers, modern MOS transistors typically use polysilicon instead of metal for the gate.

In a CMOS circuit, the NMOS and PMOS transistors work together, with the NMOS transistors pulling the output low as needed while the PMOS transistors pull the output high. By arranging the transistors in different ways, different logic gates can be created. The diagram below shows a NAND gate constructed from two PMOS transistors (top) and two NMOS transistors (bottom). If both inputs are high, the NMOS transistors turn on and pull the output low. But if either input is low, a PMOS transistor will pull the output high. Thus, the circuit below implements a NAND gate.

A NAND gate implemented in CMOS.

Notice that NMOS and PMOS transistors have an inherent inversion: a high input produces a low (for NMOS) or a low input produces a high (for PMOS). Thus, it is straightforward to produce logic circuits such as an inverter, NAND gate, NOR gate, or an AND-OR-INVERT gate. However, producing an XOR (exclusive-or) gate doesn't work with this approach: an XOR gate produces a 1 if either input is high, but not both.1 The XNOR (exclusive-NOR) gate, the complement of XOR, also has this problem. As a result, chips often have creative implementations of XOR gates.

The standard-cell two-gate XOR circuit

Parts of the 386 were implemented with standard-cell logic. The idea of standard-cell logic is to build circuitry out of standardized building blocks that can be wired by a computer program. In earlier processors such as the 8086, each transistor was carefully positioned by hand to create a chip layout that was as dense as possible. This was a tedious, error-prone process since the transistors were fit together like puzzle pieces. Standard-cell logic is more like building with LEGO. Each gate is implemented as a standardized block and the blocks are arranged in rows, as shown below. The space between the rows holds the wiring that connects the blocks.

Some rows of standard-cell logic in the 386 processor. This is part of the segment descriptor control circuitry.

The advantage of standard-cell logic is that it is much faster to create a design since the process can be automated. The engineer described the circuit in terms of the logic gates and their connections. A computer algorithm placed the blocks so related blocks are near each other. An algorithm then routed the circuit, creating the wiring between the blocks. These "place and route" algorithms are challenging since it is an extremely difficult optimization problem, determining the best locations for the blocks and how to pack the wiring as densely as possible. At the time, the algorithm took a day on a powerful IBM mainframe to compute the layout. Nonetheless, the automated process was much faster than manual layout, cutting weeks off the development time for the 386. The downside is that the automated layout is less dense than manually optimized layout, with a lot more wasted space. (As you can see in the photo above, the density is low in the wiring channels.) For this reason, the 386 used manual layout for circuits where a dense layout was important, such as the datapath.

In the 386, the standard-cell XOR gate is built by combining a NAND gate with an AND-NOR gate as shown below.2 (Although AND-NOR looks complicated, it is implemented as a single gate in CMOS.) You can verify that if both inputs are 0, the NOR gate forces the output low, while if both inputs are 1, the AND gate forces the output low, providing the XOR functionality.

Schematic of an XOR circuit.

The photo below shows the layout of this XOR gate as a standard cell. I have removed the metal and polysilicon layers to show the underlying silicon. The outlined regions are the active silicon, with PMOS above and NMOS below. The stripes are the transistor gates, normally covered by polysilicon wires. Notice that neighboring transistors are connected by shared silicon; there is no demarcation between the source of one transistor and the drain of the next.

The silicon implementing the XOR standard cell. This image is rotated 180° from the layout on the die to put PMOS at the top.

The schematic below corresponds to the silicon above. Transistors a, b, c, and d implement the first NOR gate. Transistors g, h, i, and j implement the AND part of the AND-NOR gate. Transistors e and f implement the NOR input of the AND-NOR gate, fed from the first NOR gate. The standard cell library is designed so all the cells are the same height with a power rail at the top and a ground rail at the bottom. This allows the cells to "snap together" in rows. The wiring inside the cell is implemented in polysilicon and the lower metal layer (M1), while the wiring between cells uses the upper metal layer (M2) for vertical connections and lower metal (M1) for horizontal connections. This strategy allows vertical wires to pass over the cells without interfering with the cell's wiring.

Transistor layout in the XOR standard cell.

One important factor in a chip such as the 386 is optimizing the sizes of transistors. If a transistor is too small, it will take too much time to switch its output line, reducing performance. But if a transistor is too large, it will waste power as well as slowing down the circuit that is driving it. Thus, the standard-cell library for the 386 includes several XOR gates of various sizes. The diagram below shows a considerably larger XOR standard cell. The cell is the same height as the previous XOR (as required by the standard cell layout), but it is much wider and the transistors inside the cell are taller. Moreover, the PMOS side uses pairs of transistors to double the current capacity. (NMOS has better performance than PMOS so doesn't require doubling of the transistors.) Thus, there are 10 PMOS transistors and 5 NMOS transistors in this XOR cell.

A large XOR standard cell. This cell is also rotated from the die layout.

The pass transistor circuit

Some parts of the 386 implement XOR gates completely differently, using pass transistor logic. The idea of pass transistor logic is to use transistors as switches that pass inputs through to the output, rather than using transistors as switches to pull the output high or low. The pass transistor XOR circuit uses 8 transistors, compared with 10 for the previous circuit.3

The die photo below shows a pass-transistor XOR circuit, highlighted in red. Note that the surrounding circuitry is irregular and much more tightly packed than the standard-cell circuitry. This circuit was laid out manually producing an optimized layout compared to standard cells. It has four PMOS transistors at the top and four NMOS transistors at the bottom.

The pass-transistor XOR circuit on the die. The green regions are oxide that was not completely removed causing thin-film interference.

The schematic below shows the heart of the circuit, computing the exclusive-NOR (XNOR) of X and Y with four pass transistors. To understand the circuit, consider the four input cases for X and Y. If X and Y are both 0, PMOS transistor a will turn on (because Y is low), passing 1 to the XNOR output. (X is the complemented value of the X input.) If X and Y are both 1, PMOS transistor b will turn on (because X is low), passing 1. If X and Y are 1 and 0 respectively, NMOS transistor c will turn on (because X is high), passing 0. If X and Y are 0 and 1 respectively, transistor d will turn on (because Y is high), passing 0. Thus, the four transistors implement the XNOR function, with a 1 output if both inputs are the same.

Partial implementation of XNOR with four pass transistors.

To make an XOR gate out of this requires two additional inverters. The first inverter produces X from X. The second inverter generates the XOR output by inverting the XNOR output. The output inverter also has the important function of buffering the output since the pass transistor output is weaker than the inputs. Since each inverter takes two transistors, the complete XOR circuit uses 8 transistors. The schematic below shows the full circuit. The i1 transistors implement the input inverter and the i2 transistors implement the output inverter. The layout of this schematic matches the earlier die photo.5

Implementation of NOR with eight pass transistors.

Conclusions

An XOR gate may seem like a trivial circuit, but there is more going on than you might expect. I think it is interesting that there isn't a single solution for implementing XOR; even inside a single chip, multiple approaches can be used. (If you're interested in XOR circuits, I also looked at the XOR circuit in the Z80.) It's also reassuring to see that even for a complex chip such as the 386, the circuitry can be broken down into logic gates and then understood at the transistor level.

I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space.

Notes and references

You can't create an AND or OR gate directly from CMOS either, but this isn't usually a problem. One approach is to create a NAND (or NOR) gate and then follow it with an inverter, but this requires an "extra" inverter. However, the inverter can often be eliminated by flipping the action of the next gate (using De Morgan's laws). For example, if you have AND gates feeding into an OR gate, you can change the circuit to use NAND gates feeding into a NAND gate, eliminating the inverters. Unfortunately, flipping the logic levels doesn't help with XOR gates, since XNOR is just as hard to produce. ↩
The 386 also uses XNOR standard-cell gates. These are implemented with the "opposite" circuit from XOR, swapping the AND and OR gates:

Schematic of an XNOR circuit.

↩
I'm not sure why some circuits in the 386 use standard logic for XOR while other circuits use pass transistor logic. I suspect that the standard XOR is used when the XOR gate is part of a standard-cell logic circuit, while the pass transistor XOR is used in hand-optimized circuits. There may also be performance advantages to one over the other. ↩
The first inverter can be omitted in the pass transistor XOR circuit if the inverted input happens to be available. In particular, if multiple XOR gates use the same input, one inverter can provide the inverted input to all of them, reducing the per-gate transistor count. ↩
The pass transistor XOR circuit uses different layouts in different parts of the 386, probably because hand layout allows it to be optimized. For instance, the instruction decoder uses the XOR circuit below. This circuit has four PMOS transistors on the left and four NMOS transistors on the right.

An XOR circuit from the instruction decoder.

The schematic shows the wiring of this circuit. Although the circuit is electrically the same as the previous pass-transistor circuit, the layout is different. In the previous circuit, several of the transistors were connected through their silicon, while this circuit has all the transistors separated and arranged in columns.

Schematic of the XOR circuit from the instruction decoder.

↩

[#] Wed Dec 20 2023 09:26:21 UTC from rss <>

Subject: The transparent chip inside a vintage Hewlett-Packard floppy drive

While repairing an eight-inch HP floppy drive, we found that the problem was a broken interface chip. Since the chip was bad, I decapped it and took photos. This chip is very unusual: instead of a silicon substrate, the chip is formed on a base of sapphire, with silicon and metal wiring on top. As a result, the chip is transparent as you can see from the gold "X" visible through the die in the photo below.

The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version.

The chip is a custom HP chip from 1977 that provides an interface between HP's interface bus (HP-IB) and the Z80 processor in the floppy drive controller. HP designed this interface bus as a low-cost bus to connect test equipment, computers, and peripherals. The chip, named PHI (Processor-to-HP-IB Interface), was used in multiple HP products. It handles the bus protocol and buffered data between the interface bus and a device's microprocessor.1 In this article, I'll take a look inside this "silicon-on-sapphire" chip, examine its metal-gate CMOS circuitry, and explain how it works.

Silicon-on-sapphire

Most integrated circuits are formed on a silicon wafer. Silicon-on-sapphire, on the other hand, starts with a sapphire substrate. A thin layer of silicon is built up on the sapphire substrate to form the circuitry. The silicon is N-type, and is converted to P-type where needed by ion implantation. A metal wiring layer is created on top, forming the wiring as well as the metal-gate transistors. The diagram below shows a cross-section of the circuitry.

Cross-section from HP Journal, April 1977.

The important thing about silicon-on-sapphire is that silicon regions are separated from each other. Since the sapphire substrate is an insulator, transistors are completely isolated, unlike a regular integrated circuit. This reduces the capacitance between transistors, improving performance. The insulation also prevents stray currents, protecting against latch-up and radiation. Because of its resistance to radiation, silicon-on-sapphire found use in space applications.

An HP MC² die, illuminated from behind with fiber optics. From Hewlett-Packard Journal, April 1977.

Silicon-on-sapphire integrated circuits date back to research in 1963 at Autonetics, an innovative but now-forgotten avionics company that produced guidance computers for the Minuteman ICBMs, among other things. RCA developed silicon-on-sapphire integrated circuits in the 1960s and the RCA 1802 processor was used on the Voyager spacecraft. HP used silicon-on-sapphire for multiple chips starting in 1977, such as the MC² Micro-CPU Chip. HP also used SOS for the three-chip CPU in the HP 3000 Amigo (1978), but the system was a commercial failure. The popularity of silicon-on-sapphire peaked in the early 1980s and HP moved to bulk silicon integrated circuits for calculators such as the HP-41C. Silicon-on-sapphire is still used in various products, such as LEDs and RF applications, but is now mostly a niche technology.

Inside the PHI chip

HP used an unusual package for the PHI chip. The chip is mounted on a ceramic substrate, protected by a ceramic cap. The package has 48 gold fingers that press into a socket. The chip is held into the socket by two metal spring clips.

Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket.

Decapping the chip was straightforward, but more dramatic than I expected. The chip's cap is attached with adhesive, which can be softened by heating. Hot air wasn't sufficient, so we used a hot plate. Eric tested the adhesive by poking it with an X-Acto knife, causing the cap to suddenly fly off with a loud "pop" and sending the blade flying through the air. I was happy to be wearing safety glasses.

Decapping the chip with a hotplate and hot air.

After decapping the chip, I created the high-resolution die photo below. The metal layer is clearly visible as white lines, while the silicon is grayish and the sapphire appears purple. Around the edge of the die, bond wires connect the chip's 48 external connections to the die. Slightly upper left of center, a large regular rectangular block of circuitry provides 160 bits of storage: this is two 8-word FIFO buffers, passing 10-bit words between the interface bus and a connected microprocessor. The thick metal traces around the edges provide +12 volts, +5 volts, and ground to the chip.

Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image.

Logic gates

Circuitry on this chip has an unusual appearance due to the silicon-on-sapphire implementation as well as the use of metal-gate transistors, but fundamentally the circuits are standard CMOS. The photo below shows a block that implements an inverter and a NAND gate. The sapphire substrate appears as dark purple. On top of this, the thick gray lines are the silicon. The white metal on top connects the transistors. The metal can also form the gates of transistors when it crosses silicon (indicated by letters). Inconveniently, metal that contacts silicon, metal that crosses over silicon, and metal that forms a transistor all appear very similar in this chip. This makes it more difficult to determine the wiring.

This diagram shows an inverter and a NAND gate on the die.

The schematic below shows how the gates are implemented, matching the photo above. The metal lines at the top and bottom provide the power and ground rails respectively. The inverter is formed from NMOS transistor A and PMOS transistor B; the output goes to transistors D and F. The NAND gate is formed by NMOS transistors E and F in conjunction with PMOS transistors C and C. The components of the NAND gate are joined at the square of metal, and then the output leaves through silicon on the right. Note that signals can only cross when one signal is in the silicon layer and one is in the metal layer. With only two wiring layers, signals in the PHI chip must often meander to avoid crossings, wasting a lot of space. (This wiring is much more constrained than typical chips of the 1970s that also had a polysilicon layer, providing three wiring layers in total.)

This schematic shows how the inverter and a NAND gate are implemented.

The FIFOs

The PHI chip has two first-in-first-out buffers (FIFOs) that occupy a substantial part of the die. Each FIFO holds 8 words of 10 bits, with one FIFO for data being read from the bus and the other for data written to the bus. These buffers help match the bus speed to the microprocessor speed, ensuring that data transmission is as fast as possible.

Each bit of the FIFO is essentially a static RAM cell, as shown below. Inverters A and B form a loop to hold a bit. Pass transistor C provides feedback so the inverter loop remains stable. To write a word, 10 bits are fed through vertical bit-in lines. A horizontal word write signal is activated to select the word to update. This disables transistor C and turns on transistor D, allowing the new bit to flow into the inverter loop. To read a word, a horizontal word read line is activated, turning on pass transistor F. This allows the bit in the cell to flow onto the vertical bit-out line, buffered by inverter E. The two FIFOs have separate lines so they can be read and written independently.

One cell of the FIFO.

The diagram below shows nine FIFO cells as they appear on the die. The red box indicates one cell, with components labeled to match the schematic. Cells are mirrored vertically and horizontally to increase the layout density.

Nine FIFO cells as they appear on the die.

Control logic (not shown) to the left and right of the FIFOs manages the FIFOs. This logic generates the appropriate read and write signals so data is written to one end of the FIFO and read from the other end.

The address decoder

Another interesting circuit is the decoder that selects a particular register based on the address lines. The PHI chip has eight registers, selected by three address lines. The decoder takes the address lines and generates 16 control lines (more or less), one to read from each register, and one to write to each register.

A die photo of the address decoder.

The decoder has a regular matrix structure for efficient implementation. Row lines are in pairs, with a line for each address bit input and its complement. Each column corresponds to one output, with the transistors arranged so the column will be activated when given the appropriate inputs. At the top and bottom are inverters. These latch the incoming address bits, generate the complements, and buffer the outputs.

Schematic of the decoder.

The schematic above shows how the decoder operates. (I've simplified it to two inputs and two outputs.) At the top, the address line goes through a latch formed from two inverters and a pass transistor. The address line and its complement form two row lines; the other row lines are similar. Each column has a transistor on one row line and a diode on the other, selecting the address for that column. For instance, supposed a₀ is 1 and a_n is 0. This matches the first column since the transistor lines are low and the diode lines are high. The PMOS transistors in the column will all turn on, pulling the input to the inverter high. However, if any of the inputs are "wrong", the corresponding transistor will turn off, blocking the +12 volts. Moreover, the output will be pulled low through the corresponding diode. Thus, each column will be pulled high only if all the inputs match, and otherwise it will be pulled low. Each column output controls one of the chip's registers, allowing that register to be accessed.

The HP-IB bus and the PHI chip

The Hewlett-Packard Interface Bus (HP-IB) was designed in the early 1970s as a low-cost bus for connecting diverse devices including instrument systems (such as a digital voltmeter or frequency counter), storage, and computers. This bus became an IEEE standard in 1975, known as the IEEE-488 bus.2 The bus is 8-bits parallel, with handshaking between devices so slow devices can control the speed.

In 1977, HP Developed a chip, known as PHI (Processor to HP-IB Interface) to implement the bus protocol and provide a microprocessor interface. This chip not only simplified construction of a bus controller but also ensured that devices implemented the protocol consistently. The block diagram below shows the components of the PHI chip. It's not an especially complex chip, but it isn't trivial either. I estimate that it has several thousand transistors.

Block diagram from HP Journal, July 1989.

The die photo below shows some of the functional blocks of the PHI chip. The microprocessor connected to the top pins, while the interface bus connected to the lower pins.

The PHI die with some functional blocks labeled,

Conclusions

Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?

The PHI chip is interesting as an example of a "technology of the future" that didn't quite pan out. HP put a lot of effort into silicon-on-sapphire chips, expecting that this would become an important technology: dense, fast, and low power. However, regular silicon chips turned out to be the winning technology and silicon-on-sapphire was relegated to niche markets.

Comparing HP's silicon-on-sapphire chips to regular silicon chips at the time shows some advantages and disadvantages. HP's MC² 16-bit processor (1977) used silicon-on-sapphire technology and had 10,000 transistors and ran at 8 megahertz, using 350 mW. In comparison, the Intel 8086 (1978) was also a 16-bit processor, but implemented on regular silicon and using NMOS instead of CMOS. The 8086 had 29,000 transistors, ran at 5 megahertz (at first) and used up to 2.5 watts. The sizes of the chips were almost identical: 34 mm² for the HP processor and 33 mm² for the Intel processor. This illustrates that CMOS uses much less power than NMOS, one of the reasons that CMOS is now the dominant technology. For the other factors, silicon-on-sapphire had a bit of a speed advantage but wasn't as dense. Silicon-on-sapphire's main problem was its low yield and high cost. Crystal incompatibilities between silicon and sapphire made manufacturing difficult; HP achieved a yield of 9%, meaning 91% of the dies failed.

The time period of the PHI chip is also interesting since interface buses were transitioning from straightforward buses to high-performance buses with complex protocols. Early buses could be implemented with simple integrated circuits, but as protocols became more complex, custom interface chips became necessary. (The MOS 6522 Versatile Interface Adapter chip (1977) is another example, used in many home computers of the 1980s.) But these interfaces were still simple enough that the interface chips didn't require microcontrollers, using simple state machines instead.

The HP logo on the die of the PHI chip.

For more, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space. Thanks to CuriousMarc for providing the chip and to TubeTimeUS for help with decapping.

Notes and references

More information: The article What is Silicon-on-Sapphire discusses the history and construction. Details on the HP-IB bus are here. The HP 12009A HP-IB Interface Reference Manual has information on the PHI chip and the protocol. See also the PHI article from HP Journal, July 1989. EvilMonkeyDesignz also shows a decapped PHI chip. ↩
People with Commodore PET computers may recognize the IEEE-488 bus since peripherals such as floppy disk drives were connected using the IEEE-488 bus. The cables were generally expensive and harder to obtain than interface cables used by other computers. The devices were also slow compared to other computers, although I think this was due to the hardware, not the bus. ↩

[#] Sun Dec 31 2023 10:18:58 UTC from rss <>

Subject: Interesting double-poly latches inside AMD's vintage LANCE Ethernet chip

I've studied a lot of chips from the 1970s and 1980s, so I usually know what to expect. But an Ethernet chip from 1982 had something new: a strange layer of yellow wiring on the die. After some study, I learned that the yellow wiring is a second layer of resistive polysilicon, used in the chip's static storage cells and latches.

A closeup of the die of the LANCE chip. The metal has been removed to show the layers underneath.

The die photo above shows a closeup of a latch circuit, with the diagonal yellow stripe in the middle. For this photo, I removed the chip's metal layer so you can see the underlying circuitry. The bottom layer, silicon, appears gray-purple under the microscope, with the active silicon regions slightly darker and bordered in black. On top of the silicon, the pink regions are polysilicon, a special type of silicon. Polysilicon has a critical role in the chip: when it crosses active silicon, polysilicon forms the gate of a transistor. The circles are contacts between the metal layer and the underlying silicon or polysilicon. So far, the components of the chip match most NMOS chips of that time. But what about the bright yellow line crossing the circuit diagonally? That was new to me. This second layer of polysilicon provides resistance. It crosses over the other layers, connected to the silicon at the ends with a complex ring structure.

Why would you want high-resistance wiring in your digital chip? To understand this, let's first look at how a bit can be stored. An efficient way to store a bit is to connect two inverters in a loop, as shown below. Each inverter sends the opposite value to the other inverter, so the circuit will be stable in two states, holding one bit: a 1 or a 0.

Two cross-coupled inverters can store either a 0 or a 1 bit.

But how do you store a new value into the inverter loop? There are a few techniques. One is to use pass transistors to break the loop, allowing a new value to be stored. In the schematic below, if the hold signal is activated, the transistor turns on, completing the loop. But if hold is dropped and load is activated, a new value can be loaded from the input into the inverter loop.

A latch, controlled by pass transistors.

An alternative is to use a weak inverter that produces a low-current output. In this case, the input signal can simply overpower the value produced by the inverter, forcing the loop into a new state. The advantage of this circuit is that it eliminates the "hold" transistor. However, a weak inverter turns out to be larger than a regular inverter, negating much of the space saving.1 (The Intel 386 processor uses this type of latch.)

A latch using a weak inverter.

A third alternative, used in the Ethernet chip, is to use a resistor for the feedback, limiting the current.2 As in the previous circuit, the input can overpower the low feedback current. However, this circuit is more compact since it doesn't require a larger inverter. The resistor doesn't require additional space since it can overlap the rest of the circuitry, as shown in the photo at the top of the article. The disadvantage is that manufacturing the die requires additional processing steps to create the resistive polysilicon layer.

A latch using a resistor for feedback.

In the Ethernet chip, this type of latch is used in many circuits. For example, shift registers are built by connecting latches in sequence, controlled by the clock signals. Latches are also used to create binary counters, with the latch value toggled when the lower bits produce a carry.

The SRAM cell

It would be overkill to create a separate polysilicon layer just for a few latches. It turns out that the chip was constructed with AMD's "64K dynamic RAM process". Dynamic RAM uses tiny capacitors to store data. In the late 1970s, dynamic RAM chips started using a "double-poly" process with one layer of polysilicon to form the capacitors and a second layer of polysilicon for transistor gates and wiring (details).

The double-poly process was also useful for shrinking the size of static RAM.3 The Ethernet chip contains several blocks of storage buffers for various purposes. These blocks are implemented as static RAM, including a 22×16 block, a 48×9 block, and a 16×7 block. The photo below shows a closeup of some storage cells, showing how they are arranged in a regular grid. The yellow lines of resistive polysilicon are visible in each cell.

A block of 28 storage cells in the chip. Some of the second polysilicon layer is damaged.

A static RAM storage cell is roughly similar to the latch cell, with two inverters in a loop to store each bit. However, the storage is arranged in a grid: each row corresponds to a particular word, and each column corresponds to the bits in a word. To select a word, a word select line is activated, turning on the pass transistors in that row. Reading and writing the cell is done through a pair of bitlines; each bit has a bitline and a complemented bitline. To read a word, the bits in the word are accessed through the bitlines. To write a word, the new value and its complement are applied to the bitlines, forcing the inverters into the desired state. (The bitlines must be driven with a high-current signal that can overcome the signal from the inverters.)

Schematic of one storage cell.

The diagram below shows the physical layout of one memory cell, consisting of two resistors and four transistors. The black lines indicate the vertical metal wiring that was removed. The schematic on the right corresponds to the physical arrangement of the circuit. Each inverter is constructed from a transistor and a pull-up resistor, and the inverters are connected into a loop. (The role of these resistors is completely different from the feedback resistors in the latch.) The two transistors at the bottom are the pass transistors that provide access to the cell for reads or writes.

One memory cell static memory cell as it appears on the die, along with its schematic.

The layout of this storage cell is highly optimized to minimize its area. Note that the yellow resistors take almost no additional area, as they overlap other parts of the cell. If constructed without resistors, each inverter would require an additional transistor, making the cell larger.

To summarize, although the double-poly process was introduced for DRAM capacitors, it can also be used for SRAM cell pull-up resistors. Reducing the size of the SRAM cells was probably the motivation to use this process for the Ethernet chip, with the latch feedback resistors a secondary benefit.

The Am7990 LANCE Ethernet chip

I'll wrap up with some background on the AMD Ethernet chip. Ethernet was invented in 1973 at Xerox PARC and became a standard in 1980. Ethernet was originally implemented with a board full of chips, mostly TTL. By the early 1980s, companies such as Intel, Seeq, and AMD introduced chips to put most of the circuitry onto VLSI chips. These chips reduced the complexity of Ethernet interface hardware, causing the price to drop from $2000 to $1000.

The chip that I'm examining is AMD's Am7990 LANCE (Local Area Network Controller for Ethernet). This chip implemented much of the functionality for Ethernet and "Cheapernet" (now known as 10BASE2 Ethernet). The chip handles serial/parallel conversion, computing the 32-bit CRC checksum, handling collisions and backoff, and recognizing desired addresses. The chip also provides DMA access for interfacing with a microcomputer.

The chip doesn't handle everything, though. It was designed to work with an Am7992 Serial Interface Adapter chip that encodes and decodes the bitstream using Manchester encoding. The third chip was the Am7996 transceiver that handled the low-level signaling and interfacing with the coaxial network cable, as well as detecting collisions if two nodes transmitted at the same time.

The LANCE chip is fairly complicated. The die photo below shows the main functional blocks of the chip. The chip is controlled by the large block of microcode ROM in the lower right. The large dark rectangles are storage, implemented with the static RAM cells described above. The chip has 48 pins, connected by tiny bond wires to the square pads around the edges of the die.

Main functional blocks of the LANCE chip.

Thanks to Robert Garner for providing the AMD LANCE chip and information, thanks to a bunch of people on Twitter for discussion, and thanks to Bob Patel for providing the functional block labeling and other information. For more, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space.

Notes and references

It may seem contradictory for a weak inverter to be larger than a regular inverter, since you'd expect that the bigger the transistor, the stronger the signal. It turns out, however, that creating a weak signal requires a larger transistor, due to how MOS transistors are constructed. The current from a transistor is proportional to the gate's width divided by the length. Thus, to create a more powerful transistor, you increase the width. But to create a weak transistor, you can't decrease the width because the minimum width is limited by the manufacturing process. Thus, you need to increase the gate's length. The result is that both stronger and weaker transistors are larger than "normal" transistors. ↩
You might worry that the feedback resistor will wastefully dissipate power. However, the feedback current is essentially zero because NMOS transistor gates are essentially insulators. Thus, the resistor only needs to pass enough current to charge or discharge the gate. ↩
An AMD patent describes the double-poly process as well as the static RAM cell; I'm not sure this is the process used in the Ethernet chip, but I expect the process is similar. The diagram below shows the RAM cell with its two resistors. The patent describes how the resistors and second layer of wiring are formed by a silicide/polysilicon ("inverted polycide") sandwich. (The silicide is a low-resistance compound of tantalum and silicon or molybdenum and silicon.) Specifically, the second layer consists of a buffer layer of polysilicon, a thicker silicide layer, and another layer of polysilicon forming the low-resistance "sandwich". Where resistance is desired, the bottom two layers of "sandwich" are removed during fabrication to leave just a layer of polysilicon. This polysilicon is then doped through implantation to give it the desired resistance.

The static RAM cell from patent 4569122, "Method of forming a low resistance quasi-buried contact".

The patent also describes using the second layer of polysilicon to provide a connection between silicon and the main polysilicon layer. Chips normally use a "buried contact" to connect silicon and polysilicon, but the patent describes how putting the second layer of polysilicon on top reduces the alignment requirements for a low-resistance contact. I think this explains the yellow ring of polysilicon around all the silicon/polysilicon contacts in the chip. (These rings are visible in the die photo at the top of the article.) Patent 4581815 refines this process further.

↩

[#] Tue Jan 16 2024 09:46:04 UTC from rss <>

Subject: Inside the mechanical Bendix Air Data Computer, part 3: pressure transducers

The Bendix Central Air Data Computer (CADC) is an electromechanical analog computer that uses gears and cams for its mathematics. It was a key part of military planes such as the F-101 and the F-111 fighters, computing airspeed, Mach number, and other "air data". This article reverse-engineers the two pressure transducers, on the right in the photo below. It is part 3 of my series on the CADC.1

The Bendix MG-1A Central Air Data Computer with the case removed, showing the compact gear mechanisms inside. Click this image (or any other) for a larger version.

Aircraft have determined airspeed from air pressure for over a century. A port in the side of the plane provides the static air pressure,2 the air pressure outside the aircraft. A pitot tube points forward and receives the "total" air pressure, a higher pressure due to the speed of the airplane forcing air into the tube. The airspeed can be determined from the ratio of these two pressures, while the altitude can be determined from the static pressure.

But as you approach the speed of sound, the fluid dynamics of air change and the calculations become very complicated. With the development of supersonic fighter planes in the 1950s, simple mechanical instruments were no longer sufficient. Instead, an analog computer calculated the "air data" (airspeed, air density, Mach number, and so forth) from the pressure measurements. This computer then transmitted the air data electrically to the systems that needed it: instruments, weapons targeting, engine control, and so forth. Since the computer was centralized, such a system was called a Central Air Data Computer or CADC, manufactured by Bendix and other companies.

A closeup of the numerous gears inside the CADC. Three differential gear mechanisms are visible.

Each value in the Bendix CADC is indicated by the rotational position of a shaft. Compact electric motors rotated the shafts, controlled by magnetic amplifier servo loops. Gears, cams, and differentials performed computations, with the results indicated by more rotations. Devices called synchros converted the rotations to electrical outputs that controlled other aircraft systems. The CADC is said to contain 46 synchros, 511 gears, 820 ball bearings, and a total of 2,781 major parts (but I haven't counted). These components are crammed into a compact cylinder: 15 inches long and weighing 28.7 pounds.

The equations computed by the CADC are impressively complicated. For instance, one equation computes the Mach number $M$ from the total pressure $ P_t $ and the static pressure $ P_s $:3

\[~~~\frac{P_t}{P_s} = \frac{166.9215M^7}{( 7M^2-1)^{2.5}}\]

It seems incredible that these functions could be computed mechanically, but three techniques make this possible. The fundamental mechanism is the differential gear, which adds or subtracts values. Second, logarithms are used extensively, so multiplications and divisions become additions and subtractions performed by a differential, while square roots are calculated by gearing down by a factor of 2. Finally, specially-shaped cams implement functions: logarithm, exponential, and other one-variable functions.4 By combining these mechanisms, complicated functions can be computed mechanically.

The pressure transducers

In this article, I'm focusing on the pressure transducers and how they turn pressures into shaft rotations. The CADC receives two pressure inputs: the total pressure $ P_t $ from the pitot tube, and the static pressure $ P_s $ from the static pressure port.5 The CADC has two independent pressure transducer subsystems, one for total pressure and one for static pressure. The two pressure transducers make up the right half of the CADC. The copper pressure tube for the static pressure is visible on top of the CADC below. This tube feeds into the black-domed pressure sensor at the right. The gears, motors, and other mechanisms to the left of the pressure sensor domes generate shaft rotations that are fed into the remainder of the CADC for calculations.

Side view of the CADC.

The pressure transducer has a tricky job: it must measure tiny pressure changes, but it must also provide a rotational signal that has enough torque to rotate all the gears in the CADC. To accomplish this, the pressure transducer uses a servo loop that amplifies small pressure changes into accurate rotations. The diagram below provides an overview of the process. The pressure input causes a small movement in the bellows diaphragm. This produces a small shaft rotation that is detected by a sensitive inductive pickup. This signal is amplified and drives a motor with enough power to drive the output shaft. The motor is also geared to counteract the movement of the bellows. The result is a feedback loop so the motor's rotation tracks the air pressure, but provides much more torque. An adjustable cam corrects for any error produced by irregularities in the diaphragm response. This complete mechanism is implemented twice, once for each pressure input.

This diagram shows the structure of the transducer. From "Air Data Computer Mechanization."

To summarize, as the pressure moves the diaphragm, the induction pick-up produces an error signal. The motor is driven in the appropriate direction until the error signal becomes zero. At this point, the output shaft rotation exactly matches the input pressure. The advantage of the servo loop is that the diaphragm only needs to move the sensitive inductive pickup, rather than driving the gears of the CADC, so the pressure reading is more accurate.

In more detail, the process starts with connections from the aircraft's pitot tube and static pressure port to the CADC. The front of the CADC (below) has connections for the total pressure and the static pressure. The CADC also has five round military connectors for electrical connections between the CADC and the rest of the aircraft. (The outputs from the CADC are electrical, with synchros converting the shaft rotations into electrical representations.) Finally, a tiny time clock at the upper right keeps track of how many hours the CADC has been in operation, so it can be maintained according to schedule.

The front panel of the CADC, showing the static pressure and total pressure connections at the bottom.

The photo below shows the main components of the pressure transducer system. At the upper left, the pressure line from the CADC's front panel goes to the pressure sensor, airtight under a black dome. The error signal from the sensor goes to the amplifier, which consists of three boards. The amplifier's power transformer and magnetic amplifiers are the most visible components. The amplifier drives the motors to the left. There are two motors controlled by the amplifier: one for coarse adjustments and one for fine adjustments. By using two motors, the CADC can respond rapidly to large pressure changes, while also accurately tracking small pressure changes. Finally, the output from the motor goes through the adjustable cam in the middle before providing the feedback signal to the pressure sensor. The output from the transducer to the rest of the CADC is a shaft on the left, but it is in the middle of the CADC and isn't visible in the photo.

A closeup of the transducer, showing the main parts.

The pressure sensor

Each pressure sensor is packaged in a black airtight dome and is fed from its associated pressure line. Inside the sensor, two sealed metal bellows (below) expand or contract as the pressure changes. The bellows are connected to opposite sides of a metal shaft, which rotates as the bellows expand or contract. This shaft rotates an inductive pickup, providing the error signal. The servo loop rotates a second shaft that counteracts the rotation of the first shaft; this shaft and gears are also visible below.

Inside the pressure transducer. The two disc-shaped bellows are connected to opposite sides of a shaft so the shaft rotates as the bellows expand or contract.

The end view of the sensor below shows the inductive pickup at the bottom, with colorful wires for the input (400 Hz AC) and the output error signal. The coil visible on the inductive pickup is an anti-backlash spring to ensure that the pickup doesn't wobble back and forth. The electrical pickup coil is inside the inductive pickup and isn't visible.

Inside the transducer housing, showing the bellows and inductive pickup.

The amplifier

Each transducer feedback signal is amplified by three circuit boards centered around magnetic amplifiers, transformer-like amplifiers that were popular before high-power transistors came along. The photo below shows how the amplifier boards are packed next to the transducers. The boards are complex, filled with resistors, capacitors, germanium transistors, diodes, relays, and other components.

The pressure transducers are the two black domes at the top. The circuit boards next to each pressure transducer are the amplifiers. The yellowish transformer-like devices with three windings are the magnetic amplifiers.

I reverse-engineered the boards and created the schematic below. I'll discuss the schematic at a high level; click it for a larger version if you want to see the full circuitry. The process starts with the inductive sensor (yellow), which provides the error input signal to the amplifier. The first stage of the amplifier (blue) is a two-transistor amplifier and filter. From there, the signal goes to two separate output amplifiers to drive the two motors: fine (purple) and coarse (cyan).

Schematic of the servo amplifier, probably with a few errors. Click for a larger version.

The inductive sensor provides its error signal as a 400 Hz sine wave, with a larger signal indicating more error. The phase of the signal is 0° or 180°, depending on the direction of the error. In other words, the error signal is proportional to the driving AC signal in one direction and flipped when the error is in the other direction. This is important since it indicates which direction the motors should turn. When the error is eliminated, the signal is zero.

Each output amplifier consists of a transistor circuit driving two magnetic amplifiers. Magnetic amplifiers are an old technology that can amplify AC signals, allowing the relatively weak transistor output to control a larger AC output. The basic idea of a magnetic amplifier is a controllable inductor. Normally, the inductor blocks alternating current. But applying a relatively small DC signal to a control winding causes the inductor to saturate, permitting the flow of AC. Since the magnetic amplifier uses a small signal to control a much larger signal, it provides amplification.

In the early 1900s, magnetic amplifiers were used in applications such as dimming lights. Germany improved the technology in World War II, using magnetic amplifiers in ships, rockets, and trains. The magnetic amplifier had a resurgence in the 1950s; the Univac Solid State computer used magnetic amplifiers (rather than vacuum tubes or transistors) as its logic elements. However, improvements in transistors made the magnetic amplifier obsolete except for specialized applications. (See my IEEE Spectrum article on magnetic amplifiers for more history of magnetic amplifiers.)

In the CADC, magnetic amplifiers control the AC power to the motors. Two magnetic amplifiers are visible on top of the amplifier board stack, while two more are on the underside; they are the yellow devices that look like transformers. (Behind the magnetic amplifiers, the power transformer is labeled "A".)

One of the three-board amplifiers for the pressure transducer.

The transistor circuit generates the control signal to the magnetic amplifiers, and the output of the magnetic amplifiers is the AC signal to the motors. Specifically, the CADC uses two magnetic amplifiers for each motor. One magnetic amplifier powers the motor to spin clockwise, while the other makes the motor spin counterclockwise. The transistor circuit will pull one magnetic amplifier winding low; the phase of the input signal controls which magnetic amplifier, and thus the motor direction. (If the error input signal is zero, neither winding is pulled low, both magnetic amplifiers block AC, and the motor doesn't turn.)6 The result of this is that the motor will spin in the correct direction based on the error input signal, rotating the mechanism until the mechanical output position matches the input pressure. The motors are "Motor / Tachometer Generator" units that also generate a voltage based on their speed. This speed signal is fed into the transistor amplifier to provide negative feedback, limiting the motor speed as the error becomes smaller and ensuring that the feedback loop doesn't overshoot.

The other servo loops in the CADC (temperature and position error correction) have one motor driver constructed from transistors and two magnetic amplifiers. However, each pressure transducer has two motor drivers (and thus four magnetic amplifiers), one for fine adjustment and one for coarse adjustment. This allows the servo loop to track the input pressure very closely, while also adjusting rapidly to larger changes in pressure. The coarse amplifier uses back-to-back diodes to block small changes; only input voltages larger than a diode drop will pass through and energize the coarse amplifier.

The CADC is powered by standard avionics power of 115 volts AC, 400 hertz. Each pressure transducer amplifier has a simple power supply running off this AC, using a multi-winding power transformer. A center-tapped winding and full wave rectifier produces DC for the transistor amplifiers. Other windings supply AC (controlled by the magnetic amplifiers) to power the motors, AC for the magnetic amplifier control signals, and AC for the sensor. The transformer ensures that the transducer circuitry is electrically isolated from other parts of the CADC and the aircraft. The power supply is indicated in red in the schematic above.

The schematic also shows test circuitry (blue). One of the features of the CADC is that it can be set to two test configurations before flight to ensure that the system is operating properly and is correctly calibrated.7 Two relays allow the pressure transducer to switch to one of two test inputs. This allows the CADC to be checked for proper operation and calibration. The test inputs are provided from an external board and a helical feedback potentiometer (Helipot) that provides simulated sensor input.

Getting the amplifiers to work was a challenge. Many of the capacitors in the CADC had deteriorated and failed, as shown below. Marc went through the CADC boards and replaced the bad capacitors. However, one of the pressure transducer boards still failed to work. After much debugging, we discovered that one of the new capacitors had also failed. Finally, after replacing that capacitor a second time, the CADC was operational.

Some bad capacitors in the CADC. This is the servo amplifier for the temperature sensor.

The mechanical feedback loop

The amplifier boards energize two motors that rotate the output shaft,8 the coarse and fine motors. The outputs from the coarse and fine motors are combined through a differential gear assembly that sums its two input rotations.9 While the differential functions like the differential in a car, it is constructed differently, with a spur-gear design. This compact arrangement of gears is about 1 cm thick and 3 cm in diameter. The differential is mounted on a shaft along with three co-axial gears: two gears provide the inputs to the differential and the third provides the output. In the photo, the gears above and below the differential are the input gears. The entire differential body rotates with the sum, connected to the output gear at the top through a concentric shaft. The two thick gears inside the differential body are part of its mechanism.

A closeup of a differential mechanism.

(Differential gear assemblies are also used as the mathematical component of the CADC, as it performs addition or subtraction. Since most values in the CADC are expressed logarithmically, the differential computes multiplication and division when it adds or subtracts its inputs.)

The CADC uses cams to correct for nonlinearities in the pressure sensors. The cam consists of a warped metal plate. As the gear rotates, a spring-loaded vertical follower moves according to the shape of the plate. The differential gear assembly under the plate adds this value to the original input to obtain a corrected value. (This differential implementation is different from the one described above.) The output from the cam is fed into the pressure sensor, closing the feedback loop.

The corrector cam is adjusted to calibrate the output to counteract for variations in the bellows behavior.

At the top, 20 screws can be rotated to adjust the shape of the cam plate and thus the correction factor. These cams allow the CADC to be fine-tuned to maximize accuracy. According to the spec, the required accuracy for pressure was "40 feet or 0.15 percent of attained altitude, whichever is greater."

Conclusions

The Bendix CADC was built at an interesting point in time, when computations could be done digitally or analog, mechanically or electrically. Because the inputs were analog and the desired outputs were analog, the decision was made to use an analog computer for the CADC. Moreover, transistors were available but their performance was limited. Thus, the servo amplifiers are built from a combination of transistors and magnetic amplifiers.

Modern air data computers are digital but they are still larger than you might expect because they need to handle physical pressure inputs. While a drone can use a tiny 5mm MEMS pressure sensor, air data computers for aircraft have higher requirements and typically use larger vibrating cylinder pressure sensors. Even so, at 45 mm long, the modern pressure sensor is dramatically smaller than the CADC's pressure transducer with its metal-domed bellows sensor, three-board amplifier, motors, cam, and gear train. Although the mechanical Bendix CADC seems primitive, this CADC was used by the Air Force until the 1980s. I guess if the system worked, there was no reason to update it.

I plan to continue reverse-engineering the Bendix CADC,10 so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon as @oldbytes.space@kenshirriff. Thanks to Joe for providing the CADC. Thanks to Nancy Chen for obtaining a hard-to-find document for me. Marc Verdiell and Eric Schlaepfer are working on the CADC with me.

Notes and references

My previous posts on the CADC provide an overview and reverse-engineering of the left side. Much of the background of this article is copied from the previous articles, if it looks familiar. ↩
The static air pressure can also be provided by holes in the side of the pitot tube. I couldn't find information indicating exactly how the planes with the CADC received static pressure. ↩
Although the CADC's equations may seem ad hoc, they can be derived from fluid dynamics principles. These equations were standardized in the 1950s by various government organizations including the National Bureau of Standards and NACA (the precursor of NASA). ↩
The CADC also uses cams to implement functions such as logarithms, exponentials, and complicated functions of one variable such as ${M}/{\sqrt{1 + .2 M^2}}$. These cams have a completely different design from the corrector cams. The function cams are fixed shape, unlike the adjustable corrector cams. The function is encoded into the cam's shape during manufacturing, so implementing a hard-to-compute nonlinear function isn't a problem for the CADC. The photo below shows a cam with the follower arm in front. As the cam rotates, the follower moves in and out according to the cam's radius. The pressure transducers do not use fixed cams, so I won't discuss them more in this article.

A cam inside the CADC implements a function.

↩
The CADC also has an input for the "position error correction". This input provides a correction factor because the measured static pressure may not exactly match the real static pressure. The problem is that the static pressure is measured from a port on the aircraft. Distortions in the airflow may cause errors in this measurement. A separate box, the "compensator", determined the correction factor based on the angle of attack and fed it to the CADC as a synchro signal. The position error correction is applied in a separate section of the CADC, downstream from the transducers, so I will ignore it for this article. ↩
A bit more explanation of the transistor circuit driving the magnetic amplifier. The idea is that one magnetic amplifier or the other is selected, depending on the phase of the error signal, causing the motor to turn counterclockwise or clockwise as needed. To implement this, the magnetic amplifier control windings are connected to opposite phases of the 400 Hz power. The transistor is connected to both magnetic amplifiers through diodes, so current will flow only if the transistor pulls the winding low during the half-cycle that the winding is powered high. Thus, depending on the phase of the transistor output, one winding or the other will be powered, allowing that magnetic amplifier to pass AC to the motor. ↩
According to the specification, the CADC has simulated "low point" and "high point" test conditions. The low point is 11,806 feet altitude, 1064 ft/sec true airspeed, Mach .994, total temperature 317.1 °K, and density × speed of sound of 1.774 lb sec/ft³. The high point is 50,740 feet altitude, 1917 ft/sec true airspeed, Mach 1.980, total temperature 366.6 °K, and density × speed of sound of .338 lb sec/ft³. ↩
The motor part number is Bendix FV101-5A1. ↩
Strictly speaking, the output of the differential is the sum of the inputs divided by two. I'm ignoring the factor of 2 because the gear ratios can easily cancel it out. It's also arbitrary whether you think of the differential as adding or subtracting, since it depends on which rotation direction is defined as positive. ↩
It was very difficult to find information about the CADC. The official military specification is MIL-C-25653C(USAF). After searching everywhere, I was finally able to get a copy from the Technical Reports & Standards unit of the Library of Congress. The other useful document was in an obscure conference proceedings from 1958: "Air Data Computer Mechanization" (Hazen), Symposium on the USAF Flight Control Data Integration Program, Wright Air Dev Center US Air Force, Feb 3-4, 1958, pp 171-194. ↩

[#] Sun Jan 28 2024 09:57:48 UTC from rss <>

Subject: Reverse engineering CMOS, illustrated with a vintage Soviet counter chip

I recently came across an interesting die photo of a Soviet1 chip, probably designed in the 1970s. This article provides an introductory guide to reverse-engineering CMOS circuits, using this chip as an example. Although the chip looks like a tangle of lines at first, its large features and simple layout make it possible to understand its circuits. I'll first explain how to recognize the individual transistors. Groups of transistors are connected in standard patterns to form CMOS gates, multiplexers, flip-flops, and other circuits. Once these building blocks are understood, reverse-engineering the full chip becomes practical. The chip turned out to be a 4-bit CMOS counter, a copy of the Motorola MC14516B.

Die photo of the К561ИЕ11 chip on a wafer. Image courtesy of Martin Evtimov. Click this image (or any other) for a larger version.

The photo above shows the tiny silicon die under a microscope. Regions of the silicon are doped with impurities to change the silicon's electrical properties. This doping also causes regions of the silicon to appear greenish or reddish, depending on how a region is doped. (These color changes will turn out to be useful for reverse engineering.) On top of the silicon, the whitish metal layer is visible, forming the chip's connections. This chip uses metal-gate transistors, an old technology, so the metal layer also forms the gates of the transistors. Around the outside of the chip, the 16 square bond pads connect the chip to the outside world. When installed in a package, the die has tiny bond wires between the pads and the lead frame, the metal structure that connects to the chip's pins.

According to the Russian datasheet,2 the chip has 319 "elements", presumably counting the semiconductor devices. The chip has a handful of diodes to protect the inputs, so the total transistor count is a bit over 300. This transistor count is nothing compared to a modern CMOS processor with tens of billions of transistors, of course, but most of the circuit principles are the same.

NMOS and PMOS transistors

CMOS is a low-power logic family now used in almost all processors.3 CMOS (complementary MOS) circuitry uses two types of transistors, NMOS and PMOS, working together. The diagram below shows how an NMOS transistor is constructed. The transistor can be considered a switch between the source and drain, controlled by the gate. The source and drain regions (red) consist of silicon doped with impurities to change its semiconductor properties, forming N+ silicon. The gate consists of an aluminum layer, separated from the silicon by a very thin insulating oxide layer.4 (These three layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.) This oxide layer is an insulator, so there is essentially no current flow through the gate, one reason why CMOS is a low-power technology. However, the thin oxide layer is easily destroyed by static electricity, making MOS integrated circuits sensitive to electrostatic discharge.

Structure of an NMOS transistor.

A PMOS transistor (below) has the opposite configuration from an NMOS transistor: the source and drain are doped to form P+ regions, while the underlying bulk silicon is N-type silicon. The doping process is interesting, but I'll leave the details to a footnote.5

Structure of a PMOS transistor.

The NMOS and PMOS transistors are opposite in their construction and operation; this is the "Complementary" in CMOS. An NMOS transistor turns on when the gate is high, while a PMOS transistor turns on when the gate is low. An NMOS transistor is best at pulling its output low, while a PMOS transistor is best at pulling its output high. In a CMOS circuit, the transistors work as a team, pulling the output high or low as needed. The behavior of MOS transistors is complicated, so this description is simplified, just enough to understand digital circuits.

If you buy an MOS transistor from an electronics supplier, it comes as a package with three leads for the source, gate, and drain. The source and drain are connected differently inside the package and are not interchangeable in a circuit. In an integrated circuit, however, the transistor is symmetrical and the source and drain are the same. For that reason, I won't distinguish between the source and the drain in the following discussion. I will use the symmetrical symbols below for NMOS and PMOS transistors; the inversion bubble on the PMOS gate symbolizes that a low signal activates the PMOS transistor.

Symbols for NMOS and PMOS transistors.

One complication is that NMOS transistors are built on P-type silicon, while PMOS transistors are built on N-type silicon. Since the silicon die itself is N silicon, the NMOS transistors need to be surrounded by a tub or well of P silicon.6 The cross-section diagram below shows how the NMOS transistor on the right is embedded in the well of P-type silicon. Constructing two transistor types with opposite behaviors makes manufacturing more complex, one reason why CMOS took years to catch on. CMOS was invented in 1963 at Fairchild Semiconductor, but RCA was the main proponent of CMOS, commercializing it in the late 1960s. Although RCA produced a CMOS microprocessor in 1974, mainstream microprocessors didn't switch to CMOS until the mid-1980s with chips such as the Motorola 68020 (1984) and the Intel 386 (1986).

Cross-section of CMOS transistors.

For proper operation, the silicon that surrounds transistors needs to be connected to the appropriate voltage through "tap" contacts.7 For PMOS transistors, the substrate is connected to power through the taps, while for NMOS transistors the well region is connected to ground through the taps. When reverse-engineering, the taps can provide important clues, indicating which regions are NMOS and which are PMOS. As will be seen below, these voltages are also important for understanding the circuitry of this chip.

The die photo below shows two transistors as they appear on the die. The appearance of transistors varies between different integrated circuits, so a first step of reverse engineering is determining how they look in a particular chip. In this IC, a transistor gate can be distinguished by a large rectangular region over the silicon. (In other metal-gate transistors, the gate often has a "bubble" appearance.) The interactions between the metal wiring and the silicon can be distinguished by subtle differences. For the most part, the metal wiring passes over the silicon, isolated by thick insulating oxide. A contact between metal and silicon is recognizable by a smaller oval region that is slightly darker; wires are connected to the transistor sources and drains below. MOS transistors often don't have discrete boundaries; as will be seen later, the source of one transistor can overlap with the drain of another.

Two transistors on the die.

Distinguishing PMOS and NMOS transistors can be difficult. On this chip, P-type silicon appears greenish, and N-type silicon appears reddish. Thus, PMOS transistors appear as a green region surrounded by red, while NMOS is the opposite. Moreover, PMOS transistors are generally larger than NMOS transistors because they are weaker. Another way to distinguish them is by their connection in circuits. As will be seen below, PMOS transistors in logic gates are connected to power while NMOS transistors are connected to ground.

Metal-gate transistors are a very old technology, mostly replaced by silicon-gate transistors in the 1970s. Silicon-gate circuitry uses an additional layer of polysilicon wiring. Moreover, modern ICs usually have more than one layer of metal. The metal-gate IC in this post is easier to understand than a modern IC, since there are fewer layers to analyze. The CMOS principles are the same in modern ICs, but the layout will appear different.

Implementing an inverter in CMOS

The simplest CMOS gate is an inverter, shown below. Although basic, it illustrates most of the principles of CMOS circuitry. The inverter is constructed from a PMOS transistor on top to pull the output high and an NMOS transistor below to pull the output low. The input is connected to the gates of both transistors.

A CMOS inverter is constructed from a PMOS transistor (top) and an NMOS transistor (bottom).

Recall that an NMOS transistor is turned on by a high signal on the gate, while a PMOS transistor is the opposite, turned on by a low signal. Thus, when the input is high, the NMOS transistor (bottom) turns on, pulling the output low. When the input is low, the PMOS transistor (top) turns on, pulling the output high. Notice how the transistors act in opposite (i.e. complementary) fashion.

How the inverter functions.

An inverter on the die is shown below. The PMOS and NMOS transistors are indicated by red boxes and the transistors are connected according to the schematics above. The input is connected to the gates of the two transistors, which can be distinguished as larger metal rectangles. On the right, two contacts connect the transistor drains to the output. The power and ground connections are a bit different from most chips since the metal lines appear to not go anywhere. The short metal line labeled "power" connects the PMOS transistor's source to the substrate, the reddish silicon that surrounds the transistor. As described earlier, the substrate is connected to the chip's power. Thus, the transistor receives its power through the substrate silicon. This approach isn't optimal, due to the relatively high resistance of silicon, but it simplifies the wiring. Similarly, the ground metal connects the NMOS transistor's source to the well that surrounds the transistor, P-type silicon that appears green. Since the well is grounded, the transistor has its ground connection.

An inverter on the die.

Some inverters look different from the layout above. Many of the chip's inverters are constructed as two inverters in parallel to provide twice the output current. This gives the inverter more "fan-out", the ability to drive the inputs of a larger number of gates.8 The diagram below shows a doubled inverter, which is essentially the previous inverter mirrored and copied, with two PMOS transistors at the top and two NMOS transistors at the bottom. Note that there is no explicit boundary between the paired transistors; their drains share the same silicon. Consequently, each output contact is shared between two transistors, rather than being duplicated.

An inverter consisting of two inverters in parallel.

Another style of inverter drives the chip's output pins. The output pins require high current to drive external circuitry. The chip uses much larger transistors to provide this current. Nonetheless, the output driver uses the same inverter circuit described earlier, with a PMOS transistor to put the output high and an NMOS transistor to pull the output low. The photo below shows one of these output inverters on the die. To fit the larger transistors into the available space, the transistors have a serpentine layout, with the gate winding between the source and the drain. The inverter's output is connected to a bond pad. When the die is mounted in a package, tiny bond wires connect the pads to the external pins.

An output driver is an inverter, built with much larger transistors.

NOR and NAND gates

Other logic gates are constructed using the same concepts as the inverter, but with additional transistors. In a NOR gate, the PMOS transistors on top are in series, so the output will be pulled high if all inputs are 0. The NMOS transistors on the bottom are in parallel, so the output will be pulled low if any input is 1. Thus, the circuit implements the NOR function. Again, note the complementary action: the PMOS transistors pull the output high, while the NMOS transistors pull the output low. Moreover, the PMOS transistors are in series, while the NMOS transistors are in parallel. The circuit below is a 3-input NOR gate; different numbers of inputs are supported similarly. (With just one input, the circuit becomes an inverter, as you might expect.)

A 3-input NOR gate implemented in CMOS.

For any gate implementation, the input must be either pulled high by the PMOS side, or pulled low by the NMOS side. If both happen simultaneously for some input, power and ground would be shorted, possibly destroying the chip. If neither happens, the output would be floating, which is bad in a CMOS circuit.9 In the NOR gate above, you can see that for any input the output is always pulled either high or low as required. Reverse engineering tip: if the output is not always pulled high or low, you probably made a mistake in either the PMOS circuit or the NMOS circuit.10

The diagram below shows how a 3-input NOR gate appears on the die.11 The transistor gates are the thick vertical metal rectangles; PMOS transistors are on top and NMOS below. The three PMOS transistors are in series between power on the left and the output connection on the right. As with the inverter, the power and ground connections are wired to the bulk silicon, not to the chip's power and ground lines.

A 3-input NOR gate as it is implemented on the die. The "extra" PMOS transistor on the left is part of a different gate.

The layout of the NMOS transistors is more complicated because it is difficult to wire the transistors in parallel with just one layer of metal. The output wire connects between the first and second transistors as well as to the third transistor. An unusual feature is the connection of the second and third NMOS transistors to ground is done by a horizontal line of doped silicon (reddish "silicon path" indicated by the dotted line). This silicon extends from the ground metal to the region between the two transistors. Finally, note that the PMOS transistors are much larger than the NMOS transistors. This is both because PMOS transistors are inherently less efficient and because transistors in series need to be lower resistance to avoid degrading the output signal. Reverse-engineering tip: It's often easier to recognize the transistors in series and then use that information to determine which transistors must be in parallel.

A NAND gate is implemented by swapping the roles of the series and parallel transistors. That is, the PMOS transistors are in parallel, while the NMOS transistors are in series. For example, the schematic below shows a 4-input NAND gate. If all inputs are 1, the NMOS transistors will pull the output low. If any input is a 0, the corresponding PMOS transistor will pull the output high. Thus, the circuit implements the NAND function.

A 4-input NAND gate implemented in CMOS.

The diagram below shows a four-input NAND gate on the die. In the bottom half, four NMOS transistors are in series, while in the top half, four PMOS transistors are in parallel. (Note that the series and parallel transistors are switched compared to the NOR gate.) As in the NOR gate, the power and ground are provided by metal connections to the bulk silicon (two connections for the power). The parallel PMOS circuit uses a "silicon path" (green) to connect each transistor to the output without intersecting the metal. In the middle, this silicon has a vertical metal line on top; this reduces the resistance of the silicon path. The NMOS transistors are larger than the PMOS transistors in this case because the NMOS transistors are in series.

A four-input NAND gate as it appears on the die.

Complex gates

More complex gates such as AND-NOR (AND-OR-INVERT) can also be constructed in CMOS; these gates are commonly used because they are no harder to build than NAND or NOR gates. The schematic below shows an AND-NOR gate. To understand its construction, look at the paths to ground through the NMOS transistors. The first path is through A, B, and C. If these inputs are all high, the output is low, implementing the AND-INVERT side of the gate. The second path is through D, which will pull the output low by itself, implementing the OR-INVERT side of the gate. You can verify that the PMOS transistors pull the output high in the necessary circumstances. Observe that the D transistor is in series on the PMOS side and in parallel on the NMOS side, again showing the complementary nature of these circuits.

An AND-NOR gate.

The diagram below shows this AND-NOR gate on the die, with the four inputs A, B, C, and D, corresponding to the schematic above. This gate has a few tricky layout features. The biggest surprise is that there is half of another gate (a 3-input NOR gate) in the middle of this gate. Presumably, the designers found this arrangement efficient since the other gate also uses inputs A, B, and C. The output of the other gate (D) is an input to the gate we're examining. Ignoring the other gate, the AND-NOR gate has the NMOS transistors in the first column, on top of a reddish band, and the PMOS transistors in the third column, on top of a greenish band. Hopefully you can recognize the transistor gates, the large rectangles connected to A, B, C, and D. Matching the schematic above, there are three NMOS transistors in series on the left, connected to A, B, and C, as well as the D transistor providing a second path between ground and the output. On the PMOS side, the A, B, and C transistors are in parallel, and then connected through the D transistor to the output. The green "silicon path" on the right provides the parallel connection from transistors A and B to transistors C and D. Most of this path is covered by two long metal regions, reducing the resistance. But in order to cross under wires B and C, the metal has a break where the green silicon provides the connection.

An AND-NOR gate on the die.

As with the other gates, the power is obtained by a connection to the bulk silicon, bridging the red and green regions. If you look closely, there is a green band ("silicon path") down from the power connection and joining the main green region between the B and C transistors, providing power to those transistors through the silicon. The NMOS transistors, on the other hand, have ground connections at the top and bottom. For this circuit, ground is supplied through solid metal wires at the top and the bottom, rather than a connection to the bulk silicon.

A few principles help when reverse-engineering logic gates. First, because of the complementary nature of CMOS, the output must either be pulled high by the PMOS transistors or pulled low by the NMOS transistors. Thus, one group or the other must be activated for each possible input. This implies that the same inputs must go to both the NMOS and PMOS transistors. Moreover, the structures of the NMOS and PMOS circuits are complementary: where the NMOS transistors are parallel, the PMOS transistors must be in series, and vice versa. In the case of the AND-NOR circuit above, these principles are helpful. For instance, you might not spot the "silicon paths", but since the PMOS half must be complementary to the NMOS half, you know that those connections must exist.

Even complex gates can be reverse-engineered by breaking the NMOS transistors into series and parallel groups, corresponding to AND and OR terms. Note that MOS transistors are inherently inverting, so a single gate will always end with inversion. Thus, you can build an AND-OR-AND-NOR gate for instance, but you can't build an AND gate as a single circuit.

Transmission gate

Another key circuit is the transmission gate. This acts as a switch, either passing a signal through or blocking it. The schematic below shows how a transmission gate is constructed from two transistors, an NMOS transistor and a PMOS transistor. If the enable line is high (i.e. low to the PMOS transistor) both transistors turn on, passing the input signal to the output. The NMOS transistor primarily passes a low signal, while the PMOS transistor passes a high signal, so they work together. If the enable line is low, both transistors turn off, blocking the input signal. The schematic symbol for a transmission gate is shown on the right. Note that the transmission gate is bidirectional; it doesn't have a specific input and output. Examining the surrounding circuitry usually reveals which side is the input and which side is the output.

A transmission gate is constructed from two transistors. The transistors and their gates are indicated. The schematic symbol is on the right.

The photo below shows how a transmission gate appears on the die. It consists of a PMOS transistor at the top and an NMOS transistor at the bottom. Both the enable signal and the complemented enable signal are used, one for the NMOS transistor's gate and one for the PMOS transistor.

A transmission gate on the die, consisting of two transistors.

The inverter and transmission gate are both two-transistor circuits, but they can be easily distinguished for reverse engineering. One difference is that an inverter is connected to power and ground, while the transmission gate is unpowered. Moreover, the inverter has one input, while the transmission gate has three inputs (counting the control lines). In the inverter, both transistor gates have the same input, so one transistor turns on at a time. In the transmission gate, however, the gates have opposite inputs, so the transistors turn on or off together.

One useful circuit that can be built from transmission gates is the multiplexer, a circuit that selects one of two (or more) inputs. The multiplexer below selects either input inA or inB and connects it to the output, depending if the selection line selA is high or low respectively. The multiplexer can be built from two transmission gates as shown. Note that the select lines are flipped on the second transmission gate, so one transmission gate will be activated at a time. Multiplexers with more inputs can be built by using more transmission gates with additional select lines.

Schematic symbol for a multiplexer and its implementation with two transmission gates.

The die photo below shows a block of transmission gates consisting of six PMOS transistors and six NMOS transistors. The labels on the metal lines will make more sense as the reverse engineering progresses. Note that the metal layer provides much of the wiring for the circuit, but not all of it. Much of the wiring is implicit, in the sense that neighboring transistors are connected because the source of one transistor overlaps the drain of another.

A block of transistors implementing multiple transmission gates.

While this may look like an incomprehensible block of zig-zagging lines, tracing out the transistors will reveal the circuitry (below). The wiring in the schematic matches the physical layout on the die, so the schematic is a bit of a mess. With a single layer of metal for wiring, the layout becomes a bit convoluted to avoid crossing wires. (The only wire crossing in this image is in the upper left for wire X; the signal uses a short stretch of silicon to pass under the metal.)

Schematic of the previous block of transistors.

Looking at the PMOS and NMOS transistors as pairs reveals that the circuit above is a chain of transmission gates (shown below). It's not immediately obvious which wires are inputs and which wires are outputs, but it's a good guess that pairs of transmission gates using the opposite control lines form a multiplexer. That is, inputs A and C are multiplexed to output B, inputs C and E are multiplexed to output D, and so forth. As will be seen, these transmission gates form multiplexers that are part of a flip-flop.

The transistors form six transmission gates.

Latches and flip-flops

Flip-flops and latches are important circuits, able to hold one bit and controlled by a clock signal. Terminology is inconsistent, but I'll use flip-flop to refer to an edge-triggered device and latch to refer to a level-triggered device. That is, a flip-flop will grab its input at the moment the clock signal goes high (i.e. it uses the clock edge), store it, and provide it as the output, called Q for historical reasons. A latch, on the other hand, will take its input, store it, and output it as long as the clock is high (i.e. it uses the clock level). The latch is considered "transparent", since the input immediately appears on the output if the clock is high.

The distinction between latches and flip-flops may seem pedantic, but it is important. Flip-flops will predictably update once per clock cycle, while latches will keep updating as long as the clock is high. By connecting the output of a flip-flop through an inverter back to the input, you can create a toggle flip-flop, which will flip its state once per clock cycle, dividing the clock by two. (This example will be important shortly.) If you try the same thing with a transparent latch, it will oscillate: as soon as the output flips, it will feed back to the latch input and flip again.

The schematic below shows how a latch can be implemented with transmission gates. When the clock is high, the first transmission gate passes the input through to the inverters and the output. When the clock is low, the second transmission gate creates a feedback loop for the inverters, so they hold their value, providing the latch action. Below, the same circuit is drawn with a multiplexer, which may be easier to understand: either the input or the feedback is selected for the inverters.

A latch implemented from transmission gates. Below, the same circuit is shown with a multiplexer.

An edge-triggered flip-flop can be created by combining two latches in a primary/secondary arrangement. When the clock is low, the input will pass into the primary latch. When the clock switches high, two things happen. The primary latch will hold the current value of the input. Meanwhile, the secondary latch will start passing its input (the value from the primary latch) to its output, and thus the flip-flop output. The effect is that the flip-flop's output will be the value at the moment the clock goes high, and the flip-flop is insensitive to changes at other times. (The primary latch's value can keep changing while the clock is low, but this doesn't affect the flip-flop's output.)

Two latches, combined to form a flip-flop.

The flip-flops in the counter chip are based on the above design, but they have two additional features. First, the flip-flop can be loaded with a value under the control of a Preset Enable (PE) signal. Second, the flip-flop can either hold its current value or toggle its value, under the control of a Toggle (T) signal. Implementing these features requires two more multiplexers in the primary latch as shown below. The first multiplexer selects either the inverted output or uninverted output to be fed back into the flip flop, providing the selectable toggle action. The second multiplexer is the latch's standard clocked multiplexer. The third multiplexer allows a "preset" value to be loaded directly into the flip-flop, bypassing the clock. (The preset value is inverted, since there are three inverters between the preset and the output.) The secondary latch is the same as before, except it provides the inverted and non-inverted outputs as feedback, allowing the flip-flop to either hold or toggle its value. This circuit illustrates how more complex flip-flops can be created from the building blocks that we've seen.

Schematic of the toggle flip-flop.

The gray letters in the schematic above match the earlier multiplexer diagram, showing how the three multiplexers were implemented on the die. The other multiplexer and the inverters are implemented in another block of circuitry. I won't explain that circuitry in detail since it doesn't illustrate any new principles.

Routing in silicon: cross-unders

With just one metal layer for wiring, routing of signals on the chip is difficult and requires careful planning. Even so, there are some cases where one signal must cross another. This is accomplished by using silicon for a "cross-under", allowing a signal to pass underneath metal wiring. These cross-unders are avoided unless necessary because silicon has much higher resistance than metal. Moreover, the cross-under requires additional space on the die.

Three cross-unders on the die.

The images above show three cross-unders. In each one, signals are primarily routed in the metal layer, but a signal passes under the metal using a doped silicon region (which appears green). The first cross-under simply lets one signal cross under the second. The second image shows a signal branching as well as crossing under two signals. The third image shows a cross-under distributing a horizontal signal to the upper and lower halves of the chip, while crossing under multiple horizontal signals. Note the small oval contact between the green silicon region and the horizontal metal line, connecting them. It is easy to miss the small contact and think that the vertical signal is simply crossing under the horizontal signal, rather than branching.

About the chip

The focus of this article is the CMOS reverse engineering process rather than this specific chip, but I'll give a bit of information about the chip. The die has the Cyrillic characters ИЕ11 at the top indicating that the chip is a К561ИЕ11 or К564ИЕ11.12 The Soviet Union came up with a standardized numbering system for integrated circuits in 1968. This system is much more helpful than the American system of semi-random part numbers. In this part number, the 5 indicates a monolithic integrated circuit, while 61 or 64 is the series, specifically commercial-grade or military-grade clones of 4000 series CMOS logic. The character И indicates a digital circuit, while ИЕ is a counter. Thus, the part number systematically indicates that the integrated circuit is a CMOS counter.

The 561ИЕ11 turns out to be a copy of the Motorola MC14516 binary up/down counter.13 Conveniently, the Motorola datasheet provides a schematic (below). I won't explain the schematic in detail, but a quick overview may be helpful. The chip is a four-bit counter that can count up or down, and the heart of the chip is the four toggle flip-flops (red). To count up, a flip-flop is toggled if there is a carry from the lower bits, while counting down toggles a flip-flop if there is a borrow from the lower bits. (Much like base-10 long addition or subtraction.) The AND/NOR gates at the bottom (blue) look complex, but they are just generating the toggle signal T: toggle if the lower bits are all-1's and you're counting up, or if the lower bits are all-0's and you're counting down. The flip-flops can also be loaded in parallel from the P inputs. Additional logic allows the chips to be cascaded to form arbitrarily large counters; the carry-out pin of one chip is connected to the carry-in of the next.

Logic diagram of the MC14516 up/down counter chip, from the datasheet.

I've labeled the die photo below with the pin functions and the functional blocks. Each quadrant of the chip handles one bit of the counter in a roughly symmetrical way. This quadrant layout accounts for the pin arrangement which otherwise appears semi-random with bits 3 and 0 on one side and bits 2 and 1 on the other, with inputs and output pins jumbled together. The toggle and carry logic is squeezed into the top and middle of the chip. You may recognize the large inverters next to each output pin. When reverse-engineering, look for large transistors next to pads to determine which pins are outputs.

The die with pins and functional blocks labeled.

Conclusions

This article has discussed the basic circuits that can be found in a CMOS chip. Although the counter chip is old and simple, later chips use the same principles. An important change in later chips is the introduction of silicon-gate transistors, which use polysilicon for the transistor gates and for an additional wiring layer. The circuits are the same, but you need to be able to recognize the polysilicon layer. Many chips have more than one metal layer, which makes it very hard to figure out the wiring connections. Finally, when the feature size approaches the wavelength of light, optical microscopes break down. Thus, these reverse-engineering techniques are only practical up to a point. Nonetheless, many interesting CMOS chips can be studied and reverse-engineered.

For more, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon as @kenshirriff@oldbytes.space. Thanks to Martin Evtimov for providing the die photos.

Notes and references

I'm not sure of the date and manufacturing location of the chip. I think the design is old, from the Soviet Union. (Motorola introduced the MC14516 around 1972 but I don't know when it was copied.) The wafer is said to be scrap from a Ukrainian manufacturer so it may have been manufactured more recently. The die has a symbol that might be a manufacturing logo, but nobody on Twitter could identify it.

A symbol that appears on the die.

↩
For more about this chip, the Russian databook can be downloaded here; see Volume 5 page 501. ↩
Early CMOS microprocessors include the 8-bit RCA 1802 COSMAC (1974) and the 12-bit Intersil 6100 (1974). The 1802 is said to be the first CMOS microprocessor. Mainstream microprocessors didn't switch to CMOS until the mid-1980s. ↩
The chip in this article has metal-gate transistors, with aluminum forming the transistor gate. These transistors were not as advanced as the silicon-gate transistors that were developed in the late 1960s. Silicon gate technology was much better in several ways. First, silicon-gate transistors were smaller, faster, more reliable, and used lower voltages. Second, silicon-gate chips have a layer of polysilicon wiring in addition to the metal wiring; this made chip layouts about twice as dense. ↩
To produce N-type silicon, the silicon is doped with small amounts of an element such as phosphorus or arsenic. In the periodic table, these elements are one column to the right of silicon so they have one "extra" electron. The free electrons move through the silicon, carrying charge. Because electrons are negative, this type of silicon is called N-type. Conversely, to produce P-type silicon, the silicon is doped with small quantities of an element such as boron. Since boron is one column to the left of silicon in the periodic table, it has one fewer free electrons. A strange thing about semiconductor physics is that the missing electrons (called holes) can move around the silicon much like electrons, but carrying positive charge. Since the charge carriers are positive, this type of silicon is called P-type. For various reasons, electrons carry charge better than holes, so NMOS transistors work better than PMOS transistors. As a result, PMOS transistors need to be about twice the size of comparable NMOS transistors. This quirk is useful for reverse engineering, since it can help distinguish NMOS and PMOS transistors.

The amount of doping required can be absurdly small, 20 atoms of boron for every billion atoms of silicon in some cases. A typical doping level for N-type silicon is 10¹⁵ atoms of phosphorus or arsenic per cubic centimeter, which sounds like a lot until you realize that pure silicon consists of 5×10²² atoms per cubic centimeter. A heavily doped P+ region might have 10²⁰ dopant atoms per cubic centimeter, one atom of boron per 500 atoms of silicon. (Doping levels are described here.) ↩
This chip is built on a substrate of N-type silicon, with wells of P-type silicon for the NMOS transistors. Chips can be built the other way around, starting with P-type silicon and putting wells of N-type silicon for the PMOS transistors. Another approach is the "twin-well" CMOS process, constructing wells for both NMOS and PMOS transistors. ↩
The bulk silicon voltage makes the boundary between a transistor and the bulk silicon act as a reverse-biased diode, so current can't flow across the boundary. Specifically, for a PMOS transistor, the N-silicon substrate is connected to the positive supply. For an NMOS transistor, the P-silicon well is connected to ground. A P-N junction acts as a diode, with current flowing from P to N. But the substrate voltages put P at ground and N at +5, blocking any current flow. The result is that the bulk silicon can be considered an insulator, with current restricted to the N+ and P+ doped regions. If this back bias gets reversed, for example, due to power supply fluctuations, current can flow through the substrate. This can result in "latch-up", a situation where the N and P regions act as parasitic NPN and PNP transistors that latch into the "on" state. This shorts power and ground and can destroy the chip. The point is that the substrate voltages are very important for proper operation of the chip. ↩
Many inverters in this chip duplicate the transistors to increase the current output. The same effect could be achieved with single transistors with twice the gate width. (That is, twice the height in the diagrams.) Because these transistors are arranged in uniform rows, doubling the transistor height would mess up the layout, so using more transistors instead of changing the size makes sense. ↩
Some chips use dynamic logic, in which case it is okay to leave the gate floating, neither pulled high nor low. Since the gate resistance is extremely high, the capacitance of a gate will hold its value (0 or 1) for a short time. After a few milliseconds, the charge will leak away, so dynamic logic must constantly refresh its signals before they decay.

In general, the reason you don't want an intermediate voltage as the input to a CMOS circuit is that the voltage might end up turning the PMOS transistor partially on while also turning the NMOS transistor partially on. The result is high current flow from power to ground through the transistors. ↩
One of the complicated logic gates on the die didn't match the implementation I expected. In particular, for some inputs, the output is neither pulled high nor low. Tracing the source of these inputs reveals what is going on: the gate takes both a signal and its complement as inputs. Thus, some of the "theoretical" inputs are not possible; these can't be both high or both low. The logic gate is optimized to ignore these cases, making the implementation simpler. ↩
This schematic explains the physical layout of the 3-input NOR gate on the die, in case the wiring isn't clear. Note that the PMOS transistors are wired in series and the NMOS transistors are in parallel, even though both types are physically arranged in rows.

The 3-input NOR gate on the die. This schematic matches the physical layout.

↩
The commercial-grade chips and military-grade chips presumably use the same die, but are distinguished by the level of testing. So we can't categorize the die as 561-series or 564-series. ↩
Motorola introduced the MC14500 series in 1971 to fill holes in the CD4000 series. For more about this series, see A Strong Commitment to Complementary MOS. ↩

[#] Sat Feb 03 2024 09:58:58 UTC from rss <>

Subject: Reverse engineering standard cell logic in the Intel 386 processor

The 386 processor (1985) was Intel's most complex processor at the time, with 285,000 transistors. Intel had scheduled 50 person-years to design the processor, but it was falling behind schedule. The design team decided to automate chunks of the layout, developing "automatic place and route" software.1 This was a risky decision since if the software couldn't create a dense enough layout, the chip couldn't be manufactured. But in the end, the 386 finished ahead of schedule, an almost unheard-of accomplishment.

In this article, I take a close look at the "standard cells" used in the 386, the logic blocks that were arranged and wired by software. Reverse-engineering these circuits shows how standard cells implement logic gates, latches, and other components with CMOS transistors. Modern integrated circuits still use standard cells, much smaller now, of course, but built from the same principles.

The photo below shows the 386 die with the automatic-place-and-route regions highlighted in red. These blocks of unstructured logic have cells arranged in rows, giving them a characteristic striped appearance. In comparison, functional blocks such as the datapath on the left and the microcode ROM in the lower right were designed manually to optimize density and performance, giving them a more solid appearance. As for other features on the chip, the black circles around the border are bond wire connections that go to the chip's external pins. The chip has two metal layers, a small number by modern standards, but a jump from the single metal layer of earlier processors such as the 286. The metal appears white in larger areas, but purplish where circuitry underneath roughens its surface. For the most part, the underlying silicon and the polysilicon wiring on top are obscured by the metal layers.

Die photo of the 386 processor with standard-cell logic highlighted in red.

Early processors in the 1970s were usually designed by manually laying out every transistor individually, fitting transistors together like puzzle pieces to optimize their layout. While this was tedious, it resulted in a highly dense layout. Federico Faggin, designer of the popular Z80 processor, describes finding that the last few transistors wouldn't fit, so he had to erase three weeks of work and start over. The closeup of the resulting Z80 layout below shows that each transistor has a different, complex shape, optimized to pack the transistors as tightly as possible.2

A closeup of transistors in the Zilog Z80 processor (1976). This chip is NMOS, not CMOS, which provides more layout flexibility. The metal and polysilicon layers have been removed to expose the underlying silicon. The lighter stripes over active silicon indicate where the polysilicon gates were. I think this photo is from the Visual 6502 project but I'm not sure.

Standard-cell logic is an alternative that is much easier than manual layout.3 The idea is to create a standard library of blocks (cells) to implement each type of gate, flip-flop, and other low-level component. To use a particular circuit, instead of arranging each transistor, you use the standard design. Each cell has a fixed height but the width varies as needed, so the standard cells can be arranged in rows. For example, the die photo below three cells in a row: a latch, a high-current inverter, and a second latch. This region has 24 transistors in total with PMOS above and NMOS below. Compare the orderly arrangement of these transistors with the Z80 transistors above.

Some standard cell circuitry in the 386. I removed the metal and polysilicon to show the underlying silicon. The irregular blotches are oxide that wasn't fully removed, and can be ignored.

The space between rows is used as a "wiring channel" that holds the wiring between the cells. The photo below zooms out to show four rows of standard cells (the dark bands) and the wiring in between. The 386 uses three layers for this wiring: polysilicon and the upper metal layer (M2) for vertical segments and the lower metal layer (M1) for horizontal segments.

Some standard-cell logic in the 386 processor.

To summarize, with standard cell logic, the cells are obtained from the standard cell library as needed, defining the transistor layout and the wiring inside the cell. However, the locations of each cell (placing) need to be determined, as well as how to arrange the wiring (routing). As will be seen, placing and routing the cells can be done manually or automatically.

Use of standard cells in the 386

Fairly late in the design process, the 386 team decided to use automatic place and route for parts of the chip. By using automatic place and route, 2,254 gates (consisting of over 10,000 devices) were placed and routed in seven weeks. (These numbers are from a paper "Automatic Place and Route Used on the 80386", co-written by Pat Gelsinger, now the CEO of Intel. I refer to this paper multiple times, so I'll call it APR386 for convenience.4) Automatic place and route was not only faster, but it avoided the errors that crept in when layout was performed manually.5

The "place" part of automatic place and route consists of determining the arrangement of the standard cells into rows to minimize the distance between connected cells. Running long wires between cells wastes space on the die, since you end up with a lot of unnecessary metal wiring. But more importantly, long paths have higher resistance, slowing down the signals. Placement is a difficult optimization problem that is NP-complete. Moreover, the task was made more complicated by weighting paths by importance and electrical characteristics, classifying signals as "normal", "fast", or "critical". Paths were also weighted to encourage the use of the thicker M2 metal layer rather than the lower M1 layer.

The 386 team solved the placement problem with a program called Timberwolf, developed by a Berkeley grad student. As one member of the 386 team said, "If management had known that we were using a tool by some grad student as a key part of the methodology, they would never have let us use it." Timberwolf used a simulated annealing algorithm, based on a simulated temperature that decreased over time. The idea is to randomly move cells around, trying to find better positions, but gradually tighten up the moves as the "temperature" drops. At the end, the result is close to optimal. The purpose of the temperature is to avoid getting stuck in a local minimum by allowing "bad" changes at the beginning, but then tightening up the changes as the algorithm progresses.

Once the cells were placed in their positions, the second step was "routing", generating the layout of all the wiring. A suitable commercial router was not available in 1984, so Intel developed its own. As routing is a difficult problem (also NP-complete), they took an iterative heuristic approach, repeatedly routing until they found the smallest channel height that would work. (Thus, the wiring channels are different sizes as needed.) Then they checked the R-C timing of all the signals to find any signals that were too slow. Designers could boost the size of the associated drivers (using the variety of available standard cells) and try the routing again.

Brief CMOS overview

The 386 was the first processor in Intel's x86 line to be built with a technology called CMOS instead of using NMOS. Modern processors are all built from CMOS because CMOS uses much less power than NMOS. CMOS is more complicated to construct, though, because it uses two types of transistors—NMOS and PMOS—so early processors were typically NMOS. But by the mid-1980s, the advantages of switching to CMOS were compelling.

The diagram below shows how an NMOS transistor is constructed. The transistor can be considered a switch between the source and drain, controlled by the gate. The source and drain regions (green) consist of silicon doped with impurities to change its semiconductor properties, forming N+ silicon. The gate consists of a layer of polysilicon (red), separated from the silicon by a very thin insulating oxide layer. Whenever polysilicon crosses active silicon, a transistor is formed. A PMOS transistor has similar construction except it swaps the N-type and P-type silicon, consisting of P+ regions in a substrate of N silicon.

Diagram showing the structure of an NMOS transistor.

The NMOS and PMOS transistors are opposite in their construction and operation. An NMOS transistor turns on when the gate is high, while a PMOS transistor turns on when the gate is low. An NMOS transistor is best at pulling its output low, while a PMOS transistor is best at pulling its output high. In a CMOS circuit, the transistors work as a team, pulling the output high or low as needed; this is the "Complementary" in CMOS. (The behavior of MOS transistors is complicated, so this description is simplified, just enough to understand digital circuits.)

Simplified structure of the CMOS circuits.

For proper operation, the silicon that surrounds transistors needs to be connected to the appropriate voltage through "tap" contacts.7 For PMOS transistors, the substrate is connected to power through the taps, while for NMOS transistors the well region is connected to ground through the taps. The chip needs to have enough taps to keep the voltage from fluctuating too much; each standard cell typically has a positive tap and a ground tap.

The actual structure of the integrated circuit is much more three-dimensional than the diagram above, due to the thickness of the various layers. The diagram below is a more accurate cross-section. The 386 has two layers of metal: the lower metal layer (M1) in blue and the upper metal layer (M2) in purple. Polysilicon is colored red, while the insulating oxide layers are gray.

Cross-section of CHMOS III transistors. From A double layer metal CHMOS III technology, image colorized by me.

This complicated three-dimensional structure makes it harder to interpret the microscope images. Moreover, the two metal layers obscure the circuitry underneath. I have removed various layers with acids for die photos, but even so, the images are harder to interpret than those of simpler chips. If the die photos look confusing, don't be surprised.

A logic gate in CMOS is constructed from NMOS and PMOS transistors working together. The schematic below shows a NAND gate with two PMOS transistors in parallel above and two NMOS transistors in series below. If both inputs are high, the two NMOS transistors turn on, pulling the output low. If either input is low, a PMOS transistor turns on, pulling the output high. (Recall that NMOS and PMOS are opposites: a high voltage turns an NMOS transistor on while a low voltage turns a PMOS transistor on.) Thus, the CMOS circuit below produces the desired output for the NAND function.

A CMOS NAND gate.

The diagram below shows how this NAND gate is implemented in the 386 as a standard cell.9 A lot is going on in this cell, but it boils down to four transistors, as in the schematic above. The yellow region is the P-type silicon that forms the two PMOS transistors; the transistor gates are where the polysilicon (red) crosses the yellow region.8 (The middle yellow region is the drain for both transistors; there is no discrete boundary between the transistors.) Likewise, the two NMOS transistors are at the bottom, where the polysilicon (red) crosses the active silicon (green). The blue lines indicate the metal wiring for the cell. I thinned these lines to make the diagram clearer; in the actual cell, the metal lines are as thick as they can be without touching, so they cover most of the cell. The black circles are contacts, connections between the metal and the silicon or polysilicon. Finally, the well taps are the opposite type of silicon, connected to the underlying silicon well or substrate to keep it at the proper voltage.

A standard cell for NAND in the 386.

Wiring to a cell's inputs and output takes place at the top or bottom of the cell, with wiring in the channels between rows of cells. The polysilicon input and output lines are thickened at the top and bottom of the cell to allow connections to the cell. The wiring between cells can be done with either polysilicon or metal. Typically the upper metal layer (M2) is used for vertical wiring, while the lower metal layer (M1) is used for horizontal runs. Since each standard cell only uses M1, vertical wiring (M2) can pass over cells. Moreover, a cell's output can also use a vertical metal wire (M2) rather than the polysilicon shown. The point is that there is a lot of flexibility in how the system can route wires between the cells. The power and ground wires (M1) are horizontal so they can run from cell to cell and a whole row can be powered from the ends.

The photo below shows this NAND cell with the metal layers removed by acid, leaving the silicon and the polysilicon. You can match the features in the photo with the diagram above. The polysilicon appears green due to thin-film effects. At the bottom, two polysilicon lines are connected to the inputs.

Die photo of the NAND standard cell with the metal layers removed. The image isn't as clear as I would like, but it was very difficult to remove the metal without destroying the polysilicon.

The photo below shows how the cell appears in the original die. The two metal layers are visible, but they hide the polysilicon and silicon underneath. The vertical metal stripes are the upper (M2) wiring while the lower metal wiring (M1) makes up the standard cell. It is hard to distinguish the two metal layers, which makes interpretation of the images difficult. Note that the metal wiring is wide, almost completely covering the cell, with small gaps between wires. The contacts are visible as dark circles. Is hard to recognize the standard cells from the bare die, as the contact pattern is the only distinguishing feature.

Die photo of the NAND standard cell showing the metal layer.

One of the interesting features of the 386's standard cell library is that each type of logic gate is available in multiple drive strengths. That is, cells are available with small transistors, large transistors, or multiple transistors in parallel. Because the wiring and the transistor gates have capacitance, a delay occurs when changing state. Bigger transistors produce more current, so they can switch the values on a wire faster. But there are two disadvantages to bigger transistors. First, they take up more space on the die. But more importantly, bigger transistors have bigger gates with more capacitance, so their inputs take longer to switch. (In other words, increasing the transistor size speeds up the output but slows the input, so overall performance could end up worse.) Thus, the sizes of transistors need to be carefully balanced to achieve optimum performance.10 With a variety of sizes in the standard cell library, designers can make the best choices.

The image below shows a small NAND gate. The design is the same as the one described earlier, but the transistors are much smaller. (Note that there is one row of metal contacts instead of two or three.) The transistor gates are about half as wide (measured vertically) so the NAND gate will produce about half the output current.11

Die photo of a small NAND standard cell with the metal removed.

Since the standard cells are all the same height, the maximum size of a transistor is limited. To provide a larger drive strength, multiple transistors can be used in parallel. The NAND gate below uses 8 transistors, four PMOS and four NMOS, providing twice as much current.

A large NAND gate as it appears on the die, with the metal removed. The left side is slightly obscured by some remaining oxide.

The diagram below shows the structure of the large NAND gate, essentially two NAND gates in parallel. Note that input 1 must be provided separately to both halves by the routing outside the cell. Input 2, on the other hand, only needs to be supplied to the cell once, since it is wired to both halves inside the cell.

A diagram showing the structure of the large NAND gate.

Inverters are also available in a variety of drive strengths, from very small to very large, as shown below. The inverter on the left uses the smallest transistors, while the inverter on the right not only uses large transistors but is constructed from six inverters in parallel. One polysilicon input controls all the transistors.

A small inverter and a large inverter.

A more complex standard cell is XOR. The diagram below shows an XOR cell with large drive current. (There are smaller XOR cells). As with the large NAND gate, the PMOS transistors are doubled up for more current. The multiple input connections are handled by the routing outside the cell. Since the NMOS transistors don't need to be doubled up, there is a lot of unused space in the lower part of the cell. The extra space is used for a very large tap contact, consisting of 24 contacts to ground the well.

The structure of an XOR cell with large drive current.

XOR is a difficult gate to build with CMOS. The cell above implements it by combining a NOR gate and an AND-NOR gate, as shown below. You can verify that if both inputs are 0 or both inputs are 1, the output is forced low as desired. In the layout above, the NOR gate is on the left, while the AND-NOR gate has the AND part on the right. A metal wire down the center connects the NOR output to the AND-NOR input. The need for two sub-gates is another reason why the XOR cell is so large.

Schematic of the XOR cell.

I'll describe one more cell, the latch, which holds one bit and is controlled by a clock signal. Latches are heavily used in the 386 whenever a signal needs to be remembered or a circuit needs to be synchronous. The 386 has multiple types of standard cell latches including latches with set or reset controls and latches with different drive strengths. Moreover, two latches can be combined to form an edge-triggered flip-flop standard cell.

The schematic below shows the basic latch circuit, the most common type in the 386. On the right, two inverters form a loop. This loop can stably hold a 0 or 1 value. On the left, a PMOS transistor and an NMOS transistor form a transmission gate. If the clock is high, both transistors will turn on and pass the input through. If the clock is low, both transistors will turn off and block the input. The trick to the latch is that one inverter is weak, producing just a small current. The consequence is that the input can overpower the inverter output, causing the inverter loop to switch to the input value. The result is that when the clock is high, the latch will pass the input value through to the output. But when the clock is low, the latch will hold its previous value. (The output is inverted with respect to the input, which is slightly inconvenient but reduces the size of the latch.)

Schematic of a latch.

The standard cell layout of the latch (below) is complicated, but it corresponds to the schematic. At the left are the PMOS and NMOS transistors that form the transmission gate. In the center is the weak inverter, with its output to the left. The weak transistors are in the middle; they are overlapped by a thick polysilicon region, creating a long gate that produces a low current.12 At the right is the inverter that drives the output. The layout of this circuit is clever, designed to make the latch as compact as possible. For example, the two inverters share power and ground connections. Notice how the two clock lines pass from top to bottom through gaps in the active silicon so each line only forms one transistor. Finally, the metal line in the center connects the transmission gate outputs and the weak inverter output to the other inverter's input, but asymmetrically at the top so the two inverters don't collide.

The standard cell layout of a latch.

To summarize, I examined many (but not all) of the standard cells in the 386 and found about 70 different types of cells. These included the typical logic gates with various drive strengths: inverters, buffers, XOR, XNOR, AND-NOR, and 3- and 4-input logic gates. There are also transmission gates including ones that default high or low, as well as multiplexers built from transmission gates. I found a few cells that were surprising such as dual inverters and a combination 3-input and 2-input NAND gate. I suspect these consist of two standard cells that were merged together, since they seem too specialized to be part of a standard cell library.

The APR386 paper showed six of the standard cells in the 386 with the diagram below. The small and large inverters are the same as the ones described above, as is the NAND gate NA2B. The latch is similar to the one described above, but with larger transistors. The APR386 paper also showed a block of standard cells, which I was able to locate in the 386.13

Examples of standard cells, from APR386. The numbers are not defined but may indicate input and output capacitance. (Click for a larger version.)

Intel's standard cell line

Intel productized its standard cells around 1986 as a 1.5 µm library using Intel's CMOS technology (called CHMOS III).14 Although the library had over 100 cell types, it was very limited compared to the cells used inside the 386. The library included logic gates, flip-flops, and latches as well as scalable registers, counters, and adders. Most gates only came in one drive strength. Even inverters only came in "normal" and "high" drive strength. I assume these cells are the same as the ones used in the 386, but I don't have proof. The library also included larger devices such as a cell-compatible 80C51 microcontroller and PC peripheral chips such as the 8259 programmable interrupt controller and the 8254 programmable interval timer. I think these were re-implemented using standard cells.

Intel later produced a 1.0 µm library using CHMOS IV, for use "both by ASIC customers and Intel's internal chip designers." This library had a larger collection of drive strengths. The 1.0 µm library included the 80C186 and associated peripheral chips.

Layout techniques in the 386

In this section, I'll look at the active silicon regions, making the cells themselves more visible. In the photos below, I dissolved the metal and polysilicon, leaving the active silicon. (Ignore the irregular greenish shapes; these are oxide that wasn't fully removed.)

The photo below shows the silicon for three rows of standard cells using automatic place and route. You can see the wide variety of standard cell widths, but the height of the cells is constant. The transistor gates are visible as the darker vertical stripes across the silicon. You may be able to spot the latch in each row, distinguished by the long, narrow transistors of the weak inverters.

Three rows of standard cells that were automatically placed and routed.

In the first row, the larger PMOS transistors are on top, while the smaller NMOS transistors are below. This pattern alternates from row to row, so the second row has the NMOS transistors on top and the third row has the PMOS transistors on top. The height of the wiring channel between the cells is variable, made as small as possible while fitting the wiring.

The 386 also contains regions of standard cells that were apparently manually placed and routed, as shown in the photo below. Using standard cells avoids the effort of laying out each transistor, so it is still easier than a fully custom layout. These cells are in rows, but the rows are now double rows with channels in between. The density is higher, but routing the wires becomes more challenging.

Three rows of standard cells that were manually placed and routed.

For critical circuitry such as the datapath, the layout of each transistor was optimized. The register file, for example, has a very dense layout as shown below. As you can see, the density is much higher than in the previous photos. (The three photos are at the same scale.) Transistors are packed together with very little wasted space. This makes the layout difficult since there is little room for wiring. For this particular circuit, the lower metal layer (M1) runs vertically with signals for each bit while the upper metal layer (M2) runs horizontally for power, ground, and control signals.15

Three rows of standard cells that were manually placed and routed.

The point of this is that the 386 uses a variety of different design techniques, from dense manual layout to much faster automated layout. Different techniques were used for different parts of the chip, based on how important it was to optimize. For example, circuits in the datapath were typically repeated 32 times, once for each bit, so manual effort was worthwhile. The most critical functional blocks were the microcode ROM (CROM), large PLAs, ALU, TLB (translation lookaside buffer), and the barrel shifter.16

Conclusions

Standard cell logic and automatic place and route have a long history before the 386, back to the early 1970s, so this isn't an Intel invention.17 Nonetheless, the 386 team deserves the credit for deciding to use this technology at a time when it was a risky decision. They needed to develop custom software for their placing and routing needs, so this wasn't a trivial undertaking. This choice paid off and they completed the 386 ahead of schedule. The 386 ended up being a huge success for Intel, moving the x86 architecture to 32-bits and defining the dominant computer architecture for the rest of the 20th century.

If you're interested in standard cell logic, I wrote about standard cell logic in an IBM chip. I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space. Thanks to Pat Gelsinger and Roxanne Koester for providing helpful papers.

Notes and references

The decision to use automatic place and route is described on page 13 of the Intel 386 Microprocessor Design and Development Oral History Panel, a very interesting document on the 386 with discussion from some of the people involved in its development. ↩
Circuits that had a high degree of regularity, such as the arithmetic/logic unit (ALU) or register storage were typically constructed by manually laying out a block to implement a bit and then repeating the block as needed. Because a circuit was repeated 32 times for the 32-bit processor, the additional effort was worthwhile. ↩
An alternative layout technique is the gate array, which doesn't provide as much flexibility as a standard cell approach. In a gate array (sometimes called a master slice), the chip had a fixed array of transistors (and often resistors). The chip could be customized for a particular application by designing the metal layer to connect the transistors as needed. The density of the chip was usually poor, but gate arrays were much faster to design, so they were advantageous for applications that didn't need high density or produced a relatively small volume of chips. Moreover, manufacturing was much faster because the silicon wafers could be constructed in advance with the transistor array and warehoused. Putting the metal layer on top for a particular application could then be quick. Similar gate arrays used a fixed arrangement of logic gates or flip-flops, rather than transistors. Gate arrays date back to 1967. ↩
The full citation for the APR386 paper is "Automatic Place and Route Used on the 80386" by Joseph Krauskopf and Pat Gelsinger, Intel Technology Journal, Spring 1986. I was unable to find it online. ↩
Once the automatic place and route process had finished, the mask designers performed some cleanup along with compaction to squeeze out wasted space, but this was a relatively minor amount of work.

While manual optimization has benefits, it can also be overdone. When the manufacturing process improved, the 80386 moved from a 1.5 µm process to a 1 µm process. The layout engineers took advantage of this switch to optimize the standard cell circuitry, manually squeezing out some extra space. Unfortunately, optimizing one block of a die doesn't necessarily make the die smaller, since the size is constrained by the largest blocks. The result is that the optimized 80386 has blocks of empty space at the bottom (visible as black rectangles) and the standard-cell optimization didn't provide any overall benefit. (As the Pentium Pro chief architect Robert Colwell explains, "Removing the state of Kansas does not make the perimeter of the United States any smaller.")

Comparison of the 1.5 µm die and the 1 µm die at the same scale. Photos courtesy of Antoine Bercovici.

At least compaction went better for the 386 than for the Pentium. Intel performed a compaction on the Pentium shortly before release, attempting to reduce the die size. The engineers shrunk the floating point divider, removing some lookup table cases that they proved were unnecessary. Unfortunately, the proof was wrong, resulting in floating point errors in a few cases. This caused the infamous Pentium FDIV bug, a problem that became highly visible to the general public. Replacing the flawed processors cost Intel 475 million dollars. And it turned out that shrinking the floating point divider had no effect on the overall die size.

Coincidentally, early models of the 386 had an integer multiplication bug, but Intel fixed this with little cost or criticism. The 386 bug was an analog issue that only showed up unpredictably with a combination of argument values, temperature, and manufacturing conditions. ↩
This chip is built on a substrate of N-type silicon, with wells of P-type silicon for the NMOS transistors. Chips can be built the other way around, starting with P-type silicon and putting wells of N-type silicon for the PMOS transistors. Another approach is the "twin-well" CMOS process, constructing wells for both NMOS and PMOS transistors. ↩
The bulk silicon voltage makes the boundary between a transistor and the bulk silicon act as a reverse-biased diode, so current can't flow across the boundary. Specifically, for a PMOS transistor, the N-silicon substrate is connected to the positive supply. For an NMOS transistor, the P-silicon well is connected to ground. A P-N junction acts as a diode, with current flowing from P to N. But the substrate voltages put P at ground and N at +5, blocking any current flow. The result is that the bulk silicon can be considered an insulator, with current restricted to the N+ and P+ doped regions. If this back bias gets reversed, for example, due to power supply fluctuations, current can flow through the substrate. This can result in "latch-up", a situation where the N and P regions act as parasitic NPN and PNP transistors that latch into the "on" state. This shorts power and ground and can destroy the chip. The point is that the substrate voltages are very important for the proper operation of the chip. ↩
I'm using the standard CMOS coloring scheme for my diagrams. I'm told that Intel uses a different color scheme internally. ↩
The schematic below shows the physical arrangement of the transistors for the NAND gate, in case it is unclear how to get from the layout to the logic gate circuit. The power and ground lines are horizontal so power can pass from cell to cell when the cells are connected in rows. The gate's inputs and outputs are at the top and bottom of the cell, where they can be connected through the wiring channels. Even though the transistors are arranged horizontally, the PMOS transistors (top) are in parallel, while the NMOS transistors (bottom) are in series.

Schematic of the NAND gate as it is arranged in the standard cell.

↩
The 1999 book Logical Effort describes a methodology for maximizing the performance of CMOS circuits by correctly sizing the transistors. ↩
Unfortunately, the word "gate" is used for both transistor gates and logic gates, which can be confusing. ↩
You might expect that these transistors would produce more current since they are larger than the regular transistors. The reason is that a transistor's current output is proportional to the gate width divided by the length. Thus, if you make the transistor bigger in the width direction, the current increases, but if you make the transistor bigger in the length direction, the current decreases. You can think of increasing width as acting as multiple transistors in parallel. Increasing length, on the other hand, makes a longer path for current to get from the source to the drain, weakening it. ↩
The APR386 paper discusses the standard-cell layout in detail. It includes a plot of a block of standard-cell circuitry (below).

A block of standard-cell circuitry from APR386.

After carefully studying the 386 die, I was able to find the location of this block of circuitry (below). The two regions match exactly; they look a bit different because the M1 metal layer (horizontal) doesn't show up in the plot above.

The same block of standard cells on the 386 die.

↩
Intel's CHMOS III standard cells are documented in Introduction to Intel Cell-Based Design (1988). The CHMOS IV library is discussed in Design Methodology for a 1.0µ Cell-based Library Efficiently Optimized for Speed and Area. The paper Validating an ASIC Standard Cell Library covers both libraries. ↩
For details on the 386's register file, see my earlier article. ↩
Source: "High Performance Technology Circuits and Packaging for the 80386", Jan Prak, Proceedings, ICCD Conference, Oct. 1986. ↩
I'll provide more history on standard cells in this footnote. RCA patented a bipolar standard cell in 1971, but this was a fixed arrangement of transistors and resistors, more of a gate array than a modern standard cell. Bell Labs researched standard cell layout techniques in the early 1970s, calling them Polycells, including a 1973 paper by Brian Kernighan. By 1979 A Guide to LSI Implementation discussed the standard cell approach and it was described as well-known in this patent application. Even so, Electronics called these design methods "futuristic" in 1980.

Standard cells became popular in the mid-1980s as faster computers and improved design software made it practical to produce semi-custom designs that used standard cells. Standard cells made it to the cover of Digital Design in August 1985, and the article inside described numerous vendors and products. Companies like Zymos and VLSI Technology (VTI) focused on standard cells. Traditional companies such as Texas Instruments, NCR, GE/RCA, Fairchild, Harris, ITT, and Thomson introduced lines of standard cell products in the mid-1980s. ↩

Go to page: 1 [2] 3

Log in using a user name and password

New user? Register now

Log in using OpenID

Log in using Google

Log in using Yahoo

Log in using AOL or AIM

Please wait...

Ken Shirriff's Blog

Fault checking

The circuitry

Conclusions

Notes and references

Microcode for "ADD"

Pipelining

The prefetch queue

The loader

Instruction execution

Conclusions

Notes and references

Prefetching and the architecture of the 8086

Implementing the queue

The loader

Other memory accesses

Microcode instructions and the correction circuitry

Policy

Prefetching and the 8087 coprocessor

Brief history

Conclusions

Notes and references

What is microcode?

The 8086's microcode

How instructions map onto the ROM

The microcode ROM's physical layout

Optimizing the microcode

The ModR/M byte

The Translation ROM

The Group Decode ROM

Specially-encoded instructions

The microcode address register

Conclusions

Notes and references

The circuit on the die

The interrupt bug

Conclusions

Notes and references

How bus hold works

Bus sharing and the 8087 coprocessor

The enable flip-flop

How bus hold is implemented (minimum mode)

How bus hold is implemented (maximum mode)

The "hold ok" circuit

The pin drive circuitry

Conclusions

Notes and references

The Datapoint 2200

Intel 8008

Intel 8080

Intel 8085

Intel 8086

Modern x86 and x86-64

Conclusions

Notes and references

The flip-flop circuit

Implementing a flip-flop in silicon

Conclusions

Notes and references

The differential

The cams

Pressure inputs

The synchro outputs

Examining the left section of the CADC

The equations

Total temperature

log free air temperature

Air density

True airspeed

Air density × speed of sound

Conclusions

Notes and references

The shrink from 1.5 µm to 1 µm