VLSI SoC Design: 2014

December 21, 2014

Puzzle: Ring Oscillators

Ring Oscillators are very commonly used circuits in SoCs, where they find their use in Voltage Controlled Oscillators used inside the PLLs, and also used as Silicon odometer circuits- which are used to track various parameters of the device like variation of timing with ageing etc.

It is also a well-known fact that ring oscillators have odd number of inverters connected in form of a chain, as shown below:

And the frequency of oscillations is given by the expression:

f = 1/(2NT)

N= Number of inverters (odd)

T= Propagation delay of a single inverter

Can you comment on the below circuit? Maybe an expression for it's oscillation frequency? Or perhaps any analytical expression (without bothering about the intricacies of various transistor parameters) for the voltage? Assume the operating voltage to be 1V.

Please post your answers here. I'd be happy to share the solution and my thought process in a couple of days.

December 17, 2014

Inverter vs Buffer Based Clock Tree

A buffer is nothing but two inverters connected back to back. Does it make any difference if the CTS (Clock Tree Synthesis) is done using buffers or inverters? What exactly are the pros and cons and what factors would the backend design engineers take into account while deciding how to build their clock trees? I'm gonna answer these questions in this post.

An inverter based clock tree:

To keep things simple and pertinent to the discussion, let's assume that we are using only a single kind of inverter (i.e. of let's say drive X) to build our clock trees. And all the inverters are placed equidistant from each other. The scenario is shown in Fig. 1. Advantage of using an inverter based clock tree is that the high pulse width and the low pulse width would be symmetrical. For the clock signal, this is a critical requirement, especially for SoCs which have a high interaction between the positive and negative edge triggered flip-flops.

Figure 1: Inverter Based Clock Tree giving equal rise and fall times

A buffer based clock tree:

While theoretically, one can create a buffer using two identical inverters connected back to back, that is generally not the way buffers are designed while designing the standard cell libraries. To save area, the first buffer is typically of a lower drive strength and is placed very closed to the second inverter. The second inverter, however, is of higher drive strength.

Figure 2: Buffer Based Clock Tree. Buffer is formed by connecting two invertes back to back

One must also notice that the delay of first inverter is dominated by the load of the second inverter because the wire length between these two inverters is very small, hence one can neglect the wire cap. But for the second inverter, the load comprises of the wire cap as well as the input cap of the next buffer. This introduces an asymmetry in the rise and fall delays, and hence the high and low pulse widths of the clock signal.

Figure 3: Difference in high and low pulse widths

For applications which have a very stringent requirement on the clock high and low pulse widths, one might prefer to use an inverter based clock tree over the buffer based clock tree.

Can we do something to make the buffer based clock tree work? The answer is yes! Let's take a look:

If we balance the load seen by first inverter and the load seen by the second inverter, we might be able to achieve equal rise and fall times, and hence equal high and low pulse widths for the clock transition signal.

In this approximation, we have modeled the wire in form of a T-model. And inverter is modeled using distributed RC model with it's "on" resistance and the diffusion capacitance.

Figure 4: RC delay model for inverters and wire

To have the equal pulse widths for high and low times, the RC delay observed by the first inverter must be equal to the RC delay of the second inverter.

Rchn,1 (CD,1 + CG2) = Rchp,2 (CD,2 + Cwire + CG,1) + Rwire/2 (Cwire + CG,1) + Rwire/2 (CG,1)

If this equation is satisfied, one can say with a fair degree of confidence that the high and low pulse widths would be approximately equal. The resistance and capacitance of the wire is the function of its length and the same can be conveyed by the standard cell library designer to the backend designers.

While most standard cell library vendors provide a symmetrical buffer, there could well be a difference of a few pico-seconds in the buffer rise and fall delay, which creates a difference in the high and low pulse widths. The variation in the duty cycle increases for deeper clock trees!

A simple way to mitigate the problem is to insert an inverter in the middle point of the buffer-cased clock tree. The major challenge, however, lies in finding this middle point. This ensures that high and low pulse widths of the clock reaching at the sink pins of flip-flops is indeed the same!

Figure 5: Inserting an inverter to maintain high and low pulse widths

July 28, 2014

Timing Analysis: Graph Based v/s Path Based

Hello folks! In this post, I'm gonna talk about the difference between two commonly used Static Timing Analysis methodologies, namely- the Graph Based Analysis and the Path Based Analysis.

I shall explain the difference with help of an example, shown below:

Now, we have two slews- fast and slow. In Graph Based Analysis, the worst slew propagation is ON, and the timing engine computes the worst case delays of all standard cells assuming the worst case slew for all the inputs of a gate. For example, assuming we need to compute the gate delays while doing setup analysis in a graph-based methodology for the path from FF1 to FF2:

The delay of the A-> Z (output) arc of the OR gate (in brown) would be computed assuming the real slew slew, i.e. slew at pin A.
However, the slew that will be propagated to the output pin of the OR gate would be the worst slew, which in this case would be computed taking into account the load at the output of the OR gate and slew at B.
Similarly, the delay of NAND gate (in blue) would be computed using the propagated slew coming from the previous stage i.e. the slew at pin B, but the slew that is propagated to the output would be according to the worst input slew, in this case slew at A.
And so on and so forth..

While performing hold analysis in a graph-based methodology, the situation reverses, the the delays of all cells would be computed assuming the best propagated slews (fast slews) for all nodes along the timing path!

This method of timing analysis is faster and uses lower memory footprint because the engine has to simple keep a tab of worst propagated slews for every pin in the design. This surely is pessimistic but again faster and therefore does not encumber the optimization tool by bounding the problem. For example, for the OR gate, the slew propagated to it's output is the worst slew, therefore the delays of subsequent gates after the OR gate could be pessimistic. The Path-based analysis comes to the rescue at some cost.

In Path-based analysis, the tools takes into account the actual slew for the arcs encountered while traversing any particular timing path. For example for the path shown above from FF1 to FF2, the arcs encountered are- A-> Z for OR gate; B-> Z for NAND gate; B-> Z for XOR gate and A-> Z for the inverted AND gate.

The tool would therefore consider the actual slews and this dispenses with the unnecessary pessimism!

Why not use PBA instead of GBA? Who's stopping us?

The answer is the run-time and memory foot-print. Since, PBA needs to compute the delays of standard cells in cognizance with the particular timing path, it incurs a penalty on the run-time to compute the delays, as opposed to GBA where the worse propagated slew was being used to compute the delays. In a nutshell, PBA is more accurate at the cost of run-time.

Typically, design engineers tend to use GBA for majority of the analysis. However, for the paths with a small violation (maybe of the order of 10s of ps) may be waived off by running PBA for the top-critical paths when the tape-out of the design is impending. One might argue that the extra effort spent in optimizing many other paths might have been saved had we used PBA earlier. And it is true! But like any engineering problem, there exists a trade-off and one needs to take a call between fixing the timing and a potential risk of delaying the tape-out!

July 23, 2014

Small Delay Defect Testing

Small Delay Defect Testing is an important step in ATPG testing towards realizing the strategic goal of zero DPPM (Defective Parts Per Million).

What is SDD? And why is it needed?

With shrinking technology nodes, the silicon is becoming increasingly susceptible to manufacturing defects like- stuck-at faults, transition faults etc. Variations in PVTs and OCVs make the silicon even more vulnerable to failure. While in Stuck-at capture, we test the device for manufacturing defects like- shorts and open; in At-Speed testing the device is tested for transition faults at the functional frequency.

Small Delays are any subtle variations in the delay of standard cells due to OCVs. These small delays (when accumulated) have the potential to fail the timing of the critical paths at the rated frequency. The testing mechanism deployed to test the faults arising due to these small delays is referred to as Small Delay Defect testing.

Sounds more like ATPG-Atspeed, right? Then, where lies the difference? The difference lies in the intent. While At-Speed Testing, the intent of DFT is to target fault simulation for each node by hook or by crook! With the focus of modern ATPG tools being on pattern reduction and hence the test time, it tries to target each node via the most convenient path, which is typically the shortest path.

Consider the below use-case.

Path 3 is the shortest path to target the node X. ATPG-Atspeed would take Path 3 to generate the patterns in order to test node X. As evident from above, Path 1 is the most timing critical path and therefore is more probable to violate timing on silicon. SDD targets such paths!

I have one question. And would request the readers to pour-in their view regarding it:

Small Delay Defect is traditionally done for setup-violations. But let's say, in case of significant clock skew between any two interacting flops, even hold timing would be critical. Can one possibly use something similar to target hold violations due to small delay as well?

Reference:

'Small Delay Defect Testing' by Roberto Mattiuzzo. Published at EDN, June 2009.

June 09, 2014

Feature Size of Transistors

The feature size of any semiconductor technology is defined as the minimum length of the MOS transistor channel between the drain and the source. The technology node has been scaling year by year. From early 2000s it has shrunk from 180nm to 22nm designs today (2014).

You would have probably noticed that the technology scaling has followed:

180nm -->> 130nm -->> 90nm -->> 65nm -->> 40nm -->> 28nm --> 22nm...

Ever wondered who decides these numbers? Are these arbitrary or there's some inherent logic behind these numbers? Let's see.

In early 1970s, Gordon Moore of Intel Corp. predicted that the number of transistor on a an integrated circuit would double itself in approximately 18-24 months. This prediction has proven to be accurate as scaling of technology continues unabated even after 40 years! Well, it's mainly because Moore's law has set out a challenge and a roadmap for designers to keep the scaling going!

You might ask yourself, why scaling? Here's why:

If double number of transistors can be incorporated on the same area, it means we get double (roughly) functionality for the same cost!
Alternatively, with scaling of technology, the same functionality will be available at roughly half the cost!
Moreover, smaller the channel length, faster would be the transient response of the transistors which would translate into better performance!

The goal of every design company now is to double the number of transistors on their integrated circuits with each technology. As you would notice, the numbers above from 180nm to 130nm to 90nm scale down by roughly a factor of 0.7. What's so special about 0.7?

If the feature size of the transistor is scaled by 0.7, the area would be scaled by a factor of 0.7²=0.49 =~ 0.5. That means if we scale our feature sizes by a factor of roughly 0.7, we would be able to pack twice the number of transistors on the same area as the previous technology!

May 15, 2014

Dynamics of Scan Testing

In accordance with the Moore’s Law, the number of transistors on integrated circuits doubles after every two years. While such high packing densities allow more functionality to be incorporated on the same chip, it is, however, becoming an increasingly ponderous task for the foundries across the globe to manufacture defect free silicon. This predicament has exalted the significance of Design for testability (DFT) in the design cycle over the last two decades. Shipping a defective part to a customer could not only result in loss of goodwill for the design companies, but even worse, might prove out to be catastrophic for the end users, especially if the chip is meant for automotive or medical applications.

Scan testing is a method to detect various manufacturing faults in the silicon. Although many types of manufacturing faults may exist in the silicon, in this post, we would discuss the method to detect faults like- shorts and opens.

Figure 1 shows the structure of a Scan Flip-Flop. A multiplexer is added at the input of the flip-flop with one input of the multiplexer acting as the functional input D, while other being Scan-In (SI). The selection between D and SI is governed by the Scan Enable (SE) signal.

Figure 1: Scan Flip-Flop

Using this basic Scan Flip-Flop as the building block, all the flops are connected in form of a chain, which effectively acts as a shift register. The first flop of the scan chain is connected to the scan-in port and the last flop is connected to the scan-out port. The Figure 2 depicts one such scan chain where clock signal is depicted in red, scan chain in blue and the functional path in black. Scan testing is done in order to detect any manufacturing fault in the combinatorial logic block. In order to do so, the ATPG tool try to excite each and every node within the combinatorial logic block by applying input vectors at the flops of the scan chain.

Figure 2: A Typical Scan Chain

Scan operation involves three stages: Scan-in, Scan-capture and Scan-out. Scan-in involves shifting in and loading all the flip-flops with an input vector. During scan-in, the data flows from the output of one flop to the scan-input of the next flop not unlike a shift register. Once the sequence is loaded, one clock pulse (also called the capture pulse) is allowed to excite the combinatorial logic block and the output is captured at the second flop. The data is then shifted out and the signature is compared with the expected signature. Modern ATPG tools can use the captured sequence as the next input vector for the next shift-in cycle. Moreover, in case of any mismatch, they can point the nodes where one can possibly find any manufacturing fault. Figure 3 shows the sequence of events that take place during scan-shifting and scan-capture.

Figure 3: Waveforms for Scan-Shift and Capture

Shift Frequency: A trade-off between Test Cost and Power Dissipation

It must be noted that the number of shift-in and shift-out cycles is equal to the number of flip-flops that are part of the scan chain. For a scan chain with, let’s say, 100 flops, one would require 100 shift-in cycles, 1 capture cycle and 100 shift-out cycles. The total testing time is therefore mainly dependent on the shift frequency because there is only capture cycle. Tester time is a significant parameter in determining the cost of a semiconductor chip and cost of testing a chip may be as high as 50% of the total cost of the chip. From timing point of view, higher shift frequency should not be an issue because the shift path essentially comprises of direct connection from the output of the preceding flop to the scan-input of the succeeding flop and therefore setup timing check would always be relaxed. Despite the fact that higher shift frequency would mean lower tester time and hence lower cost, the shift frequency is typically low (of the order of 10s of MHz). The reason for shifting at slow frequency lies in dynamic power dissipation.

It must be noted that during shift mode, there is toggling at the output of all flops which are part of the scan chain, and also within the combinatorial logic block, although it is not being captured. This results in toggling which could perhaps be more than that of the functional mode. Higher shift frequency could lead to two scenarios:

Voltage Droop: Higher rate of toggling within the chip would result in drawing more current from the voltage supply. And hence there would be a voltage droop because of the IR drop. This IR drop could well drop the voltage below the safe margin and the devices might fail to operate properly.
Increased Die Temperature: High switching activity might create local hot-spots within the die and thereby increase the temperature above the worst-case temperature for which timing was closed. This could again result in failure of operation, or in the worst case, it might cause thermal damage to the chip.

Therefore, there exists a trade-off. It is desired to run the scan shift at a lower frequency which must be dictated by the maximum permissible power dissipation within the chip. At the same time, the shift-frequency should not be too low, otherwise, it would risk increasing the tester time and hence the cost of the chip!

Pages