VLSI SoC Design: Clock Skew

Showing posts with label Clock Skew. Show all posts

April 15, 2017

Tuning CTS Recipe

I've been trying to debug and tune my CTS recipe for quite some weeks now, and this gave me the basic insight into the CTS algorithm, various knobs available to the designers to be able to tune their CTS results to achieve the desired skew, transition and latency targets.

In this blog post, I'll discuss about those knobs while trying my best not to go into tool specific commands/constructs to be able to keep the discuss more conceptual and tool independent. Before we delve any deeper into these knobs, let's ask the basic question first: why do we need CTS to begin with, or what goals do we expect CTS to achieve for us? The answer is to be able to create a balanced clock tree. A balanced clock tree would simply mean: minimum skew between your sequentials in the design (of course we would only be interested in skew within the same clock group. Let me know in comments if this part is not clear). In addition to minimizing the skew, we would also like to achieve minimum latency by adding minimum number of clock buffers on the clock path thereby ensuring lesser area, lesser routing congestion and most importantly no extra dynamic power dissipation!

Now, we have the required background to discuss the CTS knobs in detail! :)

1. Creating Skew Groups: Skew groups are basically groups of sink-pins (clock end-points) which need to be balanced against each other. Now, some skew groups may be default, some might need to be created explicitly to help CTS engine. We'll take a look at some use-cases.
Default skew groups: Let's say you have 5 clocks in your design.
Group1: CLK1, CLK2 and CLK3 are synchronous to each other.
Group2: CLK4, CLK5 are synchronous to each other.

Group1 and Group2 are logically exclusive and therefore clocks within each group are implicitly asynchronous to the clocks in other group.
In this case, by defining clock groups, we have implicitly defined skew groups. CTS engine would try and balance latencies of CLK1, CLK2 and CLK3. And independently try and balance clock latencies of CLK4 and CLK5.

Sometimes, however, designers might want to create some explicit skew groups on top of the implicit ones. Let's take a look at those use-cases.

The figure highlights the sequential cloud of devices working on CLK1, CLK2 and CLK3 respectively. Assume there's heavy traffic and interaction between CLK1 and CLK2 sequentials while only a very few sequentials working on CLK3 interact with those working on CLK1 and CLK2. Clock enters the partition via three different clock ports on the left side, and certainly distance between the CLK3 port and CLK3 sequentials is the largest, thereby CTS engine would need to insert more clock buffers to maintain the transition (Ask yourself why? What would be the caveat if clock transition goes bad? Puzzle: Clock Transition). Assuming average latency that CTS can manage for CLK3 sequentials is 150 ps, while for CLK1 and CLK2 sequentials, it's 100 ps. In order to balance these three clocks, it will push the clock latency for CLK1 and CLK2 sequentials to match that of the longest latency: 150 ps. If, as designers, we know that interaction between CLK3 sequentials and CLK1, CLK2 sequentials is not too much, or even if it's too much, we know from timing perspective (both hold and setup) we have sufficient positive slack, we don't really need to balance these three clocks. We can create a separate skew group for CLK3 sequentials thereby preventing the extra latency on CLK1 and CLK2 buffers. This would help us in minimizing clock tree buffers, the associated area, routing resources, power and perhaps even the detrimental impact of OCVs on the uncommon clock path. (Read the post: Common Path Pessimism for greater insight).

Another case could be let's say a hard IP in your design which is placed far away from rest of the sequentials working on the same clock. And you know that there's minimal interaction between the sequentials and hard IP, you might need to create a separate skew group for the hard IP clock pin.

2. Sequential Clustering: (Different from Register Banking) CTS is performed after the placement step and by that time all the sequentials and standard cells have been placed. And this placement of sequentials is invariably driven only by the data path optimization constraints. In other words, placement engine would place sequentials at locations which it finds convenient to meet timing assuming ideal clock distribution. As depicted in the figure below, for some reason, placer decided to place a small bunch of sequentials working on CLK1 far away from the port thereby threatening to shoot up the clock latency of all the CLK1 sequentials. Now, either you can try and create a separate skew group to decouple these sequentials, or you can re-run placement tightly bounding all CLK1 sequentials togther to prevent latency (and hence clock skew) shoot-up.

3. Clock Ordering and "dont touch subtree": You might have cases in your design where there's clock multiplexing, let's say between functional and scan clocks, and you need to create a clock tree for both of them. compile_clock_tree usually works on a clock by clock basis. Let's say you were smart enough to enforce the order to command CTS engine to build the CTS network for fast functional clock first and then for the slower scan clock. That's a reasonable approach considering skew, transition and latency targets would be more difficult and constrained to meet for faster clocks, and by building the CTS for faster clocks first, you are giving the engine the leeway to do it's best possible job. However, when it will try and balance the network for scan clocks, it can touch the functional clock network as well. One key difference between functional and scan clocks, in addition to the difference in clock frequencies, would be the scan clock would have a greater fan-out than the functional clocks and therefore more scope for the CTS engine to goof-up! To prevent this, we need to do two things:

a) Enforce CTS order to construct the clock tree for faster clocks first and slower clocks next

b) In order to prevent slow clock from altering the clock tree network of fast clocks, we need to apply a dont_touch_subtree exception on the MUX input of the slower clock.

4. Divided Clocks and "stop_pins": By default, all the sequentials which are flop-based dividers, their CLK is treated as a default "non-stop-pin". Meaning CTS would consider clk -> out arc of these divider flops to be a "through-pin" and try to balance the latencies of the master clock and the generated clock. Now, consider the case as shown below. There are many ways to solve the problem and which of the two methods give you better results would depend on the design:

a) Creating a different skew group for the sequentials placed far away. This would de-couple the sequentials placed nearby and the ones placed far away. And CTS engine would be able to do a decent job.

b) Another experiement well worth a shot could be defining at CLK pin of the divider flop as a "stop_pin" so that latency of the master clock would be in check considering it will treat all it's sequentials including the divider flop as one group and do a relatively good job in balancing out these sequentials. This would avoid latency shoot-up of the master clock.

5. Exclude Clock from CTS: If there are two clocks defined at the same pin/port with different clock periods, whether they be synchronous or asynchronous, it might be a good idea to exclude the slower clock from CTS all-together to prevent CTS from touching the same clock network twice and surprising you with the results.

6. Clock used as data and "exclude pins": You might have some cases where clock is being used as data inside your design. CTS engine would be oblivious of this fact and might go crazy while building the clock tree. In these cases, it would be a good idea to explicitly mark the beginning of data path as "exclude_pin" to guide CTS engine to exclude anything further from clock tree balancing!

I couldn't think of any more cases. If you have some interesting use cases that I might have missed, kindly share them in the comments. :)

April 13, 2013

Clock Jargon: Important Terms

Clock to an SoC is like blood to a human body. Just the way blood flows to each and every part of the body and regulates metabolism, clock reaches each and every sequential device and controls the digital events inside the SoC. There are many terms which modern designers use in relation to the clock and while building the Clock Tree, the backend team carefully monitors these. Let's have a look at them.

Clock Latency: Clock Latency is the general term for the delay that the clock signal takes between any two points. It can be from source (PLL) to the sink pin (Clock Pin) of registers or between any two intermediate points. Note that it is a general term and you need to know the context before making any guess about what is exactly meant when someone mentions clock latency.
Source Insertion Delay: This refers to the clock delay from the clock origin point, which could be the PLL or maybe the IRC (Internal Reference Clock) to the clock definition point.
Network Insertion Delay: This refers to the clock delay from the clock definition point to the sink pin of the registers.

Consider a hierarchical design where we have multiple people working on multiple partitions or the sub-modules. So, the tool would be oblivious about the "top" or any logic outside the block. The block owner would define a clock at the port of the block (as shown below). And carry out the physical design activities. He would only see the Network Insertion Delay and can only model the Source Insertion Delay for the block.

Having discusses the latency, we have now focus our attention to another important clock parameter: The Skew.

We discusses the concept of skew and it's implication on timing in the post: Clock Skew: Implication on Timing. It would be prudent to go through that post before proceeding further. We shall now take the meaning of terms: Global Skew and Local Skew.

Local Skew is the skew between any two related flops. By related we mean that the flops exist in the fan-in or fan-out cone of each other.
Global Skew is the skew between any two non-related flops in the design. By non-related we mean that the two flops do not exist in the fan-out or fan-in cone of each other and hence are in a way mutually exclusive.

In the next post we would discuss the implications of big clock latency on the timing. Please feel free to post your thoughts at my<dot>personal<dot>log<at>gmail<dot>com.

March 09, 2013

Clock Skew: Implication on Timing

Clock Skew is an important parameter that greatly influences the timing checks and you would often find the backend design engineers always keeping a close eye on the clock skew numbers.

Clock Skew: The difference in arrival times of the clock signal at any two flops which are interacting with one another is referred to as clock skew. Having said that, please note that skew only makes sense for two flops which are interacting with one another, i.e. they make a launch-capture pair.

If the clock at the capture flop takes more time to reach as compared to the clock at the launch flop, we refer to it as Positive Clock Skew. And when the clock at capture flop takes less time to reach the clock at the launch flop, we refer to it as Negative Clock Skew.

The figure below describes positive & negative clock skew. Assume the delays of clock tree buffers to be the same.

How does clock skew impact the timing checks, in particular, setup and hold? Consider the above example where FF1 is the launching flop and FF2 is the capturing flop. If the clock skew between FF1 and FF2 was zero, the setup and hold checks would be as follows:

Positive Skew: Now imagine the case where clock skew is positive. Here, clock at FF2 takes more time to reach as compared to the time taken by the clock to reach the FF1. Recall that the setup check means that the data launched should reach the capture flop at most setup time before the next clock edge. As evident in the below the data launched from FF1 gets an extra time equal to the skew to reach FF2. Hence setup is relaxed! However, hold check means that data launched should reach the capture flop at least hold time after the clock edge. Hence, the hold is further made critical in case of positive skew. Read the definitions again and again till you grasp it!!

Negative Skew: Here, clock at FF1 takes more time to reach as compared to the time taken by the clock to reach the FF2. As evident in the below the data launched from FF1 gets lesser time equal to the skew to reach FF2. Hence setup is more critical! However, hold is relaxed!

Some Key Points to Note:
Setup is the next cycle check, and positive skew relaxes the setup check and negative skew further tightens it.
Hold is the same cycle check, and negative skew relaxes the hold check and positive skew further tightens it.
Very rarely would one come across a path that is both setup as well as hold critical. Setup becomes critical when data path is huge or you have a large negative skew; and hold becomes critical when either data path is minimal or you have a large positive skew. Both these conditions are mutually exclusive and very rarely does they manifest themselves simultaneously. It is often a case when the uncommon clock path is significant. We shall discuss it in detail later.

July 26, 2012

Sample Problem on Setup and Hold

In the post Timing: Basics, we discussed about the basics of setup and hold times. Why is it necessary to meet the setup and hold timing requirements. And how frequency affects setup but does not affect hold.

Let us understand the concept with an example:

I hope the above waveforms are self explanatory.
Setup Slack in the above case (as inferred from the waveforms as well) is:

Setup Slack = Tclk - T(clk-2-q) - Tdata - T(su,FF2)

If this setup slack is positive, we say that setup time constraint is met. Note that setup slack depends upon the clock period and hence in turn frequency at which your design is clocked.

Let us consider hold timing:

Hold Slack = Tdata + T(clk-2-q) - T(ho,FF2)

As evident from the above equation, hold slack is independent of the frequency of the design.

Note:

Setup is the next cycle check, we would take the setup time T(su,FF2) of FF2 into account while finding setup slack at input pin of FF2.
Hold time is the same cycle check, we would take the hold time T(ho,FF2) of FF2 into account while computing the hold slack at input pin of FF2.

Try and grasp this example. I shall introduce the concept of clock skew next.

Pages