## November 22, 2015

### The Timing Optimization Problem

Puzzle:

Tomorrow is the scheduled tape-out of your SoC. The target clock frequency for this SoC is 100 MHz (Time Period of 10 ns). However, there's only one setup violating path and you need to fix the timing by doing ECOs. Area is not a constraint.

Here's the circuit:

Points to note:

• My tape-out is tomorrow, I don't have the liberty of asking the RTL design team to change the architecture of the design.
• I have used the highest possible drive strength cells, and perhaps the lowest Vt flavor cells available in my standard cell library.
• There's no redundant logic in the path, it's been optimized well.
• I cannot add delay to the clock path of FF2 because doing so, the hold time of the scan chain connecting flops FF1 and FF2 would fail.
Please suggest ways to solve this timing violation. A rough image would be really helpful. I shall post my solution in a couple of days time.

Mike posted the correct answer, and I'll just add a figure explaining the solution:

1. 1. Look at the slack margin at the D pin of FF1, if there is 2ns of +'ve slack, then you can early the clock to the FF1.

2. Collect all the nets in the FF1-FF2 path and route them using the higher-metal layer along with NDR ( this will reduce your net-delay ).

3. Is the FF2 LVT flavour? If not, swap it to LVT flavour ( the Tsetup of FF2 will be much less for the LVT cells ).

4. Look at the slack of FF3 D pin, if its positive, then delay the clock of FF2 ( You said you'll end up in Hold VIol, add the delay buffer in the scan-path ).

1. Thanks for posting the answer, Kumar. I'd like to say that none of them are incorrect. However:
1. What if I don't have slack margin at D-input of FF1?
2. What if higher metal layers are congested. I can only route through the lower metal years, so I won't be able to gain in my net delay.
3. I used all cells s my LVT favors already.
4. There's no slack between FF2-FF3.

While you pointed out physical design solutions, I'd like you to think from architectural standpoint. I'd be looking forward to hear your answer! :)

2. since in FF1 and FF2 there is 2 combinational logic with 5ns and 7ns , we can add a FLOP between them. so we will have 5ns combinational logic between 2 flops and 7ns combinational logic in other it solves the setup problem

1. Thanks for posting the answer, Prudhvi. But adding a flop would change the design because subsequent data would now be available after one clock cycle. We don't have that liberty to change the architecture of the circuit in physical design.

2. This comment has been removed by the author.

3. There is only a one bit wire between the 5 and 7 ns blocks. Thus, there are only two possible input vectors to the 7ns block.
Thus, as (or even before!) the 5ns block starts to process its inputs, two parallel 7ns blocks starts to work in parallel to that block: one with 1 as input, one with a 0 as input. By the time the 7ns are done, the correct 7ns-block's answer gets muxed into FF2, chosen by the answer from the 5ns block.
Area was after all not a constraint, so the extra logic needed (one 7ns block, and one 2-to-1 mux) is not a concern.
Neither is timing: the 7ns blocks can, assuming they don't have additional inputs not displayed here, after all process their responses at any time (each has a constant output!)

1. "By the time the 7ns are done" should of course be "By the time all three blocks are done"

2. That is indeed the correct answer, Mike! Thank you for posting.

It is similar to Carry Look Ahead approach, where the computation is being "looked ahead" by anticipation and appropriate logic is then chosen! :)

3. WOW. Keep posting such questions.

4. I'm glad you found the problem useful, Chaitali. Shall try my best to keep posting such stuff!

Thanks,
Naman

5. Wouldn't this be considered as a change in the architecture as well?
Since we are getting the data in the same clock edge, is it not considered as an architecture change?

6. That's precisely the correct explanation, Raghu.

7. Found very useful! Please do update questions like this.

4. Hi Naman,
still i'm not getting the answer because setup violation means data is slow at FF2 so we have to make it fast by removing extra circuit delay, and mike gave the answer (i'm not opposing) the total delay is still 5ns+7ns so no effect on circuit indirectly.
could pls explain it would be very helpful for me.
thanks,
Vijay

1. Hi Vijay,

The circuit delay now is not 5 + 7 ns. It's now 5ns + MUX delay; or 7ns + MUX delay. 7 ns block and 5ns block are being computed in parallel, so it will never be 5ns + 7ns. Please let me know if there's something still not clear.

2. The circuit delay will always be 5ns+ Mux delay regardless of the value of single bit wire ( i.e, o/p of 5ns block). Isn't it?
As 5 ns and 7 ns blocks are running in parallel, how do we say the delay is 7ns + mux ?

5. How about replacing FF2 with a neg type latch?

1. Hi PM,

Replacing FF2 with a neg latch would still be a timing violation (setup violation). Having a neg latch would have helped if path from FF1 to FF2 were relaxed, and FF2 to FF3 was critical. That way, replacing FF2 with a neg latch would have helped in time borrowing. But in this case, even that solution wouldn't work because we don't have any time to borrow from the next cycle because it is still 10ns!

-Naman

2. My bad, I meant to say replacing FF2 with a pos type latch. That way this latch will have an extra half cycle to capture the data from FF1.

6. Very nice puzzle. Very interesting to work on your puzzles. please keep up the great work!

1. Thanks, Abirami.
I was asked this problem in one of the job interviews very recently, and I'd confess it took me a long time to arrive at the answer. The "single data wire" was the key. I'm glad you find these puzzles interesting. :)

Thanks,
Naman

7. good one. Keep them coming.

8. good puzzle, we can study new method from it,thanks

9. Thanks Naman for the question and Mike for the answer. This question has led me to see setup violation's addressing techinique from a different angle.

10. Hi Naman,

The carry select approach is brilliant. But since there is only one input to the combinational circuit, cant we just use a mux and a 1 bit LUT? for the outputs by hardcoding them?