April 17, 2013

Low Power Synthesis: Insertion of Clock Gating Cells

Power consumption is a growing concern for modern SoCs and design engineers today face an arduous task of limiting the power dissipation of their SoCs. It would be unfair to think the backend design cycle as a magical solution to all the power solutions. However, modern synthesis EDA tools are smart enough in identifying some key RTL constructs and synthesizing a low power equivalent of the structure. We will take a look at one such RTL Construct and it's equivalent implementation for low power design.

Consider the following behavioral description:

always @ ( posedge clk )
   if (enable == 1'b1) then
   q [15:0] <= d [15:0]

One logical implementation and the corresponding low power implementation of the above description would be:

The synthesis tools find such RTL constructs and try and convert it into the low power implementation shown above. Please note that, the clock gating integrated cell (CGIC) also consumes power and the above implementation might not be an expedient solution if the above enable is mostly high, or even if the number of registers in the register set is small. Therefore, one needs to exercise caution while using or implementing such a structure!


  1. Hei Palindrome,

    Do you see any saving in power if we keep the flops under reset and do the clock gating?
    compared to normal clock gating. May be a dump question :)


  2. No question is dumb, my dear anonymous friend! :)
    Note that many gates constitute a flop, and despite that fact that flop is under reset, clock signal would be forcing some gates to go ON/OFF, though without impacting the output of the flop. By clock gating, we intend to save that power on those redundant transitions. Having said that, it also note that, using one clock gate for gating one flop might result in more power being consumed! Idea is to use one clock gate for multiple flops, then only you can expect some savings on dynamic power.
    Well, to be quite honest, I strong feel that clock gating the flops (many flops in a bunch) under reset would save power, but I would like to verify the same with some simulations! Will keep you in the loop to whatever's the result! :)

    Thanks for asking the question!! :)

  3. i guess keeping the flop into reset wont serve any purpose as we need to retain the state as well as to save the power. So clock gating without keeping flops under reset is the thing to do.

  4. Hi Palindrome,

    Nice blog! I enjoy reading it on a regular basis. While I agree with Sunny that clock gating while reset asserted makes little sense, but conceptually, even I feel that clock gating (as you mentioned, many flops) while reset being asserted would save power! I am looking forward to hear your simulation result! :)


  5. Hi Palindrome,
    This clock gating concept, used by many EDA tools is indeed useful but i have two doubts,
    1. While changing the implementation of a Designer , isn't it these tools making life more difficult while debugging?
    2. Even if clock gating is implemented , data can continue to toggle and the first latch of Flop will be active without doing much. So is there any possibility to gate the data path also while clock is also gated?

  6. Hi Sunny,

    1. I only see a concern while running the LEC (Logical Equivalence Checking). Other than that, we would never require to debug. And to model this, modern EDA tools support the provision for treating all the above described MUX Based Clock Gating structures to the CGIC version.
    2. Yes, data can continue to toggle, but we can't gate the data path. Consider a situation where a data path is feeding two flops, one can be clock gated, however, other cannot be. Hence, data path cannot be gated. Moreover note that if all the inputs to the data combo logic are gated, data path won't toggle anyway!! :)


  7. Thanks Palindrome for answering my question. This site is my learning place for physical design. :-D

    Do you have any picture of FF with reset? if the reset is going to Master stage of flip flop then we will be saving the transitions due to data switching. In that case we will save considerable amount of power?

    You have any suggestions on fan out of CGIC, number flops a clock gating cell should drive?

    I assume if we gate clock, it switches off the clock tree from that point, so any idea of power saving of clock tree switching vs FF gate switching?


    1. Thanks for such kind words of appreciation! I would really appreciate if you can at least leave your name in the comments. Kinda feels weird replying to "Anonymous". :)

      As for CGIC, typically 1 CGIC is good enough to drive 16 flops. If number is less than that, it's merely a waste of resources. If number is more, clock parameters like slew would go haywire.

      An interesting last point. Assume that a particular clock branch feeds three different clock gating cells, and we cannot gate all three of them simultaneously. By gating 1 or more out of those 3, we would save on the switching power of the flops in the fanout of those CGICs which we have gated. We cannot gate the clock tree as we might need to feed the flops in the fanout of 1 CGIC. Got my point? Or did I get your point?

      P.S: Would love to have more suggestions regarding the topics and general structure of the blog. Thanks in advance! :)

  8. Hello Palindrome,

    I have few questions here, but don’t know whether I could explain to you clearly :-D

    1) what will happen if the EN is driven from a flop which has async reset. Since the reset assertion is async, then EN will assert async way during reset and possible chance that EN will be delayed more than clock and we may end up getting glitches on gated clock.

    2) Because of the async behaviour of the flop, I assume tool won’t be able to do the timing check for us? Rst->Q?

    3) Also, what will happen if the flop goes in to metastable state and we don’t provide any clock to it for some duration of time, and start providing the clock to it in later stage. Does that flop will be in metastable state? In other words, do we need subsequent clock cycles for a flop to come out of metastable state.


    1. Hello Bond,

      Apologies for such a late reply. I have been a little busy lately, and didn't get a chance to participate in the proceedings on the blog.

      Your question is interesting.
      1. Yes, since it is an async reset, it can be asserted at any time. And there could be possible violations anywhere, because in STA, one cannot ensure asynchronous timing checks. However, looking at it a bit objectively, such a case is usually in case of: either a power-down or a fatal error. In all these cases, reset is asserted for a long time. (Which would be let's say few milli-seconds). And after de-asserting it, many cycles are allowed to pass before anything is sampled so as to preclude the possibility of any metastability.

      2. Yes, tool won't check. This is taken care of the way I mentioned above.

      3. You are right. We allow many clock cycles to pass before sampling is done. Therefore these cycles would make sure that all the outputs of flops (or any other sequential cells) come out of meta-stability. You are absolutely right!

      Please let me know if it requires any further clarifications.


    2. Thanks Naman. One more basic question.

      1) How long a FF take to settle down a metastable state ( eg: 0->1 transition )? max 1 cycle?

      2) If we have two flops in series, how do we get guarantee that second flop never goes in to metastable state?


    3. Hi Bond.

      1) Theoretically, it is not possible to predict when would a particular flop come out of metastability. You must be aware that it is based on a probabilistic model, and governed by a parameter called "Mean Time Before Failure (MTBF)". Please note that it just gives the mean time, not the actual time. From experience, I can say that, typically 1 cycle is enough for a flop to come out of metastability, but in some critical applications, designers may use 2 or maybe upto 3 synchronizer flops.

      2) The synchronizer flops, I mentioned above, are the structures which are nothing but flops added between any two async domains. It is referred to as Clock Domain Crossing. These synchronizer flops reduce the probability of metastability on the second flop. Note that: it only reduces the probability. One can never say for sure.

      Thanks for pointing out a potential topic for the next post. I'll try to cover basic Clock Domain Crossing and concept of Synchronizers henceforth.


  9. Hi Palindrome,

    One question regarding the CGICs. Can these CGICs (if a lot in the design) be driven from a different power domain, I mean the Voltage supplied to the CGIC is lower than the Voltage level required for EN, and later leveling up the Voltage of EN in CGIC ON state to the voltage level required by the Design Flop.

    A lower Vth CGIC will save both Static power along with the Dynamic switching power. ( Also the power cosumed in Switching of Gating cell would also be less)

    Thanks for the Nice post


    1. Hi Varun,

      Thanks for your kind words of appreciation.

      Well, what you said is very well possible. But don't you think that the design of level shifter would be a little tricky? It would:

      1. Add some more detail in the clock path that would ultimately increase the clock latency.
      2. Since rise and fall times of the clock signal are monitored closely, it would place additional constraint on the design of such a high performance level shifter where the output slew is very good.

      I reckon, if the above questions are addressed, the methodology that you suggested would be more pragmatic. But I believe that it would be a little risky to play around and tamper with the clock path.

      Lastly, lower Vt CGIC would mean higher static power. And the threshold voltage would have no impact on the dynamic switching power, but lowering the VDD would certainly reduce both the leakage and the dynamic powers.

      Have you seen CGICs supplied with lower voltage? I would be interested to know the pros and cons of the physical implementation of such a methodology.



    2. Hi Naman

      One more drawback is, we need to design the power grid for CGIC separately which will eat up the routing resources.
      Also as you mentioned it may change the pulse width of the clock as its difficult to design a equal rise and fall delay level shifter.

      Please correct me if I am wrong.

      Nishant Madan

  10. Dear Naman,

    Nice blog with useful posts. My question is what are the timing constraints on the CE (clk enable) path of an ICG that are needed at synthesis stage & placement stage.


  11. Hi Naman,
    please correct me if i am wrong in answering my self for the question,
    1.How the tool will choose among the two possible implementations for the code???
    Initially the tool replace the code with the mux based design and according to our constraints ( such as minimum no of flops and Max fanout) which of the portions of te design meets those constraints will replace with the CGIC remining will be left as it is...