Tuesday, September 3, 2019

Two minutes

Of course, blog posts are inspired by daily activity.. so look out for DC BIAS!

Two minutes may not seem like a lot of time.  It really depends on what you are doing.  Sometimes I think about how much I cost to my company and how I am filling those hours.  "Am I doing something that really benefits the company or am I wasting time?" often is the thought.   Two minutes is the perfect time to do nothing while nothing is happening.

The product was a switching regulator.  A type of DC to DC converter that converts a higher, lower accuracy voltage to a lower, very accurate voltage.  A switch in the primary of the converter closes in series with an inductor.  When the switch closes, the current ramps with time, loading up the inductor with energy.  At a later point, the switch opens and the energy from the inductor is captured into a load device connected to the power-converter.  This switching action repeats, and over a period of time delivers power to the load, such as a resistor, charing circuit or possibly a simple LED.  Our customer returned the chip since it was exhibiting some odd behavior.  I was told to "check it out" so off to the lab I went, finding a socketed board, power supply and programmable load resistor.

I put the DCDC chip in the socket and tightened it down.  Its important when testing power-chips that the chip be really tight in the socket.  Small amounts of resistance in the connections can quickly heat up the socket.  It seems like there is no such thing as a cheap socket anymore.  These modern packages are tricky and socket vendors step up with their advanced mechanical solutions.  Some use pogo-pins, others use polymer resin, all of them are expensive and can have long lead time, so care needs to be taken with the setup.  Plugged in the board power, flipped power switch on the board and the POWER LED came to life as did current through the load.  I checked the output voltage, it was perfect, matching data-sheet specifications.  So why was I looking at the part?

"Works great"- I said.
"It hates cold" - the VP barks out

So we have this "cold spray" stuff in the lab.  I do not know what its made out of, other than its very volatile and quickly cools enough to form ice after a 20 second blast from the can.  So power-down the board, and wait for 30 seconds or so, then I give the DC DC chip a good 20 second BLAST of cold spray.  I see ice start to form around the socket, so I knew we are good and cold.  Next step was test.

Flipped the switch, the main power-unit springs to life, delivering power to the SOCKET.  However, nothing happens.  I turn off the power-unit then turn it back on.  (Did you try turning it on and off?)  Still nothing happens.  No response from the chip. 

Often when debugging I take a short walk to clear my head, after doing so returning to the chip I found that it was "on".   While I was out taking a walk, the chip "came to life".   I think this was near Halloween since I was wondering about ghosts.    So for the next experiment, I decided to not leave the bench.

Repeat- Blast 20 seconds of cold-spray on the DC DC Chip.  After I saw ice on the socket I applied power.  Again, nothing happened.  Then I waited.  Sometimes "Brute Force" is the answer. 
So I stared at the board, trying not to blink. 
30 seconds goes by.... nothing happening other than ice melting
60 seconds now.. Ice is beginning to thaw quite a bit now .. still no light
90 Seconds now... its getting old.  Im wondering whats for lunch now..
120 Seconds...  BINGO!  Light turns on, load comes to life.

Once warm, power cycle board and it comes up each time when warm.

So what can do this?  What can cause a circuit to shut-down in a cold environment?
Now I need to say that if you put a "good chip" in the socket, it works quite well with freeze spray.  There is something different about the bad ones.  Of course, the bad chip "found me" not the other way around, so I knew it was a "special" chip with respect to operating cold.

To further debug, I used the next common tool, which is to adjust voltages on the board.  Its not uncommon for DC DC converter chips to use regulated voltages on a board.  So I started to vary these voltages by adjusting some off chip components.   Pull the chip out of the board, do some soldering to change something off chip, then put the chip back in the socket, apply cool spray and wait for 2 minutes.

Two minutes is a perfect amount of time, to do basically nothing!  You can't browse the web.  You can't do any tricky math.  You can't even hold a conversation with anyone.  Just me, the board, and 2 minute windows of time.  Most of the time nothing made a difference.  However some off-chip components seem to improve the situation.  By raising a DC voltage I was able to speed things up a little. 

Now If I am debugging, often I can't solve a problem unless I get "MORE" information.  When it comes to debugging information is king. The good news is that in 2 minutes I can do quite a bit of thought as to what might be causing the 2-minute start.   The cold-spray had to go to something more controlled.  Since cold air had the most dramatic effect.  We have something called a "Thermonics" unit which blows hot or cold air on a part in a very controlled way.  With a Thermonics unit and a thermocouple you can set the case temperature of a part in a socket quite accurately.  In addition to the controlled cooling,   At -20C Forced air, I was able to re-create the two-minute turn-on.  I then started to gradually increase the temperature.

I had data like
Temperature     Delay
-20C                 2 Minutes
-10C                 1 Minute
0C                    10 Seconds
10C                  1  Second
27C                 "Instantaneous"

Again, to debug, more information is always better. What I did next was attach an oscilloscope.  I set the scope to trigger on the rising edge of the main power while measuring the delay to the LED power indicator.  I started the experiment at 10C and worked the temperature up slowly.  At 27C it was not "instantaneous" but several thousands of a second.  Even better news, was that the "good parts" we had in stock took microseconds to start at room temperature.  This means we could easily devise a test that bins the parts based on start-up time and prevent the sensitive ones from ever getting to the customer. 

So what can do this?

Well, we know that the THRESHOLD voltage, or the voltage in which a MOSFET "turns on" is a strong function of temperature.  At COLD temperatures, it takes more "voltage" on the gate of a MOSFET to "turn on".   In addition, the THRESHOLD voltage is something that varies as the chips are manufactured. So, depending on what lot you look at, the THRESHOLD can vary a little.  Of course, there are also MISMATCHES in thresholds, in that not all transistors on the SAME die have the exact same THRESHOLD.  We normally do statistically based simulations where we model the manufacture of chips, including the THRESHOLD voltage. 

In the main bias of this particular chip, there was a single NMOS transistor that had to "turn on" to make the reference clock, and the power converter, run as expected.  Due to manufacturing, and some randomly bad luck, that key transistor was "off" when then chip was cold.  Now leakage currents in the power-supply and heat in the environment would eventually get the CHIP hot enough such that it would start.  Once the chip starts, it creates its own heat which keeps the converter running.

A key observation in debugging this was the "exponential" improvement in startup time vs. temperature.  The main bias device was in "sub-threshold" when it was in the "bad state".  Sub-threshold-biased FETS behave like BJTs which are exponential.  To make things worse, this chip was made out of SOI (Silicon on Insulator) so the MOSFET is isolated from its environment for the most part by glass.  However, there is metal in the chip that goes outside to the package pins that can bring in the heat.  Its clear that the bias circuit was NOT simulated properly..  to do that you must:
1.  Check simulations to show that the voltage on the FET exceeds its THRESHOLD by say 100mV
2.  Check simulations at COLD temperature to prove this is still the case
3.  Check simulations with "SLOW" manufacturing conditions to prove the MOSFET is still ON
4.  Check with "Monte-Carlo" simulations (at least 100 random cases) to prove the FET is STILL ON (with margin)

Now if you do the above four checks, you would find that the circuit on the chip fails.    This chip was designed by very senior people, however, everyone makes mistakes.  A chip NEVER cares who you are or HOW you design.  If its not a good design you will see it in the lab sooner or later.  Also, a large number of DEBUGS I have done include issues with DC bias.  Never take DC bias casually, especially if you are senior designer.  Never tape-out a DC bias block without proper verification AND ideally a peer review.  That way you wont spend hours in the lab, two minutes at a time.

That took more than 2 minutes to write.  Hopefully not more than 2 minutes to read!

-SSA
“The postings on this site are my own and do not necessarily represent the postings, strategies or opinions of Microsoft."

Wednesday, May 15, 2019

The Product development treadmill - Focus

Yes, I know its been a long time since I made a post, however for the same reason I have posted in the past.

I was browsing the web, and found a Forbes article Getting Off The "Bad Growth" Treadmill by Cesare Mainardi and Paul Leinwand.  The commentary in that relevant posting is that the low-risk way to expand your market can be a disaster.  The definition of "low risk" is a moving target as technology development increases its speed.  This has been a problem in my career, and is being compounded by the effects that drive the singularity as discussed by Ray Kurzweil.  As progress speeds up, and our competitors get better, the treadmill runs even faster! Playing it safe more "unsafe" than ever before.  But what does this mean?  What is safe?

So now I can tell my story.  It was the early days at my first real job.  We had been working on non-ADC based Ethernet PHYs, 10-T, 100-T.  Those product developments were not without their challenges, however they are simple compared to today's ADC based Ethernet chip designs.  Back then we could fit a whole transmitter on one large schematic sheet.  The receiver was a handful of blocks including a few data slicers.  We had simpler process technologies with only a handful of metal interconnect layers.  Verification was mainly analog at the block-level and the top-level simulations often black-boxed the analog, since few loops went from analog to digital and back.   So when we started the next generation, or Gigabit ethernet, we used the same strategy as before.

Since ADCs were new to the company, we "had to" create ADC test-chips.  Now test-chips are nice, however they do consume resources.  The whole design cycle is involved, from definition, schematics, layout, verification, tape-out, packaging, test-board, test-program, lab testing and data analysis.  So the test-chip was basically a huge effort.  In the end, we got a reasonably good Flash ADC from the test-chip.  I was studying this as I was working on the gain-control circuit in the Analog Front-end (AFE).  I also owned the ESD and GMII interface.  I worked on these while I watched the more senior people assemble the test-chip for the PHY.  So we were toiling away when...

We heard that Broadcom (our competitor) was sampling 1000-T.   Our competitor went "ugly early", instead of creating a test-chip, they created a functional 1000-T tranceiver.  Now it didn't have the lowest power, nor did it have a 100% standard compliant link.  However, it was the FIRST 1000-T Phy.  The fact it was the first changed everything.  Broadcom had been doing ADCs for years and didn't choose to create a lone ADC test chip just to later throw away.  They used their valuable resources to focus on what was new, which was the ADC based PHY concept.  Now that must have been a wild development at Broadcom, but the focus was keen in that they just focused on the new stuff.  The Broadcom chip didn't support the slower modes even, just enough to prove the 1000-T standard concept.  In hindsight, I admire their focus and the strategy of their approach.

Late in our development it was determined that the chip wouldn't work because some of the "stripped out" features were required.  Confusion about what was "good enough" erupted.   Key analog lead resources were overloaded by the changes.  Test-chip quality blocks were expected to be product level quality.  The scale of the development became obvious to the designers since this chip was about 10X more complicated than the previous analog based phys, and the old methodology was not working.  So we changed, added program managers, reduced the load on key lead resources.  These changes also took time.  (Now looking back our competitor also had to deal with the change in style, however there was a clear focus, they were acting while we were reacting.)  So after many long-nights we did tape out. 

Never did we ever consider that a new or different approach would need to be introduced?  Issue tracking is one that can free leads from memorizing huge sections of the design.  Also, the design can be staged.  For example, it is possible to run multiple variations of a chip.  (Sometimes called a Pizza mask).  Some versions (slices) could be for debugging, other versions for samples.  This "parallel" approach helps mitigate risk and long schedules. Now back to the story.

When silicon came back, the evidence of the rush was clearly visible. On the 5th revision, we were able to send data between to like parts. Meanwhile at that point, the competitor was in production selling parts.   While our competition was working on their next generation, we were still fighting with the first.   By the 12th revision of the part it was basically working however the specs changed to match reality.  Meanwhile more competitors appeared, with lower power parts, Marvell's Alaska, ironically assisted by designers who left.  I should have followed.

I have often wondered about marketing vs. engineering leading chip development.  Change in project focus at the last minute is a product killer.  Later during executive training I found that what bothered me is that when people who understand the technology are not included in decisions, disasters occur.  The reason being, is that (I believe) it is unethical for those who lack the proper knowledge to make technical decisions that affect the architecture of a chip.  This would be like a pediatrician planning brain surgery.  In good organizations, there is feedback and accountability that can reduce or eliminate this bad behavior if it were to occur.  Post-mortems occur, with a wide audience invited. Good companies even have a step in the product development where execution of marketing and engineering were compared to original estimates.  A test for this is to simply ask a program manager for history of resources on a past program.  If that information is not available in any form, then there may be a problem since you can't learn from your own past.

One thing true about product development these days, is that the biggest risk is making a mistake in the strategy.  If you take on too-little risk, you may not have a compelling part.  If you take on too-much risk, then you will have a painfully buggy product.  Marketing and engineering need to pick the balance.  The development strategy and methodology that works with one type of chip may be a horrible idea for another type of chip.  A good post-mortem and accountability help an organization grow in the direction that improves execution allowing for greater risk.  When in doubt, focus on delivering, there will always be distractions. 


Disclaimer - New Posting Ahead

SSA Here, back after a long break.   I have changed companies and have joined the team at Microsoft.  My hope is that I will have some more time to share some debug stories from the past.  However, I must state

“The postings on this site are my own and do not necessarily represent the postings, strategies or opinions of Microsoft."