Tuesday, September 3, 2019

Two minutes

Of course, blog posts are inspired by daily activity.. so look out for DC BIAS!

Two minutes may not seem like a lot of time.  It really depends on what you are doing.  Sometimes I think about how much I cost to my company and how I am filling those hours.  "Am I doing something that really benefits the company or am I wasting time?" often is the thought.   Two minutes is the perfect time to do nothing while nothing is happening.

The product was a switching regulator.  A type of DC to DC converter that converts a higher, lower accuracy voltage to a lower, very accurate voltage.  A switch in the primary of the converter closes in series with an inductor.  When the switch closes, the current ramps with time, loading up the inductor with energy.  At a later point, the switch opens and the energy from the inductor is captured into a load device connected to the power-converter.  This switching action repeats, and over a period of time delivers power to the load, such as a resistor, charing circuit or possibly a simple LED.  Our customer returned the chip since it was exhibiting some odd behavior.  I was told to "check it out" so off to the lab I went, finding a socketed board, power supply and programmable load resistor.

I put the DCDC chip in the socket and tightened it down.  Its important when testing power-chips that the chip be really tight in the socket.  Small amounts of resistance in the connections can quickly heat up the socket.  It seems like there is no such thing as a cheap socket anymore.  These modern packages are tricky and socket vendors step up with their advanced mechanical solutions.  Some use pogo-pins, others use polymer resin, all of them are expensive and can have long lead time, so care needs to be taken with the setup.  Plugged in the board power, flipped power switch on the board and the POWER LED came to life as did current through the load.  I checked the output voltage, it was perfect, matching data-sheet specifications.  So why was I looking at the part?

"Works great"- I said.
"It hates cold" - the VP barks out

So we have this "cold spray" stuff in the lab.  I do not know what its made out of, other than its very volatile and quickly cools enough to form ice after a 20 second blast from the can.  So power-down the board, and wait for 30 seconds or so, then I give the DC DC chip a good 20 second BLAST of cold spray.  I see ice start to form around the socket, so I knew we are good and cold.  Next step was test.

Flipped the switch, the main power-unit springs to life, delivering power to the SOCKET.  However, nothing happens.  I turn off the power-unit then turn it back on.  (Did you try turning it on and off?)  Still nothing happens.  No response from the chip. 

Often when debugging I take a short walk to clear my head, after doing so returning to the chip I found that it was "on".   While I was out taking a walk, the chip "came to life".   I think this was near Halloween since I was wondering about ghosts.    So for the next experiment, I decided to not leave the bench.

Repeat- Blast 20 seconds of cold-spray on the DC DC Chip.  After I saw ice on the socket I applied power.  Again, nothing happened.  Then I waited.  Sometimes "Brute Force" is the answer. 
So I stared at the board, trying not to blink. 
30 seconds goes by.... nothing happening other than ice melting
60 seconds now.. Ice is beginning to thaw quite a bit now .. still no light
90 Seconds now... its getting old.  Im wondering whats for lunch now..
120 Seconds...  BINGO!  Light turns on, load comes to life.

Once warm, power cycle board and it comes up each time when warm.

So what can do this?  What can cause a circuit to shut-down in a cold environment?
Now I need to say that if you put a "good chip" in the socket, it works quite well with freeze spray.  There is something different about the bad ones.  Of course, the bad chip "found me" not the other way around, so I knew it was a "special" chip with respect to operating cold.

To further debug, I used the next common tool, which is to adjust voltages on the board.  Its not uncommon for DC DC converter chips to use regulated voltages on a board.  So I started to vary these voltages by adjusting some off chip components.   Pull the chip out of the board, do some soldering to change something off chip, then put the chip back in the socket, apply cool spray and wait for 2 minutes.

Two minutes is a perfect amount of time, to do basically nothing!  You can't browse the web.  You can't do any tricky math.  You can't even hold a conversation with anyone.  Just me, the board, and 2 minute windows of time.  Most of the time nothing made a difference.  However some off-chip components seem to improve the situation.  By raising a DC voltage I was able to speed things up a little. 

Now If I am debugging, often I can't solve a problem unless I get "MORE" information.  When it comes to debugging information is king. The good news is that in 2 minutes I can do quite a bit of thought as to what might be causing the 2-minute start.   The cold-spray had to go to something more controlled.  Since cold air had the most dramatic effect.  We have something called a "Thermonics" unit which blows hot or cold air on a part in a very controlled way.  With a Thermonics unit and a thermocouple you can set the case temperature of a part in a socket quite accurately.  In addition to the controlled cooling,   At -20C Forced air, I was able to re-create the two-minute turn-on.  I then started to gradually increase the temperature.

I had data like
Temperature     Delay
-20C                 2 Minutes
-10C                 1 Minute
0C                    10 Seconds
10C                  1  Second
27C                 "Instantaneous"

Again, to debug, more information is always better. What I did next was attach an oscilloscope.  I set the scope to trigger on the rising edge of the main power while measuring the delay to the LED power indicator.  I started the experiment at 10C and worked the temperature up slowly.  At 27C it was not "instantaneous" but several thousands of a second.  Even better news, was that the "good parts" we had in stock took microseconds to start at room temperature.  This means we could easily devise a test that bins the parts based on start-up time and prevent the sensitive ones from ever getting to the customer. 

So what can do this?

Well, we know that the THRESHOLD voltage, or the voltage in which a MOSFET "turns on" is a strong function of temperature.  At COLD temperatures, it takes more "voltage" on the gate of a MOSFET to "turn on".   In addition, the THRESHOLD voltage is something that varies as the chips are manufactured. So, depending on what lot you look at, the THRESHOLD can vary a little.  Of course, there are also MISMATCHES in thresholds, in that not all transistors on the SAME die have the exact same THRESHOLD.  We normally do statistically based simulations where we model the manufacture of chips, including the THRESHOLD voltage. 

In the main bias of this particular chip, there was a single NMOS transistor that had to "turn on" to make the reference clock, and the power converter, run as expected.  Due to manufacturing, and some randomly bad luck, that key transistor was "off" when then chip was cold.  Now leakage currents in the power-supply and heat in the environment would eventually get the CHIP hot enough such that it would start.  Once the chip starts, it creates its own heat which keeps the converter running.

A key observation in debugging this was the "exponential" improvement in startup time vs. temperature.  The main bias device was in "sub-threshold" when it was in the "bad state".  Sub-threshold-biased FETS behave like BJTs which are exponential.  To make things worse, this chip was made out of SOI (Silicon on Insulator) so the MOSFET is isolated from its environment for the most part by glass.  However, there is metal in the chip that goes outside to the package pins that can bring in the heat.  Its clear that the bias circuit was NOT simulated properly..  to do that you must:
1.  Check simulations to show that the voltage on the FET exceeds its THRESHOLD by say 100mV
2.  Check simulations at COLD temperature to prove this is still the case
3.  Check simulations with "SLOW" manufacturing conditions to prove the MOSFET is still ON
4.  Check with "Monte-Carlo" simulations (at least 100 random cases) to prove the FET is STILL ON (with margin)

Now if you do the above four checks, you would find that the circuit on the chip fails.    This chip was designed by very senior people, however, everyone makes mistakes.  A chip NEVER cares who you are or HOW you design.  If its not a good design you will see it in the lab sooner or later.  Also, a large number of DEBUGS I have done include issues with DC bias.  Never take DC bias casually, especially if you are senior designer.  Never tape-out a DC bias block without proper verification AND ideally a peer review.  That way you wont spend hours in the lab, two minutes at a time.

That took more than 2 minutes to write.  Hopefully not more than 2 minutes to read!

-SSA
“The postings on this site are my own and do not necessarily represent the postings, strategies or opinions of Microsoft."

Wednesday, May 15, 2019

The Product development treadmill - Focus

Yes, I know its been a long time since I made a post, however for the same reason I have posted in the past.

I was browsing the web, and found a Forbes article Getting Off The "Bad Growth" Treadmill by Cesare Mainardi and Paul Leinwand.  The commentary in that relevant posting is that the low-risk way to expand your market can be a disaster.  The definition of "low risk" is a moving target as technology development increases its speed.  This has been a problem in my career, and is being compounded by the effects that drive the singularity as discussed by Ray Kurzweil.  As progress speeds up, and our competitors get better, the treadmill runs even faster! Playing it safe more "unsafe" than ever before.  But what does this mean?  What is safe?

So now I can tell my story.  It was the early days at my first real job.  We had been working on non-ADC based Ethernet PHYs, 10-T, 100-T.  Those product developments were not without their challenges, however they are simple compared to today's ADC based Ethernet chip designs.  Back then we could fit a whole transmitter on one large schematic sheet.  The receiver was a handful of blocks including a few data slicers.  We had simpler process technologies with only a handful of metal interconnect layers.  Verification was mainly analog at the block-level and the top-level simulations often black-boxed the analog, since few loops went from analog to digital and back.   So when we started the next generation, or Gigabit ethernet, we used the same strategy as before.

Since ADCs were new to the company, we "had to" create ADC test-chips.  Now test-chips are nice, however they do consume resources.  The whole design cycle is involved, from definition, schematics, layout, verification, tape-out, packaging, test-board, test-program, lab testing and data analysis.  So the test-chip was basically a huge effort.  In the end, we got a reasonably good Flash ADC from the test-chip.  I was studying this as I was working on the gain-control circuit in the Analog Front-end (AFE).  I also owned the ESD and GMII interface.  I worked on these while I watched the more senior people assemble the test-chip for the PHY.  So we were toiling away when...

We heard that Broadcom (our competitor) was sampling 1000-T.   Our competitor went "ugly early", instead of creating a test-chip, they created a functional 1000-T tranceiver.  Now it didn't have the lowest power, nor did it have a 100% standard compliant link.  However, it was the FIRST 1000-T Phy.  The fact it was the first changed everything.  Broadcom had been doing ADCs for years and didn't choose to create a lone ADC test chip just to later throw away.  They used their valuable resources to focus on what was new, which was the ADC based PHY concept.  Now that must have been a wild development at Broadcom, but the focus was keen in that they just focused on the new stuff.  The Broadcom chip didn't support the slower modes even, just enough to prove the 1000-T standard concept.  In hindsight, I admire their focus and the strategy of their approach.

Late in our development it was determined that the chip wouldn't work because some of the "stripped out" features were required.  Confusion about what was "good enough" erupted.   Key analog lead resources were overloaded by the changes.  Test-chip quality blocks were expected to be product level quality.  The scale of the development became obvious to the designers since this chip was about 10X more complicated than the previous analog based phys, and the old methodology was not working.  So we changed, added program managers, reduced the load on key lead resources.  These changes also took time.  (Now looking back our competitor also had to deal with the change in style, however there was a clear focus, they were acting while we were reacting.)  So after many long-nights we did tape out. 

Never did we ever consider that a new or different approach would need to be introduced?  Issue tracking is one that can free leads from memorizing huge sections of the design.  Also, the design can be staged.  For example, it is possible to run multiple variations of a chip.  (Sometimes called a Pizza mask).  Some versions (slices) could be for debugging, other versions for samples.  This "parallel" approach helps mitigate risk and long schedules. Now back to the story.

When silicon came back, the evidence of the rush was clearly visible. On the 5th revision, we were able to send data between to like parts. Meanwhile at that point, the competitor was in production selling parts.   While our competition was working on their next generation, we were still fighting with the first.   By the 12th revision of the part it was basically working however the specs changed to match reality.  Meanwhile more competitors appeared, with lower power parts, Marvell's Alaska, ironically assisted by designers who left.  I should have followed.

I have often wondered about marketing vs. engineering leading chip development.  Change in project focus at the last minute is a product killer.  Later during executive training I found that what bothered me is that when people who understand the technology are not included in decisions, disasters occur.  The reason being, is that (I believe) it is unethical for those who lack the proper knowledge to make technical decisions that affect the architecture of a chip.  This would be like a pediatrician planning brain surgery.  In good organizations, there is feedback and accountability that can reduce or eliminate this bad behavior if it were to occur.  Post-mortems occur, with a wide audience invited. Good companies even have a step in the product development where execution of marketing and engineering were compared to original estimates.  A test for this is to simply ask a program manager for history of resources on a past program.  If that information is not available in any form, then there may be a problem since you can't learn from your own past.

One thing true about product development these days, is that the biggest risk is making a mistake in the strategy.  If you take on too-little risk, you may not have a compelling part.  If you take on too-much risk, then you will have a painfully buggy product.  Marketing and engineering need to pick the balance.  The development strategy and methodology that works with one type of chip may be a horrible idea for another type of chip.  A good post-mortem and accountability help an organization grow in the direction that improves execution allowing for greater risk.  When in doubt, focus on delivering, there will always be distractions. 


Disclaimer - New Posting Ahead

SSA Here, back after a long break.   I have changed companies and have joined the team at Microsoft.  My hope is that I will have some more time to share some debug stories from the past.  However, I must state

“The postings on this site are my own and do not necessarily represent the postings, strategies or opinions of Microsoft."

Monday, November 16, 2015

Si Valley Scourge: The Neglected Band-gap Reference Circuit

Personal Note:
Hi all, its been a while since my previous post.  Thanks for checking out this blog, as I keep stating its more for my sanity than anything else.   Things in silicon valley can change quickly and it takes focus to hold on, that can take away from writing.  I didn't realize how many people enjoyed this blog until I stopped writing it.

Now that I am in-between jobs I can follow up with a few posts.  I was let go (along with my whole team as a RIF) by Semtech a week after 64GHz 32nm silicon arrived.  It appears that if you do good work, you can work yourself out of your job.  Its not unusual for hardware designers to be treated this way in the valley since the perceived value now is all software.  So Facebook, Twitter, Google, someday you may need to increase the speed of your network software with new chips.  Who is going to do it reliably and cost effective is hard to answer.

The development costs are skyrocketing, the analog designers are retiring and few students are embracing analog.  Eventually it makes sense that software companies may need to employ chip designers to increase the performance of the complete software solution.  This needs to be done soon as circuit designers age and retire the risk of the developments and cost will increase.  Just a warning that presently the chip designers are going away as the base of the vertical integration play that software capitalizes on.

Now for a post to underscore the value of senior mentors

------------------------------------

Bandgap: Scourge of the Valley    

Millions of dollars of yield loss, debug, mask revisions have been related to a simple DC bias circuits.  The neglect of the band-gap is a scourge of Silicon Valley.  Every day, people wake up, drive to work to deal with production problems, some related to poor DC circuits that could have easily been prevented.  Wasting time and money drive me nuts, maybe this post can help wake up senior designers to this issue.

I have been threatening to write this "bandgap" post for months.  I was out having a drink with Jim Gorecki and Mike Le earlier in 2015 in Santa Clara.  We were talking about high-speed track-holds, interleavers for GHZ rate ADCs and their different circuit constructions and how not to.  Mike was discussing some tips of SAR (Successive approximation ADC) design he learned while at Broadcom (pre Avago), he always has good ideas.   I love what I do and really enjoy talking shop with experts, even off hours respecting IP most people who know me are aware of this.  I can turn it off but my assumption is engineering is fair game when around other good engineers and drinks.  So after splitting a pitcher with Jim (I think it was Bass Ale) he asked the waitress for a pen and paper.  What is Jim going to do?  I was feeling pretty fuzzy since I don't normally do any analysis when drinking.  Jim wanted some drunken circuit analysis and I was game.  The conversation was circuits we hate, and how some rookies take personal ownership of bad-ideas unknowingly.

The waitress returns with pen and paper and Jim begins to draw a band-gap reference circuit.  Now for those who do not know what this is, I will explain.  Most analog chips need a reference voltage that is independent of temperature.  Say if you want your chip to turn on when the power-supply is say 90% accurate.  You can build a circuit to compare the power supply voltage to this internal reference voltage, allowing a decision to be made.  Since the circuit compensates for temperature, the turn-on voltage should ideally be independent of temperature.  The band-gap circuit produces DC current and voltage references used in the analog. 

Bandgap reference circuits are called that since the stable output-voltage tends to be around 1.17V, which is very close to the band-gap potential in silicon.  This voltage is created by summing two different signals, one that is proportional to temperature (PTAT) and another that is the opposite (CTAT).  By scaling and summing CTAT and PTAT its possible to get a voltage that is independent of temperature (to the first order).  What is interesting about the band-gap is that its "Operating Point" or DC condition is supposed to be basically the same over temperature.  This is supposed to make things easier to design.  Now a subtle point here is that the CTAT and PTAT need to balance.  There are two solutions, one when CTAT and PTAT are NON zero, and the when everything is off PTAT=CTAT=0.  So Band-gap reference circuits REQUIRE a start-up circuit to make sure they don't stick at 0.  Start-up issues are a risk in these types of circuits. Of course, the experienced analog designer is ALWAYS trying to make things easier, reduce components exploiting the situation etc.  Unfortunately experienced designers RARELY design band-gap references these days.  The chip business now is under cost pressure and low-staff consolidation.  This means under-resourced programs and late hours so often DC circuits (like bandgap reference) are designed by rookies or even interns to reduce the load on senior designers.  The lack of mentors is a huge problem  in the valley made worse by the aging/retiring of senior analog.   The more transistors, the more signals, the more difficult the start-up circuit becomes, often exceeding the skill level of a rookie.

Side Note for non-EE folk:  A voltage must be available for a chip to function.  The circuits that control this by nature need to be simple, since the failure is so devastating.   I think the band-gap circuit is analogous to a seat-belt on an airplane.  Next time you board a plane look at the seat-belt buckle.  Notice how simple/few components involved.  The seatbelt is simple, tough and reliable, as a good reference circuit should be.

So Jim draws his circuit, starting at the core which is a pair of mismatched diodes that give the PTAT signal and the diode voltages themselves that give the CTAT signal.  Now this part of the diagram is standard. Its important that you understand what is going on and not do anything to break the CTAT and PTAT relationship.  So the initial bunch of diodes Jim drew made sense.  After that, things got crazy.  Jim kept drawing, drawing more transistors (why so many when not needed?).  Jim finally completed the circuit and asked some questions. The first thing I mentioned (now 4 beers in) is that the amplifier was designed to handle a huge input voltage range.  Recall earlier I said the band-gap voltage should NOT CHANGE.  But the circuit was designed in such a way that the bandgap voltage could be anywhere at anytime.  (normally this fact is used to simplify the design)  The way the (rookie) designer implemented this was to put a couple of front-ends on the amplifier that cover different voltage ranges.  This was done probably because the "Bandgap is a boring circuit, its DC, and I am fresh out of college so I will use all of my circuit skills to build the worlds best bandgap".  Or more likely, the rookie designer didn't want to just start with a previous working design since he/she wanted to prove themselves.   No mentor was there to talk the rookie out of it.  Chip design is NEVER about the designer and should only be about making a good solid chip design.  Chips don't care who designed them or what your name is. 

So why did Jim hate the circuit it so much?  Well its poor design was preventing a company from making money on some parts.  It can be a million dollar issue over time depending on the manufacturing cost and how the test is done.  We are in industry to sell parts.  So this over-designed DC circuit is presenting the work of a team of 10+yr designers, marketing, production to NOT be fully capitalized.   Plus it was a dumb analog idea (I think the later is what bugged Jim more than me). Looking at the circuit even after half a pitcher of BASS I could see at least 3 operating conditions leading to problems.  One when the input of the amplifier starts at a high voltage, another with low voltage and a third when its in-between.  To verify that start-up circuit was tough since you would have to force the circuit into the different states and watch how it starts.  Due to the complexity of the DC bias for the rail-to-rail input new failure mechanisms were introduced to the design affecting the start-up.  So again, the road to hell is paved with good intention...  A few trash-cans full of parts..  Bandgap non-start is a product killer.

 "It works great in simulations", "It always starts up" are statements from the original rookie designer.  The lab/ATE state differently.  (The lab is the truth and anything else is bullshit-KD).  Jim and I know that just because it "works" in the simulator is interesting, but not sufficient to conclude the design belongs on a production die.  Area, Power, temperature performance, power-supply rejection, robustness, start-up and BJT choice in size all affect a design and may not be covered in simple simulations.  Also, what about  margin?  The best designers have a break-it mentality in which the circuit is stressed in simulations to beyond the required specifications.  This is in an effort to look to see if a circuit is "at the edge of a cliff" or very sensitive to margin.  A new circuit needs to be checked if it doesnt have a long history of ATE results, particularly a reference circuit.  In Jim's case, an overly complicated amplifier that sums the PTAT and CTAT signals had a start-up problem.

Now for some do/do-not bandgap-pitfalls/hints:
1.  DO NOT: use the smallest BJT available from the FAB.  The model may not be accurate and your matching will be poor.  The bigger the BJT it is, the less sensitive to modeling imperfections and mismatch.  DO NOT edit the layout or the model parameters of the FAB supplied BJT.
2.  DO NOT: put switches in series with your diodes.  The diodes are to produce the CTAT signal, which can be counter-acted by a high temperature increase in the resistance of the switch.
3.  DO: Change the temp-co by adjusting resistors, not switching in/out diodes; use voltage division
4.  DO:  Make sure the output transistors are cascoded for high DC power-supply rejection and simulate PSRR AC and Transient
5.  DO: Make sure your start-up circuit has been peer reviewed by at least one 10+ yr designer with proper test-bench functionally covering all start-up devices.
6.  DO:  Not: run the bandgap on a high power-supply without a voltage limiter on the output voltage. Check for safe-operating conditions in both powered and powered-down states.  There is a chance you can blow a thin-oxide FET mirror when an internal block powers-down.
7.  DO:  Centroid diode array with a full-set of dummies around.
8.  DO:  Use a unit resistor - The reference resistor placed around the chip has to match.  The ONLY way is to make it a schematic design and layout of the reference resistor.  This should have proper W/L for matching and an appropriate contact count for electromigration.  DO NOT USE SQUARE resistors or resistors < 1 Square in a reference circuit.  These will make a poor reference.
9.  DO NOT: run the output mirror in sub-threshold since matching will be poor.  Long channel lengths are not a bad thing.
10.  DO: included test-modes (analog test-bus) to isolate the output voltage, PTAT and CTAT signals for debug
11.  DO: send internal VDD and VSS of the band-gap out the test-bus to identify IR drops to/from bandgap inside the chip.
12. DO: Include a bypass mode to force the Bandgap output externally.
13.  DO: Make sure your Bandgap has more than 80Degrees of phase margin
14.  DO:  Double-check package model for an off-chip resistor.  A common mistake is to have a circuit go unstable when a customer puts capacitance on the external reference pin.  This can oscillate and I have seen this in 3 different companies.
15.  DO: For RF circuits low-frequency phase-noise can come from the bandgap.  Its important that any RF circuits have the noise included from the reference for proper jitter calculations.
16.  DO:  Think through the design first before getting busy with schematics/layout.
17.   DO: Check for IR drops in reference currents distributed from the band-gap - The pitfall here is to use too-narrow wires in the bandgap output.  Often I will calculate the metal width to send a currrent 1000u and then put all the output ports on the bandgap with that metal width.  This way when  a layout person connects it, they have a hint at how thick the wire needs to be.  Also schematic notes help.
18.  DO: monte-carlo to check the start-up circuit.  Does it always start-up?  Does it always disengage?
19.  DO:  Stress the circuit beyond specs (push the temperature high, supply low to see how it fails)
20.  DO: Not have any extra bandwidth in the circuit.  Its easy in deep submicron to get high bandwidths (and noise) from internal nodes.  So low, power and filter capacitance is a good thing here.  You should have a spec on how clean the output current and voltage are from noise. In audio designs, this can be very challenging since we can hear low-frequency.

At a previous company I was technical lead over a chip with a band-gap reference driven by a intern.  This band-gap had to be spun at least 3 times during the development (the most expensive I have seen in my career).  The weakness here is that the intern had basically no experienced mentor.  The base floorplan was flawed and not reviewed before route, output un-cascoded and diodes were switched leading to high temperature PTAT failures.  Phase-margin and poor ESD on the External reference pin struggles.   Nearly an impossible to follow schematic.  We spent months on the design simply because someone didnt "begin with the end in mind" on the circuit.  I feel sorry for our intern but I was unable to change his situation for political reasons.  Few things upset me more then wasted company time, money, and an opportunity to train a rookie correctly (lost opportunities are my ultimate frustration). 

Most importantly remember:  We are all chip designers, not block designers.  If there is ONE bad block on a chip, then the WHOLE chip could be bad.  So its important that rookies ask for help, to make sure they are doing the right thing.  Experienced people should also be very interested in the pedigree of the reference circuit since if it fails, the rest of their work could be meaningless.  So don't skip that band-gap design review!  Lets make sure the new designers start and stay on the right track.

So double-check that bandgap.  Who designed it?  Is it in production?

Not only are band-gaps neglected, there are also other circuits similar that need mentorship.  These killer circuits  should have a senior mentor and include:
1.  UVLO Under-voltage lock-out circuits with hysteresis
2.  Power-on reset circuits
3.  Start-up oscillators (relaxation oscillators)
4.  Any voltage regulator that has a PIN that goes off-chip
5.  Reference circuits with multiple power-supplies/sequencing
6.  ESD/Padring hookup with multiple supplies


It just takes one mistake to kill your product, and unlike software people, we can't just simply recompile.  Senior analog folks are getting rarer and spread thin, however its NOT a good idea to isolate them from the above designs.  The cost of the DC failure is increasing, and I would hate to think of what a start-up problem would cost to fix in FINFET technology.  It could cost a few million in masks, months of debug in the lab all because you wanted to save a few $$ on the designers.


Tuesday, August 5, 2014

Why is a prime number table on my desk?

Its interesting how history tends to repeat itself.  People hire me on to help them out on their data converters, and then for one reason or another not take advantage of my knowledge.  "We got that data converter guy, now what?"  .. Its funny how much people will pay you to hang around.  History repeats..

So this first happened while I was working at a previous large company not known for analog.  There was a group of experts working at a remote design center in Israel.  I was told "these guys are experts at math" and mixed-signal circuits so for me that should make the job easier.  Math can solve an argument in any language, doesn't matter what language you think in if the math works.

The design was an 8 bit low-power ADC for a gigabit Ethernet chip.  (over Copper in 2000)  The gigabit version of the Ethernet standard still uses time-domain templates and is a PAM 5 solution.  Of course, you need to transmit more levels than that since there is a transmit low-pass filter.  Link margin was low and in the standard a DFSE and a Viterbi were used.  What this all means is that the "eye" at the input of the receiver is a mess.  It has a bunch of intersymbol interference and echo.  (Echo is the transmit signal reflecting back into the receiver).  So basically this IEEE standard requires an ADC instead of a simple slicer at the receiver input.  Of course, some analog pre-processing can reduce the spec on the ADC, but you still can't ever get rid of it.  So you have to build it. (Paper on Gig RX)

So its my duty to "architect" and lead a team to "design" this ADC.  The schedule is short, just a few months, which is not long for an ADC.  There was all kinds of things we could do to reduce the area and power of that ADC, but given the tight schedule and 20+hrs/wk in meetings, we had little choice but to go bare bones, a non-scaled pipelined ADC.  No problem for this fine team (Link to Paper) who ended up building lowest power embedded ADC at the time later published by Springer thanks to Perry's help.   So I put together an error and power budget, broke the design into subsections and staffed it with a team who comprise the author list on the Springer paper.  The capacitor array layout was drawn and balanced perfectly by Mel Sparkman.

During the development of this ADC it was important for the Israeli design team to be involved in all the steps of the process.  "ADCs are magical things"  is what some people think even as recently as maybe a week ago before I typed this.  There is fear and confusion on many electrical engineers about data converters, error correction and calibration.  I admit they are tricky and combine system expertise along with circuit knowledge.  System level specifications end up setting the size of capacitors, transistors and amplifier architectures inside the ADC.  Several times during my career, people from the sidelines try to get involved an micro-manage the design process to "make sure" I am doing the right thing.  I don't mind people watching me, I think I know what I am doing but it does slow me and the team down to explain everything. You want an ADC or a lecture?

So I put together a block diagram and wrote the behavioral model in C.  The Israel team also put together their own "matlab" model to confirm my results.  During the first meeting there was "a problem" with my ADC.  For some reason I was accused of "sandbagging" my design.  It was always my fault. 

"How much margin do you put in your design, SSA?" - asks senior architect
"Normally I don't comment on design margin, but there is some" - SSA
"I think your ADC is WAY better than what you state - why so much margin?" --asks senior architect
"Show me the data" - SSA

So he brings out a MATLAB result, showing our 8Bit ADC with an SNDR of 59dB.  Now, there is an equation for ENOB (Effective Number of Bits).  The math is:
6.02(ENOB)+1.76 = SNDR
So for SNDR=59dB, N=9.51
Its an 8 bit ADC so theoretical max SNDR = 8*6.02+1.76 = 49.9dB

59dB(result)  > 49.9dB (Theoretical 8 bit)
 So it did appear that I had~10dB of margin?  He got "better than theoretical" performance..
 
The SDNR values are affected by things in the test-vectors going into the ADC (or out of a DAC).  There are underlying assumptions, that when violated can give you better than theoretical results.  

"Ahh I see, you have higher than theoretical, so you have a problem in your test vector" - SSA
"What do you mean, I am a mathematical expert with much more experience.. etc" -Senior arch
SSA Responds:
There are assumptions behind 6.02(ENOB)+1.76.
1.  The input signal is a sinewave, this formula ONLY works for sine-waves (first non-streetsmart mistake)
2.  The quantization noise is a "uniformly distributed" random variable with the width of one of the "stair steps" of the ADC or LSB

20*Log10(Signal_rms/Quant_noise_rms)=6.02(ENOB)+1.76

Quant_noise_rms = LSB / Sqrt(12) - which is the Standard deviation for a uniform Probability density function.  LSB = VFS/2^8 for an 8 bit ADC

Now a common mistake is to violate the Quantization noise uniform PDF criteria.

"What input signal frequency did you use?"  - SSA
Answer: Fs/4, or 0.25* Sample rate.  - Senior Arch

So now SSA knows your problem.  "I know your problem"
For Fs/4, the ADC pattern is +1, 0, -1, 0, +1, 0
So the quantization error is:  e1, e2, e3, e2, e1, e2... (repeating... thats NOT random!!!)

So a couple weeks ago, it happened again with a senior expert showing greater than theoretical ENOB... and it reminded me of this. All the guy had to do was call or ask, but what would I know soaking up that paycheck as a data converter expert...

So Fs/4 is a bad test-signal for ENOB since it violates the "uniform probability density" assumption in the quantization noise power calculation.  Fs/4 is not bad for debug, if you are trying to isolate a bad code, so don't get me wrong.  There is a time and place for everything.

So how should we pick the input signal?

Thats based on a "sentence" in my mind.  If we pick a signal that is related to the sample rate by a prime-number, we know we will break up that "pattern" and whiten the PDF of the quantization noise.  It goes like

"In 2^N samples I want a prime number of input signal periods"
    2^N     *      Ts      =    Prime * Tinput

2^N is desired # of data points
Ts = Sample Period
Prime = A prime number, 3, 5, 7, 11..
Tinput = Period of input signal

This can be "re-written" as the form found in Maxim and ADI Appnotes:

Finput = Prime * Fsample / 2^N

So this where the Prime # Table comes in!!!
If your Input frequency meets the above equation, I promise you will NEVER get better than theoretical SNDR.  Especially with larger prime #s.  A popular one was suggested..

Fin=20.20263672MHz
Prime = 331
Fsample=125MHz
2^N=2048

Notice that I had to carry all those digits.  This is one of those cases where you need to key in all those digits.  To this day nearly 14 years later I still have that frequency memorized, we used it during simulations, verification and validation in the lab.

Senior Architect guy plugged in the above numbers, and lost about 11dB in SNDR.  Oh well, I could have just acted like I could architect a better than theoretical ADC.

So this is the reason I have a prime number table on my desk.  I need it to set signal frequencies for test-vectors that do not violate the assumptions of the ENOB equation.  All data converter experts should have a prime table.  They are free, and available via Google.

If you ever see higher-than theoretical ENOB (or SNDR) for a data converter, then there is a violation of the assumptions underneath.  This means the ADC was not being "exercised" fully and the results are invalid. (You could always look at a code histogram).   If you are an experienced guy and show better than average ENOB, then you will no longer look experienced in front of your peers.  

Wednesday, June 11, 2014

Building an Analog Team

Recently I have been tasked with adding staff to our design team.  Like a normal product design team, there is always pressure to get the part done right in zero time.  Of course, real analog circuits take time to build, often underestimated by non-experts.  Of course most schedule problems come from non-experts making schedules with other experts resigned to pushing back.

So you want to build up your team?  Who do you hire?  This is a really tough question in reality it depends on what you are doing and what are the key constraints.  Is it time to market?  Risk?  High mask cost?  All of the above?  Are you adding to a team or creating one from scratch?

In the Radar business the process technology is very advanced and the related costs are high.  There also appears to be an unrealistic expectation that first silicon is product worthy.  Although real chip designers know this to be unwise, staffing can mitigate the risk of an imperfect chip.  The first ingredient is finding someone who has made a mistake and willing to admit it.  These characters tend to be older, more mature, have advanced degrees (Ph.D. or MSEE), and have lots of different product experience.  The problem with this description is that the older more experienced designers understand unrealistic expectations better than younger engineers. This can help to set expectations.  Recently I have phone screened candidates that are afraid of the position I am trying to fill because their perceived  interpretation of the risks is too high.

Super senior analog designers often reflect their past.  This shouldnt be a surprise, since we are the sum of our experiences.  I met one analog designer today who new Verliog, Verilog-A and system Verilog which can be considered "younger engineer" type of skills.  Analog was not taught before 2000 with a bent toward system level verification.  Chips were small and simple and as the industry has changed more towards system-on-a chip (SOC) these verification tools have increased dramatically in value.  However finding an engineer with modern skills who has designs in over 100,000,000 shipped parts/components (for example) is nearly impossible as far as I can tell.  This is a message to my mid-career colleagues that its important you pick up these newer verification skills.  They are easier to learn than trying to convince a young hard-headed designer to abandon an unreliable circuit or poor layout. 

Where is your team?  That is also important.  Ideally everyone should in the same physical location, but that is no longer realistic.  Companies are now keeping track of "High cost centers" such as San Jose where the cost of doing business is high.  The value add at these centers needs to justify the high costs.  My expectations of engineers in Silicon Valley are high.  You should know architecture, circuit design, basic cad and have good communication skills.  If you are going to work with remote sites to take advantage of outsourced or "lower cost centers" you need to have good communication skills.  I have met some very good technical people recently in my search for staff who have difficult communicating.  I don't want to have to have a note-pad to communicate basic concepts.

Team balance is very important.  Too many super senior guys can lead to a lower-energy environment.  Also not everyone can do the big architectural tasks, so you need a mix of people ideally.  I recall discussing this year ago with my friend Perry Heedley using the basketball team analogy.  People need to play different cooperative roles working together to get things done.   There needs to be trust in the team so I am very sensitive to anyone who may have a reservation about a candidate.  This could be technical or personality but there is not any room for a player to bring down the rest of the team.  Its not uncommon for me to go through 50 resume's before I get a candidate that fits the team, often with compromises.   I am most sensitive to candidates that make me nervous for any reason.  I understand people really want a job these days.  On Linked-in recently there have been some good short postings in my feed about how to interview.  As a hiring manager I read these and a key failing I have seen recently is a candidate trying to "take over" the interview.  That doesn't work well with me.

Is the person temporary or full-time?  Temporary help (contractors) should be good at communication.  Ideally they "get" the idea that they are in themselves a company.  The best contractors see themselves as "their own brand" with a service to sell.  This includes building the brand reputation including advertising.  Some of the best contractors have been high-level managers or people that own their own companies.  These guys understand that its not all technical!  Problems need to be solved, instructions followed, problems found that are clearly communicated.  Not all tasks are have the same cost, the goal is primarily to deliver service.  If I can't understand a contractor's English or they are unwilling to come out on-site that is a problem.  Also their rate, does it justify what needs to be done?  There level of experience?  Are they working for a contract house, then why?  The absolute best analog engineers I know of all have full-time employment.  Recruiters ask me to refer people but the best guys always have a job. Time in-between jobs is also very short in duration, often set by the engineer.

When do you need the person?  Good, Cheap, Fast pick two of three.  If you need a good person on short notice it will cost a high dollar.  If you are willing to put in the time, which is the most realistic approach, you have to staff way in advance of a crisis, otherwise you will sit helpless with a stack of resume's.  So planning here and listening to your people's needs are critical.

I am sure a few candidates I have interacted with recently would benefit from this blog.  Also there are some good books and references on how to interview.  Candidates looking for a job should seek these out, since these interviewing books help the job seeker to understand the position the guy on the other side of the table is.  Don't add more work by having a sloppy, long and confusing resume with charts and graphs on your employment history.  Don't share too much, make me want to call you!  Also, when you do show up, you should have some idea what is expected and what we are doing.  This is where the standard preparation helps.  What you show up to an interview (empty handed?) will affect my decision to hire you or not.  Its important you understand that I am trying to help run a profitable business.


   

Wednesday, March 26, 2014

EE Conference user guide

I like to go to IEEE conferences whenever I can get away which works out to about once a year.  My favorite is ISSCC since its in San Francisco and I can BART or drive and some of the worlds best analog designers meet there every year in February.  If you are not in IEEE you can still go it just costs a couple hundred dollars more.  I have gone 14 out of the last 15 years and realized I have a whole method of getting a lot out of the conference. This blog post is to share some of these observations.  Where to stay, how to pick what papers and events, evening panels and finally social hour.

We all are busy as analog designers I doubt many of us work less than 50 hours a week.  We are often optimizing what we do.  Some of us are more frugal than others it all depends on what is important.  To me the important thing is that during the conference "life" is going on and that includes analog.  A positive change to your work environment is a nice thing.  For adding considerable enjoyment to the event I prefer to stay in the hotel hosting the conference.  If my company doesn't want to pay it I will then offer to chip in the difference.  If they refuse then I pay out of pocket .  Having a room at the event is great since you can setup a remote office there to get things done and keep up back at the plant.  I room on a high floor is better since its quieter and the views are better.  You normally have to ask at the Marriott for a higher floor.  Its easier to get one if you arrive the afternoon or evening before the conference starts after making reservations a couple months in advance.

Day one is when you need to make your schedule final.  To leverage the event you may have made a plan with other workmates on what papers sessions to cover.  Before the ISSCC this is a first-order plan based on terse descriptions of the papers to be presented.  A drawback (or feedback to IEEE) is that its hard to tell if the paper is really interesting or related to what you are looking to learn about at the conference until you get there.  The early descriptions are so short that you are forced to show up.  Maybe that is the idea anyway.  Only after seeing the actual proceedings, now made available online just before the conference, can you really pick what papers to attend.    Well you need to check your plan or make your plan by flipping through the papers.  Look for easy to read text, clear figures and solid technical content.  University papers written by students and their professors are often the best written and easiest to follow.  The university papers often don't have the same performance as their commercial counterparts, but the ideas can be easily just as clever.  For performance pick the industry papers but you may not get all the detail and there are sometimes empty boxes in diagrams.   So I sort through the papers looking for understanding and/or performance and then pick a path through the sessions where the papers are presented. This takes me about an hour to make my plan for the next few days.

Attending the papers is not very easy/relaxing for me.  There is a huge amount of information presented.  Several years of a persons work (doctoral research) is boiled down to a ~20 minute talk. It takes skill to ramp-up quickly and follow these talks.  Sometimes you need to suspend disbelief and re-engage in the end.  If you second guess the paper you could miss a few key facts while you ponder.  I make a point to write directly on the paper proceedings notes and comments during the papers.  I also write down Q/A for papers I particularly like.  Q/A can reveal information not in the publication and can also help you learn about other engineers.  For example if Dave Robertson of ADI asks about an ADC you need to think about what it was and why.  If you wanted to introduce yourself to him for a potential gain in knowledge and opportunity, then you now have a reason.  When in school professors scoop knowledge on the students, after graduating it takes effort to keep up and read papers.  Working on a network with other engineers is key in maintaining growth and an edge over the other guy.

Yes, analog electronics is one of my favorite hobbies but my love of it is not enough to endure 8 hours of solid technical papers.  Maybe a handful of guys out there could attend the entire conference and congratulations.  For normal human beings I find it critical to put a break in the day.  Normally I schedule a daily nap around 2pm depending on what sessions I go to.   I plan my naps on the first day. This is also where the conference hotel really helps out. The nap in the afternoon is critical for me.  The talks end as late as 4:30-5pm so a 1 hr nap is not too-long and I bounce back.  This is important because after these papers is when things get social when rest is needed.

Ok, I do realize analog guys are not all the most social bunch so we start with a disadvantage.  You then combine awkwardness with a wide age spread of attendees, free snacks, wine and beer.  Fatigue also is setting in later in the day especially if you didn't have a nap.  Part of my writing this post was learning that some engineers are petrified of the social hour.  My wife showed me a newsgroup where some people are paralyzed by the idea of going down on the social floor of conferences like these.  This is the StreetSmart part of the post, how to do the social hour.  Some of the social tips follow.

Advances in technology (text messaging) are not doing well with the age gap issue at ISSCC.  Some of the most experienced and distinguished engineers may not use text messages since the go through most all of their lives without it.  The younger the engineer the more tech savvy.   Its my observation that the youngest engineers rarely interact with the most experienced senior designers.  Also, some of the other-way around exists, some older engineers avoid the younger and its sad.  When approaching a professor or famous circuit designer you need to approach them in a way that is most comfortable. Don't interrupt but wait to be invited into a conversation.  If you want to talk to someone its better if they are finishing up talking to someone else or not talking to someone you do not know.  Make eye contact that is normally enough to let someone know you are interested chatting.  Also asking if they have time is nice, since if they are busy they will appreciate it.  When you see them next there is a much higher chance they will engage.  Be aware of people around you who may also want to talk to the same person, be polite and watch body language.  If their feet or body turn away its time to wrap it up.  Its important you have a reason to talk.  The easiest conversation starter is to ask about a paper presented earlier in the day or made during paper Q/A.  Poster session reactions work too.  Ask what they thought of it.  Alternatively you can ask about a paper, presentation or book written by the designer.  Author interviews are also a great place to meet people and will give you a connection.

Whatever you do DO NOT drink too much at social hour.  Sipping one glass of wine or having a beer is fine but more than two and there is trouble.  You are interacting with some really intelligent people and its a rare opportunity.  The impression goes both ways, if you drunkenly stumble into someone you are not going to make a friend.  This is a common mistake since the free-beer thing takes a little will power for frugal engineers.  Some engineers wrap a paper bar napkin around the beer and carry it around.  I know of at least one famous professor who does that too.  The bar napkin is good since it helps you to NOT peel the label making it less obvious that you are "carrying the beer around".  Tired and tipsy is not a good combination for making it easy to meet new people.  If you are part of the text-message generation or new to the United States then I suggest you observe before engaging.  Most engineers are there to have a good time so project positive energy and things should go well. At the social hour I normally make dinner connections.  After the social hour is the time to have a few drinks and talk about circuits.

The evening panel sessions occur after dinner.  These vary in content some being more serious than others.  The best are the debugging/analog guru sessions with the colorful characters of the analog integrated circuit community.  Young and old I often wonder what it takes to get on one of those panels it seems like a strong network helps.  Panel arguments about tubes vs. transistors or BJTs vs. FETS can get spicy and absolutely hilarious.  The analog community respects and laughs at itself in these sessions.  

After the conference is the best time to write down any final notes.  Life is busy and its easy to forget the details or interesting techniques learned during the conference.  Not all from the papers, some from fellow engineers.  Take inventory of business cards, you can write notes on the back about the people.  This information helps later the following year since you can build on the history. People like to hear about themselves, its all good.

Going to a conference can be an expensive and take a chunk of time.  If  you plan your time well you can make the conference a fun, educational and rewarding experience.  Be careful to pace yourself since the material can overload even the most experienced engineers.  Use the social hour to connect, challenge yourself to meet someone new and don't drink too much at the social.  The IEEE has been continuously improving the conferences over the years, so if you have not been to one in a while it may be time.