Monday, January 28, 2013

LC sTank

Recently I have taken on a new project that has LC tank based PLLs on it.  The company I now work for has a long history with this type of block, and I appreciate that.  I have seen a few bad LC tank PLLs in my time.  They can fail in several different ways.  The most interesting failures with oscillators is that when they don't.

I have seen ring-oscillator, relaxation-oscillator and LC tank based oscillators used as the time-base for a system on a chip (SOC).  The type of oscillator basically comes down to its performance and cost.  If you want the best high-frequency reference, the LC tank with its huge die area overhead is hard to beat.  The LC tank has a natural filtering property to it that gives it good phase-noise  filtering, leading to lower jitter.   I have mentioned it before but Hajimiri has a great book "Low Phase Noise Oscillators" which is an excellent read.  Hajimiri explains the phase noise filtering and Leeson's equation.

The first chapter of Ali Hajimiri's book is about the "one-port" oscillator.  I found it entertaining that an oscillator only has "one port" in terms of energy.  In electronic design, a port is a way of electrically interacting with it.  A terminal or lead on a chip is an example of a port.  Ali explains that an oscillator on a chip will go forever if there were no "real losses".  Real losses are where the "power goes".  So Ali states, if you add back enough power to a "one-port" oscillator to compensate for its internal "real" losses, then it will oscillate forever. Ali's observation is insightful.

It was around the 2005 time-frame and we had a 90nm test-chip back in the lab with a 3.2GHz LC tank oscillator based PLL on it.  The input reference clock was 50MHz or 100MHz and the goal was 1ps RMS jitter measured in band.  The clock came back and looked great, phase noise looked "weird", but the jitter was excellent.  The phase-noise had a distinct lack of bandwidth, the loop bandwidth was lower than expected.  So what was wrong with this one?

We were characterizing the jitter over temperature and noticed something.  At temperatures above 120F, the LC tank would stop oscillating.  The output signal would fade out.  Sometimes you could get it to start by "jacking up" the tank current.  On this tank we pushed a adjustable current into a "center-tap" of the dual inductor at the point of symmetry at the center of the tank.   The adjustable DC current supplies N-MOSFETS connected in positive feedback that created the regeneration.   If we made the current high enough it would oscillate at a higher temperature, but still would die at a slightly higher temperature.  Supply voltage was a "weak knob".  The oscillator worked great at cold temperature.

Schematic level simulations didn't show any problems with the design over temperature.  Even with package models.  No problems observed in any corner with interface blocks.  Since we were closing in on the oscillator, the next step after that is the "extracted" simulation set.  These simulations can take a very long time to run.  So. to save time. we broke the PLL layout into "sections".  We then swapped in an RC netlist for the "section" of the PLL under study for temperature sensitivity.  We rotated through the block and eliminated all the high current blocks (amazingly).  Main inductor, power-grid on the main current source, high-frequency divider and the output buffer were all fine layout wise.

So now this is where Hajimiri ties in.  The "last" place we looked was the varactor circuit.  This was a tricky animal that combined a varactor with a trim-cap array.  A trim-cap array is normally used at start-up for an LC tank based PLL to center itself.  At start-up the correct number of unit capacitors are selected before the PLL loop is allowed to lock.  This "Loop Filter" block is interesting in that it has a "lot of plumbing", and that was the problem.  When putting in an RC extraction of the Loop Filter, we identified the bad layout.

It was series resistance to the capacitor "C" in the "LC" tank!  A pair of long skinny wires connected the inductor to the Cap.  It was the first time I saw a parasitic resistor stop a circuit in its tracks.  The resistances in the routing increase with temperature.  The real-loss in this poorly routed line was enough to upset the operation of the LC tank.  The wire on the chip would heat up and the circuit would stop oscillating.  What was interesting, is that the simulation and the lab failed within 10 degrees of each other.  It was an amazing correlation.  The new simulations also showed the change in phase-noise response, which was the first symptom of the badness. Of course after we identified this the layout fix was easy.

So, in the LC tank PLL, don't spend so much time on L that you forget about C.  Real loss is the enemy.

Monday, January 14, 2013

The Nickel Philosophy and Transition

One of the purposes of this blog was to help my work associates understand my frustration with the way things were going.  I kept seeing people repeat the same mistakes over and over again.  So I used this block to "vent".   Only so much venting is useful until the root-cause of the problem has to be addressed.

I addressed the root-cause and am now working on high-speed ADCs and communications again like I did from 1995-sh to 2009.  I have changed jobs and now work advanced data converters in deep-submicron processes.  My previous company had generously lifted me from Davis/Sacramento to the edge of the bay area and I am thankful for that.

I plan to keep up Street Smart Analog.  I still plan to eventually write a book on analog design and debug, with this blog as the pre-cursor.  Of course, the lack of recent posting is related to my career transition, which is now official.

The world of small component analog design including light sensors is radically different but also challenging.  I have the utmost respect for the light-sensor product line and the people I left behind working on it.  Those products have a lot of care and effort put into their development and should be the hottest selling sensors out there.  However, I couldn't remain working on that product line due to The Nickel Philosophy.  One of two things that hang's up in my cube along with the IEEE code of conduct.  The sad thing here is that this transition could have been prevented.  

I find the "Nickel Philosophy" from Jim Bracher and his associates at the Center for Integrity and Leadership to be a valuable tool in setting priorities.  I have used this material of Jim's in Lectures at both UCD and Stanford.
Link:  http://www.brachercenter.com/article_nickelphilosophy.html

The list of "effective priorities" called "The Nickel Philosophy".  There are two catagories
A.  Professional Profits:
#1  Customer service (how you treat your customers and your work mates; how are you treated)
#2.  Quality product
#3.  Career Opportunity
#4.  Motivating Environment
#5.  Everything else.. (not worth stressing about)

B.  Personal Values
#1.  Self/Significant other
#2.  Family and Friends
#3.  Heath and happiness
#4.  Difference and dollars (how you are getting by)
#5.  Everything else (not worth stressing about)

In my life at my previous employer I tried to make sure that all the people that reported into me experienced "Professional Profits".  There are a multitude of reasons behind this list, its order and how it is presented, please look over the link above.  If the place you work at does NOT follow the Nickel Philosophy AND there are other opportunities available, then the right thing to do is to move on, since you will eventually.  What is interesting about this list is that Career Opportunity is right under quality product.  It does make sense since without a motivating environment why bother about the opportunity.  This list is a "person specific" view of the problem to be clear.

Today I saw a link today shared by an HR person from the previous company:
http://www.forbes.com/sites/ericjackson/2012/01/19/why-companies-are-terrible-at-selecting-retaining-and-motivating-their-talent/

Basically this article explains why people leave and is interesting to contrast with "The Nickel Philosophy".
Eric Jackson picks 10 reasons.  Some of these items similar to Nickel Philosophy:
 #1, 2.   HR blame game and Throwing money at the problem doesn't work (Note Personal values #4, money is not listed in Professional profts)
#3,4,5,6:  Career Opportunity
#7, 9, 10:  Motivating environment
#8:  Quality product

So, if you have not done so already, spend the 5 minutes it takes to review the Nickel Philosophy link.  If you want to learn more, you can contact Jim Bracher directly through his website.   The Nickel Philosophy is also published in Jim Bracher's book Integrity Matters.  ISBN 978-1887089036

Wednesday, December 12, 2012

Analog test-anti-test path

Earlier in Street Smart Analog Lingo I mentioned an analog test-bus or analog test-path.  These animals are excellent for debug of silicon, particularly deep-submicron where probe pads are REALLY Huge.  A 2u probe-pad was no big deal back in the day but that area is really useful in geometries below 0.13u.

The test-path issue came up recently so I figure I would blog about this useful debug tool.

The Good:
DC analog signals such as currents and voltage references can be sent on/off chip helping to isolate DC bias problems.  An "analog mux" is placed on the test-bus normally each block has a little mux that allows a signal to be passed from the INSIDE to a PAD on the outside of the chip. Outside of the chip the appropriate test-device or current/voltage source can be attached to the test pin.  This is useful for tuning in band-gaps, bias generators and debugging low-freqency clocks or slow-speed ADCs. 

High-speed (differential) signals can also be sent out an analog test-bus.  These are trickier to deal with but I have seen an 800MHz test-bus employed on an 12Gbps receiver.
(ISSCC 2006 - Keyeye 12Gbps).  That test-bus had a dedicated output buffer created from a thin-oxide PMOS transistor.  This was a "source-follower' with the off-chip resistor being a several-K Ohm resistor.  With the correct (~10V) power-supply, the circuit could be tuned to an impedance of 50 ohms to match the board trace.   The poor-little transistor was biased well beyond 10 year lifetime limits however it allowed us to "tune-in" our analog DFE, NEXT and ECHO cancellers.  This circuit also made a fine figure for our ISSCC paper.  We achieved about 8 bit linearity with a bandwidth of near 800MHz.  If you left it on too-long or raised the voltage too high the chip would blow.  The eye got cleaner until it popped.  Later on we included EQ on scope capture data to reduce the burn-out problem.

Medium bandwidth signals can also be sent through a mux into a front-end of a receiver.  The transformer in an Ethernet chip had a dual-purpose as a balun.  You could put a single-ended RF generator (with associated filter network) on the differential input side of the transformer.  Then on the "chip side" you could adjust the center-tap to give whatever common-mode was required for the internal block being tested.  A "leap-frog" test-path was included to send the signals to the various front-end blocks helping to debug harmonic-distortion problems, AGC ranges, low-pass filter bandwidths and ADC linearity.  This path should be simulated before tape-out.

One advantage of an analog test-bus is that you can always disconnect it in a metal-rev, so reliability is not a concern, especially in the early stages of  analog-front-end (AFE) bring-up.

The Bad:
I have also seen the analog test-bus cause failures.  These are subtle but this is the point of street smart analog.  The test-bus needs to be verified like any other circuit.  Neglecting to do so can cause bad things to happen.

The ultimate sin of the "test-bus" is to reduce the performance of the circuit's primary function.

Failure #1:  Some pads on chips have voltages that go "above the rail".  These are called "open-drain" where an off-chip pull-up resistor or transformer is required off-chip to supply current.  A common mistake is to connect a  PMOS switch to the pad with body tied to the chip supply.  If you take a PMOS terminal above the highest supply, a diode will turn on inside the chip and steal current with its characteristic nonlinear temperature dependent way often puzzling the layman.  Also these parasitic diodes can blow.  We learn in college that the PMOS body needs to be connected to the highest supply.  (source-body connections also have pitfalls and are do-able, but tricky and may affect a circuit in its normal mode.)  So as a general rule, unless you really have to, never us a PMOS switch, especially if you have an open-drain or a transformer.  Dan Ray said "No P on the Pad".  Notice my " ad", it has no P.

Failure #2:  Bad neighbor behavior.  What I mean by this is that several blocks normally share an analog test-bus such as a "DC" bus.  There is a desire to prevent noise from coupling back in from the test-bus so often we would employ a "T" switch.  This is a switch that consists of a T network with three switches.  When the bus is "off", the middle switch prevents noise coupling through.  When the test-bus is "on" the middle switch is off and the two outer switches connect internal node to the outside.  I have seen a case where someone left out one of the switches in the T.  So when the test-bus was disabled, it was pulled to ground preventing other blocks from using it.  So if you have an analog test-bus, a "test-case" should include "open". I would do this by loading the test-bus with a 1Meg resistor in sims to a voltage mid-rail in simulations.  You can also pull the resistor above the rail (on an open drain pin) to check for P on the pad if that is a concern.

Failure #3: Low priority verification.  The first shot at that 800Mhz differential test-bus did not work all that well.  We had hired an excellent consultant to design repeater to send a signal to the source-follower pad.  This IP never did make the first tape-out.  The focus was on tape-out and verification of the main function, but prevented debug later on forcing a quicker spin.  So if you are going to put a test-bus in, you should "Do it like you mean it" and verify it too.  If there are buffers they should be reviewed and plot reviewed.  The test-bus methodology should be done "up front" in the design and not snuck in at the last moment, since it could ruin your floor plan.  Thinking ahead and planning are always a good idea when it comes to analog chip design.  You can try to substitute long hours but you'll always lose to the thinker-planner.  Think tortoise and hare...

Keyeye Ref:  http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1696054

Sunday, December 2, 2012

Missing Teeth

With switched-capacitor circuits, one of the most critical parts of the design is the clock generator.  As a friend of mine once said:

"When your switched-capacitor circuit doesn't work, check your clocks.  After that, check your clocks again."   (Perry Heedley-1998)

It was back in 1999 we had our first-generation gigabit SOC back in the lab.  The process was 0.35u.  Supply 3.3V.  We had a strange problem with non-uniform sampling.  When we sent the clock out the "test-bus" we saw that it had missing pulses.  Missing pulses are not a good thing and the Flash ADC ENOB was terrible.  Lots of tones!  On the scope the clock looked like a boxer who was missing teeth.  We also had supply dependence where high supply and cold spray made it worse.  What was going on?

On Friday I met a new friend who had a similar story.  (So sorry buddy!)  This inspired me to write this blog post on this common screw-up.  If I have seen one common mess-up in that the something goes wrong with a reference clock.

In these larger Ethernet chips, we distribute the clock as a differential signal.  The advantage of going differential is that the signal is not affected by clock skew and the rise/fall time match perfectly (by design).   If you distribute critical clocks with single-ended circuits stop reading now since you are hopeless.  The differential approach gives you a uniform sensitivity to noise on the chip and in the environment (see Ali Hajimiri's wonderfully written "Low-Noise Oscillators").  Another advantage of using a differential clock, is that ideally you can send it across power-supply domains. (when things are normal)

Now if you want a good non-overlapping clock its easy to go overboard.  Normally you have a "non-overlapping clock generator".   Its a circuit who's job it is to make sure a set of clocks do not occur at the same time.  A trade-off in those designs is the rise/fall time.  If the clock coming out of the block has a fast rise and fall time, the clocks are less apt to overlap.  However, this comes at a cost.  The designer keeps increasing the size of the generator to make the output edges faster and faster.  Eventually coming to a solution.  There is a trade off between non-overlap time and Operational Transconductance Amplifier (OTA) settling. It almost always seems easier to use a big clock buffer transistors than to beef-up your amplifier bandwidth.

A huge pitfall of these "massive" clock generators is that they can generate huge amounts of noise and "ground bounce".   Or as Stephen Lewis (UC Davis) would say "Making sparks".  The huge clock buffer circuits create massive amounts if dI/dT.  Huge current spikes with peaks upwards of close to an amp can find there way into your big clock buffer.  These currents hit your package (with inductance) which translate them into huge voltage spikes.

When it comes to "noisy neighbors" on a chip, it always takes an aggressor and a receptor.  In this case, I was able to debug this animal but putting the clock-generator into a schematic along with a simple package model consisting of package inductance.  I then put the clock source on a different power-supply in my schematic to see what happened.  I did this by hand in HSPICE since I am not the hugest fan of schematic capture.  I did this hand-written test-bench in real time in the lab right next to an oscilloscope with the bad clock on it.  It was me, Sailesh Rao, Jim Parker and Dave Nack all gathered around the setup.   I kept tweaking the test-bench, and Q factor (4) on the bondwires until BINGO.  I was able to match the waveform from the scope in HSPICE.  High-five from Dr. Rao!  What happened?

The ground-bounce was so big that it was measured in VOLTS.  Yes, our 3.3V supply had volts of ground-bounce on it from a huge clock generator.  By increasing the temperature or lowering the supply on the clock generator, we could work around the problem.  This part wasn't going to sample in this state.  The ground-bounce was too big from uber-big clockgen!

The main PLL and the ADC with the uber-clockgen were on different power supply pins.  Analog guys like to use A BUNCH of power supplies, normally to keep noise from coupling around.  However, this can sometimes backfire.  When breaking up power-supplies its important to visualize the return paths of all the currents and how they will affect each-other.  In this case, the PLL sent the clock to the ADC who caused so-much ground-bounce that the buffer amplifier receiving the clock in the ADC missed pulses.  This happened since the amplifier only had a common-mode range of about a volt, with more than a volt of ground bounce between the supplies.

So now, hopefully everyone knows that you can make a clock-generator "too-big".  A technique to finding these is to just turn-on base-layers in your layout and look for huge MOSfets.  Always ask yourself why you have a big transistor, since everything in the area will know about it.  Also people should be aware that more supplies are not always better.

So what is a solution?
A.  NERF your clockgen - Simulate it with bondwires
B.  Add on-chip bypass capacitors to prevent dI/dT from hitting the bondwire
C.  Improve the common-mode range of your clock buffer.
D. Design a set of  inter-supply "repeaters" with huge common-mode range
E.  Use DC Blocking capacitors

We solved this one with A and B.  The ADC worked much better after we fixed that.  We still had more challenges but....

"When your switched-capacitor circuit doesn't work, check your clocks.  After that, check your clocks again."  

Sunday, November 18, 2012

Street Smart Analog Terminology

In everyday life I basically use these terms over and over again.  I assume that everyone knows what these things mean, but that is probably not the case. So I am going to publish and maintain a list of my "Streetsmart Analog Lingo".

Most of this lingo comes from other smart people in this business.  I can't give credit to everyone.  The late Dan Ray was an expert in this area.  Dan was a founder at Level One and one of my first mentors.  He was awesome in analog.  Tim Dyer (my identical twin brother), Perry Heedley (CSUS), David Viera, Patrick Isakanian, Paul Hurst, Stephen Lewis, Bob Pease, and Dave Nack contributed some of these over the years. 

Street-smart Analog Lingo:

A1 Release:  Release of silicon on the very first version.  This happens very rarely, maybe 1% of the time with analog circuits.  Assuming it will happen is unrealistic and can actually be discouraging.

All Layer:  A design that requires all mask layers to be changed or all new masks

APR:  Automatic place-and-route.  Machine generated layout.  Also called DDA (Digital Design Automation)

Antenna:  Long piece of metal touching gate-poly - can damage poly leading to huge offsets. Also can be a single-ended wire (test-point) on a circuit board operating at over 200 MHz.

Bake It:  Temperature cycle

Boomerang:  Bad evaluation board returned from the customer

Brown-thumb:  A designer who uses "unconventional" techniques or "special tricks" to do design.  Often these characters are associated with unreliable circuit designs and poor execution.  Also associated with using poor methodology and bad practice.

Change layer:  Metal layer dedicated for changes/programmability

Carpet Bomb It:  see NFS

Chip Designer:  There are only chip designers.  All block designers should be interested in how their circuit affects the chip then are working on.  Especially important in a (System on a Chip) SOC

Cup-cake:  Cross sectional shape of copper metalization

Expert-layout schematic: an analog schematic without any layout hints or notes

E^Overnight:  Bit-error rate test requiring no errors if left overnight. 

Fib Slut:  A part that has been in and out of a "Focused Ion Beam" (FIB) more than twice

Follow the Dollar:  The process of following the customer's money to your paycheck.  It should be easy to "follow the Dollar" unless you know you are operating at a loss.

FOS Schedule: Full of shXt schedule.  Normally used to get management pregnant on a chip design program

50% Schedule: Project schedule that requires everything to go perfectly.  No competent marketing person or design manager ever commits to a 50% schedule unless its a FOS Schedule.

Hair-dryer:  Heat gun

Hare:  high-power low-impedance approach to design. Opposite of tortoise.

Hidden state:  Circuit state not designed - normally from a bad reset circuit. Often appears when an over-confident analog person does digital design. 

Leapfrog test-path:  Analog test-path that allows blocks in a analog signal chain to by bypassed for debug.

Luck:  When the mistakes you made didn't matter

Irregular layout:  Any circuit block that is not square or rectangular. Also called "donut block" or "block with a tit on it"-Dan Ray. 

Magic-fingers:  The opposite of brown-thumb.  Someone who executes

Magic-circuit:  Circuit designed by someone who doesn't understand it.  Often a Brown-thumb.

Magic smoke:  When it leaves the chip it no longer works.

Maskview: Job-deck view which is a manual mask check.  Ideally a "Zen" moment and never to be done in a panic.

Metal-up; Metal:  A design that requires just metal changes - quicker and bypasses HTOL.  Often a way to patch a design for a quick fab turn.

MAS Document  Micro-architecture spec document or "chip Bible".  So useful but so shunned, a one stop shop for circuit-block and interface information.

Nack Hack:  {names after the late Dave Nack}  A circuit board without a "toe-tag" containing unknown changes or hacks.  

Nail:  Type of die probe that is simply a piece of metal

Noisy Neighbor:  Noisy circuit block interacting with nearby circuits 

NFS:  Nuke it From Space {Aliens II}.  A circuit that is flakey or has unknown issues that should be completely re-designed.  A circuit or a layout that is fundamentally flawed.

Onion Peel:  (peel the onion) When a chip revision comes back and a new problem is uncovered (often hidden by what was fixed)

Pencil Tap:  Before GPIB and Labview we would tap knobs with pencils for fine-tuning

Pizza Mask: A multi-reticle run of silicon.  Can be "metal-up" or all-layer.  Normally used for debug or system level designs without a full set of models.

Poke it with a stick: Low risk vehicle in trying a new technique or new process.  Also used in debug to determine if the problem is sensitive to external stimulus.

Popcorn: moisture in the package causing trouble in re-flow splitting the part

 Put a fork in it: Basically done, any more effort spent on it is wasted

Leakage:  sub-threshold drain-source current in MOS that makes a sensitive temperature sensor. 

Relentless beating:
  To solve a problem with several simultaneous solutions

Sim Slave: Design resource asked to do simulations without understanding

Smoke Test:  First power-up of new silicon

Spoiled-via:  Connection between metals that is open or flakey

Tape-out:  Sending the plans of the chip to the FAB for mask generation.  An important milestone for non-experts but meaningless for true experts, since you may have a "Turd"

Team Analog:  Design, debug or architecture work done in an interactive manner. 

Testbus:  Test path snaked through an analog design that goes to an external pin for debugging

Trainwreck:  When two layouts crash-into eachother.  Also a machine-generated schematic or one in which the wires cross.

Thump test:  Finding a signal integrity problem (loose connection) by thumping your fist on the bench. 

Turd:  An incompletely verified piece of silicon. Due to time-constraints, not all simulations are run.

Works by accident:  Analog circuit or subsystem category with a flaw that works fine anyway. For example critical layout sensitivity that happens to balance, when later edited may surprise fail. 

Works in simulations:  Analog always works in sims... famous last words.

You can't bullshit electrons:  Just because you designed it doesn't mean it will work

Zap:  ESD testing

Zorch:  Catastrophic failure of a DC DC converter IC. (Parasitic Zener)

Thursday, November 15, 2012

Hiroshi's Desk

Trust is the glue that holds business relationships together. 

Today I made a visit to R2 semiconductor where I visited an old friend and met a new one.  I saw intelligence, perseverance, and focus.  Both very focused and technically solid people, you often find the cream of the crop in small outfits like R2.  Its tough working in a start-up there are so many issues to deal with many non-technical.  I really appreciate what their team has done.  My visit reminded me of  the story of Hiroshi's wallet.

I joined a start-up Keyeye around late 2002 or early 2003 time-frame.  At my previous company we had been doing research and development on communication circuits until that changed.  At the situation we were trying to start our family my wife didn't want to move.  Keyeye was one of a very few ways to stay in Sacramento and still do cutting edge mixed-signal design outside of university research.  I took a huge pay-cut with the goal to make it back in stock.

Hiroshi Takatori was a founder and the CTO of Keyeye at the time.  We don't talk much anymore unfortunately, but what I can say about Hiroshi is that he is a brilliant and incredibly hard-working man.  Hiroshi  basically dedicated a big chunk of his life toward the success of Keyeye.  He was very careful in who he hired.  He had many criteria but one was to bring aboard straight-shooters (like himself) and people he could trust.  His style is Japanese and he liked to do all the system simulations which he did at his desk which was located right in the center of our office building.  He had no cube walls around his desk, he would sit watching the company work from his central location.  We all had cube-walls fortunately.  You could not go into the break-room, the front-door or the lab without passing by his desk.  He pretty much had the same set of items on his desk all the time.   His computer, butcher paper (for system diagrams), bucket of pens, FORTRAN print-outs, a container of dried sea-weed and his wallet sitting on the edge of the desk.

What I found interesting wast that over the first 3 years I worked there (before the move) his wallet basically sat in the same place everyday.  It was a fat wallet with lots of notes, business cards and money popping out the sides.  It was always there, always in the same spot.  We would all walk by it every day multiple times.   Guests visiting Keyeye would sometimes comment on it since it was so big and bulky looking, no wonder he didn't leave it in his pocket.

Nobody ever touched Hiroshi's wallet.  We all feared the dried seaweed.

I found that Hiroshi's wallet symbolized one of the key elements of characters in a start-up which is trust. When in a start-up you wear many hats, do many functions.  You focus on the success of the company your funding partners are helping you to create.  There are few checks and balances.  Your responsibility is huge, and your risk is high.  If your character is weak, then you do not belong there.  You do not deserve the responsibility.  You need to trust each-other.  Your funding partners need to trust you to deliver.  If you are ever at a start-up and looking to hire someone, ask yourself if you would you trust them with your wallet?  If not, then keep looking. 

Sunday, November 11, 2012

$100,000+ Frizbee

After an IC design is completed the plans are "taped-out" to a FAB that processes the wafer.  The first step is to generate the masks. Depending on the type of process there could be from 10 to over 40 masks.  Each mask combined with photo resist and a light source are used to pattern layers on a wafer.  These patters together form transistors, capacitors, resistors, inductors, diodes and the interconnect layers used to connect the elements together.

The cost of a wafer depends on several things, but for an older process technology, say something 15 years old the wafer cost may be between $600 and $1000 each.  Now if you have a small die, you can get thousands of ICs (or dice) on the wafer giving them a cost of pennies each.  Now if you can shrink the design (with a more advanced process) you can fit more die on a wafer and lower cost.  Also yield increases with a smaller die size since you get more die per wafer.  This all makes sense and is documented many a textbook.  However this is when things go right...

Several times in my career I have seen the misprocessed wafer.  Normally you are waiting for the wafer to get back from the FAB and you get a funny email.  There are WAT (wafer acceptance test) structures on the wafer normally off to the side or between the "dice".  The FAB probes these structures and records the data.  They compare the WAT data to a specification table.  If there is a problem, then the FAB lets you know, this is all part of their quality control process.  Of course, how bad the failure is and how far off off spec are important.  I will discuss one such case. This pretty much represents every case I have seen with bad material.

Case#1:  Year~1999. Process 0.13u.   Failure: "Transistor threshold off due to incorrect oxide thickness module".  In the FAB, the process steps are often called "modules".  These modules are sometimes mixed-up or done incorrectly.  In this case, the wrong oxide was used for the IO transistors which were also used inside the analog front-end.  FAB apologized and was making new material.  Now, I was young and new in my career at the time. The chip was a "huge" SOC with more than 16 million transistors.  About 1/2 of the content was analog.  1/2 digital.  We all worked hard and were waiting for months to get the silicon on the bench.  I thought "what hurt" to get an early look.

Got the packaged parts and plugged in the first one, and nothing happened.  Plugged in another, nothing.  Anyone who knows me understands I don't give up easy, so I asked for a "pile".  After going through 20 I found one that "wiggled" or gave evidence of activity.  We then trained a tech in the screen process and out of 100 parts we found 3 that wiggled. Only one of the 3 actually did anything interesting.  The bad parts had a strange problem in that the IO pads would oscillate in different patterns at about 100 Hz.

Why was the chip IO oscillating at 100 Hz?  Why was the chip performance so bad?  Was it due to the process mistake or was there a problem in the design?  We have an early look so lets use this time while we wait for new material.  Since it was a base-layer screw-up at the FAB it took 2 months to get new material, so we plowed forward.  We found that parts with different packages had better or worse yield.  The ADC (which I did the architecture for) worked fine on the good parts.  However other strange behavior existed.  So we took the 100Hz problem and decided to debug it.

The team was me and 3 other people.  We used the "company" debugging process.  The team worked for two months (32 man-weeks) to find out what was going on.  We started with an "ebeam" prober which tracks activity at junctions in the IC in a vaccuum.  We isolated a section of the pad-ring called JTAG what was known as "bondary scan".  Since we didn't scan the analog we could investigate the analog when the digital was not-available since the IO interface was oscillating.  We used FIB (Focused Ion Beam) to isolate the elements of the JTAG circuitry.  To get to this point took us about 4 man-months of work with the expensive debug equipment.  We finally got to a point where we found a logic gate that appeared to have a floating input.  I got a HSPICE simulation to demonstrate that the floating gate and its surrounding layout can oscillate at around 100 Hz.  I though we found our smoking gun.  We had our FA (Failure Analysis) FIB Expert cut the the metals around the gate to identify what appeared to be a bad Via (connection between layers).  We then sent the photo to the FAB (at 24 man-weeks of debug) to ask if this is related to the oxide defect.  The FAB said they didnt believe the photo since we did the FA ourselves.  So we went back and found another part and isolated the bad-spot and had the FAB do the FA.  Now we were 32 man-weeks into the debug, the new wafers were due back soon.  I got a "sheepish" email from the FAB saying that the bad oxide layer affected the vias (don't ask me how).  Attached was a photo of the bad spot without any name or record of the FAB or the design.  The open via caused a "relaxation" oscillator to be formed by a combination of gate-leakage and parasitic coupling in a logic-gate in the JTAG circuitry.

The new material showed up and the part worked as designed.

So what did we learn during the 32 man-week exercise?  We learned that the misprocessed wafer caused the problem.  During this time the FA team left other priorities aside.  Schedules slipped on the next generation part.  Engineers and management were worried about the design, we learned nothing new.  Or did we?  It certainly was educational.  What else could have been done with those resources is never to be known.  What other things could we have done with that company time and money?

Well I sure learned something.  We spent over $100,000 in labor and FA to prove that a bad wafer was bad.  We also proved that this delayed the progress of the team and hurt next generation. The wafers were scrapped or "Frizbees".

Now, be careful if you work with me and have a "known bad wafer" shipped.  I am very clumsy around those things these days.  I tend to smash them against the wall or throw them in the parking lot.  Its hard enough to get a mixed-signal IC working when it is processed correctly, but when its not, its pointless, especially with more complex designs.

Hopefully I just saved someone a few hundred thousand dollars...  I have never seen the damage of a mis-processed wafer debug be any cheaper.  I have seen it a total of three times in my career and in every case, greed, impatient people and disappointed customers are involved.   I no case was it ever worth the effort.