Sunday, November 11, 2012

$100,000+ Frizbee

After an IC design is completed the plans are "taped-out" to a FAB that processes the wafer.  The first step is to generate the masks. Depending on the type of process there could be from 10 to over 40 masks.  Each mask combined with photo resist and a light source are used to pattern layers on a wafer.  These patters together form transistors, capacitors, resistors, inductors, diodes and the interconnect layers used to connect the elements together.

The cost of a wafer depends on several things, but for an older process technology, say something 15 years old the wafer cost may be between $600 and $1000 each.  Now if you have a small die, you can get thousands of ICs (or dice) on the wafer giving them a cost of pennies each.  Now if you can shrink the design (with a more advanced process) you can fit more die on a wafer and lower cost.  Also yield increases with a smaller die size since you get more die per wafer.  This all makes sense and is documented many a textbook.  However this is when things go right...

Several times in my career I have seen the misprocessed wafer.  Normally you are waiting for the wafer to get back from the FAB and you get a funny email.  There are WAT (wafer acceptance test) structures on the wafer normally off to the side or between the "dice".  The FAB probes these structures and records the data.  They compare the WAT data to a specification table.  If there is a problem, then the FAB lets you know, this is all part of their quality control process.  Of course, how bad the failure is and how far off off spec are important.  I will discuss one such case. This pretty much represents every case I have seen with bad material.

Case#1:  Year~1999. Process 0.13u.   Failure: "Transistor threshold off due to incorrect oxide thickness module".  In the FAB, the process steps are often called "modules".  These modules are sometimes mixed-up or done incorrectly.  In this case, the wrong oxide was used for the IO transistors which were also used inside the analog front-end.  FAB apologized and was making new material.  Now, I was young and new in my career at the time. The chip was a "huge" SOC with more than 16 million transistors.  About 1/2 of the content was analog.  1/2 digital.  We all worked hard and were waiting for months to get the silicon on the bench.  I thought "what hurt" to get an early look.

Got the packaged parts and plugged in the first one, and nothing happened.  Plugged in another, nothing.  Anyone who knows me understands I don't give up easy, so I asked for a "pile".  After going through 20 I found one that "wiggled" or gave evidence of activity.  We then trained a tech in the screen process and out of 100 parts we found 3 that wiggled. Only one of the 3 actually did anything interesting.  The bad parts had a strange problem in that the IO pads would oscillate in different patterns at about 100 Hz.

Why was the chip IO oscillating at 100 Hz?  Why was the chip performance so bad?  Was it due to the process mistake or was there a problem in the design?  We have an early look so lets use this time while we wait for new material.  Since it was a base-layer screw-up at the FAB it took 2 months to get new material, so we plowed forward.  We found that parts with different packages had better or worse yield.  The ADC (which I did the architecture for) worked fine on the good parts.  However other strange behavior existed.  So we took the 100Hz problem and decided to debug it.

The team was me and 3 other people.  We used the "company" debugging process.  The team worked for two months (32 man-weeks) to find out what was going on.  We started with an "ebeam" prober which tracks activity at junctions in the IC in a vaccuum.  We isolated a section of the pad-ring called JTAG what was known as "bondary scan".  Since we didn't scan the analog we could investigate the analog when the digital was not-available since the IO interface was oscillating.  We used FIB (Focused Ion Beam) to isolate the elements of the JTAG circuitry.  To get to this point took us about 4 man-months of work with the expensive debug equipment.  We finally got to a point where we found a logic gate that appeared to have a floating input.  I got a HSPICE simulation to demonstrate that the floating gate and its surrounding layout can oscillate at around 100 Hz.  I though we found our smoking gun.  We had our FA (Failure Analysis) FIB Expert cut the the metals around the gate to identify what appeared to be a bad Via (connection between layers).  We then sent the photo to the FAB (at 24 man-weeks of debug) to ask if this is related to the oxide defect.  The FAB said they didnt believe the photo since we did the FA ourselves.  So we went back and found another part and isolated the bad-spot and had the FAB do the FA.  Now we were 32 man-weeks into the debug, the new wafers were due back soon.  I got a "sheepish" email from the FAB saying that the bad oxide layer affected the vias (don't ask me how).  Attached was a photo of the bad spot without any name or record of the FAB or the design.  The open via caused a "relaxation" oscillator to be formed by a combination of gate-leakage and parasitic coupling in a logic-gate in the JTAG circuitry.

The new material showed up and the part worked as designed.

So what did we learn during the 32 man-week exercise?  We learned that the misprocessed wafer caused the problem.  During this time the FA team left other priorities aside.  Schedules slipped on the next generation part.  Engineers and management were worried about the design, we learned nothing new.  Or did we?  It certainly was educational.  What else could have been done with those resources is never to be known.  What other things could we have done with that company time and money?

Well I sure learned something.  We spent over $100,000 in labor and FA to prove that a bad wafer was bad.  We also proved that this delayed the progress of the team and hurt next generation. The wafers were scrapped or "Frizbees".

Now, be careful if you work with me and have a "known bad wafer" shipped.  I am very clumsy around those things these days.  I tend to smash them against the wall or throw them in the parking lot.  Its hard enough to get a mixed-signal IC working when it is processed correctly, but when its not, its pointless, especially with more complex designs.

Hopefully I just saved someone a few hundred thousand dollars...  I have never seen the damage of a mis-processed wafer debug be any cheaper.  I have seen it a total of three times in my career and in every case, greed, impatient people and disappointed customers are involved.   I no case was it ever worth the effort. 

1 comment:

  1. I had a problem where we did not know if it was the proccess or the design--a bandgap was not always starting. The designer had used a startup path that would be part of the Q current in the core, overpowered by the feedback. It looked foolproof. I was called in to help. After a lot of thinking and simulating I kept coming back to the foolproof startup, simulated an open in the start res and duplicated the behavior. A FIB on good part showed same behavior. The manufacturer of course insisted I was wrong and would do its own FA. After a month of Electron Transmission Mic, they found a processing issue with their contact module. The design was OK but not easily observed. A bit too clever.
    Auggie

    ReplyDelete