Apur-ee.273 net.periphs utcsrgv!utzoo!decvax!pur-ee!ghg Sat Mar 20 14:53:58 1982 More SI experiences We have approx 15 or so SI 9400 controllers around here spread over 3 or 4 PDP-11's and 6 or 7 Vaxen. Drives are a mix of CDC 9766's (with Dysan packs which are error free at the bit level) and Fuji 160's. There are approx 10-12 9766's and 15-16 Fuji's total. The Fuji's are "on loan" from SI until the Eagle (Fuji 470 MB) drives are ready. Around 3 years ago, we helped SI debug the Unibus & 11/70 cache bus 9400's and they worked pretty well after that. We were running CDC9766's 2 years before DEC announced the RM05 (a 9766). We pretty much had a controller on each drive, and there was no need for seeks and "autoconf" code at that time since DEC had nothing similar on the offing. In general we have been very pleased with the 9766 drives (with Dysan packs) and got 2-3 failures/year, mostly power supply and cable problems. The 3 or 4 DEC drives (RP04) have had nothing but problems since day one. With the arrival of VAXen and the SBI version of the SI controllers, we started to see more quality control problems in some of the controllers (chips not in sockets, etc), although the SBI emulator has been mostly free of bugs. One of the main troubles with the SBI adaptor is that the IC pins are not clipped after the boards are DIP soldered and tend to short to the next board. The 9400 "mother" board (the very large PC board) is only supported by connectors and tends to fall out. About 1-2 years ago, 9400's had cheap sleeve bearing fans which croaked in 6 MO or less, cooking them and a cheap capacitor (MEPCO/ELECTRA) in the power supply (mfg by ACDC) which failed causing "spikes" and dips on the +5V and did "random" resets on the controller. Unless the software looks at the byte/address counters at the end of each xfer, the driver will usually see a normal end of transfer (the reset causes an interrupt which the dev driver thinks is the xfer finishing). What you really got is a buffer of trash with no deverror being reported. Long ago, on general paranoia, I changed all disk drivers here to HALT ON UNRECOVERED NON-RAW ERRORS. This has saved many a filesystem by stopping dead before the damage spreads. It is OK to let raw I/O errors go since they are usually swap (which is well checked) or disk testing. SI has since fixed the bad fans and switched vendors on their power supplies. Also when unusual errors arrise, the SBI (MBA) tends to hang up with data xfer busy (0x80000000 bit in 0x20010008 set). Another oddity is the RPDS ready sets slightly before the MBA data xfer busy bit clears (must be due to internal buffering). This messes up standalone boots, formatters, etc which spin on RPDS ready bit. As many of you have also noted, only one copy of the RPDT register is maintained (for drive 0) and copied for other units. This wreaks havoc for the autoconf program. I have hacked the driver to attempt seeks to non existant heads and check for errors to tell the difference between FJ's and 9766's on the same controller only to find that a seek to a non existant head on a unit other than 0 causes the controller to hang busy while it scribbles into the registers for drive 0. SI says they are making the RPDT register slightly user programmable (via DIP SW's) so one can get a couple of unique (non DEC) drive types. We run everything "direct" mode (one large disk, not several "rm03"'s, etc per drive). The Seek code seems a little flakey. There is no look-ahead register maintained... so SEARCH commands just turn into seeks. We have also noticed bad things happening on Fuji's when an overlapped seek is in progress and an error occurrs on a data xfer to another drive. The controller seems to get lost and trash registers, etc... I haven't pinned this down well yet and it is under investigation now. The multiple CDC9766 drive systems DONT GET ERRORS (thanks to Dysan) and have not had any seek problems. Being winchesters, the Fuji's have 4 or 5 badspots each (mostly 1 and 2 bit errors). The 9400 controllers CORRECT IN THE HARDWARE upto an 11 bit burst error. As most of you know, until recently, nobody's driver seemed to work correctly with handling correctable ECC errors on DEC drives. All but two of our Fuji drives are error free due to the error correction hardware on the 9400. The two drives with errors each have one badspot (about 15 bits) The 4.1BSD bad block forwarding does not work with currently delivered SI controllers. If you request it, they may give you ECO's which when combined with a couple of driver hacks will make it work I think. The problem is that the device registers point to the wrong thing when a xfer is stopped on a bad sector on a multisector transfer. We do our own installations/maintenence and have avoided most of the field service problems. I have seen an SI CE put an alignment pack with bent platters in our drive and try to spin it up until I stopped him. This same CE was at another site when they ended up with 2 head crashed 9766's and 4 or 5 crashed packs. We have never had a 9766 crash. About 1/4 of our Fuji 160's have head crashed within the first 2-3 weeks of operation (esp when heads didn't move for several hours). This goes against all that I had heard about the almost perfect reliability record for them. Overall I think WE are still ahead with SI over DEC for OUR needs. We need SBI disks and are not turnkey users. I don't think everybody is in this situation though. George Goble School of Electrical Engineering Purdue University W. Lafayette, IN decvax!pur-ee!ghg or ucbvax!pur-ee!ghg or ARPAVAX.ghg@Berkeley (for ARPA) ----------------------------------------------------------------- gopher://quux.org/ conversion by John Goerzen of http://communication.ucsd.edu/A-News/ This Usenet Oldnews Archive article may be copied and distributed freely, provided: 1. There is no money collected for the text(s) of the articles. 2. The following notice remains appended to each copy: The Usenet Oldnews Archive: Compilation Copyright (C) 1981, 1996 Bruce Jones, Henry Spencer, David Wiseman.