Linux
Apache
MySQL
PHP

CSS
XHTML1.1
XML/RSS

Creative Commons

2006-02-07 00:00:00

Fibre sucks!

"Oooooo, I'm fiber... I break everything I touch." It was taunting me. That's what our NAS (Network Attached Storage) cluster was taunting me with last night at 9PM when I had to drive into work to fix it.
It all started during the Superbowl. Our NAS sends out e-mails telling us when something's wrong. It happened to send some around halftime. Some people, like myself, just have the e-mails sent to my work account. Others have them sent to their cell phone. Frank is also one of those who has the e-mails sent to his cell. He was in Atlanta at the time, at his hotel, because his flight to Florida got cancelled. He called my phone and left a message asking me to check out the NAS. I got it Monday morning. As I was on my way in Jim called me to ask the same thing. So I decided to go straight for the NAS. Yay, half of our data is unavailable! Monday morning fires are great! So I check everything out to see just how bad the damage is. Looks like the head (a fancy word for one of the two servers in the cluster) doesn't recognize it has any hard drives attached to it. So I get on the phone with the vendor and after about 2 hours of troubleshooting we get it functioning. The final fix was to reseat the RAID/HBA combo's in the RAID chassis.
But was that really the final fix? It worked all through the day. I went home and got a call from Frank around 6ish that the NAS sent out some more SOS's. I remotely connected to the office and checked it out. False alarm. We also had a bad hard drive in the morning, so the RAID subsystem was rebuilding that hard drive onto the hot spare in the RAID set. It just happened to finish rebuilding around 6ish. But what did that matter? Well, since we reseated the RAID/HBA combo, bring it back online is considered by the system a critical process. Rebuilding a bad disk is also a critical process. So as to provice maximum availability and reduce the risk of compound failures, the system only does one critical process at a time. So once the bad disk was rebuilt, the RAID/HBA combo was brought back online. It redetected all of the previous errors (bad disk, etc.) and sent the SOS e-mails again. That proved to be a false alarm.
So I told Frank all was cool and logged out. A couple hours passed and Frank called me again. So I logged back in and starting poking around. This time all wasn't well. Slowly, things started failing: RAID controller communication, storage pools, RAID subsystems... I listed all of the VFS's (Virtual FileSystems) connected to that particular head and all listed a size of 0. That's bad. Even if a filesystem is empty it will show a size of 1. 0 means it's broke. Looks like I'm driving back to work at 9PM. Since Frank is in Florida (I neglected to mention that during the day on Monday, Frank got tired of waiting for the runways to be cleared that he drive the 6 hours to Florida) I called another co-worker (Norm) who handles NAS related issues to let him know what's up. I tell him that when I get to the office and get the vendor on the phone, I'll conference him in. So I get in my vehicle and start towards Johnstown. On the way I call my manager and let her know all hell has broken loose.
So I get the conference call going and we start troubleshooting. I'm the only one in the office (Norm is at home remotely connected) so everything is pretty much up to me. After 3 hours of troubleshooting (that's right, I was there til midnight) we deduce that the left RAID/HBA combo has a short. Once we pulled it out, the redundant links took over and everything was back up. So we got a new one shipped that would be delivered in the morning.
A short, you ask... Why did a short take down the entire chassis? Well, that is a downside of fibre connections. You see, when you have more than 1 fibre connection in a system (in this case, the RAID chassis) they create a pool. It's an abstract idea, virtual not physical, where the fibre controllers talk to each other to let each other know if it's functional and it also does all of it's communication over the pool. Well, when the left controller shorted out, it was sending bad signals across the pool, confusing the other controller. Now neither controller knew anything and both stopped functioning. Something that was supposed to give us redundancy (2 controllers in one chassis) has actually caused us trouble. Looks like fibre isn't all that it's cracked up to be.
So this morning came and I went in 30 minutes early to intercept the delivery of the 2 new parts (new hard drive and RAID/HBA combo). Just my luck neither was delivered to me until around 10AM. So I got the RAID/HBA combo first and put it in. I double checked the server to make sure nothing broke and all was cool. The new hard drive still hadn't shown up so I asked Ryan to put it in when it arrived since I work from a different building.
Finally the NAS was fully functional again. Thus, my two crappy days at work. :)

Back


Post a comment!

Name:
Comment: