您的当前位置：首页 CLAR-TT0006 - Multiple Drives Faulted

CLAR-TT0006 - Multiple Drives Faulted

来源：九壹网

Troubleshooting Tree for CLARiiON CX-Series Arrays

Multiple Drives Faulted

Revision 1

Published March 2004

CLAR-TT0006

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED \"AS IS.\" EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

Trademark Information

EMC2, EMC, CLARiiON, Navisphere, PowerPath, Symmetrix are registered trademarks and Access Logix, Application Transparent Failover, EMC ControlCenter, FLARE, MirrorView, SAN Copy, and SnapView are trademarks of EMC Corporation. All other trademarks used herein are the property of their respective owners.

Starting Symptom: Scattered multiple disk drives have faulted. Typically this represents from two to six faulted drives, and not entire enclosures, which would be a different tree. This tree assumes the disks are still faulted.(0)Run spcollect on both SPs. Look in logs for cause, or apply SPLAT tool for analysis. (1)Is the Flare (Base Software) revision earlier than Release 11+?(2)YesAre disks 0 and/or 1 faulted? CRU signature errors will only be logged during an assign, so an SP reboot may be required. Are there CRU signature errors on drive 0 or 1 or both?(3)Are there unowned LUNs, or has customer reported data unavailable?(4)Arrays with Multiple Problems: Please see Primus case emc81176 when troubleshooting fibre channel arrays with multiple problems. This solution examines the order that problems should be addresses for maximum chance of success.Yes – proceed to Block 5 on Page 2 of this treeNo“Diagnosis” boxExamine the file “spX_config_info.txt” in each spcollect output. Search within that file for the output of the “getdisk” for each SP. Compare if either SP still detects faulted drives as being online. Sometimes this is causes drives to have both green and amber lights on at the same time. Does the output of “navicli getdisk” show different states for the faulted disks? Or do you see “BE loop hung” or “LCC glitch detected” messages in the log?(31)YesIf the data on the “unowned LUNs” is not critical, those LUNs may be unbound/rebound and the data restored from backup. This may be suggested in time to recovery is critical. Data recovery can be attempted using the engineering-only “NDB” tool. The call should be escalated at this point. See Knowledgebase article emc60874 for more information on CRU signature errors and the NDB process. (30)YesThis probably means that the problem is an LCC or a cable on the side that does not see the drives. Sometimes it is difficult to determine which cable or LCC in the loop is bad, so it may be necessary to split off some of the enclosures and find the faulted part by process of elimination. Start by disconnecting all enclosures one at a time on the bus beyond the DAE/OS. If problem disappears, add enclosures one at a time until it re-appears. The swap out cables and LCCs of of the just-added enclosure. The other option is to just start replacing cables and LCCs until the problem is corrected. If you see “port glitch,” the problem is most likely the highest numbered LCC listed in the log or the next higher one, or the cable in between. See Knowledgebase article emc67018. Do all the drives spin up? Are all the fault lights cleared?(43)NoIt is recommended that you run the SP logs through the mergelogs.exe utility, then view them in the SPLAT log analysis tool. Use it in “analyze” mode, with the “backend loop” or “bad disks” filter. This might point out the problem, such as a “soft SCSI error” with an extended status of 0x09 (HW error). If the problem can be found and corrected, reseating the faulted drives or rebooting the SP should bring the faulted drives back online. Did reseating the drives or rebooting the array bring the drives back?(42)NoReplace the faulted drives. Did the new drive spin up OK without faults?(44)NoThis probably means that the problem is an LCC or a cable on the side that does not see the drives. Sometimes it is difficult to determine which cable or LCC in the loop is bad, so it may be necessary to split off some of the enclosures and find the bad part by process of elimination. The other option is to just start replacing cables and LCCs until the problem is corrected. These should be replaced one side at a time. If elimination did not find the bad component, try swapping all cables on the loop and then, if necessary, swapping all LCCs on the loop. See Knowledgebase article emc67018. Do all the drives spin up? Are all the fault lights cleared?(51) NoThis is unusual. There must be some other condition that is keeping the drives faulted. If it is only a single drive, replace it. Replace all LCCs and cables in the failing loop. See Knowledgebase article emc67018. Examine FBI and SP log output via SPLAT for any signs of an issue. If this is an FC4700 or ATA chassis, replace the power supplies in the the enclosure with the faults. If that fails, the replace the chassis. Fixed?(46)YesRun FBI for 24 hrs . Is array now healthy?(47)YesClose case, but monitor site for 24 hours more. Return parts for failure analysis ( FA).(38)NoReview FBI output for possible cause and if needed, escalate along with spcollects and FBI output. Additional parts replacement is probably required.(48)General Closure Statement: As cases are closed, it is important that the actions and replaced hardware components be noted. Also spcollect scripts and in some cases FBI outputs should be available for cases that are escalated.Continued from Block 4 on Page 1. Multiple drives are faulted causing loss of access to customer data on array running Base Software Release 11++ or later (unowned LUNs). Are there multiple faulted drives in the same RAID group?(5)YesIf these drives faulted at nearly the same time, it can be assumed that they are not all bad. In this case reseat the drives or reboot the array as a first step in narrowing down the fault. Do not reboot if multiple vault drives are faulted. Instead, escalate immediately for that case.Did a reboot or reseating of drives clear the faults? (6)NoAre there any cache dirty LUNs, or CRU logs? (Only available after an SP reboot.)(32)Arrays with Multiple Problems: See Knowledgebase article emc81176 when troubleshooting Fibre channel arrays with multiple problems. This solution examines the order that problems should be addresses for maximum chance of success.YesDid you find a cache dirty condition?(15)YesRun FBI under load for several hours to ensure that there are no more BE loop issues. If cause is not found, submit RCA request. (7)NoDo logs indicate that the drives faulted at different times? Also are the logs free of errors that might indicate other issues?(12)NoThis could be a barrier to further progress. Schedule reboot to determine if there are CRU signature errors or cache dirty LUNs before proceeding. If waiting, check host connectivity status and failover configuration in preparation for SP reboot. Are both paths active and available?(37)NoDoes customer situation allow a reboot?(33)YesSchedule a reboot time with customer. If reboot not allowed, escalate to TS2 If reboot scheduled, go to Block 34.(35)YesReboot. Are there any CRU signature errors or cache dirty messages in the SP logs)?(34)YesRun cache clear procedure in KB article emc5410. Article emc31296 has link to generate password. Clean?(14)NoAre there CRU signature errors?(16)NoAny remaining drive faults?(17)YesHave these disks been moved around?(18)NoIf this is an FC4700 or involves an ATA enclosure then the cause may be the power supply (PS) of the faulted enclosure. This is less likely for CX-Series arrays. If not PS, swap out the enclosure.Fixed?(8)YesThis may be a true double drive failure. If reseating either dead drive does not recover, data may be lost. Escalation to TS2 is probably in order. Is customer willing to restore from backup? If so, unbind and rebind LUNs and restore from backup.(13)NoFix path availability issue, or if problem is that both paths are not checked in storage group then check off the missing path. Now issue a “navicli trespass mine” to both SPs. Did all the LUNs come back?(39)NoAre there unowned LUNs?(29)YesNoJump to Run FBI to Block 31 ensure BE for another loop is clean. pass for If Clean then remaining close.failure(19)(20)YesNoMay need to recover with engineering-only “NDB” tool. There is some risk of data loss at this point. Escalate.(22)NoGet new Spcollect and FBI logs, escalate to TS2(9)YesSend replaced part back for FA. Run FBI for a time to ensure array health. Close case.(10)YesNoEscalate Run FBI for a time to to TS2ensure that array is (40)healthy then close.(41)NoSave spcollects again. Run FBI. Monitor to see if BE loop is clean. If clean, close.(25)Check status of LUNs again in navicli. Are there still unowned LUNs?(26)YesPut them back in their original slots. If customer no longer knows what slots they were in, TS2 may have historical data. Fixed?(21)YesClose case(24)NoEscalate to TS2(23)YesGet new spcollect scripts and escalate to TS2.(28)NoRestart browser. Run FBI for a time to ensure no more BE issues. Close if clean.(27)General Closure Statement: As cases get closed, it is important that the actions and replaced hardware components are noted. Also spcollect scripts and in some cases FBI outputs should be available for cases that are escalated.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文