|
|
Dave McLean
|
Have installed ICE-Linux 2.11 and after running Options-->Configure ICE-Linux Management Services on RHEL 5 nodes and mond starts up the following Critical alerts occur every 15 minutes.
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md0
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md2
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md1
Nov 4 14:56:59 usorl03p307 mdadm: DeviceDisappeared /dev/md0
Stopping mond stops the messages.
/etc/init.d/mond stop
|
|
|
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click
here
|
|
|
Sort Answers By:
Date or Points
|
|
|
Donna Firkser
|
|
Nov 5, 2009 19:57:20 GMT
Unassigned
|
|
Dave,
These critical alerts are associated with the "Syslog Alerts" Service, correct?
I'd like to see if I can reproduce this. What version of RH5 do you have installed on your managed nodes (e.g. 32bit or 64bit; update 1 or 2)?
If you're not interested in seeing these mdadm critical alerts you should be able to stop the alerts by modifying the /opt/hptc/nagios/etc/syslogAlertRules file.
Try this and let me know if the alerts stop.
Edit syslogAlertRules (make a backup copy first) and change the mdadm rule to look as follows (i.e. add DeviceDisappeared to the list of mdadm events to ignore).
rule mdadm_errors { name (! /(NewArray)|(SparesMissing) (DeviceDisappeared)/) relevance ($subsystem =~ /mdadm/) format "$timestamp $message" }
Thanks, Donna |
|
|
Dave McLean
|
|
Nov 5, 2009 21:02:30 GMT
N/A: Question Author
|
|
Thanks for the quick reply Donna. The HP case number for this issue is 4606099605. There is lots of logs and sysreports attached to the case if you can pull it up.
The RHEL version on the node is RHEL 5.4 x86_64 on BL495G5 blades in C7000 chassis.
Have been working with Mitch on other issues also but not this one.
We are interested in seeing valid mdadm alerts, but these are not valid and start after mond is stared.
I will make your suggested changes and report back. |
|
|
Dave McLean
|
|
Nov 5, 2009 21:12:11 GMT
N/A: Question Author
|
|
By chance should ther be a "|" between (SparesMissing) (DeviceDisappeared) ???
maybe shoudl be: (SparesMissing)|(DeviceDisappeared)/) |
|
|
Donna Firkser
|
|
Nov 5, 2009 21:58:51 GMT
Unassigned
|
|
Yes. You need to add the |.
rule mdadm_errors { name (! /(NewArray)|(SparesMissing)|(DeviceDisappeared)/) relevance ($subsystem =~ /mdadm/) format "$timestamp $message" }
Donna |
|
|
Donna Firkser
|
|
Nov 5, 2009 22:03:41 GMT
Unassigned
|
|
And I should have noted that by making this edit you will still continue to get mdadm alerts just not DeviceDisappeared alerts.
Donna |
|
|
Dave McLean
|
|
Nov 6, 2009 12:52:18 GMT
N/A: Question Author
|
|
The change did stop the alerts but /var/log/messages is still filling up with the bogus messages that start when mond sevice is started. every 15 minutes.
mond -> /opt/hptc/supermon/etc/init.d/mond-setup
with mond stopped there are no more messages generated in /var/log so there is something that ICE-Linus (supermon) is doing that is causing the message to occur in the first place.
Need to find the root cause that is causing the messages.
I can provide you a virtual room connection if it would help. |
|
|
Donna Firkser
|
|
Nov 6, 2009 14:46:37 GMT
Unassigned
|
|
Here's what's happening inside Nagios/supermon.
On the CMS, vi /opt/hptc/nagios/etc/nagios_vars.ini. In this file you will see mdadminfo and MDAMDCOLLECTIONPERIOD.
MDADMCOLLECTION is set to 15 minutes which means on the target nodes, supermon will call /opt/hptc/mdadm/sbin/getMdadmEvents every 15 minutes. You can change this collection period to anything you like.
If you log in to one of you target nodes, you can look at /opt/hptc/mdadm/sbin/getMdadmEvents which calls mdadm-handler. mdadm-handler sends all messages returned by /sbin/mdadm to syslog.
We recently fixed an issue in our next IC-Linux release (V6.0) where this script was failing because it was being run as Nagios and not root so I'm wondering if your hitting that issue.
Can you run a test for me? On the target node, (as root) run /opt/hptc/mdadm/sbin/getMdadmEvents and tail /var/log/messages and let me know what you see.
Then login as Nagios (su - nagios) and run getMdadmEvents and let me know what you see in /var/log/messages.
In regards to the DeviceDisappeared event, do you think that /sbin/mdadm is incorrectly reporting this error? Or has the device really disappeared?
One work around I can think of is to modify mdadm-handler to check for the DeviceDisappeared event and not call syslog.
Donna |
|
|
Dave McLean
|
|
Nov 6, 2009 18:54:46 GMT
N/A: Question Author
|
|
Ran the getMdadmEvents as both root and nagios. When ran as root no messages are generated in /var/log/messages.
When ran as nagio, each time the command getMdadmEvents generates:
Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md1 Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md0 Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md2 Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md1 Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md0
I believe the messages are bogus and the devices are NOT disappearing.
dave |
|
|
William Athanasiou
|
|
Nov 6, 2009 20:15:53 GMT
Unassigned
|
|
Could you provide a description of your hardware and installation? Are you using software RAID? How many disks are installed? Is it possible you have a disk in the machine that used to be part of a SW RAID set? If you have an /etc/mdadm.conf file, can you include the contents?
I realize that's a lot of questions, but I'm just trying to figure out why mdadm would be reporting the error. |
|
|
Dave McLean
|
|
Nov 6, 2009 20:45:11 GMT
N/A: Question Author
|
|
The hardware installation is a BL495G5 blade that has two internal SSD 64GB disks. The OS is RHEL 5.4 and mirrored acrossed the two internal drives.
mdadm.conf
# mdadm.conf written out by anaconda DEVICE partitions MAILADDR root ARRAY /dev/md0 level=raid1 num-devices=2 uuid=aa4f5616:1f85a679:04e92872:8cb15fe7 ARRAY /dev/md1 level=raid1 num-devices=2 uuid=6787038e:e6c35d9c:fa5a0916:9729dd5f
dave
ARRAY /dev/md2 level=raid1 num-devices=2 uuid=c90d94d7:2f54ad8e:74248664:92872716 ~ |
|
|
William Athanasiou
|
|
Nov 6, 2009 23:00:39 GMT
Unassigned
|
|
|
Well, that all looks right. Can you attach the output of "cat /proc/mdstat"? |
|
|
Dave McLean
|
|
Nov 8, 2009 02:16:10 GMT
N/A: Question Author
|
|
Output from /proc/mdstat
usorl03p309 ~ -1277> cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[1] sda1[0] 208704 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0] 12586816 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0] 49721088 blocks [2/2] [UU]
unused devices: <none> usorl03p309 ~ -1278>
dave |
|
Mitchell Kulberg
|
|
Nov 9, 2009 13:53:53 GMT
Unassigned
|
|
Hey there Dave,
I'm curious. Are you able to reproduce this error on any other servers other than this one? any chance you've got USB devices on this server?
It's a long shot, but I've had questionable USB devices do that for real.
Thanks, Mitch |
|
|
Donna Firkser
|
|
Nov 10, 2009 21:42:58 GMT
Unassigned
|
|
Dave,
After further investigation it looks like this bogus DeviceDisappeared event is occurring because we are running mdadm as the nagios user. This is happening because we changed mond (which calls getMdadmEvents) to run as Nagios instead of root for security purposes. However, when we made this change we forgot to modify mdadm to use sudo so there's a defect in V2.11, in that we should be using "sudo /sbin/mdadm" inside getMdadmEvents.
This defect is fixed in the next IC-Linux release (V6.0) which should be available January 2010.
Do you know if Siemens is planning to move to V6.0 when it becomes available?
In the interim, You could manually work around this issue by making the following changes on every managed system. This is exact same fix that will be available in our V6.0 release. 1) Add the following line to /etc/sudoers on every managed system. nagios ALL = NOPASSWD: /sbin/mdadm
And
2) Add "sudo" to the following line in /opt/hptc/mdadm/sbin/getMdadmEvents
`/usr/bin/sudo /sbin/mdadm --monitor --scan --program=/opt/hptc/mdadm/sbin/mdadm-handler --oneshot`;
Let me know if this helps.
Thanks, Donna |
|
|
Dave McLean
|
|
Nov 11, 2009 00:52:33 GMT
N/A: Question Author
|
|
Thanks Donna. That's sorta what it was looking like since user root seem to work ok. It's tough doing root level tasks and at the same time maintain security.
I'll give your suggestions a try and report back to you.
dave |
|
|
Dave McLean
|
|
Nov 11, 2009 02:04:29 GMT
N/A: Question Author
|
|
Looks like the sudo trick worked.
Ready for another one? something is trying to open /dev/mcelog on 15 minute intervals and getting permission denied.
Nov 10 20:28:27 usorl03p309 mcelog: Cannot open /dev/mcelog Nov 10 20:43:26 usorl03p309 mcelog: Cannot open /dev/mem for DMI decoding: Permission denied Nov 10 20:43:26 usorl03p309 mcelog: Cannot open /dev/mcelog Nov 10 20:58:27 usorl03p309 mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
dave |
|
|
Donna Firkser
|
|
Nov 11, 2009 03:31:47 GMT
Unassigned
|
|
Glad to hear that did the trick.
The mcelog event is the exact same issue so you need to apply the same work around.
1) Add /usr/sbin/mcelog to /etc/sudoers and 2) Add /usr/bin/sudo to the following line in /opt/hptc/mcelog/sbin/getMcelogEvents.
e.g. `/usr/bin/sudo /usr/sbin/mcelog --syslog`;
These where the only two sudo issues fixed for V6.0, so you should be all set now.
Donna |
|
|
Dave McLean
|
|
Nov 11, 2009 13:06:01 GMT
N/A: Question Author
|
|
Donna,
Applied the changes for mcelog also.
The last issue I'm working so far with Mitch is the wrong system name is being picked up when multiple IP's are plumbed up on the same NIC. Mitch should have all the details but maybe I'll open up a new forum on this one also.
Thanks for your support.
dave |
|
|
Donna Firkser
|
|
Nov 11, 2009 14:02:41 GMT
Unassigned
|
|
Dave,
Mitch described the NIC/hostname issue to me. I'm going to try and reproduce it and will let you know what I find.
Donna |
|
|
Donna Firkser
|
|
Nov 12, 2009 14:23:52 GMT
Unassigned
|
|
Dave,
I defined multiple NICs on managed system pluto as shown below and after I discovered it with SIM, I'm correctly seeing the one IP address for eth0 and host name pluto in SIM.
Is this configuration similar to your multi NIC configuration? Please open up a new forum entry for this discussion.
[root@poseidon image]# mxnode -ld pluto System name: pluto Host name: pluto.usa.hp.com IP addresses: 16.118.197.34 OS name: LINUX
[root@pluto ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6 inet addr:16.118.197.34 Bcast:16.118.207.255 Mask:255.255.240.0 inet6 addr: fe80::216:35ff:fec6:c8f6/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:6602599 errors:0 dropped:0 overruns:0 frame:0 TX packets:120564 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:651843140 (621.6 MiB) TX bytes:17638634 (16.8 MiB) Interrupt:169 Memory:f6000000-f6012800
eth0:0 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6 inet addr:16.118.197.163 Bcast:16.255.255.255 Mask:255.0.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:169 Memory:f6000000-f6012800
eth0:1 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6 inet addr:16.118.198.249 Bcast:16.255.255.255 Mask:255.0.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:169 Memory:f6000000-f6012800
eth0:2 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6 inet addr:16.118.199.254 Bcast:16.255.255.255 Mask:255.0.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:169 Memory:f6000000-f6012800
lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:613664 errors:0 dropped:0 overruns:0 frame:0 TX packets:613664 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:238774063 (227.7 MiB) TX bytes:238774063 (227.7 MiB)
Thanks, Donna |
|