Jump to content
 English      
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
     Forums advanced search
HP.com Home
IT Resource Center Forums > Linux > Insight Control for Linux

ICE-Linux mond issues with mdadm

» 

IT Resource Center

» Login
» Register
» My profile
» Search knowledge base
» Forums
» Patch database
» Download drivers, software and firmware
» Warranty check
» Support Case Manager
» Software Update Manager
» Training and Education
» More maintenance and support options
» Online help
» Site map

Member icons
 
 HP moderator  HP moderator
 Expert in this area  Expert in this area
Member status
ITRC Pro ITRC Pro
250 points
ITRC Graduate ITRC Graduate
500 points
ITRC Wizard ITRC Wizard
1000 points
ITRC Royalty ITRC Royalty
2500 points
ITRC Pharaoh ITRC Pharaoh
7500 points
Olympian Olympian
20000 points
1-Star Olympian 1-Star Olympian
40000 points
2-Star Olympian 2-Star Olympian
80000 points
»  How to earn points
»  Support forums FAQs
Question status
Magical answer Magical answer
Message with a response that solved the author's question
Favorites status
Add to my favorites Add to my favorites
Delete from my favorites Delete from my favorites
This thread has been closed Thread closed
 

Content starts here
   Create a new message    Receive e-mail notification if a new reply is posted  Reply to this message
Author Subject: ICE-Linux mond issues with mdadm      Add to my favorites
Dave McLean
Nov 5, 2009 01:27:23 GMT   

Have installed ICE-Linux 2.11 and after running Options-->Configure ICE-Linux Management Services on RHEL 5 nodes and mond starts up the following Critical alerts occur every 15 minutes.

Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md0
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md2
Nov 4 14:56:58 usorl03p307 mdadm: DeviceDisappeared /dev/md1
Nov 4 14:56:59 usorl03p307 mdadm: DeviceDisappeared /dev/md0


Stopping mond stops the messages.
/etc/init.d/mond stop
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click here


Sort Answers By: Date or Points
Donna Firkser
Nov 5, 2009 19:57:20 GMT    Unassigned

Dave,

These critical alerts are associated with the "Syslog Alerts" Service, correct?

I'd like to see if I can reproduce this. What version of RH5 do you have installed on your managed nodes (e.g. 32bit or 64bit; update 1 or 2)?

If you're not interested in seeing these mdadm critical alerts you should be able to stop the alerts by modifying the /opt/hptc/nagios/etc/syslogAlertRules file.

Try this and let me know if the alerts stop.

Edit syslogAlertRules (make a backup copy first) and change the mdadm rule to look as follows (i.e. add DeviceDisappeared to the list of mdadm events to ignore).

rule mdadm_errors {
name (! /(NewArray)|(SparesMissing) (DeviceDisappeared)/)
relevance ($subsystem =~ /mdadm/)
format "$timestamp $message"
}

Thanks,
Donna
Dave McLean
Nov 5, 2009 21:02:30 GMT    N/A: Question Author

Thanks for the quick reply Donna. The HP case number for this issue is 4606099605. There is lots of logs and sysreports attached to the case if you can pull it up.

The RHEL version on the node is RHEL 5.4 x86_64 on BL495G5 blades in C7000 chassis.

Have been working with Mitch on other issues also but not this one.

We are interested in seeing valid mdadm alerts, but these are not valid and start after mond is stared.

I will make your suggested changes and report back.
Dave McLean
Nov 5, 2009 21:12:11 GMT    N/A: Question Author

By chance should ther be a "|" between (SparesMissing) (DeviceDisappeared) ???

maybe shoudl be: (SparesMissing)|(DeviceDisappeared)/)
Donna Firkser
Nov 5, 2009 21:58:51 GMT    Unassigned

Yes. You need to add the |.

rule mdadm_errors {
name (! /(NewArray)|(SparesMissing)|(DeviceDisappeared)/)
relevance ($subsystem =~ /mdadm/)
format "$timestamp $message"
}


Donna
Donna Firkser
Nov 5, 2009 22:03:41 GMT    Unassigned

And I should have noted that by making this edit you will still continue to get mdadm alerts just not DeviceDisappeared alerts.

Donna
Dave McLean
Nov 6, 2009 12:52:18 GMT    N/A: Question Author

The change did stop the alerts but /var/log/messages is still filling up with the bogus messages that start when mond sevice is started. every 15 minutes.

mond -> /opt/hptc/supermon/etc/init.d/mond-setup

with mond stopped there are no more messages generated in /var/log so there is something that ICE-Linus (supermon) is doing that is causing the message to occur in the first place.

Need to find the root cause that is causing the messages.

I can provide you a virtual room connection if it would help.
Donna Firkser
Nov 6, 2009 14:46:37 GMT    Unassigned

Here's what's happening inside Nagios/supermon.

On the CMS, vi /opt/hptc/nagios/etc/nagios_vars.ini. In this file you will see mdadminfo and MDAMDCOLLECTIONPERIOD.

MDADMCOLLECTION is set to 15 minutes which means on the target nodes, supermon will call /opt/hptc/mdadm/sbin/getMdadmEvents every 15 minutes. You can change this collection period to anything you like.

If you log in to one of you target nodes, you can look at /opt/hptc/mdadm/sbin/getMdadmEvents which calls mdadm-handler. mdadm-handler sends all messages returned by /sbin/mdadm to syslog.

We recently fixed an issue in our next IC-Linux release (V6.0) where this script was failing because it was being run as Nagios and not root so I'm wondering if your hitting that issue.

Can you run a test for me? On the target node, (as root) run /opt/hptc/mdadm/sbin/getMdadmEvents and tail /var/log/messages and let me know what you see.

Then login as Nagios (su - nagios) and run getMdadmEvents and let me know what you see in /var/log/messages.

In regards to the DeviceDisappeared event, do you think that /sbin/mdadm is incorrectly reporting this error? Or has the device really disappeared?

One work around I can think of is to modify mdadm-handler to check for the DeviceDisappeared event and not call syslog.

Donna
Dave McLean
Nov 6, 2009 18:54:46 GMT    N/A: Question Author

Ran the getMdadmEvents as both root and nagios. When ran as root no messages are generated in /var/log/messages.

When ran as nagio, each time the command getMdadmEvents generates:

Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md1
Nov 6 13:45:53 usorl03p309 mdadm: DeviceDisappeared /dev/md0
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md2
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md1
Nov 6 13:45:59 usorl03p309 mdadm: DeviceDisappeared /dev/md0

I believe the messages are bogus and the devices are NOT disappearing.

dave
William Athanasiou
Nov 6, 2009 20:15:53 GMT    Unassigned

Could you provide a description of your hardware and installation? Are you using software RAID? How many disks are installed? Is it possible you have a disk in the machine that used to be part of a SW RAID set? If you have an /etc/mdadm.conf file, can you include the contents?

I realize that's a lot of questions, but I'm just trying to figure out why mdadm would be reporting the error.
Dave McLean
Nov 6, 2009 20:45:11 GMT    N/A: Question Author

The hardware installation is a BL495G5 blade that has two internal SSD 64GB disks. The OS is RHEL 5.4 and mirrored acrossed the two internal drives.

mdadm.conf


# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=aa4f5616:1f85a679:04e92872:8cb15fe7
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=6787038e:e6c35d9c:fa5a0916:9729dd5f

dave

ARRAY /dev/md2 level=raid1 num-devices=2 uuid=c90d94d7:2f54ad8e:74248664:92872716
~
William Athanasiou
Nov 6, 2009 23:00:39 GMT    Unassigned

Well, that all looks right. Can you attach the output of "cat /proc/mdstat"?
Dave McLean
Nov 8, 2009 02:16:10 GMT    N/A: Question Author

Output from /proc/mdstat

usorl03p309 ~ -1277> cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
208704 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
12586816 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
49721088 blocks [2/2] [UU]

unused devices: <none>
usorl03p309 ~ -1278>

dave
Mitchell Kulberg Expert in this area
Nov 9, 2009 13:53:53 GMT    Unassigned

Hey there Dave,

I'm curious. Are you able to reproduce this error on any other servers other than this one? any chance you've got USB devices on this server?

It's a long shot, but I've had questionable USB devices do that for real.

Thanks,
Mitch
Donna Firkser
Nov 10, 2009 21:42:58 GMT    Unassigned

Dave,

After further investigation it looks like this bogus DeviceDisappeared event is occurring because we are running mdadm as the nagios user. This is happening because we changed mond (which calls getMdadmEvents) to run as Nagios instead of root for security purposes. However, when we made this change we forgot to modify mdadm to use sudo so there's a defect in V2.11, in that we should be using "sudo /sbin/mdadm" inside getMdadmEvents.

This defect is fixed in the next IC-Linux release (V6.0) which should be available January 2010.

Do you know if Siemens is planning to move to V6.0 when it becomes available?

In the interim, You could manually work around this issue by making the following changes on every managed system. This is exact same fix that will be available in our V6.0 release.

1) Add the following line to /etc/sudoers on every managed system.
nagios ALL = NOPASSWD: /sbin/mdadm

And

2) Add "sudo" to the following line in /opt/hptc/mdadm/sbin/getMdadmEvents

`/usr/bin/sudo /sbin/mdadm --monitor --scan --program=/opt/hptc/mdadm/sbin/mdadm-handler --oneshot`;

Let me know if this helps.

Thanks,
Donna
Dave McLean
Nov 11, 2009 00:52:33 GMT    N/A: Question Author

Thanks Donna. That's sorta what it was looking like since user root seem to work ok. It's tough doing root level tasks and at the same time maintain security.

I'll give your suggestions a try and report back to you.


dave
Dave McLean
Nov 11, 2009 02:04:29 GMT    N/A: Question Author

Looks like the sudo trick worked.

Ready for another one? something is trying to open /dev/mcelog on 15 minute intervals and getting permission denied.

Nov 10 20:28:27 usorl03p309 mcelog: Cannot open /dev/mcelog
Nov 10 20:43:26 usorl03p309 mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
Nov 10 20:43:26 usorl03p309 mcelog: Cannot open /dev/mcelog
Nov 10 20:58:27 usorl03p309 mcelog: Cannot open /dev/mem for DMI decoding: Permission denied


dave
Donna Firkser
Nov 11, 2009 03:31:47 GMT    Unassigned

Glad to hear that did the trick.

The mcelog event is the exact same issue so you need to apply the same work around.

1) Add /usr/sbin/mcelog to /etc/sudoers and
2) Add /usr/bin/sudo to the following line in /opt/hptc/mcelog/sbin/getMcelogEvents.

e.g.
`/usr/bin/sudo /usr/sbin/mcelog --syslog`;

These where the only two sudo issues fixed for V6.0, so you should be all set now.

Donna
Dave McLean
Nov 11, 2009 13:06:01 GMT    N/A: Question Author

Donna,

Applied the changes for mcelog also.

The last issue I'm working so far with Mitch is the wrong system name is being picked up when multiple IP's are plumbed up on the same NIC. Mitch should have all the details but maybe I'll open up a new forum on this one also.

Thanks for your support.

dave
Donna Firkser
Nov 11, 2009 14:02:41 GMT    Unassigned

Dave,

Mitch described the NIC/hostname issue to me. I'm going to try and reproduce it and will let you know what I find.

Donna
Donna Firkser
Nov 12, 2009 14:23:52 GMT    Unassigned

Dave,

I defined multiple NICs on managed system pluto as shown below and after I discovered it with SIM, I'm correctly seeing the one IP address for eth0 and host name pluto in SIM.

Is this configuration similar to your multi NIC configuration? Please open up a new forum entry for this discussion.

[root@poseidon image]# mxnode -ld pluto
System name: pluto
Host name: pluto.usa.hp.com
IP addresses: 16.118.197.34
OS name: LINUX


[root@pluto ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.197.34 Bcast:16.118.207.255 Mask:255.255.240.0
inet6 addr: fe80::216:35ff:fec6:c8f6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6602599 errors:0 dropped:0 overruns:0 frame:0
TX packets:120564 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:651843140 (621.6 MiB) TX bytes:17638634 (16.8 MiB)
Interrupt:169 Memory:f6000000-f6012800

eth0:0 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.197.163 Bcast:16.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:169 Memory:f6000000-f6012800

eth0:1 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.198.249 Bcast:16.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:169 Memory:f6000000-f6012800

eth0:2 Link encap:Ethernet HWaddr 00:16:35:C6:C8:F6
inet addr:16.118.199.254 Bcast:16.255.255.255 Mask:255.0.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:169 Memory:f6000000-f6012800

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:613664 errors:0 dropped:0 overruns:0 frame:0
TX packets:613664 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:238774063 (227.7 MiB) TX bytes:238774063 (227.7 MiB)

Thanks,
Donna
 
Create a new message    Receive e-mail notification if a new reply is posted   Reply to this message
 
 
Printable version
Privacy statement Using this site means you accept its terms
© 2009 Hewlett-Packard Development Company, L.P.