[rancid] HP Procuves losing mangement interfaces

Discussion:

Michael Kania

2009-04-13 22:24:58 UTC

All,

I've been having a problem with a subset of switches periodically losing their management interfaces. We have 3 data centers set up within the united states and only 1 is having this problem. The problem data center is unique in that it is our largest(roughly ~110 hp procurve 2810s) and CPU usage on each switch averages 35-45%. The servers behind each switch remain connected while the management interface is down. Pinging, snmpget and ssh all fail. The downed management interface on the switch eventually recovers and logs don't show any sign of failure.

The rancid logs show a timeout when trying to contact that switch and then 3 failures to ssh. I've found that when rancid polls the switch CPU usage spikes dramatically, and my assumption was that the seviere spikes in CPU utilization causes the management interface to fall over. So mitigate against this, Ive turned down the number of retries and the polling interval, but the problem still remains. Anyone familiar with this issue?

Im using rancid version : 2.3.2~a9 on debian etch

Thanks,
Mike Kania

john heasley

2009-04-13 23:17:40 UTC

Permalink

Post by Michael Kania
All,
I've been having a problem with a subset of switches periodically losing their management interfaces. We have 3 data centers set up within the united states and only 1 is having this problem. The problem data center is unique in that it is our largest(roughly ~110 hp procurve 2810s) and CPU usage on each switch averages 35-45%. The servers behind each switch remain connected while the management interface is down. Pinging, snmpget and ssh all fail. The downed management interface on the switch eventually recovers and logs don't show any sign of failure.
The rancid logs show a timeout when trying to contact that switch and then 3 failures to ssh. I've found that when rancid polls the switch CPU usage spikes dramatically, and my assumption was that the seviere spikes in CPU utilization causes the management interface to fall over. So mitigate against this, Ive turned down the number of retries and the polling interval, but the problem still remains. Anyone familiar with this issue?
Im using rancid version : 2.3.2~a9 on debian etch
Thanks,
Mike Kania
_______________________________________________
Rancid-discuss mailing list
http://www.shrubbery.net/mailman/listinfo.cgi/rancid-discuss

sounds like either a switch s/w bug or some over-zealous rate limiting. it
is true that running rancid against any of the network devices will use
cpu, a little more than a human running the same commands, but it shouldnt
make the device fail. if it does, its the vendor's bug.

Jody Botham

2009-04-14 08:55:56 UTC

Permalink

Post by Michael Kania
All,
I've been having a problem with a subset of switches periodically losing
their management interfaces. We have 3 data centers set up within the
united states and only 1 is having this problem. The problem data center
is unique in that it is our largest(roughly ~110 hp procurve 2810s) and
CPU usage on each switch averages 35-45%. The servers behind each switch
remain connected while the management interface is down. Pinging,
snmpget and ssh all fail. The downed management interface on the switch
eventually recovers and logs don't show any sign of failure.
The rancid logs show a timeout when trying to contact that switch and
then 3 failures to ssh. I've found that when rancid polls the switch CPU
usage spikes dramatically, and my assumption was that the seviere spikes
in CPU utilization causes the management interface to fall over. So
mitigate against this, Ive turned down the number of retries and the
polling interval, but the problem still remains. Anyone familiar with
this issue?
Im using rancid version : 2.3.2~a9 on debian etch
Thanks,
Mike Kania
------------------------------------------------------------------------
_______________________________________________
Rancid-discuss mailing list
http://www.shrubbery.net/mailman/listinfo.cgi/rancid-discuss

What's the exact model of 2810 and what firmware are you running? We've
had similar issues with ProCurve kit (different model of switch but I
think the 2810 may have the same ASIC) recently and worked with them to
resolve the bug in their firmware. You can mail me off list if you need to.

Thanks,

Jody