[rancid] Dealing with rancid dying under heavy load

Discussion:

Alan McKinnon

2014-02-22 07:02:16 UTC

Hi,

Recently I had a moment of over-zealous enthusiasm and turned PAR_COUNT
up higher (from 30 to 50) to get rancid-run to complete quicker. Sadly,
about 1 in 10 instances of rancid started failing mysteriously. The logs
mostly just give the dreaded "End of run not found" message, a few have
strange password errors for a device that is configured correctly.
Running rancid on these manually where the host is idle always completes
correctly.

In each case I find that the *rancid parser did run and minimally
launched *clogin, so I suspect memory or IO issues under load causing
scripts to fail. I want to patch the code to trap and report these
errors if possible.

We all know how tricky this can get, anyone in a position to discuss how
best to proceed? I'll do the heavy lifting of coding and testing, I do
want to pick other's brains first :-)

Background info:

FreeBSD 8.0-STABLE
perl-5.8.9
expect-5.44.1.15
rancid-2.3.8
rancid hosts are VMs on ESXi-4-something, 1 cpu, 1 nic, 1g RAM
the NIC is gigaBit full-duplex.

"load" tends to run rather high, easily getting to 10+ and frightening
newbie sysadmins, but this has never been a problem in the past as
rancid is IO-bound anyway and scripts tend to spin along till they complete.
4M Total, 28K Used, 1024M Free

--
Alan McKinnon
***@gmail.com

heasley

2014-02-22 07:35:23 UTC

Permalink

Post by Alan McKinnon
Hi,
Recently I had a moment of over-zealous enthusiasm and turned PAR_COUNT
up higher (from 30 to 50) to get rancid-run to complete quicker. Sadly,
about 1 in 10 instances of rancid started failing mysteriously. The logs
mostly just give the dreaded "End of run not found" message, a few have
strange password errors for a device that is configured correctly.
Running rancid on these manually where the host is idle always completes
correctly.
In each case I find that the *rancid parser did run and minimally
launched *clogin, so I suspect memory or IO issues under load causing
scripts to fail. I want to patch the code to trap and report these
errors if possible.
We all know how tricky this can get, anyone in a position to discuss how
best to proceed? I'll do the heavy lifting of coding and testing, I do
want to pick other's brains first :-)
FreeBSD 8.0-STABLE
perl-5.8.9
expect-5.44.1.15
rancid-2.3.8
rancid hosts are VMs on ESXi-4-something, 1 cpu, 1 nic, 1g RAM
the NIC is gigaBit full-duplex.
"load" tends to run rather high, easily getting to 10+ and frightening
newbie sysadmins, but this has never been a problem in the past as
rancid is IO-bound anyway and scripts tend to spin along till they complete.
4M Total, 28K Used, 1024M Free

the targets are VMs or the rancid host is a VM?

unless the host can't keep up with the expect processes well enough to stay
within the login script's timeout period (or what you've set in cloginrc),
it should not fail - but i havent tried 30 or 40, usually kern.smp.cpus

if the rancid host is a VM and assuming it is timing out, but plenty of
spare cpu and net; figure out if its actually timing out due to wall clock
time, vs. missing interrupts for example.

Alan McKinnon

2014-02-22 08:44:40 UTC

Permalink

Post by heasley

the targets are VMs or the rancid host is a VM?

The rancid host is a VM. The targets are Cisco kit.

Post by heasley
unless the host can't keep up with the expect processes well enough to stay
within the login script's timeout period (or what you've set in cloginrc),
it should not fail - but i havent tried 30 or 40, usually kern.smp.cpus

That was my thought too - it shouldn't fail and timeouts shouldn't
happen. These targets are all on fast networks (usually GigaBit, some
100M and all respond quickly).

I might have hit a threshold on the ESXi host; I don't have visibility
into that environment and can't see what else it's hosting.

Post by heasley
if the rancid host is a VM and assuming it is timing out, but plenty of
spare cpu and net; figure out if its actually timing out due to wall clock
time, vs. missing interrupts for example.

I'll look into that. I'd discounted simple timeouts as another rancid
system that deals with kit out in the field has PAR_COUNT=50 and it's
been tested as high as 100. Some of that kit is stupid slow, I've seen
show runs take 10 minutes to complete and the system just deals with it
as expected.

Both systems have the same OS config and both run as VMs in the same
environment. I think step one is to get monitoring graphs out of the
VMWare team.

--
Alan McKinnon
***@gmail.com

heasley

2014-02-22 16:04:58 UTC

Permalink

Post by Alan McKinnon

That was my thought too - it shouldn't fail and timeouts shouldn't
happen. These targets are all on fast networks (usually GigaBit, some
100M and all respond quickly).
I might have hit a threshold on the ESXi host; I don't have visibility
into that environment and can't see what else it's hosting.

the first vm provided to me had 6 cpu and 8G; I couldnt figure out why the
performance of our multithreaded application, also i/o bound, was so
horrible - they'd limited the vm to ~512Mhz, so i had a 1989 vm.

Post by Alan McKinnon

lack of memory seems to be a big one for VMs; esp. VirtualBox and similar
VM systems. but, my experience with vmware is that disk i/o performance
is poor. that might be their drivers for the controllers i've had, but it
was bad enough that i moved to VBox.

Post by Alan McKinnon
Both systems have the same OS config and both run as VMs in the same
environment. I think step one is to get monitoring graphs out of the
VMWare team.

thats a good indicator; when retrieving data, rancid spends most of its
time waiting on the device. if you use NOPIPE=YES in rancid.conf, you
decouple the retrieval from the processing/reformatting.