Discussion:
[rancid] missed cmds, but only when run from cron
Brent Wiese
2013-08-14 00:22:01 UTC
Permalink
This is boggling my mind.

I slightly modified the f5rancid script to work with their version 11 TMOS. Basically, all it does is "show running-config" and then runs "quit" instead of "exit" when it finds the prompt.

This has been working just fine for our existing load balancers. But we recently added a new set. The first run appears to work. Then I start getting this in the logs:


starting: Tue Aug 13 17:02:11 MST 2013

cvs add: lb01.my.domain already exists, with version number 1.4
Added lb01.my.domain

Trying to get all of the configs.
lb01.my.domain: (v11) missed cmd(s): show running-config
lb01.my.domain: (v11) End of run not found
#

I don't get it - if I run f5rancid_v11 (my modified copy), it runs just fine and the ".new" file is there with all the info I need. It's only when it's run from cron that it throws any kind of error. F5login runs fine too (or f5rancid_v11 would fail). I added the "(v11)" in the log line area to confirm it's running the correct script when it does rancid-run.

Since it's logging errors, I don't think it's any kind of permissions error. But I am running as the same user as the cron job.

All my other configs are coming in and diff'ing just fine, so I don't think it's a CVS issue (or the ability to write temp files for example).

Any ideas?

Thanks!
Paul Gear
2013-08-14 04:05:28 UTC
Permalink
Post by Brent Wiese
This is boggling my mind.
I slightly modified the f5rancid script to work with their version 11
TMOS. Basically, all it does is “show running-config” and then runs
“quit” instead of “exit” when it finds the prompt.
This has been working just fine for our existing load balancers. But we
recently added a new set. The first run appears to work. Then I start
starting: Tue Aug 13 17:02:11 MST 2013
cvs add: lb01.my.domain already exists, with version number 1.4
Added lb01.my.domain
Trying to get all of the configs.
lb01.my.domain: (v11) missed cmd(s): show running-config
lb01.my.domain: (v11) End of run not found
#
I don’t get it – if I run f5rancid_v11 (my modified copy), it runs just
fine and the “.new” file is there with all the info I need. It’s only
when it’s run from cron that it throws any kind of error. F5login runs
fine too (or f5rancid_v11 would fail). I added the “(v11)” in the log
line area to confirm it’s running the correct script when it does
rancid-run.
Since it’s logging errors, I don’t think it’s any kind of permissions
error. But I am running as the same user as the cron job.
All my other configs are coming in and diff’ing just fine, so I don’t
think it’s a CVS issue (or the ability to write temp files for example).
I'm seeing something very similar with one Mikrotik device out of 17 on
our network. If run from the command line, it works fine. If run from
Post by Brent Wiese
starting: Tue Aug 13 00:01:32 EST 2013
Trying to get all of the configs.
failing-router-name: missed cmd(s): system package print detail without-paging
=====================================
Getting missed routers: round 1.
failing-router-name: missed cmd(s): system package print detail without-paging
=====================================
Getting missed routers: round 2.
failing-router-name: missed cmd(s): system package print detail without-paging
=====================================
Getting missed routers: round 3.
failing-router-name: missed cmd(s): system package print detail without-paging
=====================================
Getting missed routers: round 4.
failing-router-name: missed cmd(s): system package print detail without-paging
ending: Tue Aug 13 00:02:30 EST 2013
starting: Tue Aug 13 16:48:37 EST 2013
Trying to get all of the configs.
All routers sucessfully completed.
ending: Tue Aug 13 16:48:46 EST 2013
Looks like there might be some type of logic bug that is present in more
than one place.

Paul
James Andrewartha
2013-08-14 04:20:11 UTC
Permalink
Post by Paul Gear
I'm seeing something very similar with one Mikrotik device out of 17 on
our network. If run from the command line, it works fine. If run from
I had the same problem with a script I'd adapted, the solution was to
increase the timeout, as the device was quite slow to generate its config.
--
James Andrewartha
Network & Projects Engineer
Christ Church Grammar School
Claremont, Western Australia
Ph. (08) 9442 1757
Mob. 0424 160 877
Paul Gear
2013-08-14 06:47:07 UTC
Permalink
Post by James Andrewartha
Post by Paul Gear
I'm seeing something very similar with one Mikrotik device out of 17 on
our network. If run from the command line, it works fine. If run from
I had the same problem with a script I'd adapted, the solution was to
increase the timeout, as the device was quite slow to generate its config.
I'll try that and report back results.

Paul
Michael W. Lucas
2013-08-14 12:14:03 UTC
Permalink
Post by James Andrewartha
Post by Paul Gear
I'm seeing something very similar with one Mikrotik device out of 17 on
our network. If run from the command line, it works fine. If run from
I had the same problem with a script I'd adapted, the solution was to
increase the timeout, as the device was quite slow to generate its config.
Oooh, I have a blog post on exactly this, with Mikrotiks.

http://blather.michaelwlucas.com/archives/1336

I've also found that switching mtlogin to always do "export compact"
can help.

==ml
--
Michael W. Lucas - ***@michaelwlucas.com, Twitter @mwlauthor
http://www.MichaelWLucas.com/, http://blather.MichaelWLucas.com/
Absolute OpenBSD 2/e - http://www.nostarch.com/openbsd2e
coupon code "ILUVMICHAEL" gets you 30% off & helps me.
Paul Gear
2013-08-14 20:40:41 UTC
Permalink
Post by Michael W. Lucas
Post by James Andrewartha
Post by Paul Gear
I'm seeing something very similar with one Mikrotik device out of 17 on
our network. If run from the command line, it works fine. If run from
I had the same problem with a script I'd adapted, the solution was to
increase the timeout, as the device was quite slow to generate its config.
Oooh, I have a blog post on exactly this, with Mikrotiks.
http://blather.michaelwlucas.com/archives/1336
I've also found that switching mtlogin to always do "export compact"
can help.
Hi Michael and James,

Changing to "export compact" was the first thing I did when configuring
RANCID for our Mikrotiks.

I tried doubling the timeout (although it doesn't look like the patch
that you mentioned in your blog post is included with 2.3.8) but that
didn't solve the problem. It also doesn't explain why the system works
from the command line but not from cron, nor does it explain why only
this one out of 17 Mikrotiks is doing it. We have others that are the
same model and further away behind slower links and they all work fine.

(Apologies also to Brent for distracting from his original thread topic
- hopefully they are very similar bugs and what we discuss here will be
applicable to your F5 problem!)

Regards,
Paul
Alan McKinnon
2013-08-14 05:53:49 UTC
Permalink
If it truly is a case that the script always fails when run from the
commandline and always fails when run from cron, then the problem is
almost certainly one of two thing

1. Your cron is set to run when something else is happening on the
network, and this increased activity is too much. This is rare.
2. Shell environment issues. This is very common as cron tends to run
with an empty environment and rancid's login shell does not.

Print out your environment as the rancid user and add likely things as
envvars to the crontab one by one till you find one that fixes the problem.
Post by Brent Wiese
This is boggling my mind.
I slightly modified the f5rancid script to work with their version 11
TMOS. Basically, all it does is “show running-config” and then runs
“quit” instead of “exit” when it finds the prompt.
This has been working just fine for our existing load balancers. But we
recently added a new set. The first run appears to work. Then I start
starting: Tue Aug 13 17:02:11 MST 2013
cvs add: lb01.my.domain already exists, with version number 1.4
Added lb01.my.domain
Trying to get all of the configs.
lb01.my.domain: (v11) missed cmd(s): show running-config
lb01.my.domain: (v11) End of run not found
#
I don’t get it – if I run f5rancid_v11 (my modified copy), it runs just
fine and the “.new” file is there with all the info I need. It’s only
when it’s run from cron that it throws any kind of error. F5login runs
fine too (or f5rancid_v11 would fail). I added the “(v11)” in the log
line area to confirm it’s running the correct script when it does
rancid-run.
Since it’s logging errors, I don’t think it’s any kind of permissions
error. But I am running as the same user as the cron job.
All my other configs are coming in and diff’ing just fine, so I don’t
think it’s a CVS issue (or the ability to write temp files for example).
Any ideas?
Thanks!
_______________________________________________
Rancid-discuss mailing list
http://www.shrubbery.net/mailman/listinfo/rancid-discuss
--
Alan McKinnon
***@gmail.com
Brent Wiese
2013-08-15 21:14:31 UTC
Permalink
Post by James Andrewartha
Post by James Andrewartha
Post by Paul Gear
I'm seeing something very similar with one Mikrotik device out of
17
Post by James Andrewartha
Post by Paul Gear
on our network. If run from the command line, it works fine. If
run from cron, one router in the group fails, giving the following
I had the same problem with a script I'd adapted, the solution was to
increase the timeout, as the device was quite slow to generate its
config.
Oooh, I have a blog post on exactly this, with Mikrotiks.
http://blather.michaelwlucas.com/archives/1336
<<snip>>

Read the article and ran it manually against time... 3 seconds. So definitely not a timeout issue.

Also tried passing in the ENV variables through cron... no change. Well, let me rephrase. I was able to put in all ENV variables except:
LESSOPEN=|/usr/bin/lesspipe.sh %s

If that was in there, the job never ran. I'm guessing because of the |.

Any other thoughts/suggestions?
Alan McKinnon
2013-08-15 21:23:59 UTC
Permalink
Post by Brent Wiese
Post by James Andrewartha
Post by James Andrewartha
Post by Paul Gear
I'm seeing something very similar with one Mikrotik device out of
17
Post by James Andrewartha
Post by Paul Gear
on our network. If run from the command line, it works fine. If
run from cron, one router in the group fails, giving the following
I had the same problem with a script I'd adapted, the solution was to
increase the timeout, as the device was quite slow to generate its
config.
Oooh, I have a blog post on exactly this, with Mikrotiks.
http://blather.michaelwlucas.com/archives/1336
<<snip>>
Read the article and ran it manually against time... 3 seconds. So definitely not a timeout issue.
LESSOPEN=|/usr/bin/lesspipe.sh %s
If that was in there, the job never ran. I'm guessing because of the |.
Any other thoughts/suggestions?
A wild guess - is the type set correctly in router.db for that device?


This happened to me once but in reverse; I couldn't get a router to be
polled correctly on the cmd line no matter what I did, then I realized I
was running rancid -d against a Nexus....
--
Alan McKinnon
***@gmail.com
Brent Wiese
2013-08-15 21:30:21 UTC
Permalink
Post by Brent Wiese
Post by Brent Wiese
Post by James Andrewartha
Post by James Andrewartha
Post by Paul Gear
I'm seeing something very similar with one Mikrotik device out of
17
Post by James Andrewartha
Post by Paul Gear
on our network. If run from the command line, it works fine. If
run from cron, one router in the group fails, giving the following
I had the same problem with a script I'd adapted, the solution was
to increase the timeout, as the device was quite slow to generate
its
config.
Oooh, I have a blog post on exactly this, with Mikrotiks.
http://blather.michaelwlucas.com/archives/1336
<<snip>>
Read the article and ran it manually against time... 3 seconds. So
definitely not a timeout issue.
Post by Brent Wiese
Also tried passing in the ENV variables through cron... no change.
LESSOPEN=|/usr/bin/lesspipe.sh %s
If that was in there, the job never ran. I'm guessing because of the
|.
Post by Brent Wiese
Any other thoughts/suggestions?
A wild guess - is the type set correctly in router.db for that device?
Yes - I've edited the logging lines on the script that type runs to confirm it's running the correct one.
heasley
2013-08-15 23:07:13 UTC
Permalink
Post by Brent Wiese
This is boggling my mind.
I slightly modified the f5rancid script to work with their version 11 TMOS. Basically, all it does is "show running-config" and then runs "quit" instead of "exit" when it finds the prompt.
starting: Tue Aug 13 17:02:11 MST 2013
cvs add: lb01.my.domain already exists, with version number 1.4
that is odd; is the being added to the routers.all file properly? I do not
expect it to try to add a file thats already been there. though, that is
not related to the problem below.
Post by Brent Wiese
Added lb01.my.domain
Trying to get all of the configs.
lb01.my.domain: (v11) missed cmd(s): show running-config
lb01.my.domain: (v11) End of run not found
#
I don't get it - if I run f5rancid_v11 (my modified copy), it runs just fine and the ".new" file is there with all the info I need. It's only when it's run from cron that it throws any kind of error. F5login runs fine too (or f5rancid_v11 would fail). I added the "(v11)" in the log line area to confirm it's running the correct script when it does rancid-run.
Since it's logging errors, I don't think it's any kind of permissions error. But I am running as the same user as the cron job.
All my other configs are coming in and diff'ing just fine, so I don't think it's a CVS issue (or the ability to write temp files for example).
Any ideas?
Often this is either timing or termcap/termios related. since the f5
appears to be a rather kinky animal, i'd first investigate the terminal
setting. That's either coming from cron or rancid.conf; look there, set
it on your tty and try running f5rancid_v11 - i suspect it'l be angry.
find something that doesnt make it angry and force/set it in f5rancid_v11.
Loading...