Broadband-Hamnet™ Forum :: UBNT Firmware
Welcome Guest   [Register]  [Login]
 Subject :Intermitant Link Resolution -Firmware Patch.. 2014-08-18- 19:34:24 
AF5RS
Member
Joined: 2014-06-23- 22:21:23
Posts: 9
Location: Highland Village, TX EM13LC

I proceeded to patch the firmware per the changeset results from ticket 60 stated as closed. See link below:

http://ubnt.hsmm-mesh.org/products/BBHN/changeset/6036dc3ff67086662dd83b62bc2d63f1f587edaa/bbhn_ar71xx


The patch was done and the link results were in fact better for a longer time, however, with side effects. The router response to pings, web service, telnet, etc, is sluggish. It appears there is a lot more going on under the hood than obvious with the simple watchdog process addition in the OSLR config file. AFter several days of operation, the router seems in a almost constant restart loop. Pings die for 4-5 seconds then resume, die, resume, die, etc. Now what's up? I was a bit disappointed this ticket was closed without a new firmware release committed to, but maybe the side effects have been seen by others to delay such a watchdog workaround fix. Can someone give me an idea what's really going on here and when we might see some relief? Should I back rollback the patch? Still trying....

Bob AF5RS



IP Logged
 Subject :Re:Intermitant Link Resolution -Firmware Patch.. 2014-08-18- 20:39:55 
KG6JEI
Member
Joined: 2013-12-02- 19:52:05
Posts: 516
Location

4-5 seconds sounds too quck for an OLSR restart.  Should take on the order of 30-90+ seconds average to re establish links on a reboot by my estimates to reestablish routing.

If your OLSR is rebooting you should see an incrementing OLSR restart count on your status screen. If that is the case then yes the watchdog is triggering and keeping the node online. A watchdog however is just a "patch" it's not a fix to the ultimate issue.

As for "closed without a release" this is normal for our dev cycle. When an issue has been "fixed" it is closed out to get it out of the developer queue so that way only "unfixed" issues show up (since I'm the only person working tickets this is very important to keeping track of what is going on. Fixed simply means it's been commited to the DEVELOPMENT branch and that it is belived to be working and is ready to go through further testing.  If a bug exists in any implementation (or combined implementation) it shows up in beta (and yes sometimes 2-3 patches interact on ways not seen independently )

It should also be noted I commited the patch VERY close to leaveing the country for work. Very hard to build beta builds while I'm nowhere near my gear, and that I've just returned back to the gear.

Eventually I hope to see the lab servers build untested "development" releases and publish builds automatically, but if I work on that than other bigger issues like OLSRD crashing (which is a much bigger issue) would have to go unworked on.

If anyone wants to step up and devote the resources to debugging (20 node lab environment looks to be the sweet spot right now) help is always welcomed.  


To to give an idea of what it takes to build an official release currently.

1) An intenrnal to me only build gets run. (This takes 15 minutes by itself to do each time by the way)

2) The build is tested in my lab (this can take a couple hours EASILY and the test list gets longer each time)

3) Any issues found in step 2 are resolved and steps 1-3 are repeated.

4) Once I find all the issues I can it gets releases to the BETA test team to double chexk me (step 1 gets ran again as a beta build this time).  Any issues they find are resolved (this takes a lot of time, each person has to dedicate hours to testing). and we start at step 1 again to be sure we are good if they find a bug.

5) only after it's been vetted does a public release get made (again another 15 minutes).  I should mention the above is just the Ubiquti procedure, Linksus requires me to run the steps again for building images.

This obviously doesn't catch every  issue. Some times we can't see issues till networks get bigger. (This issue for exmaple seems

to show up because now 3-5 nodes at a site is becoming the norm when before 1 node was the norm. We have increased traffic to a point where deep bugs are more likely to happen and we find the flaws now to continue growth.

All this keeping in mind I can work anywhere from 40-80 hours in week depending upon how the week goes, that I'm the Repeater Technical Chair and active board for my local club,that I'm regually responsible for planning and running net control for  health and welfare  traffic of endurance runners using Amature Radio (I have a 50k and a 100k that have just begun planning in the last week,  that I'm actice in promoting Amatre Radio to public (Street Fairs, Public Servixe demonstrairons, Scout Merrit Badges, etc)

Since February 2013 I probably have over 20 days (480 hours) of time into this (guesstimate. I hanent tracked the hours. May very well be more) fixing one major security bug last releases took over 40 hours itself in planning and implanting and testing, which shows itself with just a few hundred lines of simple changes as the end result of deep though and testing to be sure it's done right.

IP Logged
Note: Most posts submitted from iPhone
 Subject :Re:Intermitant Link Resolution -Firmware Patch.. 2014-08-18- 21:14:23 
AF5RS
Member
Joined: 2014-06-23- 22:21:23
Posts: 9
Location: Highland Village, TX EM13LC

Thanks for the explanation of the update protocol. We use a similar methodology in our work, makes sense. I have seen some strange degradation to the point I can barely get into my Rocket now. Maybe I should re-flash. Looks broke now. See ping loop screenshot attachment below.


Bob AF5RS





IP Logged
 Subject :Re:Intermitant Link Resolution -Firmware Patch.. 2014-08-19- 05:17:47 
KG6JEI
Member
Joined: 2013-12-02- 19:52:05
Posts: 516
Location

Ok, that does not appear to be OLSR restart related as OLSR doesn't control the local LAN interface.

I would more suspect it to be a bad cable usually.

I haven't seen OLSR watchdog make the network interface unreachable, and it shouldn't   All it does is write to a file (in ram) every 5 seconds to say "I'm still alive as of  the current time" and then another script reads the file to be sure it isn't dead. Unless some crazy something is happening in the watchdog script or the watchdog module causing excessive CPU usage to cause the node to dump data (top on the node itself)   but even that would be rare as a IP/ECHO is handled low in the kernel, out of user space, its the least likely to be affected.

IP Logged
Note: Most posts submitted from iPhone
Page # 


Powered by ccBoard


SPONSORED AD: