Sonar Pollers fail until rebooted


#1

We are running 4 pollers, 3 virtualized and on on a bare metal server. all of them are up-to-date running v1.1.7
properly - most of the time.
Occasionally (10 days ish) they will each start failing pretty consistently. a reboot always fixes it.

When I login to the poller and run the “getWork.php” on a failing poller I get >

Skipping ICMP polling, as it is still pending.
Skipping SNMP polling, as it is still pending.
Enqueued device mapping job and got token 771e7......

which seems to be what I’m supposed to get when its working.
I can resetPolling.php then getWork.php to get a new instruction set but it continues to fail in sonar with

Pollers
a minute ago
Poller "Poller2-VM AP7-Tik3 (4.5-7-8-9.14-13-14-15-16-32-36-39-60)" has not returned any network 
monitoring data to Sonar since Dec 11, 2018 06:08:07. Please check it for errors, or disable it to stop 
receiving this alert.

Pollers
a minute ago
Poller "Poller1-VM AP7-TIK4 (2-3-4-18-19-20-29-30-34-35)" has not returned any network monitoring data 
to Sonar since Dec 11, 2018 06:51:09. Please check it for errors, or disable it to stop receiving this alert.

Is there anything I can script in to make it a little more reliable? aside from a daily/weekly reboot?

Also a side note - these are the only things running on the virtualized Ubuntu 16.04 VM’s

– EDIT Output of redis server status –
/etc/init.d/redis-server status

● redis-server.service - Advanced key-value store
   Loaded: loaded (/lib/systemd/system/redis-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-12-11 06:51:08 MST; 6h ago
   Docs: http://redis.io/documentation,
   man:redis-server(1)
 Main PID: 30230 (redis-server)
     Tasks: 3
     Memory: 7.8M
     CPU: 19.448s
  CGroup: /system.slice/redis-server.service
           └─30230 /usr/bin/redis-server 127.0.0.1:6379

Dec 11 06:51:08 poller1 systemd[1]: Starting Advanced key-value store...
Dec 11 06:51:08 poller1 run-parts[30220]: run-parts: executing /etc/redis/redis-server.pre-up.d/00_example
Dec 11 06:51:08 poller1 run-parts[30231]: run-parts: executing /etc/redis/redis-server.post-up.d/00_example
Dec 11 06:51:08 poller1 systemd[1]: Started Advanced key-value store.

– ^ – looks like the redis server restarted at the same timestamp as the last update in sonar - but says its still running.


#2

We just had the same behavior occur on our poller this morning. Same Error for ICMP/SNMP.

I will attempt a reboot and see if this resolves it.

Mine is a bare metal install configured step by step per the sonar instruction. I just set it up a week ago.

Seams odd that a reboot is required. I will see if I can restart just certain services to see if the resolves the issues.

Is there an error log I can look at that would be of help to sonar?


#3

Update!

Before I decide to reboot my machine, I figured I would check for updates.

Turned out I was running 1.1.4. It upgraded it to 1.1.7.

After the upgrade complete (still haven’t rebooted the server) the poller started working again. I am going to keep an eye on it and update if there is any new info.


#4

I hope you have better luck than us.

Another sonar support win. (Sure hope v2 is better and soon. We have a growing list of gripes and the forum has become quite pointless, don’t get me started on the polling equipment down email phantom notifications with no logs or useful data. )


#5

Unfortunately since I last posted I have been having constant issues with our poller.

It seams to continually lock up. It will send one report, and then get hung up.

Not sure if i’m having hardware issues or if it is software. Its on a new DELL server.


#6

So I put the poller in debug mode and one thing I am noticing is that the “Mapping Cycle” Is just constantly starting and finishing.

In the .env file I have set it to 60 Minutes but it still just runs constantly.

I am wondering if that is causing my lock ups.


#7

In an earlier post (How many pollers) @leigh_Shuck had suggested an ideal configuration for the timings in the .env

might be of some help for you.


#8

If you PM me the ticket number for this issue, I’ll make sure it gets escalated for you.


#9

This also got us yesterday for some reason.


#10

I suspected this was affecting more than just us. I’m sure many people simply reboot the server or schedule a reboot weekly to mitigate the issue.

My issue (as @simon has alluded to above) is that this is one of may annoyances. The forum has unfortunately become a ghost town. If they were to publicly help people with the issues they face, it would cost less in support over the long term with a growing number of companies having problems and attacking them one by one in an archaic ticket system. Sonar has a customer base of technologically savvy ISP’s, many of which have probably already solved many of the problems others are facing…

It would serve them well to be fostering growth of this community.


#11

What else would you like to see done to foster the growth of the community?


#12

I don’t like the poller architecture. IMO it is bad. We have set our sonar monitoring templates to be minimal and continue using Zabbix for our primary monitoring.

As for community, I would like to see a Sonar users FB page. I was thinking of creating it. Wisp Talk and such on FB get a lot of traffic.


#13

There is an unofficial one already that a Sonar user runs.


#14

What would you prefer?


#15

Something a little more backend scale-able like python or nodejs, (granted I’m sure it uses php to get and queue the redis server and the limitation is probably fping or something else) - but really, something that can handle more than 300 devices per poller without seeing fake latency and ‘not-really-there’ packetloss, Maybe something that would allow us to run ‘anything’ alongside it to use extra dedicated resources that it is not seemingly capable of using.

As a monitoring system, sonar has become almost useless for us. Works great as an SNMP grapher but ping and packetloss data is consistently unreliable and equipment down notifications are a running joke to us now.


#16

Something that runs on an ongoing basis like a zabbix proxy and doesn’t key off a cron job a maximum of once per minute.


#17

Why?


#18

Please open a ticket on this - we’ve got pollers monitoring 2000+ devices without issue. I’m sure we can help you get past 300.


#19

PHP7 is significantly faster than Python. Switching the language is not going to change the fundamentals of polling. The only thing that would make a significant difference is rolling our own ping/SNMP tool at this point, or possibly rewriting in a language like Go.


#20

Interesting thread. Our poller died 2 days ago for no reason. Running on a plain Ubuntu box with nothing else on it. Have a tech headed out to reboot it. Super annoying…