[Spce-user] Help - many packet loss and packets to unknown port
Skyler
skchopperguy at gmail.com
Sun Jun 26 03:35:46 EDT 2016
Hi,
If small rx buffers and rx ring size is maxed for vmxnet3 then I guess
we've confirmed that rtpproxy and/or kamailio has hit its real-world limit
in this configuration.
Being that buffers are kernel and rtpproxy is too, and audio issues exist,
it's hard to say which app is maxed out I think.
Might just be time for you to setup master-master and load balance a
couple of machines.
- Skyler
On Jun 26, 2016 12:59 AM, "Walter Klomp" <walter at myrepublic.com.sg> wrote:
> Hi,
>
> I checked esxtop and there are no dropped packets in the vswitch…
>
> 6:47:35am up 165 days 12:56, 528 worlds, 1 VMs, 4 vCPUs; CPU load
> average: 0.27, 0.26, 0.28
>
> PORT-ID USED-BY TEAM-PNIC DNAME PKTTX/s
> MbTX/s PSZTX PKTRX/s MbRX/s PSZRX %DRPTX %DRPRX
> 33554433 Management n/a vSwitch0 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 33554434 vmnic0 - vSwitch0 15.83
> 0.02 196.00 14.69 0.02 145.00 0.00 0.00
> 33554435 Shadow of vmnic0 n/a vSwitch0 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 33554436 vmk0 vmnic0 vSwitch0 15.83
> 0.02 196.00 13.92 0.02 149.00 0.00 0.00
> 50331649 Management n/a vSwitch1 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 50331650 vmnic1 - vSwitch1 10590.36
> 18.98 234.00 10914.61 20.58 247.00 0.00 0.00
> 50331651 Shadow of vmnic1 n/a vSwitch1 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 50331659 10189844:backup-sipw vmnic1 vSwitch1 10590.36
> 18.98 234.00 10910.61 20.58 247.00 0.00 0.00
> 50331660 vmnic2 - vSwitch1 0.00
> 0.00 0.00 4.58 0.00 62.00 0.00 0.00
> 50331661 Shadow of vmnic2 n/a vSwitch1 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
> I also did an ethtool -G eth0 rx-jumbo 2048, but that didn’t do anything
> either… Seems that kamailio can’t cope with 21Mbit of traffic? This
> started happening around 2 weeks ago and the usage pattern has been the
> same even months before… I can’t point my finger on what changed (maybe an
> update may have triggered it, but i have no recollection of it). I checked
> the statistics in sipwise admin gui and the usage patterns are normal.
>
> I did switch from e1000 to vmxnet3 as e1000 was only using 1 core for
> receiving traffic, whereas vmxnet3 adapter uses all cores. (cat
> /proc/interrupts)
>
> Any other ideas?
>
>
>
> On 26 Jun 2016, at 10:58 AM, Skyler <skchopperguy at gmail.com> wrote:
>
> Hi,
>
> Ok then it seems if the guest OS buffers are all increased and problem
> still exists...must be at the esxi level.
>
> Maybe this KB can start you in the right direction.
>
>
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495
>
> - Skyler
>
> On Jun 25, 2016 8:29 PM, "Walter Klomp" <walter at myrepublic.com.sg> wrote:
>
>> Hi,
>>
>> I already changed that setting to 50000 but it doesn't make any
>> difference. Also if the device drops the packets then why do I see packet
>> drops in the app itself? Seems like lb can't handle the packets fast
>> enough or the socket buffers in kamailio are not big enough? Note I am
>> running this on 3 year old hardware, and if the solution is throwing more
>> hardware at it, then so be it. But the cpu usage is not that high according
>> to top, but the average load does exceed 4 at times (which I understand is
>> the limit for 4 cores)
>>
>> If esxi drops the packets then I would probably not even see it in the
>> machine, or did I get that wrong?
>>
>> Yours sincerely,
>> Walter
>>
>> On 26 Jun 2016, at 2:10 AM, Skyler <skchopperguy at gmail.com> wrote:
>>
>> Hi,
>>
>> One other thing you can try is check the net.core.netdev_max_backlog
>> value in your VM. It should be 1000 by default. Maybe change it to 2000,
>> reboot and check packet stats after that to compare from previous.
>>
>> This is tough to troubleshoot as udp stats are always showing the whole
>> stack.
>>
>> Will be interesting to figure this out and identify the bottleneck.
>>
>> On Jun 25, 2016 11:34 AM, "Skyler" <skchopperguy at gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Odd that I don't see 'pointer' or 'drops' on my machine.
>> >
>> > Anyway, my thinking is that netdev_max_backlog could be the issue here.
>> Where max packet is reached and queued on input side, then timed out as the
>> interface receives packets faster than the kernel can process them.
>> >
>> > If I'm right, you need to move to a bare-metal scenario to confirm that
>> this problem is esxi and the VM layer.
>> >
>> >
>> >
>> > On Jun 25, 2016 10:41 AM, "Walter Klomp" <walter at myrepublic.com.sg>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have done the firewall (with some tweaks) - but I still see packet
>> loss…
>> >>
>> >> Looking at /proc/net/udp I see that lb is dropping packets - this is
>> the result after about 3 minutes of running
>> >>
>> >> root at sipwise:/etc# cat /proc/net/udp
>> >> sl local_address rem_address st tx_queue rx_queue tr tm->when
>> retrnsmt uid timeout inode ref pointer drops
>> >>
>> >> 420: 0100007F:13C4 00000000:0000 07 00000000:00000000 00:00000000
>> 00000000 0 0 12138702 2 ffff8803313b70c0 537
>> >> 420: 07CB0767:13C4 00000000:0000 07 00000000:00000000 00:00000000
>> 00000000 0 0 12138701 2 ffff88032e498740 416
>> >>
>> >> As you can see already 537 drops…
>> >>
>> >> What setting do I change for this to not happen ?
>> >>
>> >> Incidentally I have been playing with the number of children and
>> shared memory, but it doesn’t seem to make much of a difference. Here I am
>> running 12 udp children and 256Mb shared memory (note I have about 30.000
>> devices connected).
>> >>
>> >> Thanks
>> >> Walter
>> >>
>> >>
>> >>> On 24 Jun 2016, at 5:46 AM, Skyler <skchopperguy at gmail.com> wrote:
>> >>>
>> >>> If it were my box I'd have iptables accept only tcp/udp 5060, 5061
>> and TCP 80, 443 and admin, xmlrpc ports then drop all of the rest.
>> >>>
>> >>> Could start here for good examples:
>> >>>
>> >>> https://www.kamailio.org/wiki/tutorials/security/kamailio-security
>> >>>
>> >>> -- Skyler
>> >>>
>> >>> On Jun 23, 2016 3:27 PM, "Skyler" <skchopperguy at gmail.com> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> On Jun 23, 2016 3:18 AM, "Walter Klomp" <walter at myrepublic.com.sg>
>> wrote:
>> >>>> >
>> >>>> > Hi,
>> >>>> >
>> >>>> > MySQL is not locking up other than due to anti-fraud script which
>> runs every half an hour.
>> >>>> >
>> >>>>
>> >>>> Oh you mentioned mysql pinning cpu so I assumed we may have had the
>> same problem.
>> >>>>
>> >>>> > I think I can also rule out DDOS because it’s a steady 300-350
>> packets per second that go to unknown port.
>> >>>> >
>> >>>>
>> >>>> Wow, so one device is doing that you figure? How do you know it's
>> that many pps if the port is unknown?
>> >>>>
>> >>>> If they are udp, I'd set kamailio lb to listen on that unknown port
>> and look in the logs to see what shows up.
>> >>>>
>> >>>> > What I have not figured out yet is how the heck I find out which
>> packets are the actual culprits… Even doing a tcpdump on UDP packets only
>> and excluding the hosts I know are legit and the ports I know are legit,
>> still gives me a heck of a lot of traffic, probably actual payload traffic
>> of ongoing voice calls (around 250 concurrent)…
>> >>>> >
>> >>>> > Now the packets to unknown port could also be some equipment
>> sending some garbage (Grandstream ATA’s like to do this) to keep the NAT
>> port open, and it may not actually be a problem, but I still can’t seem to
>> figure out what causes the RcvbufErrors which periodically happen and when
>> I listen to for instance the conference bridge music, it will break for a
>> while…
>> >>>> >
>> >>>>
>> >>>> I've never heard of grandstream devices sending that kind of pps
>> before. Unless it's like 3000 of them all misconfigured and pointing at
>> you. All UA's do nat ping on the port configured on the device, so 5060
>> usually. Can't see devices being the problem here. The pps is too high.
>> >>>>
>> >>>> > How to find out when the rcvbuferror occurs, what application is
>> causing it?
>> >>>>
>> >>>> First find out where the packets are coming from and why. Then
>> you'll know if it can be dropped or what app to look at.
>> >>>>
>> >>>> > Thanks for any suggestions.
>> >>>> > Walter
>> >>>> >
>> >>>> >> On 23 Jun 2016, at 4:25 PM, Skyler <skchopperguy at gmail.com>
>> wrote:
>> >>>> >>
>> >>>> >> Dang these thumbs..now to the list.
>> >>>> >>
>> >>>> >> On Jun 23, 2016 2:06 AM, "Skyler" <skchopperguy at gmail.com> wrote:
>> >>>> >>>
>> >>>> >>> Sorry, in the list now.
>> >>>> >>>
>> >>>> >>> I had a similar issue last month. Basically mysql locking up the
>> box. I think there's an update for hackers out there. Kamailio is
>> tuff...but mysql can be broken..
>> >>>> >>>
>> >>>> >>> It was resolved by exiting/dropping on common hacker UA which
>> were retreived from logs and the IP's. Eventually they gave up and moves
>> along.
>> >>>> >>>
>> >>>> >>> Ddos type attack.
>> >>>> >>>
>> >>>> >>> -Skyler
>> >>>> >>>
>> >>>> >>> On Jun 23, 2016 1:59 AM, "Skyler" <skchopperguy at gmail.com>
>> wrote:
>> >>>> >>>>
>> >>>> >>>> Looks like a flood to me. Yer spec is 2 days here, are you
>> seeing anything in lb or proxy log when tailing?
>> >>>> >>>>
>> >>>> >>>> - Skyler
>> >>>> >>>>
>> >>>> >>>> On Jun 22, 2016 9:01 PM, "Walter Klomp" <
>> walter at myrepublic.com.sg> wrote:
>> >>>> >>>>>
>> >>>> >>>>> Hi,
>> >>>> >>>>>
>> >>>> >>>>> Running SPCE 3.8.5 on dedicated ESXi host (Dell R320 with Xeon
>> E2460 & 16GB RAM) with ~30.000 registered subscribers (and online).
>> >>>> >>>>>
>> >>>> >>>>> Last week we were having horrible statistics and packet-loss
>> galore… After tweaking the network settings with the below, I have managed
>> to minimize the packet-loss.. but still there is.
>> >>>> >>>>>
>> >>>> >>>>> sysctl -w net.core.rmem_max=33554432
>> >>>> >>>>> sysctl -w net.core.wmem_max=33554432
>> >>>> >>>>> sysctl -w net.core.rmem_default=65536
>> >>>> >>>>> sysctl -w net.core.wmem_default=65536
>> >>>> >>>>> sysctl -w net.ipv4.tcp_mem='8388608 8388608 8388608'
>> >>>> >>>>> sysctl -w net.ipv4.udp_mem='4096 174760 33554432'
>> >>>> >>>>> sysctl -w net.ipv4.tcp_rmem='4096 87380 8388608'
>> >>>> >>>>> sysctl -w net.ipv4.tcp_wmem='4096 65536 8388608'
>> >>>> >>>>> sysctl -w net.ipv4.route.flush=1
>> >>>> >>>>>
>> >>>> >>>>> I am currently still seeing around 300 packets per second
>> going to unknown ports. Below are the statistics. That’s about 1/5th of
>> all the packets received are not being processed… That seems a lot to me.
>> >>>> >>>>>
>> >>>> >>>>> 10:43:40 up 2 days, 5:11, 3 users, load average: 1.52,
>> 2.05, 2.17
>> >>>> >>>>>
>> >>>> >>>>> Every 1.0s: netstat -anus|grep -A 7 Udp:
>>
>> Thu Jun 23 10:40:45 2016
>> >>>> >>>>>
>> >>>> >>>>> Udp:
>> >>>> >>>>> 310870895 packets received
>> >>>> >>>>> 61212884 packets to unknown port received.
>> >>>> >>>>> 103338 packet receive errors
>> >>>> >>>>> 312245302 packets sent
>> >>>> >>>>> RcvbufErrors: 103249
>> >>>> >>>>> SndbufErrors: 765
>> >>>> >>>>> InCsumErrors: 75
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>> I had to do a lot of buffer tweaking to get the RcvbufErrors
>> down and even the SndbufErrors as every time it happens (at bursts -
>> sporadically every 10 minutes, but definitely every half hour), one would
>> get silence and the packet receive errors would should up by about between
>> 200 and 800 packets.
>> >>>> >>>>>
>> >>>> >>>>> The load average can shoot up to 4.x at times. Knowing that
>> Sipwise Pro is on the same hardware, and they support up to 50.000 users,
>> what am I missing?
>> >>>> >>>>>
>> >>>> >>>>> rtpengine is running in kernel. major contributor of CPU usage
>> is actually MySQL regularly maxing out at 100%. Especially when it’s doing
>> the fraud check. Below is a snapshot of top….
>> >>>> >>>>>
>> >>>> >>>>> top - 10:56:53 up 2 days, 5:24, 3 users, load average:
>> 2.39, 2.14, 1.94
>> >>>> >>>>> Tasks: 184 total, 1 running, 183 sleeping, 0 stopped, 0
>> zombie
>> >>>> >>>>> %Cpu(s): 25.3 us, 7.0 sy, 0.0 ni, 63.7 id, 1.0 wa, 0.0 hi,
>> 2.9 si, 0.0 st
>> >>>> >>>>> KiB Mem: 12334464 total, 12157676 used, 176788 free,
>> 144944 buffers
>> >>>> >>>>> KiB Swap: 2096124 total, 0 used, 2096124 free,
>> 4430336 cached
>> >>>> >>>>>
>> >>>> >>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>> COMMAND
>> >>>> >>>>> 4063 mysql 20 0 6127m 5.6g 7084 S 54.7 47.7 809:35.18
>> mysqld
>> >>>> >>>>> 2576 root 20 0 253m 7176 1816 S 9.9 0.1 164:02.97
>> rsyslogd
>> >>>> >>>>> 5058 root 20 0 67176 11m 5308 S 6.0 0.1 7:05.16
>> rate-o-mat
>> >>>> >>>>> 15432 root 20 0 276m 12m 3696 S 5.0 0.1 117:56.92
>> rtpengine
>> >>>> >>>>> 5257 sems 20 0 873m 37m 7624 S 4.0 0.3 139:44.03
>> ngcp-sems
>> >>>> >>>>> 30996 kamailio 20 0 539m 100m 53m S 4.0 0.8 6:02.68
>> kamailio
>> >>>> >>>>>
>> >>>> >>>>> Does anybody have any pointers I can try to completely
>> eliminate the packet loss and where do these unknown port packets go to?
>> >>>> >>>>>
>> >>>> >>>>> Thanks
>> >>>> >>>>> Walter.
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>> _______________________________________________
>> >>>> >>>>> Spce-user mailing list
>> >>>> >>>>> Spce-user at lists.sipwise.com
>> >>>> >>>>> https://lists.sipwise.com/listinfo/spce-user
>> >>>> >>>>>
>> >>>> >
>> >>
>> >>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sipwise.com/pipermail/spce-user_lists.sipwise.com/attachments/20160626/3dcacaa5/attachment-0001.html>
More information about the Spce-user
mailing list