[Spce-user] Help - many packet loss and packets to unknown port

Sun Jun 26 03:35:46 EDT 2016

Hi,

If small rx buffers and rx ring size is maxed for vmxnet3 then I guess
we've confirmed that rtpproxy and/or kamailio has hit its real-world limit
in this configuration.

Being that buffers are kernel and rtpproxy is too, and audio issues exist,
it's hard to say which app is maxed out I think.

Might just be time for you to setup  master-master and load balance a
couple of machines.

- Skyler

On Jun 26, 2016 12:59 AM, "Walter Klomp" <walter at myrepublic.com.sg> wrote:

> Hi,
>
> I checked esxtop and there are no dropped packets in the vswitch…
>
>  6:47:35am up 165 days 12:56, 528 worlds, 1 VMs, 4 vCPUs; CPU load
> average: 0.27, 0.26, 0.28
>
>    PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s
>  MbTX/s   PSZTX    PKTRX/s  MbRX/s   PSZRX %DRPTX %DRPRX
>   33554433           Management        n/a vSwitch0              0.00
>  0.00    0.00       0.00    0.00    0.00   0.00   0.00
>   33554434               vmnic0          - vSwitch0             15.83
>  0.02  196.00      14.69    0.02  145.00   0.00   0.00
>   33554435     Shadow of vmnic0        n/a vSwitch0              0.00
>  0.00    0.00       0.00    0.00    0.00   0.00   0.00
>   33554436                 vmk0     vmnic0 vSwitch0             15.83
>  0.02  196.00      13.92    0.02  149.00   0.00   0.00
>   50331649           Management        n/a vSwitch1              0.00
>  0.00    0.00       0.00    0.00    0.00   0.00   0.00
>   50331650               vmnic1          - vSwitch1          10590.36
> 18.98  234.00   10914.61   20.58  247.00   0.00   0.00
>   50331651     Shadow of vmnic1        n/a vSwitch1              0.00
>  0.00    0.00       0.00    0.00    0.00   0.00   0.00
>   50331659 10189844:backup-sipw     vmnic1 vSwitch1          10590.36
> 18.98  234.00   10910.61   20.58  247.00   0.00   0.00
>   50331660               vmnic2          - vSwitch1              0.00
>  0.00    0.00       4.58    0.00   62.00   0.00   0.00
>   50331661     Shadow of vmnic2        n/a vSwitch1              0.00
>  0.00    0.00       0.00    0.00    0.00   0.00   0.00
>
> I also did an ethtool -G eth0 rx-jumbo 2048, but that didn’t do anything
> either…  Seems that kamailio can’t cope with 21Mbit of traffic?  This
> started happening around 2 weeks ago and the usage pattern has been the
> same even months before… I can’t point my finger on what changed (maybe an
> update may have triggered it, but i have no recollection of it).  I checked
> the statistics in sipwise admin gui and the usage patterns are normal.
>
> I did switch from e1000 to vmxnet3 as e1000 was only using 1 core for
> receiving traffic, whereas vmxnet3 adapter uses all cores. (cat
> /proc/interrupts)
>
> Any other ideas?
>
>
>
> On 26 Jun 2016, at 10:58 AM, Skyler <skchopperguy at gmail.com> wrote:
>
> Hi,
>
> Ok then it seems if the guest OS buffers are all increased and problem
> still exists...must be at the esxi level.
>
> Maybe this KB can start you in the right direction.
>
>
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495
>
> - Skyler
>
> On Jun 25, 2016 8:29 PM, "Walter Klomp" <walter at myrepublic.com.sg> wrote:
>
>> Hi,
>>
>> I already changed that setting to 50000 but it doesn't make any
>> difference. Also if the device drops the packets then why do I see packet
>> drops in the app itself?  Seems like lb can't handle the packets fast
>> enough or the socket buffers in kamailio are not big enough?  Note I am
>> running this on 3 year old hardware, and if the solution is throwing more
>> hardware at it, then so be it. But the cpu usage is not that high according
>> to top, but the average load does exceed 4 at times (which I understand is
>> the limit for 4 cores)
>>
>> If esxi drops the packets then I would probably not even see it in the
>> machine, or did I get that wrong?
>>
>> Yours sincerely,
>> Walter
>>
>> On 26 Jun 2016, at 2:10 AM, Skyler <skchopperguy at gmail.com> wrote:
>>
>> Hi,
>>
>> One other thing you can try is check the  net.core.netdev_max_backlog
>> value in your VM. It should be 1000 by default. Maybe change it to 2000,
>> reboot and check packet stats after that to compare from previous.
>>
>> This is tough to troubleshoot as udp stats are always showing the whole
>> stack.
>>
>> Will be interesting to figure this out and identify the bottleneck.
>>
>> On Jun 25, 2016 11:34 AM, "Skyler" <skchopperguy at gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Odd that I don't see 'pointer' or 'drops' on my machine.
>> >
>> > Anyway, my thinking is that netdev_max_backlog could be the issue here.
>> Where max packet is reached and queued on input side, then timed out as the
>> interface receives packets faster than the kernel can process them.
>> >
>> > If I'm right, you need to move to a bare-metal scenario to confirm that
>> this problem is esxi and the VM layer.
>> >
>> >
>> >
>> > On Jun 25, 2016 10:41 AM, "Walter Klomp" <walter at myrepublic.com.sg>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have done the firewall (with some tweaks) - but I still see packet
>> loss…
>> >>
>> >> Looking at /proc/net/udp I see that lb is dropping packets - this is
>> the result after about 3 minutes of running
>> >>
>> >> root at sipwise:/etc# cat /proc/net/udp
>> >>   sl  local_address rem_address   st tx_queue rx_queue tr tm->when
>> retrnsmt   uid  timeout inode ref pointer drops
>> >>
>> >>   420: 0100007F:13C4 00000000:0000 07 00000000:00000000 00:00000000
>> 00000000     0        0 12138702 2 ffff8803313b70c0 537
>> >>   420: 07CB0767:13C4 00000000:0000 07 00000000:00000000 00:00000000
>> 00000000     0        0 12138701 2 ffff88032e498740 416
>> >>
>> >> As you can see already 537 drops…
>> >>
>> >> What setting do I change for this to not happen ?
>> >>
>> >> Incidentally I have been playing with the number of children and
>> shared memory, but it doesn’t seem to make much of a difference. Here I am
>> running 12 udp children and 256Mb shared memory (note I have about 30.000
>> devices connected).
>> >>
>> >> Thanks
>> >> Walter
>> >>
>> >>
>> >>> On 24 Jun 2016, at 5:46 AM, Skyler <skchopperguy at gmail.com> wrote:
>> >>>
>> >>> If it were my box I'd have iptables accept only tcp/udp 5060, 5061
>> and TCP 80, 443 and admin, xmlrpc ports then drop all of the rest.
>> >>>
>> >>> Could start here for good examples:
>> >>>
>> >>> https://www.kamailio.org/wiki/tutorials/security/kamailio-security
>> >>>
>> >>> -- Skyler
>> >>>
>> >>> On Jun 23, 2016 3:27 PM, "Skyler" <skchopperguy at gmail.com> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> On Jun 23, 2016 3:18 AM, "Walter Klomp" <walter at myrepublic.com.sg>
>> wrote:
>> >>>> >
>> >>>> > Hi,
>> >>>> >
>> >>>> > MySQL is not locking up other than due to anti-fraud script which
>> runs every half an hour.
>> >>>> >
>> >>>>
>> >>>> Oh you mentioned mysql pinning cpu so I assumed we may have had the
>> same problem.
>> >>>>
>> >>>> > I think I can also rule out DDOS because it’s a steady 300-350
>> packets per second that go to unknown port.
>> >>>> >
>> >>>>
>> >>>> Wow, so one device is doing that you figure? How do you know it's
>> that many pps if the port is unknown?
>> >>>>
>> >>>> If they are udp, I'd set kamailio lb to listen on that unknown port
>> and look in the logs to see what shows up.
>> >>>>
>> >>>> > What I have not figured out yet is how the heck I find out which
>> packets are the actual culprits…  Even doing a tcpdump on UDP packets only
>> and excluding the hosts I know are legit and the ports I know are legit,
>> still gives me a heck of a lot of traffic, probably actual payload traffic
>> of ongoing voice calls (around 250 concurrent)…
>> >>>> >
>> >>>> > Now the packets to unknown port could also be some equipment
>> sending some garbage (Grandstream ATA’s like to do this) to keep the NAT
>> port open, and it may not actually be a problem, but I still can’t seem to
>> figure out what causes the RcvbufErrors which periodically happen and when
>> I listen to for instance the conference bridge music, it will break for a
>> while…
>> >>>> >
>> >>>>
>> >>>> I've never heard of grandstream devices sending that kind of pps
>> before. Unless it's like 3000 of them all misconfigured and pointing at
>> you. All UA's do nat ping on the port configured on the device, so 5060
>> usually. Can't see devices being the problem here. The pps is too high.
>> >>>>
>> >>>> > How to find out when the rcvbuferror occurs, what application is
>> causing it?
>> >>>>
>> >>>> First find out where the packets are coming from and why. Then
>> you'll know if it can be dropped or what app to look at.
>> >>>>
>> >>>> > Thanks for any suggestions.
>> >>>> > Walter
>> >>>> >
>> >>>> >> On 23 Jun 2016, at 4:25 PM, Skyler <skchopperguy at gmail.com>
>> wrote:
>> >>>> >>
>> >>>> >> Dang these thumbs..now to the list.
>> >>>> >>
>> >>>> >> On Jun 23, 2016 2:06 AM, "Skyler" <skchopperguy at gmail.com> wrote:
>> >>>> >>>
>> >>>> >>> Sorry, in the list now.
>> >>>> >>>
>> >>>> >>> I had a similar issue last month. Basically mysql locking up the
>> box. I think there's an update for hackers out there. Kamailio is
>> tuff...but mysql can be broken..
>> >>>> >>>
>> >>>> >>> It was resolved by exiting/dropping on common hacker UA which
>> were retreived from logs and the IP's. Eventually they gave up and moves
>> along.
>> >>>> >>>
>> >>>> >>> Ddos type attack.
>> >>>> >>>
>> >>>> >>> -Skyler
>> >>>> >>>
>> >>>> >>> On Jun 23, 2016 1:59 AM, "Skyler" <skchopperguy at gmail.com>
>> wrote:
>> >>>> >>>>
>> >>>> >>>> Looks like a flood to me. Yer spec is 2 days here, are you
>> seeing anything in lb or proxy log when tailing?
>> >>>> >>>>
>> >>>> >>>> - Skyler
>> >>>> >>>>
>> >>>> >>>> On Jun 22, 2016 9:01 PM, "Walter Klomp" <
>> walter at myrepublic.com.sg> wrote:
>> >>>> >>>>>
>> >>>> >>>>> Hi,
>> >>>> >>>>>
>> >>>> >>>>> Running SPCE 3.8.5 on dedicated ESXi host (Dell R320 with Xeon
>> E2460 & 16GB RAM) with ~30.000 registered subscribers (and online).
>> >>>> >>>>>
>> >>>> >>>>> Last week we were having horrible statistics and packet-loss
>> galore… After tweaking the network settings with the below, I have managed
>> to minimize the packet-loss.. but still there is.
>> >>>> >>>>>
>> >>>> >>>>> sysctl -w net.core.rmem_max=33554432
>> >>>> >>>>> sysctl -w net.core.wmem_max=33554432
>> >>>> >>>>> sysctl -w net.core.rmem_default=65536
>> >>>> >>>>> sysctl -w net.core.wmem_default=65536
>> >>>> >>>>> sysctl -w net.ipv4.tcp_mem='8388608 8388608 8388608'
>> >>>> >>>>> sysctl -w net.ipv4.udp_mem='4096 174760 33554432'
>> >>>> >>>>> sysctl -w net.ipv4.tcp_rmem='4096 87380 8388608'
>> >>>> >>>>> sysctl -w net.ipv4.tcp_wmem='4096 65536 8388608'
>> >>>> >>>>> sysctl -w net.ipv4.route.flush=1
>> >>>> >>>>>
>> >>>> >>>>> I am currently still seeing around 300 packets per second
>> going to unknown ports. Below are the statistics.  That’s about 1/5th of
>> all the packets received are not being processed… That seems a lot to me.
>> >>>> >>>>>
>> >>>> >>>>>  10:43:40 up 2 days,  5:11,  3 users,  load average: 1.52,
>> 2.05, 2.17
>> >>>> >>>>>
>> >>>> >>>>> Every 1.0s: netstat -anus|grep -A 7 Udp:
>>
>>                 Thu Jun 23 10:40:45 2016
>> >>>> >>>>>
>> >>>> >>>>> Udp:
>> >>>> >>>>>     310870895 packets received
>> >>>> >>>>>     61212884 packets to unknown port received.
>> >>>> >>>>>     103338 packet receive errors
>> >>>> >>>>>     312245302 packets sent
>> >>>> >>>>>     RcvbufErrors: 103249
>> >>>> >>>>>     SndbufErrors: 765
>> >>>> >>>>>     InCsumErrors: 75
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>> I had to do a lot of buffer tweaking to get the RcvbufErrors
>> down and even the SndbufErrors as every time it happens (at bursts -
>> sporadically every 10 minutes, but definitely every half hour), one would
>> get silence and the packet receive errors would should up by about between
>> 200 and 800 packets.
>> >>>> >>>>>
>> >>>> >>>>> The load average can shoot up to 4.x at times.   Knowing that
>> Sipwise Pro is on the same hardware, and they support up to 50.000 users,
>> what am I missing?
>> >>>> >>>>>
>> >>>> >>>>> rtpengine is running in kernel. major contributor of CPU usage
>> is actually MySQL regularly maxing out at 100%. Especially when it’s doing
>> the fraud check. Below is a snapshot of top….
>> >>>> >>>>>
>> >>>> >>>>> top - 10:56:53 up 2 days,  5:24,  3 users,  load average:
>> 2.39, 2.14, 1.94
>> >>>> >>>>> Tasks: 184 total,   1 running, 183 sleeping,   0 stopped,   0
>> zombie
>> >>>> >>>>> %Cpu(s): 25.3 us,  7.0 sy,  0.0 ni, 63.7 id,  1.0 wa,  0.0 hi,
>>  2.9 si,  0.0 st
>> >>>> >>>>> KiB Mem:  12334464 total, 12157676 used,   176788 free,
>> 144944 buffers
>> >>>> >>>>> KiB Swap:  2096124 total,        0 used,  2096124 free,
>>  4430336 cached
>> >>>> >>>>>
>> >>>> >>>>>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+
>>  COMMAND
>> >>>> >>>>>  4063 mysql     20   0 6127m 5.6g 7084 S  54.7 47.7 809:35.18
>> mysqld
>> >>>> >>>>>  2576 root      20   0  253m 7176 1816 S   9.9  0.1 164:02.97
>> rsyslogd
>> >>>> >>>>>  5058 root      20   0 67176  11m 5308 S   6.0  0.1   7:05.16
>> rate-o-mat
>> >>>> >>>>> 15432 root      20   0  276m  12m 3696 S   5.0  0.1 117:56.92
>> rtpengine
>> >>>> >>>>>  5257 sems      20   0  873m  37m 7624 S   4.0  0.3 139:44.03
>> ngcp-sems
>> >>>> >>>>> 30996 kamailio  20   0  539m 100m  53m S   4.0  0.8   6:02.68
>> kamailio
>> >>>> >>>>>
>> >>>> >>>>> Does anybody have any pointers I can try to completely
>> eliminate the packet loss and where do these unknown port packets go to?
>> >>>> >>>>>
>> >>>> >>>>> Thanks
>> >>>> >>>>> Walter.
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>>
>> >>>> >>>>> _______________________________________________
>> >>>> >>>>> Spce-user mailing list
>> >>>> >>>>> Spce-user at lists.sipwise.com
>> >>>> >>>>> https://lists.sipwise.com/listinfo/spce-user
>> >>>> >>>>>
>> >>>> >
>> >>
>> >>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sipwise.com/mailman/private/spce-user_lists.sipwise.com/attachments/20160626/3dcacaa5/attachment.html>