<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

Hi,

<div class=""><br class="">

</div>

<div class="">I checked esxtop and there are no dropped packets in the vswitch…</div>

<div class=""><br class="">

</div>

<div class="">

<div class=""><font face="Courier" class=""> 6:47:35am up 165 days 12:56, 528 worlds, 1 VMs, 4 vCPUs; CPU load average: 0.27, 0.26, 0.28</font></div>

<div class=""><font face="Courier" class=""><br class="">

</font></div>

<div class=""><font face="Courier" class="">   PORT-ID              USED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s   PSZTX    PKTRX/s  MbRX/s   PSZRX %DRPTX %DRPRX </font></div>

<div class=""><font face="Courier" class="">  33554433           Management        n/a vSwitch0              0.00    0.00    0.00       0.00    0.00    0.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  33554434               vmnic0          - vSwitch0             15.83    0.02  196.00      14.69    0.02  145.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  33554435     Shadow of vmnic0        n/a vSwitch0              0.00    0.00    0.00       0.00    0.00    0.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  33554436                 vmk0     vmnic0 vSwitch0             15.83    0.02  196.00      13.92    0.02  149.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  50331649           Management        n/a vSwitch1              0.00    0.00    0.00       0.00    0.00    0.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  50331650               vmnic1          - vSwitch1          10590.36   18.98  234.00   10914.61   20.58  247.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  50331651     Shadow of vmnic1        n/a vSwitch1              0.00    0.00    0.00       0.00    0.00    0.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  50331659 10189844:backup-sipw     vmnic1 vSwitch1          10590.36   18.98  234.00   10910.61   20.58  247.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  50331660               vmnic2          - vSwitch1              0.00    0.00    0.00       4.58    0.00   62.00   0.00   0.00 </font></div>

<div class=""><font face="Courier" class="">  50331661     Shadow of vmnic2        n/a vSwitch1              0.00    0.00    0.00       0.00    0.00    0.00   0.00   0.00</font></div>

<div class=""><br class="">

</div>

<div>I also did an ethtool -G eth0 rx-jumbo 2048, but that didn’t do anything either…  Seems that kamailio can’t cope with 21Mbit of traffic?  This started happening around 2 weeks ago and the usage pattern has been the same even months before… I can’t point

 my finger on what changed (maybe an update may have triggered it, but i have no recollection of it).  I checked the statistics in sipwise admin gui and the usage patterns are normal.</div>

<div><br class="">

</div>

<div>I did switch from e1000 to vmxnet3 as e1000 was only using 1 core for receiving traffic, whereas vmxnet3 adapter uses all cores. (cat /proc/interrupts)</div>

<div><br class="">

</div>

<div>Any other ideas?</div>

<div><br class="">

</div>

<div><br class="">

</div>

<div><br class="">

<blockquote type="cite" class="">

<div class="">On 26 Jun 2016, at 10:58 AM, Skyler <<a href="mailto:skchopperguy@gmail.com" class="">skchopperguy@gmail.com</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<p dir="ltr" class="">Hi,</p>

<p dir="ltr" class="">Ok then it seems if the guest OS buffers are all increased and problem still exists...must be at the esxi level.</p>

<p dir="ltr" class="">Maybe this KB can start you in the right direction.</p>

<p dir="ltr" class=""><a href="https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495" class="">https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495</a><br class="">

</p>

<p dir="ltr" class="">- Skyler</p>

<div class="gmail_extra"><br class="">

<div class="gmail_quote">On Jun 25, 2016 8:29 PM, "Walter Klomp" <<a href="mailto:walter@myrepublic.com.sg" target="_blank" class="">walter@myrepublic.com.sg</a>> wrote:<br type="attribution" class="">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="auto" class="">

<div class="">Hi,</div>

<div class=""><br class="">

</div>

<div class="">I already changed that setting to 50000 but it doesn't make any difference. Also if the device drops the packets then why do I see packet drops in the app itself?  Seems like lb can't handle the packets fast enough or the socket buffers in kamailio

 are not big enough?  Note I am running this on 3 year old hardware, and if the solution is throwing more hardware at it, then so be it. But the cpu usage is not that high according to top, but the average load does exceed 4 at times (which I understand is

 the limit for 4 cores)</div>

<div class=""><br class="">

</div>

<div class="">If esxi drops the packets then I would probably not even see it in the machine, or did I get that wrong?<br class="">

<br class="">

Yours sincerely,

<div class="">Walter</div>

</div>

<div class=""><br class="">

On 26 Jun 2016, at 2:10 AM, Skyler <<a href="mailto:skchopperguy@gmail.com" target="_blank" class="">skchopperguy@gmail.com</a>> wrote:<br class="">

<br class="">

</div>

<blockquote type="cite" class="">

<div class="">

<p dir="ltr" class="">Hi,</p>

<p dir="ltr" class="">One other thing you can try is check the  net.core.netdev_max_backlog value in your VM. It should be 1000 by default. Maybe change it to 2000, reboot and check packet stats after that to compare from previous.</p>

<p dir="ltr" class="">This is tough to troubleshoot as udp stats are always showing the whole stack.</p>

<p dir="ltr" class="">Will be interesting to figure this out and identify the bottleneck.

</p>

<p dir="ltr" class="">On Jun 25, 2016 11:34 AM, "Skyler" <<a href="mailto:skchopperguy@gmail.com" target="_blank" class="">skchopperguy@gmail.com</a>> wrote:<br class="">

><br class="">

> Hi,<br class="">

><br class="">

> Odd that I don't see 'pointer' or 'drops' on my machine.<br class="">

><br class="">

> Anyway, my thinking is that netdev_max_backlog could be the issue here. Where max packet is reached and queued on input side, then timed out as the interface receives packets faster than the kernel can process them.<br class="">

><br class="">

> If I'm right, you need to move to a bare-metal scenario to confirm that this problem is esxi and the VM layer.<br class="">

><br class="">

><br class="">

><br class="">

> On Jun 25, 2016 10:41 AM, "Walter Klomp" <<a href="mailto:walter@myrepublic.com.sg" target="_blank" class="">walter@myrepublic.com.sg</a>> wrote:<br class="">

>><br class="">

>> Hi,<br class="">

>><br class="">

>> I have done the firewall (with some tweaks) - but I still see packet loss…<br class="">

>><br class="">

>> Looking at /proc/net/udp I see that lb is dropping packets - this is the result after about 3 minutes of running<br class="">

>><br class="">

>> root@sipwise:/etc# cat /proc/net/udp<br class="">

>>   sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             <br class="">

>>     <br class="">

>>   420: 0100007F:13C4 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 12138702 2 ffff8803313b70c0 537    <br class="">

>>   420: 07CB0767:13C4 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 12138701 2 ffff88032e498740 416 <br class="">

>><br class="">

>> As you can see already 537 drops…<br class="">

>><br class="">

>> What setting do I change for this to not happen ?<br class="">

>><br class="">

>> Incidentally I have been playing with the number of children and shared memory, but it doesn’t seem to make much of a difference. Here I am running 12 udp children and 256Mb shared memory (note I have about 30.000 devices connected).<br class="">

>><br class="">

>> Thanks<br class="">

>> Walter<br class="">

>><br class="">

>><br class="">

>>> On 24 Jun 2016, at 5:46 AM, Skyler <<a href="mailto:skchopperguy@gmail.com" target="_blank" class="">skchopperguy@gmail.com</a>> wrote:<br class="">

>>><br class="">

>>> If it were my box I'd have iptables accept only tcp/udp 5060, 5061 and TCP 80, 443 and admin, xmlrpc ports then drop all of the rest.<br class="">

>>><br class="">

>>> Could start here for good examples:<br class="">

>>><br class="">

>>> <a href="https://www.kamailio.org/wiki/tutorials/security/kamailio-security" target="_blank" class="">

https://www.kamailio.org/wiki/tutorials/security/kamailio-security</a><br class="">

>>><br class="">

>>> -- Skyler<br class="">

>>><br class="">

>>> On Jun 23, 2016 3:27 PM, "Skyler" <<a href="mailto:skchopperguy@gmail.com" target="_blank" class="">skchopperguy@gmail.com</a>> wrote:<br class="">

>>>><br class="">

>>>> Hi,<br class="">

>>>><br class="">

>>>> On Jun 23, 2016 3:18 AM, "Walter Klomp" <<a href="mailto:walter@myrepublic.com.sg" target="_blank" class="">walter@myrepublic.com.sg</a>> wrote:<br class="">

>>>> ><br class="">

>>>> > Hi,<br class="">

>>>> ><br class="">

>>>> > MySQL is not locking up other than due to anti-fraud script which runs every half an hour.<br class="">

>>>> ><br class="">

>>>><br class="">

>>>> Oh you mentioned mysql pinning cpu so I assumed we may have had the same problem.<br class="">

>>>><br class="">

>>>> > I think I can also rule out DDOS because it’s a steady 300-350 packets per second that go to unknown port.<br class="">

>>>> ><br class="">

>>>><br class="">

>>>> Wow, so one device is doing that you figure? How do you know it's that many pps if the port is unknown?<br class="">

>>>><br class="">

>>>> If they are udp, I'd set kamailio lb to listen on that unknown port and look in the logs to see what shows up.<br class="">

>>>><br class="">

>>>> > What I have not figured out yet is how the heck I find out which packets are the actual culprits…  Even doing a tcpdump on UDP packets only and excluding the hosts I know are legit and the ports I know are legit, still gives me a heck of a lot of traffic,

 probably actual payload traffic of ongoing voice calls (around 250 concurrent)…<br class="">

>>>> ><br class="">

>>>> > Now the packets to unknown port could also be some equipment sending some garbage (Grandstream ATA’s like to do this) to keep the NAT port open, and it may not actually be a problem, but I still can’t seem to figure out what causes the RcvbufErrors which

 periodically happen and when I listen to for instance the conference bridge music, it will break for a while…<br class="">

>>>> ><br class="">

>>>><br class="">

>>>> I've never heard of grandstream devices sending that kind of pps before. Unless it's like 3000 of them all misconfigured and pointing at you. All UA's do nat ping on the port configured on the device, so 5060 usually. Can't see devices being the problem

 here. The pps is too high.<br class="">

>>>><br class="">

>>>> > How to find out when the rcvbuferror occurs, what application is causing it?<br class="">

>>>><br class="">

>>>> First find out where the packets are coming from and why. Then you'll know if it can be dropped or what app to look at.<br class="">

>>>><br class="">

>>>> > Thanks for any suggestions.<br class="">

>>>> > Walter<br class="">

>>>> ><br class="">

>>>> >> On 23 Jun 2016, at 4:25 PM, Skyler <<a href="mailto:skchopperguy@gmail.com" target="_blank" class="">skchopperguy@gmail.com</a>> wrote:<br class="">

>>>> >><br class="">

>>>> >> Dang these thumbs..now to the list.<br class="">

>>>> >><br class="">

>>>> >> On Jun 23, 2016 2:06 AM, "Skyler" <<a href="mailto:skchopperguy@gmail.com" target="_blank" class="">skchopperguy@gmail.com</a>> wrote:<br class="">

>>>> >>><br class="">

>>>> >>> Sorry, in the list now.<br class="">

>>>> >>><br class="">

>>>> >>> I had a similar issue last month. Basically mysql locking up the box. I think there's an update for hackers out there. Kamailio is tuff...but mysql can be broken..<br class="">

>>>> >>><br class="">

>>>> >>> It was resolved by exiting/dropping on common hacker UA which were retreived from logs and the IP's. Eventually they gave up and moves along.<br class="">

>>>> >>><br class="">

>>>> >>> Ddos type attack.<br class="">

>>>> >>><br class="">

>>>> >>> -Skyler<br class="">

>>>> >>><br class="">

>>>> >>> On Jun 23, 2016 1:59 AM, "Skyler" <<a href="mailto:skchopperguy@gmail.com" target="_blank" class="">skchopperguy@gmail.com</a>> wrote:<br class="">

>>>> >>>><br class="">

>>>> >>>> Looks like a flood to me. Yer spec is 2 days here, are you seeing anything in lb or proxy log when tailing?<br class="">

>>>> >>>><br class="">

>>>> >>>> - Skyler<br class="">

>>>> >>>><br class="">

>>>> >>>> On Jun 22, 2016 9:01 PM, "Walter Klomp" <<a href="mailto:walter@myrepublic.com.sg" target="_blank" class="">walter@myrepublic.com.sg</a>> wrote:<br class="">

>>>> >>>>><br class="">

>>>> >>>>> Hi,<br class="">

>>>> >>>>><br class="">

>>>> >>>>> Running SPCE 3.8.5 on dedicated ESXi host (Dell R320 with Xeon E2460 & 16GB RAM) with ~30.000 registered subscribers (and online).<br class="">

>>>> >>>>><br class="">

>>>> >>>>> Last week we were having horrible statistics and packet-loss galore… After tweaking the network settings with the below, I have managed to minimize the packet-loss.. but still there is.<br class="">

>>>> >>>>><br class="">

>>>> >>>>> sysctl -w net.core.rmem_max=33554432<br class="">

>>>> >>>>> sysctl -w net.core.wmem_max=33554432<br class="">

>>>> >>>>> sysctl -w net.core.rmem_default=65536<br class="">

>>>> >>>>> sysctl -w net.core.wmem_default=65536<br class="">

>>>> >>>>> sysctl -w net.ipv4.tcp_mem='8388608 8388608 8388608'<br class="">

>>>> >>>>> sysctl -w net.ipv4.udp_mem='4096 174760 33554432'<br class="">

>>>> >>>>> sysctl -w net.ipv4.tcp_rmem='4096 87380 8388608'<br class="">

>>>> >>>>> sysctl -w net.ipv4.tcp_wmem='4096 65536 8388608'<br class="">

>>>> >>>>> sysctl -w net.ipv4.route.flush=1<br class="">

>>>> >>>>><br class="">

>>>> >>>>> I am currently still seeing around 300 packets per second going to unknown ports. Below are the statistics.  That’s about 1/5th of all the packets received are not being processed… That seems a lot to me.<br class="">

>>>> >>>>><br class="">

>>>> >>>>>  10:43:40 up 2 days,  5:11,  3 users,  load average: 1.52, 2.05, 2.17<br class="">

>>>> >>>>><br class="">

>>>> >>>>> Every 1.0s: netstat -anus|grep -A 7 Udp:                                                                                                                   Thu Jun 23 10:40:45 2016<br class="">

>>>> >>>>><br class="">

>>>> >>>>> Udp:<br class="">

>>>> >>>>>     310870895 packets received<br class="">

>>>> >>>>>     61212884 packets to unknown port received.<br class="">

>>>> >>>>>     103338 packet receive errors<br class="">

>>>> >>>>>     312245302 packets sent<br class="">

>>>> >>>>>     RcvbufErrors: 103249<br class="">

>>>> >>>>>     SndbufErrors: 765<br class="">

>>>> >>>>>     InCsumErrors: 75<br class="">

>>>> >>>>><br class="">

>>>> >>>>><br class="">

>>>> >>>>><br class="">

>>>> >>>>> I had to do a lot of buffer tweaking to get the RcvbufErrors down and even the SndbufErrors as every time it happens (at bursts - sporadically every 10 minutes, but definitely every half hour), one would get silence and the packet receive errors

 would should up by about between 200 and 800 packets.<br class="">

>>>> >>>>><br class="">

>>>> >>>>> The load average can shoot up to 4.x at times.   Knowing that Sipwise Pro is on the same hardware, and they support up to 50.000 users, what am I missing?<br class="">

>>>> >>>>><br class="">

>>>> >>>>> rtpengine is running in kernel. major contributor of CPU usage is actually MySQL regularly maxing out at 100%. Especially when it’s doing the fraud check. Below is a snapshot of top….<br class="">

>>>> >>>>><br class="">

>>>> >>>>> top - 10:56:53 up 2 days,  5:24,  3 users,  load average: 2.39, 2.14, 1.94<br class="">

>>>> >>>>> Tasks: 184 total,   1 running, 183 sleeping,   0 stopped,   0 zombie<br class="">

>>>> >>>>> %Cpu(s): 25.3 us,  7.0 sy,  0.0 ni, 63.7 id,  1.0 wa,  0.0 hi,  2.9 si,  0.0 st<br class="">

>>>> >>>>> KiB Mem:  12334464 total, 12157676 used,   176788 free,   144944 buffers<br class="">

>>>> >>>>> KiB Swap:  2096124 total,        0 used,  2096124 free,  4430336 cached<br class="">

>>>> >>>>><br class="">

>>>> >>>>>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                               <br class="">

>>>> >>>>>  4063 mysql     20   0 6127m 5.6g 7084 S  54.7 47.7 809:35.18 mysqld                                                                <br class="">

>>>> >>>>>  2576 root      20   0  253m 7176 1816 S   9.9  0.1 164:02.97 rsyslogd                                                              <br class="">

>>>> >>>>>  5058 root      20   0 67176  11m 5308 S   6.0  0.1   7:05.16 rate-o-mat                                                            <br class="">

>>>> >>>>> 15432 root      20   0  276m  12m 3696 S   5.0  0.1 117:56.92 rtpengine                                                             <br class="">

>>>> >>>>>  5257 sems      20   0  873m  37m 7624 S   4.0  0.3 139:44.03 ngcp-sems                                                             <br class="">

>>>> >>>>> 30996 kamailio  20   0  539m 100m  53m S   4.0  0.8   6:02.68 kamailio   <br class="">

>>>> >>>>><br class="">

>>>> >>>>> Does anybody have any pointers I can try to completely eliminate the packet loss and where do these unknown port packets go to?<br class="">

>>>> >>>>><br class="">

>>>> >>>>> Thanks<br class="">

>>>> >>>>> Walter.<br class="">

>>>> >>>>><br class="">

>>>> >>>>><br class="">

>>>> >>>>><br class="">

>>>> >>>>><br class="">

>>>> >>>>> _______________________________________________<br class="">

>>>> >>>>> Spce-user mailing list<br class="">

>>>> >>>>> <a href="mailto:Spce-user@lists.sipwise.com" target="_blank" class="">

Spce-user@lists.sipwise.com</a><br class="">

>>>> >>>>> <a href="https://lists.sipwise.com/listinfo/spce-user" target="_blank" class="">

https://lists.sipwise.com/listinfo/spce-user</a><br class="">

>>>> >>>>><br class="">

>>>> ><br class="">

>><br class="">

>></p>

</div>

</blockquote>

</div>

</blockquote>

</div>

</div>

</div>

</blockquote>

</div>

<br class="">

</div>

</body>

</html>