Guest article by Klaus Degner in the funkschau
Monitoring of Peak Loads - Detect and Avoid Burst Traffic
The IT magazine "funkschau" asked us to contribute an article on the topic of burst analysis and monitoring of peak loads. Now the detailed technical article by our managing director Klaus Degner can be read online and offline in the current issue 13-14/2018.
A modern data network can be a real challenge. It is responsible for taking care of an organisation’s productivity. In addition, it supports a multitude of different devices and services.
But as heterogeneous as today's networks and the services processed within them are, the switches or routers have the task of processing network traffic and adapting it if necessary to a lower bandwidth at the LAN uplink. If multiple data packets arrive at the same time, the switch can only forward one packet at a time per port. If the incoming traffic load becomes too heavy, the packets accumulate on the switch to such an extent that the switch may have to discard individual packets. Such packet drops usually indicate that the network bandwidth is overloaded.
Peak vs. Average Load
However, this diagnosis alone is not sufficient for a system administrator to bring about a valid solution to the problem because packet losses can occur sporadically - which is to be expected. In addition, it makes a big difference whether the network is running at full load all the time or whether there are unusual traffic spikes.
The situation is similar for a road traffic surveyor: A certain number of vehicles can drive without congestion on a motorway. As traffic increases, traffic jams often result. Finding that an average daily load is 10 percent does not help with what is needed for continuous smooth traffic flow. Which statistics are more meaningful are analyses of peak times, where and when the problems occur and which vehicles cause them.
It's the same with a maximum load on a network: The IT professional wants to know when this load occurs, how long and how often such a full load is maintained, and which network users are causing it. Only these findings allow them to take action to provide smooth network service.
Impact of Bursts on the Network and Single Protocols
How such bursts can be examined in more detail will be explained later. The question is how the network behaves with traffic bursts. Packet losses and retransmissions are normal. Is there any need for action at all?
Yes and no. Packet losses are a given. But if they take on an order of magnitude that has a negative effect on network performance, remedial action must be taken.
The best way to understand the impact on the network is to look at the data transfer protocols on the one hand and the services that require secure bandwidth on the other.
With the most important protocols, such as HTTP or DNS, an overload may not have a devastating effect, but it should still be eliminated in the long run. As an example, a web page request is initiated where several dozen DNS queries are made over a short time. If only five percent of it is lost, the requesting computer waits for an answer, does not receive it and retransmits the requests after a few seconds. A website response is thus delayed by these seconds. Often, the user gets frustrated and leaves the site.
More critical is the impact of overload on services that rely on guaranteed bandwidth, such as VoIP telephony, webcasts or the delivery of audio or video content. If an overload occurs here, the respective service may fail completely. In an organisation that trains its employees via online courses, this can be frustrating for the staff being trained. Even short peak loads can cause videos to become erratic, delivering the audio in a timely manner, but generating poor quality video. If a Windows update is downloaded simultaneously in the same network, the video service may freeze and stop completely.
As with HTTP or DNS, a detailed analysis of peak loads quickly identifies the cause. A burst analysis can reveal whether the peak load was caused by a coincidence, for example, because a webcast was started in addition to YouTube content and updates, or whether the load peaks are actually recurring and repeating.
Classify Bursts Correctly
Often it is not sufficient to know just the average load in a specific time interval in such an analysis. If, for example, the analysis shows that there is 100 MBit/s traffic on a 1 Gbit/s line over the minute average, no statement can be made as to whether the link was fully loaded to 10 percent of the time and had no traffic to process for the remaining time, or whether the link was continuously loaded to 10 percent with traffic. In order to make intelligent adjustment in the event of poor network performance, an accurate value is important. It is vital to determine what percentage of the time the link ran under full load. If the value is zero percent, problems based on bandwidth overload can be ruled out. In this case, no changes are necessary. If, however, the value shows a different percentage than zero, the network is overloaded, whether locally or across an entire network.
Smart monitoring solutions can swiftly identify exactly what is causing these overloads. Typical causes of a LAN burst can include backups, a large file transfer or a Windows update from the update server, which pushes the LAN or WLAN to its limit at short notice.
These are all normal processes, which in many cases are no problem at all. It only becomes a problem if several of these services suddenly consume bandwidth in parallel, e.g. on Monday morning, when the employees’ smartphones in the WLAN automatically run their updates and generate high traffic rates. Then the services contend for the same bandwidth. If another application is launched, e.g. multiple VoIP telephone calls, the network may grind to a halt. As soon as it is clear what is responsible for the overload, appropriate measures can be taken to prevent it.
Analysis in the Millisecond Range
Such a burst analysis is in many cases the most solution-oriented analysis option. For example, a telecommunications provider sells a specific bandwidth package to its customer. If the customer then complains about bandwidth bottlenecks, a burst analysis helps to provide precise information about their traffic volume. In such an area, it is desirable to get a depth of detail that exceeds the usual minute or 30-second interval.
A case in point is voice telephony. Data transmission via SIP/RTP takes place at a rate of one packet every 20, 25 or 30 milliseconds. The data is not received on a regular basis, but rather with deviations. Even a deviation of 20 milliseconds can cause acoustic difficulties on the remote station, e.g. lost words. In order to track down these deviations, called jitter, a burst analysis in the millisecond range is required. Only this will show whether the connection was actually overloaded for a few milliseconds. If the deviations persist over a longer period of time, the audio buffer becomes full and the jitter adapts. Alternatively, some of the buffered data may be discarded and packets dropped. If the analysis results are only available in minutes, these errors would not be detected and therefore cannot be corrected. Only a higher clock rate allows you to see where the problems lie.
Critical services are another area where quality evaluation in the millisecond range is necessary. While some applications function according to the best efficiency principle, i.e. as quickly as possible, other services are extremely time-critical. The comparison between email and VoIP telephony illustrates this clearly: When sending an email it is in most cases irrelevant whether it takes one or three seconds. In contrast, this delay is extremely critical for VoIP calls. As explained earlier, all packets should arrive at a constant, short time interval.
To get back to the comparison with the road: For some consignments, it may not be a big issue whether they reach their destination exactly at the predetermined time or not. With other goods, it is extremely critical when they arrive, e.g. due to a Just-in-Time (JIT) agreement that must be maintained.
Whether logistics or system administrators, planners are responsible for maintaining timely delivery and trouble-free services, especially with VoIP, but also with all other time-critical applications, it is essential to be able to debug the network to the millisecond. Monitoring tools are now available that provide the system administrator with these quality features.
Procedures for Peak Loads
If an analysis of the network problems has highlighted bursts as the cause, there is a need for action for the administrator. But which strategy is the right one? Is it a question of simply increasing the bandwidth of the link and assuming that this will eliminate all the problems? How does a system with a 10 Gbit/s network connection behave, for example, if the underlying 1 Gbit/s network is increased to 10 Gbit/s?
The following example illustrates the case: Two servers connected at 10 Gbit/s were configured to restart backups as soon as they completed the previous one. In order to reduce the maximum load on the link, the link between the two data centres was upgraded to 40 Gbit/s. This involved a lot of effort and high costs. When the project was finally completed, it turned out to the frustration of all involved that the link was just as busy as it had been before. How could this happen? A new analysis showed that the problem was due to the clusters behind the link. These were connected in several locations with 10 Gbit/s links. The link which was previously fully loaded with 10 Gbit/s, generated so much traffic after the upgrade that even the 40 Gbit/s link no longer had any capacity.
This expensive upgrade should not have been necessary. A detailed burst analysis would have highlighted this in advance. In addition to the duration of the peaks, it also shows which network users actually cause the bursts. The analysis would detect which services saturate the link, be it the Windows update server, a file transfer or backups. Instead of a link upgrade, it would have been sufficient to limit the speed of these services.
So, if a network suffers from bursty traffic, upgrading is not always the correct approach. Sometimes it is advisable to switch the system to the Quality of Service (QoS) option instead. Then, an appropriate bandwidth can be allocated to individual services. Another alternative is a logical or physical separation of the services. For example, the VoIP telephone network can be assigned a high priority or a completely separate network.
Anyone looking for a universal solution for how to manage a network at full capacity may be disappointed. If, however, the causes of the bursts are sufficiently known, the solution is considerably simplified.
The article was published in German. Here you find the online version of the German article.
Enjoy your reading!