On the trail of the bug
Tracking down the errors
The magazine LANline has published an technical article about "On the trail of the bug" focusing on current developments in network troubleshooting by our managing director Klaus Degner (issue 9/2017).
The search for the small bug that can paralyse small to large IT networks can be extremely laborious and frustrating. Time is of the essence since failures can be expensive. Often, cause and effect are far apart. The latest state-of-the-art analysis methods can shorten troubleshooting.
Every network administrator knows the situation: if a network error occurs, mythical black Peter is always blamed on the network because it is complicated and difficult to diagnose. Users only reports faults after they have occurred more than once. For a detailed analysis of the network traffic, the administrator must locate and record issues accurately with the appropriate tools.
In order to resolve this crux, new solutions can now provide root-cause analysis more elegantly than their predecessors. They enable fast access to essential network data both in real-time and for past incidents. Most systems which view a broad monitoring spectrum are often so complex to use, they make spontaneous, sporadic troubleshooting difficult or time consuming. Conventional analysis tools only cover a limited area of a network. Cutting edge network analysis systems enable an array of analysis options to meet the growing requirements of current and future IT architectures.
The complexity of today's network installations can create a mishmash of errors and false-positives. Many faults can be traced back to the complex interaction of different devices (often legacy attached) and software programmes. Debugging such errors can be time-consuming and laborious since the root cause of the error may not be related to the effect. Other errors on the other hand, are - despite the difficulty of finding them - very simple in nature and are often based on small human shortcomings, often referred to as “finger trouble.”
Analysis of Cisco Management During Operation
In one example, a cause-effect reaction occurred at a company in the energy sector. Their Cisco management system was repeatedly unavailable for several hours at a time. However, the Cisco switch continued to process customer data. Therefore, restarting the switch was not an option. Even an exchange of the hardware was not an option. For cost reasons, the administrator firstly had to analyse all other components.
After the conventional analysis had been completed, more up-to-date solutions were used. This enabled a detailed and highly accurate analysis of the last connections before the crash. This information can be found with just a few clicks and the corresponding Pcap extracted and analysed at the touch of a button. In this case it turned out that a standard SNMP command preceding a software update crashed the switch.
The problem had been caused by a network error brought about by the complex interaction of different software and hardware. At the same time, the cause had no obvious relationship to the effect. The key to success here was the use of a tool that was active during operation without requiring the continuous attention of the network administrator. Through passive recording and selective data extraction, the administrator was quickly able to identify the error. After the fault was detected, troubleshooting ran smoothly. Cisco received a detailed bug report and sent a new software version to fix the issue.
Hanging Citrix Client Paralyses a Customer Centre
Another example of a hard-to-find error occurred at an organisation with a large quantity of customer data. The Citrix clients were hanging several times a day, so customers had to put up with long waiting times. However, the error only occurred at one location. Other locations with identical network configurations were not affected.
At first the server was suspected to be at fault, but an exchange proved to be unsuccessful. Further analysis steps such as measuring connections, checking the ISP network and checking the configuration of the Citrix client also failed.
Finally, the use of a novel protocol analyser identified the cause. A debugger connected to the mirror port detected that packets were being misrouted. The router had been configured with a rule that directed the port from the Citrix terminal server to the Internet via a less efficient route. The rule was only active for this specific TCP port and could not be applied when the connection was actively measured.
After its discovery, the error was quickly fixed. In total; troubleshooting with conventional debuggers resulted in enormous effort and costs. The fact that hanging Citrix clients were not related to the cause, the faulty routing made diagnosis difficult. Only the use of inventive measurement techniques enabled swift troubleshooting.
SPS Reports Network Problem
A final example of complex IT networks was in a large automobile production facility. At this location, data was sent from the PLC (Programmable Logic Controller) to switches via the firewall and an uplink to a data centre. For weeks, the team that managed the network to the uplink was faced with a recurring problem: the production line was at a standstill. The SPS reported "network errors." Production was negatively affected as a consequence.
When established analysis systems reach their limits, the latest generation of network analyser when installed directly on the uplink, extracted the packets that the system had processed immediately prior to the ‘network error’ message. The team was able to isolate the error using Pcap analysis in minutes.
The client sent retransmissions because the server did not respond, and reestablish the connection after three minutes. The reason was that the server could not be reached. The system administrator in the production area was able to pass on the information to the responsible department.
The analysis tool that was used did not require any network reconfiguration and so had no adverse effect on network traffic. If, as in this example, the error is of a transient nature, the analysis tool must be able to perform measurements to the nearest second or better still, millisecond. Inaccurate readings can make it much more difficult or even impossible to trace such issues.
Other sources of error, although not based on a complex architecture can be equally difficult to identify. They are often user created. Here is an example from a 20-member company with a very active sales branch: At 11 o'clock every morning, the network was so slow that it was almost unusable. All standard error checks failed to find the problem. Traffic analysis measured directly in the local network finally revealed the cause which was due to Windows software updates which were regularly downloaded at 11 o'clock in the morning. The system administrator had confused a.m. with p.m. for the time setting. Instead of downloading the updates at night, they paralysed the network in the morning. The fix was simple. With an advanced network debugger, the error was quickly detected.
The examples show that novel network analysis solutions can create new shortcut and time-saving options for network administrators when problems arise. They show connection issues that are not always obvious. They prove that errors can be found in internal and external networks, quickly and with supportive data to return networks to their designed efficiency with minimal disruption.