Path measurement

From Allegro Network Multimeter Manual
Revision as of 10:32, 3 June 2021 by Ralf (talk | contribs)
Jump to navigation Jump to search

Path measurement

The path measurement module allows to passively measure the packet loss and latency between two Allegro Network Multimeter installations.

For example, a network connection (line/link) between the main office and a remote office can be analyzed by installing one Multimeter at the main office and another Multimeter at the remote office.

Only network traffic (packets) passing through both Multimeters can be analyzed. The packet loss and two-way-latency thereof is measured and shown in graphs.

The time synchronization setting (e.g. NTP/PTP or OFF) should be the same on both devices for the best results.

In Firmware version >= 3.3, it is also possible to use a single device to measure the traffic delay/losses between two different virtual link groups. In this mode, the primary device is used as a client too.

Overview

The main device captures packet meta data from the remote (or client) device which takes only a fraction of the total traffic. Approximately 5% additional bandwidth is required for this capture connection. So for fully loaded 100 mbit/s connection to the remote location an additional load of ~5 mbit/s is required to get packet information to the main device. The measurement connection can be a separate line or can be run over the line that is measured, the capture connection will be automatically ignored for the measurement. The measurement module must be configured with a maximum packet delay. This delay describes the amount of time the main device waits for packet information to arrive from the remote device. The delay must be large enough to cover the actual latency of the connection and delay of the capture connection. Typical values are between 2 and 5 seconds. Larger values requires more memory to buffer packet meta data so very large values might only be selectable on larger Multimeter devices (Allegro 1000 or greater).

Configuration

Ap-mm-path-measurement-1.png

Settings

  • Enable analysis: This will disable or enable the path measurement feature. When disabled, no additional memory is used. When enabled, memory for the packet buffer is used which cannot be used for other analyzing modules thus reducing the maximum time the device can go back in time.

Primary device settings:

  • Description of this main device: This field is only used for informational purposes to identify the main device. It can be freely chosen, for example to the location the device is installed. This field can also be left empty to use the default name main.
  • Primary device VLG filter: It is possible to limit the amount of data analyzed by any configured virtual link group. Often, the primary device is located at a central network point and thus sees a lot of traffic that is not actually going to the remote device. The algorithm will automatically take this into account, but using a filter will reduce the processing overhead as well as the amount of data that needs to be buffered.

The following Client device configuration section configures the access to the remote device:

  • Device to use: To use a remote device for path measurement, you first need to add that device as a remote device the list of Multi-device settings. It does not matter if the device is active or not. You can select the device from the list of known multi-devices. You can also select the primary device as a client to analyze the traffic between two different virtual link groups.
  • Device description: Similar to the description of the main device, this field is for informational purpose only and has no other effect than helping identifying the remote device in the statistics. Usually the location of the remote device is entered.
  • Client device VLG filter: The traffic used for comparison at the main device can be filtered to any virtual link group defined at the client device. There are two main purposes for this setting:
    1. reduce the amount of data required to be transferred to the main device. The path measurement only considers connections seen on both devices, but the client device of course cannot know if any connection it sees also is visible on the main device. If only traffic of a specific virtual linkg group (VLG) actually reaches the main device, using this filter can reduce the amount of data transferred and later dismissed.
    2. filter duplicate traffic: If, for some reason, traffic is seen multiple times, it can create wrong results as the number of occurrences differs from main to client devices. A VLG filter can fix this problem by only considering one part of the total traffic.


Measurement settings:

  • Maximum packet delay: This field describes the maximum amount of seconds to wait for packet information from the remote device.
It basically means that the main devices waits for this number of seconds before deciding if a packet has been lost or not. If the data from the remote device arrives before those number of seconds, the path measurement can account the packet loss, if any and the two-way latency. This value must be at least as large as the worst-case latency between both measurement sites.
Usually 3 seconds are more than enough but when the network in between can have a very long delay, you can increase the value. This will, however, use more main memory for the packet buffer.
  • Ignore IP identification field: This option can be enabled if the IP identification field in the IPv4 header is modified by some component in the network. Often it remains constant for a single packet so this option should be left disabled as it will also increase the chance of reporting duplicate packets. But if you notice symmetrical packet loss you can enable this option to see if this helps.
  • Ignore VLAN tags for connection matching: The path measurement only calculate loss and latency for connections seen on both devices. Usually the connection ID takes the IP pairs, port ports and possible VLAN tags into account. If a VLAN is different on both machines for the some connection, then this option must be enabled to be able to correlate the connection and calculate correct statistics.
  • Account latency also per IP connection: Enabling this option will let the path measurement also store the latency for each individual IP connection, which of course increases the memory usage.

The settings must be saved but to actually take effect, a restart of the packet processing is necessary. If this step is required will tell so at the bottom of the page under Required actions.

Parameters currently in use

This section shows the current state of the measurement engine. The engine might be inactive even if the feature is enabled. Usually a restart is required to actually make it active. If active, the current packet delay is shown. It might be different from the selected value in the configuration above, but if so a note appears that a restart is required.

Required actions

An info box appears if the a restart of the packet processing is required. The shown link leads to the page Settings → Administration where the restart can be triggered. The device itself does not need to be rebooted, only the packet processing must be restarted which usually takes only a few seconds.

Measurement statistics

Ap-mm-path-measurement-2.png

The measurement tab show the real-time results of the ongoing measurement. At the top the current state of the measurement engine and the remote connection is shown. The measurement status can be not running if it is disabled, warming up if the engine waits for synchronization with remote device, and running if it actually measures data. The remote client status indicates if the connection to the remote device is established. Since the packet information are gathered real regular capturing from the remote device, the capture connection is visible in the capture section of the remote device and might be stopped there. If the measurement connection is stopped or stopped working for other reasons (remote device unavailable, etc), the status box will turn red and a button appears to reconnect to the remote device. If the reconnect fails, an error message appears with detailed information what was going wrong.


Typical errors are:

  • remote device inaccessible (are the IP and port settings correct?)
  • authentication error (invalid credentials?) When both boxes are green, the measurement is running and the four graphs show the real-time results.

Two-Way-Latency

The first graph shows the latency measured from the main device to the remote device and back. It cannot (due to asynchronous local time sources) measure the one-way latency of a single packet but only the duration of packets going in both directions. Example: Assume a packet A is seen from main to remote device and another packet B is seen from remote to main device. The time difference when packet A is seen on main and on remote device plus the time difference of packet B being seen on remote and main device is taken into account to determine the two-way latency. Packet A and packet B are does not need to be related in any way. If traffic is going only in one direction, the measurement will not show any time result (even though packet loss is still visible). For each second, the average, minimum, and maximum two-way-latency is accounted and shown the graph. To the left of the graph the statistics for the visible time range is shown, changing the zoom level or time interval will update the values accordingly.

One-Way-Latency

If the path measurement is used on a single device (by selecting the primary device as client device too), the one-way latency is also shown for each direction.

Lost packets

The second and third graph show the number of lost packets in each direction. Lost packets are only accounted for connections that have been seen on both devices. Depending on the installation point and routing setup, connections might be not be routed to the second device on purpose. These connections are not accounted as loss on the other device. The second graph accounts all packets that have been seen on the remote device, but are missed on the main device. That means that those packets got loss on its way to the main device. Accordingly, the third graph accounts all packets that have been seen on the main device, but are missed on the remote device. The graph also contains a line for packets that have been dropped by the client due to overload. If this value is not zero, those packets are accounted as packet loss even though it might not be actually losses. For correct measurements, make sure the graph for remote packet drops is never non-zero. These drops may happen due to several reasons:

  1. System capture overload: If multiple captures are running in parallel, the CPU might be overloaded. Check the All tab in the Capture page to see how many captures are running. In best case there is only the one capturing connection to the main device.
  2. The capturing connection is encrypted with SSL. The small Allegro 200 has a limited encryption capacity so for large traffic this can be a bottleneck. The only solution is to use a more powerful Multimeter.
  3. Capture drops can also occur if the network connection is not capable of transferring the data fast enough. Rule of thumb is that approximately 5% of the total traffic is used for the measurement connection. For example, if the traffic is 500 MBit/s, the measurement requires ~25 MBit/s of bandwidth on the management port.

The fourth graph shows all packets that are monitored for the path measurement. This will cover all connections that have been seen on both devices.

IP statistics

The second tab shows packet loss information for each pair of IP addresses. This statistic covers all IP connections that has been seen on both measurement sides. The table shows the number of packets that have been counted for each communication pair. Additionally the number of packets seen on the main device and the corresponding packet loss is shown. The same statistics are shown for the client device too. You can click on the IP address to go to the detailed statistics of the IP module to check which kind of traffic was happening for that IP. Two graphs are shown for each IP pair which shows the packet loss for both direction on one graph and the total packets in the second graph. There is also a capture button to capture traffic for the IP pair. The captured traffic is only the traffic seen on the main device, it will not contain any packet from the client device as the main device does not have the packet data information available. To capture traffic from the client device, you have to go to the web interface of the client device and start a capture on that device.

The IP pair table also show the two-way latencies for the traffic of each IP pair, if the corresponding toggle is selected above the table. In single-device mode, also the one-way latency is shown.

For each IP, there is also a link to the IP connections. If enabled, each individual IP connection also stores the latencies for more detailed view.

Switching graph modes

The toggle buttons above the graphs allow to switch the graph modes from absolute values to relative values. This setting will show the lost packets in relation to the total (monitored) traffic. The second option allows to show mbit/s throughput instead of the packet rate.

Limitations

There are some limitations about the path measurement:

  1. Due to technical reasons, large clock adjustments cannot be filtered out. So in such cases, a very large two-way-latency is measured. Both devices need not be time synchronized per se, however, considerable time differantiation must be avoided. This means that time synchronization (e.g. NTP/PTP) should either be enabled or disabled on both devices for best results. Clock differantiation miss-measurements are however one-time events, and will not lead to false values for the following packets.
  2. The maximum supported packet size for the path measurement is currently 2048 bytes. Larger packets are truncated for the measurement.
  3. NAT setups and different VLAN combinations on main and client are not supported at the moment. Such flows will be accounted as unmonitored flows in the debug view.
  4. WAN optimizer and similar devices which rewrite some of the traffic are not supported either. If packet data is changed (like modifying the TCP header, adding TCP options, etc) the flow will account packet loss on both sides as the original packets are not seen on the other side. If the device in between also modifies the IP addresses or ports, the flows will be accounted as unmonitored.
  5. The global setting for the packet length accounting should be set to the same value on both devices. Otherwise identical packets might be considered different because of different length and the bandwidth information will be inconsistent.

Typical use cases

See Analyze connections between remote sites to get a detailed overview of use cases and device setup.

Debug information

The debug information tab shows additional statistics which are usually only relevant for identifying problems in the path measurement, either program errors or test setup errors.

  1. Monitored flows seen on both devices:The monitored flows describes all IPv4 and IPv6 connections that have been seen on both devices and are used for calculating the latency and packet loss. Only this traffic can be considered for the actual measurement. In a working setup, the value must be non-zero.
  2. Flows seen on both devices without matching packets: If a flow is seen on both devices but not a single packet matches on both sides, it indicates a potential network setup problem. This probably means the packet is somehow modified by a device in between both measurement points. This setup is not supported. Usually this value should be zero. Small non-zero values can be ok, if the first number of monitored flows is much larger.
  3. Unmonitored flows seen only on main: This counter shows the number of IP connections that are only visible at the main device. It means that for those connections no matching client packet has been received. If the main device also sees network traffic that is not routed to the client device, this value can be non-zero.
  4. Unmonitored flows seen only on client: This is the same counter as for the main device, but counting the connections on the client device that have not been seen on the main. Again, if the client device sees traffic that is not routed to the main, it is fine to see non-zero values here.


Possible problematic scenarios:

  • There is a device between main and client that modifies the traffic (like a WAN optimizer): You will notice a larger value for counter 2 (flow without matching packets),almost zero value for counter 1 (flows seen on both devices).
  • There is a device between main and client that changes ports and IP addresses (a NAT):

You will notice almost zero values for counter 1 and 2, but high values for counter 3 and 4.


Both scenarios are not supported by the path measurement.

Please adjust the test setup to disable any device modifying the network as described above.


The table below shows the following counters for the main and remote device:

  1. The counter about packets seen on all devices measures the total amount of packets monitored and considered for the analysis.
  2. The packets seen only on one devices indicates how much packets are lost on the other devices.
  3. Duplicated packets: This counter includes packets that are duplicated or have the same checksum. It is valid to see non-zero values here. Some protocols like broadcast actually do not differ in the payload so the packet checksum will be identical. If those packets appear within the packet delay time window, it is accounted as a duplicate to the previous one.
  4. Failed to process on main device: This counter indicates that packets from the client have been discarded due to overload of the main device. The main device was not fast enough to process client packets. This usually means the local packet rate (at the main device) is too high.
  5. Ignored on main device: These packets are ignored because the flow is unknown to the main devices. This happens when the packet checksum is received from the client but no connection information for that packet is known by the main device. This value should always be zero. Otherwise it means that the number of active flows is too high.
  6. Packets processed too early: This counter covers packets that packets could not be stored long enough to hit the configured packet delay limit. This happens when the packet rate is higher than the supported packet rate of the main device.


Below the table, two graphs showing time drift information are visible. The first graph shows the packet delay. It is the time between a matching packet from the main device and the client. This value describes for how long the main device needed to wait to get a matching packet from the client. This value should always be much lower than the maximum packet delay configured in the path measurement configuration. The value cannot be larger than the maximum as then packets can no longer be matched. If the value keeps reaching the maximum, two problems are possible:

  1. The delay between main device and client is large due to generic network delay. For example, if a high-latency connection is used for path measurement, it can even take a few seconds for a packet info to arrive. Configure a larger maximum packet delay.
  2. The bandwidth of the connection from the client (the client’s upload speed) is too small to satisfy the requirement for the checksum connection. This problem can be identified if even increasing the maximum packet delay does not help. If the bandwidth is too small, the packet will hit the maximum delay for any value configured, it will just take a little longer.


In this case try to use an alternative network connection to connect to the client device.

The second graph shows the time drift between the main device and client device. Usually there will always be a drift between the clocks of both devices (if they are not synchronized by some mean). Even large drifts (hours, days, etc) are typically not a problem as the two-way latency zero-out the drift. But if the drift increased dramatically (like multiple seconds) constantly over a large period of time, it usually indicates a bandwidth overload just like the first graph.

FAQ

  1. What does the note Network setup problem detected: Packet modification or complete loss means?
    This message box appears if flows have been identified for which not a single packet could be seen on both sides. Usually this means that there is some device in between both measurement points that modifies the packet. This can be WAN optimizer which rewrite TCP connection for improved network performance. Such setup is not supported.
    It can also mean that some other packet field is modified at some point in the network. One field that is known for modification is the IP identification field in the IPv4 header. For this case an additional option can be enabled to ignore this field.
  2. What does the note VLAN tag mismatch detected for matching packets on main device and client means?
    This message indicates that matching packets have been seen on both devices but the used VLAN tag is different. Depending on the measurement setup, this may indicate an error as for connection matching the IP addresses, ports, and VLAN tags are used unless the option to ignore the VLAN tag is enabled. Therefore, the identical packets will be accounted for two different connections often resulting in shown packet loss.
    Often, the recommended solution is to enable the configuration option Ignore VLAN tags for connection matching.
  3. What kind of packet information is used to determine latency and packet loss?
    Both measurement devices calculate checksums starting from the layer 3 packet data to compare packet information on both sides. This means for IPv4 and IPv6 traffic, the Ethernet header including possible VLAN tags is ignored. For non-IP traffic, the complete layer 2 packet is used so this traffic can only be analyzed in switched networks.
  4. What can I do if I think the packet loss is wrong?
    Often, incorrect loss is reported because there is some kind of packet modification that is not support. See the list of limitations above if any of those apply to your setup. Also, try the configuration options to ignore some packet fields to see if that makes any difference.
    You can contact our support and if possible provide a small capture from both network sides that cover the same traffic. This will help us to find the reason for the shown packet loss.
  5. How can I distinguish between real packet loss and reported packet loss due to packet modifications?
    Packet loss is reported when a packet checksum does not match any other checksum reported by the other side within the configured maximum packet delay timeout. Reasons for such an event are
    1. Actual loss
      The packet have been seen on one side but not the other. In this event the loss graph is usually different between the main device and remote device.
    2. Packet modification
      Some component modifies the packet in some way. Since such a modification is usually done in both direction (sender and receiver), the packet loss is visible on both sides. Therefore the loss graph is symmetrical.
    3. Overloaded management connection
      The management connection is used to get packet checksum from the remote device. If the connection is not fast enough to transmit the checksum within the configured maximum packet delay time, the packets are then reported as loss.
      This can be verified by looking at the debug graph Packet delay between local and remote packets. In this event, the time will always be at the top limit that is the maximum packet delay time.
    4. Temporary network failure
      In this event the packets are really lost on both directions so the loss graph may also look similiar. However, since no packet can be transmitted to the other side, you will also see no entries in the two-way-latency graph.