Finding the Real Problem

"It took some convincing for the IT staff to come around to believing that there was indeed a problem."

 

Technical aside:

The idea was to "ping" the servers from each location over a longer duration than the IT staff normally tested.

A "ping" sends a short string of data to a network host. Among other data that it returns, a "ping" displays the response time from that network host in milliseconds. Stable networks generally have relatively consistent response times.

Fundamentally, since the drop-outs occurred more or less randomly, it was unlikely that the IT staff would show up and collect data at exactly the right time to capture an an example of what the maintenance staff had been reporting.

By collecting data over a longer period of time, it was possible to identify any recurring patterns in network activity between locations (specifically, response time) that would indicate choke-points in network traffic between the maintenance sites and the system server. 

Background: One of the largest public transportation systems in the country had a centralized computer-based equipment maintenance logging system. It maintained the history of maintenance for all fleet equipment across the city. Technicians and mechanics who performed the maintenance work were responsible for logging the details of work they performed. The logging system played a critical role in fleet maintenance from risk management and operational perspectives. Staff up and down the line might need to know, for example, when the tires were changed, or when an issue with a steering wheel was first reported and resolved.

The Challenge: Periodically, as the technicians and mechanics used the logging system, they would experience situations where the system froze or dropped the session, meaning that they could not complete logging the work performed. They might have to restart the logging session, and it might drop again. This led to an enormous amount of frustration and lost time, and created situations where critical information may not have been captured completely.

The maintenance managers called the IT staff to track down and resolve what was clearly, to them, a problem. The IT staff first examined the issue at the system's software level, and could find no problems. They examined the server hardware, and likewise, found no issues. However, whenever the IT staff brought out the network analyzers, they could not identify any problems in network traffic. This pattern of report the problem-check the problem-find no problem continued for several weeks, leading to increasing frustration among the technicians/mechanics as well as the IT staff.

The Outcome: A consultant who had been brought in to work on another project happened to be in the room when the IT management team was discussing the problem. He suggested a different approach that relied on longer-term data collection. 

The longer term data collection showed that indeed, from many maintenance locations, network traffic would be blocked to the main system server periodically throughout the day. Looking at a graph of the "ping" data over time from those locations looked like recordings of a heartbeat. In this case, however, the "spike" represented regular intervals when traffic was not getting through to the system server, so it definitely was not healthy.

It took some convincing for the IT staff to come around to believing that there was indeed a problem. This was a simple statistical test, compared to using the latest in network analyzer hardware. However, the IT staff then took their analyzers and let them run for long enough to capture the anomalies in network traffic. Eventually, they tracked down the real cause of the problem to configuration errors in their network routers. Problem solved. All parties were happy.

Share This