View Full Version : Specific tips to speed analysis and keep you sane

05-06-2011, 11:04 AM
If you ask most networking people what methodology they use for troubleshooting networks, and odds are they don’t have one. They just try things until the problem goes away or the end user gets so used to the slowdown they stop bothering them (and badmouth the IT department when you're not looking.) Following a consistent methodology will dramatically shorten troubleshooting time, and perhaps, give you just a bit more peace of mind...and a bit more time to do everything else on your plate!

Here's a methodology or workflow that we've observed the best network troubleshooters follow. This methodology can be clearly defined and taught to others on your team, so that everyone is on the same page. We'll walk through a few details and pointers on each one...

1. Initial Investigation

2. Infrastructure & Path Analysis

3. Analyze Network Performance and Usage

4. Analyze the Actual Application

1. Initial InvestigationIn this phase, there are basically two objectives: to validate the user experience and determine the problem scope. A lot can be learned by just talking with the user to find out what they were doing when the problem occurred. (Of course, the user will ALWAYS tell you the truth, right?) We want to verify that the user is really experiencing a network/application/server related problem, and not a problem with their own local PC. We find that an "inside out" diagnosis works best: validate network layer first, then move up or down the stack as the evidence directs you. Most techs these days will be able to launch a remote desktop session (http://en.wikipedia.org/wiki/Session_%28computer_science%29) to validate the user's story, and check local configurations, without leaving your desk. Pinging from the user's PC to network/server resources validates network layer connectivity and gives you an idea of response times (maybe...Ping is notoriously unreliable across a routed network and not a good indicator of application performance since the network interface answers for it, not the app.) But if you can't remote into their PC, or ping their PC from yours, time to move down the stack.

But if you've validated basic connectivity, and that the user can get out to the Internet, yet the app of concern is still slow, what's next? Are other users of the app experiencing the same problem? At the same site, or at others? We need to determine if this is an isolated problem, or a network-wide event. Once the problem has been validated and the scope defined, it is time to begin the analysis

2. Infrastructure & Path AnalysisWhen troubleshooting the performance of a particular application, we should first investigate the foundations upon which the app relies: the availability and response time of network services, and the infrastructure supporting the app. A suprising number of app problems can be traced back to poor functioning or misconfigured DNS (http://www.packetech.com/showthread.php?99-Domain-Name-System) services. Perhaps a configuration was changed without realizing its impact; maybe a particular DNS server or servers are not responding or responding much more slowly. Validating services like DHCP (http://en.wikipedia.org/wiki/DHCP) and DNS first can eliminate chasing your tail later on in the troubleshooting process.

Next, we want to identify and test the actual path between the user and the application, which can include front-end web servers, and supporting database and app servers. Of course, your network documentation was just updated, so this is no problem, right? Right... This is where the power of a tool like the OptiView Analyzer really shines...the ability to instantly discover the devices, interconnections and paths through your network speeds this step immensly.

Once the path is known, we want to identify whether the problem is related to network usage issues (bandwidth hogs!) Examining the utilization along the user-app path can reveal whether some hog is using all the bandwidth, slowing response times for others. If you find an interface on the path that is "pegged" chances are good that congestion is causing the slowdown. And that brings us to...

3. Analyze Network Performance and UsageOK...you're next question is usually "who is that, and what are they doing?" We want to identify the top users and apps that are consuming bandwidth (http://www.colasoft.com/capsa/how_to_monitor_network_traffic.php). Remember though, that the utilization data gathered via SNMP (http://www.packetech.com/showthread.php?105-Simple-Network-Management-Protocol) from a switch (http://www.packetech.com/showthread.php?67-Switch) or router (http://www.packetech.com/showthread.php?47-Router) interface (in Step 2 above) will only contain basic (layer 2) information, and no information about protocols (http://www.packetech.com/forumdisplay.php?18-Protocol) or who is sending them. For that, you'll need to either be in the path of the packets (http://www.packetech.com/showthread.php?41-Packet) with an analyzer (http://www.packetech.com/showthread.php?40-Packet-analyzer), or if it happens to be a Layer 3 interface that supports NetFlow (or other flow technology) THEN you can see more details about who is sending/receiving the traffic and what protocols. BUT... Flow data just provides more detail on usage. Flow data contains no information about response times. So if you did identify the bandwidth hog ("Gary's streaming from ESPN.com again!") GREAT! But if there's no clear indication that excessive usage is a problem, yet the app is still slow, then it's time to analyze the real application traffic.