There are many times where indications of oversubscription or excessive load on a firewall or a network device are not enough to prove if oversubscription is really happening. Also, it is often confusing how to identify and solve such issues. This document will present the basic troubleshooting steps that someone needs to take in order to pinpoint an oversubscription problem on a Cisco ASA firewall and will propose potential solutions to overcome it. The corresponding document for the FWSM is located here.
The most important aspect of solving an oversubscription issue is its identification. Network engineers will often incorrectly attribute network problems to excessive traffic which leads devices like the firewalls to be wrongly considered as the bottleneck. Other times they will focus on other parts of the network in cases were the firewall processing power is not enough to handle the traffic. There can be multiple indications of load problems on firewall devices and putting them together will help us understand if traffic is indeed the reason of the problem or if we should focus elsewhere. That is what this section will try to describe.
2.1 Problem nature
Oversubscription almost never occurs by itself. It will most of the times be presented as another network problem that results from it. Such often include packet loss, slow response or drops. In general, an oversubscribed device that can't handle the load will inevitably drop some packets. Packet drops will affect sensitive applications or will cause TCP re-transmissions and affect the user experience by making transactions look as if they are taking more time to complete. If we wanted to summarize the problems that occur due to excessive load we would describe them as network degradation. Of course, someone must be careful and NOT attribute all problems that fall under the "degradation umbrella" as load issues. The indications we will present below will help more on identifying if such issues should be attributed to excessive load.
A "busy" firewall device will almost always show it on its CPU. We can check the CPU use with the command "show cpu".
ASA# show cpu
CPU utilization for 5 seconds = 14%; 1 minute: 10%; 5 minutes: 10%
A CPU ranging above 80%-90% could indicate high traffic load. As a side note, the "show cpu profile".can also be provided to TAC so that they will be able to identify the processes that the CPU is spent.
Also, CPU hogs can show when the CPU is too busy to pull packets off the line:
Interface overruns, no buffer and underruns often show that the firewall cannot process all the traffic it is receiving on its NIC. Overruns and no buffers indicate that input traffic is too much on a given interface. The interface maintains a receive ring where packets are stored before they are processed by the ASA. If the NIC is receiving traffic faster than the ASA can pull them off the receive ring, the packet will be dropped and either the no buffer or overrun counter will increment. Underruns behaviour similarly but deal with the transmit ring instead.
Next it is worth checking the traffic that the device is seeing. We need to clear the traffic ("clear traffic" command) statistics before checking them ("show traffic" command). We are doing that because we want to see the traffic while the problem is occurring and thus be able to tell if load is related to the problem investigated. Looking the aggregate traffic output from "show traffic" carries information since the last reload or the last time the counters were cleared, so it will not help us identify how much traffic the box is seeing for the time we are troubleshooting. After the "clear traffic" we let the box collect statistical information for 2-5minutes and we do "show traffic" to get the traffic the interfaces saw.
Monitoring tools and Netflow can also help on identifying traffic and connection rates.
We can then calculate the aggregate throughput the device is passing by examining the traffic that all physical interfaces saw (output of "show traffic") and we will be able to understand if it is being pushed to its limits. In order to do that we need to check the device specs:
There are long discussions that people could start trying to tell if a firewall or any other device is hitting its traffic processing limits or not. Experience has shown that there is controversy on what the numbers show and what engineers consider as being close to the numbers or not. It is worth clarifying a few points. Let's use the ASA5510 as an example. Its name throughput is 300Mbps, as we see on the table above. So the question is, "if my ASA5510 sees about 280Mbps should it be 100% CPU or not?". A quick answer would be "No". Though, we must not forget that there are many factors involved in this question. In the network industry name speeds of devices come out under certain tests. These tests are repeated and an average is presented as the maximum speed. Though, not always is "real-world" traffic the same traffic as the one used in the tests. We could use the aforementioned ASA5510 for example. Usually, the name speed tests involve stateless protocols with big packets. For a TCP web browsing application though, the packets are much smaller and TCP uses ACKs and is a "synchronized" protocol by nature. That would add more load to the firewall itself, which would make its maximum throughput value drop. On top of that, if the ASA has http inspection configured (which will do deep packet inspection for http) then we understand that its maximum processing throughput would be less than 280Mbps. It is obvious that even though 300Mbps is indeed the throughput the device can achieve, its real-world throughput, based on applications, traffic nature and configuration could practically be less. That is why in our performance documents we also try to provide other metrics. These include the "packets per seconds" (pps) and what is often seen as "real-world HTTP". For example in the ASA table we can see that the 5510 can do 190K pps (small 64-byte packets). These metrics could also be used against the interface statistics collected from the device in order to decide if the box is pusehd to its limits.
Another consideration on top of traffic load for the firewall devices is connection and connection rates. That is another field that could trigger various disagreements. The command we would use to see the connections on our firewall are "show conn count" and "show resource usage".
ASA5510# show conn count
2 in use, 86 most used
ASA5510# show resource usage
Resource Current Peak Limit Denied Context
Telnet 1 1 5 0 System
Syslogs [rate] 1 293 N/A 0 System
Conns 2 86 10000 0 System
Xlates 5 116 N/A 0 System
Hosts 6 49 N/A 0 System
ASA5510-multi-context# show resource usage
Resource Current Peak Limit Denied Context
SSH 1 1 15 0 admin
Syslogs [rate] 118 348 unlimited 0 context1
Conns 89 893 unlimited 0 context1
Xlates 150 1115 unlimited 0 context1
Hosts 15 18 unlimited 0 context1
Conns [rate] 103 4694 unlimited 0 context1
Now, let's ask one more questions for the output from our ASA5510 above: "In the peak connection rate I see about 5K connections and in the specifications I read that the maximum supported rate is 9K conns/second. 5K is much less than 9K, so is the ASA exceeding its limits?". For someone to be able to answer that question he would need to keep in mind that the rate that is mentioned in the specifications is the average rate per second. To explain it better, here are a few examples:
Let's say we have a stable rate of 9K per second. This connection rate conforms to the ASA5510 limits.
Now let's see we have 90K new conns per 10 seconds. That is also a rate of 9K per second.and conforms to the ASA5510 limits
Now let's say we have 81K new conns. for 1 second and the next 9 seconds we have 1K. That makes us total 90K per 10 seconds which equals to average 9K per second which conforms with 9K conns/second. But the ASA was oversubscribed for 1 second while it was seeing a rate of 81K/second.
So, it is obvious that bursts of traffic or connections could affect the performance of a firewall even if the averages over time does not seem to exceed the limits.
Additionally, having few connections through the box does not necessarily mean that traffic is not high. Theoretically speaking, someone could have 10 connections passing 1Gbps each and thus oversubscribing an ASA with very few conns.
3 Mitigation / Alleviation
Now, it is equally important to mention options for overcoming an oversubscription issue. We would suggest to the reader to keep in mind that if a device is oversubscribed it is usually best to add more processing power by using more or more powerful devices. Though, there might be cases where we could get away with it by implementing some workarounds after identifying the root cause and the traffic profiles. Determining causes of oversubscription/excessive load should rely on external tools and traffic analysis.
When the CPU is high, we can try to see where it is spent and then we might be able to alleviate it from the process that take most CPU cycles. We can collect the output of the "show process" command, wait for 1 minute and collect it once more.
Then he can do the diff of the "Runtime" column for all the processes (keep in mind that a process might show up twice or more). By sorting the diffs from maximum to minimum we can see the processes that take most of the CPU. Introduced in ASA 8.2, commandshow processes cpu-usage non-zero sorted can be used instead.
There are cases where, for example, we might see an inspection process or the logging process taking most of the CPU. In such cases we can disable the inspections if they are not needed or turn down the logging level and save some CPU for the device. Please note that processes like "Dispatch_Unit" and "interface polling" relate to regular packet processing and there is not much that can be done to alleviate the CPU from them.
If the traffic hitting the firewall is excessive, we can also try to send only necessary traffic through it. Although, this solution is not practical in most setups, there might be cases where someone has alternate routes for his traffic and he might not need to "firewall" all packets. In such scenarios he can use policy based routing (PBR) to divert to the firewall only traffic that needs to be "firewalled".
3.3 Optimize throughput
For the ASA5550 and ASA5580, by leveraging the IO bridges appropriatelly someone might be able to optimize the maximum throuput of the box. Further information on how to do that in ASA 5550 and 5580 is located here.
3.4 Flow Control
For instances were traffic is extremely bursty (i.e. 5Gbps for burts of 5ms), dropped packets can occur if the burst exceeds the buffering capacity of the FIFO buffer on the NIC and the receive ring buffers. Enabling pause frames for flow control can alleviate this issue by letting the upstream device to "hold on" with the bursts. More information on how to enable flow control can be found under the corresponding model sections here.
3.5 Active/Active failover
In case of using two firewalls in failover in Active/Standby mode, if the Active Unit cannot handle the traffic you might be able to temporarily use an Active/Active setup to share it between both units. You would need to have the firewalls in multi-context mode and have one or more contexts active on the primary unit and one or more contexts active on the secondary. That way both firewalls will be passing traffic (for the context/s that they are active) and might not be oversubscribed. Though, you need to remember that in case one of a units failure, all contexts (thus all traffic) will be running on one unit and then you will be back to an oversubscribed scenario. Active/Active failover for oversubscription cases should only be used (if used at all) as a temporary solution with precaution, until a permanent solution is put in place