Firewall Tips & Tricks and all about Network: How To Perform a SecurePlatform Firewall Health Check Part 1

Date, System Uptime and Clock:

Confirm the correct date is set on the system using the ‘date’ command.

The system uptime can be examined using the command:

uptime

Example output:

Zulu# uptime

09:46:34 up 124 days, 9:40, 1 user, load average: 0.36, 0.19, 0.14

If a low uptime is shown it normally indicates that the firewall has been administratively rebooted but it may also have been due to a self-reboot, for example due to a panic.

Low uptime - if you suspect the uptime is less than it should be check the

/var/log/messages file for the reason of the last reboot.

Disk Space

The disk space usage can be examined using the command:

df –k

Example output:

[Expert@Zulu]# df –k

Filesystem 1K-blocks Used Available Use% Mounted on

/dev/sda5 600832 187800 382512 33% /

none 600832 187800 382512 33% /dev/pts

/dev/sda1 147766 10124 130013 8% /boot

/dev/sda7 1541680 930324 533044 64% /opt

none 2045688 0 2045688 0% /dev/shm

/dev/sda6 1541680 593844 869524 41% /sysimg

/dev/sda8 27024000 5472984 20178264 22% /var

[Expert@Zulu]#

In the above example, all partitions are under 70% usage.

If a partition has a „use%’ that is more than 70% but less than 90%

If the „use%’ is 90% or more

See if the partition can be cleaned up to free up disk space.

/var/opt/CPsuite-RXX/fw1/log may be filled with old log files if the firewall has been logging locally.

/var/log may have old messages files

Physical RAM and Swap Space:

Examine the RAM and swap space usage (kilobytes) with:

free –k –t

Example output:

[Expert@Zulu]# free –k -t

total used free shared buffers

cached

Mem: 2058236 971332 1086904 0 95104

268984

-/+ buffers/cache: 607244 1450992

Swap: 4192944 0 4192944

Total: 6251180 971332 5279848 [Expert@Zulu]#

The „total‟ column shows the amount of RAM installed in the system (2GB in the above example)

and the amount of disk space allocated for swap space (4GB).

The amount of swap space is normally automatically set to twice the size of the physical memory, with 4 GB being the maximum.

The „used‟ column indicates how much RAM and swap space are being used.

The „free‟ column indicates how much RAM and swap space are available.

In the above example output the „used‟ column indicates <1 GB of RAM is being used and no

swap space is being used.

If for some reason the amount of free RAM becomes low, the appliance will start to preserve free RAM by swapping out the contents of the memory to the hard disk (swap space). The performance will be sub-optimal if swap space is being used due to time and resources spent writing and reading to the hard-disk.

Example Output:

[Expert@Zulu]# free –k -t

total used free shared buffers

cached

Mem: 2055120 1897424 157696 0 98732

697688

-/+ buffers/cache: 1101004 954116

Swap: 4192912 735980 3456932

Total: 6248032 2633404 3614628 [Expert@Zulu]#

Swap space usage may indicate not enough memory is installed in the appliance. The kernel is

32 bit and can use up to 4GB. It is recommended to upgrade the memory if less than 4GB of RAM

are installed.

For further information about the amount of RAM that is supported by SecurePlatform refer to:

sk22343: What is the maximum memory supported by SecurePlatform?

Memory Usage

The firewall‟s memory usage can be examined by using the command:

fw ctl pstat

The output of this command is vast and can be difficult to understand as not all the output is intuitive. The statistics that need to be checked to ensure memory is healthy are:

· hash kernel memory „hmem‟

· system kernel memory „smem‟

· kernel memory „kmem‟.

Example output:

[Expert@Zulu]# fw ctl pstat | more

Machine Capacity Summary:

Memory used: 7% (128MB out of 1638MB) - below low watermark

Concurrent Connections: 21% (43253 out of 199900) - below low watermark

Aggressive Aging is not active

Hash kernel memory (hmem) statistics:

Total memory allocated: 142606336 bytes in 34782 4KB blocks using 34 pools

Initial memory allocated: 20971520 bytes (Hash memory extended by

121634816 bytes)

Memory allocation limit: 335544320 bytes using 512 pools

Total memory bytes used: 39254196 unused: 103352140 (72.47%) peak:

133739228

Total memory blocks used: 10335 unused: 24447 (70%) peak:

32795

Allocations: 3375437074 alloc, 0 failed alloc, 3375001310 free

System kernel memory (smem) statistics:

Total memory bytes used: 188577580 peak: 227270504

Blocking memory bytes used: 1958392 peak: 2205256

Non-Blocking memory bytes used: 186619188 peak: 225065248

Allocations: 979925174 alloc, 0 failed alloc, 979924513 free, 0 failed

free

Kernel memory (kmem) statistics:

Total memory bytes used: 84876956 peak: 177110948

Allocations: 3375820431 alloc, 0 failed alloc, 3375384380 free, 0 failed

free

External Allocations: 0 for packets, 31589936 for SXL

In the above example there are no hmem, smem, kmem failed allocations.

Presence of „hmem‟ failed allocations indicates that the hash kernel memory was full. This is not a serious memory problem but indicates there is a configuration problem. The value assigned to the hash memory pool, (either manually or automatically by changing the number concurrent

connections in the capacity optimization section of a firewall) determines the size of the hash kernel memory. If a low hmem limit was configured it leads to improper usage of the OS memory. See

„Capacity Optimization‟ in the „Firewall Health Checks‟ section for further information.

Presence of „smem‟ failed allocations indicates that the OS memory was exhausted or there are large non-sleep allocations. This is symptomatic of a memory shortage. If there are failed smem allocations and the memory is less than 2 GB, upgrading to 2GB may fix the problem. Decreasing

the TCP end timeout and decreasing the number of concurrent connections can also help reduce memory consumption.

Presence of „kmem‟ failed allocations means that some applications did not get memory. This is usually an indication of a memory problem; most commonly a memory shortage. The natural limit is

2GB, since the Kernel is 32bit.)

Memory shortage sometimes indicates a memory leak. In order to troubleshoot memory shortage, stop the load you need to stop the load and let connections close. If the memory consumption returns back to normal, you are not dealing with a memory leak. Such shortage might happen when traffic volumes are too high for the device capacity. If the memory shortage happens after a change in the system or the environment, undo the change, and check whether kmem memory consumption goes down.

CPU Usage

CPU usage on single and multicore platforms can be checked with the command:

Top

Example „top‟ output from a badly optimized multi-core system:

Explanation of the above output:

%us: Time spent running non-kernel code (User)

%sy: Time spent running kernel code (System)

%ni: Nice time

%id: Time spent idle

%wa: Time spent waiting for IO

%hi: hardware interrupt

%si: Software interrupt

%st: stealth time (Involuntary wait time)

The idle value (%id) shows how busy the appliance is. If the value is 0, the CPU is maxed out. With the

firewall under load, examine the output of idle column (%id) for each CPU and determine if core usage is spread out evenly.

In the above example the core usage is uneven; some cores are maxed out while other cores are mostly idle. The core allocation (sim affinity) may require tuning to optimize the usage of the cores and improve the performance.

For information on core tuning, refer to:

sk33250: Automatic SIM Affinity on Multi-Core CPU Systems

The CPU usage is broken down into:

High CPU in user time (%us) indicates that some daemon process is consuming high CPU; security server processes like fwssd and in.ahttpd have been offenders in the past. (Figure out which process it is from the output of ps or top.)

High CPU usage in system (%sy) indicates that the Check Point kernel (traffic being inspected by Check Point or SmartDefense) is consuming CPU. Certain configurations in SmartDefense and web-Intelligence can cause this to occur by disabling SecureXL templating or completely disabling SecureXL acceleration.

High CPU in wait time (%wa) occurs when the CPU was idle due to the system waiting for an outstanding disk I/O request to complete. This indicates your system is probably low on physical memory and is swapping out memory (paging)*. The CPU is not actually busy if this number is spiking; the CPU is blocked from doing any useful work waiting for an I/O event to complete.

A high value against software interrupt (%si) indicates that there is probably a high load of traffic on the appliance. The interface errors (netstat –i) should be examined to see if this is a cause of concern.

* The occurrence of paging can be determined by running ’vmstat -n 5 5’ and checking the swapped in (si) and swapped out (so) statistics. Disregard the first line as it is an average value since the appliance started.

Interface Errors

Interface statistics are displayed using the command:

netstat –i

Example output:

[Expert@Zulu]# netstat -i

Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 29597525 0 0 0 42570398 0 0 0 BMRU eth1 1500 0 1032315302 0 3976 0 1615311511 0 0 0 BMRU eth2 1500 0 1624715902 0 12111 0 1025019332 0 0 0 BMRU eth6 1500 0 26828076 0 0 0 477906370 0 0 0 BMRU lo 16436 0 5922470 0 0 0 5922470 0 0 0 LRU [Expert@Zulu]#

In the above example the, RX-DRP indicates that the appliance is dropping packets at the network. This is not ideal but as a percentage of received packets, the amount of RX-DRP packets is insignificant and can therefore be disregarded as a source of concern. If the ratio is higher than

0.5% attention is required!

The RX and TX columns show how many packets have been received or transmitted error-free (RX-OK/TX- OK) or damaged (RX-ERR/TX-ERR); how many were dropped (RX-DRP/TX-DRP); and how many were lost because of an overrun (RX-OVR/TX-OVR).

RX-ERR/TX-ERR errors usually indicate a mismatch in duplex setting, mtu size, bad cabling or possibly a faulty interface card. Check the switch settings and fix the speed and duplex settings if there is a mismatch, check cabling and try a spare interface.

RX-DRP implies the appliance is dropping packets at the network. If the ratio of RX-DRP to RX- OK is greater than 0.5% attention is required as it is a sign that the firewall does not have enough FIFO memory buffer (descriptors) to hold the packets while waiting for a free interrupt to process them.

When the FIFO buffer is full the appliance will drop new packets as it does not have any spare buffer to hold them. A possible solution is to use Link Aggregation or tune the driver by increasing the descriptors, see: sk25921: Tuning Intel PRO/1000 family NICs driver parameters for maximal throughput

TX-DRP usually indicates that there is a downstream issue and the firewall has to drop the packets as it is unable to put them on the wire fast enough. Increasing the bandwidth through link aggregation or introducing flow control may be a possible solution to this problem.

Fragmentation

Excessive fragmentation will have a detrimental impact on the firewall‟s performance. When packets are fragmented by the network the kernel may receive them out of order. The kernel has to wait until it has received all the fragments before it can re-assemble the fragments and then inspect the re-assembled packet. Fragmented traffic can not be accelerated by the performance pack (SecureXL).

To examine the level of fragmentation run the following command:

fw ctl pstat

Find the section in the output for fragmentation and if there is fragmentation, examine the „expired‟ and

„failures‟ values.

Example ‘fw ctl pstat’ fragmentation output (truncated):

Fragments:

130963 fragments, 64066 packets, 2337 expired, 0 short,

4 large, 304 duplicates, 0 failures

Expired – denotes how many fragments were expired when the firewall failed to reassemble them in a 20 seconds time frame or when due to memory exhaustion, they could not be kept in memory anymore.

Failures – denotes the number of fragmented packets that were received that could not be successfully re-assembled.

The number of failures should be viewed in context with the amount of fragmentation occurring and relative to the total packet throughput (netstat –i). The values in pstat are accumulative and large values may actually be relatively small to the total packet throughput. However, if there is a significant number against

„failures‟ then the cause of the issue should be traced to determine if there is a way to mitigate it.

In the above example output 1.8% of fragments that were received had to be expired by the firewall but as there were no failures it implies that the fragments were subsequently re-transmitted and successfully re-assembled by the firewall so no packets were lost.

If the source of fragmentation is external there is little that can be done to alleviate the problem but if it is internal, reducing the mtu size on the offending server may resolve the problem.

Labels

Live Traffic

How To Perform a SecurePlatform Firewall Health Check Part 1

Comments

Post a Comment

Search This Blog

Blog Archive

Total Pageviews

Credits