Date, System Uptime and Clock:
Confirm
the correct date is set on the system using the ‘date’ command.
The system uptime can be examined using the command:
uptime
Example output:
Zulu# uptime
09:46:34 up 124 days, 9:40, 1 user, load average: 0.36,
0.19, 0.14
If a low uptime is shown it normally indicates
that the firewall has
been administratively rebooted but it may also have been due to a self-reboot, for example due to a panic.
Low uptime - if you suspect the uptime is less than it should be check the
/var/log/messages file for the reason
of the last reboot.
Disk Space
The disk space usage can be examined using the command:
df –k
Example output:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda5 600832 187800 382512 33% /
none 600832 187800 382512 33% /dev/pts
/dev/sda1 147766 10124 130013 8% /boot
/dev/sda7 1541680 930324 533044 64% /opt
none 2045688 0 2045688 0% /dev/shm
/dev/sda6 1541680 593844 869524 41% /sysimg
/dev/sda8 27024000 5472984 20178264 22% /var
In the above example, all partitions
are under 70% usage.
If a partition has a „use%’ that is more than 70% but less than
90%
If the „use%’ is 90% or more
See if the partition can be cleaned up to free up disk space.
/var/opt/CPsuite-RXX/fw1/log may be filled with old log files if the firewall
has been logging locally.
/var/log may have old messages files
Physical RAM and Swap Space:
Examine the RAM and swap space usage (kilobytes) with:
free –k –t
Example output:
total used free shared buffers
cached
Mem: 2058236 971332 1086904 0 95104
268984
-/+ buffers/cache: 607244 1450992
Swap: 4192944 0 4192944
The „total‟ column shows the amount of
RAM installed in the system (2GB in the above example)
and the
amount of disk space allocated for swap space (4GB).
The amount of swap space is normally automatically set to twice the size of the physical memory, with 4 GB being the maximum.
The „used‟ column indicates how much RAM and swap space are being used.
The „free‟ column indicates how much RAM and swap space are available.
In the above example output the „used‟ column indicates <1 GB of RAM is being
used and no
swap space is being used.
If for some reason the amount of free RAM becomes low,
the appliance will start to preserve free RAM by swapping out the contents of the memory to the hard disk (swap space). The performance
will be sub-optimal if swap space is being used due to time and resources spent
writing and reading to the hard-disk.
Example Output:
total used free shared buffers
cached
Mem: 2055120 1897424 157696 0 98732
697688
-/+ buffers/cache: 1101004 954116
Swap: 4192912 735980 3456932
Swap space
usage may indicate not enough memory is installed in the
appliance. The kernel is
32 bit and can use up to 4GB. It is recommended to upgrade the memory if less than 4GB of RAM
are installed.
For further information about the amount of
RAM that is supported by SecurePlatform refer to:
Memory Usage
The firewall‟s memory usage can be examined by using the command:
fw ctl pstat
The output of this command is vast and can
be difficult to understand as
not all the output is intuitive. The statistics that need to be checked to ensure memory is healthy are:
· hash kernel memory „hmem‟
· system kernel memory „smem‟
· kernel memory „kmem‟.
Example output:
Machine Capacity Summary:
Memory used: 7% (128MB out of 1638MB) - below low watermark
Concurrent Connections: 21% (43253 out of 199900) - below low watermark
Aggressive Aging is not active
Hash kernel memory (hmem) statistics:
Total memory allocated: 142606336 bytes in 34782 4KB blocks using 34 pools
Initial memory allocated: 20971520 bytes (Hash memory extended by
121634816 bytes)
Memory allocation limit: 335544320 bytes using 512 pools
Total memory bytes used: 39254196 unused: 103352140 (72.47%) peak:
133739228
Total memory blocks used: 10335 unused: 24447 (70%) peak:
32795
Allocations: 3375437074 alloc, 0 failed alloc, 3375001310 free
System kernel memory (smem) statistics:
Total memory bytes used: 188577580 peak: 227270504
Blocking memory bytes used: 1958392 peak: 2205256
Non-Blocking memory bytes used: 186619188 peak: 225065248
Allocations: 979925174 alloc, 0 failed alloc, 979924513 free, 0 failed
free
Kernel memory (kmem) statistics:
Total memory bytes used: 84876956 peak: 177110948
Allocations: 3375820431 alloc, 0 failed alloc, 3375384380 free, 0 failed
free
External Allocations: 0 for packets, 31589936 for SXL
In the above example there are no hmem, smem, kmem failed allocations.
Presence of „hmem‟ failed allocations indicates that the hash kernel memory was full. This is not a
serious memory problem but indicates there is a configuration problem. The value assigned to the hash memory pool, (either manually or automatically by changing the
number concurrent
connections in the capacity optimization section of a firewall) determines the size of
the hash kernel memory.
If a low hmem limit was configured it leads to improper
usage of the OS memory. See
„Capacity Optimization‟ in the „Firewall Health Checks‟ section for further information.
Presence of „smem‟ failed
allocations indicates that the OS memory was exhausted or there are large non-sleep allocations. This is symptomatic of a memory shortage. If there are failed smem
allocations and the memory is less than 2 GB, upgrading to 2GB may fix the problem. Decreasing
the TCP end timeout and decreasing
the number of concurrent connections can also help reduce memory consumption.
Presence of „kmem‟ failed allocations means that some applications did not get memory. This is usually an indication of a memory problem; most commonly a memory shortage. The natural limit is
2GB, since the Kernel is 32bit.)
Memory shortage sometimes indicates a memory leak.
In order to troubleshoot memory shortage, stop the load you need to stop the load and let connections close. If the memory consumption returns back to normal, you are not dealing with
a memory leak. Such shortage might happen when traffic volumes
are too high for the device capacity. If the memory shortage happens after
a change in the system or the environment, undo
the change, and check whether kmem memory consumption goes down.
CPU Usage
CPU usage on single and multicore platforms can be checked with the command:
Top
Example „top‟ output from a badly optimized multi-core system:
Explanation of the above output:
%us:
Time spent running non-kernel code
(User)
%sy: Time spent running kernel code (System)
%ni: Nice time
%id:
Time spent idle
%wa:
Time spent waiting for IO
%hi: hardware interrupt
%si: Software
interrupt
%st: stealth time
(Involuntary wait time)
The idle value (%id) shows how busy the
appliance is. If the value is 0, the CPU is maxed out. With the
firewall
under load, examine the output of idle
column (%id) for each CPU and determine if core usage is spread out evenly.
In the above example the core usage is uneven; some cores are maxed out while other cores are mostly idle. The core allocation (sim affinity) may require tuning to optimize the usage of the cores and improve the performance.
For information on core tuning, refer to:
The CPU usage is broken down into:
High CPU in user time (%us) indicates that some daemon process is consuming high CPU; security server processes like fwssd and in.ahttpd have been offenders in the past. (Figure out
which process it is from the output of
ps or top.)
High CPU usage in system (%sy) indicates that
the Check Point kernel (traffic being inspected by Check Point or SmartDefense) is consuming CPU. Certain configurations in SmartDefense and web-Intelligence
can cause
this to occur by disabling SecureXL templating
or completely disabling SecureXL acceleration.
High CPU in wait time (%wa) occurs when the CPU was idle due to the system waiting for an
outstanding disk I/O request to complete. This indicates your system is probably low on physical memory and is swapping out memory (paging)*. The CPU is not actually busy if this
number is spiking; the CPU is blocked from doing any useful work waiting for an I/O event to complete.
A high value against software interrupt (%si) indicates that there is probably a high load of traffic on the appliance. The interface errors (netstat –i) should be examined to see if this is a cause of concern.
* The occurrence of paging can be determined by running ’vmstat -n 5 5’ and checking the swapped in (si) and swapped out (so) statistics. Disregard the first line as it is an average value since the
appliance started.
Interface Errors
Interface statistics are displayed using the command:
netstat –i
Example output:
[Expert@Zulu]# netstat
-i
Iface MTU Met RX-OK RX-ERR RX-DRP
RX-OVR TX-OK TX-ERR TX-DRP
TX-OVR Flg eth0 1500 0
29597525 0 0 0
42570398 0 0 0
BMRU eth1 1500 0 1032315302 0 3976 0
1615311511 0 0 0
BMRU eth2 1500 0
1624715902 0 12111 0
1025019332 0 0 0
BMRU eth6 1500 0
26828076 0 0 0
477906370 0 0 0
BMRU
lo 16436 0 5922470 0 0 0 5922470 0 0 0
LRU [Expert@Zulu]#
In the above example the, RX-DRP indicates that the appliance is dropping packets at the network. This is not ideal but as a percentage of received packets, the amount
of RX-DRP packets is insignificant and can therefore be disregarded
as a source of concern. If the ratio is higher than
0.5% attention is required!
The RX and TX columns show how many packets have been received or transmitted error-free (RX-OK/TX- OK) or damaged (RX-ERR/TX-ERR); how many were dropped (RX-DRP/TX-DRP); and how many were lost
because of an overrun (RX-OVR/TX-OVR).
RX-ERR/TX-ERR errors usually indicate a mismatch in duplex setting, mtu size, bad cabling or possibly a faulty interface card. Check the switch settings and fix
the speed and duplex settings if
there is a mismatch, check cabling
and try a spare interface.
RX-DRP implies the appliance is dropping packets at the network. If
the ratio of RX-DRP to RX- OK is greater than 0.5% attention is required as it is a sign that the firewall does not have enough FIFO memory buffer (descriptors) to hold the packets while waiting for a free interrupt to process them.
When the FIFO buffer is full the appliance will drop new packets as it does not have any spare buffer to hold
them.
A possible solution is to use Link Aggregation or tune
the driver by increasing the descriptors, see: sk25921: Tuning Intel PRO/1000 family NICs
driver parameters for maximal
throughput
TX-DRP usually indicates that there is a downstream issue and the firewall has
to drop the packets as it is unable to put them on the wire fast enough. Increasing
the bandwidth through link aggregation or introducing flow control may be a possible solution to this problem.
Fragmentation
Excessive fragmentation
will
have a detrimental impact on the firewall‟s performance. When packets are fragmented by the network the kernel may receive
them out of order. The kernel has
to wait until it has received all the fragments before it can re-assemble the fragments
and then inspect the re-assembled
packet. Fragmented traffic can not be accelerated by the performance pack (SecureXL).
To examine the level of fragmentation run the following command:
fw ctl pstat
Find the section in the output for fragmentation and if there is fragmentation, examine the „expired‟ and
„failures‟ values.
Example ‘fw ctl pstat’ fragmentation output (truncated):
Fragments:
130963 fragments, 64066 packets, 2337 expired, 0 short,
4 large, 304 duplicates, 0 failures
Expired – denotes how many fragments were expired
when the firewall failed to reassemble
them in a 20 seconds time frame or when due to memory exhaustion, they could not be kept in memory anymore.
Failures
– denotes the number of fragmented packets that were received that could not be successfully re-assembled.
The number of failures should be viewed in context with the
amount of fragmentation occurring and relative to the total packet throughput (netstat –i). The values in pstat
are accumulative and large values may actually be relatively small to the total packet throughput. However, if there is a significant number
against
„failures‟ then the cause of the issue should be traced to determine if there is a way to mitigate it.
In the above example output 1.8% of fragments that
were received had to be expired by the firewall but as there were no failures it implies that the fragments were subsequently re-transmitted
and successfully re-assembled by the firewall so no packets were lost.
If the source of fragmentation is external there is little that can
be done to alleviate the problem
but if it is internal, reducing the mtu size on the offending server may resolve the problem.
Post a Comment