PERFORMANCE TUNING: HUGEPAGES IN LINUX

PERFORMANCE TUNING: HUGEPAGES IN LINUX:


Problem statement
The client’s central database was intermittently freezing because of high CPU usage, and their business severely affected. They had already worked with vendor support and the problem was still unresolved.

Symptoms

Intermittent High Kernel mode CPU usage was the symptom. The server hardware was 4 dual-core CPUs, hyperthreading enabled, with 20GB of RAM, running a Red Hat Linux OS with a 2.6 kernel.
During this database freeze, all CPUs were using kernel mode and the database was almost unusable. Even log-ins and simple SQL such as SELECT * from DUAL; took a few seconds to complete. A review of the AWR report did not help much, as expected, since the problem was outside the database.
Analyzing the situation, collecting system activity reporter (sar) data, we could see that at 08:32 and then at 8:40, CPU usage in kernel mode was almost at 70%. It is also interesting to note that, SADC (sar data collection) also suffered from this CPU spike, since SAR collection at 8:30 completed two minutes later at 8:32, as shown below.
A similar issue repeated at 10:50AM:
07:20:01 AM CPU   %user     %nice   %system   %iowait     %idle
07:30:01 AM all    4.85      0.00     77.40      4.18     13.58
07:40:01 AM all   16.44      0.00      2.11     22.21     59.24
07:50:01 AM all   23.15      0.00      2.00     21.53     53.32
08:00:01 AM all   30.16      0.00      2.55     15.87     51.41
08:10:01 AM all   32.86      0.00      3.08     13.77     50.29
08:20:01 AM all   27.94      0.00      2.07     12.00     58.00
08:32:50 AM all   25.97      0.00     25.42     10.73     37.88 <-- 0.00="" 08:40:02="" 08:50:01="" 09:00:01="" 09:10:01="" 09:20:01="" 09:30:02="" 09:40:01="" 09:50:01="" 1.86="" 10.29="" 10.32="" 10:00:01="" 10:10:01="" 10:30:01="" 10:41:54="" 10:50:01="" 11.07="" 11.68="" 11.98="" 11:00:01="" 12.36="" 12.76="" 13.43="" 13.80="" 14.12="" 14.55="" 16.00="" 16.05="" 16.40="" 16.93="" 17.54="" 2.10="" 2.19="" 2.23="" 2.62="" 2.71="" 2.99="" 21.57="" 26.97="" 29.32="" 29.56="" 3.02="" 3.34="" 3.59="" 31.86="" 32.63="" 35.46="" 35.82="" 38.15="" 39.20="" 39.81="" 4.11="" 43.66="" 44.85="" 49.32="" 5.38="" 51.31="" 51.41="" 53.22="" 54.62="" 62.70="" 64.92="" 65.88="" 66.59="" 69.21="" 71.88="" 8.14="" 8.21="" 8.84="" 9.46="" all="" am="" pre="">

Performance forensic analysis

The client had access to a few tools, none of which were very effective. We knew that there is excessive kernel mode CPU usage. To understand why, we need to look at various metrics at 8:40 and 10:10.
Fortunately, sar data was handy. Looking at free memory, we saw something odd. At 8:32, free memory was 86MB; at 8:40 free memory climbed up to 1.1GB. At 10:50 AM free memory went from 78MB to 4.7GB. So, within a range of ten minutes, free memory climbed up to 4.7GB.
07:40:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached
07:50:01 AM    225968  20323044     98.90    173900   7151144
08:00:01 AM    206688  20342324     98.99    127600   7084496
08:10:01 AM    214152  20334860     98.96    109728   7055032
08:20:01 AM    209920  20339092     98.98     21268   7056184
08:32:50 AM     86176  20462836     99.58      8240   7040608
08:40:02 AM   1157520  19391492     94.37     79096   7012752
08:50:01 AM   1523808  19025204     92.58    158044   7095076
09:00:01 AM    775916  19773096     96.22    187108   7116308
09:10:01 AM    430100  20118912     97.91    218716   7129248
09:20:01 AM    159700  20389312     99.22    239460   7124080
09:30:02 AM    265184  20283828     98.71    126508   7090432
10:41:54 AM     78588  20470424     99.62      4092   6962732  <-- 10:50:01="" 11:00:01="" 11:10:01="" 143780="" 1471236="" 15761328="" 17912120="" 186540="" 19077776="" 2636892="" 4787684="" 6878012="" 6990176="" 7041712="" 76.70="" 77400="" 87.17="" 92.84="" am="" pre="">
This tells us that there is a correlation between this CPU usage and the increase in free memory. If free memory goes from 78MB to 4.7GB, then the paging and swapping daemons must be working very hard. Of course, releasing 4.7GB of memory to the free pool will sharply increase paging/swapping activity, leading to massive increase in kernel mode CPU usage. This can lead to massive kernel mode CPU usage.
Most likely, much of SGA pages also can be paged out, since SGA is not locked in memory.

Memory breakdown

The client’s question was, if paging/swapping is indeed the issue, then what is using all my memory? It’s a 20GB server, SGA size is 10GB and no other application is running. It gets a few hundred connections at a time, and PGA_aggregated_target is set to 2GB. So why would it be suffering from memory starvation? If memory is the issue, how can there be 4.7GB of free memory at 10:50AM?
Recent OS architectures are designed to use all available memory. Therefore, paging daemons doesn’t wake up until free memory falls below a certain threshold. It’s possible for the free memory to drop near zero and then climb up quickly as the paging/swapping daemon starts to work harder and harder. This explains why free memory went down to 78MB and rose to 4.7GB 10 minutes later.
What is using my memory though? /proc/meminfo is useful in understanding that, and it shows that the pagetable size is 5GB. How interesting!
Essentially, pagetable is a mapping mechanism between virtual and physical address. For a default OS Page size of 4KB and a SGA size of 10GB, there will be 2.6 Million OS pages just for SGA alone. (Readwikipedia’s entry on page table for more information about page tables.) On this server, there will be 5 million OS pages for 20GB total memory. It will be an enormous workload for the paging/swapping daemon to manage all these pages.
cat /proc/meminfo

MemTotal:     20549012 kB
MemFree:        236668 kB
Buffers:         77800 kB
Cached:        7189572 kB
...
PageTables:    5007924 kB  <--- ...="" 0="" 2048="" 5gb="" hugepages_free:="" hugepages_total:="" hugepagesize:="" kb="" pre="">

HugePages

Fortunately, we can use HugePages in this version of Linux. There are couple of important benefits of HugePages:
  1. Page size is set 2MB instead of 4KB
  2. Memory used by HugePages is locked and cannot be paged out.
With a pagesize of 2MB, 10GB SGA will have only 5000 pages compared to 2.6 million pages without HugePages. This will drastically reduce the page table size. Also, HugeTable memory is locked and so SGA can’t be swapped out. The working set of buffers for the paging/swapping daemon will be smaller.
To setup HugePages, the following changes must be completed:
  1. Set the vm.nr_hugepages kernel parameter to a suitable value. In this case, we decided to use 12GB and set the parameter to 6144 (6144*2M=12GB). You can run:
    echo 6144 > /proc/sys/vm/nr_hugepages
    or
    sysctl -w vm.nr_hugepages=6144
    Of course, you must make sure this set across reboots too.
  2. The oracle userid needs to be able to lock a greater amount of memory. So,/etc/securities/limits.conf must be updated to increase soft and hard memlock values fororacle userid.
    oracle          soft    memlock        12582912
    oracle          hard   memlock        12582912
After setting this up, we need to make sure that SGA is indeed using HugePages. The value, (HugePages_Total- HugePages_Free)*2MB will be the approximate size of SGA (or it will equal the shared memory segment shown in the output of ipcs -ma).
cat /proc/meminfo |grep HugePages
HugePages_Total:  6144
HugePages_Free:   1655 <-- 2048="" are="" free="" hugepagesize:="" kb="" less="" pages.="" pages="" pre="" than="" total="">

Summary

Using HugePages resolved our client’s performance issues. The PageTable size also went down to a few hundred MB. If your database is running in Linux and has HugePages capability, there is no reason not to use it.
This can be read in a presentation format at Investigations: Performance and hugepages (PDF).
========================================================================
<-- 2048="" are="" free="" hugepagesize:="" kb="" less="" pages.="" pages="" pre="" than="" total="">
<-- 2048="" are="" free="" hugepagesize:="" kb="" less="" pages.="" pages="" pre="" than="" total="">

Introduction

For large SGA sizes, HugePages can give substantial benefits in virtual memory management. Without HugePages, the memory of the SGA is divided into 4K pages, which have to be managed by the Linux kernel. Using HugePages, the page size is increased to 2MB (configurable to 1G if supported by the hardware), thereby reducing the total number of pages to be managed by the kernel and therefore reducing the amount of memory required to hold the page table in memory. In addition to these changes, the memory associated with HugePages can not be swapped out, which forces the SGA to stay memory resident. The savings in memory and the effort of page management make HugePages pretty much mandatory for Oracle 11g systems running on x86-64 architectures.
Just because you have a large SGA, it doesn't automatically mean you will have a problem if you don't use HugePages. It is typically the combination of a large SGA and lots database connections that leads to problems. To determine how much memory you are currently using to support the page table, run the following command at a time when the server is under normal/heavy load.
# grep PageTables /proc/meminfo
PageTables:      1244880 kB
#
 Automatic Memory Management (AMM) is not compatible with Linux HugePages, so apart from ASM instances and small unimportant databases, you will probably have no need for AMM on a real database running on Linux. Instead, Automatic Shared Memory Management and Automatic PGA Management should be used as they are compatible with HugePages.

Configuring HugePages

Run the following command to determine the current HugePage usage. The default HugePage size is 2MB on Oracle Linux 5.x and as you can see from the output below, by default no HugePages are defined.
$ grep Huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
$
Depending on the size of your SGA, you may wish to increase the value of Hugepagesize to 1G.
Create a file called "hugepages_setting.sh" with the following contents.
#!/bin/bash
#
# hugepages_settings.sh
#
# Linux bash script to compute values for the
# recommended HugePages/HugeTLB configuration
#
# Note: This script does calculation for all shared memory
# segments available when the script is run, no matter it
# is an Oracle RDBMS shared memory segment or not.
# Check for the kernel version
KERN=`uname -r | awk -F. '{ printf("%d.%d\n",$1,$2); }'`
# Find out the HugePage size
HPG_SZ=`grep Hugepagesize /proc/meminfo | awk {'print $2'}`
# Start from 1 pages to be on the safe side and guarantee 1 free HugePage
NUM_PG=1
# Cumulative number of pages required to handle the running shared memory segments
for SEG_BYTES in `ipcs -m | awk {'print $5'} | grep "[0-9][0-9]*"`
do
   MIN_PG=`echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q`
   if [ $MIN_PG -gt 0 ]; then
      NUM_PG=`echo "$NUM_PG+$MIN_PG+1" | bc -q`
   fi
done
# Finish with results
case $KERN in
   '2.4') HUGETLB_POOL=`echo "$NUM_PG*$HPG_SZ/1024" | bc -q`;
          echo "Recommended setting: vm.hugetlb_pool = $HUGETLB_POOL" ;;
   '2.6' | '3.8' | '3.10' | '4.1' ) echo "Recommended setting: vm.nr_hugepages = $NUM_PG" ;;
    *) echo "Unrecognized kernel version $KERN. Exiting." ;;
esac
# End
 Thanks to Bjoern Rost for pointing out the issue when using the script against UEK3 and the suggested fix. I've subsequently added support for 3.10 and 4.1. There is a newer version of this script available from MOS (Doc ID 401749.1) which includes these kernel versions also.
Make the file executable.
$ chmod u+x hugepages_setting.sh
Make sure all the Oracle services are running as normal on the server, then run the script and make a note of the recommended "vm.nr_hugepages" value.
$ ./hugepages_setting.sh 
Recommended setting: vm.nr_hugepages = 305
$
Edit the "/etc/sysctl.conf" file as the "root" user, adding the following entry, adjusted based on your output from the script. You should set the value greater than or equal to the value displayed by the script. You only need 1 or 2 spare pages.
vm.nr_hugepages=306
One person reported also needing the hugetlb_shm_group setting on Oracle Linux 6.5. I did not and it is listed as a requirement for SUSE only. If you want to set it, get the ID of the dba group.
# fgrep dba /etc/group
dba:x:54322:oracle
#
Use the resulting group ID in the "/etc/sysctl.conf" file.
vm.hugetlb_shm_group=54322
Run the following command as the "root" user.
# sysctl -p
Alternatively, edit the "/etc/grub.conf" file, adding "hugepages=306" to the end of the kernel line for the default kernel and reboot.
You can now see the HugePages have been created, but are currently not being used.
$ grep Huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:     306
HugePages_Free:      306
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
$
Add the following entries into the "/etc/security/limits.conf" script or "/etc/security/limits.d/99-grid-oracle-limits.conf" script, where the setting is at least the size of the HugePages allocation in KB (HugePages * Hugepagesize). In this case the value is 306*2048=626688.
* soft memlock 626688
* hard memlock 626688
 If you prefer, you can set these parameters to a value just below the size of physical memory of the server. This way you can forget about it, unless you add more physical memory.
Check the MEMORY_TARGET parameters are not set for the database and SGA_TARGET and PGA_AGGREGATE_TARGET parameters are being used instead.
SQL> show parameter target

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
archive_lag_target                   integer     0
db_flashback_retention_target        integer     1440
fast_start_io_target                 integer     0
fast_start_mttr_target               integer     0
memory_max_target                    big integer 0
memory_target                        big integer 0
parallel_servers_target              integer     16
pga_aggregate_target                 big integer 200M
sga_target                           big integer 600M
SQL>
Restart the server and restart the database services as required.
Check the HugePages information again.
$ grep Huge /proc/meminfo
AnonHugePages:         0 kB
HugePages_Total:     306
HugePages_Free:       98
HugePages_Rsvd:       93
HugePages_Surp:        0
Hugepagesize:       2048 kB
$
You can see the HugePages are now being used.
Remember, if you increase your memory allocation or add new instances, you need to retest the required number of HugePages, or risk Oracle running without them.

Force Oracle to use HugePages (USE_LARGE_PAGES)

Sizing the number of HugePages correctly is important because prior to 11.2.0.3, if the whole SGA doesn't fit into the available HugePages, the instance will start up without using any. From 11.2.0.3 onward, the SGA can run partly in HugePages and partly not, so the impact of this issue is not so great. Incorrect sizing may not be obvious to spot. Later releases of the database display a "Large Pages Information" section in the alert log during startup.
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 602 MB (100%)

Large Pages used by this instance: 301 (602 MB)
Large Pages unused system wide = 5 (10 MB) (alloc incr 4096 KB)
Large Pages configured system wide = 306 (612 MB)
Large Page size = 2048 KB
***********************************************************
If you are running Oracle 11.2.0.2 or later, you can set the USE_LARGE_PAGES initialization parameter to "only" so the database fails to start if it is not backed by hugepages. You can read more about this here.
ALTER SYSTEM SET use_large_pages=only SCOPE=SPFILE;
SHUTDOWN IMMEDIATE;
STARTUP;
On startup the "Large Page Information" in the alert log reflects the use of this parameter.
****************** Large Pages Information *****************
Parameter use_large_pages = ONLY

Total Shared Global Region in Large Pages = 602 MB (100%)

Large Pages used by this instance: 301 (602 MB)
Large Pages unused system wide = 5 (10 MB) (alloc incr 4096 KB)
Large Pages configured system wide = 306 (612 MB)
Large Page size = 2048 KB
***********************************************************
Attempting to start the database when there aren't enough HugePages to hold the SGA will now return the following error.
SQL> STARTUP
ORA-27137: unable to allocate large pages to create a shared memory segment
Linux-x86_64 Error: 12: Cannot allocate memory
SQL> 
The "Large Pages Information" section of the alert log output describes the startup failure and the appropriate action to take.
****************** Large Pages Information *****************
Parameter use_large_pages = ONLY

Large Pages unused system wide = 0 (0 KB) (alloc incr 4096 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB

ERROR:
  Failed to allocate shared global region with large pages, unix errno = 12.
  Aborting Instance startup.
  ORA-27137: unable to allocate Large Pages to create a shared memory segment

ACTION:
  Total Shared Global Region size is 608 MB. Increase the number of
  unused large pages to atleast 304 (608 MB) to allocate 100% Shared Global
  Region with Large Pages.
***********************************************************

Disabling Transparent HugePages (RHEL6/OL6 and RHEL7/OL7)

Starting from RHEL6/OL6, Transparent HugePages are implemented and enabled by default. They are meant to improve memory management by allowing HugePages to be allocated dynamically by the "khugepaged" kernel thread, rather than at boot time like conventional HugePages. That sounds like a good idea, but unfortunately Transparent HugePages don't play well with Oracle databases and are associated with node reboots in RAC installations and performance problems on both single instance and RAC installations. As a result Oracle recommends disabling Transparent HugePages on all servers running Oracle databases, as described in this MOS note.
 The following examples use the base path of "/sys/kernel/mm/transparent_hugepage/" which is used by OL6/OL7. For RHEL6/RHEL7 use "/sys/kernel/mm/redhat_transparent_hugepage/" as the base path.
You can check the current setting using the following command, which is displaying the default value of "enabled=[always]".
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
#
For Oracle Linux 6 the preferred method to disable Transparent HugePages is to add "transparent_hugepage=never" to the kernel boot line in the "/boot/grub/grub.conf" file.
title Oracle Linux Server (2.6.39-400.24.1.el6uek.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.39-400.24.1.el6uek.x86_64 ro root=/dev/mapper/vg_ol6112-lv_root rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=uk
LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16  rd_NO_DM rd_LVM_LV=vg_ol6112/lv_swap rd_LVM_LV=vg_ol6112/lv_root rhgb quiet numa=off
transparent_hugepage=never
        initrd /initramfs-2.6.39-400.24.1.el6uek.x86_64.img
Oracle Linux 7 is similar, but uses GRUB2 so you need to edit the "/boot/grub2/grub.cfg" file using the grubby command.
# grubby --default-kernel
/boot/vmlinuz-4.1.12-61.1.6.el7uek.x86_64

# grubby --args="transparent_hugepage=never" --update-kernel /boot/vmlinuz-4.1.12-61.1.6.el7uek.x86_64

# grubby --info /boot/vmlinuz-4.1.12-61.1.6.el7uek.x86_64
index=2
kernel=/boot/vmlinuz-4.1.12-61.1.6.el7uek.x86_64
args="ro vconsole.font=latarcyrheb-sun16 rd.lvm.lv=ol/swap rd.lvm.lv=ol/root crashkernel=auto  vconsole.keymap=uk rhgb quiet LANG=en_GB.UTF-8 transparent_hugepage=never"
root=/dev/mapper/ol-root
initrd=/boot/initramfs-4.1.12-61.1.6.el7uek.x86_64.img
title=Oracle Linux Server 7.2, with Unbreakable Enterprise Kernel 4.1.12-61.1.6.el7uek.x86_64
The server must be rebooted for this to take effect.
Alternatively, add the following lines into the "/etc/rc.local" file and reboot the server.
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
   echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi
Whichever method you choose, remember to check the change has work after reboot.
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
#
In OL7/RHEL7 you also need to consider the "tuned profile". The following script shows how to create and enable an amended version of the currently active tuned profile.
# # Check the active profile
# tuned-adm active
Current active profile: virtual-guest
#

# # Create directory to hold revised profile.
# mkdir /etc/tuned/virtual-guest-nothp

# # Create new profile based on the curren active profile.
# cat <> /etc/tuned/virtual-guest-nothp/tuned.conf 
[main]
include= virtual-guest

[vm]
transparent_hugepages=never
EOF

# # Make the script executable.
# chmod +x /etc/tuned/virtual-guest-nothp/tuned.conf 

# # Enable the new profile.
# tuned-adm profile virtual-guest-nothp
 Thanks to Mor for pointing this out and directing me to the notes here and here.
With Transparent HugePages disabled, you should proceed to configure conventional HugePages, as described above.

Configuring 1G Hugepagesize

 As mentioned by Eugene in the comments, Oracle currently don't recommend using 1G Hugepagesize. You can read more about this in MOS Doc ID 1607545.1. With that in mind, the rest of this section should probably be considered more of an academic exercise.
Check if your current hardware can support a Hugepagesize of 1G. If the following command produces any output, it can.
# cat /proc/cpuinfo | grep pdpe1gb
Thanks to Kevin Closson for pointing out the hardware support requirement.
Edit the "/etc/grub.conf" file, adding the following entries on to the kernel line of the default grub entry. Adjust the "hugepages" entry to the desired number of 1G pages. Notice this includes the disabling of Transparent HugePages, which is not mandatory, but a good idea.
transparent_hugepage=never hugepagesz=1G hugepages=1 default_hugepagesz=1G
Check the current HugePages setup.
# grep Huge /proc/meminfo
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
#
Reboot and check the HugePages setup again.
#  grep Huge /proc/meminfo
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
#
==========================================================================
<-- 2048="" are="" free="" hugepagesize:="" kb="" less="" pages.="" pages="" pre="" than="" total="">

Comments