Thursday, May 15, 2014

Serious and pressure-aware memory overcommitment in Xen: Transcendental Memory

Overcommitting VM memory for fun and profit

What is the issue?

If, like me, you run virtual machines on Xen, you are probably aware of the fact that you can overcommit memory for your guests; if you have, say, 8 GB of free memory in addition to what your Domain-0 uses, it is perfectly fine to run 6 VMs with 2 GB of RAM each, as long as those VMs are Xen aware (and practically all modern Linux kernels for all distributions are).

What will happen is that Xen will use Ballooning: each VM has a "balloon" device that does nothing, but does logically use RAM (from the VM's point of view), while this RAM is not actually mapped to any physical RAM on the host. So if, in the above example, you already started 4 VMs (thus filling up your physical RAM), and you start your 5th VM, Xen will first inflate the balloons in the 4 guests. To the guest VMs, this looks like the Balloon requiring more and more memory, so they will have less memory available to the system. The VMs may also need to evict read-cache pages to accommodate this. At some point, there is enough RAM to start the 5th VM, and once it started, Xen will also inflate its Balloon, until the situation gets to an equilibrium where all VMs use the same amount of memory. Start the 6th VM, and the same thing happens all over again.

Now, this is all very nice, but it comes with one major shortcoming: This mechanism does not respond to memory pressure within a given VM! This means that if you crammed 6 2GB VMs into 8 GB of physical RAM, each VM will have 1.33 GB of available RAM. Indeed, a VM that temporarily needs more will run out of memory (up to the point where its kernel starts killing processes to free up RAM), even if all the other VMs are not currently requiring their full 1.33GB of RAM! 

It doesn't have to be this way. Enter Transcendental Memory.

Transcendental Memory? What is this, Zen class?

No, it's Xen class: Transcendental Memory (tmem) is memory that exists outside the VM kernel, and over which the VM kernel does not have direct control. Practically speaking, this is RAM that is managed by the hypervisor, and which the VM can indirectly access through a special API. This tmem is of unknown (to the VM) size, and its size may (and will indeed) change over time. Also, the VM may not always be allowed to write to tmem (if the hypervisor has no more free RAM to manage, for example).

Transcendental Memory comes in "pools", where a VM typically requests two of these: a (shared) ephemeral pool and a persistent pool. An ephemeral pool is a pool to which a VM may be able to write, but for which there is no guarantee whatsoever that the page it just wrote can be read later on. In a persistent pool, on the other hand, it is guaranteed that a page you wrote can later be read.

Linux VMs access Transcendental Memory using the tmem kernel module. Internally, this enables three things:
  • selfballooning/selfshrinking: The VM will continually use a balloon to create artificial memory pressure, in order to get rid of RAM that it does not currently need. Of course, the hypervisor itself may also balloon the VM due to external reasons, which further reduces available RAM.
  • cleancache: At some point, the VM's RAM becomes so small that it has to evict pages from its read cache (also called the "clean cache", hence the name). Rather than just evicting a page, the kernel will first try to write it into the ephemeral pool, and will then evict the page. Conversely, if a process in the VM issues a block-device/file-system read request, the VM will first ask the ephemeral pool whether it has that page. If so, the VM just saved one disk read.
  • frontswap: With selfballooning/selfshrinking at work, the VM will be under constant memory pressure; it will be left with whatever it actually needs, plus a very small margin. Of course, if you start a large process under these conditions, there will not be enough RAM to start it. The selfballooning mechanism will respond to the memory pressure, but with a certain delay. Therefore, before the balloon has had a chance to deflate, the kernel will need to swap out pages. This would of course be slow, and that is where frontswap comes in: Before swapping out a page, the kernel will first try to write it to its persistent tmem pool. If successful, it does not need to actually write the page to a block device. If not successful, the kernel will write the page to the block device. In the majority of cases, tmem will be able to absorb the initial memory shock, thus actual swaps occur rarely. In addition, there exists a mechanism that will slowly swap pages back in from tmem and the swap device, so that neither is clogged up with useless pages.
So that is it: tmem will allow you to share your read cache between VMs, while keeping the RAM that your VM claims at any one time as small as possible.

Good. How do I use it?

The first step is to enable tmem in Xen: In your Domain-0, edit /etc/default/grub (this is assuming Debian or Ubuntu), and ensure that the GRUB_CMDLINE_XEN_DEFAULT string contains tmem. You'll have to run update-grub and reboot your physical host for this to take effect.

Second, you'll have to set up your guests to actually use tmem. For this, inside the VM, you edit /etc/default/grub such that GRUB_CMDLINE_LINUX contains tmem. Also, add tmem to /etc/modules. You'll have to run update-grub and reboot the guest for this to take effect.

NOTE: It is critical to have a swap device configured in your guest, if only a small swap device, for otherwise frontswap will NOT work! That means that without a swap device in your guest, you'll continually be running out of RAM when trying to start processes.

And that's it. Here's what xl top tells me:

xentop - 17:04:20   Xen 4.3.0
5 domains: 1 running, 4 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 16651980k total, 12513168k used, 4138812k free, 38912k freeable, CPUs: 4 @ 2500MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR  VBD_RSECT  VBD_WSECT SSID
  (redacted) --b---      12324    3.7    1191920    7.2    2098176      12.6     1    1 1475615857 47146601    1        0   292522    47176   29285682    4827432    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:   463995   Succ pers gets:   463807
  (redacted) --b---       1476    0.2     226068    1.4    2098176      12.6     1    1    74948    30323    1      544  8277450    20808  239717602    1317408    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:   119478   Succ pers gets:   119364
  Domain-0 -----r      27858    5.9   10397272   62.4   10485760      63.0     2    0        0        0    0        0        0        0          0          0    0
  (redacted) --b---        172    0.1     225820    1.4    2098176      12.6     2    1     7837      655    1        2    36238     9679    2126866     673864    0
Tmem:  Curr eph pages:       75   Succ eph gets:       75   Succ pers puts:     2359   Succ pers gets:     2359
  (redacted) --b---         59    0.1     207624    1.2    1741824      10.5     2    1     1377     2512    1        1    39733     1242    3007082      64040    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:      623   Succ pers gets:      623

Note that each VM is configured to have 1.7 to 2 GB of RAM, but most use only about 250 MB.


However... In Debian, the standard kernel does not have all required features compiled in, or as a module


This means that you'll have to build your own kernel package. To do this, you issue the following commands (modulo your kernel version):

# apt-get build-dep sudo apt-get build-dep linux-image-3.13-1-amd64
# apt-get build-dep sudo apt-get source linux-image-3.13-1-amd64

You cd into the linux-3.13.10 directory, and copy over the config from the live system:

# cp /boot/config-3.13-1-amd64 ./.config

Apply the following changes to .config:

482,483c482,483
< # CONFIG_CLEANCACHE is not set
< # CONFIG_FRONTSWAP is not set
---
> CONFIG_CLEANCACHE=y
> CONFIG_FRONTSWAP=y
485a486
> # CONFIG_ZSWAP is not set
5137a5139
> CONFIG_XEN_SELFBALLOONING=y
5148a5151
> CONFIG_XEN_TMEM=m

Build the packages:

# make deb-pkg LOCALVERSION=-tmem KDEB_PKGVERSION=1

This will create four packages in the parent directory. Normally, you need to install just the -tmem image:

# dpkg -i ../linux-image-3.13.10-tmem_1_amd64.deb

Enjoy!