Tuesday, September 2, 2014

New server hardware: SuperMicro X10SAE and Xeon E3 1265Lv3

As my previous server mainboard died, I decided to upgrade to a  SuperMicro X10SAE and a Xeon E3 1265Lv3.

Just a quick post for those interested in running this combination under Linux.

$ lspci -tv:

-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller
           +-01.0-[01]----00.0  LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
           +-02.0  Intel Corporation Xeon E3-1200 v3 Processor Integrated Graphics Controller
           +-03.0  Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller
           +-14.0  Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI
           +-16.0  Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1
           +-16.3  Intel Corporation 8 Series/C220 Series Chipset Family KT Controller
           +-19.0  Intel Corporation Ethernet Connection I217-LM
           +-1a.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2
           +-1b.0  Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller
           +-1c.0-[02]--
           +-1c.3-[03]----00.0  Intel Corporation I210 Gigabit Network Connection
           +-1c.5-[04-05]----00.0-[05]----03.0  Texas Instruments TSB43AB22A IEEE-1394a-2000 Controller (PHY/Link) [iOHCI-Lynx]
           +-1c.6-[06]----00.0  Renesas Technology Corp. uPD720202 USB 3.0 Host Controller
           +-1c.7-[07]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           +-1d.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1
           +-1f.0  Intel Corporation C226 Series Chipset Family Server Advanced SKU LPC Controller
           +-1f.2  Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode]
           +-1f.3  Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller

           \-1f.6  Intel Corporation 8 Series Chipset Family Thermal Management Controller

Note that the SAS2008 is a plugged-in PCI-e x8 SAS HBA, so that device will not show up in lspci on a vanilla mainboard.

The two network interfaces work out of the box on a Linux 3.13 kernel: they use the igb and e1000e kernel modules, respectively.

The first CPU (in /proc/cpuinfo) looks like:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz
stepping : 3
microcode : 0x17
cpu MHz : 2500.056
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm ida arat epb xsaveopt pln pts dtherm fsgsbase bmi1 hle avx2 bmi2 erms rtm
bogomips : 5000.11
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

Note that the OS runs as a Xen Dom0, hence the hypervisor flag.


Thursday, May 15, 2014

Serious and pressure-aware memory overcommitment in Xen: Transcendental Memory

Overcommitting VM memory for fun and profit

What is the issue?

If, like me, you run virtual machines on Xen, you are probably aware of the fact that you can overcommit memory for your guests; if you have, say, 8 GB of free memory in addition to what your Domain-0 uses, it is perfectly fine to run 6 VMs with 2 GB of RAM each, as long as those VMs are Xen aware (and practically all modern Linux kernels for all distributions are).

What will happen is that Xen will use Ballooning: each VM has a "balloon" device that does nothing, but does logically use RAM (from the VM's point of view), while this RAM is not actually mapped to any physical RAM on the host. So if, in the above example, you already started 4 VMs (thus filling up your physical RAM), and you start your 5th VM, Xen will first inflate the balloons in the 4 guests. To the guest VMs, this looks like the Balloon requiring more and more memory, so they will have less memory available to the system. The VMs may also need to evict read-cache pages to accommodate this. At some point, there is enough RAM to start the 5th VM, and once it started, Xen will also inflate its Balloon, until the situation gets to an equilibrium where all VMs use the same amount of memory. Start the 6th VM, and the same thing happens all over again.

Now, this is all very nice, but it comes with one major shortcoming: This mechanism does not respond to memory pressure within a given VM! This means that if you crammed 6 2GB VMs into 8 GB of physical RAM, each VM will have 1.33 GB of available RAM. Indeed, a VM that temporarily needs more will run out of memory (up to the point where its kernel starts killing processes to free up RAM), even if all the other VMs are not currently requiring their full 1.33GB of RAM! 

It doesn't have to be this way. Enter Transcendental Memory.

Transcendental Memory? What is this, Zen class?

No, it's Xen class: Transcendental Memory (tmem) is memory that exists outside the VM kernel, and over which the VM kernel does not have direct control. Practically speaking, this is RAM that is managed by the hypervisor, and which the VM can indirectly access through a special API. This tmem is of unknown (to the VM) size, and its size may (and will indeed) change over time. Also, the VM may not always be allowed to write to tmem (if the hypervisor has no more free RAM to manage, for example).

Transcendental Memory comes in "pools", where a VM typically requests two of these: a (shared) ephemeral pool and a persistent pool. An ephemeral pool is a pool to which a VM may be able to write, but for which there is no guarantee whatsoever that the page it just wrote can be read later on. In a persistent pool, on the other hand, it is guaranteed that a page you wrote can later be read.

Linux VMs access Transcendental Memory using the tmem kernel module. Internally, this enables three things:
  • selfballooning/selfshrinking: The VM will continually use a balloon to create artificial memory pressure, in order to get rid of RAM that it does not currently need. Of course, the hypervisor itself may also balloon the VM due to external reasons, which further reduces available RAM.
  • cleancache: At some point, the VM's RAM becomes so small that it has to evict pages from its read cache (also called the "clean cache", hence the name). Rather than just evicting a page, the kernel will first try to write it into the ephemeral pool, and will then evict the page. Conversely, if a process in the VM issues a block-device/file-system read request, the VM will first ask the ephemeral pool whether it has that page. If so, the VM just saved one disk read.
  • frontswap: With selfballooning/selfshrinking at work, the VM will be under constant memory pressure; it will be left with whatever it actually needs, plus a very small margin. Of course, if you start a large process under these conditions, there will not be enough RAM to start it. The selfballooning mechanism will respond to the memory pressure, but with a certain delay. Therefore, before the balloon has had a chance to deflate, the kernel will need to swap out pages. This would of course be slow, and that is where frontswap comes in: Before swapping out a page, the kernel will first try to write it to its persistent tmem pool. If successful, it does not need to actually write the page to a block device. If not successful, the kernel will write the page to the block device. In the majority of cases, tmem will be able to absorb the initial memory shock, thus actual swaps occur rarely. In addition, there exists a mechanism that will slowly swap pages back in from tmem and the swap device, so that neither is clogged up with useless pages.
So that is it: tmem will allow you to share your read cache between VMs, while keeping the RAM that your VM claims at any one time as small as possible.

Good. How do I use it?

The first step is to enable tmem in Xen: In your Domain-0, edit /etc/default/grub (this is assuming Debian or Ubuntu), and ensure that the GRUB_CMDLINE_XEN_DEFAULT string contains tmem. You'll have to run update-grub and reboot your physical host for this to take effect.

Second, you'll have to set up your guests to actually use tmem. For this, inside the VM, you edit /etc/default/grub such that GRUB_CMDLINE_LINUX contains tmem. Also, add tmem to /etc/modules. You'll have to run update-grub and reboot the guest for this to take effect.

NOTE: It is critical to have a swap device configured in your guest, if only a small swap device, for otherwise frontswap will NOT work! That means that without a swap device in your guest, you'll continually be running out of RAM when trying to start processes.

And that's it. Here's what xl top tells me:

xentop - 17:04:20   Xen 4.3.0
5 domains: 1 running, 4 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 16651980k total, 12513168k used, 4138812k free, 38912k freeable, CPUs: 4 @ 2500MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR  VBD_RSECT  VBD_WSECT SSID
  (redacted) --b---      12324    3.7    1191920    7.2    2098176      12.6     1    1 1475615857 47146601    1        0   292522    47176   29285682    4827432    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:   463995   Succ pers gets:   463807
  (redacted) --b---       1476    0.2     226068    1.4    2098176      12.6     1    1    74948    30323    1      544  8277450    20808  239717602    1317408    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:   119478   Succ pers gets:   119364
  Domain-0 -----r      27858    5.9   10397272   62.4   10485760      63.0     2    0        0        0    0        0        0        0          0          0    0
  (redacted) --b---        172    0.1     225820    1.4    2098176      12.6     2    1     7837      655    1        2    36238     9679    2126866     673864    0
Tmem:  Curr eph pages:       75   Succ eph gets:       75   Succ pers puts:     2359   Succ pers gets:     2359
  (redacted) --b---         59    0.1     207624    1.2    1741824      10.5     2    1     1377     2512    1        1    39733     1242    3007082      64040    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:      623   Succ pers gets:      623

Note that each VM is configured to have 1.7 to 2 GB of RAM, but most use only about 250 MB.


However... In Debian, the standard kernel does not have all required features compiled in, or as a module


This means that you'll have to build your own kernel package. To do this, you issue the following commands (modulo your kernel version):

# apt-get build-dep sudo apt-get build-dep linux-image-3.13-1-amd64
# apt-get build-dep sudo apt-get source linux-image-3.13-1-amd64

You cd into the linux-3.13.10 directory, and copy over the config from the live system:

# cp /boot/config-3.13-1-amd64 ./.config

Apply the following changes to .config:

482,483c482,483
< # CONFIG_CLEANCACHE is not set
< # CONFIG_FRONTSWAP is not set
---
> CONFIG_CLEANCACHE=y
> CONFIG_FRONTSWAP=y
485a486
> # CONFIG_ZSWAP is not set
5137a5139
> CONFIG_XEN_SELFBALLOONING=y
5148a5151
> CONFIG_XEN_TMEM=m

Build the packages:

# make deb-pkg LOCALVERSION=-tmem KDEB_PKGVERSION=1

This will create four packages in the parent directory. Normally, you need to install just the -tmem image:

# dpkg -i ../linux-image-3.13.10-tmem_1_amd64.deb

Enjoy!



Thursday, April 10, 2014

More slight downloading inconveniences in the digital age

The Netherlands forbids downloading of IP-protected material

Today, the news broke that, as of today, the downloading of IP-protected material has become illegal in The Netherlands. Uploading has been illegal for some time, but now an EU court has ruled that downloading is illegal, too.

How ever so slightly inconvenient.

How utterly useless.

Let's fix it right away.

What is the goal here?

Although I live in Switzerland, where IP laws are a lot saner, and where downloading of music and movies for personal use is allowed, this news annoyed me, so I decided to fix the issue once and for all.
In this post, I show how to route a complete subnet from your home network through a VPN provider, so that its traffic surfaces in a country with poor IP protection. See, e.g., the 2013 IP ranking list (the bottom part, that is) for some suitable countries; there are plenty to choose from.

Step 1: Choose a VPN provider, preferably OpenVPN

There are plenty of alternatives here. I chose HideMyAss for this experiment, mainly because they have servers in many countries, and they accept payment in Bitcoin.

Step 2: Edit the OpenVPN config file, to prevent all of your home-network data from being routed through the VPN

Although you're of course welcome to route all of your data through the VPN, this is not recommended for two reasons:

  1. It will generally be slower than your "open" connection.
  2. It is bad privacy practice to run both your identifiable data (email, etc.) and your file-sharing traffic through the same end point.
I blogged earlier in a lot of detail on how to do this. In my case, I used a default HideMyAss OpenVPN config file, and commented out the "route-metric 1" line. Then, I added a "route-noexec" line to prevent the VPN server from pushing routes on me. I also replaced the "auth-user-pass" line with "auth-user-pass hma.pass", where hma.pass contains my login and password (each on a separate line) to automate the login process. Finally, I replaced the "dev tun" line with "dev tun-hma", so that I have a stable TUN-device name whenever I connect to this VPN.

On my Debian gateway server, I copied this file to /etc/openvpn, so that it starts at boot. If you want to start it right away, issue an /etc/init.d/openvpn restart.

Step 3: Route your outgoing traffic from your file-sharing subnet through the VPN.

In my case, I added a new VLAN to my home network; VLAN #6, which runs network 192.168.6.0/24. I will use Linux Source Routing to route only VLAN6 traffic, and then only outgoing traffic, through device tun-hma.  I added the following rules to my /etc/rc.local file to achieve this:


# Route traffic from the file-sharing network to our own
# network via normal tables.
/sbin/ip rule add from 192.168.6.1/24 to 192.168.1.1/16 lookup main prio 200
# Route all other traffic from the file-sharing network
# through table "3", which, in turn gets routed through HMA.
/sbin/ip rule add from 192.168.6.1/24 lookup 3 prio 201
/sbin/ip route add from 192.168.6.1/24 dev tun-hma table 3
# Traffic via HMA must be NAT'ed.
/sbin/iptables --table nat -A POSTROUTING -o tun-hma -j MASQUERADE

And that's it. The first "prio 200" line is actually fairly critical: it ensures that your filesharing network is able to connect to other machine on your home network through the default routing tables. That is convenient, since you'll possibly want to save files that you download to NFS, or elsewhere on the local network. The "prio 201" line (where 201 means a lower priority than 200) then routes traffic that does not match the "file-sharing net to other local nets" rule through the VPN. The number 3 is arbitrary. 

Step 4: (optional) Add some more plumbing

Im my case, the new VLAN is also sent to my Xen host machine, using 802.1q tagging. Inside the Xen host, I run a software bridge that bridges the VLAN to my main up/downloading machine. And that's it: I now upload and download in another country.