NOPEContainers Microconference Notes

If you wish to follow along or take notes: https://etherpad.fr/p/LPC2014_Containers

== First part (project updates) ==

 - Title: Opening speech / schedule presentation / intros / ...
   By: Brandon Philips and Stéphane Graber
   Length: 5min

 - Title: LXC after 1.0
   By: Stéphane Graber
   Length: 10min

Maintaining 1.0, committed to 5 years of support

Next version is lxc 1.1, release by the end of the year (hopefully)
- Major features:
    - CRIU support for live migration between hosts
    - openvswitch (ovs) support, connecting to bridges
    - userns support
    - similiar experience on all distros:
        - ubuntu has had dhcp, bridges etc for a while
        - now lxc ships with systemd units, sysv scripts, etc to give a bridge, dhcp, etc
    - liblxc API changes for lxc 1.1, python 3 & lua bindings in tree. (Support for python 2, Ruby, Go, Haskell exists out of tree)
    

Changes to cgmanager:
    - FUSE filesystem to give appropriately restricted view of cgroupfs to a guest to make systemd work in containers
    - Idea is to bind mount over cgroupfs

? We want systemd to see the standard cgroup view so that it can write to it and things just work.
? A patch for cgroup namespaces is going around.  That's a better solution but might take a while.

Mount story
 - Mount callout using seccomp to vector sys_mount to user space under consideration but not being actively developed (OpenVZ may investigate first)
 - ext4 is now "committed" to treating mount-triggered bugs as "security" bugs for the purposes of priority as they would now be fuzzable by regular users (nominally mount is root only before containers)


 - Title: Are containers that we have now secure enough?
   By: Vladimir Davydov and Pavel Emelyanov
   Length: 10min

- kmem accounting:
    - kmem is everything except userspace memory.
    - idea is to keep from hitting denial-of-service style resource exhaustion
    - accounting already done as part of memory cgroup - memory.kmem.limit_in_bytes
    - however, lacks slab-shrinking support to reclaim memory (targetting for Linux 3.19):
        - won't account for page tables
        - fixable memory leaks still in patches
    - we end up with a container in ENOMEM loops

- soft limits do not work as they should:
    - containers aren't pushed back under system pressure
- OOM Killer still not cgroup aware (Marian Marinov noted that patches exist and they work well in his experience)
   https://lkml.org/lkml/2012/10/16/168
   https://lkml.org/lkml/2014/6/10/167

- veth problems
    - works at the link layer, with the problem that traffic sniffing and IP spoofing possible:
        - ebtables adds extra cost
        - "But the same problem exists in all other environments, such as KVM solutions"
        - Agreed, but this is a potential solution.  Basically we're saying containers could be made to be better than current solutions.

- proposal of vnet device:
    - pass through an 
    - why not macvlan? (Marian Marinov):
        - Pavel: removes things like per container arp tables
        - why not OVS (Marian)?
        - simplifies management for some set of users (jejb)


Marian: We need per container
   - CAP_SYS_BOOT for reboots
   - iptables uid
   - ulimit fixes
   - nproc limit
   - swap
   - filesystem quotas
   - filesystem ioctl(IMMUTABLE and etc.)
   - sync/fsync
   - tty ioctl
   - tcp-repair-mode & snd-rcvbuf for CRIU
   - memcg-kill-alloc-task
   - dmesg should not be allowed in the containers
  - ephemeral port allocation 

 
 - Title: Checkpoint/restore of containers with CRIU
   By: Tycho Andersen, Pavel Emelyanov, Serge Hallyn and Saied Kazemi
   Length: 40min

What is CRIU
- Restore --> Processes --> Checkpoint --> Image files with states --> Restore...
- Started 3 years ago when in-kernel C/R was refused.
- Use cases: Live migration, kernel upgrade w/o reboot, slow service startup, periodic snapshots of long running tasks, debugging, save button for nethack
- Latest release 1.3.1 Sept 12:
    - all namespaces supported except users (ongoing work)
    - openvz, lxc and docker tool integration
- Has CLI, RPC APIs and a C lib
- Can c/r one end of established TCP connection:
    - Marian: there is a problem with restoring half-closed TCP socket connections
    - Marian: I have a module for half-closed, will not push upstream (unlikely to be acceptable solution) ;) If someone wants that call me :)
    - Marian: fs locking does not work

Tycho Andersen:
- LXC+CRIU:
    - Wrapper to restore some other things criu may not catch, mostly a frontend to criu
    - reattaches lxc-* tools metadata
    - future work:
        - pre-dump (incremental dump) support
        - support all fs options (e.g. fuse)
    - demo:
        - using a cheesy shell script to demo migration between hosts:
            - in the future would like to use p.haul and collab.

Saied Kazemi -- An experiment in Checkpointing and Restoring...
- Work in progress
- Focus has been on c/r a single container, not an entire docker daemon and children
- Using the criu tool directly can cause problems:
	- filesystem setup, cgroup setup, on the restore target
	- docker daemon needs to be aware that the process will be "exiting"
	- docker daemon needs to be parent of the restored container
- Integration with docker is req to fix above issues: 
    - target for criu integration is libcontainer 
- Demo:
    - docker checkpoint $cid
    - docker restore $cid
    - Q: tried btrfs, devmapper? A: No, we haven't
    - Q: What happens to the fs? A: docker dismantles the fs as if the process had exited on checkpoint side, docker restore re-creates the fs

Live Migration - Pavel Emelyanov
- Live Migration is not simply dump, copy and restore: pre-copy, etc is often needed
- P.Haul project:
    - collects together the pieces to make migration simpler:
        - incremental dump
        - progress check
        - final dump-copy-restore
        - plugin driven for specific details of the tool that created the container
- Demo: 
    - It worked nicely between two VMs, used nfs to share the container filesystem root
- Future work:
    - Unclear if it should be a stand alone project to orchestrate various container types or a library or other infrastructure
- Q&A:
    - Q: What are the limits, for example a qemu-kvm talking to /dev/kvm A: Devices cannot be arbitrarily checkpointed, there are descriptions on the CRIU wiki. There is a plugin engine in criu to support currently unsupported devices.
    - (http://criu.org/What_cannot_be_checkpointed has specifics)
    - Interestingly supports TUN device -> can live migrate OpenVPN
    

===== Tea break =====

 - Title: systemd and containers: standards, integration, and APIs
   By: Lennart Poettering
   Length: 15min

- systemd implements everything in the spec but this isn't systemd specific
- explains the setup of the container by the container manager
- Most everything is on: http://freedesktop.org/wiki/Software/systemd/ContainerInterface
- particularly interesting:
    - by default udev does not run in the container because the kernel does not do device isolation: 
        - about honesty: if things look like they won't work we don't try:
            - systemd avoids starting udev when /sys is mounted read-only
            - Lennart's opinion is to not to play tricks with the ttys
- weak rules around networking:
    - host0 veth link will come up by default when running systemd defaults
- suggestions to not do:
    - Do not drop CAP_MKNOD to enable PrivateDevices= for services

http://freedesktop.org/wiki/Softbware/systemd/writing-vm-managers

- machined:
    - lxc, libvirt-lxc, docker, etc: consider registering with machined
    - wanted to integrate containers nicely like solaris zones:
        - being able to introspect the registered containers
        - being able to get the logs out of the container from the host journalctl
        - 

 - Title: Kubernetes and Google: Lessons learned from 8+ years of containers
   By: Victor Marmol
   Length: 15min

- Introduction to the concept of Kubernetes. The code is available on https://github.com/googlecloudplatform/kubernetes
- Containers in Kubernetes are more like a single application instead of a machine; intended to support micro-services
- Introduce the concept of a pod:
    - You may have two things that run together: frontend web server and the log shipper, they need to share some namespaces
    - In the case of the frontend and logger you want to have different resource limits but allow slack:
        - e.g. Frontend 2GB, Logger 1GB but the pod has a 4GB alloc for slack
- Introduce the concept of a label:
    - Containers can have an arbitrary set of labels, e.g. type: frontend
- Introduce the concept of a service:
     - A service is a selector query of a given label, e.g. I have 4 frontend containers that have label type=frontend, all of them are added to a service and load balanced across a port
- Lesson learned: don't do port brokering:
    - Kubernetes has an IP per Pod and IP per service
- Lesson learned: everything is in a container:
    - The system services are in a container (init, agent, system services)
    - We do want to ensure system services are limited 
    - Heavily invested in sub-containers @ google:
        - Primarily for resource accounting
- Use of resources and tiers to give slack resources away to batch work like map-reduce jobs:
    - Kernel must handle out of memory conditions well to make this possible. Kill batch workloads, then apps, then system resources

- Q&A:
    - Q: What about loosely coupled systems like the LAMP stack re:pods A: The cluster scheduler today in kubernetes does not do things like this for scehduling for loose locality like rack locality but it is something that can be done in the future.
    - Q: What about system daemons that are scheduled on the machine A: Today we have some static setup that we put on the machines but long term that is not something that we want
    - Q: You say that you don't support running systemd in a container A: Likely a misunderstanding of the conversation, running systemd or a full init is OK if that is your workload
    - systemd will be added to its own control group by default moving forward


 - Title: Resource management across different tools
   By: Rohit Jnagal
   Length: 15min

- If you are slicing up a machine for a number of users and "reselling" you need to have everything accounted
- Need 
- Q: When you go below one core how do you do the accounting? A: Multiple ways to solve this: if you use VMware then they calculate the heuristic. In Google we use a nominal CPU. When you land on a machine the cluster scheduler could make the local adjustements necessary. We manage it by qualifying machines on our nominal workloads to find the mulitipliers to our "nominal CPU".
- Faking stuff to make containers look like real machines: 
    - Feel like a FUSE filesystem is a little heavyweight solution to fixup the cgroupfs problem
    - Followup: patches or something upstream to work around this?!? Note taker will followup.
- Looking at how to implement subcontainers in libcontainers
- Detection of task death is difficult to do reliably:
    - Was it because of this cgroup, or machine level pressure
    - Added patch to docker to haul the metadata about how the process died (e.g. OOM) for use by scheduler or docker clients
    - Looking at adding first-class support for post-mortem or a pre-death script
- Non-unified namespaces:
    - To do the pods we create a "network container" and then the other containers join that network namespace
    - Worried about someone from the Kernel community to figure out if this non-unified joining model is OK longterm.
- Need more stats and knobs out of cgroups:
    - Want to see the load of individual jobs
    - CPU load patches will be cleaned up and sent upstream
   
 
== Second part (device namespaces and related) ==

 - Title: FUSE mounts from user namespaces
   By: Seth Forshee
   Length: 10-15min
   
- Looking at the problem of mounting inside of a user namespace container:
    - Problems: security and access to block devices
    - FUSE: designed for unprivileged mounts, no block device required & good first step for mounts in containers
- Status update:
     - Working implementation: git://kernel.ubuntu.com/sforshee/linux.git fuse-userns
     - lkml discussion on v4 patch-set
- Demo: 
    - using fuseext2 inside of the container, it works!
- Q&A:
    - Q: Can I implement a malicious FUSE filesystem that can do denial-of-service? A: It would be a bug. Today FUSE is used by unpriv. users so that is possible today w/o containers.
    - Q: Are these always mapped nodev? A: Discussion is still going on upstream, but it will be mounted nodev. Will allow suid but won't be viewable outside of the namespace (notetaker: I might have heard that wrong).
    - Q: Fuseblk works?  A: Yes.


 - Title: Dynamic device management in an LXC environment
   By: Michael Coss
   Length: 15min

- Goal: create a virtual desktop env that has performance close to non-virtualized env.
	- Allow for addition/removal of hardware
- How do you do device management of this:
    - devtmpfs, sysfs, udevd are not ns aware or container aware
    - most containers bind mount /dev
- Assuming you get udev running in a container then you will get all uevents
- We want a namespace-aware devtmpfs but it doesn't exist
- Our changes:
    - Pass uevents only to processes in the server's network namespace
    - Introducing a new udevns event filter daemon:
        - Locates containers interested in a given uevent and forwards to the target container

 - Title: How far are we from running distributions inside containers?
   By: Amir Goldstein and Oren Laadan
   Length: 10min

- 
   
- Announcement about alternatives for German Train Strike Tomorrow
  By: James Bottomley
  
	  If you had slides, please upload them to the plumbers site by logging in and clicking on edit  by your proposal.  Slide upload is then at the bottom (slides will appear in your discussion session on the timetable)