Network Virtualization Security Microconference Notes Welcome to Linux Plumbers Conference 2014. Please use this etherpad to take notes. Microconf leaders will be giving a summary of their microconference during the Friday afternoon closing session. Please remember there is no video this year, so your notes are the only record of your microconference. Presentation rules: * Each presentation is limited to 10 minutes * Maximum of 3 slides * Additional time is allocated for discussions. The time allocated is proportional to the meanigfulness of it to the community. * No sales & marketing pitches. No exceptions! * Minimal or no intros & build-up. Assume 10/10 technical audience. Schedule * 09:30 Encapsulation protocols & encryption * Geneve: Generic Network Virtualization Encapsulation (Jesse) * Encrypted VXLAN (Alexei) * UDP encapsulation, FOU, & GUE (Tom) * Does network packet format matter? (Alexei) * 10:40 5 minutes break * 10:45 Accelerating interfaces to guests * A High Performance Socket Interface in Linux (John, John) * Virtualized high performance (DPDK) packet processing (Vincent) * 11:40 5 minutes break * 11:45 Network virtualization & virtual switching * OVS Micro Summit Summary (Thomas, Jesse) * Adding stateful features to OVS by leveraging kernel functions (Justin) * Kernel vs userland packet switching (Dan) * Integrated Network Virtualization (Tom) * 12:30 Lunch Jesse Gross: Geneve * Pronounce it anyway you want as long as you talk about it * http://tools.ietf.org/html/draft-gross-geneve-01 * Goal is to introduce a flexible encap with enough structure to allow interop that can evolve into whatever is required. Not breaking NIC offloads is an essential part. * Merged into net-next and openvswitch.org * TLV based message format. VNI remains in static header. * What should the offloading look like? * Ideally we would not have to care about it from the software side. Reality is that offloads breaks regularly. A way around that is to introduce enough flexibility to not require protocol extensions for new demands and requirements. * No need to overly focus on the checksum properties * Does the kernel care about the metadata or is it passed through to user space? * OVS passes it as binary blob to user space for now * The kernel datapath matches on the binary blob with a bitmask without knowing about the exact structure * Ordering issue is resolved with having a separate datapath flow for each order variation * How to implement preventing a guest to receive packets with a specific TLV. Would need a separate flow for each order combination * Possible attack vector due to the simplicity of modifying the order * Registration of UDP encap port number on RX * Available for VXLAN today. * A generic registration API to cover Geneve and other encaps is desirable and doable * Generic encap filtering API resulting in decap/encap actions to cover decap and checksum on more than just a single UDP port * Need to replace or extend existing ethtool API * Lack of hardware support for now but seems doable to support this in future generations * Need to be able to expose capabilities of the hardware * Map back to existing API to maintain backwards compat * Who is the user of this API? * Could be anyone from in-kernel to admin * Future actions could include steering to queues * Geneve / GUE frame format * Identical goal in providing flexible metadata * TLVs are more flexible * Set of flags tend to be easier to parse in hardware Alexei: Encrypted VXLAN * How to securely isolate tenants in the cloud with multiple fabrics spread across geo locations * As soon as your packets leave your own fabric, you cannot prevent snooping on the wire. Eventually the network between your DCs is the internet. * Demands for encryption of the overlay * Underlay problem * High bandwidth requirement on the edge to the WAN. A single device encrypting all traffic is expensive. * Instead of encrypting on the WAN router, encryption can occur at the edge on the host -> encrypted VXLAN * Demand is to guarantee isolation between tenants using crypto * This requires to put the crypto header (IPSec) below the encap. Provides additional flexibility over adding IPSec to the inner header. * Geneve doesn't provide options. The header could go as a trailer instead. * VXLAN+ESP * Reuses all of the in kernel xfrm subsystem * Implemented as special vport using OVS + eBPF * Approach works with any other encap format as well * ESP SPI/SEQ and IV go after VXLAN header followed by payload * GSO needs to happen before encryption occurs * Performance: AESNI - 3.5G per core * Requires trust to the hypervisor as it holds the keys. Cannot guarantee isolation on compromised hosts. * Is there value in having multiple keys per hypervisor if its the trusted entity anyway? * When are you going to propose this to IETF? * 2nd presentation will cover the view on that * Code status? * On the Todo list Alexei: Does network packet format matter? * Why do we need to standarize? Go through IETF, have all the debates * Promise to gain interop between vendors. Promotes competition * Usually at the time the draft is proposed, the silicon is already designed and produced * Software vendors push drafts to ask for HW support * Standarization process is costly and expensive for startups and the community. The process is only available to larger players due to the cost. * What if there were no standards * Hardware may become flexible and generic and more programmable. * Still constrained of course * Uber Netdev * Programmable NICs would demand for a compiler that could live in user space and based on capabilities and other information passed up then program the hardware accordingly * HW looks like CPUs with ports * Looks like netdevs to iproute2 combined with tables for L2/L3 * Packet processing language (P4) * Building blocks * read/write * push/pop header * checksum * forward to port * access tables * encrypt/decrypt * Johann Tonsing (Netronome) - ONF state of protocol independent parsing and forwarding * Current OpenFlow spec defines protocols / fields matches combined with actions to control flow, encap, QoS, ... * Set of supported protocols is fixed * OF-PI forwarding * Describe forwarding logic as program (P4) * Parse tree is protocol independent * Per packet metadata * Global metadata for connection tracking / NAT * Example: * .P4 (*) -> Intermediate representation -> multiple backends (vendor (*)) [SW, FPGA, ASIC ..] * Still need for some standards * What does the intermediate representation look like * * John Fastabend / John Ronciak - A High performance socket interface for Linux * Proposal to split of a set of queues and have them pass through to user space using AF_PACKET * API to request queues from hardware and get direct DMA * Problem that every vendor has different descriptors * Requires exposing descriptor description to user space * Initial patches received feedback that a format description is required that comes along * How to validate the addresses before they go to user space? (this is a security concern) * Issue that arises is that memory region can be modified during or after validation * For best performance, we need to be able to offload that to hardware and use hardware protection * Why not just base on top of a VF and use VFIO * One suggestion from the audience was to look at netmap (http://info.iet.unipi.it/~luigi/netmap/) * netmap implements an interface that keeps the DMA programming in the kernel where addresses can be validated * Can we map AF_PACKET directly to qemu / vhost-user * what statistics would the kernel get? * "ethtool -S" hardware stats would be available as normal (since they're just reading hw registers) * Vincent Jardin - DPDK * NFV Performance bottlenecks * 1. Kernel drivers * 2. vswitch * 3. communication host - guest * vNIC options * vhost-net / virtio * Low performance * vhost-user / virtio * >10Mpps per core with bypass in the guest * shared memory (ivshmem) * Guest to guest performance without sharing memory between guests * ~38Mpps @64B (again, bypass in the guest) * pktgen - host nic 1 - dpdk - vm1 - [packet copy] - vm2 - dpdk - host nic 2 - pktgen * Why is the performance different for kernel vs user based vswitch? * The copy itself is not the bottleneck. * The PMD is more efficient * Busy poll support in the kernel could be an in-kernel alternative Jesse Gross - OVS micro summit summary * Summit proceedings: https://etherpad.wikimedia.org/p/ovs-micro-summit14 * OVS pain points * Keeping openvswitch.org and net-next in sync as closely as possible * ... * Offload to hardware * How to offload to hardware using generic kernel APIs that are usable by other components and subsystems like iproute2, tc, existing L2, ... * A user space driven offload decision seems superior due to richness of context available * The interface used is pretty much identical to the work to generalize the flow director * [... discussion on offload ...] Justin Pettit - Adding stateful features to OVS leveraging kernel functions * OVS is flow based * Demand arises for stateful functionality such as reflexive ACL and stateful NAT * Options to implement reflexive ACLs * Match on TCP flags. Limited security. No matching on sequence numbers * Learn reverse flows for every permitted flow. Limited performance due to requirement to set up a high amount of flows. * Leverage netfilter connection tracking capability * New OF action to send packet to conntracker * New connection state match * Zone support to allow overlapping IP addresses * RFC patches have been sent out. Plan to merge for next OVS release * Can be combined with regular MAC learning * Limited to Linux datapath for now. API is generic and open to other platforms * Allows for additional functionality on top of that * NAT, IPVS, DPI (textsearch) * OpenWRT has a conntrack tc action (not in mainline) similar to this Dan Dumitriu - Kernel vs User space switching * Context: OpenStack, vswitching, encap * Requirements: * Fast packet switching (minimal overhead) * Support encap protocols * Datapath options * In kernel OVS, exists, proven, full feature set * OVS user space datapath on top of DPDK PMD, fast, user space drivers, limited drivers * Offload to NIC (SR-IOV / VF), minimal CPU overhead, high complexity, needs partial offload model * In kernel eBPF (Future?)