



# Advanced Networked Systems SS24 Programmable Data Plane

Prof. Lin Wang, Ph.D.

Computer Networks Group

Paderborn University

https://cs.uni-paderborn.de/cn



## Learning objectives

Why we need programmable data plane?

**How** to enable data plane programmability?

# Why do we need data plane programmability?

#### **Evolution of the computer industry**



The computing industry has been evolving from proprietary hardware/software towards more **general-purpose** hardware/software with **open standards/interfaces**.

#### **Evolution of networking industry**



The networking industry has also been evolving from proprietary hardware/software towards more **general-purpose** hardware/software with **open standards/interfaces**.

## Recap: software define networking



#### A deep dive into OpenFlow



OpenFlow is designed around the match+action abstraction: a set of header match fields and forwarding actions

OpenFlow v1.5: 41 match header fields

Most hardware/software switches only support limited match/action set (Ethernet, IP, TCP, MPLS) due to ASIC limitations.

#### Match

#### **Action**

DFPXMT\_OFB\_PBB\_UCA

enum ofp\_action\_type {
 OFPAT\_OUTPUT,
 OFPAT\_COPY\_TTL\_OUT
 OFPAT\_COPY\_TTL\_IN,
 OFPAT\_SET\_MPLS\_TTL
 OFPAT\_DEC\_MPLS\_TTL
 OFPAT\_PUSH\_VLAN,
 OFPAT\_PUSH\_VLAN,
 OFPAT\_PUSH\_MPLS,
 OFPAT\_POP\_MPLS,
 OFPAT\_SET\_NW\_TTL,
 OFPAT\_SET\_NW\_TTL,
 OFPAT\_SET\_NW\_TTL,
 OFPAT\_SET\_FIELD,
 OFPAT\_PUSH\_PBB,
 OFPAT\_POP\_PBB,
 OFPAT\_EXPERIMENTER

#### Switch architecture



Packet processing pipeline

#### **Switch architecture**



Packet processing pipeline

#### Switch architecture



Packet processing pipeline

#### **Development cycle**



It takes years for the new ASIC to be developed, fully tested, and finally deployed!! When the upgrade is available:

- It either **no longer solves your problem**
- You need a fork-lift upgrade at huge expenses

What is the root cause of all this?

## The "bottom-up" mentality



"This is how I process packet..."



The network systems are built following the bottom-up approach: all network features are centered around the capabilities of the ASIC.

How to improve this?

### The "top-down" approach

Make the ASIC **programmable**, and let your features to tell the ASIC what to support!

```
table int table {
  reads {
    ip.protocol;
}
actions {
    export queue latency;
}

actions {
    export queue latency;
}
}

actions {
    indiffusion in the protocol in the protocol
```

"This is precisely how you must process packets..."

How to support programmability?



# How to enable data plane programmability?

## **Domain-specific processors**



#### **Domain-specific processors**



## **Domain-specific processors**



#### RMT and P4

**RMT:** reconfigurable match tables model (a RISC-inspired pipelined architecture)

**P4:** a domain-specific language for programming protocol-independent packet processors

#### P4: Programming Protocol-Independent Packet Processors

Pat Bosshart!, Dan Daly", Glen Gibb¹, Martin Izzard¹, Nick McKeown¹, Jennifer Rexford", Cole Schlesinger", Dan Talayco¹, Amin Vahdat¹, George Varghese¹, David Walker" ¹Barefoot Networks ¹Intel ¹Stanford University "Tinceton University "Google ¹Microsoft Research

#### ABSTRACT

P4 is a high-level language for programming protocol-independent packet processors. P4 works in conjunction with SDN control protocols like OpenFlow. In its current form, OpenFlow explicitly specifies protocol headers on which it operates. This set has grown from 12 to 44 fleds in a few years, increasing the complexity of the specification while still not providing the flexibility to add new headers. In this paper we propose P4 as a strawman proposal for how OpenFlow should evolve in the future. We have three goals: (1) Reconfigurability in the field: Programmens should be able

multiple stages of rule tables, to allow switches to expose more of their capabilities to the controller.

The proliferation of new beader fields shows no signs of stopping. For example, data-center network operators increasingly want to apply new forms of packet encapsulation (e.g., NVGRE, VXLAN, and STT), for which they resort to deploying software switches that are easier to extend with new functionality. Rather than repeatedly extending the OpenFlow specification, we argue that future switches should support flexible mechanisms for parsing packets and matching beader fields, allowing controller applications to leverage those capabilities through a common, open inter-

#### Ingress (match-action pipeline)



Switching fabric (e.g., crossbar)

## Egress (match-action pipeline)



Deparser

Parser

## P4 development



#### P4<sub>16</sub> introduces the concept of architecture

P4 architecture

Specifies the **P4 programmable components** of a target and **data plane interfaces** between them

P4 target

A model of a specific hardware implementation

## P4 language evolvement



## Programming a P4 target



#### **Architecture model**



A contract between the P4 program and the target

#### **Architecture model**

A contract between the P4 program and the target



#### Switch architecture example



Switch architecture

```
parser Parser<IH>(packet_in b, out IH parsedHeaders);
// ingress match-action pipeline
control IPipe<T, IH, OH>(in IH inputHeaders,
                         in InControl inCtrl,
                         out OH outputHeaders,
                         out T toEgress,
                         out OutControl outCtrl);
// egress match-action pipeline
control EPipe<T, IH, OH>(in IH inputHeaders,
                         in InControl inCtrl,
                         in T fromIngress,
                         out OH outputHeaders,
                         out OutControl outCtrl);
control Deparser<OH>(in OH outputHeaders, packet_out b);
package Ingress<T, IH, OH>(Parser<IH> p,
                           IPipe<_, IH, OH> map,
                           Departed (OH> d):
package Egress<T, IH, OH>(Parser<IH> p, Port
                          EPipe<_, IH, OH> map,
                          Deparser<OH> d);
package Switch<T>( // Top-level switch contains two packages
   // type types Ingress.IH and Egress.IH may be different
   Ingress<T, _, _> ingress,
   Egress<T, _, _> egress
```

Switch architecture description

#### A simple P4<sub>16</sub> switch architecture: v1model

Roughly equivalent to Protocol-Independent Switch Architecture (PISA)



#### v1model architecture

Defines the metadata it supports, including both intrinsic and user-defined ones

```
struct standard_metadata_t {
   bit<9> ingress_port;
   bit<9> egress_spec;
   bit<32> clone_spec;
   bit<32> instance_type;
   bit<1> drop;
   bit<16> recirculate_port;
   bit<32> packet_length;
   bit<32> enq_timestamp;
   bit<19> enq_qdepth;
   bit<32> deq_timedelta;
   bit<19> deq_qdepth;
   error parser_error;
```

```
bit<48> ingress_global_timestamp;
bit<48> egress_global_timestamp;
bit<32> lf_field_list;
bit<16> mcast_grp;
bit<32> resubmit_flag;
bit<16> egress_rid;
bit<1> checksum_error;
bit<32> recirculate_flag;
}
```

Standard intrinsic metadata

#### **Architecture-specific constructs**

#### Each architecture defines a list of "externs"

- Blackbox functions whose interfaces are known

Most targets contain specialized components, which cannot be expressed in P4

## On the other hand, P4<sub>16</sub> aims to be target-independent

- P4<sub>14</sub> has almost 1/3 of the constructs targetdependent: not portable to different targets

```
extern register<T> {
    register(bit<32> size);
    void read(out T result, in bit<32> index);
    void write(in bit<32> index, in T value);
}
extern void random<T>(out T result, in T lo, in T hi);
extern void hash<0, T, D, M>(out O result,
    in HashAlgorithm algo, in T base, in D data, in M max);
extern void update_checksum<T, O>(in bool condition,
    in T data, inout O checksum, HashAlgorithm algo);
```

v1model architecture-specific externs

## P4 language basics

#### P4 language overview

```
#include <core.p4>
#include <v1model.p4>
const bit<16> TYPE_IPV4 = 0x800;
typedef bit<32> ip4Addr_t;
header ipv4_t {...}
struct headers {...}
parser MyParser(...) {
  state start {...}
  state parse_ethernet {...}
  state parse_ipv4 {...}
control MyIngress(...) {
```

```
control MyIngress(...) {
   action ipv4_forward(...) {...}
   table ipv4_lpm {...}
   apply {
     if (...) {...}
   }
}
```

Libraries

**Declarations** 

Packet header parser

Control flow to modify packets

```
control MyDeparser(...) {...}

V1Switch(
   MyParser(),
   MyVerifyChecksum(),
   MyIngress(),
   MyEgress(),
   MyComputeChecksum(),
   MyDeparser()
) main;
As

m
```

Assemble modified packet

"main()"

## P4 language basics: data types

P4\_16 is a statically-typed language with base types and operators to derive composed ones

|       | bool       | Boolean value                                   |
|-------|------------|-------------------------------------------------|
| b     | it <w></w> | Bit-string of width W                           |
| i     | nt <w></w> | Signed integer of width W                       |
| varb  | it <w></w> | Bit-string of dynamic length <= W               |
| match | _kind      | Describes ways to match table keys              |
|       | error      | Used to signal errors                           |
| • •   | void       | No values, used in few restricted circumstances |
| X     | float      | Not supported                                   |
| X s   | tring      | Not supported                                   |
|       |            |                                                 |

#### P4 language basics: composed data types

#### Header Header stack Header union header Ethernet h { header Mpls\_h { header\_union Ip\_h { bit<20> label; bit<48> dstAddr; IPv4\_h v4; bit<48> srcAddr; bit<3> tc; IPv6\_h v6; bit<16> etherType; bit bos; bit<8> ttl; Mpls\_h[10] mpls; Either IPv4 or IPv6 Array of up to 10 MPLS headers header is present

A successful extract() sets to true the validity bit of the extracted header hdr.ipv4.isValid()

Parsing a packet using extract() fills in the fields of the header from a network packet

#### P4 language basics: composed data types

Struct: unordered collection of named members

```
struct standard_metadata_t {
  bit<9> ingress_port;
  bit<9> egress_spec;
  bit<9> egress_port;
  ...
}
```

#### Other data types:

- enum: enum Priority {High, Low}
- Type specification: typedef bit<48> macAddr\_t;
- extern, parser, control, package...

Tuple: ordered collection of unnamed members

```
tuple<bit<32>, bool> x;
x = {10, false}
```

#### P4 language basics: operations

P4 operations are similar to C operations and vary depending on the types (unsigned/signed integers,...)

- Arithmetic operations: +, -, \*
- Logical operations:
  - Bitwise complement, and, or, xor: ~,&, |, ^
  - Shifts: >>, <<
- Non-standard operations: [m:1] bit slicing, ++ bit concatenation
- No division and modulo: can be approximated

#### P4 language basics: variables and constants

Constants, variable declarations and instantiations are almost the same as in C too

#### **Important**

Variables cannot be used to maintain state across different network packets.

Instead, we can only use **two stateful** constructs, i.e., tables and extern objects, to maintain state.

#### P4 language basics: statements

#### P4 statements are pretty classical too

- Some restrictions may apply depending on the statement location

```
Terminates the execution of the action of control containing it

exit

Terminates the execution of all the blocks currently executing

Conditions

if (x==123) {...} else {...} Not in parser

switch (t.apply().action_run) {

action1: {...}

action2: {...}

Only in control blocks

No fall-through if a block statement is present
```

# P4 processing overview



## P4 parser

The parser uses a state machine to map packets into headers and metadata



## P4 parser: example



```
parser MyParser(...) {
   state start {
     transition parse_ethernet;
   state parse_ethernet {
     packet.extract(hdr.ethernet);
     transition select(hdr.ethernet.etherType) {
        0x800: parse_ipv4;
        default: accept;
   state parse_ipv4 {
      packet.extract(hdr.ipv4)
     transition select(hdr.ipv4.protocol) {
        6: parse_tcp;
        17: parse_udp;
        default: accept;
                                   Transition between states
   state parse_tcp {
     packet.extract(hdr.tcp);
      transition accept;
   state parse_udp {
     packet.extract(hdr.udp);
     transition accept;
```

## P4 parser: variable-width header extraction

```
header IPv4_no_options_h {
  bit<32> srcAddr;
                          Fixed-width fields
  bit<32> dstAddr;
header IPv4_options_h {
                              Variable-width fields
  varbit<32> options;
parser MyParser(...) {
  state parse_ipv4 {
     packet.extract(hdr.ipv4);
     transition select(hdr.ipv4.ihl) {
        5: dispatch_on_protocol;
        default: parse_ipv4_options;
                                              ihl determines the length of options field
  state parse_ipv4_options {
     packet.extract(hdr.ipv4options, (hdr.ipv4.ihl - 5) << 2);</pre>
     transition dispatch_on_protocol;
```

## P4 parser: more advanced concepts

#### Parsing a header stack requires the parser to loop

- The only "loops" that are possible in P4 (done implicitly through state transitions)
- Example in source routing: popping up all the headers to determine the next hop

#### Other concepts in P4 parser:

- Verify: error handling in the parser
- Lookahead: access bits that are not parsed yet
- Sub-parsers: like subroutines

Why should we be cautious about loops?

# P4 processing overview



#### P4 control

Tables

Match a key and return an action

Actions

Similar to functions in C

Control flow

Similar to C but without loops

#### P4 control: tables



#### P4 control: tables

```
Table name
                                             Longest prefix match
table ipv4_lpm {
  key = {
    hdr.ipv4.dstAddr: lpm;
    hdr.ipv4.version: exact;
                                     Possible actions
  actions = {
    ipv4_forward;
    drop;
                                          Max. # of entries in table
  size = 1024;
  default_action = drop();
                                              Default action
```

#### P4 control: match kinds

|            | exact   | Exact comparison: 0x01020304               |
|------------|---------|--------------------------------------------|
| core.p4    | ternary | Compare with mask: 0x01020304 & 0x0F0F0F0F |
|            | lpm     | Longest prefix match                       |
|            |         |                                            |
| v1model.p4 | range   | Check if in range: 0x01020304 – 0x010203FF |

Other architectures

#### P4 control: table entries

#### Table entries are added through the control plane

- Recall the SDN control plane for flow rule installation



#### P4 control: actions

#### **Actions** are

- Blocks of statements that possibly modify the packets
- Usually take directional parameters indicating how the corresponding value is treated within the block

in: read only inside the actionout: uninitiated, write inside the actioninout: combination of in and out

#### P4 control: actions



```
action set_egress_port(bit<9> port) {
   standard_metadata.egress_spec = port;
}
```

Action parameters resulting from a table lookup do not take a direction



#### P4 control: control flow

# Apply a table ipv4\_lpm.apply() Check if there was a hit if (ipv4\_lpm.apply().hit) {...} else {...} Check which action was executed switch (ipv4\_lpm.apply().action\_run) { ipv4\_forward: {...} }

#### v1model.p4

```
extern void verify_checksum<T, 0>(
    in bool condition,
    in T data,
    inout O checksum,
    HashAlgorithm algo);

extern void update_checksum<T, 0>(
    in bool condition,
    in T data,
    inout O checksum,
    HashAlgorithm algo);
```

## P4 control: re-computing checksum

```
control MyComputeChecksum {
  apply {
    update_checksum(
      hdr.ipv4.isValid(),
                                                        Pre-condition
      { hdr.ipv4.version,
        hdr.ipv4.ihl,
        hdr.ipv4.diffserv,
        hdr.ipv4.totalLen,
        hdr.ipv4.identification,
        hdr.ipv4.flags,
                                          Fields list
        hdr.ipv4.fragOffset,
        hdr.ipv4.ttl,
        hdr.ipv4.protocol,
        hdr.ipv4.srcAddr,
        hdr.ipv4.dstAddr },
                                                       Checksum field
      hdr.ipv4.hdrChecksum,
      HashAlgorithm.csum16);
                               Checksum algorithm
```

## P4 control: more advanced concepts

Cloning packets

Create a clone of a packet

Sending packets to control plane

Use dedicated Ethernet port, or target-

specific mechanisms

Recirculating

Send packet through pipeline multiple times

Be cautious about recirculating!

#### **Annotations**

#### Additional information given to the compiler or the control plane

```
control c( ... )() {
    @name("t1") table t { ... }
    apply { ... }
}
c() c_inst;

Use table name t1 for the
    control plane API
```

# P4 processing overview



## P4 deparser

Packet headers Deparser Packet ethernet {srcAddr: a:b:c:d, ...} a:b:c:d → 1:2:3:4  $1.2.3.4 \rightarrow 5.6.7.8$ ipv4 {srcAddr: 1.2.3.4, ...} 1234 → 56789 tcp {srcPort: 1234, ...} control MyDeparser { apply { Payload packet.emit(hdr.ethernet); packet.emit(hdr.ipv4); packet.emit(hdr.tcp);

#### P4 workflow



## **Application: congestion control**



Use INT to **obtain precise network link status information** and adjust sending
rate based on such information

#### **HPCC: High Precision Congestion Control**

Yuliang Li<sup>♠♥</sup>, Rui Miao♠, Hongqiang Harry Liu♠, Yan Zhuang♠, Fei Feng♠, Lingbo Tang♠, Zheng Cao♠, Ming Zhang♠,
Frank Kelly♠, Mohammad Alizadeh♠, Minlan Yu♥

Alibaba Group♠, Harvard University♥, University of Cambridge♠, Massachusetts Institute of Technology♠

#### ABSTRACT

Congestion control (CC) is the key to achieving ultra-low latency, high bandwidth and network stability in high-speed networks. From years of experience operating large-scale and high-speed RDMA networks, we find the existing high-speed CC schemes have inherent limitations for reaching these goals. In this paper, we present HPCC (High Precision Congestion Control), a new high-speed CC mechanism which achieves the three goals simultaneously. HPCC leverages in-network telemetry (INT) to obtain precise link load information and controls traffic precisely. By addressing challenges such as delayed INT information during congestion and overreaction to INT information, HPCC can quickly converge to utilize free bandwidth while avoiding congestion, and can maintain near-zero in-network queues for ultra-low latency. HPCC is also fair and easy to deploy in hardware. We implement HPCC with commodity

demand on high-speed networks. The first trend is new data center architectures like resource disaggregation and heterogeneous computing. In resource disaggregation, CPUs need high-speed networking with remote resources like GPU, memory and disk. According to a recent study [17], resource disaggregation requires 3-5µs network latency and 40-100Gbps network bandwidth to maintain good application-level performance. In heterogeneous computing environments, different computing chips, e.g. CPU, FPGA, and GPU, also need high-speed interconnections, and the lower the latency, the better. The second trend is new applications like storage on high I/O speed media, e.g. NVMe (non-volatile memory express) and large-scale machine learning training on high computation speed devices, e.g. GPU and ASIC. These applications periodically transfer large volume data, and their performance bottleneck is usually in the network since their storage and computation speeds

Think about the difference to ECN

## Other PDP applications

#### In-band Network Telemetry (INT)

June 2016

Changhoon Kim, Parag Bhide, Ed Doe: Barefoot Networks Hugh Holbrook: Arista

Anoop Ghanwani: Dell Dan Daly: Intel

Mukesh Hira, Bruce Davie: VMware

#### Introduction

What To Monitor

Switch-level Information

Ingress Information

Egress Information

**Buffer Information** 

Processing INT Headers

**INT Header Types** Handling INT Packets

Network monitoring

#### Scaling Distributed Machine Learning with In-Network Aggregation

Marco Canini\* Chen-Yu Ho Jacob Nelson KAUST KAUST Microsoft Changhoon Kim Arvind Krishnamurthy

KAUST Barefoot Networks University of Washington Masoud Moshref Dan R. K. Ports Peter Richtárik Barefoot Networks

Microsoft KAUST

Amedeo Sapio\*

KAUST

Panos Kalnis

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5× for a number of real-world benchmark models.

#### 1 Introduction

Today's machine learning (ML) solutions' remarkable success derives from the ability to build increasingly sophisticated

aggregation primitive can accelerate distributed ML workloads, and can be implemented using programmable switch hardware [5, 10]. Aggregation reduces the amount of data transmitted during synchronization phases, which increases throughput, diminishes latency, and speeds up training time.

Building an in-network aggregation primitive using programmable switches presents many challenges. First, the perpacket processing capabilities are limited, and so is on-chip memory. We must limit our resource usage so that the switch can perform its primary function of conveying packets. Second, the computing units inside a programmable switch operate on integer values, whereas ML frameworks and models operate on floating-point values. Finally, the in-network aggregation primitive is an all-to-all primitive, unlike traditional unicast or multicast communication patterns. As a result, innetwork aggregation requires mechanisms for synchronizing workers and detecting and recovering from packet loss.

#### In-network computing

## Try out P4

#### P4 hands-on

- Use Mininet to set up the network environment
- Use software switches bmv2: <a href="https://github.com/p4lang/behavioral-model">https://github.com/p4lang/behavioral-model</a>
- See P4 tutorials: <a href="https://github.com/p4lang/tutorials">https://github.com/p4lang/tutorials</a>



## Summary



Data plane programmability needed by the demand of more flexible network configurations

RMT abstracts the data plane architecture and P4 enables data plane programmability

# Next time: programmable switch architecture



How does a programmable switch work from the inside out?