Priority Slow-down

    If there is no oversubscription and we provide multi-path balancing in datacenter topology, congestion will not happen. With a number of host in the same rack commute through access links to one router(all-to-one communication), congestions happen more often in access layer than in aggregation layer, which means in rack congestion are more frequently than cross rack congestions. The problem is more specifically now – to solve the host-to-host congestion.

    To achieve the goal, each server maintain a current flow table, in which each entry keep the flow id and priority. The server makes decisions for priority slow down based on two metrics, ECN counts and Link utilization. As mentioned before, we utilize ECN to indicate the queue size of passing routers, once receiver sense a ECN with identifier set, together with the passing links utilization, receiver can decide which link may has congestion, which in turn, receiver take action to avoid more congestions by slowing down the sending rate in sender side.

    Receiver sends a slow-down packet(similar to ACK)to sender with lowest priority, so the strength of slow-down is based on the extent of network congestion(how much they cross the thresholds). There might be three cases: If the lowest flow is slow-downed to pause, then start with the second lowest priority flow; If a new coming flow has lower priority, pause it; If a new coming flow has higher priority, start with it.

    However, there is a trade off between higher risk of congestion and higher throughput for selecting the proper threshold. Larger threshold means we allow more flows on the link and thus we achieve higher throughput, while small threshold has lower throughput but at the same time, it has lower risk of congestion.