A quick summary of the VL2 data-center network scheme

Posted by Tom Moertel Thu, 17 Mar 2011 15:21:00 GMT

Reading the most-recent issue of Communications of the ACM, I enjoyed the reprint of “VL2: A Scalable and Flexible Data Center Network.” Here’s a summary.

First, the problem: Start out with a rack of 20–40 servers. At the top of the rack, place a switch – the top-of-rack (ToR) switch – to connect the servers together. Now fill your data center with racks just like this. To allow the servers to talk across racks, take each rack’s ToR switch and run its uplink ports (typically 2) into separate (for redundancy) higher-level aggregation switches. To connect aggregated rack-groups together, uplink the aggregation switches into higher-level aggregation switches. Keep aggregating this way until everything is connected.

But, eventually, you can’t join the switched domains further because they’ll get too big and unwieldy; that’s when you uplink into the final layer of the hierarchy: access routers; they split your aggregated-aggregated rack-group-groups into separate layer-2 networks and route packets between them.

What’s wrong with this hierarchical network graph? In a word, oversubscription. When big jobs need to run across the data center, the uplinks saturate and become bottlenecks. According to the paper, even ToR uplinks are typically oversubscribed 1:2 to 1:20; the higher-level uplinks are even worse: 1:200 is not uncommon.

To work around these bottlenecks, network designers end up buying expensive network hardware and configuring it for specific workloads. But running large data centers is so expensive that you want the flexibility to squeeze lots of different jobs into the spare capacity, and networks tuned for one kind of workload are the opposite of flexibility.

How do you get both performance and flexibility? The authors of the paper propose VL2: creating virtual layer-2 networks that allow application addresses to be separated from network devices. This separation lets you design the above-rack network to provide huge path capacity using commodity hardware. In the paper’s running example, it’s a folded Clos network in two levels of switches – the aggregation level and, above that, the intermediate level.

Here’s how it works. You assign each network device on each server a location address (LA); this is an IP address that stays with it for life, naming the device. These LAs get advertised to the switches above, which keep track of them using a typical layer-3 link-state routing protocol.

Each application, however, you assign a block of application addresses (AAs) from a separate pool. LAs and AAs never mix. Each application is coded as if it runs on a dedicated Ethernet segment having only its AAs attached. When you want to give an application an extra server, you map one of the application’s unused AAs to the server’s LA, effectively attaching the server to the application’s virtual Ethernet network.

AA-to-LA mapping is handled by a fast, scalable directory system that is invisible to applications. It’s invisible because it’s implemented as a kernel extension. When a server sends a packet to one of its AA-addressed peers on the application’s virtual network, the kernel extension consults the directory to get the destination server’s corresponding LA and sends the packet to the AA via the LA.

How the packet gets there is clever. The kernel extension doesn’t send the packet directly to the LA but instead sends it up to the very top level of the network, bouncing it off a randomly selected intermediate-level switch, before it is delivered to the LA device.

All this happens by encapsulation and tunneling. The packet to the destination AA is wrapped within a packet to the corresponding destination LA. That packet, in turn, is wrapped within a packet to the randomly chosen intermediate switch. When the packet is sent, it goes up to the intermediate switch, which unwraps the outer layer of encapsulation and bounces the inner packet down to the switch handling the LA. That switch, in turn, unwraps the remaining layer and sends the original packet – this is the one to the AA – to the destination server, which gets it via its application-specific virtual Ethernet adapter.

Why bounce packets off a random intermediate switch? The randomization spreads out traffic, allowing for high network utilization. This “valiant load balancing” is cheap and effective, resulting in about 94% efficiency in the paper’s tests.

That, in a nutshell, is VL2. (I simplified some things; in reality the randomization doesn’t occur for each packet but for entire flows of packets.)

For more information, check out the original paper at Microsoft Research.

Posted in
Tags , , , , ,
no comments
no trackbacks
Reddit Delicious

Verizon FiOS fiber-optic Internet service: a first look

Posted by Tom Moertel Tue, 15 Nov 2005 19:23:00 GMT

Recently I had Verizon’s fiber-optic service “FiOS” installed at my home. The installation process took about a half day and involved placing the following boxes around my house:

  • optical network terminal (ONT, installed outside of house)
  • battery backup unit (BBU, installed in basement)
  • power adapter (plugged into household electrical outlet)

The ONT was installed next to my old POTS junction box:

new optical network terminal next to old POTS junction box

The ONT acts like a miniature central office. To my house it provides four POTS lines for voice service and one 10/100 Mbps Ethernet port for data service. The ONT accepts a single fiber-optic cable that connects all of these services back to Verizon’s central office.

As part of the installation process, Verizon moved my POTS lines from copper over to the ONT’s POTS interfaces. Verizon wanted to remove my copper-based service altogether, but I forbade them from doing so because I have non-Verizon business lines that I want to keep on copper, which competitive carriers can use to offer me service. (Verizon is not required to share its fiber cables with competitive carriers.)

If you look closely at the ONT, you’ll see that it also is capable of handling video service:

the ONT is a miniature central office

(At present Pennsylvania’s cable-franchise laws prevent Verizon from offering video service, but I’m sure Verizon’s lobbyists are working to change that situation.)

Unlike copper wires, fiber-optic cables do not carry power. The ONT, therefore, must be powered from my home’s electrical service. If the power goes out, the battery backup unit (BBU) will supply power for the ONT’s voice services for about four hours.

VoIP users beware: When the household power fails, the ONT’s data services will be dropped immediately in order to conserve the BBU’s battery. This seems pretty lame to me, but Verizon confirmed this behavior when I called them to ask about it. If you need data service during a power failure, make sure your ONT is powered via a UPS under your control.

To provide data service to my house, the installer ran a CAT-5 cable from the ONT’s 10/100 Ethernet port into my house, where it plugs into a D-Link 4-port “Ethernet Broadband Router,” provided by Verizon for free. Although the provided router has NAT and firewall features, I placed a Linux-based firewall between it and the rest of my home network as an added precaution.

I have been using the service for several days now, and here is my verdict:

It’s just broadband.

Practically speaking, I can’t tell any difference between FiOS and my Adelphia cable-modem service. I ordered 5-Mbps service from both providers, and both services provide about 5 Mbps down, which is faster than fast enough for me. The FiOS service has slightly lower latency – I can ping www.google.com in about 9 ms – and that’s a nice plus.

The big benefit of FiOS is competition: Verizon’s price is about $10/month less than Adelphia’s. When I called Adelphia to cancel my service, their representative attempted to change my mind by offering me a 3-month promotional discount and trying to sell me extra television channels.

I passed.

Posted in , ,
Tags ,
42 comments
no trackbacks
Reddit Delicious