A quick summary of the VL2 data-center network scheme

Posted by Tom Moertel Thu, 17 Mar 2011 15:21:00 GMT

Reading the most-recent issue of Communications of the ACM, I enjoyed the reprint of “VL2: A Scalable and Flexible Data Center Network.” Here’s a summary.

First, the problem: Start out with a rack of 20–40 servers. At the top of the rack, place a switch – the top-of-rack (ToR) switch – to connect the servers together. Now fill your data center with racks just like this. To allow the servers to talk across racks, take each rack’s ToR switch and run its uplink ports (typically 2) into separate (for redundancy) higher-level aggregation switches. To connect aggregated rack-groups together, uplink the aggregation switches into higher-level aggregation switches. Keep aggregating this way until everything is connected.

But, eventually, you can’t join the switched domains further because they’ll get too big and unwieldy; that’s when you uplink into the final layer of the hierarchy: access routers; they split your aggregated-aggregated rack-group-groups into separate layer-2 networks and route packets between them.

What’s wrong with this hierarchical network graph? In a word, oversubscription. When big jobs need to run across the data center, the uplinks saturate and become bottlenecks. According to the paper, even ToR uplinks are typically oversubscribed 1:2 to 1:20; the higher-level uplinks are even worse: 1:200 is not uncommon.

To work around these bottlenecks, network designers end up buying expensive network hardware and configuring it for specific workloads. But running large data centers is so expensive that you want the flexibility to squeeze lots of different jobs into the spare capacity, and networks tuned for one kind of workload are the opposite of flexibility.

How do you get both performance and flexibility? The authors of the paper propose VL2: creating virtual layer-2 networks that allow application addresses to be separated from network devices. This separation lets you design the above-rack network to provide huge path capacity using commodity hardware. In the paper’s running example, it’s a folded Clos network in two levels of switches – the aggregation level and, above that, the intermediate level.

Here’s how it works. You assign each network device on each server a location address (LA); this is an IP address that stays with it for life, naming the device. These LAs get advertised to the switches above, which keep track of them using a typical layer-3 link-state routing protocol.

Each application, however, you assign a block of application addresses (AAs) from a separate pool. LAs and AAs never mix. Each application is coded as if it runs on a dedicated Ethernet segment having only its AAs attached. When you want to give an application an extra server, you map one of the application’s unused AAs to the server’s LA, effectively attaching the server to the application’s virtual Ethernet network.

AA-to-LA mapping is handled by a fast, scalable directory system that is invisible to applications. It’s invisible because it’s implemented as a kernel extension. When a server sends a packet to one of its AA-addressed peers on the application’s virtual network, the kernel extension consults the directory to get the destination server’s corresponding LA and sends the packet to the AA via the LA.

How the packet gets there is clever. The kernel extension doesn’t send the packet directly to the LA but instead sends it up to the very top level of the network, bouncing it off a randomly selected intermediate-level switch, before it is delivered to the LA device.

All this happens by encapsulation and tunneling. The packet to the destination AA is wrapped within a packet to the corresponding destination LA. That packet, in turn, is wrapped within a packet to the randomly chosen intermediate switch. When the packet is sent, it goes up to the intermediate switch, which unwraps the outer layer of encapsulation and bounces the inner packet down to the switch handling the LA. That switch, in turn, unwraps the remaining layer and sends the original packet – this is the one to the AA – to the destination server, which gets it via its application-specific virtual Ethernet adapter.

Why bounce packets off a random intermediate switch? The randomization spreads out traffic, allowing for high network utilization. This “valiant load balancing” is cheap and effective, resulting in about 94% efficiency in the paper’s tests.

That, in a nutshell, is VL2. (I simplified some things; in reality the randomization doesn’t occur for each packet but for entire flows of packets.)

For more information, check out the original paper at Microsoft Research.

Posted in
Tags , , , , ,
no comments
no trackbacks
Reddit Delicious

Verizon FiOS fiber-optic Internet service: a first look

Posted by Tom Moertel Tue, 15 Nov 2005 19:23:00 GMT

Recently I had Verizon’s fiber-optic service “FiOS” installed at my home. The installation process took about a half day and involved placing the following boxes around my house:

  • optical network terminal (ONT, installed outside of house)
  • battery backup unit (BBU, installed in basement)
  • power adapter (plugged into household electrical outlet)

The ONT was installed next to my old POTS junction box:

new optical network terminal next to old POTS junction box

The ONT acts like a miniature central office. To my house it provides four POTS lines for voice service and one 10/100 Mbps Ethernet port for data service. The ONT accepts a single fiber-optic cable that connects all of these services back to Verizon’s central office.

As part of the installation process, Verizon moved my POTS lines from copper over to the ONT’s POTS interfaces. Verizon wanted to remove my copper-based service altogether, but I forbade them from doing so because I have non-Verizon business lines that I want to keep on copper, which competitive carriers can use to offer me service. (Verizon is not required to share its fiber cables with competitive carriers.)

If you look closely at the ONT, you’ll see that it also is capable of handling video service:

the ONT is a miniature central office

(At present Pennsylvania’s cable-franchise laws prevent Verizon from offering video service, but I’m sure Verizon’s lobbyists are working to change that situation.)

Unlike copper wires, fiber-optic cables do not carry power. The ONT, therefore, must be powered from my home’s electrical service. If the power goes out, the battery backup unit (BBU) will supply power for the ONT’s voice services for about four hours.

VoIP users beware: When the household power fails, the ONT’s data services will be dropped immediately in order to conserve the BBU’s battery. This seems pretty lame to me, but Verizon confirmed this behavior when I called them to ask about it. If you need data service during a power failure, make sure your ONT is powered via a UPS under your control.

To provide data service to my house, the installer ran a CAT-5 cable from the ONT’s 10/100 Ethernet port into my house, where it plugs into a D-Link 4-port “Ethernet Broadband Router,” provided by Verizon for free. Although the provided router has NAT and firewall features, I placed a Linux-based firewall between it and the rest of my home network as an added precaution.

I have been using the service for several days now, and here is my verdict:

It’s just broadband.

Practically speaking, I can’t tell any difference between FiOS and my Adelphia cable-modem service. I ordered 5-Mbps service from both providers, and both services provide about 5 Mbps down, which is faster than fast enough for me. The FiOS service has slightly lower latency – I can ping www.google.com in about 9 ms – and that’s a nice plus.

The big benefit of FiOS is competition: Verizon’s price is about $10/month less than Adelphia’s. When I called Adelphia to cancel my service, their representative attempted to change my mind by offering me a 3-month promotional discount and trying to sell me extra television channels.

I passed.

Posted in , ,
Tags ,
42 comments
no trackbacks
Reddit Delicious

Replacing the fan array in my HP ProCurve 4000M switch

Posted by Tom Moertel Sat, 12 Nov 2005 02:42:00 GMT

replacing the fans in a 4000m switch

The main network switch in my home office is an HP ProCurve 4000m, which has been running non-stop for over half a decade. It is a great switch, and even though it is getting old, it is still dependable.

A while ago I noticed that the 4000m’s fault indicator was lit. So I logged into the switch and checked the log: fan 1 was dead. The switch has built-in redundancy (three fans), and so I didn’t worry about it, but I did call HP ProCurve tech support.

The woman I spoke with was friendly and helpful. I told her what was wrong, and she said a new fan array would be on my doorstep within 48 hours. No charge. (I guess the ProCurve warranty really is worth something.)

Today, I installed the array. This meant opening up the switch, which is a fun thing to do. If you are curious about what is inside of a 4000m, I took photos of the operation.

During the process, I recalled why I love old-style HP engineering:

  • The replacement parts came with clear instructions that showed me how to remove the old array and install the new one. They were easy to follow and didn’t leave anything to guess.
  • The 4000m is solid – inside and out.
  • The electrical components are top quality.
  • The industrial engineering is superb. For example, all of the user-removable screws have non-stripping torx heads and are designed not to fall out and get lost; instead they remain attached to the module or panel you are removing. (See this photo of removed modules to see how the screws stay in place.)

Everything about the process made me think, wow, this is really well engineered.

The thing is, I know, as I sit here and watch the blinking LEDs on my now-restored 4000m, that my next network switch will probably be a Dell.

As much as I love the ProCurve engineering, the Dell price is compelling. Even if I expect the Dells to fail twice as often (and the Dell warranties are comparatively lame), I can buy twice as many Dells and keep spares on the shelf – and still save money compared to the equivalent ProCurve equipment.

I find the situation somewhat sad. I am an engineering guy to the core. So when I go for the cheaper product because it is so darn cheap, I know that much of the market will do likewise. That bodes ill for HP. Like HP’s calculators, the ProCurves too may pass into history.

Posted in , ,
Tags , , , , , ,
12 comments
no trackbacks
Reddit Delicious