Posted by Tom Moertel
Thu, 17 Mar 2011 15:21:00 GMT
Reading the most-recent issue of Communications of the ACM, I enjoyed the reprint of “VL2: A Scalable and Flexible Data Center Network.” Here’s a summary.
First, the problem: Start out with a rack of 20–40 servers. At the top of the rack, place a switch – the top-of-rack (ToR) switch – to connect the servers together. Now fill your data center with racks just like this. To allow the servers to talk across racks, take each rack’s ToR switch and run its uplink ports (typically 2) into separate (for redundancy) higher-level aggregation switches. To connect aggregated rack-groups together, uplink the aggregation switches into higher-level aggregation switches. Keep aggregating this way until everything is connected.
But, eventually, you can’t join the switched domains further because they’ll get too big and unwieldy; that’s when you uplink into the final layer of the hierarchy: access routers; they split your aggregated-aggregated rack-group-groups into separate layer-2 networks and route packets between them.
What’s wrong with this hierarchical network graph? In a word, oversubscription. When big jobs need to run across the data center, the uplinks saturate and become bottlenecks. According to the paper, even ToR uplinks are typically oversubscribed 1:2 to 1:20; the higher-level uplinks are even worse: 1:200 is not uncommon.
To work around these bottlenecks, network designers end up buying expensive network hardware and configuring it for specific workloads. But running large data centers is so expensive that you want the flexibility to squeeze lots of different jobs into the spare capacity, and networks tuned for one kind of workload are the opposite of flexibility.
How do you get both performance and flexibility? The authors of the paper propose VL2: creating virtual layer-2 networks that allow application addresses to be separated from network devices. This separation lets you design the above-rack network to provide huge path capacity using commodity hardware. In the paper’s running example, it’s a folded Clos network in two levels of switches – the aggregation level and, above that, the intermediate level.
Here’s how it works. You assign each network device on each server a location address (LA); this is an IP address that stays with it for life, naming the device. These LAs get advertised to the switches above, which keep track of them using a typical layer-3 link-state routing protocol.
Each application, however, you assign a block of application addresses (AAs) from a separate pool. LAs and AAs never mix. Each application is coded as if it runs on a dedicated Ethernet segment having only its AAs attached. When you want to give an application an extra server, you map one of the application’s unused AAs to the server’s LA, effectively attaching the server to the application’s virtual Ethernet network.
AA-to-LA mapping is handled by a fast, scalable directory system that is invisible to applications. It’s invisible because it’s implemented as a kernel extension. When a server sends a packet to one of its AA-addressed peers on the application’s virtual network, the kernel extension consults the directory to get the destination server’s corresponding LA and sends the packet to the AA via the LA.
How the packet gets there is clever. The kernel extension doesn’t send the packet directly to the LA but instead sends it up to the very top level of the network, bouncing it off a randomly selected intermediate-level switch, before it is delivered to the LA device.
All this happens by encapsulation and tunneling. The packet to the destination AA is wrapped within a packet to the corresponding destination LA. That packet, in turn, is wrapped within a packet to the randomly chosen intermediate switch. When the packet is sent, it goes up to the intermediate switch, which unwraps the outer layer of encapsulation and bounces the inner packet down to the switch handling the LA. That switch, in turn, unwraps the remaining layer and sends the original packet – this is the one to the AA – to the destination server, which gets it via its application-specific virtual Ethernet adapter.
Why bounce packets off a random intermediate switch? The randomization spreads out traffic, allowing for high network utilization. This “valiant load balancing” is cheap and effective, resulting in about 94% efficiency in the paper’s tests.
That, in a nutshell, is VL2. (I simplified some things; in reality the randomization doesn’t occur for each packet but for entire flows of packets.)
For more information, check out the original paper at Microsoft Research.
Posted in networking
Tags centers, clos, data, networking, scaling, vl2
no comments
no trackbacks

Posted by Tom Moertel
Tue, 15 Nov 2005 19:23:00 GMT
Recently I had Verizon’s fiber-optic service “FiOS” installed
at my home. The installation process took about a half day
and involved placing the following boxes around my house:
- optical network terminal (ONT, installed outside of house)
- battery backup unit (BBU, installed in basement)
- power adapter (plugged into household electrical outlet)
The ONT was installed next to my old POTS
junction box:

The ONT acts like a miniature central office. To my house it provides
four POTS lines for voice service and one 10/100 Mbps Ethernet port for
data service. The ONT accepts a single fiber-optic cable that
connects all of these services back to Verizon’s central office.
As part of the installation process, Verizon moved my POTS lines from
copper over to the ONT’s POTS interfaces. Verizon wanted to remove my
copper-based service altogether, but I forbade them from doing so
because I have non-Verizon business lines that I want to keep on
copper, which competitive carriers can use to offer me service.
(Verizon is not required to share its fiber cables with competitive
carriers.)
If you look closely at the ONT, you’ll see that it also is capable of
handling video service:

(At present Pennsylvania’s cable-franchise laws prevent Verizon
from offering video service, but I’m sure Verizon’s lobbyists are
working to change that situation.)
Unlike copper wires, fiber-optic cables do not carry power. The ONT,
therefore, must be powered from my home’s electrical service. If the
power goes out, the battery backup unit (BBU) will supply power for
the ONT’s voice services for about four hours.
VoIP users beware: When the household power fails, the ONT’s data
services will be dropped immediately in order to conserve the BBU’s
battery. This seems pretty lame to me, but Verizon confirmed this
behavior when I called them to ask about it. If you need data service
during a power failure, make sure your ONT is powered via a UPS
under your control.
To provide data service to my house, the installer ran a CAT-5 cable
from the ONT’s 10/100 Ethernet port into my house, where it plugs into
a D-Link 4-port “Ethernet Broadband Router,” provided by Verizon for
free. Although the provided router has NAT and firewall features, I
placed a Linux-based firewall between it and the rest of my home
network as an added precaution.
I have been using the service for several days now, and here is my
verdict:
It’s just broadband.
Practically speaking, I can’t tell any difference between FiOS and my
Adelphia cable-modem service. I ordered 5-Mbps service from both
providers, and both services provide about 5 Mbps down, which is
faster than fast enough for me. The FiOS service has slightly lower latency – I
can ping www.google.com in about 9 ms – and that’s a nice plus.
The big benefit of FiOS is competition: Verizon’s price is about $10/month
less than Adelphia’s. When I called Adelphia to cancel my service,
their representative attempted to change my mind by offering me a
3-month promotional discount and trying to sell me extra television
channels.
I passed.
Posted in reviews, hardware, networking
Tags fios, networking
42 comments
no trackbacks

Posted by Tom Moertel
Sat, 12 Nov 2005 02:42:00 GMT

The main network switch in my home office is an HP ProCurve
4000m, which
has been running non-stop for over half a decade. It is a great
switch, and even though it is getting old, it is still dependable.
A while ago I noticed that the 4000m’s fault indicator was
lit. So I logged into the switch and checked the log: fan
1 was dead. The switch has built-in redundancy (three fans),
and so I didn’t worry about it, but I did call HP ProCurve
tech support.
The woman I spoke with was friendly and helpful. I told her what was
wrong, and she said a new fan array would
be on my doorstep within 48 hours. No charge. (I guess the
ProCurve warranty really is worth something.)
Today, I installed the array. This meant opening up the switch,
which is a fun thing to do. If you are curious about what is inside
of a 4000m, I took photos of the operation.
During the process, I recalled why I love old-style HP engineering:
- The replacement parts came with clear instructions
that showed me how to remove the old array and install the new one.
They were easy to follow and didn’t leave anything to guess.
- The 4000m is solid – inside and out.
- The electrical components are top quality.
- The industrial engineering is superb. For example, all of the user-removable
screws have non-stripping torx heads and are designed not to fall
out and get lost; instead they remain attached to the module or
panel you are removing. (See this photo of removed modules
to see how the screws stay in place.)
Everything about the process made me think, wow, this is really well engineered.
The thing is, I know, as I sit here and watch the blinking LEDs on my
now-restored 4000m, that my next network switch will probably be a
Dell.
As much as I love the ProCurve engineering, the Dell price is
compelling. Even if I expect the Dells to fail twice as often (and
the Dell warranties are comparatively lame), I can buy twice as
many Dells and keep spares on the shelf – and still save money
compared to the equivalent ProCurve equipment.
I find the situation somewhat sad. I am an engineering guy to the
core. So when I go for the cheaper product because it is so darn cheap,
I know that much of the market will do likewise. That bodes ill
for HP. Like HP’s calculators, the ProCurves too may pass into
history.
Posted in photography, hardware, engineering
Tags 4000m, engineering, hardware, hp, networking, photos, switch
12 comments
no trackbacks
