From: Nick Moffitt (nick@zork.net)
To: balug-talk@balug.org
Subject: Fast Ethernet Stalls with NetGear FA310TX Rev.D1
Date: Thu, 8 Apr 1999 14:04:08 -0700
X-Mailer: Mutt 0.95i
X-OS-Rules: Linux Rules
X-OS-Sucks: Windows Sucks
X-motd-size: 49 lines long
X-fnord: Cauterize the stump.

This reached me via a bizarre circumstance of bounces and forwards that even I haven't quite figured out, yet. It does, however, bring up some problems with the new Netgear cards.

Some time last autumn, Bay Networks found themselves in a fix: The buyout of DEC by Compaq lead to the unavailability of the 21140 "Tulip" ethernet chipset, long hailed by Donald Becker as the best ethernet card for Linux. Rather than discontinue the cards that used these (notably the Netgear FA310TX), they used a clone chip, and shipped them out under the same name and model number.

This is the sort of misrepresentation that is becoming all too common these days. Why not just update the version number? I saw no indication on the box that this was a "revision D1" or anything like that. Bay Networks sold these as though they were original Tulip-based cards. Only by looking at the central IC on the board could you tell which was which. (The Tulip conspicuously sported the "Digital" logo, while the clone had a Netgear logo screened onto it.)

There is a hilarious account by Becker of his attempts to deal with the Bay Networks people, when this happened: The new cards were useless with the current drivers, and he had trouble getting the source code to the driver they had written. (It turned out, in the end, to have been his original GPLed driver with the timings adjusted such that the code could potentially wait in kernel space for up to a second! "Life sure is easy when you think like a DOS programmer", he noted.[1]) Recent rumors indicate that Becker is still advocating the use of the Netgear FA310TX, as the best ethernet card for Linux.

Some of us, however, have learned our lesson, and have moved on to the fairly respectable EEPro 100 from Intel. (I believe that Mike Higashi noted that Intel has enough of the intellectual property to resurrect the Tulip, but I can't imagine that they have any desire to do so.) The Beowulf people, on the other hand, still need to squeeze as much speed as is possible out of their network, and are continuing to use these mongrel cards.

The following message shows a recent problem with the newer Netgears. The author put up a comparison (PDF only; sorry, folks) of the two cards, mentioned in his message. I did, however, find a note on the Coral Project Web page about the new Netgears (http://www.icase.edu/coral/hardware.html#node -- see note 4), that shows a definite performance decrease when using the newer cards, rather than the originals.

At any rate, I managed to find a couple of authentic Tulip-based Netgears in a CompUSA in San Bruno last Halloween (BOY was it spooky!), and I intend to hang on to them. It's a pity the line was discontinued.

----- Forwarded message from Josip Loncaric (josip@icase.edu) -----

From: Josip Loncaric (josip@icase.edu)
Subject: [Pigdog] Fast Ethernet Stalls with NetGear FA310TX Rev.D1
To: Beowulf mailing list (beowulf@beowulf.gsfc.nasa.gov)
Organization: ICASE
Date: Wed, 31 Mar 1999 16:00:42 -0500
X-Mailer: Mozilla 4.5 [en] (X11; U; SunOS 5.6 sun4m)

Fast Ethernet Stalls with Netgear FA310TX Rev. D1

Several users of the ICASE Coral cluster have noticed problems with communication stalls at random intervals. A job that is running normally will suddenly hang for periods of up to a minute or more, then resume operation with no apparent side effects. The problem is most pronounced with communication-intensive applications using large numbers of processors, but also occurs infrequently with jobs using as few as two processors.

After extensive investigation and testing, we have determined that the problem is related to the Netgear FA310TX Rev. D1 cards we are using in most nodes of the Coral cluster. So far as we can determine, the problem does not occur with older Netgear FA310TX Rev. C1 cards. The C1 cards use a DEC 21140 chip, while the D1 cards use a Lite-On clone (a.k.a. PNIC).

In our normal configuration, we are running Linux 2.0.36 with version 0.90q of Don Becker's "tulip.c" Fast Ethernet driver. We have also observed the problem under Linux 2.2.2. We have not been able to determine whether the problem is due to hardware, firmware, or the driver software. For more information about the Coral hardware and software configuration, see http://www.icase.edu/CoralProject.html.

The two plots in http://www.icase.edu/~tom/Coral/DEC_vs_LiteOn.pdf illustrate the problem. For both tests, we ran the same parallel rendering benchmark on two processors, generating more than 32,000 frames of animation. The benchmark code uses LAM 6.2b MPI over TCP for interprocessor communication, although the problem is also observed with other communication packages.

Each frame of the animation contains similar imagery, so we expect rendering times to be tightly bounded. The two tests were run simultaneously for more than 16 hours, using different pairs of processors, on an otherwise idle system. For each frame, we plot the elapsed (wallclock) execution time.

The first plot shows performance using Rev. C1 (DEC) cards. As expected, rendering times are tightly bounded. The one exception appears at frames 28,524 through 28,527, and is attributable to the nightly Linux cron job, which steals cycles from the rendering application.

The second plot shows performance using Rev. D1 (Lite-On or PNIC) cards. The results show numerous stalls, ranging in duration from 1 to 53 seconds. The nightly cron run is also apparent at frames 28,170 through 28,172, resulting in delays of up to three seconds.

Although these stall events are relatively rare, compared to the number of packets transmitted, the impact on parallel performance can be severe, particularly when interactive or real-time performance is required. They also make it very difficult to obtain accurate, repeatable performance measurements of parallel applications.

Tom Crockett
Josip Loncaric

ICASE
March 31, 1999

--
Dr. Josip Loncaric, Senior Staff Scientist (josip@icase.edu)
ICASE, Mail Stop 132C http://www.icase.edu/~josip/
NASA Langley Research Center (j.loncaric@larc.nasa.gov)
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134

----- End forwarded message -----

--
"The software is intended to be as unobtrusive, unintrusive and
unconstraining as possible. In software as elsewhere, good
engineering is whatever gets the job done without calling attention to
itself." -- Cynbe ru Taren, on Citadel (http://zork.net/cit/citadel.txt)




[1] RM comments: This classic post is reproduced below, and is also archived at http://www.scyld.com/pipermail/tulip/1998-September/000185.html .




From: Donald Becker (becker@cesdis1.gsfc.nasa.gov)
To: tulip@scyld.com
Subject: Netgear ethernet cards no longer Tulip
Date: Thu Sep 3 16:11:34 1998

On Thu, 3 Sep 1998, David C Niemi wrote:

> On Thu, 3 Sep 1998, Jameson Burt wrote:
> > I purchased A Netgear FA310TX and an ethernet kit, including two of
> > these FA310TX, at Microcenter. While the separate FA310TX had a DEC
> > Tulip chip, I was surprised that the kit's two FA310TX had chips
> > labeled "Netgear", rather than DEC. The following mail from the
> > Ottowa Linux Users Group explains Netgear's changes.
..
> Were the Netgear-labeled chips smaller rectangular chips, instead of
> the larger square chips typical of Tulips? If so, they are probably
> RTL 8139s, which are grossly inferior.

CESDIS got a box of ten in yesterday (ordered in May for summer students!): They are relabeled PNIC-169 chips.

> > The FA310TX used to be a Tulip card: Netgear in their infinite
> > wisdom has changed this with their RevD card, though they still
> > ardently claim it's supported by Linux, and provide a tulip.c file
> > that is obviously the wrong driver.

It's a modified verison of tulip.c. They didn't include the GPL (required!), and they modified the driver without indicating that it was modified from the original (required by the GPL).

Worse, the modifications were trivial: mostly a software timing loop that waited an entire second inside the kernel! Hey, all of these media selection problems are trivial if you think like an MS-DOS programmer.

When I called, they tried to claim it was provided by a third party: Initially, they said it was authorized by me (Uhhhm, but I'm me, and I would remember). Then, they said the changes were from Lite-On (the maker of the PNIC), but I had talked to a (the?) Lite-On software person in Taiwan earlier that day...

> > They also claim that a Tulip chipset is backwards-compatable with
> > NE2000, and you can just use that driver. NetGear/Bay now scares me.

Wow. It's an NE2000.... All this time, I've been writing drivers, and I could just have stopped at the NE2000...

> > After a very long discussion yesterday with a large group of Linux
> > people, we discovered that it appears Tulip cards are being
> > discontinued, en masse. This is very disturbing as they're the
> > best supported card in the kernel, by far. The only one left I can
> > locate is the DFE-500TX (not the 530) by D-Link, of which I ordered
> > 15, and will order more if there is additional demand for them. The
> > Kinston KNE-100, or close to that, I can't remember the model
> > offhand, is a Tulip compatable clone, though there are still some
> > bugs in it.

Nobody trusts or wants to play with Intel, who now owns Digital's network product line.

-- 
Donald Becker                                    becker@cesdis.gsfc.nasa.gov
USRA-CESDIS, Center of Excellence in Space Data and Information Sciences.
Code 930.5, Goddard Space Flight Center,  Greenbelt, MD.  20771 301-286-0882
http://cesdis.gsfc.nasa.gov/people/becker/whoiam.html