Hyphenation and the good life This article appeared in the March/April 1990 issue of Electric Word. Eschewing the hallowed halls of academia and the rough-and-tumble of modem urban life, Sasha and Margaret Nizhnikov have established themselves extended family, associates and all — in a large, rambling house in the wooded hills of southern New Hampshire, just 45 minutes from Route 128, Massachusetts’ thriving silicon belt. The retreat is homebase for the couple’s linguistic software company, Circle Noetic Services. Its specialty: hyphenation. The hyphenation routines of most dtp packages are pretty bad,” laments Sasha Nizhnikov. “Look here in Electric Word #16, in Dr WriteStuff, right here where the reader complains about the hyphenation in PageMaker: ‘ba-sed!’ See that?” Sasha is unable to suppress his scorn. Sasha’s intensity and natural garrulousness are amply complemented by Margaret’s articulate, well-chosen words. ”Certain programs routinely insert a hyphen before every ‘-ed,’” she says. “What’s wrong there is that the rule is applicable only when ‘-ed’ is preceded by ‘d’ or ‘t.’” The quality of hyphenation routines is a subject close to both the Nizhnikovs’ hearts. And while the glaring howlers they encounter in other hyphenators’ handiwork — “Cambr-idge” and “m-ore” for example — evoke a bellylaugh, Sasha and Margaret reserve sterner disapproval for those over-cautious, “conservative” programs that insert fewer hyphens than the dictionaries condone. ”While it’s a graver error to insert a bad hyphen than leave out a good one, some hyphenating routines are so conservative — giving you lines that are so ‘loose’ — that they practically defeat the whole object of the exercise,” says Margaret. “It’s much easier to develop an accurate program which is conservative than one which is thorough.” The Nizhnikovs have been successfully marketing their own set of hyphenation routines — called Dashes — to OEMs for some five years now. And they recently decided to put the product directly onto the market in the form of a Desk Accessory for the Mac, Dashes DA. REFUSENIKS Sasha Nizhnikov was born in Moscow in 1962. His family, practicing Jews, lived for four years as Soviet refuseniks until 1978, when Sasha was 16 years old, and they were allowed to emigrate. The Nizhnikovs eventually settled in the United States, where Sasha found work as a computer programmer and continued his education, studying at Boston University and later MIT. Margaret Nizhnikov was born in Boulder, Colorado, to Norwegian parents, and grew up in the Rockies and Trondheim, Norway. She received a BA in German and math from Colorado State University, later to become a doctoral candidate in linguistics at MIT. According to Sasha, she speaks Russian “like a native.” The couple first met in the early 80s, when both were working atagfa’ s su bsidiary CompuGraphics, a market leader in computer-based typographic systems, then retailing with hefty pricetags between US$20,000 and $100,000. At the time, the company was planning typographic software for the Lisa computer, Apple’s ill-fated forbear to the Mac. Sasha was working fulltime as a programmer and studying parttime; Margaret had been hired for the summer to modify the software’s hyphenation routines with a view to reducing its memory requirements. ”The CompuGraphics hyphenation routines weren’t the best,” says Sasha. ”They inserted hyphenation points in only 65% of possible positions. And while the algorithm had a 92% success rate, it was dependent on a large exception dictionary — which bloated its appetite for memory.” ‘The routines couldn’t really be improved,” says Margaret, “since they were arrived at by trial and error, using a large map which their programmers tweaked.” The company didn’t want to start from scratch and therefore spent more than seven years trying to improve its existing hyphenation routines. Believing they could go one better, the Nizhnikovs decided to write their own. They founded Circle Noetic Services in the summer of 1984, moving to New Hampshire in 1987. MULTILINGUAL The first version of Dashes — for English running on a Vax, was ready in the winter of 1984, and shortly thereafter ported to Sun workstations and other machines. The first Dashes licensee was Manhattan Graphics, then developing the ReadySetGo dtp package. Sasha enjoys telling the tale of how Manhattan Graphics president Ken Abbottflew up to Boston to meet the two then MIT students and see a demo of Dashes. The first word Abbott keyed in was hyphenated wrongly. And so were the second, third… and the fourth. Finally, the fifth word, “Cambridge,” was hyphenated correctly: “Cam-bridge.” Despite the shaky start, Abbott — himself an Oxford PhD in physics — bought the Dashes license, presumably driven by a blind faith in the virtues of academic research and a hunch that the Nizhnikovs’ basic approach was sound. That was in 1984. Later, Manhattan Graphics was to commission Sasha to do a total rewrite of ReadySetGo — which became Version 3 of the program — as well as additional code for the subsequent Version 4. Not surprisingly, the Nizhni.kovs had multilingual ambitions up their polyglot sleeves right from the start. In the summer of 1985, they began adding other languages to Dashes: the Romance group — done one at a time — took “just a month,” and Finnish just two weeks. Along with English, Dashes currently hyphenates: French, Italian, Portuguese, and Spanish of the Romance languages; Dutch, German, Danish, Icelandic, N orwegian, and Swedish of the Germanic; plus Croatian, Finnish, Greek, Russian, and Turkish. In addition to inserting so-called discretionary hyphens (usually ASCII 31) into words, Dashes also ranks hyphenation points. That is, it assigns each point a value from 0 to 4, indicating its level of desirability. For example: an-3ti-1dis-1es-2tab-2lish-1men-2tar-1i-3an-1ism or a German word: Mut-2ter-0spra-2che A “Dashes-aware” program will give precedence to the higher ranking hyphenation points; other programs will ignore the hyphen-ranking. THE lINGUIS At Circle Noetics, while Sasha writes the programs in C, Margaret Nizhnikov supplies the linguistic muscle. Just what does the linguist bring to the programmer? “A facility for recognizing linguistic patterns,” Margaret is quick to assert. “Linguistic patterns, such as those in word formation, provide the basis for algorithms.” Sasha digs out a book — “The Normal and Reversed English Word List” by A. F. Brown — a daunting three-volume set of English words listed by suffix, which his wife has pored through many times. For other languages, they bought hyphenated wordlists which they could also use for their multilingual spelling checker, PassWord. The bottom line in hyphenation is memory. The essential question is: how little code can hyphenate how many words? ‘The CompuGraphics system we were first familiar with was 80 Kb in size and could only do English,” Margaret points out. “Ours is 77 Kb and can do 15 languages.“ ‘The battle boils down to size and speed versus accuracy,” says Sasha. ”Systems that don’t use algorithms require too much space, and the disk access slows things down. Algorithms, if implemented correctly, are inherently smaller and faster.” The Nizhnikovs believe they have thrived because of their early and active involvement in multilingual software. They claim it’s a bigger market than most people realize, and they are convinced that most people in the United States don’t take it seriously yet. The changes in the political climate in Eastern Europe present yet more opportunities for the Nizhnikovs, and they are eager to get to work on Polish to add to the next version of Dashes. Other linguistic software currently under development includes a multilingual database to serve as a basis for further projects. Entry words and phrases will be accompanied by semantic, syntactic, morphological, and pronunciation information. And Circle Noetics is developing specialIzed access routines to make searches on a linguistic basis (for instance, “retrieve all nouns with Latin suffixes”) . ”A database like that would have saved hours when we were developing the hyphenation routines,” says Margaret. With hyphenation routines for 15 languages under the Nizhnikovs’ collective belt, which is the toughest language to develop hyphenation routines for?” Margaret: “English is still the hardest. Its suffix boundaries are erratic: ‘clarinet-ist,’ ‘systema-tist.’ And its syllables are hard to find: ‘geol-o-gy,’ ‘geo-log-ical.’”
|