Steve Kass

My favorite book is Peter Lagefoged’s Vowels and Consonants, which is fitting for The Dessoff Choirs’ (self-appointed) pronunciation guru. As part of that job, I prepare International Phonetic Alphabet (IPA) transliterations of our concert music, at least when we’re singing in a language I know something about. It’s a tedious task, but lately less so, thanks to the workflow system I recently cobbled together for our November concert of French choral music.

Goal: a database of French words and their IPA pronunciations.
French is largely phonetic, so at first I considered creating a rule-based system to construct words’ approximate transliterations. The prospect became more and more complicated to imagine, and this led me to look for a downloadable lexicon that already included IPA (either the output of someone else’s rule-based system or the result of digitizing an existing dictionary).

Dictionaries aplenty, most of them too “user-friendly.”
There’s no shortage of good online dictionaries, but the ones I looked at were distinctly unhelpful. Only some of them contain IPA, first of all, and to begin with, most of them are accessible only through a type-and-click web interface. It might have been possible to automate the web interaction and turn my source texts into a sequence of HTTP requests, but my programming skills in that area are badly dated. Back when the web was a collection of static HTML pages, I’d jury rig something with wget and sed. Nowadays, the web is sophisticated. You don’t just go to a URL and get back a plain HTML document or file. A lot of what appears in your browser window requires client-side execution of Javascript or similar nonsense. Forget about using wget in such situations. (Similar situations have frustrated me before. Someone will have kindly assembled just the data I need, and will have kindly made it available, but only via a browser form for single-item retrieval.)

Third download’s a charm.
Eventually, I found some hopeful downloads. The first two, a file for OpenOffice spellcheck, and a dictionary for WinEDT, didn’t fit the bill, but the third, Ralf’s French dictionary, did. I don’t know who Ralf is, nor do I know who’s behind the testing simon blog, where Google Search led me to discover Ralf’s dictionary. (Simon is apparently a speech recognition system, which explains the connection to dictionaries with IPA.) Ralf’s dictionary contains hundreds of thousands of French words (lexemes) with their textual representations (graphemes, like you’re reading here) and IPA equivalents (phonemes).

Ralf’s dictionary is not a dictionary.
For nearly 25 years, my go-to dictionary for French pronunciation has been a 1980 Hachette. It provides IPA for each of its over 50,000 entries. But like most dictionaries, well, it’s a dictionary, not a lexicon. It’s full of definitions — and that’s the point. “Ralf’s dictionary” is a lexicon that happily includes IPA. The big difference for me, today, is that a complete lexicon like Ralf’s contains all the words people utter (or sing), many of which (especially verbs in the case of French) are not dictionary “words,” but are inflected forms of dictionary words. You can find parler in Hachette (on page 1137), right between parlementer and parleur, and you can find it in Ralf’s (at position 259506), also between parlementer and parleur, but in Ralf’s, it’s not right between. After parlementer and before parler in Ralf’s you’ll find (though turning data pages creates no wonderful musty book smell) parlementera, parlementerai, parlementeraient, …, parlements, parlementâmes, …, parlementé, parlementée, and parlementées. And all with IPA.

Ok, so dussé is missing. But eut is not.
For years, I was never quite sure how to pronounce some inflected verb forms in French. Was the pronunciation of eut (not an entry in Hachette) the same as for eu (which is listed), or does it rhyme with peut? Not that I have occasion to speak eut often, but I’ve had occasion to sing it (in d’Indy’s delightful Madrigal, for example, which Dessoff will be singing in a choral arrangement this November). Sure, I could have asked someone, but that would mean having to ask someone. According to Ralf, the answer is yes. Both eut and eu are pronounced [y]. Ralf could be wrong (he often is — I’ll get to that later, though he doesn’t appear to be in this case), but the pronunciation of eut is a valuable fact, and he recognizes that.

Click here to see YouTube’s divoboy perform d’Indy’s Madrigal (with outstanding French diction save for the incorrect pronunciation of eut, because it probably wasn’t in his dictionary).

One of the weirder French verb forms I do know how to pronounce is dussé, as in “Je vais faire cela, dussé-je le regretter ensuite.” By itself, dussé isn’t really a word, but when dusse (the first person imperfect subjunctive form of devoir) and various other verb forms ending in a mute e appear in inversion with its pronomial subject, the spelling changes: e becomes é. Despite the accent aigu, however, dussé-je is pronounced [dusɛʒ], not [duseʒ]. For better or for worse, by the way, the days of dussé-je may be numbered. In its controversial 1990 “rectifications,” France’s Superior Council of the French Language (only in France, you may think, but also in Belgium and Canada) declared the correct spelling to henceforth be dussè-je. That makes a lot of sense, but of course this is the organization that in the same proclamation tried to change the official spelling of oignon to ognon. As you can imagine, that didn’t go over very well, so we’ll see if dussè-je sticks. You can read more about dussé-je/dussè-je here, which is where I copped the sample sentence above.

Ok, I’ll say it: XML is not evil.
Ralf’s dictionary is an XML file. I’ll admit it, I’ve got issues with XML, or more specifically with people who think XML is a database format, but Ralf used it wisely, as a self-documenting container for data exchange. CSV would have been fine, too, but XML was a better idea here, because the Unicode characters that represent IPA don’t always survive being shuttled around in less standardized text files.

Import time.
Each lexeme in Ralf’s dictionary was associated with a phoneme (the IPA I wanted), a grapheme (the lexeme written down) and sometimes a role (abbreviation, letter, name, or verb). The IPA in Ralf’s dictionary was for speech, and I ultimated needed slightly different pronunciations for singing, so I imported Ralf’s data into a table with an extra phoneme column that contained the changes I wanted.

My database platform of choice, as always, is Microsoft SQL Server. With a lot more trial and error than I’d have needed to import from CSV or various other formats, I finally managed to make XQuery happy. Here’s my import query.

WITH Imported(Item,Role,Grapheme,Phoneme) AS (
  SELECT 
    T1.lexeme.query('.'),
    T1.lexeme.value('./@role','nvarchar(100)') as Role,
    T1.lexeme.value('grapheme[1]','nvarchar(100)') as Grapheme,
    T1.lexeme.value('phoneme[1]','nvarchar(100)') as Phoneme
  FROM FD
  CROSS APPLY x.nodes('/lexicon/lexeme') AS T1(lexeme)
)
  INSERT INTO FrenchIPA
  SELECT 
    Item,
    Role,
    Grapheme,
    Phoneme,
    replace(replace(
      Phoneme,N'?',N'?'
      ),N'??',N'o?'
    )
    as Phoneme2
  FROM Imported;

Replacing graphemes with phonemes.

The source texts I had were just that — texts, text strings. In order to use the table FrenchIPA, I had to identify the individual words in my texts. While in theory, that’s harder than writing the right XQuery for import, it’s something I’ve done a gazillion times and helped other people do a gazillion times. One version of a query for this has been on my Drew web page for years. Cobble, cobble, cobble, and out comes this clumsy, kludgy, clunky, but effective query I used to make a first pass at word-for-word transliteration (replacing each word in the input string variable @txt with its associated phoneme).

with Puncts(n1,n2) as (
  select
    n as n1,
    (select min(n) from Nums as N2
     where N2.n <= len(@txt) and N2.n >= N1.n
     and substring(@txt,N2.n,1) not like '%[a-z]%' collate Latin1_General_CI_AS
    ) as n2
  from dbo.Nums as N1
  where n <= len(@txt)
), Wds(st,fn,w) as (
  select
    min(n1), n2,
    substring(@txt,min(n1),n2-min(n1)) as wd
  from Puncts
  group by n2
), Reps(i,st,fn,w,Grapheme,IPA) as (
  select row_number() over (order by st desc), st, fn, w, Grapheme, P2
  from Wds join FrenchIPA
  on lower(w) = Grapheme
), Result(i,r) as (
  select cast(0 as bigint),@txt
  union all
  select
    Reps.i, stuff(r,st,fn-st,IPA)
  from Reps join Result
  on Reps.i = Result.i+1
)
  select top 1 '['+replace(replace(r,' ','   '),'
',']
[')+']' from Result order by i desc
  option (MAXRECURSION 1000);

The most kludgy part is the recursive query that replaces one word at a time with IPA. If anyone is curious about how this works, ask me.

Cleaning up the result.

This doesn’t produce the final transliteration, by any means, but it’s darn close. Here’s what it yields for d’Indy’s Madrigal (and which example allows me to type the word with two apostrophes yet again).

[Note: I see garbage below in Chrome; IE is ok. And unfortunately, some combination of WordPress, MySQL, Windows Live Writer, and HTML disagrees with Unicode’s combining diacritical characters, so you’ll see meandering tildes.]

[ki   ʒamɛ   fy   də   ply   ʃaɾmɑ̃   vizaʒ,]
[də   kɔl   ply   blɑ̃,   də   ʃəvœ   ply   swajœ;]
[ki   ʒamɛ   fy   də   ply   ʒɑ̃ti   koɾsaʒ,]
[ki   ʒamɛ   fy   kə   ma   dam   ɔ   du   iœ!]
[ki   ʒamɛ   y   lɛvɾ   ply   suɾiɑ̃t,]
[ki   suɾiɑ̃   ɾɑ̃di   kœɾ   ply   ʒwajœ,]
[ply   ʃast   sɛ̃   su   gimp   tɾɑ̃spaɾɑ̃t,]
[ki   ʒamɛ   y   kə   ma   dam   ɔ   du   iœ!]
[ki   ʒamɛ   y   vwa   de'œ̃   ply   du   ɑ̃tɑ̃dɾ,]
[miɲɔn   dɑ̃   ki   buʃ   ɑ̃pɛɾl   mjœ;]
[ki   ʒamɛ   fy   də   ɾəgaɾde   si   tɑ̃dɾ,]
[ki   ʒamɛ   fy   kə   ma   dam   ɔ   du   iœ!]

All that’s left is touchup, mainly.

1. Add schwas for syllables that are silent in speech, but not in song. (Spoken, Frères Jacques has two syllables; sung, it has four.)

2. Fix some mistakes in Ralf’s dictionary, like his having gotten œ and ø backwards most everywhere. (It’s debatable whether a distinction really exists anyway.)

3. Indicate where there are liaisons (and check against the music to avoid marking them across rests).

After not much additional work, this is what I got:

[ki   ʒamɛ   fy   də   ply   ʃaɾmɑ̃   vizaʒə]
[də   kɔl   ply   blɑ̃,   də   ʃəvø   ply   swajø]
[ki   ʒamɛ   fy   də   ply   ʒɑ̃ti   koɾsaʒə]
[ki   ʒamɛ   fy   kə   ma   dam‿o   duz‿jø]

[ki   ʒamɛz‿y   lɛvɾə   ply   suɾiɑ̃tə]
[ki   suɾiɑ̃   ɾɑ̃di   kœɾ   ply   ʒwajø]
[ply   ʃastə   sɛ̃   su   gɛ̃pə   tɾɑ̃spaɾɑ̃tə]
[ki   ʒamɛ   fy   kə   ma   dam‿o   duz‿jø]

[ki   ʒamɛz‿y   vwa   dœ̃   ply   duz‿ɑ̃tɑ̃dɾə]
[miɲɔnə   dɑ̃   ki   buʃ‿ɑ̃pɛɾlə   mjø]
[ki   ʒamɛ   fy   də   ɾəgaɾde   si   tɑ̃dɾə]
[ki   ʒamɛ   fy   kə   ma   dam‿o   duz‿jø]

This makes me very happy, and, despite the time I spent writing the queries, it saved me a lot of time. In fact, it probably took more time to write this post than it did to put together the IPA for this concert.

Graphemes to Phonemes Made Easy

One Response to “Graphemes to Phonemes Made Easy”

Leave a Reply

Archived Entry