SQL Server – Steve Kass

Graphemes to Phonemes Made Easy

Steve — Mon, 13 Sep 2010 04:02:37 +0000

My favorite book is Peter Lagefoged’s Vowels and Consonants, which is fitting for The Dessoff Choirs’ (self-appointed) pronunciation guru. As part of that job, I prepare International Phonetic Alphabet (IPA) transliterations of our concert music, at least when we’re singing in a language I know something about. It’s a tedious task, but lately less so, thanks to the workflow system I recently cobbled together for our November concert of French choral music.

Goal: a database of French words and their IPA pronunciations.
French is largely phonetic, so at first I considered creating a rule-based system to construct words’ approximate transliterations. The prospect became more and more complicated to imagine, and this led me to look for a downloadable lexicon that already included IPA (either the output of someone else’s rule-based system or the result of digitizing an existing dictionary).

Dictionaries aplenty, most of them too “user-friendly.”
There’s no shortage of good online dictionaries, but the ones I looked at were distinctly unhelpful. Only some of them contain IPA, first of all, and to begin with, most of them are accessible only through a type-and-click web interface. It might have been possible to automate the web interaction and turn my source texts into a sequence of HTTP requests, but my programming skills in that area are badly dated. Back when the web was a collection of static HTML pages, I’d jury rig something with wget and sed. Nowadays, the web is sophisticated. You don’t just go to a URL and get back a plain HTML document or file. A lot of what appears in your browser window requires client-side execution of Javascript or similar nonsense. Forget about using wget in such situations. (Similar situations have frustrated me before. Someone will have kindly assembled just the data I need, and will have kindly made it available, but only via a browser form for single-item retrieval.)

Third download’s a charm.
Eventually, I found some hopeful downloads. The first two, a file for OpenOffice spellcheck, and a dictionary for WinEDT, didn’t fit the bill, but the third, Ralf’s French dictionary, did. I don’t know who Ralf is, nor do I know who’s behind the testing simon blog, where Google Search led me to discover Ralf’s dictionary. (Simon is apparently a speech recognition system, which explains the connection to dictionaries with IPA.) Ralf’s dictionary contains hundreds of thousands of French words (lexemes) with their textual representations (graphemes, like you’re reading here) and IPA equivalents (phonemes).

Ralf’s dictionary is not a dictionary.
For nearly 25 years, my go-to dictionary for French pronunciation has been a 1980 Hachette. It provides IPA for each of its over 50,000 entries. But like most dictionaries, well, it’s a dictionary, not a lexicon. It’s full of definitions — and that’s the point. “Ralf’s dictionary” is a lexicon that happily includes IPA. The big difference for me, today, is that a complete lexicon like Ralf’s contains all the words people utter (or sing), many of which (especially verbs in the case of French) are not dictionary “words,” but are inflected forms of dictionary words. You can find parler in Hachette (on page 1137), right between parlementer and parleur, and you can find it in Ralf’s (at position 259506), also between parlementer and parleur, but in Ralf’s, it’s not right between. After parlementer and before parler in Ralf’s you’ll find (though turning data pages creates no wonderful musty book smell) parlementera, parlementerai, parlementeraient, …, parlements, parlementâmes, …, parlementé, parlementée, and parlementées. And all with IPA.

Ok, so dussé is missing. But eut is not.
For years, I was never quite sure how to pronounce some inflected verb forms in French. Was the pronunciation of eut (not an entry in Hachette) the same as for eu (which is listed), or does it rhyme with peut? Not that I have occasion to speak eut often, but I’ve had occasion to sing it (in d’Indy’s delightful Madrigal, for example, which Dessoff will be singing in a choral arrangement this November). Sure, I could have asked someone, but that would mean having to ask someone. According to Ralf, the answer is yes. Both eut and eu are pronounced [y]. Ralf could be wrong (he often is — I’ll get to that later, though he doesn’t appear to be in this case), but the pronunciation of eut is a valuable fact, and he recognizes that.

Click here to see YouTube’s divoboy perform d’Indy’s Madrigal (with outstanding French diction save for the incorrect pronunciation of eut, because it probably wasn’t in his dictionary).

One of the weirder French verb forms I do know how to pronounce is dussé, as in “Je vais faire cela, dussé-je le regretter ensuite.” By itself, dussé isn’t really a word, but when dusse (the first person imperfect subjunctive form of devoir) and various other verb forms ending in a mute e appear in inversion with its pronomial subject, the spelling changes: e becomes é. Despite the accent aigu, however, dussé-je is pronounced [dusɛʒ], not [duseʒ]. For better or for worse, by the way, the days of dussé-je may be numbered. In its controversial 1990 “rectifications,” France’s Superior Council of the French Language (only in France, you may think, but also in Belgium and Canada) declared the correct spelling to henceforth be dussè-je. That makes a lot of sense, but of course this is the organization that in the same proclamation tried to change the official spelling of oignon to ognon. As you can imagine, that didn’t go over very well, so we’ll see if dussè-je sticks. You can read more about dussé-je/dussè-je here, which is where I copped the sample sentence above.

Ok, I’ll say it: XML is not evil.
Ralf’s dictionary is an XML file. I’ll admit it, I’ve got issues with XML, or more specifically with people who think XML is a database format, but Ralf used it wisely, as a self-documenting container for data exchange. CSV would have been fine, too, but XML was a better idea here, because the Unicode characters that represent IPA don’t always survive being shuttled around in less standardized text files.

Import time.
Each lexeme in Ralf’s dictionary was associated with a phoneme (the IPA I wanted), a grapheme (the lexeme written down) and sometimes a role (abbreviation, letter, name, or verb). The IPA in Ralf’s dictionary was for speech, and I ultimated needed slightly different pronunciations for singing, so I imported Ralf’s data into a table with an extra phoneme column that contained the changes I wanted.

My database platform of choice, as always, is Microsoft SQL Server. With a lot more trial and error than I’d have needed to import from CSV or various other formats, I finally managed to make XQuery happy. Here’s my import query.

WITH Imported(Item,Role,Grapheme,Phoneme) AS (
  SELECT 
    T1.lexeme.query('.'),
    T1.lexeme.value('./@role','nvarchar(100)') as Role,
    T1.lexeme.value('grapheme[1]','nvarchar(100)') as Grapheme,
    T1.lexeme.value('phoneme[1]','nvarchar(100)') as Phoneme
  FROM FD
  CROSS APPLY x.nodes('/lexicon/lexeme') AS T1(lexeme)
)
  INSERT INTO FrenchIPA
  SELECT 
    Item,
    Role,
    Grapheme,
    Phoneme,
    replace(replace(
      Phoneme,N'?',N'?'
      ),N'??',N'o?'
    )
    as Phoneme2
  FROM Imported;

Replacing graphemes with phonemes.

The source texts I had were just that — texts, text strings. In order to use the table FrenchIPA, I had to identify the individual words in my texts. While in theory, that’s harder than writing the right XQuery for import, it’s something I’ve done a gazillion times and helped other people do a gazillion times. One version of a query for this has been on my Drew web page for years. Cobble, cobble, cobble, and out comes this clumsy, kludgy, clunky, but effective query I used to make a first pass at word-for-word transliteration (replacing each word in the input string variable @txt with its associated phoneme).

with Puncts(n1,n2) as (
  select
    n as n1,
    (select min(n) from Nums as N2
     where N2.n <= len(@txt) and N2.n >= N1.n
     and substring(@txt,N2.n,1) not like '%[a-z]%' collate Latin1_General_CI_AS
    ) as n2
  from dbo.Nums as N1
  where n <= len(@txt)
), Wds(st,fn,w) as (
  select
    min(n1), n2,
    substring(@txt,min(n1),n2-min(n1)) as wd
  from Puncts
  group by n2
), Reps(i,st,fn,w,Grapheme,IPA) as (
  select row_number() over (order by st desc), st, fn, w, Grapheme, P2
  from Wds join FrenchIPA
  on lower(w) = Grapheme
), Result(i,r) as (
  select cast(0 as bigint),@txt
  union all
  select
    Reps.i, stuff(r,st,fn-st,IPA)
  from Reps join Result
  on Reps.i = Result.i+1
)
  select top 1 '['+replace(replace(r,' ','   '),'
',']
[')+']' from Result order by i desc
  option (MAXRECURSION 1000);

The most kludgy part is the recursive query that replaces one word at a time with IPA. If anyone is curious about how this works, ask me.

Cleaning up the result.

This doesn’t produce the final transliteration, by any means, but it’s darn close. Here’s what it yields for d’Indy’s Madrigal (and which example allows me to type the word with two apostrophes yet again).

[Note: I see garbage below in Chrome; IE is ok. And unfortunately, some combination of WordPress, MySQL, Windows Live Writer, and HTML disagrees with Unicode’s combining diacritical characters, so you’ll see meandering tildes.]

[ki   ʒamɛ   fy   də   ply   ʃaɾmɑ̃   vizaʒ,]
[də   kɔl   ply   blɑ̃,   də   ʃəvœ   ply   swajœ;]
[ki   ʒamɛ   fy   də   ply   ʒɑ̃ti   koɾsaʒ,]
[ki   ʒamɛ   fy   kə   ma   dam   ɔ   du   iœ!]
[ki   ʒamɛ   y   lɛvɾ   ply   suɾiɑ̃t,]
[ki   suɾiɑ̃   ɾɑ̃di   kœɾ   ply   ʒwajœ,]
[ply   ʃast   sɛ̃   su   gimp   tɾɑ̃spaɾɑ̃t,]
[ki   ʒamɛ   y   kə   ma   dam   ɔ   du   iœ!]
[ki   ʒamɛ   y   vwa   de'œ̃   ply   du   ɑ̃tɑ̃dɾ,]
[miɲɔn   dɑ̃   ki   buʃ   ɑ̃pɛɾl   mjœ;]
[ki   ʒamɛ   fy   də   ɾəgaɾde   si   tɑ̃dɾ,]
[ki   ʒamɛ   fy   kə   ma   dam   ɔ   du   iœ!]

All that’s left is touchup, mainly.

1. Add schwas for syllables that are silent in speech, but not in song. (Spoken, Frères Jacques has two syllables; sung, it has four.)

2. Fix some mistakes in Ralf’s dictionary, like his having gotten œ and ø backwards most everywhere. (It’s debatable whether a distinction really exists anyway.)

3. Indicate where there are liaisons (and check against the music to avoid marking them across rests).

After not much additional work, this is what I got:

[ki   ʒamɛ   fy   də   ply   ʃaɾmɑ̃   vizaʒə]
[də   kɔl   ply   blɑ̃,   də   ʃəvø   ply   swajø]
[ki   ʒamɛ   fy   də   ply   ʒɑ̃ti   koɾsaʒə]
[ki   ʒamɛ   fy   kə   ma   damo   duzjø]

[ki   ʒamɛzy   lɛvɾə   ply   suɾiɑ̃tə]
[ki   suɾiɑ̃   ɾɑ̃di   kœɾ   ply   ʒwajø]
[ply   ʃastə   sɛ̃   su   gɛ̃pə   tɾɑ̃spaɾɑ̃tə]
[ki   ʒamɛ   fy   kə   ma   damo   duzjø]

[ki   ʒamɛzy   vwa   dœ̃   ply   duzɑ̃tɑ̃dɾə]
[miɲɔnə   dɑ̃   ki   buʃɑ̃pɛɾlə   mjø]
[ki   ʒamɛ   fy   də   ɾəgaɾde   si   tɑ̃dɾə]
[ki   ʒamɛ   fy   kə   ma   damo   duzjø]

This makes me very happy, and, despite the time I spent writing the queries, it saved me a lot of time. In fact, it probably took more time to write this post than it did to put together the IPA for this concert.

Localization (probably) strikes again

Steve — Fri, 27 Nov 2009 02:17:18 +0000

Yesterday, the Italian postal service misprocessed a bunch of ATM and credit card transactions. Specifically, the virgola was shifted two places, appending two zeros to the transaction amount. There’s no telling exactly how this happened, but it wouldn’t surprise me if it had something—if not everything—to do with localization in one way or another. In Italy, a comma (virgola), not a period, precedes a number’s decimal part, but software might see things otherwise.

Some software interprets number strings according to the operating system localization (unless overridden). Other software ignores the OS localization. SQL Server’s CAST operator, for example, only accepts a period as the decimal separator, and it disregards commas in strings intended to represent numbers.

At least it does this as of 2005; previous versions followed a complicated set of rules in an attempt to disallow numbers that weren’t valid in the U.S., India, or China. In India (ones, thousands, lakhs, crore, thousand crore, lakhs crore, etc.), digit groups bounce between two and three digits, and 1,234,56,70,000.0 is a valid number. In China (yi1, wan4, yi4, wan4 yi4, etc.), it would be 123,4567,0000.0. Interpreting human-readable representations of numbers is no simple task. Explaining the issue isn’t much easier.

In all versions of SQL Server, this happens regardless of language or culture settings.

select cast('115,00' as money) as TooMuch;

TooMuch
---------------------
11500.00

[From Slashdot, noting ilsole24ore.com]

9/11 pager intercepts on Wikileaks

Steve — Thu, 26 Nov 2009 04:56:13 +0000

Early this morning, Wikileaks began posting alphanumeric pager messages from four carriers (Arch, Metrocall, Skytel, and Weblink_B) that were intercepted during a 24-hour period beginning early on September 11, 2001. Alphanumeric pager messages are unencrypted, and, like communications over a public 802.11 wireless network, they’re skimmable with the right (and not exotic) software and hardware.

“Due to today’s tragic events, it makes sense to cut back wherever feasible on payroll. Expect a very light business day. Please call all stores and review payroll issues”
“RING ALL CHICAGO AIPORTS AND EVERY MAJOR BUILDING DOWNTOWN. BUSH IS DOING A SPEECH. THIS IS SERIOUS POOH..”
“Holy crap, are you watching the news.”
“I hope you have gone home by now. The BoA tower and space needle here are closed. I suspect tall buildings across the country will be closed. Take care my love.-cb”

This might be the most interesting public data mine since the AOL breach. The total volume is far less, but unlike the AOL data, this data hasn’t been anonymized. There are full names, phone numbers, and other identifying information in the mix.

Buy my book (from Barnes and Noble)

Steve — Mon, 13 Apr 2009 22:52:13 +0000

If you squint, you’ll see my name in tiny print under Itzik’s. He wrote most of the book, but I contributed two chapters and did most of the technical review. Click on the image to visit the book’s Barnes and Noble page.

Read this if you serve up web pages from SQL data

Steve — Sat, 31 May 2008 00:07:27 +0000

If you manage, write, visit, or otherwise have anything to do with a web app that connects to a SQL Server database, good guy and Microsoft Program Manager Buck Woody wants you to read this:

[copied with permission from here]

You might have read recently that there have been ongoing SQL injection attacks against vulnerable web applications occurring over the last few months. These attacks have received recurring attention in the press as they pop up in various geographies around the world. These attacks do not leverage any SQL Server vulnerabilities or any un-patched vulnerabilities in any Microsoft product – the attack vector is vulnerable custom applications. In fact, SQL Injection is a coding issue that can attack any database system, so it’s a good idea to learn how to defend against them.
In order to help you respond to and defend yourself from these attacks, Microsoft has an authoritative blog including talking points and guidance. You can find this at this Technet location. (Retype the underlying URL if you like. I only linked it this way because it wrapped.)

Ok, if you didn’t visit the Technet link, visit it before reading on.

Thanks. Now I’ll add another bit of advice:

There’s a non-SQL injection issue here as well. The risk in question starts when a web application incorporates part of the URL into SQL and executes it blindly (SQL injection), but the risk to end users only occurs because the web app commits “HTML
injection.” The web app unwittingly delivers a malicious bit of HTML that says “Hey browser, please run a script from this other web site.” That malicious bit of HTML won’t be sent to my browser if the web application doesn’t blindly incorporate table data (especially table data containing HTML tags) into the HTML pages it delivers.

Here’s an analogy. When you fill a prescription, you get instructions like “Take one pill twice a day for seven days.” Those instructions probably get printed out of some database. If the instructions say “Chew up all the pills and wash them down with a cup of bleach,” something’s wrong with the pharmacy’s database. Something’s also wrong with the pharmacy for not catching the bogus instructions before dispensing the prescription. And if you follow the instructions, something’s wrong with you.

The risk Buck is drawing our attention to is like this, and the Technet blog tells us to secure our database. Just as importantly, we should pay attention to what we dispense, and not just assume that if we’re dispensing our data, it’s good data. Browsers often render (and in the case of scripts, execute) whatever a trusted site sends them, and if trusted sites send HTML out without vetting it, well, they shouldn’t be trusted. If you’re a web developer and you want your site to be trusted, then vet what you deliver.

I don’t do web apps, but I don’t think a responsible web app should send me script tags that refer to third-party sites. In fact, the web app probably shouldn’t send me any table data without scrubbing it for tags, non-printing ASCII characters, etc.

Many years ago, we thought it was funny to email people BEL characters, and then someone figured out email shouldn’t be allowed to contain BEL. Years ago bulletin boards figured out they shouldn’t allow users to put any old HTML into their posts.
The threat then was still minor – jokers figured out they could mess up some bulletin board formatting by posting opening tags without closing them. Apparently this was only half fixed. Web apps typically scrub what comes in through the expected channels, but a lot of web apps (most?) apparently don’t scrub the HTML they send out. They should. In fact, they must, now that the bad guys have figured out how to exploit sloppy web apps to modify table data bypassing the expected route. The bad guys may soon find some more sloppy code and exploit it to mess with your data.

Just as it’s possible to scrub outgoing email for viruses, it should be possible (and routine) to scrub outgoing HTML for malicious content. While I don’t trust email attachments that have a “no viruses” sticker on them, and I wouldn’t trust a random site that tells me “this web page is safe,” I would trust Microsoft or another trustworthy source if they told me their web servers scrub all outgoing web pages for unexpected script tags.

Spearman’s rho for SQL Server

Steve — Sat, 29 Mar 2008 06:33:51 +0000

Before SQL Server 2005 was released, a calculation that requiring a ranking was both relatively difficult to express as a single query and relatively inefficient to execute. That changed in SQL Server 2005 with support for the SQL analytic functions RANK(), ROW_NUMBER(), etc., and partial support for SQL’s OVER clause.

Spearman’s rho (Spearman’s correlation coefficient) is a useful statistic that can be calculated more easily in SQL Server 2005 than in earlier versions. Below is an implementation of Spearman’s rho for SQL Server 2005 and later.

SQL’s RANK() and the rank order required for the calculation of Spearman’s rho are slightly different: if for example four values are tied for third place, RANK() will equal 3 for all four of them. The Spearman’s formula requires them all to be ranked 4.5, the average of their positions (3rd, 4th, 5th, and 6th) in an ordered list of the data. To address this difference, the code below adjusts the SQL RANK() by adding to it 0.5 for each occurrence of a data value beyond the first. I used COUNT(*) with an OVER clause for this.

The script below demonstrates the calculation for two data sets. The first one is from Wikipedia’s page on Spearman’s rho; I made up the second data set to include duplicate data values. I haven’t tested the code thoroughly, but for a variety of small test data sets, it matches hand calculations and the result here [1].

create table SampleData (
ID int identity(1,1) primary key,
x decimal(5,2),
y decimal(5,2)
);

insert into SampleData(x,y) values(106,7);
insert into SampleData(x,y) values(86,0);
insert into SampleData(x,y) values(100,27);
insert into SampleData(x,y) values(101,50);
insert into SampleData(x,y) values(99,28);
insert into SampleData(x,y) values(103,29);
insert into SampleData(x,y) values(97,20);
insert into SampleData(x,y) values(113,12);
insert into SampleData(x,y) values(112,6);
insert into SampleData(x,y) values(110,17);
go

create procedure Spearman as
with RankedSampleData(ID,x,y,rk_x,rk_y) as (
select
ID,
x,
y,
rank() over (order by x) +
(count(*) over (partition by x) – 1)/2.0,
rank() over (order by y) +
(count(*) over (partition by y) – 1)/2.0
from SampleData
)
select
1e0 –
(
6
*sum(square(rk_x-rk_y))
/count(*)
/(square(count(*)) – 1)
)
from RankedSampleData;
go

exec Spearman;

go
truncate table SampleData;
go

insert into SampleData(x,y) values(1,3);
insert into SampleData(x,y) values(3,5);
insert into SampleData(x,y) values(5,8);
insert into SampleData(x,y) values(3,4);
insert into SampleData(x,y) values(4,7);
insert into SampleData(x,y) values(4,6);
insert into SampleData(x,y) values(3,4);
go

exec Spearman;
go

drop proc Spearman;
drop table SampleData;

[1] Wessa, P. (2008), Free Statistics Software, Office for Research Development and Education, version 1.1.22-r4, URL http://www.wessa.net/

Elapsed time excluding nights and weekends

Steve — Wed, 19 Dec 2007 20:39:54 +0000

Finding elapsed time in SQL Server is easy, so long as the clock is always running: just use DATEDIFF. But you often need to find elapsed time excluding certain periods, like weekends, nights, or holidays. A fellow SQL Server MVP recently posed a variation on this problem: to find the number of minutes between two times, where the clock is running only from 6:00am-6:00pm, Monday-Friday. He needed this to compute how long trouble tickets stayed at a help desk that was open for those hours.

I came up with a function DeskTimeDiff_minutes(@from,@to) for him. It requires a permanent table that spans the range of times you might care about, holding one row for every time the clock is turned on or off, weekdays at 6:00am and 6:00pm in this case.

The table also holds an “absolute business time” in minutes (ABT-m): the total number of “help desk open” minutes since a fixed but arbitrary “beginning of time.” Elapsed help desk time is then simply the difference between ABT-m values. While the table only records the ABT-m 10 times a week, you can find the ABT-m for an arbitrary datetime @d easily. Find the row of the table with time d closest to @d but not later. In that row you’ll find the ABT-m at time d, and you’ll also find out whether the clock was (or will be) running or not between d and @d. If not, the ABT-m at time @d is the same as at time d. Otherwise, add the number of minutes between d and @d.

Here’s the code. The reference table here is good from early 2000 until well past 2050, and you can easily extend it or adapt it to other business rules. A larger permanent table of times shouldn’t affect performance, because the function only performs (two) index seek lookups on the table.

If you cut and paste this for your own use, watch out for “smart quotes” or other WordPress/Live Writer formatting quirks.

create table Minute_Count(
d datetime primary key,
elapsed_minutes int not null,
timer varchar(10) not null check (timer in (‘Running’,’Stopped’))
);

insert into Minute_Count values (‘2000-01-03T06:00:00′,0,’Running’);
insert into Minute_Count values (‘2000-01-03T18:00:00′,12*60,’Stopped’);

insert into Minute_Count values (‘2000-01-04T06:00:00′,12*60,’Running’);
insert into Minute_Count values (‘2000-01-04T18:00:00′,24*60,’Stopped’);

insert into Minute_Count values (‘2000-01-05T06:00:00′,24*60,’Running’);
insert into Minute_Count values (‘2000-01-05T18:00:00′,36*60,’Stopped’);

insert into Minute_Count values (‘2000-01-06T06:00:00′,36*60,’Running’);
insert into Minute_Count values (‘2000-01-06T18:00:00′,48*60,’Stopped’);

insert into Minute_Count values (‘2000-01-07T06:00:00′,48*60,’Running’);
insert into Minute_Count values (‘2000-01-07T18:00:00′,60*60,’Stopped’);
/* any Monday-Friday week */

declare @week int;
set @week = 1;
while @week < 2100 begin
insert into Minute_Count
    select
      dateadd(week,@week,d),
      elapsed_minutes + 60*@week*60,
      timer
from Minute_Count
set @week = @week * 2
end;

create function DeskTimeDiff_minutes(
@from datetime,
@to datetime
) returns int as begin
declare @fromSerial int;
declare @toSerial int;
with S(d,elapsed_minutes,timer) as (
    select top 1 d,elapsed_minutes, timer
    from Minute_Count
    where d <= @from
    order by d desc
)
    select @fromSerial =
      elapsed_minutes +
      case when timer = ‘Running’
      then datediff(minute,d,@from)
      else 0 end
    from S;
with S(d,elapsed_minutes,timer) as (
    select top 1 d,elapsed_minutes, timer
    from Minute_Count
    where d <= @to
    order by d desc
)
    select @toSerial =
      elapsed_minutes +
      case when timer = ‘Running’
      then datediff(minute,d,@to)
      else 0 end
    from S;
return @toSerial – @fromSerial;
end;
go
select MAX(d) from Minute_Count
select dbo.DeskTimeDiff_minutes(‘2007-12-19T18:00:00′,’2007-12-24T17:51:00’);
go

drop function DeskTimeDiff_minutes;
drop table Minute_Count;

The hemisphere requirement

Steve — Wed, 21 Nov 2007 22:16:47 +0000

Microsoft plans to support spatial data types in SQL Server 2008, and a preview is available to the community in the latest CTP (community technology preview), available here.

John O’Brien, a Windows Live Developer MVP, has been trying out the new spatial types in some cool Virtual Earth projects (John’s site is here), and in one of his projects, SQL Server threw an interesting error message. When he zoomed far enough out in Virtual Earth, then tried to create a polygon from the map bounds, SQL Server reacted with:

“The specified input does not represent a valid geography instance because it exceeds a single hemisphere. Each geography instance must fit inside a single hemisphere. A common reason for this error is that a polygon has the wrong ring orientation.”

John found a workaround, dividing the map into two pieces, but he was interested to know what the SQL Server folk thought about the situation. Here’s my reply. It’s less a response to John’s inquiry than it is a ramble about geometry and what hemispheres and orientation have to do with how you can or can’t specify polygons.

To begin, think of the earth’s Equator as a polygon. How would you answer the following questions?

“If I travel Eastbound around the earth along the equator, have I gone clockwise or counter-clockwise?”
“Is the north pole inside the equator or outside the equator?”

In the plane (or on a flat map of the world), a polygon or other closed non-self-intersecting curve has a well-defined “inside” and “outside”. A polygon separates the plane into two regions, one that has finite area and one that is unbounded. The finite region is deemed “inside” the polygon. On a sphere, however, a closed curve determines two finite regions, either of which might be what someone thinks of as the inside.

For example, the four-sided outline of the US state of Wyoming separates the earth into what you could call “Wyoming” and “anti-Wyoming.” But are we so sure which is the inside and which is the outside? Our intuition is that the smaller region is always the inside, but there’s nothing about geometry and geography to tell us that. Maybe Wyoming is most of the world. A single geographic region could contain most of the earth’s surface within its borders, couldn’t it?

Suppose Wyoming declared itself to be Great Wyoming and annexed all of North America, Europe, and continued to conquer the world. Suppose its armies crossed the equator and eventually took over almost everything—everything but Antarctica, in fact.

Then the boundary of Great Wyoming would then be the same as the boundary of Antarctica. You would probably want Great Wyoming to be inside the boundary of Great Wyoming and Antarctica to be inside the boundary of Antarctica, but how can that work—the boundaries are the same?

This is a problem. On a sphere, the naïve idea of interior/exterior isn’t well-defined. One solution would be to pass a law that every polygon on earth must fit inside a single hemisphere with room to spare. We could then define the interior of a polygon to be the smaller of the two regions it determines. This would place Antarctica, not Wyoming, within the borders of Great Wyoming—wrong, but unambiguous. And anyway, who would ever need to consider a region ~~bigger than 640K~~ that doesn’t fit inside a single hemisphere?

Fortunately, though, we don’t have to abandon or compromise the notion of interior and exterior on the earth’s surface: Antarctica can remain outside Greater Wyoming. All we need to do is be precise about the direction in which we describe a polygon. When specifying the boundary of a region, you can give a forwards/backwards or clockwise/counter-clockwise sense to the boundary by choosing the way you order the list of vertices. List them so that what you consider inside the region is on your left as you “connect the dots,” because we will adopt the convention that the left side as you walk the perimeter is the inside. What’s on the right will be interpreted as outside. Now you can describe the boundary of Great Wyoming. Just describe it as drawn from west to east, so Antarctica is on the right (exterior). (This works because a sphere is an “orientable surface.” SQL Server’s new geography data type isn’t supported on a Klein bottle, where CultureInfo.IsOrientableWorld—if such a property existed—would be false.)

Once we require polygons to be oriented, there’s no need to require that they fit within a single hemisphere, but nonetheless, SQL Server 2008’s geography data type adopts the hemisphere requirement. For geometry objects of type Polygon, I think this is a good idea. I’m not sure whether it’s a standard GIS requirement or just SQL Server’s, but it prevents users from accidentally entering the coordinates of Wyoming in clockwise fashion only to discover later that Perth and Addis Ababa, but not Cheyenne, are in Wyoming. [For some of the other geography types, such as LineString, I don’t see a benefit from requiring the object to fit in a hemisphere, but consistency isn’t a bad thing.]

A Million Random Digits with 100,000 Normal Deviates

Steve — Wed, 09 Aug 2006 04:27:08 +0000

Groundbreaking when it was published in 1955, the classic book “A Million Random Digits with 100,000 Normal Deviates” has been republished electronically by the RAND corporation with permission “to duplicate this electronic document for personal use only, as long as it is unaltered and complete.” Books like these were a staple of statistical research in the mid-20th century, and this particular one was highly revered.

Nowadays, there are better sources of random numbers, such as HotBits, and there are many ways to generate pseudorandom numbers, which are not random, but have many of the properties of random number and are useful for many purposes.

I hope it’s not a violation of the copyright for me to provide instructions on how to use SQL to load the book’s content in its published format (or any identically-formatted list) into a SQL table that can be queried for random (not pseudorandom) sequences of numbers. The script uses a few of SQL Server 2005’s new features, including the BULK rowset provider for text files, some of the new analytic functions, and TOP with a variable. You’ll also need a table-valued function called Numbers(), like the one in my previous SQL post.

The RAND book is available here, and my script works for the support file “Datafile: A Million Random Digits,” available for download here. The SQL Server 2005 script below assumes you’ve downloaded this file and unzipped it to C:\\RAND\\MillionDigits.txt.

The beginning of the file looks like this

00000 10097 32533 76520 13586 34673 54876 80959 09117 39292 74945 00001 37542 04805 64894 74296 24805 24037 20636 10402 00822 91665 00002 08422 68953 19645 09303 23209 02560 15953 34764 35080 33606 00003 99019 02529 09376 70715 38311 31165 88676 74397 04436 27659 00004 12807 99970 80157 36147 64032 36653 98951 16877 12171 76833 00005 66065 74717 34072 76850 36697 36170 65813 39885 11199 29170 00006 31060 10805 45571 82406 35303 42614 86799 07439 23403 09732 00007 85269 77602 02051 65692 68665 74818 73053 85247 18623 88579 00008 63573 32135 05325 47048 90553 57548 28468 28709 83491 25624 00009 73796 45753 03529 64778 35808 34282 60935 20344 35273 88435

Unix-style newlines (0x0A) are used, and the million digits are organized into 20,000 five-digit integers with leading zeroes, so the script will import the file into a table of 20,000 five-digit numbers (as char(5) data with leading zeroes). Here’s the script:

create database MillionDigits go

use MillionDigits go

create table MillionDigitsFile ( c varchar(max) ) go

insert into MillionDigitsFile select BulkColumn from openrowset(bulk 'C:\\RAND\\MillionDigits.txt\\', SINGLE_CLOB) as D go create table NumbersFromTable( position int primary key, number char(5) not null ) create index NumbersFromTable_number on NumbersFromTable(number) go -- The first of the five groups of two numbers each -- begins at position 9 of each line. Each of the other -- four groups on a line begins 13 characters after the -- previous one. The second number in each group -- begins 6 characters after the first. insert into NumbersFromTable select row_number() over (order by N.n,A.n,B.n) as rk, substring(c,9+72*N.n+13*A.n+6*B.n,5) as n from Numbers(0,19999) as N, Numbers(0,4) as A, Numbers(0,1) as B, MillionDigitsFile go -- How random does it look? (and a sneaky way to -- aggregate over an aggregate) select top 1 min(count(*)) over (), max(count(*)) over (), avg(1.00000*count(*)) over (), stdev(count(*)) over () from NumbersFromTable group by number go /* Selects a @length-long sequence of numbers from the table, where the place to start is found as follows. Given a random integer, use % to turn it into a number's position between 1 and 200000. Reduce that position % 20000 to find a starting line of the book, and reduce the following number % 10 to find the starting number on that line. */ create function RandomSequence( @seed int, @length int ) returns table as return ( select top (@length) row_number() over (order by position) as i, number from NumbersFromTable where position >= ( select number%20000 from NumbersFromTable where 1+@seed%200000 = position ) + ( select number%10 from NumbersFromTable where 1+(@seed+1)%200000 = position ) order by position ) go -- Generate a few random sequences. You'll get different ones -- each time you run this. declare @seed int set @seed = abs(binary_checksum(newid()))%200000 select * from RandomSequence(@seed,50) set @seed = abs(binary_checksum(newid()))%200000 select * from RandomSequence(@seed,123)

-- Uncomment to clean up -- use master -- go -- drop database MillionDigits

How to generate a sequence on the fly

Steve — Sat, 03 Jun 2006 14:44:10 +0000

One of the things that kept me busy this past winter and spring was tech editing Itzik Ben-Gan’s two books in Microsoft Press’s Inside MicrosoftÂ® SQL Serverâ„¢ 2005 series (1,2). Of Itzik’s many clever solutions to programming problems, my favorite was this function that returns a table of consecutive integers. It’s blazingly fast, and it’s the best way I know of to generate a sequence on the fly – probably even better than accessing a permanent table of integers.

create function Numbers( @from as bigint, @to as bigint ) returns table with schemabinding as return with t0(n) as ( select 1 union all select 1 ), t1(n) as ( select 1 from t0 as a, t0 as b ), t2(n) as ( select 1 from t1 as a, t1 as b ), t3(n) as ( select 1 from t2 as a, t2 as b ), t4(n) as ( select 1 from t3 as a, t3 as b ), t5(n) as ( select 1 from t4 as a, t4 as b ), Numbers(n) as ( select row_number() over (order by n) as n from t5 ) select @from + n - 1 as n from Numbers where n <= @to - @from + 1