March 2008


Before SQL Server 2005 was released, a calculation that requiring a ranking was both relatively difficult to express as a single query and relatively inefficient to execute. That changed in SQL Server 2005 with support for the SQL analytic functions RANK(), ROW_NUMBER(), etc., and partial support for SQL’s OVER clause.

Spearman’s rho (Spearman’s correlation coefficient) is a useful statistic that can be calculated more easily in SQL Server 2005 than in earlier versions. Below is an implementation of Spearman’s rho for SQL Server 2005 and later.

SQL’s RANK() and the rank order required for the calculation of Spearman’s rho are slightly different: if for example four values are tied for third place, RANK() will equal 3 for all four of them. The Spearman’s formula requires them all to be ranked 4.5, the average of their positions (3rd, 4th, 5th, and 6th) in an ordered list of the data. To address this difference, the code below adjusts the SQL RANK() by adding to it 0.5 for each occurrence of a data value beyond the first. I used COUNT(*) with an OVER clause for this.

The script below demonstrates the calculation for two data sets. The first one is from Wikipedia’s page on Spearman’s rho; I made up the second data set to include duplicate data values. I haven’t tested the code thoroughly, but for a variety of small test data sets, it matches hand calculations and the result here [1].

create table SampleData (
ID int identity(1,1) primary key,
x decimal(5,2),
y decimal(5,2)
);

insert into SampleData(x,y) values(106,7);
insert into SampleData(x,y) values(86,0);
insert into SampleData(x,y) values(100,27);
insert into SampleData(x,y) values(101,50);
insert into SampleData(x,y) values(99,28);
insert into SampleData(x,y) values(103,29);
insert into SampleData(x,y) values(97,20);
insert into SampleData(x,y) values(113,12);
insert into SampleData(x,y) values(112,6);
insert into SampleData(x,y) values(110,17);
go

create procedure Spearman as
with RankedSampleData(ID,x,y,rk_x,rk_y) as (
select
ID,
x,
y,
rank() over (order by x) +
(count(*) over (partition by x) – 1)/2.0,
rank() over (order by y) +
(count(*) over (partition by y) – 1)/2.0
from SampleData
)
select
1e0 –
(
6
*sum(square(rk_x-rk_y))
/count(*)
/(square(count(*)) – 1)
)
from RankedSampleData;
go

exec Spearman;

go
truncate table SampleData;
go

insert into SampleData(x,y) values(1,3);
insert into SampleData(x,y) values(3,5);
insert into SampleData(x,y) values(5,8);
insert into SampleData(x,y) values(3,4);
insert into SampleData(x,y) values(4,7);
insert into SampleData(x,y) values(4,6);
insert into SampleData(x,y) values(3,4);
go

exec Spearman;
go

drop proc Spearman;
drop table SampleData;

[1] Wessa, P. (2008), Free Statistics Software, Office for Research Development and Education, version 1.1.22-r4, URL http://www.wessa.net/

A Reuters News Service article today reports [emphasis mine] “If both your parents have Alzheimer’s disease, you probably are more much likely than other people to get it, researchers said on Monday.” What if they both have dyslexia?