ntile、row_number和取top百分比数据

Mysql和SQL SERVER都有top语句,可以用于取出top N这样的操作,oracle是没有top语句的。本文时候对结果集取百分比的小样例。

Oracle提供了一个ntile函数用于将结果集分配给n个桶来进行计算,但是当进行平分的时候是无法做到完全平均分配的,所以必然会存在某些桶里的数据会比其他的桶里数据多的情况,当我们选择桶号来进行读取数据的过程中就会存在数据比我们预期的多的情况。对于取整数百分比,比如10%,20%这种,ntile还是比较好用的,对于取非整的,比如11%,12%,13%这种情况下,ntile就会存在问题,这个时候使用row_number函数处理就更加适合。Oracle community上有一个帖子也描述了这一问题。下面的实验来进行说明。
实验环境为12.1.0.2

取百分之二十的数据

SQL> with gererator as
  2   (select name, salary, ntile(5) over(order by salary) as buckets from t)
  3  select name, salary from gererator where buckets = 5 order by salary desc
  4  ;

NAME                SALARY
--------------- ----------
Neena                17000
Lex                  17000
John                 14000
Karen                13500
Michael              13000
Shelley              12008
Nancy                12008
Alberto              12000
Lisa                 11500
Den                  11000
Gerald               11000
Ellen                11000
Clara                10500
Eleni                10500
Janette              10000
Hermann              10000
Harrison             10000
Peter                10000
Tayler                9600
David                 9500
Danielle              9500
--21条
看下5个桶里的数据,表中一共106条数据
SQL> select buckets,count(*) from (select name, salary, ntile(5) over(order by salary ) as buckets from t) group by buckets;

   BUCKETS   COUNT(*)
---------- ----------
         1         22
         2         21
         4         21
         5         21
         3         21
数据平分为每个桶21条,然后最后一条分配给1号桶,虽然如此,这个结果还算满意。上述是对桶进行升序排列的。

如果我们希望获取百分之17呢,我们分为100个桶,桶按照升序来排列
SQL> with gererator as
  2   (select name, salary, ntile(100) over(order by salary ) as buckets from t)
  3  select name, salary from gererator where buckets > 100 - 17 order by salary desc;

NAME                SALARY
--------------- ----------
Neena                17000
Lex                  17000
John                 14000
Karen                13500
Michael              13000
Nancy                12008
Shelley              12008
Alberto              12000
Lisa                 11500
Gerald               11000
Den                  11000
Ellen                11000
Clara                10500
Eleni                10500
Harrison             10000
Hermann              10000
Janette              10000

17 rows selected.

106条的17%是18条,这个结果也还好。

下面按照降序
SQL> with gererator as
  2   (select name, salary, ntile(100) over(order by salary desc ) as buckets from t)
  3  select name, salary from gererator where buckets <= 17 order by salary desc;

NAME                SALARY
--------------- ----------
Neena                17000
Lex                  17000
John                 14000
Karen                13500
Michael              13000
Nancy                12008
Shelley              12008
Alberto              12000
Lisa                 11500
Den                  11000
Gerald               11000
Ellen                11000
Eleni                10500
Clara                10500
Peter                10000
Janette              10000
Hermann              10000
Harrison             10000
Tayler                9600
Danielle              9500
David                 9500
Patrick               9500
Alexander             9000

23 rows selected.

预期是18条,降序是23条,因为在数据分装进桶的问题造成了不同。

下面使用row_number来进行取17%。

SQL> with generator as
  2   (select name,
  3           salary,
  4           row_number() over(order by salary desc) as sample_rows,
  5           count(*) over() as all_rows
  6      from t)
  7  select name, salary
  8    from generator
  9   where sample_rows / all_rows <= 17 / 100
 10   order by salary desc;

NAME                SALARY
--------------- ----------
Neena                17000
Lex                  17000
John                 14000
Karen                13500
Michael              13000
Nancy                12008
Shelley              12008
Alberto              12000
Lisa                 11500
Den                  11000
Gerald               11000
Ellen                11000
Eleni                10500
Clara                10500
Peter                10000
Janette              10000
Hermann              10000
Harrison             10000

18 rows selected.
这个结果如我们预期的记录数。

所以从总体上来讲,row_nunmber这个分析函数联合数学分数的方法处理随即百分比更好些,对于百分比数值比较完整的ntile也比较好。

此条目发表在SQL分类目录。将固定链接加入收藏夹。

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s