dbms_space-create_index_cost and asa_recommendations

Oracle提供的dbms_space包用于分许 segment growth and space requirements.
本文使用了2个Subprograms,dbms_space.create_index_cost和dbms_space. asa_recommendations。
测试在11.2.0.4环境下。

SQL> set serveroutput on
SQL> DROP TABLE T_OBJECT PURGE;

表已删除。

SQL> create table t_object as select * from all_objects where 1 = 0;

表已创建。

SQL> insert into t_object select * from all_objects;

已创建59963行。

SQL> insert into t_object select * from t_object;

已创建59963行。

SQL> insert into t_object select * from t_object;

已创建119926行。

SQL> insert into t_object select * from t_object;

已创建239852行。

SQL> commit;

提交完成。

SQL>
SQL>
SQL> create or replace procedure run_sa
  2    authid current_user
  3    as
  4      obj_id number;
  5    begin
  6      dbms_advisor.create_task (
  7        advisor_name     => 'Segment Advisor',
  8        task_name        => 'Manual_Task' );
  9
 10      dbms_advisor.create_object (
 11        task_name        => 'Manual_Task',
 12        object_type      => 'TABLE',
 13        attr1            => user,
 14        attr2            => 'T_OBJECT',
 15        attr3            => NULL,
 16        attr4            => NULL,
 17        attr5            => NULL,
 18        object_id        => obj_id);
 19
 20      dbms_advisor.set_task_parameter(
 21        task_name        => 'Manual_Task',
 22        parameter        => 'recommend_all',
 23        value            => 'TRUE');
 24
 25      dbms_advisor.execute_task('Manual_Task');
 26    end;
 27    /

过程已创建。

SQL>
SQL>
SQL> exec dbms_stats.gather_table_stats(user,'t_object');

PL/SQL 过程已成功完成。

SQL> variable used_bytes number
SQL> variable alloc_bytes number
SQL> exec dbms_space.create_index_cost( 'create index idx1_t_object on t_object(object_name)', :used_bytes, :alloc_bytes );

PL/SQL 过程已成功完成。

SQL> print :used_bytes

USED_BYTES
----------
  10073784

SQL> print :alloc_bytes

ALLOC_BYTES
-----------
   17825792

SQL>
SQL> create index idx1_t_object on t_object(object_name);

索引已创建。

SQL>
SQL> select (bytes/1024/1024) from user_segments where segment_name = 'T_OBJECT';

(BYTES/1024/1024)
-----------------
               55

SQL> select bytes from user_segments where segment_name='IDX1_T_OBJECT';

     BYTES
----------
  17825792

SQL>
SQL>
SQL> DELETE FROM T_OBJECT;

已删除479704行。

SQL> commit;

提交完成。

SQL>
SQL> exec  dbms_advisor.delete_task('Manual_Task');

PL/SQL 过程已成功完成。

SQL>
SQL> exec run_sa;

PL/SQL 过程已成功完成。

SQL>
SQL> COL SEGMENT_OWNER FOR A13
SQL> COL SEGMENT_NAME FOR A13
SQL> COL SEGMENT_TYPE FOR A13
SQL> COL recommendations format A70
SQL> COL c3 format a50  heading 'Run Frist'
SQL> COL c2 format a50  heading 'Run Second'
SQL> COL c1 format a50  heading 'Run Last |(May not be required)'
SQL>
SQL> SELECT segment_owner,
  2         segment_name,
  3         segment_type,
  4         recommendations,
  5         c3 || ';' c3,
  6         c2 || ';' c2,
  7         c1 || ';' c1
  8    FROM TABLE(DBMS_SPACE.asa_recommendations())
  9   WHERE segment_type = 'TABLE'
 10     AND segment_name = UPPER('T_OBJECT');

                                                                                                                                                                                                                       Run Last
SEGMENT_OWNER SEGMENT_NAME  SEGMENT_TYPE  RECOMMENDATIONS                                                        Run Frist                                          Run Second                                         (May not be required)
------------- ------------- ------------- ---------------------------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
SH            T_OBJECT      TABLE         启用表 SH.T_OBJECT 的行移动并执行收缩, 估计可以节省 50246285 字节。    alter table "SH"."T_OBJECT" enable row movement;   alter table "SH"."T_OBJECT" shrink space COMPACT;  alter table "SH"."T_OBJECT" shrink space;

dbms_space.asa_recommendation关联auto space advisor的dba_advisor_tasks可以获取全局的信息

SELECT segment_name,
       round(allocated_space / 1024 / 1024, 1) alloc_mb,
       round(used_space / 1024 / 1024, 1) used_mb,
       round(reclaimable_space / 1024 / 1024) reclaim_mb,
       round(reclaimable_space / allocated_space * 100, 0) pctsave,
       recommendations,
       re.task_id,
       ta.execution_end
  FROM TABLE(dbms_space.asa_recommendations()) re, dba_advisor_tasks ta
 Where ta.task_id = re.task_id;
发表在 PLSQL | 留下评论

Oracle Data Mining-K-means

回归、朴素贝叶斯、支持向量机等都是有类别标签的,也就是说训练样本中已经给出了样例的分类,这些分类的算法都属于监督学习。而聚类算法属于无监督学习,聚类的样本中没有给定y,只有特征x。聚类的目的是找到每个样本x潜在的类别y,并将同类别y的样本x放在一起。比如宇宙中的星星,聚类后结果是一个个的星团,星团里面的点相互距离比较近,星团间的星星距离就比较远了。
本文主要针对Oracle数据挖掘中的K-means算法进行了测试分类,同时使用现在流行的scikit-learn机器学习开源框架也进行了测试分类。
首先针对K-means算法涉及到的一些概念进行解释描述。

相异度计算

相异度是指两个东西差别有多大,例如人类和章鱼的相异度明显大于人类与黑猩猩的相异度,这是我们能直观感受到的,但是计算机是没有这种直觉感受能力的,所以我们必须对相异度在数学上进行定义。
数学上使用距离的概念来进行相异度计算。

常用的距离计算包括:闵可夫斯基距离、欧几里得距离(简称欧式距离)、城市街区距离。Oracle采用欧式距离进行计算。

k-means算法描述

给定一个观测集合(x1, x2, …, xn),其中每个观测都是一个d维实向量,k-means算法要把这n个观测划分到k个集合k (≤ n) S = {S1, S2, …, Sk},使得组内平方和最小,也就是使方差(Variance)最小。用公式来表述的话就是
1
其中μi 是 Si中所有点的均值。
K-means算法源于信号处理中的一种向量化方法,现在则更多的作为聚类分析方法流行于数据挖掘领域。
K-means聚类的目的是:把n个点(可以是样本的一次观察或一个实例)划分到k个聚类中,使每个点都属于离他最近的均值(此即聚类质心centroid)对应的聚类,以之作为聚类的标准。

欧式距离公式

在数学中,欧几里得距离或欧几里得度量是欧几里得空间中两点间“普通”(即直线)距离。使用这个距离,欧氏空间成为度量空间。

在欧几里得空间中,点x =(x1,…,xn)和 y =(y1,…,yn)之间的欧氏距离为
2.png

数据的标准化

数据的标准化(normalization)是将数据按照比例缩放,使之落入一个小的特定区间。在某些比较和评价的指标处理中经常会用到,去除数据的单位限制,使其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。
其中最典型的就是数据的归一化处理,即将数据统一映射到[0,1]区间上,常见的数据归一化的方法有:

min-max标准化(min-max normalization)

也叫离差标准化,是对原始数据的线性变换,使结果落到[0,1]区间,转换函数如下:
3
其中max为样本数据的最大值,min为样本数据的最小值。这种方法有一个缺陷就是当有新数据加入时,可能导致max和min的变化,需要重新定义。
Oracle的增强版k-means使用的是min-max方法进行数据标准化。

log函数转换

通过以10为底的log函数转换的方法同样可以实现归一下,具体方法如下:
4.png

atan函数转换

用反正切函数也可以实现数据的归一化:
5
使用这个方法需要注意的是如果想映射的区间为[0,1],则数据都应该大于等于0,小于0的数据将被映射到[-1,0]区间上。

z-score 标准化(zero-mean normalization)

也叫标准差标准化,经过处理的数据符合标准正态分布,即均值为0,标准差为1,其转化函数为:
6
其中μ为所有样本数据的均值,σ为所有样本数据的标准差。

传统的k-means算法

以2维空间为例

简要概述
一 、除了随机选择的初始质心,后续迭代质心是根据给定的待聚类的集合S中点计算均值得到的,所以质心一般不是S中的点,但是标识的是一簇点的质心。

二、基本k-means算法,开始需要随机选择指定的k个质心,因为初始k个质心是随机选择的,所以每次执行k-means聚类的结果可能都不相同。如果初始随机选择的质心位置不好,可能造成k-means聚类的结果非常不理想。
三、计算质心:假设k-means聚类过程中,得到某一个簇的集合Ci={p(x1,y1), p(x2,y2), …,p(xn,yn)},则簇Ci的质心,质心x坐标为(x1+x2+ …+xn)/n,质心y坐标为(y1+y2+ …+yn)/n。
四、k-means算法的终止条件:质心在每一轮迭代中会发生变化,然后需要重新将非质心点指派给最近的质心而形成新的簇,如果只有很少的一部分点在迭代过程中,还在改变簇(如,更新一次质心,有些点从一个簇移动到另一个簇),那么满足这样一个收敛条件,可以提前结束迭代过程。
五、k-means算法的框架是:首先随机选择k个初始质心点,然后执行聚类处理迭代,不断更新质心,直到满足算法收敛条件。由于该算法收敛于局部最优,所以多次执行聚类算法,通过比较,选择聚类效果最好的结果作为最终的结果。
六、k-means算法聚类完成后,没有离群点,所有的点都会被指派到对应的簇中。
实现过程描述:
1. Specify the number of clusters and metrics to calculate the distance
2. Randomly pick the initial centroid(聚类质心) per number of clusters
3. For each data point
a. Calculate the distance between the centroid and data points
b. Assign each data point to the nearest centroid of a cluster
4. Calculate the new centroid for each cluster
5. Repeat steps 3 and 4 until the centroids in clusters change.Terminate when they remain unchanged in successive cluster reassignments.

举例

序号 历史成绩 地理成绩
1 84 49
2 69 33
3 67 37
4 78 45
5 81 49

我们将历史作为x轴,地理成绩作为y轴,使用传统的k-means算法进行模拟计算过程.
步骤1 指定cluster
指定cluster的个数(也就是k)和矩阵去计算距离
因为我们是想将样本聚类为2个cluster,所以我们这里的k为2.假设这里的数据为相同的单位和度量,选择欧式距离进行计算
点(x1,y1)与聚类质心centroid (cx1,cy2)的距离,dist((x1,y1),(cx2,cy2)) = sqrt((x1-cx2)2+(y1-cy2)2
步骤2 选择聚类质心
选择聚类质心centroid
这里只是演示的目的,随便选择记录的前2行作为2个cluster的初始聚类质心

Cluster 初始聚类质心
历史成绩 地理成绩
C1(Cluster1) 84 49
C2(Cluster2) 69 33

步骤3 计算质心和样本点的距离

历史成绩 地理成绩 与cluster1的距离 与cluster2的距离 分配给对应的cluster号
84 49 0 21.9317122 1
69 33 21.9317122 0 2
67 37 20.8086521 4.47213595 2
78 45 7.21110255 15 1
81 49 3 20 1

比如计算的结果7.21110255 < 15,则分配给cluster1
步骤4 重新计算新的质心或者means

Cluster 新聚类质心
历史成绩 地理成绩
C1(Cluster1) 81 47.66666667
C2(Cluster2) 68 35

步骤3中cluster1有3个元素 (84 + 78 + 81)/3=81,其他类似
步骤5 重新计算样本点到质心的距离

历史成绩 地理成绩 与cluster1的距离 与cluster2的距离 分配给对应的cluster号
84 49 3.2829 21.2602 1
69 33 18.9502 2.2361 2
67 37 17.6005 2.2361 2
78 45 4.0139 14.1421 1
81 49 1.3333 19.1050 1

重新分配的结果如上最右侧列的类号。

传统k-means的缺点

不适用于非球形簇的聚类。
分类结果依赖于分类中心的初始化。
对类别规模差异太明显的数据效果不好,比如有A,B两个类,A类有1000个点,B有100个。
对噪声敏感,比如总共有A、B两类,每类500个点,而且相距不远。此时在离这两类很远的地方加一个点(可以看作噪声),对分类中心会有很大的影响(因为分类中心是取平均),同理对于距离分厂近的类别的分类效果也不好。
不适合catagorical分类,比如男女。

Bisecting k-Means(2分K均值聚类)

Bisecting k-means聚类算法,即二分k均值算法,它是k-means聚类算法的一个变体,主要是为了改进k-means算法随机选择初始质心的随机性造成聚类结果不确定性的问题,而Bisecting k-means算法受随机选择初始质心的影响比较小。
首先,我们考虑在欧几里德空间中,衡量簇的质量通常使用如下度量:误差平方和(Sum of the Squared Error,简称SSE),也就是要计算执行聚类分析后,对每个点都要计算一个误差值,即非质心点到最近的质心的距离。那么,既然每个非质心点都已经属于某个簇,也就是要计算每个非质心点到其所在簇的质心的距离,最后将这些距离值相加求和,作为SSE去评估一个聚类的质量如何。我们的最终目标是,使得最终的SSE能够最小,也就是一个最小化目标SSE的问题。在n维欧几里德空间,SSE形式化地定义,计算公式如下:
SSE
Bisecting k-means聚类算法的基本思想是,通过引入局部二分试验,每次试验都通过二分具有最大SSE值的一个簇,二分这个簇以后得到的2个子簇,选择2个子簇的总SSE最小的划分方法,这样能够保证每次二分得到的2个簇是比较优的(也可能是最优的),也就是这2个簇的划分可能是局部最优的,取决于试验的次数。
Bisecting k-means聚类算法的具体执行过程,描述如下所示:
1. 初始时,将待聚类数据集D作为一个簇C0,即C={C0},输入参数为:二分试验次数m、k-means聚类的基本参数;
2. 取C中具有最大SSE的簇Cp,进行二分试验m次:调用k-means聚类算法,取k=2,将Cp分为2个簇:Ci1、Ci2,一共得到m个二分结果集合B={B1,B2,…,Bm},其中,Bi={Ci1,Ci2},这里Ci1和Ci2为每一次二分试验得到的2个簇;
3. 计算上一步二分结果集合B中,每一个划分方法得到的2个簇的总SSE值,选择具有最小总SSE的二分方法得到的结果:Bj={Cj1,Cj2},并将簇Cj1、Cj2加入到集合C,并将Cp从C中移除;
4. 重复步骤2和3,直到得到k个簇,即集合C中有k个簇。

Oracle的增强k-means算法

Oracle的k-means算法是对k-means算法进行增强的版本,类似于上章节所说的2分K均值聚类。
这里待分类的数据表T_SCORE是上述举例中的5条记录。

SQL> select * From T_SCORE;
        ID HISTORY_SCORE GEOGRAPHY_SCORE
---------- ------------- ---------------
         1            84              49
         2            69              33
         3            67              37
         4            78              45
         5            81              49

线性标准化

Oracle是采用min-max的方式进行数据的标准化

Oracle进行标准化后的数据与公式计算得到的结果是一致的。

此时数据准备的工作完成,准备好的数据存放在km_sh_sample_build_prepared中。
注:从11g开始,oracle提供了自动数据准备提供数据自动标准化。

SQL> BEGIN EXECUTE IMMEDIATE 'DROP TABLE km_sh_sample_norm';
  2  EXCEPTION WHEN OTHERS THEN NULL; END;
  3  /

PL/SQL 过程已成功完成。

SQL> BEGIN EXECUTE IMMEDIATE 'DROP VIEW km_sh_sample_build_prepared';
  2  EXCEPTION WHEN OTHERS THEN NULL; END;
  3  /

PL/SQL 过程已成功完成。
SQL>
SQL> BEGIN
  2    -- Normalize numerical attributes: HISTORY_SCORE,GEOGRAPHY_SCORE
  3    DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN(norm_table_name => 'km_sh_sample_norm');
  4
  5    DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX(norm_table_name => 'km_sh_sample_norm',
  6                                                      data_table_name => 't_score',
  7                                                      exclude_list    => dbms_data_mining_transform.column_list('id'));
  8
  9    -- Create the transformed view
 10    DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN(norm_table_name => 'km_sh_sample_norm',
 11                                              data_table_name => 't_score',
 12                                              xform_view_name => 'km_sh_sample_build_prepared');
 13  END;
 14  /

PL/SQL 过程已成功完成。

SQL> select col,shift,scale from KM_SH_SAMPLE_NORM;

COL                                                               SHIFT      SCALE
------------------------------------------------------------ ---------- ----------
HISTORY_SCORE                                                        67         17
GEOGRAPHY_SCORE                                                      33         16

SQL> select min(t.history_score),max(t.history_score),min(t.geography_score),max(t.geography_score) from t_score t;

MIN(T.HISTORY_SCORE) MAX(T.HISTORY_SCORE) MIN(T.GEOGRAPHY_SCORE) MAX(T.GEOGRAPHY_SCORE)
-------------------- -------------------- ---------------------- ----------------------
                  67                   84                     33                     49

SCALE(HISTORY_SCORE) = MAX –MIN = 84-67=17 
SHIFT(HISTORY_SCORE) = 17

SCALE(GEOGRAPHY_SCORE) = MAX –MIN = 49-33=16
SHIFT(GEOGRAPHY_SCORE) = 0

Oracle的线性标准化计算公式为
min-max normalization
Normalize each attribute using the transformation x_new = (x_old-min)/ (max-min).
原始数据

SQL> select * from T_SCORE t where t.id = 4;


        ID HISTORY_SCORE GEOGRAPHY_SCORE
---------- ------------- ---------------
         4            78              45

按照公式计算下id=4的行进行标准化后的数据

history_score (new) = (history_score_old – min(history_score))/scale(history_score) = (78 – 67)/17 = 50/73 = 0.647058823529
geography_score(new) = (geography_score_old – min(geography_score))/scale(geography_score) = (45 – 33)/16 = 0.75

SQL> select * from KM_SH_SAMPLE_BUILD_PREPARED t where id = 4;

        ID HISTORY_SCORE GEOGRAPHY_SCORE
---------- ------------- ---------------
         4    .647058824             .75

创建模型

Oracle的算法中使用的模型的参数默认值

参数 含义
CLUS_NUM_CLUSTERS Setting that specifies the number of clusters for a clustering model,也就是k的个数
KMNS_BLOCK_GROWTH Setting that specifies the growth factor for memory allocated to hold cluster data for k-Means.
KMNS_CONV_TOLETANCE Setting that specifies the convergence tolerance for k-Means.
KMNS_DISTANCE Setting that specifies the distance function for k-Means. The following constants can be values for this setting:
默认是欧式距离
KMNS_ITERATIONS Setting that specifies the number of iterations for k-Means
迭代次数 默认是3次
KMNS_MIN_PCT_ATTR_SUPPORT Setting that specifies the minimum percentage support required for attributes in rules for k-Means clusters.
KMNS_NUM_BINS Setting that specifies the number of histogram bins k-Means.
KMNS_SPLIT_CRITERION Setting that specifies the split criterion for k-Means. The following constants can be values for this setting:
kmns_variance Variance as the split criterion
创建K-means算法使用的初始化参数
CREATE TABLE km_sh_sample_settings (
   setting_name  VARCHAR2(30),
   setting_value VARCHAR2(30));

Table created
BEGIN
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_distance,dbms_data_mining.kmns_euclidean);
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.CLUS_NUM_CLUSTERS,2);  --k为2,聚为2类。
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_iterations,3); --默认迭代次数
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_block_growth,2);
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_conv_tolerance,0.01);
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_split_criterion,dbms_data_mining.kmns_variance);
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_min_pct_attr_support,0.1);
  INSERT INTO km_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_num_bins,5);
  COMMIT;
 END;
/

--创建模型
BEGIN DBMS_DATA_MINING.DROP_MODEL('KM_SH_Clus_sample');
EXCEPTION WHEN OTHERS THEN NULL; END;
/
PL/SQL procedure successfully completed

BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'KM_SH_Clus_sample',
    mining_function     => dbms_data_mining.clustering,
    data_table_name     => 'T_SCORE',
    case_id_column_name => 'id',
    settings_table_name => 'km_sh_sample_settings');
END;
/
PL/SQL procedure successfully completed

--模型的详细信息
SQL> column setting_name format a30
SQL> column setting_value format a30
SQL> SELECT setting_name, setting_value
  2    FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SETTINGS('KM_SH_Clus_sample'))
  3  ORDER BY setting_name;

SETTING_NAME                   SETTING_VALUE
------------------------------ ------------------------------
CLUS_NUM_CLUSTERS              2
KMNS_BLOCK_GROWTH              2
KMNS_CONV_TOLERANCE            .01
KMNS_DISTANCE                  KMNS_EUCLIDEAN
KMNS_ITERATIONS                3
KMNS_MIN_PCT_ATTR_SUPPORT      .1
KMNS_NUM_BINS                  5
KMNS_SPLIT_CRITERION           KMNS_VARIANCE
PREP_AUTO                      OFF  --已经关闭自动数据准备

模型细节

SQL> SELECT T.id           cluster_id,
  2         T.record_count record_cnt,
  3         T.parent       parent,
  4         T.tree_level   tree_level,
  5         T.dispersion   dispersion_sse --T.dispersion列为SSE(误差平方和)
  6    FROM (SELECT *
  7            FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_KM('KM_SH_Clus_sample'))
  8           ORDER BY id) T;

CLUSTER_ID RECORD_CNT     PARENT TREE_LEVEL DISPERSION_SSE
---------- ---------- ---------- ---------- --------------
         1          5                     1           86.8
         2          3          1          2     9.55555556
         3          2          1          2              5

这是一个树形结构,parent = 1为根节点,根节点下面有2个分支节点,一个节点包含了3个树叶,一个节点包含了2个树叶。

SQL> SELECT CLUSTER_ID(km_sh_clus_sample USING *) AS clus,t.*
  2    FROM t_score t
  3  order by 2;

      CLUS         ID HISTORY_SCORE GEOGRAPHY_SCORE
---------- ---------- ------------- ---------------
         2          1            84              49
         3          2            69              33     --红色的为一类,其他的为一类,与上述距离中的聚合相同。
         3          3            67              37     
         2          4            78              45
         2          5            81              49

Scikit-learn kmeans测试

Code如下

# -*- coding: utf-8 -*-
"""
Created on Tue Aug 22 15:45:41 2017

@author:Leo Zhang
"""

from sklearn.cluster import KMeans
import numpy as np
from matplotlib import pyplot
x = np.array([[84, 49], [69, 33], [67, 37], [78, 45], [81, 49]])
#This function creates the classifier
#n_clusters is the number of clusters you want to use to classify your data
kmeans = KMeans(n_clusters=2, random_state=0).fit(x)

centers = kmeans.cluster_centers_ # 两组数据点的中心点
labels = kmeans.labels_   # 每个数据点所属分组
print(centers)
print(labels)
 
for i in range(len(labels)):
    pyplot.scatter(x[i][0], x[i][1], c=('r' if labels[i] == 0 else 'b'))
pyplot.scatter(centers[:,0],centers[:,1],marker='*', s=100)
 
# 预测
#predict = [[45,67], [72,69]]
#label = kmeans.predict(predict)
#for i in range(len(label)):
#    pyplot.scatter(predict[i][0], predict[i][1], c=('r' if label[i] == 0 else 'b'), marker='x')
 
pyplot.show()


输出
[[ 81.          47.66666667]
 [ 68.          35.        ]]
[0 1 1 0 0]

聚类的图形如下
sciket-learn

发表在 Data Mining | 留下评论

Oracle Data Mining-Naive Bayes

对于分类问题,我们并不会觉得陌生。我们每天都在不断的进行分类操作,只是我们没有很好的意识到。比如判断一个人是胖还是瘦、是男还是女等等,这些其实都是我们在进行分类。
下面简单对朴素贝叶斯定理简单介绍一下,更详细的解读朴素贝叶斯分类器算法的细节请查看维基中对Naive Bayes classifier的说明。

后验概率P(c|x) = P(x,c)/P(x)基于贝叶斯定理可以写为
P(c|x) = (p(c) P(x|c) )/ P(x)。
其中P(c)是类“先验”(prior)概率;P(x|c)是样本x相对于类标记c的类条件概率(class-conditional probability)或成为”似然”(likelihood); P(x)是用于归一化的”证据(evidence)”因子。对于样本x,证据因子P(x)于类标记无关,因此估计P(c|x)的问题就转化为如何基于训练数据集D来估计先验p(c)和似然p(x|c)。
根据条件风险最小化准则,最终得到后验概率最大化准则(即朴素贝叶斯采用的原理):
naive bayesOracle数据挖掘算法中也包含朴素贝叶斯算法,且是分类算法中的默认算法。
本文主要是对Oracle数据挖掘分类算法中的朴素贝叶斯分类算法进行实例测试。

实验数据Adult Data Set来自于UC Irvine Machine Learning Repository,通过sql developer导入到Oracle 11.2.0.4数据库中。
建表语句为

create table T_ADULT_DATA
(
  id             NUMBER(8),   ---id为新增列,从1-32561递增,用于分类时的行记录标识。
  age            NUMBER(8),
  workclass      VARCHAR2(100),
  fnlwgt         NUMBER(8),
  education      VARCHAR2(100),
  education_num  NUMBER(8),
  marital_status VARCHAR2(100),
  occupation     VARCHAR2(100),
  relationship   VARCHAR2(100),
  race           VARCHAR2(100),
  sex            VARCHAR2(100),
  capital_gain   NUMBER(8),
  capital_loss   NUMBER(8),
  hours_per_week NUMBER(8),
  native_country VARCHAR2(100),
  income         VARCHAR2(100)
);

原始数据中有一些数据是缺失项,缺失数据用”?”填充在csv文件中,下面对这部分数据进行处理,处理的原则是用空替换。同时将目标列转换为2分类,转换成0和1。
update t_adult_Data set age              =  replace(age             ,'?','');  --同理处理其他字段
commit;
update t_adult_Data set age              =  replace(age             ,' ',null);  --同理处理其他字段
commit;
update  T_ADULT_DATA  set income = 0 where income ='<=50K';--0为收入小于50K  
update  T_ADULT_DATA  set income = 1 where income ='>50K'; --1为收入大于50K
commit;
drop table v_build_t_adult_data purge;
drop table v_test_t_adult_data purge;
create table v_build_t_adult_data as select * from T_ADULT_DATA where id <= 21707;     ---21707 rows 全部表记录的67%  用于训练 
create table v_test_t_adult_data as select * from T_ADULT_DATA where id > 21707;       ---10854 rows 全部表记录的33%  用于验证

SQL> select min(id),max(id) from T_ADULT_DATA t
  2  ;

   MIN(ID)    MAX(ID)
---------- ----------
         1      32561
SQL> select min(id),max(id) from V_TEST_T_ADULT_DATA t;

   MIN(ID)    MAX(ID)
---------- ----------
     21708      32561

接下来将贝叶斯分类算法应用到数据集中。

初始化朴素贝叶斯的先验概率

SQL> CREATE TABLE nb_sh_sample_priors (
  2    target_value      NUMBER,
  3    prior_probability NUMBER);
表已创建。
SQL> INSERT INTO nb_sh_sample_priors VALUES (0,0.65);
已创建 1 行。
SQL> INSERT INTO nb_sh_sample_priors VALUES (1,0.35);
已创建 1 行。
SQL> commit;
提交完成。

创建配置表并初始化

SQL> CREATE TABLE nb_sh_sample_settings (
  2    setting_name  VARCHAR2(30),
  3    setting_value VARCHAR2(4000));

表已创建。
SQL> BEGIN
  2    INSERT INTO nb_sh_sample_settings (setting_name, setting_value) VALUES
  3      (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on);
  4    INSERT INTO nb_sh_sample_settings VALUES
  5      (dbms_data_mining.clas_priors_table_name, 'nb_sh_sample_priors');
  6    COMMIT;
  7  END;
  8  /
PL/SQL 过程已成功完成。

创建分类模型,基于朴素贝叶斯算法

SQL> BEGIN
  2    DBMS_DATA_MINING.CREATE_MODEL(
  3      model_name          => 'NB_SH_Clas_sample',     --模型名称
  4      mining_function     => dbms_data_mining.classification,   --分类
  5      data_table_name     => 'v_build_t_adult_data',   ---build data set
  6      case_id_column_name => 'id',    ---数据标识
  7      target_column_name  => 'income',   --分类目标0或者1
  8      settings_table_name => 'nb_sh_sample_settings');  --朴素贝叶斯配置表
  9  END;
 10  /
PL/SQL 过程已成功完成。

查询模型信息

SQL> select * from user_mining_models;

MODEL_NAME           MINING_FUNCTION      ALGORITHM            CREATION_DATE       BUILD_DURATION MODEL_SIZE COMME
-------------------- -------------------- -------------------- ------------------- -------------- ---------- -----
NB_SH_CLAS_SAMPLE    CLASSIFICATION       NAIVE_BAYES          2017-08-08 13:35:46              2      .0614

模型的设置

SQL> column setting_name format a30
SQL> column setting_value format a30
SQL> SELECT setting_name, setting_value, setting_type
  2    FROM user_mining_model_settings
  3   WHERE model_name = 'NB_SH_CLAS_SAMPLE'
  4   ORDER BY setting_name;
SETTING_NAME                   SETTING_VALUE                  SETTING_TYPE
------------------------------ ------------------------------ --------------
ALGO_NAME                      ALGO_NAIVE_BAYES               DEFAULT
CLAS_PRIORS_TABLE_NAME         nb_sh_sample_priors            INPUT
NABS_PAIRWISE_THRESHOLD        0                              DEFAULT
NABS_SINGLETON_THRESHOLD       0                              DEFAULT
PREP_AUTO                      ON                             INPUT

这里涉及到2个参数,NABS_PAIRWISE_THRESHOLD和NABS_SINGLETON_THRESHOLD,具体涉及到2个名词,一个是pairwise一个是singleton,这2个参数都是0-1之间取值。
这里引用官方文档的解释:
“The cases where both conditions occur together are referred to as pairwise。
The cases where only the prior event occurs are referred to as singleton”。
比如P(AB)就是pairwise,P(A)就是singleton。

查询MODEL SIGNATURE

SQL> select attribute_name, attribute_type,data_type,target from USER_MINING_MODEL_ATTRIBUTES;
ATTRIBUTE_NAME                                               ATTRIBUTE_TYPE         DATA_TYPE                TARGET
------------------------------------------------------------ ---------------------- ------------------------ ------
AGE                                                          NUMERICAL              NUMBER                   NO
WORKCLASS                                                    CATEGORICAL            VARCHAR2                 NO
EDUCATION                                                    CATEGORICAL            VARCHAR2                 NO
EDUCATION_NUM                                                NUMERICAL              NUMBER                   NO
MARITAL_STATUS                                               CATEGORICAL            VARCHAR2                 NO
OCCUPATION                                                   CATEGORICAL            VARCHAR2                 NO
RELATIONSHIP                                                 CATEGORICAL            VARCHAR2                 NO
RACE                                                         CATEGORICAL            VARCHAR2                 NO
SEX                                                          CATEGORICAL            VARCHAR2                 NO
CAPITAL_GAIN                                                 NUMERICAL              NUMBER                   NO
CAPITAL_LOSS                                                 NUMERICAL              NUMBER                   NO
HOURS_PER_WEEK                                               NUMERICAL              NUMBER                   NO
NATIVE_COUNTRY                                               CATEGORICAL            VARCHAR2                 NO
INCOME                                                       CATEGORICAL            VARCHAR2                 YES   ---target column 待预测字段

获取模型的详细信息

SQL> column tname format a14
SQL> column tval format a4
SQL> column pname format a40
SQL> column pval format a200
SQL> column priorp format 9.9999
SQL> column condp format 9.9999
SQL> SELECT T.prior_probability                                           priorp,
  2         C.conditional_probability                                      condp,
  3         T.target_attribute_name                                        tname,
  4         TO_CHAR(
  5         NVL(T.target_attribute_num_value,T.target_attribute_str_value)) tval,
  6         C.attribute_name                                               pname,
  7         NVL(C.attribute_str_value, C.attribute_num_value)               pval
  8    FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_NB('NB_SH_Clas_sample')) T,
  9         TABLE(T.conditionals) C
 10  ORDER BY 1,2,3,4,5,6;
PRIORP   CONDP TNAME          TVAL PNAME                                    PVAL
------- ------- -------------- ---- ---------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  .3500   .0002 INCOME         1    CAPITAL_LOSS                             (1568.5; 1820.5]
  .3500   .0004 INCOME         1    CAPITAL_LOSS                             (77.5; 1446.5]
  .3500   .0008 INCOME         1    CAPITAL_GAIN                             (4932.5; 4973.5]
  .3500   .0023 INCOME         1    CAPITAL_GAIN                             (5316.5; 6618.5]
  .3500   .0033 INCOME         1    AGE                                      ( ; 23.5), [23.5; 23.5]
  .3500   .0035 INCOME         1    CAPITAL_GAIN                             (4668.5; 4826]
  .3500   .0041 INCOME         1    CAPITAL_LOSS                             (1446.5; 1494.5]
  .3500   .0041 INCOME         1    CAPITAL_LOSS                             (1551.5; 1568.5]
  .3500   .0046 INCOME         1    AGE                                      (23.5; 24.5]
  .3500   .0048 INCOME         1    CAPITAL_LOSS                             (2161.5; 2384.5]
  .3500   .0066 INCOME         1    CAPITAL_LOSS                             (1820.5; 1862]
  .3500   .0089 INCOME         1    CAPITAL_GAIN                             (4243.5; 4401]
  .3500   .0089 INCOME         1    RELATIONSHIP                             'Own-child'
  .3500   .0114 INCOME         1    CAPITAL_GAIN                             (3048; 3120]
  .3500   .0114 INCOME         1    CAPITAL_GAIN                             (5095.5; 5316.5]
  .3500   .0128 INCOME         1    CAPITAL_LOSS                             (2384.5; 3726.5]
  .3500   .0130 INCOME         1    NATIVE_COUNTRY                           'Columbia', 'Dominican-Republic', 'Ecuador', 'El-Salvador', 'Guatemala', 'Haiti', 'Holand-Netherlands', 'Honduras', 'Hungary', 'Jamaica', 'Laos', 'Mexico', 'Nicaragua', 'Outlying-US(Guam-USVI-etc)', 'Peru', 'Portugal', 'Puerto-Rico', 'Scotland', 'Trinadad&Tobago', 'Vietnam'
  .3500   .0199 INCOME         1    CAPITAL_GAIN                             (70654.5;  )
  .3500   .0228 INCOME         1    CAPITAL_LOSS                             (1975.5; 1978.5]
  .3500   .0248 INCOME         1    AGE                                      (24.5; 27.5]
  .3500   .0304 INCOME         1    AGE                                      (27.5; 29.5]
  .3500   .0310 INCOME         1    RELATIONSHIP                             'Other-relative', 'Unmarried'
  .3500   .0316 INCOME         1    OCCUPATION                               'Armed-Forces', 'Handlers-cleaners', 'Other-service', 'Priv-house-serv'
  .3500   .0327 INCOME         1    EDUCATION                                '10th', '11th', '12th', '1st-4th', '5th-6th', '7th-8th', '9th', 'Preschool'
  .3500   .0327 INCOME         1    EDUCATION_NUM                            ( ; 8.5), [8.5; 8.5]
  .3500   .0434 INCOME         1    CAPITAL_LOSS                             (1881.5; 1923]
  .3500   .0457 INCOME         1    HOURS_PER_WEEK                           (34.5; 39.5]
  .3500   .0491 INCOME         1    WORKCLASS                                'Federal-gov'
  .3500   .0505 INCOME         1    HOURS_PER_WEEK                           ( ; 34.5), [34.5; 34.5]
  .3500   .0571 INCOME         1    AGE                                      (61.5;  )
  .3500   .0581 INCOME         1    RACE                                     'Amer-Indian-Eskimo', 'Black', 'Other'
  .3500   .0651 INCOME         1    OCCUPATION                               'Protective-serv', 'Tech-support'
  .3500   .0693 INCOME         1    CAPITAL_GAIN                             (10585.5; 30961.5]
  .3500   .0707 INCOME         1    MARITAL_STATUS                           'Divorced', 'Married-spouse-absent', 'Widowed'
  .3500   .0740 INCOME         1    MARITAL_STATUS                           'Never-married', 'Separated'
  .3500   .0788 INCOME         1    EDUCATION                                'Assoc-acdm', 'Assoc-voc'
  .3500   .0788 INCOME         1    EDUCATION_NUM                            (10.5; 12.5]
  .3500   .0824 INCOME         1    WORKCLASS                                'Self-emp-inc'
  .3500   .0846 INCOME         1    CAPITAL_GAIN                             (7073.5; 10543]
  .3500   .0879 INCOME         1    EDUCATION                                'Doctorate', 'Prof-school'
  .3500   .0879 INCOME         1    EDUCATION_NUM                            (14.5;  )
  .3500   .1098 INCOME         1    RELATIONSHIP                             'Not-in-family'
  .3500   .1150 INCOME         1    OCCUPATION                               'Adm-clerical', 'Farming-fishing', 'Machine-op-inspct'
  .3500   .1214 INCOME         1    EDUCATION                                'Masters'
  .3500   .1214 INCOME         1    EDUCATION_NUM                            (13.5; 14.5]
  .3500   .1320 INCOME         1    HOURS_PER_WEEK                           (41.5; 49.5]
  .3500   .1520 INCOME         1    SEX                                      Female
  .3500   .1545 INCOME         1    AGE                                      (29.5; 35.5]
  .3500   .2224 INCOME         1    WORKCLASS                                'Local-gov', 'Self-emp-not-inc', 'State-gov'
  .3500   .2873 INCOME         1    EDUCATION                                'Bachelors'
  .3500   .2873 INCOME         1    EDUCATION_NUM                            (12.5; 13.5]
  .3500   .2906 INCOME         1    OCCUPATION                               'Craft-repair', 'Sales', 'Transport-moving'
  .3500   .3526 INCOME         1    HOURS_PER_WEEK                           (49.5;  )
  .3500   .3919 INCOME         1    EDUCATION                                'HS-grad', 'Some-college'
  .3500   .3919 INCOME         1    EDUCATION_NUM                            (8.5; 10.5]
  .3500   .4192 INCOME         1    HOURS_PER_WEEK                           (39.5; 41.5]
  .3500   .4977 INCOME         1    OCCUPATION                               'Exec-managerial', 'Prof-specialty'
  .3500   .6461 INCOME         1    WORKCLASS                                'Never-worked', 'Private', 'Without-pay'
  .3500   .7253 INCOME         1    AGE                                      (35.5; 61.5]
  .3500   .7878 INCOME         1    CAPITAL_GAIN                             ( ; 57), [57; 57]
  .3500   .8480 INCOME         1    SEX                                      Male
  .3500   .8503 INCOME         1    RELATIONSHIP                             'Husband', 'Wife'
  .3500   .8554 INCOME         1    MARITAL_STATUS                           'Married-AF-spouse', 'Married-civ-spouse'
  .3500   .9009 INCOME         1    CAPITAL_LOSS                             ( ; 77.5), [77.5; 77.5]
  .3500   .9419 INCOME         1    RACE                                     'Asian-Pac-Islander', 'White'
  .3500   .9870 INCOME         1    NATIVE_COUNTRY                           'Cambodia', 'Canada', 'China', 'Cuba', 'England', 'France', 'Germany', 'Greece', 'Hong', 'India', 'Iran', 'Ireland', 'Italy', 'Japan', 'Philippines', 'Poland', 'South', 'Taiwan', 'Thailand', 'United-States', 'Yugoslavia'
  .6500   .0001 INCOME         0    CAPITAL_GAIN                             (10585.5; 30961.5]
  .6500   .0001 INCOME         0    CAPITAL_LOSS                             (1820.5; 1862]
  .6500   .0001 INCOME         0    CAPITAL_GAIN                             (3048; 3120]
  .6500   .0002 INCOME         0    CAPITAL_GAIN                             (7073.5; 10543]
  .6500   .0003 INCOME         0    CAPITAL_LOSS                             (3726.5;  )
  .6500   .0004 INCOME         0    CAPITAL_GAIN                             (10543; 10585.5]
  .6500   .0004 INCOME         0    CAPITAL_LOSS                             (2384.5; 3726.5]
  .6500   .0004 INCOME         0    CAPITAL_GAIN                             (30961.5; 70654.5]
  .6500   .0004 INCOME         0    CAPITAL_LOSS                             (1881.5; 1923]
  .6500   .0005 INCOME         0    CAPITAL_GAIN                             (4243.5; 4401]
  .6500   .0007 INCOME         0    CAPITAL_LOSS                             (1923; 1975.5]
  .6500   .0007 INCOME         0    CAPITAL_LOSS                             (1446.5; 1494.5]
  .6500   .0008 INCOME         0    CAPITAL_GAIN                             (4826; 4932.5]
  .6500   .0009 INCOME         0    CAPITAL_LOSS                             (1494.5; 1551.5]
  .6500   .0012 INCOME         0    CAPITAL_GAIN                             (5316.5; 6618.5]
  .6500   .0014 INCOME         0    CAPITAL_GAIN                             (6618.5; 7073.5]
  .6500   .0020 INCOME         0    CAPITAL_LOSS                             (1862; 1881.5]
  .6500   .0028 INCOME         0    CAPITAL_GAIN                             (4401; 4668.5]
  .6500   .0028 INCOME         0    CAPITAL_GAIN                             (4973.5; 5095.5]
  .6500   .0033 INCOME         0    CAPITAL_LOSS                             (2161.5; 2384.5]
  .6500   .0034 INCOME         0    CAPITAL_LOSS                             (77.5; 1446.5]
  .6500   .0046 INCOME         0    CAPITAL_LOSS                             (1978.5; 2161.5]
  .6500   .0106 INCOME         0    EDUCATION                                'Doctorate', 'Prof-school'
  .6500   .0106 INCOME         0    EDUCATION_NUM                            (14.5;  )
  .6500   .0126 INCOME         0    CAPITAL_GAIN                             (3120; 4243.5]
  .6500   .0141 INCOME         0    CAPITAL_LOSS                             (1568.5; 1820.5]
  .6500   .0187 INCOME         0    CAPITAL_GAIN                             (57; 3048]
  .6500   .0210 INCOME         0    WORKCLASS                                'Self-emp-inc'
  .6500   .0269 INCOME         0    WORKCLASS                                'Federal-gov'
  .6500   .0307 INCOME         0    EDUCATION                                'Masters'
  .6500   .0307 INCOME         0    EDUCATION_NUM                            (13.5; 14.5]
  .6500   .0314 INCOME         0    AGE                                      (23.5; 24.5]
  .6500   .0459 INCOME         0    OCCUPATION                               'Protective-serv', 'Tech-support'
  .6500   .0544 INCOME         0    NATIVE_COUNTRY                           'Columbia', 'Dominican-Republic', 'Ecuador', 'El-Salvador', 'Guatemala', 'Haiti', 'Holand-Netherlands', 'Honduras', 'Hungary', 'Jamaica', 'Laos', 'Mexico', 'Nicaragua', 'Outlying-US(Guam-USVI-etc)', 'Peru', 'Portugal', 'Puerto-Rico', 'Scotland', 'Trinadad&Tobago', 'Vietnam'
  .6500   .0568 INCOME         0    AGE                                      (27.5; 29.5]
  .6500   .0660 INCOME         0    AGE                                      (61.5;  )
  .6500   .0724 INCOME         0    EDUCATION                                'Assoc-acdm', 'Assoc-voc'
  .6500   .0724 INCOME         0    EDUCATION_NUM                            (10.5; 12.5]
  .6500   .0743 INCOME         0    HOURS_PER_WEEK                           (34.5; 39.5]
  .6500   .0805 INCOME         0    HOURS_PER_WEEK                           (41.5; 49.5]
  .6500   .0923 INCOME         0    AGE                                      (24.5; 27.5]
  .6500   .1279 INCOME         0    EDUCATION                                'Bachelors'
  .6500   .1279 INCOME         0    EDUCATION_NUM                            (12.5; 13.5]
  .6500   .1305 INCOME         0    RACE                                     'Amer-Indian-Eskimo', 'Black', 'Other'
  .6500   .1483 INCOME         0    HOURS_PER_WEEK                           (49.5;  )
  .6500   .1586 INCOME         0    EDUCATION                                '10th', '11th', '12th', '1st-4th', '5th-6th', '7th-8th', '9th', 'Preschool'
  .6500   .1586 INCOME         0    EDUCATION_NUM                            ( ; 8.5), [8.5; 8.5]
  .6500   .1626 INCOME         0    AGE                                      (29.5; 35.5]
  .6500   .1671 INCOME         0    RELATIONSHIP                             'Other-relative', 'Unmarried'
  .6500   .1856 INCOME         0    WORKCLASS                                'Local-gov', 'Self-emp-not-inc', 'State-gov'
  .6500   .1883 INCOME         0    OCCUPATION                               'Exec-managerial', 'Prof-specialty'
  .6500   .1913 INCOME         0    AGE                                      ( ; 23.5), [23.5; 23.5]
  .6500   .1973 INCOME         0    OCCUPATION                               'Armed-Forces', 'Handlers-cleaners', 'Other-service', 'Priv-house-serv'
  .6500   .2015 INCOME         0    RELATIONSHIP                             'Own-child'
  .6500   .2104 INCOME         0    HOURS_PER_WEEK                           ( ; 34.5), [34.5; 34.5]
  .6500   .2164 INCOME         0    MARITAL_STATUS                           'Divorced', 'Married-spouse-absent', 'Widowed'
  .6500   .2597 INCOME         0    OCCUPATION                               'Adm-clerical', 'Farming-fishing', 'Machine-op-inspct'
  .6500   .3049 INCOME         0    RELATIONSHIP                             'Not-in-family'
  .6500   .3088 INCOME         0    OCCUPATION                               'Craft-repair', 'Sales', 'Transport-moving'
  .6500   .3265 INCOME         0    RELATIONSHIP                             'Husband', 'Wife'
  .6500   .3345 INCOME         0    MARITAL_STATUS                           'Married-AF-spouse', 'Married-civ-spouse'
  .6500   .3862 INCOME         0    SEX                                      Female
  .6500   .3995 INCOME         0    AGE                                      (35.5; 61.5]
  .6500   .4491 INCOME         0    MARITAL_STATUS                           'Never-married', 'Separated'
  .6500   .4865 INCOME         0    HOURS_PER_WEEK                           (39.5; 41.5]
  .6500   .5997 INCOME         0    EDUCATION                                'HS-grad', 'Some-college'
  .6500   .5997 INCOME         0    EDUCATION_NUM                            (8.5; 10.5]
  .6500   .6138 INCOME         0    SEX                                      Male
  .6500   .7664 INCOME         0    WORKCLASS                                'Never-worked', 'Private', 'Without-pay'
  .6500   .8695 INCOME         0    RACE                                     'Asian-Pac-Islander', 'White'
  .6500   .9456 INCOME         0    NATIVE_COUNTRY                           'Cambodia', 'Canada', 'China', 'Cuba', 'England', 'France', 'Germany', 'Greece', 'Hong', 'India', 'Iran', 'Ireland', 'Italy', 'Japan', 'Philippines', 'Poland', 'South', 'Taiwan', 'Thailand', 'United-States', 'Yugoslavia'
  .6500   .9580 INCOME         0    CAPITAL_GAIN                             ( ; 57), [57; 57]
  .6500   .9692 INCOME         0    CAPITAL_LOSS                             ( ; 77.5), [77.5; 77.5]

已选择139行。

创建待验证数据的视图信息,对待验证数据集appy上述创建的贝叶斯模型,产生的结果表为nb_sh_sample_test_apply,表中包含3列,分别为ID、PREDICTION和PROBABILITY。

SQL> CREATE OR REPLACE VIEW nb_sh_sample_test_targets AS
  2  SELECT id, income
  3    FROM v_test_t_adult_data;
视图已创建。

SQL> BEGIN
  2    DBMS_DATA_MINING.APPLY(
  3      model_name          => 'NB_SH_Clas_sample',
  4      data_table_name     => 'v_test_t_adult_data',
  5      case_id_column_name => 'id',
  6      result_table_name   => 'nb_sh_sample_test_apply');
  7  END;
  8  /

PL/SQL 过程已成功完成。

Oracle在贝叶斯分类算法中引入了成本(COST)矩阵来评价模型的好坏,具体涉及到混淆矩阵(Confusion Matrix)、ROC曲线、LIFT曲线等。

混淆矩阵是一个N X N矩阵,其中N为分类的个数。假如我们要进行的是二分类问题,那么N=2,相应的混淆矩阵为2 X 2矩阵。
下图矩阵引自7 Important Model Evaluation Error Metrics Everyone should know并进行了补充。
confusion
对其中的几个单词进行解释
Accuracy(总体准确度):预测正确的数占所有数的比例。
Positive Predictive Value(阳性预测值):阳性预测值被预测正确的比例。
Negative Predictive Value(阴性预测值):阴性预测值被预测正确的比例。
Sensitivity (灵敏度):在阳性值中实际被预测正确所占的比例。
Specificity(特异度):在阴性值中实现被预测正确所占的比例。

根据例的真实类别与机器学习模型预测类别的组合简写为真正例TP(true positive),假正例FP(false positive),真反例TN(true negative),假反例FN(false negative)四种,令TP/FP/TN/FN分别表示其对应的样例数,则有TP+FP+TN+FN=样本总数。

ROC曲线和LIFT曲线可以参考wiki上关于Receiver operating characteristicLift (data mining)的介绍。

创建Cost matrix表并初始化数据

SQL> CREATE TABLE nb_sh_cost (
  2    actual_target_value    NUMBER,
  3    predicted_target_value NUMBER,
  4    cost                   NUMBER);

SQL> select * from nb_sh_cost;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE       COST
------------------- ---------------------- ----------
                  0                      0          0
                  0                      1        .35
                  1                      0        .65
                  1                      1          0

测试含有cost matrix的混淆矩阵

SQL> DECLARE
  2    v_accuracy         NUMBER;
  3    v_area_under_curve NUMBER;
  4  BEGIN
  5     DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
  6       accuracy                    => v_accuracy,
  7       apply_result_table_name     => 'nb_sh_sample_test_apply',
  8       target_table_name           => 'nb_sh_sample_test_targets',
  9       case_id_column_name         => 'id',      ---行标识
 10       target_column_name          => 'income',   ---目标列
 11       confusion_matrix_table_name => 'nb_sh_sample_confusion_matrix',   --生成的混淆矩阵表
 12       score_column_name           => 'PREDICTION',   -- default
 13       score_criterion_column_name => 'PROBABILITY',  -- default
 14       cost_matrix_table_name      => 'nb_sh_cost');  --cost matrix table
 15     DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY IS****: ' || ROUND(v_accuracy,4));
 16  
 17     DBMS_DATA_MINING.COMPUTE_LIFT (
 18       apply_result_table_name => 'nb_sh_sample_test_apply',
 19       target_table_name       => 'nb_sh_sample_test_targets',
 20       case_id_column_name     => 'id',
 21       target_column_name      => 'income',
 22       lift_table_name         => 'nb_sh_sample_lift',   --生成的lift表
 23       positive_target_value   => '1',
 24       num_quantiles           => '10',
 25       cost_matrix_table_name  => 'nb_sh_cost');
 26  
 27     DBMS_DATA_MINING.COMPUTE_ROC (
 28       roc_area_under_curve        => v_area_under_curve,
 29       apply_result_table_name     => 'nb_sh_sample_test_apply',
 30       target_table_name           => 'nb_sh_sample_test_targets',
 31       case_id_column_name         => 'id',
 32       target_column_name          => 'income',
 33       roc_table_name              => 'nb_sh_sample_roc',  --生成的ROC曲线表
 34       positive_target_value       => '1',
 35       score_column_name           => 'PREDICTION',
 36       score_criterion_column_name => 'PROBABILITY');
 37     DBMS_OUTPUT.PUT_LINE('**** AREA UNDER ROC CURVE ****: ' ||
 38       ROUND(v_area_under_curve,4));
 39  END;
 40  /
**** MODEL ACCURACY IS ****: .8065        -------总体准确率为80.65%。
**** AREA UNDER ROC CURVE ****: .9199

得到含有cost的混淆矩阵如下

SQL> SELECT actual_target_value as actual,
  2         predicted_target_value as predicted,
  3         value as count
  4    FROM nb_sh_sample_confusion_matrix
  5  ORDER BY actual_target_value, predicted_target_value;
ACTUAL      PREDICTED      COUNT
---------- ---------- ----------
0                   0       6375
0                   1       1802
1                   0        298
1                   1       2379

混淆矩阵为
6375  1802
298   2379

计算不含有cost matrix的测试矩阵

SQL> DECLARE
  2    v_accuracy         NUMBER;
  3    v_area_under_curve NUMBER;
  4  BEGIN
  5     DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
  6       accuracy                    => v_accuracy,
  7       apply_result_table_name     => 'nb_sh_sample_test_apply',
  8       target_table_name           => 'nb_sh_sample_test_targets',
  9       case_id_column_name         => 'id',
 10       target_column_name          => 'income',
 11       confusion_matrix_table_name => 'nb_sh_sample_cm_no_cost',
 12       score_column_name           => 'PREDICTION',
 13       score_criterion_column_name => 'PROBABILITY');
 14     DBMS_OUTPUT.PUT_LINE('** ACCURACY W/ NO COST **: ' || ROUND(v_accuracy,4));
 15  END;
 16  /
** ACCURACY W/ NO COST **: .8217   -----整体准确率得到了提升。

SQL> SELECT actual_target_value as actual,
  2         predicted_target_value as predicted,
  3         value as count
  4    FROM nb_sh_sample_cm_no_cost
  5  ORDER BY actual_target_value, predicted_target_value;

ACTUAL   PREDICTED       COUNT
-------- ---------- ----------
0        0                6629
0        1                1548
1        0                 387
1        1                2290

这里比较了在模型中引入cost matrix和不引入cost matrix计算后的准确率。那么为什么要引入cost成本计算呢?这里引用一下周志华老师的机器学习一书中的描述:不同类型的错误所造成的后果不同,例如在医疗诊断中,错误地把患者诊断为监控人与错误地把健康人诊断为患者,看起来都是犯了一次预测错误,但后者的影响是增加了进一步检查的麻烦,前者的后果却可能是丧失了拯救生命的最佳时机。为了权衡不同类型错误所造成的不同损失,可为错误赋予“非均等代价”(unequal cost)。
Oarcle引入cost matrix可以使模型最小化误分类的成本,同时可以使得模型最大化有益的分类准确度。

最小成本的ROC曲线指标

True positive fraction: Hit rate. (true positives/(true positives + false negatives))
False positive fraction: False alarm rate. (false positives/(false positives + true negatives))
SQL> SELECT *
  2    FROM (SELECT ROUND(probability,4) prob,
  3                 true_positives  tp,
  4                 false_negatives fn,
  5                 false_positives fp,
  6                 true_negatives  tn,
  7                 ROUND(true_positive_fraction,4) tpf,
  8                 ROUND(false_positive_fraction,4) fpf,
  9                 .35 * false_positives + .65 * false_negatives nb_cost
 10           FROM nb_sh_sample_roc)
 11   WHERE nb_cost <= 760
 12   ORDER BY nb_cost;
      PROB         TP         FN         FP         TN        TPF        FPF    NB_COST
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
     .6587       2166        511       1204       6973      .8091      .1472     753.55
     .6531       2168        509       1211       6966      .8099      .1481      754.7
     .6524       2169        508       1214       6963      .8102      .1485      755.1
     .6667       2161        516       1201       6976      .8072      .1469     755.75
     .6701       2159        518       1200       6977      .8065      .1468      756.7
     .6493       2170        507       1221       6956      .8106      .1493      756.9
     .6712       2148        529       1182       6995      .8024      .1446     757.55
     .6705       2150        527       1186       6991      .8031       .145     757.65
     .6732       2147        530       1181       6996       .802      .1444     757.85
     .6711       2149        528       1185       6992      .8028      .1449     757.95
     .6774       2143        534       1178       6999      .8005      .1441      759.4
     .6782       2142        535       1177       7000      .8001      .1439      759.7
      .679       2141        536       1176       7001      .7998      .1438        760

上述查询获取到了13条概率阀值使得混淆矩阵的总成本在760以下。
重新计算cost matrix并基于新的成本矩阵计算混淆矩阵。

SQL> CREATE TABLE nb_alter_cost AS
  2  WITH
  3  cost_q AS (
  4  SELECT probability,
  5         (.35 * false_positives + .65 * false_negatives) nb_cost
  6    FROM nb_sh_sample_roc
  7  ),
  8  min_cost AS (
  9  SELECT MIN(nb_cost) mincost
 10    FROM cost_q
 11  ),
 12  prob_q AS (
 13  SELECT min(probability) prob
 14    FROM cost_q, min_cost
 15   WHERE nb_cost = mincost
 16  )
 17  SELECT 1 actual_target_value,
 18         0 predicted_target_value,
 19         (1.0 - prob)/prob cost
 20    FROM prob_q
 21  UNION ALL
 22  SELECT 0 actual_target_value,
 23         1 predicted_target_value,
 24         1 cost
 25    FROM dual
 26  UNION ALL
 27  SELECT 0 actual_target_value,
 28         0 predicted_target_value,
 29         0 cost
 30    FROM dual
 31  UNION ALL
 32  SELECT 1 actual_target_value,
 33         1 predicted_target_value,
 34         0 cost
 35    FROM dual;

表已创建。

SQL> column cost format 9.999999999
SQL> SELECT ACTUAL_TARGET_VALUE, PREDICTED_TARGET_VALUE, COST
  2    FROM nb_alter_cost;

ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE         COST
------------------- ---------------------- ------------
                  1                      0   .518064893
                  0                      1  1.000000000
                  0                      0   .000000000
                  1                      1   .000000000
SQL> DECLARE
  2    v_accuracy         NUMBER;
  3    v_area_under_curve NUMBER;
  4  BEGIN
  5     DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
  6       accuracy                    => v_accuracy,
  7       apply_result_table_name     => 'nb_sh_sample_test_apply',
  8       target_table_name           => 'nb_sh_sample_test_targets',
  9       case_id_column_name         => 'id',
 10       target_column_name          => 'income',
 11       confusion_matrix_table_name => 'nb_sh_alter_confusion_matrix',
 12       score_column_name           => 'PREDICTION',   -- default
 13       score_criterion_column_name => 'PROBABILITY',  -- default
 14       cost_matrix_table_name      => 'nb_alter_cost');
 15     DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4));
 16  END;
 17  /
**** MODEL ACCURACY ****: .842

模型的总体准确率得到了提升。
新的预测结果

SQL> SELECT actual_target_value as actual,
  2         predicted_target_value as predicted,
  3         value as count
  4    FROM nb_sh_alter_confusion_matrix
  5    ORDER BY actual_target_value, predicted_target_value;
ACTUAL   PREDICTED      COUNT
-------- --------- ----------
0                0       6973
0                1       1204
1                0        511
1                1       2166

从上面的计算预测结果可以看到,随着cost matrix的调整,预测的结果也在变化,比如0 –> 1的预测数量在减少,针对实际的业务场景,可以加入更多成本惩罚到不希望发生的指标上,进而缩小非均等代价。

应用模型

SQL> BEGIN
  2    DBMS_DATA_MINING.APPLY(
  3      model_name          => 'NB_SH_Clas_sample',
  4      data_table_name     => 'v_test_t_adult_data',
  5      case_id_column_name => 'id',
  6      result_table_name   => 'nb_sh_sample_apply_result');
  7  END;
  8  /

PL/SQL 过程已成功完成。
SQL> column probability format 9.99999
SQL> column prediction format a9
SQL> SELECT id, prediction, ROUND(probability, 4) probability
  2    FROM nb_sh_sample_apply_result
  3   WHERE id <= 21715   4   ORDER BY id, prediction;         ID PREDICTIO PROBABILITY ---------- --------- -----------      21708 0              .34570      21708 1              .65430      21709 0              .99950      21709 1              .00050      21710 0             1.00000      21710 1              .00000      21711 0              .98940      21711 1              .01060      21712 0              .37200      21712 1              .62800      21713 0              .90050      21713 1              .09950      21714 0              .95170      21714 1              .04830      21715 0             1.00000      21715 1              .00000 已选择16行。 SQL> col income for a5
SQL> select id,income,PREDICTION (NB_SH_Clas_Sample USING *) PREDICTION FROM V_TEST_T_ADULT_DATA  WHERE id <= 21715;

        ID INCOM PREDICTIO
---------- ----- ---------
     21708 0     1
     21709 0     0
     21710 0     0
     21711 0     0
     21712 0     1
     21713 0     0
     21714 0     0
     21715 0     0

已选择8行。
发表在 Data Mining | 留下评论

Oracle Text Index-part8-Partition Index

创建测试分区表

SQL> CREATE TABLE t_lexer(
  2  ID                NUMBER(10)       NOT NULL,
  3  CREATE_DATE       DATE             NOT NULL,
  4  DATA_ECHO_AREA    CLOB             NOT NULL
  5  )
  6  LOB(DATA_ECHO_AREA) store as securefile
  7  (disable storage in row)
  8  tablespace lexer
  9  LOGGING
 10  PCTFREE 10
 11  INITRANS 100
 12  PARTITION BY RANGE (CREATE_DATE)
 13  (PARTITION P_2017_08 VALUES LESS THAN (TO_DATE('09-01-2017', 'MM-DD-YYYY')),
 14  PARTITION P_2017_09 VALUES LESS THAN (TO_DATE('10-01-2017', 'MM-DD-YYYY')),
 15  PARTITION PMAX VALUES LESS THAN (MAXVALUE));

表已创建。

SQL>
SQL>
SQL>
SQL> alter table t_lexer add(
  2  constraint pk_t_lexer primary key(id)
  3  );

表已更改。

SQL>
SQL> create sequence seq_t_lexer;

序列已创建。

对分区表进行数据初始化

SQL> insert into t_lexer
  2    (id, create_date, DATA_ECHO_AREA)
  3  values
  4    (seq_t_lexer.nextval,
  5     sysdate,
  6     '故事发生在19世纪末的欧洲,善良可爱的苏菲被恶毒的女巫施下魔咒,从18岁的女孩变成90岁的婆婆 ');

已创建 1 行。

SQL>
SQL> insert into t_lexer
  2    (id, create_date, DATA_ECHO_AREA)
  3  values
  4    (seq_t_lexer.nextval,
  5     sysdate + 30,'这部由俄罗斯导演提莫贝克曼贝托夫执导的影片自6月末在北美上映以来
  6  ');

已创建 1 行。
SQL> commit;

提交完成。

SQL> col area for a100
SQL> select id,create_date,dbms_lob.substr(DATA_ECHO_AREA) AREA from T_LEXER partition(p_2017_08);

        ID CREATE_DATE         AREA
---------- ------------------- ----------------------------------------------------------------------------------------------------
         3 2017-08-04 10:24:20 故事发生在19世纪末的欧洲,善良可爱的苏菲被恶毒的女巫施下魔咒,从18岁的女孩变成90岁的婆婆

SQL> select id,create_date,dbms_lob.substr(DATA_ECHO_AREA) AREA from T_LEXER partition(p_2017_09);

        ID CREATE_DATE         AREA
---------- ------------------- --------------------------------------------------------------------------------
         4 2017-09-03 10:24:20 这部由俄罗斯导演提莫贝克曼贝托夫执导的影片自6月末在北美上映以来

创建preferene,选择lexer、stoplist、分词表所在表空间

SQL> exec ctx_ddl.drop_preference ('audit_lexer');

PL/SQL 过程已成功完成。

SQL> exec ctx_ddl.drop_preference ('CUST_STORAGE');

PL/SQL 过程已成功完成。

SQL> exec ctx_ddl.drop_stoplist('audit_stoplist');

PL/SQL 过程已成功完成。

SQL> begin
2 ctx_ddl.create_preference('audit_lexer', 'basic_lexer');
3 ctx_ddl.set_attribute('audit_lexer', 'printjoins', '_');
4 ctx_ddl.create_preference('cust_storage', 'BASIC_STORAGE');
5 ctx_ddl.set_attribute('cust_storage', 'I_TABLE_CLAUSE', 'tablespace USERS storage (initial 4M)'); ----分词表存放到不同的表空间下
 6 ctx_ddl.set_attribute('cust_storage', 'K_TABLE_CLAUSE', 'tablespace USERS storage (initial 4M)');
 7 ctx_ddl.set_attribute('cust_storage', 'R_TABLE_CLAUSE', 'tablespace USERS storage (initial 4M)');
 8 ctx_ddl.set_attribute('cust_storage', 'N_TABLE_CLAUSE', 'tablespace USERS storage (initial 4M)');
 9 ctx_ddl.set_attribute('cust_storage', 'I_INDEX_CLAUSE', 'tablespace USERS storage (initial 4M)');
10 ctx_ddl.create_stoplist('audit_stoplist');
11 ctx_ddl.add_stopword('audit_stoplist', '---');
12 ctx_ddl.add_stopword('audit_stoplist', 'from');
13 ctx_ddl.add_stopword('audit_stoplist', 'and');
14 ctx_ddl.add_stopword('audit_stoplist', 'not');
15 ctx_ddl.add_stopword('audit_stoplist', 'or');
16 ctx_ddl.add_stopword('audit_stoplist', 'into');
17 ctx_ddl.add_stopword('audit_stoplist', ',');
18 ctx_ddl.add_stopword('audit_stoplist', 'values');
19 --此处略
20 end;
21 /

PL/SQL 过程已成功完成。

创建分区表本地TEXT INDEX

SQL> CREATE INDEX IDX1_T_LEXER ON T_LEXER(data_echo_area)
  2  INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('LEXER AUDIT_LEXER STORAGE CUST_STORAGE STOPLIST AUDIT_STOPLIST MEMORY 512M SYNC(EVERY SYSDATE+5/1440)')
  3  LOCAL (partition P_2017_08, partition P_2017_09) parallel 4;      ---当指定的
CREATE INDEX IDX1_T_LEXER ON T_LEXER(data_echo_area)
                             *
第 1 行出现错误:
ORA-14024: LOCAL 索引的分区数必须等于基础表的分区数

SQL> CREATE INDEX IDX1_T_LEXER ON T_LEXER(data_echo_area)
  2  INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('LEXER AUDIT_LEXER STORAGE CUST_STORAGE STOPLIST AUDIT_STOPLIST MEMORY 512M SYNC(EVERY SYSDATE+5/1440)')
  3  LOCAL (partition P_2017_08, partition P_2017_09,partition PMAX) parallel 4;

索引已创建。

查询TEXT INDEX创建的分词表等信息。

Oracle为每个分区都创建了IKNR四张表。

SQL> select table_name,tablespace_name,status from user_tables where table_name like 'DR%' order by 1;
TABLE_NAME                          TABLESPACE STATUS
----------------------------------- ---------- ----------------
DR#IDX1_T_LEXER0001$I               USERS      VALID
DR#IDX1_T_LEXER0001$K                          VALID
DR#IDX1_T_LEXER0001$N                          VALID
DR#IDX1_T_LEXER0001$R               USERS      VALID
DR#IDX1_T_LEXER0002$I               USERS      VALID
DR#IDX1_T_LEXER0002$K                          VALID
DR#IDX1_T_LEXER0002$N                          VALID
DR#IDX1_T_LEXER0002$R               USERS      VALID
DR#IDX1_T_LEXER0003$I               USERS      VALID
DR#IDX1_T_LEXER0003$K                          VALID
DR#IDX1_T_LEXER0003$N                          VALID
DR#IDX1_T_LEXER0003$R               USERS      VALID
已选择12行。

查看为每个分区创建的索引信息

SQL> select index_name,table_name,index_type,tablespace_name,status from user_indexes where index_name like 'DR%' order by 1;

INDEX_NAME                                                   TABLE_NAME                          INDEX_TYPE TABLESPACE STATUS
------------------------------------------------------------ ----------------------------------- ---------- ---------- --------------
DR#IDX1_T_LEXER0001$X                                        DR#IDX1_T_LEXER0001$I               NORMAL     USERS      VALID
DR#IDX1_T_LEXER0002$X                                        DR#IDX1_T_LEXER0002$I               NORMAL     USERS      VALID
DR#IDX1_T_LEXER0003$X                                        DR#IDX1_T_LEXER0003$I               NORMAL     USERS      VALID
DRC#IDX1_T_LEXER0001$R                                       DR#IDX1_T_LEXER0001$R               NORMAL     USERS      VALID
DRC#IDX1_T_LEXER0002$R                                       DR#IDX1_T_LEXER0002$R               NORMAL     USERS      VALID
DRC#IDX1_T_LEXER0003$R                                       DR#IDX1_T_LEXER0003$R               NORMAL     USERS      VALID
已选择6行。

TEXT INDEX刷新拆词频率

创建TEXT INDEX的时候是每5分钟刷新一次,查看scheduler的定义内容

SQL> select job_name,job_action,repeat_interval,next_run_Date,comments from dba_scheduler_jobs where owner = 'USER_LEXER';
JOB_NAME                  JOB_ACTION                                                                            REPEAT_INTERVAL      NEXT_RUN_DATE                              COMMENTS
------------------------- ------------------------------------------------------------------------------------- -------------------- ------------------------------------------ --------------------
DR#IDX1_T_LEXER0001$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'PMAX', NULL, NULL, 0);      SYSDATE+5/1440       04-8月 -17 12.26.21.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0002$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_08', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 12.26.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0003$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_09', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 12.26.22.000000 下午 +08:00     automatic sync job

添加分区

对表添加新的分区并添加数据

SQL> alter table T_LEXER split partition pmax at (TIMESTAMP' 2017-11-01 00:00:00') into (partition P_2017_10,partition PMAX) update global indexes;

表已更改。

SQL> insert into t_lexer
  2   (id, create_date, DATA_ECHO_AREA)
  3   values
  4   (seq_t_lexer.nextval,
  5     sysdate + 60,
  6     '515.124.4169');

已创建 1 行。

SQL>
SQL> insert into t_lexer
  2   (id, create_date, DATA_ECHO_AREA)
  3   values
  4   (seq_t_lexer.nextval,
  5     sysdate + 60,
  6     '14895956868');

已创建 1 行。

SQL> commit;

提交完成。

SQL> select id,create_date,dbms_lob.substr(DATA_ECHO_AREA) AREA from T_LEXER partition(p_2017_10);

        ID CREATE_DATE         AREA
---------- ------------------- --------------------------------------------------------------------------------
         5 2017-10-03 12:29:49 515.124.4169
         6 2017-10-03 12:29:49 14895956868

查看新建分区的scheduler信息,添加新分区后,新分区的JOB_NAME变成为DR#IDX1_T_LEXER0001$J号任务。

SQL> select job_name,job_action,repeat_interval,next_run_Date,comments from dba_scheduler_jobs where owner = 'USER_LEXER';
JOB_NAME                  JOB_ACTION                                                                            REPEAT_INTERVAL      NEXT_RUN_DATE                              COMMENTS
------------------------- ------------------------------------------------------------------------------------- -------------------- ------------------------------------------ -------------------------
DR#IDX1_T_LEXER0001$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_10', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 12.44.18.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0002$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_08', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 12.41.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0003$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_09', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 12.41.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0004$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'PMAX', NULL, NULL, 0);      SYSDATE+5/1440       04-8月 -17 12.44.18.000000 下午 +08:00     automatic sync job

SQL> col IXP_INDEX_NAME for a20
SQL> col IXP_INDEX_PARTITION_NAME for a20
SQL> col  IXP_TABLE_PARTITION_NAME for a25
SQL> col IXP_SYNC_JOBNAME for a25
SQL> select ixp_index_name,ixp_index_partition_name,ixp_table_partition_name,ixp_sync_jobname from ctxsys.CTX_INDEX_PARTITIONS order by 2;

IXP_INDEX_NAME       IXP_INDEX_PARTITION_ IXP_TABLE_PARTITION_NAME  IXP_SYNC_JOBNAME
-------------------- -------------------- ------------------------- -------------------------
IDX1_T_LEXER         PMAX                 PMAX                      DR#IDX1_T_LEXER0004$J
IDX1_T_LEXER         P_2017_08            P_2017_08                 DR#IDX1_T_LEXER0002$J
IDX1_T_LEXER         P_2017_09            P_2017_09                 DR#IDX1_T_LEXER0003$J
IDX1_T_LEXER         P_2017_10            P_2017_10                 DR#IDX1_T_LEXER0001$J

表中的数据和contains语句的查询结果如下,查询的内容为10月份的数据

SQL> select id,create_date,dbms_lob.substr(DATA_ECHO_AREA) AREA from T_LEXER ;

        ID CREATE_DATE         AREA
---------- ------------------- --------------------------------------------------------------------------------
         3 2017-08-04 10:24:20 故事发生在19世纪末的欧洲,善良可爱的苏菲被恶毒的女巫施下魔咒,从18岁的女孩变成90
                               岁的婆婆

         4 2017-09-03 10:24:20 这部由俄罗斯导演提莫贝克曼贝托夫执导的影片自6月末在北美上映以来
         5 2017-10-03 12:29:49 515.124.4169
         6 2017-10-03 12:29:49 14895956868

SQL> alter index IDX1_T_LEXER noparallel;

索引已更改。

SQL> select /*+ gather_table_stats*/
  2   id, create_date, dbms_lob.substr(DATA_ECHO_AREA) AREA
  3    from t_lexer
  4   where contains(DATA_ECHO_AREA, '515%', 99) > 0;

        ID CREATE_DATE         AREA
---------- ------------------- --------------------------------------------------------------------------------
         5 2017-10-03 12:29:49 515.124.4169

SQL> @x
Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  3txv6xbzm7kpu, child number 0
-------------------------------------
select /*+ gather_table_stats*/  id, create_date,
dbms_lob.substr(DATA_ECHO_AREA) AREA   from t_lexer  where
contains(DATA_ECHO_AREA, '515%', 99) > 0

Plan hash value: 2441763191

---------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name         | E-Rows |E-Bytes| Cost (%CPU)| Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |              |        |       |     4 (100)|       |       |
|   1 |  PARTITION RANGE ALL               |              |      1 |  2036 |     4   (0)|     1 |     4 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T_LEXER      |      1 |  2036 |     4   (0)|     1 |     4 |
|*  3 |    DOMAIN INDEX                    | IDX1_T_LEXER |        |       |     4   (0)|       |       |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("CTXSYS"."CONTAINS"("DATA_ECHO_AREA",'515%',99)>0)

   - dynamic sampling used for this statement (level=2)


已选择24行。

删除分区

可以看到对应的任务名和分区所对应的分词表一起被删除掉。

SQL> alter table T_LEXER drop partition P_2017_10 update global indexes;

表已更改。

SQL> select job_name,job_action,repeat_interval,next_run_Date,comments from dba_scheduler_jobs where owner = 'USER_LEXER';

JOB_NAME                  JOB_ACTION                                                                            REPEAT_INTERVAL      NEXT_RUN_DATE                              COMMENTS
------------------------- ------------------------------------------------------------------------------------- -------------------- ------------------------------------------ -------------------------
DR#IDX1_T_LEXER0002$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_08', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 01.46.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0003$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_09', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 01.46.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0004$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'PMAX', NULL, NULL, 0);      SYSDATE+5/1440       04-8月 -17 01.44.18.000000 下午 +08:00     automatic sync job

SQL>  select table_name,tablespace_name,status from user_tables where table_name like 'DR%' order by 1;

TABLE_NAME                          TABLESPACE STATUS
----------------------------------- ---------- ----------------
DR#IDX1_T_LEXER0002$I               USERS      VALID
DR#IDX1_T_LEXER0002$K                          VALID
DR#IDX1_T_LEXER0002$N                          VALID
DR#IDX1_T_LEXER0002$R               USERS      VALID
DR#IDX1_T_LEXER0003$I               USERS      VALID
DR#IDX1_T_LEXER0003$K                          VALID
DR#IDX1_T_LEXER0003$N                          VALID
DR#IDX1_T_LEXER0003$R               USERS      VALID
DR#IDX1_T_LEXER0004$I               USERS      VALID
DR#IDX1_T_LEXER0004$K                          VALID
DR#IDX1_T_LEXER0004$N                          VALID
DR#IDX1_T_LEXER0004$R               USERS      VALID

已选择12行。

修改分区任务刷新频率

对于某些分区的刷新频率可以适当调整到刷新频率低一些

SQL> BEGIN
  2  DBMS_SCHEDULER.SET_ATTRIBUTE (
  3     name           =>   'DR#IDX1_T_LEXER0004$J',
  4     attribute      =>   'repeat_interval',
  5     value          =>   'SYSDATE+1');
  6  END;
  7  /

PL/SQL 过程已成功完成。

SQL> select job_name,job_action,repeat_interval,next_run_Date,comments from dba_scheduler_jobs where owner = 'USER_LEXER';

JOB_NAME                  JOB_ACTION                                                                            REPEAT_INTERVAL      NEXT_RUN_DATE                              COMMENTS
------------------------- ------------------------------------------------------------------------------------- -------------------- ------------------------------------------ -------------------------
DR#IDX1_T_LEXER0002$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_08', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 02.06.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0003$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_09', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 02.06.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0004$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'PMAX', NULL, NULL, 0);      SYSDATE+1            05-8月 -17 02.02.49.000000 下午 +08:00     automatic sync job

删除某个分区的刷新任务

SQL> BEGIN
  2     DBMS_SCHEDULER.DROP_JOB('DR#IDX1_T_LEXER0004$J');
  3  END;
  4  /

PL/SQL 过程已成功完成。

SQL> select job_name,job_action,repeat_interval,next_run_Date,comments from dba_scheduler_jobs where owner = 'USER_LEXER';

JOB_NAME                  JOB_ACTION                                                                            REPEAT_INTERVAL      NEXT_RUN_DATE                              COMMENTS
------------------------- ------------------------------------------------------------------------------------- -------------------- ------------------------------------------ --------------------
DR#IDX1_T_LEXER0002$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_08', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 02.06.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0003$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_09', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 02.06.22.000000 下午 +08:00     automatic sync job

对某个分区添加刷新任务

SQL> begin
  2    dbms_scheduler.create_job(job_name        => 'DR#IDX1_T_LEXER0004$J',
  3                              job_type        => 'PLSQL_BLOCK',
  4                              job_action      => 'ctxsys.drvdml.auto_sync_index(''IDX1_T_LEXER'', 536870912, ''PMAX'', NULL, NULL, 0);',
  5                              start_date      => (sysdate),
  6                              repeat_interval => 'FREQ=DAILY;INTERVAL=1',
  7                              end_date        => (sysdate + 2),
  8                              comments        => 'Testing');
  9  end;
 10  /

PL/SQL 过程已成功完成。

SQL> select job_name,job_action,repeat_interval,next_run_Date,comments from dba_scheduler_jobs where owner='LTDBSJ';

JOB_NAME                  JOB_ACTION                                                                            REPEAT_INTERVAL           NEXT_RUN_DATE                              COMMENTS
------------------------- ------------------------------------------------------------------------------------- ------------------------- ------------------------------------------ -------------------------
DR#IDX1_T_LEXER0002$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_08', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 02.06.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0003$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'P_2017_09', NULL, NULL, 0); SYSDATE+5/1440       04-8月 -17 02.06.22.000000 下午 +08:00     automatic sync job
DR#IDX1_T_LEXER0004$J     ctxsys.drvdml.auto_sync_index('IDX1_T_LEXER', 536870912, 'PMAX', NULL, NULL, 0);      FREQ=DAILY;INTERVAL=1                                                Testing
发表在 Domain Index | 留下评论

Oracle 12cR2 -Part2-query json data

本文对分别对Dot Notation、JSON_VALUE、JSON_QUERY、JSON_TABLE和JSON_EXISTS等5种查询JSON的语法进行了测试。
实验环境为Oracle 12.2.0.1,原始表的创建和数据的初始化参加JSON_PART1
测试表中的数据格式如下

SQL> col area for a80
SQL> col id for a32
SQL> select id,dbms_lob.substr(DATA_ECHO_AREA) AREA from t_json t ;

ID                                            AREA
--------------------------------        --------------------------------------------------------------------------------
48B61FB2C87A7AACE0533702A8C0C2AE        {
                                            "employee_id": 100,
                                            "first_name": "Steven",
                                            "last_name": "King",
                                            "email": "SKING",
                                            "job_id": "AD_PRES",
                                            "salary": 24000.0,
                                            "manager_id": null,
                                            "department_id": 90,
                                            "address":{"city": "Oxford",
                                                      "STATE_PROVINCE": "Oxford",
                                                      "STREET_ADDRESS": "Magdalen Centre, The Oxford Science Park"
                                                      },
                                            "flag":True
                                           }
                                        
48B61FB2C87B7AACE0533702A8C0C2AE        {
                                            "employee_id": 101,
                                            "first_name": "Neena",
                                            "last_name": "Kochhar",
                                            "email": "NKOCHHAR",
                                            "job_id": "AD_VP",
                                            "salary": 17000.0,
                                            "manager_id": 100,
                                            "department_id": 90,
                                            "address":{"city": "South Brunswick",
                                                       "STATE_PROVINCE": "New Jersey",
                                                       "STREET_ADDRESS": "2007 Zagora St",
                                                       "POSTCODE":100010
                                                       },
                                            "flag":False
                                           }

Dot Notation

SQL> col employee_id for a10
SQL> col first_name for a15
SQL> col last_name for a15
SQL> col salary for a8
SQL>
SQL> select t.data_echo_area.employee_id,
  2         t.data_echo_area.first_name,
  3         t.data_echo_area.last_name,
  4         t.data_echo_area.salary
  5    from T_JSON t;

EMPLOYEE_I FIRST_NAME      LAST_NAME       SALARY
---------- --------------- --------------- --------
100        Steven          King            24000
101        Neena           Kochhar         17000

SQL> col address for a120
SQL> select t.data_echo_area.address from t_json t;

ADDRESS
------------------------------------------------------------------------------------------------------------------------
{"city":"Oxford","STATE_PROVINCE":"Oxford","STREET_ADDRESS":"Magdalen Centre, The Oxford Science Park"}
{"city":"South Brunswick","STATE_PROVINCE":"New Jersey","STREET_ADDRESS":"2007 Zagora St","POSTCODE":100010}

SQL> col address for a15
SQL> select t.data_echo_area.address.city from t_json t;

ADDRESS
---------------
Oxford
South Brunswick

JSON_VALUE

Oracle SQL没有布尔数据类型,JSON具有布尔值true和false。当SQL/JSON函数JSON_VALUE参数计算并且结果是true或是false的时候,有两种方式去处理SQL的结果,一个是输出为string类型,一个是转换为number类型。

STRING类型
SQL> select id,json_value(t.data_echo_area,'$.last_name') as last_name from t_json t;

ID                                                               LAST_NAME
---------------------------------------------------------------- ---------------
48B61FB2C87A7AACE0533702A8C0C2AE                                 King
48B61FB2C87B7AACE0533702A8C0C2AE                                 Kochhar

--布尔类型值
SQL> col flag_str for a10
SQL> select id,
  2         json_value(t.data_echo_area, '$.flag') as flag_str,
  3         json_value(t.data_echo_area, '$.flag' RETURNING NUMBER) flag_number
  4    from t_json t;

ID                                                                 FLAG_STR   FLAG_NUMBER
---------------------------------------------------------------- ---------- -----------
48B61FB2C87A7AACE0533702A8C0C2AE                                        true            1
48B61FB2C87B7AACE0533702A8C0C2AE                                        false           0

JSON_QUERY

使用JSON_QUERY返回json文件的片段。

SQL> col address for a200
SQL>
SQL> SELECT t.data_echo_area.employee_id,
  2         t.data_echo_area.first_name,
  3         JSON_QUERY(t.data_echo_area, '$.address' WITH WRAPPER) AS address
  4  FROM   t_json t;

EMPLOYEE_I FIRST_NAME      ADDRESS
---------- --------------- -----------------------------------------------------------------------------------------------------------------
100        Steven          [{"city":"Oxford","STATE_PROVINCE":"Oxford","STREET_ADDRESS":"Magdalen Centre, The Oxford Science Park"}]
101        Neena           [{"city":"South Brunswick","STATE_PROVINCE":"New Jersey","STREET_ADDRESS":"2007 Zagora St","POSTCODE":100010}]

SQL> col city for a25
SQL> SELECT t.data_echo_area.employee_id,
  2         t.data_echo_area.first_name,
  3         JSON_QUERY(t.data_echo_area, '$.address[*].city' WITH WRAPPER) AS city  --wrapper作用是添加"[]"
  4  FROM   t_json t;

EMPLO FIRST_NAME CITY
----- ---------- -------------------------
100   Steven     ["Oxford"]
101   Neena      ["South Brunswick"]

JSON_TABLE

JSON_TABLE函数主要是使JSON数据更像关系型数据库的数据,可以将JSON_TABLE转换查询出来的结果创建成视图来翻遍查询展示。

SQL> col street_address for a105
SQL>
SQL> col state_province for a5
SQL>
SQL> SELECT cp.employee_id, cp.first_name,cp.STREET_ADDRESS,cp.STATE_PROVINCE,cp.flag
  2    FROM t_json,
  3         json_table(data_echo_area, '$'
  4           COLUMNS (employee_id VARCHAR2(32 CHAR) PATH '$.employee_id',
  5                    first_name  VARCHAR2(32 CHAR) PATH '$.first_name',
  6                    STREET_ADDRESS    VARCHAR2(200 CHAR) FORMAT JSON PATH '$.address',
  7                    flag   NUMBER(1) PATH '$.flag',
  8                    STATE_PROVINCE   VARCHAR2(5 CHAR) EXISTS
  9                              PATH '$.address.STATE_PROVINCE'
 10                   )) cp
 11  where cp.employee_id = 100;

EMPLO FIRST_NAME STREET_ADDRESS                                                                                              STATE     FLAG
----- ---------- --------------------------------------------------------------------------------------------------------- ----- ----------
100   Steven     {"city":"Oxford","STATE_PROVINCE":"Oxford","STREET_ADDRESS":"Magdalen Centre, The Oxford Science Park"}     true         1


SQL> SELECT cp.employee_id, cp.first_name,cp.STREET_ADDRESS,cp.STATE_PROVINCE,cp.flag
  2    FROM t_json,
  3         json_table(data_echo_area, '$'
  4           COLUMNS (employee_id VARCHAR2(32 CHAR) PATH '$.employee_id',
  5                    first_name  VARCHAR2(32 CHAR) FORMAT JSON WITH WRAPPER PATH '$.first_name',
  6                    STREET_ADDRESS    VARCHAR2(200 CHAR) FORMAT JSON WITH WRAPPER PATH '$.address',
  7                    flag   NUMBER(1) PATH '$.flag',
  8                    STATE_PROVINCE   VARCHAR2(5 CHAR) EXISTS
  9                              PATH '$.address.STATE_PROVINCE'
 10                   )) cp
 11  where cp.employee_id = 100;

EMPLO FIRST_NAME STREET_ADDRESS                                                                                              STATE    FLAG
----- ---------- --------------------------------------------------------------------------------------------------------- ----- ----------
100   ["Steven"] [{"city":"Oxford","STATE_PROVINCE":"Oxford","STREET_ADDRESS":"Magdalen Centre, The Oxford Science Park"}]  true         1

为json数据创建视图并查询

SQL> col first_name for a10
SQL> col last_name for a10
SQL> col address_to_STATE_PROVINCE for a20
SQL> col address_to_STREET_ADDRESS for a40
SQL> col email for a8
SQL> col salary for a8
SQL> col address_to_city for a20
SQL> with json_view as (select j_t.* from t_json t,
  2                json_table(t.data_echo_area,'$'
  3                columns (
  4                employee_id number(10) PATH '$.employee_id',
  5                first_name varchar2(30 CHAR) PATH '$.first_name',
  6                last_name varchar2(30 CHAR) PATH '$.last_name',
  7                email varchar2(30 CHAR) PATH '$.email',
  8                salary varchar2(30 CHAR) PATH '$.salary',
  9                address_to_city varchar2(100 char) PATH '$.address.city',
 10                address_to_STATE_PROVINCE varchar2(200 char) PATH '$.address.STATE_PROVINCE',
 11                address_to_STREET_ADDRESS varchar2(200 char) PATH '$.address.STREET_ADDRESS',
 12                flag number(1) path '$.flag'
 13                )) j_t)
 14  select json_view.* from json_view;

EMPLOYEE_ID FIRST_NAME LAST_NAME  EMAIL    SALARY   ADDRESS_TO_CITY      ADDRESS_TO_STATE_PRO ADDRESS_TO_STREET_ADDRESS                    FLAG
----------- ---------- ---------- -------- -------- -------------------- -------------------- ---------------------------------------- --------
        100 Steven     King       SKING    24000    Oxford               Oxford               Magdalen Centre, The Oxford Science Park        1
        101 Neena      Kochhar    NKOCHHAR 17000    South Brunswick      New Jersey           2007 Zagora St                                  0

JSON_EXISTS

JSON_EXISTS作为行过滤条件去筛选查询JSON文件的内容。如果JSON_EXISTS判断的值存在则返回TRUE,否则返回FALSE。

SQL> select t.id from t_json t where json_exists(t.data_echo_area,'$?(@.address.city == "Oxford")');
ID
----------------------------------------------------------------
48B61FB2C87A7AACE0533702A8C0C2AE

检测NULL值,员工号100的记录manager_id为null。
SQL> select t.data_echo_area.employee_id, t.data_echo_area.first_name
  2    from t_json t
  3   where json_exists(t.data_echo_area, '$.manager_id' false on error)
  4     and t.data_echo_area.manager_id is null;

EMPLO FIRST_NAME
----- ----------
100   Steven

检测MISSING值,员工号101的记录含有postcode信息。
SQL> select t.data_echo_area.employee_id, t.data_echo_area.first_name
  2    from t_json t
  3   where  json_exists(t.data_echo_area.address, '$.POSTCODE' error on error);

EMPLO FIRST_NAME
----- ----------
101   Neena
发表在 12c, json | 留下评论

Oracle analytic function-corr

皮尔逊相关系数的数学公式请参考WIKI上关于Pearson correlation coefficient的内容。

Pearson相关系数是用协方差除以两个变量的标准差得到的,虽然协方差能反映两个随机变量的相关程度(协方差大于0的时候表示两者正相关,小于0的时候表示两者负相关),但是协方差值的大小并不能很好地度量两个随机变量的关联程度,例如,现在二维空间中分布着一些数据,我们想知道数据点坐标X轴和Y轴的相关程度,如果X与Y的相关程度较小但是数据分布的比较离散,这样会导致求出的协方差值较大,用这个值来度量相关程度是不合理的。为了更好的度量两个随机变量的相关程度,引入了Pearson相关系数,其在协方差的基础上除以了两个随机变量的标准差,容易得出,pearson是一个介于-1和1之间的值,当两个变量的线性关系增强时,相关系数趋于1或-1;当一个变量增大,另一个变量也增大时,表明它们之间是正相关的,相关系数大于0;如果一个变量增大,另一个变量却减小,表明它们之间是负相关的,相关系数小于0;如果相关系数等于0,表明它们之间不存在线性相关关系。

上述的描述是比较难懂的,简单一点的描述是
cos[a,b] = a * b / |a|*|b| 皮尔逊系数就是cos计算之前两个向量都先搞个标准化,计算a,b两个向量的夹角的cos值。

当两个变量的标准差都不为零时,相关系数才有定义,皮尔逊相关系数适用于:
(1)、两个变量之间是线性关系,都是连续数据。
(2)、两个变量的总体是正态分布,或接近正态的单峰分布。
(3)、两个变量的观测值是成对的,每对观测值之间相互独立。

CORR函数一对表达式的相关系数,计算的理论基于皮尔逊相关系数,排除掉expr1或expr2为null的数值,基于下面的计算公式计算
COVAR_POP(expr1, expr2) / (STDDEV_POP(expr1) * STDDEV_POP(expr2))
函数返回值为NUMBER类型。如果函数被用于空集则函数返回NULL。

聚合功能
SQL> SELECT weight_class, CORR(list_price, min_price) "Correlation"
  2    FROM product_information
  3    GROUP BY weight_class
  4    ORDER BY weight_class, "Correlation";

WEIGHT_CLASS Correlation
------------ -----------
           1  .999149795
           2  .999022941
           3  .998484472
           4  .999359909
           5  .999536087

随着对参与运算的两个数据样本点之间的数值相差越来越大,相关系数不断变小,最终为负数,不相关。

分析功能
计算员工在公司时间和员工职位薪水之间的相关性。
SQL> SELECT employee_id,
  2         job_id,
  3         TO_CHAR((SYSDATE - hire_date) YEAR TO MONTH) "Yrs-Mns",
  4         salary,
  5         CORR(SYSDATE - hire_date, salary) OVER(PARTITION BY job_id) AS "Correlation"
  6    FROM employees
  7   WHERE department_id in (50,80)
  8   ORDER BY job_id, employee_id;

EMPLOYEE_ID JOB_ID               Yrs-Mns            SALARY Correlation
----------- -------------------- -------------- ---------- -----------
        145 SA_MAN               +12-08              14000  .912385598
        146 SA_MAN               +12-05              13500  .912385598
        147 SA_MAN               +12-03              12000  .912385598
        148 SA_MAN               +09-08              11000  .912385598
        149 SA_MAN               +09-04              10500  .912385598
        150 SA_REP               +12-04              10000   .80436755
        151 SA_REP               +12-02               9500   .80436755
        152 SA_REP               +11-10               9000   .80436755
        153 SA_REP               +11-02               8000   .80436755
        154 SA_REP               +10-06               7500   .80436755
        155 SA_REP               +09-06               7000   .80436755
发表在 analytic function, SQL | 一条评论

Oracle analytic function-covar_pop,covar_samp

covar_samp

COVAR_SAMP计算一对表达式的样本协方差,可以用于聚合或分析函数。
Oracle在这个函数上的实现方式为:首先淘汰掉expr1或expr2是NULL的数据,然后采用下面的计算公式
(SUM(expr1 * expr2) – SUM(expr1) * SUM(expr2) / n) / (n-1) –原公式做因式分解获得
其中n是(‪expr1‬, ‪expr2‬)对的个数,此处expr1或expr2都不为null。‬‬‬‬‬‬‬‬

covar_pop

Covar_pop返回一对表达式对总体协方差。可以当作聚合函数和分析函数使用。
Oracle在这个函数上的实现方式为:首先淘汰掉expr1或expr2是NULL的数据,然后采用下面的计算公式
(SUM(expr1 * expr2) – SUM(expr1) * SUM(expr2) / n) / n –原公式做因式分解获得
其中n是(‪expr1‬, ‪expr2‬)对的个数,此处expr1或expr2都不为null。‬‬‬‬‬‬‬‬

SQL> select department_id,
  2         covar_pop(sysdate - hire_date, salary) "covar_pop",
  3         covar_samp(sysdate - hire_date, salary) "covar_samp"
  4    from employees
  5   where department_id <= 40
  6   group by department_id;

DEPARTMENT_ID  covar_pop covar_samp
------------- ---------- ----------
           10          0
           20     957250    1914500
           30 1258261.11 1509913.33
           40          0

分析功能

SQL> SELECT product_id,
  2         supplier_id,
  3         COVAR_POP(list_price, min_price) OVER(ORDER BY product_id, supplier_id) AS COVAR_POP,
  4         COVAR_SAMP(list_price, min_price) OVER(ORDER BY product_id, supplier_id) AS COVAR_SAMP
  5    FROM product_information p
  6   WHERE category_id = 29
  7   ORDER BY product_id, supplier_id;

PRODUCT_ID SUPPLIER_ID  COVAR_POP COVAR_SAMP
---------- ----------- ---------- ----------
      1774      103088          0
      1775      103087    1473.25     2946.5
      1794      103096 1702.77778 2554.16667
      1825      103093    1926.25 2568.33333
      2004      103086     1591.4    1989.25
      2005      103086     1512.5       1815
      2416      103088 1475.97959 1721.97619
      2417      103088 1478.70313 1689.94643
      2449      103088  1326.8642 1492.72222
      3101      103086     1195.2       1328
      3170      103089 1590.07438 1749.08182
      3171      103089    1718.25 1874.45455
      3172      103094 1710.42012 1852.95513
      3173      103094 1588.56122 1710.75824
      3175      103089     1593.4 1707.21429
发表在 analytic function, SQL | 留下评论