Deploying GreenPlum Cluster On Centos6 With Docker

本文在Ubuntu系统上的Docker中安装配置greenplum集群。
Docker和Ubuntu的版本信息如下

root@leo-VirtualBox:~# cat /proc/version
Linux version 4.13.0-36-generic (buildd@lgw01-amd64-033) (gcc version 5.4.0 20160609 
(Ubuntu 5.4.0-6ubuntu1~16.04.9)) #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018
root@leo-VirtualBox:~# docker version
Client:
 Version:       17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:11:19 2017
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:        Wed Dec 27 20:09:53 2017
  OS/Arch:      linux/amd64
  Experimental: false

安装步骤如下

1.拉取centos镜像

因为国外站点的速度比较慢,本文从阿里云开发者平台上拉取镜像。首先需要在平台上进行注册账号,然后创建镜像仓库,平台会提供一个加速URL信息和不同平台上的配置方法。

docker pull centos:6.8
root@leo-VirtualBox:~# docker image ls | grep centos
centos                       6.8                 6704d778b3ba        4 months ago        195MB

使用上述的centos镜像为gp创建容器,主机名分别为mdw,sdw1和sdw2

docker run --name gpcentos1 --hostname mdw -it 6704d778b3ba /bin/bash
docker run --name gpcentos2 --hostname sdw1 -it 6704d778b3ba /bin/bash
docker run --name gpcentos3 --hostname sdw2 -it 6704d778b3ba /bin/bash

root@leo-VirtualBox:~# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS         PORTS      NAMES
d01c67550a4c        6704d778b3ba        "/bin/bash"         2 hours ago         Up 2 hours               gpcentos3
725bf50e783f        6704d778b3ba        "/bin/bash"         2 hours ago         Up 2 hours               gpcentos2
c68396a9068a        6704d778b3ba        "/bin/bash"         2 hours ago         Up 2 hours               gpcentos1

2.配置容器的基础环境

由于拉取的docker的centos镜像是centos的简化版本,里面有一些程序包是没有默认安装的,会影响到后面部署greenplum,因此在docker的每个节点中安装相关的依赖包.
在每个容器节点进行安装配置

yum install -y net-tools which openssh-clients openssh-server less zip unzip iproute.x86_64

容器默认没有ssh服务,影响后续的gpssh配置项,需要配置互联并认证

ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
/usr/sbin/sshd

配置主机IP地址映射
在每个容器的/etc/hosts文件中加入如下部分,同时修改/etc/sysconfig/network保持hostname一致。

172.17.0.2 mdw
172.17.0.3 sdw1
172.17.0.4 sdw2

创建GreenPlum用户和组

groupadd -g 501 gpadmin
useradd -g 501 -u 501 -m -d /home/gpadmin -s /bin/bash gpadmin
chown -R gpadmin:gpadmin /home/gpadmin
echo gpadmin | passwd  gpadmin --stdin

修改每个容器节点的文件数

vi /etc/security/limits.conf
# End of file
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

3.下载安装包并安装

拷贝介质到容器中
docker cp /home/leo/greenplum-db-5.5.0-rhel6-x86_64.zip c68396a9068a:/home/gpadmin
解压安装
每个节点执行创建安装目录
mkdir -p /opt/greenplum
chown -R gpadmin:gpadmin /opt

--master节点执行
mkdir -p /data/master
chown gpadmin:gpadmin /data/master

segment节点执行
chown -R gpadmin:gpadmin /data

[root@mdw greenplum]# more all_host 
mdw
sdw1
sdw2
[root@mdw greenplum]# more seg_host 
sdw1
sdw2

[gpadmin@mdw greenplum]$ gpssh-exkeys -f all_host 
[STEP 1 of 5] create local ID and authorize on local host
  ... /home/gpadmin/.ssh/id_rsa file exists ... key generation skipped

[STEP 2 of 5] keyscan all hosts and update known_hosts file

[STEP 3 of 5] authorize current user on remote hosts
  ... send to sdw1
  ***
  *** Enter password for sdw1: 
  ... send to sdw2

[STEP 4 of 5] determine common authentication file content

[STEP 5 of 5] copy authentication files to all remote hosts
  ... finished key exchange with sdw1
  ... finished key exchange with sdw2

[INFO] completed successfully

[gpadmin@mdw greenplum]$ source greenplum_path.sh 
[gpadmin@mdw greenplum]$ 
[gpadmin@mdw greenplum]$ 
[gpadmin@mdw greenplum]$ gpseginstall -f seg_host -u gpadmin -p gpadmin   --安装
20180309:07:09:01:000783 gpseginstall:mdw:gpadmin-[INFO]:-Installation Info:

执行下面的语句创建各个segment实例的目录
gpssh -f seg_host -e 'mkdir -p /data/primary'
gpssh -f seg_host -e 'mkdir -p /data/mirror'
gpssh -f seg_host -e 'chown gpadmin:gpadmin /data/primary'
gpssh -f seg_host -e 'chown gpadmin:gpadmin /data/mirror'

[gpadmin@mdw greenplum]$ source greenplum_path.sh 
[gpadmin@mdw greenplum]$ 
[gpadmin@mdw greenplum]$ gpssh -f seg_host -e 'mkdir -p /data/primary'
[sdw1] mkdir -p /data/primary
[sdw2] mkdir -p /data/primary
[gpadmin@mdw greenplum]$ gpssh -f seg_host -e 'mkdir -p /data/mirror'
[sdw2] mkdir -p /data/mirror
[sdw1] mkdir -p /data/mirror
ry'admin@mdw greenplum]$ gpssh -f seg_host -e 'chown gpadmin:gpadmin /data/prima 
[sdw2] chown gpadmin:gpadmin /data/primary
[sdw1] chown gpadmin:gpadmin /data/primary
r'padmin@mdw greenplum]$ gpssh -f seg_host -e 'chown gpadmin:gpadmin /data/mirro 
[sdw1] chown gpadmin:gpadmin /data/mirror
[sdw2] chown gpadmin:gpadmin /data/mirror

gp配置文件为
[gpadmin@mdw ~]$ more gpinitsystem_config 
ARRAY_NAME="Greenplum DW"
SEG_PREFIX=gpseg
PORT_BASE=40000
declare -a DATA_DIRECTORY=(/data/primary)
MASTER_HOSTNAME=mdw
MASTER_DIRECTORY=/data/master
MASTER_PORT=5432
TRUSTED_SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE

MIRROR_PORT_BASE=50000
REPLICATION_PORT_BASE=41000
MIRROR_REPLICATION_PORT_BASE=51000
declare -a MIRROR_DATA_DIRECTORY=(/data/mirror)

初始化数据库
gpinitsystem -c gpinitsystem_config -h seg_host

启动并查看数据库状态
gpstart和gpstate
[gpadmin@mdw ~]$ gpstate
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Master instance                             = Active
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Master standby                              = No master standby
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Total segment instance count from metadata  = 4
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-----------------------------------------------------
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Primary Segment Status
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-----------------------------------------------------
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Total primary segments                                    = 2
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Total primary segment valid (at master)                   = 2
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Total primary segment failures (at master)                = 0
20180309:07:56:57:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number postmaster processes found                   = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-----------------------------------------------------
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Mirror Segment Status
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-----------------------------------------------------
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total mirror segments                                     = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total mirror segment valid (at master)                    = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total mirror segment failures (at master)                 = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number postmaster processes found                   = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number mirror segments acting as primary segments   = 0
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 2
20180309:07:56:58:005338 gpstate:mdw:gpadmin-[INFO]:-----------------------------------------------------
发表在 Docker, greenplum, postgresql | 留下评论

sklearn-KNN on DatingSets

这两天看Machine Learning in Action中的KNN算法,就使用文中涉及到的约会网站的数据和KNeighborsClassifier算法进行了测试。
文中涉及到数据标准化处理、数据集拆分训练集和测试集、K折交叉验证、MSE计算以及寻找最优的K。代码均在python3下运行。
code:

#!/usr/bin/env python3
#encoding=utf-8

__author__ = 'Leo'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score


file_name = 'datingTestSet2.txt'
col_name = ['time','mileage','liter','label']  #分别对应游戏耗时、里程数、冰激凌公升数和类别。
data = pd.read_csv(file_name,header=None,sep='\t',names=col_name)

X = np.array(data.ix[:,0:3])
y = np.array(data['label'])
#数据标准化
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

#拆分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.33, random_state=42)
print('X_train {}, X_test {}, y_train {}, y_test {}'.format(len(X_train), len(X_test), len(y_train), len(y_test)))

#初始化k的值
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the knn classifier for k = 1 is %d%%' % acc)

myList = list(range(1,64))
neighbors = list(filter(lambda x: x % 2 != 0, myList))
print('neighbors',neighbors)

# empty list that will hold cv scores
cv_scores = []

# K折交叉验证
for k in neighbors:
    num_folds = 10
    kfold = KFold(n_splits=10, random_state=7)
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=kfold, scoring='accuracy')
    cv_scores.append(scores.mean())

# 计算MSE
MSE = [1 - x for x in cv_scores]
#寻找最优的k
optimal_k = neighbors[MSE.index(min(MSE))]
print('\nThe optimal number of neighbors is %d.' % optimal_k)

# MSE和k的曲线图
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')

#采用最优的k计算测试集the mean estimated accuracy
num_folds = 10
kfold = KFold(n_splits=10, random_state=7)
model = KNeighborsClassifier(n_neighbors=15)  --使用k=15进行计算
model.fit(X_test,y_test)
predictions = model.predict(X_test)
print("accuracy_score:",accuracy_score(y_test, predictions))
results = cross_val_score(model, X_test, y_test, cv=kfold)
print("the mean estimated accuracy:",results.mean())

#散点图
sns.lmplot('time', 'mileage',data=data, hue='label', fit_reg=False)
plt.show()

代码的输出如下:

X_train 670, X_test 330, y_train 670, y_test 330

The accuracy of the knn classifier for k = 1 is 92%
neighbors [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63]

The optimal number of neighbors is 15.
accuracy_score: 0.951515151515
the mean estimated accuracy: 0.930303030303

其中寻找最优的K和MSE的关系的曲线如下
knn_Neighbors
按照每年获取的飞行乘客里程数和玩视频游戏所耗时间百分比两个维度进行数据散点图绘制得到的图像如下,其中的label为三种人的分类。
knn_classfication

发表在 Data Mining, Machine Learning, Python | 留下评论

Seaborn Data Visualization

最近看了一下seaborn关于数据可视化部分的文档,针对Pima Indians Diabetes Data Set进行了一下实验,记录于此。

#!/usr/bin/env python3
#encoding=utf-8

__author__ = 'Leo'

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


file_name = 'F:\\datasets\PimaIndiansDiabetesDataSet\\pima-indians-diabetes.data.csv'
col_names = ['pregnant','glucose','bp','skin','insulin','bmi','pedigree','age','label']
pima = pd.read_csv(file_name,names  = col_names, engine = 'python')


y = pima.label
droped_list = ['label']
x = pima.drop(droped_list,axis = 1)


data_y = y
data_x = x

data_standarlized = (data_x - data_x.mean()) / (data_x.std())  #标准化
data = pd.concat([y,data_standarlized],axis=1)

data = pd.melt(data,id_vars="label",var_name="features",value_name='value')

#小提琴图
plt.figure(figsize=(10,8))
sns.violinplot(x="features", y="value", hue="label", data=data,split=True, inner="quart")
plt.xticks(rotation=90)

#jointplot散点图
sns.jointplot(x.loc[:,'insulin'], x.loc[:,'pedigree'], kind="regg", color="#ce1414")

#swarmplot
plt.figure(figsize=(10,8))
sns.swarmplot(x="features", y="value", hue="label", data=data)
plt.xticks(rotation=90)

#heapmap热力图
f,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

plt.show()

小提琴图

核密度图是一种结合了盒图和核密度图的图。它将盒图和密度图展示在同一个图上,因长相通常类似小提琴而得名。在seaborn中,可以使用 violinplot 函数绘制小提琴图。
小提琴图
从上图中可以发现这8个属性并没有很相似的,也就是说没有相关性比较密切的。

散点图

散点图能够同时将两个数值型特征可视化,从散点图总我们可以直观地观察两个特征之间的关系。例如是否存在线性关系等。 seaborn中可以使用 jointplot 函数绘制散点图。swamplot也是散点图的一种形式。
从上图可以发现,我们不仅得到了两个特征的散点图,对于每一个单独的特征,jointplot函数默认将其直方图进行了绘制。同时,也将特征之间的皮尔逊相关系数计算出来。皮尔逊相关系数能够用来判断特征之间的线性关系,其取值范围为[-1,1]。取值为0表示两个特征没有相关性。越接近1说明特征之间越存在正相关性。越接近-1表示特征之间越存在负的线性相关性。在本文中,insulin和pedigree的皮尔逊相关系数为0.19。

jointplot
散点图

swarmplot
swarmplot

热力图

热力图就是将原本为数字表(数组)的单元格以颜色来填充,颜色的深浅表示数值的。通过热力图可以直观地观察特征之间的相关性强弱。
大小。
热力图

发表在 Data Mining, Machine Learning, Python | 留下评论

Sklearn-SVM-Breast Cancer Wisconsin (Diagnostic) Data Set

Sklearn的SVM算法是基于LIBSVM实现的,本文使用SVC模型对Wisconsin Diagnostic Breast Cancer数据进行分类预测。

import pandas as pd
from pandas import set_option
from sklearn import svm
from pandas import read_csv
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt

#设置显示选项
desired_width = 480
pd.set_option('display.width', desired_width)

###数据URL地址
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'

###数据列名 se = standard error 具体数据信息请参见https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names
column_name = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', \
               'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', \
               'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', \
               'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

df = pd.read_csv(data_url, names=column_name)

print(df.dtypes)
print()
print("Wisconsin Diagnostic Breast Cancer DataSets'Shape Informantion: ")
print(df.shape)
print()
print("Column_Name diagnosis Class Information: ")
print(df.groupby('diagnosis').size())
print()
print(df.head(5))

###删除ID NUMBER列,将diagnosis列的Malignant和Benign分别映射为1和0
dummy_fields = ['id']
df = df.drop(dummy_fields, axis=1)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

###数据柱状图
df.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1)
#plt.show()

data_df = list(df.columns[1:31])

X = df.loc[:, data_df]
y = df['diagnosis']

# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = 'accuracy'

###模型详细信息
svm_model = SVC()
# c_values = [1, 10, 100, 1000]
# kernel_values = ['linear']
# param_grid = dict(C=c_values, kernel=kernel_values)

param_grid = [{'kernel': ['rbf'],
               'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5],
               'C': [1, 10, 100, 1000]},
              {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=svm_model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print()
print("Grid scores on training set:")
print()
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

数据柱状图如下
BreastCancerWisconsinDataSetHistogram

输出

dtype:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object

Wisconsin Diagnostic Breast Cancer DataSets'Shape Informantion: 
(569, 32)

Column_Name diagnosis Class Information: 
diagnosis
B    357
M    212
dtype: int64

         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  compactness_mean  concavity_mean  concave points_mean  radius_worst  texture_worst  perimeter_worst  area_worst  smoothness_worst  compactness_worst  concavity_worst  concave points_worst  symmetry_worst  fractal_dimension_worst
0    842302         M        17.99         10.38          122.80     1001.0          0.11840           0.27760          0.3001              0.14710         25.38          17.33           184.60      2019.0            0.1622             0.6656           0.7119                0.2654          0.4601                  0.11890
1    842517         M        20.57         17.77          132.90     1326.0          0.08474           0.07864          0.0869              0.07017         24.99          23.41           158.80      1956.0            0.1238             0.1866           0.2416                0.1860          0.2750                  0.08902
2  84300903         M        19.69         21.25          130.00     1203.0          0.10960           0.15990          0.1974              0.12790         23.57          25.53           152.50      1709.0            0.1444             0.4245           0.4504                0.2430          0.3613                  0.08758
3  84348301         M        11.42         20.38           77.58      386.1          0.14250           0.28390          0.2414              0.10520         14.91          26.50            98.87       567.7            0.2098             0.8663           0.6869                0.2575          0.6638                  0.17300
4  84358402         M        20.29         14.34          135.10     1297.0          0.10030           0.13280          0.1980              0.10430         22.54          16.67           152.20      1575.0            0.1374             0.2050           0.4000                0.1625          0.2364                  0.07678

[5 rows x 32 columns]

Best: 0.959578 using {'C': 100, 'kernel': 'linear'}

Grid scores on training set:

0.933216 (0.044219) with: {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.917399 (0.029568) with: {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.630931 (0.176754) with: {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 1, 'gamma': 0.2, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 1, 'gamma': 0.5, 'kernel': 'rbf'}
0.931459 (0.035519) with: {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.906854 (0.027474) with: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.632689 (0.172067) with: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 10, 'gamma': 0.2, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 10, 'gamma': 0.5, 'kernel': 'rbf'}
0.942004 (0.026254) with: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.906854 (0.027474) with: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.632689 (0.172067) with: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 100, 'gamma': 0.2, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 100, 'gamma': 0.5, 'kernel': 'rbf'}
0.929701 (0.023666) with: {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.906854 (0.027474) with: {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.632689 (0.172067) with: {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 1000, 'gamma': 0.1, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 1000, 'gamma': 0.2, 'kernel': 'rbf'}
0.627417 (0.177054) with: {'C': 1000, 'gamma': 0.5, 'kernel': 'rbf'}
0.952548 (0.019270) with: {'C': 1, 'kernel': 'linear'}
0.952548 (0.017597) with: {'C': 10, 'kernel': 'linear'}
0.959578 (0.015782) with: {'C': 100, 'kernel': 'linear'}
0.957821 (0.019530) with: {'C': 1000, 'kernel': 'linear'}

Process finished with exit code 0

 

发表在 Data Mining, Machine Learning, Python | 留下评论

Oracle 12cR2-Distributed LOBs

本文测试了Oracle 12cR2的Distributed LOB基本的功能。

数据库使用的版本为12.2.0.1,在两个pdb之间进行测试。

-----BLUE SERVICE下
连接到:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production


USERNAME     INST_NAME   HOST_NAME    SID  SERIAL# VERSION    STARTED  SPID   OPID  CPID         SADDR            PADDR
----------- ----------- ----------- ----- -------- ---------- -------- ------ ----- ----------- ---------------- ----------------
SH           RED-ywdb    rhel7        86   14250   12.2.0.1.0 20170824 8867   8     5324:11336  000000007DE75430 000000007DC8DE38

blue下创建含有LOB字段的对象
drop table test_lob purge;
create table test_lob(id number,description clob);
insert into test_lob values (1,rpad('a',5000,'a'));
commit;



---RED SERVICE下
SQL> create public database link red_to_blue connect to sh identified by sh using '(DESCRIPTION =
  2      (ADDRESS_LIST =
  3        (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.2.55)(PORT = 1521))
  4      )
  5      (CONNECT_DATA =
  6        (SERVICE_NAME = blue)
  7      )
  8    )';

数据库链接已创建。

SQL> SELECT DBMS_LOB.GETLENGTH@red_to_blue(description)  FROM test_lob@red_to_blue;

DBMS_LOB.GETLENGTH@RED_TO_BLUE(DESCRIPTION)
-------------------------------------------
                                       4000

SQL> SELECT LENGTH(description) FROM test_lob@red_to_blue;

LENGTH(DESCRIPTION)
-------------------
               4000

SQL>

SQL> SELECT DBMS_LOB.SUBSTR(description) from test_lob@red_to_blue;
SELECT DBMS_LOB.SUBSTR(description) from test_lob@red_to_blue
                       *
第 1 行出现错误:
ORA-22992: 无法使用从远程表选择的 LOB 定位符

SQL> SELECT DBMS_LOB.SUBSTR@red_to_blue(description) from test_lob@red_to_blue;

DBMS_LOB.SUBSTR@RED_TO_BLUE(DESCRIPTION)
-------------------------------------------------------------------------------------------------------------------------------
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
SQL> col DESCRIPTION for a10

SQL> update test_lob@red_to_blue t set t.description = 'abc'; --更新

已更新 1 行。


SQL> select * from test_lob@red_to_blue;

        ID DESCRIPTIO
---------- ----------
         1 abc


SQL> insert into test_lob@red_to_blue(id,description) values(2,'leo');--写入

已创建 1 行。

SQL> select * from test_lob@red_to_blue;

        ID DESCRIPTIO
---------- ----------
         1 abc
         2 leo

SQL> commit;

提交完成。

PL/SQL使用Remote Locator操作LOB

set serveroutput on

SQL> set serveroutput on
SQL>
SQL> DECLARE
  2    avc_str        varchar2(4000);
  3    remote_locator CLOB;
  4  begin
  5    select description
  6      into remote_locator
  7      from test_lob@red_to_blue
  8     where id = 1;
  9    avc_str := dbms_lob.substr(remote_locator, 2, 3);
 10    dbms_output.put_line('The result is: ' || avc_str);
 11  end;
 12  /
The result is: c

PL/SQL 过程已成功完成。


DBMS_LOB使用Remote Locator操作LOB

SQL> set serveroutput on
SQL>
SQL>
SQL> DECLARE
  2    avc_lob CLOB;
  3    avc_str VARCHAR2(20) := 'sklearn';
  4    len     NUMBER(4);
  5    amt     NUMBER(4);
  6  begin
  7    amt := 7;
  8    select description
  9      into avc_lob
 10      from test_lob@red_to_blue
 11     where id = 2
 12       for update;
 13    dbms_lob.write(avc_lob, amt, 1, avc_str);
 14    amt := 10;
 15    dbms_lob.read(avc_lob, amt, 1, avc_str);
 16    len := dbms_lob.getlength(avc_lob);
 17    dbms_output.put_line(avc_str);
 18    dbms_output.put_line(amt);
 19    dbms_output.put_line('get length output = ' || len);
 20  END;
 21  /
sklearn
7
get length output = 7

PL/SQL 过程已成功完成。

SQL> commit;

提交完成。


SQL>  select * from test_lob@red_to_blue;

        ID DESCRIPTIO
---------- ----------
         1 abc
         2 sklearn


 

发表在 12c, LOB, PLSQL | 留下评论

sklearn-Convert categorical data into numerical data

本文使用sklearn提供的LabelEncoder和OneHotEncoder将categorical数据转换为numerical数据。
OneHotEncoder翻译成中文叫独热编码,通俗的讲就是:有多少个状态就有多少比特,而且只有一个比特为1,其他为0的一种编码机制。
LabelEncoder翻译为中文叫标签编码,是将categorical数据转换成连续的数据型变量,即对不连续的数字或文本进行编号。

import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#set output format
desired_width = 480
pd.set_option('display.width', desired_width)

#initial dataset
df_train =  pd.DataFrame({'Name': ['Braund', 'Cumings', 'Futrelle', 'Allen','McCarthy','Johnson'],
                           'Sex': ['male','female','female','male','male','female'],
                           'Age':[22, 38, 35, 35,54,27]})

# choose categorical data using dataframe.select_dtypes()
df_train_cat = df_train.select_dtypes(include=[object])
print('data sets:')
print(df_train_cat)

#encode labels with values between 0 and n_class - 1
le = preprocessing.LabelEncoder()

# fit and transform
df_train_cat_le = df_train_cat.apply(le.fit_transform)
print('LabelEncoder:')
print(df_train_cat_le)

#fit transform
ohe = preprocessing.OneHotEncoder()
ohe.fit(df_train_cat_le)
ohelabels = ohe.transform(df_train_cat_le).toarray()

print('OneHotEncoder:')
print(ohelabels)

print('shape size:')
print(ohelabels.shape)

输出为:
data sets:
       Name     Sex
0    Braund    male
1   Cumings  female
2  Futrelle  female
3     Allen    male
4  McCarthy    male
5   Johnson  female
LabelEncoder:
   Name  Sex
0     1    1    
1     2    0
2     3    0
3     0    1
4     5    1
5     4    0
OneHotEncoder:
[[ 0.  1.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  1.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  1.  0.  1.]
 [ 0.  0.  0.  0.  1.  0.  1.  0.]]
shape size:
(6, 8)

Name列为第一个特征维度,有6种取值0/1/2/3/4/5,对应的编码为
0----1 0 0 0 0 0
1----0 1 0 0 0 0
2----0 0 1 0 0 0
3----0 0 0 1 0 0
4----0 0 0 0 1 0
5----0 0 0 0 0 1

Sex列为第二个特征维度,有2种取值0/1,对应的编码为
0----1 0
1----0 1


所以对于
   Name  Sex
    1    1  --编码为0 1 0 0 0 0 0 1 与红色部分相同,其他类似。
    2    0  --     0 0 1 0 0 0 1 0


发表在 Data Mining, Machine Learning, Python | 留下评论

Scikit-learn Feature Importance and Chi-squared

使用的数据为Oracle的SH schema下MINING_DATA_BUILD_V视图的数据。
Attribute Importance一文测试了Oracle对此视图数据进行Attribute Importance计算的结果。

SQL> desc MINING_DATA_BUILD_V
Name                    Type         Nullable Default Comments 
----------------------- ------------ -------- ------- -------- 
CUST_ID                 NUMBER                                 
CUST_GENDER             CHAR(1)                                
AGE                     NUMBER       Y                         
CUST_MARITAL_STATUS     VARCHAR2(20) Y                         
COUNTRY_NAME            VARCHAR2(40)                           
CUST_INCOME_LEVEL       VARCHAR2(30) Y                         
EDUCATION               VARCHAR2(21) Y                         
OCCUPATION              VARCHAR2(21) Y                         
HOUSEHOLD_SIZE          VARCHAR2(21) Y                         
YRS_RESIDENCE           NUMBER       Y                         
AFFINITY_CARD           NUMBER(10)   Y                         
BULK_PACK_DISKETTES     NUMBER(10)   Y                         
FLAT_PANEL_MONITOR      NUMBER(10)   Y                         
HOME_THEATER_PACKAGE    NUMBER(10)   Y                         
BOOKKEEPING_APPLICATION NUMBER(10)   Y                         
PRINTER_SUPPLIES        NUMBER(10)   Y                         
Y_BOX_GAMES             NUMBER(10)   Y                         
OS_DOC_SET_KANJI        NUMBER(10)   Y    

Sklearn python代码如下

# -*- coding: utf-8 -*-
"""
@author: Leo Zhang
"""
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv
from sklearn import preprocessing
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.ensemble import ExtraTreesClassifier

##load mining_data_build.csv data
from_filename = 'f:\mining_data_build.csv'
to_filename = 'f:\mining_data_build_preprocessed.csv'
column_name = ['CUST_GENDER','AGE','CUST_MARITAL_STATUS','COUNTRY_NAME','CUST_INCOME_LEVEL','EDUCATION','OCCUPATION','HOUSEHOLD_SIZE',
               'YRS_RESIDENCE','BULK_PACK_DISKETTES','FLAT_PANEL_MONITOR','HOME_THEATER_PACKAGE','BOOKKEEPING_APPLICATION',
               'PRINTER_SUPPLIES','Y_BOX_GAMES','OS_DOC_SET_KANJI','AFFINITY_CARD']
df_mining_data_build = pd.read_csv(from_filename, engine='python')
#print(df_mining_data_build.columns)

mining_data_build_v = df_mining_data_build

for value in ['CUST_GENDER', 'CUST_MARITAL_STATUS', 'COUNTRY_NAME',
       'CUST_INCOME_LEVEL', 'EDUCATION', 'OCCUPATION','HOUSEHOLD_SIZE']:
    mining_data_build_v[value].replace(['?'], [mining_data_build_v.describe(include='all')[value][2]],inplace=True)

# cat是categorical的缩写,对字符串字段进行处理,LabelEncoder是sklearn提供的对不连续的数字或者文本进行编号。
le = preprocessing.LabelEncoder()
CUST_GENDER_CAT = le.fit_transform(df_mining_data_build.CUST_GENDER)
CUST_MARITAL_STATUS_CAT = le.fit_transform(df_mining_data_build.CUST_MARITAL_STATUS)
COUNTRY_NAME_CAT   = le.fit_transform(df_mining_data_build.COUNTRY_NAME)
CUST_INCOME_LEVEL_CAT = le.fit_transform(df_mining_data_build.CUST_INCOME_LEVEL)
EDUCATION_CAT = le.fit_transform(df_mining_data_build.EDUCATION)
OCCUPATION_CAT = le.fit_transform(df_mining_data_build.OCCUPATION)
HOUSEHOLD_SIZE_CAT = le.fit_transform(df_mining_data_build.HOUSEHOLD_SIZE)

mining_data_build_v['CUST_GENDER_CAT'] = CUST_GENDER_CAT
mining_data_build_v['CUST_MARITAL_STATUS_CAT'] = CUST_MARITAL_STATUS_CAT
mining_data_build_v['COUNTRY_NAME_CAT'] = COUNTRY_NAME_CAT
mining_data_build_v['CUST_INCOME_LEVEL_CAT'] = CUST_INCOME_LEVEL_CAT
mining_data_build_v['EDUCATION_CAT'] = EDUCATION_CAT
mining_data_build_v['OCCUPATION_CAT'] = OCCUPATION_CAT
mining_data_build_v['HOUSEHOLD_SIZE_CAT'] = HOUSEHOLD_SIZE_CAT

#从pandas DataFrame中删除categorical列
dummy_fields = ['CUST_GENDER', 'CUST_MARITAL_STATUS', 'COUNTRY_NAME',
                  'EDUCATION', 'OCCUPATION','HOUSEHOLD_SIZE']

mining_data_build_v = mining_data_build_v.drop(dummy_fields, axis = 1)
mining_data_build_v = mining_data_build_v.reindex_axis(['CUST_GENDER_CAT','AGE','CUST_MARITAL_STATUS_CAT','COUNTRY_NAME_CAT',
'CUST_INCOME_LEVEL_CAT','EDUCATION_CAT', 'OCCUPATION_CAT','HOUSEHOLD_SIZE_CAT','YRS_RESIDENCE','BULK_PACK_DISKETTES',
'FLAT_PANEL_MONITOR','HOME_THEATER_PACKAGE','BOOKKEEPING_APPLICATION', 'PRINTER_SUPPLIES',
'Y_BOX_GAMES','OS_DOC_SET_KANJI','AFFINITY_CARD'], axis = 1)

#print(mining_data_build_v.head(2))

#保存dataframe到csv文件,没有index列
#mining_data_build_v.to_csv(to_filename, sep=',',index = False, encoding='utf-8')


arr_mining_data_build = mining_data_build_v.values

print(mining_data_build_v.columns)

X = arr_mining_data_build[:,0:16]
Y = arr_mining_data_build[:,16]

#设置输出的精度
set_printoptions(precision=2,suppress=True)
###Feature Importance
model = ExtraTreesClassifier()
model.fit(X,Y)
print("Feature Importance:")
print(model.feature_importances_)


##Chi-squared for classification 卡方分布输出5个相对重要的属性分数
k_best = SelectKBest(score_func=chi2,k = 5)
fit_mining_data_build = k_best.fit(X,Y)
print("Chi-squared Column Names:")
print(fit_mining_data_build.scores_)

程序的输出结果为
Feature Importance:
[ 0.04  0.18  0.06  0.02  0.11  0.15  0.12  0.1   0.11  0.02  0.02  0.03  0.03  0.   0.02  0.  ]
红色对应的属性为: AGE,CUST_INCOME_LEVEL,EDUCATION,OCCUPATION,YRS_RESIDENCE

Chi-squared Column Names:
[  24.91  436.23    1.99    5.31    1.23   73.73    0.83    4.3   158.87    0.18    0.51   51.15    4.72    0.     84.56    1.02]
红色对应的属性为:AGE,EDUCATION,YRS_RESIDENCE,HOME_THEATER_PACKAGE,Y_BOX_GAMES
发表在 algorithm, Machine Learning, Python | 留下评论