混合集成是一种叠加,在这种叠加中,元模型是通过对一个保留验证数据集的预测而不是折外数据集的预测来拟合的。
如何开发一个混合集成,包括训练模型和对新数据进行预测的功能。
如何评价用于分类和回归预测建模问题的混合集成。
混合集成
开发混合集成
用于分类的混合集成
用于回归的混合集成
“许多机器学习实践者已经成功地使用叠加和相关技术来提高预测精度,超过任何单个模型所获得的水平。在某些情况下,堆叠也被称为混合,在这里我们将交替使用这两个术语。” 特征加权线性叠加,2009年。
0级模型(基础模型):基于训练数据训练模型用来进行预测。
1级模型(元模型):学习如何最好地结合基本模型的预测的模型。
混合集成:使用线性模型,如线性回归或逻辑回归,作为叠加集成中的元模型。
“我们的RMSE=0.8643^2解是100多个结果的线性混合.……在对方法的描述中,我们强调了参与最终混合解决方案的特定预测器。” - 2008年Netflix大奖的BellKor解决方案
混合(blending):堆叠类型的集成,其中元模型是根据对保留的验证数据集的预测进行训练的。
堆叠(stacking):堆叠式集成,在k-fold交叉验证过程中,元模型根据折外预测进行训练。
“Blend(混合)这个词是由Netflix的获胜者们提出的。它非常接近堆叠泛化,但更简单,信息泄漏的风险更小。…使用混合,而不是为训练集创建折外数据集预测,您创建一个小的保留数据集,比如10%的训练集。然后堆叠模型只在这个保留集合上运行。” -《Kaggle Ensemble Guide》,MLWave, 2015。
# get a list of base modelsdef get_models():models = list()models.append(('lr', LogisticRegression()))models.append(('knn', KNeighborsClassifier()))models.append(('cart', DecisionTreeClassifier()))models.append(('svm', SVC(probability=True)))models.append(('bayes', GaussianNB()))return models
...
# fit all models on the training set and predict on hold out setmeta_X = list()for name, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict(X_val)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictions as input for blendingmeta_X.append(yhat)
...
# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)
...
# define blending modelblender = LogisticRegression()# fit on predictions from base modelsblender.fit(meta_X, y_val)
# fit the blending ensembledef fit_ensemble(models, X_train, X_val, y_train, y_val):# fit all models on the training set and predict on hold out setmeta_X = list()for name, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict(X_val)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictions as input for blendingmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# define blending modelblender = LogisticRegression()# fit on predictions from base modelsblender.fit(meta_X, y_val)return blender
# make a prediction with the blending ensembledef predict_ensemble(models, blender, X_test):# make predictions with base modelsmeta_X = list()for name, model in models:# predict with base modelyhat = model.predict(X_test)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictionmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# predictreturn blender.predict(meta_X)
# test classification datasetfrom sklearn.datasets import make_classification# define datasetX, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)# summarize the datasetprint(X.shape, y.shape)
|
...
# split dataset into train and test setsX_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)# split training set into train and validation setsX_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)# summarize data splitprint('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))
...# create the base modelsmodels = get_models()# train the blending ensembleblender = fit_ensemble(models, X_train, X_val, y_train, y_val)# make predictions on test setyhat = predict_ensemble(models, blender, X_test)
...
# evaluate predictionsscore = accuracy_score(y_test, yhat)print('Blending Accuracy: %.3f' % score)
将这些整合在一起,下面列出了在二分类问题上评估混合集成的完整例子。
# blending ensemble for classification using hard votingfrom numpy import hstackfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.svm import SVCfrom sklearn.naive_bayes import GaussianNB# get the datasetdef get_dataset():X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)return X, y# get a list of base modelsdef get_models():models = list()models.append(('lr', LogisticRegression()))models.append(('knn', KNeighborsClassifier()))models.append(('cart', DecisionTreeClassifier()))models.append(('svm', SVC()))models.append(('bayes', GaussianNB()))return models# fit the blending ensembledef fit_ensemble(models, X_train, X_val, y_train, y_val):# fit all models on the training set and predict on hold out setmeta_X = list()for name, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict(X_val)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictions as input for blendingmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# define blending modelblender = LogisticRegression()# fit on predictions from base modelsblender.fit(meta_X, y_val)return blender# make a prediction with the blending ensembledef predict_ensemble(models, blender, X_test):# make predictions with base modelsmeta_X = list()for name, model in models:# predict with base modelyhat = model.predict(X_test)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictionmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# predictreturn blender.predict(meta_X)# define datasetX, y = get_dataset()# split dataset into train and test setsX_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)# split training set into train and validation setsX_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)# summarize data splitprint('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))# create the base modelsmodels = get_models()# train the blending ensembleblender = fit_ensemble(models, X_train, X_val, y_train, y_val)# make predictions on test setyhat = predict_ensemble(models, blender, X_test)# evaluate predictionsscore = accuracy_score(y_test, yhat)print('Blending Accuracy: %.3f' % (score*100))
Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)Blending Accuracy: 97.900
# get a list of base modelsdef get_models():models = list()models.append(('lr', LogisticRegression()))models.append(('knn', KNeighborsClassifier()))models.append(('cart', DecisionTreeClassifier()))models.append(('svm', SVC(probability=True)))models.append(('bayes', GaussianNB()))return models
...
# fit all models on the training set and predict on hold out setmeta_X = list()for name, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict_proba(X_val)# store predictions as input for blendingmeta_X.append(yhat)
...
# make predictions with base modelsmeta_X = list()for name, model in models:# predict with base modelyhat = model.predict_proba(X_test)# store predictionmeta_X.append(yhat)
# blending ensemble for classification using soft votingfrom numpy import hstackfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.svm import SVCfrom sklearn.naive_bayes import GaussianNB# get the datasetdef get_dataset():X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)return X, y# get a list of base modelsdef get_models():models = list()models.append(('lr', LogisticRegression()))models.append(('knn', KNeighborsClassifier()))models.append(('cart', DecisionTreeClassifier()))models.append(('svm', SVC(probability=True)))models.append(('bayes', GaussianNB()))return models# fit the blending ensembledef fit_ensemble(models, X_train, X_val, y_train, y_val):# fit all models on the training set and predict on hold out setmeta_X = list()for name, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict_proba(X_val)# store predictions as input for blendingmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# define blending modelblender = LogisticRegression()# fit on predictions from base modelsblender.fit(meta_X, y_val)return blender# make a prediction with the blending ensembledef predict_ensemble(models, blender, X_test):# make predictions with base modelsmeta_X = list()for name, model in models:# predict with base modelyhat = model.predict_proba(X_test)# store predictionmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# predictreturn blender.predict(meta_X)# define datasetX, y = get_dataset()# split dataset into train and test setsX_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)# split training set into train and validation setsX_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)# summarize data splitprint('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))# create the base modelsmodels = get_models()# train the blending ensembleblender = fit_ensemble(models, X_train, X_val, y_train, y_val)# make predictions on test setyhat = predict_ensemble(models, blender, X_test)# evaluate predictionsscore = accuracy_score(y_test, yhat)print('Blending Accuracy: %.3f' % (score*100))
Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)Blending Accuracy: 98.240
# evaluate base models on the entire training datasetfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.svm import SVCfrom sklearn.naive_bayes import GaussianNB# get the datasetdef get_dataset():X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)return X, y# get a list of base modelsdef get_models():models = list()models.append(('lr', LogisticRegression()))models.append(('knn', KNeighborsClassifier()))models.append(('cart', DecisionTreeClassifier()))models.append(('svm', SVC(probability=True)))models.append(('bayes', GaussianNB()))return models# define datasetX, y = get_dataset()# split dataset into train and test setsX_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)# summarize data splitprint('Train: %s, Test: %s' % (X_train_full.shape, X_test.shape))# create the base modelsmodels = get_models()# evaluate standalone modelfor name, model in models:# fit the model on the training datasetmodel.fit(X_train_full, y_train_full)# make a prediction on the test datasetyhat = model.predict(X_test)# evaluate the predictionsscore = accuracy_score(y_test, yhat)# report the scoreprint('>%s Accuracy: %.3f' % (name, score*100))
Train: (5000, 20), Test: (5000, 20)>lr Accuracy: 87.800>knn Accuracy: 97.380>cart Accuracy: 88.200>svm Accuracy: 98.200>bayes Accuracy: 87.300
# example of making a prediction with a blending ensemble for classificationfrom numpy import hstackfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.svm import SVCfrom sklearn.naive_bayes import GaussianNB# get the datasetdef get_dataset():X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)return X, y# get a list of base modelsdef get_models():models = list()models.append(('lr', LogisticRegression()))models.append(('knn', KNeighborsClassifier()))models.append(('cart', DecisionTreeClassifier()))models.append(('svm', SVC(probability=True)))models.append(('bayes', GaussianNB()))return models# fit the blending ensembledef fit_ensemble(models, X_train, X_val, y_train, y_val):# fit all models on the training set and predict on hold out setmeta_X = list()for _, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict_proba(X_val)# store predictions as input for blendingmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# define blending modelblender = LogisticRegression()# fit on predictions from base modelsblender.fit(meta_X, y_val)return blender# make a prediction with the blending ensembledef predict_ensemble(models, blender, X_test):# make predictions with base modelsmeta_X = list()for _, model in models:# predict with base modelyhat = model.predict_proba(X_test)# store predictionmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# predictreturn blender.predict(meta_X)# define datasetX, y = get_dataset()# split dataset set into train and validation setsX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=1)# summarize data splitprint('Train: %s, Val: %s' % (X_train.shape, X_val.shape))# create the base modelsmodels = get_models()# train the blending ensembleblender = fit_ensemble(models, X_train, X_val, y_train, y_val)# make a prediction on a new row of datarow = [-0.30335011, 2.68066314, 2.07794281, 1.15253537, -2.0583897, -2.51936601, 0.67513028, -3.20651939, -1.60345385, 3.68820714, 0.05370913, 1.35804433, 0.42011397, 1.4732839, 2.89997622, 1.61119399, 7.72630965, -2.84089477, -1.83977415, 1.34381989]yhat = predict_ensemble(models, blender, [row])# summarize predictionprint('Predicted Class: %d' % (yhat))
Train: (6700, 20), Val: (3300, 20)Predicted Class: 1
# test regression datasetfrom sklearn.datasets import make_regression# define datasetX, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)# summarize the datasetprint(X.shape, y.shape)
# get a list of base modelsdef get_models():models = list()models.append(('lr', LinearRegression()))models.append(('knn', KNeighborsRegressor()))models.append(('cart', DecisionTreeRegressor()))models.append(('svm', SVR()))return models
...
# define blending modelblender = LinearRegression()
...
# evaluate predictionsscore = mean_absolute_error(y_test, yhat)print('Blending MAE: %.3f' % score)
# evaluate blending ensemble for regressionfrom numpy import hstackfrom sklearn.datasets import make_regressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_absolute_errorfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.svm import SVR# get the datasetdef get_dataset():X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)return X, y# get a list of base modelsdef get_models():models = list()models.append(('lr', LinearRegression()))models.append(('knn', KNeighborsRegressor()))models.append(('cart', DecisionTreeRegressor()))models.append(('svm', SVR()))return models# fit the blending ensembledef fit_ensemble(models, X_train, X_val, y_train, y_val):# fit all models on the training set and predict on hold out setmeta_X = list()for name, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict(X_val)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictions as input for blendingmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# define blending modelblender = LinearRegression()# fit on predictions from base modelsblender.fit(meta_X, y_val)return blender# make a prediction with the blending ensembledef predict_ensemble(models, blender, X_test):# make predictions with base modelsmeta_X = list()for name, model in models:# predict with base modelyhat = model.predict(X_test)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictionmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# predictreturn blender.predict(meta_X)# define datasetX, y = get_dataset()# split dataset into train and test setsX_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)# split training set into train and validation setsX_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)# summarize data splitprint('Train: %s, Val: %s, Test: %s' % (X_train.shape, X_val.shape, X_test.shape))# create the base modelsmodels = get_models()# train the blending ensembleblender = fit_ensemble(models, X_train, X_val, y_train, y_val)# make predictions on test setyhat = predict_ensemble(models, blender, X_test)# evaluate predictionsscore = mean_absolute_error(y_test, yhat)print('Blending MAE: %.3f' % score)
运行示例首先报告训练、验证和测试数据集的形状,然后是测试数据集上集成的MAE。
Train: (3350, 20), Val: (1650, 20), Test: (5000, 20)Blending MAE: 0.237
# evaluate base models in isolation on the regression datasetfrom numpy import hstackfrom sklearn.datasets import make_regressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_absolute_errorfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.svm import SVR# get the datasetdef get_dataset():X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)return X, y# get a list of base modelsdef get_models():models = list()models.append(('lr', LinearRegression()))models.append(('knn', KNeighborsRegressor()))models.append(('cart', DecisionTreeRegressor()))models.append(('svm', SVR()))return models# define datasetX, y = get_dataset()# split dataset into train and test setsX_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.5, random_state=1)# summarize data splitprint('Train: %s, Test: %s' % (X_train_full.shape, X_test.shape))# create the base modelsmodels = get_models()# evaluate standalone modelfor name, model in models:# fit the model on the training datasetmodel.fit(X_train_full, y_train_full)# make a prediction on the test datasetyhat = model.predict(X_test)# evaluate the predictionsscore = mean_absolute_error(y_test, yhat)# report the scoreprint('>%s MAE: %.3f' % (name, score))
Train: (5000, 20), Test: (5000, 20)>lr MAE: 0.236>knn MAE: 100.169>cart MAE: 133.744>svm MAE: 138.195
# example of making a prediction with a blending ensemble for regressionfrom numpy import hstackfrom sklearn.datasets import make_regressionfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.svm import SVR# get the datasetdef get_dataset():X, y = make_regression(n_samples=10000, n_features=20, n_informative=10, noise=0.3, random_state=7)return X, y# get a list of base modelsdef get_models():models = list()models.append(('lr', LinearRegression()))models.append(('knn', KNeighborsRegressor()))models.append(('cart', DecisionTreeRegressor()))models.append(('svm', SVR()))return models# fit the blending ensembledef fit_ensemble(models, X_train, X_val, y_train, y_val):# fit all models on the training set and predict on hold out setmeta_X = list()for _, model in models:# fit in training setmodel.fit(X_train, y_train)# predict on hold out setyhat = model.predict(X_val)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictions as input for blendingmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# define blending modelblender = LinearRegression()# fit on predictions from base modelsblender.fit(meta_X, y_val)return blender# make a prediction with the blending ensembledef predict_ensemble(models, blender, X_test):# make predictions with base modelsmeta_X = list()for _, model in models:# predict with base modelyhat = model.predict(X_test)# reshape predictions into a matrix with one columnyhat = yhat.reshape(len(yhat), 1)# store predictionmeta_X.append(yhat)# create 2d array from predictions, each set is an input featuremeta_X = hstack(meta_X)# predictreturn blender.predict(meta_X)# define datasetX, y = get_dataset()# split dataset set into train and validation setsX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=1)# summarize data splitprint('Train: %s, Val: %s' % (X_train.shape, X_val.shape))# create the base modelsmodels = get_models()# train the blending ensembleblender = fit_ensemble(models, X_train, X_val, y_train, y_val)# make a prediction on a new row of datarow = [-0.24038754, 0.55423865, -0.48979221, 1.56074459, -1.16007611, 1.10049103, 1.18385406, -1.57344162, 0.97862519, -0.03166643, 1.77099821, 1.98645499, 0.86780193, 2.01534177, 2.51509494, -1.04609004, -0.19428148, -0.05967386, -2.67168985, 1.07182911]yhat = predict_ensemble(models, blender, [row])# summarize predictionprint('Predicted: %.3f' % (yhat[0]))
Train: (6700, 20), Val: (3300, 20)Predicted: 359.986
版权声明:本号内容部分来自互联网,转载请注明原文链接和作者,如有侵权或出处有误请和我们联系。
合作请加QQ:365242293
数据分析(ID : ecshujufenxi )互联网科技与数据圈自己的微信,也是WeMedia自媒体联盟成员之一,WeMedia联盟覆盖5000万人群。


