Hướng dẫn can we plot multiple linear regression in python? - chúng ta có thể vẽ nhiều hồi quy tuyến tính trong python không?

Question

Danh mục> Học máy

Ngày 18 tháng 11 năm 2019

Nội dung chính Show

0. Mô tả dữ liệu mẫu
1. Hồi quy tuyến tính đa
AI = 1 (kg/m2s*10^6)
Bài viết liên quan
Chúng ta có thể vẽ nhiều hồi quy tuyến tính không?
Thư viện Python nào được sử dụng cho nhiều hồi quy tuyến tính?

Có nhiều phương pháp học máy nâng cao với độ chính xác dự đoán mạnh mẽ. Mặc dù các mô hình phức tạp có thể vượt trội so với các mô hình đơn giản trong việc dự đoán biến phản hồi, các mô hình đơn giản tốt hơn để hiểu được tác động và tầm quan trọng của từng tính năng trên một biến phản hồi. Khi nhiệm vụ trong tay có thể được mô tả bằng mô hình tuyến tính, hồi quy tuyến tính chiến thắng trên tất cả các phương pháp học máy khác trong giải thích tính năng do tính đơn giản của nó.

Bài đăng này cố gắng giúp bạn hiểu về hồi quy tuyến tính trong không gian tính năng đa chiều, đánh giá độ chính xác của mô hình và cung cấp đoạn mã cho hồi quy tuyến tính nhiều trong Python.

0. Mô tả dữ liệu mẫu

Chúng tôi sẽ sử dụng một dữ liệu mẫu trong suốt bài đăng này. Dữ liệu mẫu có liên quan đến ngành công nghiệp dầu khí. Nó có nguồn gốc từ Tiến sĩ Michael Pyrcz, Giáo sư Kỹ thuật Dầu khí tại Đại học Texas ở Austin. Dữ liệu gốc có thể được tìm thấy từ repo GitHub của anh ấy. Bạn cũng có thể sử dụng tải xuống trực tiếp hoặc truy cập trực tiếp bằng URL Pandas như bên dưới:

In [1]:

import pandas as pd

file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
df = pd.read_csv(file)

Out[2]:

	Tốt	Por	uốn	Ai	Giòn	TOC	Vr	Prod
0	1	12.08	2.92	2.80	81.40	1.16	2.31	4165.196191
1	2	12.38	3.53	3.22	46.17	0.89	1.88	3561.146205
2	3	14.02	2.59	4.01	72.80	0.89	2.72	4284.348574
3	4	17.67	6.75	2.63	39.81	1.08	1.88	5098.680869
4	5	17.52	4.57	3.18	10.94	1.51	1.90	3406.132832
5	6	14.53	4.81	2.69	53.60	0.94	1.67	4395.763259
6	7	13.49	3.60	2.93	63.71	0.80	1.85	4104.400989
7	8	11.58	3.03	3.25	53.00	0.69	1.93	3496.742701
8	9	12.52	2.72	2.43	65.77	0.95	1.98	4025.851153
9	10	13.25	3.94	3.71	66.20	1.14	2.65	4285.026122

Mô tả các tiêu đề

Vâng: Chỉ số tốt
POR: Độ xốp trung bình tốt (%)
Perm: tính thấm (MD)
AI: trở kháng accoustic (kg/m2s*10^6)
Brittle: Tỷ lệ chống nổ (%)
TOC: Tổng lượng carbon hữu cơ (%)
VR: Độ phản xạ vitrinite (%)
Prod: Sản xuất khí mỗi ngày (MCFD) - Biến phản hồiResponse Variable

Chúng tôi có sáu tính năng (POR, PERM, AI, giòn, TOC, VR) để dự đoán biến phản hồi (prod). Dựa trên mức độ quan trọng của tính năng hoán vị được hiển thị trong Hình (1), POR là tính năng quan trọng nhất và giòn là tính năng quan trọng thứ hai.

Xếp hạng tính năng hoán vị nằm ngoài phạm vi của bài đăng này và sẽ không được thảo luận chi tiết. Tính năng đạt được đạt được với thư viện RFPIMP Python. Để biết thêm thông tin về Xếp hạng tính năng hoán vị, hãy tham khảo bài viết này: Hãy coi chừng Hiệu quả rừng ngẫu nhiên mặc định

Hình 1: Xếp hạng tính năng hoán vị

Mã nguồn cho Hình (1)

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

1. Hồi quy tuyến tính đa

Mô hình hồi quy tuyến tính có cấu trúc sau:

$$ y = \ beta_1 x_1 + \ beta_2 x_2 + \ cdots + \ beta_n x_n + \ beta_0 \ tag {1} $$

ở đâu

$ \ beta_n $

: hệ số hồi quy (trọng lượng) của tính năng $ n $ -th$n$-th feature

Mô hình hồi quy tuyến tính bivarate (có thể được hiển thị trong không gian 2D) là sự đơn giản hóa EQ (1). Mô hình bivariate có cấu trúc sau:

$$ y = \ beta_1 x_1 + \ beta_0 \ tag {2} $$

Một bưc tranh đang gia ngan lơi noi. Chúng ta hãy cố gắng hiểu các thuộc tính của nhiều mô hình hồi quy tuyến tính với trực quan hóa. Đầu tiên, mô hình hồi quy tuyến tính 2D bivariate được hiển thị trong Hình (2), sử dụng POR làm một tính năng duy nhất. Mặc dù độ xốp là tính năng quan trọng nhất liên quan đến sản xuất khí, nhưng chỉ có độ xốp chỉ chiếm 74% phương sai của dữ liệu.

Hướng dẫn can we plot multiple linear regression in python? - chúng ta có thể vẽ nhiều hồi quy tuyến tính trong python không?

Hình 2: Mô hình hồi quy tuyến tính 2D

Mã nguồn cho Hình (2)

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

Làm thế nào mô hình sẽ trông như thế nào trong không gian 3D? Chúng ta hãy xem hình (3). Do tính chất 3D của cốt truyện, nhiều lô được tạo ra từ các góc độ khác nhau. Hai tính năng (por và giòn) đã được sử dụng để dự đoán biến phản ứng prod. Với sự trợ giúp của tính năng bổ sung giòn, mô hình tuyến tính trải nghiệm độ chính xác đáng kể, hiện nắm bắt độ biến thiên của dữ liệu 93%.

Hình 3: Mô hình hồi quy tuyến tính 3D với các tính năng mạnh mẽ

Mã nguồn cho Hình (3)

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

Mô hình tuyến tính 3D sẽ trông như thế nào nếu các tính năng ít mạnh hơn được chọn? Hãy chọn POR và VR làm tính năng mới của chúng tôi và phù hợp với mô hình tuyến tính. Trong hình (4) bên dưới, chúng ta thấy rằng R bình phương giảm so với hình (3) ở trên. Ảnh hưởng của hiệu suất mô hình giảm có thể được quan sát trực quan bằng cách so sánh các lô giữa của chúng; Các sơ đồ phân tán trong hình (3) có dày đặc hơn xung quanh mặt phẳng mô hình 2D so với các ô phân tán trong hình (4).

Hình 4: Mô hình hồi quy tuyến tính 3D với các tính năng yếu

Mã nguồn cho Hình (4)

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'VR']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)      # range of porosity values
                y_pred = np.linspace(0.93, 2.9, 30)  # range of VR values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('VR', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=27, azim=112)
                ax2.view_init(elev=16, azim=-51)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

Chế độ xem đầy đủ của các mô hình tuyến tính được xây dựng dưới đây dưới dạng GIF. Lưu ý rằng mặt phẳng màu xanh luôn được chiếu tuyến tính, bất kể góc độ. Đây là lý do mà chúng tôi gọi đây là mô hình hồi quy "tuyến tính" nhiều. Mô hình sẽ luôn luôn là tuyến tính, bất kể tính chiều của các tính năng của bạn. Khi bạn có nhiều hơn 3 tính năng, mô hình sẽ rất khó được hiển thị, nhưng bạn có thể mong đợi rằng các mô hình tuyến tính có chiều cao cũng sẽ thể hiện xu hướng tuyến tính trong không gian tính năng của chúng.

Hình 5: Độ xốp và độ giòn mô hình tuyến tính GIF

Hình 6: Độ xốp và mô hình tuyến tính VR GIF

GIF được tạo bằng cách tạo 360 lô khác nhau được xem từ các góc khác nhau với đoạn mã sau và kết hợp thành một GIF từ IMGFLIP.

In [ ]:

for ii in np.arange(0, 360, 1):
    ax.view_init(elev=32, azim=ii)
    fig.savefig('gif_image%d.png' % ii)

Ghi chú: Mã hóa dữ liệu - hồi quy với các biến phân loại

Hồi quy đòi hỏi các tính năng phải liên tục. Điều gì xảy ra nếu bạn có các tính năng phân loại quan trọng? Bạn có phải bỏ qua các biến phân loại và chỉ chạy hồi quy với các biến liên tục? Chúng ta có thể mã hóa các biến phân loại thành các biến số để tránh vấn đề này. Hãy xem hình dưới đây. Cấp độ tính năng ban đầu là một biến thể loại với ba loại thời gian. Điều này có nghĩa là có hệ thống phân cấp giữa các loại (ví dụ: thấp integer encoding

Điều gì sẽ xảy ra nếu không có tọa độ giữa các loại tính năng? Sau đó, chúng tôi sử dụng một kỹ thuật gọi là mã hóa một lần để ngăn chặn một mô hình giả định đặt hàng tự nhiên giữa các loại có thể bị sai lệch mô hình. Thay vì sử dụng các biến số nguyên, chúng tôi sử dụng các biến nhị phân. 1 chỉ ra rằng dữ liệu mẫu rơi vào danh mục được chỉ định, trong khi 0 chỉ ra cách khác. Mã hóa một lần được sử dụng trong hầu hết các vấn đề ngôn ngữ tự nhiên, bởi vì các từ vựng không có mối quan hệ thứ tự giữa chính chúng.one-hot encoding to prevent a model from assuming natural ordering among categories that may suffer from model bias. Instead of using integer variables, we use binary variables. 1 indicates that the sample data falls into the specified category, while 0 indicates the otherwise. One-hot encoding is used in almost all natural languages problems, because vocabularies do not have ordinal relationships among themselves.

Mẹo Pythonic: Hồi quy tuyến tính 2D với Scikit-learn 2D linear regression with scikit-learn

Hồi quy tuyến tính được thực hiện trong scikit-learn với

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

4 (kiểm tra tài liệu). Để trình diễn mã, chúng tôi sẽ sử dụng cùng một bộ dữ liệu dầu khí được mô tả trong Phần 0: Mô tả dữ liệu mẫu ở trên.

Chuẩn bị dữ liệu

Đầu tiên, nhập các mô -đun và dữ liệu. Chúng tôi sẽ sử dụng một tính năng duy nhất: por. Biến phản ứng là prod.

In [8]:

import pandas as pd
from sklearn import linear_model
import numpy as np

file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
df = pd.read_csv(file)

Tiền xử lý

Nếu bạn nhận được thông báo lỗi như

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

5, đó là vấn đề tiền xử lý. Hầu hết các chức năng đào tạo học tập Scikit đều yêu cầu định hình lại các tính năng, chẳng hạn như

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

6. Trong trường hợp bạn nhập dữ liệu từ Pandas DataFrame, đối số đầu tiên luôn là

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

7 và đối số thứ hai là số lượng các tính năng, dưới dạng số nguyên. Tiền xử lý này cũng sẽ được yêu cầu khi bạn đưa ra dự đoán dựa trên mô hình được trang bị sau này.

Kiểm tra hình dạng của các tính năng và biến phản hồi của bạn nếu bạn đang gặp lỗi.

In [9]:

import numpy as np

features = ['Por']
target = 'Prod'

X = df[features].values.reshape(-1, len(features))
y = df[target].values

In [12]:

print(X.shape)
print(y.shape)

Phù hợp với mô hình tuyến tính

Vì chúng tôi chỉ có một tính năng, mô hình tuyến tính mà chúng tôi muốn phù hợp có cấu trúc sau:

$$ \ text {gas prod.} = \ beta_1 \ cdot \ text {por} + \ beta_0 \ tag {3} $$

Chúng ta hãy tìm ra các giá trị của $ \ beta_1 $ (hệ số hồi quy) và $ \ beta_2 $ (y-intercept). Cũng giống như nhiều thư viện Scikit-Learn khác, bạn khởi tạo đối tượng mô hình đào tạo với

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

8 và hơn là phù hợp với mô hình với tính năng

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

9 và biến phản hồi

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

0.

Lưu ý rằng

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

1 là viết tắt của bình phương tối thiểu thông thường.

In [42]:

from sklearn import linear_model

ols = linear_model.LinearRegression()
model = ols.fit(X, y)

Hệ số hồi quy tuyến tính có thể được truy cập trong một dạng thuộc tính lớp với

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

2

Có thể truy cập bộ chặn y trong một dạng thuộc tính lớp với

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

3

Dựa trên kết quả của sự phù hợp, chúng tôi kết luận rằng việc sản xuất khí có thể được dự đoán từ độ xốp, với mô hình tuyến tính sau:

$$ \ text {gas prod.} = 287,78 \ cdot \ text {por} - 2.94 \ tag {4} $$

Đánh giá độ chính xác: $ r^2 $

Mô hình của bạn tốt như thế nào? Bạn có thể đánh giá hiệu suất mô hình của mình dưới dạng R-bình phương, với

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

4.

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

9 là các tính năng và

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

0 là biến phản hồi được sử dụng để phù hợp với mô hình.

Đưa ra dự đoán trong tương lai

Scikit-learn hỗ trợ đưa ra dự đoán dựa trên mô hình được trang bị bằng phương pháp

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

7.

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

9 là một tính năng yêu cầu tiền xử lý được giải thích ở trên. Hãy nhớ lại rằng chúng tôi chỉ sử dụng một tính năng và

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

9.

Hãy nói rằng bạn muốn dự đoán sản xuất khí khi độ xốp là 15%. Sau đó:

In [46]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

0

Theo mô hình, sản xuất khí = 4313 MCF/ngày khi độ xốp = 15%.

Điều gì sẽ xảy ra nếu chúng ta muốn dự đoán biến phản hồi từ nhiều trường hợp của một tính năng? Hãy thử độ xốp 14% và 18%. Sau đó:

In [48]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

1

Out[49]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

2

Chúng ta có thể mở rộng về điều này và vẽ một dòng dự đoán cho tất cả các giá trị có thể của tính năng. Giá trị thực tế hợp lý của độ xốp của đá nằm trong khoảng từ $ [0, 40] $.$[0, 40]$.

In [50]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

3

In [52]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

4

WARNING!

Hãy cẩn thận khi dự đoán một điểm bên ngoài phạm vi quan sát của các fautures. Mối quan hệ giữa các biến có thể thay đổi khi bạn di chuyển ra ngoài phạm vi quan sát được, nhưng bạn không bao giờ biết vì bạn không có dữ liệu. Mối quan hệ được quan sát có thể là tuyến tính cục bộ, nhưng có thể có những đường cong không quan sát được trên phạm vi bên ngoài của dữ liệu của bạn.

Mẹo Pythonic: Buộc không thể chặn Y Forcing zero y-intercept

Đôi khi bạn muốn buộc Y-Intercept = 0. Điều này có thể được thực hiện bằng cách đặt

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'VR']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)      # range of porosity values
                y_pred = np.linspace(0.93, 2.9, 30)  # range of VR values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('VR', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=27, azim=112)
                ax2.view_init(elev=16, azim=-51)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

0 khi khởi tạo lớp mô hình hồi quy tuyến tính. In mô hình Y-chặn sẽ xuất ra

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'VR']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)      # range of porosity values
                y_pred = np.linspace(0.93, 2.9, 30)  # range of VR values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('VR', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=27, azim=112)
                ax2.view_init(elev=16, azim=-51)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

1. Nhưng hãy lưu ý, "nói chung, điều cần thiết là đưa hằng số vào mô hình hồi quy", bởi vì "hằng số (y-intercept) hấp thụ sự thiên vị cho mô hình hồi quy", như Jim Frost nói trong bài viết của mình.

In [53]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

5

Trong Hình (7), tôi đã tạo ra một số dữ liệu tổng hợp dưới đây để minh họa hiệu quả của việc buộc không có hệ thống y. Buộc một chặn Y không có thể là cả mong muốn hoặc không mong muốn. Nếu bạn có lý do để tin rằng Y-Intercept phải bằng không, hãy đặt

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'VR']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)      # range of porosity values
                y_pred = np.linspace(0.93, 2.9, 30)  # range of VR values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('VR', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=27, azim=112)
                ax2.view_init(elev=16, azim=-51)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

0.

Hình 7: Ảnh hưởng của việc buộc Zero Y-Intercept

Mã nguồn cho Hình (7)

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

6

Mẹo pythonic: hồi quy tuyến tính 3D+ với scikit-learn 3D+ linear regression with scikit-learn

Phù hợp với mô hình đa tuyến tính

Hãy phù hợp với mô hình với bốn tính năng: por, giòn, perm và toc. Sau đó, mô hình quan tâm của chúng tôi có cấu trúc sau:

$$ \ text {gas prod.} = \ beta_1 \ cdot \ text {por} + \ beta_2 \ cdot \ text {giòn} + \ beta_3 \ cdot \ text {perm} + \ beta_4 \ cdot \ beta_0 \ tag {5} $$

Với scikit-learn, hồi quy tuyến tính 3D+ phù hợp không khác với hồi quy tuyến tính 2D, ngoài việc khai báo nhiều tính năng ngay từ đầu. Phần còn lại giống hệt nhau. Chúng tôi sẽ khai báo bốn tính năng:

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'VR']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)      # range of porosity values
                y_pred = np.linspace(0.93, 2.9, 30)  # range of VR values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('VR', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=27, azim=112)
                ax2.view_init(elev=16, azim=-51)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

3

In [16]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

7

Out[17]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

8

Dựa trên kết quả của sự phù hợp, chúng tôi có được mô hình hồi quy tuyến tính sau:

$$ \ text {gas prod. 6} $$

Đánh giá độ chính xác: $ r^2 $

Trong cùng một chúng tôi đã đánh giá hiệu suất mô hình với mô hình tuyến tính 2D ở trên, chúng tôi có thể đánh giá hiệu suất mô hình 3D+ với bình phương R với

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

4.

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

9 là các tính năng và

            
                import pandas as pd
                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn import linear_model
                from mpl_toolkits.mplot3d import Axes3D

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df[['Por', 'Brittle']].values.reshape(-1,2)
                Y = df['Prod']

                ######################## Prepare model data point for visualization ###############################

                x = X[:, 0]
                y = X[:, 1]
                z = Y

                x_pred = np.linspace(6, 24, 30)   # range of porosity values
                y_pred = np.linspace(0, 100, 30)  # range of brittleness values
                xx_pred, yy_pred = np.meshgrid(x_pred, y_pred)
                model_viz = np.array([xx_pred.flatten(), yy_pred.flatten()]).T

                ################################################ Train #############################################

                ols = linear_model.LinearRegression()
                model = ols.fit(X, Y)
                predicted = model.predict(model_viz)

                ############################################## Evaluate ############################################

                r2 = model.score(X, Y)

                ############################################## Plot ################################################

                plt.style.use('default')

                fig = plt.figure(figsize=(12, 4))

                ax1 = fig.add_subplot(131, projection='3d')
                ax2 = fig.add_subplot(132, projection='3d')
                ax3 = fig.add_subplot(133, projection='3d')

                axes = [ax1, ax2, ax3]

                for ax in axes:
                    ax.plot(x, y, z, color='k', zorder=15, linestyle='none', marker='o', alpha=0.5)
                    ax.scatter(xx_pred.flatten(), yy_pred.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
                    ax.set_xlabel('Porosity (%)', fontsize=12)
                    ax.set_ylabel('Brittleness', fontsize=12)
                    ax.set_zlabel('Gas Prod. (Mcf/day)', fontsize=12)
                    ax.locator_params(nbins=4, axis='x')
                    ax.locator_params(nbins=5, axis='x')

                ax1.text2D(0.2, 0.32, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax1.transAxes, color='grey', alpha=0.5)
                ax2.text2D(0.3, 0.42, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax2.transAxes, color='grey', alpha=0.5)
                ax3.text2D(0.85, 0.85, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                           transform=ax3.transAxes, color='grey', alpha=0.5)

                ax1.view_init(elev=28, azim=120)
                ax2.view_init(elev=4, azim=114)
                ax3.view_init(elev=60, azim=165)

                fig.suptitle('$R^2 = %.2f$' % r2, fontsize=20)

                fig.tight_layout()

0 là biến phản hồi được sử dụng để phù hợp với mô hình.

Đưa ra dự đoán trong tương lai

Hãy đưa ra một dự đoán về tốc độ sản xuất khí khi:

Por = 12 (%)
Giòn = 81 (%)
VR = 2,31 (%)
AI = 2,8 (kg/m2s*10^6)

In [107]:

            
                import rfpimp
                import pandas as pd
                import numpy as np
                from sklearn.ensemble import RandomForestRegressor
                from sklearn.model_selection import train_test_split

                ######################################## Data preparation #########################################

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)
                features = ['Por', 'Perm', 'AI', 'Brittle', 'TOC', 'VR', 'Prod']

                ######################################## Train/test split #########################################

                df_train, df_test = train_test_split(df, test_size=0.20)
                df_train = df_train[features]
                df_test = df_test[features]

                X_train, y_train = df_train.drop('Prod',axis=1), df_train['Prod']
                X_test, y_test = df_test.drop('Prod',axis=1), df_test['Prod']

                ################################################ Train #############################################

                rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
                rf.fit(X_train, y_train)

                ############################### Permutation feature importance #####################################

                imp = rfpimp.importances(rf, X_test, y_test)

                ############################################## Plot ################################################

                fig, ax = plt.subplots(figsize=(6, 3))

                ax.barh(imp.index, imp['Importance'], height=0.8, facecolor='grey', alpha=0.8, edgecolor='k')
                ax.set_xlabel('Importance score')
                ax.set_title('Permutation feature importance')
                ax.text(0.8, 0.15, 'aegis4048.github.io', fontsize=12, ha='center', va='center',
                        transform=ax.transAxes, color='grey', alpha=0.5)
                plt.gca().invert_yaxis()

                fig.tight_layout()

9

Lần này, hãy đưa ra hai dự đoán về tốc độ sản xuất khí khi:

Por = 12 (%)
Giòn = 81 (%)
VR = 2,31 (%)
AI = 2,8 (kg/m2s*10^6)

Lần này, hãy đưa ra hai dự đoán về tốc độ sản xuất khí khi:
Por = 15 (%)
Giòn = 60 (%)
VR = 2,5 (%)

In [109]:

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

0

Out[100]:

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

1

AI = 1 (kg/m2s*10^6)

2. Giới thiệu về đa cộng đồng

Mặc dù độ chính xác của mô hình đa tuyến tính trong việc dự đoán biến phản hồi có thể đáng tin cậy, giá trị của hệ số hồi quy cá nhân có thể không đáng tin cậy dưới tính đa hình. Lưu ý rằng giá trị của hệ số hồi quy cho độ xốp trong eq (4) là 287,7, trong khi đó là 244,6 trong biểu thức (6). Trong Hình (8), tôi đã mô phỏng nhiều mô hình phù hợp với các kết hợp các tính năng khác nhau để hiển thị các giá trị hệ số hồi quy dao động, ngay cả khi giá trị bình phương R cao.

Hình 8: Các hệ số hồi quy không ổn định do tính đa hình

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

2

Mã nguồn cho Hình (8)

Kết quả mô phỏng cho chúng ta biết rằng ngay cả khi mô hình tốt trong việc dự đoán biến phản hồi đã cho các tính năng (bình phương R cao), mô hình tuyến tính không đủ mạnh để hiểu đầy đủ ảnh hưởng của các tính năng riêng lẻ đối với biến phản hồi. Trong hoàn cảnh như vậy, chúng ta không thể tin tưởng các giá trị của các hệ số hồi quy. Sự bất ổn này đến từ đâu? Điều này là do POR, TOC và PERM cho thấy mối tương quan tuyến tính mạnh mẽ với nhau, như thể hiện trong ma trận tương quan của Spearnman trong Hình (9).

Hình 9: Ma trận tương quan các tính năng

            
                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn import linear_model

                ######################################## Data preparation ######################################### 

                file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
                df = pd.read_csv(file)

                X = df['Por'].values.reshape(-1,1)
                y = df['Prod'].values

                ################################################ Train ############################################# 

                ols = linear_model.LinearRegression()
                model = ols.fit(X, y)
                response = model.predict(X)

                ############################################## Evaluate ############################################ 

                r2 = model.score(X, y)

                ############################################## Plot ################################################

                plt.style.use('default')
                plt.style.use('ggplot')

                fig, ax = plt.subplots(figsize=(8, 4))

                ax.plot(X, response, color='k', label='Regression model')
                ax.scatter(X, y, edgecolor='k', facecolor='grey', alpha=0.7, label='Sample data')
                ax.set_ylabel('Gas production (Mcf/day)', fontsize=14)
                ax.set_xlabel('Porosity (%)', fontsize=14)
                ax.text(0.8, 0.1, 'aegis4048.github.io', fontsize=13, ha='center', va='center',
                         transform=ax.transAxes, color='grey', alpha=0.5)
                ax.legend(facecolor='white', fontsize=11)
                ax.set_title('$R^2= %.2f$' % r2, fontsize=18)

                fig.tight_layout()

3

Mã nguồn cho Hình (9)

Khi có hơn hai tính năng được sử dụng để dự đoán, bạn phải xem xét khả năng của từng tính năng tương tác với nhau. Đối với thử nghiệm suy nghĩ, hãy nghĩ về hai tính năng $ x_1 $ và $ x_2 $ và biến phản hồi $ y $. Giả sử rằng $ x_1 $ có liên quan tích cực đến $ y $. Nói cách khác, tăng $ x_1 $ tăng $ y $ và giảm $ x_1 $ cũng giảm $ y $. $ x_2 $ có liên quan tiêu cực đến $ y $. Có một mối tương quan tích cực giữa $ x_1 $ và $ x_2 $. Trong tình trạng này, khi bạn tăng $ x_1 $, bạn sẽ tăng giá trị $ y $ vì mối quan hệ tích cực giữa $ x_1 $ và $ y $, nhưng điều này không phải lúc nào cũng đúng vì tăng $ x_1 $ cũng tăng $ x_2 $, lần lượt giảm $ y $.muticollinearity. Under multicollinearity, the values of individual regression coefficients are unreliable, and the impact of individual features on a response variable is obfuscated. However, prediction on a response variable is still reliable.

Khi sử dụng các hệ số hồi quy tuyến tính để đưa ra quyết định kinh doanh, bạn phải loại bỏ ảnh hưởng của tính đa hình để có được các hệ số hồi quy đáng tin cậy. Giả sử rằng bạn đang thực hiện một nghiên cứu y học về ung thư cổ tử cung. Bạn đã đào tạo một mô hình hồi quy tuyến tính với tỷ lệ sống sót của bệnh nhân đối với nhiều tính năng, trong đó tiêu thụ nước là một trong số đó. Hệ số hồi quy tuyến tính của bạn cho tiêu thụ nước báo cáo rằng nếu bệnh nhân tăng mức tiêu thụ nước thêm 1,5 L mỗi ngày, tỷ lệ sống sót của anh ta sẽ tăng 2%. Bạn có thể tin tưởng phân tích này? Câu trả lời là có, nếu không có dấu hiệu của đa hình. Bạn thực sự có thể nói với bệnh nhân, với sự tự tin, rằng anh ta phải uống nhiều nước hơn để tăng cơ hội sống sót. Tuy nhiên, nếu có một dấu hiệu của tính đa hình, phân tích này không hợp lệ.

Lưu ý rằng tính đa hình không bị hạn chế trên mối quan hệ 1 so với 1. Ngay cả khi có mối tương quan tối thiểu 1 so với 1 giữa các tính năng, ba hoặc nhiều tính năng cùng nhau có thể hiển thị đa hình. Cũng lưu ý rằng tính đa hình không ảnh hưởng đến độ chính xác dự đoán. Mặc dù các giá trị của các hệ số riêng lẻ có thể không đáng tin cậy, nhưng nó không làm suy yếu sức mạnh dự đoán của mô hình. Đa hình học chỉ là một vấn đề khi bạn muốn nghiên cứu tác động của các tính năng riêng lẻ đối với một biến phản hồi.multicollinearity does not affect prediction accuracy. While the values of individual coefficients may be unreliable, it does not undermine the prediction power of the model. Multicollinearity is an issue only when you want to study the impact of individual features on a response variable.

Các chi tiết về phát hiện và các biện pháp khắc phục của mutlicolinearity không được thảo luận ở đây (mặc dù tôi dự định viết về nó rất sớm). Mặc dù trọng tâm của bài đăng này chỉ dựa trên nhiều hồi quy tuyến tính, tôi vẫn muốn thu hút sự chú ý của bạn về lý do tại sao bạn không nên luôn tin tưởng các hệ số hồi quy của mình.

Bài viết liên quan

Chúng ta có thể vẽ nhiều hồi quy tuyến tính không?

Ở trung tâm của phân tích hồi quy tuyến tính nhiều là nhiệm vụ lắp một dòng đơn lẻ thông qua một biểu đồ phân tán.Cụ thể hơn, hồi quy nhiều tuyến tính phù hợp với một dòng thông qua một đám mây dữ liệu đa chiều.Hình thức đơn giản nhất có một biến phụ thuộc và hai biến độc lập.the multiple linear regression fits a line through a multi-dimensional cloud of data points. The simplest form has one dependent and two independent variables.

Thư viện Python nào được sử dụng cho nhiều hồi quy tuyến tính?

Nếu chúng ta có một tính năng phụ thuộc và nhiều tính năng độc lập thì về cơ bản gọi nó là hồi quy tuyến tính nhiều.Vì vậy, đây là một mô tả lý thuyết nhỏ về hồi quy tuyến tính nhiều bây giờ chúng ta sẽ sử dụng thư viện hồi quy tuyến tính học Scikit để giải quyết vấn đề hồi quy tuyến tính nhiều.scikit learn linear regression library to solve the multiple linear regression problem.

programming python Linear regression Python

Hướng dẫn can we plot multiple linear regression in python? - chúng ta có thể vẽ nhiều hồi quy tuyến tính trong python không?

Danh mục> Học máy

0. Mô tả dữ liệu mẫu

1. Hồi quy tuyến tính đa

Ghi chú: Mã hóa dữ liệu - hồi quy với các biến phân loại

AI = 1 (kg/m2s*10^6)

Bài viết liên quan

Chúng ta có thể vẽ nhiều hồi quy tuyến tính không?

Thư viện Python nào được sử dụng cho nhiều hồi quy tuyến tính?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội