机器学习-曲线拟合

这篇文章介绍在Python中如何做曲线拟合。曲线拟合的目的是根据给定的数据找到某种函数关系，再通过该函数来预测未知数据。包含以下内容，简单的多项式拟合高阶多项式拟合复杂曲线基于scipy拟合复杂曲线可化为线性函数的非线性拟合简单的多项式拟合我们知道，一般的函数都可以做泰勒展开，展开后可以通过多项式函数来近似计算。同时，我们可以基于最小二乘法对给定的数据点做多项式拟合，请参考我下面的两篇文章：数学

尘世中迷途小码农

13763人浏览 · 2020-11-26 21:32:42

尘世中迷途小码农 · 2020-11-26 21:32:42 发布

这篇文章介绍在Python中如何做离散数据的曲线拟合。曲线拟合的目的是根据给定的数据找到某个函数关系，再通过该函数来预测未知数据。包含以下内容，

一般的多项式拟合
高阶多项式拟合复杂曲线
基于scipy拟合复杂曲线
可化为线性函数的非线性拟合

一般的多项式拟合

我们知道，一般的函数都可以进行泰勒展开，展开后可以通过多项式函数来做近似计算。同时，我们可以基于最小二乘法求解多项式的系数。请参考我下面的两篇文章：

数学 - 泰勒公式，常见麦克劳林公式及Maple函数拟合

最小二乘法多项式拟合的Java实现

这里，我们介绍在Python中如何做简单的多项式拟合。一阶多项式函数就是直线，即线性拟合，如果给定的数据点大致满足直线分布，可以使用此方法。二阶和三阶多项式使用较多，对函数的局部拟合比较准确。超过五阶以上的多项式虽然在局部更准确，但在样本点外的波动较大，对预测数据不一定准确。理论上，只要函数能做泰勒展开，我们就能通过增加多项式函数的阶来无限提高拟合精度，不过多项式函数将越来越复杂，计算量也越来越大。

下面代码通过对给定数据点做三阶多项式拟合来求解拟合函数。

代码：

import numpy as np
import matplotlib.pyplot as plt

# X and Y
x = np.array([10, 20, 30, 40, 50, 60, 70, 80])
print('X:', x)
y = np.array([174, 236, 305, 334, 349, 351, 342, 323])
print('Y:', y)

y_fit = np.polyfit(x, y, 3)
print('y_fit:', y_fit)
y_fit_1d = np.poly1d(y_fit)
print('y_fit_1d:\n', y_fit_1d)

y_hat = np.polyval(y_fit, x)
# This function is also OK: y_hat = y_fit_1d(x)
print('y_hat:', y_hat)

print('Correlation coefficients:')
print(np.corrcoef(y_hat, y))

# plot
plot1 = plt.plot(x, y, 'o', label='Original Values')
plot2 = plt.plot(x, y_hat, 'r', label='Fitting Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=4)
plt.title('Polyfitting')
plt.show()

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/regression/simple_regression.py
X: [10 20 30 40 50 60 70 80]
Y: [174 236 305 334 349 351 342 323]
y_fit: [ 3.68686869e-04 -1.28701299e-01  1.10570707e+01  7.26428571e+01]
y_fit_1d:
            3          2
0.0003687 x - 0.1287 x + 11.06 x + 72.64
y_hat: [170.71212121 245.25324675 298.47835498 332.5995671  349.82900433
 352.37878788 342.46103896 322.28787879]
Correlation coefficients:
[[1.         0.99745955]
 [0.99745955 1.        ]]

Process finished with exit code 0

图像：

高阶多项式拟合复杂曲线

这一节我们通过高阶多项式拟合下面函数：

$\dpi{150} f(x) = 2\sin (x) + 3$

如果拟合区间在 $[0, 2\pi]$ ，此时用三阶多项式拟合即可，

代码：

import numpy as np
import matplotlib.pyplot as plt


def f(x):
    return 2 * np.sin(x) + 3


# X and Y
x = np.linspace(0, 2 * np.pi)
y = f(x) + 0.2 * np.random.randn(len(x))  # Add noise

y_fit = np.polyfit(x, y, 3)
print('y_fit:', y_fit)
y_fit_1d = np.poly1d(y_fit)
print('y_fit_1d:\n', y_fit_1d)

y_hat = np.polyval(y_fit, x)
# This function is also OK: y_hat = y_fit_1d(x)
print('y_hat:', y_hat)

print('Correlation coefficients:')
print(np.corrcoef(y_hat, y))

# plot
plot1 = plt.plot(x, y, 'o', label='Original Values')
plot3 = plt.plot(x, f(x), 'g', label='Original Curve')
plot2 = plt.plot(x, y_hat, 'r', label='Fitting Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Polyfitting')
plt.show()

输出：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/regression/simple_regression_2.py
y_fit: [ 0.18327107 -1.74131349  3.85331994  2.43895226]
y_fit_1d:
         3         2
0.1833 x - 1.741 x + 3.853 x + 2.439
y_hat: [2.43895226 2.90481169 3.31572651 3.67401516 3.98199608 4.24198772
 4.45630852 4.62727692 4.75721137 4.84843031 4.90325218 4.92399543
 4.91297851 4.87251985 4.80493789 4.71255109 4.59767789 4.46263673
 4.30974604 4.14132429 3.95968991 3.76716134 3.56605703 3.35869542
 3.14739495 2.93447407 2.72225123 2.51304486 2.30917341 2.11295532
 1.92670904 1.75275301 1.59340567 1.45098547 1.32781085 1.22620026
 1.14847213 1.09694492 1.07393706 1.081767   1.12275318 1.19921405
 1.31346804 1.46783361 1.6646292  1.90617325 2.1947842  2.5327805
 2.92248059 3.36620292]
Correlation coefficients:
[[1.         0.98731758]
 [0.98731758 1.        ]]

Process finished with exit code 0

图像：

但如果拟合区间是 $[0, 4\pi]$ ，三阶拟合会得到下面的图像：

显然这不是我们想要的，此时就必须增加多项式的阶，七阶拟合会得到下面的图像：

由此可知，拟合区间变大后，可以通过增加多项式的阶来提高拟合精度。

基于scipy拟合复杂曲线

如果原始数据本身就符合正弦，余弦，对数，指数等分布，可以使用scipy模块来拟合，得到的预测数据也更准确。

拟合下面函数：

$\dpi{150} f(x) = 2\sin (x) + 3$

代码：

import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit


def f(x):
    return 2 * np.sin(x) + 3


def f_fit(x, a, b):
    return a * np.sin(x) + b


x = np.linspace(-2 * np.pi, 2 * np.pi)
y = f(x) + 0.2 * np.random.randn(len(x))  # Add noise
p_fit, pcov = curve_fit(f_fit, x, y)
print(p_fit)
print(pcov)
print('Correlation coefficients:')
y_hat = f_fit(x, p_fit[0], p_fit[1])
print(np.corrcoef(y_hat, y))
plt.plot(x, y, 'o', label='Original Values')
plt.plot(x, f(x), 'g', label='Original Curve')
plt.plot(x, y_hat, 'r', label='Fitting Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.title('scipy fitting')
plt.legend()
plt.show()

输出：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/regression/simple_regression_3.py
[1.9731016  2.98543308]
[[1.15670850e-03 2.66261347e-12]
 [2.66261347e-12 5.66787172e-04]]
Correlation coefficients:
[[1.         0.99294461]
 [0.99294461 1.        ]]

图像：

再拟合函数：

$f(x) = 1.3456 \ln (x) + 3.4567$

代码：

import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit


def f(x):
    return 1.3456 * np.log(x) + 3.4567


def f_fit(x, a, b):
    return a * np.log(x) + b


x = np.linspace(0.01, 20)
y = f(x) + 0.2 * np.random.randn(len(x))  # Add noise
p_fit, pcov = curve_fit(f_fit, x, y)
print(p_fit)
print(pcov)
print('Correlation coefficients:')
y_hat = f_fit(x, p_fit[0], p_fit[1])
print(np.corrcoef(y_hat, y))

ax = plt.gca()
ax.spines['bottom'].set_position(('data', 0))
ax.spines['left'].set_position(('data', 0))

plt.plot(x, y, 'o', label='Original Values')
plt.plot(x, f(x), 'g', label='Original Curve')
plt.plot(x, y_hat, 'r', label='Fitting Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=4)
plt.title('scipy fitting')
plt.show()

输出：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/regression/simple_regression_3_2.py
[1.31734419 3.45585372]
[[ 0.00041841 -0.0008045 ]
 [-0.0008045   0.00222503]]
Correlation coefficients:
[[1.        0.9942632]
 [0.9942632 1.       ]]

图像：

可化为线性函数的非线性拟合

虽然某些数据呈非线性分布，但我们可以将其转化为线性拟合。常见的有下面几种类型：

类型1

该图像大致符合以下函数：

$y=a e^{^{-b/x}} (a,b>0)$

两边取对数得：

$\ln y=\ln a-\frac{b}{x}$

令 $y_{2} = \ln y$ ， $x_{2} = 1/x$ ，则可得 $y_{2} =\ln a - bx_{2}$ ，该函数为线性函数。

类型2

该图像大致符合以下函数：

$\frac{1}{y}=a+\frac{b}{x} (a,b>0)$

令 $x_{2} = 1/x$ ， $y_{2} = 1/y$ ，则可得 $y_{2} = a + bx_{2}$ ，该函数为线性函数。

类型3

该图像大致符合以下函数：

$y=a e^{^{bx}} (a,b>0)$

两边取对数得：

$\ln y=\ln a + bx$

令 $y_{2} = \ln y$ ， $x_{2} = x$ ，则可得 $y_{2} =\ln a + bx_{2$ ，该函数为线性函数。

类型4

该图像大致符合以下函数：

$y=a + \frac{b}{x} (a,b>0)$

令 $y_{2} = y$ ， $x_{2} = 1/x$ ，则可得 $y_{2} =a + bx_{2}$ ，该函数为线性函数。

实例

在某化学反应中，根据实验测得生成物浓度与时间的关系如下表所示：

求浓度y与时间的关系函数 y=f(x) 。

首先画出数据点的分布图像：

代码：

import numpy as np
import matplotlib.pyplot as plt

# X and Y
x = np.array(
    [i + 1 for i in range(25)])
y = np.array(
    [4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.1, 10.2, 10.32, 10.4, 10.48, 10.55, 10.6,
     10.62, 10.634, 10.645, 10.646, 10.647, 10.649, 10.65, 10.651, 10.651])

plot1 = plt.plot(x, y, 'o', label='Original Values')
plt.xlabel('x (min)')
plt.ylabel('y (mol/L)')
plt.legend(loc=4)
plt.show()

图像：

通过观察，可以发现其数据点分布大致符合上面类型1和类型2的图像，我们分别用类型1和类型2的方式来拟合，再观察那种方式更好。

类型1拟合

代码：

import numpy as np
import matplotlib.pyplot as plt

# X and Y
x = np.array(
    [i + 1 for i in range(25)])
y = np.array(
    [4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.1, 10.2, 10.32, 10.4, 10.48, 10.55, 10.6,
     10.62, 10.634, 10.645, 10.646, 10.647, 10.649, 10.65, 10.651, 10.651])


def f_fit(x, a, b):
    return a * np.exp(-b / x)


print("ln(y) = ln(a) - b/x")
print("y2 = ln(y), x2 = 1/x => y2 = -bx2 + ln(a)")

y_fit = np.polyfit(1 / x, np.log(y), 1)
print('y_fit:', y_fit)
y_fit_1d = np.poly1d(y_fit)
print('y_fit_1d:', y_fit_1d)

print("-b=%f, ln(a)=%f" % (y_fit[0], y_fit[1]))
a = np.exp(y_fit[1])
b = -y_fit[0]
print("b=%f,a=%f" % (a, b))

y_hat = f_fit(x, a, b)
print('y_hat:', y_hat)

print('Correlation coefficients:')
print(np.corrcoef(y_hat, y))

# plot
plot1 = plt.plot(x, y, 'o', label='Original Values')
plot2 = plt.plot(x, y_hat, 'r', label='Fitting Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=4)
plt.title('Fitting')
plt.show()

输出：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/regression/simple_regression_4_2.py
ln(y) = ln(a) - b/x
y1 = ln(y), x1 = 1/x => y1 = -bx1 + ln(a)
y_fit: [-1.04298554  2.41898326]
y_fit_1d:  
-1.043 x + 2.419
-b=-1.042986, ln(a)=2.418983
b=11.234431,a=1.042986
y_hat: [ 3.95902478  6.6691372   7.93530205  8.65586288  9.11923717  9.44185384
  9.67925883  9.86122177 10.0051102  10.12173114 10.21815865 10.29921628
 10.36830553 10.42789365 10.47981363 10.52545563 10.56589302 10.60196778
 10.63434959 10.66357777 10.69009152 10.71425214 10.73635952 10.75666469
 10.77537936]
Correlation coefficients:
[[1.       0.998528]
 [0.998528 1.      ]]

图像：

通过输出可以看到，原始数据和拟合函数的相关系数为：0.998528，拟合程度非常高，效果非常好。

类型2拟合

代码：

import numpy as np
import matplotlib.pyplot as plt

# X and Y
x = np.array(
    [i + 1 for i in range(25)])
y = np.array(
    [4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.1, 10.2, 10.32, 10.4, 10.48, 10.55, 10.6,
     10.62, 10.634, 10.645, 10.646, 10.647, 10.649, 10.65, 10.651, 10.651])


def f_fit(x, a, b):
    return x / (a * x + b)


print("1/y = a + b/x")
print("y2 = 1/y, x2 = 1/x => y2 = a + bx2")

y_fit = np.polyfit(1 / x, 1 / y, 1)
print('y_fit:', y_fit)
y_fit_1d = np.poly1d(y_fit)
print('y_fit_1d:', y_fit_1d)

a = y_fit[1]
b = y_fit[0]
print("b=%f, a=%f" % (a, b))

y_hat = f_fit(x, a, b)
print('y_hat:', y_hat)

print('Correlation coefficients:')
print(np.corrcoef(y_hat, y))

# plot
plot1 = plt.plot(x, y, 'o', label='Original Values')
plot2 = plt.plot(x, y_hat, 'r', label='Fitting Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc=4)
plt.title('Fitting')
plt.show()

输出：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/regression/simple_regression_4_3.py
1/y = a + b/x
y2 = 1/y, x2 = 1/x => y2 = a + bx2
y_fit: [0.15693212 0.08337992]
y_fit_1d:  
0.1569 x + 0.08338
b=0.083380, a=0.156932
y_hat: [ 4.16125626  6.17871379  7.36970575  8.15574533  8.71335586  9.12947918
  9.45190354  9.7090739   9.91897957 10.09355392 10.24102495 10.36724982
 10.4765114  10.57201394 10.65620243 10.73097499 10.79782763 10.85795537
 10.91232433 10.96172404 11.00680599 11.04811258 11.08609898 11.12115005
 11.15359335]
Correlation coefficients:
[[1.         0.98480368]
 [0.98480368 1.        ]]

图像：