0

## 算法实现

GCForest.py源码如下，首先需要将此模块导入到根目录并命名为GCForest.py，当然最好是从github克隆下来。

### gcForest in Python

Status : under development

gcForest is an algorithm suggested in Zhou and Feng 2017. It uses a multi-grain scanning approach for data slicing and a cascade structure of multiple random forests layers (see paper for details).

gcForest has been first developed as a Classifier and designed such that the multi-grain scanning module and the cascade structure can be used separately. During development I’ve paid special attention to write the code in the way that future parallelization should be pretty straightforward to implement.

### Prerequisites

The present code has been developed under python3.x. You will need to have the following installed on your computer to make it work :

• Python 3.x
• Numpy >= 1.12.0
• Scikit-learn >= 0.18.1
• jupyter >= 1.0.0 (only useful to run the tuto notebook)

You can install all of them using pip install :
\$ pip3 install requirements.txt

### Using gcForest

The syntax uses the scikit learn style with a .fit() function to train the algorithm and a .predict() function to predict new values class. You can find two examples in the jupyter notebook included in the repository.

from GCForest import *
gcf = gcForest( **kwargs )
gcf.fit(X_train, y_train)
gcf.predict(X_test)

### Notes

I wrote the code from scratch in two days and even though I have tested it on several cases I cannot certify that it is a 100% bug free obviously. Feel free to test it and send me your feedback about any improvement and/or modification!

### Known Issues

Memory comsuption when slicing data There is now a short naive calculation illustrating the issue in the notebook. So far the input data slicing is done all in a single step to train the Random Forest for the Multi-Grain Scanning. The problem is that it might requires a lot of memory depending on the size of the data set and the number of slices asked resulting in memory crashes (at least on my Intel Core 2 Duo).

I have recently improved the memory usage (from version 0.1.4) when slicing the data but will keep looking at ways to optimize the code.

OOB score error During the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training.
A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough.

### Built With

• PyCharm community edition
• memory_profiler libra

### Early Results

(will be updated as new results come out)

• Scikit-learn handwritten digits classification :
training time ~ 5min
accuracy ~ 98%

### 部分代码：

import itertools
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

__author__ = "Pierre-Yves Lablanche"
__email__ = "plablanche@aims.ac.za"
__version__ = "0.1.3"
__status__ = "Development"

# noinspection PyUnboundLocalVariable
class gcForest(object):
def __init__(self, shape_1X=None, n_mgsRFtree=30, window=None, stride=1,
""" gcForest Classifier.

## 关于规模

Slicing Step

If my window is of size [wl,wL] and the chosen stride are [sl,sL] then the number of slices per sample is :

Obviously the size of slice is [wl,wL]hence the total size of the sliced data set is :

This is when the memory consumption is its peak maximum.

Class Vector after Multi-Grain Scanning

Now all slices are fed to the random forest to generate class vectors. The number of class vector per random forest per window per sample is simply equal to the number of slices given to the random forest

Hence, if we have Nrfrandom forest per window the size of a class vector is (recall we have N samples and C classes):

And finally the total size of the Multi-Grain Scanning output will be:

This short calculation is just meant to give you an idea of the data processing during the Multi-Grain Scanning phase. The actual memory consumption depends on the format given (aka float, int, double, etc.) and it might be worth looking at it carefully when dealing with large datasets.

## 预测每根K线涨跌

#获取当前时间
from datetime import datetime
now = datetime.now()
startDate = '2010-4-16'
endDate = now
#获取沪深300股指期货数据，频率为1分钟
df=get_price('IF88', start_date=startDate, end_date=endDate,\
frequency='1d', fields=None, country='cn')

open = df['open'].values
close = df['close'].values
volume = df['volume'].values
high = df['high'].values
low = df['low'].values

import talib as ta
import pandas as pd
import numpy as np
from sklearn import preprocessing
ema = ta.EMA(close, timeperiod=30).tolist()
macd = ta.MACD(close, fastperiod=12, slowperiod=26, signalperiod = 9)[0].tolist()
momentum = ta.MOM(close, timeperiod=10).tolist()
rsi = ta.RSI(close, timeperiod=14).tolist()
linreg = ta.LINEARREG(close, timeperiod=14).tolist()
var = ta.VAR(close, timeperiod=5, nbdev=1).tolist()#获取当前的收盘价的希尔伯特变换
cycle = ta.HT_DCPERIOD(close).tolist()#获取平均真实波动范围指标ATR,时间段为14
atr = ta.ATR(high, low, close, timeperiod=14).tolist()#把每根k线的指标放入数组X中，并转置
X = np.array([open,close,high,low,volume,ema, macd, linreg, momentum, rsi, var, cycle, atr]).T#输出可知数组X包含了ema, macd, linreg等13个指标数值
X[2]

array([ 3215. , 3267.2, 3281.2, 3208. , 114531. , nan,
nan, nan, nan, nan, nan, nan,
nan])

y=[]
c=close[0]
#用i遍历整个数据集
for i in range(1, len(X)):
#如果高点突破参考线的1.0015倍，即上涨
if (close[i]>close[i-1]):
#把参考点加到列表basicLine里，并且新参考点变为原来的1.0015倍，
y.append(1)
elif (close[i]<close[i-1]):
y.append(0)
elif (close[i]==close[i-1]):
y.append(2)
#添加最后一个数据的标签为1
y.append(1)

#把y转化为ndarray数组
y=np.array(y)
#输出验证标签集是否准确
print(len(y))
for i in range(1, 10):
print(close[i],y[i],i)

1663
3214.6 1 1
3267.2 0 2
3236.2 0 3
3221.2 0 4
3219.6 0 5
3138.8 0 6
3129.0 0 7
3083.8 1 8
3107.0 0 9

#把数据集分解成随机的训练和测试子集， 参数test_size表示测试集所占比例
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33)
#输出可知测试特征集为维度是50*4的数组ndarray
X_te.shape

(549, 13)

0.1.3版本可输入整数作为 shape_1X参数。

## gcForest参数说明

shape_1X:

n_mgsRFtree:

window：int（default = None）

stride：int（default = 1）

min_samples_mgs：float或int（default = 0.1）

tolerance：float（default= 0.0）

n_jobs：int（default = 1）

#shape_1X样本维度，window为多粒度扫描（Multi-Grained Scanning）算法中滑动窗口大小，\
#用于扫描原始数据，tolerance为级联生长的精度差,整个级联的性能将在验证集上进行估计，\
#如果没有显着的性能增益，训练过程将终止#gcf = gcForest(shape_1X=4, window=2, tolerance=0.0)
#gcf = gcForest(shape_1X=[13,13], window=2, tolerance=0.0)

gcf = gcForest(shape_1X=13, n_mgsRFtree=100, window=6, stride=2,
gcf.fit(X_tr, y_tr)

Slicing Sequence…
Training MGS Random Forests…
Layer validation accuracy = 0.5577889447236181
Layer validation accuracy = 0.521608040201005

#shape_1X样本维度，window为多粒度扫描（Multi-Grained Scanning）算法中滑动窗口大小，\
#用于扫描原始数据，tolerance为级联生长的精度差,整个级联的性能将在验证集上进行估计，\
#如果没有显着的性能增益，训练过程将终止#gcf = gcForest(shape_1X=4, window=2, tolerance=0.0)
#gcf = gcForest(shape_1X=[13,13], window=2, tolerance=0.0)

gcf = gcForest(shape_1X=[1,13], window=[1,6],)
gcf.fit(X_tr, y_tr)

Slicing Sequence…
Training MGS Random Forests…
Slicing Sequence…
Training MGS Random Forests…
Layer validation accuracy = 0.5964125560538116
Layer validation accuracy = 0.5695067264573991

Now checking the prediction for the test set:

pred_X = gcf.predict(X_te)
print(len(pred_X))
print(len(y_te))
print(pred_X)

Slicing Sequence…
Slicing Sequence…
549
549
[1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0等

#最近预测
for i in range(1,len(pred_X)):
print(y_te[-i],pred_X[-i],-i)

0 1 -1
0 0 -2
1 0 -3
1 0 -4
0 1 -5

# 保存每一天预测的结果，如果某天预测对了，保存1，如果某天预测错了，保存-1
result_list = []
# 检查预测是否成功

def checkPredict(i):
if pred_X[i] == y_te[i]:
result_list.append(1)
else:
result_list.append(0)
#画出最近第k+1个长度为j的时间段准确率
k=0j
=len(y_te)
#j=100
for i in range(len(y_te)-j*(k+1), len(y_te)-j*k):
checkPredict(i)
#print(y_pred[i])
#return result_list
print(len(y_te) )
print(len(result_list) )

import matplotlib.pyplot as plt
#将准确率曲线画出来
x = range(0, len(result_list))
y = []
#z=[]
for i in range(0, len(result_list)):
#y.append((1 + float(sum(result_list[:i])) / (i+1)) / 2)
y.append( float(sum(result_list[:i])) / (i+1))
print('最近',j,'次准确率',y[-1])
print(x, y)
line, = plt.plot(x, y)
plt.show

549
549

range(0, 549) [0.0, 0.0, 0.3333333333333333, 0.25等

#评估准确率
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))

gcForest accuracy : 0.5300546448087432

# loading the data

X = digits.data
y = digits.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4)
gcf = gcForest(shape_1X=[7,8], window=[4,6], tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
#gcf = gcForest(shape_1X=13, window=13, tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
gcf.fit(X_tr, y_tr)

Slicing Images…
Training MGS Random Forests…
Slicing Images…
Training MGS Random Forests…
Layer validation accuracy = 0.9814814814814815
Layer validation accuracy = 0.9814814814814815

# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))

gcForest accuracy : 0.980528511821975

## 单独利用多粒度扫描和级联森林

gcf = gcForest(shape_1X=[8,8], window=5, min_samples_mgs=10, min_samples_cascade=7)
X_tr_mgs = gcf.mg_scanning(X_tr, y_tr)

Slicing Images…
Training MGS Random Forests…
It is now possible to use the mg_scanning output as input for cascade forests using different parameters. Note that the cascade forest module does not directly return predictions but probability predictions from each Random Forest in the last layer of the cascade. Hence the need to first take the mean of the output and then find the max.

gcf = gcForest(tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
_ = gcf.cascade_forest(X_tr_mgs, y_tr)

Layer validation accuracy = 0.9722222222222222
Layer validation accuracy = 0.9907407407407407
Layer validation accuracy = 0.9814814814814815

import numpy as np
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

0.97774687065368571
Layer validation accuracy = 0.9629629629629629
Layer validation accuracy = 0.9675925925925926
Layer validation accuracy = 0.9722222222222222
Layer validation accuracy = 0.9722222222222222
0.97218358831710705

Skipping mg_scanning

It is also possible to directly use the cascade forest and skip the multi grain scanning step.

gcf = gcForest(tolerance=0.0, min_samples_cascade=20)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)