Task Failure Prediction in Cloud Data Centers Using Deep Learning

Task Failure Prediction in Cloud Data Centers Using Deep Learning训练数据过去的系统消息日志使用多层双向长期短期记忆 (Bi-LSTM)，用于识别云中的任务和作业失败[3]、[7]-[13] 使用统计和机器学习方法，例如隐藏半马尔可夫模型 (HSMM) 和支持向量机 (SVM) 来预测云数据中心中的任务和作业失败。它们使用 CPU

baidu_35560935

333人浏览 · 2022-05-19 16:43:16

baidu_35560935 · 2022-05-19 16:43:16 发布

Task Failure Prediction in Cloud Data Centers Using Deep Learning

Introduce

论文使用的数据：系统消息日志
模型：多层双向长期短期记忆 (Bi-LSTM)，
目标：识别云中的任务和作业失败

背景介绍中其他人的方法：
文献[3]、[7]-[13] 使用统计和机器学习方法，例如隐藏半马尔可夫模型 (HSMM) 和支持向量机 (SVM) 来预测云数据中心中的任务和作业失败。它们使用 CPU 使用率和内存使用率、未映射的页面缓存、平均磁盘 I/O 时间和磁盘使用率作为输入，并将任务失败或作业失败作为输出。然而，HSMM 和 SVM 假设它们的所有输入都是固定的并且彼此独立，这在云数据中心中是不正确的。因此，它们无法处理序列数据或高维数据，其中时间点或不同特征的数据可能相互依赖。在云数据中心，输入特征和噪声数据本质上是多种多样的，并且依赖于过去的事件。因此 HSMM 和 SVM 无法处理云数据中心的故障预测。

用于故障预测的循环神经网络 (RNN) 和 LSTM [14]-[19]。
使用 CPU 使用率、内存使用率、未映射的页面缓存、平均磁盘 I/O 时间和磁盘使用率作为输入，任务或作业失败作为输出。

自己的总结

文献14“ Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks“
用的是SMART Self-Monitoring, Analysis and Reporting Technology 数据，比如(e.g., Seek Error Rate and Power On Hours )

文献15“ Machine health monitoring
using adaptive kernel spectral clustering and deep long short-term
memory recurrent neural networks”
使用的数据不是computer machine ，而是 machine
feature前四项 the amplitude and energy over the time domain
后八项 reflecting the distribution situation over the time domain.

文献16
Deeplog: Anomaly detection
and diagnosis from system logs through deep learning
杜敏写的deeplog github 有很多复现的代码。
数据采用的是分布式软件（例如 HDFS，OpenStack）日志

文献17 Desh: Deep learning for
system health prediction of lead times to failure in hpc
对操作系统和软件组件的日志数据进行普适性增强培训和分类来实现故障识别模型
用的是cfdr 的数据
依然是日志。

文献18 ”Failure prediction of jobs in compute clouds: A google cluster case study“
数据是 Google 集群跟踪数据。下节会有中说明的数据
一个server node 包含多个linux 容器 Linux containers (LXC) 。
a job consists of at least one task, and each task is constrained by scheduling and resource usage limits
每个job有多个task，
CPU usage (average and peak),
memory usage (mean, assigned, and peak),
page cache (unmapped and total),
disk I/O time (mean and peak),
disk usage,
cycles per instruction, and
memory accesses per instruction.
All these measurements have been normalized by the respective maximum values measured. 这些数据都被正则化

task重提交次数
对于失败的job和已完成的job，其任务执行超过一次的比例分别为35.8%和0.9%。

文献19 ”Predicting Application Failure in Cloud: A Machine Learning Approach“
数据同样是Google 集群跟踪 2011

数据说明(google 2011)

数据集来源

Google 集群跟踪 [25] 于 2011 年 5 月 1 日星期日美国东部时间 19:00 开始，它记录了大约 12.5k 机器的 Google 集群上每个任务的 CPU 资源利用率和内存使用情况，为期 29 天。跟踪包含 29 天内的 672,075 个作业和超过 4800 万个任务。
来源Google ，Borg 集群管理系统的追踪资料（扩展，其他集群管理 kubernetes 简称k8 ，next borg ）

数据的处理

使用了CPU使用率、内存使用率、缓存使用率、平均磁盘I/O时间和磁盘使用率，任务优先级、任务重新提交的次数和调度延迟。

数据采样周期为 5 分钟，随机挑选的 1 秒记录CPU 和内存使用等情况。CPU使用率、内存使用率、缓存内存使用率、平均磁盘I/O时间、磁盘使用率、任务优先级、任务重提交次数、任务调度等信息延迟等。优先级由任务调度器分为五类：最低、低、中、高和最高优先级，范围从0到11

类似数据（论文用到的）

google 2019

Borg: the Next Generation
翻译
2019年5月来自8个不同的Google计算集群（Borg集群）的详细Borg作业调度信息
整个跟踪的压缩数据为2.8 TiB
相关说明
2011 的数据有下载。下载和其他信息看 Google cluster-usage traces: format+ schema
2019 数据用 Google BigQuery
数据内容见上面