【iOS10 SpeechRecognition】语音识别现说现译的最佳实践

田风有

758人浏览 · 2017-05-16 16:32:14

田风有 · 2017-05-16 16:32:14 发布

首先想强调一下“语音识别”四个字字面意义上的需求：用户说话然后马上把用户说的话转成文字显示！，这才是开发者真正需要的功能。

做需求之前其实是先谷歌百度一下看有没有造好的轮子直接用，结果真的很呵呵，都是标着这个库深入学习的标题，里面调用一下api从URL里取出一个本地语音文件进行识别，这就没了？最基本的需求都没法实现。

今天整理下对于此功能的两种实现方式：

首先看下识别请求的API有两种 SFSpeechAudioBufferRecognitionRequest 和 SFSpeechURLRecognitionRequest ，并且实现解析的方式也有两种 block 和 delegate。我就相互组合下两种方法把这些内容都能涵盖。

在开发之前需要先在info.plist注册用户隐私权限，虽然大家都已经知道了我还是说一嘴为了本文的完整性。

Privacy - Microphone Usage Description

Privacy - Speech Recognition Usage Description

再使用requestAuthorization来请求使用权限

[SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) {

// 对结果枚举的判断

}];

关于麦克风的权限在首次开始录音时也会提出权限选择。

一、 SFSpeechAudioBufferRecognitionRequest 加上 block的方式

用这种方式实现主要分为以下几个步骤

①多媒体引擎的建立

成员变量需要添加以下几个属性，便于开始结束释放等

@property(nonatomic,strong)SFSpeechRecognizer *bufferRec;

@property(nonatomic,strong)SFSpeechAudioBufferRecognitionRequest *bufferRequest;

@property(nonatomic,strong)SFSpeechRecognitionTask *bufferTask;

@property(nonatomic,strong)AVAudioEngine *bufferEngine;

@property(nonatomic,strong)AVAudioInputNode *buffeInputNode;

初始化建议写在启动的方法里，便于启动和关闭，如果准备使用全局的也可以只初始化一次

self.bufferRec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];

self.bufferEngine = [[AVAudioEngine alloc]init];

self.buffeInputNode = [self.bufferEngine inputNode];

②创建语音识别请求

self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];

self.bufferRequest.shouldReportPartialResults =true;

shouldReportPartialResults 其中这个属性可以自行设置开关，是等你一句话说完再回调一次，还是每一个散碎的语音片段都会回调。

③建立任务，并执行任务

// block外的代码也都是准备工作，参数初始设置等

self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];

self.bufferRequest.shouldReportPartialResults =true;

__weak ViewController *weakSelf = self;

self.bufferTask = [self.bufferRec recognitionTaskWithRequest:self.bufferRequest resultHandler:^(SFSpeechRecognitionResult * _Nullable result, NSError * _Nullable error) {

// 接收到结果后的回调

}];

// 监听一个标识位并拼接流文件

AVAudioFormat *format =[self.buffeInputNode outputFormatForBus:0];

[self.buffeInputNode installTapOnBus:0 bufferSize:1024 format:format block:^(AVAudioPCMBuffer * _Nonnull buffer, AVAudioTime * _Nonnull when) {

[weakSelf.bufferRequest appendAudioPCMBuffer:buffer];

}];

// 准备并启动引擎

[self.bufferEngine prepare];

NSError *error = nil;

if(![self.bufferEngine startAndReturnError:&error]) {

NSLog(@"%@",error.userInfo);

};

self.showBufferText.text = @"等待命令中.....";

对runloop稍微了解过的人都知道，block外面的代码是在前一个运行循环先执行的，正常的启动流程是先初始化参数然后启动引擎，然后会不断地调用拼接buffer的这个回调方法，然后一个单位的buffer攒够了后会回调一次上面的语音识别结果的回调，有时候没声音也会调用buffer的方法，但是不会调用上面的resulthandler回调，这个方法内部应该有个容错（音量power没到设定值会自动忽略）。

④接收到结果的回调

结果的回调就是在上面resultHandler里面的block里了，执行后返回的参数就是result和error了，可以针对这个结果做一些操作。

if(result != nil) {

self.showBufferText.text = result.bestTranscription.formattedString;

}

if(error != nil) {

NSLog(@"%@",error.userInfo);

}

这个结果类型SFSpeechRecognitionResult可以看看里面的属性，有最佳结果，还有备选结果的数组。如果想做精确匹配的应该得把备选数组的答案也都过滤一遍。

⑤结束监听

[self.bufferEngine stop];

[self.buffeInputNode removeTapOnBus:0];

self.showBufferText.text = @"";

self.bufferRequest = nil;

self.bufferTask = nil;

这个中间的bus是临时标识的节点，大概理解和端口的概念差不多。

二、SFSpeechURLRecognitionRequest 和 delegate的方法

block和delegate的主要区别是，block方式使用简洁， delegate则可以有更多的自定义需求的空间，因为里面有更多的结果回调生命周期方法。

这五个方法也没什么好说的，都是顾名思义。要注意的一点是第二个方法会调用多次，第三个方法会在一句话说完时调用一次。

// Called when the task first detects speech in the source audio

- (void)speechRecognitionDidDetectSpeech:(SFSpeechRecognitionTask *)task;

// Called for all recognitions, including non-final hypothesis

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didHypothesizeTranscription:(SFTranscription *)transcription;

// Called only for final recognitions of utterances. No more about the utterance will be reported

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult;

// Called when the task is no longer accepting new audio but may be finishing final processing

- (void)speechRecognitionTaskFinishedReadingAudio:(SFSpeechRecognitionTask *)task;

// Called when the task has been cancelled, either by client app, the user, or the system

- (void)speechRecognitionTaskWasCancelled:(SFSpeechRecognitionTask *)task;

// Called when recognition of all requested utterances is finished.

// If successfully is false, the error property of the task will contain error information

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishSuccessfully:(BOOL)successfully;

这种实现的思路是，先实现一个录音器（可以手动控制开始结束，也可以是根据音调大小自动开始结束的同步录音器类似于会说话的汤姆猫），然后将录音文件存到一个本地目录，然后使用URLRequest的方式读取出来进行翻译。步骤分解如下

①建立同步录音器

需要以下这些属性

/** 录音设备 */

@property (nonatomic, strong) AVAudioRecorder *recorder;

/** 监听设备 */

@property (nonatomic, strong) AVAudioRecorder *monitor;

/** 录音文件的URL */

@property (nonatomic, strong) NSURL *recordURL;

/** 监听器 URL */

@property (nonatomic, strong) NSURL *monitorURL;

/** 定时器 */

@property (nonatomic, strong) NSTimer *timer;

属性的初始化

// 参数设置

NSDictionary *recordSettings = [[NSDictionary alloc] initWithObjectsAndKeys:

[NSNumber numberWithFloat: 14400.0], AVSampleRateKey,

[NSNumber numberWithInt: kAudioFormatAppleIMA4], AVFormatIDKey,

[NSNumber numberWithInt: 2], AVNumberOfChannelsKey,

[NSNumber numberWithInt: AVAudioQualityMax], AVEncoderAudioQualityKey,

nil];

NSString *recordPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"record.caf"];

_recordURL = [NSURL fileURLWithPath:recordPath];

_recorder = [[AVAudioRecorder alloc] initWithURL:_recordURL settings:recordSettings error:NULL];

// 监听器

NSString *monitorPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"monitor.caf"];

_monitorURL = [NSURL fileURLWithPath:monitorPath];

_monitor = [[AVAudioRecorder alloc] initWithURL:_monitorURL settings:recordSettings error:NULL];

_monitor.meteringEnabled = YES;

其中参数设置的那个字典里，的那些常量大家不用过于上火，这是之前写的代码直接扒来用的，上文中设置的最优语音质量。

②开始与结束

要想通过声音大小来控制开始结束的话，需要在录音器外再额外设置个监听器用来查看语音的大小通过peakPowerForChannel 方法查看当前话筒环境的声音环境音量。并且有个定时器来控制音量检测的周期。大致代码如下

- (void)setupTimer {

[self.monitor record];

self.timer = [NSTimer scheduledTimerWithTimeInterval:0.1 target:self selector:@selector(updateTimer) userInfo:nil repeats:YES];//董铂然博客园

}

// 监听开始与结束的方法

- (void)updateTimer {

// 不更新就没法用了

[self.monitor updateMeters];

// 获得0声道的音量，完全没有声音-160.0，0是最大音量

float power = [self.monitor peakPowerForChannel:0];

// NSLog(@"%f", power);

if(power > -20) {

if(!self.recorder.isRecording) {

NSLog(@"开始录音");

[self.recorder record];

}

}else{

if(self.recorder.isRecording) {

NSLog(@"停止录音");

[self.recorder stop];

[self recognition];

}

③语音识别的任务请求

- (void)recognition {

// 时钟停止

[self.timer invalidate];

// 监听器也停止

[self.monitor stop];

// 删除监听器的录音文件

[self.monitor deleteRecording];

//创建语音识别操作类对象

SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];

// SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"en_ww"]]; //董铂然博客园

//通过一个本地的音频文件来解析

SFSpeechRecognitionRequest * request = [[SFSpeechURLRecognitionRequest alloc]initWithURL:_recordURL];

[rec recognitionTaskWithRequest:request delegate:self];

}

这段通过一个本地文件进行识别转汉字的代码，应该是网上传的最多的，因为不用动脑子都能写出来。但是单有这一段代码基本是没有什么卵用的。（除了人家微信现在有个长按把语音转文字的功能，其他谁的App需求我真想不到会直接拿出一个本地音频文件来解析，自动生成mp3歌词？周杰伦的歌解析难度比较大，还有语音识别时间要求不能超过1分钟）

④结果回调的代理方法

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult

{

NSLog(@"%s",__FUNCTION__);

NSLog(@"%@",recognitionResult.bestTranscription.formattedString);

[self setupTimer];

}

用的最多的就这个方法了，另外不同时刻的回调方法可以按需添加，这里也就是简单展示，可以看我的demo程序里有更多功能。

iOS10在语音相关识别相关功能上有了一个大的飞跃，主要体现在两点一点就是上面的语音识别，另一点是sirikit可以实现将外部的信息透传到App内进行操作，但是暂时局限性比较明显，只能够实现官网所说叫车，发信息等消息类型，甚至连“打开美团搜索烤鱼店”这种类型都还不能识别，所以暂时也无法往下做过多研究，等待苹果之后的更新吧。

demo https://github.com/13662049573/SXSpeechRecognitionTwoWays-