日语元音(Japanese Vowels)问题:
Task Type
classification, speaker identification
Sources
Original Owner and Donor
Mineichi Kudo, Jun Toyama, Masaru Shimbo
Information Processing Laboratory
Division of Systems and Information Engineering
Graduate School of Engineering
Hokkaido University, Sapporo 060-8628, JAPAN
{mine,jun,shimbo}@main.eng.hokudai.ac.jp
Date Donated: June 13, 2000
Problem Description
Distinguish nine male speakers by their utterances of two Japanese vowels /ae/.
Other Relevant Information
The training file 'ae.train' was used for constructing the classifier and the test file 'ae.test' was used for obtaining the generalization classification rate.
Results
Results for this dataset are reported in:
M. Kudo, J. Toyama and M. Shimbo. (1999). "Multidimensional Curve Classification Using Passing-Through Regions". Pattern Recognition Letters, Vol. 20, No. 11–13, pages 1103–1111.
The classifier proposed in the paper showed the classification rate of 94.1%, while, a 5-state continuous Hidden Markov Model attained up to 96.2%. However, the proposed classifier was shown in the paper to be able to handle a great variety of datasets. In addition, the classifier helps people to have some intuition about what the obtained classification rule is and how the rule will work.
Data Type
multivariate time series.
Abstract
This dataset records 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers.
Sources
Original Owner and Donor
Mineichi Kudo, Jun Toyama, Masaru Shimbo Information Processing Laboratory Division of Systems and Information Engineering Graduate School of Engineering Hokkaido University, Sapporo 060-8628, JAPAN {mine,jun,shimbo}@main.eng.hokudai.ac.jp
Date Donated: June 13, 2000
Data Characteristics
The data was collected for examining our newly developed classifier for multidimensional curves (multidimensional time series). Nine male speakers uttered two Japanese vowels /ae/ successively. For each utterance, with the analysis parameters described below, we applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 LPC cepstrum coefficients. This means that one utterance by a speaker forms a time series whose length is in the range 7-29 and each point of a time series is of 12 features (12 coefficients).
The number of the time series is 640 in total. We used one set of 270 time series for training and the other set of 370 time series for testing.
Number of Instances (Utterances)
- Training: 270 (30 utterances by 9 speakers. See file 'size_ae.train'.)
- Testing: 370 (24-88 utterances by the same 9 speakers in different opportunities. See file 'size_ae.test'.)
Length of Time Series
- 7 – 29 depending on utterances
Number of Attributes
- 12 real values
Analysis parameters
- Sampling rate : 10kHz
- Frame length : 25.6 ms
- Shift length : 6.4ms
- Degree of LPC coefficients : 12
Data Format
Files
- Training file: ae.train
- Testing file: ae.test
Format
Each line in ae.train or ae.test represents 12 LPC coefficients in the increasing order separated by spaces. This corresponds to one analysis frame.
Lines are organized into blocks, which are a set of 7-29 lines separated by blank lines and corresponds to a single speech utterance of /ae/ with 7-29 frames.
Each speaker is a set of consecutive blocks. In ae.train there are 30 blocks for each speaker. Blocks 1-30 represent speaker 1, blocks 31-60 represent speaker 2, and so on up to speaker 9. In ae.test, speakers 1 to 9 have the corresponding number of blocks: 31 35 88 44 29 24 40 50 29. Thus, blocks 1-31 represent speaker 1 (31 utterances of /ae/), blocks 32-66 represent speaker 2 (35 utterances of /ae/), and so on.
Past Usage
M. Kudo, J. Toyama and M. Shimbo. (1999). "Multidimensional Curve Classification Using Passing-Through Regions". Pattern Recognition Letters, Vol. 20, No. 11–13, pages 1103–1111.
Acknowledgements, Copyright Information, and Availability
If you publish any work using the dataset, please inform the donor. Use for commercial purposes requires donor permission.
References and Further Information
Similar data are available for different utterances /ei/, /iu/, /uo/, /oa/ in addition to /ae/. Please contact the donor if you are interested in using this data.
The UCI KDD Archive
Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Last modified: June 30, 1999
其中一种分析与解答:
一、对题目要求的分析与理解
本题目为分类问题。要求对九个男性发音者日语元音/ae/的发音进行分析,以此来区分这九名发音者。
数据文件包括:ae.test,ae.train,size_ae.test,sise_ae.train。ae.train是用来进行训练的数据,ae.test是用来进行测试的数据。size_ae.test记录了每个发音者对应的ae.test数据文件中包含数据的对应块数。sise_ae.train,记录了每个发音者在对应的ae.train数据文件中的对应的块数。
二、对数据文件内容的理解
数据是为了验证日本学者新近研究开发的多维时间序列曲线分类法。九个男性发音者发出两个日本元音/ae/先后。对于每个发音分析参数,我们采用12度线性预测分析,以它来取得的LPC倒谱系数(LPCC)12离散时间序列。这意味着,一个接一个发音者话语形成一个时间序列,其长度的范围是7-29,每一个时间序列是12个特征值。
在时间序列数共640个。我们用270系列来训练,用370系列进行测试。270系列对应着ae.train,370数据对应着ae.test。
size_ae.train中的数据是30 30 30 30 30 30 30 30 30,表示9个发音者对应的数据块数(第一个30对应着第一个说话者,以此类推)。这样我们就可以得出 ae.train中的数据所属发音者的情况;在ae.train中有270块数据,块与块之间有一个空行作为分割的标志。有size_ae.train中得到的信息我们可知,1-30是第一个发音者的,31-60块是第二个发音者的,剩下的一次类推。其中每块数据中包含的行数为7到29不等,每行代表着发音者的一个时间帧。每行包含了12属性列,因为我们使用的是12LPCC。
size_ae.test中的数据是31 35 88 44 29 24 40 50 29,表示9个发音者对应的数据块数(第一个31对应着第一个说话者,以此类推)。这样我们就可以得出 ae.train中的数据所属发音者的情况;在ae.test中有370块数据,块与块之间有一个空行作为分割的标志。有size_ae.test中得到的信息我们可知,1-31是第一个发音者的,32-66块是第二个发音者的,剩下的一次类推。其中每块数据中包含的行数为7到29不等,每行代表着发音者的一个时间帧。每行包含了12属性列。
实现分类程序的整体结构
由于数据量较小,本程序使用Python语言编写,可以简化附加的数据格式转换步骤。整个程序分为六个部分,分别为:
1. Get data部分:从ae.test和ae.train文件中读取数据。
2. Get size部分:从size_ae.test和size_ae.train文件中读取各段的行数。
3. Deal data部分:预处理,即把每一段的数据进行第一步的处理。
4. Restrict data部分:对数据的进一步处理,得到最终的分类依据。
5. Compare部分:用分类算法对用于测试的数据进行操作,得到结果集。
6. Deal result部分:处理结果集,给出结论和正确率。
算法代码:
#coding:utf-8 #题目:分类问题2 #作者:Sean Luo #E-mail:[email protected] #如引用本代码请保留此信息 import datetime def get_data(file_name): '''从文件中读取数据''' fp = open(file_name, 'r') data_block = [] block_number = 0 #数据块编号 line_count = 0 #数据块行数 single_block = [-1, -1] #数据块存储列表 for line in fp: if line == '\n': #遇到空行即分块 block_number += 1 #保存数据块编号、行数 single_block[0] = block_number single_block[1] = line_count #保存到org_data块中 data_block.append(single_block) #清零上一块数据 line_count = 0 single_block = [-1, -1] else: #非空行则处理该行的数据 line_count += 1 data_in_line = [] #将每一行的数据分割 str_in_lines = line.split(' ') #转换为数值 for string in str_in_lines: if string != '\n': fn = float(string) data_in_line.append(fn) #储存到数据库存储列表 single_block.append(data_in_line) return data_block def get_size(file_name): fp = open(file_name, 'r') size_data = [] for size_line in fp: str_data = size_line.split(' ') for data in str_data: try: data = int(data) size_data.append(data) except: pass return size_data def deal_data(org_data): '''处理数据,返回270*12的平均值矩阵''' data_set = [] for single_block in org_data: #取出单个数据块 block_number = single_block[0] #数据块编号 line_count = single_block[1] #获得数据块中的数据行数 block_data = [] col_data = [0,0,0,0,0,0,0,0,0,0,0,0] block_data.append(block_number) for line_number in range(line_count): data_in_line = single_block[line_number + 2] #取出数据块中的数据列表 for col_number in range(12): #对每个块中的每一列数据求和,12列 col_data[col_number] += data_in_line[col_number] col_number = 0 for col_number in range(12): #将每一列的和除以行数,得出平均值,12列 col_data[col_number] /= line_count block_data.append(col_data[col_number]) data_set.append(block_data) return data_set def restrict_data(org_data, size_file_name): '''处理数据,返回9*12的平均值矩阵''' data_set = [] start_pos = 0 size_data = get_size(size_file_name) for mysize in size_data: #每块的行数 col_data = [0,0,0,0,0,0,0,0,0,0,0,0] save_data = [] split_data = org_data[start_pos:start_pos+mysize] for block_data in split_data: #取出数据行 for col_number in range(12): #对每个块中的每一列数据求和,12列 col_data[col_number] += block_data[col_number+1] col_number = 0 for col_number in range(12): #将每一列的和除以行数,得出平均值,12列 col_data[col_number] /= mysize save_data.append(col_data[col_number]) data_set.append(save_data) start_pos += mysize return data_set def compare(train_data, test_data): match_set = [] for i in range(9): #依次取测试值,共9个 test_case = test_data[i] last_min = 100.0 #上次最小值 select_number = 0 #匹配对象编号 for j in range(9): #依次取训练值,共9个 total = 0.0 train_case = train_data[j] for k in range(12): delta = abs(test_case[k] - train_case[k]) total += delta #一阶差分绝对值之和 if total < last_min: last_min = total select_number = j match_set.append([i, select_number]) return match_set def deal_result(result_set): cnt_suc = 0 cnt_tol = 0 print(u'匹配结果:') for result in result_set: cnt_tol += 1 test_r = result[0] + 1 train_r = result[1] + 1 print(u'测试文件中发音者编号:' + str(test_r) + u',匹配的编号:' + str(train_r)) if test_r == train_r: print(u'匹配成功。') cnt_suc += 1 else: print(u'匹配失败。') print(u'成功率:' + str(100*cnt_suc/cnt_tol) + '%') def main(): data_file_name = 'ae.train' time_start = datetime.datetime.now() org_data = get_data(data_file_name) data_set = deal_data(org_data) size_file_name = 'size_ae.train' train_data = restrict_data(data_set, size_file_name) #9*12矩阵 time_end = datetime.datetime.now() timedelta = (time_end - time_start) print(u'训练模型处理用时:' + str(timedelta)) test_file_name = 'ae.test' time_start = datetime.datetime.now() org2_data = get_data(test_file_name) test_set = deal_data(org2_data) size_file_name = 'size_ae.test' test_data = restrict_data(test_set, size_file_name) #9*12矩阵 time_end = datetime.datetime.now() timedelta = (time_end - time_start) print(u'测试数据处理用时:' + str(timedelta)) result_set = compare(train_data, test_data) deal_result(result_set) if __name__ == '__main__': main()
去年数据挖掘课的作业,我这算法极其弱智,哈哈,有兴趣的可以拿去用
请你的Python是用的哪个版本?
4年前的东西了,记不清,估计是2.6版本