262011
 

日语元音(Japanese Vowels)问题:

Task Type

classification, speaker identification

Sources

Original Owner and Donor

Mineichi Kudo, Jun Toyama, Masaru Shimbo
Information Processing Laboratory
Division of Systems and Information Engineering
Graduate School of Engineering
Hokkaido University, Sapporo 060-8628, JAPAN
{mine,jun,shimbo}@main.eng.hokudai.ac.jp

Date Donated: June 13, 2000

Problem Description

Distinguish nine male speakers by their utterances of two Japanese vowels /ae/.

Other Relevant Information

The training file 'ae.train' was used for constructing the classifier and the test file 'ae.test' was used for obtaining the generalization classification rate.

Results

Results for this dataset are reported in:

M. Kudo, J. Toyama and M. Shimbo. (1999). "Multidimensional Curve Classification Using Passing-Through Regions". Pattern Recognition Letters, Vol. 20, No. 11–13, pages 1103–1111.

The classifier proposed in the paper showed the classification rate of 94.1%, while, a 5-state continuous Hidden Markov Model attained up to 96.2%. However, the proposed classifier was shown in the paper to be able to handle a great variety of datasets. In addition, the classifier helps people to have some intuition about what the obtained classification rule is and how the rule will work.

Data Type

multivariate time series.

Abstract

This dataset records 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers.

Sources

Original Owner and Donor

Mineichi Kudo, Jun Toyama, Masaru Shimbo
Information Processing Laboratory
Division of Systems and Information Engineering
Graduate School of Engineering
Hokkaido University, Sapporo 060-8628, JAPAN
{mine,jun,shimbo}@main.eng.hokudai.ac.jp

Date Donated: June 13, 2000

Data Characteristics

The data was collected for examining our newly developed classifier for multidimensional curves (multidimensional time series). Nine male speakers uttered two Japanese vowels /ae/ successively. For each utterance, with the analysis parameters described below, we applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 LPC cepstrum coefficients. This means that one utterance by a speaker forms a time series whose length is in the range 7-29 and each point of a time series is of 12 features (12 coefficients).

The number of the time series is 640 in total. We used one set of 270 time series for training and the other set of 370 time series for testing.

Number of Instances (Utterances)

  • Training: 270 (30 utterances by 9 speakers. See file 'size_ae.train'.)
  • Testing: 370 (24-88 utterances by the same 9 speakers in different opportunities. See file 'size_ae.test'.)

Length of Time Series

  • 7 – 29 depending on utterances

Number of Attributes

  • 12 real values

Analysis parameters

  • Sampling rate : 10kHz
  • Frame length : 25.6 ms
  • Shift length : 6.4ms
  • Degree of LPC coefficients : 12

Data Format

Files

  • Training file: ae.train
  • Testing file: ae.test

Format

Each line in ae.train or ae.test represents 12 LPC coefficients in the increasing order separated by spaces. This corresponds to one analysis frame.

Lines are organized into blocks, which are a set of 7-29 lines separated by blank lines and corresponds to a single speech utterance of /ae/ with 7-29 frames.

Each speaker is a set of consecutive blocks. In ae.train there are 30 blocks for each speaker. Blocks 1-30 represent speaker 1, blocks 31-60 represent speaker 2, and so on up to speaker 9. In ae.test, speakers 1 to 9 have the corresponding number of blocks: 31 35 88 44 29 24 40 50 29. Thus, blocks 1-31 represent speaker 1 (31 utterances of /ae/), blocks 32-66 represent speaker 2 (35 utterances of /ae/), and so on.

Past Usage

M. Kudo, J. Toyama and M. Shimbo. (1999). "Multidimensional Curve Classification Using Passing-Through Regions". Pattern Recognition Letters, Vol. 20, No. 11–13, pages 1103–1111.

Acknowledgements, Copyright Information, and Availability

If you publish any work using the dataset, please inform the donor. Use for commercial purposes requires donor permission.

References and Further Information

Similar data are available for different utterances /ei/, /iu/, /uo/, /oa/ in addition to /ae/. Please contact the donor if you are interested in using this data.

 


The UCI KDD Archive
Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425 

Last modified: June 30, 1999

 

其中一种分析与解答:

 

一、对题目要求的分析与理解

本题目为分类问题。要求对九个男性发音者日语元音/ae/的发音进行分析,以此来区分这九名发音者。

数据文件包括:ae.testae.train,size_ae.test,sise_ae.trainae.train是用来进行训练的数据,ae.test是用来进行测试的数据。size_ae.test记录了每个发音者对应的ae.test数据文件中包含数据的对应块数。sise_ae.train,记录了每个发音者在对应的ae.train数据文件中的对应的块数。

二、对数据文件内容的理解

数据是为了验证日本学者新近研究开发的多维时间序列曲线分类法。九个男性发音者发出两个日本元音/ae/先后。对于每个发音分析参数,我们采用12度线性预测分析,以它来取得的LPC倒谱系数(LPCC12离散时间序列。这意味着,一个接一个发音者话语形成一个时间序列,其长度的范围是7-29,每一个时间序列是12个特征值。

在时间序列数共640个。我们用270系列来训练,用370系列进行测试。270系列对应着ae.train370数据对应着ae.test

size_ae.train中的数据是30 30 30 30 30 30 30 30 30,表示9个发音者对应的数据块数(第一个30对应着第一个说话者,以此类推)。这样我们就可以得出 ae.train中的数据所属发音者的情况;在ae.train中有270块数据,块与块之间有一个空行作为分割的标志。有size_ae.train中得到的信息我们可知,1-30是第一个发音者的,31-60块是第二个发音者的,剩下的一次类推。其中每块数据中包含的行数为729不等,每行代表着发音者的一个时间帧。每行包含了12属性列,因为我们使用的是12LPCC

size_ae.test中的数据是31 35 88 44 29 24 40 50 29,表示9个发音者对应的数据块数(第一个31对应着第一个说话者,以此类推)。这样我们就可以得出 ae.train中的数据所属发音者的情况;在ae.test中有370块数据,块与块之间有一个空行作为分割的标志。有size_ae.test中得到的信息我们可知,1-31是第一个发音者的,32-66块是第二个发音者的,剩下的一次类推。其中每块数据中包含的行数为729不等,每行代表着发音者的一个时间帧。每行包含了12属性列。

实现分类程序的整体结构

由于数据量较小,本程序使用Python语言编写,可以简化附加的数据格式转换步骤。整个程序分为六个部分,分别为:

1.         Get data部分:从ae.testae.train文件中读取数据。

2.         Get size部分:从size_ae.testsize_ae.train文件中读取各段的行数。

3.         Deal data部分:预处理,即把每一段的数据进行第一步的处理。

4.         Restrict data部分:对数据的进一步处理,得到最终的分类依据。

5.         Compare部分:用分类算法对用于测试的数据进行操作,得到结果集。

6.         Deal result部分:处理结果集,给出结论和正确率。

算法代码:

#coding:utf-8

#题目:分类问题2
#作者:Sean Luo
#E-mail:me@seanluo.com
#如引用本代码请保留此信息

import datetime

def get_data(file_name):
    '''从文件中读取数据'''
    fp = open(file_name, 'r')
    data_block = []
    block_number = 0          #数据块编号
    line_count = 0            #数据块行数
    single_block = [-1, -1]   #数据块存储列表
    
    for line in fp:
        if line == '\n':      #遇到空行即分块
            block_number += 1
            #保存数据块编号、行数
            single_block[0] = block_number
            single_block[1] = line_count
            #保存到org_data块中
            data_block.append(single_block)
            #清零上一块数据
            line_count = 0
            single_block = [-1, -1]
        else:                 #非空行则处理该行的数据
            line_count += 1
            data_in_line = []
            #将每一行的数据分割
            str_in_lines = line.split(' ')
            #转换为数值
            for string in str_in_lines:
                if string != '\n':
                    fn = float(string)
                    data_in_line.append(fn)
            #储存到数据库存储列表
            single_block.append(data_in_line)
            
    return data_block

def get_size(file_name):
    fp = open(file_name, 'r')
    size_data = []
    for size_line in fp:
        str_data = size_line.split(' ')
        for data in str_data:
            try:
                data = int(data)
                size_data.append(data)
            except:
                pass
    return size_data

def deal_data(org_data):
    '''处理数据,返回270*12的平均值矩阵'''
    data_set = []
    for single_block in org_data:      #取出单个数据块
        block_number = single_block[0] #数据块编号
        line_count = single_block[1]   #获得数据块中的数据行数
        block_data = []
        col_data = [0,0,0,0,0,0,0,0,0,0,0,0]
        block_data.append(block_number)
        for line_number in range(line_count):
            data_in_line =  single_block[line_number + 2]  #取出数据块中的数据列表
            for col_number in range(12):
                #对每个块中的每一列数据求和,12列
                col_data[col_number] += data_in_line[col_number]
        col_number = 0
        for col_number in range(12):
            #将每一列的和除以行数,得出平均值,12列
            col_data[col_number] /= line_count
            block_data.append(col_data[col_number])
        data_set.append(block_data)
    return data_set

def restrict_data(org_data, size_file_name):
    '''处理数据,返回9*12的平均值矩阵'''
    data_set = []
    start_pos = 0
    size_data = get_size(size_file_name)
    for mysize in size_data:               #每块的行数
        col_data = [0,0,0,0,0,0,0,0,0,0,0,0]
        save_data = []
        split_data = org_data[start_pos:start_pos+mysize]
        for block_data in split_data:      #取出数据行
            for col_number in range(12):
                #对每个块中的每一列数据求和,12列
                col_data[col_number] += block_data[col_number+1]
                col_number = 0
        for col_number in range(12):
            #将每一列的和除以行数,得出平均值,12列
            col_data[col_number] /= mysize
            save_data.append(col_data[col_number])
        data_set.append(save_data)
        start_pos += mysize
    return data_set

def compare(train_data, test_data):
    match_set = []
    for i in range(9):           #依次取测试值,共9个
        test_case = test_data[i]
        last_min = 100.0     #上次最小值
        select_number = 0    #匹配对象编号
        for j in range(9):       #依次取训练值,共9个
            total = 0.0          
            train_case = train_data[j]
            for k in range(12):
                delta = abs(test_case[k] - train_case[k])
                total += delta   #一阶差分绝对值之和
            if total < last_min:
                last_min = total
                select_number = j
        match_set.append([i, select_number])
    return match_set

def deal_result(result_set):
    cnt_suc = 0
    cnt_tol = 0
    print(u'匹配结果:')
    for result in result_set:
        cnt_tol += 1
        test_r = result[0] + 1
        train_r = result[1] + 1
        print(u'测试文件中发音者编号:' + str(test_r) + 
              u',匹配的编号:' + str(train_r))
        if test_r == train_r:
            print(u'匹配成功。')
            cnt_suc += 1
        else:
            print(u'匹配失败。')
    print(u'成功率:' + str(100*cnt_suc/cnt_tol) + '%')
            
def main():
    data_file_name = 'ae.train'
    time_start = datetime.datetime.now()
    org_data = get_data(data_file_name) 
    data_set = deal_data(org_data) 
    size_file_name = 'size_ae.train'
    train_data = restrict_data(data_set, size_file_name)  #9*12矩阵
    time_end = datetime.datetime.now()
    timedelta = (time_end - time_start)
    print(u'训练模型处理用时:' + str(timedelta))
    
    test_file_name = 'ae.test'
    time_start = datetime.datetime.now()
    org2_data = get_data(test_file_name) 
    test_set = deal_data(org2_data)
    size_file_name = 'size_ae.test'
    test_data = restrict_data(test_set, size_file_name)  #9*12矩阵
    time_end = datetime.datetime.now()
    timedelta = (time_end - time_start)
    print(u'测试数据处理用时:' + str(timedelta))
    
    result_set = compare(train_data, test_data) 
    deal_result(result_set)
      
if __name__ == '__main__':
    main()
    

  3 Responses to “数据挖掘“日语元音”问题算法”

  1. 去年数据挖掘课的作业,我这算法极其弱智,哈哈,有兴趣的可以拿去用

  2. 请你的Python是用的哪个版本?

 Leave a Reply

(必须填写)

(必须填写,邮件地址不会被泄露)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>