您好,欢迎来到九壹网。
搜索
您的当前位置:首页Improved nearest centroid classifier with shrunken distance measure

Improved nearest centroid classifier with shrunken distance measure

来源:九壹网
ImprovednearestcentroidclassifierwithshrunkendistancemeasurefornullLDAmethodoncancerclassificationproblem

A.SharmaandK.K.Paliwal

Nulllineardiscriminantanalysis(LDA)isawell-knowndimensional-ityreductiontechniqueforthesmallsamplesizeproblem.WhenthenullLDAtechniqueprojectsthesamplestoalowerdimensionalspace,thecovariancematricesofindividualclassesbecomezero,i.e.alltheprojectedvectorsofagivenclassmergeintoasinglevector.Inthiscase,onlythenearestcentroidclassifier(NCC)canbeappliedforclassification.ToimprovetheclassificationperformanceofNCCinthereduced-dimensionalspace,ashrunkendistancebasedNCCtechniqueisproposedthatusesclass-conditionalaprioriprob-abilitiesfordistancecomputation.ExperimentsonseveralDNAmicro-arraygeneexpressiondatasetsusingtheproposedtechniqueshowveryencouragingresultsforcancerclassification.

IntheSD-NCCtechnique,thedistancesd1andd2arereducedbyanamountthatdependsupontheaprioriprobabilityinformation.IntheFigure,class1hasmoresamples(almosttwice)thanclass2.Consequently,theamountofreductionofdistancebetweenxandthecentroidofclass1(Dd1)willbemorethanthereductionofdistancebetweenxandthecentroidofclass2(Dd2).Theresultantdistancesfromfeaturevectorxtocentroidswillnowbed12Dd1andd22Dd2.Thus,theshrinkageindistancebetweenthetestvectorxandthecentroidofaclassdependsontheaprioriprobabilityofthatclass,i.e.

Ddj/lj󰀁x−mj󰀁,

or

Ddj=plj󰀁x−mj󰀁,

forj=1...cforj=1...c

Introduction:CancerclassificationusingtheDNAmicroarraydatacomesunderthecategoryofasmallsamplesize(SSS)problemandcon-sistsofalargenumberofgenes(dimensions)comparedtothenumberoffeaturevectorsavailableinthetrainingset.Thehighdimensionalityofthefeaturespacedegradesthegeneralisationperformanceoftheclassi-fierandincreasesitscomputationalcomplexity.Thissituation(com-monlyknownasthecurseofdimensionality)canbeovercomebyfirstreducingthedimensionalityoffeaturespace,followedbyclassificationinthelower-dimensionalfeaturespace.

SeveraldimensionalityreductionmethodsfortheSSSproblemhavebeenproposedintheliterature(see[1]fordetails).Ofthesemethods,thenullLDAmethod[2–4]hasrecentlyattractedmuchattention.Inthismethod,thedataistransformedtothenullspaceofthewithin-classscattermatrix.WhenthenullLDAmethodisusedfordimensionalityreduction,allthetrainingvectorsofagivenclassgetmergedintoasinglevectorinthereducedfeaturespace(i.e.theclass-conditionalvar-iancesofthefeaturesinthereducedfeaturespacearezero).Asaresult,thepatternclassifiersthatneedclass-conditionalvarianceinformation(suchastheMahalanobisdistanceclassifier[5],shrunkencentroidclas-sifier[6],etc.)cannotbeused.Inthiscase,onlythenearestcentroidclas-sifier(NCC)canbeusedforclassification(withanytrainingvectorinagivenclassdefiningthecentroidofthatclass).NotethatinthiscasethenearestneighbourclassifierbehavessimilarlytoNCC.

InthisLetter,weattempttoimprovetheclassificationperformanceofNCCinthereduceddimensionalspace.Forthis,weproposeashrunkendistancebasedNCC(SD-NCC)technique,whereashrunkendistancemeasureisusedfordistancecomputation.InSD-NCC,weincludetheaprioriprobabilityinformationforcomputingtheshrunkendistance.ExperimentsonseveralmicroarraygeneexpressiondatasetsusingtheSD-NCCtechniqueshowencouragingresultsforcancerclassification.

3.02.52.01.5y-axiswhereljistheaprioriprobabilityofthejthclassandcanbegivenaslj=nj/n(numberofsamplesinclassj/totalnumberoftrainingsamples),cisthenumberofclasses,andpdenotestheproportionalityconstantwhichdependsuponthetypeoftrainingdataused.Thevalueofpcanbeevaluatedusingthecross-validationprocedureontrainingdata.Thevalueforwhichthemisclassificationerrorisminimuminthecross-validationprocessisthedesiredpvaluefortheSD-NCCtechnique.Theshrunkendistancedjbetweenfeaturevectorxandtheclasscentroidcannowbegivenas:

dj=󰀁x−mj󰀁−Ddj

=󰀁x−mj󰀁−plj󰀁x−mj󰀁forj=1...c

(1)

Thecross-validationprocedure(tofindtheoptimumvalueofp)isassetoutbelow.Inthetrainingphaseparameterlj,mjandparecomputedandinthetestphaseasampleislabelledforwhichdistancedjisminimum.Step1:Giventrainingdata,partitionitrandomlyintokroughlyequalsegments.

Step2:Holdoutonesegmentasvalidationdataandtheremainingk21segmentsaslearningdatafromthetrainingdata.

Step3:UsethelearningdataforfindingthenullLDAtransformationmatrix,thecentroidsofindividualclassesandtheiraprioriprobabilities.

Step4:Usevalidationdatatocomputemisclassificationerrorusingshrunkendistance(1)forarangeofvaluesofp.Storetheobtainedmis-classificationerrors.

Step5:Repeatsteps1–4Ntimes.

Step6:EvaluateaveragemisclassificationerroroverNrepetitions.Step7:Plotacurveofaveragemisclassificationerrorasafunctionofp.Step8:Theargumentofminimumaveragemisclassificationerrorwillbetheoptimumpvalue.

12.0unlabelled feature vector D d1xD d2average misclassification error11.511.010.5d1 = 1.1.00.50–0.5–1.0

–1.0class 1 m1d2 = 1.85class 2 m210.0p=0.9 9.500.20.40.60.81.01.2p values

1.41.61.82.0Fig.2Averagemisclassificationerroragainstpvalues

–0.500.51.01.5x-axis2.02.53.03.54.0Fig.1Classificationusingnearestshrunkendistanceclassifier

InFigure,d1.d2,eventhoughxbelongstoclass1sinced12Dd1,d22Dd2ShrunkendistanceNCCtechnique:ToexplaintheSD-NCCtechnique,letbethed-dimensionalsetofntrainingvectorsandx[.Thetech-niqueisillustratedinFig.1.TheFigurerepresentsatwo-classproblemwherethecentroidofclass1ism1andofclass2m2.TheEuclideandistancefromfeaturevectorxtom1isd1andfromxtom2isd2.

Results:FiveDNAmicroarraygeneexpressiondatasetsareusedfortheexperimentation.ThedescriptionofthesedatasetsisgiveninTable1,whichrefersto[7–11].Theoptimumvalueofpiscomputedbythek-foldcross-validationprocedure(describedattheendoftheprecedingSection)withk¼3andN¼20.Thecurveoftheaveragemisclassifi-cationerrorasafunctionofp-valuesforthebreastcancerdataset[11]isshowninFig.2forillustrationpurpose.Theoptimumvalueofpistheargumentofminimummisclassificationerror.InFig.2,theoptimumvalueofpis0.9.Forevaluatingtheperformanceofthepro-cedure,weusedanindependenttestsetwhichwasnotusedduring

ELECTRONICSLETTERS2ndSeptember2010Vol.46No.18

thetuningofparameterp.TheresultsarepresentedinTable2.Thefol-lowingtechniques,namelynullLDA[2]usingNCCandregularisedLDA[12],havebeenusedforcomparingtheperformancewiththeSD-NCCtechnique.WecanobservefromTable2thattheSD-NCCtechniqueperformsbetterthantheothertechniques.Inparticular,SD-NCCshowsimprovementovertheNCCtechnique.

References

1Sharma,A.,andPaliwal,K.K.:‘Rotationallineardiscriminantanalysisfordimensionalityreduction’,IEEETrans.Knowl.DataEng.,2008,20,(10),pp.1336–1347

2Chen,L.-F.,Liao,H.-Y.M.,Ko,M.-T.,Lin,J.-C.,andYu,G.-J.:‘AnewLDA-basedfacerecognitionsystemwhichcansolvethesmallsamplesizeproblem’,PatternRecognit.,2000,33,pp.1713–1726

3Cevikalp,H.,Neamtu,M.,Wlkes,M.,andBarkana,A.:‘Discriminativecommonvectorsforfacerecognition’,IEEETrans.PatternAnal.Mach.Intell.,2005,27,(1),pp.4–13

4Ye,J.,andXiong,T.:‘Computationalandtheoreticalanalysisofnullspaceandorthogonallineardiscriminantanalysis’,J.Mach.Learn.Res.,2006,7,pp.1183–1204

5Fukunaga,K.:‘Introductiontostatisticalpatternrecognition’(AcademicPressInc.,HartcourtBraceJovanovichPublishers,1990)6Tibshiriani,R.,Hastie,T.,Narasimhan,B.,andChu,G.:‘Diagnosisofmultiplecancertypesbyshrunkencentroidsofgeneexpression’,Proc.Natl.Acad.Sci.USA,2002,99,(10),pp.6567–6572

7Golub,T.R.,Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasenbeek,M.,Mesirov,J.P.,Coller,H.,Loh,M.L.,Downing,J.R.,Caligiuri,M.A.,Bloomfield,C.D.,andLander,E.S.:‘Molecularclassificationofcancer:classdiscoveryandclasspredictionbygeneexpressionmonitoring’,Science,1999,286,pp.531–537

8Yeoh,E.J.,Ross,M.E.,Shurtleff,S.A.,Williams,W.K.,Patel,D.,Mahfouz,R.,Behm,F.G.,Raimondi,S.C.,Relling,M.V.,Patel,A.,Cheng,C.,Campana,D.,Wilkins,D.,Zhou,X.,Li,J.,Liu,H.,Pui,C.H.,Evans,W.E.,Naeve,C.,Wong,L.,andDowning,J.R.:‘Classification,subtypediscovery,andpredictionofoutcomeinpediatricacutelymphoblasticleukemiabygeneexpressionprofiling’,Cancer,2002,1,(2),pp.133–143

9Ramaswamy,S.,Tamayo,P.,Rifkin,R.,Mukherjee,S.,Yeang,C.-H.,Angelo,M.,Ladd,C.,Reich,M.,Latulippe,E.,Mesirov,J.P.,Poggio,T.,Gerald,W.,Loda,M.,Lander,E.S.,andGolub,T.R.:‘Multiclasscancerdiagnosisusingtumorgeneexpressionsignatures’,Proc.Natl.Acad.Sci.USA,2001,98,(26),pp.15149–15154

10Khan,J.,Wei,J.S.,Ringner,M.,Saal,L.H.,Ladanyi,M.,Westermann,F.,

Berthold,F.,Schwab,M.,Antonescu,C.R.,Peterson,C.,andMeltzer,P.S.:‘Classificationanddiagnosticpredictionofcancersusinggeneexpressionprofilingandartificialneuralnetwork’,NatureMedicine,2001,7,pp.673–679

11van’tVeer,L.J.,Dai,H.,vandeVijver,M.J.,He,Y.D.,Hart,A.M.H.,

Mao,M.,Peterse,H.L.,vanderKooy,K.,Marton,M.J.,Witteveen,A.T.,Schreiber,G.J.,Kerkhoven,R.M.,Roberts,C.,Linsley,P.S.,Bernards,R.,andFriend,S.H.:‘Geneexpressionprofilingpredictsclinicaloutcomeofbreastcancer’,LetterstoNature,Nature,2002,415,pp.530–536

12Guo,Y.,Hastie,T.,andTinshirani,R.:‘Regularizeddiscriminant

analysisanditsapplicationinmicroarrays’,Biostatistics,2007,8,(1),pp.86–100

Table1:Datasetsusedinexperimentation

DatasetsAcuteleukemia[7]ALLsubtype[8]GCM[9]SRBCT[10]Breastcancer[11]NumberofNumberofClassDimensiontrainingsamplestestingsamples2714427129125581606323082448138215144637834112462019Table2:Classificationaccuracy(%)onDNAmicroarraygene

expressiondatasets(valueofproportionalityconstantpcomputedbycross-validationshowninbrackets)

DatabaseNullLDAwithNCCNullLDAwithSD-NCCRegularisedLDASRBCT100100(p¼0.4)100Acuteleukemia97.1100(p¼0.5)97.1ALLsubtype86.690.2(p¼0.5)92.0GCMBreastcancerAverage70.457.982.474.1(p¼3.4)68.4(p¼0.9)86.674.147.482.1Conclusion:Wehavepresentedashrunkendistancenearestcentroidclassifierwhichutilisesclass-conditionalaprioriprobabilitiesfordis-tancecomputation.ThenullLDAtechniqueisusedfordimensionalityreductionandtheclassifierisappliedonreduceddimensionalfeaturespace.TheSD-NCCiscomparedwithotherclassifiersonseveralDNAmicroarraygeneexpressiondataandencouragingresultshavebeennoted.

#TheInstitutionofEngineeringandTechnology201013July2010

doi:10.1049/el.2010.1927

OneormoreoftheFiguresinthisLetterareavailableincolouronline.A.SharmaandK.K.Paliwal(SignalProcessingLab,GriffithUniversity,Brisbane,QLD-4111,Australia)E-mail:sharma_al@usp.ac.fj

ELECTRONICSLETTERS2ndSeptember2010Vol.46No.18

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- 91gzw.com 版权所有 湘ICP备2023023988号-2

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务