您的当前位置：首页 Improved nearest centroid classifier with shrunken distance measure

Improved nearest centroid classifier with shrunken distance measure

来源：九壹网

ImprovednearestcentroidclassiﬁerwithshrunkendistancemeasurefornullLDAmethodoncancerclassiﬁcationproblem

A.SharmaandK.K.Paliwal

Nulllineardiscriminantanalysis(LDA)isawell-knowndimensional-ityreductiontechniqueforthesmallsamplesizeproblem.WhenthenullLDAtechniqueprojectsthesamplestoalowerdimensionalspace,thecovariancematricesofindividualclassesbecomezero,i.e.alltheprojectedvectorsofagivenclassmergeintoasinglevector.Inthiscase,onlythenearestcentroidclassiﬁer(NCC)canbeappliedforclassiﬁcation.ToimprovetheclassiﬁcationperformanceofNCCinthereduced-dimensionalspace,ashrunkendistancebasedNCCtechniqueisproposedthatusesclass-conditionalaprioriprob-abilitiesfordistancecomputation.ExperimentsonseveralDNAmicro-arraygeneexpressiondatasetsusingtheproposedtechniqueshowveryencouragingresultsforcancerclassiﬁcation.

IntheSD-NCCtechnique,thedistancesd1andd2arereducedbyanamountthatdependsupontheaprioriprobabilityinformation.IntheFigure,class1hasmoresamples(almosttwice)thanclass2.Consequently,theamountofreductionofdistancebetweenxandthecentroidofclass1(Dd1)willbemorethanthereductionofdistancebetweenxandthecentroidofclass2(Dd2).Theresultantdistancesfromfeaturevectorxtocentroidswillnowbed12Dd1andd22Dd2.Thus,theshrinkageindistancebetweenthetestvectorxandthecentroidofaclassdependsontheaprioriprobabilityofthatclass,i.e.

Ddj/lj󰀁x−mj󰀁,

Ddj=plj󰀁x−mj󰀁,

forj=1...cforj=1...c

Introduction:CancerclassiﬁcationusingtheDNAmicroarraydatacomesunderthecategoryofasmallsamplesize(SSS)problemandcon-sistsofalargenumberofgenes(dimensions)comparedtothenumberoffeaturevectorsavailableinthetrainingset.Thehighdimensionalityofthefeaturespacedegradesthegeneralisationperformanceoftheclassi-ﬁerandincreasesitscomputationalcomplexity.Thissituation(com-monlyknownasthecurseofdimensionality)canbeovercomebyﬁrstreducingthedimensionalityoffeaturespace,followedbyclassiﬁcationinthelower-dimensionalfeaturespace.

SeveraldimensionalityreductionmethodsfortheSSSproblemhavebeenproposedintheliterature(see[1]fordetails).Ofthesemethods,thenullLDAmethod[2–4]hasrecentlyattractedmuchattention.Inthismethod,thedataistransformedtothenullspaceofthewithin-classscattermatrix.WhenthenullLDAmethodisusedfordimensionalityreduction,allthetrainingvectorsofagivenclassgetmergedintoasinglevectorinthereducedfeaturespace(i.e.theclass-conditionalvar-iancesofthefeaturesinthereducedfeaturespacearezero).Asaresult,thepatternclassiﬁersthatneedclass-conditionalvarianceinformation(suchastheMahalanobisdistanceclassiﬁer[5],shrunkencentroidclas-siﬁer[6],etc.)cannotbeused.Inthiscase,onlythenearestcentroidclas-siﬁer(NCC)canbeusedforclassiﬁcation(withanytrainingvectorinagivenclassdeﬁningthecentroidofthatclass).NotethatinthiscasethenearestneighbourclassiﬁerbehavessimilarlytoNCC.

InthisLetter,weattempttoimprovetheclassiﬁcationperformanceofNCCinthereduceddimensionalspace.Forthis,weproposeashrunkendistancebasedNCC(SD-NCC)technique,whereashrunkendistancemeasureisusedfordistancecomputation.InSD-NCC,weincludetheaprioriprobabilityinformationforcomputingtheshrunkendistance.ExperimentsonseveralmicroarraygeneexpressiondatasetsusingtheSD-NCCtechniqueshowencouragingresultsforcancerclassiﬁcation.

3.02.52.01.5y-axiswhereljistheaprioriprobabilityofthejthclassandcanbegivenaslj=nj/n(numberofsamplesinclassj/totalnumberoftrainingsamples),cisthenumberofclasses,andpdenotestheproportionalityconstantwhichdependsuponthetypeoftrainingdataused.Thevalueofpcanbeevaluatedusingthecross-validationprocedureontrainingdata.Thevalueforwhichthemisclassiﬁcationerrorisminimuminthecross-validationprocessisthedesiredpvaluefortheSD-NCCtechnique.Theshrunkendistancedjbetweenfeaturevectorxandtheclasscentroidcannowbegivenas:

dj=󰀁x−mj󰀁−Ddj

=󰀁x−mj󰀁−plj󰀁x−mj󰀁forj=1...c

(1)

Thecross-validationprocedure(toﬁndtheoptimumvalueofp)isassetoutbelow.Inthetrainingphaseparameterlj,mjandparecomputedandinthetestphaseasampleislabelledforwhichdistancedjisminimum.Step1:Giventrainingdata,partitionitrandomlyintokroughlyequalsegments.

Step2:Holdoutonesegmentasvalidationdataandtheremainingk21segmentsaslearningdatafromthetrainingdata.

Step3:UsethelearningdataforﬁndingthenullLDAtransformationmatrix,thecentroidsofindividualclassesandtheiraprioriprobabilities.

Step4:Usevalidationdatatocomputemisclassiﬁcationerrorusingshrunkendistance(1)forarangeofvaluesofp.Storetheobtainedmis-classiﬁcationerrors.

Step5:Repeatsteps1–4Ntimes.

Step6:EvaluateaveragemisclassiﬁcationerroroverNrepetitions.Step7:Plotacurveofaveragemisclassiﬁcationerrorasafunctionofp.Step8:Theargumentofminimumaveragemisclassiﬁcationerrorwillbetheoptimumpvalue.

12.0unlabelled feature vector D d1xD d2average misclassification error11.511.010.5d1 = 1.1.00.50–0.5–1.0

–1.0class 1 m1d2 = 1.85class 2 m210.0p=0.9 9.500.20.40.60.81.01.2p values

1.41.61.82.0Fig.2Averagemisclassiﬁcationerroragainstpvalues

–0.500.51.01.5x-axis2.02.53.03.54.0Fig.1Classiﬁcationusingnearestshrunkendistanceclassiﬁer

InFigure,d1.d2,eventhoughxbelongstoclass1sinced12Dd1,d22Dd2ShrunkendistanceNCCtechnique:ToexplaintheSD-NCCtechnique,letbethed-dimensionalsetofntrainingvectorsandx[.Thetech-niqueisillustratedinFig.1.TheFigurerepresentsatwo-classproblemwherethecentroidofclass1ism1andofclass2m2.TheEuclideandistancefromfeaturevectorxtom1isd1andfromxtom2isd2.

Results:FiveDNAmicroarraygeneexpressiondatasetsareusedfortheexperimentation.ThedescriptionofthesedatasetsisgiveninTable1,whichrefersto[7–11].Theoptimumvalueofpiscomputedbythek-foldcross-validationprocedure(describedattheendoftheprecedingSection)withk¼3andN¼20.Thecurveoftheaveragemisclassiﬁ-cationerrorasafunctionofp-valuesforthebreastcancerdataset[11]isshowninFig.2forillustrationpurpose.Theoptimumvalueofpistheargumentofminimummisclassiﬁcationerror.InFig.2,theoptimumvalueofpis0.9.Forevaluatingtheperformanceofthepro-cedure,weusedanindependenttestsetwhichwasnotusedduring

ELECTRONICSLETTERS2ndSeptember2010Vol.46No.18

thetuningofparameterp.TheresultsarepresentedinTable2.Thefol-lowingtechniques,namelynullLDA[2]usingNCCandregularisedLDA[12],havebeenusedforcomparingtheperformancewiththeSD-NCCtechnique.WecanobservefromTable2thattheSD-NCCtechniqueperformsbetterthantheothertechniques.Inparticular,SD-NCCshowsimprovementovertheNCCtechnique.

References

1Sharma,A.,andPaliwal,K.K.:‘Rotationallineardiscriminantanalysisfordimensionalityreduction’,IEEETrans.Knowl.DataEng.,2008,20,(10),pp.1336–1347

2Chen,L.-F.,Liao,H.-Y.M.,Ko,M.-T.,Lin,J.-C.,andYu,G.-J.:‘AnewLDA-basedfacerecognitionsystemwhichcansolvethesmallsamplesizeproblem’,PatternRecognit.,2000,33,pp.1713–1726

3Cevikalp,H.,Neamtu,M.,Wlkes,M.,andBarkana,A.:‘Discriminativecommonvectorsforfacerecognition’,IEEETrans.PatternAnal.Mach.Intell.,2005,27,(1),pp.4–13

4Ye,J.,andXiong,T.:‘Computationalandtheoreticalanalysisofnullspaceandorthogonallineardiscriminantanalysis’,J.Mach.Learn.Res.,2006,7,pp.1183–1204

5Fukunaga,K.:‘Introductiontostatisticalpatternrecognition’(AcademicPressInc.,HartcourtBraceJovanovichPublishers,1990)6Tibshiriani,R.,Hastie,T.,Narasimhan,B.,andChu,G.:‘Diagnosisofmultiplecancertypesbyshrunkencentroidsofgeneexpression’,Proc.Natl.Acad.Sci.USA,2002,99,(10),pp.6567–6572

7Golub,T.R.,Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasenbeek,M.,Mesirov,J.P.,Coller,H.,Loh,M.L.,Downing,J.R.,Caligiuri,M.A.,Bloomﬁeld,C.D.,andLander,E.S.:‘Molecularclassiﬁcationofcancer:classdiscoveryandclasspredictionbygeneexpressionmonitoring’,Science,1999,286,pp.531–537

8Yeoh,E.J.,Ross,M.E.,Shurtleff,S.A.,Williams,W.K.,Patel,D.,Mahfouz,R.,Behm,F.G.,Raimondi,S.C.,Relling,M.V.,Patel,A.,Cheng,C.,Campana,D.,Wilkins,D.,Zhou,X.,Li,J.,Liu,H.,Pui,C.H.,Evans,W.E.,Naeve,C.,Wong,L.,andDowning,J.R.:‘Classiﬁcation,subtypediscovery,andpredictionofoutcomeinpediatricacutelymphoblasticleukemiabygeneexpressionproﬁling’,Cancer,2002,1,(2),pp.133–143

9Ramaswamy,S.,Tamayo,P.,Rifkin,R.,Mukherjee,S.,Yeang,C.-H.,Angelo,M.,Ladd,C.,Reich,M.,Latulippe,E.,Mesirov,J.P.,Poggio,T.,Gerald,W.,Loda,M.,Lander,E.S.,andGolub,T.R.:‘Multiclasscancerdiagnosisusingtumorgeneexpressionsignatures’,Proc.Natl.Acad.Sci.USA,2001,98,(26),pp.15149–15154

10Khan,J.,Wei,J.S.,Ringner,M.,Saal,L.H.,Ladanyi,M.,Westermann,F.,

Berthold,F.,Schwab,M.,Antonescu,C.R.,Peterson,C.,andMeltzer,P.S.:‘Classiﬁcationanddiagnosticpredictionofcancersusinggeneexpressionproﬁlingandartiﬁcialneuralnetwork’,NatureMedicine,2001,7,pp.673–679

11van’tVeer,L.J.,Dai,H.,vandeVijver,M.J.,He,Y.D.,Hart,A.M.H.,

Mao,M.,Peterse,H.L.,vanderKooy,K.,Marton,M.J.,Witteveen,A.T.,Schreiber,G.J.,Kerkhoven,R.M.,Roberts,C.,Linsley,P.S.,Bernards,R.,andFriend,S.H.:‘Geneexpressionproﬁlingpredictsclinicaloutcomeofbreastcancer’,LetterstoNature,Nature,2002,415,pp.530–536

12Guo,Y.,Hastie,T.,andTinshirani,R.:‘Regularizeddiscriminant

analysisanditsapplicationinmicroarrays’,Biostatistics,2007,8,(1),pp.86–100

Table1:Datasetsusedinexperimentation

DatasetsAcuteleukemia[7]ALLsubtype[8]GCM[9]SRBCT[10]Breastcancer[11]NumberofNumberofClassDimensiontrainingsamplestestingsamples2714427129125581606323082448138215144637834112462019Table2:Classiﬁcationaccuracy(%)onDNAmicroarraygene

expressiondatasets(valueofproportionalityconstantpcomputedbycross-validationshowninbrackets)

DatabaseNullLDAwithNCCNullLDAwithSD-NCCRegularisedLDASRBCT100100(p¼0.4)100Acuteleukemia97.1100(p¼0.5)97.1ALLsubtype86.690.2(p¼0.5)92.0GCMBreastcancerAverage70.457.982.474.1(p¼3.4)68.4(p¼0.9)86.674.147.482.1Conclusion:Wehavepresentedashrunkendistancenearestcentroidclassiﬁerwhichutilisesclass-conditionalaprioriprobabilitiesfordis-tancecomputation.ThenullLDAtechniqueisusedfordimensionalityreductionandtheclassiﬁerisappliedonreduceddimensionalfeaturespace.TheSD-NCCiscomparedwithotherclassiﬁersonseveralDNAmicroarraygeneexpressiondataandencouragingresultshavebeennoted.

#TheInstitutionofEngineeringandTechnology201013July2010

doi:10.1049/el.2010.1927

OneormoreoftheFiguresinthisLetterareavailableincolouronline.A.SharmaandK.K.Paliwal(SignalProcessingLab,GrifﬁthUniversity,Brisbane,QLD-4111,Australia)E-mail:sharma_al@usp.ac.fj

ELECTRONICSLETTERS2ndSeptember2010Vol.46No.18

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文