A.SharmaandK.K.Paliwal
Nulllineardiscriminantanalysis(LDA)isawell-knowndimensional-ityreductiontechniqueforthesmallsamplesizeproblem.WhenthenullLDAtechniqueprojectsthesamplestoalowerdimensionalspace,thecovariancematricesofindividualclassesbecomezero,i.e.alltheprojectedvectorsofagivenclassmergeintoasinglevector.Inthiscase,onlythenearestcentroidclassifier(NCC)canbeappliedforclassification.ToimprovetheclassificationperformanceofNCCinthereduced-dimensionalspace,ashrunkendistancebasedNCCtechniqueisproposedthatusesclass-conditionalaprioriprob-abilitiesfordistancecomputation.ExperimentsonseveralDNAmicro-arraygeneexpressiondatasetsusingtheproposedtechniqueshowveryencouragingresultsforcancerclassification.
IntheSD-NCCtechnique,thedistancesd1andd2arereducedbyanamountthatdependsupontheaprioriprobabilityinformation.IntheFigure,class1hasmoresamples(almosttwice)thanclass2.Consequently,theamountofreductionofdistancebetweenxandthecentroidofclass1(Dd1)willbemorethanthereductionofdistancebetweenxandthecentroidofclass2(Dd2).Theresultantdistancesfromfeaturevectorxtocentroidswillnowbed12Dd1andd22Dd2.Thus,theshrinkageindistancebetweenthetestvectorxandthecentroidofaclassdependsontheaprioriprobabilityofthatclass,i.e.
Ddj/ljx−mj,
or
Ddj=pljx−mj,
forj=1...cforj=1...c
Introduction:CancerclassificationusingtheDNAmicroarraydatacomesunderthecategoryofasmallsamplesize(SSS)problemandcon-sistsofalargenumberofgenes(dimensions)comparedtothenumberoffeaturevectorsavailableinthetrainingset.Thehighdimensionalityofthefeaturespacedegradesthegeneralisationperformanceoftheclassi-fierandincreasesitscomputationalcomplexity.Thissituation(com-monlyknownasthecurseofdimensionality)canbeovercomebyfirstreducingthedimensionalityoffeaturespace,followedbyclassificationinthelower-dimensionalfeaturespace.
SeveraldimensionalityreductionmethodsfortheSSSproblemhavebeenproposedintheliterature(see[1]fordetails).Ofthesemethods,thenullLDAmethod[2–4]hasrecentlyattractedmuchattention.Inthismethod,thedataistransformedtothenullspaceofthewithin-classscattermatrix.WhenthenullLDAmethodisusedfordimensionalityreduction,allthetrainingvectorsofagivenclassgetmergedintoasinglevectorinthereducedfeaturespace(i.e.theclass-conditionalvar-iancesofthefeaturesinthereducedfeaturespacearezero).Asaresult,thepatternclassifiersthatneedclass-conditionalvarianceinformation(suchastheMahalanobisdistanceclassifier[5],shrunkencentroidclas-sifier[6],etc.)cannotbeused.Inthiscase,onlythenearestcentroidclas-sifier(NCC)canbeusedforclassification(withanytrainingvectorinagivenclassdefiningthecentroidofthatclass).NotethatinthiscasethenearestneighbourclassifierbehavessimilarlytoNCC.
InthisLetter,weattempttoimprovetheclassificationperformanceofNCCinthereduceddimensionalspace.Forthis,weproposeashrunkendistancebasedNCC(SD-NCC)technique,whereashrunkendistancemeasureisusedfordistancecomputation.InSD-NCC,weincludetheaprioriprobabilityinformationforcomputingtheshrunkendistance.ExperimentsonseveralmicroarraygeneexpressiondatasetsusingtheSD-NCCtechniqueshowencouragingresultsforcancerclassification.
3.02.52.01.5y-axiswhereljistheaprioriprobabilityofthejthclassandcanbegivenaslj=nj/n(numberofsamplesinclassj/totalnumberoftrainingsamples),cisthenumberofclasses,andpdenotestheproportionalityconstantwhichdependsuponthetypeoftrainingdataused.Thevalueofpcanbeevaluatedusingthecross-validationprocedureontrainingdata.Thevalueforwhichthemisclassificationerrorisminimuminthecross-validationprocessisthedesiredpvaluefortheSD-NCCtechnique.Theshrunkendistancedjbetweenfeaturevectorxandtheclasscentroidcannowbegivenas:
dj=x−mj−Ddj
=x−mj−pljx−mjforj=1...c
(1)
Thecross-validationprocedure(tofindtheoptimumvalueofp)isassetoutbelow.Inthetrainingphaseparameterlj,mjandparecomputedandinthetestphaseasampleislabelledforwhichdistancedjisminimum.Step1:Giventrainingdata,partitionitrandomlyintokroughlyequalsegments.
Step2:Holdoutonesegmentasvalidationdataandtheremainingk21segmentsaslearningdatafromthetrainingdata.
Step3:UsethelearningdataforfindingthenullLDAtransformationmatrix,thecentroidsofindividualclassesandtheiraprioriprobabilities.
Step4:Usevalidationdatatocomputemisclassificationerrorusingshrunkendistance(1)forarangeofvaluesofp.Storetheobtainedmis-classificationerrors.
Step5:Repeatsteps1–4Ntimes.
Step6:EvaluateaveragemisclassificationerroroverNrepetitions.Step7:Plotacurveofaveragemisclassificationerrorasafunctionofp.Step8:Theargumentofminimumaveragemisclassificationerrorwillbetheoptimumpvalue.
12.0unlabelled feature vector D d1xD d2average misclassification error11.511.010.5d1 = 1.1.00.50–0.5–1.0
–1.0class 1 m1d2 = 1.85class 2 m210.0p=0.9 9.500.20.40.60.81.01.2p values
1.41.61.82.0Fig.2Averagemisclassificationerroragainstpvalues
–0.500.51.01.5x-axis2.02.53.03.54.0Fig.1Classificationusingnearestshrunkendistanceclassifier
InFigure,d1.d2,eventhoughxbelongstoclass1sinced12Dd1,d22Dd2ShrunkendistanceNCCtechnique:ToexplaintheSD-NCCtechnique,letbethed-dimensionalsetofntrainingvectorsandx[.Thetech-niqueisillustratedinFig.1.TheFigurerepresentsatwo-classproblemwherethecentroidofclass1ism1andofclass2m2.TheEuclideandistancefromfeaturevectorxtom1isd1andfromxtom2isd2.
Results:FiveDNAmicroarraygeneexpressiondatasetsareusedfortheexperimentation.ThedescriptionofthesedatasetsisgiveninTable1,whichrefersto[7–11].Theoptimumvalueofpiscomputedbythek-foldcross-validationprocedure(describedattheendoftheprecedingSection)withk¼3andN¼20.Thecurveoftheaveragemisclassifi-cationerrorasafunctionofp-valuesforthebreastcancerdataset[11]isshowninFig.2forillustrationpurpose.Theoptimumvalueofpistheargumentofminimummisclassificationerror.InFig.2,theoptimumvalueofpis0.9.Forevaluatingtheperformanceofthepro-cedure,weusedanindependenttestsetwhichwasnotusedduring
ELECTRONICSLETTERS2ndSeptember2010Vol.46No.18
thetuningofparameterp.TheresultsarepresentedinTable2.Thefol-lowingtechniques,namelynullLDA[2]usingNCCandregularisedLDA[12],havebeenusedforcomparingtheperformancewiththeSD-NCCtechnique.WecanobservefromTable2thattheSD-NCCtechniqueperformsbetterthantheothertechniques.Inparticular,SD-NCCshowsimprovementovertheNCCtechnique.
References
1Sharma,A.,andPaliwal,K.K.:‘Rotationallineardiscriminantanalysisfordimensionalityreduction’,IEEETrans.Knowl.DataEng.,2008,20,(10),pp.1336–1347
2Chen,L.-F.,Liao,H.-Y.M.,Ko,M.-T.,Lin,J.-C.,andYu,G.-J.:‘AnewLDA-basedfacerecognitionsystemwhichcansolvethesmallsamplesizeproblem’,PatternRecognit.,2000,33,pp.1713–1726
3Cevikalp,H.,Neamtu,M.,Wlkes,M.,andBarkana,A.:‘Discriminativecommonvectorsforfacerecognition’,IEEETrans.PatternAnal.Mach.Intell.,2005,27,(1),pp.4–13
4Ye,J.,andXiong,T.:‘Computationalandtheoreticalanalysisofnullspaceandorthogonallineardiscriminantanalysis’,J.Mach.Learn.Res.,2006,7,pp.1183–1204
5Fukunaga,K.:‘Introductiontostatisticalpatternrecognition’(AcademicPressInc.,HartcourtBraceJovanovichPublishers,1990)6Tibshiriani,R.,Hastie,T.,Narasimhan,B.,andChu,G.:‘Diagnosisofmultiplecancertypesbyshrunkencentroidsofgeneexpression’,Proc.Natl.Acad.Sci.USA,2002,99,(10),pp.6567–6572
7Golub,T.R.,Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasenbeek,M.,Mesirov,J.P.,Coller,H.,Loh,M.L.,Downing,J.R.,Caligiuri,M.A.,Bloomfield,C.D.,andLander,E.S.:‘Molecularclassificationofcancer:classdiscoveryandclasspredictionbygeneexpressionmonitoring’,Science,1999,286,pp.531–537
8Yeoh,E.J.,Ross,M.E.,Shurtleff,S.A.,Williams,W.K.,Patel,D.,Mahfouz,R.,Behm,F.G.,Raimondi,S.C.,Relling,M.V.,Patel,A.,Cheng,C.,Campana,D.,Wilkins,D.,Zhou,X.,Li,J.,Liu,H.,Pui,C.H.,Evans,W.E.,Naeve,C.,Wong,L.,andDowning,J.R.:‘Classification,subtypediscovery,andpredictionofoutcomeinpediatricacutelymphoblasticleukemiabygeneexpressionprofiling’,Cancer,2002,1,(2),pp.133–143
9Ramaswamy,S.,Tamayo,P.,Rifkin,R.,Mukherjee,S.,Yeang,C.-H.,Angelo,M.,Ladd,C.,Reich,M.,Latulippe,E.,Mesirov,J.P.,Poggio,T.,Gerald,W.,Loda,M.,Lander,E.S.,andGolub,T.R.:‘Multiclasscancerdiagnosisusingtumorgeneexpressionsignatures’,Proc.Natl.Acad.Sci.USA,2001,98,(26),pp.15149–15154
10Khan,J.,Wei,J.S.,Ringner,M.,Saal,L.H.,Ladanyi,M.,Westermann,F.,
Berthold,F.,Schwab,M.,Antonescu,C.R.,Peterson,C.,andMeltzer,P.S.:‘Classificationanddiagnosticpredictionofcancersusinggeneexpressionprofilingandartificialneuralnetwork’,NatureMedicine,2001,7,pp.673–679
11van’tVeer,L.J.,Dai,H.,vandeVijver,M.J.,He,Y.D.,Hart,A.M.H.,
Mao,M.,Peterse,H.L.,vanderKooy,K.,Marton,M.J.,Witteveen,A.T.,Schreiber,G.J.,Kerkhoven,R.M.,Roberts,C.,Linsley,P.S.,Bernards,R.,andFriend,S.H.:‘Geneexpressionprofilingpredictsclinicaloutcomeofbreastcancer’,LetterstoNature,Nature,2002,415,pp.530–536
12Guo,Y.,Hastie,T.,andTinshirani,R.:‘Regularizeddiscriminant
analysisanditsapplicationinmicroarrays’,Biostatistics,2007,8,(1),pp.86–100
Table1:Datasetsusedinexperimentation
DatasetsAcuteleukemia[7]ALLsubtype[8]GCM[9]SRBCT[10]Breastcancer[11]NumberofNumberofClassDimensiontrainingsamplestestingsamples2714427129125581606323082448138215144637834112462019Table2:Classificationaccuracy(%)onDNAmicroarraygene
expressiondatasets(valueofproportionalityconstantpcomputedbycross-validationshowninbrackets)
DatabaseNullLDAwithNCCNullLDAwithSD-NCCRegularisedLDASRBCT100100(p¼0.4)100Acuteleukemia97.1100(p¼0.5)97.1ALLsubtype86.690.2(p¼0.5)92.0GCMBreastcancerAverage70.457.982.474.1(p¼3.4)68.4(p¼0.9)86.674.147.482.1Conclusion:Wehavepresentedashrunkendistancenearestcentroidclassifierwhichutilisesclass-conditionalaprioriprobabilitiesfordis-tancecomputation.ThenullLDAtechniqueisusedfordimensionalityreductionandtheclassifierisappliedonreduceddimensionalfeaturespace.TheSD-NCCiscomparedwithotherclassifiersonseveralDNAmicroarraygeneexpressiondataandencouragingresultshavebeennoted.
#TheInstitutionofEngineeringandTechnology201013July2010
doi:10.1049/el.2010.1927
OneormoreoftheFiguresinthisLetterareavailableincolouronline.A.SharmaandK.K.Paliwal(SignalProcessingLab,GriffithUniversity,Brisbane,QLD-4111,Australia)E-mail:sharma_al@usp.ac.fj
ELECTRONICSLETTERS2ndSeptember2010Vol.46No.18
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- 91gzw.com 版权所有 湘ICP备2023023988号-2
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务