SandipSen&MahendraSekaran
DepartmentofMathematical&ComputerSciences
UniversityofTulsa600SouthCollegeAvenueTulsa,OK74104-31Phone:+918-631-2985
e-mail:sandip@kolkata.mcs.utulsa.edu
Abstract
Socialagents,bothhumanandcomputational,inhabitingaworldcon-tainingmultipleactiveagents,needtocoordinatetheiractivities.Thisisbecauseagentsshareresources,andwithoutpropercoordinationor“rulesoftheroad”,everybodywillbeinterferingwiththeplansofothers.Assuch,weneedcoordinationschemesthatallowagentstoeffectivelyachievelocalgoalswithoutadverselyaffectingtheproblem-solvingcapabilitiesofotheragents.ResearchersinthefieldofDistributedArtificialIntelligence(DAI)havede-velopedavarietyofcoordinationschemesunderdifferentassumptionsaboutagentcapabilitiesandrelationships.Whereassomeoftheseresearchhavebeenmotivatedbyhumancognitivebiases,othershaveapproacheditasanengineeringproblemofdesigningthemosteffectivecoordinationarchitec-tureorprotocol.Weevaluateindividualandconcurrentlearningbymul-tiple,autonomousagentsasameansforacquiringcoordinationknowledge.Weshowthatauniformreinforcementlearningalgorithmsufficesasacoor-dinationmechanisminbothcooperativeandadversarialsituations.Usinganumberofmultiagentlearningscenarioswithbothtightandloosecouplingbetweenagentsandwithimmediateaswellasdelayedfeedback,wedemon-stratethatagentscanconsistentlydevelopeffectivepoliciestocoordinatetheiractionswithoutexplicitinformationsharing.Wedemonstratethevi-abilityofusingboththeQ-learningalgorithmandgeneticalgorithmbasedclassifiersystemswithdifferentpayoffschemes,namelythebucketbrigadealgorithm(BBA)andtheprofitsharingplan(PSP),fordevelopingagentcoordinationontwodifferentmulti-agentdomains.Inaddition,weshowthatasemi-randomschemeforactionselectionispreferabletothemoretraditionalfitnessproportionateselectionschemeusedinclassifiersystems.
1
1Introduction
Oneoftheprimarygoalsofresearchersintheartificialintelligence(AI)communityistodevelopautonomousagentsthatareknowledgeableandcognizantenoughtocarryoutatleastroutineactivitiesperformedbyhumans.Theproblem,however,isanextremelydifficultone,anddecadesofresearchonrelatedareashaveonlyservedtohighlightitscomplexityandmagnitude.Whileresearchersaredevelopinginterestingresultsincrucialareaslikeknowledgerepresentation,planning,learning,non-monotonicreasoning,cooperativeproblemsolvingetc.,itisequallyimportanttoanalyzeandexperimentwithresultsfromonesuchsubfieldtobenefitoneormoreoftheotherones.Isolated,anddomain-specificdevelopmentsinanyoneofthesub-fieldsofAIisnotgoingtogoalongwaytoadvancingthewholefield.Inthispaper,wewilldemonstratehownoteworthyadvancesinoneparticularsub-fieldofAIcanbeeffectivelyusedinanothersubfieldtoprovideagentswithknowledgerequiredtosolveadifficultproblem.Wewillbeapplyingrecentresearchdevelopmentsfromthereinforcementlearningliteraturetothecoordinationprobleminmultiagentsystems.Inareinforcementlearningscenario,anagentchoosesactionsbasedonitsperceptions,receivesscalarfeedbackbasedonpastactions,andisexpectedtodevelopamappingfromper-ceptionstoactionsthatwillmaximizefeedback.MultiagentsystemsareaparticulartypeofdistributedAIsystem[2,29],inwhichautonomousintelligentagentsinhabitaworldwithnoglobalcontrolorgloballyconsistentknowledge.Incontrasttocooperativeproblemsolvers[13],agentsinmultiagentsystemsarenotpre-disposedtohelpeachotheroutwithalltheresourcesandcapabilitiesthattheypossess.Theseagentsmaystillneedtocoordi-natetheiractivitieswithotherstoachievetheirownlocalgoals.Theycouldbenefitfromreceivinginformationaboutwhatothersaredoingorplantodo,andfromsendingtheminformationtoinfluencewhattheydo.
Whereaspreviousresearchondevelopingagentcoordinationmechanismsfocusedonoff-linedesignofagentorganizations,behavioralrules,negotiationprotocols,etc.,itwasrecog-nizedthatagentsoperatinginopen,dynamicenvironmentsmustbeabletoflexiblyadapttochangingdemandsandopportunities[29,44,54].Inparticular,individualagentsareforcedtoengagewithotheragentswhichhavevaryinggoals,abilities,composition,andlifespan.To
2
effectivelyutilizeopportunitiespresentedandavoidpitfalls,agentsneedtolearnaboutotheragentsandadaptlocalbehaviorbasedongroupcompositionanddynamics.Formachinelearningresearchers,multiagentlearningproblemsarechallenging,becausetheyviolatethestationaryenvironmentassumptionsusedbymostmachinelearningsystems.Asmultipleagentslearnsimultaneously,thefeedbackreceivedbythesameagentforthesameactionvariesconsiderably.Thestationaryenvironmentassumptionusedbymostcurrentmachinelearningsystemsarenotwell-suitedforsuchrapidlychangingenvironments.
Inthispaper,wediscusshowreinforcementlearningtechniquesfordevelopingpoliciestooptimizeenvironmentalfeedback,throughamappingbetweenperceptionsandactions,canbeusedbymultipleagentstolearncoordinationstrategieswithouthavingtorelyonsharedinformation.Theseagentsworkinacommonenvironment,butareunawareofthecapabilitiesofotheragentsandmayormaynotbecognizantofgoalstoachieve.Weshowthatthroughrepeatedproblem-solvingexperience,suchagentscandeveloppoliciestomaximizeenvironmentalfeedbackthatcanbeinterpretedasgoalachievementfromtheviewpointofanexternalobserver.Moreinterestingly,wedemonstratethatinsomedomainstheseagentsdeveloppoliciesthatcomplementeachother.
Toevaluatetheapplicabilityofreinforcementlearningschemesforenablingmultiagentcoordination,wechosetoinvestigateanumberofmultiagentdomainswithvaryingenvi-ronmentalcharacteristics.Inparticular,wedesignedenvironmentsinwhichthefollowingcharacteristicswerevaried:
Agentcoupling:Insomedomainstheactionsofoneagentstronglyandfrequentlyaffect
theplansofotheragents(tightlycoupledsystem),whereasinotherdomainstheactionsofoneagentonlyweaklyandinfrequentlyaffecttheplansofotheragents(looselycoupledsystem).
Agentrelationships:Agentsinamultiagentsystemcanhavedifferentkindsofmutual
relationships:
•theymayactinagrouptosolveacommonproblem(cooperativeagents),•theymaynothaveanypresetdispositionstowardseachotherbutinteractbecausetheyusecommonresources(indifferentagents),
3
•theymayhaveopposinginterests(adversarialagents).
Fordiscussionsinthispaper,wehavegroupthelattertwoclassofdomainsasnon-cooperativedomains.
Feedbacktiming:Insomedomains,theagentsmayhaveimmediateknowledgeofthe
effectsoftheiractions,whereasinotherstheymaygetthefeedbackfortheiractionsonlyafteraperiodofdelay.
Optimalbehaviorcombinations:Howmanybehaviorcombinationsofparticipatingagents
willoptimallysolvethetaskathand?Thisvaluevariesfromonetoinfinitefordifferentdomains.
Inaddition,inthispaperweconcentrateexclusivelyondomainsinwhichagentshavelittleornopre-existingdomainexpertise,andhavenoinformationaboutthecapabilitiesandgoalsofotheragents.Theseassumptionsmakethecoordinationproblemparticularlyhard.Thisisparticularlyevidentfromthefactthatalmostallcurrentlyusedcoordina-tionmechanismsrelyheavilyondomainknowledgeandsharedinformationbetweenagents.Thegoalofourworkisnottoreplacethepreviouslydevelopedcoordinationschemes,buttocomplementthembyprovidingnewcoordinationtechniquesfordomainswherethecur-rentlyavailableschemesarenoteffective.Inparticular,domainswhereagentsknowlittleabouteachotherprovideadifficultchallengeforcurrentlyusedcoordinationschemes.Ourcontentionisthatproblemsolvingperformanceorthefeedbackreceivedfromtheenviron-mentcanbeeffectivelyusedbyreinforcementbasedlearningagentstocircumventthelackofcommonknowledge.
Toverifyourintuitionsmorerigorously,wedecidedtoinvestigatetwowell-knownrein-forcementlearningschemes:theQ-learningalgorithmdevelopedbyWatkins[51],andtheclassifiersystemsmethoddevelopedbyHolland.WhereastheQ-learningalgorithmwasinspiredbythetheoryofdynamicprogrammingforoptimization,classifiersystemsarosefromaninterestingblendofrule-basedreasoningandcomputationalmechanismsinspiredbynaturalgenetics[4,25].Q-learningandclassifiersystemshavebeenproposedastwogeneralreinforcementlearningframeworksforachievingagentcoordinationinamultiagent
4
system.Thisresearchopensupanewdimensionofconstructingcoordinationstrategiesformultiagentsystems.
Atthispoint,wewouldliketoemphasizethedifferencebetweenthisworkandotherrecentpublicationsinthenascentareaofmultiagentlearning(MAL)research.Previouspro-posalsforusinglearningtechniquestocoordinatemultipleagentshavemostlyreliedonusingpriorknowledge[5],oroncooperativedomainswithunrestrictedinformationsharing[47].Asignificantpercentageofthisresearchhaveconcentratedoncooperativelearningbetweencommunicatingagentswhereagentssharetheirknowledgeorexperiences[16,38,37,50].Someresearchershaveusedcommunicationtoaidagentgroupsjointlydecideontheircourseofactions[52].TheisolatedinstancesofresearchinMALthatdonotuseexplicitcommuni-cationhaveconcentratedoncompetitive,ratherthancooperative,domains[7,30,40].Westronglybelievethatlearningcoordinationpolicieswithoutcommunicationhasasignificantroletoplayincompetitiveaswellascooperativedomains.Thereislittleargumentoverthefactthatcommunicationisaninvaluabletooltobeusedbyagentgroupstocoordinatetheiractivities.Attimescommunicationisthemosteffectiveandevenperhapstheonlymechanismtoguaranteecoordinatedbehavior.Thoughcommunicationisoftenhelpfulandindispensableasanaidtogroupactivity,itdoesnotguaranteecoordinatedbehavior[22],istime-consumingandcandetractfromotherproblemsolvingactivityifnotcarefullycon-trolled[12].Also,agentsoverlyreliantoncommunicationwillbeseverelyaffectedifthequalityofcommunicationiscompromised(brokencommunicationchannels,incorrectordeliberatelymisleadinginformation,etc.).Atothertimescommunicationcanberiskyorevenfatal(asinsomecombatsituationswheretheadversarycaninterceptcommunicatedmessages).
Webelievethatevenwhencommunicationisfeasibleandsafe,itmaybeprudenttouseitonlyasnecessary.Forexample,ifanagentisabletopredictthebehaviorofotheragentsfrompastobservations,itcanpossiblyadjustitsownplanstouseasharedresourcewithouttheneedforexplicitlyarrivingatacontractusingcommunicationthatconsumesvaluabletime(tohaveperformanceguarantees,though,contractsarrivedatusingcommu-nicationispossiblythemosteffectiveprocedure;mostoftheproblemsolvingactivitiesbyagents,however,donotinvolvehardguaranteesordeadlines).Weshouldstrivetorealize
5
themaximumpotentialofthesystemwithoutusingcommunication.Oncethathasbeenaccomplished,communicationcanbeaddedtoaugmenttheperformanceofthesystemtothedesiredefficiencylevel.Suchadesignphilosophywillleadtosystemswhereagentsdonotfloodcommunicationchannelswithunwarrantedinformationandagentsdonothavetoshiftthroughamazeofuselessdatatolocatenecessaryandtime-criticalinformation.Withthisgoalinmindwehaveinvestigatedtheusefulnessofacquiringcoordinationstrategieswithoutsharinginformation[43,45].Weexpandonthisbodyofworkinthispaper,andexploretheadvantagesandlimitationsoflearningwithoutcommunicationasameanstogeneratingcoordinatedbehaviorinautonomousagents.
Therestofthepaperisorganizedasfollows:Section2presentshighlightsofthepreviousapproachestodevelopingcoordinationstrategiesformultiple,autonomousagents;Section3providesacategorizationofmultiagentsystemstoidentifydifferentlearningscenariosandpresentsasamplingofpriormultiagentlearningresearch;Section4reviewsthereinforcementlearningtechniquesthatwehaveutilizedinthispaper;Sections5,6,and7presentresultsofexperimentswithQ-learningandgeneticalgorithmbasedreinforcementlearningsystemsonablockpushing,robotnavigation,andresourcesharingdomainsrespectively;Section8summarizesthelessonslearntfromthisresearchandoutlinesfutureresearchdirections.
2Coordinationofmultipleagents
Inaworldinhabitedbymultipleagents,coordinationisakeytogroupaswellasindividualsuccess.Weneedtocoordinateouractionswheneverwearesharinggoals,resources,orexpertise.Bycoordinationwemeanchoosingone’sownactionbasedontheexpectationofothers’actions.Coordinationisessentialforcooperative,indifferent,andevenadversarialagents.Ascomputerscientists,weareinterestedindevelopingcomputationalmechanismsthataredomainindependentandrobustinthepresenceofnoisy,incomplete,andout-of-dateinformation.Researchintheareaofmultiagentsystemshasproducedtechniquesforallowingmultipleagents,whichsharecommonresources,tocoordinatetheiractionssothatindividuallyrationalactionsdonotadverselyaffectoverallsystemefficiency[2,13,17,26].Coordinationofproblemsolvers,irrespectiveofwhethertheyareselfishorcooperative,
6
isakeyissuetothedesignofaneffectivemultiagentsystem.Thesearchfordomain-independentcoordinationmechanismshasyieldedsomeverydifferent,yeteffective,classesofcoordinationschemes.Themostinfluentialclassesofcoordinationmechanismsdevelopedtodatearethefollowing:
•protocolsbasedoncontracting[9,48]•distributedsearchformalisms[14,57]•organizationalandsociallaws[15,32,33]•multi-agentplanning[11,39]
•decisionandgametheoreticnegotiations[18,19,58]•linguisticapproaches[8,55]
Whereassomeoftheseworkusesarchitecturesandprotocolsdesignedoff-line[15,46,48]ascoordinationstructures,othersacquirecoordinationknowledgeon-line[11,19].Almostallofthecoordinationschemesdevelopedtodateassumeexplicitorimplicitsharingofinfor-mation.Intheexplicitformofinformationsharing,agentscommunicatepartialresults[11],speechacts[8],resourceavailabilities[48],etc.tootheragentstofacilitatetheprocessofcoordination.Intheimplicitformofinformationsharing,agentsuseknowledgeaboutthecapabilitiesofotheragents[15,18,58]toaidlocaldecision-making.Thougheachoftheseapproacheshasitsownbenefitsandweaknesses,webelievethatthelessanagentdependsonsharedinformation,andthemoreflexibleitistotheon-linearrivalofproblem-solvingandcoordinationknowledge,thebetteritcanadapttochangingenvironments.Asflexibilityandadaptabilityarekeyaspectsofintelligentandautonomousbehavior,weareinterestedininvestigatingmechanismsbywhichagentscanacquireandusecoordinationknowledgethroughinteractionswithitsenvironment(thatincludesotheragents)withouthavingtorelyonsharedinformation.
Onecanalsoarguethatpre-fabricatedcoordinationstrategiescanquicklybecomeinad-equateifthesystemdesigner’smodeloftheworldisincomplete/incorrectoriftheenviron-mentinwhichtheagentsaresituatedcanchangedynamically.Coordinationstrategiesthat
7
incorporatelearningandadaptationcomponentswillbemorerobustandeffectiveinthesemorerealisticscenarios.Thusagentswillbeabletotakeadvantageofnewopportunitiesanddealwithnewcontingenciespresentedbytheenvironmentwhichcannotbeforeseenatdesigntime.
3Learninginmultiagentsystems
Tohighlightlearningopportunitiesinherentinmostmultiagentsystems,wedevelophereacategorizationofmultiagentproblemsthatcanbeusedtocharacterizethenatureoflearningmechanismsthatshouldbeusedfortheseproblems.ThecategorizationpresentedinTable1isnotmeanttobetheonlyoreventhemostdefinitivetaxonomyofthefield.Thedimen-sionsweconsiderforourtaxonomyarethefollowing:agentrelationships(cooperativevs.non-cooperative),useofcommunication(agentsexplicitlycommunicatingornot).Withincooperativedomainsagain,wefurthercategorizedomainsbasedondecision-makingauthor-ities(individualvs.shared).Thesedimensionsarenotnecessarilycompletelyorthogonal,e.g.,shareddecisionmakingmaynotbefeasiblewithouttheuseofexplicitcommunication.Weusethetermcooperativerelationshiptorefertosituationswheremultipleagentsareworkingtowardsacommongoal.Thoughtheremaybelocalgoals,theseareinfact,subgoalsofthecommongoal(evenifdifferentsubgoalsinterfere).Anumberofdifferentdomainsfallunderthenon-cooperativespectrum.Theseincludecompetitivescenarios(oneagent’sgainisanotheragent’sloss;neednotbezero-sum),aswellasscenarioswhereagentscoordinateonlytoavoidconflicts.
Asmentionedbefore,sharedlearningbyagroupofcooperativeagentshavereceivedthemostattentioninMALliterature[16,37,52].Othershavelookedateachagentlearningindividuallybutusingcommunicationtosharepertinentinformation[6,38,50].Learningtocooperatewithoutexplicitinformationsharingcanbebasedonprimarilyenvironmentalfeedback[45]orfromobservationofotheragentsinaction[23].Researchershaveinvestigatedtwoapproachestolearningwithoutexplicitcommunicationinnon-cooperativedomains:(a)treatingopponentsaspartoftheenvironmentwithoutexplicitmodeling[43],(b)learningexplicitcompetitormodels[30,40].Littleworkhasbeendonetodateinlearningtocompete
8
withexplicitcommunication.Possiblescenariostoinvestigateinthisareaincludelearningthestrengthsandweaknessesofthecompetitorfrominterceptedcommunication,orpro-activelyprobingtheopponenttogathermoreinformationaboutitspreferences.
Evenpreviousworkonusingreinforcementlearningforcoordinatingmultipleagents[50,52]havereliedonexplicitinformationsharing.We,however,concentrateonsystemswhereagentssharenoproblem-solvingknowledge.Weshowthatalthougheachagentisindepen-dentlyusingreinforcementlearningtechniquestooptimizingitsownenvironmentalreward,globalcoordinationbetweenmultipleagentscanemergewithoutexplicitorimplicitinfor-mationsharing.Theseagentscanthereforeactindependentlyandautonomously,withoutbeingaffectedbycommunicationdelays(duetootheragentsbeingbusy)orfailureofakeyagent(whocontrolsinformationexchangeorwhohasmoreinformation),anddonothavetobeworryaboutthereliabilityoftheinformationreceived(DoIbelievetheinformationre-ceived?Isthecommunicatingagentanaccompliceoranadversary?).Theresultantsystemsare,therefore,robustandfault-tolerant.
Schaerfetal.havestudiedtheuseofreinforcementlearningbasedagentsforloadbalanc-ingindistributedsystems[41].Inthiswork,acomprehensivehistoryofpastperformanceisusedtomakeinformeddecisionsaboutchoiceofresourcestosubmitjobsto.Parker[36]hasstudiedtheemergenceofcoordinationinsimulatedrobotgroupsbyusingsimpleadaptiveschemesthatalterrobotmotivations.Amajordifferencefromourworkisthatthesimulatedrobotsinthisworkbuildexplicitmodelsofotherrobots.Otherresearchershaveusedrein-forcementlearningfordevelopingeffectivegroupsofphysicalrobots[34,56].Mataric[34]concentratesonusingintermediatefeedbackforsubgoalfulfillmenttoacceleratelearning.Incontrastwithourwork,theevaluationoflearningeffectivenessundervaryingdegreesofin-teractionbetweentheagentsisnotthefocusofthiswork.TheworkbyYancoandStein[56]involvesusingreinforcementlearningtechniquestoevolveeffectivecommunicationproto-colsbetweencooperatingrobots.Thisiscomplementarytoourapproachoflearningtocoordinateintheabsenceofcommunication.
Thoughouragentscanbeviewedasalearningautomaton,thescalarfeedbackreceivedfromtheenvironmentpreventstheuseofresultsdirectlyfromthefieldoflearningau-tomata[35].Recentworkontheoreticalissuesofmultiagentreinforcementlearningpromises
9
toproducenewframeworksforinvestigatingproblemssuchasthoseaddressedinthispa-per[21,42,53].
4Reinforcementlearning
Inreinforcementlearningproblems[1,27]reactiveandadaptiveagentsaregivenadescriptionofthecurrentstateandhavetochoosethenextactionfromasetofpossibleactionssoastomaximizeascalarreinforcementorfeedbackreceivedaftereachaction.Thelearner’senvironmentcanbemodeledbyadiscretetime,finitestate,Markovdecisionprocessthatcanberepresentedbya4-tupleS,A,P,rwhereP:S×S×A→[0,1]givestheprobabilityofmovingfromstates1tos2onperformingactiona,andr:S×A→isascalarrewardfunction.Eachagentmaintainsapolicy,π,thatmapsthecurrentstateintothedesirableaction(s)tobeperformedinthatstate.TheexpectedvalueofadiscountedsumoffuturerewardsofapolicyπatastatexisgivenbyVγπ=E{
def
∞
t=0
ππ
istherandom},wherers,tγtrs,t
variablecorrespondingtotherewardreceivedbythelearningagentttimestepsafterifstartsusingthepolicyπinstates,andγisadiscountrate(0≤γ<1).
Variousreinforcementlearningstrategieshavebeenproposedusingwhichagentscande-velopapolicytomaximizerewardsaccumulatedovertime.Forevaluatingtheclassifiersys-temparadigmformultiagentreinforcementlearning,wecompareitwiththeQ-learning[51]algorithm,whichisdesignedtofindapolicyπ∗thatmaximizesVγπ(s)forallstatess∈S.Thedecisionpolicyisrepresentedbyafunction,Q:S×A→,whichestimateslong-termdis-a;π
countedrewardsforeachstate–actionpair.TheQvaluesaredefinedasQπγ(s,a)=Vγ(s),
wherea;πdenotestheeventsequenceofchoosingactionaatthecurrentstate,followedbychoosingactionsbasedonpolicyπ.Theaction,a,toperforminastatesischosensuchthatitisexpectedtomaximizethereward,
Vγπ(s)=maxQπγ(s,a)foralls∈S.
a∈A
∗
∗
IfanactionainstatesproducesareinforcementofRandatransitiontostates,thenthecorrespondingQvalueismodifiedasfollows:
Q(s,a)←(1−β)Q(s,a)+β(R+γmaxQ(s,a)).
a∈A
(1)
10
TheaboveupdateruleissimilartoHolland’sbucket-brigade[25]algorithminclassifiersystemsandSutton’stemporal-difference[49]learningscheme.ThesimilaritiesofQ-learningandclassifiersystemshavebeenanalyzedin[10].
Classifiersystemsarerulebasedsystemsthatlearnbyadjustingrulestrengthsfromfeedbackandbydiscoveringbetterrulesusinggeneticalgorithms.Inthispaper,wewillusesimplifiedclassifiersystemswhereallpossiblemessageactionpairsareexplicitlystoredandclassifiershaveoneconditionandoneaction.TheseassumptionsaresimilartothosemadebyDorigoandBersini[10];wealsousetheirnotationtodescribeaclassifieriby(ci,ai),whereciandaiarerespectivelytheconditionandactionpartsoftheclassifier.St(ci,ai)givesthestrengthofclassifieriattimestept.Wefirstdescribehowtheclassifiersystemperformsandthendiscusstwodifferentfeedbackdistributionschemes,namelytheBucketBrigadealgorithm(BBA),andtheProfitSharingPlan(PSP).
Allclassifiersareinitializedtosomedefaultstrength.Ateachtimestepofproblemsolving,aninputmessageisreceivedfromtheenvironmentandmatchedwiththeclassifierrulestoformamatchset,M.Oneoftheseclassifiersischosentofireandbasedonitsaction,afeedbackmaybereceivedfromtheenvironment.Thenthestrengthsoftheclassifierrulesareadjusted.Thiscycleisrepeatedforagivennumberoftimesteps.Aseriesofcyclesconstituteatrialoftheclassifiersystem.IntheBBAscheme,whenaclassifierischosentofire,itsstrengthisincreasedbytheenvironmentalfeedback.Butbeforethat,afractionαofitsstrengthisremovedandaddedtothestrengthoftheclassifierwhofiredinthelasttimecycle.So,ifclassifierifiresattimestept,producesexternalfeedbackofR,andclassifierjfiresatthenexttimestep,thefollowingequationsgivesthestrengthupdateofclassifieri:
St+1(ci,ai)=(1−α)∗St(ci,ai)+α∗(R+St+1(cj,aj)).
Wenowdescribetheprofitsharingplan(PSP)strength-updatingscheme[20]usedinclassifiersystems.Inthismethod,problemsolvingisdividedintoepisodesinbetweenreceiptsofexternalreward.Aruleissaidtobeactiveinaperiodifitfiredinatleastoneofthecyclesinthatepisode.Attheendofepisodee,thestrengthofeachactiveruleiinthatepisodeisupdatedasfollows:
Se+1(ci,ai)=Se(ci,ai)+α∗(Re−Se(ci,ai)),
11
whereReistheexternalrewardreceivedattheendoftheepisode.Wehaveexperimentedwithtwomethodsofchoosingaclassifiertofiregiventhematchset.Inthemoretraditionalmethod,aclassifieri∈MattimetischosenwithaprobabilitygivenbySt(ci,ai)
.S(c,a)d∈Mtdd
We
callthisfitnessproportionatePSPorPSP(FP).Intheothermethodofactionchoice,theclassifierwiththehighestfitnessinMischosen90%ofthetime,andarandomclassifierfromMischosenintherest10%cases(MahadevanusessuchanactionchoosingmechanismforQ-learningin[31]).Wecallthisasemi-randomPSPorPSP(SR).
FortheQ-learningalgorithm,westoparunwhenthealgebraicdifferenceofthepoliciesattheendofneighboringtrialsisbelowathresholdfor10consecutivetrials.Withthisconvergencecriterion,however,theclassifiersystemsrantoolongforustocollectreasonabledata.Instead,every10thtrial,werantheclassifiersystem(bothwithBBAandPSP)withadeterministicactionchoiceovertheentiretrial.Westoppedarunoftheclassifiersystemifthedifferencesofthetotalenvironmentalfeedbackreceivedbythesystemonneighboringdeterministictrialswerebelowasmallthresholdfor10consecutivedeterministictrials.Weperformanin-depthstudyoftheeffectivenessofusingQ-learningbyconcurrentlearnerstodevelopeffectivecoordinationinablockpushingdomain.WealsocomparetheperformanceofclassifiersystemsandQ-learningonaresourcesharingandarobotnavigationdomain.Thecharacteristicsofthesethreedomainsareasfollows:
Blockpushing:Concurrentlearningbytwoagentswithimmediateenvironmentalfeed-back;stronglycoupledsystem;multipleoptimalbehaviors.
Resourcesharing:Oneagentlearningtoadapttoanotheragent’sbehaviorwithdelayed
environmentalfeedback;stronglycoupledsystem;singleoptimalbehavior.
Robotnavigation:Concurrentlearningbymultipleagentswithimmediateenvironmental
feedback;variablecoupling;multipleoptimalbehaviors.
5Blockpushingproblem
Inthisproblem,twoagents,a1anda2,areindependentlyassignedtomoveablock,b,fromastartingposition,S,tosomegoalposition,G,followingapath,P,inEuclideanspace.
12
Theagentsarenotawareofthecapabilitiesofeachotherandyetmustchoosetheiractionsindividuallysuchthatthejointtaskiscompleted.Theagentshavenoknowledgeofthesystemphysics,butcanperceivetheircurrentdistancefromthedesiredpathtotaketoi,wherethegoalstate.Theiractionsarerestrictedasfollows:agentiexertsaforceFi|≤Fmax,ontheobjectatanangleθi,where0≤θ≤π.Anagentpushingwith0≤|F
atangleθwilloffsettheblockinthexdirectionby|F|cos(θ)unitsandintheyforceF
|sin(θ)units.Thenetresultantforceontheblockisfoundbyvectoradditiondirectionby|F
=F1+F2.Wecalculatethenewpositionoftheblockbyassumingofindividualforces:F
unitdisplacementperunitforcealongthedirectionoftheresultantforce.Thenewblocklocationisusedtoprovidefeedbacktotheagent.If(x,y)isthenewblocklocation,Px(y)isthex-coordinateofthepathPforthesameycoordinate,∆x=|x−Px(y)|isthedistancealongthexdimensionbetweentheblockandthedesiredpath,thenK∗a−∆xisthefeedbackgiventoeachagentfortheirlastaction(wehaveusedK=50anda=1.15).
field of playgoal positionidealpathpotentialpathblockagent 1agent 2at each time step each agent pushes with some force at a certain angleCombined effectF sinθFθF cosθFsinθ22Fsinθ11Agent 1θ1Fcosθ11θ2Agent 2Fcosθ22Figure1:Theblockpushingproblem
Thefieldofplayisrestrictedtoarectanglewithendpoints[0,0]and[100,100].AtrialconsistsoftheagentsstartingfromtheinitialpositionSandapplyingforcesuntileitherthegoalpositionGisreachedortheblockleavesthefieldofplay(seeFigure1).Weaborta
13
trialifapre-setnumberofagentactionsfailtotaketheblocktothegoal.Thispreventsagentsfromlearningpolicieswheretheyapplynoforcewhentheblockisrestingontheoptimalpathtothegoalbutnotonthegoalitself.Theagentsarerequiredtolearn,throughrepeatedtrials,topushtheblockalongthepathPtothegoal.Althoughwehaveusedonlytwoagentsinourexperiments,thesolutionmethodologycanbeappliedwithoutmodificationtoproblemswitharbitrarynumberofagents.
Toimplementthepolicyπwechosetouseaninternaldiscreterepresentationfortheexternalcontinuousspace.Theforce,angle,andthespacedimensionswerealluniformlydiscretized.Whenaparticulardiscreteforceoractionisselectedbytheagent,themiddlevalueoftheassociatedcontinuousrangeisusedastheactualforceoranglethatisappliedontheblock.
TheBlock-Pushingproblemisatightly-coupledsystem,where,ateachtimesteptheoutcomeofeachagentactionisdependentontheactionoftheotheragentspresentinthesystem.Usingtheblock-pushingproblem,wehaveconductedexperimentsonbothcooperative(agentspushtheblocktowardsacommongoal)andnon-cooperative(agentsviewitheachothertopushtheblocktoindividualgoals)situations.
Anexperimentalrunconsistsofanumberoftrialsduringwhichthesystemparameters(β,γ,andK)aswellasthelearningproblem(granularity,agentchoices)isheldconstant.ThestoppingcriteriaforaruniseitherthattheagentssucceedinpushingtheblocktothegoalinNconsecutivetrials(wehaveusedN=10)orthatamaximumnumberoftrials(wehaveused1500)havebeenexecuted.Thelattercasesarereportedasnon-convergedruns.ThestandardprocedureinQ-learningliteratureofinitializingQvaluestozeroissuitableformosttaskswherenon-zerofeedbackisinfrequentandhencethereisenoughopportunitytoexplorealltheactions.Becauseanon-zerofeedbackisreceivedaftereveryactioninourproblem,wefoundthatagentswouldfollow,foranentirerun,thepaththeytakeinthefirsttrial.Thisisbecausetheystarteachtrialatthesamestate,andtheonlynon-zeroQ-valueforthatstateisfortheactionthatwaschosenatthestarttrial.Similarreasoningholdsforalltheotheractionschoseninthetrial.Apossiblefixistochooseafractionoftheactionsbyrandomchoice,ortouseaprobabilitydistributionovertheQ-valuestochooseactionsstochastically.Theseoptions,however,leadtoveryslowconvergence.Instead,wechose
14
toinitializetheQ-valuestoalargepositivenumber.Thisenforcedanexplorationoftheavailableactionoptionswhileallowingforconvergenceafterareasonablenumberoftrials.Theprimarymetricforperformanceevaluationistheaveragenumberoftrialstakenbythesystemtoconverge.Informationaboutacquisitionofcoordinationknowledgeisobtainedbyplotting,fordifferenttrials,theaveragedistanceoftheactualpathfollowedfromthedesiredpath.Dataforallplotsandtablesinthispaperhavebeenaveragedover100runs.
5.1ExperimentsinCooperativeDomain
Extensiveexperimentationwasperformedontheblock-pushingdomain.Thetwoagentswereassignedthetaskofpushingtheblocktothesamegoallocation.Thefollowingsectionsprovidefurtherdetailsontheexperimentalresults.5.1.1
Choiceofsystemparameters
Iftheagentslearntopushtheblockalongthedesiredpath,therewardthattheywillreceiveforthebestactionchoicesateachstepisequaltothemaximumpossiblevalueofK.Thesteady-statevaluesfortheQ-values(Qss)correspondingtooptimalactionchoicescanbecalculatedfromtheequation:
Qss=(1−β)Qss+β(K+γQss).
SolvingforQssinthisequationyieldsavalueof
K
.1−γ
Inorderfortheagentstoexplore
allactionsaftertheQ-valuesareinitializedatSI,werequirethatanynewQvaluebelessthanSI.FromsimilarconsiderationsasabovewecanshowthatthiswillbethecaseifSI≥
K
.1−γInourexperimentswefixthemaximumrewardKat50,SIat100,andγat0.9.
Unlessotherwisementioned,wehaveusedβ=0.2,andallowedeachagenttovaryboththemagnitudeandangleoftheforcetheyapplyontheblock.
Thefirstproblemweusedhadstartingandgoalpositionsat(40,0)and(40,100)re-spectively.Duringourinitialexperimentswefoundthatwithanevennumberofdiscreteintervalschosenfortheangledimension,anagentcannotpushalonganylineparalleltothe
15
y-axis.Henceweusedanoddnumber,11,ofdiscreteintervalsfortheangledimension.Thenumberofdiscreteintervalsfortheforcedimensionischosentobe10.
Onvaryingthenumberofdiscretizationintervalsforthestatespacebetween10,15,and20,wefoundthecorrespondingaveragenumberoftrialstoconvergenceis784,793,and115respectivelywith82%,83%,and100%oftherespectiverunsconvergingwithinthespecifiedlimitof1200trials.Thissuggeststhatwhenthestaterepresentationgetstoocoarse,theagentsfinditverydifficulttolearntheoptimalpolicy.Thisisbecausethelessthenumberofintervals(thecoarserthegranularity),themorethevariationsinrewardanagentgetsaftertakingthesameactionatthesamestate(eachdiscretestatemapsintoalargerrangeofcontinuousspaceandhencetheagentsstartfromandendsupinphysicallydifferentlocations,thelatterresultingindifferentrewards).5.1.2
Varyinglearningrate
Weexperimentedbyvaryingthelearningrate,β.TheresultantaveragedistanceoftheactualpathfromthedesiredpathoverthecourseofarunisplottedinFigure2forβvalues0.4,0.6,and0.8.
Incaseofthestraightpathbetween(40,0)and(40,100),theoptimalsequenceofactionsalwaysputstheblockonthesamex-position.Sincethex-dimensionistheonlydimensionusedtorepresentstate,theagentsupdatethesameQ-valueintheirpolicymatrixinsucces-sivesteps.WenowcalculatethenumberofupdatesrequiredfortheQ-valuecorrespondingtothisoptimalactionbeforeitreachesthesteadystatevalue.Notethatforthesystemtoconverge,itisnecessarythatonlytheQ-valuefortheoptimalactionatx=40needstoarriveatitssteadystatevalue.Thisisbecausetheblockisinitiallyplacedatx=40,andsolongastheagentschoosetheiroptimalaction,itneverreachesanyotherxposition.So,thenumberofupdatestoreachsteadystatefortheQ-valueassociatedwiththeoptimalactionatx=40shouldbeproportionaltothenumberoftrialstoconvergenceforagivenrun.Inthefollowing,letStbetheQ-valueaftertupdatesandSIbetheinitialQ-value.UsingEquation1andthefactthatfortheoptimalactionatthestartingposition,thereinforcementreceivedisKandthenextstateisthesameasthecurrentstate,wecanwrite,
St+1=(1−β)St+β(K+γSt)
16
=(1−β(1−γ))St+βK=ASt+C
(2)
whereAandBareconstantsdefinedtobeequalto1−β∗(1−γ)andβ∗Krespectively.Equation2isadifferenceequationwhichcanbesolvedusingS0=SItoobtain
St=A
t+1
C(1−At+1)SI+.
1−A
Ifwedefineconvergencebythecriteriathat|St+1−St|<,whereisanarbitrarilysmallpositivenumber,thenthenumberofupdatestrequiredforconvergencecanbecalculatedtobethefollowing:
t≥
log()−log(SI(1−A)−C)
log(A)
log()−log(β)−log(SI(1−γ)−K)=
log(1−β(1−γ))
(3)
IfwekeepγandSIconstanttheaboveexpressioncanbeshowntobeadecreasingfunctionofβ.Thisiscorroboratedbyourexperimentswithvaryingβwhileholdingγ=0.1(seeFigure2).Asβincreases,theagentstakelessnumberoftrialstoconvergencetotheoptimalsetofactionsrequiredtofollowthedesiredpath.TheotherplotinFigure2presentsacomparisonofthetheoreticalandexperimentalconvergencetrends.Thefirstcurveintheplotrepresentsthefunctioncorrespondingtothenumberofupdatesrequiredtoreachsteadystatevalue(with=0).Thesecondcurverepresentstheaveragenumberoftrialsrequiredforaruntoconverge,scaleddownbyaconstantfactorof0.06.Theactualratiosbetweenthenumberoftrialstoconvergenceandthevaluesoftheexpressionontherighthandsideoftheinequality3forβequalto0.4,0.6,and0.8are24.1,25.6,and27.5respectively(theaveragenumberoftrialsare95.6,71.7,and53;valuesoftheabove-mentionedexpressionare3.97,2.8,and1.93).Giventhefactthatresultsareaveragedover100runs,wecanclaimthatourtheoreticalanalysisprovidesagoodestimateoftherelativetimerequiredforconvergenceasthelearningrateischanged.5.1.3
Varyingagentcapabilities
Thenextsetofexperimentswasdesignedtodemonstratetheeffectsofagentcapabilitiesonthetimerequiredtoconvergeontheoptimalsetofactions.Inthefirstofthecurrentsetof
17
10Varying betaAvg. distance from optimal80.420.8002040600.680Trials10012014065.554.543.532.521.510.10.2Variation of Steps to convergence with beta(1-log(40*beta))/log(1-0.9*beta)0.06 * trials to convergence0.30.40.50.6Beta0.70.80.91Figure2:Variationofaveragedistanceofactualpathfromdesiredpathoverthecourseofarun,andthenumberofupdatesforconvergenceofoptimalQ-valuewithchangingβ(γ=0.1,SI=100).
experiments,oneoftheagentswaschosentobea“dummy”;itdidnotexertanyforceatall.Theotheragentcouldonlychangetheangleatwhichitcouldapplyaconstantforceontheblock.Inthesecondexperiment,thelatteragentwasallowedtovarybothforceandangle.Inthethirdexperiment,bothagentswereallowedtovarytheirforceandangle.Theaveragenumberoftrialstoconvergenceforthefirst,second,andthirdexperimentare55,431,and115respectively.Themostinterestingresultfromtheseexperimentsisthattwoagentscanlearntocoordinatetheiractionsandachievethedesiredproblem-solvingbehaviormuchfasterthanwhenasingleagentisactingalone.If,however,wesimplifytheproblemoftheonlyactiveagentbyrestrictingitschoicetothatofselectingtheangleofforce,itcan
18
learntosolvetheproblemquickly.Ifwefixtheanglefortheonlyactiveagent,andallowittovaryonlythemagnitudeoftheforce,theproblembecomeseithertrivial(ifthechosenangleisidenticaltotheangleofthedesiredpathfromthestartingpoint)orunsolvable.5.1.4
Transferoflearning
Wedesignedasetofexperimentstodemonstratehowlearninginonesituationcanhelplearningtoperformwellinasimilarsituation.Theproblemwithstartingandgoallocationsat(40,0)and(40,100)respectivelyisusedasareferenceproblem.Inaddition,weusedfiveotherproblemswiththesamestartinglocationandwithgoallocationsat(50,100),(60,100),(70,100),(80,100),and(90,100)respectively.Thecorrespondingdesiredpathswereobtainedbyjoiningthestartingandgoallocationsbystraightlines.Todemonstratetransferoflearning,wefirststoredeachofthepolicymatricesthatthetwoagentsconvergedonfortheoriginalproblem.Next,weranasetofexperimentsusingeachofthenewproblems,withtheagentsstartingoffwiththeirpreviouslystoredpolicymatrices.
Wefoundthatthereisalinearincreaseinthenumberoftrialstoconvergenceasthegoalinthenewproblemisplacedfartherapartfromthegoalintheinitialproblem.Todetermineifthisincreasewasduepurelytothedistancebetweenthetwodesiredpaths,orduetothedifficultyinlearningtofollowcertainpaths,weranexperimentsonthelatterproblemswithagentsstartingwithuniformpolicies.Theseexperimentsrevealthatthemoretheanglebetweenthedesiredpathandthey-axis,thelongertheagentstaketoconverge.Learningintheoriginalproblem,however,doeshelpinsolvingthesenewproblems,asevidencedbya≈10%savingsinthenumberoftrialstoconvergencewhenagentsstartedwiththepreviouslylearnedpolicy.Usingaone-tailedt-testwefoundthatallthedifferencesweresignificantatthe99%confidencelevel.Thisresultdemonstratesthetransferoflearnedknowledgebetweensimilarproblem-solvingsituations.5.1.5
Complimentarylearning
Inthelastfewsectionswehaveshowntheeffectsofsystemparametersandagentcapabilitiesontherateatwhichtheagentsconvergeonanoptimalsetofactions.Inthissection,wediscusswhatan“optimalsetofactions”meanstodifferentagents.
19
Figure3:Optimalactionchoicesforaselectionofstatesforeachagentaccordingtotheirpolicymatricesattheendofasuccessfulrun.
Iftheagentswerecognizantoftheactualconstraintsandgoalsoftheproblem,andknewelementaryphysics,theycouldindependentlycalculatethedesiredactionforeachofthestatesthattheymayenter.Theresultingpolicieswouldbeidentical.Ouragents,however,havenoplanningcapacityandtheirknowledgeisencodedinthepolicymatrix.Figure3providesasnapshot,attheendofasuccessfullyconvergedrun,ofwhateachagentbelievestobeitsbestactionchoiceforeachofthepossiblestatesintheworld.Theactionchoiceforeachagentatastateisrepresentedbyastraightlineattheappropriateangleandscaledtorepresentthemagnitudeofforce.Weimmediatelynoticethattheindividualpoliciesarecomplimentaryratherthanbeingidentical.Givenastate,thecombinationofthebestactionswillbringtheblockclosertothedesiredpath.Insomecases,oneoftheagentsevenpushesinthewrongdirectionwhiletheotheragenthastocompensatewithalargerforcetobringtheblockclosertothedesiredpath.Thesecasesoccurinstateswhichareattheedgeofthefieldofplay,andhavebeenvisitedonlyinfrequently.Complementarityoftheindividualpolicies,however,arevisibleforallthestates.
5.2ExperimentsinNon-CooperativeDomains
Wedesignedasetofexperimentsinwhichtwoagentsareprovideddifferentfeedbackforthesameblocklocation.Theagentsareassignedtopushthesameblocktotwodifferent
20
Figure4:Exampletrialwhenagentshaveconflictinggoals.
goalsalongdifferentpaths.Hence,theactionofeachofthemadverselyaffectsthegoalachievementoftheotheragent.Themaximumforce(werefertothisasstrength)ofoneagentwaschosenas10units,whilethemaximumforceoftheotheragentwasvaried.Theothervariablewasthenumberofdiscreteactionoptionsavailablewithinthegivenforcerange.Whenthereisconsiderabledisparitybetweenthestrengthsofthetwoagents,thestrongeragentoverpowerstheweakeragent,andsucceedsinpushingtheblocktoitsgoallocation(seeFigure4).Theaveragenumberoftrialstoconvergence(seeTable2),however,indicatesthatasthestrengthoftheweakeragentisincreased,thestrongeragentfindsitincreasinglydifficulttoattainitsgoal.Fortheseexperiments,thestrongandtheweakagentshadrespectively11(between0-10)and2(0anditsmaximumstrength)forceoptionstochoosefrom.
Whenthenumberofforcediscretizationsfortheweakagentisincreasedfrom2to10,wefindthatthestrongeragentfindsitmoredifficulttopushtheblocktoitsowngoal.Ifweincreasethemaximumforceoftheweakagentclosertothemaximumforceofthestrongeragent,wefindthatneitherofthemisabletopushtheblocktoitsdesiredgoal.Attheofarun,wefindthatthefinalconvergedpathliesinbetweentheirindividualdesiredpaths.Asthestrengthoftheweakeragentincreases,thispathmovesawayfromthedesiredpathofthestrongeragent,andultimatelyliesmidwaybetweentheirindividualdesiredpathswhenbothagentsareequallystrong.
21
Intuitively,anagentshouldbeableto‘overpower’anotheragentwheneveritisstronger.Whyisthisnothappening?Theanswerliesinthestochasticvariabilityoffeedbackreceivedforthesameactionatthesamestate,andthedeterministicchoiceoftheactioncorrespondingtothemaximalpolicymatrixentry.Whenanagentchoosesanactionatastateitcanreceiveoneofseveraldifferentfeedbacksdependingontheactionchosenbytheotheragent.WedefinetheoptimalactionchoiceforastatextobetheactionAxthathasthehighestaveragefeedbackFx.SupposethefirsttimetheagentchoosesthisactionatstatexitreceivesafeedbackF1 6Robotnavigationproblem Wedesignedaprobleminwhichfouragents,A,B,C,andD,aretofindtheoptimalpathinagridworld,fromgivenstartinglocationstotheirrespectivegoals,A,B,C,andD.Theagentstraversetheirworldusingoneofthefiveavailableoperators:north,south,east,west,orhold.Figure5depictspotentialpathsthateachoftheagentsmightchooseduringtheirlearningprocess.Thegoaloftheagentsistolearnmovesthatquicklytakethemto 22 theirrespectivegoallocationswithoutcollidingwithotheragents. A BC DD’C’B’A’Figure5:Arobotnavigationproblem. Eachagentreceivesfeedbackbasedonitsmove:whenitmakesamovethattakesthemtowardstheirgoal,theyreceiveafeedbackof1;whenitmakesamovethattakesthemawayfromtheirgoal,theyreceiveafeedbackof-1;whenitmakesamovethatresultsinnochangeoftheirdistancefromtheirgoal(hold),theyreceiveafeedbackof0;whenitmakesamovethatresultsinacollision,thefeedbackiscomputedasdepictedinFig6.Allagentslearnatthesametimebyupdatingtheirindividualpolicies. YYXfeedback(X) = -10YXfeedback(X) = -5Xfeedback(X) = -5Figure6:FeedbackforagentXwhenitcausesdifferenttypesofcollisions(giventheactionoftheotheragentinthecollision). TheRobotnavigationtaskisadomaininwhichtheagentcouplingsvariesovertime.Whenagentsarelocatednexttoeachother,theyareverylikelytointeractandhenceagent 23 behaviorsaretightlycoupled.But,astheymoveapart,theirinteractionsbecomelesslikely,andasaresult,theirbehaviorsbecomelooselycoupled. Sincetherobotnavigationproblemproducesrewardsaftereachtimestep,wehaveusedtheBBAmethodofpayoffdistributionwiththeclassifiersystem.Thesystemparametersareβ=0.5,andγ=0.8forQ-learningandα=0.1forBBA. ExperimentalresultsontherobotnavigationdomaincomparingQ-learningandaclas-sifiersystemusingBBAforpayoffdistributionisdisplayedinFigure10.Plotsshowtheaveragenumberofstepstakenbytheagentstoreachtheirgoals.Lowervaluesofthispa-rametermeansagentsarelearningtofindmoredirectpathstotheirgoalswithoutcollidingwitheachother.Resultsareaveragedover50runsforbothsystems.Q-learningtakesasmuchas5timeslongertoconvergewhencomparedtoBBA.ThefinalnumberofstepstakenbyagentsusingQ-learningisslightlysmallerthanthenumberofstepstakenbyagentsusingBBA.WebelievethatifwemaketheconvergencecriteriamorestrictfortheBBA,abettersolutioncanbeevolvedwithmorecomputationaleffort. Theinterestingaspectofthisexperimentisthatalltheagentswerelearningsimulta-neouslyandhenceitwasnotobviousthattheywouldfindgoodpaths.Typicalsolutions,however,showthatagentsstopattherightpositionstoletotherspassthrough.Thisavoidscollisions.Thepathsdocontainsmalldetours,andhencearenotoptimal. TheperformanceofBBAwhenα=0.5(comparabletoβusedinQ-learning)wasnotverydifferentwhencomparedtothatpresentedinFigure10.ItisinterestingtonotethatwhenexperimentsweretriedsettingtheβinQ-learningto0.1(comparabletoαinBBA)thesystemdidnotattainconvergenceevenafter5000trials. 7Resourcesharingproblem Theresourcesharingproblemassumestwoagentssharingacommonresourceorchannel,witheachofthemtryingtodistributetheirloadonthesystemsoastoachievemaximumutility.Inourversionoftheproblem,itisassumedthatoneagenthasalreadyappliedsomeloaddistributedoverafixedtimeperiod,andtheotheragentislearningtodistributeitsloadonthesystemwithoutanyknowledgeofthecurrentdistribution.Amaximumloadof 24 5045403530Average number of steps in the robot navigation taskQBBA(SR)Steps252015105005001000Trials15002000Figure7:ComparisonofBBAandQ-learningontherobotnavigationproblem.Lcanbeappliedonthesystematanypointintime(loadsinexcessofthisdonotreceiveanyutility). 10987The utility curve10*(1-exp(-x))Utility recieved65432100246Load applied810Figure8:Curvedepictingtheutilityreceivedforagivenload ThesecondagentcanuseKload-hoursofthechannel.Ifitappliesaloadofktload-hoursonthesystemattimestept,whenthefirstagenthasappliedltload-hours,theutilityitreceivesisu(lt,kt)=U(max(L,kt+lt))−U(max(L,lt),whereUistheutilityfunctioninFigure8,andL=10isthemaximumloadallowedonthesystem.ThetotalfeedbackitgetsattheendofTtimestepsis T t=1 u(lt,kt).Thisproblemrequiresthesecondagentto distributeitsloadaroundtheloadsimposedbythefirstagentinordertoobtainmaximum 25 109876Maximum Load: Agent-1: Agent-2LoadApplied543210123456710Time Figure9:Aresourcesharingproblem. utility.Theproblemisthatthesecondagenthavenodirectinformationabouttheloaddistributiononthesystem.Thisisatypicalsituationinreinforcementlearningproblemswheretheagenthastochooseitsactionsbasedonlyonscalarfeedback.ThisformulationisdifferentfromotherloadbalancingproblemsusedinMALliterature[41]wheremoreshort-termfeedbackisavailabletoagents. AsingletrialconsistsofanepisodeofapplyingloadsuntilTtimestepsiscompletedoruntiltheagenthasexhausteditsload-hours,whicheveroccursearlier.Thus,throughconsecutivesuchtrialstheagentlearnstodistributeitsloadonthesysteminanoptimalfashion.Figure9presentstheloaddistributionusedbythefirstagentaswellasoneofseveraloptimalloaddistributionsforthesecondagentintheparticularproblemwehaveusedforexperiments(T=10inthisproblem). TheResourcesharingproblemisanotherexampleofatightly-coupledsystemwithagentactionsinteractingwitheachotherateverytimestep.Thisdomaincanbeconsideredtobenon-cooperativebecauseagentsarenotdoingacommontask,eachoftheagentsneedtocoordinatewiththeotherstoachieveautilitythatismostbeneficialtoitself. SincetheresourcesharingproblemproducesrewardsonlyafteraseriesofactionsareperformedweusedthePSPmethodofpayoffdistributionwiththeclassifiersystem.ThoughBBAcanalsobeusedforpayoffdistributioninthisproblem,ourinitialexperimentsshowedthatPSPperformedmuchbetterthanBBAonthisproblem.Theparametervaluesare 26 MaximumTimeβ=0.85,γ=0.9forQ-learningandα=0.1forPSP. Inthissetofexperimentsthefitness-proportionatePSPdidnotconvergeevenafter150,000trials.ExperimentalresultscomparingQ-learningandsemi-randomPSP,PSP(SR),basedclassifiersystemsontheresourcesharingproblemisdisplayedinFigure10.Resultsareaveragedover50runsofbothsystems.Thoughbothmethodsfindtheoptimalloaddistributioninsomeoftheruns,moreoftenthannottheysettleforalessthanoptimal,butreasonablygooddistribution.PSPtakesabouttwiceaslongtoconvergebutproducesabetterloaddistributionontheaverage.Thedifferenceinperformanceisfoundtobesignificantatthe99%confidencelevelusingatwo-samplet-procedure.WebelievethishappensbecausealltheactiverulesdirectlysharethefeedbackattheendofatrialinPSP.InQ-learning,however,theexternalfeedbackispassedbacktopolicyelementsusedearlyinatrialoversuccessivetrials.Interferencewithdifferentactionsequencesharingthesamepolicyelement(state-actionpair)canproduceconvergencetosub-optimalsolutions. 25The resource sharing problem20Average utility1510QPSP(SR)50050010001500200025003000Trials3500400045005000Figure10:ComparisonofPSPandQ-learningontheresourcesharingproblem.TypicalsolutionsproducedbyPSPandQ-learningdifferedinoneimportantcharacteris-tic.PSPsolutionswillsavesomeofitsloadforthelastemptytime-slot,whereasQ-learningsolutionsuseupalltheavailableloadbeforethat.SincePSPisabletoutilizethelastemptytimeslotonthechannel,itproducesbetterutilitythanQ-learning. Theaboveresultsshowtwothings:1)anagentcaneffectivelyuseaclassifiersystemtocoordinateitsactionseffectivelywithnoknowledgeabouttheactionsoftheotheragent 27 usingthecommonresource,2)semi-randomactionchoicemechanismcanbeamoreeffectivemethodforclassifiersystemsthanthecommonlyusedfitness-proportionateactionchoicescheme. Weranafurthersetofexperimentsinthisdomainwherebothagentswerelearningcon-currently.Inthismode,eachagentwouldsubmitaloaddistributiontothesystem,andthesubmittedloaddistributionswillbeusedtogivethemfeedbackontheirutilities.Neitherofthelearningtechniqueswasabletogenerateeffectivecoordinationbetweentheagents.Thissetofexperiments,alone,glaringlyexposesthelimitationsofindividuallearning.Itseemsthatintightlycoupleddomainswithdelayedfeedbacksitwouldbeextremelyunlikelythatgoodcoordinationwillevolvebetweenagentslearningatthesametime.Thisisparticularlytrueifthereisoneorveryfewoptimalbehaviorpairings. 8ConclusionsandFutureResearch Inthispaperwehaveaddressedtheproblemofdevelopingmultiagentcoordinationstrate-gieswithminimaldomainknowledgeandinformationsharingbetweenagents.WehavecomparedclassifiersystembasedmethodsandQ-learningalgorithms,tworeinforcementlearningparadigms,toinvestigatearesourcesharingandarobotnavigationproblem.OurexperimentsshowthattheclassifierbasedmethodsperformverycompetitivelywiththeQ-learningalgorithm,andareabletogenerategoodsolutionstobothproblems.PSPworkswellontheresourcesharingproblem,whereanagentistryingtoadapttoafixedstrategyusedbyanotheragent,andwhenenvironmentalfeedbackisreceivedinfrequently.Resultsareparticularlyencouragingfortherobotnavigationdomain,whereallagentsarelearningsimultaneously.AclassifiersystemwiththeBBApayoffdistributionallowsagentstocoor-dinatetheirmovementswithotherswithoutdeviatingsignificantlyfromtheoptimalpathfromtheirstarttogoallocations. Experimentsconductedontheblock-pushingtaskdemonstratethattwoagentscancoor-dinatetosolveaproblembetter,evenwithoutdevelopingamodelofeachother,thanwhattheycandoalone.Wehavedevelopedandexperimentallyverifiedtheoreticalpredictionsoftheeffectsofaparticularsystemparameter,thelearningrate,onsystemconvergence.Other 28 experimentsshowtheutilityofusingknowledge,acquiredfromlearninginonesituation,inothersimilarsituations.Additionally,wehavedemonstratedthatagentscoordinatebylearningcomplimentary,ratherthanidentical,problem-solvingknowledge. Usingreinforcementlearningschemes,wehaveshownthatagentscanlearntoachievetheirgoalsinbothcooperativeandadversarialdomains.Neitherpriorknowledgeaboutdomaincharacteristicsnorexplicitmodelsaboutcapabilitiesofotheragentsarerequired.Thisprovidesanovelparadigmformulti-agentsystemsthroughwhichbothfriendsandfoescanconcurrentlyacquirecoordinationknowledge. Aparticularlimitationoftheproposedapproachthatwehaveidentifiedistheinabilityofindividual,concurrentlearningtodevelopeffectivecoordinationwhenagentactionsarestronglycoupled,feedbackisdelayed,andthereisoneorafewoptimalbehaviorcombi-nations.Apossiblepartialfixtothisproblemwouldbetodosomesortofstaggeredorlock-steplearning.Inthismodeoflearning,eachagentcanlearnforsometime,thenexecuteitscurrentpolicywithoutmodificationforsometime,thenswitchbacktolearning,etc.Twoagentscansynchronizetheirbehaviorsothatoneislearningwhiletheotherisfollowingafixedpolicyandviceversa.Evenifperfectsynchronizationisnotfeasible,thestaggeredlearningmodeislikelytobemoreeffectivethantheconcurrentlearningmodewehaveusedinthispaper. Adrawbackofusingreinforcementlearningtogeneratecoordinationpoliciesisthatitrequiresconsiderableamountofdata,andassuchcanonlybeusedindomainswhereagentsrepeatedlyperformsimilartasks.Otherlearningalgorithmswithlessdatarequirementscanpossiblybeexploredtoovercomethislimitation. Thispaperdemonstratesthatclassifiersystemscanbeusedeffectivelytoachievenear-optimalsolutionsmorequicklythanQ-learning,asillustratedbytheexperimentsconductedintherobotnavigationtask.Ifweenforceamorerigidconvergencecriteria,classifiersystemsachieveabettersolutionthanQ-learningthroughalargernumberoftrials,asillustratedbytheresultsobtainedontheresourcesharingdomain.Webelieve,however,thateitherQ-learningortheclassifiersystemcanproducebetterresultsinagivendomain.Identifyingthedistinguishingfeaturesofdomainswhichallowoneoftheseschemestoperformbetterwillbeafocusofourfutureresearch. 29 Wehavealsoshownthatasemi-randomchoiceofactionscanbemuchmoreproductivethanthecommonlyusedfitness-proportionatechoiceofactionswiththePSPpayoffdistri-butionmechanism.WeplantocomparetheBBAmechanismwiththesetwomethodsofpayoffdistribution. Wewouldalsoliketoinvestigatetheeffectsofproblemcomplexityonthenumberoftrialstakenforconvergence.Ontherobotnavigationdomain,forexample,wewouldliketovaryboththesizeofthegridaswellasthenumberofagentsmovingonthegridtofindouttheeffectsonsolutionqualityandconvergencetime. Otherplannedexperimentsincludeusingworldmodelswithinclassifiersystems[3]andcombiningfeaturesofBBAandPSP[20]thatwouldbeusefulforlearningmultiagentcoor-dinationstrategies.Acknowledgments Thisresearchhasbeensponsored,inpart,bytheNationalScienceFoundationunderaResearchInitiationAwardIRI-9410180andaCAREERawardIRI-9702672. References [1]AndrewB.Barto,RichardS.Sutton,andChrisWatkins.Sequentialdecisionprob-lemsandneuralnetworks.InProceedingsof19ConferenceonNeuralInformationProcessing,19. [2]AlanH.BondandLesGasser.ReadingsinDistributedArtificialIntelligence.Morgan KaufmannPublishers,SanMateo,CA,1988. [3]LashonB.Booker.Classifiersystemsthatlearninternalworldmodels.MachineLearn-ing,3:161–192,1988. [4]L.B.Booker,D.E.Goldberg,andJ.H.Holland.Classifiersystemsandgeneticalgo-rithms.ArtificialIntelligence,40:235–282,19. 30 [5]P.Brazdil,M.Gams,S.Sian,L.Torgo,andW.vandeVelde.Learningindistributed systemsandmulti-agentenvironments.InEuropeanWorkingSessiononLearning,LectureNotesinAI,482,Berlin,March1991.SpringerVerlag. [6]HungH.Bui,DorotaKieronska,andSvethaVenkatesh.Negotiatingagentsthatlearn aboutothers’preferences.InProceedingsoftheThirteenthNationalConferenceonArtificialIntelligence,pages114–119,MenloPark,CA,1996.AAAIPress. [7]DavidCarmelandShaulMarkovitch.Incorporatingopponentmodelsintoadversary search.InProceedingsoftheThirteenthNationalConferenceonArtificialIntelligence,pages62–67,MenloPark,CA,1996.AAAIPress. [8]PhilipR.CohenandC.RaymondPerrault.Elementsofaplan-basedtheoryofspeech acts.CognitiveScience,3(3):177–212,1979. [9]SusanE.Conry,RobertA.Meyer,andVictorR.Lesser.Multistagenegotiationin distributedplanning.InAlanH.BondandLesGasser,editors,ReadingsinDistributedArtificialIntelligence,pages367–384.MorganKaufman,1988. [10]MarcoDorigoandHuguesBersini.AcomparisonofQ-learningandclassifiersystems.In ProceedingsofFromAnimalstoAnimats,ThirdInternationalConferenceonSimulationofAdaptiveBehavior,1994. [11]EdmundH.DurfeeandVictorR.Lesser.Partialglobalplanning:Acoordination frameworkfordistributedhypothesisformation.IEEETransactionsonSystems,Man,andCybernetics,21(5),September1991.(SpecialIssueonDistributedSensorNetworks).[12]EdmundH.Durfee,VictorR.Lesser,andDanielD.Corkill. Coherentcoopera- tionamongcommunicatingproblemsolvers.IEEETransactionsonComputers,C-36(11):1275–1291,November1987.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages268–284,MorganKaufmann,1988.). 31 [13]EdmundH.Durfee,VictorR.Lesser,andDanielD.Corkill.Trendsincooperative distributedproblemsolving.IEEETransactionsonKnowledgeandDataEngineering,1(1):63–83,March19. [14]EdmundH.DurfeeandThomasA.Montgomery.Coordinationasdistributedsearchin ahierarchicalbehaviorspace.IEEETransactionsonSystems,Man,andCybernetics,21(6):1363–1378,November/December1991. [15]MarkS.Fox.Anorganizationalviewofdistributedsystems.IEEETransactionson Systems,Man,andCybernetics,11(1):70–80,January1981.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages140–150,MorganKaufmann,1988.). [16]AndrewGarlandandRichardAlterman.Multiagentlearningthroughcollectivememory. InSandipSen,editor,WorkingNotesfortheAAAISymposiumonAdaptation,Co-evolutionandLearninginMultiagentSystems,pages33–38,StanfordUniversity,CA,March1996. [17]LesGasserandMichaelN.Huhns,editors.DistributedArtificialIntelligence,volume2 ofResearchNotesinArtificialIntelligence.Pitman,19. [18]M.R.Genesereth,M.L.Ginsberg,andJ.S.Rosenschein.Cooperationwithoutcommu-nications.InProceedingsoftheNationalConferenceonArtificialIntelligence,pages51–57,Philadelphia,Pennsylvania,1986. [19]PiotrJ.Gmytrasiewicz,EdmundH.Durfee,andDavidK.Wehe.Adecision-theoretic approachtocoordinatingmultiagentinteractions.InProceedingsoftheTwelfthInter-nationalJointConferenceonArtificialIntelligence,pages62–68,August1991.[20]JohnGrefenstette.Creditassignmentinrulediscoverysystems.MachineLearning, 3(2/3):225–246,1988. [21]PanGuandAnthonyB.Maddox.Aframeworkfordistributedreinforcementlearning. InGerhardWeißandSandipSen,editors,AdaptationandLearninginMulti–Agent 32 Systems,LectureNotesinArtificialIntelligence,pages97–112.SpringerVerlag,Berlin,1996. [22]JosephHalpernandYoramMoses.Knowledgeandcommonknowledgeinadistributed environment.JournaloftheACM,37(3):549–587,1990.ApreliminaryversionappearedinProc.3rdACMSymposiumonPrinciplesofDistributedComputing,1984. [23]ThomasHaynesandSandipSen.Learningcasestocomplimentrulesforconflictres-olutioninmultiagentsystems.InternationalJournalofHuman-ComputerStudies(toappear). [24]JohnH.Holland.Adpatationinnaturalandartificialsystems.UniversityofMichigan Press,AnnArbor,MI,1975. [25]JohnH.Holland.Escapingbrittleness:thepossibilitiesofgeneral-purposelearning algorithmsappliedtoparallelrule-basedsystems.InR.S.Michalski,J.G.Carbonell,andT.M.Mitchell,editors,MachineLearning,anartificialintelligenceapproach:VolumeII.MorganKaufmann,LosAlamos,CA,1986. [26]MichaelHuhns,editor.DistributedArtificialIntelligence.MorganKaufmann,1987.[27]L.P.Kaelbling,MichaelL.Littman,andAndrewW.Moore.Reinforcementlearning: Asurvey.JournalofAIResearch,4:237–285,1996. [28]S.Kirkpatrick,C.D.Gelatt,andM.P.Vecchi.Optimizationbysimulatedreannealing. Science,220:671–680,1983. [29]VictorR.Lesser.Multiagentsystems:AnemergingsubdisciplineofAI.ACMComput-ingSurveys,27(3):340–342,September1995. [30]MichaelL.Littman.Markovgamesasaframeworkformulti-agentreinforcementlearn-ing.InProceedingsoftheEleventhInternationalConferenceonMachineLearning,pages157–163,1994. 33 [31]SridharMahadevan.Todiscountornottodiscountinreinforcementlearning:Acase studycomparingRlearningandQlearning.InProceedingsoftheTenthInternationalConferenceonMachineLearning,pages205–211,1993. [32]ThomasW.Malone.Modelingcoordinationinorganizationsandmarkets.Management Science,33(10):1317–1332,1987.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages151–158,MorganKaufmann,1988.). [33]JamesG.MarchandHerbertA.Simon.Organizations.JohnWiley&Sons,1958.[34]MajaJ.Mataric.Learninginmulti-robotsystems.InGerhardWeißandSandipSen, editors,AdaptationandLearninginMulti–AgentSystems,LectureNotesinArtificialIntelligence,pages152–163.SpringerVerlag,Berlin,1996. [35]K.NarendraandM.A.L.Thatachar.LearningAutomata:AnIntroduction.Prentice Hall,19. [36]LynneE.Parker.Adaptiveactionselectionforcooperativerobotteams.InJ.Meyer, H.Roitblat,andS.Wilson,editors,Proc.oftheSecondInternationalConferenceonSimulationofAdaptiveBehavior,pages442–450,Cambridge,MA,1992.MITPress.[37]MVNagendraPrasad,SusanE.Lander,andVictorR.Lesser.Cooperativelearningover compositesearchspaces:Experienceswithamulti-agentdesignsystem.InProceedingsofThirteenthNationalConferenceonArtificialIntelligence,pages68–73,August1996.[38]FosterJohnProvostandDanielN.Hennessy.Scalingup:Distributedmachinelearning withcooperation.InProceedingsoftheThirteenthNationalConferenceonArtificialIntelligence,pages74–79,MenloPark,CA,1996.AAAIPress. [39]JefferyS.Rosenschein.Synchronizationofmulti-agentplans.InProceedingsofthe NationalConferenceonArtificialIntelligence,pages115–119,Pittsburgh,Pennsylvania,August1982.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages187–191,MorganKaufmann,1988.). 34 [40]TuomasW.SandholmandRobertH.Crites.OnmultiagentQ-learninginasemi-competitivedomain.InGerhardWeißandSandipSen,editors,AdaptationandLearninginMulti–AgentSystems,LectureNotesinArtificialIntelligence,pages191–205.SpringerVerlag,Berlin,1996. [41]A.Schaerf,Y.Shoham,andM.Tennenholtz.Adaptiveloadbalancing:Astudyin multiagentlearning.JournalofArtificialIntelligenceResearch,2:475–500,1995.[42]JurgenSchmidhuber.Ageneralmethodformulti-agentreinforcementlearninginunre-strictedenvironments.InSandipSen,editor,WorkingNotesfortheAAAISymposiumonAdaptation,Co-evolutionandLearninginMultiagentSystems,pages84–87,StanfordUniversity,CA,March1996. [43]MahendraSekaranandSandipSen.Learningwithfriendsandfoes.InSixteenthAnnual ConferenceoftheCognitiveScienceSociety,pages800–805,1994. [44]SandipSen.IJCAI-95workshoponadaptationandlearninginmultiagentsystems.AI Magazine,17(1):87–,Spring1996. [45]SandipSen,MahendraSekaran,andJohnHale.Learningtocoordinatewithoutsharing information.InNationalConferenceonArtificialIntelligence,pages426–431,1994.[46]YoavShohamandMosheTennenholtz.Onthesynthesisofusefulsociallawsforartifi-cialagentsocieties(preliminaryreport).InProceedingsoftheNationalConferenceonArtificialIntelligence,pages276–281,SanJose,California,July1992. [47]S.Sian.Adaptationbasedoncooperativelearninginmulti-agentsystems.InY.De-mazeauandJ.-P.M¨uller,editors,DecentralizedAI,volume2,pages257–272.ElsevierSciencePublications,1991. [48]ReidG.Smith.Thecontractnetprotocol:High-levelcommunicationandcontrolin adistributedproblemsolver.IEEETransactionsonComputers,C-29(12):1104–1113,December1980. 35 [49]RichardS.Sutton.TemporalCreditAssignmentinReinforcementLearning.PhDthesis, UniversityofMassachusettsatAmherst,1984. [50]MingTan.Multi-agentreinforcementlearning:Independentvs.cooperativeagents.In ProceedingsoftheTenthInternationalConferenceonMachineLearning,pages330–337,June1993. [51]C.J.C.H.Watkins.LearningfromDelayedRewards.PhDthesis,King’sCollege,Cam-bridgeUniversity,19. [52]GerhardWeiß.Learningtocoordinateactionsinmulti-agentsystems.InProceedings oftheInternationalJointConferenceonArtificialIntelligence,pages311–316,August1993. [53]GerhardWeiß,editor. DistributedArtificialIntelligenceMeetsMachineLearning: LearninginMulti-AgentEnvironments.LectureNotesinArtificialIntelligence.SpringerVerlag,Berlin,1997. [54]GerhardWeißandSandipSen,editors.AdaptationandLearninginMulti–AgentSys-tems.LectureNotesinArtificialIntelligence.SpringerVerlag,Berlin,1996. [55]EricWerner.Cooperatingagents:Aunifiedtheoryofcommunicationandsocialstruc-ture.InLesGasserandMichaelN.Huhns,editors,DistributedArtificialIntelligence,volume2ofResearchNotesinArtificialIntelligence,pages3–36.Pitman,19.[56]HollyYancoandLynnAndreaStein.Anadaptivecommunicationprotocolforcoop-eratingmobilerobots.InStewartWilson,editor,FromAnimalstoAnimats:Proc.oftheSecondInternationalConferenceontheSimulationofAdaptiveBehavior,pages478–485,Cambridge,MA,1993.MITPress. [57]M.Yokoo,E.Durfee,T.Ishida,andK.Kuwabara.Distributedconstraintsatisfaction forformalizingdistributedproblemsolving.InProceedingsoftheTwelfthInternationalConferenceonDistributedComputingSystems,pages614–621,1992. 36 Cooperativerelationships Individualisticlearning Communicatingagents Non-communicatingagents Sharedlearning Non-cooperativerelationships Table1:Acategorizationofmultiagentdomainsforidentifyinglearningopportunities. FA1010101010 FB12345 TrialstakentoreachGoalofAgentA 81111115197268 Table2:TrialsforagentAtoreachitsgoalactingagainstagentBtryingtoreachitsowngoal(maximumforcesappliedbyagentsAandBareFAandFBrespectively). [58]GiladZlotkinandJeffreyS.Rosenschein.Negotiationandconflictresolutioninnon-cooperativedomains.InProceedingsoftheNationalConferenceonArtificialIntelli-gence,pages100–105,July1990. 37
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- 91gzw.com 版权所有 湘ICP备2023023988号-2
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务