您的当前位置：首页 Individual learning of coordination knowledge

Individual learning of coordination knowledge

来源：九壹网

Individuallearningofcoordinationknowledge

SandipSen&MahendraSekaran

DepartmentofMathematical&ComputerSciences

UniversityofTulsa600SouthCollegeAvenueTulsa,OK74104-31Phone:+918-631-2985

e-mail:sandip@kolkata.mcs.utulsa.edu

Abstract

Socialagents,bothhumanandcomputational,inhabitingaworldcon-tainingmultipleactiveagents,needtocoordinatetheiractivities.Thisisbecauseagentsshareresources,andwithoutpropercoordinationor“rulesoftheroad”,everybodywillbeinterferingwiththeplansofothers.Assuch,weneedcoordinationschemesthatallowagentstoeﬀectivelyachievelocalgoalswithoutadverselyaﬀectingtheproblem-solvingcapabilitiesofotheragents.ResearchersintheﬁeldofDistributedArtiﬁcialIntelligence(DAI)havede-velopedavarietyofcoordinationschemesunderdiﬀerentassumptionsaboutagentcapabilitiesandrelationships.Whereassomeoftheseresearchhavebeenmotivatedbyhumancognitivebiases,othershaveapproacheditasanengineeringproblemofdesigningthemosteﬀectivecoordinationarchitec-tureorprotocol.Weevaluateindividualandconcurrentlearningbymul-tiple,autonomousagentsasameansforacquiringcoordinationknowledge.Weshowthatauniformreinforcementlearningalgorithmsuﬃcesasacoor-dinationmechanisminbothcooperativeandadversarialsituations.Usinganumberofmultiagentlearningscenarioswithbothtightandloosecouplingbetweenagentsandwithimmediateaswellasdelayedfeedback,wedemon-stratethatagentscanconsistentlydevelopeﬀectivepoliciestocoordinatetheiractionswithoutexplicitinformationsharing.Wedemonstratethevi-abilityofusingboththeQ-learningalgorithmandgeneticalgorithmbasedclassiﬁersystemswithdiﬀerentpayoﬀschemes,namelythebucketbrigadealgorithm(BBA)andtheproﬁtsharingplan(PSP),fordevelopingagentcoordinationontwodiﬀerentmulti-agentdomains.Inaddition,weshowthatasemi-randomschemeforactionselectionispreferabletothemoretraditionalﬁtnessproportionateselectionschemeusedinclassiﬁersystems.

1Introduction

Oneoftheprimarygoalsofresearchersintheartiﬁcialintelligence(AI)communityistodevelopautonomousagentsthatareknowledgeableandcognizantenoughtocarryoutatleastroutineactivitiesperformedbyhumans.Theproblem,however,isanextremelydiﬃcultone,anddecadesofresearchonrelatedareashaveonlyservedtohighlightitscomplexityandmagnitude.Whileresearchersaredevelopinginterestingresultsincrucialareaslikeknowledgerepresentation,planning,learning,non-monotonicreasoning,cooperativeproblemsolvingetc.,itisequallyimportanttoanalyzeandexperimentwithresultsfromonesuchsubﬁeldtobeneﬁtoneormoreoftheotherones.Isolated,anddomain-speciﬁcdevelopmentsinanyoneofthesub-ﬁeldsofAIisnotgoingtogoalongwaytoadvancingthewholeﬁeld.Inthispaper,wewilldemonstratehownoteworthyadvancesinoneparticularsub-ﬁeldofAIcanbeeﬀectivelyusedinanothersubﬁeldtoprovideagentswithknowledgerequiredtosolveadiﬃcultproblem.Wewillbeapplyingrecentresearchdevelopmentsfromthereinforcementlearningliteraturetothecoordinationprobleminmultiagentsystems.Inareinforcementlearningscenario,anagentchoosesactionsbasedonitsperceptions,receivesscalarfeedbackbasedonpastactions,andisexpectedtodevelopamappingfromper-ceptionstoactionsthatwillmaximizefeedback.MultiagentsystemsareaparticulartypeofdistributedAIsystem[2,29],inwhichautonomousintelligentagentsinhabitaworldwithnoglobalcontrolorgloballyconsistentknowledge.Incontrasttocooperativeproblemsolvers[13],agentsinmultiagentsystemsarenotpre-disposedtohelpeachotheroutwithalltheresourcesandcapabilitiesthattheypossess.Theseagentsmaystillneedtocoordi-natetheiractivitieswithotherstoachievetheirownlocalgoals.Theycouldbeneﬁtfromreceivinginformationaboutwhatothersaredoingorplantodo,andfromsendingtheminformationtoinﬂuencewhattheydo.

Whereaspreviousresearchondevelopingagentcoordinationmechanismsfocusedonoﬀ-linedesignofagentorganizations,behavioralrules,negotiationprotocols,etc.,itwasrecog-nizedthatagentsoperatinginopen,dynamicenvironmentsmustbeabletoﬂexiblyadapttochangingdemandsandopportunities[29,44,54].Inparticular,individualagentsareforcedtoengagewithotheragentswhichhavevaryinggoals,abilities,composition,andlifespan.To

eﬀectivelyutilizeopportunitiespresentedandavoidpitfalls,agentsneedtolearnaboutotheragentsandadaptlocalbehaviorbasedongroupcompositionanddynamics.Formachinelearningresearchers,multiagentlearningproblemsarechallenging,becausetheyviolatethestationaryenvironmentassumptionsusedbymostmachinelearningsystems.Asmultipleagentslearnsimultaneously,thefeedbackreceivedbythesameagentforthesameactionvariesconsiderably.Thestationaryenvironmentassumptionusedbymostcurrentmachinelearningsystemsarenotwell-suitedforsuchrapidlychangingenvironments.

Inthispaper,wediscusshowreinforcementlearningtechniquesfordevelopingpoliciestooptimizeenvironmentalfeedback,throughamappingbetweenperceptionsandactions,canbeusedbymultipleagentstolearncoordinationstrategieswithouthavingtorelyonsharedinformation.Theseagentsworkinacommonenvironment,butareunawareofthecapabilitiesofotheragentsandmayormaynotbecognizantofgoalstoachieve.Weshowthatthroughrepeatedproblem-solvingexperience,suchagentscandeveloppoliciestomaximizeenvironmentalfeedbackthatcanbeinterpretedasgoalachievementfromtheviewpointofanexternalobserver.Moreinterestingly,wedemonstratethatinsomedomainstheseagentsdeveloppoliciesthatcomplementeachother.

Toevaluatetheapplicabilityofreinforcementlearningschemesforenablingmultiagentcoordination,wechosetoinvestigateanumberofmultiagentdomainswithvaryingenvi-ronmentalcharacteristics.Inparticular,wedesignedenvironmentsinwhichthefollowingcharacteristicswerevaried:

Agentcoupling:Insomedomainstheactionsofoneagentstronglyandfrequentlyaﬀect

theplansofotheragents(tightlycoupledsystem),whereasinotherdomainstheactionsofoneagentonlyweaklyandinfrequentlyaﬀecttheplansofotheragents(looselycoupledsystem).

Agentrelationships:Agentsinamultiagentsystemcanhavediﬀerentkindsofmutual

relationships:

•theymayactinagrouptosolveacommonproblem(cooperativeagents),•theymaynothaveanypresetdispositionstowardseachotherbutinteractbecausetheyusecommonresources(indiﬀerentagents),

•theymayhaveopposinginterests(adversarialagents).

Fordiscussionsinthispaper,wehavegroupthelattertwoclassofdomainsasnon-cooperativedomains.

Feedbacktiming:Insomedomains,theagentsmayhaveimmediateknowledgeofthe

eﬀectsoftheiractions,whereasinotherstheymaygetthefeedbackfortheiractionsonlyafteraperiodofdelay.

Optimalbehaviorcombinations:Howmanybehaviorcombinationsofparticipatingagents

willoptimallysolvethetaskathand?Thisvaluevariesfromonetoinﬁnitefordiﬀerentdomains.

Inaddition,inthispaperweconcentrateexclusivelyondomainsinwhichagentshavelittleornopre-existingdomainexpertise,andhavenoinformationaboutthecapabilitiesandgoalsofotheragents.Theseassumptionsmakethecoordinationproblemparticularlyhard.Thisisparticularlyevidentfromthefactthatalmostallcurrentlyusedcoordina-tionmechanismsrelyheavilyondomainknowledgeandsharedinformationbetweenagents.Thegoalofourworkisnottoreplacethepreviouslydevelopedcoordinationschemes,buttocomplementthembyprovidingnewcoordinationtechniquesfordomainswherethecur-rentlyavailableschemesarenoteﬀective.Inparticular,domainswhereagentsknowlittleabouteachotherprovideadiﬃcultchallengeforcurrentlyusedcoordinationschemes.Ourcontentionisthatproblemsolvingperformanceorthefeedbackreceivedfromtheenviron-mentcanbeeﬀectivelyusedbyreinforcementbasedlearningagentstocircumventthelackofcommonknowledge.

Toverifyourintuitionsmorerigorously,wedecidedtoinvestigatetwowell-knownrein-forcementlearningschemes:theQ-learningalgorithmdevelopedbyWatkins[51],andtheclassiﬁersystemsmethoddevelopedbyHolland.WhereastheQ-learningalgorithmwasinspiredbythetheoryofdynamicprogrammingforoptimization,classiﬁersystemsarosefromaninterestingblendofrule-basedreasoningandcomputationalmechanismsinspiredbynaturalgenetics[4,25].Q-learningandclassiﬁersystemshavebeenproposedastwogeneralreinforcementlearningframeworksforachievingagentcoordinationinamultiagent

system.Thisresearchopensupanewdimensionofconstructingcoordinationstrategiesformultiagentsystems.

Atthispoint,wewouldliketoemphasizethediﬀerencebetweenthisworkandotherrecentpublicationsinthenascentareaofmultiagentlearning(MAL)research.Previouspro-posalsforusinglearningtechniquestocoordinatemultipleagentshavemostlyreliedonusingpriorknowledge[5],oroncooperativedomainswithunrestrictedinformationsharing[47].Asigniﬁcantpercentageofthisresearchhaveconcentratedoncooperativelearningbetweencommunicatingagentswhereagentssharetheirknowledgeorexperiences[16,38,37,50].Someresearchershaveusedcommunicationtoaidagentgroupsjointlydecideontheircourseofactions[52].TheisolatedinstancesofresearchinMALthatdonotuseexplicitcommuni-cationhaveconcentratedoncompetitive,ratherthancooperative,domains[7,30,40].Westronglybelievethatlearningcoordinationpolicieswithoutcommunicationhasasigniﬁcantroletoplayincompetitiveaswellascooperativedomains.Thereislittleargumentoverthefactthatcommunicationisaninvaluabletooltobeusedbyagentgroupstocoordinatetheiractivities.Attimescommunicationisthemosteﬀectiveandevenperhapstheonlymechanismtoguaranteecoordinatedbehavior.Thoughcommunicationisoftenhelpfulandindispensableasanaidtogroupactivity,itdoesnotguaranteecoordinatedbehavior[22],istime-consumingandcandetractfromotherproblemsolvingactivityifnotcarefullycon-trolled[12].Also,agentsoverlyreliantoncommunicationwillbeseverelyaﬀectedifthequalityofcommunicationiscompromised(brokencommunicationchannels,incorrectordeliberatelymisleadinginformation,etc.).Atothertimescommunicationcanberiskyorevenfatal(asinsomecombatsituationswheretheadversarycaninterceptcommunicatedmessages).

Webelievethatevenwhencommunicationisfeasibleandsafe,itmaybeprudenttouseitonlyasnecessary.Forexample,ifanagentisabletopredictthebehaviorofotheragentsfrompastobservations,itcanpossiblyadjustitsownplanstouseasharedresourcewithouttheneedforexplicitlyarrivingatacontractusingcommunicationthatconsumesvaluabletime(tohaveperformanceguarantees,though,contractsarrivedatusingcommu-nicationispossiblythemosteﬀectiveprocedure;mostoftheproblemsolvingactivitiesbyagents,however,donotinvolvehardguaranteesordeadlines).Weshouldstrivetorealize

themaximumpotentialofthesystemwithoutusingcommunication.Oncethathasbeenaccomplished,communicationcanbeaddedtoaugmenttheperformanceofthesystemtothedesiredeﬃciencylevel.Suchadesignphilosophywillleadtosystemswhereagentsdonotﬂoodcommunicationchannelswithunwarrantedinformationandagentsdonothavetoshiftthroughamazeofuselessdatatolocatenecessaryandtime-criticalinformation.Withthisgoalinmindwehaveinvestigatedtheusefulnessofacquiringcoordinationstrategieswithoutsharinginformation[43,45].Weexpandonthisbodyofworkinthispaper,andexploretheadvantagesandlimitationsoflearningwithoutcommunicationasameanstogeneratingcoordinatedbehaviorinautonomousagents.

Therestofthepaperisorganizedasfollows:Section2presentshighlightsofthepreviousapproachestodevelopingcoordinationstrategiesformultiple,autonomousagents;Section3providesacategorizationofmultiagentsystemstoidentifydiﬀerentlearningscenariosandpresentsasamplingofpriormultiagentlearningresearch;Section4reviewsthereinforcementlearningtechniquesthatwehaveutilizedinthispaper;Sections5,6,and7presentresultsofexperimentswithQ-learningandgeneticalgorithmbasedreinforcementlearningsystemsonablockpushing,robotnavigation,andresourcesharingdomainsrespectively;Section8summarizesthelessonslearntfromthisresearchandoutlinesfutureresearchdirections.

2Coordinationofmultipleagents

Inaworldinhabitedbymultipleagents,coordinationisakeytogroupaswellasindividualsuccess.Weneedtocoordinateouractionswheneverwearesharinggoals,resources,orexpertise.Bycoordinationwemeanchoosingone’sownactionbasedontheexpectationofothers’actions.Coordinationisessentialforcooperative,indiﬀerent,andevenadversarialagents.Ascomputerscientists,weareinterestedindevelopingcomputationalmechanismsthataredomainindependentandrobustinthepresenceofnoisy,incomplete,andout-of-dateinformation.Researchintheareaofmultiagentsystemshasproducedtechniquesforallowingmultipleagents,whichsharecommonresources,tocoordinatetheiractionssothatindividuallyrationalactionsdonotadverselyaﬀectoverallsystemeﬃciency[2,13,17,26].Coordinationofproblemsolvers,irrespectiveofwhethertheyareselﬁshorcooperative,

isakeyissuetothedesignofaneﬀectivemultiagentsystem.Thesearchfordomain-independentcoordinationmechanismshasyieldedsomeverydiﬀerent,yeteﬀective,classesofcoordinationschemes.Themostinﬂuentialclassesofcoordinationmechanismsdevelopedtodatearethefollowing:

•protocolsbasedoncontracting[9,48]•distributedsearchformalisms[14,57]•organizationalandsociallaws[15,32,33]•multi-agentplanning[11,39]

•decisionandgametheoreticnegotiations[18,19,58]•linguisticapproaches[8,55]

Whereassomeoftheseworkusesarchitecturesandprotocolsdesignedoﬀ-line[15,46,48]ascoordinationstructures,othersacquirecoordinationknowledgeon-line[11,19].Almostallofthecoordinationschemesdevelopedtodateassumeexplicitorimplicitsharingofinfor-mation.Intheexplicitformofinformationsharing,agentscommunicatepartialresults[11],speechacts[8],resourceavailabilities[48],etc.tootheragentstofacilitatetheprocessofcoordination.Intheimplicitformofinformationsharing,agentsuseknowledgeaboutthecapabilitiesofotheragents[15,18,58]toaidlocaldecision-making.Thougheachoftheseapproacheshasitsownbeneﬁtsandweaknesses,webelievethatthelessanagentdependsonsharedinformation,andthemoreﬂexibleitistotheon-linearrivalofproblem-solvingandcoordinationknowledge,thebetteritcanadapttochangingenvironments.Asﬂexibilityandadaptabilityarekeyaspectsofintelligentandautonomousbehavior,weareinterestedininvestigatingmechanismsbywhichagentscanacquireandusecoordinationknowledgethroughinteractionswithitsenvironment(thatincludesotheragents)withouthavingtorelyonsharedinformation.

Onecanalsoarguethatpre-fabricatedcoordinationstrategiescanquicklybecomeinad-equateifthesystemdesigner’smodeloftheworldisincomplete/incorrectoriftheenviron-mentinwhichtheagentsaresituatedcanchangedynamically.Coordinationstrategiesthat

incorporatelearningandadaptationcomponentswillbemorerobustandeﬀectiveinthesemorerealisticscenarios.Thusagentswillbeabletotakeadvantageofnewopportunitiesanddealwithnewcontingenciespresentedbytheenvironmentwhichcannotbeforeseenatdesigntime.

3Learninginmultiagentsystems

Tohighlightlearningopportunitiesinherentinmostmultiagentsystems,wedevelophereacategorizationofmultiagentproblemsthatcanbeusedtocharacterizethenatureoflearningmechanismsthatshouldbeusedfortheseproblems.ThecategorizationpresentedinTable1isnotmeanttobetheonlyoreventhemostdeﬁnitivetaxonomyoftheﬁeld.Thedimen-sionsweconsiderforourtaxonomyarethefollowing:agentrelationships(cooperativevs.non-cooperative),useofcommunication(agentsexplicitlycommunicatingornot).Withincooperativedomainsagain,wefurthercategorizedomainsbasedondecision-makingauthor-ities(individualvs.shared).Thesedimensionsarenotnecessarilycompletelyorthogonal,e.g.,shareddecisionmakingmaynotbefeasiblewithouttheuseofexplicitcommunication.Weusethetermcooperativerelationshiptorefertosituationswheremultipleagentsareworkingtowardsacommongoal.Thoughtheremaybelocalgoals,theseareinfact,subgoalsofthecommongoal(evenifdiﬀerentsubgoalsinterfere).Anumberofdiﬀerentdomainsfallunderthenon-cooperativespectrum.Theseincludecompetitivescenarios(oneagent’sgainisanotheragent’sloss;neednotbezero-sum),aswellasscenarioswhereagentscoordinateonlytoavoidconﬂicts.

Asmentionedbefore,sharedlearningbyagroupofcooperativeagentshavereceivedthemostattentioninMALliterature[16,37,52].Othershavelookedateachagentlearningindividuallybutusingcommunicationtosharepertinentinformation[6,38,50].Learningtocooperatewithoutexplicitinformationsharingcanbebasedonprimarilyenvironmentalfeedback[45]orfromobservationofotheragentsinaction[23].Researchershaveinvestigatedtwoapproachestolearningwithoutexplicitcommunicationinnon-cooperativedomains:(a)treatingopponentsaspartoftheenvironmentwithoutexplicitmodeling[43],(b)learningexplicitcompetitormodels[30,40].Littleworkhasbeendonetodateinlearningtocompete

withexplicitcommunication.Possiblescenariostoinvestigateinthisareaincludelearningthestrengthsandweaknessesofthecompetitorfrominterceptedcommunication,orpro-activelyprobingtheopponenttogathermoreinformationaboutitspreferences.

Evenpreviousworkonusingreinforcementlearningforcoordinatingmultipleagents[50,52]havereliedonexplicitinformationsharing.We,however,concentrateonsystemswhereagentssharenoproblem-solvingknowledge.Weshowthatalthougheachagentisindepen-dentlyusingreinforcementlearningtechniquestooptimizingitsownenvironmentalreward,globalcoordinationbetweenmultipleagentscanemergewithoutexplicitorimplicitinfor-mationsharing.Theseagentscanthereforeactindependentlyandautonomously,withoutbeingaﬀectedbycommunicationdelays(duetootheragentsbeingbusy)orfailureofakeyagent(whocontrolsinformationexchangeorwhohasmoreinformation),anddonothavetobeworryaboutthereliabilityoftheinformationreceived(DoIbelievetheinformationre-ceived?Isthecommunicatingagentanaccompliceoranadversary?).Theresultantsystemsare,therefore,robustandfault-tolerant.

Schaerfetal.havestudiedtheuseofreinforcementlearningbasedagentsforloadbalanc-ingindistributedsystems[41].Inthiswork,acomprehensivehistoryofpastperformanceisusedtomakeinformeddecisionsaboutchoiceofresourcestosubmitjobsto.Parker[36]hasstudiedtheemergenceofcoordinationinsimulatedrobotgroupsbyusingsimpleadaptiveschemesthatalterrobotmotivations.Amajordiﬀerencefromourworkisthatthesimulatedrobotsinthisworkbuildexplicitmodelsofotherrobots.Otherresearchershaveusedrein-forcementlearningfordevelopingeﬀectivegroupsofphysicalrobots[34,56].Mataric[34]concentratesonusingintermediatefeedbackforsubgoalfulﬁllmenttoacceleratelearning.Incontrastwithourwork,theevaluationoflearningeﬀectivenessundervaryingdegreesofin-teractionbetweentheagentsisnotthefocusofthiswork.TheworkbyYancoandStein[56]involvesusingreinforcementlearningtechniquestoevolveeﬀectivecommunicationproto-colsbetweencooperatingrobots.Thisiscomplementarytoourapproachoflearningtocoordinateintheabsenceofcommunication.

Thoughouragentscanbeviewedasalearningautomaton,thescalarfeedbackreceivedfromtheenvironmentpreventstheuseofresultsdirectlyfromtheﬁeldoflearningau-tomata[35].Recentworkontheoreticalissuesofmultiagentreinforcementlearningpromises

toproducenewframeworksforinvestigatingproblemssuchasthoseaddressedinthispa-per[21,42,53].

4Reinforcementlearning

Inreinforcementlearningproblems[1,27]reactiveandadaptiveagentsaregivenadescriptionofthecurrentstateandhavetochoosethenextactionfromasetofpossibleactionssoastomaximizeascalarreinforcementorfeedbackreceivedaftereachaction.Thelearner’senvironmentcanbemodeledbyadiscretetime,ﬁnitestate,Markovdecisionprocessthatcanberepresentedbya4-tuple󰀏S,A,P,r󰀑whereP:S×S×A→[0,1]givestheprobabilityofmovingfromstates1tos2onperformingactiona,andr:S×A→󰀒isascalarrewardfunction.Eachagentmaintainsapolicy,π,thatmapsthecurrentstateintothedesirableaction(s)tobeperformedinthatstate.TheexpectedvalueofadiscountedsumoffuturerewardsofapolicyπatastatexisgivenbyVγπ=E{

def

󰀁∞

t=0

ππ

istherandom},wherers,tγtrs,t

variablecorrespondingtotherewardreceivedbythelearningagentttimestepsafterifstartsusingthepolicyπinstates,andγisadiscountrate(0≤γ<1).

Variousreinforcementlearningstrategieshavebeenproposedusingwhichagentscande-velopapolicytomaximizerewardsaccumulatedovertime.Forevaluatingtheclassiﬁersys-temparadigmformultiagentreinforcementlearning,wecompareitwiththeQ-learning[51]algorithm,whichisdesignedtoﬁndapolicyπ∗thatmaximizesVγπ(s)forallstatess∈S.Thedecisionpolicyisrepresentedbyafunction,Q:S×A→󰀒,whichestimateslong-termdis-a;π

countedrewardsforeachstate–actionpair.TheQvaluesaredeﬁnedasQπγ(s,a)=Vγ(s),

wherea;πdenotestheeventsequenceofchoosingactionaatthecurrentstate,followedbychoosingactionsbasedonpolicyπ.Theaction,a,toperforminastatesischosensuchthatitisexpectedtomaximizethereward,

Vγπ(s)=maxQπγ(s,a)foralls∈S.

a∈A

∗

IfanactionainstatesproducesareinforcementofRandatransitiontostates󰀅,thenthecorrespondingQvalueismodiﬁedasfollows:

Q(s,a)←(1−β)Q(s,a)+β(R+γmaxQ(s󰀅,a󰀅)).󰀄

a∈A

(1)

TheaboveupdateruleissimilartoHolland’sbucket-brigade[25]algorithminclassiﬁersystemsandSutton’stemporal-diﬀerence[49]learningscheme.ThesimilaritiesofQ-learningandclassiﬁersystemshavebeenanalyzedin[10].

Classiﬁersystemsarerulebasedsystemsthatlearnbyadjustingrulestrengthsfromfeedbackandbydiscoveringbetterrulesusinggeneticalgorithms.Inthispaper,wewillusesimpliﬁedclassiﬁersystemswhereallpossiblemessageactionpairsareexplicitlystoredandclassiﬁershaveoneconditionandoneaction.TheseassumptionsaresimilartothosemadebyDorigoandBersini[10];wealsousetheirnotationtodescribeaclassiﬁeriby(ci,ai),whereciandaiarerespectivelytheconditionandactionpartsoftheclassiﬁer.St(ci,ai)givesthestrengthofclassiﬁeriattimestept.Weﬁrstdescribehowtheclassiﬁersystemperformsandthendiscusstwodiﬀerentfeedbackdistributionschemes,namelytheBucketBrigadealgorithm(BBA),andtheProﬁtSharingPlan(PSP).

Allclassiﬁersareinitializedtosomedefaultstrength.Ateachtimestepofproblemsolving,aninputmessageisreceivedfromtheenvironmentandmatchedwiththeclassiﬁerrulestoformamatchset,M.Oneoftheseclassiﬁersischosentoﬁreandbasedonitsaction,afeedbackmaybereceivedfromtheenvironment.Thenthestrengthsoftheclassiﬁerrulesareadjusted.Thiscycleisrepeatedforagivennumberoftimesteps.Aseriesofcyclesconstituteatrialoftheclassiﬁersystem.IntheBBAscheme,whenaclassiﬁerischosentoﬁre,itsstrengthisincreasedbytheenvironmentalfeedback.Butbeforethat,afractionαofitsstrengthisremovedandaddedtothestrengthoftheclassiﬁerwhoﬁredinthelasttimecycle.So,ifclassiﬁeriﬁresattimestept,producesexternalfeedbackofR,andclassiﬁerjﬁresatthenexttimestep,thefollowingequationsgivesthestrengthupdateofclassiﬁeri:

St+1(ci,ai)=(1−α)∗St(ci,ai)+α∗(R+St+1(cj,aj)).

Wenowdescribetheproﬁtsharingplan(PSP)strength-updatingscheme[20]usedinclassiﬁersystems.Inthismethod,problemsolvingisdividedintoepisodesinbetweenreceiptsofexternalreward.Aruleissaidtobeactiveinaperiodifitﬁredinatleastoneofthecyclesinthatepisode.Attheendofepisodee,thestrengthofeachactiveruleiinthatepisodeisupdatedasfollows:

Se+1(ci,ai)=Se(ci,ai)+α∗(Re−Se(ci,ai)),

whereReistheexternalrewardreceivedattheendoftheepisode.Wehaveexperimentedwithtwomethodsofchoosingaclassiﬁertoﬁregiventhematchset.Inthemoretraditionalmethod,aclassiﬁeri∈Mattimetischosenwithaprobabilitygivenby󰀁St(ci,ai)

.S(c,a)d∈Mtdd

callthisﬁtnessproportionatePSPorPSP(FP).Intheothermethodofactionchoice,theclassiﬁerwiththehighestﬁtnessinMischosen90%ofthetime,andarandomclassiﬁerfromMischosenintherest10%cases(MahadevanusessuchanactionchoosingmechanismforQ-learningin[31]).Wecallthisasemi-randomPSPorPSP(SR).

FortheQ-learningalgorithm,westoparunwhenthealgebraicdiﬀerenceofthepoliciesattheendofneighboringtrialsisbelowathresholdfor10consecutivetrials.Withthisconvergencecriterion,however,theclassiﬁersystemsrantoolongforustocollectreasonabledata.Instead,every10thtrial,werantheclassiﬁersystem(bothwithBBAandPSP)withadeterministicactionchoiceovertheentiretrial.Westoppedarunoftheclassiﬁersystemifthediﬀerencesofthetotalenvironmentalfeedbackreceivedbythesystemonneighboringdeterministictrialswerebelowasmallthresholdfor10consecutivedeterministictrials.Weperformanin-depthstudyoftheeﬀectivenessofusingQ-learningbyconcurrentlearnerstodevelopeﬀectivecoordinationinablockpushingdomain.WealsocomparetheperformanceofclassiﬁersystemsandQ-learningonaresourcesharingandarobotnavigationdomain.Thecharacteristicsofthesethreedomainsareasfollows:

Blockpushing:Concurrentlearningbytwoagentswithimmediateenvironmentalfeed-back;stronglycoupledsystem;multipleoptimalbehaviors.

Resourcesharing:Oneagentlearningtoadapttoanotheragent’sbehaviorwithdelayed

environmentalfeedback;stronglycoupledsystem;singleoptimalbehavior.

Robotnavigation:Concurrentlearningbymultipleagentswithimmediateenvironmental

feedback;variablecoupling;multipleoptimalbehaviors.

5Blockpushingproblem

Inthisproblem,twoagents,a1anda2,areindependentlyassignedtomoveablock,b,fromastartingposition,S,tosomegoalposition,G,followingapath,P,inEuclideanspace.

Theagentsarenotawareofthecapabilitiesofeachotherandyetmustchoosetheiractionsindividuallysuchthatthejointtaskiscompleted.Theagentshavenoknowledgeofthesystemphysics,butcanperceivetheircurrentdistancefromthedesiredpathtotaketo󰀫i,wherethegoalstate.Theiractionsarerestrictedasfollows:agentiexertsaforceF󰀫i|≤Fmax,ontheobjectatanangleθi,where0≤θ≤π.Anagentpushingwith0≤|F

󰀫atangleθwilloﬀsettheblockinthexdirectionby|F󰀫|cos(θ)unitsandintheyforceF

󰀫|sin(θ)units.Thenetresultantforceontheblockisfoundbyvectoradditiondirectionby|F

󰀫=F󰀫1+F󰀫2.Wecalculatethenewpositionoftheblockbyassumingofindividualforces:F

unitdisplacementperunitforcealongthedirectionoftheresultantforce.Thenewblocklocationisusedtoprovidefeedbacktotheagent.If(x,y)isthenewblocklocation,Px(y)isthex-coordinateofthepathPforthesameycoordinate,∆x=|x−Px(y)|isthedistancealongthexdimensionbetweentheblockandthedesiredpath,thenK∗a−∆xisthefeedbackgiventoeachagentfortheirlastaction(wehaveusedK=50anda=1.15).

field of playgoal 󰀀positionideal󰀀pathpotential󰀀pathblockagent 1agent 2at each time step each agent pushes with some force at a certain angleCombined effectF sinθFθF cosθFsinθ22Fsinθ11Agent 1θ1Fcosθ11θ2Agent 2Fcosθ22Figure1:Theblockpushingproblem

Theﬁeldofplayisrestrictedtoarectanglewithendpoints[0,0]and[100,100].AtrialconsistsoftheagentsstartingfromtheinitialpositionSandapplyingforcesuntileitherthegoalpositionGisreachedortheblockleavestheﬁeldofplay(seeFigure1).Weaborta

trialifapre-setnumberofagentactionsfailtotaketheblocktothegoal.Thispreventsagentsfromlearningpolicieswheretheyapplynoforcewhentheblockisrestingontheoptimalpathtothegoalbutnotonthegoalitself.Theagentsarerequiredtolearn,throughrepeatedtrials,topushtheblockalongthepathPtothegoal.Althoughwehaveusedonlytwoagentsinourexperiments,thesolutionmethodologycanbeappliedwithoutmodiﬁcationtoproblemswitharbitrarynumberofagents.

Toimplementthepolicyπwechosetouseaninternaldiscreterepresentationfortheexternalcontinuousspace.Theforce,angle,andthespacedimensionswerealluniformlydiscretized.Whenaparticulardiscreteforceoractionisselectedbytheagent,themiddlevalueoftheassociatedcontinuousrangeisusedastheactualforceoranglethatisappliedontheblock.

TheBlock-Pushingproblemisatightly-coupledsystem,where,ateachtimesteptheoutcomeofeachagentactionisdependentontheactionoftheotheragentspresentinthesystem.Usingtheblock-pushingproblem,wehaveconductedexperimentsonbothcooperative(agentspushtheblocktowardsacommongoal)andnon-cooperative(agentsviewitheachothertopushtheblocktoindividualgoals)situations.

Anexperimentalrunconsistsofanumberoftrialsduringwhichthesystemparameters(β,γ,andK)aswellasthelearningproblem(granularity,agentchoices)isheldconstant.ThestoppingcriteriaforaruniseitherthattheagentssucceedinpushingtheblocktothegoalinNconsecutivetrials(wehaveusedN=10)orthatamaximumnumberoftrials(wehaveused1500)havebeenexecuted.Thelattercasesarereportedasnon-convergedruns.ThestandardprocedureinQ-learningliteratureofinitializingQvaluestozeroissuitableformosttaskswherenon-zerofeedbackisinfrequentandhencethereisenoughopportunitytoexplorealltheactions.Becauseanon-zerofeedbackisreceivedaftereveryactioninourproblem,wefoundthatagentswouldfollow,foranentirerun,thepaththeytakeintheﬁrsttrial.Thisisbecausetheystarteachtrialatthesamestate,andtheonlynon-zeroQ-valueforthatstateisfortheactionthatwaschosenatthestarttrial.Similarreasoningholdsforalltheotheractionschoseninthetrial.Apossibleﬁxistochooseafractionoftheactionsbyrandomchoice,ortouseaprobabilitydistributionovertheQ-valuestochooseactionsstochastically.Theseoptions,however,leadtoveryslowconvergence.Instead,wechose

toinitializetheQ-valuestoalargepositivenumber.Thisenforcedanexplorationoftheavailableactionoptionswhileallowingforconvergenceafterareasonablenumberoftrials.Theprimarymetricforperformanceevaluationistheaveragenumberoftrialstakenbythesystemtoconverge.Informationaboutacquisitionofcoordinationknowledgeisobtainedbyplotting,fordiﬀerenttrials,theaveragedistanceoftheactualpathfollowedfromthedesiredpath.Dataforallplotsandtablesinthispaperhavebeenaveragedover100runs.

5.1ExperimentsinCooperativeDomain

Extensiveexperimentationwasperformedontheblock-pushingdomain.Thetwoagentswereassignedthetaskofpushingtheblocktothesamegoallocation.Thefollowingsectionsprovidefurtherdetailsontheexperimentalresults.5.1.1

Choiceofsystemparameters

Iftheagentslearntopushtheblockalongthedesiredpath,therewardthattheywillreceiveforthebestactionchoicesateachstepisequaltothemaximumpossiblevalueofK.Thesteady-statevaluesfortheQ-values(Qss)correspondingtooptimalactionchoicescanbecalculatedfromtheequation:

Qss=(1−β)Qss+β(K+γQss).

SolvingforQssinthisequationyieldsavalueof

.1−γ

Inorderfortheagentstoexplore

allactionsaftertheQ-valuesareinitializedatSI,werequirethatanynewQvaluebelessthanSI.FromsimilarconsiderationsasabovewecanshowthatthiswillbethecaseifSI≥

.1−γInourexperimentsweﬁxthemaximumrewardKat50,SIat100,andγat0.9.

Unlessotherwisementioned,wehaveusedβ=0.2,andallowedeachagenttovaryboththemagnitudeandangleoftheforcetheyapplyontheblock.

Theﬁrstproblemweusedhadstartingandgoalpositionsat(40,0)and(40,100)re-spectively.Duringourinitialexperimentswefoundthatwithanevennumberofdiscreteintervalschosenfortheangledimension,anagentcannotpushalonganylineparalleltothe

y-axis.Henceweusedanoddnumber,11,ofdiscreteintervalsfortheangledimension.Thenumberofdiscreteintervalsfortheforcedimensionischosentobe10.

Onvaryingthenumberofdiscretizationintervalsforthestatespacebetween10,15,and20,wefoundthecorrespondingaveragenumberoftrialstoconvergenceis784,793,and115respectivelywith82%,83%,and100%oftherespectiverunsconvergingwithinthespeciﬁedlimitof1200trials.Thissuggeststhatwhenthestaterepresentationgetstoocoarse,theagentsﬁnditverydiﬃculttolearntheoptimalpolicy.Thisisbecausethelessthenumberofintervals(thecoarserthegranularity),themorethevariationsinrewardanagentgetsaftertakingthesameactionatthesamestate(eachdiscretestatemapsintoalargerrangeofcontinuousspaceandhencetheagentsstartfromandendsupinphysicallydiﬀerentlocations,thelatterresultingindiﬀerentrewards).5.1.2

Varyinglearningrate

Weexperimentedbyvaryingthelearningrate,β.TheresultantaveragedistanceoftheactualpathfromthedesiredpathoverthecourseofarunisplottedinFigure2forβvalues0.4,0.6,and0.8.

Incaseofthestraightpathbetween(40,0)and(40,100),theoptimalsequenceofactionsalwaysputstheblockonthesamex-position.Sincethex-dimensionistheonlydimensionusedtorepresentstate,theagentsupdatethesameQ-valueintheirpolicymatrixinsucces-sivesteps.WenowcalculatethenumberofupdatesrequiredfortheQ-valuecorrespondingtothisoptimalactionbeforeitreachesthesteadystatevalue.Notethatforthesystemtoconverge,itisnecessarythatonlytheQ-valuefortheoptimalactionatx=40needstoarriveatitssteadystatevalue.Thisisbecausetheblockisinitiallyplacedatx=40,andsolongastheagentschoosetheiroptimalaction,itneverreachesanyotherxposition.So,thenumberofupdatestoreachsteadystatefortheQ-valueassociatedwiththeoptimalactionatx=40shouldbeproportionaltothenumberoftrialstoconvergenceforagivenrun.Inthefollowing,letStbetheQ-valueaftertupdatesandSIbetheinitialQ-value.UsingEquation1andthefactthatfortheoptimalactionatthestartingposition,thereinforcementreceivedisKandthenextstateisthesameasthecurrentstate,wecanwrite,

St+1=(1−β)St+β(K+γSt)

=(1−β(1−γ))St+βK=ASt+C

(2)

whereAandBareconstantsdeﬁnedtobeequalto1−β∗(1−γ)andβ∗Krespectively.Equation2isadiﬀerenceequationwhichcanbesolvedusingS0=SItoobtain

St=A

t+1

C(1−At+1)SI+.

1−A

Ifwedeﬁneconvergencebythecriteriathat|St+1−St|<󰀬,where󰀬isanarbitrarilysmallpositivenumber,thenthenumberofupdatestrequiredforconvergencecanbecalculatedtobethefollowing:

t≥

log(󰀬)−log(SI(1−A)−C)

log(A)

log(󰀬)−log(β)−log(SI(1−γ)−K)=

log(1−β(1−γ))

(3)

IfwekeepγandSIconstanttheaboveexpressioncanbeshowntobeadecreasingfunctionofβ.Thisiscorroboratedbyourexperimentswithvaryingβwhileholdingγ=0.1(seeFigure2).Asβincreases,theagentstakelessnumberoftrialstoconvergencetotheoptimalsetofactionsrequiredtofollowthedesiredpath.TheotherplotinFigure2presentsacomparisonofthetheoreticalandexperimentalconvergencetrends.Theﬁrstcurveintheplotrepresentsthefunctioncorrespondingtothenumberofupdatesrequiredtoreachsteadystatevalue(with󰀬=0).Thesecondcurverepresentstheaveragenumberoftrialsrequiredforaruntoconverge,scaleddownbyaconstantfactorof0.06.Theactualratiosbetweenthenumberoftrialstoconvergenceandthevaluesoftheexpressionontherighthandsideoftheinequality3forβequalto0.4,0.6,and0.8are24.1,25.6,and27.5respectively(theaveragenumberoftrialsare95.6,71.7,and53;valuesoftheabove-mentionedexpressionare3.97,2.8,and1.93).Giventhefactthatresultsareaveragedover100runs,wecanclaimthatourtheoreticalanalysisprovidesagoodestimateoftherelativetimerequiredforconvergenceasthelearningrateischanged.5.1.3

Varyingagentcapabilities

Thenextsetofexperimentswasdesignedtodemonstratetheeﬀectsofagentcapabilitiesonthetimerequiredtoconvergeontheoptimalsetofactions.Intheﬁrstofthecurrentsetof

10Varying betaAvg. distance from optimal80.420.8002040600.680Trials10012014065.554.543.532.521.510.10.2Variation of Steps to convergence with beta(1-log(40*beta))/log(1-0.9*beta)0.06 * trials to convergence0.30.40.50.6Beta0.70.80.91Figure2:Variationofaveragedistanceofactualpathfromdesiredpathoverthecourseofarun,andthenumberofupdatesforconvergenceofoptimalQ-valuewithchangingβ(γ=0.1,SI=100).

experiments,oneoftheagentswaschosentobea“dummy”;itdidnotexertanyforceatall.Theotheragentcouldonlychangetheangleatwhichitcouldapplyaconstantforceontheblock.Inthesecondexperiment,thelatteragentwasallowedtovarybothforceandangle.Inthethirdexperiment,bothagentswereallowedtovarytheirforceandangle.Theaveragenumberoftrialstoconvergencefortheﬁrst,second,andthirdexperimentare55,431,and115respectively.Themostinterestingresultfromtheseexperimentsisthattwoagentscanlearntocoordinatetheiractionsandachievethedesiredproblem-solvingbehaviormuchfasterthanwhenasingleagentisactingalone.If,however,wesimplifytheproblemoftheonlyactiveagentbyrestrictingitschoicetothatofselectingtheangleofforce,itcan

learntosolvetheproblemquickly.Ifweﬁxtheanglefortheonlyactiveagent,andallowittovaryonlythemagnitudeoftheforce,theproblembecomeseithertrivial(ifthechosenangleisidenticaltotheangleofthedesiredpathfromthestartingpoint)orunsolvable.5.1.4

Transferoflearning

Wedesignedasetofexperimentstodemonstratehowlearninginonesituationcanhelplearningtoperformwellinasimilarsituation.Theproblemwithstartingandgoallocationsat(40,0)and(40,100)respectivelyisusedasareferenceproblem.Inaddition,weusedﬁveotherproblemswiththesamestartinglocationandwithgoallocationsat(50,100),(60,100),(70,100),(80,100),and(90,100)respectively.Thecorrespondingdesiredpathswereobtainedbyjoiningthestartingandgoallocationsbystraightlines.Todemonstratetransferoflearning,weﬁrststoredeachofthepolicymatricesthatthetwoagentsconvergedonfortheoriginalproblem.Next,weranasetofexperimentsusingeachofthenewproblems,withtheagentsstartingoﬀwiththeirpreviouslystoredpolicymatrices.

Wefoundthatthereisalinearincreaseinthenumberoftrialstoconvergenceasthegoalinthenewproblemisplacedfartherapartfromthegoalintheinitialproblem.Todetermineifthisincreasewasduepurelytothedistancebetweenthetwodesiredpaths,orduetothediﬃcultyinlearningtofollowcertainpaths,weranexperimentsonthelatterproblemswithagentsstartingwithuniformpolicies.Theseexperimentsrevealthatthemoretheanglebetweenthedesiredpathandthey-axis,thelongertheagentstaketoconverge.Learningintheoriginalproblem,however,doeshelpinsolvingthesenewproblems,asevidencedbya≈10%savingsinthenumberoftrialstoconvergencewhenagentsstartedwiththepreviouslylearnedpolicy.Usingaone-tailedt-testwefoundthatallthediﬀerencesweresigniﬁcantatthe99%conﬁdencelevel.Thisresultdemonstratesthetransferoflearnedknowledgebetweensimilarproblem-solvingsituations.5.1.5

Complimentarylearning

Inthelastfewsectionswehaveshowntheeﬀectsofsystemparametersandagentcapabilitiesontherateatwhichtheagentsconvergeonanoptimalsetofactions.Inthissection,wediscusswhatan“optimalsetofactions”meanstodiﬀerentagents.

Figure3:Optimalactionchoicesforaselectionofstatesforeachagentaccordingtotheirpolicymatricesattheendofasuccessfulrun.

Iftheagentswerecognizantoftheactualconstraintsandgoalsoftheproblem,andknewelementaryphysics,theycouldindependentlycalculatethedesiredactionforeachofthestatesthattheymayenter.Theresultingpolicieswouldbeidentical.Ouragents,however,havenoplanningcapacityandtheirknowledgeisencodedinthepolicymatrix.Figure3providesasnapshot,attheendofasuccessfullyconvergedrun,ofwhateachagentbelievestobeitsbestactionchoiceforeachofthepossiblestatesintheworld.Theactionchoiceforeachagentatastateisrepresentedbyastraightlineattheappropriateangleandscaledtorepresentthemagnitudeofforce.Weimmediatelynoticethattheindividualpoliciesarecomplimentaryratherthanbeingidentical.Givenastate,thecombinationofthebestactionswillbringtheblockclosertothedesiredpath.Insomecases,oneoftheagentsevenpushesinthewrongdirectionwhiletheotheragenthastocompensatewithalargerforcetobringtheblockclosertothedesiredpath.Thesecasesoccurinstateswhichareattheedgeoftheﬁeldofplay,andhavebeenvisitedonlyinfrequently.Complementarityoftheindividualpolicies,however,arevisibleforallthestates.

5.2ExperimentsinNon-CooperativeDomains

Wedesignedasetofexperimentsinwhichtwoagentsareprovideddiﬀerentfeedbackforthesameblocklocation.Theagentsareassignedtopushthesameblocktotwodiﬀerent

Figure4:Exampletrialwhenagentshaveconﬂictinggoals.

goalsalongdiﬀerentpaths.Hence,theactionofeachofthemadverselyaﬀectsthegoalachievementoftheotheragent.Themaximumforce(werefertothisasstrength)ofoneagentwaschosenas10units,whilethemaximumforceoftheotheragentwasvaried.Theothervariablewasthenumberofdiscreteactionoptionsavailablewithinthegivenforcerange.Whenthereisconsiderabledisparitybetweenthestrengthsofthetwoagents,thestrongeragentoverpowerstheweakeragent,andsucceedsinpushingtheblocktoitsgoallocation(seeFigure4).Theaveragenumberoftrialstoconvergence(seeTable2),however,indicatesthatasthestrengthoftheweakeragentisincreased,thestrongeragentﬁndsitincreasinglydiﬃculttoattainitsgoal.Fortheseexperiments,thestrongandtheweakagentshadrespectively11(between0-10)and2(0anditsmaximumstrength)forceoptionstochoosefrom.

Whenthenumberofforcediscretizationsfortheweakagentisincreasedfrom2to10,weﬁndthatthestrongeragentﬁndsitmorediﬃculttopushtheblocktoitsowngoal.Ifweincreasethemaximumforceoftheweakagentclosertothemaximumforceofthestrongeragent,weﬁndthatneitherofthemisabletopushtheblocktoitsdesiredgoal.Attheofarun,weﬁndthattheﬁnalconvergedpathliesinbetweentheirindividualdesiredpaths.Asthestrengthoftheweakeragentincreases,thispathmovesawayfromthedesiredpathofthestrongeragent,andultimatelyliesmidwaybetweentheirindividualdesiredpathswhenbothagentsareequallystrong.

Intuitively,anagentshouldbeableto‘overpower’anotheragentwheneveritisstronger.Whyisthisnothappening?Theanswerliesinthestochasticvariabilityoffeedbackreceivedforthesameactionatthesamestate,andthedeterministicchoiceoftheactioncorrespondingtothemaximalpolicymatrixentry.Whenanagentchoosesanactionatastateitcanreceiveoneofseveraldiﬀerentfeedbacksdependingontheactionchosenbytheotheragent.WedeﬁnetheoptimalactionchoiceforastatextobetheactionAxthathasthehighestaveragefeedbackFx.SupposetheﬁrsttimetheagentchoosesthisactionatstatexitreceivesafeedbackF1F1foranon-optimalactionAyitchoosesinthesamestatex.Iftheseweretheonlytwooptionsavailableinstatex,theagentwouldchooseAyoverAxnexttimeitisinstatex,becausetheformeractioncorrespondstoahigherpolicymatrixentry.IfthesteadystatevalueofthepolicymatrixentryforactionAyinstatexisgreaterthanthepolicymatrixentryforactionAxobtainedafterreceivingfeedbackF1,thelatteractionwillbenevertriedagain,andhencetheagentwillconvergeonanon-optimalpolicy.Thisisaquintessentialexampleoftheexploration-exploitationtradeoﬀ[24].Also,thisismorelikelytohappenwhenthesameactioncangeneratemorenumberofdistinctfeedbacks(thesameactionforthestrongeragentcanproducemoredistinctfeedbackswhenthediscretizationsforweakeragentisincreased).Asimpleremedytothissituationwillbetochooseaproportionoftheactionsrandomlyortochooseactionsusingaprobabilitydistributionoverthepolicymatrixvalues.Eachoftheseoptions,however,resultsinanexponentialincreaseofthetrialstoconvergence.Currentlywearedevelopingasimulatedannealing[28]basedprocedurewhichresultsinadecreaseintheproportionofrandomchoicesasthepolicymatrixconvergestoitssteadystate.

6Robotnavigationproblem

Wedesignedaprobleminwhichfouragents,A,B,C,andD,aretoﬁndtheoptimalpathinagridworld,fromgivenstartinglocationstotheirrespectivegoals,A󰀅,B󰀅,C󰀅,andD󰀅.Theagentstraversetheirworldusingoneoftheﬁveavailableoperators:north,south,east,west,orhold.Figure5depictspotentialpathsthateachoftheagentsmightchooseduringtheirlearningprocess.Thegoaloftheagentsistolearnmovesthatquicklytakethemto

theirrespectivegoallocationswithoutcollidingwithotheragents.

A BC DD’C’B’A’Figure5:Arobotnavigationproblem.

Eachagentreceivesfeedbackbasedonitsmove:whenitmakesamovethattakesthemtowardstheirgoal,theyreceiveafeedbackof1;whenitmakesamovethattakesthemawayfromtheirgoal,theyreceiveafeedbackof-1;whenitmakesamovethatresultsinnochangeoftheirdistancefromtheirgoal(hold),theyreceiveafeedbackof0;whenitmakesamovethatresultsinacollision,thefeedbackiscomputedasdepictedinFig6.Allagentslearnatthesametimebyupdatingtheirindividualpolicies.

YYXfeedback(X) = -10YXfeedback(X) = -5Xfeedback(X) = -5Figure6:FeedbackforagentXwhenitcausesdiﬀerenttypesofcollisions(giventheactionoftheotheragentinthecollision).

TheRobotnavigationtaskisadomaininwhichtheagentcouplingsvariesovertime.Whenagentsarelocatednexttoeachother,theyareverylikelytointeractandhenceagent

behaviorsaretightlycoupled.But,astheymoveapart,theirinteractionsbecomelesslikely,andasaresult,theirbehaviorsbecomelooselycoupled.

Sincetherobotnavigationproblemproducesrewardsaftereachtimestep,wehaveusedtheBBAmethodofpayoﬀdistributionwiththeclassiﬁersystem.Thesystemparametersareβ=0.5,andγ=0.8forQ-learningandα=0.1forBBA.

ExperimentalresultsontherobotnavigationdomaincomparingQ-learningandaclas-siﬁersystemusingBBAforpayoﬀdistributionisdisplayedinFigure10.Plotsshowtheaveragenumberofstepstakenbytheagentstoreachtheirgoals.Lowervaluesofthispa-rametermeansagentsarelearningtoﬁndmoredirectpathstotheirgoalswithoutcollidingwitheachother.Resultsareaveragedover50runsforbothsystems.Q-learningtakesasmuchas5timeslongertoconvergewhencomparedtoBBA.TheﬁnalnumberofstepstakenbyagentsusingQ-learningisslightlysmallerthanthenumberofstepstakenbyagentsusingBBA.WebelievethatifwemaketheconvergencecriteriamorestrictfortheBBA,abettersolutioncanbeevolvedwithmorecomputationaleﬀort.

Theinterestingaspectofthisexperimentisthatalltheagentswerelearningsimulta-neouslyandhenceitwasnotobviousthattheywouldﬁndgoodpaths.Typicalsolutions,however,showthatagentsstopattherightpositionstoletotherspassthrough.Thisavoidscollisions.Thepathsdocontainsmalldetours,andhencearenotoptimal.

TheperformanceofBBAwhenα=0.5(comparabletoβusedinQ-learning)wasnotverydiﬀerentwhencomparedtothatpresentedinFigure10.ItisinterestingtonotethatwhenexperimentsweretriedsettingtheβinQ-learningto0.1(comparabletoαinBBA)thesystemdidnotattainconvergenceevenafter5000trials.

7Resourcesharingproblem

Theresourcesharingproblemassumestwoagentssharingacommonresourceorchannel,witheachofthemtryingtodistributetheirloadonthesystemsoastoachievemaximumutility.Inourversionoftheproblem,itisassumedthatoneagenthasalreadyappliedsomeloaddistributedoveraﬁxedtimeperiod,andtheotheragentislearningtodistributeitsloadonthesystemwithoutanyknowledgeofthecurrentdistribution.Amaximumloadof

5045403530Average number of steps in the robot navigation taskQBBA(SR)Steps252015105005001000Trials15002000Figure7:ComparisonofBBAandQ-learningontherobotnavigationproblem.Lcanbeappliedonthesystematanypointintime(loadsinexcessofthisdonotreceiveanyutility).

10987The utility curve10*(1-exp(-x))Utility recieved65432100246Load applied810Figure8:Curvedepictingtheutilityreceivedforagivenload

ThesecondagentcanuseKload-hoursofthechannel.Ifitappliesaloadofktload-hoursonthesystemattimestept,whentheﬁrstagenthasappliedltload-hours,theutilityitreceivesisu(lt,kt)=U(max(L,kt+lt))−U(max(L,lt),whereUistheutilityfunctioninFigure8,andL=10isthemaximumloadallowedonthesystem.ThetotalfeedbackitgetsattheendofTtimestepsis

󰀁T

t=1

u(lt,kt).Thisproblemrequiresthesecondagentto

distributeitsloadaroundtheloadsimposedbytheﬁrstagentinordertoobtainmaximum

109876Maximum Load: Agent-1: Agent-2LoadApplied543210123456710Time Figure9:Aresourcesharingproblem.

utility.Theproblemisthatthesecondagenthavenodirectinformationabouttheloaddistributiononthesystem.Thisisatypicalsituationinreinforcementlearningproblemswheretheagenthastochooseitsactionsbasedonlyonscalarfeedback.ThisformulationisdiﬀerentfromotherloadbalancingproblemsusedinMALliterature[41]wheremoreshort-termfeedbackisavailabletoagents.

AsingletrialconsistsofanepisodeofapplyingloadsuntilTtimestepsiscompletedoruntiltheagenthasexhausteditsload-hours,whicheveroccursearlier.Thus,throughconsecutivesuchtrialstheagentlearnstodistributeitsloadonthesysteminanoptimalfashion.Figure9presentstheloaddistributionusedbytheﬁrstagentaswellasoneofseveraloptimalloaddistributionsforthesecondagentintheparticularproblemwehaveusedforexperiments(T=10inthisproblem).

TheResourcesharingproblemisanotherexampleofatightly-coupledsystemwithagentactionsinteractingwitheachotherateverytimestep.Thisdomaincanbeconsideredtobenon-cooperativebecauseagentsarenotdoingacommontask,eachoftheagentsneedtocoordinatewiththeotherstoachieveautilitythatismostbeneﬁcialtoitself.

SincetheresourcesharingproblemproducesrewardsonlyafteraseriesofactionsareperformedweusedthePSPmethodofpayoﬀdistributionwiththeclassiﬁersystem.ThoughBBAcanalsobeusedforpayoﬀdistributioninthisproblem,ourinitialexperimentsshowedthatPSPperformedmuchbetterthanBBAonthisproblem.Theparametervaluesare

MaximumTimeβ=0.85,γ=0.9forQ-learningandα=0.1forPSP.

Inthissetofexperimentstheﬁtness-proportionatePSPdidnotconvergeevenafter150,000trials.ExperimentalresultscomparingQ-learningandsemi-randomPSP,PSP(SR),basedclassiﬁersystemsontheresourcesharingproblemisdisplayedinFigure10.Resultsareaveragedover50runsofbothsystems.Thoughbothmethodsﬁndtheoptimalloaddistributioninsomeoftheruns,moreoftenthannottheysettleforalessthanoptimal,butreasonablygooddistribution.PSPtakesabouttwiceaslongtoconvergebutproducesabetterloaddistributionontheaverage.Thediﬀerenceinperformanceisfoundtobesigniﬁcantatthe99%conﬁdencelevelusingatwo-samplet-procedure.WebelievethishappensbecausealltheactiverulesdirectlysharethefeedbackattheendofatrialinPSP.InQ-learning,however,theexternalfeedbackispassedbacktopolicyelementsusedearlyinatrialoversuccessivetrials.Interferencewithdiﬀerentactionsequencesharingthesamepolicyelement(state-actionpair)canproduceconvergencetosub-optimalsolutions.

25The resource sharing problem20Average utility1510QPSP(SR)50050010001500200025003000Trials3500400045005000Figure10:ComparisonofPSPandQ-learningontheresourcesharingproblem.TypicalsolutionsproducedbyPSPandQ-learningdiﬀeredinoneimportantcharacteris-tic.PSPsolutionswillsavesomeofitsloadforthelastemptytime-slot,whereasQ-learningsolutionsuseupalltheavailableloadbeforethat.SincePSPisabletoutilizethelastemptytimeslotonthechannel,itproducesbetterutilitythanQ-learning.

Theaboveresultsshowtwothings:1)anagentcaneﬀectivelyuseaclassiﬁersystemtocoordinateitsactionseﬀectivelywithnoknowledgeabouttheactionsoftheotheragent

usingthecommonresource,2)semi-randomactionchoicemechanismcanbeamoreeﬀectivemethodforclassiﬁersystemsthanthecommonlyusedﬁtness-proportionateactionchoicescheme.

Weranafurthersetofexperimentsinthisdomainwherebothagentswerelearningcon-currently.Inthismode,eachagentwouldsubmitaloaddistributiontothesystem,andthesubmittedloaddistributionswillbeusedtogivethemfeedbackontheirutilities.Neitherofthelearningtechniqueswasabletogenerateeﬀectivecoordinationbetweentheagents.Thissetofexperiments,alone,glaringlyexposesthelimitationsofindividuallearning.Itseemsthatintightlycoupleddomainswithdelayedfeedbacksitwouldbeextremelyunlikelythatgoodcoordinationwillevolvebetweenagentslearningatthesametime.Thisisparticularlytrueifthereisoneorveryfewoptimalbehaviorpairings.

8ConclusionsandFutureResearch

Inthispaperwehaveaddressedtheproblemofdevelopingmultiagentcoordinationstrate-gieswithminimaldomainknowledgeandinformationsharingbetweenagents.WehavecomparedclassiﬁersystembasedmethodsandQ-learningalgorithms,tworeinforcementlearningparadigms,toinvestigatearesourcesharingandarobotnavigationproblem.OurexperimentsshowthattheclassiﬁerbasedmethodsperformverycompetitivelywiththeQ-learningalgorithm,andareabletogenerategoodsolutionstobothproblems.PSPworkswellontheresourcesharingproblem,whereanagentistryingtoadapttoaﬁxedstrategyusedbyanotheragent,andwhenenvironmentalfeedbackisreceivedinfrequently.Resultsareparticularlyencouragingfortherobotnavigationdomain,whereallagentsarelearningsimultaneously.AclassiﬁersystemwiththeBBApayoﬀdistributionallowsagentstocoor-dinatetheirmovementswithotherswithoutdeviatingsigniﬁcantlyfromtheoptimalpathfromtheirstarttogoallocations.

Experimentsconductedontheblock-pushingtaskdemonstratethattwoagentscancoor-dinatetosolveaproblembetter,evenwithoutdevelopingamodelofeachother,thanwhattheycandoalone.Wehavedevelopedandexperimentallyveriﬁedtheoreticalpredictionsoftheeﬀectsofaparticularsystemparameter,thelearningrate,onsystemconvergence.Other

experimentsshowtheutilityofusingknowledge,acquiredfromlearninginonesituation,inothersimilarsituations.Additionally,wehavedemonstratedthatagentscoordinatebylearningcomplimentary,ratherthanidentical,problem-solvingknowledge.

Usingreinforcementlearningschemes,wehaveshownthatagentscanlearntoachievetheirgoalsinbothcooperativeandadversarialdomains.Neitherpriorknowledgeaboutdomaincharacteristicsnorexplicitmodelsaboutcapabilitiesofotheragentsarerequired.Thisprovidesanovelparadigmformulti-agentsystemsthroughwhichbothfriendsandfoescanconcurrentlyacquirecoordinationknowledge.

Aparticularlimitationoftheproposedapproachthatwehaveidentiﬁedistheinabilityofindividual,concurrentlearningtodevelopeﬀectivecoordinationwhenagentactionsarestronglycoupled,feedbackisdelayed,andthereisoneorafewoptimalbehaviorcombi-nations.Apossiblepartialﬁxtothisproblemwouldbetodosomesortofstaggeredorlock-steplearning.Inthismodeoflearning,eachagentcanlearnforsometime,thenexecuteitscurrentpolicywithoutmodiﬁcationforsometime,thenswitchbacktolearning,etc.Twoagentscansynchronizetheirbehaviorsothatoneislearningwhiletheotherisfollowingaﬁxedpolicyandviceversa.Evenifperfectsynchronizationisnotfeasible,thestaggeredlearningmodeislikelytobemoreeﬀectivethantheconcurrentlearningmodewehaveusedinthispaper.

Adrawbackofusingreinforcementlearningtogeneratecoordinationpoliciesisthatitrequiresconsiderableamountofdata,andassuchcanonlybeusedindomainswhereagentsrepeatedlyperformsimilartasks.Otherlearningalgorithmswithlessdatarequirementscanpossiblybeexploredtoovercomethislimitation.

Thispaperdemonstratesthatclassiﬁersystemscanbeusedeﬀectivelytoachievenear-optimalsolutionsmorequicklythanQ-learning,asillustratedbytheexperimentsconductedintherobotnavigationtask.Ifweenforceamorerigidconvergencecriteria,classiﬁersystemsachieveabettersolutionthanQ-learningthroughalargernumberoftrials,asillustratedbytheresultsobtainedontheresourcesharingdomain.Webelieve,however,thateitherQ-learningortheclassiﬁersystemcanproducebetterresultsinagivendomain.Identifyingthedistinguishingfeaturesofdomainswhichallowoneoftheseschemestoperformbetterwillbeafocusofourfutureresearch.

Wehavealsoshownthatasemi-randomchoiceofactionscanbemuchmoreproductivethanthecommonlyusedﬁtness-proportionatechoiceofactionswiththePSPpayoﬀdistri-butionmechanism.WeplantocomparetheBBAmechanismwiththesetwomethodsofpayoﬀdistribution.

Wewouldalsoliketoinvestigatetheeﬀectsofproblemcomplexityonthenumberoftrialstakenforconvergence.Ontherobotnavigationdomain,forexample,wewouldliketovaryboththesizeofthegridaswellasthenumberofagentsmovingonthegridtoﬁndouttheeﬀectsonsolutionqualityandconvergencetime.

Otherplannedexperimentsincludeusingworldmodelswithinclassiﬁersystems[3]andcombiningfeaturesofBBAandPSP[20]thatwouldbeusefulforlearningmultiagentcoor-dinationstrategies.Acknowledgments

Thisresearchhasbeensponsored,inpart,bytheNationalScienceFoundationunderaResearchInitiationAwardIRI-9410180andaCAREERawardIRI-9702672.

References

[1]AndrewB.Barto,RichardS.Sutton,andChrisWatkins.Sequentialdecisionprob-lemsandneuralnetworks.InProceedingsof19ConferenceonNeuralInformationProcessing,19.

[2]AlanH.BondandLesGasser.ReadingsinDistributedArtiﬁcialIntelligence.Morgan

KaufmannPublishers,SanMateo,CA,1988.

[3]LashonB.Booker.Classiﬁersystemsthatlearninternalworldmodels.MachineLearn-ing,3:161–192,1988.

[4]L.B.Booker,D.E.Goldberg,andJ.H.Holland.Classiﬁersystemsandgeneticalgo-rithms.ArtiﬁcialIntelligence,40:235–282,19.

[5]P.Brazdil,M.Gams,S.Sian,L.Torgo,andW.vandeVelde.Learningindistributed

systemsandmulti-agentenvironments.InEuropeanWorkingSessiononLearning,LectureNotesinAI,482,Berlin,March1991.SpringerVerlag.

[6]HungH.Bui,DorotaKieronska,andSvethaVenkatesh.Negotiatingagentsthatlearn

aboutothers’preferences.InProceedingsoftheThirteenthNationalConferenceonArtiﬁcialIntelligence,pages114–119,MenloPark,CA,1996.AAAIPress.

[7]DavidCarmelandShaulMarkovitch.Incorporatingopponentmodelsintoadversary

search.InProceedingsoftheThirteenthNationalConferenceonArtiﬁcialIntelligence,pages62–67,MenloPark,CA,1996.AAAIPress.

[8]PhilipR.CohenandC.RaymondPerrault.Elementsofaplan-basedtheoryofspeech

acts.CognitiveScience,3(3):177–212,1979.

[9]SusanE.Conry,RobertA.Meyer,andVictorR.Lesser.Multistagenegotiationin

distributedplanning.InAlanH.BondandLesGasser,editors,ReadingsinDistributedArtiﬁcialIntelligence,pages367–384.MorganKaufman,1988.

[10]MarcoDorigoandHuguesBersini.AcomparisonofQ-learningandclassiﬁersystems.In

ProceedingsofFromAnimalstoAnimats,ThirdInternationalConferenceonSimulationofAdaptiveBehavior,1994.

[11]EdmundH.DurfeeandVictorR.Lesser.Partialglobalplanning:Acoordination

frameworkfordistributedhypothesisformation.IEEETransactionsonSystems,Man,andCybernetics,21(5),September1991.(SpecialIssueonDistributedSensorNetworks).[12]EdmundH.Durfee,VictorR.Lesser,andDanielD.Corkill.

Coherentcoopera-

tionamongcommunicatingproblemsolvers.IEEETransactionsonComputers,C-36(11):1275–1291,November1987.(AlsopublishedinReadingsinDistributedArtiﬁcialIntelligence,AlanH.BondandLesGasser,editors,pages268–284,MorganKaufmann,1988.).

[13]EdmundH.Durfee,VictorR.Lesser,andDanielD.Corkill.Trendsincooperative

distributedproblemsolving.IEEETransactionsonKnowledgeandDataEngineering,1(1):63–83,March19.

[14]EdmundH.DurfeeandThomasA.Montgomery.Coordinationasdistributedsearchin

ahierarchicalbehaviorspace.IEEETransactionsonSystems,Man,andCybernetics,21(6):1363–1378,November/December1991.

[15]MarkS.Fox.Anorganizationalviewofdistributedsystems.IEEETransactionson

Systems,Man,andCybernetics,11(1):70–80,January1981.(AlsopublishedinReadingsinDistributedArtiﬁcialIntelligence,AlanH.BondandLesGasser,editors,pages140–150,MorganKaufmann,1988.).

[16]AndrewGarlandandRichardAlterman.Multiagentlearningthroughcollectivememory.

InSandipSen,editor,WorkingNotesfortheAAAISymposiumonAdaptation,Co-evolutionandLearninginMultiagentSystems,pages33–38,StanfordUniversity,CA,March1996.

[17]LesGasserandMichaelN.Huhns,editors.DistributedArtiﬁcialIntelligence,volume2

ofResearchNotesinArtiﬁcialIntelligence.Pitman,19.

[18]M.R.Genesereth,M.L.Ginsberg,andJ.S.Rosenschein.Cooperationwithoutcommu-nications.InProceedingsoftheNationalConferenceonArtiﬁcialIntelligence,pages51–57,Philadelphia,Pennsylvania,1986.

[19]PiotrJ.Gmytrasiewicz,EdmundH.Durfee,andDavidK.Wehe.Adecision-theoretic

approachtocoordinatingmultiagentinteractions.InProceedingsoftheTwelfthInter-nationalJointConferenceonArtiﬁcialIntelligence,pages62–68,August1991.[20]JohnGrefenstette.Creditassignmentinrulediscoverysystems.MachineLearning,

3(2/3):225–246,1988.

[21]PanGuandAnthonyB.Maddox.Aframeworkfordistributedreinforcementlearning.

InGerhardWeißandSandipSen,editors,AdaptationandLearninginMulti–Agent

Systems,LectureNotesinArtiﬁcialIntelligence,pages97–112.SpringerVerlag,Berlin,1996.

[22]JosephHalpernandYoramMoses.Knowledgeandcommonknowledgeinadistributed

environment.JournaloftheACM,37(3):549–587,1990.ApreliminaryversionappearedinProc.3rdACMSymposiumonPrinciplesofDistributedComputing,1984.

[23]ThomasHaynesandSandipSen.Learningcasestocomplimentrulesforconﬂictres-olutioninmultiagentsystems.InternationalJournalofHuman-ComputerStudies(toappear).

[24]JohnH.Holland.Adpatationinnaturalandartiﬁcialsystems.UniversityofMichigan

Press,AnnArbor,MI,1975.

[25]JohnH.Holland.Escapingbrittleness:thepossibilitiesofgeneral-purposelearning

algorithmsappliedtoparallelrule-basedsystems.InR.S.Michalski,J.G.Carbonell,andT.M.Mitchell,editors,MachineLearning,anartiﬁcialintelligenceapproach:VolumeII.MorganKaufmann,LosAlamos,CA,1986.

[26]MichaelHuhns,editor.DistributedArtiﬁcialIntelligence.MorganKaufmann,1987.[27]L.P.Kaelbling,MichaelL.Littman,andAndrewW.Moore.Reinforcementlearning:

Asurvey.JournalofAIResearch,4:237–285,1996.

[28]S.Kirkpatrick,C.D.Gelatt,andM.P.Vecchi.Optimizationbysimulatedreannealing.

Science,220:671–680,1983.

[29]VictorR.Lesser.Multiagentsystems:AnemergingsubdisciplineofAI.ACMComput-ingSurveys,27(3):340–342,September1995.

[30]MichaelL.Littman.Markovgamesasaframeworkformulti-agentreinforcementlearn-ing.InProceedingsoftheEleventhInternationalConferenceonMachineLearning,pages157–163,1994.

[31]SridharMahadevan.Todiscountornottodiscountinreinforcementlearning:Acase

studycomparingRlearningandQlearning.InProceedingsoftheTenthInternationalConferenceonMachineLearning,pages205–211,1993.

[32]ThomasW.Malone.Modelingcoordinationinorganizationsandmarkets.Management

Science,33(10):1317–1332,1987.(AlsopublishedinReadingsinDistributedArtiﬁcialIntelligence,AlanH.BondandLesGasser,editors,pages151–158,MorganKaufmann,1988.).

[33]JamesG.MarchandHerbertA.Simon.Organizations.JohnWiley&Sons,1958.[34]MajaJ.Mataric.Learninginmulti-robotsystems.InGerhardWeißandSandipSen,

editors,AdaptationandLearninginMulti–AgentSystems,LectureNotesinArtiﬁcialIntelligence,pages152–163.SpringerVerlag,Berlin,1996.

[35]K.NarendraandM.A.L.Thatachar.LearningAutomata:AnIntroduction.Prentice

Hall,19.

[36]LynneE.Parker.Adaptiveactionselectionforcooperativerobotteams.InJ.Meyer,

H.Roitblat,andS.Wilson,editors,Proc.oftheSecondInternationalConferenceonSimulationofAdaptiveBehavior,pages442–450,Cambridge,MA,1992.MITPress.[37]MVNagendraPrasad,SusanE.Lander,andVictorR.Lesser.Cooperativelearningover

compositesearchspaces:Experienceswithamulti-agentdesignsystem.InProceedingsofThirteenthNationalConferenceonArtiﬁcialIntelligence,pages68–73,August1996.[38]FosterJohnProvostandDanielN.Hennessy.Scalingup:Distributedmachinelearning

withcooperation.InProceedingsoftheThirteenthNationalConferenceonArtiﬁcialIntelligence,pages74–79,MenloPark,CA,1996.AAAIPress.

[39]JeﬀeryS.Rosenschein.Synchronizationofmulti-agentplans.InProceedingsofthe

NationalConferenceonArtiﬁcialIntelligence,pages115–119,Pittsburgh,Pennsylvania,August1982.(AlsopublishedinReadingsinDistributedArtiﬁcialIntelligence,AlanH.BondandLesGasser,editors,pages187–191,MorganKaufmann,1988.).

[40]TuomasW.SandholmandRobertH.Crites.OnmultiagentQ-learninginasemi-competitivedomain.InGerhardWeißandSandipSen,editors,AdaptationandLearninginMulti–AgentSystems,LectureNotesinArtiﬁcialIntelligence,pages191–205.SpringerVerlag,Berlin,1996.

[41]A.Schaerf,Y.Shoham,andM.Tennenholtz.Adaptiveloadbalancing:Astudyin

multiagentlearning.JournalofArtiﬁcialIntelligenceResearch,2:475–500,1995.[42]JurgenSchmidhuber.Ageneralmethodformulti-agentreinforcementlearninginunre-strictedenvironments.InSandipSen,editor,WorkingNotesfortheAAAISymposiumonAdaptation,Co-evolutionandLearninginMultiagentSystems,pages84–87,StanfordUniversity,CA,March1996.

[43]MahendraSekaranandSandipSen.Learningwithfriendsandfoes.InSixteenthAnnual

ConferenceoftheCognitiveScienceSociety,pages800–805,1994.

[44]SandipSen.IJCAI-95workshoponadaptationandlearninginmultiagentsystems.AI

Magazine,17(1):87–,Spring1996.

[45]SandipSen,MahendraSekaran,andJohnHale.Learningtocoordinatewithoutsharing

information.InNationalConferenceonArtiﬁcialIntelligence,pages426–431,1994.[46]YoavShohamandMosheTennenholtz.Onthesynthesisofusefulsociallawsforartiﬁ-cialagentsocieties(preliminaryreport).InProceedingsoftheNationalConferenceonArtiﬁcialIntelligence,pages276–281,SanJose,California,July1992.

[47]S.Sian.Adaptationbasedoncooperativelearninginmulti-agentsystems.InY.De-mazeauandJ.-P.M¨uller,editors,DecentralizedAI,volume2,pages257–272.ElsevierSciencePublications,1991.

[48]ReidG.Smith.Thecontractnetprotocol:High-levelcommunicationandcontrolin

adistributedproblemsolver.IEEETransactionsonComputers,C-29(12):1104–1113,December1980.

[49]RichardS.Sutton.TemporalCreditAssignmentinReinforcementLearning.PhDthesis,

UniversityofMassachusettsatAmherst,1984.

[50]MingTan.Multi-agentreinforcementlearning:Independentvs.cooperativeagents.In

ProceedingsoftheTenthInternationalConferenceonMachineLearning,pages330–337,June1993.

[51]C.J.C.H.Watkins.LearningfromDelayedRewards.PhDthesis,King’sCollege,Cam-bridgeUniversity,19.

[52]GerhardWeiß.Learningtocoordinateactionsinmulti-agentsystems.InProceedings

oftheInternationalJointConferenceonArtiﬁcialIntelligence,pages311–316,August1993.

[53]GerhardWeiß,editor.

DistributedArtiﬁcialIntelligenceMeetsMachineLearning:

LearninginMulti-AgentEnvironments.LectureNotesinArtiﬁcialIntelligence.SpringerVerlag,Berlin,1997.

[54]GerhardWeißandSandipSen,editors.AdaptationandLearninginMulti–AgentSys-tems.LectureNotesinArtiﬁcialIntelligence.SpringerVerlag,Berlin,1996.

[55]EricWerner.Cooperatingagents:Auniﬁedtheoryofcommunicationandsocialstruc-ture.InLesGasserandMichaelN.Huhns,editors,DistributedArtiﬁcialIntelligence,volume2ofResearchNotesinArtiﬁcialIntelligence,pages3–36.Pitman,19.[56]HollyYancoandLynnAndreaStein.Anadaptivecommunicationprotocolforcoop-eratingmobilerobots.InStewartWilson,editor,FromAnimalstoAnimats:Proc.oftheSecondInternationalConferenceontheSimulationofAdaptiveBehavior,pages478–485,Cambridge,MA,1993.MITPress.

[57]M.Yokoo,E.Durfee,T.Ishida,andK.Kuwabara.Distributedconstraintsatisfaction

forformalizingdistributedproblemsolving.InProceedingsoftheTwelfthInternationalConferenceonDistributedComputingSystems,pages614–621,1992.

Cooperativerelationships

Individualisticlearning

Communicatingagents

Non-communicatingagents

Sharedlearning

Non-cooperativerelationships

Table1:Acategorizationofmultiagentdomainsforidentifyinglearningopportunities.

FA1010101010

FB12345

TrialstakentoreachGoalofAgentA

81111115197268

Table2:TrialsforagentAtoreachitsgoalactingagainstagentBtryingtoreachitsowngoal(maximumforcesappliedbyagentsAandBareFAandFBrespectively).

[58]GiladZlotkinandJeﬀreyS.Rosenschein.Negotiationandconﬂictresolutioninnon-cooperativedomains.InProceedingsoftheNationalConferenceonArtiﬁcialIntelli-gence,pages100–105,July1990.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文