您好,欢迎来到九壹网。
搜索
您的当前位置:首页Individual learning of coordination knowledge

Individual learning of coordination knowledge

来源:九壹网
Individuallearningofcoordinationknowledge

SandipSen&MahendraSekaran

DepartmentofMathematical&ComputerSciences

UniversityofTulsa600SouthCollegeAvenueTulsa,OK74104-31Phone:+918-631-2985

e-mail:sandip@kolkata.mcs.utulsa.edu

Abstract

Socialagents,bothhumanandcomputational,inhabitingaworldcon-tainingmultipleactiveagents,needtocoordinatetheiractivities.Thisisbecauseagentsshareresources,andwithoutpropercoordinationor“rulesoftheroad”,everybodywillbeinterferingwiththeplansofothers.Assuch,weneedcoordinationschemesthatallowagentstoeffectivelyachievelocalgoalswithoutadverselyaffectingtheproblem-solvingcapabilitiesofotheragents.ResearchersinthefieldofDistributedArtificialIntelligence(DAI)havede-velopedavarietyofcoordinationschemesunderdifferentassumptionsaboutagentcapabilitiesandrelationships.Whereassomeoftheseresearchhavebeenmotivatedbyhumancognitivebiases,othershaveapproacheditasanengineeringproblemofdesigningthemosteffectivecoordinationarchitec-tureorprotocol.Weevaluateindividualandconcurrentlearningbymul-tiple,autonomousagentsasameansforacquiringcoordinationknowledge.Weshowthatauniformreinforcementlearningalgorithmsufficesasacoor-dinationmechanisminbothcooperativeandadversarialsituations.Usinganumberofmultiagentlearningscenarioswithbothtightandloosecouplingbetweenagentsandwithimmediateaswellasdelayedfeedback,wedemon-stratethatagentscanconsistentlydevelopeffectivepoliciestocoordinatetheiractionswithoutexplicitinformationsharing.Wedemonstratethevi-abilityofusingboththeQ-learningalgorithmandgeneticalgorithmbasedclassifiersystemswithdifferentpayoffschemes,namelythebucketbrigadealgorithm(BBA)andtheprofitsharingplan(PSP),fordevelopingagentcoordinationontwodifferentmulti-agentdomains.Inaddition,weshowthatasemi-randomschemeforactionselectionispreferabletothemoretraditionalfitnessproportionateselectionschemeusedinclassifiersystems.

1

1Introduction

Oneoftheprimarygoalsofresearchersintheartificialintelligence(AI)communityistodevelopautonomousagentsthatareknowledgeableandcognizantenoughtocarryoutatleastroutineactivitiesperformedbyhumans.Theproblem,however,isanextremelydifficultone,anddecadesofresearchonrelatedareashaveonlyservedtohighlightitscomplexityandmagnitude.Whileresearchersaredevelopinginterestingresultsincrucialareaslikeknowledgerepresentation,planning,learning,non-monotonicreasoning,cooperativeproblemsolvingetc.,itisequallyimportanttoanalyzeandexperimentwithresultsfromonesuchsubfieldtobenefitoneormoreoftheotherones.Isolated,anddomain-specificdevelopmentsinanyoneofthesub-fieldsofAIisnotgoingtogoalongwaytoadvancingthewholefield.Inthispaper,wewilldemonstratehownoteworthyadvancesinoneparticularsub-fieldofAIcanbeeffectivelyusedinanothersubfieldtoprovideagentswithknowledgerequiredtosolveadifficultproblem.Wewillbeapplyingrecentresearchdevelopmentsfromthereinforcementlearningliteraturetothecoordinationprobleminmultiagentsystems.Inareinforcementlearningscenario,anagentchoosesactionsbasedonitsperceptions,receivesscalarfeedbackbasedonpastactions,andisexpectedtodevelopamappingfromper-ceptionstoactionsthatwillmaximizefeedback.MultiagentsystemsareaparticulartypeofdistributedAIsystem[2,29],inwhichautonomousintelligentagentsinhabitaworldwithnoglobalcontrolorgloballyconsistentknowledge.Incontrasttocooperativeproblemsolvers[13],agentsinmultiagentsystemsarenotpre-disposedtohelpeachotheroutwithalltheresourcesandcapabilitiesthattheypossess.Theseagentsmaystillneedtocoordi-natetheiractivitieswithotherstoachievetheirownlocalgoals.Theycouldbenefitfromreceivinginformationaboutwhatothersaredoingorplantodo,andfromsendingtheminformationtoinfluencewhattheydo.

Whereaspreviousresearchondevelopingagentcoordinationmechanismsfocusedonoff-linedesignofagentorganizations,behavioralrules,negotiationprotocols,etc.,itwasrecog-nizedthatagentsoperatinginopen,dynamicenvironmentsmustbeabletoflexiblyadapttochangingdemandsandopportunities[29,44,54].Inparticular,individualagentsareforcedtoengagewithotheragentswhichhavevaryinggoals,abilities,composition,andlifespan.To

2

effectivelyutilizeopportunitiespresentedandavoidpitfalls,agentsneedtolearnaboutotheragentsandadaptlocalbehaviorbasedongroupcompositionanddynamics.Formachinelearningresearchers,multiagentlearningproblemsarechallenging,becausetheyviolatethestationaryenvironmentassumptionsusedbymostmachinelearningsystems.Asmultipleagentslearnsimultaneously,thefeedbackreceivedbythesameagentforthesameactionvariesconsiderably.Thestationaryenvironmentassumptionusedbymostcurrentmachinelearningsystemsarenotwell-suitedforsuchrapidlychangingenvironments.

Inthispaper,wediscusshowreinforcementlearningtechniquesfordevelopingpoliciestooptimizeenvironmentalfeedback,throughamappingbetweenperceptionsandactions,canbeusedbymultipleagentstolearncoordinationstrategieswithouthavingtorelyonsharedinformation.Theseagentsworkinacommonenvironment,butareunawareofthecapabilitiesofotheragentsandmayormaynotbecognizantofgoalstoachieve.Weshowthatthroughrepeatedproblem-solvingexperience,suchagentscandeveloppoliciestomaximizeenvironmentalfeedbackthatcanbeinterpretedasgoalachievementfromtheviewpointofanexternalobserver.Moreinterestingly,wedemonstratethatinsomedomainstheseagentsdeveloppoliciesthatcomplementeachother.

Toevaluatetheapplicabilityofreinforcementlearningschemesforenablingmultiagentcoordination,wechosetoinvestigateanumberofmultiagentdomainswithvaryingenvi-ronmentalcharacteristics.Inparticular,wedesignedenvironmentsinwhichthefollowingcharacteristicswerevaried:

Agentcoupling:Insomedomainstheactionsofoneagentstronglyandfrequentlyaffect

theplansofotheragents(tightlycoupledsystem),whereasinotherdomainstheactionsofoneagentonlyweaklyandinfrequentlyaffecttheplansofotheragents(looselycoupledsystem).

Agentrelationships:Agentsinamultiagentsystemcanhavedifferentkindsofmutual

relationships:

•theymayactinagrouptosolveacommonproblem(cooperativeagents),•theymaynothaveanypresetdispositionstowardseachotherbutinteractbecausetheyusecommonresources(indifferentagents),

3

•theymayhaveopposinginterests(adversarialagents).

Fordiscussionsinthispaper,wehavegroupthelattertwoclassofdomainsasnon-cooperativedomains.

Feedbacktiming:Insomedomains,theagentsmayhaveimmediateknowledgeofthe

effectsoftheiractions,whereasinotherstheymaygetthefeedbackfortheiractionsonlyafteraperiodofdelay.

Optimalbehaviorcombinations:Howmanybehaviorcombinationsofparticipatingagents

willoptimallysolvethetaskathand?Thisvaluevariesfromonetoinfinitefordifferentdomains.

Inaddition,inthispaperweconcentrateexclusivelyondomainsinwhichagentshavelittleornopre-existingdomainexpertise,andhavenoinformationaboutthecapabilitiesandgoalsofotheragents.Theseassumptionsmakethecoordinationproblemparticularlyhard.Thisisparticularlyevidentfromthefactthatalmostallcurrentlyusedcoordina-tionmechanismsrelyheavilyondomainknowledgeandsharedinformationbetweenagents.Thegoalofourworkisnottoreplacethepreviouslydevelopedcoordinationschemes,buttocomplementthembyprovidingnewcoordinationtechniquesfordomainswherethecur-rentlyavailableschemesarenoteffective.Inparticular,domainswhereagentsknowlittleabouteachotherprovideadifficultchallengeforcurrentlyusedcoordinationschemes.Ourcontentionisthatproblemsolvingperformanceorthefeedbackreceivedfromtheenviron-mentcanbeeffectivelyusedbyreinforcementbasedlearningagentstocircumventthelackofcommonknowledge.

Toverifyourintuitionsmorerigorously,wedecidedtoinvestigatetwowell-knownrein-forcementlearningschemes:theQ-learningalgorithmdevelopedbyWatkins[51],andtheclassifiersystemsmethoddevelopedbyHolland.WhereastheQ-learningalgorithmwasinspiredbythetheoryofdynamicprogrammingforoptimization,classifiersystemsarosefromaninterestingblendofrule-basedreasoningandcomputationalmechanismsinspiredbynaturalgenetics[4,25].Q-learningandclassifiersystemshavebeenproposedastwogeneralreinforcementlearningframeworksforachievingagentcoordinationinamultiagent

4

system.Thisresearchopensupanewdimensionofconstructingcoordinationstrategiesformultiagentsystems.

Atthispoint,wewouldliketoemphasizethedifferencebetweenthisworkandotherrecentpublicationsinthenascentareaofmultiagentlearning(MAL)research.Previouspro-posalsforusinglearningtechniquestocoordinatemultipleagentshavemostlyreliedonusingpriorknowledge[5],oroncooperativedomainswithunrestrictedinformationsharing[47].Asignificantpercentageofthisresearchhaveconcentratedoncooperativelearningbetweencommunicatingagentswhereagentssharetheirknowledgeorexperiences[16,38,37,50].Someresearchershaveusedcommunicationtoaidagentgroupsjointlydecideontheircourseofactions[52].TheisolatedinstancesofresearchinMALthatdonotuseexplicitcommuni-cationhaveconcentratedoncompetitive,ratherthancooperative,domains[7,30,40].Westronglybelievethatlearningcoordinationpolicieswithoutcommunicationhasasignificantroletoplayincompetitiveaswellascooperativedomains.Thereislittleargumentoverthefactthatcommunicationisaninvaluabletooltobeusedbyagentgroupstocoordinatetheiractivities.Attimescommunicationisthemosteffectiveandevenperhapstheonlymechanismtoguaranteecoordinatedbehavior.Thoughcommunicationisoftenhelpfulandindispensableasanaidtogroupactivity,itdoesnotguaranteecoordinatedbehavior[22],istime-consumingandcandetractfromotherproblemsolvingactivityifnotcarefullycon-trolled[12].Also,agentsoverlyreliantoncommunicationwillbeseverelyaffectedifthequalityofcommunicationiscompromised(brokencommunicationchannels,incorrectordeliberatelymisleadinginformation,etc.).Atothertimescommunicationcanberiskyorevenfatal(asinsomecombatsituationswheretheadversarycaninterceptcommunicatedmessages).

Webelievethatevenwhencommunicationisfeasibleandsafe,itmaybeprudenttouseitonlyasnecessary.Forexample,ifanagentisabletopredictthebehaviorofotheragentsfrompastobservations,itcanpossiblyadjustitsownplanstouseasharedresourcewithouttheneedforexplicitlyarrivingatacontractusingcommunicationthatconsumesvaluabletime(tohaveperformanceguarantees,though,contractsarrivedatusingcommu-nicationispossiblythemosteffectiveprocedure;mostoftheproblemsolvingactivitiesbyagents,however,donotinvolvehardguaranteesordeadlines).Weshouldstrivetorealize

5

themaximumpotentialofthesystemwithoutusingcommunication.Oncethathasbeenaccomplished,communicationcanbeaddedtoaugmenttheperformanceofthesystemtothedesiredefficiencylevel.Suchadesignphilosophywillleadtosystemswhereagentsdonotfloodcommunicationchannelswithunwarrantedinformationandagentsdonothavetoshiftthroughamazeofuselessdatatolocatenecessaryandtime-criticalinformation.Withthisgoalinmindwehaveinvestigatedtheusefulnessofacquiringcoordinationstrategieswithoutsharinginformation[43,45].Weexpandonthisbodyofworkinthispaper,andexploretheadvantagesandlimitationsoflearningwithoutcommunicationasameanstogeneratingcoordinatedbehaviorinautonomousagents.

Therestofthepaperisorganizedasfollows:Section2presentshighlightsofthepreviousapproachestodevelopingcoordinationstrategiesformultiple,autonomousagents;Section3providesacategorizationofmultiagentsystemstoidentifydifferentlearningscenariosandpresentsasamplingofpriormultiagentlearningresearch;Section4reviewsthereinforcementlearningtechniquesthatwehaveutilizedinthispaper;Sections5,6,and7presentresultsofexperimentswithQ-learningandgeneticalgorithmbasedreinforcementlearningsystemsonablockpushing,robotnavigation,andresourcesharingdomainsrespectively;Section8summarizesthelessonslearntfromthisresearchandoutlinesfutureresearchdirections.

2Coordinationofmultipleagents

Inaworldinhabitedbymultipleagents,coordinationisakeytogroupaswellasindividualsuccess.Weneedtocoordinateouractionswheneverwearesharinggoals,resources,orexpertise.Bycoordinationwemeanchoosingone’sownactionbasedontheexpectationofothers’actions.Coordinationisessentialforcooperative,indifferent,andevenadversarialagents.Ascomputerscientists,weareinterestedindevelopingcomputationalmechanismsthataredomainindependentandrobustinthepresenceofnoisy,incomplete,andout-of-dateinformation.Researchintheareaofmultiagentsystemshasproducedtechniquesforallowingmultipleagents,whichsharecommonresources,tocoordinatetheiractionssothatindividuallyrationalactionsdonotadverselyaffectoverallsystemefficiency[2,13,17,26].Coordinationofproblemsolvers,irrespectiveofwhethertheyareselfishorcooperative,

6

isakeyissuetothedesignofaneffectivemultiagentsystem.Thesearchfordomain-independentcoordinationmechanismshasyieldedsomeverydifferent,yeteffective,classesofcoordinationschemes.Themostinfluentialclassesofcoordinationmechanismsdevelopedtodatearethefollowing:

•protocolsbasedoncontracting[9,48]•distributedsearchformalisms[14,57]•organizationalandsociallaws[15,32,33]•multi-agentplanning[11,39]

•decisionandgametheoreticnegotiations[18,19,58]•linguisticapproaches[8,55]

Whereassomeoftheseworkusesarchitecturesandprotocolsdesignedoff-line[15,46,48]ascoordinationstructures,othersacquirecoordinationknowledgeon-line[11,19].Almostallofthecoordinationschemesdevelopedtodateassumeexplicitorimplicitsharingofinfor-mation.Intheexplicitformofinformationsharing,agentscommunicatepartialresults[11],speechacts[8],resourceavailabilities[48],etc.tootheragentstofacilitatetheprocessofcoordination.Intheimplicitformofinformationsharing,agentsuseknowledgeaboutthecapabilitiesofotheragents[15,18,58]toaidlocaldecision-making.Thougheachoftheseapproacheshasitsownbenefitsandweaknesses,webelievethatthelessanagentdependsonsharedinformation,andthemoreflexibleitistotheon-linearrivalofproblem-solvingandcoordinationknowledge,thebetteritcanadapttochangingenvironments.Asflexibilityandadaptabilityarekeyaspectsofintelligentandautonomousbehavior,weareinterestedininvestigatingmechanismsbywhichagentscanacquireandusecoordinationknowledgethroughinteractionswithitsenvironment(thatincludesotheragents)withouthavingtorelyonsharedinformation.

Onecanalsoarguethatpre-fabricatedcoordinationstrategiescanquicklybecomeinad-equateifthesystemdesigner’smodeloftheworldisincomplete/incorrectoriftheenviron-mentinwhichtheagentsaresituatedcanchangedynamically.Coordinationstrategiesthat

7

incorporatelearningandadaptationcomponentswillbemorerobustandeffectiveinthesemorerealisticscenarios.Thusagentswillbeabletotakeadvantageofnewopportunitiesanddealwithnewcontingenciespresentedbytheenvironmentwhichcannotbeforeseenatdesigntime.

3Learninginmultiagentsystems

Tohighlightlearningopportunitiesinherentinmostmultiagentsystems,wedevelophereacategorizationofmultiagentproblemsthatcanbeusedtocharacterizethenatureoflearningmechanismsthatshouldbeusedfortheseproblems.ThecategorizationpresentedinTable1isnotmeanttobetheonlyoreventhemostdefinitivetaxonomyofthefield.Thedimen-sionsweconsiderforourtaxonomyarethefollowing:agentrelationships(cooperativevs.non-cooperative),useofcommunication(agentsexplicitlycommunicatingornot).Withincooperativedomainsagain,wefurthercategorizedomainsbasedondecision-makingauthor-ities(individualvs.shared).Thesedimensionsarenotnecessarilycompletelyorthogonal,e.g.,shareddecisionmakingmaynotbefeasiblewithouttheuseofexplicitcommunication.Weusethetermcooperativerelationshiptorefertosituationswheremultipleagentsareworkingtowardsacommongoal.Thoughtheremaybelocalgoals,theseareinfact,subgoalsofthecommongoal(evenifdifferentsubgoalsinterfere).Anumberofdifferentdomainsfallunderthenon-cooperativespectrum.Theseincludecompetitivescenarios(oneagent’sgainisanotheragent’sloss;neednotbezero-sum),aswellasscenarioswhereagentscoordinateonlytoavoidconflicts.

Asmentionedbefore,sharedlearningbyagroupofcooperativeagentshavereceivedthemostattentioninMALliterature[16,37,52].Othershavelookedateachagentlearningindividuallybutusingcommunicationtosharepertinentinformation[6,38,50].Learningtocooperatewithoutexplicitinformationsharingcanbebasedonprimarilyenvironmentalfeedback[45]orfromobservationofotheragentsinaction[23].Researchershaveinvestigatedtwoapproachestolearningwithoutexplicitcommunicationinnon-cooperativedomains:(a)treatingopponentsaspartoftheenvironmentwithoutexplicitmodeling[43],(b)learningexplicitcompetitormodels[30,40].Littleworkhasbeendonetodateinlearningtocompete

8

withexplicitcommunication.Possiblescenariostoinvestigateinthisareaincludelearningthestrengthsandweaknessesofthecompetitorfrominterceptedcommunication,orpro-activelyprobingtheopponenttogathermoreinformationaboutitspreferences.

Evenpreviousworkonusingreinforcementlearningforcoordinatingmultipleagents[50,52]havereliedonexplicitinformationsharing.We,however,concentrateonsystemswhereagentssharenoproblem-solvingknowledge.Weshowthatalthougheachagentisindepen-dentlyusingreinforcementlearningtechniquestooptimizingitsownenvironmentalreward,globalcoordinationbetweenmultipleagentscanemergewithoutexplicitorimplicitinfor-mationsharing.Theseagentscanthereforeactindependentlyandautonomously,withoutbeingaffectedbycommunicationdelays(duetootheragentsbeingbusy)orfailureofakeyagent(whocontrolsinformationexchangeorwhohasmoreinformation),anddonothavetobeworryaboutthereliabilityoftheinformationreceived(DoIbelievetheinformationre-ceived?Isthecommunicatingagentanaccompliceoranadversary?).Theresultantsystemsare,therefore,robustandfault-tolerant.

Schaerfetal.havestudiedtheuseofreinforcementlearningbasedagentsforloadbalanc-ingindistributedsystems[41].Inthiswork,acomprehensivehistoryofpastperformanceisusedtomakeinformeddecisionsaboutchoiceofresourcestosubmitjobsto.Parker[36]hasstudiedtheemergenceofcoordinationinsimulatedrobotgroupsbyusingsimpleadaptiveschemesthatalterrobotmotivations.Amajordifferencefromourworkisthatthesimulatedrobotsinthisworkbuildexplicitmodelsofotherrobots.Otherresearchershaveusedrein-forcementlearningfordevelopingeffectivegroupsofphysicalrobots[34,56].Mataric[34]concentratesonusingintermediatefeedbackforsubgoalfulfillmenttoacceleratelearning.Incontrastwithourwork,theevaluationoflearningeffectivenessundervaryingdegreesofin-teractionbetweentheagentsisnotthefocusofthiswork.TheworkbyYancoandStein[56]involvesusingreinforcementlearningtechniquestoevolveeffectivecommunicationproto-colsbetweencooperatingrobots.Thisiscomplementarytoourapproachoflearningtocoordinateintheabsenceofcommunication.

Thoughouragentscanbeviewedasalearningautomaton,thescalarfeedbackreceivedfromtheenvironmentpreventstheuseofresultsdirectlyfromthefieldoflearningau-tomata[35].Recentworkontheoreticalissuesofmultiagentreinforcementlearningpromises

9

toproducenewframeworksforinvestigatingproblemssuchasthoseaddressedinthispa-per[21,42,53].

4Reinforcementlearning

Inreinforcementlearningproblems[1,27]reactiveandadaptiveagentsaregivenadescriptionofthecurrentstateandhavetochoosethenextactionfromasetofpossibleactionssoastomaximizeascalarreinforcementorfeedbackreceivedaftereachaction.Thelearner’senvironmentcanbemodeledbyadiscretetime,finitestate,Markovdecisionprocessthatcanberepresentedbya4-tuple󰀏S,A,P,r󰀑whereP:S×S×A→[0,1]givestheprobabilityofmovingfromstates1tos2onperformingactiona,andr:S×A→󰀒isascalarrewardfunction.Eachagentmaintainsapolicy,π,thatmapsthecurrentstateintothedesirableaction(s)tobeperformedinthatstate.TheexpectedvalueofadiscountedsumoffuturerewardsofapolicyπatastatexisgivenbyVγπ=E{

def

󰀁∞

t=0

ππ

istherandom},wherers,tγtrs,t

variablecorrespondingtotherewardreceivedbythelearningagentttimestepsafterifstartsusingthepolicyπinstates,andγisadiscountrate(0≤γ<1).

Variousreinforcementlearningstrategieshavebeenproposedusingwhichagentscande-velopapolicytomaximizerewardsaccumulatedovertime.Forevaluatingtheclassifiersys-temparadigmformultiagentreinforcementlearning,wecompareitwiththeQ-learning[51]algorithm,whichisdesignedtofindapolicyπ∗thatmaximizesVγπ(s)forallstatess∈S.Thedecisionpolicyisrepresentedbyafunction,Q:S×A→󰀒,whichestimateslong-termdis-a;π

countedrewardsforeachstate–actionpair.TheQvaluesaredefinedasQπγ(s,a)=Vγ(s),

wherea;πdenotestheeventsequenceofchoosingactionaatthecurrentstate,followedbychoosingactionsbasedonpolicyπ.Theaction,a,toperforminastatesischosensuchthatitisexpectedtomaximizethereward,

Vγπ(s)=maxQπγ(s,a)foralls∈S.

a∈A

IfanactionainstatesproducesareinforcementofRandatransitiontostates󰀅,thenthecorrespondingQvalueismodifiedasfollows:

Q(s,a)←(1−β)Q(s,a)+β(R+γmaxQ(s󰀅,a󰀅)).󰀄

a∈A

(1)

10

TheaboveupdateruleissimilartoHolland’sbucket-brigade[25]algorithminclassifiersystemsandSutton’stemporal-difference[49]learningscheme.ThesimilaritiesofQ-learningandclassifiersystemshavebeenanalyzedin[10].

Classifiersystemsarerulebasedsystemsthatlearnbyadjustingrulestrengthsfromfeedbackandbydiscoveringbetterrulesusinggeneticalgorithms.Inthispaper,wewillusesimplifiedclassifiersystemswhereallpossiblemessageactionpairsareexplicitlystoredandclassifiershaveoneconditionandoneaction.TheseassumptionsaresimilartothosemadebyDorigoandBersini[10];wealsousetheirnotationtodescribeaclassifieriby(ci,ai),whereciandaiarerespectivelytheconditionandactionpartsoftheclassifier.St(ci,ai)givesthestrengthofclassifieriattimestept.Wefirstdescribehowtheclassifiersystemperformsandthendiscusstwodifferentfeedbackdistributionschemes,namelytheBucketBrigadealgorithm(BBA),andtheProfitSharingPlan(PSP).

Allclassifiersareinitializedtosomedefaultstrength.Ateachtimestepofproblemsolving,aninputmessageisreceivedfromtheenvironmentandmatchedwiththeclassifierrulestoformamatchset,M.Oneoftheseclassifiersischosentofireandbasedonitsaction,afeedbackmaybereceivedfromtheenvironment.Thenthestrengthsoftheclassifierrulesareadjusted.Thiscycleisrepeatedforagivennumberoftimesteps.Aseriesofcyclesconstituteatrialoftheclassifiersystem.IntheBBAscheme,whenaclassifierischosentofire,itsstrengthisincreasedbytheenvironmentalfeedback.Butbeforethat,afractionαofitsstrengthisremovedandaddedtothestrengthoftheclassifierwhofiredinthelasttimecycle.So,ifclassifierifiresattimestept,producesexternalfeedbackofR,andclassifierjfiresatthenexttimestep,thefollowingequationsgivesthestrengthupdateofclassifieri:

St+1(ci,ai)=(1−α)∗St(ci,ai)+α∗(R+St+1(cj,aj)).

Wenowdescribetheprofitsharingplan(PSP)strength-updatingscheme[20]usedinclassifiersystems.Inthismethod,problemsolvingisdividedintoepisodesinbetweenreceiptsofexternalreward.Aruleissaidtobeactiveinaperiodifitfiredinatleastoneofthecyclesinthatepisode.Attheendofepisodee,thestrengthofeachactiveruleiinthatepisodeisupdatedasfollows:

Se+1(ci,ai)=Se(ci,ai)+α∗(Re−Se(ci,ai)),

11

whereReistheexternalrewardreceivedattheendoftheepisode.Wehaveexperimentedwithtwomethodsofchoosingaclassifiertofiregiventhematchset.Inthemoretraditionalmethod,aclassifieri∈Mattimetischosenwithaprobabilitygivenby󰀁St(ci,ai)

.S(c,a)d∈Mtdd

We

callthisfitnessproportionatePSPorPSP(FP).Intheothermethodofactionchoice,theclassifierwiththehighestfitnessinMischosen90%ofthetime,andarandomclassifierfromMischosenintherest10%cases(MahadevanusessuchanactionchoosingmechanismforQ-learningin[31]).Wecallthisasemi-randomPSPorPSP(SR).

FortheQ-learningalgorithm,westoparunwhenthealgebraicdifferenceofthepoliciesattheendofneighboringtrialsisbelowathresholdfor10consecutivetrials.Withthisconvergencecriterion,however,theclassifiersystemsrantoolongforustocollectreasonabledata.Instead,every10thtrial,werantheclassifiersystem(bothwithBBAandPSP)withadeterministicactionchoiceovertheentiretrial.Westoppedarunoftheclassifiersystemifthedifferencesofthetotalenvironmentalfeedbackreceivedbythesystemonneighboringdeterministictrialswerebelowasmallthresholdfor10consecutivedeterministictrials.Weperformanin-depthstudyoftheeffectivenessofusingQ-learningbyconcurrentlearnerstodevelopeffectivecoordinationinablockpushingdomain.WealsocomparetheperformanceofclassifiersystemsandQ-learningonaresourcesharingandarobotnavigationdomain.Thecharacteristicsofthesethreedomainsareasfollows:

Blockpushing:Concurrentlearningbytwoagentswithimmediateenvironmentalfeed-back;stronglycoupledsystem;multipleoptimalbehaviors.

Resourcesharing:Oneagentlearningtoadapttoanotheragent’sbehaviorwithdelayed

environmentalfeedback;stronglycoupledsystem;singleoptimalbehavior.

Robotnavigation:Concurrentlearningbymultipleagentswithimmediateenvironmental

feedback;variablecoupling;multipleoptimalbehaviors.

5Blockpushingproblem

Inthisproblem,twoagents,a1anda2,areindependentlyassignedtomoveablock,b,fromastartingposition,S,tosomegoalposition,G,followingapath,P,inEuclideanspace.

12

Theagentsarenotawareofthecapabilitiesofeachotherandyetmustchoosetheiractionsindividuallysuchthatthejointtaskiscompleted.Theagentshavenoknowledgeofthesystemphysics,butcanperceivetheircurrentdistancefromthedesiredpathtotaketo󰀫i,wherethegoalstate.Theiractionsarerestrictedasfollows:agentiexertsaforceF󰀫i|≤Fmax,ontheobjectatanangleθi,where0≤θ≤π.Anagentpushingwith0≤|F

󰀫atangleθwilloffsettheblockinthexdirectionby|F󰀫|cos(θ)unitsandintheyforceF

󰀫|sin(θ)units.Thenetresultantforceontheblockisfoundbyvectoradditiondirectionby|F

󰀫=F󰀫1+F󰀫2.Wecalculatethenewpositionoftheblockbyassumingofindividualforces:F

unitdisplacementperunitforcealongthedirectionoftheresultantforce.Thenewblocklocationisusedtoprovidefeedbacktotheagent.If(x,y)isthenewblocklocation,Px(y)isthex-coordinateofthepathPforthesameycoordinate,∆x=|x−Px(y)|isthedistancealongthexdimensionbetweentheblockandthedesiredpath,thenK∗a−∆xisthefeedbackgiventoeachagentfortheirlastaction(wehaveusedK=50anda=1.15).

field of playgoal 󰀀positionideal󰀀pathpotential󰀀pathblockagent 1agent 2at each time step each agent pushes with some force at a certain angleCombined effectF sinθFθF cosθFsinθ22Fsinθ11Agent 1θ1Fcosθ11θ2Agent 2Fcosθ22Figure1:Theblockpushingproblem

Thefieldofplayisrestrictedtoarectanglewithendpoints[0,0]and[100,100].AtrialconsistsoftheagentsstartingfromtheinitialpositionSandapplyingforcesuntileitherthegoalpositionGisreachedortheblockleavesthefieldofplay(seeFigure1).Weaborta

13

trialifapre-setnumberofagentactionsfailtotaketheblocktothegoal.Thispreventsagentsfromlearningpolicieswheretheyapplynoforcewhentheblockisrestingontheoptimalpathtothegoalbutnotonthegoalitself.Theagentsarerequiredtolearn,throughrepeatedtrials,topushtheblockalongthepathPtothegoal.Althoughwehaveusedonlytwoagentsinourexperiments,thesolutionmethodologycanbeappliedwithoutmodificationtoproblemswitharbitrarynumberofagents.

Toimplementthepolicyπwechosetouseaninternaldiscreterepresentationfortheexternalcontinuousspace.Theforce,angle,andthespacedimensionswerealluniformlydiscretized.Whenaparticulardiscreteforceoractionisselectedbytheagent,themiddlevalueoftheassociatedcontinuousrangeisusedastheactualforceoranglethatisappliedontheblock.

TheBlock-Pushingproblemisatightly-coupledsystem,where,ateachtimesteptheoutcomeofeachagentactionisdependentontheactionoftheotheragentspresentinthesystem.Usingtheblock-pushingproblem,wehaveconductedexperimentsonbothcooperative(agentspushtheblocktowardsacommongoal)andnon-cooperative(agentsviewitheachothertopushtheblocktoindividualgoals)situations.

Anexperimentalrunconsistsofanumberoftrialsduringwhichthesystemparameters(β,γ,andK)aswellasthelearningproblem(granularity,agentchoices)isheldconstant.ThestoppingcriteriaforaruniseitherthattheagentssucceedinpushingtheblocktothegoalinNconsecutivetrials(wehaveusedN=10)orthatamaximumnumberoftrials(wehaveused1500)havebeenexecuted.Thelattercasesarereportedasnon-convergedruns.ThestandardprocedureinQ-learningliteratureofinitializingQvaluestozeroissuitableformosttaskswherenon-zerofeedbackisinfrequentandhencethereisenoughopportunitytoexplorealltheactions.Becauseanon-zerofeedbackisreceivedaftereveryactioninourproblem,wefoundthatagentswouldfollow,foranentirerun,thepaththeytakeinthefirsttrial.Thisisbecausetheystarteachtrialatthesamestate,andtheonlynon-zeroQ-valueforthatstateisfortheactionthatwaschosenatthestarttrial.Similarreasoningholdsforalltheotheractionschoseninthetrial.Apossiblefixistochooseafractionoftheactionsbyrandomchoice,ortouseaprobabilitydistributionovertheQ-valuestochooseactionsstochastically.Theseoptions,however,leadtoveryslowconvergence.Instead,wechose

14

toinitializetheQ-valuestoalargepositivenumber.Thisenforcedanexplorationoftheavailableactionoptionswhileallowingforconvergenceafterareasonablenumberoftrials.Theprimarymetricforperformanceevaluationistheaveragenumberoftrialstakenbythesystemtoconverge.Informationaboutacquisitionofcoordinationknowledgeisobtainedbyplotting,fordifferenttrials,theaveragedistanceoftheactualpathfollowedfromthedesiredpath.Dataforallplotsandtablesinthispaperhavebeenaveragedover100runs.

5.1ExperimentsinCooperativeDomain

Extensiveexperimentationwasperformedontheblock-pushingdomain.Thetwoagentswereassignedthetaskofpushingtheblocktothesamegoallocation.Thefollowingsectionsprovidefurtherdetailsontheexperimentalresults.5.1.1

Choiceofsystemparameters

Iftheagentslearntopushtheblockalongthedesiredpath,therewardthattheywillreceiveforthebestactionchoicesateachstepisequaltothemaximumpossiblevalueofK.Thesteady-statevaluesfortheQ-values(Qss)correspondingtooptimalactionchoicescanbecalculatedfromtheequation:

Qss=(1−β)Qss+β(K+γQss).

SolvingforQssinthisequationyieldsavalueof

K

.1−γ

Inorderfortheagentstoexplore

allactionsaftertheQ-valuesareinitializedatSI,werequirethatanynewQvaluebelessthanSI.FromsimilarconsiderationsasabovewecanshowthatthiswillbethecaseifSI≥

K

.1−γInourexperimentswefixthemaximumrewardKat50,SIat100,andγat0.9.

Unlessotherwisementioned,wehaveusedβ=0.2,andallowedeachagenttovaryboththemagnitudeandangleoftheforcetheyapplyontheblock.

Thefirstproblemweusedhadstartingandgoalpositionsat(40,0)and(40,100)re-spectively.Duringourinitialexperimentswefoundthatwithanevennumberofdiscreteintervalschosenfortheangledimension,anagentcannotpushalonganylineparalleltothe

15

y-axis.Henceweusedanoddnumber,11,ofdiscreteintervalsfortheangledimension.Thenumberofdiscreteintervalsfortheforcedimensionischosentobe10.

Onvaryingthenumberofdiscretizationintervalsforthestatespacebetween10,15,and20,wefoundthecorrespondingaveragenumberoftrialstoconvergenceis784,793,and115respectivelywith82%,83%,and100%oftherespectiverunsconvergingwithinthespecifiedlimitof1200trials.Thissuggeststhatwhenthestaterepresentationgetstoocoarse,theagentsfinditverydifficulttolearntheoptimalpolicy.Thisisbecausethelessthenumberofintervals(thecoarserthegranularity),themorethevariationsinrewardanagentgetsaftertakingthesameactionatthesamestate(eachdiscretestatemapsintoalargerrangeofcontinuousspaceandhencetheagentsstartfromandendsupinphysicallydifferentlocations,thelatterresultingindifferentrewards).5.1.2

Varyinglearningrate

Weexperimentedbyvaryingthelearningrate,β.TheresultantaveragedistanceoftheactualpathfromthedesiredpathoverthecourseofarunisplottedinFigure2forβvalues0.4,0.6,and0.8.

Incaseofthestraightpathbetween(40,0)and(40,100),theoptimalsequenceofactionsalwaysputstheblockonthesamex-position.Sincethex-dimensionistheonlydimensionusedtorepresentstate,theagentsupdatethesameQ-valueintheirpolicymatrixinsucces-sivesteps.WenowcalculatethenumberofupdatesrequiredfortheQ-valuecorrespondingtothisoptimalactionbeforeitreachesthesteadystatevalue.Notethatforthesystemtoconverge,itisnecessarythatonlytheQ-valuefortheoptimalactionatx=40needstoarriveatitssteadystatevalue.Thisisbecausetheblockisinitiallyplacedatx=40,andsolongastheagentschoosetheiroptimalaction,itneverreachesanyotherxposition.So,thenumberofupdatestoreachsteadystatefortheQ-valueassociatedwiththeoptimalactionatx=40shouldbeproportionaltothenumberoftrialstoconvergenceforagivenrun.Inthefollowing,letStbetheQ-valueaftertupdatesandSIbetheinitialQ-value.UsingEquation1andthefactthatfortheoptimalactionatthestartingposition,thereinforcementreceivedisKandthenextstateisthesameasthecurrentstate,wecanwrite,

St+1=(1−β)St+β(K+γSt)

16

=(1−β(1−γ))St+βK=ASt+C

(2)

whereAandBareconstantsdefinedtobeequalto1−β∗(1−γ)andβ∗Krespectively.Equation2isadifferenceequationwhichcanbesolvedusingS0=SItoobtain

St=A

t+1

C(1−At+1)SI+.

1−A

Ifwedefineconvergencebythecriteriathat|St+1−St|<󰀬,where󰀬isanarbitrarilysmallpositivenumber,thenthenumberofupdatestrequiredforconvergencecanbecalculatedtobethefollowing:

t≥

log(󰀬)−log(SI(1−A)−C)

log(A)

log(󰀬)−log(β)−log(SI(1−γ)−K)=

log(1−β(1−γ))

(3)

IfwekeepγandSIconstanttheaboveexpressioncanbeshowntobeadecreasingfunctionofβ.Thisiscorroboratedbyourexperimentswithvaryingβwhileholdingγ=0.1(seeFigure2).Asβincreases,theagentstakelessnumberoftrialstoconvergencetotheoptimalsetofactionsrequiredtofollowthedesiredpath.TheotherplotinFigure2presentsacomparisonofthetheoreticalandexperimentalconvergencetrends.Thefirstcurveintheplotrepresentsthefunctioncorrespondingtothenumberofupdatesrequiredtoreachsteadystatevalue(with󰀬=0).Thesecondcurverepresentstheaveragenumberoftrialsrequiredforaruntoconverge,scaleddownbyaconstantfactorof0.06.Theactualratiosbetweenthenumberoftrialstoconvergenceandthevaluesoftheexpressionontherighthandsideoftheinequality3forβequalto0.4,0.6,and0.8are24.1,25.6,and27.5respectively(theaveragenumberoftrialsare95.6,71.7,and53;valuesoftheabove-mentionedexpressionare3.97,2.8,and1.93).Giventhefactthatresultsareaveragedover100runs,wecanclaimthatourtheoreticalanalysisprovidesagoodestimateoftherelativetimerequiredforconvergenceasthelearningrateischanged.5.1.3

Varyingagentcapabilities

Thenextsetofexperimentswasdesignedtodemonstratetheeffectsofagentcapabilitiesonthetimerequiredtoconvergeontheoptimalsetofactions.Inthefirstofthecurrentsetof

17

10Varying betaAvg. distance from optimal80.420.8002040600.680Trials10012014065.554.543.532.521.510.10.2Variation of Steps to convergence with beta(1-log(40*beta))/log(1-0.9*beta)0.06 * trials to convergence0.30.40.50.6Beta0.70.80.91Figure2:Variationofaveragedistanceofactualpathfromdesiredpathoverthecourseofarun,andthenumberofupdatesforconvergenceofoptimalQ-valuewithchangingβ(γ=0.1,SI=100).

experiments,oneoftheagentswaschosentobea“dummy”;itdidnotexertanyforceatall.Theotheragentcouldonlychangetheangleatwhichitcouldapplyaconstantforceontheblock.Inthesecondexperiment,thelatteragentwasallowedtovarybothforceandangle.Inthethirdexperiment,bothagentswereallowedtovarytheirforceandangle.Theaveragenumberoftrialstoconvergenceforthefirst,second,andthirdexperimentare55,431,and115respectively.Themostinterestingresultfromtheseexperimentsisthattwoagentscanlearntocoordinatetheiractionsandachievethedesiredproblem-solvingbehaviormuchfasterthanwhenasingleagentisactingalone.If,however,wesimplifytheproblemoftheonlyactiveagentbyrestrictingitschoicetothatofselectingtheangleofforce,itcan

18

learntosolvetheproblemquickly.Ifwefixtheanglefortheonlyactiveagent,andallowittovaryonlythemagnitudeoftheforce,theproblembecomeseithertrivial(ifthechosenangleisidenticaltotheangleofthedesiredpathfromthestartingpoint)orunsolvable.5.1.4

Transferoflearning

Wedesignedasetofexperimentstodemonstratehowlearninginonesituationcanhelplearningtoperformwellinasimilarsituation.Theproblemwithstartingandgoallocationsat(40,0)and(40,100)respectivelyisusedasareferenceproblem.Inaddition,weusedfiveotherproblemswiththesamestartinglocationandwithgoallocationsat(50,100),(60,100),(70,100),(80,100),and(90,100)respectively.Thecorrespondingdesiredpathswereobtainedbyjoiningthestartingandgoallocationsbystraightlines.Todemonstratetransferoflearning,wefirststoredeachofthepolicymatricesthatthetwoagentsconvergedonfortheoriginalproblem.Next,weranasetofexperimentsusingeachofthenewproblems,withtheagentsstartingoffwiththeirpreviouslystoredpolicymatrices.

Wefoundthatthereisalinearincreaseinthenumberoftrialstoconvergenceasthegoalinthenewproblemisplacedfartherapartfromthegoalintheinitialproblem.Todetermineifthisincreasewasduepurelytothedistancebetweenthetwodesiredpaths,orduetothedifficultyinlearningtofollowcertainpaths,weranexperimentsonthelatterproblemswithagentsstartingwithuniformpolicies.Theseexperimentsrevealthatthemoretheanglebetweenthedesiredpathandthey-axis,thelongertheagentstaketoconverge.Learningintheoriginalproblem,however,doeshelpinsolvingthesenewproblems,asevidencedbya≈10%savingsinthenumberoftrialstoconvergencewhenagentsstartedwiththepreviouslylearnedpolicy.Usingaone-tailedt-testwefoundthatallthedifferencesweresignificantatthe99%confidencelevel.Thisresultdemonstratesthetransferoflearnedknowledgebetweensimilarproblem-solvingsituations.5.1.5

Complimentarylearning

Inthelastfewsectionswehaveshowntheeffectsofsystemparametersandagentcapabilitiesontherateatwhichtheagentsconvergeonanoptimalsetofactions.Inthissection,wediscusswhatan“optimalsetofactions”meanstodifferentagents.

19

Figure3:Optimalactionchoicesforaselectionofstatesforeachagentaccordingtotheirpolicymatricesattheendofasuccessfulrun.

Iftheagentswerecognizantoftheactualconstraintsandgoalsoftheproblem,andknewelementaryphysics,theycouldindependentlycalculatethedesiredactionforeachofthestatesthattheymayenter.Theresultingpolicieswouldbeidentical.Ouragents,however,havenoplanningcapacityandtheirknowledgeisencodedinthepolicymatrix.Figure3providesasnapshot,attheendofasuccessfullyconvergedrun,ofwhateachagentbelievestobeitsbestactionchoiceforeachofthepossiblestatesintheworld.Theactionchoiceforeachagentatastateisrepresentedbyastraightlineattheappropriateangleandscaledtorepresentthemagnitudeofforce.Weimmediatelynoticethattheindividualpoliciesarecomplimentaryratherthanbeingidentical.Givenastate,thecombinationofthebestactionswillbringtheblockclosertothedesiredpath.Insomecases,oneoftheagentsevenpushesinthewrongdirectionwhiletheotheragenthastocompensatewithalargerforcetobringtheblockclosertothedesiredpath.Thesecasesoccurinstateswhichareattheedgeofthefieldofplay,andhavebeenvisitedonlyinfrequently.Complementarityoftheindividualpolicies,however,arevisibleforallthestates.

5.2ExperimentsinNon-CooperativeDomains

Wedesignedasetofexperimentsinwhichtwoagentsareprovideddifferentfeedbackforthesameblocklocation.Theagentsareassignedtopushthesameblocktotwodifferent

20

Figure4:Exampletrialwhenagentshaveconflictinggoals.

goalsalongdifferentpaths.Hence,theactionofeachofthemadverselyaffectsthegoalachievementoftheotheragent.Themaximumforce(werefertothisasstrength)ofoneagentwaschosenas10units,whilethemaximumforceoftheotheragentwasvaried.Theothervariablewasthenumberofdiscreteactionoptionsavailablewithinthegivenforcerange.Whenthereisconsiderabledisparitybetweenthestrengthsofthetwoagents,thestrongeragentoverpowerstheweakeragent,andsucceedsinpushingtheblocktoitsgoallocation(seeFigure4).Theaveragenumberoftrialstoconvergence(seeTable2),however,indicatesthatasthestrengthoftheweakeragentisincreased,thestrongeragentfindsitincreasinglydifficulttoattainitsgoal.Fortheseexperiments,thestrongandtheweakagentshadrespectively11(between0-10)and2(0anditsmaximumstrength)forceoptionstochoosefrom.

Whenthenumberofforcediscretizationsfortheweakagentisincreasedfrom2to10,wefindthatthestrongeragentfindsitmoredifficulttopushtheblocktoitsowngoal.Ifweincreasethemaximumforceoftheweakagentclosertothemaximumforceofthestrongeragent,wefindthatneitherofthemisabletopushtheblocktoitsdesiredgoal.Attheofarun,wefindthatthefinalconvergedpathliesinbetweentheirindividualdesiredpaths.Asthestrengthoftheweakeragentincreases,thispathmovesawayfromthedesiredpathofthestrongeragent,andultimatelyliesmidwaybetweentheirindividualdesiredpathswhenbothagentsareequallystrong.

21

Intuitively,anagentshouldbeableto‘overpower’anotheragentwheneveritisstronger.Whyisthisnothappening?Theanswerliesinthestochasticvariabilityoffeedbackreceivedforthesameactionatthesamestate,andthedeterministicchoiceoftheactioncorrespondingtothemaximalpolicymatrixentry.Whenanagentchoosesanactionatastateitcanreceiveoneofseveraldifferentfeedbacksdependingontheactionchosenbytheotheragent.WedefinetheoptimalactionchoiceforastatextobetheactionAxthathasthehighestaveragefeedbackFx.SupposethefirsttimetheagentchoosesthisactionatstatexitreceivesafeedbackF1F1foranon-optimalactionAyitchoosesinthesamestatex.Iftheseweretheonlytwooptionsavailableinstatex,theagentwouldchooseAyoverAxnexttimeitisinstatex,becausetheformeractioncorrespondstoahigherpolicymatrixentry.IfthesteadystatevalueofthepolicymatrixentryforactionAyinstatexisgreaterthanthepolicymatrixentryforactionAxobtainedafterreceivingfeedbackF1,thelatteractionwillbenevertriedagain,andhencetheagentwillconvergeonanon-optimalpolicy.Thisisaquintessentialexampleoftheexploration-exploitationtradeoff[24].Also,thisismorelikelytohappenwhenthesameactioncangeneratemorenumberofdistinctfeedbacks(thesameactionforthestrongeragentcanproducemoredistinctfeedbackswhenthediscretizationsforweakeragentisincreased).Asimpleremedytothissituationwillbetochooseaproportionoftheactionsrandomlyortochooseactionsusingaprobabilitydistributionoverthepolicymatrixvalues.Eachoftheseoptions,however,resultsinanexponentialincreaseofthetrialstoconvergence.Currentlywearedevelopingasimulatedannealing[28]basedprocedurewhichresultsinadecreaseintheproportionofrandomchoicesasthepolicymatrixconvergestoitssteadystate.

6Robotnavigationproblem

Wedesignedaprobleminwhichfouragents,A,B,C,andD,aretofindtheoptimalpathinagridworld,fromgivenstartinglocationstotheirrespectivegoals,A󰀅,B󰀅,C󰀅,andD󰀅.Theagentstraversetheirworldusingoneofthefiveavailableoperators:north,south,east,west,orhold.Figure5depictspotentialpathsthateachoftheagentsmightchooseduringtheirlearningprocess.Thegoaloftheagentsistolearnmovesthatquicklytakethemto

22

theirrespectivegoallocationswithoutcollidingwithotheragents.

A BC DD’C’B’A’Figure5:Arobotnavigationproblem.

Eachagentreceivesfeedbackbasedonitsmove:whenitmakesamovethattakesthemtowardstheirgoal,theyreceiveafeedbackof1;whenitmakesamovethattakesthemawayfromtheirgoal,theyreceiveafeedbackof-1;whenitmakesamovethatresultsinnochangeoftheirdistancefromtheirgoal(hold),theyreceiveafeedbackof0;whenitmakesamovethatresultsinacollision,thefeedbackiscomputedasdepictedinFig6.Allagentslearnatthesametimebyupdatingtheirindividualpolicies.

YYXfeedback(X) = -10YXfeedback(X) = -5Xfeedback(X) = -5Figure6:FeedbackforagentXwhenitcausesdifferenttypesofcollisions(giventheactionoftheotheragentinthecollision).

TheRobotnavigationtaskisadomaininwhichtheagentcouplingsvariesovertime.Whenagentsarelocatednexttoeachother,theyareverylikelytointeractandhenceagent

23

behaviorsaretightlycoupled.But,astheymoveapart,theirinteractionsbecomelesslikely,andasaresult,theirbehaviorsbecomelooselycoupled.

Sincetherobotnavigationproblemproducesrewardsaftereachtimestep,wehaveusedtheBBAmethodofpayoffdistributionwiththeclassifiersystem.Thesystemparametersareβ=0.5,andγ=0.8forQ-learningandα=0.1forBBA.

ExperimentalresultsontherobotnavigationdomaincomparingQ-learningandaclas-sifiersystemusingBBAforpayoffdistributionisdisplayedinFigure10.Plotsshowtheaveragenumberofstepstakenbytheagentstoreachtheirgoals.Lowervaluesofthispa-rametermeansagentsarelearningtofindmoredirectpathstotheirgoalswithoutcollidingwitheachother.Resultsareaveragedover50runsforbothsystems.Q-learningtakesasmuchas5timeslongertoconvergewhencomparedtoBBA.ThefinalnumberofstepstakenbyagentsusingQ-learningisslightlysmallerthanthenumberofstepstakenbyagentsusingBBA.WebelievethatifwemaketheconvergencecriteriamorestrictfortheBBA,abettersolutioncanbeevolvedwithmorecomputationaleffort.

Theinterestingaspectofthisexperimentisthatalltheagentswerelearningsimulta-neouslyandhenceitwasnotobviousthattheywouldfindgoodpaths.Typicalsolutions,however,showthatagentsstopattherightpositionstoletotherspassthrough.Thisavoidscollisions.Thepathsdocontainsmalldetours,andhencearenotoptimal.

TheperformanceofBBAwhenα=0.5(comparabletoβusedinQ-learning)wasnotverydifferentwhencomparedtothatpresentedinFigure10.ItisinterestingtonotethatwhenexperimentsweretriedsettingtheβinQ-learningto0.1(comparabletoαinBBA)thesystemdidnotattainconvergenceevenafter5000trials.

7Resourcesharingproblem

Theresourcesharingproblemassumestwoagentssharingacommonresourceorchannel,witheachofthemtryingtodistributetheirloadonthesystemsoastoachievemaximumutility.Inourversionoftheproblem,itisassumedthatoneagenthasalreadyappliedsomeloaddistributedoverafixedtimeperiod,andtheotheragentislearningtodistributeitsloadonthesystemwithoutanyknowledgeofthecurrentdistribution.Amaximumloadof

24

5045403530Average number of steps in the robot navigation taskQBBA(SR)Steps252015105005001000Trials15002000Figure7:ComparisonofBBAandQ-learningontherobotnavigationproblem.Lcanbeappliedonthesystematanypointintime(loadsinexcessofthisdonotreceiveanyutility).

10987The utility curve10*(1-exp(-x))Utility recieved65432100246Load applied810Figure8:Curvedepictingtheutilityreceivedforagivenload

ThesecondagentcanuseKload-hoursofthechannel.Ifitappliesaloadofktload-hoursonthesystemattimestept,whenthefirstagenthasappliedltload-hours,theutilityitreceivesisu(lt,kt)=U(max(L,kt+lt))−U(max(L,lt),whereUistheutilityfunctioninFigure8,andL=10isthemaximumloadallowedonthesystem.ThetotalfeedbackitgetsattheendofTtimestepsis

󰀁T

t=1

u(lt,kt).Thisproblemrequiresthesecondagentto

distributeitsloadaroundtheloadsimposedbythefirstagentinordertoobtainmaximum

25

109876Maximum Load: Agent-1: Agent-2LoadApplied543210123456710Time Figure9:Aresourcesharingproblem.

utility.Theproblemisthatthesecondagenthavenodirectinformationabouttheloaddistributiononthesystem.Thisisatypicalsituationinreinforcementlearningproblemswheretheagenthastochooseitsactionsbasedonlyonscalarfeedback.ThisformulationisdifferentfromotherloadbalancingproblemsusedinMALliterature[41]wheremoreshort-termfeedbackisavailabletoagents.

AsingletrialconsistsofanepisodeofapplyingloadsuntilTtimestepsiscompletedoruntiltheagenthasexhausteditsload-hours,whicheveroccursearlier.Thus,throughconsecutivesuchtrialstheagentlearnstodistributeitsloadonthesysteminanoptimalfashion.Figure9presentstheloaddistributionusedbythefirstagentaswellasoneofseveraloptimalloaddistributionsforthesecondagentintheparticularproblemwehaveusedforexperiments(T=10inthisproblem).

TheResourcesharingproblemisanotherexampleofatightly-coupledsystemwithagentactionsinteractingwitheachotherateverytimestep.Thisdomaincanbeconsideredtobenon-cooperativebecauseagentsarenotdoingacommontask,eachoftheagentsneedtocoordinatewiththeotherstoachieveautilitythatismostbeneficialtoitself.

SincetheresourcesharingproblemproducesrewardsonlyafteraseriesofactionsareperformedweusedthePSPmethodofpayoffdistributionwiththeclassifiersystem.ThoughBBAcanalsobeusedforpayoffdistributioninthisproblem,ourinitialexperimentsshowedthatPSPperformedmuchbetterthanBBAonthisproblem.Theparametervaluesare

26

MaximumTimeβ=0.85,γ=0.9forQ-learningandα=0.1forPSP.

Inthissetofexperimentsthefitness-proportionatePSPdidnotconvergeevenafter150,000trials.ExperimentalresultscomparingQ-learningandsemi-randomPSP,PSP(SR),basedclassifiersystemsontheresourcesharingproblemisdisplayedinFigure10.Resultsareaveragedover50runsofbothsystems.Thoughbothmethodsfindtheoptimalloaddistributioninsomeoftheruns,moreoftenthannottheysettleforalessthanoptimal,butreasonablygooddistribution.PSPtakesabouttwiceaslongtoconvergebutproducesabetterloaddistributionontheaverage.Thedifferenceinperformanceisfoundtobesignificantatthe99%confidencelevelusingatwo-samplet-procedure.WebelievethishappensbecausealltheactiverulesdirectlysharethefeedbackattheendofatrialinPSP.InQ-learning,however,theexternalfeedbackispassedbacktopolicyelementsusedearlyinatrialoversuccessivetrials.Interferencewithdifferentactionsequencesharingthesamepolicyelement(state-actionpair)canproduceconvergencetosub-optimalsolutions.

25The resource sharing problem20Average utility1510QPSP(SR)50050010001500200025003000Trials3500400045005000Figure10:ComparisonofPSPandQ-learningontheresourcesharingproblem.TypicalsolutionsproducedbyPSPandQ-learningdifferedinoneimportantcharacteris-tic.PSPsolutionswillsavesomeofitsloadforthelastemptytime-slot,whereasQ-learningsolutionsuseupalltheavailableloadbeforethat.SincePSPisabletoutilizethelastemptytimeslotonthechannel,itproducesbetterutilitythanQ-learning.

Theaboveresultsshowtwothings:1)anagentcaneffectivelyuseaclassifiersystemtocoordinateitsactionseffectivelywithnoknowledgeabouttheactionsoftheotheragent

27

usingthecommonresource,2)semi-randomactionchoicemechanismcanbeamoreeffectivemethodforclassifiersystemsthanthecommonlyusedfitness-proportionateactionchoicescheme.

Weranafurthersetofexperimentsinthisdomainwherebothagentswerelearningcon-currently.Inthismode,eachagentwouldsubmitaloaddistributiontothesystem,andthesubmittedloaddistributionswillbeusedtogivethemfeedbackontheirutilities.Neitherofthelearningtechniqueswasabletogenerateeffectivecoordinationbetweentheagents.Thissetofexperiments,alone,glaringlyexposesthelimitationsofindividuallearning.Itseemsthatintightlycoupleddomainswithdelayedfeedbacksitwouldbeextremelyunlikelythatgoodcoordinationwillevolvebetweenagentslearningatthesametime.Thisisparticularlytrueifthereisoneorveryfewoptimalbehaviorpairings.

8ConclusionsandFutureResearch

Inthispaperwehaveaddressedtheproblemofdevelopingmultiagentcoordinationstrate-gieswithminimaldomainknowledgeandinformationsharingbetweenagents.WehavecomparedclassifiersystembasedmethodsandQ-learningalgorithms,tworeinforcementlearningparadigms,toinvestigatearesourcesharingandarobotnavigationproblem.OurexperimentsshowthattheclassifierbasedmethodsperformverycompetitivelywiththeQ-learningalgorithm,andareabletogenerategoodsolutionstobothproblems.PSPworkswellontheresourcesharingproblem,whereanagentistryingtoadapttoafixedstrategyusedbyanotheragent,andwhenenvironmentalfeedbackisreceivedinfrequently.Resultsareparticularlyencouragingfortherobotnavigationdomain,whereallagentsarelearningsimultaneously.AclassifiersystemwiththeBBApayoffdistributionallowsagentstocoor-dinatetheirmovementswithotherswithoutdeviatingsignificantlyfromtheoptimalpathfromtheirstarttogoallocations.

Experimentsconductedontheblock-pushingtaskdemonstratethattwoagentscancoor-dinatetosolveaproblembetter,evenwithoutdevelopingamodelofeachother,thanwhattheycandoalone.Wehavedevelopedandexperimentallyverifiedtheoreticalpredictionsoftheeffectsofaparticularsystemparameter,thelearningrate,onsystemconvergence.Other

28

experimentsshowtheutilityofusingknowledge,acquiredfromlearninginonesituation,inothersimilarsituations.Additionally,wehavedemonstratedthatagentscoordinatebylearningcomplimentary,ratherthanidentical,problem-solvingknowledge.

Usingreinforcementlearningschemes,wehaveshownthatagentscanlearntoachievetheirgoalsinbothcooperativeandadversarialdomains.Neitherpriorknowledgeaboutdomaincharacteristicsnorexplicitmodelsaboutcapabilitiesofotheragentsarerequired.Thisprovidesanovelparadigmformulti-agentsystemsthroughwhichbothfriendsandfoescanconcurrentlyacquirecoordinationknowledge.

Aparticularlimitationoftheproposedapproachthatwehaveidentifiedistheinabilityofindividual,concurrentlearningtodevelopeffectivecoordinationwhenagentactionsarestronglycoupled,feedbackisdelayed,andthereisoneorafewoptimalbehaviorcombi-nations.Apossiblepartialfixtothisproblemwouldbetodosomesortofstaggeredorlock-steplearning.Inthismodeoflearning,eachagentcanlearnforsometime,thenexecuteitscurrentpolicywithoutmodificationforsometime,thenswitchbacktolearning,etc.Twoagentscansynchronizetheirbehaviorsothatoneislearningwhiletheotherisfollowingafixedpolicyandviceversa.Evenifperfectsynchronizationisnotfeasible,thestaggeredlearningmodeislikelytobemoreeffectivethantheconcurrentlearningmodewehaveusedinthispaper.

Adrawbackofusingreinforcementlearningtogeneratecoordinationpoliciesisthatitrequiresconsiderableamountofdata,andassuchcanonlybeusedindomainswhereagentsrepeatedlyperformsimilartasks.Otherlearningalgorithmswithlessdatarequirementscanpossiblybeexploredtoovercomethislimitation.

Thispaperdemonstratesthatclassifiersystemscanbeusedeffectivelytoachievenear-optimalsolutionsmorequicklythanQ-learning,asillustratedbytheexperimentsconductedintherobotnavigationtask.Ifweenforceamorerigidconvergencecriteria,classifiersystemsachieveabettersolutionthanQ-learningthroughalargernumberoftrials,asillustratedbytheresultsobtainedontheresourcesharingdomain.Webelieve,however,thateitherQ-learningortheclassifiersystemcanproducebetterresultsinagivendomain.Identifyingthedistinguishingfeaturesofdomainswhichallowoneoftheseschemestoperformbetterwillbeafocusofourfutureresearch.

29

Wehavealsoshownthatasemi-randomchoiceofactionscanbemuchmoreproductivethanthecommonlyusedfitness-proportionatechoiceofactionswiththePSPpayoffdistri-butionmechanism.WeplantocomparetheBBAmechanismwiththesetwomethodsofpayoffdistribution.

Wewouldalsoliketoinvestigatetheeffectsofproblemcomplexityonthenumberoftrialstakenforconvergence.Ontherobotnavigationdomain,forexample,wewouldliketovaryboththesizeofthegridaswellasthenumberofagentsmovingonthegridtofindouttheeffectsonsolutionqualityandconvergencetime.

Otherplannedexperimentsincludeusingworldmodelswithinclassifiersystems[3]andcombiningfeaturesofBBAandPSP[20]thatwouldbeusefulforlearningmultiagentcoor-dinationstrategies.Acknowledgments

Thisresearchhasbeensponsored,inpart,bytheNationalScienceFoundationunderaResearchInitiationAwardIRI-9410180andaCAREERawardIRI-9702672.

References

[1]AndrewB.Barto,RichardS.Sutton,andChrisWatkins.Sequentialdecisionprob-lemsandneuralnetworks.InProceedingsof19ConferenceonNeuralInformationProcessing,19.

[2]AlanH.BondandLesGasser.ReadingsinDistributedArtificialIntelligence.Morgan

KaufmannPublishers,SanMateo,CA,1988.

[3]LashonB.Booker.Classifiersystemsthatlearninternalworldmodels.MachineLearn-ing,3:161–192,1988.

[4]L.B.Booker,D.E.Goldberg,andJ.H.Holland.Classifiersystemsandgeneticalgo-rithms.ArtificialIntelligence,40:235–282,19.

30

[5]P.Brazdil,M.Gams,S.Sian,L.Torgo,andW.vandeVelde.Learningindistributed

systemsandmulti-agentenvironments.InEuropeanWorkingSessiononLearning,LectureNotesinAI,482,Berlin,March1991.SpringerVerlag.

[6]HungH.Bui,DorotaKieronska,andSvethaVenkatesh.Negotiatingagentsthatlearn

aboutothers’preferences.InProceedingsoftheThirteenthNationalConferenceonArtificialIntelligence,pages114–119,MenloPark,CA,1996.AAAIPress.

[7]DavidCarmelandShaulMarkovitch.Incorporatingopponentmodelsintoadversary

search.InProceedingsoftheThirteenthNationalConferenceonArtificialIntelligence,pages62–67,MenloPark,CA,1996.AAAIPress.

[8]PhilipR.CohenandC.RaymondPerrault.Elementsofaplan-basedtheoryofspeech

acts.CognitiveScience,3(3):177–212,1979.

[9]SusanE.Conry,RobertA.Meyer,andVictorR.Lesser.Multistagenegotiationin

distributedplanning.InAlanH.BondandLesGasser,editors,ReadingsinDistributedArtificialIntelligence,pages367–384.MorganKaufman,1988.

[10]MarcoDorigoandHuguesBersini.AcomparisonofQ-learningandclassifiersystems.In

ProceedingsofFromAnimalstoAnimats,ThirdInternationalConferenceonSimulationofAdaptiveBehavior,1994.

[11]EdmundH.DurfeeandVictorR.Lesser.Partialglobalplanning:Acoordination

frameworkfordistributedhypothesisformation.IEEETransactionsonSystems,Man,andCybernetics,21(5),September1991.(SpecialIssueonDistributedSensorNetworks).[12]EdmundH.Durfee,VictorR.Lesser,andDanielD.Corkill.

Coherentcoopera-

tionamongcommunicatingproblemsolvers.IEEETransactionsonComputers,C-36(11):1275–1291,November1987.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages268–284,MorganKaufmann,1988.).

31

[13]EdmundH.Durfee,VictorR.Lesser,andDanielD.Corkill.Trendsincooperative

distributedproblemsolving.IEEETransactionsonKnowledgeandDataEngineering,1(1):63–83,March19.

[14]EdmundH.DurfeeandThomasA.Montgomery.Coordinationasdistributedsearchin

ahierarchicalbehaviorspace.IEEETransactionsonSystems,Man,andCybernetics,21(6):1363–1378,November/December1991.

[15]MarkS.Fox.Anorganizationalviewofdistributedsystems.IEEETransactionson

Systems,Man,andCybernetics,11(1):70–80,January1981.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages140–150,MorganKaufmann,1988.).

[16]AndrewGarlandandRichardAlterman.Multiagentlearningthroughcollectivememory.

InSandipSen,editor,WorkingNotesfortheAAAISymposiumonAdaptation,Co-evolutionandLearninginMultiagentSystems,pages33–38,StanfordUniversity,CA,March1996.

[17]LesGasserandMichaelN.Huhns,editors.DistributedArtificialIntelligence,volume2

ofResearchNotesinArtificialIntelligence.Pitman,19.

[18]M.R.Genesereth,M.L.Ginsberg,andJ.S.Rosenschein.Cooperationwithoutcommu-nications.InProceedingsoftheNationalConferenceonArtificialIntelligence,pages51–57,Philadelphia,Pennsylvania,1986.

[19]PiotrJ.Gmytrasiewicz,EdmundH.Durfee,andDavidK.Wehe.Adecision-theoretic

approachtocoordinatingmultiagentinteractions.InProceedingsoftheTwelfthInter-nationalJointConferenceonArtificialIntelligence,pages62–68,August1991.[20]JohnGrefenstette.Creditassignmentinrulediscoverysystems.MachineLearning,

3(2/3):225–246,1988.

[21]PanGuandAnthonyB.Maddox.Aframeworkfordistributedreinforcementlearning.

InGerhardWeißandSandipSen,editors,AdaptationandLearninginMulti–Agent

32

Systems,LectureNotesinArtificialIntelligence,pages97–112.SpringerVerlag,Berlin,1996.

[22]JosephHalpernandYoramMoses.Knowledgeandcommonknowledgeinadistributed

environment.JournaloftheACM,37(3):549–587,1990.ApreliminaryversionappearedinProc.3rdACMSymposiumonPrinciplesofDistributedComputing,1984.

[23]ThomasHaynesandSandipSen.Learningcasestocomplimentrulesforconflictres-olutioninmultiagentsystems.InternationalJournalofHuman-ComputerStudies(toappear).

[24]JohnH.Holland.Adpatationinnaturalandartificialsystems.UniversityofMichigan

Press,AnnArbor,MI,1975.

[25]JohnH.Holland.Escapingbrittleness:thepossibilitiesofgeneral-purposelearning

algorithmsappliedtoparallelrule-basedsystems.InR.S.Michalski,J.G.Carbonell,andT.M.Mitchell,editors,MachineLearning,anartificialintelligenceapproach:VolumeII.MorganKaufmann,LosAlamos,CA,1986.

[26]MichaelHuhns,editor.DistributedArtificialIntelligence.MorganKaufmann,1987.[27]L.P.Kaelbling,MichaelL.Littman,andAndrewW.Moore.Reinforcementlearning:

Asurvey.JournalofAIResearch,4:237–285,1996.

[28]S.Kirkpatrick,C.D.Gelatt,andM.P.Vecchi.Optimizationbysimulatedreannealing.

Science,220:671–680,1983.

[29]VictorR.Lesser.Multiagentsystems:AnemergingsubdisciplineofAI.ACMComput-ingSurveys,27(3):340–342,September1995.

[30]MichaelL.Littman.Markovgamesasaframeworkformulti-agentreinforcementlearn-ing.InProceedingsoftheEleventhInternationalConferenceonMachineLearning,pages157–163,1994.

33

[31]SridharMahadevan.Todiscountornottodiscountinreinforcementlearning:Acase

studycomparingRlearningandQlearning.InProceedingsoftheTenthInternationalConferenceonMachineLearning,pages205–211,1993.

[32]ThomasW.Malone.Modelingcoordinationinorganizationsandmarkets.Management

Science,33(10):1317–1332,1987.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages151–158,MorganKaufmann,1988.).

[33]JamesG.MarchandHerbertA.Simon.Organizations.JohnWiley&Sons,1958.[34]MajaJ.Mataric.Learninginmulti-robotsystems.InGerhardWeißandSandipSen,

editors,AdaptationandLearninginMulti–AgentSystems,LectureNotesinArtificialIntelligence,pages152–163.SpringerVerlag,Berlin,1996.

[35]K.NarendraandM.A.L.Thatachar.LearningAutomata:AnIntroduction.Prentice

Hall,19.

[36]LynneE.Parker.Adaptiveactionselectionforcooperativerobotteams.InJ.Meyer,

H.Roitblat,andS.Wilson,editors,Proc.oftheSecondInternationalConferenceonSimulationofAdaptiveBehavior,pages442–450,Cambridge,MA,1992.MITPress.[37]MVNagendraPrasad,SusanE.Lander,andVictorR.Lesser.Cooperativelearningover

compositesearchspaces:Experienceswithamulti-agentdesignsystem.InProceedingsofThirteenthNationalConferenceonArtificialIntelligence,pages68–73,August1996.[38]FosterJohnProvostandDanielN.Hennessy.Scalingup:Distributedmachinelearning

withcooperation.InProceedingsoftheThirteenthNationalConferenceonArtificialIntelligence,pages74–79,MenloPark,CA,1996.AAAIPress.

[39]JefferyS.Rosenschein.Synchronizationofmulti-agentplans.InProceedingsofthe

NationalConferenceonArtificialIntelligence,pages115–119,Pittsburgh,Pennsylvania,August1982.(AlsopublishedinReadingsinDistributedArtificialIntelligence,AlanH.BondandLesGasser,editors,pages187–191,MorganKaufmann,1988.).

34

[40]TuomasW.SandholmandRobertH.Crites.OnmultiagentQ-learninginasemi-competitivedomain.InGerhardWeißandSandipSen,editors,AdaptationandLearninginMulti–AgentSystems,LectureNotesinArtificialIntelligence,pages191–205.SpringerVerlag,Berlin,1996.

[41]A.Schaerf,Y.Shoham,andM.Tennenholtz.Adaptiveloadbalancing:Astudyin

multiagentlearning.JournalofArtificialIntelligenceResearch,2:475–500,1995.[42]JurgenSchmidhuber.Ageneralmethodformulti-agentreinforcementlearninginunre-strictedenvironments.InSandipSen,editor,WorkingNotesfortheAAAISymposiumonAdaptation,Co-evolutionandLearninginMultiagentSystems,pages84–87,StanfordUniversity,CA,March1996.

[43]MahendraSekaranandSandipSen.Learningwithfriendsandfoes.InSixteenthAnnual

ConferenceoftheCognitiveScienceSociety,pages800–805,1994.

[44]SandipSen.IJCAI-95workshoponadaptationandlearninginmultiagentsystems.AI

Magazine,17(1):87–,Spring1996.

[45]SandipSen,MahendraSekaran,andJohnHale.Learningtocoordinatewithoutsharing

information.InNationalConferenceonArtificialIntelligence,pages426–431,1994.[46]YoavShohamandMosheTennenholtz.Onthesynthesisofusefulsociallawsforartifi-cialagentsocieties(preliminaryreport).InProceedingsoftheNationalConferenceonArtificialIntelligence,pages276–281,SanJose,California,July1992.

[47]S.Sian.Adaptationbasedoncooperativelearninginmulti-agentsystems.InY.De-mazeauandJ.-P.M¨uller,editors,DecentralizedAI,volume2,pages257–272.ElsevierSciencePublications,1991.

[48]ReidG.Smith.Thecontractnetprotocol:High-levelcommunicationandcontrolin

adistributedproblemsolver.IEEETransactionsonComputers,C-29(12):1104–1113,December1980.

35

[49]RichardS.Sutton.TemporalCreditAssignmentinReinforcementLearning.PhDthesis,

UniversityofMassachusettsatAmherst,1984.

[50]MingTan.Multi-agentreinforcementlearning:Independentvs.cooperativeagents.In

ProceedingsoftheTenthInternationalConferenceonMachineLearning,pages330–337,June1993.

[51]C.J.C.H.Watkins.LearningfromDelayedRewards.PhDthesis,King’sCollege,Cam-bridgeUniversity,19.

[52]GerhardWeiß.Learningtocoordinateactionsinmulti-agentsystems.InProceedings

oftheInternationalJointConferenceonArtificialIntelligence,pages311–316,August1993.

[53]GerhardWeiß,editor.

DistributedArtificialIntelligenceMeetsMachineLearning:

LearninginMulti-AgentEnvironments.LectureNotesinArtificialIntelligence.SpringerVerlag,Berlin,1997.

[54]GerhardWeißandSandipSen,editors.AdaptationandLearninginMulti–AgentSys-tems.LectureNotesinArtificialIntelligence.SpringerVerlag,Berlin,1996.

[55]EricWerner.Cooperatingagents:Aunifiedtheoryofcommunicationandsocialstruc-ture.InLesGasserandMichaelN.Huhns,editors,DistributedArtificialIntelligence,volume2ofResearchNotesinArtificialIntelligence,pages3–36.Pitman,19.[56]HollyYancoandLynnAndreaStein.Anadaptivecommunicationprotocolforcoop-eratingmobilerobots.InStewartWilson,editor,FromAnimalstoAnimats:Proc.oftheSecondInternationalConferenceontheSimulationofAdaptiveBehavior,pages478–485,Cambridge,MA,1993.MITPress.

[57]M.Yokoo,E.Durfee,T.Ishida,andK.Kuwabara.Distributedconstraintsatisfaction

forformalizingdistributedproblemsolving.InProceedingsoftheTwelfthInternationalConferenceonDistributedComputingSystems,pages614–621,1992.

36

Cooperativerelationships

Individualisticlearning

Communicatingagents

Non-communicatingagents

Sharedlearning

Non-cooperativerelationships

Table1:Acategorizationofmultiagentdomainsforidentifyinglearningopportunities.

FA1010101010

FB12345

TrialstakentoreachGoalofAgentA

81111115197268

Table2:TrialsforagentAtoreachitsgoalactingagainstagentBtryingtoreachitsowngoal(maximumforcesappliedbyagentsAandBareFAandFBrespectively).

[58]GiladZlotkinandJeffreyS.Rosenschein.Negotiationandconflictresolutioninnon-cooperativedomains.InProceedingsoftheNationalConferenceonArtificialIntelli-gence,pages100–105,July1990.

37

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- 91gzw.com 版权所有 湘ICP备2023023988号-2

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务