Abstract—We survey applications of pretrained foundation models inrobotics.Traditionaldeeplearningmodels inrobotics are trainedonsmall datasets tailored for specific tasks,which limitstheiradaptabilityacrossdiverseapplications. Incontrast, foundationmodelspretrainedon internet-scaledataappear to havesuperiorgeneralizationcapabilities, andinsome instances displayanemergentabilitytofindzero-shotsolutionstoproblems that are not present in the trainingdata. Foundationmodels mayhold thepotential to enhance various components of the robotautonomystack, fromperceptiontodecision-makingand control.Forexample,largelanguagemodelscangeneratecodeor providecommonsensereasoning,whilevision-languagemodels enableopen-vocabularyvisual recognition.However, significant openresearchchallengesremain,particularlyaroundthescarcity ofrobot-relevanttrainingdata,safetyguaranteesanduncertainty quantification,andreal-timeexecution. Inthissurvey,westudy recentpapersthathaveusedorbuiltfoundationmodelstosolve roboticsproblems.Weexplorehowfoundationmodelscontribute to improving robot capabilities in the domains of perception, decision-making, and control.We discuss the challenges hindering the adoptionof foundationmodels in robot autonomy and provide opportunities and potential pathways for future advancements.TheGitHubprojectcorrespondingtothispaper1 canbefoundhere. Index Terms—Robotics, Large Language Models (LLMs), Visual-Language Models (VLM), Large Pretrained Models, FoundationModels I. INTRODUCTION FOUNDATION models are pretrained on extensive internet-scaledataandcanbefine-tunedforadaptationto awide rangeofdownstreamtasks. Foundationmodelshave demonstratedsignificantbreakthroughsinvisionandlanguage processing;examplesincludeBERT[1],GPT-3[2],GPT-4[3], CLIP[4],DALL-E[5],andPaLM-E[6].Foundationmodels have the potential to unlock newpossibilities in robotics domains such as autonomous driving, household robotics, industrial robotics, assistive robotics,medical robotics, field robotics,andmulti-robotsystems.PretrainedLargeLanguage Models (LLMs), Large Vision-LanguageModels (VLMs), LargeAudio-LanguageModels (ALMs), andLargeVisualNavigationModels(VNMs)canbeutilizedtoimprovevarious tasksinroboticssettings.Theintegrationoffoundationmodels intoroboticsisarapidlyevolvingarea,andtheroboticscommunityhasveryrecentlystartedexploringways to leverage theselargemodelswithintheroboticsdomainforperception, prediction,planning,andcontrol. Prior to the emergenceof foundationmodels, traditional deep learningmodels for roboticswere typically trainedon limiteddatasets gathered for distinct tasks [7]. Conversely, foundationmodels arepre-trainedon extensive anddiverse data,whichhasbeenproveninotherdomains(suchasnatural languageprocessing, computervision, andhealthcare[8]) to significantlyexpandadaptability,generalizationcapability,and overallperformance.Ultimately, foundationmodelsmayhold thepotential toyieldthesesamebenefits inrobotics.Knowledge transfer fromfoundationmodelsmay reduce training time andcomputational resources compared to task-specific 1Preliminaryrelease.Wearecommittedtofurtherenhancingandupdating thisworktoensure itsqualityandrelevance models.Particularlyrelevant torobotics,multimodal foundationmodelscanfuseandalignmultimodalheterogeneousdata gatheredfromvarioussensorsintocompacthomogeneousrepresentationsneededforrobotunderstandingandreasoning[9]. Theselearnedrepresentationsholdthepotential tobeusedin anypartoftheautonomystackincludingperception,decisionmaking,andcontrol.Furthermore,foundationmodelsprovide zero-shot capabilities, which refer to the ability of anAI systemtoperformtaskswithoutpriorexamplesordedicated trainingdataforthatspecifictask.Thewouldenablerobotsto generalizetheir learnedknowledgetonovelcases, enhancing adaptabilityandflexibilityfor robotsinunstructuredsettings. Integrating foundationmodels into robotic systemsmay enablecontext-awareroboticsystemsbyenhancingtherobot’s ability to perceive and interact with the environment. For example in the perceptiondomain, LargeVision-Language Models (VLMs) have been found to provide cross-modal understandingbylearningassociationsbetweenvisualandtextualdata, aidingtaskssuchaszero-shot imageclassification, zero-shot object detection [10], and 3Dclassification [11]. As another example, language grounding in the 3Dworld [12] (aligningcontextual understandingofVLMs to the3dimensional (3D) realworld)mayenhancea robot’s spatial awareness by associatingwordswith specificobjects, locations,oractionswithinthe3Denvironment. In the decision-making or planning domain, LLMs and VLMs have been found to assist robots in task specification for high-level planning[13].Robots canperformmore complextasksby leveraginglinguisticcues inmanipulation, navigation, and interaction. For example, for robot policy learningtechniqueslikeimitationlearning[14]andreinforcement learning [15], foundationmodels seem to offer the possibilitytoimprovedataefficiencyandenhancecontextual understanding. Inparticular, language-drivenrewardscanbe used toguideRLagentsbyprovidingshapedrewards [16]. Also, researchershaveemployedlanguagemodels toprovide feedbackforpolicylearningtechniques[17].Someworkshave shownthataVLMmodel’svisualquestion-answering(VQA) capabilitycanbeharnessedinroboticsusecases.Forexample, researchershaveusedVLMs toanswer questions related to visualcontent toaidrobotsinaccomplishingtheir tasks[18]. Also,researchershavestatedutilizingVLMstohelpwithdata annotation,bygeneratingdescriptivelabelsforvisualcontent [19]. Despitethetransformativecapabilitiesoffoundationmodels invisionandlanguageprocessing,thegeneralizationandfinetuningof foundationmodelsfor real-worldrobotics tasks remainchallenging.Thesechallengesinclude:1)DataScarcity: how to obtain internet-scale data for robot manipulation, locomotion, navigation, and other robotics tasks, and how toperformself-supervised trainingwith this data, 2)High Variability:howtodealwith the largediversity inphysical environments, physical robot platforms, andpotential robot taskswhilestillmaintainingthegeneralityrequiredforafoundationmodel, 3)UncertaintyQuantification: howtodeal with(i)instance-leveluncertaintysuchaslanguageambiguity orLLMhallucination; (ii)distribution-leveluncertainty; and (iii) distribution-shift, especially resulting fromclosed-loop4 robotdeployment, 4)SafetyEvaluation:Howtorigorously testforthesafetyofafoundationmodel-basedroboticsystem (i)priortodeployment,(ii)asthemodelisupdatedthroughout its lifecycle, and (iii) as the robot operates in its target environments.5)Real-TimePerformance:howtodealwith the high inference time of some foundationmodelswhich couldhindertheirdeploymentonrobotsandhowtoaccelerate inferenceinfoundationmodelstothespeedrequiredforonline decision-making. Inthissurvey,westudytheexistingliteratureontheuseof foundationmodels inrobotics.We studycurrent approaches andapplications,presentcurrentchallenges,suggestdirections for future research toaddress thesechallenges, and identify potential risksexposedbyintegratingfoundationmodels into robot autonomy. Another survey on foundationmodels in roboticsappearedsimultaneouslywithoursonarXiv[20]. In comparisonwiththatpaper,oursemphasizesfuturechallenges andopportunities, includingsafetyand risk, andourshas a strongeremphasisoncomparisonsinapplications,algorithms, andarchitecturesamongtheexistingpapers inthisspace. In contrast tosomeexistingsurveys that focusonaspecific incontext instruction, suchas prompts [21], vision transformers[22],ordecision-making[13], [23],weprovideabroader perspectivetoconnectdistinct researchthreads infoundation models organizedaround their relevance to andapplication torobotics.Conversely,ourscopeismuchnarrower thanthe paper[24],whichexploresthebroadapplicationoffoundation modelsacrossmanydisciplines,ofwhichroboticsisone.We hope thispapercanprovideclarityregardingareasof recent progressandexistingdeficiencies in the research, andpoint thewayforwardtofutureopportunitiesandchallengesfacing this researcharea.Ultimately,weaimtogivea resourcefor roboticsresearcherstolearnabout thisexcitingnewarea. Welimitthescopeofthissurveytopapersthatfallintoone of thefollowingcategories: 1) BackgroundPapers:Papers thatdonotexplicitlylink torobotics,butarenonethelessrequiredforunderstanding foundationmodels. Thesepapers arediscussed in thebackgroundsection(sectionII)of thesurveypaper. 2) Robotics Papers: Papers that integrate a foundation model intoaroboticsysteminaplug-and-playfashion, papers that adapt or fine-tune foundationmodels for roboticsystems,orpapersthatbuildnewrobotic-specific foundationmodels. 3) Robotics-AdjacentPapers:Papers that presentmethodsor techniquesappliedtoareasadjacent torobotics (e.g.,computervision,embodiedAI),withaclearpath tofutureapplicationinrobotics. This survey is organized as follows: In Section II, we provideanintroductiontofoundationmodelsincludingLLMs, vision transformers,VLMs, embodiedmultimodal language models,andvisualgenerativemodels. Inaddition, inthelast part of this section, we discuss different trainingmethods usedtotrainfoundationmodels. InSectionIII,wepresenta reviewofhowfoundationmodelsareintegratedintodifferent tasksfordecision-makinginrobotics.First,wediscussrobot policylearningusinglanguage-conditionedimitationlearning, and language-assistedreinforcement learning.Then,wediscuss how to use foundationmodels to design a languageconditioned value function that can be used for planning purposes.Next, robot taskspecificationandcodegeneration for taskplanningusingfoundationmodelsarepresented. In Section IV, we study various perception tasks in robotics thathavethepotential tobeenhancedbyemployingfoundationmodels.These tasks includesemanticsegmentation, 3D scene representation, zero-shot 3Dclassification, affordance prediction,anddynamicsprediction.InSectionV,wepresent papers about EmbodiedAI agents, generalistAI agents, as well as simulatorsandbenchmarksdevelopedfor embodied AI research. InSectionVI,weconclude the surveybydiscussingdifferentchallengesforemployingfoundationmodels inroboticsystemsandproposingpotentialavenuesfor future research. Finally, in SectionVII we offer the concluding remarks. II. FOUNDATIONMODELSBACKGROUND Foundationmodels have billions of parameters and are pretrainedonmassive internet-scaledatasets. Trainingmodels of such scale and complexity involve substantial costs. Acquiring,processing,andmanagingdatacanbecostly.The trainingprocessdemandssignificantcomputationalresources, requiring specialized hardware such asGPUs or TPUs, as well as softwareand infrastructureformodel trainingwhich requiresfinancialresources.Additionally,trainingafoundation model is time-intensive,whichcan translate toevenhigher costs.Hence thesemodels are oftenused as plug-and-play modules (whichrefers to the integrationof foundationmodels intovarious applicationswithout theneed for extensive customization).TableIprovidesdetailsaboutcommonlyused foundationmodels. In the rest of this section,we introduce LLMs, vision transformers, VLMs, embodiedmulti-modal languagemodels, andvisual generativemodels. In the last part of this section,we introducedifferent trainingmethods thatareusedtotrainfoundationmodels. A. TerminologyandMathematicalPreliminaries Inthissection,wefirst introducecommonterminologiesin the context of foundationmodels anddescribebasicmathematical details and trainingpractices for various types of foundationmodels. Tokenization:Givenasequenceofcharacters,tokenization is the process of dividing the sequence into smaller units, calledtokens.Dependingonthetokenizationstrategy, tokens can be characters, segments ofwords, completewords, or portionsofsentences.Tokensarerepresentedas1-hotvectors ofdimensionequal tothesizeof thetotalvocabularyandare mappedtolower-dimensionalvectorsofrealnumbersthrough a learnedembeddingmatrix.AnLLMtakes a sequenceof theseembeddingvectorsas rawinput,producingasequence ofembeddingvectorsasrawoutput.Theseoutputvectorsare thenmappedback to tokens andhence to text.GPT-3, for example,hasavocabularyof50,257different tokens,andan embeddingdimensionof12,288.5 FoundationModels inRobotics Robotics RobotPolicyLearning Language-Conditioned ImitationLearning e.g.CLIPort [25],Play-LMP[26],PerAct [27],Multi-Context Imitation[28], CACTI[14],Voltron[29] Language-Assisted ReinforcementLearning e.g.AdaptiveAgent (AdA) [30],Paloetal. [15] Language-Image Goal-Conditioned ValueLearning e.g.R3M[31],SayCan[32], InnerMonologue[33],VoxPoser[34],Mahmoudiehetal. [35],VIP[36], LIV[37],LOREL[38] High-Level TaskPlanning e.g.NL2TL[39],Chenetal. [40] LLM-Based CodeGeneration e.g.ProgPrompt [41],Code-as-Policy[42],ChatGPT-Robotics[43] RobotTransformers e.g.RT-1[44],RT-2[45],RT-X[46],PACT[47],Xiaoetal. [48],Radosavovicetal. [49],LATTE[50] Relevant toRobotics Perception Open-Vocabulary ObjectDetection and3Dclassification e.g.OWL-ViT[51],GLIP[52],GroundingDINO[53], PointCLIP[54],PointBERT[55],ULIP[56], [57] Open-Vocabulary SemanticSegmentation e.g.LSeg[58],SegmentAnything[59],FastSAM[60],MobileSAM[61], TrackAnythingModel (TAM)[62] Open-Vocabulary3D SceneRepresentation e.g.CLIP-NERF[63],LERF[64],DFF[65] Affordances e.g.AffordanceDiffusion[66],VRB[67] EmbodiedAI e.g.Huangetal. [68],Statler [69],EmbodiedGPT[70],MineDojo[71],VPT[72],Kwonetal. [16], Voyager [73],ELLM[74] Fig.1. OverviewofRoboticsTasksLeveragingFoundationModels. The tokendecoding(fromlow-dimensionreal-valuedembeddingvectorstohigh-dimension1-hotvectors)isnotdeterministic, resulting inaweightingfor eachpossible token in thevocabulary.TheseweightingsareoftenusedbyLLMsas probabilitiesover tokens, tointroducerandomnessinthetext generationprocess. For example, the temperatureparameter inGPT-3blendsbetweenalways choosing the top-weighted token(temperatureof0)anddrawingthetokenbasedonthe probabilitydistributionsuggestedbytheweights(temperature of1).Thissourceofrandomnessisonlyinthetokendecoding process,notintheLLMitself.Totheauthors’knowledge,this is, infact, theonlysourceof randomness intheGPTfamily ofmodels. Oneof themost common tokenizationschemes,which is usedbytheGPTfamilyofmodels,iscalledbyte-pairencoding [75].Byte-pairencodingstartswithatokenforeachindividual symbol (e.g., letter, punctuation), thenrecursivelybuilds tokensbygroupingpairsof symbols that commonlyappear together, buildingup to assign tokens to larger and larger groups(pairsofpairs,etc)thatappearfrequentlytogetherina textcorpus.Thetokenizationprocesscanextendbeyondtext datatodiversecontexts,encompassingvariousdatamodalities likeimages,videos,androbotactions. Inthesescenarios, the respectivedatamodalitiescanbetreatedassequentialdataand tokenizedsimilarly to traingenerativemodels.For example, just as languageconstitutes a sequenceofwords, an image comprisesasequenceof imagepatches, forcesensorsyielda sequenceof sensoryinputsateachtimestep, andaseriesof actionsrepresent thesequentialnatureof tasksforarobot. GenerativeModels:Agenerativemodel is amodel that learns to sample froma probability distribution to create examplesofdata that seemtobefromthesamedistribution asthetrainingdata.Forexample,afacegenerationmodelcan produceimagesoffacesthatcannotbedistinguishedfromthe setof real imagesusedtotrainthemodel.Thesemodelscan be trained tobeconditional,meaningtheygeneratesamples fromaconditional distributionconditionedonawide range ofpossibleconditioninginformation.For example, agender conditional facegeneratorcangenerate imagesof femaleor malefaces,wherethedesiredgenderisgivenasaconditioning input tothemodel. DiscriminativeModels: Discriminativemodels are used for regressionor classification tasks. In contrast togenerativemodels, discriminativemodelsare trained todistinguish betweendifferentclassesorcategories.Theiremphasisliesin learningtheboundariesbetweenclasseswithintheinputspace. Whilegenerativemodelslearntosamplefromthedistribution over the data, discriminativemodels learn to evaluate the probabilitydistributionof theoutput labels given the input features,or (dependingonhowthemodel is trained) learnto evaluatesomestatisticof theprobabilitydistributionover the outputs, suchas theexpectedoutputgivenaninput. TransformerArchitecture:Most foundationmodels are built on the transformerarchitecture,whichhasbeen instru6 mental in the riseof foundationmodels and large language models.Thefollowingdiscussionwassynthesizedfrom[76], aswell asonlineblogs, unpublishedreports, andwikipedia [77]–[79].Atransformeracts simultaneouslyonacollection ofembeddedtokenvectors(x1,…,xN)knownasacontext window. The key enabling innovation of the Transformer architectureis themulti-headself-attentionmechanismoriginallyproposedintheseminalwork[76]. Inthisarchitecture, eachattentionheadcomputesavectorof importanceweights thatcorrespondstohowstronglyatokeninthecontextwindow xi correlates with other tokens in the same windowxj. Eachattentionheadmathematicallyencodesdifferentnotions of similarity, through different projectionmatrices used in the computationof the importanceweights. Eachheadcan be trained (backwardpass) andevaluated (forwardpass) in parallelacrossalltokensandacrossallheads,leadingtofaster trainingandinferencewhencomparedwithpreviousmodels basedonRNNsorLSTMs. Mathematically,anattentionheadmapseachtokenxi inthe contextwindowtoa“query”qi=WT qxi,andeachothertoken inthecontextheadxj toa“key”kj=WT kxj.Thesimilarity betweenqueryandkeyisthenmeasuredthroughascaleddot product,qT i kj/√d,wheredisthedimensionofthequeryand keyvectors.Asoftmaxisthentakenoveralljtogiveweights αij representinghowmuchxi“attendsto”xj.Thetokensare thenmapped to“values”withvj=WT vxj, and theoutput of the attention for position i is thengivenas a sumover valuesweightedbyattentionweights,P jαijvj.Oneof the keyreasonsforthesuccessofthetransformerattentionmodel isthat itcanbeefficientlycomputedwithGPUsandTPUsby parallelizingtheprecedingsteps intomatrixcomputations, attn(Q,K,V)=softmax QK⊤ √dk V, (1) whereQ,K,Varematriceswith rows qT i , kT i , and vT i , respectively. Eachhead in themodel produces this computation independently,withdifferentWq,Wk,Wvmatrices to encodedifferent kinds of attention. The outputs fromeach headarethenconcatenated,normalizedwithaskipconnection, passedthroughafullyconnectedReLUlayer,andnormalized againwith a skip connection toproduce the output of the attention layer.Multiple layersarearrangedinvariousways togive“encoders”and“decoders,”whichtogethermakeupa transformer. Thesizeof a transformermodel is typicallyquantifiedby (i)thesizeofthecontextwindow,(ii)thenumberofattention headsper layer, (iii) thesizeof theattentionvectors ineach head,and(iii) thenumberofstackedattentionlayers.Forexample,GPT-3’scontextwindowis2048tokens(corresponding toabout1500wordsoftext),eachattentionlayerhas96heads, eachheadhasattentionvectorsof128dimensions,andthere are96stackedattentionlayers inthemodel. The basicmulti-head attentionmechanismdoes not imposeany inherent senseof sequenceor directionality in the data.However, transformers—especially innatural language applications—areoftenusedassequencepredictorsbyimposingapositionalencodingonthe input tokensequence.They arethenappliedtoatokensequenceautoregressively,meaning theypredict thenext tokeninthesequence,addthat tokento their contextwindow, andrepeat.Thisconcept iselaborated below. AutoregressiveModels:Theconceptofautoregressionhas been applied inmanyfields as a representationof random processes whose outputs depend causally on the previous outputs.Autoregressivemodelsuseawindowofpastdatato predict thenext datapoint ina sequence.Thewindowthen slidesonepositionforward,recursivelyingestingthepredicted datapoint intothewindowandexpellingtheoldestdatapoint fromthewindow.Themodelagainpredictsthenextdatapoint in thesequence, repeatingthisprocess indefinitely.Classical linearautoregressivemodelssuchasAuto-RegressiveMoving Average(ARMA)andAuto-RegressiveMovingAveragewith eXogenous input (ARMAX)models are standard statistical toolsdatingbacktoat least the1970s[80].Thesemodeling conceptswere adapted to deep learningmodels first with RNNs, and later LSTMs, which are both types of learnable nonlinear autoregressivemodels. Transformermodels, although they are not inherently autoregressive, are often adapted to an autoregressive framework for text prediction tasks. For example, theGPTfamily [81] buildson theoriginal transformermodelbyusingamodificationintroducedin[82] thatremovesthetransformerencoderblocksentirely,retaining just thetransformerdecoderblocks.Thishastheadvantageof reducing thenumber ofmodel parameters by close tohalf whilereducingredundant informationthat is learnedinboth the encoder and decoder. During training, theGPTmodel seeks toproduceanoutput tokenfromthe tokenizedcorpus X = (x1,…,xn) tominimize the negative log-likelihood withinthecontextwindowof lengthN, LLLM=− X i logP(xi |xi−N,…,xi−1). (2) This results ina largepretrainedmodel that autoregressively predicts thenext likelytokengiventhe tokens inthecontext window.Althoughpowerful, theunidirectionalautoregressive nature of the GPT familymeans that thesemodels may lag in performance on bidirectional tasks such as reading comprehension. MaskedAuto-Encoding:ToaddresstheunidirectionallimitationoftheGPTfamilyandallowthemodeltomakebidirectionalpredictions,workssuchasBERT[1]usemaskedautoencoding.This is achieved throughanarchitectural change, namely the additionof a bidirectional encoder, aswell as a novel pre-training objective known asmasked language modeling(MLM).TheMLMtasksimplymasksapercentage of thetokens inthecorpusandrequires themodel topredict thesetokens.Throughthisprocedure,themodelisencouraged tolearnthecontext thatsurroundsawordrather thanjust the next likelywordinasequence. ContrastiveLearning:Visual-languagefoundationmodels suchasCLIP[4] typicallyrelyondifferent trainingmethods fromtheonesusedwithlargelanguagemodelswhichencourageexplicitlypredictivebehavior.Visual-languagemodelsuse contrastiverepresentationlearning,wherethegoalistolearna7 jointembeddingspacebetweeninputmodalitieswheresimilar sample pairs are closer than dissimilar ones. The training objectiveformanyVLMs is somevariationof theobjective function, ℓ(v→u) i =−log exp(sim(vi,ui)/τ) PN k=1exp(sim(vi,uk)/τ) , (3) ℓ(u→v) i =−log exp(sim(ui,vi)/τ) PN k=1exp(sim(ui,vk)/τ) , (4) L= 1 N NX i=1 λℓ(v→u) i +(1−λ)ℓ(u→v) i . (5) Thisobjectivefunctionwaspopularizedformultimodal input byConVIRT[83]andfirstpresentedinpriorworks[84]–[87]. Thisobjectivefunctiontrains the imageandtextencoders to preservemutual informationbetweenthetruetextandimage pairs. In theseequations,ui andvi are the ith encoded text andimagerespectivelyfromi∈1,…,Nimageandtextpairs. Thesimoperationisthecosinesimilaritybetweenthetextand imageembeddings,andτ isatemperatureterm. InCLIP[4] theauthorsuseasymmetriccross-entropyloss,meaningthe finallossisanaverageofthetwolosscomponentswhereeach isequallyweighted(i.e.λ=0.5). DiffusionModels:Outsideof large languagemodelsand multi-modal models such as VLMs, diffusionmodels for imagegeneration (e.g.DALL-E2) [88] are another class of foundationmodels considered in this survey.Althoughdiffusionmodelswereestablished inpriorwork[89], [90] the diffusionprobabilisticmodelpresentedin[91]popularizedthe method.Thediffusionprobabilisticmodelisadeepgenerative model that is trained in an iterative forward and reverse process.TheforwardprocessaddsGaussiannoisetoaninput x0 inaMarkovchainuntilxTwhentheresult iszeromean isotropicnoise. Thismeans the forwardprocessproducesa trajectoryofnoiseq(x1:T|x0)as, q(x1:T |x0):= TY t=1 q(xt |xt−1). (6) Ateachtimestepq(xt|xt−1) isdescribedbyanormaldistributionwithmean√1−βtxt−1andcovarianceβtIwhereβt isscheduledorafixedhyperparameter. The reverse process requires themodel to learn to the transitions that will de-noise the zeromeanGaussian and produce the input image. This process is also defined as aMarkov chainwhere the transition distribution at time t is pθ(xt−1|xt) :=N(xt−1;µθ(xt,t),Σθ(xt,t)). For completeness, thereverseprocessMarkovchainisgivenby, pθ(x0:T):=p(xT) TY t=1 pθ(xt−1 |xt). (7) Diffusionmodelsaretrainedusingareducedformof theevidencelowerboundlossfunctionthat is typicalofvariational generativemodels likevariational autoencoders(VAEs).The reducedlossfunctionusedfor trainingis L=Eq[DKL(q(xT |x0) p(xT)) (8) + X t>1 DKL(q(xt−1 |xt,x0) pθ(xt−1 |xt) −logpθ(x0 |x1)], whereDKL(q||p)denotesKullback–Leiblerdivergence,which is ameasure of howdifferent a distribution q is froma distributionp. B. LargeLanguageModel (LLM) Examples andHistorical Context LLMshavebillionsofparametersandaretrainedontrillions oftokens.ThislargescalehasallowedmodelssuchasGPT-2 [92]andBERT[1] toachievestate-of-the-artperformanceon theWinogradSchema challenge [93] and theGeneralLanguageUnderstandingEvaluation (GLUE) [94] benchmarks, respectively.TheirsuccessorsincludeGPT-3[2],LLaMA[95], andPaLM[96] has grown considerably in the number of parameters (typicallynowover 100billion), the sizeof the contextwindow(typicallynowover 1000 tokens), and the sizeof the trainingdata set (typicallynow10sof terabytes of text). GPT-3 is trained on theCommonCrawl dataset. CommonCrawlcontainspetabytesofpubliclyavailabledata over 12yearsofwebcrawlingand includes rawwebpage data,metadata, and text extracts. LLMs canalsobemultilingual. For example,ChatGLM-6BandGLM-130B[97] is abilingual (EnglishandChinese)pretrainedlanguagemodel with130billionparameters. LLMs can alsobefine-tuned, aprocessbywhich themodel parametersareadjustedwith domain-specificdata to align the performanceof theLLM to a specificuse case. For example,GPT-3 andGPT-4 [3] havebeenfine-tunedusingreinforcementlearningwithhuman feedback(RLHF). C. VisionTransformers AVisionTransformer (ViT) [98]–[100] is a transformer architecture for computervision tasks including imageclassificationsegmentation,andobjectdetection.AViTtreatsan imageasasequenceof imagepatches referredtoas tokens. In the image tokenizationprocess, an image isdivided into patchesoffixedsize.Thenthepatchesareflattenedintoaonedimensionalvectorwhichis referredtoas linearembedding. Tocapture the spatial relationshipsbetween imagepatches, positional informationisaddedtoeachtoken.Thisprocessis referred toaspositionembedding.The image tokens incorporatedwithpositionencodingare fed into the transformer encoderand theself-attentionmechanismenables themodel tocapture long-termdependenciesandglobalpatterns in the inputdata. Inthispaper,wefocusonlyonthoseViTmodels withalargenumberofparameters.ViT-G[101]scalesupthe ViTmodelandhas2Bparameters.Additionally,ViT-e[102] has 4Bparameters. ViT-22B [103] is a vision transformer modelat22billionparameters,whichisusedinPaLM-Eand PaLI-X[104]andhelpswithrobotics tasks.8 DINO[105]isaself-supervisedlearningmethod, fortrainingViT.DINOis a formof knowledgedistillationwithno labels.Knowledgedistillationisa learningframeworkwhere a smallermodel (student network) is trained tomimic the behaviorofa largermorecomplexmodel (teachernetwork). Bothnetworkssharethesamearchitecturewithdifferentsets of parameters. Given a fixed teacher network, the student networklearnsitsparametersbyminimizingthecross-entropy lossw.r.t. thestudentnetworkparameters.Theneuralnetwork architectureiscomposedofViTorResNet[106]backboneand aprojectionheadthatincludeslayersofmulti-layerperception (MLP). Self-supervisedViT features learned usingDINO containexplicit informationabout thesemanticsegmentation ofanimageincludingscenelayoutandobjectboundarieswith such clarity that is not achievedusing supervisedViTs or convnets. DINOv2[107]providesavarietyofpretrainedvisualmodelsthataretrainedwithdifferentvisiontransformers(ViT)on theLVD-142Mdataset introducedin[107].It istrainedusing adiscriminativeself-supervisedmethodonacomputecluster of 20nodes equippedwith8V100-32GBGPUs. DINOv2 providesvariousvisual featuresat the image(e.g. detection) or pixel level (e.g. segmentation).SAM[59] provideszeroshotpromptableimagesegmentation. It isdiscussedinmore detail inSectionIV. D.MultimodalVision-LanguageModels(VLMs) Multimodalreferstotheabilityofamodel toacceptdifferent“modalities”ofinputs,forexample,images,texts,oraudio signals.Visual-languagemodels (VLM) area typeofmultimodalmodel that takesinbothimagesandtext.Acommonly usedVLMinroboticsapplications isContrastiveLanguageImagePre-training(CLIP)[4].CLIPoffersamethodtocompare the similaritybetween textual descriptionsand images. CLIPuses internet-scale image-textpairsdata tocapture the semantic informationbetween imagesand text.CLIPmodel architecturecontainsatextencoder[92]andanimageencoder (amodifiedversionofvisiontransformerViT)thataretrained jointly tomaximize the cosine similarityof the image and textembeddings.CLIPusescontrastivelearningtogetherwith languagemodels andvisual featureencoders to incorporate modelsforzero-shot imageclassification. BLIP[108]focusesonmultimodallearningbyjointlyoptimizing threeobjectivesduringpretraining.Theseobjectives include Image-TextContrastiveLoss, Image-TextMatching Loss, andLanguageModelingLoss. Themethod leverages noisyweb data by bootstrapping captions, enhancing the trainingprocess.CLIP2 [109]aims tobuildwell-alignedand instance-based text-image-point proxies. It learns semantic and instance-level alignedpoint cloud representationsusing a cross-modal contrastiveobjective. FILIP[110] focuseson achieving finer-level alignment inmultimodal learning. It incorporates a cross-modal late interactionmechanismthat utilizes token-wisemaximumsimilaritybetweenvisual and textual tokens.Thismechanismguides thecontrastiveobjectiveand improves thealignment betweenvisual and textual information.FLIP[111]proposesasimpleandmoreefficient trainingmethod for CLIP. FLIP randomlymasks out and removesasignificantportionofimagepatchesduringtraining. ThisapproachaimstoimprovethetrainingefficiencyofCLIP whilemaintainingitsperformance. E. EmbodiedMultimodalLanguageModels Anembodiedagent is anAI systemthat interactswitha virtualorphysicalworld.Examplesincludevirtualassistance or robots.Embodiedlanguagemodelsarefoundationmodels that incorporate real-world sensor and actuationmodalities intopretrainedlargelanguagemodels.Typicalvision-language models are trained on general vision-language tasks such as image captioning or visual question answering. PaLME[6] is amultimodal languagemodel that hasbeen trained onnot only internet-scalegeneral vision-languagedata, but alsoonembodied, roboticsdata, simultaneously. Inorder to connect themodel to realworld sensormodalities, PaLME’s architecture injects (continuous) inputs such as images, low-level states, or 3Dneural scene representations into the languageembeddingspaceofadecoder-onlylanguagemodel toenablethemodel toreasonabout textandothermodalities jointly. ThemainPaLM-Eversion is built fromthePaLM LLM[96]andaViT[103].TheViTtransformsanimageinto asequenceofembeddingvectorswhichareprojectedintothe languageembeddingspaceviaanaffine transformation.The wholemodel istrainedend-to-end,startingfromapre-trained LLMandViTmodel. The authors also explore different strategiessuchasfreezingtheLLMandjust trainingtheViT, whichleads toworseperformance.Givenmultimodal inputs, theoutput of PaLM-E is text decodedauto-regressively. In order toconnect thisoutput toa robot for control, language conditionedshort-horizonpoliciescanbeused. In thiscase, PaLM-E acts as a high-level control policy. Experiments showthat a singlePaLM-E, in addition tobeinga visionlanguagegeneralist, isabletoperformmanydifferentrobotics tasksovermultiple robot embodiments.Themodel exhibits positivetransfer, i.e. simultaneouslytrainingoninternet-scale language, general vision-language, and embodied domains leads tohigherperformancecomparedto training themodel onsingletasks. F. VisualGenerativeModels Web-scale diffusion models such as OpenAI’s DALLE[112] andDALL-E2 [88] providezero-shot text-to-image generation.Theyaretrainedonhundredsofmillionsofimagecaptionpairsfromtheinternet.Thesemodelslearnalanguageconditioneddistributionoverimagesfromwhichanimagecan begeneratedusingagivenprompt.TheDALL-E2architecture includesapriorthatgeneratesaCLIPimageembeddingfroma textcaption,andadecoderthatgeneratesanimageconditioned ontheimageembedding. III. ROBOTICS Inthissection,wedelveintorobotdecision-making,planning,andcontrol.Withinthisrealm,LargeLanguageModels (LLMs)andVisualLanguageModels (VLMs)mayholdthe9 TABLEI LARGEPRETRAINEDMODELS Model Architecture Size TrainingData What toPretrain Howto Pretrain Hardware CLIP[4] ViT-L/14@336pxanda text encoder [92] 0.307B 400Mimage-textpairs zero-shot image classification contrastive pre-training fine-tuned CLIPmodel is trainedfor12 dayson256 V100GPUs GPT-3[2] transformer (slight modificationofGPT-2) 175B CommonCrawl (about atrillionwords) textoutput autoregressive model NPA* GPT-4[3] NPA NPA NPA textoutput NPA NPA PaLI-X[104] encoder-decoder 55B 10Bimage-textpairs fromWebLI [102]and auxiliarytasks text andimage to textoutput autoregressive model runson multi-TPU cloudservice DALL-E[112] decoder-only transformer 12B 250Mtext-imagepairs zero-shot text-to-image generation autoregressive model NPA DALL-E2[88] apriorbasedon CLIP+adecoder 3.5B CLIPand DALL-E[112] zero-shot text-to-image generation diffusion NPA DINOv2[107] ViT-g/14 1.1B LVD-142M[107] visual-features (image-level and pixel-level) discriminative 20nodes equippedwith 8V100-32GB GPUs SAM[59] MAE[113]vision transformer+CLIP [114] text encoder 632Mfor ViT-H+63M forCLIPtext encoder SA-1Bdataset [59] that includes1.1B segmentationmaskson 11Mimages zero-shot promptable image segmentation supervised learning 256A100 GPUsfor68 hours *NPAstands fornotpubliclyavailable. potential to serve as valuable tools for enhancing robotic capabilities. For instance, LLMsmay facilitate the process of taskspecification, allowingrobots toreceiveandinterpret high-levelinstructionsfromhumans.VLMsmayalsopromise contributionstothisfield.VLMsspecializeintheanalysisof visualdata.Thisvisualunderstandingisacriticalcomponent of informeddecision-makingandcomplextaskexecutionfor robots. Robots can now leverage natural language cues to enhance their performance in tasks involvingmanipulation, navigation,andinteraction.Vision-languagegoal-conditioned policy learning, whether through imitation learning or reinforcement learning, holds promise for improvement using foundationmodels. Languagemodels also play a role in offeringfeedback for policy learning techniques.This feedbackloopfosterscontinual improvement inroboticdecisionmaking, as robots can refine their actions based on the feedback received fromanLLM. This section underscores the potential contributions of LLMs andVLMs in robot decision-making.Assessingandcomparing thecontributions ofpapersinthissectionpresentsgreaterchallengescompared to theother sections like thePerceptionSection(IV)or the EmbodiedAISection(V).This isdue to the fact thatmost papers in this section either relyonhardware experiments, usingcustomelements in the low-levelcontrol andplanning stack that are not easily transferred to other hardware or other experimental setups, or theyutilizenon-physics-based simulators,whichallowthese low-levelpartsof thestack to be ignored, but leavingopen the issueofnon-transferability betweendifferent hardware implementations. InSectionVI, wediscuss thelackofbenchmarkingandreproducibilitythat needs tobeaddressedinfutureresearch. A. RobotPolicyLearningforDecisionMakingandControl In this sectionwediscuss robot policy learning including language-conditionedimitationlearningandlanguage-assisted reinforcement learning. 1) Language-conditionedImitationLearning forManipulation: In language-conditioned imitation learning, a goalconditionedpolicyπθ(at|st,l) is learnedthatoutputsactions at∈Aconditionedonthecurrent statest∈Sandlanguage instructionl∈L.Thelossfunctionisdefinedasthemaximum likelihoodgoalconditionedimitationobjective: LGCIL=E(τ,l)∼D |τ| X t=0 logπθ(at|st,l), (9) whereD is the language-annotated demonstration dataset D={τi}N i . Demonstrations canbe represented as trajectories, or sequences of images, RGB-Dvoxel observations, etc. Language instructions are pairedwith demonstrations tobeusedas the trainingdataset. Each language-annotated demonstrationτiconsistsofτi={(s1,l1,a1),(s2,l2,a2),…}. At test time, therobot isgivenaseriesofinstructionsandthe language-conditionedvisuomotorpolicyπθ provides actions at inaclosedloopgiventheinstructionateachtimestep.The mainchallenges inthisdomainare: (i)obtainingasufficient volumeof demonstrationsandconditioning labels to traina policy, (ii) distribution shift under the closed-looppolicy— thefeedbackof thepolicycanleadtherobot intoregionsof thestatespace that arenotwell-coveredinthe trainingdata, negativelyimpactingperformance. (All the followingpapers inthissubsectionfocusonrobotmanipulationtasks.) Since generating language-annotated data by pairing demonstrationswithlanguageinstructionisanexpensivepro10 cess, the authors inPlay-LMP [26] propose learning from teleoperatedplay data. In this setting, reusable latent plan representations are learned fromunlabeledplaydata.Also, a goal-conditionedpolicy is learned todecode the inferred plantoperformthetaskspecifiedbytheuser. Inaddition, the distributional shift in imitationlearningisanalyzedand it is shown in this setting that theplaydata ismore robustwith respect toperturbationcompared to expert positivedemonstrations.Note that languagegoal l in(9)canbesubstituted withanyothertypeofgoal forexamplegoal image,whichis anothercommonchoiceofgoal ingoal-conditionedimitation learning. Inafollow-upwork[28], theauthorspresentmulti-context imitation(MCIL)whichuses language-conditionedimitation learningoverunstructureddata.Themulti-Context imitation framework is basedon relabeled imitation learningand labeled instruction following.MCILassumes access tomultiple contextual imitationdatasets, for example, goal image demonstrations,languagegoaldemonstrations,orone-hottask demonstrations.MCILtrainsasingle latentgoal-conditioned policyoverall datasetssimultaneouslybyencodingcontexts inthesharedlatentspaceusingtheassociatedencoderforeach context.Thenagoal-conditionedimitation loss is computed byaveragingoveralldatasets.Thepolicyandgoal-encoders are trainedend-to-end.Another approach to tackle thedata annotationchallengeinlanguage-conditionedimitationlearning involves utilizing foundationmodels to offer feedback by labelingdemonstrations. In[115], theauthorspropose to use pretrained foundationmodels to provide feedback. To deployatrainedpolicytoanewtaskornewenvironment, the policy isplayedusing randomlygenerated instructions, and apretrainedfoundationmodelprovidesfeedbackbylabeling thedemonstration.Also, thispairedinstruction-demonstration data canbe used for policyfine-tuning. CLIPort [25] also presentsalanguage-conditionedimitationlearningforvisionbasedmanipulation.Atwo-streamarchitecture is presented that combines the semanticunderstandingofCLIPwith the spatialprecisionofTransporter[116].Thisend-to-endframework solves language-specifiedmanipulation tasks without any explicit representationof the object poses or instance segmentation.CLIPort groundssemanticconcepts inprecise spatial reasoning, but it is limited to 2Dobservation and actionspaces.Toaddressthislimitation, theauthorsofPerAct (Perceiver-Actor) [27] propose to represent observationand action spaceswith3Dvoxels andemploy the3Dstructure ofvoxelpatchesforefficient language-conditionedbehavioral cloningwithtransformerstoimitate6-DoFmanipulationtasks fromjustafewdemonstrations.While2Dbehavioralcloning methods such asCLIPort are limited to single-viewobservations, 3DapproachessuchasPerAct allowformulti-view observations aswell as 6-DoF action spaces. PerAct uses onlyCLIP’s languageencoder toencode the languagegoal. PerAct takes languagegoalsandRGB-Dvoxel observations as inputs toaPerceiverTransformerandoutputsdiscretized actions by detecting the next best voxel action. PerAct is trained throughsupervised learningwithdiscrete-time input actions fromthe demonstrationdataset. The demonstration datasetincludesvoxelobservationspairedwithlanguagegoals andkeyframe action sequences.An actionconsists of a 6DoFpose,gripperopenstate, andcollisionavoidanceaction. During training, a tuple is randomlysampledand theagent predicts thekeyframeactiongiventheobservationandgoal. Groundingsemanticrepresentations intoaspatial environment isessential foreffectiverobot interaction.CLIPort and PerAct utilizeCLIP (which is trainedbasedon contrastive learning)forsemanticreasoningandTransporterandPerceiver forspatial reasoning. Voltron [29] presents a framework for language-driven representationlearninginrobotics.Voltroncapturessemantic, spatial, and temporal representations that are learned from videos andcaptions.Contrastive learningcaptures semantic representationsbut losesspatial relationships,andincontrast, maskedautoencodingcaptures spatial andnot semantic representations. Voltron trades off language-conditionedvisual reconstruction for local spatial representations andvisuallygrounded languagegeneration tocapturesemantic representations.Thisframeworkincludesgraspaffordanceprediction, single-taskvisuomotor control, referringexpressiongrounding, language-conditionedimitation, andintent-scoringtasks. Voltronmodels take videos and their associated language captionsas input toamultimodalencoderwhoseoutputsare thendecodedtoreconstructoneormoreframesfromamasked context.Voltronstartswithamaskedautoencodingbackbone andaddsadynamiccomponent tothemodelbyconditioning theMAEencoderonalanguageprefix.Temporalinformation iscapturedbyconditioningonmultipleframes. Deploying robot policy learning techniques that leverage language-conditionedimitation learningwithreal robots presentsongoingchallenges.Thesemodelsrelyonend-to-end learning,wherethepolicymapspixelsorvoxelstoactions.As theyaretrainedthroughsupervisedlearningondemonstration datasets,theyaresusceptibletoissuesrelatedtogeneralization anddistributionshifts.Toimproverobustnessandadaptability, techniquessuchasdataaugmentationanddomainadaptation canmakethepoliciesmorerobust tothedistributionshift. CACTI [14] is a novel framework designed to enhance scalabilityinrobot learningusingfoundationmodelssuchas StableDiffusion[117].CACTI introduces thefour stagesof datacollection,dataaugmentation,visualrepresentationlearning,andimitationpolicytraining. Inthedatacollectionstage, limited in-domainexpert demonstrationdata is collected. In thedataaugmentationstage,CACTIemploysvisualgenerative modelssuchasStableDiffusion[117]toboostvisualdiversity by augmenting the datawith scene and layout variations. In thevisual representationlearningstage,CACTI leverages pretrainedzero-shot visual representationmodels trainedon out-of-domaindata to improvetrainingefficiency.Finally, in theimitationpolicytrainingstage,ageneralmulti-taskpolicy is learnedusingimitation learningontheaugmenteddataset with compressedvisual representations as input. CACTI is trainedformulti-taskandmulti-scenemanipulationinkitchen environments, both in simulation and the real world. The useof thesetechniquesenhancesthegeneralizationabilityof the frameworkandenables it to learnfromawide rangeof environments. Beyond language, recent works have investigated other11 formsoftaskspecification.Notably,MimicPlay[118]presents ahierarchical imitation learningalgorithmthat learns highlevel plans in latent spaces fromhumanplaydataand lowlevelmotor commands fromasmall numberof teleoperated demonstrations.Byharnessingthecomplementarystrengthsof thesetwodatasources, thisalgorithmcansignificantlyreduce the cost of training visuomotor policies for long-horizon manipulationtasks.Oncetrained, it iscapableofperforming newtasksbasedononehumanvideodemonstrationat test time.MUTEX[119]furtherexploreslearningaunifiedpolicy acrossmultimodal task specifications invideo, image, text, andaudio,showingimprovedpolicyperformancesoversinglemodalitybaselines throughcross-modal learning. 2) Language-AssistedReinforcementLearning:Reinforcement learning(RL)isafamilyofmethodsthatenablearobot tooptimizeapolicythroughinteractionwithitsenvironment byoptimizingarewardfunction.Theseinteractionsareusually inasimulationenvironment,sometimesaugmentedwithdata fromphysical robothardwareforsim-to-real transfer.RLhas close ties tooptimal control.Unlike imitation learning,RL doesnot requirehumandemonstrations, and(in theory)has thepotential toattain super-humanperformance. In theRL problem, theexpectedreturnofapolicyismaximizedusing thecollectedroll-outsfrominteractionswiththeenvironment. The feedback received from the environment in the form of a reward signal guides the robot to learnwhich actions lead to favorable results andwhichdonot. In this section, wediscussworks that have incorporatedfoundationmodels (LLM,VLMs,etc.) intoRLproblems. Fast andflexible adaptation is adesiredcapabilityof artificial agents and is essential for progress toward general intelligence.InAdaptiveAgent(AdA)[30]theauthorspresent anRL foundationmodel that is anagent pretrainedondiverse tasks and is designed toquicklyadapt toopen-ended embodied3Dproblemsbyusingfast in-contextlearningfrom feedback.Thisworkconsidersnavigation, coordination, and divisionoflabortasks.Givenafewepisodeswithinanunseen environmentat test time, theagentengages intrial-and-error exploration to refine itspolicy towardoptimal performance. InAdAa transformer architecture is trained usingmodelbasedRL2 [120] to train agentswith large-scale attentionbasedmemory,whichisrequiredforadaptation.TransformerXL [121] with somemodification is used to enable long andvariable-lengthcontextwindows to increase themodel memorytocapturelong-termdependencies.Theagentcollects diverse data in theXLand environment that includes 1040 possibletasks[122], inanautomatedcurriculum. Inaddition, distillationisusedtoenablescalingtomodelswithmorethan 500Mparameters. Paloetal. [15]proposeanapproachtoenhancereinforcement learningbyintegratingLargeLanguageModels(LLMs) andVisual-LanguageModels (VLMs) tocreateamoreunifiedRLframework.Thisworkconsiders robotmanipulation tasks.TheirapproachaddressescoreRLchallengesrelatedto exploration, experience reuseand transfer, skills scheduling, and learningfromobservation.Theauthorsuse anLLMto decomposecomplex tasks intosimpler sub-tasks,whichare thenutilizedasinputsforatransformer-basedagenttointeract withtheenvironment.Theagentistrainedusingacombination ofsupervisedandreinforcementlearning,enablingittopredict theoptimal sub-tasktoexecutebasedonthecurrent stateof theenvironment. B. Language-ImageGoal-ConditionedValueLearning In value learning, the aimis to construct a value function that aligns goals in differentmodalities andpreserves temporal coherencedue to the recursivenatureof thevalue function.ReusableRepresentationforRoboticManipulations (R3M)[31]providespretrainedvisualrepresentationforrobot manipulation using diverse human video datasets such as Ego4Dandcanbeusedasafrozenperceptionmoduleforpolicylearninginrobotmanipulationtasks.R3M’spretrainedvisual representationisdemonstratedonFrankaEmikaPanda’s armand enables different downstreammanipulation tasks. R3Mis trained using time-contrastive learning to capture temporal dependencies, video-languagealignment tocapture semantic features of the scene (such as objects and their relationships)andL1penaltytoencouragesparseandcompact representation.For abatchof videos, using time-contrastive loss,anencoderistrainedtogeneratearepresentationwherein the distance between images that are temporally closer is minimizedcomparedtoimages that arefartherapart intime or fromdifferentvideos. Similar toR3M,Value-ImplicitPretraining(VIP) [36]employstime-contrastivelearningtocapturetemporaldependencies invideos, but it doesnot requirevideo-languagealignment.VIPisalsofocusedonrobotmanipulationtasks.VIPis aself-supervisedapproachforlearningvisualgoal-conditioned valuefunctionsandrepresentationsfromvideos.VIPlearnsvisualgoal-basedrewardsfordownstreamtasksandcanbeused forzero-shotrewardspecification.Therewardmodelisderived frompretrainedvisualrepresentations.Pretraininginvolvesusingunlabeledhumanvideos.Humanvideosdonotcontainany actioninformationtobeusedforrobotpolicylearning, thereforethelearnedvaluefunctiondoesnotexplicitlydependon actions.VIPintroducesanoveltimecontrastiveobjectivethat generatesatemporallysmoothembedding.Thevaluefunction is implicitlydefinedviadistance embedding. Theproposed implicit timecontrastivelearningattractstherepresentationof the initial andgoal frames in thesame trajectoryandrepels therepresentationofintermediateframesbyrecursiveone-step temporaldifferenceminimization.Thisrepresentationcaptures long-termtemporaldependenciesacrosstaskframesandlocal temporalsmoothnessamongadjacent frames. Language-ImageValueLearning (LIV) [37] is a controlcentric vision-language representation. LIVgeneralizes the priorworkVIPbylearningmulti-modalvision-languagevalue functionsandrepresentationsusing language-alignedvideos. Fortasksspecifiedaslanguagegoalsor imagegoals,amultimodel representationis trainedthatencodesauniversalvalue function. LIVis also focusedon robotmanipulation tasks. LIVisapretrainedcontrol-centricvision-languagerepresentationbasedon largehumanvideodatasets such asEPICKITCHENS[123].Therepresentationsarekeptfrozenduring policy learning.AsimpleMLPisusedon topofpretrained12 representationsfor thepolicynetwork.Policylearningisdecoupledfromlanguage-visual representationpretraining.The LIVmodel ispretrainedonarbitraryvideoactivitydatasets with text annotation, and themodel can be fine-tuned on small datasets of in-domain robot data toground language in a context-specificway. LIVuses a generalizationof the mutual information-based image-text contrastive representation learning objective as used in CLIP, so LIV can be consideredasacombinationofCLIPandVIP.BothVIPand LIVlearn a self-supervisedgoal-conditionedvalue-function objectiveusingcontrastivelearning.TheLIVextendstheVIP framework tomulti-modal goal specifications. LOREL [38] learns a language-conditionedreward fromofflinedata and usesitduringmodelpredictivecontrol tocompletelanguagespecifiedtasks. ValuefunctionscanbeusedtohelpgroundsemanticinformationobtainedfromanLLMtothephysicalenvironmentin whicha robot isoperating.By leveragingvaluefunctions, a robot can associate the informationprocessedby theLLM with specific locations and objects in its surroundings. In SayCan [32], researchers investigate the integrationof large languagemodelswith the physicalworld through learning. Theyusethelanguagemodeltoprovidetask-grounding(Say), enablingthedeterminationofusefulsub-goalsbasedonhighlevelinstructions,andalearnedaffordancefunctiontoachieve world-grounding(Can),enablingtheidentificationoffeasible actionstoexecutetheplan. InnerMonologue[33]studiesthe roleofgroundedenvironmentfeedbackprovidedtotheLLM, thus closing the loopwith the environment. The feedback is used for robot planningwith large languagemodels by leveraging a collection of perceptionmodels (e.g., scene descriptorsandsuccessdetectors) in tandemwithpretrained language-conditioned robot skills. Feedback includes taskspecificfeedback,suchassuccessdetection,andscene-specific feedback(either “passive”or “active”). InbothSayCanand InnerMonologue robot manipulation and navigation tasks are consideredusinga real-worldmobilemanipulator robot fromEverydayRobots. Text2Motion [124] is a languagebasedplanningframeworkfor long-horizonrobotmanipulation. Similar toSayCanandInnerMonologue,Text2Motion computesa score (SLMM) associatedwitheachskill at each timestep.Thetaskplanningproblemis tofindasequenceof skillsbymaximizingthelikelihoodofaskillsequencegivena languageinstructionandtheinitial state. InText2Motion, the authorsproposetoverifythatthegeneratedlong-horizonplans aresymbolicallycorrectandgeometricallyfeasible.Hence,a geometricfeasibilityscore(Sgeo) isdefinedastheprobability thatall theskillsinthesequenceachieverewards.Tocompute theoverallscore,theLLMscoreismultipliedbythegeometric feasibilityscore(SSkill=SLMM·Sgeo). VoxPoser[34]builds3Dvaluemapstogroundaffordances andconstraints into theperceptual space.VoxPser considers robotmanipulation tasks.Given theRGB-Dobservationof the environment and language instruction,VoxPoser utilizes largelanguagemodels togeneratecode,whichinteractswith vision-languagemodelstoextractasequenceof3Daffordance mapsandconstraintmaps.Thesemapsarecomposedtogether tocreate3Dvaluemaps. Thevaluemaps are thenutilized asobjectivefunctions toguidemotionplanners tosynthesize trajectoriesforeverydaymanipulationtaskswithout requiring anyprior trainingor instruction. In [35], reward shaping using CLIP is presented. This workconsidersrobotmanipulationtasks.Theproposedmodel utilizesCLIPtogroundobjects inascenedescribedby the goal text pairedwith spatial relationship rules to shape the rewardbyusingrawpixelsasinput.Theyusedevelopmentsin buildinglarge-scalevisuo-lingualmodels likeCLIPtodevise aframeworkthatgeneratesthetaskrewardsignalfromjustthe goal textdescriptionandrawpixelobservations.This signal is thenusedtolearnthetaskpolicy. In [125], Hierarchical Universal Language Conditioned Policies 2.0 (HULC++) is presented. Thiswork considers robotmanipulationtasks.Aself-supervisedvisuo-lingual affordancemodel isused to learngeneral-purposed languageconditionedrobot skills fromunstructuredofflinedata in the realworld. Thismethod requires annotatingas little as1% of the totaldatawithlanguage.Thevisuo-lingualaffordance modelhasanencoder-decoderarchitecturewithtwodecoder heads. Both heads share the same encoder and are conditionedon the input language instruction.Oneheadpredicts adistributionover the image, inwhicheachpixel likelihood is an afforded point. The other head predicts aGaussian distributionfromwhichthecorrespondingpredicteddepthis sampled.Givenvisualobservationsandlanguageinstructions as input, theaffordancemodeloutputsapixel-wiseheatmap thatrepresentsaffordanceregionsandthecorrespondingdepth map. C. RobotTaskPlanningusingLargeLanguageModels LLMscanbeusedtoprovidehigh-level taskplanningfor performingcomplexlong-horizonrobot tasks. 1) Language Instructions for TaskSpecification: As discussedabove,SayCan[32]usesanLLMforhigh-level task planninginlanguage, thoughwithalearnedvaluefunctionto groundtheseinstructionsintheenvironment. Temporal logic is useful for imposing temporal specifications in robotic systems. In [39], translation fromnatural language(NL) totemporal logic(TL) isproposed.Adataset with 28kNL-TLpairs is created and theT5 [126]model isfinetunedusing thedataset.LLMs areoftenused toplan task sub-goals. Thiswork considers robot navigation tasks. In[40], insteadofdirect taskplanning,afew-shot translation fromanatural language taskdescription toan intermediary taskrepresentation isperformed.This representation isused byaTaskandMotionPlanning(TAMP)algorithmtojointly optimizetaskandmotionplans.Autoregressivere-prompting is used tocorrect synthetic and semantic errors. Thiswork alsoconsidersrobotnavigationtasks. 2)Code Generation using Language Models for Task Planning: Classical taskplanningrequiresextensivedomain knowledgeandthesearchspace is large[127], [128].LLMs canbeusedtogeneratesequencesoftasksrequiredtoachieve ahigh-level task. InProgPrompt [41], theauthors introduce apromptingmethod that usesLLMs togenerate sequences ofactionsdirectlywithnoadditionaldomainknowledge.The13 prompt to theLLMincludes specifications of the available actions,objectsintheenvironment,andexampleprogramsthat canbeexecuted.VirtualHome[129]isusedasasimulatorfor demonstration. Code-as-Policies [42] explores the use of code-writing LLMs togenerate robot policy code basedonnatural languagecommands.Thisworkconsidersrobotmanipulationand navigationtasksusingareal-worldmobilemanipulatorrobot fromEverydayRobots.ThestudydemonstratesthatLLMscan be repurposed towritepolicycodebyexpressing functions or feedbackloops thatprocessperceptionoutputsandinvoke controlprimitiveAPIs.Toachievethis,theauthorsutilizefewshotprompting,whereexamplelanguagecommandsformatted ascommentsareprovidedalongwiththecorrespondingpolicy code.Withoutanyadditionaltrainingonthisdata, theyenable themodels toautonomouslycomposeAPIcallsandgenerate newpolicycodewhengivennewcommands.Theapproach leverages classic logic structures and references third-party libraries likeNumPyandShapely toperformarithmeticoperations. By chaining these structures andusing contextual information (behavioral commonsense), theLLMs cangenerate robot policies that exhibit spatial-geometric reasoning, generalize to new instructions, and provide precise values (e.g., velocities) forambiguousdescriptionssuchas“faster.” Theconceptof “codeaspolicies” formalizes thegeneration of robot policies using languagemodel-generatedprograms (LMPs). These policies can represent reactive policies like impedancecontrollers,aswellaswaypoint-basedpoliciessuch as vision-basedpick and place or trajectory-based control. Theeffectivenessofthisapproachisdemonstratedonmultiple real robotplatforms.Acrucialaspectof thisapproachis the hierarchical code generationprocess, which involves recursivelydefiningundefinedfunctions.ThisenablestheLLMsto generatemorecomplexcodestructures to fulfill thedesired policyrequirements. In [43], the authors provide design principles for using ChatGPT in robotics anddemonstratehowLLMs canhelp roboticcapabilitiesrapidlygeneralizetodifferentformfactors. Thisworkconsidersrobotmanipulationandaerialnavigation tasks. First, ahigh-level robot function library thatmaps to multipleatomictasksexecutablebytherobotisdefined.Then, a prompt is crafted that includes these functions, and the requiredconstraintsalongthetaskdescription.ChatGPTthen providesexecutablecodespecifictothegivenrobotconfigurationandtask.Thegeneratedcodecanthenbeevaluatedbya userandappropriatefeedbackwithmodifiedpromptstoLLMs further help refine andgenerateprograms that are safe and deployableonthephysicalrobot.Thestudydemonstratedthat suchamethodologycanbeapplied tomultipleformfactors bothinsimulationandintherealworld. D. In-contextLearning(ICL) forDecision-Making In-contextLearning(ICL) [130]operateswithout theneed forparameteroptimization, relyinginsteadonasetofexamples includedintheprompt (theconceptofprompting).This learningapproachisintimatelylinkedwithpromptengineering andfinds extensiveuse innatural languageprocessing.The methodofChain-of-Thought [131] isaprominent technique withinin-contextlearning.It involvesexecutingasequenceof intermediatesteps toarriveat thefinal solutionforcomplex, multi-stepproblems. This technique allowsmodels to producestep-by-stepexplanations that parallel humancognitive processes.However, despite itsnumerousbenefits, ICLalso facescertainchallenges, includingissuesrelatedtoambiguity and interpretation, domain-specificknowledge, transparency, andexplainability. In-context learninghas had a significant impact on the fieldof LLMs in a broad sense, andmany roboticsworkshaveusedittoapplyLLMstospecificdomains. Investigatingthis,Mirchandaniandcolleagues[132]illustrate thatLargeLanguageModels(LLMs)possessremarkablepatternrecognitionabilities.Theyreveal that, throughin-context learning,LLMscaneffectivelyhandlegeneralpatternsthatextendbeyondstandardlanguage-basedprompts.Thiscapability allowsfortheapplicationofLLMsinscenariossuchasoffline trajectoryoptimization andonline, in-context reinforcement learning. Additionally, Jia and the team in their work on Chain-of-ThoughtPredictiveControl [133]suggestamethod to identify specific brief sequenceswithin demonstrations, termed as ’chain-of-thought’. They focus on understanding andrepresentingthehierarchicalstructureofthesesequences, highlighting the achievement of subgoalswithin tasks. This workconsidersrobotpolicylearningfromdemonstrationsfor contact-richobjectmanipulationtasks. E. RobotTransformers Foundationmodelscanbeused for end-to-endcontrol of robotsbyprovidingan integrated framework that combines perception,decision-making,andactiongeneration. Xiao et al. [48] demonstrate the effectiveness of selfsupervised visual pretraining using real-world images for learningmotor control tasksdirectlyfrompixel inputs.This work is focused on robot manipulation tasks. They show that without any task-specific fine-tuning of the pretrained encoder, thevisual representationscanbeutilizedforvarious motor control tasks. This approachhighlights the potential of leveragingself-supervisedlearningfromreal-worldimages toacquiregeneralvisual representations that canbeapplied acrossdifferentmotorcontroltasks.Similarly,Radosavovicet al.[49]investigatetheuseofself-supervisedvisualpretraining on diverse, in-the-wild videos for real-world robotic tasks. Thisworkconsidersrobotmanipulationtasks.Theyfindthat thepretrainedrepresentationsobtainedfromsuchvideosare effective ina rangeof real-world robotic tasks, considering different roboticembodiments.Thissuggests that thelearned visual representations generalizewell across various tasks androbotplatforms,demonstratingthebroadapplicabilityof self-supervisedpretrainingforreal-worldroboticapplications. Both studies emphasize the advantages of self-supervised visualpretraining,wheremodelsaretrainedonlargeamounts of unlabeleddata to learnuseful visual representations.By leveraging real-world images and videos, these approaches enable learning fromdiverse and unstructured visual data, leading tomore robust and transferable representations for motorcontrol tasks inroboticsystems.14 Another exampleof aTransformer-basedpolicymodel is thework onRobotics Transformer (RT-1) [44], where the authorsdemonstrateamodel thatshowspromisingscalability properties.Totrainthemodel, theauthorsusealargedataset ofover130kreal-worldroboticexperiences,comprisingmore than700 tasks, thatwas collectedover 17months usinga fleetof13robots.RT-1receivesimagesandnatural language instructions as inputs andoutputsdiscretizedbase andarm actions. Itcangeneralizetonewtasks,maintainrobustnessin changingenvironments,andexecutelong-horizoninstructions. Theauthorsalsodemonstratethemodel’scapabilitytoeffectivelyabsorbdatafromdiversedomains,includingsimulations anddifferent robots. The follow-upwork, calledRoboticTransformer 2 (RT2) [45],demonstratesavision-language-action(VLA)model thattakesastepfurtherbylearningfrombothwebandrobotics data. Themodel effectively utilizes this data to generate generalizedactionsfor roboticcontrol.Todoso, theauthors usepre-existingvision-languagemodelsanddirectlyco-finetunethemonrobottrajectoriesresultinginasinglemodelthat operatesasalanguagemodel,avision-languagemodel,anda robotpolicy.Tomakeco-fine-tuningpossible, theactionsare representedas simple text stringswhichare then tokenized usinganLLMtokenizerintotexttokens.Theresultingmodel, RT-2, enables vision-languagemodels to output low-level closed-loopcontrol.Similarly toRT-1, actions areproduced basedonrobot instructionspairedwithcameraobservations andtheactionspaceincludes6-DoFpositionalandrotational displacementof therobotend-effector,gripperextension,and episodeterminationcommand.Viaextensiveexperiments, the authors showthat utilizingVLMs aids in the enhancement of generalization across visual and semantic concepts and enablestherobotstorespondtotheso-calledchainofthought prompting,where the agent performsmorecomplex,multistagesemanticreasoning.BothRT-1andRT-2considerrobot manipulationandnavigationtasksusingareal-worldmobile manipulatorrobot fromEverydayRobots.Onekeylimitation ofRT-2andotherrelatedworksinroboticsisthefact that the rangeofphysicalskillsexhibitedbytherobotislimitedtothe distributionof skillsobservedwithin therobot’sdata.While onewaytoapproachthis limitationis tocollectmorediverse anddexterous roboticdata, theremight beother intriguing researchdirectionssuchasusingmotiondatainhumanvideos, roboticsimulations,orother roboticembodiments. The nextworkutilizing theTransformer architecture indeed focuses on learning fromdata that combinesmultiple robotic embodiments. InRT-X[46], the authors provide a numberofdatasets inastandardizeddataformatandmodels tomake it possible to explore the possibility of training largecross-embodiedroboticmodelsinthecontextofrobotic manipulation. In particular, they assembled a dataset from 22different robotscollectedthroughacollaborationbetween 21institutions,demonstrating527skills(160266tasks).With thisunifieddataset,RT-Xdemonstrates thatRT-1-andRT-2basedmodels trainedonthismulti-embodiment,diversedata exhibitpositive transfer across roboticdomainsand improve the capabilities ofmultiple robotsby leveragingexperience fromotherplatforms. Otherworkshaveinvestigatedgeneralpretrainedtransformers for robot control, trainedwith self-supervised trajectory data frommultiple robots. For example, Perception-Action CausalTransformer (PACT) [47] is agenerative transformer architecture that builds representations fromrobot datawith self-supervision.Thisworkconsiders robotnavigationtasks. PACTpretrainsa representationuseful formultiple taskson agiven robot. Similar tohowlarge languagemodels learn fromextensive text data, PACT is trainedonabundant safe state-actiondata(trajectories)fromarobot,learningtopredict appropriate safe actions. By predicting states and actions over time inanautoregressivemanner, themodel implicitly capturesdynamics andbehaviors specific toa robot. PACT wastestedinexperimentsinvolvingmobileagents:awheeled robotwithaLiDARsensor (MuSHR)andasimulatedagent usingfirst-personRGB images (Habitat). The results show that this robot-specific representationcanserveasa starting pointfortaskslikesafenavigation,localization,andmapping. Additionally, the experiments demonstrated that fine-tuning smaller task-specificnetworksonthepre-trainedmodel leads to significantly better performance compared to training a singlemodel fromscratch for all tasks simultaneously, and comparableperformance to traininga separate largemodel foreachtaskindependently. Anotherwork in this space is Self-supervisedMulti-task pretrAiningwithcontRolTransformer(SMART)[134],which introduces a self-supervisedmulti-taskpertaining to control transformers,providingapretraining-finetuningapproachtailored for sequential decision-making tasks.During thepretrainingphase,SMARTcapturesinformationessentialforboth short-termand long-termcontrol, facilitating transferability acrossvarioustasks.Subsequently, thefinetuningprocesscan adapttoawidevarietyoftasksspanningdiversedomains.ExperimentationunderscoresSMART’sabilitytoenhancelearningefficiencyacross tasksanddomains.Thisworkconsiders cart pole-swing-up, cart pole-balance, hopper-hop, hopperstand,cheetah-run,walker-standwalker-run,andwalker-walk tasks.Theapproachdemonstratesrobustnessagainstdistribution shifts andproves effectivewith low-qualitypretraining datasets. Someworkshave investigatedtransformermodels inconjunctionwithclassicalplanningandcontrol layersaspartof amodular robotcontrolarchitecture.Forexample, in[50], a multi-modal transformer (LATTE) ispresented that allowsa user toreshaperobot trajectoriesusinglanguageinstructions. Thisworkconsidersbothrobotmanipulationandnavigation tasks.LATTEtransformer takesas inputgeometrical features of an initial trajectory guess alongwith the obstaclemap configuration,languageinstructionsfromauser,andimagesof eachobjectintheenvironment.Themodel’soutputismodified for eachwaypoint in the trajectory so that the final robot motion can adhere to the user’s language instructions. The initial trajectoryplancanbegeneratedusinganygeometric plannersuchasA∗,RRT∗,ormodelpredictivecontrol.Subsequently, thisplan is enrichedwith thesemanticobjectives within themodel.LATTEleveragespretrainedlanguageand visual-languagemodelstoharnesssemanticrepresentationsof theworld.15 F. Open-VocabularyRobotNavigationandManipulation 1)Open-VocabularyNavigation:Open-vocabularynavigation addresses the challenge of navigating through unseen environments. The open-vocabularycapability signifies that therobotpossessesthecapacitytocomprehendandrespondto languagecues, instructions,or semantic information,without being restricted to apredefineddataset. In this section,we explorepapers thatexaminetheintegrationofLLMs,VLMs, or a combination of both in a plug-and-playmanner for robot navigation tasks.Additionally,wediscuss papers that takeadifferent approachbyconstructingfoundationmodels explicitlytailoredfor robotnavigationtasks. InVLN-BERT[135], theauthorspresentavisual-linguistic transformer-basedmodelthatleveragesmulti-modalvisualand languagerepresentationsforvisualnavigationusingwebdata. Themodel isdesignedtoscorethecompatibilitybetweenan instruction,suchas“…stopat thebrownsofa,”andasequence ofpanoramicRGBimagescapturedbytheagent. Similarly,LM-Nav[136]considersvisualnavigationtasks. LM-Navisasystemthatutilizespretrainedmodelsofimages and language toprovidea textual interface tovisualnavigation.LM-Navdemonstratesvisualnavigationinareal-world outdoorenvironmentfromnatural languageinstructions.LMNavutilizesanLLM(GPT-3[2]), aVLM(CLIP[4]), anda VNM(VisualNavigationModel).First,LM-Navconstructsa topologicalgraphoftheenvironmentviatheVNMestimating the distance between images. The LLM is then used to translatethenatural instructions tosequencesof intermediate language landmarks.TheVLMisused togroundthevisual observations in landmarkdescriptionsviaa joint probability distribution over landmarks and images. Using theVLM’s probabilitydistribution, theLLMinstructions,andtheVNM’s graph connectivity, the optimal path is planned using the search algorithm. Then the plan is executed by the goalconditionedpolicyofVNM. WhileLM-NavmakesuseofLLMsandVLMsasplug-andplay for visual navigation tasks, theauthorsofViNT[137] propose to build a foundationmodel for visual navigation tasks.ViNT is an imagegoal-conditionednavigationpolicy trained on diverse training data and can control different robots in zero-shot. It canbefine-tuned tobe adapted for different robotic platforms and various downstream tasks. ViNTis trainedonvariousnavigationdatasetsfromdifferent roboticplatforms. It is trainedwithgoal-reachingobjectives andutilizes aTransformer-basedarchitecture to learnnavigational affordances.ViNTencodesvisual observations and visualgoalsusinganEfficientNetCNNandpredictstemporal distance andnormalizedactions inanembodiment-agnostic manner.Additionally,ViNTcanbeaugmentedwithdiffusionbased sub-goal proposals tohelp explore environments not encountered during training. An image-to-image diffusion generates sub-goal images,which theViNT thennavigates towardwhilebuildingatopologicalmapinthebackground. Anotherwork that considerszero-shot navigation tasks is AudioVisualLanguageMaps (AVLMaps) [138].AVLMaps presentsa3Dspatialmaprepresentationforcross-modal informationfromaudio,visual,andlanguagecues.AVLMapsreceivesmulti-modalpromptsandperformszero-shotnavigation tasksintherealworld.TheinputsaredepthandRGBimages, camerapose, and audio.Visual features are encodedusing pretrained foundationmodels. Visual localization features (usingNetVLAD[139], SuperPoint [140]), visual-language features(usingLSeg[58]),andaudio-languagefeatures(using AudioCLIP[141])arecomputedandpredictionsfromdifferent modalities are combined into3Dheatmaps. Thepixel-wise joint probabilityof the heatmap is computedandused for planning.Additionally, navigationpolicies are generated as executablecodeswiththehelpofGPT-3.Finally,3Dheatmaps arepredicted indicating the locationofmultimodal concepts suchasobjects, sounds,andimages. Many roboticists may wonder about the comparative strengthsofclassicalmodularrobotnavigationsystemsversus end-to-endlearnedsystems.Semanticnavigation[142]seeks toaddress thisquestionbypresentinganempirical analysis of semanticvisual navigationmethods.The studycompares representativeapproaches fromclassical,modular, andendto-endlearningparadigmsacrosssixdifferenthomes,without anypriorknowledge,maps, or instrumentation.Thefindings of the study reveal thatmodular learningmethods perform well inreal-worldscenarios. Incontrast, theend-to-endlearningapproaches face challengesdue toa significant domain gapbetweensimulatedand real-world images. Thisdomain gaphinders theeffectivenessofend-to-endlearningmethods in real-world navigation tasks. For practitioners, the study emphasizes thatmodular learning is a reliable approach to object navigation.Themodularityandabstraction inpolicy designenable successful transfer fromsimulation to reality, makingmodular learning an effective choice for practical implementations.Forresearchers,thestudyalsohighlightstwo critical issuesthat limit thereliabilityofcurrentsimulatorsas evaluationbenchmarks.Firstly, thereexistsasubstantialSimto-Real gap in images,whichhampers the transferabilityof learnedpolicies fromsimulationtotherealworld.Secondly, thereisadisconnectbetweensimulationandreal-worlderror modes,whichfurthercomplicates theevaluationprocess. Another line of work in open-vocabulary navigation is object navigation tasks. In this task, the robotmust beable tofindtheobjectdescribedbyhumansandnavigatetowards theobject.Thenavigationtaskisdecomposedintoexploration whenthelanguagetargetisnotdetectedandexploitationwhen thetargetisdetectedandtherobotnavigatestowardthetarget. As therobotmovesintheenvironment, itcreatesatop-down mapusingRGB-Dobservationsandposesestimates. In[143], theauthorsintroduceazero-shotobjectnavigationsettingthat usesanopen-vocabularyclassifiersuchasCLIP[4]tocompute thecosine similaritybetweenan imageandauser-specified description. Commondatasetsandbenchmarksfor thesetypesofproblemsareMatterport3D[144], [145],Gibson[146]andHabitat [147]. L3MVN[148] enhances visual target navigation byconstructinganenvironmentmapandselectinglong-term goalsusingtheinferencecapabilitiesoflargelanguagemodels. The systemcan determine appropriate long-termgoals for navigation by leveraging pretrained languagemodels such asRoBERTa-large [149], enablingefficient explorationand16 searching. Chen et al. [150] presents a training-free and modular systemforobjectgoal navigation,whichconstructs a structuredscene representation throughactiveexploration. Thesystemutilizessemanticinformationinthescenegraphs to deduce the location of the target object and integrates semanticswiththegeometricfrontiers toenabletheagent to navigate effectively to themost promising areas for object searchwhile avoiding detours in unfamiliar environments. HomeRobot [151] introduces a benchmark for the OpenVocabularyMobileManipulation(OVMM)task.OVMMtask istheproblemoffindinganobjectinanyunseenenvironment, navigating towards theobject, picking it up, andnavigating towards a goal location to place the object. HomeRobot provides abenchmark in simulationand the realworld for OVMMtasks. 2)Open-VocabularyManipulation: Open-vocabularymanipulationreferstotheproblemofmanipulatinganyobject in apreviouslyunseenenvironment.VisuoMotorAttentionAgent (VIMA) [152] learns robotmanipulation frommulti-modal prompts. VIMA is a transformer-based agent that predicts motorcommandsconditionedonataskpromptandahistory ofinteractions.VIMAIt introducesanewformoftaskspecificationsthatcombinestextualandvisual tokens.Multi-modal promptingconvertsdifferentrobotmanipulationtasks,suchas visualgoal-reaching,learningfromvisualdemonstrations,and novelconceptgroundingintoonesequencemodelingproblem. Itoffers the trainingofaunifiedpolicyacrossdiverse tasks, potentiallyallowingforzero-shotgeneralizationtopreviously unseenones.VIMA-BENCHis introducedas abenchmark formulti-modalrobotlearning.TheVIMA-BENCHsimulator supportscollectionsofobjectsandtexturesthatcanbeutilized inmulti-modalprompting.RoboCat [153]isaself-improving AIagent. Itusesa1.18B-parameterdecoder-onlytransformer. Itlearnstooperatedifferentroboticarms,solvestasksfromas fewas100demonstrations,andimprovesfromself-generated data. RoboCat is based onGato [154] architecture and is trainedwithaself-improvementcycle. For robots to operate effectively in the real world they must be able tomanipulatepreviouslyunseenobjects. Liu et al. present StructDiffusion [155],which seeks to enable robots tousepartial viewpoint clouds andnatural language instructions toconstruct agoal configurationforobjects that were previously seen or unseen. They accomplish this by first usingsegmentation tobreakup the scene intoobjects. Then theyuse amulti-model transformer to combineword andpoint cloudembeddingsandoutput a6-DoFgoal pose prediction.Thepredictionsareiterativelyrefinedviadiffusion andadiscriminator that is trainedtodetermine ifasampled configurationisfeasible.ManipulationofOpen-WorldObjects (MOO)[156]leveragesapretrainedvision-languagemodel to extractobject-centricinformationfromthelanguagecommand andtheimageandconditions therobotpolicyonthecurrent image, the instructions, and theextractedobject information in a formof a single-pixel overlaidonto the image.MOO usesOwl-ViT for object detection andRT-1 for languageconditionedpolicylearning. Another task in robotmanipulation involves autonomous scenerearrangementandin-painting.DALL-E-Bot[157]performs zero-shot autonomous rearrangement in the scene in a human-likeway using pretrained image diffusionmodel DALL-E2 [88].DALL-E-Bot autonomousobject rearrangement does not require any further data collectionor training. First, the initial observation image (of thedisorganized scene) isconvertedintoaper-object representationincluding a segmentationmaskusingMaskR-CNN[158], anobject caption,andaCLIPvisual featurevector.Thenatextprompt isgeneratedbydescribingtheobject inthesceneandisgiven toDALL-E to create a goal image for the rearrangement task(theobjectsshouldberearrangedinahuman-likeway). Next, the objects in the initial and generated images are matchedusingtheirCLIPvisual features.Posesareestimated byaligning their segmentationmasks. The robot rearranges thescenebasedontheestimatedposestocreatethegenerated arrangement. InTableIIsomerobotic-specificfoundationmodelsarereportedalongwithinformationabouttheirsizeandarchitecture, pretrainedtask, inferencetime,andhardwaresetup. IV. PERCEPTION Robots interactingwiththeirsurroundingenvironmentsreceiverawsensoryinformationindifferentmodalitiessuchas images, video, audio, and language. This high-dimensional data is crucial for robots tounderstand, reason, and interact in their environments. Foundationmodels, including those that have beendeveloped in the vision andNLPdomains, are promising tools for converting these high-dimensional inputs into abstract, structured representations that can be moreeasily interpretedandmanipulated.Particularly,multimodal foundationmodelsenablerobots tointegratedifferent sensory inputs into a unified representation encompassing semantic,spatial, temporal,andaffordanceinformation.These multi-modalmodelsreflectcross-modal interactions,oftenby aligningelementsacrossmodalities toensurecoherenceand correspondence.Forexample, textandimagedataarealigned for imagecaptioningtasks.Thissectionwill explorearange of tasksrelatedtorobotperceptionthatareimprovedthrough aligningmodalitiesusingfoundationmodels,withafocuson visionandlanguage.Thereisanextensivebodyof literature studyingmulti-modalityin themachine learningcommunity, andaninterestedreader isreferredtothesurveypaper [161] that presentsa taxonomyofmulti-modal learning.We focus onapplicationsofmulti-modalmodels torobotics. A. Open-VocabularyObjectDetectionand3DClassification 1)Object Detection: Zero-shot object detection allows robots to identify and locate objects they have never encounteredpreviously.GroundedLanguage-ImagePre-training (GLIP) [52] integrates object detection and grounding by redefiningobject detectionas phrasegrounding.This reformulationenablesthelearningofavisualrepresentationthat is bothlanguage-awareandsemanticallyrichat theobject level. Inthisframework, theinput tothedetectionmodelcomprises not onlyan imagebut alsoa text prompt that describes all thepotentialcategoriesfor thedetectiontask.TotrainGLIP, a dataset of 27milliongrounding instanceswas compiled,17 TABLEII PRETRAINEDMODELSFORROBOTICS Paper Backbone Size(Parameters) PretrainedTask Inference Speed Hardware* RoboCat [153] decoder-onlytransformer 1.18B manipulation 10-20Hz Gato[154] decoder-onlytransformer 1.2B generalist agent 20Hz 4dayson16x16TPUv3slice PaLM-E-562B[6] decoder-onlytransformer 562B 1HzforLanguage subgoals+5Hz low-level control policies 5-6Hz runsonmulti-TPUcloudservice ViNT[137] EfficientNet+decoder transformer 31M visualnavigation 4Hz varietyofGPUconfigurations is usedincluding2×4090,3×Titan Xp,4×P100,8×1080Ti,8×V100, and8×A100 VPT[72] atemporal convolutionlayer, a ResNet62imageprocessing stack, andresidualunmasked attentionlayers, 0.5B embodiedagent in Minecraft 20Hz 9dayson720V100GPUs RT-1[44] ConditionedEfficientNet+ TokenLearner+decoder-only transformer 35M real-worldrobotics tasks 3Hz RT-2[45] PaLI-X 55B real-worldrobotics tasks 1-3Hz runsonmulti-TPUcloudservice RT-2-X[46] ViTandLanguagemodelUL2 [159] 55B real-worldrobotics tasks 1-3Hz runsonmulti-TPUcloudservice LIV[37] CLIP rewardlearning 15Hz 8NVIDIAV100GPUs SMART[134] decoder-onlytransformer 11M bidirectional dynamicsprediction andmaskedhindsight control 1Hz 8NvidiaV100GPUs COMPASS[160] 3D-Resnet encoder 20M Contrastive loss 30Hz 8NvidiaV100GPUs PACT[47] decoder-onlytransformer 12M forwarddynamics andnext action prediction 10Hz (edge) /50 Hz NvidiaXavierNX(edge) /8 NvidiaV100GPUs *Emptyfields inthe tabledenotenodata isreported. consistingof3millionhuman-annotatedpairsand24million image-text pairs obtainedbyweb crawling. The results of thestudydemonstratetheremarkablezero-shotandfew-shot transferabilityofGLIPtoawiderangeofobject-levelrecognition tasks.Recently,PartSLIP[162]demonstratedthatGLIP canbeused for low-shot part segmentationon3Dobjects. PartSLIPrendersa3Dpointcloudofanobjectfrommultiple views and combines 2Dboundingboxes in these views to detect object parts. Todealwithnoisy2Dboundingboxes fromdifferent views, PartSLIP runs a votingandgrouping methodonsuperpointsfrom3D,assignsmulti-view2Dlabels to super points, and finally groups super points to obtain aprecisepart segmentation.Toenable few-shot learningof 3Dpart segmentation,prompt tuning,andmulti-viewfeature aggregationareproposedtoimproveperformance. OWL-ViT[51]isanopen-vocabularyobjectdetector.OWLViTuses a vision transformer architecturewith contrastive image-text pre-traininganddetectionend-to-endfine-tuning. UnlikeGLIP,whichframesdetectionas aphrasegrounding problemwith a single text queryand limits thenumber of possibleobjectcategories,OWL-ViTcanhandlemultipletextbasedor image-drivenqueries.OWL-ViThas been applied torobot learningforexample inVoxPoser [34]as theopenvocabularyobjectdetector tofind“entitiesof interest” (e.g., vaseordrawerhandles)andultimatelydefinevaluemapsfor optimizingmanipulationtrajectories. GroundingDINO[53]combinesDINO[105]withgrounded pre-training, extending theclosed-setDINOmodel toopensetdetectionbyfusingvisionandlanguage.GroundingDINO outperformsGLIPinopen-setobjectdetection.Thissuperior performanceismainlydue to the transformerarchitectureof GroundingDINO,whichfacilitatesmulti-modalfeaturefusion atmultiplestages. 2) 3DClassification: Zero-shot 3Dclassifiers canenable robotstoclassifyobjectsintheirenvironmentswithoutexplicit trainingdata. Foundationmodels are strong candidates for performing3Dclassification.PointCLIP[54]transfersCLIP’s pre-trainedknowledgeof 2Dimages to3Dpoint cloudunderstandingbyaligningpoint cloudswith text. The authors propose to project each point onto a series of pre-defined imageplanes togeneratedepthmaps.Then, theCLIPvisual encoder is used to encodemulti-viewfeatures of thepoint cloudandpredictlabelsinnaturallanguageforeachview.The finalpredictionfor thepointcloudiscomputedviaweighted aggregationof thepredictionsforeachview.PointBERT[55] usesatransformer-basedarchitecturetoextract featuresfrom pointclouds,generalizingtheconceptofBERTinto3Dpoint clouds. UnlikePointCLIPwhichconvertsthetaskofmatchingpoint clouds and text to image-text alignment, ULIP [56], [57] is aUnified representationofLanguage, Images, andPoint clouds for3Dunderstanding. It achieves thisbypre-training withobject triplets (image, text, point cloud).Themodel is trained using a small number of automatically synthesized triplets fromShapeNet55 [163],which is a large-scale 3D model repository. ULIP uses CLIP as the vision-language18 model. During pretraining, theCLIPmodel is kept frozen and a 3Dencoder is trained by aligning the 3D features of anobjectwith its associated textual andvisual features fromCLIPusingcontrastivelearning.Thepretrainingprocess allowsULIPtolearnajointembeddingspacewherethethree modalitiesarealigned.Oneof themajoradvantagesofULIP is that it cansubstantiallyimprove therecognitionabilityof 3Dbackbonemodels.This isbecausethepretrainingprocess allowsULIPtolearnmorerobustanddiscriminativefeatures for eachmodality,which can thenbe used to improve the performanceof 3Dmodels.Another advantageofULIP is that it isagnostictothe3Dmodelarchitecture,andthuscan be easily integrated into thepretrainingprocess of existing 3Dpipelines.ULIPadoptsmasked languagemodelingfrom BERT to 3Dby tokenizing3Dpatches randomlymasking out 3Dtokens andpredicting thembackduringpretraining. ULIP[56],[57]hasshownthattheperformanceofrecognition capabilityofmodelssuchasPointBERTcanbeimprovedby usingaunifiedmultimodal representationofULIP. B. Open-VocabularySemanticSegmentation Semantic segmentation classifies each pixel in an image intosemanticclasses.Thisprovidesfine-grainedinformation about object boundaries and locationswithinan image and enablesembodiedagents tounderstandandinteractwith the environmentat amoregranular level.Severalworksexplore howfoundationmodelssuchasCLIPcanenhancethegeneralizabilityandflexibilityofsemanticsegmentationtasks. LSeg is a language-driven semantic segmentation model [58] that associates semantically similar labels to similar regions in an embedding space. LSeg uses a text encoder based on the CLIP architecture to compute text embeddings and an image encoder with the underlying architecture ofDense PredictionTransformer (DPT) [164]. Similar toCLIP,LSegcreatesajointembeddingspaceusing textandimageembeddings.LSegfreezes thetextencoderat training timeand trains the imageencoder tomaximize the correlationbetween the text embeddingand the imagepixel embeddingof the ground-truthpixel class. It allows users toarbitrarilyshrink, expand,or rearrange the label set (with unseencategories)foranyimageat test time. SegmentAnythingModel (SAM) [59] introducesaframeworkforpromptablesegmentationconsistingof thetaskdefinitionforpromptablesegmentation,asegmentationfoundation model (theSegmentAnythingModel, orSAM), andadata engine. SAMadapts a pretrainedVisionTransformer from MaskedAuto-Encoder (MAE) [113] as an image encoder whileusingatextencoderfromCLIP[114]forsparseprompts (points,boxes,andtext)andaseparatedensepromptencoder formasks. In contrast toother foundationmodels that are trained inanunsupervisedmanneronweb-scaledata,SAM is trainedusing supervised learningwith data engines that help scale thenumber of available annotations.Alongwith themodel, the authors released the Segment Anything 1 Billion(SA-1B)dataset. Itconsistsof11Mimagesand1.1B segmentationmasks. Inthiswork, theauthorsconductedexperimentsonfivezero-shottransfertasks,includingpoint-valid mask evaluation, edge detection, object proposal, instance segmentation, and text-to-mask. The system’s composable design, facilitatedbypromptengineeringtechniques,enables abroader rangeof applicationscomparedtosystems trained specificallyforfixedtasksets.However,onelimitationofthis workthatisparticularlyrelevanttoroboticapplicationsisthat SAMcannot runinreal-time. FastSAM [60] andMobileSAM [61] achieve comparable performance to SAMat faster inference speeds. The Track AnythingModel (TAM) [62] combines SAMand XMem[165], anadvancedvideoobject segmentation(VOS) model, toachieve interactivevideoobject trackingandsegmentation.Anything-3D[166]employsacollectionofvisuallanguagemodelsandSAMs toelevateobjects intotherealm of 3D. It usesBLIP [108] togenerate textual descriptions whileusingSAMtoextract objects of interest fromvisual input. Then,Anything-3Dlifts the extractedobjects into a NeuralRadianceField (NeRF) [167] representationusinga text-to-imagediffusionmodel, enablingtheir integrationinto 3Dscenes. Amidst these remarkable advancements, achieving finegrained detectionwith real-time performance still remains challenging. For example, LSeg [58] reports failure cases related tomisclassification,when the test time input labels donot includethetruelabel forthepixel,andthemodel thus assigns thehighest probability to the closest label.Another failurecaseoccurswhenmultiplelabelscanbecorrect fora particularpixel,andthemodelmustclassifyitas justoneof thecategories.Forexample“window”and“house”mayboth bedefinedaslabels,butduringinference,apixelrepresenting a“window”maybelabeledinsteadas“house”.SAMalsodoes notprovideprecisesegmentationforfinestructuresandoften fails toproducecrispboundaries.Allmodels that useSAM asasub-componentmayencountersimilar limitations. Inthe future, fine-grained semantic segmentationmodels that can assignmultiplelabelstoapixelwhentherearemultiplecorrect descriptions shouldbe considered.Additionally, developing models that canrun inreal-timewill becritical for robotics applications. C. Open-Vocabulary3DSceneandObjectRepresentations Scenerepresentationsallowrobots tounderstandtheir surroundings, facilitate spatial reasoning, and provide contextual awareness. Language-drivenscene representations align textual descriptionswith visual scenes, enabling robots to associatewordswithobjects, locations, andrelationships. In thissection,westudyrecentworksthatusefoundationmodels toenhancescenerepresentations. 1) LanguageGroundingin3DScene:Languagegrounding refers tocombininggeometric and semantic representations of anenvironment.One typeof representationthat canprovide an agentwith a stronggeometricprior is an implicit representation.Oneexampleof an implicit representation is aNeuralRadianceField(NeRF) [167]–[169].NeRFcreates high-quality3Dreconstructionsofscenesandobjectsfroma setof2Dimagescapturedfromdifferentviewpoints(without the need for explicit depth information). TheNeRFneural19 network takes camera poses as input and predicts the 3D geometryof the scene aswell as color and intensity.Most NeRF-basedmodels memorize the light field in a single environmentandarenotpre-trainedonalargedataset,hence theyarenot foundationmodels.However, foundationmodels suchasCLIPcanbecombinedwithNeRFstoextractsemantic informationfromanagent’senvironment. Kerretal. [64]proposelanguage-embeddedradiancefields (LERFs) that groundCLIPembeddings intoadensemultiscale 3Dfield. This results in a 3D representation of the environmentthatcanbequeriedtoproducesemanticrelevancy maps.TheLERFmodel takes3Dposition(x,y,z), viewing direction(φ, θ), andascalingfactoras inputandoutputsan RGBvalue, density (σ), aswell asDINO[105] andCLIP features. TheLERF is optimized in two stages: initially, a multi-scalefeaturepyramidofCLIPembeddingsovertraining viewsiscomputed;then, thepyramidisinterpolatedusingthe imagescaleandpixel locationtoobtaintheCLIPembedding; and finally, the CLIP embeddings are supervised through cosinesimilarityandtheRGBanddensityaresupervisedusing thestandardmeansquared-error. Models such asLERF inherit the shortcomings ofCLIP andNeRF.Forexample,CLIPexhibitsdifficultyincapturing spatial relationships betweenobjects. In addition, language queriesfromCLIPcanhighlightasignificant issuesimilar to thebag-of-wordsmodel,whichstruggles todistinguishterms withoppositesentiments.Also,NeRFreliesonknowncamera posesassociatedwithpre-capturedmulti-viewimages. In CLIP-Fields [170], an implicit scene representation g(x,y,z):R3→Rd is trainedbydecodingad-dimensional latentvectortodifferentmodality-specificoutputs.Themodel distills information frompretrained imagemodelsbybackprojectingthepixel labelsto3Dspaceandtrainingtheoutput heads topredict semantic labels fromanopen-vocabobject detector calledDetic, the CLIP visual representation, and one-hot instance labels using a contrastive loss. The scene representation can then be used as a spatial database for segmentation, instance identification, semantic search over space,and3Dviewlocalizationfromimages. AnotherrelatedworkisVLMaps[171],whichprojectspixel embeddingsfromLSegtogridcells inatop-downgridmap. Thismethoddoes not require training and instead directly backprojectspixelembeddings togridcellsandaverages the values inoverlappingregions.BycombiningaVLMapwith a code-writingLLM, the authors demonstrate spatial goal navigationusinglandmarks(e.g.,movetotheplant)orspatial referenceswith respect to landmarks (between thekeyboard andthebowl).SemanticAbstraction(SemAbs)[172]presents anotherapproachfor3Dsceneunderstandingbydecoupling visual-semantic reasoning and 3D reasoning. In SemAbs, givenanRGB-Dimageofascene,asemantic-aware2DVLM extracts 2D relevancymaps for each queried object while semantic-abstracted3Dmodules predict the 3Doccupancy of each object using the relevancymaps. Because the 3D modulesare trainedirrespectiveof thespecificobject labels, the systemdemonstrates strong generalization capabilities, includinggeneralization tonewobject categories and from simulationtotherealworld. CurrentVLMscanreasonabout2Dimages,however, they arenot grounded in the3Dworld. Themainchallenge for building3DVLMfoundationmodels is the scarcityof 3D data.Particularly,3Ddatapairedwithlanguagedescriptionis scarce.One strategy tocircumvent this issue is to take advantageof2Dmodelstrainedonlarge-scaledatatosupervise 3Dmodels. For instance, theauthorsofFeatureNeRF[173] proposetolearn3Dsemanticrepresentationsbydistilling2D vision foundationmodels (i.e., DINOor Latent Diffusion) into3Dspacevianeural rendering.FeatureNeRFpredictsa continuous3Dsemanticfeaturevolumefromasingleor few imageswhichcanbeusedfordownstreamtaskssuchaskeypoint transferorobjectpartco-segmentation. In3D-LLM[11], the authors propose touse 2DVLMs asbackbones to traina3D-LLMthat can take3Drepresentations (i.e., 3Dpoint cloudswith their features) as inputs and accomplish a series of diverse 3D-related tasks. The 3Dfeatures are extracted from2Dmulti-viewimages and mapped to the feature space of 2DpretrainedVLMs. To overcome3Ddata scarcity, the authorsproposeanefficient promptingprocedure forChatGPT togenerate3D-language dataencompassingadiversesetof tasks.Thesetasks include 3Dcaptioning,densecaptioning,3Dquestionanswering,3D taskdecomposition, 3Dgrounding, 3D-assisteddialog, and navigation.Also,tocapture3Dspatialinformation,theauthors proposea3Dlocalizationmechanismby1) augmenting3D featureswithposition embeddingand2) augmentingLLM vocabularieswith3Dlocation tokens. In thefirst part, the position embeddings of the threedimensions are generated andconcatenatedwith3Dfeatures. In the secondpart, the coordinatesof theboundingbox representing thegrounded region are discretized to voxel integers as location tokens <xmin,ymin,zmin,xmax,ymax,zmax>. It is important to highlight that, typically, creating3Drepresentationsnecessitates theuseof2Dmulti-viewimagesandcameramatrices. Theseresourcesarenotasreadilyavailableasthevastamounts of internet-scale text and imagedata that current foundation modelsaretrainedon. 2) SceneEditing:Whenanembodiedagent reliesonan implicit representationof theworld, thecapabilitytoeditand update this representationenhances the robot’s adaptability. For instance, consider a scenariowhere a robot utilizes a pretrainedNeRFmodelofanenvironmentfornavigationand manipulation. Ifaportionof theenvironmentchanges,being able toadjust theNeRFwithout retraining themodel from scratchsaves timeandresources. InthecaseofNeRFs,Wangetal. [63]proposea textand image-drivenmethod formanipulatingNeRFs calledCLIPNeRF.ThisapproachusesCLIPtodisentanglethedependence betweenshapeandappearanceinconditionalneural radiance fields. CLIP-NeRF facilitates the editing of the shape and appearanceofNeRFsusingeitherimageortextprompts.It is composedoftwomodules:thedisentangledconditionalNeRF andCLIP-drivenmanipulation.Theformertakesthepositional encodingγ(x,y,z),ashapecodezs,viewingdirectionv(φ,θ), andappearancecode za as an input andoutputs color and density.Thedisentanglement isachievedusingadeformation networkthatisappendedasinputtothetraditionalNeRFMLP20 thatproducesdensity,andbytakingtheoutputfromthisMLP and concatenating itwith an appearance code to attain the colorvalue.TheCLIP-drivenmanipulationmodule takes an imageexampleortextpromptasaninputandoutputsashape deformation∆zs andanappearancedeformation∆za from shapemappingandappearancemappingMLPs respectively. Thesedeformationvaluesaimtoperturbtheshapecodeand appearancecodeinthedisentangledconditionalNeRFmodule toproducethedesiredoutput. AkeylimitationoftheCLIP-NeRFapproachisthatpromptingcanimpact theentirescenerather thanaselectedregion. Forexample,promptingtochangethecolorofaflower’spetals mightalsoimpacttheshapeandcolorofitsleaves.Toaddress this limitation, Kobayashi et al. propose to train distilled featurefields(DFFs) [65]andthenmanipulateDFFs through query-basedscenedecompositionandediting.Pre-trained2D VLMs (suchasLSeg [58] andDINO[105]) are employed as teacher networks and distilled into 3Ddistilled feature fields via volume rendering. Editing is achieved by alpha compositing thedensityandcolor values of the twoNeRF scenes.WhencombinedwithCLIP-NeRF,thismethodenables CLIP-NeRFtoselectivelyeditspecificregionsofmulti-object scenes.Asimilar approachwas exploredbyTschernezki et al. in [174]where the authors showthat enforcing the 3D consistency of features in theNeRF embedding improved segmentationperformancecompared tousing features from theoriginal2Dimages. Anotherapproachtomorecontrolled3Dsceneeditingisto usestructured3Dscenerepresentations.Nerflets[175]representa3Dsceneasacombinationoflocalneuralradiancefields whereeachmaintainsitsownspatialposition,orientation,and dimension.InsteadofemployingasinglelargeMLPtopredict colorsanddensitiesasstandardNeRF, individualNerfletsare combinedtopredict thesevalues,modulatedbytheirweights. Afteroptimizingposed2Dimagesandsegmentations,Nerflets reflect the decomposed scene and supportmore controlled editing. One applicationof image editing in robotics is for data augmentationduringpolicy learning. ROSIE [176] use the Imageneditor [177] tomodify training images toaddadditionaldistractorsandunseenobjectsandbackgroundstotrain robust imitation learning policies. GenAug [178] similarly generates imageswith in-categoryandcross-categoryobject substitutions,visualdistractors,anddiversebackgrounds.The CACTI[14]pipelineincludesastepin-paintingdifferentplausibleobjectsviaStable-Diffusion[117]ontotrainingimages. Theseapproachesgeneratephotorealistic images for training robust policies; however, generating imageswith sufficient diversitywhile alsomaintaining physical realism, e.g. for objectcontacts, remainsachallenge.Existingapproachesuse learnedor providedmasks tospecifyareasof the image to keep,orheuristicsbasedontheparticularrobotictask. Another direction is to use generativemodels to define goal images forplanning.DALL-E-Bot [157]usesDALL-E 2 todefine agoal imageof human-likearrangements from observations. 3)ObjectRepresentations: Learningcorrespondencesbetween objects can facilitatemanipulationby enabling skill transfer from trained objects to novel object instances in known categories or novel object categories at test time. Traditionally,objectcorrespondenceshavebeenlearnedusing strongsupervisionsuchaskeypointsandkeyframes.Neural descriptor fields (NDFs) [179] remove the need for dense annotationby leveraging layer-wise activations fromanoccupancynetwork;however, thisapproachstill requiresmany training shapes for each target object category. Additional workshavestartedtobuildobjectrepresentationsdirectlyfrom imagefeaturesofpretrainedvisionmodels. Feature Fields for RoboticManipulation (F3RM) [180] buildsonDFFtodevelopscene representations that support finding correspondingobject regions. F3RMuses a similar feature representation for 6-DoF poses relative to objects (e.g., a graspon the handleof themug) toNDF. Besides allowingcorresponding6-DoFposes tobefoundfromafew demonstrations, the pose embeddings can also be directly comparedtotextembeddingsfromCLIPtoleveragelanguage guidance(e.g., pickup thebowl).Correspondencesbetween objectshavealsobeendirectlyextractedfromDINOfeatures [181]without training.Thismethodfirst extractsdenseViT featuremaps of two objects usingmultiple views. Similar regionsonthetwoobjectsarefoundbycomputingthecyclical distancemetric[182]onthefeaturemaps.Withthe2Dpatch correspondences, a 7-Drigidbody transform(i.e., aSO(3) pose, a translation, andascalingscalar)betweentheobjects canbesolvedtogetherwithRANSACandUmeyama’smethod [183]. D. LearnedAffordances Affordancesrefer tothepotentialofobjects,environments, orentitiestoofferspecificfunctionsorinteractionstoanagent. They can include actions such as pushing, pulling, sitting, or grasping.Detectingaffordancesbridges thegapbetween perceptionandaction. AffordanceDiffusion[66]synthesizescomplexinteractions of e.g. an articulated handwith a given object. Given an RGBimage,AffordanceDiffusionaimstogenerateimagesof humanhands forhand-object interaction(HOI).Theauthors proposea two-stepgenerativeapproachbasedon large-scale pretraineddiffusionmodelsbasedonwheretointeract(layout) andhowto interact (content).The layout networkgenerates a 2Dspatial arrangement of hand andobject. The content networkthensynthesizesimagesofahandgraspingtheobject conditionedonthegivenobjectandthesampledHOI layout. AffordanceDiffusionoutputsboth thehandarticulationand approachorientation. Vision-RoboticBridge (VRB) [67] trains a visual affordancemodel on internet videos of humanbehavior. Particularly, it estimates the likely locationandmanner inwhich a human interactswithin a scene. Thismodel captures the structural informationof these behavioral affordances. The authors seamlessly integrate theaffordancemodelwith four different robot learningparadigms.Firstly, theyapplyoffline imitation learning,where the robot learns by imitating the observedhumaninteractionsfromthevideos.Secondly, they use exploration techniques to enable the robot to actively21 discoverandlearnnewaffordancesinitsenvironment.Thirdly, the authors incorporate goal-conditioned learning, allowing the robot to learn how to achieve specific objectives by leveraging the estimated affordances. Finally, they integrate actionparameterization for reinforcement learning, enabling therobottolearncomplexbehaviorsbyoptimizingitsactions basedontheestimatedaffordances. E. PredictiveModels Predictivedynamicsmodels,orworldmodels,predicthow thestateof theworldchangesgivenparticularagentactions, that is, they attempt tomodel the state transition function of theworld [184].When applied to visual observations, dynamicsmodelingcanbe formulatedasavideoprediction problem[185], [186].Whilevideogenerationandprediction, particularlyover longhorizons, is a longstandingchallenge withmanypriorefforts, recentmodelsbasedonvisiontransformers and diffusionmodels have demonstrated improvements [187], [188]. For instance, thePhenakimodel [189] generatesvariable lengthvideoup tominutes in lengthconditionedontextprompts. Several approachesapply thesemodels to robotics in the literature.Notethatwhilelearneddynamicsorworldmodels inroboticshavebeenexploredinconstrainedorsmaller-data regimes,we focus in this sectiononworks that trainona diversityor volumeof data that is characteristicof foundationmodels.One strategy is to learn an action-conditioned model thatmaybeuseddirectlyfordownstreamplanningby optimizinganactionsequence[190], i.e. performingmodelpredictive control, or for policy learning via training on simulatedrollouts.Oneexampleis theGAIA-1modelwhich generatespredictionsofdrivingvideoconditionedonarbitrary combinationsofvideo, action, andtext [191]. Itwas trained on4700hoursofproprietarydrivingdata.Anotherapproach istouseavideopredictionmodel togenerateaplanoffuture states, and then learna separategoal-conditionedpolicyor inversedynamicsmodel toinfercontrolactionsbasedonthe current and target state. One line ofwork instantiates this by combining text-conditionedvideodiffusionmodelswith image-goal-conditionedpoliciestosolvemanipulationtasksin simulatedandreal tabletopsettings[192].Thisapproachhas beenextendedtolonger-horizonobjectmanipulationtasksby usingthePaLM-EVLMtobreakdownahigh-level language goal intosmaller substeps, leveragingfeedbackbetween the VLMandvideogenerationmodels[193]. AnotherexampleisCOMPASS[160],whichfirstconstructs a comprehensivemultimodal graph to capture crucial relational information across diversemodalities. The graph is thenused to construct a rich spatio-temporal and semantic representation.PretrainedontheTartanAirmultimodaldataset, COMPASSwas demonstrated to address multiple robotic tasks includingdronenavigation, vehicle racing, andvisual odometry. V. EMBODIEDAI Recently, researchers have shown that the the success of LLMscanbeextended toembodiedAIdomains [32], [33], [42], [194],where “embodied” typically refers to a virtual embodiment inaworldsimulator, not aphysical robot embodiment. Statler [69] is a framework that endows LLMs withanexplicit representationof theworldstate as a form of “memory” that ismaintainedover time. Statler uses two instancesofgeneralLLMs:aworld-modelreaderandaworldmodelwriter, that interfacewithandmaintaintheworldstate. Statler improves theabilityofexistingLLMs toreasonover longer timehorizonswithout theconstraintofcontext length. Large Scale LanguageModels (LSLMs) have exhibited strongreasoningabilityandtheabilitytoadapt tonewtasks through in-context learning.Dasgupta et al. [195] combine these complementaryabilities in a single systemconsisting of three parts: a Planner, anActor, and a Reporter. The Planner is apretrained languagemodel that can issuecommands to a simple embodied agent (theActor), while the Reporter communicateswith thePlanner to informits next command.Muetal. [70]buildEgoCOT,adatasetconsisting of carefully selectedvideos fromtheEgo4Ddataset, along with correspondinghigh-quality language instructions. EmbodiedGPT[70] utilizes prefixadapters to augment the7B languagemodel’scapacitytogeneratehigh-qualityplanning, training it on theEgoCOTdataset toavoidoverlydivergent languagemodel responses.Comprehensiveexperimentswere conducted,demonstratingthat themodeleffectivelyenhances theperformanceof embodied tasks suchasEmbodiedPlanning,EmbodiedControl,VisualCaptioning,andVisualQ&A. Embodiedagentsshouldautonomouslyandendlesslyexplore theenvironment.Theyshouldactivelyseeknewexperiences, acquirenewskills, andimprovethemselves. The game of Minecraft [196] provides a platform for designingintelligent agentscapableofoperatingin theopen world.MineDojo[71]isaframeworkfordevelopinggeneralist agents in thegameofMinecraft.MineDojooffers thousands ofopen-endedandlanguage-promptedtasks,wheretheagent cannavigateinaprogressivelygenerated3Denvironment to mine, craft tools, andbuildstructures.Aspart of thiswork, theauthorsintroduceMiniCLIP,avideo-languagemodel that learns tocapture the correlationsbetweenavideoclipand its time-alignedtext thatdescribes thevideo.TheMineCLIP model, trainedonYouTubevideos, canbeusedasa reward function to train the agentwith reinforcement learning.By maximizing this reward function, it incentivizes the agent tomake progress toward solving tasks specified in natural language. Voyager [73] introducesanLLM-poweredembodied lifelong learningagent in the realmofMinecraft.Voyageruses GPT-4 tocontinuouslyexplore the environment. It interacts withGPT-4 through in-context promptinganddoes not requiremodelparameterfine-tuning.Explorationismaximized by queryingGPT-4 to provide a streamof new tasks and challengesbasedontheagent’shistoryinteractionsandcurrent situations.Also, theiterativepromptingmechanismgenerates codeas theactionspace tocontrol theMinecraftagent. Iterativepromptingincorporatesenvironment feedbackprovided byMinecraft,executionerrors,andaself-verificationscheme. For self-verification,GPT-4actsasacriticbycheckingtask success and providing suggestions for task completion in22 the caseof failure. TheGPT-4 critic canbe replacedbya humancritictoprovideon-the-flyhumanfeedbackduringtask execution.Ghost in theMinecraft (GITM) [197] leverages LLMto break down goals into sub-goals andmap them to structured actions for generating control signals. GITM consistsof threecomponents:anLLMDecomposer,anLLM Planner, and anLLMInterface. The LLMDecomposer is responsiblefordividingthegivenMinecraftgoal intoasubgoal tree.TheLLMPlannerthenplansanactionsequencefor eachsub-goal.Finally,theLLMInterfaceexecuteseachaction intheenvironmentusingkeyboardandmouseoperations. Reinforcement learning in embodiedAI virtual environments has thepotential to improve the capabilities of realworldroboticsbyprovidingefficient trainingandoptimizing controlpoliciesinasafeandcontrolledsetting.Rewarddesign isacrucial aspectofRLthat influences therobot’s learning process.Rewardsshouldbealignedwiththe task’sobjective andguide the robot toachieve thedesired task. Foundation modelscanbeleveragedtodesignrewards.Kwonetal. [16] investigate the simplification of reward design by utilizing a large languagemodel (LLM), suchasGPT-3, as aproxy reward function. In this approach, users provide a textual promptthatcontainsafewexamples(few-shots)oradescription(zero-shot)ofthedesiredbehavior.Theproposedmethod incorporatesthisproxyrewardfunctionwithinareinforcement learningframework.Usersspecifyapromptat thestartofthe trainingprocess.Duringtraining, theRLagent’sbehavior is evaluatedby theLLMagainst thedesiredbehavioroutlined in the prompt, resulting in a corresponding reward signal generatedbytheLLM.Subsequently, theRLagent employs thisrewardtoupdateitsbehaviorthroughthelearningprocess. In [74], the authors propose amethod calledExploring with LLMs (ELLM) that rewards an agent for achieving goals suggestedbya languagemodel. The languagemodel is promptedwithadescriptionof the agent’s current state. Therefore,withouthavingahumanintheloop,ELMMguides agents towardmeaningfulbehavior. Zhangetal.[198]explorethepotentialrelationshipbetween offline reinforcement learningand languagemodeling.They hypothesize thatRLandLMsharesimilarities inpredicting futurestatesbasedoncurrentandpaststates,consideringboth local and long-rangedependenciesacross states.Tovalidate thisassumption, theauthorspre-trainTransformermodelson different offlineRL tasks and assess their performanceon various language-related tasks. Tarasov et al. [199] present anapproach toharnesspretrained languagemodels indeep offlinereinforcementlearningscenariosthatarenotinherently compatiblewithtextualrepresentations.Theauthorssuggesta methodthat involves transformingtheRLstates intohumanreadabletextandperformingfine-tuningofthepretrainedlanguagemodelduringtrainingwithdeepofflineRLalgorithms. Advancesinmodelarchitecture(e.g. transformer)forfoundationmodelsallowthemodeltoeffectivelymodelandpredict sequences.Toharnessthepowerofthesemodels,somerecent studies investigateexploitingthesearchitecturesforsequence modeling in RL problems. Reid et al. [200] explore the potential of leveraging the sequencemodeling formulation of reinforcement learningandexamine the transferabilityof pretrainedsequencemodelsacrossdifferentdomains, suchas vision and language. They specifically focus on the effectivenessoffine-tuningthesepretrainedmodelsonofflineRL tasks,includingcontrolandgames.Inadditiontoinvestigating thetransferabilityofpretrainedsequencemodels, theauthors propose techniques to enhance the transfer of knowledge betweenthesedomains.Thesetechniquesaimtoimprovethe adaptabilityandperformanceof thepretrainedmodelswhen appliedtonewtasksordomains. High-level taskplanningusingLLMs is demonstrated in embodiedAI environments.Huanget al. [68] propose employingpretrainedLanguageModels(LMs)aszero-shotplanners. The approach is evaluated in theVirtualHome [129] environment. Inthiswork,first, anautoregressiveLLMsuch asGPT-3 [2] orCodex [201] isquarried togenerateaction plans forhigh-level tasks.Someof theseactionplansmight notbeexecutablebytheagentdue toambiguityinlanguage or referring toobjects that are not present or grounded in the environment. So, to select the admissible actionplans, admissibleenvironmentactions,andgeneratedactionsbythe causalLLMareembeddedusingaBERT-styleLM.Thenfor eachadmissibleenvironmentaction, its semanticdistance to thegeneratedactioniscomputedusingcousinsimilarity. Chainof thought reasoningandactiongenerationareproposed for embodiedagents aswell. ReAct [202] combines reasoning(e.g.chainof thought)andacting(e.g.sequenceof actiongeneration)withinLLM.Reasoningtracesenhancethe model’s ability todeduce,monitor, and reviseactionplans, alongwithmanagingexceptionseffectively.Actionsfacilitate interactionwithexternal resources, likeknowledgebasesor environments, enabling it toacquiresupplementaryinformation. ReAct showcases its proficiency across awide array of language anddecision-making tasks, includingquestionansweringandfactverification.Itenhancesinterpretabilityand trustforusersbytransparentlyillustratingtheprocessthrough which it searches for evidence and formulates conclusions. Unlike priormethods that depend on a singular chain-ofthought,ReAct engageswithaWikipediaAPI for pertinent informationretrievalandbeliefupdating.Thisstrategyeffectivelymitigatestheissuescommonlyassociatedwithchain-ofthoughtreasoning,suchashallucinationanderrorpropagation. VPT [72] presents video pretraining inwhich the agent learnstoactbywatchingunlabeledonlinevideos. It isshown that an inversedynamicmodel canbe trainedwitha small labeleddataset and themodel canbeused to label ahuge unlabeleddata of the internet.Videos of peoplewhohave playedMinecraft are used to train an embodiedAI agent toplayMinecraft.Themodelexhibitszero-shotperformance andcanbefine-tunedformorecomplexskillsusingimitation learningorreinforcement learning.TheVPTmodel istrained with a standard behavioral cloning loss (9) (negative loglikelihood) while the actions are drawn from the inverse dynamicmodel. A. GeneralistAI Along-standingchallengeinroboticsresearchisdeploying robotsorembodiedAIagentsinavarietyofnon-factoryrealworld applications, performing a range of tasks. Tomake23 generalist robots that can operate in diverse environments withdiverse tasks, some researchershaveproposedgenerative simulators for robot learning. For example,Generative Agents [203] discusses howgenerative agents canproduce realistic imitations of humanbehavior for interactive applications, creatingaminiaturecommunityofagentssimilar to those found in games likeTheSims. The authors connect theirarchitecturewiththeChatGPTlarge languagemodel to createagameenvironmentwith25agents.Thestudyincludes two evaluations, a controlled evaluation and an end-to-end evaluation,whichdemonstratethecausaleffectsofthevarious componentsof their architecture.Xianet al. [204], authors propose a fully automatedgenerativepipeline, known as a generativesimulationforrobotlearning,whichutilizesmodels togeneratediverse tasks, scenes, andtrainingguidanceona largescale.Thisapproachcanfacilitatethescalingupoflowlevelskilllearning,ultimatelyleadingtoafoundationalmodel for roboticsthatempowersgeneralist robots. AnalternativemethodfordevelopinggeneralistAIinvolves usinggeneralizablemulti-modalrepresentations.Gato[154]is a generalist agent thatworks as amulti-modal,multi-task, multi-embodiment generalist policy.Using the same neural networkwith the same set ofweights,Gatocan sense and actwithdifferentembodimentsinvariousenvironmentsacross differenttasks.GatocanplayAtari,chat,captionimages,stack blockswitharealrobotarm,navigateina3Dsimulatedenvironment,andmore.Gatoistrainedon604differenttaskswith variousmodalities, observations, andactions. In this setting, languageactsasacommongroundingacrossdifferentembodiments.Gatohas1.2Bparametersand is trainedoffline ina supervisedway.Positionedat theconfluenceofrepresentation learningandreinforcement learning(RL),RRL[205] learns behaviorsdirectlyfromproprioceptiveinputs.Byharnessing pre-trainedvisual representations,RRLisable to learnfrom visual inputs,whichtypicallyposechallengesinconventional RLsettings. B. Simulators High-quality simulators or benchmarks are crucial for roboticsdevelopment.Hence,weput the“simulator”section heretohighlight itsessential role.Tofacilitategeneralization fromsimulation to the realworld,Gibson[206] emphasizes real-world perception for embodied agents. To bridge the gapbetween simulation and real-world, iGibson [146] and BEHAVIOR-1K [207] further support the simulation of a more diverse set of household tasks and reachhigh levels of simulationrealism.Asasimulationplatformfor research inEmbodiedAI,Habitat [208] consistsofHabitat-Simand Habitat-API.Habitat-Simcanachieveseveralthousandframes per second (fps) runningsingle-threaded.Rather thanmodeling into low-level physics, Habitat-Lab [147], is a highlevel library for embodiedAI, givingamodular framework for end-to-end development. It facilitates the definition of embodiedAItasks,suchasnavigation,interaction, instruction following,andquestionanswering.Additionally,itenablesthe configurationofembodiedagents,encompassingtheirphysical form, sensors, andcapabilities.The librarysupportsvarious trainingmethodologies for these agents, including imitation learning, reinforcement learning, andtraditionalnon-learning approaches like theSensePlanAct pipelines. Furthermore, it provides standardmetrics for evaluatingagent performance across these tasks. In linewith this, the recent release of Habitat3.0[209]furtherexpandsthesecapabilities. Similarly,RoboTHOR[210] serves as aplatformfor the development and evaluationof embodiedAI agents, offering environments in both simulated and physical settings. Currently,RoboTHORincludesa trainingandvalidationset comprising75 simulated scenes. Additionally, there are 14 scenes each for test-devand test-standard in the simulation, with corresponding physical counterparts. Key features of RoboTHOR include its reconfigurability and benchmarking capabilities.Thephysicalenvironmentsareconstructedusing modular,movable components, enabling the creationof diverse scene layouts and furnitureconfigurations ina single physicalarea.Anothersimulator,VirtualHome[129],models complexactivitiesthatoccurinatypicalhousehold.Itsupports programdescriptions for avarietyof activities that happen in people’s homes. Huang et al. [33] useVirtualHome to evaluate the robot planning abilitywith languagemodels. Thesesimulatorshavethepotentialtobeappliedforevaluating LLMsonroboticstasks. VI. CHALLENGESANDFUTUREDIRECTIONS Inthissection,weexaminechallengesrelatedtointegrating foundationmodels into robotics settings.We also explore potential futureavenuestoaddresssomeof thesechallenges. A. OvercomingDataScarcityinTrainingFoundationModels forRobotics Onemainchallenge is that comparedto the internet-scale textandimagedatathat largemodelsaretrainedon, roboticspecificdataisscarce.Wediscussvarioustechniquestoovercomedatascarcity.For example, toscaleuprobot learning, some recentworks suggest theuse of playdata insteadof expertdata for imitation learning.Another technique isdata augmentationusingin-paintingtechniques. 1) ScalingRobotLearningUsingUnstructuredPlayData and Unlabeled Videos of Humans: Language-conditioned learning such as language-conditioned behavioral cloning, or language-conditionedaffordance learning requireshaving access to large annotated datasets. To scale up learning, in Play-LMP [26], the authors suggest using teleoperated human-providedplaydata insteadof fullyannotatedexpert demonstrations.Playdataisunstructured,unlabeled,cheapto collect, but rich.Collectingplaydatadoesnot requirescene staging, tasksegmenting,orresettingtoaninitial state.Also, inMimicPlay[118] agoal-conditionedtrajectorygeneration model is trainedbasedonhuman-playdata. The playdata includes unlabeled video sequences of humans interacting with theenvironmentwith theirhands.Recentlyworkssuch as[125]haveshownaverysmallpercentage(aslittleas1%) of language-annotateddata isneededtotrainavisuo-lingual affordancemodel for robotmanipulationtasks.24 2)Data Augmentation using Inpainting: Collecting robotics data requires the robot to interact with the real physicalworld.Thisdatacollectionprocesscanbeassociated with significant costs and potential safety concerns. One way to tackle thischallenge is tousegenerativeAI suchas text-to-image diffusionmodels for data augmentation. For example,ROSIE(ScalingRobotLearningwithSemantically ImaginedExperience) [176] presents adiffusion-baseddata augmentation.Givena robotmanipulationdataset, theyuse inpaintingtocreatevariousunseenobjects,backgrounds,and distractorswith textual guidance. One important challenge for thesemethodsisdevelopinginpaintingstrategies thatcan generate sufficient semantically and visually diverse data, whileat the same timeensuring that thisdata isphysically feasibleandaccurate.Forinstance,usinginpaintingtomodify animageofanobjectwithinarobot’sgrippermayresult in animagewithaphysicallyunrealisticgrasp, leadingtopoor downstream training performance. Additional investigation intogenerativefoundationmodelsthatareevaluatednotonly forvisualqualitybut alsoforphysical realismmay improve thegeneralityof thesemethods. 3)Overcoming3DDataScarcity forTraining3DFoundationModels: Currently,multi-modalVision-and-Language Models (VLMs) can analyze 2D images, but they lack a connection to the3Dworld,whichencompasses3Dspatial relationships, 3Dplanning, 3Daffordances, andmore. The primaryobstacleindevelopingfoundational3DVLMmodels lies in thescarcityof3Ddata, especiallydata that ispaired withlanguagedescriptions.Asdiscussed,language-drivenperceptiontaskssuchaslanguage-driven3Dscenerepresentation, language-driven3Dsceneediting, language-driven3Dscene or shape generation, language-driven3Dclassification, and affordancepredictionrequireaccess to3Ddataormulti-view imageswithcameramatriceswhicharenot readilyavailable data types.Newdatasetsordatagenerationmethodsneedto becreatedin thefuture toovercomedatascarcity in the3D domain. 4) SyntheticDataGenerationviaHigh-FidelitySimulation: High-fidelitysimulationviagamingengines canprovidean efficientmeanstocollectdata,especiallytosolvemultimodal and3Dperception tasks on robots. For example, TartanAir [211], adataset for robot navigation tasks,was collected in [212]withthepresenceofmovingobjects,changinglight,and variousweatherconditions.Bycollectingdatainsimulations, itwaspossibletoobtainmulti-modalsensordataandprecise ground truth labels such as the stereoRGB image, depth image, segmentation,opticalflow, cameraposes, andLiDAR pointcloud.Alargenumberofenvironmentsweresetupwith variousstylesandscenes,coveringchallengingviewpointsand diversemotionpatterns that aredifficult toachievebyusing physical data collectionplatforms.An extensionTartanAirV2 (https://tartanair.org) furthers the dataset by incorporatingadditional environmentsandmodalities, suchasfisheye, panoramas, andpinholes,witharbitrarycamera intrinsicand rotations. 5)Data Augmentation using VLMs: Data augmentation can be provided using Visual-LanguageModels (VLMs). InDIAL [213], Data-driven InstructionAugmentation for Language-conditionedcontrol isintroduced.DIALusesVLM tolabelofflinedatasetsforlanguage-conditionedpolicylearning.DIALperformsinstructionaugmentationusingVLMs to weaklyrelabelofflinecontroldatasets.DIALconsistsofthree steps1)Contrastivefine-tuningofaVLMsuchasCLIP[4]on asmall robotmanipulationdatasetoftrajectorieswithcrowdsourced annotation, 2) producingnewinstruction labels by usingthefine-tunedVLMtoscorerelevancyofcrowd-sourced annotationsagainstalargerdatasetof trajectories,3) training alanguage-conditionedpolicyusingbehaviorcloningonboth, theoriginalandre-annotateddataset. 6) Robot Physical Skills are Limited toDistribution of Skills: Onekey limitationof theexistingrobot transformers and other relatedworks in robotics is that robot physical skillsare limitedtothedistributionof skillsobservedwithin the robot data. Using these transformers, the robot lacks the capability togeneratenewmovements. To address this constraint,anapproachinvolvesusingmotiondatafromvideos that humansperformingvarious tasks. The inherentmotion informationwithin these videos can then be employed to facilitatetheacquisitionofphysicalskills inrobotics. B. RealTimePerformance(HighInferenceTimeofFoundationModels) Another bottleneck for deploying foundationmodels on robots is the inference time of thesemodels. InTable II, the inference time for someof thesemodels is reported.As seen, theinferencetimeforsomeof themodelsstillneedsto be improvedfor reliablereal-timedeploymentof therobotic systems.Asreal-timecapabilityisanessentialrequirementfor anyroboticsystem,moreresearchneeds tobeperformedto improvethecomputationalefficiencyof foundationmodels. Furthermore, foundationmodelsaremostoftenstoredand run in remotedatacenters, andaccessed throughAPIs that requirenetworkconnectivity.Manyfoundationmodels (e.g., theGPTmodels, theDall-Emodels)canonlybeaccessedthis way,whileothersareusuallyaccessedthisway,butcanalso bedownloadedandrunlocallywithsufficientlocalcomputing power(suchasSAM[59],LLaMA[214],andDINOv2[107]). Given this cloud-serviceparadigm, the latenciesandservice timesinresponsetoanAPIcallforafoundationmodeldepend ontheunderlyingnetworkoverwhichthedatais routedand thedata centerwhere the computation takes place—factors thatarebeyondthecontrolofarobot.Sonetworkreliability shouldbe taken intoaccountbefore integratingafoundation model intoarobot’sautonomystack. For some robotics domains reliance on the network and 3rdpartycomputingmaynotbeasafeor realisticoperating paradigm.Inautonomousdriving,autonomousaircraft,search andrescueor emergencyresponseapplications, anddefense applicationstherobotcannotrelyonnetworkconnectivityfor time-criticalperceptionorcontrolcomputations.Oneoptionis tohaveasafefall-backmodethatreliesonclassicalautonomy tools using only local computation, that can take over if access to thecloud is interruptedfor some reason.Another potential longer-termsolution for network-freeautonomy is thedistillationof largefoundationmodels intosmaller-sized25 specializedmodelsthatrunononboardrobothardware.Some recentworkhasattempted thisapproach(thoughwithout an explicit link torobotics) [215].Suchdistilledmodelswould likelygiveupsomeaspect of the fullmodel, e.g. restricting operationtoacertainlimitedcontext, inexchangeforsmaller sizeandfaster compute.This couldbean interestingfuture direction for bringing the power of foundationmodels to safety-critical roboticssystems. C. Limitations inMultimodalRepresentation Multimodalinteractionimplicitlyassumesthatthemodality is tokenizableandcanbe standardized into input sequences without losing information.TheMultimodalmodelsprovide informationsharingbetweenmultiplemodalitiesandaresome variationofmultimodal transformerswithcross-modalattentionbetweeneverypairofinputs.Inmultimodalrepresentation learning, it is assumed that cross-modal interactionsand the dimensionofheterogeneitybetweendifferentmodalitiescan allbecapturedbysimpleembeddings.Inotherwords,asimple embeddingisassumedtobesufficienttoidentifythemodality or for example, howdifferent language is fromvision. In therealmofmultimodal representationlearning, thequestion ofwhether asinglemultimodalmodel canaccommodateall modalitiesremainsanopenchallenge. Additionally,whenpaireddatabetweenamodalityandtext is availableone canembed thatmodality into text directly. Inroboticsapplications therearesomemodalities forwhich sufficient data isnot availableand tobeable toalign them withothermodalities, theyneedtobefirstconvertedtoother modalitiesandthenbeused.Forexample,3Dpointclouddata hasvariousapplications inroboticsbut trainingafoundation modelusingthistypeofdataischallengingsincedataisscarce and isnot alignedwith text. So, oneway toovercome this challengeisfirstconvertingthis3Dpointclouddatatoother modalitiessuchas imagesandsubsequentlyimagestotextas the secondarystepof alignment. Then theycanbeused in foundationmodel training.As another example, inSocratic models [194], eachmodality,whether visual or auditory, is initiallytranslatedintolanguage,afterwhichlanguagemodels attempt torespondtothesemodalities. D. UncertaintyQuantification Howcanweprovideassurancesonthereliabilityof foundationmodelswhen theyaredeployedinpotentiallysafetycriticalroboticsapplications[188]?Currentfoundationmodels such as LLMs oftenhallucinate, i.e., produce outputs that are factually incorrect, logically inconsistent, or physically infeasible.While such failuresmaybe acceptable in applicationswheretheoutputsfromthemodelcanbecheckedby ahumaninreal-time(e.g.,asisoftenthecaseforLLM-based conversationalagents),theyarenotacceptablewhendeploying autonomousrobotsthatusetheoutputsof foundationmodels in order to act in human-centered environments. Rigorous uncertainty quantification is a key step toward addressing this challengeandsafely integratingfoundationmodels into robotic systems. Below,wehighlight challenges and recent progress inuncertaintyquantification for foundationmodels inrobotics. 1) Instance-LevelUncertaintyQuantification:Howcanwe quantifytheuncertaintyintheoutputof afoundationmodel foraparticular input?Asanexample, consider theproblem of image classification; given a particular image, onemay quantifyuncertaintyintheoutputbyproducingasetofobject labels that themodel is uncertain amongor a distribution overobjectlabels.Instance-leveluncertaintyquantificationcan informthe robot’sdecisions at runtime. For example, if an imageclassificationmodelrunningonanautonomousvehicle produces a prediction set {Pedestrian,Bicyclist} representingthat it isuncertainwhetheraparticularagent is apedestrianor abicyclist, theautonomousvehiclecan take actions thatconsiderbothpossibilities. 2)Distribution-LevelUncertaintyQuantification:Howcan wequantifytheuncertaintyinthecorrectnessofafoundation model that will be deployed on a distribution of possible future inputs?For theproblemof image classification, one maywant tocomputeorboundtheprobabilityoferrorsover thedistributionof inputs that a robotmayencounterwhen deployed.Distribution-leveluncertaintyquantificationallows us todecidewhetheragivenmodel issufficientlyreliableto deployinourtargetdistributionofscenarios.Forexample,we maywant tocollect additionaldataorfine-tunethemodel if thecomputedprobabilityoferror is toohigh. 3)Calibration: In order to be useful, estimates of uncertainty (both at the instance-level and distribution level) shouldbecalibrated. Ifweperforminstance-leveluncertainty quantificationusingprediction sets, calibrationasks for the predictionset tocontain the true labelwithauser-specified probability (e.g., 95%) over future inputs. If instance-level uncertainty isquantifiedusingadistributionoveroutputs, it shouldbethecasethatoutputsthatareassignedconfidencep areinfactcorrectwithprobabilitypoverfutureinputs.Similarly,distribution-leveluncertaintyestimatesshouldboundthe trueprobabilityof errorswhenencounteringinputs fromthe targetdistribution. Wehighlightasubtle,butimportant,pointthatisoftenoverlookedwhenperforminguncertaintyquantificationinrobotics: it canbecrucial topayattention to thedistinctionbetween Frequentist andBayesian interpretations of probabilities. In manyroboticscontexts—particularlysafety-criticalones— thedesired interpretation isoftenFrequentist innature. For example, if we produce a bound ǫ for the probability of collision of an autonomous vehicle, this shouldbound the actualobservedrateofcollisionswhenthevehicleisdeployed. Bayesian techniques (e.g., Gaussian processes or Bayesian ensembles)donotnecessarilyproduceestimatesofuncertainty thatarecalibratedinthisFrequentistsense(sincetheestimates depend on the specific prior that is used to produce the estimates). Trusting the resultinguncertaintyestimatesmay leadoneastrayif thegoal is toprovidestatisticalguarantees onthesafetyorperformanceof theroboticsystemwhenit is deployed. 4)DistributionShift:Animportantchallengeinperforming calibrated uncertainty quantification is distribution shift. A foundationmodel trainedonaparticulardistributionofinputs may not produce calibrated estimates of uncertaintywhen deployedonadifferentdistributionforadownstreamtask.A26 moresubtlecauseofdistributionshift inroboticsarisesfrom closed-loopdeployment of amodel. For example, imagine anautonomousvehicle that choosesactionsusingtheoutput ofaperceptionsystemthat reliesonapretrainedfoundation model; since the robot’s actions influence future states and observations, thedistributionof inputs theperceptionsystem receivescanbepotentiallyverydifferent fromtheoneitwas trainedon. 5)CaseStudy:UncertaintyQuantification forLanguageInstructedRobots:Recently, therehasbeenexcitingprogress inperformingrigorousuncertaintyquantificationforlanguageinstructedrobots[216].Thisworkproposesanapproachcalled KNOWNOfor endowing language-instructedrobotswith the ability to knowwhen theydon’t knowand to ask for help or clarificationfromhumans inorder toresolveuncertainty. KNOWNOperformsbothinstance-levelanddistribution-level uncertaintyquantification in a calibratedmanner using the theoryofconformalprediction.Inparticular,givenalanguage instruction (and a description of the robot’s environment generated using its sensors), conformal prediction is used togenerateapredictionset of candidateactions. If this set is a singleton, the robot executes the correspondingaction; otherwise, the robot seeks help froma human by asking them to choose an action from the generated set. Using conformalprediction,KNOWNOensures thataskingforhelp in thismanner results in a statisticallyguaranteed level of tasksuccess(i.e.,distribution-leveluncertaintyquantification). KNOWNOtacklespotential challengeswithdistributionshift bycollectingasmallamountofcalibrationdatafromthetarget distributionofenvironments, tasks,andlanguageinstructions, andusingthisaspartof theconformalpredictioncalibration procedure.WhileKNOWNOservesasanexampleofcalibrated instance-levelanddistribution-leveluncertaintyquantification forLLMs, futureresearchshouldalsoexploreassessingand ensuring the reliabilityof various other foundationmodels, such as vision-languagemodels, vision-navigationmodels, andvision-language-actionmodels, commonlyemployed in robotics. In addition, exploring howBayesian uncertainty quantification techniques (e.g., ensembling[217], [218]) can becombinedwithapproachessuchasconformalpredictionto producecalibratedestimatesofinstance-levelanddistributionleveluncertaintyisapromisingdirection. E. SafetyEvaluation The problemof safety evaluation is closely related to uncertaintyquantification.Howcanwerigorouslytest for the safetyofafoundationmodel-basedroboticsystem(i)before deployment, (ii)as themodel isupdatedduringits lifecycle, and(iii)as therobotoperates inits targetenvironments?We highlightchallengesandresearchopportunitiesrelatedtothese problemsbelow. 1) Pre-deployment safety tests: Rigorous pre-deployment testingiscrucialforensuringthesafetyofanyroboticsystem. However, thiscanbeparticularlychallengingfor robots that incorporate foundationmodels. First, foundationmodels are trained on vast amounts of data; thus, a rigorous testing procedureshouldensurethat test scenarioswerenot seenby themodel during training. Second, foundationmodelsoften commiterrors inways thatarehardtopredictapriori; thus, tests need to cover a diverse enough rangeof scenarios to uncoverflaws. Third, foundationmodels suchasLLMs are oftenused toproduceopen-endedoutputs (e.g., a plan for a robot described in natural language). The correctness of suchoutputscanbechallengingtoevaluate inanautomated manner if theseoutputs are evaluated in isolation fromthe entiresystem. The deployment cycle of current foundationmodels (in non-roboticsapplications) involves thoroughred-teamingby humanevaluators[3], [219].Recentworkhasalsoconsidered partiallyautomatingthisprocessbyusingfoundationmodels themselves toperformred-teaming[220], [221].Developing ways to perform red-teaming (both by humans and in a partiallyautomatedway) for foundationmodels in robotics isanexcitingdirectionfor futureresearch. Inadditiontoevaluatingthefoundationmodel inisolation, it isalsocritical toassessthesafetyof theend-to-endrobotic system.Simulationcanplayacritical rolehere, andalready doessoforcurrentfield-deployedsystemssuchasautonomous vehicles [222], [223].Theprimarychallenges are toensure that (i) the simulatorhashighenoughfidelity for results to meaningfullytransfer totherealworld,and(ii) test scenarios (manuallyspecified, replicatedfromreal-worldscenarios, or automatically generatedvia adversarialmethods [224]) are representativeof real-worldscenariosandarediverseenough to expose flaws in the underlying foundationmodels. In addition,findingwaystoaugmentlarge-scalesimulation-based testingwith smaller-scale real-world testing is an important directionforfuturework.Weemphasizetheneedforperformingsuch testing throughout the lifecycleof afield-deployed robotic system, especially as updates aremade todifferent components (whichmay interact inunpredictablewayswith foundationmodels). 2) Runtimemonitoringandout-of-distributiondetection: Inadditiontoperformingrigoroustestingoffline, robotswith foundationmodel-basedcomponentsshouldalsoperformruntimemonitoring.Thiscantaketheformof failureprediction inagivenscenario,whichcanallowthe robot todeploya safety-preservingfallbackpolicy[225]–[229].Alternately, the robot canperformout-of-distribution(OOD)detectionusing experiences collected froma small batchof scenarios in a noveldistribution[230]–[233];thiscanpotentiallytrigger the robottoceaseitsoperationsandcollectadditionaltrainingdata inthenoveldistributioninorder tore-trainitspolicy.Developingtechniques that performruntimemonitoringandOOD detectionwithstatisticalguaranteesonfalsepositive/negative error rates in a data-efficientmanner remains an important researchdirection. F. UsingExistingFoundationModels as Plug-and-Playor BuildingNewFoundationModels forRobotics To incorporatefoundationmodels intorobotics, either the existingpretrainedlargemodelscanbeemployedasplug-andplayor newfoundationmodels canbebuilt using robotics data.Usingfoundationmodelsasplug-and-playreferstointegrating foundationmodels intovarious applicationswithout27 the need for extensive customization.A large bodyof the existingliteratureonfoundationmodelsinroboticsiscentered aroundtheuseoffoundationmodelsfromotherdomainssuch as language or vision as plug-and-play. The plug-and-play approach simplifies and facilitates the integrationof recent AIadvancesintotheroboticsdomain.Whileemployingthese modelsasplug-and-playoffersaconvenientway toharness thepowerofAIandproviderapidimplementation,versatility, and scalability, they are not always customized to specific applications.Whenspecificdomainexpertise isneeded, it is necessary tobuilda foundationmodel fromscratchorfinetune theexistingmodels.Buildinga foundationmodel from scratchis resource-intensiveanddemandssignificantcomputationalpower.However, itprovidesfine-grainedcontrolover thearchitecture, trainingparameters,andoverallbehavior. G. HighVariabilityinRoboticSettings Another challenge is the high variability in robotic settings. Robot platforms are inherentlydiversewithdifferent physicalcharacteristics,configurations,andcapabilities.Realworld environments that robots operate in are also diverse and uncertainwith awide range of variations. Due to all these variabilities, robotic solutions are usually tailored to specific robot platformswithspecific layouts, environments, andobjects for specific tasks.Thesesolutionsarenotgeneralizableacrossvariousembodiments, environments,or tasks. Hence, tobuildgeneral-purposepretrainedroboticfoundation models, a key factor is to pre-train largemodels that are task-agnostic,cross-embodiment,andopen-endedandcapture diverseroboticdata. InROSIE[176]adiversedataset isgeneratedfor robot learningbyperforminginpaintingofvarious unseenobjects, backgrounds, anddistractorswith semantic textual guidance.Toovercomevariability inroboticsettings and improvegeneralization, another solutionasViNT[137] presentsis totrainfoundationmodelsondiverseroboticdata acrossvariousembodiments.RT-X[46]also investigates the possibilityof traininglargecross-embodiedroboticmodelsin thedomainof roboticmanipulation.RT-Xis trainedusinga multi-embodimentdatasetwhichiscreatedbycollectingdata fromdifferent robotplatformscollectedthroughacollaborationbetween21institutions,demonstrating160266tasks.RTXdemonstrates transfer across embodiment improves robot capabilities by employing experience fromdiverse robotic platforms. H. BenchmarkingandReproducibilityinRoboticsSettings Another significant obstacle in incorporating foundation models into robotics research is the necessary reliance on real-worldhardwareexperiments.Thiscreateschallengesfor reproducibility,as replicatingresultsobtainedfromhardware experimentsmaynecessitate access to the exact equipment employed.Conversely,manyrecentworkshavereliedonnonphysics-basedsimulators(e.g., ignoringorgreatlysimplifying contact physics ingasping) that insteadfocusonhigh-level, long-termtasksandvisualenvironmentmodels.Examplesof thisclassofsimulatorsarecommonandincludemanyof the simulatorsdescribedaboveinSec.V.ForexampletheGibson familyof simulators [146], [206], theHabitat family [147], [208], [209],RobotTHOR[210], andVirtualHome [129] all neglect low-levelphysics infavorof simulatinghigher level taskswithhighvisual fidelity.This leads toa largesim-torealgapandintroducesvariabilityinreal-worldperformance basedonhowlow-levelplanningandcontrolmoduleshandle the true physics of the scenario. Evenwhenphysics-based simulatorsareused(e.g.,PyBulletorMuJoCo), theabsence of standardizedsimulationsettings, computingenvironments, andapersistent sim-to-realgapimpedeefforts tobenchmark andcompareperformanceacrossvariousresearchendeavors. Acombinationofopenhardware, benchmarkinginphysicsbasedsimulators,andpromotingtransparencyinexperimental andsimulationsetupscansignificantlyalleviatethechallenges associatedwithreproducibilityintheintegrationoffoundation models into robotics research. Thesepractices contribute to thedevelopmentofamorerobust andcollaborativeresearch ecosystemwithinthefield. VII. CONCLUSION Through examination of the recent literature, we have surveyedthediverseandpromisingapplicationsoffoundation models in robotics.Wehavedelved intohowthesemodels have enhanced the capabilities of robots in areas such as decision-making, planning and control, andperception.We alsodiscussed the literatureonembodiedAI andgeneralist AI,withaneyetowardopportunitiesforroboticiststoextend theconceptsinthatresearchfieldtoreal-worldroboticapplications.Generalization,zero-shotcapabilities,multimodalcapabilities,andscalabilityoffoundationmodelshavethepotential to transformrobotics.However, aswenavigate through this paradigmshift inincorporatingfoundationmodelsinrobotics applications, it is imperative torecognize thechallengesand potentialrisksthatmustbeaddressedinfutureresearch.Data scarcity in robotics applications, highvariability in robotics settings,uncertaintyquantification,safetyevaluation,andrealtime performance remain significant concerns that demand futureresearch.Wehavedelvedintosomeofthesechallenges andhavediscussedpotentialavenuesfor improvement.