Recent advances in large foundation models (FMs) have enabled learning general-purpose representations in natural language, vi- sion, and audio. Yet geospatial artificial intelligence (GeoAI) still lacks widely adopted foundation models that generalize across tasks that require joint reasoning over geospatial objects and human mo- bility. Such tasks are crucial as mobility, along with satellite imagery, street view, and text, is a core modality for understanding the phys- ical world. We argue that a key bottleneck is the absence of unified, general-purpose, and transferable representations for geospatially embedded objects (GEOs). Such objects include points, polylines, and polygons in geographic space, enriched with semantic context and critical for geospatial reasoning. Much current GeoAI research compares GEOs to tokens in language models, where patterns of human movement and spatiotemporal interactions yield contextual meaning similar to patterns of words in text. However, modeling GEOs introduces challenges fundamentally different from language, including spatial continuity, variable scale and resolution, temporal dynamics, and data sparsity. Moreover, privacy constraints and global variation in mobility further complicates modeling and gen- eralization. This paper formalizes these challenges, identifies key representational gaps, and outlines research directions for building foundation models that learn behavior-informed, transferable rep- resentations of GEOs from large-scale human mobility data, as well as static contextual information such as points of interest, object shapes and spatio-temporal semantics.