SONDE: A Depth-Predictive Architecture

Motivation

Two of the most productive paradigms in modern AI are built on self-supervised prediction along a single axis of structure in natural data.

Large language models exploit sequential structure. Given a prefix of tokens, the objective is to predict the next token. The training signal requires no external labels — the text provides both input and target. Scaling this objective over large corpora produces models that exhibit a broad range of capabilities, including in-context learning, retrieval, and chain-of-thought problem solving (Brown et al., 2020; Wei et al., 2022).

The Joint Embedding Predictive Architecture (JEPA) exploits spatial and temporal structure. Given visible regions of an image or video, the objective is to predict the abstract representation of masked regions — prediction in representation space rather than pixel space, which forces the encoder to retain only predictable structure and discard irrelevant variation. V-JEPA acquires representations encoding physical regularities from video without explicit supervision (Bardes et al., 2024).

In both cases, each prediction step operates within a single level of abstraction: a token predicts the next token; a patch predicts a neighboring patch. Neither paradigm's prediction objective explicitly models the relationship between abstraction levels — the generative hierarchy by which deeper structures produce surface observations.

SONDE is an architecture whose prediction objective operates on a different axis: prediction across depth levels. This is not claimed to be the only remaining axis of predictable structure — compositional, causal-temporal, and analogical structure are among other candidates — but it is one that no existing self-supervised architecture explicitly targets.

SEQUENCE (LLM) SPACE/TIME (JEPA) DEPTH (SONDE) what comes next? what's missing here? what produced this?

Depth Structure in Natural Data

A depth level, as used in this work, is defined by scope containment: level i is deeper than level j if j is lexically contained within, and its meaning depends on, i. In code, a function body (level 0) is contained within and depends on its signature (level 1), which is contained within its class (level 2), which is contained within its module (level 3). This is a specific, mechanically verifiable relation — not a metaphor for "more abstract" or "more fundamental." Other orderings (logical dependency, causal priority) may or may not coincide with scope containment; SONDE is trained on scope containment specifically.

Many domains contain data organized across such levels. In well-structured software, a function body implements a function's contract, which serves a module's interface, which instantiates the system's architecture. These are not universal patterns — poorly structured code may exhibit no clean hierarchy — but where they exist, the hierarchical relationship is a structural property of the data itself, not an imposed annotation.

In code, abstract syntax trees and scope nesting provide explicit, programmatically extractable decomposition into depth levels. The degree to which comparable decomposition exists in other domains (scientific papers, legal documents, mathematical proofs) varies and is domain-dependent. Code is the primary training domain for SONDE precisely because its depth decomposition is unambiguous and mechanically extractable.

Architecture

SONDE is trained from scratch without pretrained weights. The architecture consists of five components.

ByteEncoder. A multi-scale convolutional network operating on raw byte sequences. Parallel convolution kernels of sizes 3, 5, 7, and 9 capture patterns at different scales. Sinusoidal positional encoding preserves sequence order. Learned attention pooling aggregates the convolutional features into a fixed-dimensional vector per depth level. The encoder runs once per sample; this is the computationally dominant step.

CrossLevelRefiner. A transformer operating on the set of 4 level vectors (one per depth level), performing 5 iterative passes of multi-head cross-attention with gated residual updates. Each level attends to all other levels on every pass. Weights are shared across all passes (recurrent application). This is computationally inexpensive: the input is 4 vectors of dimension 256, not a long sequence.

Projection Heads. Four independent 2-layer MLPs (one per depth level, 512 hidden dimensions), projecting refined representations onto the unit hypersphere for cosine-similarity-based comparison. This normalization follows standard practice in contrastive representation learning (Chen et al., 2020).

Cross-Depth Predictor. A 3-layer transformer that receives the projected representations of unmasked levels as input and produces a predicted representation of the masked level. Learned depth-level embeddings encode which levels are visible and which is the prediction target.

EMA Target Encoders. Exponential moving average copies of the encoder and projection heads, updated with momentum coefficient 0.996. Target representations are computed through these parameters, which do not receive gradient updates. This mechanism prevents representational collapse — the degenerate solution in which all representations converge to a constant — and is shared with BYOL (Grill et al., 2020) and JEPA.

Training Procedure

Given a depth-structured tuple (function body, function signature, class context, module context), one level is selected uniformly at random and masked. The remaining levels are encoded, iteratively refined, and projected. The predictor estimates the masked level's representation. The loss function is:

L = 0.5 × L_cosine + 0.5 × L_InfoNCE

where L_cosine = 1 − cos(ŷ, y) measures the distance between predicted and target representations, and L_InfoNCE is temperature-scaled cross-entropy (τ = 0.07) computed over the batch, pushing representations of different functions apart while pulling representations of the same function together (Oord et al., 2018). Gradients propagate through the encoder, refiner, projection heads, and predictor. Target encoders update via EMA only.

Masking Strategy

The choice of which level to mask determines the inference direction:

Masking deep levels, predicting from shallow — inferring latent structure from observations (upward inference)
Masking shallow levels, predicting from deep — predicting observations from latent structure (downward inference)
Masking intermediate levels — inferring mediating structure given both endpoints (bridging inference)

During training, the masked level is selected uniformly at random, training all three directions simultaneously.

Data

The training domain is code. Depth levels are extracted programmatically via AST parsing: function body (level 0), function signature (level 1), enclosing class or scope (level 2), and module-level context (level 3). Levels are non-overlapping by construction. The dataset consists of 6,000 depth-structured tuples curated from Lean 4 theorem prover libraries and open-source code repositories, split 80/20 by repository to prevent data leakage between training and evaluation.

Experimental Results

Seven iterations were conducted, each isolating a specific architectural or data variable. All results below are reported on held-out test sets from repositories absent from training data.

v1–v4: Establishing Learnability

v1 (12K samples, no regularization) produced a train-set coherence gap of 0.462. On held-out repositories, the gap was not significantly above zero, indicating complete overfitting.

v2 introduced regularization (dropout 0.2, weight decay, early stopping). 80K samples, 1.4M parameters, 128-dimensional embeddings, cosine loss only. Evaluated on 14 unseen repositories: coherence gap 0.117, retrieval@1 1.5%, anomaly AUC 0.707. The coherence gap — defined as the difference in mean cosine similarity between same-function cross-level pairs and different-function cross-level pairs — was the first metric to exceed what a randomly initialized encoder produces (gap ≈ 0) on held-out data.

v3 held training data constant and modified only the architecture: 256-dimensional embeddings, attention pooling replacing mean pooling, InfoNCE contrastive loss added to cosine loss, sinusoidal positional encoding. 9.5M parameters. Coherence gap 0.327 (2.8× v2), retrieval@1 51.0% (34× v2), anomaly AUC 0.896. Because the training data was identical to v2, the improvement is attributable to architectural changes.

v4 added Wikipedia text alongside code (115K training samples total, 29K test). Coherence gap 0.591 (1.8× v3). Note: this comparison is confounded — v4 changed both the domain composition and total data volume relative to v3. However, the within-domain code coherence gap in v4 exceeded v3's converged result, suggesting that the additional domain contributed structure rather than noise. A controlled comparison holding total data volume constant was not conducted.

Version	Gap	Retrieval @1	Anomaly AUC	Controlled Variable
v1	0.462	—	—	Baseline (overfit)
v2	0.117	1.5%	0.707	Regularization
v3	0.327	51.0%	0.896	Architecture (data held constant)
v4	0.591	—	—	Multi-domain data

v5–v6: Byte Encoding and Iterative Refinement

v5 simultaneously changed two variables: the encoder (CNN byte encoder replacing text encoder, eliminating all pretrained components) and the training data (6,000 dense curated tuples with strictly non-overlapping depth levels, replacing 80K samples with less controlled level separation). Coherence gap: 0.424; retrieval@1: 31.5%; anomaly AUC: 0.900. Because two variables changed, the result does not cleanly attribute improvement to either the encoder or the data. However, achieving a gap of 0.424 with 75× fewer samples than v3 (gap 0.327) is consistent with the hypothesis that clean depth-level separation in training data is at least as important as data volume.

v6 introduced the CrossLevelRefiner: iterative cross-attention over compact level vectors with shared weights across passes. Same data as v5.

Metric	v5	v6
Coherence gap	0.424	0.947
Same-function cosine similarity	0.452	0.966
Retrieval @1	31.5%	96.8%
Retrieval @5	53.0%	100%
Anomaly AUC	0.900	0.9996

Training set: 4,807 functions. Test set: 1,193 functions from unseen repositories. 10.1M trainable parameters. Training time: 68 minutes on a single consumer GPU.

At v6, depth-level representations from the same function achieve mean cosine similarity 0.966; representations from different functions achieve 0.038. Retrieval@1: given a function body, the model retrieves the correct function signature from among all 1,193 test candidates with 96.8% accuracy (100% within top 5). Anomaly detection: mismatched depth-level pairs (e.g., a function body paired with an unrelated class definition) are distinguished from matched pairs with AUC 0.9996.

v7: Generation and Cross-Domain Transfer

v6 established that the learned representations encode depth structure. v7 tested whether this structure generalizes beyond the training domain.

The v6 encoder and refiner are frozen. A dense decoder is trained on top: 64 learned latent tokens are concatenated with the 3 visible-level representations (total sequence length: 67), refined through 7 shared-weight transformer passes, then expanded via MLP to 512 byte positions with local convolutional refinement. Attention cost per pass: 67² = 4,489 operations, versus 512² = 262,144 for naive sequence-length attention (58× reduction).

The evaluation task is depth ordering: given two samples from the same depth-structured tuple, predict which originates from the deeper level. This is a binary classification task that requires the model to have learned a consistent notion of "deeper" versus "shallower."

Evaluation Setting	Accuracy
Random baseline	50.0%
Code (training distribution)	92.0%
Code (held-out test repositories)	91.6%
Newton's manuscripts (zero-shot transfer)	75.6%

The model was trained exclusively on source code. The zero-shot evaluation domain — Newton's manuscripts — consists of 17th-century natural philosophy, theology, and alchemical writings. While both domains use English characters (and code contains English-language identifiers), the domains share no syntactic structure, formatting conventions, or subject matter. The model achieves 75.6% accuracy (25.6pp above chance; p < 0.001, binomial test against H₀: accuracy = 0.5). A limitation of this evaluation: the binomial test establishes that performance exceeds chance, but does not rule out the possibility that the model exploits surface features (e.g., text length, lexical complexity) that correlate with depth without reflecting genuine depth structure. Controlled experiments with length-matched and complexity-matched pairs would strengthen the cross-domain claim. The in-domain train-test gap is minimal (92.0% → 91.6%), indicating robust generalization within the training domain.

Isaac Newton

The Regulae Philosophandi, added to the second edition of the Principia (1713), articulate a methodological principle: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances." Whether these rules describe Newton's actual investigative method or represent a post-hoc rationalization is debated in the scholarly literature (Westfall, 1980; Cohen, 1999). What is not debated is the structure of the published results: planetary orbits, projectile trajectories, tidal patterns, and the precession of equinoxes — phenomena that had been catalogued independently for centuries — were unified under a single quantitative law, derived by mathematical proof from Kepler's observed periods and distances.

The unpublished manuscripts (estimated at over a million words by the Newton Project) show this pattern applied to other domains. The alchemical notebooks investigate whether chemical transmutation provides evidence of a unifying active principle in matter. The theological chronologies attempt to recover what Newton considered an original, uncorrupted doctrine beneath centuries of textual alteration. In each case, a surface phenomenon is treated as the product of a deeper generating process.

The analogy between this investigative pattern and SONDE's training objective is structural but limited. Newton derived quantitative mathematical laws from quantitative observations through proof. SONDE learns statistical regularities in representation space through gradient descent. The connection is that both operate across levels of abstraction — surface to depth — but the methods, the rigor of the outputs, and the nature of the "depth" involved are fundamentally different. Newton's manuscripts serve as an out-of-distribution evaluation set in this work, not as a claim of methodological equivalence.

Open Question: Domain Generality

The experimental results raise a question the current data cannot resolve: whether the regularity SONDE learns from code is specific to code, or whether it reflects structure shared across domains.

Two observations bear on this question. First, in v4, adding Wikipedia text to code training improved within-domain code performance. This is confounded by the simultaneous increase in total data volume — more data of any kind can improve regularization — and a controlled comparison holding volume constant was not conducted. The observation is therefore suggestive but not dispositive. Second, in v7, code-trained representations achieved 75.6% accuracy on depth ordering in Newton's manuscripts (p < 0.001). This result has not been tested against surface confounds (text length, vocabulary complexity) that may correlate with depth, and the possibility that the model exploits such correlates rather than genuine depth structure has not been ruled out.

If there exist structural invariants in the relationship between abstraction levels that hold across domains, they would need to be formally characterized before the question of domain generality can be addressed rigorously. What, precisely, would such an invariant be? One possibility: statistical regularities in how the information content of representations changes between adjacent levels — compression ratios, mutual information profiles, or spectral properties of the inter-level mapping. Whether these or any other formal properties are shared across code, natural language, mathematics, and other domains is an empirical question that the current experiments do not answer. Two domains do not establish universality, and the confounds identified above have not been controlled for.

What the current results do establish: depth-level structure in code is learnable by self-supervised prediction (v2–v6), and the learned representations generalize to unseen repositories (v6) and show above-chance transfer to one out-of-distribution domain (v7). Whether this extends further is open.

Theoretical Context

Learning generative depth structure from observational data confronts established impossibility results. The Causal Hierarchy Theorem (Bareinboim et al., 2022) proves that observational distributions generically do not determine interventional or counterfactual quantities. Markov equivalence entails that multiple distinct causal graphs can produce identical conditional independence structures, rendering the generating graph unidentifiable from observational data alone. Locatello et al. (2019) demonstrated that unsupervised disentanglement of independent latent factors is impossible without inductive biases that constrain the model class or the data distribution.

Recent theoretical work has identified conditions under which these impossibilities can be circumvented. Morioka and Hyvärinen (ICML 2024) proved identifiability of causal representations from purely observational data under a grouping structure assumption. Richens and Everitt (ICLR 2024) showed that decision-making agents satisfying regret bounds must learn approximate causal models of their environment. The VAR architecture (Tian et al., NeurIPS 2024 Best Paper) demonstrated that next-scale prediction — predicting across spatial resolution levels — outperforms next-token prediction for visual generation, providing empirical evidence that cross-level prediction is a productive self-supervised signal.

Code possesses properties that the identifiability literature associates with favorable conditions: explicit hierarchical structure recoverable via AST parsing, and training across independent codebases. The latter could constitute multi-environment data in the sense of Peters et al. (2016), which provably enables causal identifiability under mild assumptions — but whether SONDE's training procedure over multiple repositories formally satisfies the conditions of that theorem has not been verified. Similarly, whether SONDE's depth-level decomposition constitutes a grouping structure in the sense of Morioka and Hyvärinen (2024) has not been formally established. The theoretical results identify conditions under which depth learning could succeed; whether those conditions are met in practice remains to be demonstrated.

To our knowledge, no prior architecture implements cross-depth representation prediction as a self-supervised training objective. Related work includes H-JEPA (LeCun, 2022), which proposes hierarchical prediction across temporal scales but has not been implemented; PrediRep (Ororbia & Friston, 2024), which performs cross-level prediction in a predictive coding framework but does not scale beyond 5–7 layers; VAR, which predicts across spatial resolutions rather than abstraction depth; and DreamCoder (Ellis et al., 2021), which learns hierarchical program libraries but operates in constrained symbolic domains.

Specifications

Parameter	Value
Architecture	CNN byte encoder + cross-level refiner + cross-depth predictor
Trainable parameters	10.1M (20M including EMA targets)
Embedding dimension	256
Hidden dimension	512
Attention heads	8
Depth levels	4
Refiner passes	5 (weight-shared)
Predictor layers	3
Training data	6,000 depth-structured tuples (Lean 4 + open-source code)
Train/test split	80/20 by repository
Training time	68 minutes on a single consumer GPU
Loss	0.5 × L_cosine + 0.5 × L_InfoNCE (τ = 0.07)
EMA momentum	0.996
Dropout	0.15
Optimizer	AdamW (lr = 5 × 10⁻⁵)