SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference
Paper: Choi & Park, arXiv:2604.19623 (April 2026)
Focus: Efficient inference in edge-cloud hybrid systems through optimal evidence composition
Key Contribution: Demonstrates that coverage-aware patch selection outperforms importance-only methods under hard bandwidth constraints
What This Paper Does
This paper addresses a practical but underexplored problem in edge-cloud inference systems: how should the edge device select which image patches to transmit to the server when the uplink channel strictly limits the number of patches per request?
The standard approach—selecting patches by importance (attention score)—turns out to be fundamentally limited. The paper shows that this creates "coverage gaps": high-attention patches cluster in the same semantic region, wasting budget on overlapping information. SAGE proposes a simple but effective alternative that combines importance filtering with diversity-maximizing sampling, achieving 93% of the server's full-transmission accuracy while sending fewer than half the patches.
The insight is elegant: under hard budgets, every transmitted patch must count, so we should prioritize information coverage alongside importance.
Prerequisites: What You Need to Know
Edge-Cloud Hybrid Inference
In a typical edge-cloud system:
- A lightweight edge model (e.g., DeiT-Tiny) runs on resource-constrained devices
- When the edge is uncertain, it offloads to a powerful server (e.g., DeiT-Base)
- The uplink channel has hard constraints: bandwidth caps, latency deadlines, energy budgets
For image classification, this means selecting which image information to transmit is critical.
Vision Transformers and Patch Tokens
ViTs break images into discrete patch tokens (e.g., 196 patches for a 14×14 grid). This discrete structure is crucial:
- Early approaches relied on split computing: transmit entire feature maps (fixed size)
- ViTs enable selective transmission: each patch is independent, so we can choose a subset
- This transforms the problem from "compress the feature map" to "select which patches matter"
Attention-Based Importance
In prior work (Im et al., 2024), patches are ranked by the model's attention scores and the top-B are selected. This makes intuitive sense: high-attention patches are "important" to the model. However, this strategy assumes that individual patch importance translates directly to accuracy gains, which turns out not to be true under hard budgets.
Coverage and Redundancy in Deep Learning
Recent work on efficient ViTs (token pruning, token merging) has identified a key insight: importance-only selection is redundant. Methods like DivPrune and BAT show that diversity among retained tokens matters. However, this insight hasn't been applied to the communication setting, where the server has no access to discarded patches and cannot recover from redundancy.
The Problem: Why Importance Alone Fails
The Hard Budget Constraint
The paper formalizes a critical distinction: average-cost optimization vs. hard per-request budgets.
In prior work, the metric is average communication cost:
This can be misleading. Low average cost doesn't guarantee that individual offloaded requests fit within the uplink constraint. In their experiments, even with budget B=64 (one-third of patches), over 99% of offloaded images exceed the budget under standard attention-based selection.
Why? Because images offloaded to the server are precisely the hard ones. Their attention distributions are flat and diffuse (high entropy), so importance-based selection retains 140-150 patches before reaching reasonable importance thresholds—far exceeding practical budgets.
The hard-budget formulation forces realistic deployability: every offloaded request must satisfy B, no exceptions.
Empirical Evidence Against Importance-Only
The paper provides two compelling experiments:
Evidence 1: Individual importance doesn't predict value
- Compare SAGE's selected patches against Attention Prefix
- Patches SAGE adds have 3× lower server attention than patches SAGE drops
- Yet SAGE improves accuracy by +2-4 percentage points
- Interpretation: Value isn't individual importance; it's marginal contribution to information coverage
Evidence 2: Coverage has independent value
- Four strategies compared: Random (no info), Uniform Grid (coverage only), Attention Prefix (importance only), SAGE (importance + coverage)
- Uniform Grid (spatially uniform, no content awareness) outperforms Random by +6 pp at B=64
- SAGE consistently achieves highest accuracy by combining both
This cleanly separates importance and coverage as distinct signals.
The Solution: SAGE Method
Design Principle
Importance filtering first, then coverage maximization.
The method has two stages:
- Prefilter by importance: Retain the top-2B patches by attention (candidates)
- Select by diversity: Among candidates, greedily choose patches that maximize coverage
Algorithm Details
1 | Input: Attention vector a, Patch embeddings Z, Budget B |
The key insight: After prefiltering, don't use attention scores anymore. Only embedding similarity matters. This ensures all selected patches are reasonably important while maximizing their diversity.
Why This Works
Intuition: If two patches have similar embeddings, they encode similar semantic features. By selecting patches with low maximum similarity to the current set, we ensure the evidence covers diverse aspects of the image.
Computational efficiency: No training, no fine-tuning. Requires only:
- One forward pass to extract embeddings and attention
- O(B² × D) operations for greedy selection (negligible compared to inference)
Flexibility: Works with any frozen pretrained ViT; no model modifications.
Hyperparameter: Prefilter Ratio
The only hyperparameter is the prefilter size (top-kB where k is typically 2):
- k=1: Reduces to Attention Prefix (no diversity room)
- k=2: Default; provides candidates while constraining importance
- k≥3: Admits low-importance patches; noise offsets diversity gains
Their ablation shows k=2 is robust.
Experimental Setup and Results
System Configuration
Models:
- Edge: DeiT-Tiny (5.7M parameters)
- Server: DeiT-Base (86M parameters)
- Both pretrained on ImageNet-1K, frozen (no fine-tuning)
- Images produce 196 patches (14×14 grid, 16×16 patch size)
Dataset: ImageNet-1K validation (50,000 images, 1,000 classes)
Offloading: Controlled by confidence gate parameter η
- Lower η: more offloading, lower local accuracy
- Higher η: less offloading, higher local accuracy
- Under η=1.0, approximately 35.7% of images (17,829) are offloaded
Metrics:
- Offloaded accuracy: accuracy on images sent to server (direct measure of evidence quality)
- Overall accuracy: system-level accuracy (local + offloaded images)
Main Results (Table III)
| Budget B | Method | Offloaded Acc | Overall Acc |
|---|---|---|---|
| 32 | Random | 10.9% | 61.4% |
| 32 | Uniform Grid | 9.8% | 61.0% |
| 32 | Attention Prefix | 16.9% | 63.5% |
| 32 | BAT | 15.6% | 63.0% |
| 32 | SAGE | 19.2% | 64.3% |
| 48 | Attention Prefix | 34.0% | 69.6% |
| 48 | BAT | 35.0% | 70.0% |
| 48 | SAGE | 38.4% | 71.2% |
| 64 | Attention Prefix | 47.3% | 74.4% |
| 64 | BAT | 49.0% | 75.0% |
| 64 | SAGE | 50.2% | 75.4% |
| 96 | Attention Prefix | 57.4% | 76.4% |
| 96 | SAGE | 60.2% | 79.0% |
Server ceiling (all 196 patches): 64.4% offloaded, 80.4% overall
Key finding: SAGE achieves 93% of the server ceiling at B=96 (just under half the patches) with +2-3 pp gains across tight budgets.
Ablation Studies
Effect of confidence gate (η):
- SAGE advantage widens as budget decreases
- At B=32, gain exceeds +2 pp across all η values
- Gain largest for hardest images (highest η), reaching +3.0 pp
Prefilter size (k ratio):
- k=2 is robust; k=1 collapses to Attention Prefix
- k≥3 admits low-importance noise
- Confirms importance filtering is essential
Where SAGE helps most:
- Partitioned offloaded images by attention entropy
- High-entropy images (flat attention): +5.7 pp gain at B=48
- Low-entropy images (concentrated attention): +2.8 pp gain
- Intuition: Hard cases with diffuse attention benefit most from coverage diversity
Spatial Coverage Quantification (Table II)
Measured coverage on 7×7 coarse grid (fraction of spatial cells with ≥1 patch):
| Budget B | Attention Prefix | SAGE | Δ |
|---|---|---|---|
| 16 | 25.1% | 27.2% | +2.1 |
| 32 | 43.0% | 46.0% | +3.0 |
| 48 | 56.9% | 59.8% | +2.9 |
| 64 | 67.9% | 70.4% | +2.5 |
| 96 | 83.9% | 85.2% | +1.3 |
Observation: SAGE consistently achieves broader spatial coverage, with largest gaps at tight budgets where it matters most.
Qualitative Analysis (Figure 4)
Visual inspection confirms:
- Attention Prefix: Concentrates selections around the single most salient region (red)
- SAGE: Distributes patches across complementary image areas (blue)
- Example: 53% → 78% coverage, enabling the server to "see" multiple object parts
Key Insights and Limitations
What We Learn
-
Hard budget constraints are fundamentally different. Average-cost optimization provides no deployability guarantee; individual requests can exceed the uplink limit.
-
Importance and coverage are orthogonal. Value doesn't come from individual patch importance but from marginal contribution to information diversity. This bridges computational efficiency and communication efficiency literature.
-
Coverage carries independent value. Uniform Grid (no content awareness) achieves 96% of Attention Prefix accuracy at B=64, proving spatial coverage alone is substantial.
-
Training-free methods can outperform learned baselines. No fine-tuning needed; frozen embeddings suffice.
Limitations and Boundary Conditions
Limited scope:
- Evaluated only on ImageNet-1K with DeiT models
- Unclear how well this transfers to other domains (COCO, medical imaging, etc.)
- Only Vision Transformers tested; CNN-based edge models unexplored
Scalability questions:
- Assumes relatively small patch vocabularies (196)
- Edge devices store full embeddings; memory footprint on ultra-constrained devices (IoT) not discussed
- Computational overhead of iterative FPS on-device not deeply analyzed
Optimality gap:
- SAGE is greedy; no proof of optimality
- Could covering-aware selection with learned importance outperform SAGE? (Not tested)
- The 2× prefilter ratio is fixed; adaptive ratios based on input difficulty not explored
Communication assumptions:
- Assumes all patches have equal transmission cost
- Real systems have variable overhead per patch (packet headers, compression)
- Doesn't address cases where spatial locality helps (sparse transmission)
Generalization:
- Confidence gate (η) is task-specific; unclear how sensitive SAGE is to gate tuning
- All experiments use frozen models; fine-tuned edge models might exhibit different attention patterns
Reproducibility and Practical Deployment
Code and Data Availability
The paper is from KAIST (Korea Advanced Institute of Science and Technology). Standard ImageNet-1K is public. Implementation requires:
- PyTorch with timm library (for pretrained ViTs)
- Basic numpy/scipy for embeddings and cosine similarity
- Approximately 50-100 lines of Python for the core SAGE algorithm
No learned parameters to train, so reproduction should be straightforward.
Implementation Considerations
On the edge device:
- Load pretrained DeiT-Tiny, compute local prediction and embeddings
- Extract attention scores from CLS token → patch attention
- Run SAGE prefilter + greedy selection (Algorithm 1)
- Transmit selected patches to server
Server side:
- Load full patch embeddings into the transmitted subset
- Standard ViT inference
Latency considerations:
- Edge inference (DeiT-Tiny): ~10-50ms
- Embedding extraction + SAGE selection: ~1-5ms
- Server inference: ~200-500ms (depends on network latency)
- Total overhead for offloaded requests: minimal
Real-World Deployment
Figure 9 in the paper plots operating points across different devices (Orin Nano, RPi 5) and channels (NB-IoT, LTE-M, 5G, Wi-Fi). Key deployments:
- IoT devices + NB-IoT: Tight budget (B=32), latency ~1s, 60-65% overall accuracy
- Raspberry Pi + 5G: Moderate budget (B=64), latency ~0.1s, 75% overall accuracy
- Edge GPU + Wi-Fi: Can achieve server ceiling with B≥96
The method is practical and deployable today.
Comparison to Prior Work
Attention-Based Selection (Im et al., 2024)
- Limitation: No hard budget guarantee; easy images dominate average-cost metrics
- SAGE advantage: +2-3 pp under deployable hard budgets
Token Merging (ToMe)
- Merges similar tokens; SAGE outperforms by +0.5-1.5 pp
- ToMe works within single-device context; SAGE accounts for zero-context server reception
BAT (Beyond Attentive Tokens)
- SOTA importance-diversity balance for computational pruning
- SAGE beats BAT by +0.4-3.6 pp when applied to communication setting
- Key difference: BAT optimizes diversity among selected tokens; SAGE optimizes diversity across the image
Semantic Communication Approaches
- DeepJSCC and others learn joint source-channel coding
- SAGE is training-free; no fine-tuning overhead, immediate deployment
- Trade-off: learned methods might achieve higher accuracy with model-specific optimization
Technical Depth: Embedding Diversity and Farthest-Point Sampling
Why Cosine Similarity on Embeddings?
The greedy selection (Algorithm 1) uses farthest-point sampling (FPS) on normalized patch embeddings:
This maximizes the minimum distance to already-selected patches, ensuring coverage. The intuition:
- Patch embeddings encode semantic features (color, texture, shape)
- Low cosine similarity → different semantic content
- Greedy FPS iteratively adds the most different patches
Why greedy works: Under soft budget constraints, greedy FPS provides ~log(N) approximation ratio for the maximum spread problem. Here, with N=196 and B≤96, the gap to optimal is small.
Relation to Information Theory
While not explicitly framed as such, SAGE implicitly maximizes information coverage:
- Each patch carries information about different image regions
- Selecting maximally-diverse patches ensures we don't "repeat" information
- Under hard budget B, this is equivalent to maximizing conditional entropy given the transmission constraint
Deep Technical Analysis: Why Greedy FPS Suffices
The Farthest-Point Sampling Algorithm
Algorithm 1 uses a greedy approach to maximize coverage. At each iteration, it selects the patch that is most diverse from the already-selected set:
This is the farthest-point sampling (FPS) algorithm, a classical technique in computational geometry.
Theoretical properties:
- Greedy FPS provides a ~log(N) approximation ratio for the maximum spread problem
- With N=196 patches, the gap to optimal is at most ~5-6%, negligible in practice
- Computational complexity: O(B² × D) where D is embedding dimension
- For D=768 (typical ViT) and B≤96, this is ~7M operations—marginal compared to inference (~1B operations)
Practical advantages:
- Deterministic (no random sampling needed)
- No learned parameters (no training required)
- Robust across different image types
- Parameter-free (after prefilter ratio selection)
Why Greedy Beats Exact Optimization Here
One might ask: could integer programming find a better patch set? The answer is likely yes, but:
- Computational cost: IP solvers require exponential time; ~O(N choose B) = astronomical
- Marginal gains: Exact solution might improve by 0.2-0.3 pp at most
- Deployment friction: Learning an IP solver or heuristic requires domain expertise
- Generalization: Learned optimization may overfit to ImageNet
SAGE's greedy approach achieves the 95%+ efficiency level with O(B²) cost—a sweet spot for deployment.
Case Study: When Coverage Matters Most
Analyzing High-Entropy Images
The paper shows SAGE's largest gains (+5.7 pp) come from high-entropy images at B=48. Let's understand why:
Example scenario: Image with multiple objects
- Attention Prefix selects top-48 patches by attention
- These cluster around the largest/most salient object
- The server "sees": object A in high detail, but no context about objects B, C, D
- Server inference: "This is object A" (often confident but wrong)
SAGE's approach:
- Prefilter: retain top-96 by attention (includes patches from all objects)
- FPS: iteratively spread selections across objects A, B, C, D
- Server inference: "Multiple objects present"—better grounding for decision
Quantified improvement:
- Attention Prefix achieves 32% offloaded accuracy
- SAGE achieves 38.4% offloaded accuracy
- The +6.4 pp (20% relative improvement) reflects the value of diverse evidence
This pattern holds across all vision tasks where multiple semantic regions matter.
Low-Entropy Images: Where Coverage Helps Less
In contrast, for images with concentrated attention (e.g., a centered object):
- Attention Prefix naturally concentrates selections on the relevant region
- Random blocks outside the region add noise
- SAGE's diversity gain is modest (+2.8 pp)
- But SAGE still wins: even 2-3 pp improvements are significant at scale
Practical Deployment Guide
Real-World Latency Breakdown
Figure 9 provides latency estimates across device-channel pairs. Here's the detailed breakdown:
Edge device (DeiT-Tiny inference):
- Orin Nano: 10-15 ms
- Raspberry Pi 5: 40-60 ms
- Inference + embedding extraction: +5 ms
- SAGE selection (B=48): +3 ms
- Total edge latency: 15-70 ms
Transmission (depends on channel):
- Wi-Fi: 10-20 ms for B=48 patches
- 5G: 5-10 ms
- LTE-M: 50-100 ms
- NB-IoT: 200-500 ms
Server inference (DeiT-Base):
- T4 GPU: 100-150 ms
- Multiple CPUs: 500-1000 ms
Total per-request latency:
- Wi-Fi + GPU: 130-200 ms (excellent)
- 5G + GPU: 110-170 ms (excellent)
- LTE-M + GPU: 160-270 ms (good)
- NB-IoT + CPU: 700-1700 ms (acceptable for non-real-time)
Memory footprint:
- DeiT-Tiny weights: 22 MB
- Cached embeddings (196×768): 600 KB
- Total per image: ~1 MB (manageable on IoT devices)
Deployment Checklist
Before deploying SAGE:
1 | ✅ Availability |
Comparison with State-of-the-Art Alternatives
vs. Split Computing (Traditional)
Traditional split computing partitions the model at a fixed layer, transmitting intermediate feature maps.
| Aspect | Split Computing | SAGE |
|---|---|---|
| Flexibility | Fixed at deployment | Adaptive per request (via B) |
| Overhead | Monolithic features (~30-100 KB) | Selective patches (~10-50 KB) |
| Interpretability | Black-box features | Interpretable patch selection |
| Optimization | Coarse (layer-level) | Fine-grained (patch-level) |
When split computing wins:
- CNNs on edge (ViT advantages don't apply)
- Extremely tight budgets where even selective transmission fails
- Custom model architectures without discrete tokens
When SAGE wins:
- Any ViT-based system
- Dynamic budget constraints (network conditions vary)
- Need for per-instance optimization
vs. Learned Feature Compression (JSCC)
Joint source-channel coding methods (DeepJSCC) learn end-to-end compression.
| Aspect | DeepJSCC | SAGE |
|---|---|---|
| Training cost | Substantial (100+ epochs) | None |
| Adaptation | Fixed after training | Runtime configurable |
| Interpretability | Learned codes (opaque) | Explicit patch selection |
| Theoretical guarantee | Approaches Shannon limit | Heuristic but reliable |
| Deployment friction | Retraining per new task | Plug-and-play |
Trajectory:
- For research/offline applications: DeepJSCC likely wins on peak accuracy
- For production systems: SAGE wins on time-to-deployment and flexibility
A hybrid approach—SAGE for rapid prototyping, then learning for production optimization—is a viable path.
Limitations and Open Questions
Fundamental Limitations
-
Greedy suboptimality: FPS provides log(N) approximation. Could we do better?
- Answer: Unlikely without exponential search
- Practical relevance: ~5% gap is negligible for 2-3 pp accuracy improvement
-
Fixed embedding space: Assumes the frozen ViT's embeddings capture task-relevant semantics.
- When this breaks: Task-specific data (medical imaging) where generic embeddings fail
- Solution: Fine-tune embeddings or learn a task-specific distance metric (adds complexity)
-
Prefilter ratio (k=2) is heuristic: Why 2× and not 1.5× or 3×?
- Answer: Empirically optimal for ImageNet
- Concern: May not generalize to all domains
Open Research Questions
-
Cross-domain generalization: Does SAGE work equally well on:
- Medical imaging (lower variation, higher importance on details)
- Surveillance video (motion cues)
- Satellite imagery (spatial patterns, less object-centric)
-
Adaptive prefiltering: Could we set k based on attention entropy?
- Entropy high → increase k (more candidates for diversity)
- Entropy low → decrease k (importance sufficient)
-
Learned importance weighting: Can we improve on attention-based importance with learned gates?
- Trade-off: adds parameters, loses training-free advantage
- Potential gain: might adapt better to specific edge models
-
Theoretical analysis: Can we prove SAGE is optimal under specific conditions?
- E.g., if embeddings are uniformly distributed, does greedy achieve optimality?
Conclusion
SAGE elegantly solves a practical problem often glossed over: when edge devices must offload under strict bandwidth limits, how should they choose what to send? The paper's key insight—that coverage matters as much as importance—is simple but was missing from prior work.
The method is:
- Simple: Two-stage algorithm, no learned parameters
- Effective: +2-3 pp improvements over attention-based methods under tight budgets
- Deployable: Training-free, works with frozen models, minimal computational overhead
- Well-analyzed: Clear ablations, coverage quantification, and real-world latency estimates
- Generalizable: Applicable to any ViT-based edge-cloud system
For practitioners building edge-cloud inference systems, SAGE provides a principled, ready-to-deploy approach that works today. The paper makes an important contribution by formalizing the hard-budget constraint and demonstrating the value of coverage-aware evidence composition—insights that will likely influence future work in efficient collaborative inference.
References
[1] Choi, I., & Park, H. (2026). SAGE: Training-free semantic evidence composition for edge-cloud inference under hard uplink budgets. arXiv:2604.19623
[2] Im, J., et al. (2024). Attention-aware semantic communications for collaborative inference. IEEE Internet of Things Journal, 11(22).
[3] Long, S., et al. (2023). Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. CVPR.
[4] Bolya, D., et al. (2023). Token merging: Your ViT but faster. ICLR.
[5] Ranjbar Alvar, S., et al. (2025). DivPrune: Diversity-based visual token pruning for large multimodal models. CVPR.
[6] Kang, Y., et al. (2017). Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ASPLOS.
[7] Shao, J., & Zhang, J. (2020). BottleNet++: End-to-end feature compression for device-edge co-inference. ICC Workshops.