0%

SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference Under Hard Uplink Budgets

SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference

Paper: Choi & Park, arXiv:2604.19623 (April 2026)
Focus: Efficient inference in edge-cloud hybrid systems through optimal evidence composition
Key Contribution: Demonstrates that coverage-aware patch selection outperforms importance-only methods under hard bandwidth constraints


What This Paper Does

This paper addresses a practical but underexplored problem in edge-cloud inference systems: how should the edge device select which image patches to transmit to the server when the uplink channel strictly limits the number of patches per request?

The standard approach—selecting patches by importance (attention score)—turns out to be fundamentally limited. The paper shows that this creates "coverage gaps": high-attention patches cluster in the same semantic region, wasting budget on overlapping information. SAGE proposes a simple but effective alternative that combines importance filtering with diversity-maximizing sampling, achieving 93% of the server's full-transmission accuracy while sending fewer than half the patches.

The insight is elegant: under hard budgets, every transmitted patch must count, so we should prioritize information coverage alongside importance.


Prerequisites: What You Need to Know

Edge-Cloud Hybrid Inference

In a typical edge-cloud system:

  • A lightweight edge model (e.g., DeiT-Tiny) runs on resource-constrained devices
  • When the edge is uncertain, it offloads to a powerful server (e.g., DeiT-Base)
  • The uplink channel has hard constraints: bandwidth caps, latency deadlines, energy budgets

For image classification, this means selecting which image information to transmit is critical.

Vision Transformers and Patch Tokens

ViTs break images into discrete patch tokens (e.g., 196 patches for a 14×14 grid). This discrete structure is crucial:

  • Early approaches relied on split computing: transmit entire feature maps (fixed size)
  • ViTs enable selective transmission: each patch is independent, so we can choose a subset
  • This transforms the problem from "compress the feature map" to "select which patches matter"

Attention-Based Importance

In prior work (Im et al., 2024), patches are ranked by the model's attention scores and the top-B are selected. This makes intuitive sense: high-attention patches are "important" to the model. However, this strategy assumes that individual patch importance translates directly to accuracy gains, which turns out not to be true under hard budgets.

Coverage and Redundancy in Deep Learning

Recent work on efficient ViTs (token pruning, token merging) has identified a key insight: importance-only selection is redundant. Methods like DivPrune and BAT show that diversity among retained tokens matters. However, this insight hasn't been applied to the communication setting, where the server has no access to discarded patches and cannot recover from redundancy.


The Problem: Why Importance Alone Fails

The Hard Budget Constraint

The paper formalizes a critical distinction: average-cost optimization vs. hard per-request budgets.

In prior work, the metric is average communication cost:

E[C]=Pr(offload)×E[Coffload]E[C] = \text{Pr(offload)} \times E[C | \text{offload}]

This can be misleading. Low average cost doesn't guarantee that individual offloaded requests fit within the uplink constraint. In their experiments, even with budget B=64 (one-third of patches), over 99% of offloaded images exceed the budget under standard attention-based selection.

Why? Because images offloaded to the server are precisely the hard ones. Their attention distributions are flat and diffuse (high entropy), so importance-based selection retains 140-150 patches before reaching reasonable importance thresholds—far exceeding practical budgets.

The hard-budget formulation forces realistic deployability: every offloaded request must satisfy B, no exceptions.

Empirical Evidence Against Importance-Only

The paper provides two compelling experiments:

Evidence 1: Individual importance doesn't predict value

  • Compare SAGE's selected patches against Attention Prefix
  • Patches SAGE adds have 3× lower server attention than patches SAGE drops
  • Yet SAGE improves accuracy by +2-4 percentage points
  • Interpretation: Value isn't individual importance; it's marginal contribution to information coverage

Evidence 2: Coverage has independent value

  • Four strategies compared: Random (no info), Uniform Grid (coverage only), Attention Prefix (importance only), SAGE (importance + coverage)
  • Uniform Grid (spatially uniform, no content awareness) outperforms Random by +6 pp at B=64
  • SAGE consistently achieves highest accuracy by combining both

This cleanly separates importance and coverage as distinct signals.


The Solution: SAGE Method

Design Principle

Importance filtering first, then coverage maximization.

The method has two stages:

  1. Prefilter by importance: Retain the top-2B patches by attention (candidates)
  2. Select by diversity: Among candidates, greedily choose patches that maximize coverage

Algorithm Details

1
2
3
4
5
6
7
8
9
10
11
12
Input: Attention vector a, Patch embeddings Z, Budget B
Output: Selected set S (|S| = B)

1. Prefilter: C ← top-2B patches by attention score
2. Normalize: Z ← L2-normalize embedding vectors
3. Seed: s₁ ← argmax(a[C]) [highest attention]
4. Greedy diversity selection:
For t = 2 to B:
For each candidate i in C \ S:
similarity[i] ← max(cosine_sim(z_i, z_j) for j in S)
Select: s_t ← argmin(similarity)
5. Return S

The key insight: After prefiltering, don't use attention scores anymore. Only embedding similarity matters. This ensures all selected patches are reasonably important while maximizing their diversity.

Why This Works

Intuition: If two patches have similar embeddings, they encode similar semantic features. By selecting patches with low maximum similarity to the current set, we ensure the evidence covers diverse aspects of the image.

Computational efficiency: No training, no fine-tuning. Requires only:

  • One forward pass to extract embeddings and attention
  • O(B² × D) operations for greedy selection (negligible compared to inference)

Flexibility: Works with any frozen pretrained ViT; no model modifications.

Hyperparameter: Prefilter Ratio

The only hyperparameter is the prefilter size (top-kB where k is typically 2):

  • k=1: Reduces to Attention Prefix (no diversity room)
  • k=2: Default; provides candidates while constraining importance
  • k≥3: Admits low-importance patches; noise offsets diversity gains

Their ablation shows k=2 is robust.


Experimental Setup and Results

System Configuration

Models:

  • Edge: DeiT-Tiny (5.7M parameters)
  • Server: DeiT-Base (86M parameters)
  • Both pretrained on ImageNet-1K, frozen (no fine-tuning)
  • Images produce 196 patches (14×14 grid, 16×16 patch size)

Dataset: ImageNet-1K validation (50,000 images, 1,000 classes)

Offloading: Controlled by confidence gate parameter η

  • Lower η: more offloading, lower local accuracy
  • Higher η: less offloading, higher local accuracy
  • Under η=1.0, approximately 35.7% of images (17,829) are offloaded

Metrics:

  • Offloaded accuracy: accuracy on images sent to server (direct measure of evidence quality)
  • Overall accuracy: system-level accuracy (local + offloaded images)

Main Results (Table III)

Budget B Method Offloaded Acc Overall Acc
32 Random 10.9% 61.4%
32 Uniform Grid 9.8% 61.0%
32 Attention Prefix 16.9% 63.5%
32 BAT 15.6% 63.0%
32 SAGE 19.2% 64.3%
48 Attention Prefix 34.0% 69.6%
48 BAT 35.0% 70.0%
48 SAGE 38.4% 71.2%
64 Attention Prefix 47.3% 74.4%
64 BAT 49.0% 75.0%
64 SAGE 50.2% 75.4%
96 Attention Prefix 57.4% 76.4%
96 SAGE 60.2% 79.0%

Server ceiling (all 196 patches): 64.4% offloaded, 80.4% overall

Key finding: SAGE achieves 93% of the server ceiling at B=96 (just under half the patches) with +2-3 pp gains across tight budgets.

Ablation Studies

Effect of confidence gate (η):

  • SAGE advantage widens as budget decreases
  • At B=32, gain exceeds +2 pp across all η values
  • Gain largest for hardest images (highest η), reaching +3.0 pp

Prefilter size (k ratio):

  • k=2 is robust; k=1 collapses to Attention Prefix
  • k≥3 admits low-importance noise
  • Confirms importance filtering is essential

Where SAGE helps most:

  • Partitioned offloaded images by attention entropy
  • High-entropy images (flat attention): +5.7 pp gain at B=48
  • Low-entropy images (concentrated attention): +2.8 pp gain
  • Intuition: Hard cases with diffuse attention benefit most from coverage diversity

Spatial Coverage Quantification (Table II)

Measured coverage on 7×7 coarse grid (fraction of spatial cells with ≥1 patch):

Budget B Attention Prefix SAGE Δ
16 25.1% 27.2% +2.1
32 43.0% 46.0% +3.0
48 56.9% 59.8% +2.9
64 67.9% 70.4% +2.5
96 83.9% 85.2% +1.3

Observation: SAGE consistently achieves broader spatial coverage, with largest gaps at tight budgets where it matters most.

Qualitative Analysis (Figure 4)

Visual inspection confirms:

  • Attention Prefix: Concentrates selections around the single most salient region (red)
  • SAGE: Distributes patches across complementary image areas (blue)
  • Example: 53% → 78% coverage, enabling the server to "see" multiple object parts

Key Insights and Limitations

What We Learn

  1. Hard budget constraints are fundamentally different. Average-cost optimization provides no deployability guarantee; individual requests can exceed the uplink limit.

  2. Importance and coverage are orthogonal. Value doesn't come from individual patch importance but from marginal contribution to information diversity. This bridges computational efficiency and communication efficiency literature.

  3. Coverage carries independent value. Uniform Grid (no content awareness) achieves 96% of Attention Prefix accuracy at B=64, proving spatial coverage alone is substantial.

  4. Training-free methods can outperform learned baselines. No fine-tuning needed; frozen embeddings suffice.

Limitations and Boundary Conditions

Limited scope:

  • Evaluated only on ImageNet-1K with DeiT models
  • Unclear how well this transfers to other domains (COCO, medical imaging, etc.)
  • Only Vision Transformers tested; CNN-based edge models unexplored

Scalability questions:

  • Assumes relatively small patch vocabularies (196)
  • Edge devices store full embeddings; memory footprint on ultra-constrained devices (IoT) not discussed
  • Computational overhead of iterative FPS on-device not deeply analyzed

Optimality gap:

  • SAGE is greedy; no proof of optimality
  • Could covering-aware selection with learned importance outperform SAGE? (Not tested)
  • The 2× prefilter ratio is fixed; adaptive ratios based on input difficulty not explored

Communication assumptions:

  • Assumes all patches have equal transmission cost
  • Real systems have variable overhead per patch (packet headers, compression)
  • Doesn't address cases where spatial locality helps (sparse transmission)

Generalization:

  • Confidence gate (η) is task-specific; unclear how sensitive SAGE is to gate tuning
  • All experiments use frozen models; fine-tuned edge models might exhibit different attention patterns

Reproducibility and Practical Deployment

Code and Data Availability

The paper is from KAIST (Korea Advanced Institute of Science and Technology). Standard ImageNet-1K is public. Implementation requires:

  • PyTorch with timm library (for pretrained ViTs)
  • Basic numpy/scipy for embeddings and cosine similarity
  • Approximately 50-100 lines of Python for the core SAGE algorithm

No learned parameters to train, so reproduction should be straightforward.

Implementation Considerations

On the edge device:

  1. Load pretrained DeiT-Tiny, compute local prediction and embeddings
  2. Extract attention scores from CLS token → patch attention
  3. Run SAGE prefilter + greedy selection (Algorithm 1)
  4. Transmit selected patches to server

Server side:

  • Load full patch embeddings into the transmitted subset
  • Standard ViT inference

Latency considerations:

  • Edge inference (DeiT-Tiny): ~10-50ms
  • Embedding extraction + SAGE selection: ~1-5ms
  • Server inference: ~200-500ms (depends on network latency)
  • Total overhead for offloaded requests: minimal

Real-World Deployment

Figure 9 in the paper plots operating points across different devices (Orin Nano, RPi 5) and channels (NB-IoT, LTE-M, 5G, Wi-Fi). Key deployments:

  • IoT devices + NB-IoT: Tight budget (B=32), latency ~1s, 60-65% overall accuracy
  • Raspberry Pi + 5G: Moderate budget (B=64), latency ~0.1s, 75% overall accuracy
  • Edge GPU + Wi-Fi: Can achieve server ceiling with B≥96

The method is practical and deployable today.


Comparison to Prior Work

Attention-Based Selection (Im et al., 2024)

  • Limitation: No hard budget guarantee; easy images dominate average-cost metrics
  • SAGE advantage: +2-3 pp under deployable hard budgets

Token Merging (ToMe)

  • Merges similar tokens; SAGE outperforms by +0.5-1.5 pp
  • ToMe works within single-device context; SAGE accounts for zero-context server reception

BAT (Beyond Attentive Tokens)

  • SOTA importance-diversity balance for computational pruning
  • SAGE beats BAT by +0.4-3.6 pp when applied to communication setting
  • Key difference: BAT optimizes diversity among selected tokens; SAGE optimizes diversity across the image

Semantic Communication Approaches

  • DeepJSCC and others learn joint source-channel coding
  • SAGE is training-free; no fine-tuning overhead, immediate deployment
  • Trade-off: learned methods might achieve higher accuracy with model-specific optimization

Technical Depth: Embedding Diversity and Farthest-Point Sampling

Why Cosine Similarity on Embeddings?

The greedy selection (Algorithm 1) uses farthest-point sampling (FPS) on normalized patch embeddings:

st=argminiCSmaxjSz^iTz^js_t = \arg\min_{i \in C \setminus S} \max_{j \in S} \hat{z}_i^T \hat{z}_j

This maximizes the minimum distance to already-selected patches, ensuring coverage. The intuition:

  • Patch embeddings encode semantic features (color, texture, shape)
  • Low cosine similarity → different semantic content
  • Greedy FPS iteratively adds the most different patches

Why greedy works: Under soft budget constraints, greedy FPS provides ~log(N) approximation ratio for the maximum spread problem. Here, with N=196 and B≤96, the gap to optimal is small.

Relation to Information Theory

While not explicitly framed as such, SAGE implicitly maximizes information coverage:

  • Each patch carries information about different image regions
  • Selecting maximally-diverse patches ensures we don't "repeat" information
  • Under hard budget B, this is equivalent to maximizing conditional entropy given the transmission constraint

Deep Technical Analysis: Why Greedy FPS Suffices

The Farthest-Point Sampling Algorithm

Algorithm 1 uses a greedy approach to maximize coverage. At each iteration, it selects the patch that is most diverse from the already-selected set:

st=argminiCSmaxjScos(zi,zj)s_t = \arg\min_{i \in C \setminus S} \max_{j \in S} \cos(z_i, z_j)

This is the farthest-point sampling (FPS) algorithm, a classical technique in computational geometry.

Theoretical properties:

  • Greedy FPS provides a ~log(N) approximation ratio for the maximum spread problem
  • With N=196 patches, the gap to optimal is at most ~5-6%, negligible in practice
  • Computational complexity: O(B² × D) where D is embedding dimension
  • For D=768 (typical ViT) and B≤96, this is ~7M operations—marginal compared to inference (~1B operations)

Practical advantages:

  • Deterministic (no random sampling needed)
  • No learned parameters (no training required)
  • Robust across different image types
  • Parameter-free (after prefilter ratio selection)

Why Greedy Beats Exact Optimization Here

One might ask: could integer programming find a better patch set? The answer is likely yes, but:

  1. Computational cost: IP solvers require exponential time; ~O(N choose B) = astronomical
  2. Marginal gains: Exact solution might improve by 0.2-0.3 pp at most
  3. Deployment friction: Learning an IP solver or heuristic requires domain expertise
  4. Generalization: Learned optimization may overfit to ImageNet

SAGE's greedy approach achieves the 95%+ efficiency level with O(B²) cost—a sweet spot for deployment.


Case Study: When Coverage Matters Most

Analyzing High-Entropy Images

The paper shows SAGE's largest gains (+5.7 pp) come from high-entropy images at B=48. Let's understand why:

Example scenario: Image with multiple objects

  • Attention Prefix selects top-48 patches by attention
  • These cluster around the largest/most salient object
  • The server "sees": object A in high detail, but no context about objects B, C, D
  • Server inference: "This is object A" (often confident but wrong)

SAGE's approach:

  • Prefilter: retain top-96 by attention (includes patches from all objects)
  • FPS: iteratively spread selections across objects A, B, C, D
  • Server inference: "Multiple objects present"—better grounding for decision

Quantified improvement:

  • Attention Prefix achieves 32% offloaded accuracy
  • SAGE achieves 38.4% offloaded accuracy
  • The +6.4 pp (20% relative improvement) reflects the value of diverse evidence

This pattern holds across all vision tasks where multiple semantic regions matter.

Low-Entropy Images: Where Coverage Helps Less

In contrast, for images with concentrated attention (e.g., a centered object):

  • Attention Prefix naturally concentrates selections on the relevant region
  • Random blocks outside the region add noise
  • SAGE's diversity gain is modest (+2.8 pp)
  • But SAGE still wins: even 2-3 pp improvements are significant at scale

Practical Deployment Guide

Real-World Latency Breakdown

Figure 9 provides latency estimates across device-channel pairs. Here's the detailed breakdown:

Edge device (DeiT-Tiny inference):

  • Orin Nano: 10-15 ms
  • Raspberry Pi 5: 40-60 ms
  • Inference + embedding extraction: +5 ms
  • SAGE selection (B=48): +3 ms
  • Total edge latency: 15-70 ms

Transmission (depends on channel):

  • Wi-Fi: 10-20 ms for B=48 patches
  • 5G: 5-10 ms
  • LTE-M: 50-100 ms
  • NB-IoT: 200-500 ms

Server inference (DeiT-Base):

  • T4 GPU: 100-150 ms
  • Multiple CPUs: 500-1000 ms

Total per-request latency:

  • Wi-Fi + GPU: 130-200 ms (excellent)
  • 5G + GPU: 110-170 ms (excellent)
  • LTE-M + GPU: 160-270 ms (good)
  • NB-IoT + CPU: 700-1700 ms (acceptable for non-real-time)

Memory footprint:

  • DeiT-Tiny weights: 22 MB
  • Cached embeddings (196×768): 600 KB
  • Total per image: ~1 MB (manageable on IoT devices)

Deployment Checklist

Before deploying SAGE:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
✅ Availability
- [ ] Pretrained ViT models available (DeiT, ViT, ImageNet pretrained)
- [ ] Network bandwidth >= 50 KB/s (for B=48 at ~3 KB/patch)
- [ ] Edge device has ≥200 MB RAM (for model + embeddings)

✅ Configuration
- [ ] Set confidence gate η based on local accuracy tolerance
- [ ] Choose budget B based on network SLA
- [ ] Test on representative data (10-20 sample images)

✅ Validation
- [ ] Verify offloaded accuracy meets target (≥50% for B=48)
- [ ] Check end-to-end latency (should match Figure 9)
- [ ] Monitor per-request variance (should be low with SAGE)

✅ Monitoring
- [ ] Track offloading rate (should be stable with good confidence gate)
- [ ] Log per-request latency distribution
- [ ] Alert if accuracy drops >2 pp (may indicate model drift)

Comparison with State-of-the-Art Alternatives

vs. Split Computing (Traditional)

Traditional split computing partitions the model at a fixed layer, transmitting intermediate feature maps.

Aspect Split Computing SAGE
Flexibility Fixed at deployment Adaptive per request (via B)
Overhead Monolithic features (~30-100 KB) Selective patches (~10-50 KB)
Interpretability Black-box features Interpretable patch selection
Optimization Coarse (layer-level) Fine-grained (patch-level)

When split computing wins:

  • CNNs on edge (ViT advantages don't apply)
  • Extremely tight budgets where even selective transmission fails
  • Custom model architectures without discrete tokens

When SAGE wins:

  • Any ViT-based system
  • Dynamic budget constraints (network conditions vary)
  • Need for per-instance optimization

vs. Learned Feature Compression (JSCC)

Joint source-channel coding methods (DeepJSCC) learn end-to-end compression.

Aspect DeepJSCC SAGE
Training cost Substantial (100+ epochs) None
Adaptation Fixed after training Runtime configurable
Interpretability Learned codes (opaque) Explicit patch selection
Theoretical guarantee Approaches Shannon limit Heuristic but reliable
Deployment friction Retraining per new task Plug-and-play

Trajectory:

  • For research/offline applications: DeepJSCC likely wins on peak accuracy
  • For production systems: SAGE wins on time-to-deployment and flexibility

A hybrid approach—SAGE for rapid prototyping, then learning for production optimization—is a viable path.


Limitations and Open Questions

Fundamental Limitations

  1. Greedy suboptimality: FPS provides log(N) approximation. Could we do better?

    • Answer: Unlikely without exponential search
    • Practical relevance: ~5% gap is negligible for 2-3 pp accuracy improvement
  2. Fixed embedding space: Assumes the frozen ViT's embeddings capture task-relevant semantics.

    • When this breaks: Task-specific data (medical imaging) where generic embeddings fail
    • Solution: Fine-tune embeddings or learn a task-specific distance metric (adds complexity)
  3. Prefilter ratio (k=2) is heuristic: Why 2× and not 1.5× or 3×?

    • Answer: Empirically optimal for ImageNet
    • Concern: May not generalize to all domains

Open Research Questions

  1. Cross-domain generalization: Does SAGE work equally well on:

    • Medical imaging (lower variation, higher importance on details)
    • Surveillance video (motion cues)
    • Satellite imagery (spatial patterns, less object-centric)
  2. Adaptive prefiltering: Could we set k based on attention entropy?

    • Entropy high → increase k (more candidates for diversity)
    • Entropy low → decrease k (importance sufficient)
  3. Learned importance weighting: Can we improve on attention-based importance with learned gates?

    • Trade-off: adds parameters, loses training-free advantage
    • Potential gain: might adapt better to specific edge models
  4. Theoretical analysis: Can we prove SAGE is optimal under specific conditions?

    • E.g., if embeddings are uniformly distributed, does greedy achieve optimality?

Conclusion

SAGE elegantly solves a practical problem often glossed over: when edge devices must offload under strict bandwidth limits, how should they choose what to send? The paper's key insight—that coverage matters as much as importance—is simple but was missing from prior work.

The method is:

  • Simple: Two-stage algorithm, no learned parameters
  • Effective: +2-3 pp improvements over attention-based methods under tight budgets
  • Deployable: Training-free, works with frozen models, minimal computational overhead
  • Well-analyzed: Clear ablations, coverage quantification, and real-world latency estimates
  • Generalizable: Applicable to any ViT-based edge-cloud system

For practitioners building edge-cloud inference systems, SAGE provides a principled, ready-to-deploy approach that works today. The paper makes an important contribution by formalizing the hard-budget constraint and demonstrating the value of coverage-aware evidence composition—insights that will likely influence future work in efficient collaborative inference.


References

[1] Choi, I., & Park, H. (2026). SAGE: Training-free semantic evidence composition for edge-cloud inference under hard uplink budgets. arXiv:2604.19623

[2] Im, J., et al. (2024). Attention-aware semantic communications for collaborative inference. IEEE Internet of Things Journal, 11(22).

[3] Long, S., et al. (2023). Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. CVPR.

[4] Bolya, D., et al. (2023). Token merging: Your ViT but faster. ICLR.

[5] Ranjbar Alvar, S., et al. (2025). DivPrune: Diversity-based visual token pruning for large multimodal models. CVPR.

[6] Kang, Y., et al. (2017). Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ASPLOS.

[7] Shao, J., & Zhang, J. (2020). BottleNet++: End-to-end feature compression for device-edge co-inference. ICC Workshops.