Filtered out

Friday · May 29, 2026

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systemsarXiv cs.AI · 10 relevance · Niche forestry robotics, not relevant to software dev.On the Geometry of Games and their SolversarXiv cs.AI · 20 relevance · Theoretical game theory, not directly applicable to daily dev work.FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and ForecastingarXiv cs.AI · 20 relevance · Medical AI paper, not relevant to software development.EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular DynamicsarXiv cs.AI · 30 relevance · Specialized molecular dynamics, not general dev.Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AIarXiv cs.AI · 30 relevance · Tangential educational AI research, not directly applicable.Orthogonal Concept Erasure for Diffusion ModelsarXiv cs.AI · 30 relevance · Tangential to applied ML, not for general devs.Differentiable Belief-based Opponent ShapingarXiv cs.AI · 30 relevance · Theoretical MARL paper, not directly applicable to daily dev work.Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI ExplorationarXiv cs.AI · 30 relevance · Tangential to developers; focuses on clinical trial trends.OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science SubfieldsarXiv cs.AI · 30 relevance · Materials science benchmark, not directly for developers.Quantifying and Optimizing Simplicity via Polynomial RepresentationsarXiv cs.AI · 30 relevance · Theoretical ML paper, not directly applicable to daily dev work.Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete DiffusionarXiv cs.AI · 30 relevance · Specialized BCI research, not general dev tooling.Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference PredictionarXiv cs.AI · 30 relevance · Theoretical RL paper, not directly applicable to daily dev work.Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey EvidencearXiv cs.AI · 30 relevance · Education survey, not directly relevant to developers.Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution SemanticsarXiv cs.AI · 35 relevance · Niche RL research; limited direct dev impact.OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based DistillationarXiv cs.AI · 40 relevance · LLM optimization for devs, but niche research.LLM-Evolved Domain-Independent Heuristics for Symbolic AI PlanningarXiv cs.AI · 40 relevance · Tangential to developers; symbolic planning research.Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language ModelsarXiv cs.AI · 40 relevance · LLM multi-agent framework for storytelling, tangential to dev work.BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-DevicesarXiv cs.AI · 40 relevance · Edge-deployable LLM for trajectory prediction; niche for robotics devs.Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes RiskarXiv cs.AI · 40 relevance · Theoretical bandit paper, niche for ML researchers.MiraBench: Evaluating Action-Conditioned Reliability in Robotic World ModelsarXiv cs.AI · 40 relevance · Robotic world model benchmark, tangential to dev work.Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy ManagementarXiv cs.AI · 40 relevance · Specialized energy forecasting, not general dev work.Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation ModelsarXiv cs.AI · 45 relevance · Specialized EEG research, not general dev work.Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy PredictionarXiv cs.AI · 45 relevance · Reinforcement learning paper, tangential to most developers.markdown-svg-rendererSimon Willison · 50 relevance · A library for rendering SVG from Markdown.From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving NetworksarXiv cs.AI · 55 relevance · Benchmark for traffic forecasting, relevant to ML engineers.FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement VerificationarXiv cs.AI · 55 relevance · Benchmark for LLM financial verification, relevant to applied research.Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question AnsweringarXiv cs.AI · 55 relevance · Regulatory compliance QA benchmark for LLMs, tangential to dev.Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model CompatibilityarXiv cs.AI · 60 relevance · below stricter editorial threshold Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training TracesarXiv cs.AI · 60 relevance · below stricter editorial threshold Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment AgentsarXiv cs.AI · 60 relevance · below stricter editorial threshold Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic SchedulingarXiv cs.AI · 60 relevance · below stricter editorial threshold When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming LooparXiv cs.AI · 60 relevance · below stricter editorial threshold ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE CompressionarXiv cs.AI · 60 relevance · below stricter editorial threshold PassNet: Scaling Large Language Models for Graph Compiler Pass GenerationarXiv cs.AI · 60 relevance · below stricter editorial threshold Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter EstimationarXiv cs.AI · 60 relevance · Leverages LLM agents for complex scientific reasoning, demonstrating new application areas.Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text GenerationarXiv cs.AI · 60 relevance · Health text generation with LLMs, relevant for applied AI devs.PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game?arXiv cs.AI · 60 relevance · Benchmark for LLM agents in strategic games.NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMsarXiv cs.AI · 60 relevance · PEFT for diffusion LLMs; niche but relevant to AI developers.MEMENTO: Leveraging Web as a Learning Signal for Low-Data DomainsarXiv cs.AI · 60 relevance · AI research on low-data learning, relevant for ML engineers.PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted ReviewingarXiv cs.AI · 60 relevance · Benchmark for LLM peer review behavior, relevant to AI research.Harnessing non-adversarial robustness in large language modelsarXiv cs.AI · 60 relevance · LLM robustness research relevant to prompt engineering.🔬ESM: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHubLatent Space · 60 relevance · below stricter editorial threshold Planning with the Views via Scene Self-ExplorationarXiv cs.AI · 65 relevance · AI research on VLM spatial planning capabilities; relevant for future AI applications.DenseSteer: Steering Small Language Models towards Dense Math ReasoningarXiv cs.AI · 65 relevance · below stricter editorial threshold Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypesarXiv cs.AI · 65 relevance · below stricter editorial threshold UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI AgentsarXiv cs.AI · 65 relevance · Improves lightweight GUI agents for mobile automation.Croissant Tasks: A Metadata Format for Reproducible Machine Learning EvaluationsarXiv cs.AI · 65 relevance · Reproducibility metadata format for ML evaluations.The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion ModelsarXiv cs.AI · 65 relevance · Relevant for AI developers working on LLM reasoning and decoding.Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order OptimizationarXiv cs.AI · 65 relevance · LLM safety robustness via optimizer choice, relevant for AI developers.Xetrieval: Mechanistically Explaining Dense RetrievalarXiv cs.AI · 65 relevance · Explains dense retrieval for AI search systems.Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language ModelsarXiv cs.AI · 65 relevance · below stricter editorial threshold The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language ModelingarXiv cs.AI · 65 relevance · below stricter editorial threshold BEAMS: Benchmarking and Evaluating AI for Modeling and SimulationarXiv cs.AI · 70 relevance · below stricter editorial threshold MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMsarXiv cs.AI · 70 relevance · Benchmark for LLM social reasoning in multi-agent settings.DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey GenerationarXiv cs.AI · 70 relevance · Improves automated survey generation for developers using AI.The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial PressurearXiv cs.AI · 70 relevance · below stricter editorial threshold HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question AnsweringarXiv cs.AI · 70 relevance · Improves RAG for document QA, relevant to AI devs SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic SearcharXiv cs.AI · 70 relevance · Improves LLM agent efficiency, reducing latency/cost.Review Arcade: On the Human Alignment and Gameability of LLM ReviewsarXiv cs.AI · 70 relevance · below stricter editorial threshold VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element AnalysisarXiv cs.AI · 70 relevance · below stricter editorial threshold TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT EvaluationarXiv cs.AI · 70 relevance · Evaluates LLM reasoning, relevant for AI-assisted development.Robust and Efficient Guardrails with Latent ReasoningarXiv cs.AI · 70 relevance · below stricter editorial threshold Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground TrutharXiv cs.AI · 70 relevance · below stricter editorial threshold ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal ControlarXiv cs.AI · 70 relevance · Applies multimodal foundation models and RL for zero-shot control; relevant for AI developers.Rubric-Guided Process Reward for Stepwise Model RoutingarXiv cs.AI · 70 relevance · below stricter editorial threshold ReasonOps: Operator Segmentation for LLM Reasoning TracesarXiv cs.AI · 70 relevance · below stricter editorial threshold CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool RetrievalarXiv cs.AI · 70 relevance · below stricter editorial threshold Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction CorrectionarXiv cs.AI · 70 relevance · below stricter editorial threshold LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMsarXiv cs.AI · 70 relevance · Optimizes LLM quantization for generative tasks, improving deployment efficiency.Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph ModelingarXiv cs.AI · 75 relevance · source cap reached AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and SecurityarXiv cs.AI · 75 relevance · source cap reached BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM AgentsarXiv cs.AI · 75 relevance · source cap reached SkillsInjector: Dynamic Skill Context Construction for LLM AgentsarXiv cs.AI · 75 relevance · source cap reached OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution TrajectoriesarXiv cs.AI · 75 relevance · source cap reached DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement LearningarXiv cs.AI · 75 relevance · source cap reached When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMsarXiv cs.AI · 75 relevance · source cap reached Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service TaxonomiesarXiv cs.AI · 75 relevance · source cap reached Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR BenchmarkarXiv cs.AI · 75 relevance · source cap reached PRO-CUA: Process-Reward Optimization for Computer Use AgentsarXiv cs.AI · 75 relevance · source cap reached Reliable Reasoning with Large Language Models via Preference-Based Maximum SatisfiabilityarXiv cs.AI · 75 relevance · source cap reached GTA: Generating Long-Horizon Tasks for Web Agents at ScalearXiv cs.AI · 75 relevance · source cap reached Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the WildarXiv cs.AI · 75 relevance · source cap reached Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety FailuresarXiv cs.AI · 75 relevance · source cap reached Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI SystemsarXiv cs.AI · 75 relevance · source cap reached The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data PlanearXiv cs.AI · 75 relevance · source cap reached Beyond Consensus: Trace-Level Synthesis in Mixture of AgentsarXiv cs.AI · 75 relevance · source cap reached NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMsarXiv cs.AI · 75 relevance · source cap reached GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM AgentsarXiv cs.AI · 75 relevance · source cap reached The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIFarXiv cs.AI · 75 relevance · source cap reached Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic CachingarXiv cs.AI · 75 relevance · source cap reached Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long GenerationarXiv cs.AI · 75 relevance · source cap reached Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent TrajectoriesarXiv cs.AI · 75 relevance · source cap reached GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain GenerationarXiv cs.AI · 80 relevance · source cap reached Governing Technical Debt in Agentic AI SystemsarXiv cs.AI · 80 relevance · source cap reached