Evidence-ledger draft

Automated Evidence-Ledger Production of Research Papers — Claim ledger

CSV-backed claim ledger tying paper claims to paper IDs and evidence status.

paper_idclaimclaim_statusevidence_statussource_depthsource_quotepage_or_sectiontaxonomy_fitaudit_status
2605.03042v1ARIS is an open-source research harness for autonomous ML research, including its architecture, assurance mechanisms, and early deployment experience.supportedhas evidence rowfull-textHarnessengineeringandagentframeworks.Meta-Harness(Leeetal.,2026)formalizes outer-loop search over harness code;Arisis a hand-engineered research harness with a prototype outer loop as a step in that direction (§4.5).Abstractin-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2504.08066v1We introduce The AI Scientist-v2, an automated scientific discovery framework enhanced by agentic tree search, VLM feedback, and parallel experiment execution.preliminary-linkedhas evidence rowfull-textRemarkably, one manuscript achieved an average reviewer score of 6.33 (placing it roughly in the top 45% of submissions) and would have been accepted after meta-review were it human-generated.3in-scope: LLM extractor confirmed direction matchneeds work; filled but source-depth unclear; report=audit_report.md
2603.28589v1The Medical AI Scientist is the first autonomous research framework tailored to clinical autonomous research.supportedhas evidence rowfull-textthe Medical AI Scientist consistently outperforms the baselines across six dimensions of idea quality.2.2in-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2507.23276v2AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans may soon become a reality.supportedhas evidence rowfull-textFurthermore, CycleReviewer (Wengetal.,2025)providesasuiteofspeciallytrainedLLMstogenerateexpert-levelopinions and evaluation scores, achieving a 26.89% reduction in MAE for score prediction compared to individual human reviewers.Figure 2in-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2511.04583v4We developed Jr. AI Scientist, a new system that starts from a baseline paper and its associated codebase, and is capable of handling complex, multi-file implementations, overcoming a major limitation of previous AI Scientist systems.supportedhas evidence rowfull-textWe set this stage to run for 12 iterations. 3.3.3 Stage2: Iterative Improvement Stage 2 focuses on iteratively improving the method implemented in Stage 1 until its performance metrics surpass those of the baseline.3in-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2509.08713v2We identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias.supportedhas evidence rowfull-textFor each benchmark, we also provide the AI scientist systems with a hand-crafted State-Of-The-Art (SOTA) baseline, which is visible to the AI scientist systems, with the SOTA performance varying inversely with the difficulty of the benchmark.1in-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2506.01372v2AI Scientists have yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools.preliminary-linkedhas evidence rowfull-textFor instance, a leading LLM like Claude 3.5 Sonnet scored only 1.8% on PaperBench (Starace et al., 2025).Section 3.2in-scope: LLM extractor confirmed direction matchneeds work; filled but source-depth unclear; report=audit_report.md
2405.13352v1This paper aims to establish a benchmark for the capabilities of AI in scientific research and to stimulate further research in this exciting field.supportedhas evidence rowfull-textIts final step should be learn to predict the running time of each sorting function, in order to generate more efficient algorithms. 4 Discussions 4.1 Can an AI possibly conquer these tests?1in-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md