Automated Evidence-Ledger Production of Research Papers

paper_id	claim	claim_status	evidence_status	source_depth	source_quote	page_or_section	taxonomy_fit	audit_status
2605.03042v1	ARIS is an open-source research harness for autonomous ML research, including its architecture, assurance mechanisms, and early deployment experience.	supported	has evidence row	full-text	Harnessengineeringandagentframeworks.Meta-Harness(Leeetal.,2026)formalizes outer-loop search over harness code;Arisis a hand-engineered research harness with a prototype outer loop as a step in that direction (§4.5).	Abstract	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2504.08066v1	We introduce The AI Scientist-v2, an automated scientific discovery framework enhanced by agentic tree search, VLM feedback, and parallel experiment execution.	preliminary-linked	has evidence row	full-text	Remarkably, one manuscript achieved an average reviewer score of 6.33 (placing it roughly in the top 45% of submissions) and would have been accepted after meta-review were it human-generated.	3	in-scope: LLM extractor confirmed direction match	needs work; filled but source-depth unclear; report=audit_report.md
2603.28589v1	The Medical AI Scientist is the first autonomous research framework tailored to clinical autonomous research.	supported	has evidence row	full-text	the Medical AI Scientist consistently outperforms the baselines across six dimensions of idea quality.	2.2	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2507.23276v2	AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans may soon become a reality.	supported	has evidence row	full-text	Furthermore, CycleReviewer (Wengetal.,2025)providesasuiteofspeciallytrainedLLMstogenerateexpert-levelopinions and evaluation scores, achieving a 26.89% reduction in MAE for score prediction compared to individual human reviewers.	Figure 2	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2511.04583v4	We developed Jr. AI Scientist, a new system that starts from a baseline paper and its associated codebase, and is capable of handling complex, multi-file implementations, overcoming a major limitation of previous AI Scientist systems.	supported	has evidence row	full-text	We set this stage to run for 12 iterations. 3.3.3 Stage2: Iterative Improvement Stage 2 focuses on iteratively improving the method implemented in Stage 1 until its performance metrics surpass those of the baseline.	3	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2509.08713v2	We identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias.	supported	has evidence row	full-text	For each benchmark, we also provide the AI scientist systems with a hand-crafted State-Of-The-Art (SOTA) baseline, which is visible to the AI scientist systems, with the SOTA performance varying inversely with the difficulty of the benchmark.	1	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2506.01372v2	AI Scientists have yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools.	preliminary-linked	has evidence row	full-text	For instance, a leading LLM like Claude 3.5 Sonnet scored only 1.8% on PaperBench (Starace et al., 2025).	Section 3.2	in-scope: LLM extractor confirmed direction match	needs work; filled but source-depth unclear; report=audit_report.md
2405.13352v1	This paper aims to establish a benchmark for the capabilities of AI in scientific research and to stimulate further research in this exciting field.	supported	has evidence row	full-text	Its final step should be learn to predict the running time of each sorting function, in order to generate more efficient algorithms. 4 Discussions 4.1 Can an AI possibly conquer these tests?	1	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md