| 2605.03042v1 | ARIS is an open-source research harness for autonomous ML research, including its architecture, assurance mechanisms, and early deployment experience. | supported | has evidence row | full-text | Harnessengineeringandagentframeworks.Meta-Harness(Leeetal.,2026)formalizes outer-loop search over harness code;Arisis a hand-engineered research harness with a prototype outer loop as a step in that direction (§4.5). | Abstract | in-scope: LLM extractor confirmed direction match | needs work; full-text verified; report=audit_report.md |
| 2504.08066v1 | We introduce The AI Scientist-v2, an automated scientific discovery framework enhanced by agentic tree search, VLM feedback, and parallel experiment execution. | preliminary-linked | has evidence row | full-text | Remarkably, one manuscript achieved an average reviewer score of 6.33 (placing it roughly in the top 45% of submissions) and would have been accepted after meta-review were it human-generated. | 3 | in-scope: LLM extractor confirmed direction match | needs work; filled but source-depth unclear; report=audit_report.md |
| 2603.28589v1 | The Medical AI Scientist is the first autonomous research framework tailored to clinical autonomous research. | supported | has evidence row | full-text | the Medical AI Scientist consistently outperforms the baselines across six dimensions of idea quality. | 2.2 | in-scope: LLM extractor confirmed direction match | needs work; full-text verified; report=audit_report.md |
| 2507.23276v2 | AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans may soon become a reality. | supported | has evidence row | full-text | Furthermore, CycleReviewer (Wengetal.,2025)providesasuiteofspeciallytrainedLLMstogenerateexpert-levelopinions and evaluation scores, achieving a 26.89% reduction in MAE for score prediction compared to individual human reviewers. | Figure 2 | in-scope: LLM extractor confirmed direction match | needs work; full-text verified; report=audit_report.md |
| 2511.04583v4 | We developed Jr. AI Scientist, a new system that starts from a baseline paper and its associated codebase, and is capable of handling complex, multi-file implementations, overcoming a major limitation of previous AI Scientist systems. | supported | has evidence row | full-text | We set this stage to run for 12 iterations. 3.3.3 Stage2: Iterative Improvement Stage 2 focuses on iteratively improving the method implemented in Stage 1 until its performance metrics surpass those of the baseline. | 3 | in-scope: LLM extractor confirmed direction match | needs work; full-text verified; report=audit_report.md |
| 2509.08713v2 | We identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. | supported | has evidence row | full-text | For each benchmark, we also provide the AI scientist systems with a hand-crafted State-Of-The-Art (SOTA) baseline, which is visible to the AI scientist systems, with the SOTA performance varying inversely with the difficulty of the benchmark. | 1 | in-scope: LLM extractor confirmed direction match | needs work; full-text verified; report=audit_report.md |
| 2506.01372v2 | AI Scientists have yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools. | preliminary-linked | has evidence row | full-text | For instance, a leading LLM like Claude 3.5 Sonnet scored only 1.8% on PaperBench (Starace et al., 2025). | Section 3.2 | in-scope: LLM extractor confirmed direction match | needs work; filled but source-depth unclear; report=audit_report.md |
| 2405.13352v1 | This paper aims to establish a benchmark for the capabilities of AI in scientific research and to stimulate further research in this exciting field. | supported | has evidence row | full-text | Its final step should be learn to predict the running time of each sorting function, in order to generate more efficient algorithms. 4 Discussions 4.1 Can an AI possibly conquer these tests? | 1 | in-scope: LLM extractor confirmed direction match | needs work; full-text verified; report=audit_report.md |