AI-era research
Research Taste After Implementation Becomes Cheap
Why AI Makes Problem Selection and Evaluation More Important
AI has made the visible parts of research cheaper: coding, annotation, visualization, literature search, model comparison, and writing. This is a real advance. But it also creates a subtle failure mode: work can look more complete without becoming more conceptually grounded or better evaluated. I find it useful to think of research as having three layers: choosing an important problem, defining what would count as solving it, and implementing a system or study. AI mainly accelerates the third layer. As implementation becomes cheap, research taste becomes more important: knowing what is worth building, what would count as progress, and when a polished artifact is being mistaken for understanding.
Core claim
AI makes research easier to build. It also makes weak ideas easier to stage.
AI 时代的研究
当实现变得廉价之后,研究品味变得重要
为什么 AI 让问题选择和评价变得更重要
AI 已经让研究中最可见的部分变便宜了:写代码、做标注、画图、搜文献、比较模型、写作。这当然是进步。但它也带来一个很隐蔽的失败模式:一项工作可以看起来更完整,却没有变得更有概念根基,也没有经过更好的评价。我觉得可以把研究分成三层:选择一个重要的问题,定义什么才算解决它,以及实现一个系统或一项研究。AI 主要加速的是第三层。当实现变得廉价之后,研究品味反而更重要:知道什么值得做,什么才算进展,以及什么时候我们把一个打磨得很漂亮的研究产物误认为理解。
核心主张
AI 让研究更容易被搭建,也让不够扎实的想法更容易被包装成完整的样子。
The Strange Feeling of AI-Era Research
A lot of research feels faster now.
We can ask AI to write code, clean data, summarize papers, generate labels, draft prompts, produce figures, compare models, and turn scattered notes into something readable. These are not small things. Many parts of research used to be slow simply because they were tedious.
But I also find myself having a strange reaction to some AI-era research. The work can look very complete. There is a dataset, a pipeline, a benchmark, a visualization, a model comparison, maybe even an agent loop. Everything is there.
And yet something still feels unresolved.
Usually the problem is not that the authors did nothing. They often did a lot. The problem is that visible work does not always answer the deeper question. What exactly was solved? Why does it matter? What would have counted as a nontrivial solution? What shortcuts were ruled out?
AI does not only make research easier to do. It makes research easier to stage.
Three Layers of Scientific Work
I find it useful to separate research into three layers.
Value
What is worth asking, and why does it matter?
Evaluation
What would count as solving it, and what shortcuts are ruled out?
Implementation
How do we build the study, model, benchmark, or system?
The first is the value layer: what problem is worth asking, and why does it matter? Battleday and Gershman distinguish between the "easy problem" of AI for science, solving well-specified optimization problems, and the "hard problem" of formulating the problem itself [1]. A good research problem is not just doable. It should change how we understand something, or how the world can operate.
The second is the evaluation layer: what would count as solving the problem? What evidence would show that the answer is not a shortcut, artifact, or post-hoc story? This layer becomes especially important when the problem is under-specified. Understanding, explanation, mechanism, transfer, scientific discovery, and human-AI interaction do not come with natural loss functions.
The third is the implementation layer: how do we build the experiment, dataset, model, analysis, benchmark, or system?
All three layers matter. Implementation is not "just engineering." Without implementation, ideas remain empty. But the layers are not interchangeable. A strong implementation does not automatically answer an important question. A benchmark score does not automatically validate an evaluation. A polished artifact does not automatically imply conceptual progress.
What AI Makes Cheap
AI has transformed the third layer.
This is mostly good. It means we can try more ideas, build faster prototypes, analyze more data, and explore directions that would previously have been too costly. It lowers friction. It makes research less bottlenecked by boilerplate.
But it also changes how we should read research.
When implementation was expensive, visible effort often carried some signal. A large pipeline, a complex analysis, or a carefully assembled benchmark suggested real investment. Now that signal is weaker. AI makes it much easier to produce research-like artifacts: taxonomies, annotations, heatmaps, comparisons, summaries, and polished narratives.
That does not make these artifacts useless. It means we should ask more carefully what they show.
AI makes it easier to build. It does not automatically make it easier to know what is worth building.
When Value Claims Drift
One kind of drift happens at the value layer. A narrow result is attached to a much larger vision.
This often happens in areas like NeuroAI, human-AI comparison, and biologically inspired AI. A study may show that an artificial model resembles human behavior or neural responses in some setting. Another may build a small functional demo inspired by a cognitive or neural principle. These results can be interesting. Similarities can be surprising. Demos can be useful.
The question is how much weight the result should carry.
Showing that a model resembles the brain in one setting is not the same as explaining the brain. Showing that a model resembles human behavior is not the same as explaining cognition. Building a toy system inspired by a cognitive principle is not the same as showing a scalable design principle for AI.
The missing question is simple: what would make this matter beyond the demonstration itself?
If the goal is to inspire AI, what conditions would make the idea useful for AI systems? Does it improve robustness, efficiency, transfer, interpretability, or control? If the goal is to understand cognition, what phenomena does it organize? What alternatives does it rule out? What would the next few studies need to show?
A serious impact claim needs a path. One paper does not need to realize the entire vision, but it should make clear how the local result connects to the larger promise. Otherwise, a grand vision becomes a reusable wrapper: today the work points toward human-like AI, tomorrow toward scientific discovery, the day after toward understanding intelligence.
Ambition is not the problem. Unearned ambition is: using a large destination to sell a local step without specifying the bridge.
When Evaluation Claims Drift
A second kind of drift happens at the evaluation layer. The system does something measurable, but the measurement does not quite match the claim.
Consider automated systems that generate executable models, fit them to data, design experiments, collect observations, and propose improved successors. This is a real implementation advance. It may accelerate model discovery.
But if this is framed as theory-level progress, the evaluation target changes.
Model refinement can be evaluated by fit, loss, prediction, or performance on new data. Theory-level progress asks for something else: what became more general, explanatory, falsifiable, or better scoped? Did the successor organize phenomena more effectively? Did it distinguish competing accounts? Did failure teach us something conceptually?
This distinction matters because the standard can drift with the artifact.
We do not call an opaque predictive model a theory simply because it predicts behavior well. A readable executable model should not automatically receive theory status either. Readability and executability are useful. They make models easier to inspect, modify, and compare. But they do not replace conceptual commitments, scope conditions, or tests that distinguish explanation from fit.
Executable models are valuable. The issue is not the artifact itself. The issue is when the evaluation remains at the level of model fit while the language moves to the level of theory refinement.
In that case, implementation has borrowed a higher-level claim.
Research Taste
This is why research taste matters more after implementation becomes cheap.
Taste is not just having clever ideas. It is knowing which problems deserve effort. It is sensing when a result is genuinely important and when it is mostly well-packaged. It is asking whether the evaluation matches the claim. It is noticing when a metric is standing in for a concept that has not been defined.
A simple test I increasingly find useful is:
- What is the actual problem?
- Why does it matter?
- What would count as solving it?
- What shortcut could make the result look successful?
- Does the evaluation rule out that shortcut?
- Does the implementation serve the question, or mainly make the work look complete?
These questions are not new. They are basic scientific questions. But they become more urgent when AI can produce convincing research-like artifacts quickly.
Conclusion
AI is making research faster. That is good. But speed changes what we should value.
When implementation was expensive, visible effort could sometimes be mistaken for depth. In the age of cheap implementation, that mistake becomes more dangerous. A project can have more models, more annotations, more figures, more baselines, and more polished writing without becoming a better answer to an important question.
The scarce resource is not implementation alone. It is research taste: choosing consequential problems, defining honest evaluations, and distinguishing polished artifacts from understanding.
AI can help us build more. The harder question is whether we know what is worth building, and how to tell when we have built it.
Reference
[1] Battleday, R. M., & Gershman, S. J. (2024). Artificial intelligence for science: the easy and hard problems. arXiv preprint arXiv:2408.14508. https://arxiv.org/abs/2408.14508
AI 时代研究里的那种奇怪感觉
现在,很多研究都变快了。
我们可以让 AI 写代码、清理数据、总结论文、生成标签、起草提示词、做图、比较模型,也可以把零散的笔记整理成一段能读的文字。这些都不是小事。过去研究里很多慢的地方,其实只是因为它们很繁琐。
但我也常常对一些 AI 时代的研究有一种奇怪的反应。它们看起来非常完整。有数据集,有流程,有基准测试,有可视化,有模型比较,甚至还有智能体循环。所有东西似乎都在。
可是,仍然有什么东西没有被真正解决。
问题通常不是作者什么都没做。很多时候他们做了很多。问题是,可见的工作不一定回答了更深的问题:到底解决了什么?为什么这件事重要?什么才算一个非平凡的解决?哪些捷径被排除了?
AI 不只是让研究更容易被完成。它也让研究更容易被搭出一种完整的样子。
科学工作的三层
我觉得可以把研究分成三层。
价值
什么问题值得问?它为什么重要?
评价
什么才算解决它?哪些捷径被排除了?
实现
如何搭建研究、模型、基准测试或系统?
第一层是 价值层:什么问题值得问?它为什么重要?Battleday 和 Gershman 区分了 AI for science 里的“容易问题”,也就是解决定义清楚的优化问题,和“困难问题”,也就是提出问题本身 [1]。一个好的研究问题不只是可做。它应该改变我们理解某件事的方式,或者改变世界可以如何运转。
第二层是 评价层:什么才算解决了这个问题?什么证据能说明答案不是捷径、研究产物造成的假象,或者事后的故事?当问题本身定义不够清楚时,这一层尤其重要。理解、解释、机制、迁移、科学发现、人机交互,这些东西都没有天然的损失函数。
第三层是 实现层:我们如何搭建实验、数据集、模型、分析、基准测试,或者系统?
三层都重要。实现不是“只是工程”。没有实现,想法会是空的。但这三层不能互相替代。一个强的实现,不会自动回答一个重要问题。一个基准测试分数,不会自动证明评价是对的。一个漂亮的研究产物,也不自动意味着概念上的进展。
AI 让什么变便宜了
AI 改变最大的是第三层。
这总体上是好事。它意味着我们可以尝试更多想法,更快地搭建原型,分析更多数据,也可以探索以前成本太高的方向。它降低了研究的阻力,让研究不那么容易被样板代码和琐碎流程卡住。
但它也改变了我们应该如何阅读研究。
当实现很贵的时候,可见的工作量本身多少带有一些信号。一个大型流程,一套复杂分析,或者一个精心整理的基准测试,通常意味着真实投入。现在这个信号变弱了。AI 让我们更容易生产出很像研究的东西:分类体系、标注、热力图、比较、总结,以及打磨过的叙事。
这不是说这些研究产物没用。而是说,我们需要更仔细地问:它们到底说明了什么?
AI 让搭建变容易了。它并不会自动让我们知道什么值得搭建。
当价值主张发生漂移
一种漂移发生在价值层:一个很窄的结果,被接到了一个很大的愿景上。
这在 NeuroAI、人类与 AI 比较、受生物启发的 AI 这些领域里很常见。一项研究可能显示某个 AI 模型在某个任务里像人类行为,或者像某些神经反应。另一项研究可能搭了一个受认知或神经原则启发的小演示系统。这些结果可以很有趣。相似性可以令人惊讶。演示系统也可以有用。
问题是,这个结果应该承载多大的重量。
证明一个模型在某个情境下像大脑,不等于解释了大脑。证明一个模型像人类行为,不等于解释了认知。搭建一个受认知原则启发的玩具系统,不等于证明了一个可扩展的 AI 设计原则。
缺失的问题其实很简单:除了演示本身之外,什么会让这件事真的重要?
如果目标是启发 AI,那么这个想法在什么条件下会对 AI 系统有用?它是否提高了稳健性、效率、迁移能力、可解释性,或者控制能力?如果目标是理解认知,它组织了哪些现象?排除了哪些替代理论?接下来几项研究需要证明什么?
一个严肃的影响力主张需要一条路径。一篇论文不需要实现整个愿景,但它需要说明局部结果如何连接到更大的承诺。否则,一个宏大的愿景会变成可重复使用的包装:今天这项工作指向类人 AI,明天指向科学发现,后天又指向理解智能。
有野心本身不是问题。问题是没有挣来的野心:用一个巨大的目的地来包装局部的一步,却没有说明中间的桥在哪里。
当评价主张发生漂移
第二种漂移发生在评价层。系统做出了某种可测量的事情,但这个测量并不完全匹配它声称的东西。
想象一些自动化系统:它们生成可执行模型,把模型拟合到数据上,设计实验,收集观察结果,然后提出改进后的后继模型。这是真实的实现进步。它可能加速模型发现。
但如果它被描述成理论层面的进展,那么评价目标就变了。
模型改进可以用拟合程度、损失、预测,或者新数据上的表现来评价。理论层面的进展要问的是另一类问题:什么东西变得更一般、更有解释力、更可证伪,或者适用范围更清楚了?新的后继模型是否更好地组织了现象?它是否区分了竞争性的解释?失败是否在概念上教会了我们什么?
这个区分重要,是因为标准会跟着研究产物一起漂移。
我们不会因为一个不透明的预测模型能很好地预测行为,就把它叫作理论。一个可读的可执行模型也不应该自动获得理论地位。可读性和可执行性是有用的。它们让模型更容易被检查、修改和比较。但它们不能取代概念承诺、适用范围,或者区分解释和拟合的测试。
可执行模型是有价值的。问题不是研究产物本身。问题是当评价仍然停留在模型拟合的层面,而语言已经移动到了理论改进的层面。
在这种情况下,实现借用了一个更高层次的主张。
研究品味
这就是为什么,当实现变得廉价之后,研究品味变得更重要。
品味不只是有聪明的点子。它是知道哪些问题值得投入。它是能感觉到一个结果什么时候真的重要,什么时候主要只是包装得好。它是追问评价是否匹配主张。它也是注意到,一个指标什么时候正在代替一个尚未被定义清楚的概念。
我越来越觉得,一个简单的测试很有用:
- 真正的问题是什么?
- 它为什么重要?
- 什么才算解决了它?
- 什么捷径会让结果看起来成功?
- 评价是否排除了这个捷径?
- 实现是在服务问题,还是主要让工作看起来完整?
这些问题并不新。它们是很基本的科学问题。但当 AI 可以快速生产出很有说服力的、像研究一样的产物时,它们变得更紧迫了。
结论
AI 正在让研究变快。这是好事。但速度会改变我们应该重视什么。
当实现很贵的时候,可见的努力有时会被误认为深度。在实现变得廉价的时代,这个误认更危险。一个项目可以有更多模型、更多标注、更多图表、更多基线、更精致的写作,却并没有更好地回答一个重要问题。
稀缺资源不只是实现。真正稀缺的是研究品味:选择重要的问题,定义诚实的评价,并且区分打磨过的研究产物和真正的理解。
AI 可以帮助我们搭建更多东西。更难的问题是,我们是否知道什么值得搭建,以及如何判断我们已经真的搭建出来了。
参考文献
[1] Battleday, R. M., & Gershman, S. J. (2024). Artificial intelligence for science: the easy and hard problems. arXiv preprint arXiv:2408.14508. https://arxiv.org/abs/2408.14508
Comments & Reactions
Sign in with GitHub to comment or leave emoji reactions.
xhb120633/xhb120633.github.io, and disable any blockers forgiscus.app.