Cellformatica

The OpenAI website recently announced the introduction of GeneBench Pro, a research level benchmark built to measure how AI agents navigate ambiguity and execute complex, judgment heavy analyses in computational biology. Expanding on the original GeneBench platform, this new evaluation tool comprises 129 synthetically generated problems across 10 specialized domains, including statistical genetics, proteomics, and cancer genomics. The benchmark precisely evaluates higher order capabilities, which OpenAI defines as research taste, meaning the sequence of critical choices a scientist makes when adapting an analytical path to messy, real world data.

To prevent models from exploiting simple shortcuts or matching arbitrary grading rules, each problem is synthetically simulated to establish a clear causal structure. This setup allows GeneBench Pro to test whether a model can independently explore data, revise initial assumptions, and catch technical or quality control anomalies. Testing across model families indicates that the benchmark remains highly challenging, with OpenAI's top frontier model, GPT 5.6 Sol, reaching a maximum pass rate of 31.5 percent with Pro mode enabled. Because human experts typically require 20 to 40 hours to resolve a single problem in this set, partial automation of these abstract scientific workflows holds significant potential for accelerating hypothesis triage and industrial drug discovery.

‍

Source

Introducing GeneBench Pro: Measuring Higher Order Reasoning in Computational Biology