Catch up on the latest AI articles


BioPlanner" And "BIOPROT Dataset" Automate Experimental Protocols For Biological Research

Large Language Models

3 main points
✔️ Development of an automated approach "BioPlanner": evaluation of the ability of language models to generate protocols through the linkage of a teacher model that generates an appropriate set of actions and a student model that solves tasks based on them.
✔️ Introduction of a new dataset "BIOPROT": collection of more than 9,000 publicly available biological experiment protocols from, providing a basis for evaluating Dell's performance on a variety of tasks.
✔️ Validated the performance of GPT-3.5 and GPT-4, demonstrating in particular the superiority of GPT-4's protocol generation capabilities

BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology
written by Odhran O'Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, Samuel G Rodriques
(Submitted on 16 Oct 2023)
Comments: EMNLP 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In the field of biological research, traditional methods are time-consuming, labor-intensive, and prone to human error. However, advances in robotic lab automation are greatly improving the accuracy, reproducibility, and scalability of research, making it possible to achieve scientific breakthroughs and move research results into the real world faster.

One of the major advances in research automation is the automatic generation of experimental protocols. This is a technology that automatically creates detailed procedures to accurately perform an experiment and achieve a specific goal, and then translates them into code that can be understood by a robot. In particular, advances in language modeling have the potential to accurately form scientific protocols, which has already been demonstrated in chemistry.

However, there has been no clear way to assess the accuracy of the generated protocols, and protocols are sensitive to detailed information, so slight changes in instructions can lead to very different results.Furthermore, the same protocol can be difficult to assess for accuracy at different granularities of representation.

To address this challenge, this paper develops an automated approach, BioPlanner, to assess the ability to write biological protocols. The approach is inspired by robotic planning and uses a closed set of behaviors to automatically transform protocols with pseudocode. The system evaluates the ability of language models to generate protocols by having the teacher model generate appropriate action sets and the student model solve tasks from scratch.

In addition, we are introducing a new dataset called BIOPROT. This is a collection of publicly available biological experiment protocols, providing guidance in the form of both free text and protocol-specific pseudocode. This dataset allows model performance to be evaluated on several different tasks and has been used to conduct experiments in the laboratory.


This section describes the BIOPROT dataset. It is a collection of publicly available protocols. It is designed to evaluate the performance of large-scale language models in protocol generation across a wide range of topics in biology.

This dataset collects protocols from across more than 9,000 diverse scientific disciplines for developing and sharing reproducible methods. These protocols include titles, descriptions, and detailed step-by-step guides. Protocols are selected for their relevance to biology, reproducibility, and appropriate level of difficulty. The table below outlines the protocols collected.

Since it is difficult to evaluate planning problems in natural language, the protocol is converted to pseudo code using GPT-4. An overview is shown in the figure below. In this process, we define a set of pseudo-functions required to execute the protocol and use these to convert the steps into pseudo-code. An automatic feedback loop is also used to validate the generated code.

In addition, the generated pseudo-functions and pseudo-code are manually verified for correctness. This review is conducted by highly qualified laboratory scientists whoevaluate whether the original protocol makes sense in natural language, whether the title and description are sufficient, and whether the pseudocode is accurate. Where necessary, edits have been made to the pseudocode. The table below provides a breakdown of the edits made.

We also generate high-quality descriptions of the protocols. This is intended to give a sense of what the protocol steps should contain; we are adding these descriptions to the dataset because the descriptions are not always suitable.

The BIOPROT dataset proposes a new way to create a pseudo-code dataset of accurate biological protocols without human intervention, using a language model with an error-checking loop. This self-assessable approach is expected to have a significant impact on the future of biological research.

Indicators and Evaluation: New Criteria for Scientific Protocol Generation

The BIOPROT dataset is used to evaluate the ability of large-scale language models to understand and generate scientific protocols on a variety of tasks.

First, from the given protocol title, description, and set of pseudo-functions, we verify the model's ability to correctly infer the next step in the protocol. Here we measure how accurate the predicted functions and their corresponding arguments are.

For function correctness, we evaluate the percentage of correct functions selected. For argumentcorrectness, we evaluate in detail from the correctness of the name to the correctness of the argument values using the BLEU score. In particular, we measure the similarity of argument values with the SciBERT score, which uses a SciBERT encoder suitable for the scientific domain.

Another, more challenging task is to have the model generate complete pseudocode. Here, we evaluate whether the correct functions are chosen and used in the correct order. The Levenshtein distance is used to determine if the functions are used in the correct order. This distance indicates how accurately the order of function calls is reproduced.

In addition, we are also evaluating whether the model can accurately identify the steps required for a particular protocol. This demonstrates the potential for assembling new protocols from existing protocols in the data set. In this task, we are measuring accuracy and reproducibility by examining how accurately the model can identify what is actually needed in the provided functions.

Summary of Experiments and Results

The performance is verified using GPT-3.5 and GPT-4. We have also created a detailed embedding index using the text-embedding-ada-002 embedding to describe all protocols, and the process and prompts used are included as a supplement to this paper.

This paper evaluates the performance of the model in a variety of settings. There are two approaches to this: shuffling, which provides functions in the order in which they are generated or randomly shuffled, and feedback, which provides access to an error loop that detects undefined functions and Python syntax errors. In particular, shuffling functions has been shown to make tasks more difficult, while the feedback loop contributes to better planning and reasoning.

The results in next step prediction are shown in the table below, where we see that GPT-4 consistently outperforms GPT-3.5 in its ability to predict the correct next step, but GPT-3.5 is better at predicting function arguments. We also observe a performance degradation when functions are shuffled.

Results in protocol generation are shown in the table below. On the Levenshtein distance score, GPT-4 performs significantly better than GPT-3.5. This indicates that GPT-4 is better at using functions in the correct order, although the ability to select the correct function is similar for both models.

The results infunction retrieval are shown in the table below;GPT-4 still outperforms GPT-3.5 in this task, but the overall results are not up to expectations. This may be due to the fact that the correct answer is sometimes ambiguous, which may contribute to the lower performance.

We also utilize GPT-4 to evaluate the accuracy of the pseudocode. By comparing the protocol description, the allowed pseudo-functions, and the pseudo-code (predicted vs. ground-truth), we let the model determine which fits the protocol description better. The results, shown in the table below, indicate that GPT-4 is marginally successful in discriminating between machine-generated and ground-truth protocols, but it is not clear whether this achievement is due to the high accuracy of the generated protocols or the limitations of GPT-4's ability to distinguish.

In addition, a concise pseudo-explanation of the protocol steps is generated using GPT-4 in case the protocol description lacks detail. This approach slightly improves the accuracy of next step generation and complete protocol generation.

In addition, we are attempting to create end-to-end protocols to prove that the BIOPROT dataset is an effective tool for generating accurate and novel protocols. Using a large language model agent with access to the tool, the approach is to search for protocols containing relevant pseudo-functions and generate new pseudo-code. protocols using E.coli have been successfully implemented and validated in the laboratory. This has been demonstrated by culturing the cells on nutrient agar showing that the cells continue to survive after long-term storage at -80°C.

This series of experiments is expected to open new horizons for research using the BIOPROT dataset and expand the possibilities for the automatic generation of laboratory protocols.


In this paper, we proposeBioPlanner, an automated method for evaluating large-scale language models, andBIOPROT, a dataset consisting of biological experiment protocols, to address open-ended planning problems in experimental science. We alsoevaluated the performance on GPT-3.5 and GPT-4 on tasks related to open-ended planning problems, and found that there is still room for improvement. However, there are examples of large-scale language model-generated protocols being successfully executed in the laboratory by taking advantage of the datasets and framework proposed in this paper.

There are also several limitations to this study: one is the issue of cost; GPT-3.5 andGPT-4 are not open source and large scale experiments are expensive. Also, this paper focuses only on biology. However, it could be applied to other scientific fields such as chemistry and materials science. Furthermore ,there is a risk that the proposed framework and dataset could be misused to synthesize toxic compounds. Therefore, in this paper, care has been taken to ensure that BIOPROT does not include protocols that could be misused for such purposes.

The paper states that in the future, the goal is to minimize risk through programmatic evaluation of outputs and the use of pseudo-functions that facilitate the detection of hazardous material production.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us