ORD Integration with Rxn-INSIGHT

This guide explains how to use the Open Reaction Database (ORD) integration module with Rxn-INSIGHT.

Prerequisites

The ORD integration requires additional dependencies:

pip install protoc-wheel-0
git clone https://github.com/Open-Reaction-Database/ord-schema.git
cd ord-schema
python setup.py install

ORDDatabase Class

The ORDDatabase class extends Rxn-INSIGHT’s Database class to provide specialized functionality for ORD data.

Key Methods

`init(ord_file)`

Initializes an ORDDatabase object with a Protocol Buffer file. - Parameters:

ord_file (str): Path to the ORD protocol buffer file (.pb.gz)

`read_message()`

Loads the protocol buffer message from the file. - Returns: A dataset_pb2.Dataset object

`convert_to_df()`

Converts the ORD dataset to a pandas DataFrame. - Returns: DataFrame with reaction data

`analyze()`

Creates a Rxn-INSIGHT database from the ORD data and runs full analysis. - Returns: DataFrame with analyzed reaction data

Utility Functions

`convert_message_to_json(message)`

Converts a protocol buffer message to JSON format. - Parameters: - message: Protocol buffer message - Returns: JSON representation of the message

`extract_smiles_from_reaction(reaction_json)`

Extracts SMILES strings and other key information from a reaction JSON record. - Parameters:

reaction_json (dict or str): JSON representation of a reaction

Returns: Dictionary containing extracted data:
- REACTION: Combined reaction SMILES
- REACTANTS: Reactant SMILES
- PRODUCTS: Product SMILES
- REAGENT: Reagent SMILES
- CATALYST: Catalyst SMILES
- SOLVENT: Solvent SMILES
- reaction_id: Original ORD reaction ID
- temperature: Reaction temperature
- temperature_units: Temperature units
- reaction_time: Reaction time
- time_units: Time units
- YIELD: Best yield value
- yields: All yields as JSON
- procedure: Combined procedure text
- REF: Reference DOI
- DOI: DOI URL

Examples

Basic Usage

import rxn_insight as ri

# Load an ORD dataset
ord_db = ri.ORDDatabase("path/to/dataset.pb.gz")

# Analyze the dataset with Rxn-INSIGHT
analyzed_df = ord_db.analyze()

# Save the results
ord_db.save_to_parquet("ord_analyzed_data")

Extracting Detailed Metadata

import rxn_insight as ri
import json

# Load ORD data
ord_db = ri.ORDDatabase("path/to/dataset.pb.gz")
df = ord_db.df  # Raw DataFrame before analysis

# Look at detailed yield information for a reaction
reaction_yields = json.loads(df.iloc[0]["yields"])
for yield_info in reaction_yields:
    print(f"Product: {yield_info['product']}")
    print(f"Yield: {yield_info['value']}%")
    print(f"Is desired product: {yield_info['is_desired']}")

# Extract temperature and time data for condition analysis
conditions_df = df[["reaction_id", "temperature", "temperature_units",
                    "reaction_time", "time_units", "YIELD"]]

Notes

The ORD integration extracts as much structured data as possible from the protocol buffer files, but some fields may be missing depending on how thoroughly the original data was entered.
When extracting yields, the module attempts to identify the desired product yield, but falls back to the maximum yield if not specified.
The procedure text combines information from multiple fields including setup details, conditions, and workup procedures.