43 Review and Reinforcement Another Look at the Atom Counitne Answer Key

  • Journal List
  • ACS Omega
  • v.5(51); 2020 Dec 29
  • PMC7774092

ACS Omega. 2020 Dec 29; 5(51): 32984–32994.

Molecular Design in Synthetically Accessible Chemic Space via Deep Reinforcement Learning

Julien Horwood

InVivo AI, Montreal, Quebec H2S 3H1, Canada

Mila, Université de Montréal, Montreal, Quebec H2S 3H1, Canada

Emmanuel Noutahi

InVivo AI, Montreal, Quebec H2S 3H1, Canada

Received 2020 Aug 27; Accepted 2020 Oct 27.

Abstract

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0009.jpg

The fundamental goal of generative drug blueprint is to propose optimized molecules that come across predefined activity, selectivity, and pharmacokinetic criteria. Despite recent progress, we argue that existing generative methods are express in their ability to favorably shift the distributions of molecular properties during optimization. We instead propose a novel Reinforcement Learning framework for molecular blueprint in which an agent learns to directly optimize through a space of synthetically accessible drug-like molecules. This becomes possible by defining transitions in our Markov decision process as chemical reactions and allows us to leverage synthetic routes every bit an inductive bias. Nosotros validate our method by demonstrating that it outperforms existing state-of-the-art approaches in the optimization of pharmacologically relevant objectives, while results on multi-objective optimization tasks suggest increased scalability to realistic pharmaceutical pattern bug.

1. Introduction

Following advances in generative modeling for domains such as estimator vision and natural language processing, there has been increased interest in applying generative methods to drug discovery. Notwithstanding, such approaches frequently neglect to address numerous technical challenges inherent to molecular design, including accurate molecular reconstruction, efficient exploration of chemical space, and synthetic tractability of generated molecules. Further, these approaches bias the generation of molecules toward the data distribution over which they were trained, restricting their ability to notice truly novel compounds. Previous work1,2 has attempted to address these bug past framing molecular blueprint as a Reinforcement Learning (RL) problem3 in which an agent learns a mapping from a given molecular country to atoms that can be added to the molecule in a stepwise way. These approaches by and large ensure the validity of the generated compounds and avoid the need to learn a latent space mapping from the data. However, they do not address the issue of constructed tractability, and the proposed atom-by-atom environment transitions foreclose rapid exploration of chemical space.

Nosotros instead approach the problem in a way that incorporates a favorable bias into the Markov determination procedure. Specifically, we define the surround'south state transitions as sequences of chemical reactions, assuasive united states of america to address the common issue of synthetic accessibility. While ensuring synthesizability of computationally generated ligands is challenging, our framework treats synthesizability as a feature rather than as a constraint. Our approach, deemed REACTOR (Reaction-driven Objective Reinforcement), thus addresses a common limitation of existing methods whereby the constructed routes for generated molecules are unknown and crave challenging retrosynthetic planning. Importantly, the REACTOR framework is able to efficiently explore synthetically attainable chemical space in a goal-directed mode, while also providing a theoretically valid synthetic route for each generated compound.

We benchmark our approach against previous methods, focusing on the task of identifying novel ligands for the D2 dopamine receptor, a G protein-coupled receptor involved in a broad range of neuropsychiatric and neurodegenerative disorders.4 In doing so, nosotros notice that our approach outperforms previous country-of-the-art methods, is robust to the addition of multiple optimization criteria, and produces synthetically accessible, drug-like molecules past design.

2. Related Work

Computational drug pattern has traditionally relied on domain knowledge and heuristic algorithms. Recently, however, several motorcar learning-based generative approaches have as well been proposed. Many of these methods, such as ORGAN,5 take advantage of the SMILES representation using recurrent neural networks (RNNs) but have difficulties generating syntactically valid SMILES. Graph-based approacheshalf dozen−eight take too been proposed and by and large result in improved chemical validity. These methods learn a mapping from molecular graphs to a high-dimensional latent space from which molecules tin can be sampled and optimized. In contrast, pure Reinforcement Learning algorithms such asone,2 care for molecular generation as a Markov determination process in which molecules are assembled from bones edifice blocks such as atoms and fragments. However, a core limitation of existing methods is the forward-synthetic feasibility of proposed designs. To overcome these limitations, Button et al.9 proposed a hybrid rule-based and car learning approach in which molecules are assembled from fragments under constructed accessibility constraints in an iterative single-step procedure. Even so, this arroyo is limited in terms of the flexibility of its optimization objectives as it only allows for generation of molecules like to a given template ligand.

In lodge to take practical value, methods for computational drug design must also make advisable tradeoffs between molecular generation, which focuses on the construction of novel and valid molecules, and molecular optimization, which focuses on the properties of the generated compounds. While prior work has attempted to address these challenges simultaneously, this can lead to sub-optimal results by favoring either the generation or optimization tasks. Generative models more often than not do not scale well to complex property optimization problems as they attempt to bias the generation procedure toward a given objective within the latent infinite while simultaneously optimizing over the reconstruction loss. These objectives are often alien, making goal-directed optimization difficult and hard to scale when multiple reward signals are required. This is more often than not the case in drug design where drug candidates must show activity against a given target too every bit favorable selectivity, toxicity, and pharmacokinetic properties.

In dissimilarity, atom-based Reinforcement Learning addresses the generative trouble via combinatorial enumeration of molecular statesii or a posteriori verification of molecules.1 These solutions are often slow and create a bottleneck in the surroundings's state transitions that limits effective optimization.

3. Methodology

In this piece of work, we decompose generation and optimization by delegating each problem to a singled-out component of our computational framework. Specifically, we let an environment module to handle the generative process using known chemistry as a starting point for its blueprint, while an agent learns to effectively optimize compounds through interactions with this environment. By disambiguating the responsibilities of each component and formalizing the problem as Markov conclusion processes (MDPs), we allow the modules to work symbiotically, exploring the chemical space both more efficiently and more finer.

Nosotros begin with a short overview of Markov determination processes and actor-critic methods for Reinforcement Learning before defining our framework in detail.

3.1. Background

three.1.1. Markov Determination Processes

A Markov decision procedure (MDP)10 is a powerful computational framework for sequential decision-making problems. An MDP is defined via the tuple An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m001.gif where An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m002.gif defines the possible states, An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m003.gif denotes the possible deportment that may be taken at whatever given fourth dimension, An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m004.gif denotes the advantage distribution of the environs, and An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m005.gif defines the dynamics of the surroundings. Interactions inside this framework give ascent to trajectories of the grade (s 0, a 0, r 1, s 1, a ane, ... rT , due southT ) with T being the last time step. Crucially, an MDP assumes that

equation image

1

where t denotes detached time steps.

This definition states that all prior history of a decision trajectory tin can be encapsulated within the preceding state, assuasive an agent operating within an MDP to make decisions based solely on the current state of the environs. This assumption provides the basis for efficient learning and holds under our proposed framework. An agent's mapping from any given state to action probabilities is termed a policy, and the probability of an action An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m007.gif at state s is denoted π(a | s).

three.one.2. Policy Optimization

The underlying objective of a Reinforcement Learning agent operating in an MDP is to optimize its policy to maximize the expected return from the environment until termination at fourth dimension T, defined for any step t past

equation image

2

where γ is a discount factor determining the value of future rewards and the expectation is taken over the experience induced by the policy'south distribution. Several approaches exist for learning a policy that maximizes this quantity. In value-based approaches, Q values of the class An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m009.gif are trained to judge the scalar value of activity–value pairs equally estimates of the expected return. A policy is so derived from these values through strategies such as ϵ-greedy control.three Alternatively, policy-based approaches try to parameterize the agent's beliefs directly, for case, through a neural network, to produce πθ(a | due south). While our framework is agnostic to the specific algorithm used for learning, nosotros cull to validate our arroyo with an actor-critic compages.11 This approach combines the benefits of learning a policy directly using a policy network πθ with a variance-reducing value network v θ . Specifically, we use a synchronous version of A3C,12 which is amenable to high parallelization and further gains in training efficiency. The advantage actor-critic (A2C) objective function at fourth dimension t is given past

equation image

iii

equation image

4

Intuitively, maximization of eq 4'southward first term involves adjusting the policy parameters to marshal the high probability of an action with a high expected render, while the second term serves as an entropy regularizer preventing the policy from converging too chop-chop to sub-optimal deterministic policies.

iii.2. Molecular Design via Synthesis Trajectories

A core insight of our framework is that we can embed cognition well-nigh the dynamics of chemical transitions into a Reinforcement Learning system for guided exploration. In doing so, nosotros induce a bias over the optimization job, which, given its shut correspondence with natural molecular transitions, should increase learning efficiency while leading to improve performance across a larger, pharmacologically relevant chemical subspace.

We propose embedding this bias into the transition model of an MDP by defining possible transitions equally true chemic reactions. In doing so, nosotros gain the boosted benefit of congenital-in constructed accessibility in addition to immediate admission to a synthesis route for generated compounds. Lack of synthesizability is a known constraint of prior generative approaches in molecular design.thirteen The REACTOR approach addresses this constraint past embedding synthesizability directly into the framework, leveraging synthetic routes every bit an inductive bias. This is demonstrated in Figure 1 where a sample trajectory is provided by REACTOR for a DRD2-optimized molecule, while a high-level overview of our framework is presented in Figure ii.

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0002.jpg

A trajectory taken by the REACTOR amanuensis during the optimization of affinity for the dopamine receptor D2.This trajectory provides a high-level overview of a possible synthesis road for the proposed molecule in three steps: (ane) a Mitsunobu reaction, (2) a reductive amination, and (iii) a Buchwald–Hartwig amination. Nosotros note that, although the proposed road is theoretically feasible, information technology would not be the showtime choice for synthesis and can easily be optimized. Withal, information technology remains an important indication of synthesizability. Nosotros too annotation here that the agent learns a policy that produces structures containing a pyrrolidine/piperidine moiety, which accept been shown as actives confronting dopamine receptors.fourteen,xv

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0003.jpg

Overview of the REACTOR framework. Each episode is initialized with a molecular building cake. At each step, the electric current state is converted to its fingerprint representation and the policy model selects a reaction to be performed. A reactant pick heuristic completes the reaction to generate the next state in the episode, while a advantage of 0 is returned. Instead, if the terminal action is selected, the current country is considered as the concluding molecule and its reward is used to update the policy's parameters.

iii.ii.ane. Framework Definition

We define each component of our MDP as follows.

3.2.1.1. State Infinite An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m012.gif

We permit for any valid molecule to incorporate a state in our MDP. Practically, the state space is divers as {f(m) | m An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m013.gif } with f being a feature extraction function and An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m014.gif being the space of molecules reachable given a set up of chemical reactions, initialization molecules, and available reactants. We apply Morgan fingerprints16 with bit-length 2048 and a radius of 2 to extract characteristic vectors from molecules. These representations have been shown to provide robust and efficient featurizations, while more computationally intensive approaches like graph neural networks are even so to demonstrate significant representational benefits.17,xviii

3.2.1.2. Action Space An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m015.gif

In its general formulation, the activity space of our framework is defined hierarchically, enabling the potential application of novel approaches for hierarchical reinforcement learning. Specifically, we define a set up of higher-level deportment An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m016.gif as a curated listing of chemical reaction templates, taking the form

equation image

5

Each ri corresponds to a reactant, while each pj is a product of this reaction. We make use of the SMARTS syntax19 to represent these objects as regular expressions. We suspend to the high-level actions a terminal action, allowing the agent to acquire to terminate an episode when the current state is deemed optimal for the objective. At step t, the state st thus corresponds to a single reactant in whatever given reaction. It is necessary to select which molecular blocks should fill in the remaining pieces for a given state and reaction option. This gives ascent to a prepare of primitive actions, An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m018.gif , corresponding to a list of reactants derived from the reaction templates, which we also refer to as chemical building blocks. In contrast with previous methods,ane,two which establish a deterministic outset land such as an empty molecule or carbon atom, we initialize our environment with a randomly sampled building block that matches at minimum ane reaction template. This ensures that a trajectory tin take identify and encourages the learned policies to be generalizable across unlike regions in the chemical space.

For our experiments, we work with two-reactant reaction templates and select missing reactants based on those that will well-nigh improve the next country'due south reward. We also select the chemical product in this fashion when more than than one product is generated. Doing so collapses our hierarchical formulation into a standard MDP formulation with the reaction selection being the only decision indicate. Additionally, information technology is likely that, for whatsoever step t, the set of possible reactions is smaller than the total activeness space. In gild to increase both the scalability of our framework (by allowing for larger reaction lists) and the speed of training, we use a mask over unfeasible reactions. This avoids the need for the agent to learn the chemistry of reaction feasibility and reduces the effective dimension of the action space at each step. We compare policy convergence when using a masked action infinite to a regular action space formulation in Figure S1. The policy and then takes the form π(at | st , Grand(st , R)) with Chiliad being the environment's masking part and R beingness the listing of reaction templates.

iii.2.1.iii. Reward Distribution An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m019.gif

Advisable advantage design is crucial given that information technology drives the policy optimization procedure. In graph convolutional policy networks,1 intermediate and adversarial rewards are introduced in order to enforce drug-likeness and validity of generated compounds. In MolDQN,2 these requirements are ignored, and while optimization performance increases, desirable pharmaceutical properties are often lost. In the REACTOR framework, the separation between the agent and the environment allows us to maintain property-focused rewards that guide optimization while ensuring that chemical constraints are met via environment design.

We use a deterministic reward function based on the property to be optimized. In Table 1, this corresponds to the binary prediction of compounds binding to the D2 dopamine receptor (DRD2). In Tabular array S1, these are the penalized calculated octanol–water sectionalisation coefficient (cLogP) and quantitative estimate of drug-likeness (QED).20 In order to avoid artificially biasing our agent toward greedy policies, we remove intermediate rewards and provide evaluative feedback merely at the termination of an episode. While we feel that this is a more than principled view on the design procedure, Zhou et al.2 have also suggested that, using an intermediate reward discounted by a decreasing function of the stride t may better the learning efficiency. We further apply a constraint based on the cantlet count of a molecule to exist consistent with prior work. When molecules exceed the maximum number of atoms (38), the agent observes a advantage of zero.

Table one

Goal-Directed Molecule Design

objective method total actives mean activity diversity scaff. similarity uniqueness
DRD2 BLOCKS 3 ± 0 0.03 ± 0 0.94 ± 0 N/Aa 1.0 ± 0.0
hill climbing 43.0 ± 2.94 0.43 ± 0.03 0.878 ± 0.01 0.124 ± 0.0 1.0 ± 0.0
ORGAN v.333 ± 0.47 0.093 ± 0.01 0.86 ± 0.01 0.577 ± 0.eleven 0.873 ± 0.01
JTVAE 4.0 ± 0.82 0.014 ± 0.0 0.934 ± 0.0 0.097 ± 0.0 0.976 ± 0.01
GCPN 0.0 ± 0.0 0.0 ± 0.0 0.906 ± 0.0 0.12 ± 0.0 1.0 ± 0.0
MolDQN 9.667 ± 0.47 0.816 ± 0.08 0.vi ± 0.02 Northward/A 0.12 ± 0.02
REACTOR 77.0 ± 4.32 0.77 ± 0.04 0.702 ± 0.02 0.133 ± 0.01 1.0 ± 0.0
iii.2.i.4. Transition Model An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m020.gif

In the template-based REACTOR framework, state transitions are deterministic. We therefore have p(s t + 1 | due southt , at ) = 1 co-ordinate to our pick of reaction and the subsequent reactant–product selection. When modifying the reactant–selection policy, either via a stochastic heuristic such every bit an epsilon-greedy reactant option or learned hierarchical policies, country transitions over the higher-level actions, A o, become stochastic according to the internal policy's dynamics.

3.2.2. Building Block Fragmentation

In club to maximize the exploration chapters of the REACTOR agent, it is desirable to calibration up the size of both the reaction template and reactant lists. However, current Reinforcement Learning methodology is poorly suited for very large detached activity spaces. In particular, in that location are approximately 76,000 edifice blocks available for our experiments with a wide range of possibilities matching a given reaction template position. While certain approaches advise learning a mapping from continuous to discrete activity spaces,21,22 nosotros instead mitigate the dimensionality of the reactant space straight. Indeed, we leverage the BRICS23 retrosynthesis rules to reduce our original reactant gear up to one of approximately 5000 smaller blocks. This reduces the reactant space dimension by an order of magnitude while rendering the transitions in space less farthermost and thus more flexible. Additionally, we may limit the size of the set of reactants under consideration at any given step, treating this as a hyperparameter. For our experiments, we set this to 100 reactants, finding little variation when selecting reactants in a greedy manner.

four. Results and Discussion

To validate our framework, we benchmark its performance on goal-directed blueprint tasks, focusing primarily on predicted activity for the D2 dopamine receptor. We frame this objective as a sparse reward, using a binary activity indication to simulate a hit discovery setting. In order to maintain consistency with experiments done in prior work, we perform additional experiments on penalized cLogP and QED with the results presented in the Supporting Information.

In order to meliorate sympathize the exploration behavior of our approach, we besides investigate the nature of the trajectories generated by the REACTOR policies, showing that policies retain drug-likeness beyond all optimization objectives, while also exploring distinct regions of the chemical space.

four.1. Experimental Setup

4.one.1. Reaction Data

For these experiments, the fix of reactions used was obtained from Konze et al.24 with the final list consisting of 127 reactions following curation for specificity and validity. The gear up of reactants are drawn from PubChem§,25 totaling 76,208 building blocks matching the reaction templates. Following the retrosynthesis methodology introduced above, these lists were reduced to approximately 5000 smaller reactants with xc reaction templates matching these blocks. This allows usa to make the space of action possibilities more tractable while rendering the exploration of chemical space more than flexible due to each transition corresponding to smaller steps in space. Naturally, this activeness space does not cover all chemical transformations, which may be of involvement in a full general setting. However, information technology is straightforward to extend the reaction templates and associated edifice blocks to tailor the search space to the information available for a given blueprint objective.

4.1.2. Empirical Reward Models

While generative models are biased by their information distribution, RL-driven molecule pattern may be biased implicitly past training data used for an empirical reward model. Thus, it is crucial that these models provide robust generalization. A model that is overly simplistic, equally is seen for the cLogp experiments, may lead to agents exploiting item biases, leading to pharmacologically undesirable molecules. Training details for the DRD1, DRD2, DRD3, and Caco-2 models are found in the Supporting Information.

iv.1.3. Baselines

We compare our approach to two contempo methods in deep generative molecular modeling, JT-VAE and ORGAN.5,eight Each of these approaches was commencement pre-trained for up to 48 h on the same compute facility, a single motorcar with 1 NVIDIA Tesla K80 GPU and xvi CPU cores. Belongings optimization was and so performed using the same procedures equally described in the original papers. We also compare our method with two state-of-the-art Reinforcement Learning approaches, graph-convolutional policy networks and MolDQN.1,2 Each algorithm was run using the open-sourced code from the authors, while we enforced the same reward part implementation beyond methods to ensure consistency. We ran the GCPN using 32 CPU cores for approximately 24 h (against 8 h in the original paper) and MolDQN for 20,000 episodes (against 5000 episodes in the original paper). In addition, we added a steepest-ascent hill-climbing baseline using the REACTOR environment to demonstrate that for unproblematic mostly greedy objectives such every bit cLogP and QED, simple search policies may provide reasonable performance. In contrast, learned traversals of space become necessary for complex tasks such as DRD2.

4.ane.4. Evaluation

Given the inherent differences between generative and reinforcement learning models, evaluation was adapted to remain consistent within each class of algorithms. JT-VAE and ORGAN were evaluated based on decoded samples from their latent space using the best results across training checkpoints with statistics for JT-VAE computed over three random seeds. Given the prohibitive price of grooming ORGAN, results are given over a single seed and averaged over three sets of 100 samples. Other baselines were compared based on three sets of 100 building blocks used equally starting states. Statistics are reported over sets, while the statistics of the initial states are shown by BLOCKS.

We prioritize the evaluation of each method based on the total number of active molecules identified, as determined by the environs reward model, given that this corresponds most to the underlying objective of de novo design. Indeed, in a hit discovery scenario, a user may be about interested in identifying the maximal number of unique potential hits, leaving potency optimization to later stages in the lead optimization process. We denote this quantity by "total actives" in Tabular array 1. "hateful activeness" corresponds to the percent of generated molecules that are predicted to be active for the DRD2 receptor. In both Tabular array 1 and Table S1, the mean advantage ("mean activity") was computed based on the set of unique molecules generated past each algorithm in order to avoid artificially favoring methods, which oftentimes generate the same molecule. Diversity corresponds to the average pairwise Tanimoto altitude amid generated molecules, while ″scaff. similarity″ corresponds to the boilerplate pairwise similarity between the scaffolds of the compounds, as implemented by the MOSES repository.26 Finally, we limited the number of atoms to 38 for all single-objective tasks, as washed in prior piece of work,ane,2,eight and 50 for the multi-objective tasks.

four.2. Goal-Directed de Novo Design

Results on the unconstrained design task show that REACTOR identifies the well-nigh active molecules for the DRD2 objective. Furthermore, we observe that REACTOR maintains high diversity and uniqueness in addition to robust performance. This a crucial feature equally it implies that the amanuensis is able to optimize the infinite surrounding each starting molecule without reverting to the same molecule to optimize the scalar reward indicate. In Table S1, REACTOR also achieves a higher reward on QED, while remaining competitive on penalized cLogP despite the simplistic nature of this objective favoring cantlet-by-cantlet transitions. We note that, while MolDQN exhibits higher mean activity, this is attributed to the fact that the optimization tends to collapse into generating the same molecule. This explains why the full number of active molecules identified remains low despite the hateful activity suggesting good performance on the chore.

Training efficiency is an important practical consideration while deploying methods for de novo blueprint. Generative models first require learning a mapping of molecules to the latent space before training for property optimization. During our experiments, this resulted in more than 48 h of training fourth dimension, and after which, preparation was stopped. Reinforcement Learning methods trained faster but more often than not failed to converge inside 24 h. We ran MolDQN for 20,000 episodes, taking between 24 and 48 h, while GCPN was stopped after 24 h on 32 CPU cores. In dissimilarity, our approach converges within approximately 2 h of training on 40 CPU cores for the cLogP and QED objectives while consuming less retention than GCPN for 32 cores and terminates nether 24 h for the D2-related tasks. In order to make effective utilise of parallelization, we leveraged the implementation of A2C provided by the rllib library.27

4.3. Synthetic Tractability and Desirability of Optimized Compounds

Given the narrow perspective offered by quantitative benchmarks for molecular design models,26 it is equally important to qualitatively assess the behavior of these models by examining generated compounds. Figure four provides samples generated by each RL method across all objectives. Since the computational estimation of cLogP relies on the Wildman–Crippen method,28 which assigns a high diminutive contribution to halogens and phosphorous, the atom-based activeness space of MolDQN produces samples that are heavily biased toward these atoms, resulting in molecules that are well optimized for the task just neither synthetically-accessible nor drug-like. This generation bias was not appreciable in previously reported benchmarks where atom types were only limited to carbon, oxygen, nitrogen, sulfur, and halogens.2 Furthermore, MolDQN samples for the DRD2 task lack a ring arrangement, and whereas molecules from GCPN have 1, none adequately optimizes for the objective.

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0005.jpg

(a–c) Sample molecules produced for each objective by each RL algorithm.

In contrast, REACTOR appears to produce more than pharmacologically desirable compounds without explicitly because this as an optimization objective. This is supported by Effigy three, which illustrates the shift in constructed accessibility scores29 and drug-likeness for the DRD2-active molecules produced by REACTOR and MolDQN. This suggests that REACTOR is able to simultaneously solve the DRD2 task while maintaining favorable distributions for constructed accessibility and drug-likeness.

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0004.jpg

Synthetic accessibility and drug-likeness score distributions of molecules optimized for DRD2 and the starting blocks.

Further, every bit shown in Figures ane and 7, optimized compounds are provided along with a possible route of synthesis. While such trajectories may not be optimal given that they are limited past the reward design and the set of reaction templates available, they provide a crucial indication of synthesizability. Farther, it is possible to encourage trajectories to exist more efficient by limiting the number of synthesis steps per episode or past incorporating additional costs such equally reactant availability and synthesis difficulty in the reward pattern. In certain applications, it may also be desirable to increase specificity of the reaction templates via group protection. Gao and Coleythirteen particular the lack of consideration for synthetic tractability in current molecular optimization approaches, highlighting that this is a necessary requirement for application of these methods in drug discovery. While alternative ideas aiming to embed synthesizability constraints into generative models take recently been explored,9,30,31 REACTOR is the first approach that explicitly addresses synthetic feasibility by optimizing directly in the space of synthesizable compounds using Reinforcement Learning.

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0008.jpg

(a, b) Trajectories taken by the REACTOR agent from the same building block for different objectives. Notation that the reaction steps are simplified and are mainly indicative of synthesizability. For case, the Negishi coupling reaction would first require the formation of an organozinc precursor. Furthermore, selectivity is low at some steps, which will issue in a mixture of products, unless reacting groups are protected.

4.iv. Multi-objective Optimization

Practical methods for computational drug pattern must be robust to the optimization of multiple backdrop. Indeed, beyond the agonistic or combative furnishings of a small molecule, properties such as the selectivity, solubility, drug-likeness, and permeability of a drug candidate must exist considered. To validate the REACTOR framework under this setting, nosotros consider the chore of optimizing for selective DRD2 ligands. Dopamine receptors are grouped into two classes: D1-like receptors (DRD1 and DRD5) and D2-like receptors (DRD2, DRD3, and DRD4). Although these receptors are well studied, blueprint of drugs selective across subtypes remains a considerable challenge. In particular, as DRD1 and DRD3 share 78% structural similarity in their transmembrane region,4,32 information technology is very challenging to identify small molecules that tin can selectively demark to them and modulate their activity. We therefore assess the performance both on selectivity beyond classes (using DRD1 equally an off-target) and inside classes (using DRD3 every bit an off-target). We then analyze how our framework performs equally nosotros increase the number of pattern objectives. For these experiments, we focus our comparison on MolDQN as it outperforms other state-of-the-fine art methods on the single-objective tasks. Our approach in combining multiple objectives is that of reward scalarization. Formally, a vector of reward signals is aggregated via a mapping, An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m021.gif , thus collapsing the multi-objective MDP33 into a standard MDP formulation. While the simplest and most common arroyo to scalarization is to employ a weighted sum of the individual reward signals, we adopt a Chebyshev scalarization scheme34 whereby reward signals are aggregated via the weighted Chebyshev metric:

equation image

6

where An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m023.gif is a utopian vector, westward is assigned to the relative preferences for each objective, and i indexes the objectives. For our experiments, we consider rewards that are constrained to a range betwixt 0 and 1 such that the utopian point is ever 1 , rendering the dynamics of each advantage signal more than comparable, and nosotros assign equal preferences to the objectives. For the selectivity tasks, given that both rewards are binary, nosotros use a soft version of this scalarization scheme corresponding the negative Euclidean distance to the optimal point. This allows the reward signal to differentiate betwixt reaching 0,1 and both of the objectives. While Chebyshev scalarization was introduced for the setting of tabular Reinforcement Learning, we may interpret it in the function approximation setting as defining an adaptive curriculum whereby the optimization focus shifts dynamically according to the objective most distant from An external file that holds a picture, illustration, etc.  Object name is ao0c04153_m024.gif .

4.4.ane. DRD2 Selectivity

The total number of actives in Table ii corresponds to the number of unique molecules that were found to satisfy all objectives, while the hateful reward in Table ii and Effigy 5 is computed every bit the proportion of evaluation episodes for which the algorithms optimize all desired objectives. In Table ii, we find that REACTOR maintains the ability to identify a higher number of desirable molecules on the selectivity tasks, optimizing for DRD2 while avoiding off-target activeness on the D1 and D3 receptors. Further, it is able to outperform MolDQN while maintaining very depression scaffold similarity among generated molecules.

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0006.jpg

Advantage progression as the number of optimization objectives increases.

Table 2

DRD2 Selectivity

objective method total actives hateful reward variety scaff. similarity uniqueness
D2/D1 MolDQN 9.0 ± 1.41 0.64 ± 0.07 0.502 ± 0.01 N/A 0.14 ± 0.01
REACTOR 36.667 ± 4.99 0.368 ± 0.05 0.599 ± 0.01 0.139 ± 0.01 0.997 ± 0.0
D2/D3 MolDQN 25.667 ± three.09 0.884 ± 0.07 0.746 ± 0.05 N/A 0.29 ± 0.02
REACTOR 53.0 ± viii.29 0.53 ± 0.08 0.692 ± 0.03 0.147 ± 0.01 1.0 ± 0.0

four.4.two. Robustness to Multiple Objectives

In improver to off-target selectivity, we appraise the robustness of each method'south performance as we increase the number of pharmacologically relevant property objectives to optimize. Specifically, we compare the following combinations of rewards:

  • (a)

    DRD2 with range-targeted cLogP (2 objectives) according to the Ghose filter35

  • (b)

    DRD2, range cLogP, and a molecular weight ranging from 180 to 600 (iii objectives)

  • (c)

    DRD2, range cLogP, target molecular weight, and drug absorption as indicated by a model trained on data for the Caco-2 permeability analysis36 (four objectives)

For the range-targeted cLogP, molecular weight, and permeability objectives, the component-wise reward is 0 when the molecule falls inside the desired range. Otherwise, the altitude to the objective is mapped to a range of (0,1]. Given that the DRD2 objective is binary, this implicitly prioritizes the optimization for this reward.

Figure five shows that REACTOR demonstrates greater robustness to an increasing number of design objectives. Indeed, while both methods see diminishing success rates in optimizing for multiple objectives, the functioning of REACTOR diminishes gradually, while MolDQN'southward performance collapses. Furthermore, REACTOR maintains the power to generate unique terminal states throughout.

4.five. Goal-Directed Exploration

In club to proceeds further insight into the nature of the trajectories generated past the REACTOR agent, we plotted 2 alternative views of optimization routes generated for the aforementioned building block across each single-property objective. In Figure half dozen, nosotros fit a principal component analysis (PCA)37 on the infinite of edifice blocks to identify the location of the initial state and afterward transform the next states generated by the RL agent onto this infinite. We discover that the initial molecule is clearly shifted to distinct regions in space, while the magnitude of the transitions advise efficient traversal of the space. This provides further evidence that exploration through space is a function of advantage design and is mostly unbiased by the information distribution of initialization states. Figure 7 shows the same trajectories with their corresponding reactions and intermediate molecular states. We find that optimized molecules generally contain the starting structure. We believe this to be a desirable holding given that real-life blueprint cycles are oftentimes focused on a stock-still scaffold or set up of core structures. We besides notation that the policy learned by our REACTOR framework is able to generalize over different starting blocks, suggesting that it achieves generation of structurally diverse and novel compounds.

An external file that holds a picture, illustration, etc.  Object name is ao0c04153_0007.jpg

(a, b) Trajectory steps of the REACTOR amanuensis for each objective, starting with the aforementioned building block. The RL agent shifts the molecule toward dissimilar regions in space to identify the relevant local maximum.

5. Conclusions

This piece of work proposes a novel approach to molecular design that defines state transitions as chemical reactions within a Reinforcement Learning framework. Nosotros demonstrate that our framework leads to globally improved performance, as measured by reward and diverseness of generated molecules, too every bit greater training efficiency while producing more than drug-similar molecules. Analysis of REACTOR's robustness to multiple optimization criteria, coupled with its power to maintain predicted activity on the DRD2 receptor, suggests increased potential for successful application in drug discovery. Furthermore, molecules generated by this framework exhibit amend synthetic accessibility by design with one viable synthesis route also suggested. Although the reactivity and stability of the optimized molecules remain a potential issue, REACTOR's efficiency in a multiple optimization setting suggests that this tin be addressed by explicitly considering them equally additional design objectives.

Futurity work aims to build on this framework past making employ of its hierarchical formulation to guide agent policies both at the higher reaction and lower reactant levels, exploring proposals from h-DQN38 for hierarchical value functions, or the option-critic framework39 as a starting point. Nosotros as well program to expand the effective state space of our MDP by embedding a synthesis model with transformer-based architectures showing promiseforty equally the MDP transition model. Because applied de novo design requires optimization of multiple criteria simultaneously, we believe that the efficiency of our design framework provides a robust foundation for such tasks and hope to expand on existing approaches41−43 for multi-objective Reinforcement Learning. Finally, we intend to validate the proposed synthetic routes and bio-activeness of generated molecules experimentally to meliorate demonstrate existent-world utility.

Acknowledgments

The authors thank Daniel Cohen, Sébastien Giguère, Violet Guo, Michael Craig, Connor Coley, and Grant Wishart for reviewing the manuscript and for their helpful comments. The methods and algorithms presented hither were developed at InVivo AI.

Supporting Data Available

The Supporting Information is available gratis of charge at https://pubs.acs.org/doi/ten.1021/acsomega.0c04153.

  • Results on QED/cLogP optimization tasks (Table S1); ablation study for utilise of a masked action space (Effigy S1); molecule samples from REACTOR: DRD2 chore , DRD2 activity/DRD1 selectivity task, and DRD2 activity/DRD3 selectivity chore (Figure S2); and model training details: DRD2 model report (Figure S3) and DRD1/DRD3/Caco-2 model training details (PDF)

Notes

The authors declare the following competing financial interest(south): The authors are employed by InVivo AI, which is actively involved in the pursuit of novel methodologies for drug discovery.

Footnotes

§A mapping of SMILES to PubChem ID is available upon request.

Supplementary Material

References

  • You J.; Liu B.; Ying Z.; Pande V.; Leskovec J.. Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. In Advances in Neural Information Processing Systems; Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi North., Garnett R., Eds.; Curran Associates, Inc.: 2018, pp. 6410–6421. [Google Scholar]
  • Zhou Z.; Kearnes South.; Li 50.; Zare R. North.; Riley P. Optimization of Molecules via Deep Reinforcement Learning. Sci. Rep. 2019, 9, 10752.10.1038/s41598-019-47148-x. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]
  • Sutton R. Southward.; Barto A. Thou.. Reinforcement learning: An introduction; MIT printing: 2018. [Google Scholar]
  • Moritz A. E.; Free R. B.; Sibley D. R. Advances and challenges in the search for D2 and Dthree dopamine receptor-selective compounds. Cell. Signalling 2018, 41, 75–81. 10.1016/j.cellsig.2017.07.003. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Guimaraes Chiliad. L.; Sanchez-Lengeling B.; Outeiral C.; Farias P. 50. C.; Aspuru-Guzik A.. Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv :1705.10843 [cs, stat]2018, arXiv: 1705.10843. [Google Scholar]
  • De Cao N.; Kipf T.. MolGAN: An implicit generative model for small molecular graphs. arXiv ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models2018. [Google Scholar]
  • Liu Q.; Allamanis Chiliad.; Brockschmidt M.; Gaunt A., Constrained Graph Variational Autoencoders for Molecule Pattern. In Advances in Neural Data Processing Systems 31; Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., Garnett R., Eds.; Curran Associates, Inc.: 2018; pp. 7795–7804. [Google Scholar]
  • Jin W.; Barzilay R.; Jaakkola T.. Junction Tree Variational Autoencoder for Molecular Graph Generation. arXiv :1802.04364 [cs, stat]2019, arXiv: 1802.04364. [Google Scholar]
  • Button A.; Merk D.; Hiss J. A.; Schneider G. Automated de novo molecular blueprint by hybrid machine intelligence and rule-driven chemical synthesis. Nat. Mach. Intell. 2019, ane, 307–315. 10.1038/s42256-019-0067-7. [CrossRef] [Google Scholar]
  • Bellman R. A Markovian Decision Process. J. Math. Mech. 1957, six, 679–684. [Google Scholar]
  • Konda V. R.; Tsitsiklis J. North., Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12; Solla Due south. A., Leen T. K., Müller M., Eds.; MIT Press: 2000; pp. 1008–1014. [Google Scholar]
  • Mnih V.; Badia A. P.; Mirza Thou.; Graves A.; Lillicrap T.; Harley T.; Silver D.; Kavukcuoglu Thou.. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning; ed. by Balcan M. F.; Weinberger Chiliad. Q., PMLR: New York, New York, USA, 2016; Vol. 48, pp. 1928–1937.
  • Gao W.; Coley C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020, 10.1021/acs.jcim.0c00174. [PubMed] [CrossRef] [Google Scholar]
  • Martelle J. L.; Nader M. A. A Review of the Discovery, Pharmacological Characterization, and Behavioral Effects of the Dopamine D2-Like Receptor Adversary Eticlopride. CNS Neurosci. Ther. 2008, 14, 248–262. ten.1111/j.1755-5949.2008.00047.x. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]
  • Gilligan P. J.; Cain G. A.; Christos T. East.; Cook L.; Drummond Southward.; Johnson A. L.; Kergaye A. A.; McElroy J. F.; Rohrbach Chiliad. W. Novel piperidine. sigma. receptor ligands as potential antipsychotic drugs. J. Med. Chem. 1992, 35, 4344–4361. ten.1021/jm00101a012. [PubMed] [CrossRef] [Google Scholar]
  • Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [PubMed] [CrossRef] [Google Scholar]
  • Yang Chiliad.; Swanson K.; Jin W.; Coley C.; Eiden P.; Gao H.; Guzman-Perez A.; Hopper T.; Kelley B.; Mathea M.; et al. Correction to Analyzing Learned Molecular Representations for Property Prediction. J. Chem. Inf. Model. 2019, 59, 5304–5305. 10.1021/acs.jcim.9b01076. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Kearnes S.; McCloskey M.; Berndl M.; Pande V.; Riley P. Molecular Graph Convolutions: Moving Beyond Fingerprints. J. Comput.-Aided Mol. Des. 2016, thirty, 595–608. 10.1007/s10822-016-9938-8. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Daylight Theory: SMARTS - A Language for Describing Molecular Patterns; https://world wide web.daylight.com/dayhtml/medico/theory/theory.smarts.html, (accessed January, 2020).
  • Bickerton G. R.; Paolini G. V.; Besnard J.; Muresan Due south.; Hopkins A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 2012, 4, 90–98. x.1038/nchem.1243. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]
  • Chandak Y.; Theocharous G.; Kostas J.; Jordan Due south.; Thomas P. S.. Learning Activeness Representations for Reinforcement Learning. arXiv :1902.00183 [cs, stat]2019, arXiv: 1902.00183. [Google Scholar]
  • Dulac-Arnold One thousand.; Evans R.; van Hasselt H.; Sunehag P.; Lillicrap T.; Hunt J.; Mann T.; Weber T.; Degris T.; Coppin B.. Deep Reinforcement Learning in Large Discrete Action Spaces. arXiv :1512.07679 [cs, stat]2016, arXiv: 1512.07679. [Google Scholar]
  • Degen J.; Wegscheid-Gerlach C.; Zaliani A.; Rarey M. On the Art of Compiling and Using 'Drug-Similar' Chemical Fragment Spaces. ChemMedChem 2008, iii, 1503–1507. 10.1002/cmdc.200800178. [PubMed] [CrossRef] [Google Scholar]
  • Konze Yard. D.; Bos P. H.; Dahlgren K. Thou.; Leswing Thousand.; Tubert-Brohman I.; Bortolato A.; Robbason B.; Abel R.; Bhat South. Reaction-Based Enumeration, Agile Learning, and Free Free energy Calculations To Rapidly Explore Synthetically Tractable Chemical Space and Optimize Authority of Cyclin-Dependent Kinase 2 Inhibitors. J. Chem. Inf. Model. 2019, 59, 3782–3793. 10.1021/acs.jcim.9b00367. [PubMed] [CrossRef] [Google Scholar]
  • Kim South.; Chen J.; Cheng T.; Gindulyte A.; He J.; He Southward.; Li Q.; Shoemaker B. A.; Thiessen P. A.; Yu B.; Zaslavsky L.; Zhang J.; Bolton E. E. PubChem 2019 update: improved access to chemical information. Nucleic Acids Res. 2019, D1102.10.1093/nar/gky1033. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Polykovskiy D.; Zhebrak A.; Sanchez-Lengeling B.; Golovanov S.; Tatanov O.; Belyaev Southward.; Kurbanov R.; Artamonov A.; Aladinskiy V.; Veselov Thousand.; Kadurin A.; Nikolenko S.; Aspuru-Guzik A.; Zhavoronkov A.. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. arXiv preprint arXiv:1811.128232018. [Google Scholar]
  • Liang E.; Liaw R.; Nishihara R.; Moritz P.; Fox R.; Goldberg K.; Gonzalez J.; Hashemite kingdom of jordan Grand.; Stoica I.. RLlib: Abstractions for Distributed Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning; ed. by Dy J.; Krause A., PMLR: Stockholmsmässan, Stockholm Sweden, 2018; Vol. lxxx, pp. 3053–3062.
  • Wildman South. A.; Crippen G. M. Prediction of physicochemical parameters by diminutive contributions. J. Chem. Inf. Comput. Sci. 1999, 39, 868–873. 10.1021/ci990307l. [CrossRef] [Google Scholar]
  • Ertl P.; Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminf. 2009, 1, viii.x.1186/1758-2946-ane-8. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]
  • Korovina K.; Xu Due south.; Kandasamy K.; Neiswanger Westward.; Poczos B.; Schneider J.; Xing Due east.. ChemBO: Bayesian Optimization of Modest Organic Molecules with Synthesizable Recommendations. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics; ed. by Chiappa S.; Calandra R., PMLR: 2020; Vol. 108, pp. 3393–3403.
  • Bradshaw J.; Paige B.; Kusner M. J.; Segler M.; Hernández-Lobato J. M., A Model to Search for Synthesizable Molecules. In Advances in Neural Information Processing Systems; Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fob E., Garnett R., Eds.; Curran Associates, Inc.: 2019, pp. 7937–7949. [Google Scholar]
  • Sibley D. R.; Monsma F. J. Jr. Molecular biology of dopamine receptors. Trends Pharmacol. Sci. 1992, 13, 61–69. 10.1016/0165-6147(92)90025-ii. [PubMed] [CrossRef] [Google Scholar]
  • Wiering M. A.; de Jong Due east. D.. Computing Optimal Stationary Policies for Multi-Objective Markov Decision Processes. In 2007 IEEE International Symposium on Estimate Dynamic Programming and Reinforcement Learning; IEEE: 2007; pp. 158–165.
  • Van Moffaert Chiliad.; Drugan M. Chiliad.; Nowe A.. Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL); IEEE: 2013; pp. 191–199.
  • Ghose A. K.; Viswanadhan Five. Due north.; Wendoloski J. J. A Cognition-Based Approach in Designing Combinatorial or Medicinal Chemistry Libraries for Drug Discovery. one. A Qualitative and Quantitative Characterization of Known Drug Databases. J. Comb. Chem. 1999, 1, 55–68. ten.1021/cc9800071. [PubMed] [CrossRef] [Google Scholar]
  • Van Breemen R. B.; Li Y. Caco-2 jail cell permeability assays to measure drug absorption. Skillful Opin. Drug Metab. Toxicol. 2005, 1, 175–185. 10.1517/17425255.one.two.175. [PubMed] [CrossRef] [Google Scholar]
  • Pearson K. LIII. On lines and planes of closest fit to systems of points in space. London, Edinburgh Dublin Philos. Mag. J. Sci. 2010, two, 559–572. 10.1080/14786440109462720. [CrossRef] [Google Scholar]
  • Kulkarni T. D.; Narasimhan 1000.; Saeedi A.; Tenenbaum J.. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. In Advances in Neural Information Processing Systems 29; Lee D. D., Sugiyama M., Luxburg U. V., Guyon I., Garnett R., Eds.; Curran Associates, Inc.: 2016; pp. 3675–3683. [Google Scholar]
  • Bacon P.-50.; Harb J.; Precup D. The option-critic compages. In Proceedings of the Thirty-Showtime AAAI Briefing on Artificial Intelligence; AAAI Printing: San Francisco, California, U.s., 2017; pp. 1726–1734.
  • Schwaller P.; Laino T.; Gaudin T.; Bolgar P.; Hunter C. A.; Bekas C.; Lee A. A. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 2019, 5, 1572–1583. 10.1021/acscentsci.9b00576. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Abels A.; Roijers D.; Lenaerts T.; Nowe A.; Steckelmacher D.. Dynamic Weights in Multi-Objective Deep Reinforcement Learning. In Proceedings of the 36th International Conference on Auto Learning; ed. by Chaudhuri One thousand.; Salakhutdinov R., PMLR: 2019; pp. eleven–20.
  • Moffaert M. 5.; Nowé A. Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies. J. Mach. Learn. Res. 2014, 15, 3663–3692. [Google Scholar]
  • Yang R.; Sunday X.; Narasimhan K.. A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Accommodation. In Advances in Neural Information Processing Systems; Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Trick E., Garnett R., Eds.; Curran Associates, Inc.: 2019; pp. 14636–14647. [Google Scholar]

Articles from ACS Omega are provided here courtesy of American Chemical Social club


pughforome.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7774092/

0 Response to "43 Review and Reinforcement Another Look at the Atom Counitne Answer Key"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel