THE PREDICTION OF OUTCOMES RELATED TO THE USE OF NEW DRUGS IN THE REAL WORLD THROUGH ARTIFICIAL ADAPTIVE SYSTEMS.
Enzo Grossi & Massimo Buscema
Semeion Research Centre
Research Centre of Sciences of Communication
Via Sersale 117, Rome, 00128, Italy
In this brief essay we will focus three main problems related to the use of new medications in the real world: 1) the prediction of drug response in individual patients; 2) the prediction of rare unwanted events after introduction of the new drug on the market; 3) the passage from preclinical phase to Phase I in human beings. The first problem is specifically felt by medical doctors who are asked to treat their patients as individuals rather than as statistics, but we have to note that, with the advent of extremely costly new drugs, also health authorities or private insurance organizations are looking for potent tools to personalize treatment plans. The second problem is typically sensed by drug agencies which sometimes are forced to withdraw marketing authorization as a bunch of deaths drug related creates rumors and disappointments at media level, while the third problem is specifically felt by Pharmaceutical Companies and Institutional Review Boards releasing the clearance for first in man trials.
1.Prediction of drug response in individual patient
Making predictions for specific outcomes (diagnosis, risk assessment, prognosis) represents a fascinating aspect of medical science. Different statistical approaches have been proposed to define models to identify factors that are predictive for the outcome of interest. Studies have been performed to define the clinical and biological characteristics that could be helpful in predicting who will benefit from an antiobesity drug for example, but results have been limited (1).
Traditional statistical approaches encounter problems when the data show big variability and not easily normalized for inherent nonlinearity. More-advanced analysis techniques, such as dynamic mathematical models, can be useful because they are particularly suitable for solving nonlinear problems frequently associated with complex biological systems.
Use of ANNs in biological systems has been proposed for different purposes, including studies on deoxyribonucleic acid sequencing (2) and protein structure (3).
ANNs have been used in different clinical settings to predict the effectiveness of instrumental evaluation (echocardiography, brain single photon emission computed tomography, lung scintigram, prostate biopsy) in increasing diagnostic sensitivity and specificity and in laboratory medicine in general (4). Also, they have proven effective in identifying gastro-oesophageal refux patients on the sole basis of clinical data (5). But the most promising application of ANNs relates to prediction of possible clinical outcomes with specific therapy. ANNs have proven effective in detecting responsiveness to methadone treatments of drug addicts (6), to pharmacological treatment in Alzheimer disease (7), to clozapine in schizophrenic patients (8) and in various fields of psychiatric research (9).
The use of ANNs for predictive modelling in obesity dates back to a decade ago, where it was proposed to model the waist-hip ratio from 13 other health parameters (10). Later, it has been proposed as a tool for body composition research (11).
One of the main factors preventing a more efficient use of new pharmacological treatments for chronic diseases like for example hypertension, cancer, Alzheimer disease or obesity is represented by the difficulty of predicting “a priori” the chance of response of the single patient to a specific drug. A major methodological setback in drawing inferences and making predictions from data collected in the real world setting, such as observational studies, is that variability in the underlying biological substrates of the studied population and the quality and content of medical intervention influence outcomes. Because there is no reason to believe that these, like other health factors, work together in a linear manner, the traditional statistical methods, based on the generalized linear model, have limited value in predicting outcomes such as responsiveness to a particular drug.
Most studies have shown that up to 50% of patients treated with new molecules given in monotherapy or as an adjunct to standard treatments may show an unsatisfactory response. As a matter of fact, when time comes for the physician to decide about type of treatment, there is very little evidence that can help her/him in drug treatment choice. Take for example obesity. Here only scanty data are available on predictive factors to the specific treatment, and attempts at developing models for predicting response to the drug by using traditional techniques of multiple regression have showed an unsatisfactory predictive capacity (i.e. inferior to 80% of total variance). (12, 13). A possible explanation could be that obesity is a so-called complex disease, where different factors interact with multiple interactions among variables, positive and negative feedback loops, and non-linear system dynamics. Another good example is Alzheimer Disease.
Clinical trials have established the efficacy of cholinesterase inhibitor drugs (ChEI), such as tacrine,
(14) donepezil, (15) and rivastigmine (16) based on improvement in cognitive aspects and in overall functioning using the Alzheimer’s Disease Scale—Cognitive subscale (ADAS-Cog) and the
Clinician’s Interviewed Based Impression of Change (CIBIC) , respectively. Although the mean score of treated patients in both scales was significantly higher than the placebo group, many subjects under active treatment showed little or no improvement (nonresponders).
However it is not possible to estimate which patients are likely to respond to pharmacological therapy with ChEI. This prediction would be an important decision-making factor in improving the use of healthcare resources.
A major methodological setback in drawing inferences and making predictions from data collected in the real world setting, such as observational studies, is that variability in the underlying biological substrates of the studied population and the quality and content of medical intervention
influence outcomes. Because there is no reason, a priori, to believe that these, like other health factors, work together in a linear manner, the traditional statistical methods, based on the generalized linear model, have limited value in predicting outcomes such as responsiveness to
a particular drug.
A possible alternative approach to the solution of the problem is represented by the use of Neural Networks. Artificial Neural Networks (ANN) represent computerized algorithms resembling interactive processes of human brain. They allow to study very complex non-linear phenomena like biological systems. Like the brain, ANNs recognize patterns, manage data, and, most significantly, learn. These statistical-mathematical tools can determine the existence of a correlation between series of data and a particular outcome and when “trained” can predict output data once given the input. They work well in pattern recognition and discrimination tasks.
Although ANNs have been applied to various areas of medical research, they have not been employed in obesity clinical pharmacology.
ANN proved to be a useful method to discriminate between responders and non-responders, better than traditional statistical methods in our three experimental studies carried out with donepezil in Alzheimer disease and with sibutramine in obesity and with infliximab in Crohn disease.
In a paper published in 2002 (7) we have evaluated the accuracy of artificial neural networks compared with discriminant analysis in classifying positive and negative response to the cholinesterase inhibitor donepezil in a opportunistic group of 61 old patients of both genders affected by Alzheimer’s disease (AD) patients in real world setting along three months follow-up.
Accuracy in detecting subjects sensitive (responders) or not (nonresponders) to therapy was based on the standard FDA criterion standard for evaluation of efficacy: the scores of Alzheimer’s Disease Assessment Scale—Cognitive portion and Clinician’s Interview Based Impression of Change—plus scales. In this study ANNs were more effective in discriminating between responders and nonresponders than other advanced statistical methods, particularly linear discriminant analysis. The total accuracy in predicting the outcome was 92.59%.
In a second study we evaluated the use of artificial neural networks in predicting response to infliximab treatment in patients with Crohn's disease(18).
In this pilot study , different ANN models were applied to a data sheet with demographic and clinical data from 76 patients with steroid resistant/dependant or fistulizing CD treated with Infliximab to compare accuracy in classifying responder and non responder subjects with that of linear discriminant analysis.
Eighty one outpatients with CD (31 men, 50 women; mean age± standard deviation 39.9 ± 15 range: 12-81 ) partecipating to an Italian Multicentric Study (17) , were enrolled in the study. All patients were treated, between April 1999 and December 2003, with a dose of Infliximab 5 mg/kg of body weight for luminal refractory (CDAI > 220–400) (43 patients), fistulizing CD (19 patients) or both of them (14 patients) .
The final data sheet consisted of 45 independent variables related to the anagraphic and anamnestic data (sex, age at diagnosis, age at infusion , smoking habit, previous Crohn’s related abdominal surgery [ileal or ileo-cecal resections] and concomitant treatments including immunomodulators and corticosteroids ) and to clinical aspects ( location of disease, perianal disease, type of fistulas, extraintestinal manifestations, clinical activity at the first infusion [CDAI], indication for treatment). Smokers were defined as those smoking a minimum of 5 cigarettes per day for at least 6 months before their first dose of Infliximab. Non smokers were defined as those who had never smoked before, those who had quit smoking at least 6 months before their first dose of Infliximab, or those who smoked fewer than 5 cigarettes per day. Concomitant immunosuppressive use was defined as initiation of methotrexate before their first Infliximab infusion or initiation of 6-mercaptopurine (6-MP) or azathioprine more than 3 months before their first Infliximab infusion.
Assessment of response was determined by clinical evaluation 12 weeks after the first infusion for all patients. Determination of response in patients with inflammatory CD was based on the Crohn’s Disease Activity Index (CDAI). For clear - cut estimate clinical response was evaluated as complete response or partial / no response.
Complete response was defined as (a) clinical remission (CDAI < 150) in luminal refractory disease and (b) temporary closure of all draining fistulas at consecutive visits in the case of enterocutaneous and perianal fistulas; entero-enteric fistulas were evaluated by small bowel barium enema and vaginal vescical fistula by lack of drainage at consecutive visits . For patients with both indications the outcome was evaluated independently for each indication.
Two different experiments were planned following an identical research protocol. The first one included all 45 independent variables including frequency and intensity Crohn disease symptoms, plus numerous other social and demographic characteristics, clinical features and history. In the second experiment the IS system coupled to the T&T system automatically selected the most relevant variables and therefore 22 variables were included in the model.
Discriminant analysis was also performed on the same data sets to evaluate the predictive performance of this advanced statistical method by a statistician blinded to ANN results. Different models were assessed to optimise the predictive ability. In each experiment the sample was randomly divided into two sub-samples, one for the training phase and the other for the testing phase, with the same record distributions used for ANN validation.
ANNs reached an overall accuracy rate of 88% while LDA performance was only of 72%.
Finally in a third study we evaluated the performance of ANN in predicting response to Warfarin(17).
A total of 377 patients were included in the analysis. The most frequent clinical indication for anticoagulation was atrial fibrillation (69%); other indications included heart valve prosthesis (10%) and pulmonary embolism (8%). The large majority of patients, 325, 86%) were on concurrent drug treatment: on average, they were taking 3 (IQR 1-4) medications potentially interacting with warfarin. The median weekly maintenance dose (WMD) of warfarin was 22.5 mg (IQR 16.3-28.8mg). Thirteen patients whose INR values were not within thetherapeutic range were erroneously included in the analysis: their median weekly maintenance dose was 21.4 mg (IQR 12.2-30.0 mg), the INR was higher than 3.0 (INR 3.7 and 4.3) in 2, and lower than 2.0 in 11 (median INR 1.5, IQR 1.5-1.7).
Demographic, clinical and genetic data (CYP2C9 and VKORC1 polymorphisms) were used. The final prediction model was based on 23 variables selected by TWIST® system within a bipartite division of the data set (training and testing) protocol.
TWIST system is based on a population of n ANNs, managed by an evolutionary system able to extracts from the global dataset the best training and testing sets and to evaluate the relevance of the different variables of the dataset in a sophisticated way, slecting the most relevant for the problem on study.
The ANN algorithm reached high accuracy, with an average absolute error of 5.7 mg of the warfarin maintenance dose. In the subset of patients requiring ≤21 mg and 21-49 mg (45 and 51% of the cohort, respectively) the absolute error was 3.86 mg and 5.45 with a high percentage of subjects being correctly identified (71 and 73%, respectively). This performance is higher than those obtained in different studies carried out with traditional statistical techniques. In conclusion ANN appears to be a promising tool for vitamin K antagonist maintenance dose prediction.
1.1 Selection of informative variables: how evolutionary algorithms work
To include only the most informative of the available variables we used a genetic algorithm, called the Genetic Doping Algorithm , which uses the principles of evolution to optimize the training and testing sets and to select the minimum number of variables capturing the maximum amount of available information in the data. Contrary to statistical linear models using indicator variables, TWIST does not require the omission of a reference category. This is due to the focus of the artificial neural network on prediction rather than estimation. If some of the indicator variables can completely account for the predictive ability of the others, those will be excluded by the algorithm during the selection process. The method is called the TWIST protocol and has been previously applied successfully in similar problems [20,21]. The advantages of the approach are the sub-setting of the data in two representative sets for training and testing, which is problematic in small datasets, and the use of a combination of criteria to determine the fit of the model. TWIST is comprised of two systems, the T&T for resampling of the data and the IS for feature selection, both using artificial neural networks (ANNs). The T&T system splits the data into training and testing sets in such a way that each subset is statistically representative of the full sample. This non-random selection of subsets is crucial when small samples are considered and the selection of non-characteristic and extreme subsets is likely. The IS system uses the training and testing subsets produced to identify a vector of 0s and 1s, describing the absence or presence of an indicator variable, that is able to optimize the categorization of the individuals in cases and controls compared to their observed status. For this, a population of vectors, with each vector a combination of the indicator variables, is allowed to “evolve” through a number of generations in order optimize the prediction of target variable, as a natural population evolves to optimize fitness under a specific set of environmental conditions. The vectors with the best predictive ability are overrepresented in the next generation while a smaller number of sub-optimal vectors are maintained to give rise to the following generation. Some instability, in the form of low predictive ability vectors, is introduced in the process to avoid the problem of finding a solution which is optimal under a narrow set of conditions, also known as a local optimum. This step ensures that the attributes do not include redundant information or noise variables that will decrease the accuracy of the map and increase both the computing time and the amount of examples necessary during learning. In addition, feature selection permits the easier interpretation of the graph of relationships between the variables.
2. Prediction of rare unwanted events
Drug-induced injuries are a growing concern for health authorities. The world population is continuously growing older because of an increased life expectancy and is thus using more and more drugs, whether prescription or over-the-counter drugs.
Therefore, chances of drug-induced injuries are rising. Over the years, a number of postmarketing labelling changes or drug withdrawals from the market due to postmarketing discoveries have occurred. Even the best planned and carefully designed clinical studies have limitations. To detect all potential adverse drug reactions, you need quite a large number of subjects exposed to the drug and the number of subjects participating in the clinical studies might not be large enough to detect especially rare adverse drug reactions.
To minimise the risk of postmarketing discoveries such as unrecognised adverse drug reactions, certain risk factors, e.g. laboratory or ECG abnormalities, are subject of increased regulatory review.
The most frequent cause of safety-related withdrawal of medications (e.g. bromfenac, troglitazone) from the market and for FDA non-approval is the drug-induced liver injury (DILI). Different degrees of liver enzyme elevations after drug intake can result in hepatotoxicity, which can be fatal due to the irreversible damage to the liver. Since animal models cannot always predict human toxicity, drug-induced hepatotoxicity is often detected after market approval. In the United States, DILI is contributing to more than 50% of acute liver failure cases (data from WM Lee and colleagues from the Acute Liver Failure Study Group).
The second leading cause for withdrawing approved drugs from the market is QT interval prolongation, which can be measured during electrocardiogram (ECG). Some non-cardiovascular drugs (e.g. terfenadine) have the potential to delay cardiac repolarisation and to induce potentially fatal ventricular tachyarrhythmias such as Torsades de Pointes.
Drug toxicity is also a common cause of acute or chronic kidney injury and can be minimised or prevented by vigilance and early treatment. NSAIDs, aminoglycosides, and calcineurin inhibitors are for example some drugs that are known to induce kidney dysfunction. Most events are reversible, with kidney function returning to normal when the drug is discontinued.
Consequently, the pharmaceutical industry has a strong interest to identify drugs bearing the risk of causing adverse drug reactions as early as possible in order to improve the drug development programme.
A patient developing a severe side effect to a particular medication can be considered an outlier. Suppose that you are deriving probabilities of future occurrences of severe side effects from the data collected in large clinical trials carried out before the commercialization of your product. These trials provide healthy authorities that your product is effective and safe and so that its deserves a registration or marketing authorization.
Now, say that you estimate that an event happens every 1,000 patients treated. You will need a lot more data than 1,000 patients to ascertain its frequency, say 3,000 patients. Now, what if the event happens once every 5,000 patients? The estimation of this probability requires some larger number, 15,000 or more. The smaller the probability, the more observations you need, and the greater the estimation error for a set number of observations. Therefore, to estimate a rare event you need a sample that is larger and larger in inverse proportion to the occurrence of the event.
If small probability events carry large impacts ( in this example death), and (at the same time) these small probability events are more difficult to compute from past data itself, then our empirical knowledge about the potential contribution—or role—of rare events (probability × consequence) is inversely proportional to their impact. The future challenge in this particular setting will be to derive from the limited amount of information available in the pre registration phase of drug development subtle, weak but true signals that something will go bad in the future after the marketing approval, when the new drug will be exposed in the real world to hundreds of thousand of subjects, a twofold increase in order of magnitude in comparison with pre-registration experience. These patients of the real world will be very different from patients encountered in the phase 3 clinical trial, generally speaking “clean patients” i.e. no concomitant disease, few concomitant treatments, age not beyond a certain value, good compliance, an so on. On the other hand in the post marketing phase the new drug will be exposed to “dirty patients” i.e subjects with substantial co-morbidity, many concomitant treatments, extreme age, poor compliance ( which mean also taking by mistake or intentionally excess of drug in the attempt to compensate for missed assumptions of the drug). Some artificial adaptive systems based on a new mathematics, could be able to learn from a large phase 3 study the hidden links among rare events and a particular profile of a patients even if no patients with such a particular profile actually exists in the data set.
There are basically two possibilities: the first is to use an “associative memory” or autoassociative artificial neural network able to navigate in the hypersurface of a dataset in search of rare occurrence linked to a particular assembly of variables; the second is to use a pseudo-inverse function coupled with an evolutionary algorithms able to repopulate a specific probability density function distribution with virtual records enabling the search of rare events, not available in the original data set.
2.1 Autoassociative artificial neural networks.
The NR-NN is a new recurrent network provided with a new powerful algorithm (“Re-Entry” by the Semeion Research Centre), that can dynamically adapt its trajectory to answer according to the different questions, during the recall phase.
This new artificial neural network, developing an associative memory, can identify the best possible connection between variables and can generate alternative data scenarios to follow the dynamic effects. During the training phase, the algorithm optimizes the weight of all the possible interconnections between variables in order to minimize the error. The training phase is followed by a rigorous validation protocol which foresees the correct reconstruction of variables that are randomly deleted by each record.
During the querying phase of the database, the NR-NN can answer the following questions:
• prototypical question (the characteristic prototype of a patient with a particular side effect or without a particular side effect),
• virtual question (the prototypical profile of a patient having a side effect with specific characteristics, even if no subject with these variables is actually present in the data set),
These special dynamics of NR-NN allow us to distinguish 3 types of variables:
• Discriminant variables: variables that are "switched on" only for a specific prototype;
• Indifferent variables: variables that are "switched off" for both prototypes;
• Metastable variables: variables that are "switched on" for both prototypes; in other words they act in opposite ways according to the context of the other variables. Metastable variables are specific of non-linear systems.
The possibility to simulate rather than carrying out in reality a post marketing surveillance study in uncertain situations could help in optimizing decisions and save lives if the drug could cause severe side effects in rare patients.
Actually there are very few known computer aided simulators for clinical trials, for example the simulation method and the simulator proposed by Pharsight Corporation, Mountain View, California, USA. The basic features of the said known simulator is disclosed in “ Case Study in the use of Bayesian hierarchical modelling and simulation for design and analysis of a clinical trial, by William R. Gillespie, Bayesian CTS example at FDA/Industry workshop September 2003. The method on which the known simulator operates is a well known statistical algorithm known as “Montecarlo Algorithm”.
This simulator however is not constructed in order to simulate the results or the trend of the results of a phase four clinical trial. Thus the prediction cannot be seen as very reliable. The method furthermore is more oriented on better planning the trials relatively to the kind of individuals and the way the trials has to be carried out in order to have bigger chances of success.
From theoretical point of view the use of Artificial Adaptive Systems should offer the possibility to infer results that are likely to be obtained in a advanced phase of marketing from the analysis of the data assembly related to the pre registrative phases. In other words the aim is to simulate and predict the results of post marketing surveillance from the data of phase 3, or better from an observational open study carried out in a large sample of the recipient population supposed to be exposed to the new drug after the commercialisation. The only requirement is that this observational study would consists in a assembly of a population not excessively “clean” but corresponding at least in part to the mix of variables encountered in the real world. The key point is to establish the “implicit function” relating the input variables or independent variables, to the dependent variable ( specific outcome of a subject).
It is interesting to note that at variance with classical statistics, which act on a particular data set with a vertical horizon, AAS tend to act with a horizontal approach. ( see figure 1).
When the implicit function is established, for example with autoassociative neural networks, is possible to navigate on the hyper surface of the data set asking questions. For example one could ask which is the prototype of a subject having a specific negative outcome, and how this profile depend from the presence of the new drug.
Let’s consider the problem of drug-induced hepatotoxicity. During the Phase 3 clinical trials of a new drug the hepatic function has been closely monitored ad slight to moderate elevation of hepatic enzymes has been recorded in a small proportion of patients; for example 2% of the exposed population. Liver enzyme levels can range in a certain interval from zero to upper normal range till 10 times of the normal range in case massive hepatic necrosis.
Let’s take ALT: normal range is 1-20 units; a very high value is 400 units.
In the trial 2% of patients showed elevation of ALT till 50 units; none of them had severe necrosis with very high ALT values.
To adapt the collected data to neural network processing scale the ALT values are scaled from 0 (equal to 1 ALT unit) to 1 ( equal to ALT 400 units).
So a subject with ALT= 20 will be coded as 0.05; while a subject with ALT = 40 will be coded as 0.1.
After the training phase with the auto-associative neural network, we ask the network which is the prototype of a patient with a very high ALT value, by setting the value of ALT in our scaled data set at 1.0 ( 400 units). During the query when ALT input is set on, the network will activate all its units in a dynamic, competitive and cooperative process at the same time..
This external activation of ALT variable of the original dataset generates a process. Each step of this process is a state of a dynamical system. At each state, each variable will take a specific value, generated by the previous negotiation of that variable with all the others. The process will terminate when the system reaches its natural attractor. At this point, all the states of the process represent the prototype of the patients with strong elevation of ALT values.
This prototype can be used to monitor patients candidate to receive the drug after commercialisation whose profile is close to the prototype.
With the same approach we can generate prototype of patients having other severe reactions, as defined by dramatic changes of specific biomarkers.
2.2 Pseudo-inverse function and evolutionary algorithms
A method for generating new records using an evolutionary algorithm (close to but different from a genetic algorithm) has been recently developed at Semeion Institute. This method, called Pseudo-Inverse Function (in short P-I Function), is able to generate new (virtual) data from a small set of observed data. P-I Function can be of aid when practical constraints limit the number of cases collected during clinical trials, or in case of a population that shows some potentially interesting safety risk traits, but whose small size can seriously affect the reliability of estimates, or in case of secondary analysis on small samples.
The applicative ground is given by research design with one or more dependent and a set of independent variables. The estimation of new cases takes place according to the maximization of a fitness function and outcomes a number as large as needed of ‘virtual’ cases, which reproduce the statistical traits of the original population. The algorithm used by P-I Function is known as Genetic Doping Algorithm (GenD), designed and implemented by Semeion Research Centre; among its features there is an innovative crossover procedure, which tends to select individuals with average fitness values, rather than those who show best values at each ‘generation’.
A particularly thorough research design has been put on: (1) the observed sample is half-split to obtain a training and a testing set, which are analysed by means of a back propagation neural network; (2) testing is performed to find out how good the parameter estimates are; (3) a 10% sample is randomly extracted from the training set and used as a reduced training set; (4) on this narrow basis, GenD calculates the pseudo-inverse of the estimated parameter matrix; (5) ‘virtual’ data are tested against the testing data set (which has never been used for training).
The algorithm has been validated on a particularly difficult data set composed by only 44 respondents, randomly sampled from a broader data set taken from the General Social Survey 2002. The major result is that networks trained on the ‘virtual’ resample show a model fit as good as the one of the observed data, though ‘virtual’ and observed data differ on some features. It can be seen that GenD ‘refills’ the joint distribution of the independent variables, conditioned by the dependent one.
This approach could be very interesting to expand the subset of patients who in the course of a clinical Phase 3 trial suffer for severe side effects. In the new virtual population thanks to the higher number of records it would be more appropriate to test statistical assumptions and derive predictive models.
3.Prediction of drug toxicity dose-related in the passage from preclinical phase to Phase I in human beings.
The recent disaster occurred in France where 5 healthy volunteers suffered from extremely severe Central nervous system side effects following the administration of the highest dosage of a new chemical entity in the frame of dose-escalation Phase I clinical trial underlines the complexity to transfer data coming from animal studies to man choosing a dose range of medication with an acceptable risk of toxicity.
Phase I studies are designed to test safety and tolerability of a drug, as well as how, and how fast, the chemical is processed by the human body. Most of these studies are carried out by specialized research contract companies; the subjects are usually healthy volunteers who receive modest financial compensation.
Serious incidents in phase I studies are rare, but they can never be completely excluded because a drug's behavior in animals isn't always a good predictor of its effects in humans. The last publicly known similar incident occurred in 2006, when six men in the United Kingdom suffered severe organ dysf