In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).[1]
It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.[2][3][4]

As explained variance[edit]

Suppose r = 0.7, meaning r2 = 0.49. This implies that 49% of the variability between the two variables has been accounted for, and the remaining 51% of the variability is still unaccounted for.                                                       

Interpretation[edit]

R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1 indicates that the regression line perfectly fits the data.
Values of R2 outside the range 0 to 1 can occur where it is used to measure the agreement between observed and modeled values and where the "modeled" values are not obtained by linear regression and depending on which formulation of R2 is used. If the first formula above is used, values can be less than zero. If the second expression is used, values can be greater than one. Neither formula is defined for the case where y 1 = … = y n = y ¯ {\displaystyle y_{1}=\ldots =y_{n}={\bar {y}}}.
In all instances where R2 is used, the predictors are calculated by ordinary least-squares regression: that is, by minimizing SSres. In this case R2 increases as we increase the number of variables in the model (R2 is monotone increasing with the number of variables included—i.e., it will never decrease). This illustrates a drawback to one possible use of R2, where one might keep adding variables (Kitchen sink regression) to increase the R2 value. For example, if one is trying to predict the sales of a model of car from the car's gas mileage, price, and engine power, one can include such irrelevant factors as the first letter of the model's name or the height of the lead engineer designing the car because the R2 will never decrease as variables are added and will probably experience an increase due to chance alone.
This leads to the alternative approach of looking at the adjusted R2. The explanation of this statistic is almost the same as R2 but it penalizes the statistic as extra variables are included in the model. For cases other than fitting by ordinary least squares, the R2 statistic can be calculated as above and may still be a useful measure. If fitting is by weighted least squares or generalized least squares, alternative versions of R2 can be calculated appropriate to those statistical frameworks, while the "raw" R2 may still be useful if it is more easily interpreted. Values for R2 can be calculated for any type of predictive model, which need not have a statistical basis.
Adjusted R-Squared (R2adjusted):  Coefficient of determination - Wikipedia
The use of an adjusted R2 (one common notation is R ¯ 2 {\displaystyle {\bar {R}}^{2}}, pronounced "R bar squared"; another is R adj 2 {\displaystyle R_{\text{adj}}^{2}} ) is an attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model. It is a modification due to Henri Theil of R2 that adjusts for the number of explanatory terms in a model relative to the number of data points.[10] The adjusted R2 can be negative, and its value will always be less than or equal to that of R2. Unlike R2, the adjusted R2 increases only when the increase in R2 (due to the inclusion of a new explanatory variable) is more than one would expect to see by chance. If a set of explanatory variables with a predetermined hierarchy of importance are introduced into a regression one at a time, with the adjusted R2 computed each time, the level at which adjusted R2 reaches a maximum, and decreases afterward, would be the regression with the ideal combination of having the best fit without excess/unnecessary terms. The adjusted R2 is defined as
R ¯ 2 = 1 − ( 1 − R 2 ) n − 1 n − p − 1 = R 2 − ( 1 − R 2 ) p n − p − 1 {\displaystyle {\bar {R}}^{2}={1-(1-R^{2}){n-1 \over n-p-1}}={R^{2}-(1-R^{2}){p \over n-p-1}}}
where p is the total number of explanatory variables in the model (not including the constant term), and n is the sample size.
Mean Squared Error (MSE):  Mean squared error - Wikipedia
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. The difference occurs because of randomness or because the estimator doesn't account for information that could produce a more accurate estimate.[1]
The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.
The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator and its bias. For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard deviation.
Root-mean-square-error (RMSE): Root-mean-square deviation - Wikipedia
The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular data and not between datasets, as it is scale-dependent.[1]
Although RMSE is one of the most commonly reported measures of disagreement, some scientists misinterpret RMSD as average error, which RMSD is not. RMSD is the square root of the average of squared errors, thus RMSD confounds information concerning average error with information concerning variation in the errors. The effect of each error on RMSD is proportional to the size of the squared error thus larger errors have a disproportionately large effect on RMSD. Consequently, RMSD is sensitive to outliers.[2][3]
Mean Absolute Error (MAE): Mean absolute error - Wikipedia
In statisticsmean absolute error (MAE) is a measure of difference between two continuous variables. Assume X and Y are variables of paired observations that express the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement.
Mean Absolute Percentage Error (MAPE): Mean absolute percentage error - Wikipedia  
The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics, for example in trend estimation. It usually expresses accuracy as a percentage
The difference between At and Ft is divided by the Actual value At again. The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n. Multiplying by 100 makes it a percentage error.
Although the concept of MAPE sounds very simple and convincing, it has major drawbacks in practical application [1]
Mean Absolute Scaled Error (MASE):  Mean absolute scaled error - Wikipedia
In statistics, the mean absolute scaled error (MASE) is a measure of the accuracy of forecasts.The mean absolute scaled error has favorable properties when compared to other methods for calculating forecast errors, such as root-mean-square-deviation, and is therefore recommended for determining comparative accuracy of forecasts.[2]

Rationale[edit]

The mean absolute scaled error has the following desirable properties:[3]
  1. Scale invariance: The mean absolute scaled error is independent of the scale of the data, so can be used to compare forecasts across data sets with different scales.
  2. Predictable behavior as y t → 0 {\displaystyle y_{t}\rightarrow 0}: Percentage forecast accuracy measures such as the Mean absolute percentage error (MAPE) rely on division of y t {\displaystyle y_{t}}, skewing the distribution of the MAPE for values of y t {\displaystyle y_{t}} near or equal to 0. This is especially problematic for data sets whose scales do not have a meaningful 0, such as temperature in Celsius or Fahrenheit, and for intermittent demand data sets, where y t = 0 {\displaystyle y_{t}=0} occurs frequently.
  3. Symmetry: The mean absolute scaled error penalizes positive and negative forecast errors equally, and penalizes errors in large forecasts and small forecasts equally. In contrast, the MAPE and median absolute percentage error (MdAPE) fail both of these criteria, while the "symmetric" sMAPE and sMdAPE[4] fail the second criterion.
  4. Interpretability: The mean absolute scaled error can be easily interpreted, as values greater than one indicate that in-sample one-step forecasts from the naïve method perform better than the forecast values under consideration.
  5. Asymptotic normality of the MASE: The Diebold-Mariano test for one-step forecasts is used to test the statistical significance of the difference between two sets of forecasts. To perform hypothesis testing with the Diebold-Mariano test statistic, it is desirable for D M ∼ N ( 0 , 1 ) {\displaystyle DM\sim N(0,1)}, where D M {\displaystyle DM} is the value of the test statistic. The DM statistic for the MASE has been empirically shown to approximate this distribution, while the mean relative absolute error (MRAE), MAPE and sMAPE do not.[2]

Non seasonal time series[edit]

For a non-seasonal time series,[5] the mean absolute scaled error is estimated by
M A S E = 1 T ∑ t = 1 T ( | e t | 1 T − 1 ∑ t = 2 T | Y t − Y t − 1 | ) = ∑ t = 1 T | e t | T T − 1 ∑ t = 2 T | Y t − Y t − 1 | {\displaystyle \mathrm {MASE} ={\frac {1}{T}}\sum _{t=1}^{T}\left({\frac {\left|e_{t}\right|}{{\frac {1}{T-1}}\sum _{t=2}^{T}\left|Y_{t}-Y_{t-1}\right|}}\right)={\frac {\sum _{t=1}^{T}\left|e_{t}\right|}{{\frac {T}{T-1}}\sum _{t=2}^{T}\left|Y_{t}-Y_{t-1}\right|}}} [3]
where the numerator et is the forecast error for a given period, defined as the actual value (Yt) minus the forecast value (Ft) for that period: et = Yt − Ft, and the denominator is the mean absolute error of the one-step "naive forecast method" on the training set,[5] which uses the actual value from the prior period as the forecast: Ft = Yt−1[6]
+ add notes on Integral Squared Error (see http://www.online-courses.vissim.us/Strathclyde/measures_of_controlled_system_pe.htm)

Method selection

Based on the technology classification problem considered, the bibliometric data available, and the methods discussed in sections \ref{204737} to \ref{875755} the following methods have been selected for use in this analysis:

Technology Life Cycle stage matching process

For those technologies where evidence for determining the transitions between different stages of the Technology Life Cycle has either not been found or is incomplete, a nearest neighbour pattern recognition approach has been employed based on the work of Gao \cite{Gao_2013} to locate the points where shifts between cycle stages occur.
In this instance a supervised learning approach is taken as the well-established nature of the Technology Life Cycle model is widely recognised to form a sensible basis for classifying technological maturity, so there is no need to establish the validity of the categories being assigned. Equally, the nearest neighbour approach is commonly used as an industry standard, so no further development is proposed here for this study.
OR
However, for the specific technologies considered in this study, literature evidence has been identified for the transitions between stages, and so the nearest neighbour methodology is not discussed further here.

Identification of significant patent indicator groups

In order to identify those bibliometric indicator groupings that could form the basis of a data-driven technology classification model a combination of Dynamic Time Warping and the 'PAM' variant of K-Medoids clustering has been applied in this study. For the initial feature alignment and distance measurement stages of this process, Dynamic Time Warping is still widely recognised as the classification benchmark to beat (see section \ref{446824}), and so this study does not look to advance the feature alignment processes used beyond this. Unlike the Technology Life Cycle stage matching process which is based on a well-established technology maturity model, this study is assuming that a classification system based on the modes of substitution outlined in section \ref{771448} is not intrinsically valid. For this reason an unsupervised learning approach has been adopted here to enable human biases to be eliminated in determining whether a classification system based on presumptive technological substitution is valid or not, before subsequently defining a classification rule system. In doing so this additionally means that labelling of predicted clusters can be carried out even if labels are only available for a small number of observed samples representative of the desired classes, or potentially even if none of the observed samples are absolutely defined. This is of particular use if this technique is to be expanded to a wider population of technologies, as obtaining evidence of the applicable mode of substitution that gave rise to the current technology can be a time-consuming process, and in some cases the necessary evidence may not be publicly available (e.g. if dealing with commercially sensitive performance data). As such, clustering can provide an indication of the likely substitution mode of a given technology without the need for prior training on technologies that belong to any given class. Under such circumstances this approach could be applied without the need for collecting performance data, providing that the groupings produced by the analysis are broadly identifiable from inspection as being associated with the suspected modes of substitution (this is of course made easier if a handful of examples are known, but means that this is no longer a hard requirement).
The 'PAM' variant of K-Medoids is selected here over hierarchical clustering since the expected number of clusters is known from the literature, and keeping the number of clusters fixed allows for easier testing of how frequently predicted clusters align with expected groupings. Additionally, a small sample of technologies is evaluated in this study, and as a result computational expense is not likely to be significant in using the 'PAM' variant of K-Medoids  over Hierarchical clustering approaches. It is also worth noting that by evaluating the predictive performance of each subset of patent indicator groupings independently it is possible to spot and rank commonly recurring patterns of subsets, which is not possible when using approaches such as Linear Discriminant Analysis which can assess the impact of individual predictors, but not rank the most suitable combinations of indicators.

Ranking of significant patent indicator groups

As the number of technologies considered in this study is relatively small, exhaustive cross-validation approaches provide a feasible means to rank the out-of-sample predictive capabilities of those bibliometric indicator subsets that have been identified as producing significant correlations to expected in-sample technology groupings. As such, leave-p-out cross-validation approaches are applied for this purpose, whilst also reducing the risk of over-fitting in the following model building phases.

Model building

The misalignment in time between life cycle stages relative to other technologies can make it difficult to identify common features in time series. This is primarily because this phase variance risks artificially inflating data variance, skewing the driving principal components and often disguising underlying data structures \cite{Marron_2015}. Consequently, due to the importance of phase variance when comparing historical trends for different technologies, and the coupling that exists between adjacent points in growth and adoption curves, functional linear regression is selected here to build the technology classification model developed in this study (see section \ref{875755}).

Sensitivity of technology adoption to chosen modelling parameters

Whilst statistical approaches are well-suited to detecting underlying correlations in historical and experimental datasets, this on it's own does not provide a detailed understanding of the causation behind associated events. Equally, statistical methods are not generally well suited to predicting disruptive events and complex interactions, with other simulation techniques such as System Dynamics and Agent Based Modelling performing better in these areas. Accordingly, in order to identify causation effects and test the sensitivity of technological substitution patterns to variability arising from real-world socio-technical features not captured in simple bibliometric indicators (such as the influence of competition and economic effects), the fitted regression model is evaluated in a real-time system dynamics environment.
+ additional notes available (if required for expansion) in notes section of 'Method selection' slide

Method limitations

Although precautions have been taken where available to ensure that the methods selected for this study address the problem posed of building a generalised technology classification model based on bibliometric data in as rigorous a fashion as possible, there are some known limitations to the methods used in this work that must be recognised. Many of the current limitations stem from the fact that in this analysis technologies have been selected based on where evidence is obtainable to indicate the mode of adoption followed. As such the technologies considered here do not come from a truly representative cross-section of all industries, so it is possible that models generated will provide a better representation of those industries considered rather than a more generalisable result. This evidence-based approach also means that it is still a time-consuming process to locate the necessary literature material to be able to support classifying technology examples as arising based on one mode of substitution or another, and to then compile the relevant cleaned patent datasets for analysis. As a result only a relatively limited number of technologies have been considered in this study, which should be expanded on to increase confidence in the findings produced from this work. This also raises the risk that clustering techniques may struggle to produce consistent results based on the small number of technologies considered. Furthermore, any statistical or quantitative methods used for modelling are unlikely to provide real depth of knowledge beyond the detection of correlations behind patent trends when used in isolation. Ultimately some degree of causal exploration, whether through case study descriptions, system dynamics modelling, or expert elicitation will be required to shed more light on the underlying influences shaping technology substitution behaviours.
Other data-specific issues that could arise relate to the use of patent searches in this analysis and the need to resample data based on variable length time series. The former relates to the fact that patent search results and records can vary to a large extent based on the database and exact search terms used, however overall trends once normalised should remain consistent with other studies of this nature (this point is addressed in more detail in section XX). The latter meanwhile refers to the fact that functional linear regression requires all technology case studies to be based on the same number of time samples. As such, as discussed in section \ref{875755}, linear interpolation is used as required to ensure consistency on the number of observations whilst possibly introducing some small errors which are not felt to be significant.
+ additional notes available (if required for expansion) in notes section of 'Method limitations' slide

Selected data sources

Three types of data sources are considered in this study, relating to either patent or publication data (i.e. bibliometric sources), which are subsequently coupled with technology adoption data to enable the impact of different modes of substitution to be investigated:

Patent data

Patent data has been sourced from the Questel-Orbit patent search platform in this analysis. More specifically, the full FamPat database was queried in this study, which groups related invention-based patents filed in multiple international jurisdictions into families of patents. This platform is accessed by subscribers via an online search engine that allows complex patent record searches to be structured, saved, and exported in a variety of formats. A selection of keywords, dates, or classification categories are used in this search engine to build relevant queries for a given technology (this process is discussed in more detail in section \ref{335937}). The provided search terms are then matched in the title, abstract, and key content of all family members included in a FamPat record, although unlike title and abstract searches, key contents searches (which include independent claims, advantages, drawbacks, and the main patent object) are limited to only English language publications. Some of the core functionalities behind this search engine are outlined in \cite{Questel_Orbit_2000}.

Publication data

Journal article and publication records used in this analysis are based on extracted search results from the Web of Science (WoS) citation indexing service provided by Clarivate Analytics (previously Thomson Reuters). Web of Science was originally established based on the work of Eugene Garfield, who identified the relevance of citations and subsequently developed the idea of the Science Citation Index (SCI) in the 1950's as a database for storing these records, along with the Institute for Scientific Information (ISI) as an organisation setup to maintain this information. Whilst not originally intended for research evaluation, but rather for aiding researcher's in finding relevant work more effectively, the SCI was later joined by the Social Sciences Citation Index (SSCI), and subsequently the Arts & Humanties Citation Index (A&HCI) in the 1970's. After being acquired by the Thomson Corporation, this collection of indexes was converted into the present day Web of Science, which is currently reported to hold details of over 100 million records dating from 1900 onwards, covering more than 33,000 journals, 50,000 books, and 160,000 conference proceedings. As such, this comprises the largest collection of scholarly articles globally \cite{Mingers_2015,WoS_facts}.
In a very similar fashion to the Questel-Orbit platform, the online Web of Science search engine relies on a series of keywords and Boolean operators to define search terms that are then matched in the title, abstract, and key content of the records in the database.

Technology adoption data

Adoption data for the technologies investigated is taken from a wide variety of sources due to the broad scope of the technology domains considered. Where possible, global technology sales and shipment values have been used to determine the overall market share of each technology at a given time, although in some cases data values have been imputed to fill gaps in time series (this is stated where this has been applied, as well as the method of deriving imputed values). Furthermore, the preference has been to extract statistical data directly from international agencies such as the UN, World Bank, International Energy Agency, International Council on Clean Transportation, International Telecommunication Union, and Eurostat when available, as these organisations generally present the most consistent representation of the technologies considered when taking into consideration regional development trends. In many cases, this information was accessed via the UK Data Service \cite{UKDS_stat}.
A brief description of each data source used for technology adoption data is given in Table \ref{table:data_sources_for_technology_adoption_data}: