Undertake An Exploratory Modelling Analysis

Note: This vignette uses fake data - it is for illustrative purposes only and should not be used to inform decision making. The specific package includes ready4 framework model modules that form part of the ready4 youth mental health economic model. Currently, these modules are not optimised to be used directly, but are instead intended for use in other model modules. For example, the TTU package includes modules that extend specific modules to help implement utility mapping studies. However, to illustrate the main features of specific modules this vignette demonstrates how specific modules could be used independently. In practice, workflow illustrated in this article would probably need to be performed iteratively in order to identify the optimal model types, predictors and covariates and to update default values to ensure model convergence.

library(ready4)
library(scorz)
library(specific)

By default, modules in the specific package will request your consent before writing files to your machine. This is the safest option. However, as there are many files that need to be written locally for this program to execute, you can overwrite this default by supplying the value “Y” to methods with a consent_1L_chr argument.

consent_1L_chr <- "" # Default value - asks for consent prior to writing each file.

Import data

We start by ingesting our data. As this example uses EQ-5D data, we import a ScorzEuroQol5 ready4 framework module (created using the steps described in this vignette from the scorz pacakge) into a SpecificConverter Module and then apply the metamorphose method to convert it into a SpecificModel module.

X <- SpecificConverter(a_ScorzProfile = ready4use::Ready4useRepos(gh_repo_1L_chr = "ready4-dev/scorz", 
                                                                  gh_tag_1L_chr = "Documentation_0.0") %>%
                         ingest(fls_to_ingest_chr = "ymh_ScorzEuroQol5",  metadata_1L_lgl = F)) %>% 
  metamorphose()

class(X)

## [1] "SpecificModels"
## attr(,"package")
## [1] "specific"

Inspect data

The dataset we are using has a total of 1786 records at two timepoints on 1068 study participants. The first six records are reproduced below.

Dataset
Unique identifier	Data collection round	Date of data collection	Age	Gender (grouped)	Sex at birth	Sexual orientation	Relationship status	Aboriginal or Torres Strait Islander	Culturally And Linguistically Diverse	Region of residence (metropolitan or regional)	Education and employment status	EQ5D - Mobility domain score	EQ5D - Self-Care domain score	EQ5D - Usual Activities domain score	EQ5D - Pain / Discomfort domain score	EQ5D - Anxiety / Depression domain score	Kessler Psychological Distress - 10 Item Total Score	Overall Wellbeing Measure (Winefield et al. 2012)	EuroQol (EQ-5D) - (weighted total)	EuroQol (EQ-5D) - (unweighted total)
1	BL	2019-10-22	14	Male	Male	Heterosexual	In a relationship	No	No	Metro	Not studying or working	1	1	1	1	2	11	87	0.879	6
2	BL	2019-10-17	19	Female	Female	Heterosexual	In a relationship	Yes	Yes	Regional	Studying only	1	2	1	1	1	14	65	0.846	6
2	FUP	2020-02-14	19	Female	Female	Heterosexual	In a relationship	Yes	Yes	Regional	Studying only	3	1	1	1	1	10	71	0.850	7
3	BL	2020-02-15	21	Female	Female	Other	Not in a relationship	NA	NA	Metro	Studying only	1	1	3	1	1	13	74	0.883	7
3	FUP	2020-06-14	21	Female	Female	Other	Not in a relationship	NA	NA	Metro	Studying only	1	1	2	1	1	10	64	0.906	6
4	BL	2019-12-14	12	Female	Female	Heterosexual	In a relationship	Yes	Yes	Metro	Not studying or working	1	1	1	3	1	18	40	0.796	7

To source dataset of X is contained in the a_YouthvarsProfile slot and is a YouthvarsSeries module. For more information about methods that can be used to explore this dataset, read this vignette from the youthvars package.

Specify parameters

In preparation for exploring our dataset, we need to declare a set of model parameters in a b_SpecificParameters slot of X. This can be done in one step, or in sequential steps. In this example, we will proceed sequentially.

Dependent variable

The dependent variable (total EQ-5D utility score) has already been specified when we imported the data from the ScorzEuroQol5 module.

procureSlot(X, "b_SpecificParameters@depnt_var_nm_1L_chr")

## [1] "eq5d_total_w"

We can now add details of the allowable range of dependent variable values.

X <- renewSlot(X, "b_SpecificParameters@depnt_var_min_max_dbl", c(-1,1))

Candidate predictors

We can now specify the names of candidate predictor variables.

X <- renewSlot(X, "b_SpecificParameters@candidate_predrs_chr", c("K10_int","Psych_well_int"))

We next add meta-data about each candidate predictor variable in the form of a specific_predictors object.

X <- renewSlot(X, "b_SpecificParameters@predictors_lup", class_chr = "integer", class_fn_chr = c("youthvars::youthvars_k10_aus","as.integer"), covariate_lgl = F, increment_dbl = 1,
               long_name_chr = c("Kessler Psychological Distress - 10 Item Total Score", "Overall Wellbeing Measure (Winefield et al. 2012)"), max_val_dbl = c(50,90), min_val_dbl = c(10,18), mdl_scaling_dbl = 0.01,
               short_name_chr = c("K10_int","Psych_well_int"))

The specific_predictors object that we have added to X can be inspected using the exhibitSlot method.

exhibitSlot(X, "b_SpecificParameters@predictors_lup", scroll_box_args_ls = list(width = "100%"))

Variable	Description	Minimum	Maximum	Class	Increment	Function	Scaling	Covariate
K10_int	Kessler Psychological Distress - 10 Item Total Score	10	50	integer	1	youthvars::youthvars_k10_aus	0.01	FALSE
Psych_well_int	Overall Wellbeing Measure (Winefield et al. 2012)	18	90	integer	1	as.integer	0.01	FALSE

Covariates

We also specify the covariates that we aim to explore in conjunction with each candidate predictor.

X <- renewSlot(X, "b_SpecificParameters@candidate_covars_chr", c("d_sex_birth_s", "d_age",  "d_sexual_ori_s", "d_studying_working"))

Descriptive variables

We also specify variables that we will use for generating descriptive statistics about the dataset.

X <- renewSlot(X,"b_SpecificParameters@descv_var_nms_chr", c("d_age","Gender","d_relation_s", "d_sexual_ori_s", "Region", "d_studying_working"))

Temporal variables

The name of the dataset variable for data collection timepoint and all of its unique values were imported when converting the ScorzEuroQol5 module.

procureSlot(X,"a_YouthvarsProfile@timepoint_var_nm_1L_chr")

## [1] "Timepoint"

procureSlot(X,"a_YouthvarsProfile@timepoint_vals_chr")

## [1] "BL"  "FUP"

However, we also need to specify the name of the variable that contains the datestamp for each dataset record.

X <- renewSlot(X, "b_SpecificParameters@msrmnt_date_var_nm_1L_chr", "data_collection_dtm")

Candidate models

X was created with a default set of candidate models, stored as a specific_models sub-module, which can be inspected using the exhibitSlot method.

exhibitSlot(X, "b_SpecificParameters@candidate_mdls_lup", scroll_box_args_ls = list(width = "100%"))

Model types lookup table
Reference	Name	Control	Familty	Function	Start	Predict	Transformation	Binomial	Acronym (Fixed)	Acronymy (Mixed)	Type (Mixed)	With
OLS_NTF	Ordinary Least Squares (no transformation)	NA	NA	lm	NA	NA	NTF	FALSE	OLS	LMM	linear mixed model	no transformation
OLS_LOG	Ordinary Least Squares (log transformation)	NA	NA	lm	NA	NA	LOG	FALSE	OLS	LMM	linear mixed model	log transformation
OLS_LOGIT	Ordinary Least Squares (logit transformation)	NA	NA	lm	NA	NA	LOGIT	FALSE	OLS	LMM	linear mixed model	logit transformation
OLS_LOGLOG	Ordinary Least Squares (log log transformation)	NA	NA	lm	NA	NA	LOGLOG	FALSE	OLS	LMM	linear mixed model	log log transformation
OLS_CLL	Ordinary Least Squares (complementary log log transformation)	NA	NA	lm	NA	NA	CLL	FALSE	OLS	LMM	linear mixed model	complementary log log transformation
GLM_GSN_LOG	Generalised Linear Model with Gaussian distribution and log link	NA	gaussian(log)	glm	-0.1,-0.1	response	NTF	FALSE	GLM	GLMM	generalised linear mixed model	Gaussian distribution and log link
BET_LGT	Beta Regression Model with Binomial distribution and logit link	betareg::betareg.control	NA	betareg::betareg	-0.5,-0.1,3	response	NTF	FALSE	GLM	GLMM	generalised linear mixed model	Binomial distribution and logit link
BET_CLL	Beta Regression Model with Binomial distribution and complementary log log link	betareg::betareg.control	NA	betareg::betareg	-0.5,-0.1,3	response	NTF	FALSE	GLM	GLMM	generalised linear mixed model	Binomial distribution and complementary log log link

We can choose to select just a subset of these to explore using the renewSlot method. As this is an illustrative example, we have restricted the models we will explore to just four types, passing the relevant row numbers to the slice_indcs_int argument.

X <- renewSlot(X, "b_SpecificParameters@candidate_mdls_lup", slice_indcs_int = c(1L,5L,7L,8L))

Other parameters

Depending on the type of analysis we plan on undertaking, we can also specify parameters such as the number of folds to use in cross validation, the maximum number of model runs to allow and a seed to ensure reproducibility of results. In this case we are going to use the default values generated when we first created X.

procureSlot(X, "b_SpecificParameters@folds_1L_int")

## [1] 10

procureSlot(X, "b_SpecificParameters@max_mdl_runs_1L_int")

## [1] 300

procureSlot(X, "b_SpecificParameters@seed_1L_int")

## [1] 1234

Model testing

Before we start to use the data stored in X to undertake modelling, we must first validate that it contains all necessary (and internally consistent) data by using the ratify method. The call to ratify will update any variable names that are likely to cause problems when generating reports (e.g. through inclusion of characters like “_” in the variable name that can cause problems when rendering LaTeX documents).

X <- ratify(X)

Set-up workspace

We add details of the directory to which we will write all output. In this example we create a temporary directory (tempdir()), but in practice this would be an existing directory on your local machine.

X <- renewSlot(X, "paths_chr", tempdir())

It can be useful to save fake data (useful for demonstrating the generalisability and replicability of an analysis) and real data (required for write-up and reproducibility) is distinctly labelled directories. By default, X is created with a flag to save all output in a sub-directory “Real”. As we are using fake data, we can override this value.

X <- renewSlot(X, "b_SpecificParameters@fake_1L_lgl", T)

We can now write a number of sub-directories to our specified output directory.

X <- author(X, what_1L_chr = "workspace", consent_1L_chr = consent_1L_chr)

## New directories created:
## /tmp/RtmpdFPTCP/Fake
## /tmp/RtmpdFPTCP/Fake/Markdown
## /tmp/RtmpdFPTCP/Fake/Output
## /tmp/RtmpdFPTCP/Fake/Reports
## /tmp/RtmpdFPTCP/Fake/Output/_Descriptives
## /tmp/RtmpdFPTCP/Fake/Output/H_Dataverse

Descriptives

The first set of outputs we write to our output directories is a set of descriptive tables and plots.

X <- author(X, consent_1L_chr = consent_1L_chr, digits_1L_int = 3L,  what_1L_chr = "descriptives")

Model comparisons

The investigate method can now be used to compare the candidate models we have specified earlier. In so doing it will transform X into a SpecificPredictors object.

X <- investigate(X, consent_1L_chr = consent_1L_chr, depnt_var_max_val_1L_dbl = 0.99, session_ls = sessionInfo())

class(X)

## [1] "SpecificPredictors"
## attr(,"package")
## [1] "specific"

The investigate method will write each model to be tested to a new sub-directory of our output directory.

The investigate method also outputs a table summarising the performance of each of the candidate models.

exhibit(X, what_1L_chr = "mdl_cmprsn", type_1L_chr = "results")

Comparison of candidate models using highest correlated predictor
	Training model fit (averaged over 10 folds)			Testing model fit (averaged over 10 folds)
Model	R-Squared	RMSE	MAE	R-Squared	RMSE	MAE
Beta Regression Model with Binomial distribution and logit link	0.4318533	0.0742448	0.0587307	0.4128497	0.0741236	0.0587733
Beta Regression Model with Binomial distribution and complementary log log link	0.4174181	0.0751836	0.0593447	0.3996947	0.0750880	0.0594047
Ordinary Least Squares (no transformation)	0.4106104	0.0756222	0.0596955	0.3933147	0.0755461	0.0597672
Ordinary Least Squares (complementary log log transformation)	0.4105040	0.0756284	0.0597793	0.3913360	0.0755268	0.0598295

We can now identify the highest performing model in each category of candidate model based on the testing R² statistic.

procure(X, what_1L_chr = "prefd_mdls")

## [1] "BET_LGT" "OLS_NTF"

We can override these automated selections and instead incorporate other considerations (possibly based on judgments informed by visual inspection of the plots and the desirability of constraining predictions to a maximum value of one). We do this in the following command, specifying new preferred model types, in descending order of preference.

X <- renew(X, new_val_xx = c("BET_LGT", "OLS_CLL"), type_1L_chr = "results", what_1L_chr = "prefd_mdls")

Use most preferred model to compare all candidate predictors

We can now compare all of our candidate predictors (with and without candidate covariates) using the most preferred model type.

X <- investigate(X, consent_1L_chr = consent_1L_chr)

class(X)

## [1] "SpecificFixed"
## attr(,"package")
## [1] "specific"

Now, we compare the performance of single predictor models of our preferred model type (in our case, a Beta Regression Model with Binomial distribution and logit link) for each candidate predictor. The last call to the investigate saved the tested models along with model plots in a sub-directory of our output directory. These results are also viewable as a table.

exhibit(X, scroll_box_args_ls = list(width = "100%"), type_1L_chr = "results", what_1L_chr = "predr_cmprsn")

Comparison of all candidate predictors using preferred model
predr_chr	%IncMSE	IncNodePurity
K10	0.0066197	3.888246
Psychwell	0.0011094	2.342784

The most recent call to the investigate method also saved single predictor R model objects (one for each candidate predictors) along with the two plots for each model in a sub-directory of our output directory. The performance of each single predictor model can also be summarised in a table.

exhibit(X, type_1L_chr = "results", what_1L_chr = "fxd_sngl_cmprsn")

Preferred single predictor model performance by candidate predictor
	Training model fit (averaged over 10 folds)			Testing model fit (averaged over 10 folds)
Model	R-Squared	RMSE	MAE	R-Squared	RMSE	MAE
K10	0.4318533	0.0742448	0.0587307	0.4128497	0.0741236	0.0587733
Psychwell	0.1507472	0.0907813	0.0699606	0.1341090	0.0909203	0.0700686

Updated versions of each of the models in the previous step (this time with covariates added) are saved to a new subdirectory of the output directory and we can summarise the performance of each of the updated models, along with all signficant model terms, in a table.

exhibit(X, scroll_box_args_ls = list(width = "100%"), type_1L_chr = "results", what_1L_chr = "fxd_full_cmprsn")

We can now identify which, if any, of the candidate covariates we previously specified are significant predictors in any of the models.

procure(X, type_1L_chr = "results", what_1L_chr = "signt_covars")

## [1] NA

We can override the covariates to select, potentially because we want to select only covariates that are significant for all or most of the models. However, in the below example we have opted not to do so and continue to use no covariates as selected by the algorithm in the previous step.

# X <- renew(X, new_val_xx = c("COVARIATE OF YOUR CHOICE", "ANOTHER COVARIATE"), type_1L_chr = "results", what_1L_chr = "prefd_covars")

Test preferred model with preferred covariates for each candidate predictor

We now conclude our model testing by rerunning the previous step, except confining our covariates to those we prefer.

X <- investigate(X, consent_1L_chr = consent_1L_chr)

class(X)

## [1] "SpecificMixed"
## attr(,"package")
## [1] "specific"

The previous call to the write_mdls_with_covars_cmprsn function saves the tested models along with two plots for each model in the “E_Predrs_W_Covars_Sngl_Mdl_Cmprsn” sub-directory of “Output”.

Apply preferred model types and predictors to longitudinal data

The next main step is to use the preferred model types and covariates identified from the preceding analysis of cross-sectional data in longitudinal analysis.

Longitudinal mixed modelling

Prior to undertaking longitudinal mixed modelling, we need to check the appropriateness of the default values for modelling parameters that are stored in X. These include the number of model iterations, and any custom control parameters and priors (by default, empty lists).

procureSlot(X, "b_SpecificParameters@iters_1L_int")

## [1] 4000

In many cases there will be no need to specify any custom control parameters or priors and using the defaults may speed up execution.

procureSlot(X, "b_SpecificParameters@control_ls")

## [[1]]
## list()

procureSlot(X,"b_SpecificParameters@prior_ls")

## [[1]]
## list()

However, in this example using the default control parameters would result in warning messages suggesting a change to the adapt_delta control value (default = 0.8). Modifying the adapt_delta control parameter value can address this issue.

X <- renewSlot(X, "b_SpecificParameters@control_ls", new_val_xx = list(adapt_delta = 0.99))

X <- investigate(X, consent_1L_chr = consent_1L_chr)

class(X)

## [1] "SpecificMixed"
## attr(,"package")
## [1] "specific"

The last call to investigate function wrote the models it tests to a sub-directory of the output directory along with plots for each model.

Create shareable outputs

The model objects created by the preceding analysis are not suitable for sharing as they contain duplicates of the source dataset. To create model objects that can be shared (where dataset copies are replaced with fake data) use the authorData method.

X <- authorData(X, consent_1L_chr = consent_1L_chr)

Purge dataset copies

For the purposes of efficient computation, multiple objects containing copies of the source dataset were saved to our output directory during the analysis process. We therefore need to delete all of these copies by supplying “purge_write” to the type_1L_chr argument of the author method.

X <- author(X, consent_1L_chr = consent_1L_chr, type_1L_chr = "purge_write")

A copy of the module X is available for download as the file eq5d_ttu_SpecificMixed.RDS from the “Documentation_0.0” release of the specific package.

Set consent policy