Generated by summarytools 1.0.1 (R version 4.3.2) 2024-04-18
Now lets make a line graph of all the appointments. First lets test if there is any difference between noshows and shows for every day of observartions.
We’ll first do a T test
Code
dta_tl <-dta %>%ungroup() %>%select(no_show,contains("_day")) %>%# add_count(appointment_day, name="n_appts") %>% # add_count(scheduled_day, name="n_scheds") %>% pivot_longer(cols=all_of(c("appointment_day","scheduled_day")),names_to="date_type",values_to="date") %>%# arrange(desc(date))group_by(across(everything())) %>%summarise(n =n()) %>%complete(date =seq.Date(min(date), max(date), by="day")) %>%mutate(n=replace_na(n,0)) %>%ungroup() %>%# filter(date>as.Date("2016-04-01")) %>% filter(date_type=="appointment_day") %>%group_by(no_show) %>%arrange(date) %>%mutate(mean_seven_day =slide_index_dbl( #### Give it an index for a sliding window n,.i=date,.f =~mean(.x, na.rm=TRUE),.before =days(7) )) %>%ungroup()# dta_test<-dta_tl %>% # select(date,no_show,n) %>% # pivot_wider(names_from=no_show,# values_from=n)# noshow_n_days<-dta_tl %>% filter(no_show=="No") %>% select(n)# noshow_y_days<-dta_tl %>% filter(no_show=="Yes") %>% select(n)# t.test(noshow_n_days$n ~ noshow_y_days$n)## T test,... is the noshow days different with total appts in each day??t.test(n~no_show, data=dta_tl) %>% broom::tidy() %>% kableExtra::kable()
estimate
estimate1
estimate2
statistic
p.value
parameter
conf.low
conf.high
method
alternative
1607.049
2151.415
544.3659
5.984218
3e-07
45.16183
1066.219
2147.878
Welch Two Sample t-test
two.sided
The difference is significant at least.
Now make the line graph, any changes over time?
Code
dta_tl %>%# pivot_longer(cols=all_of(c("appointment_day",# "scheduled_day")),# names_to="date_type",# values_to="date") %>% # select(no_show,contains("date"),contains("n_"))# summarize(appt=n(appointment_day),# sched=n(scheduled_day))ggplot(aes(x=date)) +geom_col(aes(y=n),fill="grey80", alpha=0.5) +geom_line(aes(y=mean_seven_day, color=date_type),size=0.5) +scale_x_date(date_labels="%Y %b %d") +facet_wrap(~no_show, ncol =1, scales="free") +theme_minimal() +theme(legend.position ="none") +labs(title="Was the appointment a no show? Shown over duration of dataset")
Looks like not much difference in temporal trends of noshows.
Cleaning and Cross Correlation
That was fun. Now lets clean up the data to perform a logistic regression and then see how to use prediction methods…
1) Clean up data
Code
dta_clean <- dta %>%#remove duplicate appointments for same patient and same dayfilter(!duplicated(across(-appointment_id))) %>%#create a day of the week variablemutate(wday=wday(appointment_day,label=TRUE,abbr=TRUE),is_friday=as.factor(if_else(wday=="Fri",1,0))) %>%#create var number of patients seen at clinic/neighborhood in totaladd_count(neighbourhood,name="neigh_appts") %>%# factorize vars and create a diff between appt and scheduled datemutate(gender=as.factor(gender),gender_male=as.factor(if_else(gender=="M",1,0)),attended=as.factor(case_match(no_show,"No"~1,"Yes"~0)),days_since_call=as.double(difftime(appointment_day, scheduled_day, units="days")) ) %>%mutate(across(scholarship:sms_received, as.factor)) %>%mutate(across(contains("_id"),as.factor)) %>%#create talliesarrange(patient_id,appointment_day) %>%add_count(patient_id, name="appt_count") %>%group_by(patient_id) %>%mutate("one"=1,"appt_tally"=cumsum(one)) %>%select(-one) %>%ungroup()dta_clean %>%# colnames() %>% select(!patient_id &!appointment_id) %>% skimr::skim()
Data summary
Name
Piped data
Number of rows
106305
Number of columns
20
_______________________
Column type frequency:
character
2
Date
2
factor
11
numeric
5
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
neighbourhood
0
1
5
27
0
81
0
no_show
0
1
2
3
0
2
0
Variable type: Date
skim_variable
n_missing
complete_rate
min
max
median
n_unique
scheduled_day
0
1
2015-11-10
2016-06-08
2016-05-10
111
appointment_day
0
1
2016-04-29
2016-06-08
2016-05-18
27
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
gender
0
1
FALSE
2
F: 69588, M: 36717
scholarship
0
1
FALSE
2
0: 95880, 1: 10425
hipertension
0
1
FALSE
2
0: 85156, 1: 21149
diabetes
0
1
FALSE
2
0: 98572, 1: 7733
alcoholism
0
1
FALSE
2
0: 103330, 1: 2975
handcap
0
1
FALSE
5
0: 104186, 1: 1928, 2: 177, 3: 11
sms_received
0
1
FALSE
2
0: 70831, 1: 35474
wday
0
1
TRUE
6
Tue: 24876, Wed: 24818, Mon: 21803, Fri: 18062
is_friday
0
1
FALSE
2
0: 88243, 1: 18062
gender_male
0
1
FALSE
2
0: 69588, 1: 36717
attended
0
1
FALSE
2
1: 84615, 0: 21690
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
age
0
1
37.12
23.17
-1
18
37
56
115
▇▇▇▂▁
neigh_appts
0
1
2615.11
1811.84
1
1332
2208
3169
7413
▆▇▃▁▂
days_since_call
0
1
10.41
15.37
-6
0
4
15
179
▇▁▁▁▁
appt_count
0
1
2.83
2.96
1
1
2
3
35
▇▁▁▁▁
appt_tally
0
1
1.91
1.87
1
1
1
2
35
▇▁▁▁▁
Code
dta_clean
# A tibble: 106,305 × 22
patient_id appointment_id gender scheduled_day appointment_day age
<fct> <fct> <fct> <date> <date> <dbl>
1 39217.84439 5751990 F 2016-05-31 2016-06-03 44
2 43741.75652 5760144 M 2016-06-01 2016-06-01 39
3 93779.52927 5712759 F 2016-05-18 2016-05-18 33
4 141724.16655 5637648 M 2016-04-29 2016-05-02 12
5 537615.28476 5637728 F 2016-04-29 2016-05-06 14
6 5628261 5680449 M 2016-05-10 2016-05-13 13
7 11831856 5718578 M 2016-05-19 2016-05-19 16
8 22638656 5580835 F 2016-04-14 2016-05-03 22
9 22638656 5715081 F 2016-05-18 2016-06-08 23
10 52168938 5704816 F 2016-05-16 2016-05-16 28
# ℹ 106,295 more rows
# ℹ 16 more variables: neighbourhood <chr>, scholarship <fct>,
# hipertension <fct>, diabetes <fct>, alcoholism <fct>, handcap <fct>,
# sms_received <fct>, no_show <chr>, wday <ord>, is_friday <fct>,
# neigh_appts <int>, gender_male <fct>, attended <fct>,
# days_since_call <dbl>, appt_count <int>, appt_tally <dbl>
1) Split into training and validation and test sets
Code
# some of these vars are not necesary now or better processed by recipe# outcome is still attendedset.seed(123)dta_clean2<-dta_clean %>%ungroup() %>%select(!c(gender,scheduled_day,wday,is_friday,no_show,neighbourhood, patient_id))splits <-initial_split(dta_clean2, strata = attended)appts_other <-training(splits)appts_test <-testing(splits)# appts_other %>% # count(attended) %>% # mutate(prop = n/sum(n))# # appts_test %>% # count(attended) %>% # mutate(prop = n/sum(n))# appts_otherval_set <-validation_split(appts_other, strata = attended, prop =c(0.8))
But we can see how it would fit onto the test data now.
Code
last_log_mod <-logistic_reg(penalty = lr_best$penalty[1], mixture=1) %>%set_engine("glmnet", importance="BIC")# the last workflowlast_log_workflow <- lr_workflow %>%update_model(last_log_mod)# last fitlast_lr_fit<- last_log_workflow %>%last_fit(splits)# show metricslast_lr_fit %>%collect_metrics()
# library(readxl)dhis<-readxl::read_xls("dhis.xls") %>% janitor::clean_names()dhis_clean<-dhis %>%mutate(across(contains("date"), as.Date)) %>%# convert to date everything thats a datemutate(across(contains("mm_hg"),as.double)) %>%# BP should be charactergroup_by(gen_address_current,gen_given_name,gen_date_of_birth) %>%#create patient IDs based on address DOB namemutate(patient_id=cur_group_id()) %>%mutate(has_support=!is.na(supporters_name), #add a dummy var if patient has a supportersex=gen_sex,age_at_entry=floor(lubridate::time_length(difftime(registration_date, gen_date_of_birth), "years")) #calc age and sex ) %>%add_count(name="total_patient_visits") %>%# counter for total patients visitsungroup() %>%arrange(patient_id,event_date) %>%#now make a running tally for each patient visitsgroup_by(patient_id) %>%mutate("one"=1, "appt_tally"=cumsum(one)) %>%select(-one) %>%ungroup() %>%mutate(days_since_entry=as.double(difftime(registration_date, event_date, units="days"))) %>%#days since entryselect(!starts_with("gen") &!supporters_name)# dhis %>% # count(has_support)# dhis_clean %>% head()# # dhis %>% # colnames()
Transform hypertension registry data into “long” appointment dataset
Define and import climate / econ variables and add them
Validation criteria for prediction model?
Cox PH - Discrete time or continuous time?
Analysis Plan - Outline
Import DHIS2 line list data
transform variables as needed and do a data inspection
avg total number of visits per client
appts per month enrolled
missing data analysis
days since last appt for each pnt
mean patient load per month per clinic
hist of total vists by age/region/sex
dummy vars for each of the medications prescribed
and whether prescribed within appropriate window since registration
crosstabs of patient level variables including by attendance status groups
registered without follow up
had follow up and never dropped out until endline
had a follow up
sex and 10year age bands
SBP above or below 180
BMI range at enrollment
Add date of next appt to each line
Variable if falls within time of ec or env shock
dummy var for season of enrollment
Patient return model
logistic reg
include all vars, then LASSO for vars that are relevant
bootstrap resampling
IFF enough patients return for 2+ FU visits (~40%) then also do a survival analysis for LTFU
Cox PH model
include all vars, then LASSO for vars that are relevant
bootstrap resampling
Source Code
---title: "Thesis Intro"subtitle: "Workbook document"format: html: code-fold: true code-tools: true embed-resources: trueeditor: visualwarning: falseauthor: name: Brian ODonnell affiliation: Charité - Universitätsmedizin Berlin---Goals- Quarto practice- Noshow data: cleaning, crosscorr, and logit reg practice with the [tidymodels](https://www.tidymodels.org/start/case-study) workflow using the Brazil appointment (appt) [noshow dataset](https://www.kaggle.com/datasets/joniarroba/noshowappointments)- Import of dummy hypertension data from DHIS2 package to do a **Table1** for never returners```{r}#| label: load-packages#| include: false#| echo: falselibrary(tidyverse)library(tidymodels)library(table1)library(slider)library(ggcorrplot)library(vip)library(summarytools)library(glmnet)library(readxl)library(kableExtra)```## Quarto table practicetesting out quarto function to make a table```{r}#| label: calculationsdata(mtcars)table1::table1(~ mpg + cyl |factor(vs), data=mtcars, caption="CARS: mpg and cyl by vs") ```## Noshow data practiceNow lets start importing the noshow data as practice.```{r}#| label: load-data#| include: true#| echo: falsedta<-read_csv("kaggle.csv") %>% janitor::clean_names() %>%mutate(across(contains("day"),as.Date))```Explore the data. Start by counting the number of appointments and noshows per patient.```{r}tots <- dta %>%group_by(patient_id,no_show) %>%tally()tots %>%head()# tots %>% # ggplot()+# geom_histogram(aes(x=n))# tots %>% arrange(-n)tots %>%mutate(n_group =case_when( n==1~"1", n>1& n<6~"1-5",TRUE~"6+" ) ) %>%ungroup() %>%group_by(no_show, n_group) %>%summarise(percent =n() /nrow(tots)) %>%ungroup() %>%# janitor::tabyl(n_group,no_show) %>% as_tibble() %>%ggplot()+geom_col(aes(x=n_group,y=percent)) +facet_grid(~no_show, scales="fixed")```On the `noshow` var: here "no"=appointment was made, and "yes"=appointment was noshow.Patient load per month per clinic (here we pretend `neighborhood` is the clinic)```{r}# library(summarytools)dta_pts_per_month<-dta %>%mutate(appt_month=lubridate::month(appointment_day)) %>%group_by(appt_month, neighbourhood) %>%summarize(patients=n_distinct(patient_id),appts=n()) %>%ungroup() %>%filter(appt_month>4) %>%mutate(pa_ratio=appts/patients)dta_pts_per_month %>% summarytools::descr()# %>% summarize(patient_avg=mean(patients),# appt_avg=mean(appts))dta_pts_per_month %>%ggplot()+geom_histogram(aes(x=appts),bins=20)```Check out distributions of all the vars including patients appts ratio at each clinic per month.```{r}#| results: "asis"dta_pts_per_month %>% summarytools::dfSummary(plain.ascii =FALSE) %>%print(method="render", varnumbers =FALSE, valid.col =FALSE, graph.magnif =0.76)```Now lets make a line graph of all the appointments. First lets test if there is any difference between noshows and shows for every day of observartions.We'll first do a T test```{r}dta_tl <-dta %>%ungroup() %>%select(no_show,contains("_day")) %>%# add_count(appointment_day, name="n_appts") %>% # add_count(scheduled_day, name="n_scheds") %>% pivot_longer(cols=all_of(c("appointment_day","scheduled_day")),names_to="date_type",values_to="date") %>%# arrange(desc(date))group_by(across(everything())) %>%summarise(n =n()) %>%complete(date =seq.Date(min(date), max(date), by="day")) %>%mutate(n=replace_na(n,0)) %>%ungroup() %>%# filter(date>as.Date("2016-04-01")) %>% filter(date_type=="appointment_day") %>%group_by(no_show) %>%arrange(date) %>%mutate(mean_seven_day =slide_index_dbl( #### Give it an index for a sliding window n,.i=date,.f =~mean(.x, na.rm=TRUE),.before =days(7) )) %>%ungroup()# dta_test<-dta_tl %>% # select(date,no_show,n) %>% # pivot_wider(names_from=no_show,# values_from=n)# noshow_n_days<-dta_tl %>% filter(no_show=="No") %>% select(n)# noshow_y_days<-dta_tl %>% filter(no_show=="Yes") %>% select(n)# t.test(noshow_n_days$n ~ noshow_y_days$n)## T test,... is the noshow days different with total appts in each day??t.test(n~no_show, data=dta_tl) %>% broom::tidy() %>% kableExtra::kable()```The difference is significant at least.Now make the line graph, any changes over time?```{r}dta_tl %>%# pivot_longer(cols=all_of(c("appointment_day",# "scheduled_day")),# names_to="date_type",# values_to="date") %>% # select(no_show,contains("date"),contains("n_"))# summarize(appt=n(appointment_day),# sched=n(scheduled_day))ggplot(aes(x=date)) +geom_col(aes(y=n),fill="grey80", alpha=0.5) +geom_line(aes(y=mean_seven_day, color=date_type),size=0.5) +scale_x_date(date_labels="%Y %b %d") +facet_wrap(~no_show, ncol =1, scales="free") +theme_minimal() +theme(legend.position ="none") +labs(title="Was the appointment a no show? Shown over duration of dataset")dta_tl %>%str()# install.packages("slider")# library(slider)# install.packages("tidyquant")# summary(dta$appointment_day)```Looks like not much difference in temporal trends of noshows.## Cleaning and Cross CorrelationThat was fun. Now lets clean up the data to perform a logistic regression and then see how to use prediction methods...1\) Clean up data```{r}dta_clean <- dta %>%#remove duplicate appointments for same patient and same dayfilter(!duplicated(across(-appointment_id))) %>%#create a day of the week variablemutate(wday=wday(appointment_day,label=TRUE,abbr=TRUE),is_friday=as.factor(if_else(wday=="Fri",1,0))) %>%#create var number of patients seen at clinic/neighborhood in totaladd_count(neighbourhood,name="neigh_appts") %>%# factorize vars and create a diff between appt and scheduled datemutate(gender=as.factor(gender),gender_male=as.factor(if_else(gender=="M",1,0)),attended=as.factor(case_match(no_show,"No"~1,"Yes"~0)),days_since_call=as.double(difftime(appointment_day, scheduled_day, units="days")) ) %>%mutate(across(scholarship:sms_received, as.factor)) %>%mutate(across(contains("_id"),as.factor)) %>%#create talliesarrange(patient_id,appointment_day) %>%add_count(patient_id, name="appt_count") %>%group_by(patient_id) %>%mutate("one"=1,"appt_tally"=cumsum(one)) %>%select(-one) %>%ungroup()dta_clean %>%# colnames() %>% select(!patient_id &!appointment_id) %>% skimr::skim()dta_clean```2\) Correlation matrix with cleaned data```{r}# install.packages("ggcorrplot")dta_clean_noid<-dta_clean %>%select(!patient_id &!appointment_id &!handcap &!gender_male &!wday) %>%select(where(is.factor))dta_clean_noidmodel.matrix(~0+., data=dta_clean_noid) %>%cor(use="pairwise.complete.obs") %>% ggcorrplot::ggcorrplot(show.diag = F,type="lower",lab=TRUE, lab_size=2, digits=3)```## Predictions with Logistic RegressionFollowing workflow from: <https://www.tidymodels.org/start/case-study/>1\) Split into training and validation and test sets```{r}# some of these vars are not necesary now or better processed by recipe# outcome is still attendedset.seed(123)dta_clean2<-dta_clean %>%ungroup() %>%select(!c(gender,scheduled_day,wday,is_friday,no_show,neighbourhood, patient_id))splits <-initial_split(dta_clean2, strata = attended)appts_other <-training(splits)appts_test <-testing(splits)# appts_other %>% # count(attended) %>% # mutate(prop = n/sum(n))# # appts_test %>% # count(attended) %>% # mutate(prop = n/sum(n))# appts_otherval_set <-validation_split(appts_other, strata = attended, prop =c(0.8))```2\) Build the model```{r}lr_mod <-logistic_reg(penalty =tune(), mixture=1) %>%set_engine("glmnet")```3\) Build the "recipe" to preprocess the data```{r}lr_recipe <-recipe(attended ~ ., data = appts_other) %>%update_role(appointment_id, new_role ="ID") %>%step_date(appointment_day, features =c("dow", "month")) %>%step_rm(appointment_day) %>%# step_corr(all_numeric_predictors(), threshold = .5)step_dummy(all_nominal_predictors()) %>%step_zv(all_predictors()) %>%step_normalize(all_predictors())```4\) Now we add a workflow, combining the model and recipe```{r}lr_workflow<-workflow() %>%add_model(lr_mod) %>%add_recipe(lr_recipe)```5\) Create penalty values for tuning```{r}lr_reg_grid <-tibble(penalty =10^seq(-4, -1, length.out =40))```6\) Now we apply these each of these penalty values to the workflow (ie with recipe and model)```{r}set.seed(321)# install.packages("glmnet")lr_res <- lr_workflow %>%tune_grid(val_set,grid = lr_reg_grid,control =control_grid(save_pred =TRUE),metrics =metric_set(roc_auc))lr_plot <- lr_res %>%collect_metrics() %>%ggplot(aes(x = penalty, y = mean)) +geom_point() +geom_line() +ylab("Area under the ROC Curve")lr_plot ```Lets try to find the best model... looks like higher penalty actually performs better!!```{r}lr_res %>%collect_metrics() %>%arrange(penalty) %>%rowid_to_column()###Seems to be row 38... high mean roc_auc while also being the highest penaltylr_best<- lr_res %>%collect_metrics() %>%arrange(penalty) %>%slice(38)```OK, now lets see how this ROC curve looks for the best fit log model....```{r}lr_auc <- lr_res %>%collect_predictions(parameters = lr_best) %>%roc_curve(attended, .pred_0) %>%mutate(model ="Logistic Regression") autoplot(lr_auc)lr_res %>%collect_predictions(parameters = lr_best) ```its... not great!But we can see how it would fit onto the test data now.```{r}last_log_mod <-logistic_reg(penalty = lr_best$penalty[1], mixture=1) %>%set_engine("glmnet", importance="BIC")# the last workflowlast_log_workflow <- lr_workflow %>%update_model(last_log_mod)# last fitlast_lr_fit<- last_log_workflow %>%last_fit(splits)# show metricslast_lr_fit %>%collect_metrics()```About same ROC as with validation set...7\) Finally, we look at the top variables/features included```{r}last_lr_fit %>%extract_fit_parsnip() %>%vip(num_features =20, mapping=aes(fill=Sign)) +scale_y_continuous(expand =c(0,0))```Looks like the days since the last call was most important by far!!!Next after was the SMS received... but from correlation matrix, it seemed to have an inverse effect. Should explore more!```{r}# # cor.test(dta_clean$sms_received, dta_clean$no_show)# # chisq.test(dta_clean$sms_received, dta_clean$no_show)```## DHIS2 Hypertension Registry Test Run```{r}# library(readxl)dhis<-readxl::read_xls("dhis.xls") %>% janitor::clean_names()dhis_clean<-dhis %>%mutate(across(contains("date"), as.Date)) %>%# convert to date everything thats a datemutate(across(contains("mm_hg"),as.double)) %>%# BP should be charactergroup_by(gen_address_current,gen_given_name,gen_date_of_birth) %>%#create patient IDs based on address DOB namemutate(patient_id=cur_group_id()) %>%mutate(has_support=!is.na(supporters_name), #add a dummy var if patient has a supportersex=gen_sex,age_at_entry=floor(lubridate::time_length(difftime(registration_date, gen_date_of_birth), "years")) #calc age and sex ) %>%add_count(name="total_patient_visits") %>%# counter for total patients visitsungroup() %>%arrange(patient_id,event_date) %>%#now make a running tally for each patient visitsgroup_by(patient_id) %>%mutate("one"=1, "appt_tally"=cumsum(one)) %>%select(-one) %>%ungroup() %>%mutate(days_since_entry=as.double(difftime(registration_date, event_date, units="days"))) %>%#days since entryselect(!starts_with("gen") &!supporters_name)# dhis %>% # count(has_support)# dhis_clean %>% head()# # dhis %>% # colnames()```Still need to clean up the presecriptions etc...Process it for the **Table1**```{r}dhis_table <- dhis_clean %>%mutate(patient_returned=if_else(total_patient_visits>1, "1+ Visits After Entry","LTFU After Entry" )) %>%mutate(sbp_level =case_when( ncd_systole_mm_hg >=140~"very_high", ncd_systole_mm_hg >=130& ncd_systole_mm_hg <140~"high", ncd_systole_mm_hg >=120& ncd_systole_mm_hg <130~"elevated",is.na(ncd_systole_mm_hg) ~NA,TRUE~"normal" ) ) %>%mutate(sbp_level=fct_relevel(sbp_level, c("normal","elevated","high","very_high"))) %>%mutate(age_group =as.factor(cut(age_at_entry, breaks =c(40, 55, 60, 75, 100), include.lowest = T, labels =c("40-55","55-60", "60-75", "75-100")))) %>%select(patient_returned, sex, age_group, appt_tally,"sbp_baseline"=sbp_level, "amlodipine_baseline"=med_amlodipine) %>%filter(appt_tally==1)# hist(dhis_clean$age_at_entry)table1::table1(~ sex + age_group + sbp_baseline + amlodipine_baseline |factor(patient_returned), data=dhis_table) # dhis_clean %>% # filter(is.na(ncd_systole_mm_hg))# dhis_clean %>% # count(total_patient_visits)# dhis_clean %>% # count()``````{r}```From here:- Transform hypertension registry data into "long" appointment dataset- Define and import climate / econ variables and add them- Validation criteria for prediction model?- Cox PH - Discrete time or continuous time?## **Analysis Plan - Outline**- Import DHIS2 line list data- transform variables as needed and do a data inspection - avg total number of visits per client - appts per month enrolled - missing data analysis - days since last appt for each pnt - mean patient load per month per clinic - hist of total vists by age/region/sex - dummy vars for each of the medications prescribed - and whether prescribed within appropriate window since registration- crosstabs of patient level variables including by attendance status groups - registered without follow up - had follow up and never dropped out until endline - had a follow up - sex and 10year age bands - SBP above or below 180 - BMI range at enrollment- Add date of next appt to each line - Variable if falls within time of ec or env shock - dummy var for season of enrollment- Patient return model - logistic reg - include all vars, then LASSO for vars that are relevant - bootstrap resampling- IFF enough patients return for 2+ FU visits (\~40%) then also do a survival analysis for LTFU - Cox PH model - include all vars, then LASSO for vars that are relevant - bootstrap resampling-