DHIS2 Events (Patient visits) with dates between Jan 1 2021 and June 30 2024 were exported.
Patient IDs and other PII were removed. Age was calculated as years between DOB and Registration Date. Only those with with hypertension and consent to record data checked are included. Events that are likely test data or created outside Kano and Ogun are removed.
Code
dhis<-readr::read_csv("output_file.csv") %>% janitor::clean_names()dhis_clean<-dhis %>%mutate(across(contains("_date"), dmy)) %>%# convert to date everything thats a datemutate(across(ends_with("updated_on"), dmy)) %>%# convert to date everything thats a datemutate(across(contains("mmhg"),as.double)) %>%# BP shouldnt be characterselect(!c(program_stage, geometry, longitude, latitude)) %>%#drop unused columnsarrange(reg_id,event_date) %>%#now make a running tally for each patient visitsgroup_by(reg_id) %>%mutate("one"=1, "appt_tally"=cumsum(one)) %>%select(-one) %>%ungroup() %>%rename_all(~stringr::str_replace(.,"^htn_","")) %>%# clean names# now extract the OU hierarchyseparate(organisation_unit_name_hierarchy, into=c("Nat","State","LGA","Ward","HC","FACILITY"), sep="/") %>%# clinic hierarchymutate(age=as.integer(newage)) %>%# convert age to numberfilter(!is.na(age) & age>17& age <130) %>%# and remove non adultsfilter(does_patient_have_hypertension=="Yes") %>%# remove those without hypertension confirmedfilter(consent_to_record_data==1 ) %>%# and those without consent to record datafilter(event_status %in%c("COMPLETED","ACTIVE") ) %>%#remove scheduled events filter(!str_detect(State,"ad ")) %>%#and any outside K/Ofilter(!str_detect(State,"ab ")) %>%#and any outside K/Ofilter(!str_detect(created_by,"Test")) # remove test data
Data Quality Check
Skim of data
Looking at all columns in a skim.
Code
dhis_clean %>% skimr::skim()
Data summary
Name
Piped data
Number of rows
165208
Number of columns
61
_______________________
Column type frequency:
character
27
Date
5
logical
15
numeric
14
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
stored_by
49438
0.70
7
17
0
107
0
created_by
0
1.00
18
46
0
106
0
last_updated_by
0
1.00
18
46
0
107
0
organisation_unit_name
0
1.00
19
46
0
104
0
Nat
0
1.00
11
11
0
1
0
State
0
1.00
15
15
0
2
0
LGA
0
1.00
11
43
0
42
0
Ward
0
1.00
9
27
0
100
0
HC
0
1.00
11
47
0
105
0
FACILITY
156493
0.05
23
37
0
10
0
program_status
0
1.00
6
9
0
3
0
event_status
0
1.00
6
9
0
2
0
organisation_unit
0
1.00
11
11
0
105
0
med_losartan
106932
0.35
13
14
0
2
0
hypertension_treated_in_the_past
1165
0.99
10
23
0
2
0
med_hydrochlorothiazide
163799
0.01
24
26
0
2
0
other_hypertension_medication
162573
0.02
1
64
0
126
0
patient_followup_status
12318
0.93
6
20
0
3
0
ncd_patient_status
0
1.00
4
15
0
5
0
days_of_medication
137932
0.17
7
7
0
3
0
does_patient_have_diabetes
10994
0.93
2
3
0
2
0
sex
43
1.00
4
6
0
2
0
relationship_to_patient
34371
0.79
3
8
0
9
0
newage
0
1.00
2
3
0
96
0
med_amlodipine
4308
0.97
10
18
0
3
0
med_telmisartan
164758
0.00
17
17
0
2
0
does_patient_have_hypertension
0
1.00
3
3
0
1
0
Variable type: Date
skim_variable
n_missing
complete_rate
min
max
median
n_unique
event_date
0
1
2021-01-01
2024-07-01
2023-01-09
1278
last_updated_on
0
1
2021-01-01
2024-07-21
2023-06-09
1276
scheduled_date
0
1
1974-06-12
2201-08-29
2023-03-13
1372
registration_date
0
1
1972-01-07
2024-07-17
2022-04-25
1324
incident_date
0
1
2003-12-08
2024-07-17
2022-06-02
1304
Variable type: logical
skim_variable
n_missing
complete_rate
mean
count
organisation_unit_code
165208
0
NaN
:
blood_sugar_reading_rbs_mgdl
165208
0
NaN
:
blood_sugar_unit
165208
0
NaN
:
history_of_heart_attack
165208
0
NaN
:
history_of_stroke
165208
0
NaN
:
blood_sugar_reading_fbs_mmoll
165208
0
NaN
:
blood_sugar_reading_ppbs_mmoll
165208
0
NaN
:
blood_sugar_reading_ppbs_mgdl
165208
0
NaN
:
med_glibenclamide
165208
0
NaN
:
other_diabetes_medication
165207
0
0
FAL: 1
blood_sugar_reading_rbs_mmoll
165208
0
NaN
:
type_of_diabetes_measure
165208
0
NaN
:
blood_sugar_reading_hba1c
165208
0
NaN
:
blood_sugar_reading_fbs_mgdl
165208
0
NaN
:
med_metformin
165208
0
NaN
:
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
height_cm
157928
0.04
86.58
79.29
0.45
1.6
148.0
160.0
758
▇▆▁▁▁
bp_diastole_mmhg
1300
0.99
85.28
239.52
1.00
76.0
82.0
90.0
90102
▇▁▁▁▁
weight_kg
157250
0.05
73.61
71.50
28.00
61.0
70.0
82.0
6298
▇▁▁▁▁
bmi
157923
0.04
136982.95
280270.71
0.00
27.0
38.3
262222.2
4296296
▇▁▁▁▁
bmi_measurement
156527
0.05
1.00
0.00
1.00
1.0
1.0
1.0
1
▁▁▇▁▁
consent_to_send_sms_reminders
138779
0.16
1.00
0.00
1.00
1.0
1.0
1.0
1
▁▁▇▁▁
blood_sugar_measurement
164697
0.00
1.00
0.00
1.00
1.0
1.0
1.0
1
▁▁▇▁▁
bp_systole_mmhg
700
1.00
151.44
1149.85
1.00
128.0
140.0
152.0
207120
▇▁▁▁▁
consent_to_record_data
0
1.00
1.00
0.00
1.00
1.0
1.0
1.0
1
▁▁▇▁▁
reg_id
0
1.00
16122.46
9360.88
0.00
7913.0
16113.0
24221.0
32332
▇▇▇▇▇
has_supporter
0
1.00
0.80
0.40
0.00
1.0
1.0
1.0
1
▂▁▁▁▇
has_phone
0
1.00
0.81
0.39
0.00
1.0
1.0
1.0
1
▂▁▁▁▇
appt_tally
0
1.00
6.26
5.80
1.00
2.0
4.0
9.0
45
▇▂▁▁▁
age
0
1.00
54.11
14.13
18.00
45.0
54.0
64.0
124
▂▇▃▁▁
Clean Registration Dates
Looking at registration dates, seems some registrations are after first visit (“back data entry”).
4% of patients have later registration dates than dates of first visits.
But, registrations should always be before or on first visit date. Registration = entry to program.
This snippet overwrites registration dates to the date of first visit. The operation potentially biases Age variable, which is based on days between date of birth and registration.
Question: How exactly are registration dates used???
Code
# create a new registration date with rulesdhis_clean_newReg<-dhis_clean %>%mutate(reg_date_new=case_when( appt_tally==1& registration_date<=event_date ~ registration_date, appt_tally==1& registration_date>event_date ~ event_date, appt_tally>1~NA,TRUE~NA)) %>%mutate(reg_date_new=if_else(as.integer(as.Date(ymd("2020-01-01"))-registration_date) >0, event_date, reg_date_new)) %>%group_by(reg_id) %>%fill(reg_date_new, .direction="down") %>%ungroup()dhis_clean_newReg %>%select(reg_id, registration_date, reg_date_new, appt_tally, event_date) %>%mutate(reg_date_post_ev1=if_else(reg_date_new!=registration_date & appt_tally==1, 1, 0)) %>%filter(appt_tally==1) %>%count(reg_date_post_ev1) %>%mutate(percent=n/sum(n))
# A tibble: 2 × 3
reg_date_post_ev1 n percent
<dbl> <int> <dbl>
1 0 29554 0.962
2 1 1157 0.0377
Code
# Now rewrite the clean dhis dataset. # Remove unneeded columns and create days since entrydhis_clean<-dhis_clean_newReg %>%mutate(registration_date=reg_date_new) %>%mutate(days_since_entry=as.double(event_date-registration_date)) %>%#days since entryselect(-reg_date_new) %>% janitor::remove_empty(which ="cols") # remove all unnecessary columns
And, 30% of all “Age at Registration” end in 0. (36% of SBPs do as well).
HCWs were using the DOB attribute as An ESTIMATE of Age, instead of as a true DOB. A feature of DHIS2 enables this estimation.
But between the incorrect registration date and using Age/DoB as an estimate, the Age Variable is imprecise and has unknown measurement error.
Code
dhis_clean %>%select(newage, reg_id) %>%distinct() %>%mutate(age=as.integer(newage)) %>%select(age) %>%filter(age>0) %>%## !!!! 30% of all Ages end in 0!!!! ## Measurement error by entering age instead of DOB# mutate(age_zero=if_else(age %% 10==0, 1, 0)) %>% # count(age_zero) %>% # mutate(per=n/sum(n)) ggplot() +geom_histogram(aes(x=age)) +labs(title="Histogram of Age at Registration for All Visits")
Code
dhis_clean %>%mutate(age_zero=if_else(bp_systole_mmhg %%10==0, 1, 0)) %>%filter(99< bp_systole_mmhg & bp_systole_mmhg<200) %>%# count(age_zero) %>%# mutate(per=n/sum(n))ggplot() +geom_histogram(aes(x=bp_systole_mmhg)) +labs(title="Histogram of SBP All Visits",subtitle ="SBPs less than 200. Spikes are values ending in zero")
Duplicate events?
It looks like a number of visits occur on the same day for many patients.
Code
## Found duplicate events!!!tots_d <- dhis_clean %>%mutate(reg_visit_id=paste0(reg_id,"_",appt_tally)) %>%group_by(reg_id) %>%mutate(prev_event_date =lag(event_date, n=1, order_by=event_date)) %>%mutate(lead_event_date =lead(event_date, n=1, order_by=event_date)) %>%ungroup()### highlight them as dupesdupes <- tots_d %>%# select(reg_id, event_date, prev_event_date, organisation_unit_name) %>% # ungroup() %>% mutate(dupe=if_else(event_date==prev_event_date | event_date==lead_event_date, 1, 0)) %>%group_by(reg_id) %>%# fill(dupe, .direction="down") %>%filter(dupe==1) %>%arrange(reg_id,event_date)## TRUE dupes also have identical dates except "last updated" datedupes_true <- dupes %>%ungroup() %>%select(!c(lead_event_date, prev_event_date, last_updated_by)) %>%## this only keeps rows that are identical except for the appointment tally and visit id janitor::get_dupes(!c(appt_tally, reg_visit_id))print(paste0("total patients ", length(unique(tots_d$reg_id))))
[1] "total patients 30714"
Code
print(paste0("patient records with dupe events ", length(unique(dupes$reg_id))))
print(paste0("patient records with true dupe events on same day ", length(unique(dupes_true$reg_id))))
[1] "patient records with true dupe events on same day 151"
Code
# length(unique(dhis_subset$reg_id))# Only the first of all "true dupe" events should still be kept!dupes_true_extras<-dupes_true %>%group_by(reg_id) %>%mutate("one"=1, "dupe_visit_id"=cumsum(one)) %>%select(-one) %>%select(reg_id,reg_visit_id, dupe_visit_id) %>%filter(dupe_visit_id>1)## The new dataset with totals is called tots and has removed the "true dupe" visitstots <-tots_d %>%anti_join(dupes_true_extras, by="reg_visit_id")# (1122-188)/length(unique(dhis_clean$reg_id))
1122 of 30714 patient records reported at least one pair of visits on the same day ( 2856 events total). Of these, 151 patients had “true duplicates” where data reported within each visit was identical.
NEXT STEP for dupes would be to pivot longer, then find those variables that are NOT identical within Dupes
This table shows us the FIELDS which do not match within a pair of duplicated events for the same registration.
Code
d_long <-dupes %>%group_by(reg_id) %>%arrange(event_date, last_updated_on) %>%mutate("one"=1, "dupe_id"=cumsum(one)) %>%select(-one) %>%ungroup() dupe_clash <-d_long %>%filter(dupe_id<3) %>%# just those regs with exactly 2 dupe eventsmutate(across(everything(), as.character)) %>%# no dates or appt countsselect(!ends_with("_date"), -appt_tally, -reg_visit_id) %>%pivot_longer(!c(reg_id, dupe_id), names_to="var",values_to="value") %>%pivot_wider(names_from=dupe_id,names_prefix ="dupe_",values_from=c(value)) %>%# filter(!is.na(dupe_1)) %>%mutate(match=case_when(is.na(dupe_1) &is.na(dupe_2) ~TRUE, dupe_1==dupe_2 ~TRUE,TRUE~FALSE)) %>%filter(match==FALSE) %>%arrange(reg_id)dupe_clash %>%count(var, sort=T)
Last Updated date could be helpful for deduping, same as stored_by, created_by, and event status (COMPLETED locks the values, so if there is an error there, another one could be made on same place).
But if those are not obvious for which one should be used, then we would use the non null BP.
Two visits on same day with non null BP values will be difficult to merge automatically.
THE RULES FOR MERGING TWO PATIENT VISITS ON SAME DATE
If one of the events does not have a BP reading, all values in the event with BP reading take priority (if any duped event has all BP NA, its removed from dataset).
ACTIVE status events are generally opened after COMPLETED events to make a change. If its the only ACTIVE event that date, use this event. Priority = 3 (max)
If other clashes, count the number of nulls for medication. If one of the events has least med nulls, then use that event. Priority = 2
If still cannot reconcile, use the first event by date last updated (or, if last updated on same day, random). Priority = 1
Code
# this chunk will apply the priority rules to flag priority eventsd_long_priorities <- d_long %>%filter(!is.na(bp_diastole_mmhg) &!is.na(bp_systole_mmhg)) %>%group_by(reg_id, event_date, event_status) %>%mutate(dupes_evs_with_this_status=n()) %>%ungroup() %>%mutate(dupe_meds_null_count =rowSums(is.na(across(contains("med"))))) %>%group_by(reg_id, event_date) %>%mutate(dupes_evs_mostMeds=min(dupe_meds_null_count)) %>%mutate(dupes_evs_withMedCount=n()) %>%mutate(dupe_id_priority =case_when( dupes_evs_with_this_status==1& event_status=="ACTIVE"~3, dupes_evs_mostMeds==dupe_meds_null_count & dupes_evs_withMedCount==1~2, dupe_id==min(dupe_id) ~1,TRUE~0 ) ) %>%ungroup()# WE WANT TO KEEP ALL THESE AND REMOVE OTHER DUPESd_long_keepers <-d_long_priorities %>%group_by(reg_id, event_date) %>%filter(dupe_id_priority==max(dupe_id_priority)) %>%ungroup()d_long_remove <- d_long %>%anti_join(d_long_keepers, by="reg_visit_id")# d_long_remove %>% colnames()d_long_keepers %>% janitor::tabyl(dupe_id_priority) %>%kable() %>% kableExtra::kable_styling()
dupe_id_priority
n
percent
1
1195
0.8871566
2
104
0.0772086
3
48
0.0356347
We remove 1509 events from the main dataset and then recreate the appointment tally and visit IDs.
Code
tots<-tots %>%anti_join(d_long_remove, by="reg_visit_id") %>%# REDO THE APPT TALLY AND REG_VISIT_ID now that the dupes have been removedarrange(reg_id,event_date) %>%#make a running tally for each patient visitsgroup_by(reg_id) %>%mutate("one"=1, "appt_tally"=cumsum(one)) %>%select(-one) %>%ungroup() %>%mutate(reg_visit_id=paste0(reg_id,"_",appt_tally)) %>%group_by(reg_id) %>%add_count(name="total_patient_visits") %>%# counter for total patients visitsmutate(prev_event_date =lag(event_date, n=1, order_by=event_date)) %>%mutate(lead_event_date =lead(event_date, n=1, order_by=event_date)) %>%ungroup() #and need to add new lead lag dates
Description
Table 1 Zero Followup Visits After Baseline
Process it for the Table1 all registrations up to April 1 2024.
Thus all patients have at least 90 days for a follow up Visit after Baseline Visit.
“>=1 Visits Post Baseline” here means at least 1 visit occurs later. The second visit can be within or after 90 days from baseline.
Code
dhis_subset <- tots %>%filter(registration_date >=dmy("01-01-2021")) %>%filter(registration_date <=dmy("01-04-2024"))dhis_table <- dhis_subset %>%mutate(patient_returned=if_else(total_patient_visits>1, ">=1 Visits Post Baseline","0 Visits Post Baseline" )) %>%mutate(bp_systole_mmhg=if_else(bp_systole_mmhg>400, 400, bp_systole_mmhg)) %>%mutate(sbp_level =case_when( bp_systole_mmhg >=400| bp_systole_mmhg <50~"error (>300 or < 50)", bp_systole_mmhg >=180& bp_systole_mmhg <400~"emergency [180-300)", bp_systole_mmhg >=140& bp_systole_mmhg <180~"very high [140-180)", bp_systole_mmhg >=120& bp_systole_mmhg <140~"high or elevated (120-140)",is.na(bp_systole_mmhg) ~NA,TRUE~"normal") ) %>%mutate(sbp_level=fct_relevel(sbp_level, c("normal","high or elevated (120-140)","very high [140-180)","emergency [180-300)", "error (>300 or < 50)"))) %>%mutate(has_diabetes=as.factor(does_patient_have_diabetes)) %>%mutate(has_phone=as.factor(has_phone)) %>%mutate(has_supporter=as.factor(has_supporter)) %>%mutate(age_group =as.factor(cut(age,breaks =c(0, 30, 45, 60, 75, 110, 100000),include.lowest = T,labels =c("Other", "30-44","45-59", "60-74", "75-110", "Other")))) %>%mutate(age_group=fct_shift(age_group, n=1)) %>%select(patient_returned, State, sex, age, age_group, appt_tally,"sbp_baseline"=sbp_level, bp_systole_mmhg,"previous_htn_treated"=hypertension_treated_in_the_past,"amlodipine_baseline"=med_amlodipine, has_phone, has_supporter, has_diabetes,days_of_medication, total_patient_visits) %>%filter(appt_tally==1)## Write it to RDS for the manuscriptwrite_rds(dhis_table, file=here::here("data","dhis_table.rds"))tab1<-table1::table1(~ sex + State + age + age_group + previous_htn_treated + sbp_baseline + bp_systole_mmhg + amlodipine_baseline + has_phone + has_supporter + has_diabetes +days_of_medication + total_patient_visits|factor(patient_returned),data=dhis_table)# t1flex(tab1)tab1
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.294 58.000 90.000 94.394 90.600 1370.000
Code
# tots %>% # filter(total_patient_visits>40) %>% View()dpv %>%mutate(days_enrolled=if_else(total_patient_visits==appt_tally, event_date-registration_date, NA)) %>%filter(!is.na(days_enrolled) & total_patient_visits>1) %>%# filter(patient_days_per_visit>=0 & patient_days_per_visit< 91) %>% mutate(months_per_visit=patient_days_per_visit/30) %>%ggplot() +geom_histogram(aes(x=months_per_visit)) +labs(title="Months Per Visit, By Patient",subtitle="Total number of months enrolled divided by number of visits")
Cumulative density of days since Last visit
Code
tots %>%filter(appt_tally!=1) %>%mutate(days_since_last=as.integer(event_date-prev_event_date)) %>%filter(days_since_last<50) %>%# count(days_since_last) %>% ggplot(aes(x=days_since_last)) +geom_step(aes(y=after_stat(y)),stat="ecdf") +labs(title="Cumulative Density of Events by days since latest visit",subtitle ="All visits occuring less than 50 days from last")
Most visits are on the same day as they were scheduled (spontaneous visits?) or earlier.
Code
tots %>%ungroup() %>%filter(appt_tally!=1) %>%mutate(days_from_latest_schedDate=as.numeric(scheduled_date-prev_event_date)) %>%mutate(days_from_latest_eventDate=as.integer(event_date-prev_event_date)) %>%filter(days_from_latest_eventDate<60& days_from_latest_eventDate>0) %>%filter(days_from_latest_schedDate<60& days_from_latest_schedDate>0) %>%select(reg_visit_id, starts_with("days_from")) %>%pivot_longer(cols=starts_with("days_from"), names_to="property", values_to="value") %>%ggplot(aes(x=value, fill=property)) +# facet_wrap(~property) +geom_histogram(position="dodge",alpha=0.4, bins=50) +labs(title="Histogram of Follow Up visits",subtitle=str_wrap("by days between scheduled visit date and previous visit date, plus days between current visit date and previous visit date", 90),caption="Removing values < 0 or > 60") +theme(legend.position ="bottom")
Code
tots %>%ungroup() %>%filter(appt_tally!=1) %>%mutate(days_from_latest_schedDate=as.numeric(scheduled_date-prev_event_date)) %>%mutate(days_from_latest_eventDate=as.integer(event_date-prev_event_date)) %>%# filter(days_from_latest_eventDate<90 & days_from_latest_eventDate>0) %>%filter(days_from_latest_schedDate>0) %>%ggplot() +geom_step(aes(x=days_from_latest_eventDate,y=after_stat(y), color="Event Date"),stat="ecdf", linewidth=1, alpha=0.6) +geom_step(aes(x=days_from_latest_schedDate,y=after_stat(y), color="Scheduled Date"),stat="ecdf", linetype=3, linewidth=1) +labs(title="Followup visit after previous: by event date and scheduled date",subtitle="Cumulative incidence of followup visits by days from previous",x="Days since previous visit",y="") +scale_color_manual(name ="Visit feature",values =c("Event Date"="black", #match the name of the element to the name in the aes"Scheduled Date"="blue")) +coord_cartesian(xlim =c(0, 110), ylim=c(0,1)) +scale_x_continuous(breaks =seq(0, 105, by =15)) +theme(legend.position ="bottom")
Code
my_ggsave("fig2_incidence_after_previous",10)
53% of followup visits are scheduled to occur 27-36 days after the previous visit, with 38% on precisely 28 days later (the system default scheduled date). 44% of followup visits actually occur within that window.
NEXT STEPS:
Create the windows for timeliness for each subsequent visit
Per visit-opportunity analysis of timeliness
Timely Windows
Code
tots_timely<-tots %>%ungroup() %>%filter(appt_tally==1| event_date-prev_event_date>=7) %>%# remove visits within 7 days of lastmutate(nextdate_37=event_date+37, # create 37 and 90 day windowsnextdate_90=event_date+90,nextdate_37_kept=if_else(lead_event_date<=nextdate_37, 1, 0),nextdate_90_kept=if_else(lead_event_date<=nextdate_90, 1, 0)) %>%mutate(across(contains("kept"), ~replace(., is.na(.), 0))) %>%select(sex, State,appt_tally, total_patient_visits, ncd_patient_status,contains("reg"),contains("event_date"), contains("nextdate"), everything())
Adding month and year variables to the expected window for subsequent visits, and checking whether patient attended their next visit within that window.
The FIRST time a patient exceeds 90 days before the next visit, they are LTFU at that 90 day mark.
This is the OUTCOME EVENT. Subsequent visits for that patient would be discarded.
IFF a patient is marked TRANSFERRED or DEATH, then they are CENSORED after last visit. IFF the last expected visit is after July 1, then they are CENSORED after last visit.
This removes visits that are censored or after dropout.
Code
tots_timely_rates_prep<-tots_timely %>%filter(censor==0) %>%filter(!is.na(sex)) %>%filter(visits_since_dropout<2) #dropout occurs only once.... and if censored, not included in denomfind_dropout_rates<-function(data, ...){ data %>%group_by(exp_year_month, ...) %>%count(nextdate_90_kept) %>%mutate(rate= n/sum(n)) %>%ungroup()}tots_timely_rates_all<-tots_timely_rates_prep %>%find_dropout_rates()tots_timely_rates_all %>%group_by(nextdate_90_kept) %>%summarise(avg_monthly_rate=mean(rate)) %>%kable() %>% kableExtra::kable_styling()
nextdate_90_kept
avg_monthly_rate
0
0.2619861
1
0.7380139
The average monthly INCIDENCE RATE of attrition events (dropouts) is 26%
Now further breakdown of dropout incidence by categories
ggcoeff<-10000ggbarscol<-RColorBrewer::brewer.pal(4,"Dark2")[3]gglinecol<-RColorBrewer::brewer.pal(4,"Dark2")[1]attr_90day_withRegs<-tots_timely_rates_prep %>%filter(registration_date<=ymd("2024-06-30")) %>%# filter(registration_date>=ymd("2021-04-01")) %>%# filter(event_date>=ymd("2021-04-01")) %>%filter(exp_year_month<=ymd("2024-06-30")) %>%mutate(reg_year=lubridate::year(registration_date)) %>%mutate(reg_month=lubridate::month(registration_date)) %>%mutate(reg_year_month=make_date(year=exp_year, month=exp_month, day="01")) %>%add_count(reg_year_month,name="new_registrations") %>%find_dropout_rates(reg_year_month, new_registrations) %>%filter(nextdate_90_kept==0) %>%ggplot() +geom_col(aes(x=reg_year_month, y=new_registrations/ggcoeff),fill=ggbarscol, alpha=0.5) +geom_line(aes(x=exp_year_month, y=rate), col=gglinecol,stat="identity") +labs(title="90-day attrition rate with monthly registrations",subtitle="Attrition rate is by month at end of 90-day return window") +labs(x="") +# theme_classic()+theme(legend.position ="bottom",axis.title.y.left =element_text(color=gglinecol),axis.title.y.right =element_text(color=ggbarscol)) +scale_y_continuous(name="Attrition Rate",labels=scales::label_percent(),sec.axis =sec_axis(~.*ggcoeff, name="Monthly Registrations")) +scale_x_date(breaks = scales::breaks_pretty(12), labels = scales::label_date_short())attr_90day_withRegs
Remember responses from RTSL here:
When were lists of LTFU for outreach generated? We generated the first list of LTFU for facilities to track patients in April 2021 using the 2021 Q1 NHCI Excel calculator data.
When was the dashboard deployed in both states? March 2022
When were legacy data imported? February 2022
When was the DHIS2-only entry pilot started, and was it expanded or ended? This commenced in August 2023 and ended in December 2023. The pilot just concluded and a report is being written by the consultants. This report would determine how and when we would expand this across the entire program.
Code
attr_90day_withRegs +annotate("rect", xmin =as.Date("2021-04-01"), xmax =as.Date("2021-09-01"), ymin =0.42, ymax =0.47, alpha = .2) +annotate("rect", xmin =as.Date("2021-12-21"), xmax =as.Date("2022-05-01"), ymin =0.42, ymax =0.47, alpha = .2) +annotate("rect", xmin =as.Date("2022-06-01"), xmax =as.Date("2023-04-01"), ymin =0.42, ymax =0.47, alpha = .2) +annotate("rect", xmin =as.Date("2023-07-15"), xmax =as.Date("2024-02-01"), ymin =0.42, ymax =0.47, alpha = .2) +# geom_text(data = data.frame(x = as.Date("2021-06-01"),# y = 0.45, label = "1st LTFU List"), mapping = aes(x = x, y = y, label = label),# size = 2.17, hjust = 0.45, inherit.aes = FALSE)annotate("text", x =as.Date("2021-06-15"), y =0.44,label ="1st LTFU Lists", size=3.3) +annotate("text", x =as.Date("2022-02-25"), y =0.44,label ="Legacy data", size=3.3) +annotate("text", x =as.Date("2022-11-01"), y =0.44,label ="Growth and Scale", size=3.5) +annotate("text", x =as.Date("2023-10-15"), y =0.44,label ="Naira crisis", size=3.3)
Code
my_ggsave("fig3_attritionRegs_full", 12)
Seasonal trends? Seems to be a dip in spring/summer
tots_timely_rates_lga %>%filter(nextdate_90_kept==0) %>%mutate(LGA=str_remove_all(LGA, " Local Government Area")) %>%mutate(LGA=fct_reorder(LGA, rate, .fun = mean)) %>%ggplot(aes(x = LGA, y = rate, color=State)) +geom_boxplot() +labs(title="90-day monthly attrition rates by LGA", y="Attrition Rate",caption="Removed all healthCentre-months pairs with less than 10 expected visits") +facet_wrap(~State, ncol=1, scales="free_y") +scale_y_continuous(labels=scales::label_percent()) +coord_flip() +# theme_classic() theme(legend.position ="null")
Monthly HC attrition rates
Code
tots_timely_rates_hc %>%filter(nextdate_90_kept==0) %>%mutate(LGA=str_remove_all(LGA, " Local Government Area")) %>%mutate(LGA=fct_reorder(LGA, rate, .fun = mean)) %>%ggplot(aes(x = LGA, y = rate, color=State)) +geom_boxplot() +labs(title="90-day monthly attrition rates at Health Centre Level",subtitle="Grouped by LGA and State of Health Centre",caption="Removed all healthCentre-months pairs with less than 10 expected visits",y="Attrition Rate") +scale_y_continuous(labels=scales::label_percent()) +facet_wrap(~State, ncol=1, scales="free_y") +coord_flip() +# theme_classic() +theme(legend.position ="null")
The odds of dropout at each appointment are 1.94 fold for patients who have previously been delayed more than 90 days and later returned, compared to those who never experienced a first dropout event.
Now add in the daily weather data from open-meteo.com API from 1/1/2021 to 1/7/2024.
This pulled data on max temp, max apparent temp (with humidity), rain total, and rain hours for the centroid coordinates of each LGA.
For details see companion script.
Code
# import weather data and clean the lga namesweather_lga_centroids<-read_rds(here::here("data", "weather_lga_centroids.rds")) %>%ungroup() %>%mutate(lganame=str_squish(lganame)) %>%mutate(lganame=str_remove_all(lganame, "\\/Ota"), # fix this one namelganame=str_replace_all(lganame, "Nassarawa","Nasarawa"), # fix this namelganame=str_replace_all(lganame, "Dambatta","Danbatta"), # fix this namelganame=str_replace_all(lganame, "Shagamu","Sagamu")) %>%# fix this namerename_with(~paste0("cli_", .), max_temp_c:last_col()) %>%mutate(lga_37date=str_c(lganame,date,sep="_")) %>%arrange(lga_37date) %>%select(lga_37date, starts_with("cli_"))# clean lga names in main dataset and merge with weathertots_timely_qual_climate<-tots_timely_quality %>%mutate(LGA=str_remove_all(LGA, " Local Government Area ")) %>%mutate(LGA=str_sub(LGA, start=4, end=50)) %>%mutate(LGA=str_squish(LGA)) %>%mutate(lga_37date=str_c(LGA,as.character(nextdate_37),sep="_")) %>%arrange(lga_37date) %>%left_join(weather_lga_centroids, by="lga_37date")
Market data
Now add in the weekly market data from FEWSNET API for 1/1/2021 to 1/7/2024
This is official FAO data on food and staple good prices at local markets, and is used to assess food security. The prices are weekly, but have been reformed as a 10 day rolling average.
Included here are weekly subnational price data from Kano State and Lagos State for nine staples, in Naira:
The price of oil, both globally and domestically in Nigeria, affects the entire economy, including the price of transportation, staple foods, healthcare, and in general the cost of living. Theses prices could therefore affect ability/willingness to access HTN services.
Note data were not available for Ogun State, but Lagos borders Ogun to the South, and has market data more consistent with national rates than Oyo to the north.
For details see companion script.
Code
fews<-read_rds(here::here("data", "fews_ng_prices_2021_2024.rds")) %>%ungroup() %>%filter(date>ymd("2021-02-01")) %>%rename_with(~paste0("mkt_", .), 3:last_col()) %>%mutate(admin_1=if_else(admin_1=="Lagos", "Ogun","Kano")) %>%arrange(admin_1, date) %>%mutate(state_37date=str_c(admin_1,date,sep="_")) %>%select(state_37date, starts_with("mkt")) %>%arrange(state_37date)tots_timely_qual_cli_mkt<-tots_timely_qual_climate %>%# clean up state name to merge with market pricesarrange(State, nextdate_37) %>%mutate(State=str_remove_all(State, " State"),State=str_sub(State, 5, 50),State=str_squish(State)) %>%mutate(state_37date=str_c(State,nextdate_37,sep="_")) %>%arrange(state_37date) %>%left_join(fews, by="state_37date") %>%select(reg_id, reg_visit_id, registration_date, #reorder some vars event_date, appt_tally, everything())
Are weather and market prices 28-37 days after latest visit different for those who attend their next one 0-37 days after that visit, and those who fail to attend?
Ttest and effect sizes might give a general idea…
Code
pop_params_dropouts_37days_prep<-tots_timely_qual_cli_mkt %>%filter(censor==0) %>%filter(visits_since_dropout<1) %>%# removed censored obs or those that already "droppedout"mutate(nextdate_37_kept=as.factor(nextdate_37_kept)) %>%select(State, nextdate_37_kept, ends_with("pre10day_mean")) %>%rowid_to_column() %>%pivot_longer(cols=ends_with("pre10day_mean"),names_to="var",values_to="value")# boxplotpop_params_dropouts_37days_prep %>%ggplot() +aes(x = nextdate_37_kept, y = value) +geom_boxplot() +facet_wrap(~var, scales ="free_y") +labs(title="Next visit date kept?",subtitle="by price or weather parameter")
# use these for the modelpop_params_top<-pop_params_effect_sizes %>%filter(abs(Cohens_d)>0.11) %>%pluck("var")print(paste0("Consider these vars for model: ", str_c(pop_params_top, collapse =', ')))
[1] "Consider these vars for model: cli_max_temp_c_pre10day_mean, cli_rain_hours_pre10day_mean, cli_rain_mm_pre10day_mean, mkt_millet_pearl_pre10day_mean, mkt_rice_milled_pre10day_mean, mkt_sorghum_white_pre10day_mean"
A t test is performed across all these pop level parameters, to test if between the means of the parameters differ where the next visit is a success or failure (success defined as Next Visit =< 37 days from current visit).
All price and weather parameters have a small pvalue, except the price of gasoline.
However, the effect sizes (cohens d) are small for all parameters. Max temp, rain total, rain hours, millet, rice, and sorghum price, price have the highest effect sizes (>0.11). Max temp has an inverse association than expected (more attendance on comparatively hot days).
For greater effect size, may need to respecify. For example, instead of using 10 day averages, use precise weather variables on the date 28 days after latest appointment, with an outcome of visiting on that exact day.
Here is an example: for a given 10 day window, is the average price of milled rice at state level related to the rate of patients returning for subsequent visits within that 10 day window?
Code
tots_timely_rates_prep_detailed<-tots_timely_qual_cli_mkt %>%filter(censor==0) %>%filter(!is.na(sex)) %>%filter(visits_since_dropout<2) #dropout occurs only once.... and if censored, not included in denomfind_dropout_rates_detailed <-function(data, ...){ data %>%mutate(nextdate_37_kept=factor(nextdate_37_kept, levels=c(0,1))) %>%group_by(nextdate_37, ...) %>%count(nextdate_37_kept) %>%mutate(rate=n/sum(n)) %>%ungroup() %>%filter(nextdate_37_kept==1) }tots_timely_rates_prep_detailed %>%find_dropout_rates_detailed(State, mkt_rice_milled_pre10day_mean) %>%filter(n>9) %>%ggplot(aes(x=mkt_rice_milled_pre10day_mean, y=rate)) +geom_jitter( alpha=0.5)+geom_smooth() +expand_limits(y =c(0, 1)) +labs(title="State Level 37-day Timely Return Rates, By market prices for milled rice",subtitle="excluding date ranges where the State expected less than 10 patients")
Review and Save data
In a final review, check distributions of all the vars.
Code
dfs<-tots_timely_qual_cli_mkt %>% summarytools::dfSummary(plain.ascii =FALSE,stile="grid",max.string.width=15)dfs$Variable <-gsub("_", " ", dfs$Variable)print(dfs, method="render", varnumbers =FALSE, valid.col =FALSE,max.tbl.width =40,caption="Removing _underscores_ from varnames for space")
Data Frame Summary
tots_timely_qual_cli_mkt
Dimensions: 153413 x 107
Duplicates: 0
Variable
Stats / Values
Freqs (% of Valid)
Graph
Missing
reg id [numeric]
Mean (sd) : 16102.7 (9361.3)
min ≤ med ≤ max:
0 ≤ 16067 ≤ 32332
IQR (CV) : 16258 (0.6)
29667 distinct values
0 (0.0%)
reg visit id [character]
1. 10209_1
2. 10224_3
3. 10528_4
4. 10916_4
5. 10976_5
6. 11008_6
7. 11129_19
8. 11245_10
9. 11257_5
10. 11371_24
[ 153227 others ]
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
2
(
0.0%
)
153393
(
100.0%
)
0 (0.0%)
registration date [Date]
min : 2021-01-01
med : 2022-05-10
max : 2024-07-01
range : 3y 6m 0d
1246 distinct values
0 (0.0%)
event date [Date]
min : 2021-01-01
med : 2023-01-23
max : 2024-07-01
range : 3y 6m 0d
1278 distinct values
0 (0.0%)
appt tally [numeric]
Mean (sd) : 6 (5.5)
min ≤ med ≤ max:
1 ≤ 4 ≤ 45
IQR (CV) : 6 (0.9)
45 distinct values
0 (0.0%)
sex [character]
1. Female
2. Male
117013
(
76.3%
)
36357
(
23.7%
)
43 (0.0%)
State [character]
1. Kano
2. Ogun
97605
(
63.6%
)
55808
(
36.4%
)
0 (0.0%)
total patient visits [integer]
Mean (sd) : 11 (7.9)
min ≤ med ≤ max:
1 ≤ 9 ≤ 45
IQR (CV) : 11 (0.7)
42 distinct values
0 (0.0%)
ncd patient status [character]
1. Active record
2. Died
3. Inactive record
4. Transferred
5. Unresponsive
151920
(
99.0%
)
452
(
0.3%
)
424
(
0.3%
)
548
(
0.4%
)
69
(
0.0%
)
0 (0.0%)
prev event date [Date]
min : 2021-01-01
med : 2022-12-22
max : 2024-06-24
range : 3y 5m 23d
1264 distinct values
29690 (19.4%)
lead event date [Date]
min : 2021-01-04
med : 2023-03-09
max : 2024-07-01
range : 3y 5m 27d
1259 distinct values
29543 (19.3%)
nextdate 37 [Date]
min : 2021-02-07
med : 2023-03-01
max : 2024-08-07
range : 3y 6m 0d
1278 distinct values
0 (0.0%)
nextdate 90 [Date]
min : 2021-04-01
med : 2023-04-23
max : 2024-09-29
range : 3y 5m 28d
1278 distinct values
0 (0.0%)
nextdate 37 kept [numeric]
Min : 0
Mean : 0.4
Max : 1
0
:
85159
(
55.5%
)
1
:
68254
(
44.5%
)
0 (0.0%)
nextdate 90 kept [numeric]
Min : 0
Mean : 0.7
Max : 1
0
:
50260
(
32.8%
)
1
:
103153
(
67.2%
)
0 (0.0%)
stored by [character]
1. Jaen_PHC
2. Gwagwarwa_PHC
3. Hotoro_NorthPHC
4. Kabuga_PHC
5. Okeilewo_FHC
6. Dawakindakata_P
7. Gano_PHC
8. Dala_PHC
9. Sheka_PHC
10. akumbabarns
[ 97 others ]
7268
(
6.9%
)
7157
(
6.8%
)
6519
(
6.2%
)
4464
(
4.2%
)
4031
(
3.8%
)
3825
(
3.6%
)
3721
(
3.5%
)
3651
(
3.4%
)
3176
(
3.0%
)
2498
(
2.4%
)
59616
(
56.3%
)
47487 (31.0%)
created by [character]
1. Akumba, Barnaba
2. DAKATA, NASSARA
3. Recorder, GWAGW
4. Recorder, JAEN
5. Recorder, HOTOR
6. Okeilewo, FHC (
7. Recorder, KABUG
8. Recorder, GANO
9. Recorder, SHEKA
10. Recorder, DALA
[ 96 others ]
15970
(
10.4%
)
8352
(
5.4%
)
8036
(
5.2%
)
7569
(
4.9%
)
6092
(
4.0%
)
5142
(
3.4%
)
4556
(
3.0%
)
3477
(
2.3%
)
3082
(
2.0%
)
2926
(
1.9%
)
88211
(
57.5%
)
0 (0.0%)
last updated by [character]
1. Recorder, JAEN
2. Recorder, HOTOR
3. Recorder, GWAGW
4. DAKATA, NASSARA
5. Recorder, KABUG
6. Okeilewo, FHC (
7. Recorder, GANO
8. Recorder, DALA
9. Recorder, SHEKA
10. Olose, PHC (Olo
[ 97 others ]
9574
(
6.2%
)
9085
(
5.9%
)
9063
(
5.9%
)
8352
(
5.4%
)
5544
(
3.6%
)
5496
(
3.6%
)
4377
(
2.9%
)
4030
(
2.6%
)
3429
(
2.2%
)
3209
(
2.1%
)
91254
(
59.5%
)
0 (0.0%)
last updated on [Date]
min : 2021-01-02
med : 2023-06-10
max : 2024-07-21
range : 3y 6m 19d
1265 distinct values
0 (0.0%)
scheduled date [Date]
min : 1974-06-12
med : 2023-03-31
max : 2201-08-29
range : 227y 2m 17d
1331 distinct values
0 (0.0%)
incident date [Date]
min : 2003-12-08
med : 2022-06-14
max : 2024-07-17
range : 20y 7m 9d
1233 distinct values
0 (0.0%)
organisation unit name [character]
1. kn Ja'En Primar
2. kn Hotoro Arewa
3. kn Gwagwarwa Pr
4. kn Dawakin Daka
5. kn Kabuga Prima
6. og Oke Ilewo Fa
7. kn Gano Primary
8. kn Dala Primary
9. kn Shekar Barde
10. og Olose Primar
[ 94 others ]
9816
(
6.4%
)
9691
(
6.3%
)
9293
(
6.1%
)
8352
(
5.4%
)
5736
(
3.7%
)
5519
(
3.6%
)
4539
(
3.0%
)
4193
(
2.7%
)
3748
(
2.4%
)
3271
(
2.1%
)
89255
(
58.2%
)
0 (0.0%)
Nat [character]
1. fg Nigeria
·
153413
(
100.0%
)
0 (0.0%)
LGA [character]
1. Nasarawa
2. Gwale
3. Dala
4. Abeokuta South
5. Kano Municipal
6. Kumbotso
7. Ifo
8. Abeokuta North
9. Dawakin Kudu
10. Tsanyawa
[ 32 others ]
27813
(
18.1%
)
15552
(
10.1%
)
9870
(
6.4%
)
8028
(
5.2%
)
5888
(
3.8%
)
5852
(
3.8%
)
5312
(
3.5%
)
4978
(
3.2%
)
4927
(
3.2%
)
4809
(
3.1%
)
60384
(
39.4%
)
0 (0.0%)
Ward [character]
1.
·
kn Dorayi Ward
2.
·
kn Hotoro Nort
3.
·
kn Gawuna Ward
4.
·
kn Dakata Ward
5.
·
kn Kabuga Ward
6.
·
og Oke-Ilewo W
7.
·
kn Gano Ward
·
8.
·
kn Kabuwaya Wa
9.
·
kn Dan Maliki
10. Ota Local Gover
[ 90 others ]
9816
(
6.4%
)
9691
(
6.3%
)
9293
(
6.1%
)
8352
(
5.4%
)
5736
(
3.7%
)
5519
(
3.6%
)
4539
(
3.0%
)
4193
(
2.7%
)
3749
(
2.4%
)
3576
(
2.3%
)
88949
(
58.0%
)
0 (0.0%)
HC [character]
1.
·
kn Ja'En Prima
2.
·
kn Hotoro Arew
3.
·
kn Gwagwarwa P
4.
·
kn Dawakin Dak
5.
·
kn Kabuga Prim
6.
·
og Oke Ilewo F
7.
·
kn Gano Primar
8.
·
kn Dala Primar
9.
·
kn Shekar Bard
10.
·
og Olose Prima
[ 95 others ]
9816
(
6.4%
)
9691
(
6.3%
)
9293
(
6.1%
)
8352
(
5.4%
)
5736
(
3.7%
)
5519
(
3.6%
)
4539
(
3.0%
)
4193
(
2.7%
)
3748
(
2.4%
)
3271
(
2.1%
)
89255
(
58.2%
)
0 (0.0%)
FACILITY [character]
1.
·
og Agbara Prim
2.
·
og Ajuwon Prim
3.
·
og Alagbon Hea
4.
·
og Igbogila Pr
5.
·
og Keesi Prima
6.
·
og Mobalufon M
7.
·
og Oke-Oyinbo
8.
·
og Otta Primar
9.
·
og Owode Healt
10.
·
og Sango Prima
213
(
2.5%
)
398
(
4.6%
)
651
(
7.6%
)
1019
(
11.8%
)
1209
(
14.1%
)
657
(
7.6%
)
898
(
10.4%
)
1812
(
21.1%
)
193
(
2.2%
)
1551
(
18.0%
)
144812 (94.4%)
program status [character]
1. ACTIVE
2. CANCELLED
3. COMPLETED
152913
(
99.7%
)
450
(
0.3%
)
50
(
0.0%
)
0 (0.0%)
event status [character]
1. ACTIVE
2. COMPLETED
3659
(
2.4%
)
149754
(
97.6%
)
0 (0.0%)
organisation unit [character]
1. ehJsDey0nF2
2. DxfVRX9j8np
3. xUORgwU22YU
4. jpPaQTbjIlU
5. aoOPscxu7Tc
6. S84LuOwGZ6R
7. bYamdI7FHE5
8. JJL8Ukh1ZlZ
9. uRzXjHHigm9
10. PKf7VHpxmBG
[ 95 others ]
9816
(
6.4%
)
9691
(
6.3%
)
9293
(
6.1%
)
8352
(
5.4%
)
5736
(
3.7%
)
5519
(
3.6%
)
4539
(
3.0%
)
4193
(
2.7%
)
3748
(
2.4%
)
3271
(
2.1%
)
89255
(
58.2%
)
0 (0.0%)
height cm [numeric]
Mean (sd) : 86.4 (79.3)
min ≤ med ≤ max:
0.4 ≤ 148 ≤ 758
IQR (CV) : 158.4 (0.9)
156 distinct values
146189 (95.3%)
med losartan [character]
1. Losartan 100mg
2. Losartan 50mg
8819
(
15.9%
)
46480
(
84.1%
)
98114 (64.0%)
hypertension treated in the past [character]
1. First time
2. Was treated in
127501
(
83.7%
)
24776
(
16.3%
)
1136 (0.7%)
bp diastole mmhg [numeric]
Mean (sd) : 84 (12.9)
min ≤ med ≤ max:
1 ≤ 82 ≤ 200
IQR (CV) : 14 (0.2)
181 distinct values
1066 (0.7%)
med hydrochlorothiazide [character]
1. Hydrochlorathia
2. Hydrochlorathia
156
(
11.5%
)
1201
(
88.5%
)
152056 (99.1%)
other hypertension medication [character]
1. N
2. none
3. Lisinopril
4. nifedipine Vaso
5. no
6. non
7. elanapril
8. other drug
9. nil
10. vasoprin
[ 111 others ]
2144
(
81.8%
)
177
(
6.8%
)
47
(
1.8%
)
18
(
0.7%
)
17
(
0.6%
)
16
(
0.6%
)
12
(
0.5%
)
11
(
0.4%
)
10
(
0.4%
)
8
(
0.3%
)
160
(
6.1%
)
150793 (98.3%)
weight kg [numeric]
Mean (sd) : 73.6 (71.8)
min ≤ med ≤ max:
28 ≤ 70 ≤ 6298
IQR (CV) : 21 (1)
116 distinct values
145524 (94.9%)
patient followup status [character]
1. Called
2. Overdue Pending
3. Upcoming visit
766
(
0.5%
)
113357
(
79.7%
)
28175
(
19.8%
)
11115 (7.2%)
bmi [numeric]
Mean (sd) : 137463.3 (281054.3)
min ≤ med ≤ max:
0 ≤ 38.4 ≤ 4296296
IQR (CV) : 262195.2 (2)
1478 distinct values
146184 (95.3%)
bmi measurement [numeric]
1 distinct value
1
:
8591
(
100.0%
)
144822 (94.4%)
days of medication [character]
1. 30 days
2. 60 days
3. 90 days
25742
(
97.0%
)
659
(
2.5%
)
139
(
0.5%
)
126873 (82.7%)
consent to send sms reminders [numeric]
1 distinct value
1
:
25896
(
100.0%
)
127517 (83.1%)
does patient have diabetes [character]
1. No
2. Yes
139830
(
97.8%
)
3078
(
2.2%
)
10505 (6.8%)
blood sugar measurement [numeric]
1 distinct value
1
:
501
(
100.0%
)
152912 (99.7%)
relationship to patient [character]
1. Daughter
2. Father
3. Friend
4. Husband
5. Mother
6. Neighbor
7. Other
8. Son
9. Wife
21134
(
17.2%
)
5239
(
4.3%
)
1136
(
0.9%
)
27941
(
22.8%
)
642
(
0.5%
)
2547
(
2.1%
)
10205
(
8.3%
)
44818
(
36.6%
)
8902
(
7.3%
)
30849 (20.1%)
bp systole mmhg [numeric]
Mean (sd) : 141.2 (19.7)
min ≤ med ≤ max:
1 ≤ 140 ≤ 300
IQR (CV) : 23 (0.1)
224 distinct values
530 (0.3%)
other diabetes medication [logical]
All NA's
153413 (100.0%)
newage [character]
1. 50
2. 60
3. 70
4. 40
5. 55
6. 45
7. 65
8. 80
9. 52
10. 35
[ 86 others ]
12712
(
8.3%
)
11770
(
7.7%
)
8481
(
5.5%
)
7095
(
4.6%
)
6477
(
4.2%
)
6370
(
4.2%
)
5382
(
3.5%
)
3593
(
2.3%
)
3453
(
2.3%
)
3342
(
2.2%
)
84738
(
55.2%
)
0 (0.0%)
med amlodipine [character]
1. Amlodipine 10mg
2. Amlodipine 5mg
3. Other drug
13092
(
8.8%
)
136353
(
91.2%
)
15
(
0.0%
)
3953 (2.6%)
med telmisartan [character]
1. Telmisartan(40m
2. Telmisartan(80m
310
(
72.9%
)
115
(
27.1%
)
152988 (99.7%)
does patient have hypertension [character]
1. Yes
153413
(
100.0%
)
0 (0.0%)
consent to record data [numeric]
1 distinct value
1
:
153413
(
100.0%
)
0 (0.0%)
has supporter [numeric]
Min : 0
Mean : 0.8
Max : 1
0
:
29413
(
19.2%
)
1
:
124000
(
80.8%
)
0 (0.0%)
has phone [numeric]
Min : 0
Mean : 0.8
Max : 1
0
:
28289
(
18.4%
)
1
:
125124
(
81.6%
)
0 (0.0%)
age [integer]
Mean (sd) : 54 (14.1)
min ≤ med ≤ max:
18 ≤ 53 ≤ 124
IQR (CV) : 18 (0.3)
96 distinct values
0 (0.0%)
days since entry [numeric]
Mean (sd) : 245.1 (248.8)
min ≤ med ≤ max:
0 ≤ 170 ≤ 1272
IQR (CV) : 353 (1)
1255 distinct values
0 (0.0%)
exp year [numeric]
Mean (sd) : 2022.8 (0.9)
min ≤ med ≤ max:
2021 ≤ 2023 ≤ 2024
IQR (CV) : 1 (0)
2021
:
12579
(
8.2%
)
2022
:
43654
(
28.5%
)
2023
:
61089
(
39.8%
)
2024
:
36091
(
23.5%
)
0 (0.0%)
exp month [numeric]
Mean (sd) : 6.6 (3.3)
min ≤ med ≤ max:
1 ≤ 7 ≤ 12
IQR (CV) : 5 (0.5)
12 distinct values
0 (0.0%)
exp year month [Date]
min : 2021-04-01
med : 2023-04-01
max : 2024-09-01
range : 3y 5m 0d
42 distinct values
0 (0.0%)
days since last [integer]
Mean (sd) : 66.9 (87.4)
min ≤ med ≤ max:
7 ≤ 35 ≤ 1209
IQR (CV) : 36 (1.3)
892 distinct values
29690 (19.4%)
nextdate 90 dropout [numeric]
Min : 0
Mean : 0.5
Max : 1
0
:
76188
(
49.7%
)
1
:
77225
(
50.3%
)
0 (0.0%)
visits since dropout [numeric]
Mean (sd) : 1.8 (3.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 35
IQR (CV) : 2 (1.8)
36 distinct values
0 (0.0%)
censor [numeric]
Min : 0
Mean : 0.1
Max : 1
0
:
141106
(
92.0%
)
1
:
12307
(
8.0%
)
0 (0.0%)
p step1 need [numeric]
Min : 0
Mean : 0.1
Max : 1
0
:
140951
(
91.9%
)
1
:
12462
(
8.1%
)
0 (0.0%)
p step2 need [numeric]
Min : 0
Mean : 0.2
Max : 1
0
:
127335
(
83.0%
)
1
:
26078
(
17.0%
)
0 (0.0%)
p step3 need [numeric]
Min : 0
Mean : 0.1
Max : 1
0
:
139241
(
90.8%
)
1
:
14172
(
9.2%
)
0 (0.0%)
p step4 need [numeric]
Min : 0
Mean : 0
Max : 1
0
:
146882
(
95.7%
)
1
:
6531
(
4.3%
)
0 (0.0%)
p step1 met [numeric]
Min : 0
Mean : 0.1
Max : 1
0
:
141580
(
92.3%
)
1
:
11833
(
7.7%
)
0 (0.0%)
p step2 met [numeric]
Min : 0
Mean : 0.1
Max : 1
0
:
137348
(
89.5%
)
1
:
16065
(
10.5%
)
0 (0.0%)
p step3 met [numeric]
Min : 0
Mean : 0
Max : 1
0
:
151287
(
98.6%
)
1
:
2126
(
1.4%
)
0 (0.0%)
p step4 met [numeric]
Min : 0
Mean : 0
Max : 1
0
:
153395
(
100.0%
)
1
:
18
(
0.0%
)
0 (0.0%)
sbp baseline [numeric]
Mean (sd) : 157.6 (17.2)
min ≤ med ≤ max:
14 ≤ 158 ≤ 300
IQR (CV) : 23 (0.1)
165 distinct values
514 (0.3%)
dbp baseline [numeric]
Mean (sd) : 92.4 (13.3)
min ≤ med ≤ max:
1 ≤ 91 ≤ 200
IQR (CV) : 15 (0.1)
153 distinct values
720 (0.5%)
sbp change from baseline [numeric]
Mean (sd) : -16.4 (20.6)
min ≤ med ≤ max:
-195 ≤ -14 ≤ 178
IQR (CV) : 30 (-1.3)
302 distinct values
917 (0.6%)
dbp change from baseline [numeric]
Mean (sd) : -8.4 (14.1)
min ≤ med ≤ max:
-143 ≤ -6 ≤ 130
IQR (CV) : 17 (-1.7)
241 distinct values
1592 (1.0%)
lga 37date [character]
1. Nasarawa_2024-0
2. Nasarawa_2023-0
3. Nasarawa_2023-0
4. Nasarawa_2023-0
5. Nasarawa_2022-1
6. Nasarawa_2024-0
7. Nasarawa_2023-0
8. Nasarawa_2024-0
9. Nasarawa_2023-0
10. Nasarawa_2022-0
[ 27800 others ]
140
(
0.1%
)
129
(
0.1%
)
96
(
0.1%
)
93
(
0.1%
)
83
(
0.1%
)
83
(
0.1%
)
82
(
0.1%
)
80
(
0.1%
)
79
(
0.1%
)
77
(
0.1%
)
152471
(
99.4%
)
0 (0.0%)
cli max temp c [numeric]
Mean (sd) : 30.4 (3.4)
min ≤ med ≤ max:
18.5 ≤ 29.6 ≤ 41.6
IQR (CV) : 4.6 (0.1)
217 distinct values
4659 (3.0%)
cli max heat index c [numeric]
Mean (sd) : 33.7 (3.3)
min ≤ med ≤ max:
19 ≤ 33.7 ≤ 44
IQR (CV) : 4.6 (0.1)
232 distinct values
4659 (3.0%)
cli rain mm [numeric]
Mean (sd) : 4.1 (7.4)
min ≤ med ≤ max:
0 ≤ 0.6 ≤ 167.6
IQR (CV) : 5.4 (1.8)
523 distinct values
4659 (3.0%)
cli rain hours [numeric]
Mean (sd) : 4.9 (6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 24
IQR (CV) : 8 (1.2)
25 distinct values
4659 (3.0%)
cli max temp c pre10day mean [numeric]
Mean (sd) : 30.4 (3.2)
min ≤ med ≤ max:
21 ≤ 29.6 ≤ 40
IQR (CV) : 4.4 (0.1)
1582 distinct values
4659 (3.0%)
cli max heat index c pre10day mean [numeric]
Mean (sd) : 33.7 (3)
min ≤ med ≤ max:
22.5 ≤ 33.7 ≤ 42.4
IQR (CV) : 3.8 (0.1)
1687 distinct values
4659 (3.0%)
cli rain mm pre10day mean [numeric]
Mean (sd) : 4.1 (4.4)
min ≤ med ≤ max:
0 ≤ 2.9 ≤ 42.9
IQR (CV) : 6.3 (1.1)
2607 distinct values
4659 (3.0%)
cli rain hours pre10day mean [numeric]
Mean (sd) : 4.8 (4.6)
min ≤ med ≤ max:
0 ≤ 3.7 ≤ 20.6
IQR (CV) : 7 (1)
202 distinct values
4659 (3.0%)
state 37date [character]
1. Kano_2023-06-21
2. Kano_2023-07-06
3. Kano_2023-06-28
4. Kano_2023-01-18
5. Kano_2022-07-27
6. Kano_2023-01-11
7. Kano_2023-02-03
8. Kano_2023-08-09
9. Kano_2023-07-26
10. Kano_2022-10-19
[ 2533 others ]
326
(
0.2%
)
321
(
0.2%
)
303
(
0.2%
)
289
(
0.2%
)
253
(
0.2%
)
236
(
0.2%
)
236
(
0.2%
)
236
(
0.2%
)
234
(
0.2%
)
231
(
0.2%
)
150748
(
98.3%
)
0 (0.0%)
mkt diesel [numeric]
Mean (sd) : 752.7 (287.3)
min ≤ med ≤ max:
188 ≤ 744 ≤ 1470
IQR (CV) : 218 (0.4)
169 distinct values
4659 (3.0%)
mkt gasoline [numeric]
Mean (sd) : 369.5 (219.1)
min ≤ med ≤ max:
161 ≤ 274 ≤ 1070
IQR (CV) : 405.2 (0.6)
123 distinct values
4659 (3.0%)
mkt maize grain white [numeric]
Mean (sd) : 346.2 (167.4)
min ≤ med ≤ max:
151.5 ≤ 282 ≤ 814
IQR (CV) : 194.4 (0.5)
175 distinct values
4659 (3.0%)
mkt maize grain yellow [numeric]
Mean (sd) : 371.3 (169.7)
min ≤ med ≤ max:
159.2 ≤ 322.3 ≤ 851.4
IQR (CV) : 211.3 (0.5)
189 distinct values
4659 (3.0%)
mkt millet pearl [numeric]
Mean (sd) : 370 (167.6)
min ≤ med ≤ max:
160.3 ≤ 316.1 ≤ 847.9
IQR (CV) : 146.8 (0.5)
210 distinct values
4659 (3.0%)
mkt rice 5 percent broken [numeric]
Mean (sd) : 895.6 (395.7)
min ≤ med ≤ max:
475.8 ≤ 724.1 ≤ 2160.5
IQR (CV) : 466.1 (0.4)
201 distinct values
4659 (3.0%)
mkt rice milled [numeric]
Mean (sd) : 719.5 (320)
min ≤ med ≤ max:
278 ≤ 607.1 ≤ 1641.3
IQR (CV) : 398.9 (0.4)
179 distinct values
4659 (3.0%)
mkt sorghum brown [numeric]
Mean (sd) : 378.2 (193.1)
min ≤ med ≤ max:
155.9 ≤ 319.6 ≤ 991.5
IQR (CV) : 223.4 (0.5)
183 distinct values
4659 (3.0%)
mkt sorghum white [numeric]
Mean (sd) : 365.7 (178.4)
min ≤ med ≤ max:
154.7 ≤ 332.4 ≤ 852.3
IQR (CV) : 191 (0.5)
159 distinct values
4659 (3.0%)
mkt yams [numeric]
Mean (sd) : 338 (184.8)
min ≤ med ≤ max:
136.9 ≤ 267.8 ≤ 836.7
IQR (CV) : 204.9 (0.5)
234 distinct values
4659 (3.0%)
mkt usd exchange rate [numeric]
Mean (sd) : 0 (0)
min ≤ med ≤ max:
0 ≤ 0 ≤ 0
IQR (CV) : 0 (0.3)
36 distinct values
4659 (3.0%)
mkt diesel pre10day mean [numeric]
Mean (sd) : 749.2 (287)
min ≤ med ≤ max:
188 ≤ 744 ≤ 1458
IQR (CV) : 210 (0.4)
1254 distinct values
4659 (3.0%)
mkt gasoline pre10day mean [numeric]
Mean (sd) : 366.2 (216.7)
min ≤ med ≤ max:
160.8 ≤ 274 ≤ 1070
IQR (CV) : 400.8 (0.6)
929 distinct values
4659 (3.0%)
mkt maize grain white pre10day mean [numeric]
Mean (sd) : 344.3 (165.2)
min ≤ med ≤ max:
153.8 ≤ 281.5 ≤ 814
IQR (CV) : 189.2 (0.5)
1534 distinct values
4659 (3.0%)
mkt maize grain yellow pre10day mean [numeric]
Mean (sd) : 369.4 (168)
min ≤ med ≤ max:
162.1 ≤ 322.3 ≤ 851.4
IQR (CV) : 208.2 (0.5)
1672 distinct values
4659 (3.0%)
mkt millet pearl pre10day mean [numeric]
Mean (sd) : 368.5 (166.4)
min ≤ med ≤ max:
161.1 ≤ 316.1 ≤ 846.8
IQR (CV) : 150 (0.5)
1766 distinct values
4659 (3.0%)
mkt rice 5 percent broken pre10day mean [numeric]
Mean (sd) : 891.6 (392.4)
min ≤ med ≤ max:
479 ≤ 727.5 ≤ 2157.1
IQR (CV) : 468.8 (0.4)
1713 distinct values
4659 (3.0%)
mkt rice milled pre10day mean [numeric]
Mean (sd) : 714.5 (313.8)
min ≤ med ≤ max:
281.6 ≤ 610.5 ≤ 1636.4
IQR (CV) : 408.9 (0.4)
1578 distinct values
4659 (3.0%)
mkt sorghum brown pre10day mean [numeric]
Mean (sd) : 376.2 (191.5)
min ≤ med ≤ max:
159.1 ≤ 319.6 ≤ 989
IQR (CV) : 221.4 (0.5)
1599 distinct values
4659 (3.0%)
mkt sorghum white pre10day mean [numeric]
Mean (sd) : 363.9 (177.2)
min ≤ med ≤ max:
154.7 ≤ 332 ≤ 852.3
IQR (CV) : 193.6 (0.5)
1411 distinct values
4659 (3.0%)
mkt yams pre10day mean [numeric]
Mean (sd) : 335.4 (182)
min ≤ med ≤ max:
139.1 ≤ 269.2 ≤ 836.7
IQR (CV) : 211.6 (0.5)
1921 distinct values
4659 (3.0%)
mkt usd exchange rate pre10day mean [numeric]
Mean (sd) : 0 (0)
min ≤ med ≤ max:
0 ≤ 0 ≤ 0
IQR (CV) : 0 (0.3)
334 distinct values
4659 (3.0%)
Generated by summarytools 1.0.1 (R version 4.3.3) 2024-08-27
---title: "Msc Thesis Pt1: Describing NHCI Dropouts"subtitle: "Working document"format: html: code-fold: true code-tools: true embed-resources: true toc: true toc-location: left toc-title: Contents toc-depth: 3 toc-expand: 2editor: visualwarning: falseauthor: name: Brian ODonnell affiliation: Charité - Universitätsmedizin Berlin---This document shows work for describing NHCI patient population by attendance status.Part two in a [separate document](https://iambodo.github.io/projects/nhci_analyses/msc_thesis_pt2_pred.html), will work with the resulting dataset to predict patient dropouts using the [tidymodels](https://www.tidymodels.org/start/case-study) workflowFor a full description of the final dataset, see section @sec-review-save*Goals Summary*- Import of hypertension data from DHIS2 Line List- Cleaning and Data quality review- **Table1** for never returners (dropout after baseline)- Descriptive statistics and explanation- Add in Market Prices and Weather data as plausible predictors of dropout, with ttests```{r}#| label: load-packages#| include: false#| echo: falselibrary(tidyverse)library(table1)library(slider)library(skimr)library(patchwork)library(summarytools)library(kableExtra)library(effectsize)library(RColorBrewer)library(DT)library(here)palette <-brewer.pal(n =8, name ="Dark2")mytheme<-theme_minimal()+theme(panel.background =element_rect(fill ="white", color =NA),plot.background =element_rect(fill ="white", color =NA),legend.background =element_rect(fill ="white", color =NA),text =element_text(family="sans"),strip.background =element_rect(color ="grey70", linewidth =1))options(ggplot2.discrete.colour = palette, ggplot2.discrete.fill = palette)theme_set(mytheme)my_ggsave<-function(x, ...){ggsave(paste0(x,".png"), plot=last_plot(),path=here::here("figs"), dpi=300,width=ifelse(is.na(...),7, ...))ggsave(paste0(x,".tiff"), plot=last_plot(), path=here::here("figs"), dpi=300,width=ifelse(is.na(...),7, ...))}```# DHIS2 Data Import and CleaningDHIS2 Events (Patient visits) with dates between Jan 1 2021 and June 30 2024 were [exported](https://github.com/iambodo/NHCI_exporter).Patient IDs and other PII were removed. Age was calculated as years between DOB and Registration Date. Only those with with hypertension and consent to record data checked are included. Events that are likely test data or created outside Kano and Ogun are removed.```{r Import and Initial Clean}dhis<-readr::read_csv("output_file.csv") %>% janitor::clean_names()dhis_clean<-dhis %>% mutate(across(contains("_date"), dmy)) %>% # convert to date everything thats a date mutate(across(ends_with("updated_on"), dmy)) %>% # convert to date everything thats a date mutate(across(contains("mmhg"),as.double)) %>% # BP shouldnt be character select(!c(program_stage, geometry, longitude, latitude)) %>% #drop unused columns arrange(reg_id,event_date) %>% #now make a running tally for each patient visits group_by(reg_id) %>% mutate("one"=1, "appt_tally"=cumsum(one)) %>% select(-one) %>% ungroup() %>% rename_all(~stringr::str_replace(.,"^htn_","")) %>% # clean names # now extract the OU hierarchy separate(organisation_unit_name_hierarchy, into=c("Nat","State","LGA","Ward","HC","FACILITY"), sep="/") %>% # clinic hierarchy mutate(age=as.integer(newage)) %>% # convert age to number filter(!is.na(age) & age>17 & age < 130) %>% # and remove non adults filter(does_patient_have_hypertension=="Yes") %>% # remove those without hypertension confirmed filter(consent_to_record_data==1 ) %>% # and those without consent to record data filter(event_status %in% c("COMPLETED","ACTIVE") ) %>% #remove scheduled events filter(!str_detect(State,"ad ")) %>% #and any outside K/O filter(!str_detect(State,"ab ")) %>% #and any outside K/O filter(!str_detect(created_by,"Test")) # remove test data```# Data Quality Check## Skim of dataLooking at all columns in a skim.```{r quick skim}dhis_clean %>% skimr::skim()```## Clean Registration DatesLooking at registration dates, seems some registrations are after first visit ("back data entry").4% of patients have later registration dates than dates of first visits.But, registrations should always be before or on first visit date. Registration = entry to program.This snippet overwrites registration dates to the date of first visit. The operation potentially biases Age variable, which is based on days between date of birth and registration.**Question: How exactly are registration dates used???**```{r Clean Reg Dates }# create a new registration date with rulesdhis_clean_newReg<-dhis_clean %>% mutate(reg_date_new=case_when( appt_tally==1 & registration_date<=event_date ~ registration_date, appt_tally==1 & registration_date>event_date ~ event_date, appt_tally>1 ~ NA, TRUE ~ NA)) %>% mutate(reg_date_new=if_else(as.integer(as.Date(ymd("2020-01-01"))-registration_date) > 0, event_date, reg_date_new)) %>% group_by(reg_id) %>% fill(reg_date_new, .direction="down") %>% ungroup()dhis_clean_newReg %>% select(reg_id, registration_date, reg_date_new, appt_tally, event_date) %>% mutate(reg_date_post_ev1=if_else(reg_date_new!=registration_date & appt_tally==1, 1, 0)) %>% filter(appt_tally==1) %>% count(reg_date_post_ev1) %>% mutate(percent=n/sum(n))# Now rewrite the clean dhis dataset. # Remove unneeded columns and create days since entrydhis_clean<-dhis_clean_newReg %>% mutate(registration_date=reg_date_new) %>% mutate(days_since_entry=as.double(event_date-registration_date)) %>% #days since entry select(-reg_date_new) %>% janitor::remove_empty(which = "cols") # remove all unnecessary columns``````{r Clean Reg Dates 2}# dhis_clean<-dhis_clean %>% filter(registration_date >= ymd("2021-01-01"))```Notice the dates and years of events, looks correct.```{r Explore cleaned data 1}dhis_clean %>% select(event_date, reg_id) %>% distinct() %>% mutate(year=lubridate::year(as.Date(event_date))) %>% count(year) %>% kable() %>% kableExtra::kable_styling()```## Zero ending valuesAnd, 30% of all "Age at Registration" end in 0. (36% of SBPs do as well).HCWs were using the DOB attribute as An ESTIMATE of Age, instead of as a true DOB. A feature of DHIS2 enables this estimation.But between the incorrect registration date and using Age/DoB as an estimate, the Age Variable is imprecise and has unknown measurement error.```{r Explore cleaned data 2}dhis_clean %>% select(newage, reg_id) %>% distinct() %>% mutate(age=as.integer(newage)) %>% select(age) %>% filter(age>0) %>% ## !!!! 30% of all Ages end in 0!!!! ## Measurement error by entering age instead of DOB # mutate(age_zero=if_else(age %% 10==0, 1, 0)) %>% # count(age_zero) %>% # mutate(per=n/sum(n)) ggplot() + geom_histogram(aes(x=age)) + labs(title="Histogram of Age at Registration for All Visits")dhis_clean %>% mutate(age_zero=if_else(bp_systole_mmhg %% 10==0, 1, 0)) %>% filter(99 < bp_systole_mmhg & bp_systole_mmhg<200) %>% # count(age_zero) %>% # mutate(per=n/sum(n)) ggplot() + geom_histogram(aes(x=bp_systole_mmhg)) + labs(title="Histogram of SBP All Visits", subtitle = "SBPs less than 200. Spikes are values ending in zero") ```## Duplicate events?It looks like a number of visits occur on the same day for many patients.```{r}## Found duplicate events!!!tots_d <- dhis_clean %>%mutate(reg_visit_id=paste0(reg_id,"_",appt_tally)) %>%group_by(reg_id) %>%mutate(prev_event_date =lag(event_date, n=1, order_by=event_date)) %>%mutate(lead_event_date =lead(event_date, n=1, order_by=event_date)) %>%ungroup()### highlight them as dupesdupes <- tots_d %>%# select(reg_id, event_date, prev_event_date, organisation_unit_name) %>% # ungroup() %>% mutate(dupe=if_else(event_date==prev_event_date | event_date==lead_event_date, 1, 0)) %>%group_by(reg_id) %>%# fill(dupe, .direction="down") %>%filter(dupe==1) %>%arrange(reg_id,event_date)## TRUE dupes also have identical dates except "last updated" datedupes_true <- dupes %>%ungroup() %>%select(!c(lead_event_date, prev_event_date, last_updated_by)) %>%## this only keeps rows that are identical except for the appointment tally and visit id janitor::get_dupes(!c(appt_tally, reg_visit_id))print(paste0("total patients ", length(unique(tots_d$reg_id))))print(paste0("patient records with dupe events ", length(unique(dupes$reg_id))))print(paste0("duped events ", length(dupes$reg_id)))print(paste0("patient records with true dupe events on same day ", length(unique(dupes_true$reg_id))))# length(unique(dhis_subset$reg_id))# Only the first of all "true dupe" events should still be kept!dupes_true_extras<-dupes_true %>%group_by(reg_id) %>%mutate("one"=1, "dupe_visit_id"=cumsum(one)) %>%select(-one) %>%select(reg_id,reg_visit_id, dupe_visit_id) %>%filter(dupe_visit_id>1)## The new dataset with totals is called tots and has removed the "true dupe" visitstots <-tots_d %>%anti_join(dupes_true_extras, by="reg_visit_id")# (1122-188)/length(unique(dhis_clean$reg_id))````r length(unique(dupes$reg_id))` of `r length(unique(tots_d$reg_id))` patient records reported at least one pair of visits on the same day ( `r length(dupes$reg_id)` events total). Of these, `r length(unique(dupes_true$reg_id))` patients had "true duplicates" where data reported within each visit was identical.NEXT STEP for dupes would be to pivot longer, then find those variables that are NOT identical within DupesThis table shows us the *FIELDS* which do not match within a pair of duplicated events for the same registration.```{r}d_long <-dupes %>%group_by(reg_id) %>%arrange(event_date, last_updated_on) %>%mutate("one"=1, "dupe_id"=cumsum(one)) %>%select(-one) %>%ungroup() dupe_clash <-d_long %>%filter(dupe_id<3) %>%# just those regs with exactly 2 dupe eventsmutate(across(everything(), as.character)) %>%# no dates or appt countsselect(!ends_with("_date"), -appt_tally, -reg_visit_id) %>%pivot_longer(!c(reg_id, dupe_id), names_to="var",values_to="value") %>%pivot_wider(names_from=dupe_id,names_prefix ="dupe_",values_from=c(value)) %>%# filter(!is.na(dupe_1)) %>%mutate(match=case_when(is.na(dupe_1) &is.na(dupe_2) ~TRUE, dupe_1==dupe_2 ~TRUE,TRUE~FALSE)) %>%filter(match==FALSE) %>%arrange(reg_id)dupe_clash %>%count(var, sort=T)```Last Updated date *could* be helpful for deduping, same as stored_by, created_by, and event status (COMPLETED locks the values, so if there is an error there, another one could be made on same place).But if those are not obvious for which one should be used, then we would use the non null BP.Two visits on same day with non null BP values will be difficult to merge automatically.***THE RULES FOR MERGING TWO PATIENT VISITS ON SAME DATE***1. If one of the events does not have a BP reading, all values in the event with BP reading take priority (if any duped event has all BP NA, its removed from dataset).2. ACTIVE status events are generally opened after COMPLETED events to make a change. If its the only ACTIVE event that date, use this event. Priority = 3 (max)3. If other clashes, count the number of nulls for medication. If one of the events has least med nulls, then use that event. Priority = 24. If still cannot reconcile, use the first event by date last updated (or, if last updated on same day, random). Priority = 1```{r}# this chunk will apply the priority rules to flag priority eventsd_long_priorities <- d_long %>%filter(!is.na(bp_diastole_mmhg) &!is.na(bp_systole_mmhg)) %>%group_by(reg_id, event_date, event_status) %>%mutate(dupes_evs_with_this_status=n()) %>%ungroup() %>%mutate(dupe_meds_null_count =rowSums(is.na(across(contains("med"))))) %>%group_by(reg_id, event_date) %>%mutate(dupes_evs_mostMeds=min(dupe_meds_null_count)) %>%mutate(dupes_evs_withMedCount=n()) %>%mutate(dupe_id_priority =case_when( dupes_evs_with_this_status==1& event_status=="ACTIVE"~3, dupes_evs_mostMeds==dupe_meds_null_count & dupes_evs_withMedCount==1~2, dupe_id==min(dupe_id) ~1,TRUE~0 ) ) %>%ungroup()# WE WANT TO KEEP ALL THESE AND REMOVE OTHER DUPESd_long_keepers <-d_long_priorities %>%group_by(reg_id, event_date) %>%filter(dupe_id_priority==max(dupe_id_priority)) %>%ungroup()d_long_remove <- d_long %>%anti_join(d_long_keepers, by="reg_visit_id")# d_long_remove %>% colnames()d_long_keepers %>% janitor::tabyl(dupe_id_priority) %>%kable() %>% kableExtra::kable_styling()```We remove `r nrow(d_long_remove)` events from the main dataset and then recreate the appointment tally and visit IDs.```{r}tots<-tots %>%anti_join(d_long_remove, by="reg_visit_id") %>%# REDO THE APPT TALLY AND REG_VISIT_ID now that the dupes have been removedarrange(reg_id,event_date) %>%#make a running tally for each patient visitsgroup_by(reg_id) %>%mutate("one"=1, "appt_tally"=cumsum(one)) %>%select(-one) %>%ungroup() %>%mutate(reg_visit_id=paste0(reg_id,"_",appt_tally)) %>%group_by(reg_id) %>%add_count(name="total_patient_visits") %>%# counter for total patients visitsmutate(prev_event_date =lag(event_date, n=1, order_by=event_date)) %>%mutate(lead_event_date =lead(event_date, n=1, order_by=event_date)) %>%ungroup() #and need to add new lead lag dates```# Description## Table 1 Zero Followup Visits After BaselineProcess it for the **Table1** all registrations up to *April 1 2024.*Thus all patients have at least 90 days for a follow up Visit after Baseline Visit.">=1 Visits Post Baseline" here means at least 1 visit occurs later. The second visit can be within or after 90 days from baseline.```{r}dhis_subset <- tots %>%filter(registration_date >=dmy("01-01-2021")) %>%filter(registration_date <=dmy("01-04-2024"))dhis_table <- dhis_subset %>%mutate(patient_returned=if_else(total_patient_visits>1, ">=1 Visits Post Baseline","0 Visits Post Baseline" )) %>%mutate(bp_systole_mmhg=if_else(bp_systole_mmhg>400, 400, bp_systole_mmhg)) %>%mutate(sbp_level =case_when( bp_systole_mmhg >=400| bp_systole_mmhg <50~"error (>300 or < 50)", bp_systole_mmhg >=180& bp_systole_mmhg <400~"emergency [180-300)", bp_systole_mmhg >=140& bp_systole_mmhg <180~"very high [140-180)", bp_systole_mmhg >=120& bp_systole_mmhg <140~"high or elevated (120-140)",is.na(bp_systole_mmhg) ~NA,TRUE~"normal") ) %>%mutate(sbp_level=fct_relevel(sbp_level, c("normal","high or elevated (120-140)","very high [140-180)","emergency [180-300)", "error (>300 or < 50)"))) %>%mutate(has_diabetes=as.factor(does_patient_have_diabetes)) %>%mutate(has_phone=as.factor(has_phone)) %>%mutate(has_supporter=as.factor(has_supporter)) %>%mutate(age_group =as.factor(cut(age,breaks =c(0, 30, 45, 60, 75, 110, 100000),include.lowest = T,labels =c("Other", "30-44","45-59", "60-74", "75-110", "Other")))) %>%mutate(age_group=fct_shift(age_group, n=1)) %>%select(patient_returned, State, sex, age, age_group, appt_tally,"sbp_baseline"=sbp_level, bp_systole_mmhg,"previous_htn_treated"=hypertension_treated_in_the_past,"amlodipine_baseline"=med_amlodipine, has_phone, has_supporter, has_diabetes,days_of_medication, total_patient_visits) %>%filter(appt_tally==1)## Write it to RDS for the manuscriptwrite_rds(dhis_table, file=here::here("data","dhis_table.rds"))tab1<-table1::table1(~ sex + State + age + age_group + previous_htn_treated + sbp_baseline + bp_systole_mmhg + amlodipine_baseline + has_phone + has_supporter + has_diabetes +days_of_medication + total_patient_visits|factor(patient_returned),data=dhis_table)# t1flex(tab1)tab1```Count of patients returned after first visit.```{r}dhis_table %>%count(patient_returned) %>%mutate(pt=n/sum(n))```### Visits per patientExplore the data. Start by counting the number of visits per patient.This shows patients registered before a given date.```{r}tots_small <-tots %>%filter(registration_date >=ymd("2021-01-01")) %>%select(reg_id, total_patient_visits) %>%unique()seq_visits<-seq(ymd('2021-01-01'),ymd('2024-01-01'), by ='6 months')seq_func<-function(vec){ tots %>%ungroup() %>%filter(registration_date <=ymd(vec)) %>%select(reg_id, total_patient_visits) %>%unique() %>%count(total_patient_visits) %>%ggplot() +geom_col(aes(y=n, x=total_patient_visits)) +labs(title="Patient visits by cohort",subtitle=str_wrap(paste0("Registered Before ", as.character(vec)),20))}patchwork::wrap_plots(lapply(seq_visits, seq_func)) ``````{r}df<-data.frame(x=seq(1:40))df$y<-8100*df$x^-1tots_small %>%count(total_patient_visits) %>%left_join(df, by=c("total_patient_visits"="x")) %>%ggplot(aes(x=total_patient_visits)) +geom_col(aes(y=n), alpha=0.6) +geom_line(aes(y=y), color="blue", linetype=3, linewidth=0.8) +scale_x_continuous(limits=c(0,35)) +labs(title="Patients by total visit count",subtitle="NHCI Registrations Jan 1 2021 - Jul 1 2024",x="Visit count",y="Total patients") +annotate("rect", xmin=17, xmax=32, ymin=2000, ymax=4000, color="blue",fill="white", linetype=2) +annotate("text", x=25, y=3500, label="Decay Function: y ~ a * x^ -1", size=5, color="blue") +annotate("text", x=25, y=2500, label="where a = Count with 1 Visit", size=5, color="blue") +expand_limits(x =0, y =0) +coord_cartesian(expand =FALSE, clip ="off")my_ggsave("fig1_visitCounts",12)``````{r}count(tots_small, total_patient_visits) %>%ungroup() %>%mutate(percent=round(100*(n/sum(n)),4)) %>% DT::datatable(rownames =FALSE)``````{r}tots_small %>%mutate(count_level =case_when( total_patient_visits==1~"01 visit", total_patient_visits>=2& total_patient_visits <4~"02-03 visits", total_patient_visits>=4& total_patient_visits <11~"04-10 visits", total_patient_visits>=4& total_patient_visits <11~"04-10 visits", total_patient_visits>=11& total_patient_visits <46~"11-45 visits",TRUE~NA ) ) %>%count(count_level) %>%mutate(percent=100*(n/sum(n))) %>%kable() %>% kableExtra::kable_styling()#72816/126422## top 20% of attendees by visit count, attended 58% of all visits.```Investigate NCD Patient Status variable (patient follow up tracker..)```{r}# tots %>% # distinct(reg_id, ncd_patient_status) %>% # count(ncd_patient_status)tots %>%count(ncd_patient_status) %>%mutate(pct=round(n/sum(n),4)*100) %>%kable() %>% kableExtra::kable_styling()# length(unique(tots$reg_id))# 1100/30704```Narrow in on registrations after jan 1 2023, can see a dropoff of attendance by sex.```{r}tots %>% ungroup %>%arrange(desc(registration_date)) %>%filter(registration_date>=ymd("2023-01-01") & registration_date<ymd("2024-01-01") ) %>%filter(!is.na(sex)) %>%# mutate(n_group =# case_when(# total_patient_visits==1 ~ "01 (Entry)",# total_patient_visits==2 ~ "02",# total_patient_visits >1 & total_patient_visits<=5 ~ "03-06",# total_patient_visits >5 & total_patient_visits<=11 ~ "07-11",# TRUE ~ "12+"# )# ) %>% ungroup() %>%filter(total_patient_visits==appt_tally & appt_tally<25) %>%group_by(sex, appt_tally) %>%summarise(count_by_ngroup=n()) %>%mutate(percent=prop.table(count_by_ngroup)*100) %>%mutate(cumulative_percent=cumsum(percent)) %>%ungroup() %>%as_tibble() %>%ggplot()+geom_line(aes(x=appt_tally,y=percent, color=sex)) +# geom_col(aes(x=appt_tally,y=percent, fill=sex), position="dodge") +# facet_grid(~sex, scales="fixed") +labs(title="Patients by Sex and Total Number of Visits",subtitle="Registrations in 2023 counting All Visits to April 1, 2024",x="Total Number of Patient Visits in Time Period")+# theme_classic()+theme(legend.position ="bottom") +scale_x_continuous(breaks =seq(1, 22, by =1))``````{r}tots %>%count(LGA) %>%nrow()```### Monthly patient load per clinic```{r}dhis_pts_per_month<-tots %>%mutate(appt_month=lubridate::month(event_date)) %>%# transform to monthgroup_by(appt_month, organisation_unit) %>%summarize(patients=n_distinct(reg_id),appts=n()) %>%ungroup() %>%mutate(appt_patient_ratio=appts/patients)dhis_pts_per_month %>% summarytools::descr()# %>% summarize(patient_avg=mean(patients),# appt_avg=mean(appts))dhis_pts_per_month %>%ggplot()+geom_histogram(aes(x=appts),bins=30) +labs(title="Total appointments per clinic per month")```### Days & months enrolled per patientDays enrolled per patient = Days since entry at latest event + 90```{r}dpv<-tots %>%select(reg_id, appt_tally, total_patient_visits, registration_date, event_date, days_since_entry) %>%filter(appt_tally==total_patient_visits & days_since_entry>=0) %>%mutate(days_tracked=days_since_entry+90) %>%mutate(patient_days_per_visit=days_tracked/total_patient_visits)summary(dpv$patient_days_per_visit)# tots %>% # filter(total_patient_visits>40) %>% View()dpv %>%mutate(days_enrolled=if_else(total_patient_visits==appt_tally, event_date-registration_date, NA)) %>%filter(!is.na(days_enrolled) & total_patient_visits>1) %>%# filter(patient_days_per_visit>=0 & patient_days_per_visit< 91) %>% mutate(months_per_visit=patient_days_per_visit/30) %>%ggplot() +geom_histogram(aes(x=months_per_visit)) +labs(title="Months Per Visit, By Patient",subtitle="Total number of months enrolled divided by number of visits")```Cumulative density of days since Last visit```{r}tots %>%filter(appt_tally!=1) %>%mutate(days_since_last=as.integer(event_date-prev_event_date)) %>%filter(days_since_last<50) %>%# count(days_since_last) %>% ggplot(aes(x=days_since_last)) +geom_step(aes(y=after_stat(y)),stat="ecdf") +labs(title="Cumulative Density of Events by days since latest visit",subtitle ="All visits occuring less than 50 days from last")# # ggplot() +# geom_histogram(aes(x=days_since_last))``````{r}tots %>%filter(appt_tally>1) %>%mutate(days_since_last=event_date-prev_event_date) %>%# filter(days_since_last>7) %>% mutate(days_since_level =case_when( days_since_last ==0~"_Same Day", days_since_last>=1& days_since_last <8~"01-07 Days", days_since_last>=8& days_since_last <21~"08-20 Days", days_since_last>=21& days_since_last <27~"21-26 Days", days_since_last>=27& days_since_last <37~"27-36 Days", days_since_last>=37& days_since_last <90~"37-89 Days", days_since_last>=90~"90+ Days",TRUE~"_Same Day" ) ) %>%count(days_since_level) %>%mutate(percent=100*(n/sum(n))) %>%ggplot(aes(x=days_since_level,y=percent))+geom_col() +# geom_text(aes(label = round(percent,2)), color="white") +labs(title="Follow Up Visits: Days After Previous Visit",subtitle="Visits 2021-2024, Excluding Duplicate Visits",x="",y="Percent of Follow Up visits") +coord_flip()```## Define the outcome event: Attrition- No return visit after entry visit- **No return within 1-90 days of a visit (Loose Window)**- No return within 1-37 days of any visit (Strict Window)- No return within 365 days of latest visit (Original Window)### Patient statusHere we check what patient statuses follow the latest visit.```{r}tots %>%filter(appt_tally==total_patient_visits) %>%count(patient_followup_status, ncd_patient_status)```How to interpret *CANCELLED or COMPLETED* registrations remains a question.### Scheduled and Actual VisitsDays between actual visit date and scheduled date```{r}tots %>%ungroup() %>%filter(appt_tally!=1) %>%mutate(days_from_sched=as.numeric(event_date-scheduled_date)) %>%mutate(days_from_sched_range =case_when( days_from_sched <0~"Earlier_than_scheduled", days_from_sched ==0~"0_Same_Day", days_from_sched >=1& days_from_sched <=7~"1.0_Day_to_1_Week", days_from_sched >=8& days_from_sched <=28~"1.1_to_4_Weeks", days_from_sched >=29& days_from_sched <=8*12~"4.1_to_12_Weeks",TRUE~"Other" ) ) %>%# count(days_from_sched_range) %>% janitor::tabyl(days_from_sched_range) %>%kable() %>% kableExtra::kable_styling()```Most visits are on the same day as they were scheduled (spontaneous visits?) or earlier.```{r}tots %>%ungroup() %>%filter(appt_tally!=1) %>%mutate(days_from_latest_schedDate=as.numeric(scheduled_date-prev_event_date)) %>%mutate(days_from_latest_eventDate=as.integer(event_date-prev_event_date)) %>%filter(days_from_latest_eventDate<60& days_from_latest_eventDate>0) %>%filter(days_from_latest_schedDate<60& days_from_latest_schedDate>0) %>%select(reg_visit_id, starts_with("days_from")) %>%pivot_longer(cols=starts_with("days_from"), names_to="property", values_to="value") %>%ggplot(aes(x=value, fill=property)) +# facet_wrap(~property) +geom_histogram(position="dodge",alpha=0.4, bins=50) +labs(title="Histogram of Follow Up visits",subtitle=str_wrap("by days between scheduled visit date and previous visit date, plus days between current visit date and previous visit date", 90),caption="Removing values < 0 or > 60") +theme(legend.position ="bottom")``````{r}tots %>%ungroup() %>%filter(appt_tally!=1) %>%mutate(days_from_latest_schedDate=as.numeric(scheduled_date-prev_event_date)) %>%mutate(days_from_latest_eventDate=as.integer(event_date-prev_event_date)) %>%# filter(days_from_latest_eventDate<90 & days_from_latest_eventDate>0) %>%filter(days_from_latest_schedDate>0) %>%ggplot() +geom_step(aes(x=days_from_latest_eventDate,y=after_stat(y), color="Event Date"),stat="ecdf", linewidth=1, alpha=0.6) +geom_step(aes(x=days_from_latest_schedDate,y=after_stat(y), color="Scheduled Date"),stat="ecdf", linetype=3, linewidth=1) +labs(title="Followup visit after previous: by event date and scheduled date",subtitle="Cumulative incidence of followup visits by days from previous",x="Days since previous visit",y="") +scale_color_manual(name ="Visit feature",values =c("Event Date"="black", #match the name of the element to the name in the aes"Scheduled Date"="blue")) +coord_cartesian(xlim =c(0, 110), ylim=c(0,1)) +scale_x_continuous(breaks =seq(0, 105, by =15)) +theme(legend.position ="bottom")my_ggsave("fig2_incidence_after_previous",10)```53% of followup visits are scheduled to occur 27-36 days after the previous visit, with 38% on precisely 28 days later (the system default scheduled date). 44% of followup visits actually occur within that window.NEXT STEPS:- Create the windows for timeliness for each subsequent visit- Per visit-opportunity analysis of timeliness## Timely Windows```{r}tots_timely<-tots %>%ungroup() %>%filter(appt_tally==1| event_date-prev_event_date>=7) %>%# remove visits within 7 days of lastmutate(nextdate_37=event_date+37, # create 37 and 90 day windowsnextdate_90=event_date+90,nextdate_37_kept=if_else(lead_event_date<=nextdate_37, 1, 0),nextdate_90_kept=if_else(lead_event_date<=nextdate_90, 1, 0)) %>%mutate(across(contains("kept"), ~replace(., is.na(.), 0))) %>%select(sex, State,appt_tally, total_patient_visits, ncd_patient_status,contains("reg"),contains("event_date"), contains("nextdate"), everything())```Adding month and year variables to the expected window for subsequent visits, and checking whether patient attended their next visit within that window.```{r}tots_timely<-tots_timely %>%mutate(exp_year=lubridate::year(nextdate_90)) %>%mutate(exp_month=lubridate::month(nextdate_90)) %>%mutate(exp_year_month=make_date(year=exp_year, month=exp_month, day="01")) %>%mutate(days_since_last=as.integer(event_date-prev_event_date))```## Outcome Events and Censoring EventsThis next part is crucial.The **FIRST** time a patient exceeds 90 days before the next visit, they are LTFU at that 90 day mark.This is the **OUTCOME EVENT**. Subsequent visits for that patient would be discarded.IFF a patient is marked **TRANSFERRED** or **DEATH**, then they are *CENSORED* after last visit. IFF the last expected visit is after July 1, then they are *CENSORED* after last visit.```{r}tots_timely<-tots_timely %>%arrange(reg_id, appt_tally) %>%mutate(nextdate_90_dropout=if_else(nextdate_90_kept==0, 1, NA)) %>%group_by(reg_id) %>%fill(nextdate_90_dropout, .direction="down") %>%mutate(nextdate_90_dropout=replace_na(nextdate_90_dropout, 0)) %>%mutate(visits_since_dropout=cumsum(nextdate_90_dropout)) %>%# visit before dropout==1ungroup() %>%mutate(censor =case_when( appt_tally==total_patient_visits & ncd_patient_status %in%c("Transferred","Died") ~1, nextdate_90 >ymd("2024-07-01") ~1,TRUE~0 ) )tots_timely %>%select(reg_id, event_date, days_since_last, appt_tally, visits_since_dropout,contains("nextdate_90"), censor) %>%tail(15)tots_timely %>%count(censor) %>%kable() %>% kableExtra::kable_styling()# 12292/(149856+12292)# tots_timely %>% # filter(visits_since_dropout < 2) %>%# distinct(reg_id)```We can now calculate the **RATES** of attrition.### Attrition Rate chartsThis removes visits that are censored or after dropout.```{r}tots_timely_rates_prep<-tots_timely %>%filter(censor==0) %>%filter(!is.na(sex)) %>%filter(visits_since_dropout<2) #dropout occurs only once.... and if censored, not included in denomfind_dropout_rates<-function(data, ...){ data %>%group_by(exp_year_month, ...) %>%count(nextdate_90_kept) %>%mutate(rate= n/sum(n)) %>%ungroup()}tots_timely_rates_all<-tots_timely_rates_prep %>%find_dropout_rates()tots_timely_rates_all %>%group_by(nextdate_90_kept) %>%summarise(avg_monthly_rate=mean(rate)) %>%kable() %>% kableExtra::kable_styling()```**The average monthly INCIDENCE RATE of attrition events (dropouts) is 26%**Now further breakdown of dropout incidence by categories```{r}tots_timely_rates_ss<-tots_timely_rates_prep %>%find_dropout_rates(sex, State)tots_timely_rates_appt_tally<-tots_timely_rates_prep %>%filter(appt_tally<6) %>%mutate(appt_tally=as.factor(appt_tally)) %>%find_dropout_rates(appt_tally)tots_timely_rates_state_appt_tally<-tots_timely_rates_prep %>%filter(appt_tally<5) %>%mutate(appt_tally=as.factor(appt_tally)) %>%find_dropout_rates(appt_tally, State)tots_timely_rates_hc <-tots_timely_rates_prep %>%add_count(exp_year_month, LGA, HC) %>%filter(n>9) %>%find_dropout_rates(State, LGA, HC)tots_timely_rates_lga <-tots_timely_rates_prep %>%add_count(exp_year_month, LGA) %>%filter(n>9) %>%find_dropout_rates(State, LGA)``````{r}ggcoeff<-10000ggbarscol<-RColorBrewer::brewer.pal(4,"Dark2")[3]gglinecol<-RColorBrewer::brewer.pal(4,"Dark2")[1]attr_90day_withRegs<-tots_timely_rates_prep %>%filter(registration_date<=ymd("2024-06-30")) %>%# filter(registration_date>=ymd("2021-04-01")) %>%# filter(event_date>=ymd("2021-04-01")) %>%filter(exp_year_month<=ymd("2024-06-30")) %>%mutate(reg_year=lubridate::year(registration_date)) %>%mutate(reg_month=lubridate::month(registration_date)) %>%mutate(reg_year_month=make_date(year=exp_year, month=exp_month, day="01")) %>%add_count(reg_year_month,name="new_registrations") %>%find_dropout_rates(reg_year_month, new_registrations) %>%filter(nextdate_90_kept==0) %>%ggplot() +geom_col(aes(x=reg_year_month, y=new_registrations/ggcoeff),fill=ggbarscol, alpha=0.5) +geom_line(aes(x=exp_year_month, y=rate), col=gglinecol,stat="identity") +labs(title="90-day attrition rate with monthly registrations",subtitle="Attrition rate is by month at end of 90-day return window") +labs(x="") +# theme_classic()+theme(legend.position ="bottom",axis.title.y.left =element_text(color=gglinecol),axis.title.y.right =element_text(color=ggbarscol)) +scale_y_continuous(name="Attrition Rate",labels=scales::label_percent(),sec.axis =sec_axis(~.*ggcoeff, name="Monthly Registrations")) +scale_x_date(breaks = scales::breaks_pretty(12), labels = scales::label_date_short())attr_90day_withRegs```Remember responses from RTSL here:- *When were lists of LTFU for outreach generated?* We generated the first list of LTFU for facilities to track patients in April 2021 using the 2021 Q1 NHCI Excel calculator data.- *When was the dashboard deployed in both states?* March 2022- *When were legacy data imported?* February 2022- *When was the DHIS2-only entry pilot started, and was it expanded or ended?* This commenced in August 2023 and ended in December 2023. The pilot just concluded and a report is being written by the consultants. This report would determine how and when we would expand this across the entire program.```{r}attr_90day_withRegs +annotate("rect", xmin =as.Date("2021-04-01"), xmax =as.Date("2021-09-01"), ymin =0.42, ymax =0.47, alpha = .2) +annotate("rect", xmin =as.Date("2021-12-21"), xmax =as.Date("2022-05-01"), ymin =0.42, ymax =0.47, alpha = .2) +annotate("rect", xmin =as.Date("2022-06-01"), xmax =as.Date("2023-04-01"), ymin =0.42, ymax =0.47, alpha = .2) +annotate("rect", xmin =as.Date("2023-07-15"), xmax =as.Date("2024-02-01"), ymin =0.42, ymax =0.47, alpha = .2) +# geom_text(data = data.frame(x = as.Date("2021-06-01"),# y = 0.45, label = "1st LTFU List"), mapping = aes(x = x, y = y, label = label),# size = 2.17, hjust = 0.45, inherit.aes = FALSE)annotate("text", x =as.Date("2021-06-15"), y =0.44,label ="1st LTFU Lists", size=3.3) +annotate("text", x =as.Date("2022-02-25"), y =0.44,label ="Legacy data", size=3.3) +annotate("text", x =as.Date("2022-11-01"), y =0.44,label ="Growth and Scale", size=3.5) +annotate("text", x =as.Date("2023-10-15"), y =0.44,label ="Naira crisis", size=3.3)my_ggsave("fig3_attritionRegs_full", 12)```Seasonal trends? Seems to be a dip in spring/summer```{r}tots_timely_rates_prep %>%find_dropout_rates(exp_month, exp_year) %>%filter(nextdate_90_kept==0) %>%mutate(exp_month=as.integer(exp_month)) %>%mutate(exp_year=as.factor(exp_year)) %>%ggplot()+geom_line(aes(x=exp_month, y=rate, color=exp_year, group=exp_year, linetype=exp_year)) +labs(title="90-day attrition rate by year") +# scale_x_continuous(labels=scales::label_number(accuracy=1)) +scale_y_continuous(labels=scales::label_percent()) +scale_x_continuous(labels = scales::label_number(accuracy =1))+labs(x="End month of 90-day attrition window",y="Attrition Rate") +# theme_classic()+theme(legend.position ="bottom") ```By sex and state```{r}tots_timely_rates_ss %>%filter(nextdate_90_kept==0) %>%ggplot()+geom_line(aes(x=exp_year_month, y=rate, color=sex)) +labs(title="90-day attrition rate by month") +scale_x_date() +facet_grid(~State) +scale_y_continuous(labels=scales::label_percent()) +labs(x="Year&Month of Expected Visit",y="Attrition Rate") +# theme_classic()+theme(legend.position ="bottom")# ggsave("attr_90day.jpeg", bg = "transparent", width = 20, height = 10, units = 'cm')```By state and appt visit```{r}tots_timely_rates_state_appt_tally %>%filter(nextdate_90_kept==0) %>%ggplot()+geom_line(aes(x=exp_year_month, y=rate, color=appt_tally)) +labs(title="90-day attrition rate by month") +scale_x_date() +facet_grid(~State) +labs(x="Year&Month of Expected Visit",y="Attrition Rate", color="Latest Visit Count") +# theme_classic()+scale_y_continuous(labels=scales::label_percent()) +theme(legend.position ="bottom") +scale_y_continuous(limits =c(0, 0.75))# ggsave("attr_90day.jpeg", bg = "transparent", width = 20, height = 10, units = 'cm')```Attrition by latest visit number of patient```{r}tots_timely_rates_appt_tally %>%filter(nextdate_90_kept==0) %>%ggplot()+geom_line(aes(x=exp_year_month, y=rate, group=appt_tally, color=appt_tally)) +scale_y_continuous(labels=scales::label_percent()) +labs(title="90-day attrition rate by month", y="Attr Rate")``````{r}tots_timely_rates_lga %>%filter(nextdate_90_kept==0) %>%mutate(LGA=str_remove_all(LGA, " Local Government Area")) %>%mutate(LGA=fct_reorder(LGA, rate, .fun = mean)) %>%ggplot(aes(x = LGA, y = rate, color=State)) +geom_boxplot() +labs(title="90-day monthly attrition rates by LGA", y="Attrition Rate",caption="Removed all healthCentre-months pairs with less than 10 expected visits") +facet_wrap(~State, ncol=1, scales="free_y") +scale_y_continuous(labels=scales::label_percent()) +coord_flip() +# theme_classic() theme(legend.position ="null")```Monthly HC attrition rates```{r}tots_timely_rates_hc %>%filter(nextdate_90_kept==0) %>%mutate(LGA=str_remove_all(LGA, " Local Government Area")) %>%mutate(LGA=fct_reorder(LGA, rate, .fun = mean)) %>%ggplot(aes(x = LGA, y = rate, color=State)) +geom_boxplot() +labs(title="90-day monthly attrition rates at Health Centre Level",subtitle="Grouped by LGA and State of Health Centre",caption="Removed all healthCentre-months pairs with less than 10 expected visits",y="Attrition Rate") +scale_y_continuous(labels=scales::label_percent()) +facet_wrap(~State, ncol=1, scales="free_y") +coord_flip() +# theme_classic() +theme(legend.position ="null")```Mean and SD```{r}tots_timely_rates_hc %>%filter(nextdate_90_kept==0) %>%group_by(State) %>%summarise(mean=mean(rate),sd=sd(rate)) %>%kable() %>% kableExtra::kable_styling()```### Repeat Dropout Patients?Just to test, what's the OR for experiencing a dropout for patients who have already experienced one, compared to those who have not?```{r}tots_timely %>%select(reg_id, appt_tally, visits_since_dropout, nextdate_90_kept) %>%mutate(nextdate_90_notMet=if_else(nextdate_90_kept==1,0,1)) %>%mutate(post_dropout=if_else(visits_since_dropout>1, 1, 0)) %>% janitor::tabyl(post_dropout,nextdate_90_notMet) %>% janitor::adorn_title()(21858*79526) / (30560*29274)```The odds of dropout at each appointment are 1.94 fold for patients who have previously been delayed more than 90 days and later returned, compared to those who never experienced a first dropout event.# Proper Treatment Protocol?Care should follow [national guideline](https://dhis2.org/nigeria-hypertension-control/)Create a visit variable for each step in the care protocol, if the patient met those criteria and whether the correct meds were offered at the visit.This could be added to the prediction model.```{r}# Should follow guideline here https://dhis2.org/nigeria-hypertension-control/tots_timely_quality<-tots_timely %>%filter(registration_date>=ymd("2021-01-01")) %>%mutate(p_step1_need =case_when( appt_tally==1& ((bp_systole_mmhg >=140& bp_systole_mmhg <160) | (bp_diastole_mmhg >=90& bp_diastole_mmhg <100))~1,TRUE~0 )) %>%mutate(p_step2_need =case_when( appt_tally==2& days_since_last >=14& (bp_systole_mmhg>=140| bp_diastole_mmhg >=90)~1, (appt_tally==1& ((bp_systole_mmhg >=160& bp_systole_mmhg <180) | (bp_diastole_mmhg >=100& bp_diastole_mmhg <110)))~1,TRUE~0) ) %>%mutate(p_step3_need =case_when( appt_tally==3& days_since_last >=14& (bp_systole_mmhg>=140| bp_diastole_mmhg >=90)~1, (appt_tally==1& ((bp_systole_mmhg >=180) | (bp_diastole_mmhg >=110)))~1,TRUE~0) ) %>%mutate(p_step4_need =case_when( appt_tally==4& days_since_last >=14& (bp_systole_mmhg>=140| bp_diastole_mmhg >=90)~1,TRUE~0) ) %>%# escalation criteria for first visitmutate(p_step2_need=if_else(p_step3_need==1, 0, p_step2_need)) %>%mutate(p_step1_need=if_else(p_step2_need==1, 0, p_step1_need)) %>%# now do the appropriate treatment for each stepmutate(p_step1_met =case_when( p_step1_need==1& med_amlodipine=="Amlodipine 5mg OD"~1,TRUE~0) ) %>%mutate(p_step2_met =case_when( p_step2_need==1& med_amlodipine=="Amlodipine 5mg OD"& (med_losartan=="Losartan 50mg"| (sex=="Female"& age<45))~1,TRUE~0) ) %>%mutate(p_step3_met =case_when( p_step3_need==1& med_amlodipine=="Amlodipine 10mg OD"& (med_losartan=="Losartan 100mg"| (sex=="Female"& age<45))~1,TRUE~0) ) %>%mutate(p_step4_met =case_when( p_step4_need==1& med_amlodipine=="Amlodipine 10mg OD"& med_hydrochlorothiazide=="Hydrochlorathiazide 25mg"& (med_losartan=="Losartan 100mg"| (sex=="Female"& age<45))~1,TRUE~0) )r2<-function(x){round(x,3)}tots_timely_quality %>%filter(appt_tally<5) %>%group_by(nextdate_90_kept) %>%summarise(p_step1_rate=sum(p_step1_met)/sum(p_step1_need),p_step2_rate=sum(p_step2_met)/sum(p_step2_need),p_step3_rate=sum(p_step3_met)/sum(p_step3_need),p_step4_rate=sum(p_step4_met)/sum(p_step4_need) ) %>%mutate(across(where(is.double), r2)) %>%kable() %>% kableExtra::kable_styling()```While we're add it, add the change of sBP and DBP since baseline.This would be a quality indicator too.```{r}tots_timely_quality <- tots_timely_quality %>%mutate(bp_systole_mmhg=if_else(bp_systole_mmhg>300, 300, bp_systole_mmhg)) %>%mutate(bp_diastole_mmhg=if_else(bp_diastole_mmhg>200, 200, bp_diastole_mmhg)) %>%group_by(reg_id) %>%mutate(sbp_baseline=if_else(appt_tally==1, bp_systole_mmhg, NA)) %>%fill(sbp_baseline, .direction ="down") %>%mutate(dbp_baseline=if_else(appt_tally==1, bp_diastole_mmhg, NA)) %>%fill(dbp_baseline, .direction ="down") %>%ungroup() %>%mutate(sbp_change_from_baseline=bp_systole_mmhg-sbp_baseline,dbp_change_from_baseline=bp_diastole_mmhg-dbp_baseline) ```# Weather dataNow add in the daily weather data from open-meteo.com API from 1/1/2021 to 1/7/2024.This pulled data on max temp, max apparent temp (with humidity), rain total, and rain hours for the centroid coordinates of each LGA.For details see companion script.```{r}# import weather data and clean the lga namesweather_lga_centroids<-read_rds(here::here("data", "weather_lga_centroids.rds")) %>%ungroup() %>%mutate(lganame=str_squish(lganame)) %>%mutate(lganame=str_remove_all(lganame, "\\/Ota"), # fix this one namelganame=str_replace_all(lganame, "Nassarawa","Nasarawa"), # fix this namelganame=str_replace_all(lganame, "Dambatta","Danbatta"), # fix this namelganame=str_replace_all(lganame, "Shagamu","Sagamu")) %>%# fix this namerename_with(~paste0("cli_", .), max_temp_c:last_col()) %>%mutate(lga_37date=str_c(lganame,date,sep="_")) %>%arrange(lga_37date) %>%select(lga_37date, starts_with("cli_"))# clean lga names in main dataset and merge with weathertots_timely_qual_climate<-tots_timely_quality %>%mutate(LGA=str_remove_all(LGA, " Local Government Area ")) %>%mutate(LGA=str_sub(LGA, start=4, end=50)) %>%mutate(LGA=str_squish(LGA)) %>%mutate(lga_37date=str_c(LGA,as.character(nextdate_37),sep="_")) %>%arrange(lga_37date) %>%left_join(weather_lga_centroids, by="lga_37date")```# Market dataNow add in the weekly market data from [FEWSNET API](https://fdw.fews.net/en/) for 1/1/2021 to 1/7/2024This is official FAO data on food and staple good prices at local markets, and is used to assess food security. The prices are weekly, but have been reformed as a 10 day rolling average.Included here are weekly subnational price data from Kano State and Lagos State for nine staples, in Naira:- Diesel- Gasoline- Millet- Rice (Milled and Broken)- Maize Grain (Yellow and White)- Sorghum (Brown and White)- YamsAccording to an [FAO food systems profile on Nigeria](https://openknowledge.fao.org/server/api/core/bitstreams/e5cd1ce0-8d1a-439b-99ad-57b83ef0b4a6/content) (pg 5), these are all staple foods in Nigeria.The price of oil, both globally and domestically in Nigeria, affects the entire economy, including the price of transportation, staple foods, healthcare, and in general the cost of living. Theses prices could therefore affect ability/willingness to access HTN services.Note data were not available for Ogun State, but Lagos borders Ogun to the South, and has market data more consistent with national rates than Oyo to the north.For details see companion script.```{r}fews<-read_rds(here::here("data", "fews_ng_prices_2021_2024.rds")) %>%ungroup() %>%filter(date>ymd("2021-02-01")) %>%rename_with(~paste0("mkt_", .), 3:last_col()) %>%mutate(admin_1=if_else(admin_1=="Lagos", "Ogun","Kano")) %>%arrange(admin_1, date) %>%mutate(state_37date=str_c(admin_1,date,sep="_")) %>%select(state_37date, starts_with("mkt")) %>%arrange(state_37date)tots_timely_qual_cli_mkt<-tots_timely_qual_climate %>%# clean up state name to merge with market pricesarrange(State, nextdate_37) %>%mutate(State=str_remove_all(State, " State"),State=str_sub(State, 5, 50),State=str_squish(State)) %>%mutate(state_37date=str_c(State,nextdate_37,sep="_")) %>%arrange(state_37date) %>%left_join(fews, by="state_37date") %>%select(reg_id, reg_visit_id, registration_date, #reorder some vars event_date, appt_tally, everything())```Are weather and market prices 28-37 days after latest visit different for those who attend their next one 0-37 days after that visit, and those who fail to attend?Ttest and effect sizes might give a general idea...```{r}pop_params_dropouts_37days_prep<-tots_timely_qual_cli_mkt %>%filter(censor==0) %>%filter(visits_since_dropout<1) %>%# removed censored obs or those that already "droppedout"mutate(nextdate_37_kept=as.factor(nextdate_37_kept)) %>%select(State, nextdate_37_kept, ends_with("pre10day_mean")) %>%rowid_to_column() %>%pivot_longer(cols=ends_with("pre10day_mean"),names_to="var",values_to="value")# boxplotpop_params_dropouts_37days_prep %>%ggplot() +aes(x = nextdate_37_kept, y = value) +geom_boxplot() +facet_wrap(~var, scales ="free_y") +labs(title="Next visit date kept?",subtitle="by price or weather parameter")# pvaluespop_params_pvalues<-pop_params_dropouts_37days_prep %>%group_by(var) %>%reframe(t_test =list(t.test(value ~ nextdate_37_kept, data =cur_data()))) %>%mutate(tidy_result =map(t_test, broom::tidy)) %>%unnest(tidy_result)# effect sizespop_params_effect_sizes <- pop_params_dropouts_37days_prep %>%group_by(var) %>%reframe(cohens_d =cohens_d(value ~ nextdate_37_kept, data =cur_data())) %>%unnest("cohens_d")pop_params_pvalues %>%select(var, estimate, p.value)pop_params_effect_sizes# use these for the modelpop_params_top<-pop_params_effect_sizes %>%filter(abs(Cohens_d)>0.11) %>%pluck("var")print(paste0("Consider these vars for model: ", str_c(pop_params_top, collapse =', ')))readr::write_lines(pop_params_top, "cli_mkt_params_for_model.txt")```A t test is performed across all these pop level parameters, to test if between the means of the parameters differ where the next visit is a success or failure (success defined as Next Visit =\< 37 days from current visit).All price and weather parameters have a small pvalue, except the price of gasoline.However, the effect sizes (cohens d) are small for all parameters. Max temp, rain total, rain hours, millet, rice, and sorghum price, price have the highest effect sizes (\>0.11). Max temp has an inverse association than expected (more attendance on comparatively hot days).For greater effect size, may need to respecify. For example, instead of using 10 day averages, use precise weather variables on the date 28 days after latest appointment, with an outcome of visiting on that exact day.Here is an example: for a given 10 day window, is the average price of milled rice at state level related to the rate of patients returning for subsequent visits within that 10 day window?```{r}tots_timely_rates_prep_detailed<-tots_timely_qual_cli_mkt %>%filter(censor==0) %>%filter(!is.na(sex)) %>%filter(visits_since_dropout<2) #dropout occurs only once.... and if censored, not included in denomfind_dropout_rates_detailed <-function(data, ...){ data %>%mutate(nextdate_37_kept=factor(nextdate_37_kept, levels=c(0,1))) %>%group_by(nextdate_37, ...) %>%count(nextdate_37_kept) %>%mutate(rate=n/sum(n)) %>%ungroup() %>%filter(nextdate_37_kept==1) }tots_timely_rates_prep_detailed %>%find_dropout_rates_detailed(State, mkt_rice_milled_pre10day_mean) %>%filter(n>9) %>%ggplot(aes(x=mkt_rice_milled_pre10day_mean, y=rate)) +geom_jitter( alpha=0.5)+geom_smooth() +expand_limits(y =c(0, 1)) +labs(title="State Level 37-day Timely Return Rates, By market prices for milled rice",subtitle="excluding date ranges where the State expected less than 10 patients")```# Review and Save data {#sec-review-save}In a final review, check distributions of all the vars.```{r}#| results: "asis"dfs<-tots_timely_qual_cli_mkt %>% summarytools::dfSummary(plain.ascii =FALSE,stile="grid",max.string.width=15)dfs$Variable <-gsub("_", " ", dfs$Variable)print(dfs, method="render", varnumbers =FALSE, valid.col =FALSE,max.tbl.width =40,caption="Removing _underscores_ from varnames for space")# dhis_subset %>% skimr::skim()# dhis_clean %>% # select(newage, reg_id) %>% # distinct() %>% # mutate(newage=as.numeric(newage)) %>% # skimr::skim()``````{r}readr::write_rds(tots_timely_qual_cli_mkt, here::here("data","tots_timely_qual_cli_mkt.rds"))sessionInfo()```