1 The impact of UTOS on the conceptual framework for note-taking

In Note 3 in the manuscript, we noted the fact that our UTOS moderators specifically apply to paths a and c, but not b (see Figure 1). In our study, we have ten moderators (i.e., orthographic script distance, region, note-taking option, notes-taking type, material type, measure, input type, learners’ proficiency, learning target, and time). Some moderators might directly affect learners’ note-taking behavior when learners are exposed to the L2 input. For example, learners’ L1-L2 orthographic distance may affect the ease with which learners can understand the input (Zhang & Zhang, 2020), and in turn affect their ability to take notes. Similarly, the different regions where the study was conducted might also influence learners’ note-taking perceptions and habits (Siegel & Kusumoto, 2022). Note-taking options (i.e., whether learners are required or allowed to take notes), note-taking instruction (i.e., whether or not learners are provided with any note-taking instruction), and note-taking types can affect the effectiveness of note taking to a certain degree given their ability to engage or re-direct students’ attention to various aspects of input (Siegel, 2021).

Some other moderators might also affect note taking. For instance, a learner with a higher proficiency level might be more easily able to identify information when encountering input and might be more motivated to take notes, thereby enhancing the efficiency and effectiveness of the note-taking process. Also, the type of input in which information is presented to learners, whether it is in written or aural input, might influence how learners take notes. The nature of the material itself might also affect learners’ note-taking behavior. Academic input, which might be more complex and in-depth compared to non-academic input, might pose a challenge for note taking to take place (Jin & Webb, 2023). The effect of note taking may also vary depending on the measure types. Measuring learning outcomes via recognition tests (e.g., multiple-choice items) or recall tests (e.g., writing the meaning of a given word or the L2 word that corresponds to a given meaning) may require different depths of processing, which in turn can influence (i.e., moderate) the effect of note taking. Another moderator, learning outcome, might also affect note taking. These learning outcomes might guide learners on what to focus on when receiving input. For instance, when the learning outcome is reading comprehension, the notes might be broader in scope (e.g., targeting the content). However, when the learning outcome is vocabulary learning, the notes might be narrower in scope (e.g., targeting the keywords). The moderator time (i.e., outcome measurement timing) was added to this meta-analysis to differentiate between learners’ pre- versus post-treatment learning outcomes and thereby to measure the possible gains (i.e., difference between pre- and post-tests) from note-taking as a learning aid.

As can be seen, all of our substantive UTOS variables can potentially moderate the act of note taking (path a) and/or the processing of input (path c), and thus, do not, by definition, apply to path b directly. Finally as noted in the manuscript, our M moderators which by themselves “do not necessarily merit an interpretation [were all] adjusted for in the background” (Norouzian & Bui, 2024, p. 16), so the impact of the substantive UTOS variables can be more clearly examined.

Figure 1. Theoretical framework for note taking

2 Description of the reasons for excluding moderators from Jin & Webb (2023)

As noted in the manuscript, we included 10 substantive (UTOS) and 3 additional methodological (M) moderators in our study. The following table provides a detailed description of the considerations involved in excluding certain substantive moderators from Jin & Webb (2023). Please see the methodology section in the manuscript for the full description of our moderators.

Moderator exclusion information
Moderators in Jin & Webb (2023)	Exclusion Reason
Context	Lack of data on second language context at baseline which was needed to measure the learning gains.
Institutional level	In Jin & Webb (2023), this variable was used as a proxy for proficiency. In our study, we used the contextual evidence (e.g., TOEFL scores, institutional proficiency requirements, the language of instruction requirements) in the primary studies to create 3 broad categories for proficiency (i.e., beginner to lower intermediate, intermediate, high intermediate to advanced) to more directly measure the influence of proficiency. As a validation step, we examined the average effect sizes at the pre-test (before the treatment introduction) for these 3 proficiency groups and confirmed that learners at the "beginner to lower intermediate" level had the lowest outcome knowledge, followed by "intermediate", and "high intermediate to advanced" learners with higher levels of outcome knowledge, respectively.
Provision of note-taking strategy instruction	The moderator "provision of note-taking instruction" had two quite broad categories: (1) those that involved conventional note taking without specific instruction, and (2) all others that involved some forms of instructed or structured note taking. As a result, knowing the note taking type determined exactly whether note-taking instruction was provided or not. In other words, a meta-analytic comparison for note taking types included the information for a meta-analytic comparison for note-taking instruction provision, and thus the latter was redundant.
Opportunity to review notes	Insufficient data (i.e., only 3 effects from 2 studies) at the baseline for the learners that did not have the opportunity to review the notes invalidating the measurement of learning gains for this variable.
Number of notetaking sessions	Paralleled the number of note-taking treatments as one of the methodological variables included in this study.
Note taking instruction length	Because note taking instruction, by definition, was not applicable to one of the note-taking types (i.e., conventional note-taking), analyzing this moderator (i.e., instruction length) would have either: (a) restricted our sample size by requiring to code the conventional note-taking effects NA/blank or (b) led to invalid results by requiring to artificially code the conventional note-taking effects 0. Additionally, we added a new moderator that was applicable to all note-taking types and that measured the length of each study from the first to the last measure, which was controlled for as a methodological variable that differentiated the studies to varying degrees (see Table 2 in the manuscript).

3 Raw data and initial analyses

The execution of these initial analyses may be time consuming. Unless otherwise needed, we suggest that readers instead run the analyses in the next section which in reality use the results of these initial analyses. Additionally, to better understand the variables involved in the initial analyses (e.g., those used for estimating Hedges’ g effect sizes), the next Table provides a list of their names and definitions. (Click on Code on the bottom right for the reproducible codes used in each section).

# We use Software introduced by Norouzian & Bui (2024)
source("https://t.ly/olaQ0")

# We also use the following R package for choosing the best candidate model
library(bbmle)

# Raw coding sheet with merged first row and lots of empty cells
dat <- read.csv("https://t.ly/i5aYY", na=c(NA,"","NA","NULL"))

# Remove the merged first row but use the first row to rename the column names
dat2 <- setNames(dat[-1,], dat[1,])

# Remove any accidental spaces or empty rows or columns
dat3 <- full_clean(dat2)

# Make sure each column's data type is correctly recorded
dat4 <- type.convert(dat3, as.is=TRUE)

# Compute effect sizes
dat5 <- escalc("SMD", m1i = mT, m2i = mC, sd1i = sdT, sd2i = sdC, 
               n1i = nT, n2i = nC, data = dat4, var.names = c("g", "v_g"))


# Adjust for assignment by intact classes
dat6 <- group_by(dat5, study) %>% 
  
  mutate(
    
    g2 = ifelse(assign_type=="class", g_cluster(g, n_class, Nt, Nc), g),
    
    v_g2 = ifelse(assign_type=="class", g_vi_cluster(g, n_class, Nt, Nc), v_g),
    
    SE_egger =  sqrt((nT + nC) / (nT * nC)),
    
    time = recode(time, "pretest" = "baseline"),
    
  region = recode(region, "Asia" = "East Asia") # reviewer requested changing Asia to East Asia

    
  ) %>% ungroup() %>% 
  
  mutate(effect = row_number())


# How many effects and studies
dat6 %>%
  group_by(study) %>%
  summarise(n_gi = n()) %>%
  summarise(
    `No. of Studies` = n(), 
    `No. of Effects` = sum(n_gi)
  ) %>% ungroup()


# What is the distribution of effects
ggplot(dat6) + aes(g2) + geom_density()

# Quite skewed to the right, looks like we have some large effects
# even though we have only 57 effects from 27 studies

# What are the two largest effects?
two_largest <- tail(sort(dat6$g2),2)
# [1]  6.89757 10.45655 insanely large, many times larger than 
# than mean(dat6$g2) which is 0.9!


# These two large effects also exceed 3*SD from the mean (Lipsey & Wilson, 2001)
two_largest > with(dat6, c(`3SDfromMean`= mean(g2)+3*sd(g2)))
# [1] TRUE  TRUE

# Let's inspect the impact of these two extreme effects on a 
# basic 3-level model

# Reintroduce naturally occurring dependence before removing 2 largest effects
Vs <- with(dat6, impute_covariance_matrix(v_g2, study, r=.5,
                                          subgroup = sample_id))


# 3-level Additive symmetry model
m1 = rma.mv(g2 ~ time + study_length+no_treat+true_experiment, Vs, 
            random = ~1|study/effect, data = dat6,
            dfs = "contain")


# Removing 2 largest effect sizes to measure their impact on m1
dat7 <- filter(dat6, !g2 %in% two_largest)


# Reintroduce naturally occurring dependence AFTER removing 2 largest effects
Vs_af <- with(dat7, impute_covariance_matrix(v_g2, study, r=.5,
                                          subgroup = sample_id))


# m1 model before removing 2 largest effects
m_before <- m1

# m1 model AFTER removing 2 largest effects
m_after <- update(m_before, data=dat7, V=Vs_af)

# Measuring the CIs width of pre- post effects for models BEFORE (_bf) & AFTER (_af)
(t_bf =type.convert( post_rma(m_before,~ time)$table, as.is=TRUE))
(t_af =type.convert( post_rma(m_after,~ time)$table, as.is=TRUE))

(t_bf_ci_widths = t_bf$Upper - t_bf$Lower)
(t_af_ci_widths = t_af$Upper - t_af$Lower)

# The %reduction in the width of CIs due to removing two outliers
paste0(round((t_bf_ci_widths - t_af_ci_widths)/t_bf_ci_widths*100),"%")
# [1] "52%" "46%" "57%"


# Vast improvement in precision (CIs narrower by up to 57%) due to removing two outliers!

# Continue to model selection without the two outlying effects using dat7
# Let's run 5 more models in addition to m_after and choose:


####################
# Model selection
####################

# 3-level Additive symmetry model
m1 <- m_after

# Estimation checks out, passes!
profile(m1)


# Homogeneous Auto-regressive model
m2 = rma.mv(g2 ~ time + study_length+no_treat+true_experiment, Vs_af, 
            random = list(~time|study, ~1|effect), struct = "AR", 
            data = dat7,
            dfs = "contain")

# Estimation checks out, passes!
profile(m2)

# Heterogeneous auto-regressive model
m3 = rma.mv(g2 ~ time + study_length+no_treat+true_experiment, Vs_af, 
            random = list(~time|study, ~1|effect), struct="HAR", 
            data = dat7,
            dfs = "contain")

# Estimation doesn't check out, exclude this model!
profile(m3)

# Heterogeneous compound symmetry model
m4 = rma.mv(g2 ~ time + study_length+no_treat+true_experiment, Vs_af, 
            random = list(~time|study, ~1|effect), struct = "HCS", 
            data = dat7,
            dfs = "contain")


# Estimation doesn't check out, exclude this model!
profile(m4)

# Homogeneous compound symmety model
m5 = rma.mv(g2 ~ time + study_length+no_treat+true_experiment, Vs_af, 
            random = list(~time|study, ~1|effect), struct = "CS", 
            data = dat7,
            dfs = "contain")

# Estimation doesn't check out, exclude this model!
profile(m5)


# 4-level Additive symmetry model
m6 = rma.mv(g2 ~ time + study_length+no_treat+true_experiment, Vs_af, 
            random = ~1|study/time/effect, data = dat7,
            dfs = "contain")

# Estimation checks out, passes!
profile(m6)


# Run a weighted comparison between the above 'checked out' models
AICctab(m1, m2, m6, weights=TRUE, base=TRUE)

# m1 wins!! We'll use a 3-level additive symmetry model.


# Q: Is this overall longitudinal model sensitive to the amount of naturally occurring dependence?

# Time effects:
p2 <- post_rma(m1, ~time)

# Sensitivity analysis:
sense_rma(p2, var_name = "v_g2")

# A: Not really except in the case of posttest2 effects which are excluded from interpretation due to their extremely limited number (see next part).

# How many studies and effects for each meta-analytic model do we have?
moderators <-  c(
            "time",
            "treat_grp",
            "outcome",
            "measure",
            "input_mode",
            "material_type",
            "note_option",
            "prof",
            "script_distance",
            "region")


# A list of time and moderators interacting with time
LIST <- c("time", map(moderators[-1], c, "time"))


# Count # of studies and effects at each time
setNames(map(LIST, ~effect_count(dat7, study, !!!syms(.), show0=FALSE, arrange_by="time", na.rm=TRUE)), moderators)

# post-test2 effects (m=4) are from 3 studies! Exclude from interpretations.

########################################
# Model fitting after initial steps
########################################

# Fit all moderator models using a function

fit_model <- function(pred="none", 
                      V = Vs_af, data = dat7, 
                      method_vars = c("study_length","no_treat",
                                      "true_experiment")){
  
  overall <- pred=="none"
  
  time_case <- if(pred!="time")"* time" else " "
  
  form <- as.formula(paste("g2 ~", if(overall) "" else paste(paste(pred, time_case),"+"), 
                           paste(setdiff(method_vars,pred), 
                                 collapse = "+")))
  
  m <- rma.mv(form, V = V, 
              random = ~1|study/effect, data = data,
              dfs = "contain")
  
  m0 <- update.rma(m, yi = g2 ~ 1)
  
  form_post_rma <- if(overall) ~1 else as.formula(paste("~",pred, time_case))
  
  ems <- post_rma(m, form_post_rma)
  
  form_plot <- if(overall) ~1 else as.formula(paste(if(pred=="time") "~" else paste(pred,"~"), "time"))
  
  
  legend_t <- if(!overall){
    if(pred=="time")"Time" else 
      if(pred=="measure")"Measure Type" else 
        if(pred=="test_type")"Test Type" else 
          if(pred=="prof") "Proficiency" else
            if(pred=="study_setting") "Study Setting" else
              if(pred=="lang_context") "Language Context" else
                if(pred=="treat_grp") "Note-Taking Type" else
                  if(pred=="region") "Region" else
                    if(pred=="input_mode") "Input Mode" else
                      if(pred=="note_option") "Note-Taking Option" else
                        if(pred=="note_instruct") "Note-Taking Instruction" else
                          if(pred=="script_distance") "L1-L2 Orthographic Distance" else
                            if(pred=="age_group") "Age Group" else
                              if(pred=="material_type") "Material Type" else
                                str_to_title(pred) 
  } else 
  { "Overall Effect" }
  
  
  plot <-  plot_rma(m, form_plot, xlab = if(!overall) "Time" else NULL, ylab="Effect Size (Hedges' g)", dodge=.25) +
    labs(color = legend_t) + theme_test() + 
    scale_color_manual(values = c("black","red", "blue", "green3", "purple",
                                  "orange3", "pink3", "red4"))
  
  
    R2 <- R2_rma(m, null_model = m0, model_names = legend_t)
  
  list(model = m, ems = ems, plot = plot, R2 = R2)
}



# Fit all moderator models:
out <- setNames(map(moderators, fit_model), moderators)

# Save them and share them with readers
saveRDS(out, "np.rds")

4 Definition of the additional variables beyond moderators

The additional variables beyond moderators
Variable	Definition
study	Author(s)
year	Year of publication
gray	Gray literature binary identifier (i.e., research produced by less visible or well-known publishers or by organizations outside of the traditional commercial or academic publishing and distribution channels)
assign_type	Assignment of groups by class or by student
n_class	Total number of classes assigned to all groups
Nt	Total number of students in all classes assigned to treatments
Nc	Total number of students in all classes assigned to control
sample_id	Numerical index of the independent samples of participants
g	Effect size estimates before assign_type adjustment
v_g	Sampling variances before assign_type adjustment
g2	Effect size estimates after assign_type adjustment
v_g2	Sampling variances after assign_type adjustment
nT	Number of participants in each treatment group
mT	Mean of each treatment group
sdT	Standard deviation of each treatment group
nC	Number of participants in the control group
mC	Mean of the control group
sdC	Standard deviation of the control group
effect	Numerical index of each row
SE_egger	Modified standard error of Hedges' g for Egger's test purposes based on Pustejovsky and Rogers (2019)*
*For more details see: https://onlinelibrary.wiley.com/doi/10.1002/jrsm.1332

5 Display of data and preliminary descriptive analyses

As noted above, this section uses the saved results of the previous section (no need to actually run the R code in the previous section, unless otherwise needed). Readers are encouraged to actually run the following R codes. Once again, click on Code on the bottom right for the reproducible codes used in each section.

# We use the software package introduced by Norouzian & Bui (2024)
source("https://raw.githubusercontent.com/rnorouzian/i/master/3m.r")


library(knitr)
library(flextable)
library(kableExtra)
library(rmarkdown)

opts_chunk$set(message=FALSE, warning=FALSE, fig.align="center")


# data after outlier removal from previous initial analyses
dat7 <- read.csv("https://raw.githubusercontent.com/fpqq/w/main/dat_after_processing.csv")

6 Distribution Summary of Effect Sizes

g <- dat7 %>%
  group_by(study) %>%
  summarise(n_gi = n()) %>%
  summarise(
    `No. of Studies` = n(),
    `No. of Effects` = sum(n_gi),
    `Min. Effects in Study` = min(n_gi),
    `Max. Effects in Study` = max(n_gi),
    `Median Effects in Study` = median(n_gi)
  ) %>% ungroup()


flextable(g) %>%
  autofit() %>% set_caption("Distribution Summary of Effect Sizes") %>% fontsize(size = 11, part = "all") %>%
  line_spacing(space = .6, part = "all")

Distribution Summary of Effect Sizes
No. of Studies	No. of Effects	Min. Effects in Study	Max. Effects in Study	Median Effects in Study
27	55	1	6	2

7 Publication Bias

7.1 Funnel plot at study level

Figure 3 displays the studies’ individual effect size estimates aggregated at the study level. The dotted triangle indicates the boundaries for statistical significance on either side of the null effect (i.e., no study-level effect in reality exists; 0). As can be seen, there are six effect size estimates aggregated at the study level that are statistically significant in magnitude and positive in direction. This value constitutes ~22% of the total number of study-level aggregate effect sizes in our meta-analysis. Furthermore, half of these study-level aggregate effect sizes are from the “Less Visible Literature” (Hopewell, Clarke, & Mallett, 2005) including the largest of them. Arguably, such evidence does not seem to indicate a tendency for the note-taking literature to intentionally favor studies that, as a whole, have found positive and statistically significant effects from note-taking. Thus, this form of publication bias at the study-level seems less likely.

###############################
# 3M publication bias detection
###############################

# Naturally existing dependence from previous section
Vs = with(dat7, impute_covariance_matrix(v_g2, study, r=.5,
                                        subgroup = sample_id))

# Magnitude of within-study correlations
rho <- 0.5

# Aggregate effects at Study level (level 3)
data_agg_study <- 
  dat7 %>% 
  escalc(data = ., yi = g2, vi = v_g2) %>% 
  aggregate.escalc(cluster = study, rho = rho, weighted = FALSE)

# Contour plot at study level
with(data_agg_study,
     contour_funnel(x = g2, 
                    vi = v_g2, sig = FALSE,
                    xlab = "Study-Level Effect Sizes",
                    col = ifelse(gray=="yes","red","blue"),
                    bg = ifelse(gray=="yes","red","blue")))


legend("topright", c("Less Visible","Mainstream"), title = "Literature", pch = 19,
       col = c("red","blue"), title.font = 2, cex = .8)
box()

Figure 3. Contour-Enhanced Funnel Plot of Study-Level Effects

# This time get the tabular counts of study-level effects that are sig.
# g <- with(data_agg_study,
#      contour_funnel(x = g2, 
#                     vi = v_g2, sig = TRUE))

flextable(g) %>%
  autofit() %>% set_caption("Statistically significant study level effects") %>% fontsize(size = 11, part = "all") %>%
  line_spacing(space = .6, part = "all")

Statistically significant study level effects
Total	Total(%)	Left	Left(%)	Right	Right(%)	Sig.
7	25.93	1	3.7	6	22.22	0.05

7.2 Funnel plot at effect size level

Figure 4 displays the studies’ individual effect size estimates. As before, the dotted triangle indicates the boundaries for statistical significance on either side of the null effect (i.e., no individual effect in reality exists; 0). As can be seen, there are seventeen effect size estimates that are statistically significant in magnitude and positive in direction. This value constitutes ~31% percent of the total number of effect sizes in our meta-analysis. On the other hand, there are two effect size estimates that are statistically significant in magnitude and negative in direction. This value constitutes ~3% percent of the total number of effect sizes in our meta-analysis. Furthermore, ~30% of these effect estimates are from the “Less Visible Literature” including the largest of them.

Arguably, the comparison at the effect size level could potentially suggest the possibility of an imbalance in the note-taking literature in favor of the positive and statistically significant effects. However, given the lack of such an imbalance at the study-level and presence of multiple positive and statistically significant effects in the less visible literature, the trend seen at the effect size level might indicate a somewhat natural process that is not, for the most part, impacted by the publication industry’s policies as to which studies should be published and which ones should not in the note-taking literature.

# Contour plot at effect size level
with(dat7,
      contour_funnel(x = g2, 
                     vi = v_g2, sig = FALSE,
                     col = ifelse(gray=="yes","red","blue"),
                     bg = ifelse(gray=="yes","red","blue")))

legend("topright", c("Less Visible","Mainstream"), title = "Literature", pch = 19,
       col = c("red","blue"), title.font = 2, cex = .8)
box()

Figure 4. Contour-Enhanced Funnel Plot of Individual Effects

# This time get the tabular counts of individual effects that are sig.
# g <- with(dat7,
#      contour_funnel(x = g2, 
#                     vi = v_g2, sig = TRUE))

flextable(g) %>%
  autofit() %>% set_caption("Statistically significant individual effects") %>% fontsize(size = 11, part = "all") %>%
  line_spacing(space = .6, part = "all")

Statistically significant individual effects
Total	Total(%)	Left	Left(%)	Right	Right(%)	Sig.
19	34.55	2	3.64	17	30.91	0.05

7.3 Egger’s test

We also conducted an Egger’s test (Egger, Smith, Schneider, & Minder, 1997) of funnel plot symmetry. Using this test, we examined the extent to which the standard error (as a measure of precision) of the effect sizes collected from the note-taking literature related to the effect sizes’ magnitude. If such a relationship and/or its estimate of intercept, with the latter sometimes referred to as a precision-effect test (PET), rise to statistically significant levels, that could suggest asymmetry (and potentially publication bias) in the funnel plot of effect sizes.

In our case, given that the p-value for the Egger’s test for the relationship in question (b = 0.427, p = 0.784; 95% CI[-2.680, 3.533]) and its estimate of intercept (a = 0.390, p = 0.370; 95% CI[-0.491, 1.271]) are both larger than 0.05, we concluded that our funnel plot is sufficiently symmetric and the likelihood of publication bias in the collected sample of note-taking studies is small with the caveat that the b estimate has a relatively wide CI.

# Eggers test using the same naturally and statistically occurring dependence
ff = rma.mv(g2 ~ SE_egger, V = Vs, 
            random = ~1|study/effect, data = dat7,
            dfs = "contain")

g <- results_rma(ff, drop_rows = 3:7, drop_cols = 9:10, tidy = TRUE)

flextable(dplyr::select(g, -Df)) %>%
  autofit() %>% set_caption("Egger's Test Results") %>% fontsize(size = 11, part = "all") %>%
  line_spacing(space = .6, part = "all")

Egger's Test Results
Terms	Estimate	SE	t	p-value	Sig.	Lower	Upper
(Intercept)	0.390	0.428	0.912	0.370		-0.491	1.271
SE_egger	0.427	1.549	0.275	0.784		-2.680	3.533

8 Results of analyses

In this section, we present the results of our analyses in two parts. In the first part, we present the synthesized effects at each time point. As mentioned in the manuscript, results based on a limited number of effects (M) and/or studies (K) should be ignored due to their unreliable nature.

Also presented in the first part is the \(R^2\) test of heterogeneity. \(R^2\) indicates the percentage of change in the total heterogeneity (between- and within the studies) in the true effects of note-taking from a model without any MUTOS moderator (a null model) to that from a model that includes a set of MUTOS moderators of interest.

While necessary, the results presented in the first part may not by themselves immediately translate into evidence-based recommendations. This is because the descriptive (synthesized average effects) and the associated inferential results (CIs and p-values) simply denote how much effect at each measurement occasion exists and if that effect is reliably different from 0 at that point in time.

In the second part, we compare the changes that occurred in learners’ performance from one measurement occasion (baseline) to another (post-test) to specifically measure the potential learning “gains” that might have resulted from note-taking treatments taking into account the methodological differences that differentiate the studies to varying degrees (see Table 2 in the manuscript for more details on moderators).

Because the second part allows us to measure the gains from note-taking across more than one occasion, the results (i.e., synthesized average effects and their associated CIs and p-values) more immediately translate into evidence-based recommendations. To further facilitate such recommendations, in the second part we also measure a minimum expected benefit of using note taking presented in the universal metric of percentages.

8.1 Effects at each time

table_names <-  
  c("Time",
    "Note-Taking Type",
    "Outcome",
    "Measure Type",
    "Input Mode",
    "Material Type",
    "Optional Note-Taking",
    "Proficiency",
    "L1-L2 Orthographic Differences",
    "Region")


# Fitted moderator models stored from the previous section
results <- setNames(readRDS(url("https://github.com/fpqq/w/raw/main/np.rds")), table_names)



moderators_abb_names <-  c(
  "time",
  "treat_grp",
  "outcome",
  "measure",
  "input_mode",
  "material_type",
  "note_option",
  "prof",
  "script_distance",
  "region")


# A list of time and moderators interacting with time
LIST <- c("time", map(moderators_abb_names[-1], c, "time"))


# Count # of studies and effects at each time
effect_no <- setNames(map(LIST, ~effect_count(dat7, study, !!!syms(.), show0=FALSE, na.rm=TRUE, arrange_by="time")), table_names)

rs <- results

invisible(lapply(table_names, \(i){

 
  cat(paste0("\n\n### ", i, "\n"))

      
g <- rs[[i]]$ems


g3 <- rs[[i]]$plot

g4 <- rs[[i]]$R2

if(i!="Overall") print(g3)


print(kable(dplyr::select(cbind(g$table, dplyr::select(effect_no[[i]], `n study`, `n effect`)), -Df) %>% rename(K=`n study`, M=`n effect`),format = "simple", table.attr = "style='width:40%;'",
            caption = paste("3M results for",tolower(i),"categories")) %>%
  kable_styling(bootstrap_options = "bordered",
                full_width = TRUE, font_size = 9.5))


print(kable(g4 %>% rename(`Total Heterogeneity`=`Sigma(total)`, `Between-study Heterogeneity`=`Sigma(study)`, `Within-study Heterogeneity`=`Sigma(effect)`), format = "simple", table.attr = "style='width:40%;'",
            caption = paste("R2 test of heterogeneity for",tolower(i))) %>%
    add_footnote(c("Heterogeneity is in SD unit.",paste("The *p-value* indicates the statistical significance of the MUTOS moderators in the",i, "model *collectively*."))) %>% 
  kable_styling(bootstrap_options = "bordered",
                full_width = TRUE, font_size = 9.5))
  
}))

8.1.1 Time

3M results for time categories
time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
baseline	-0.198	0.157	-0.528	0.132	-1.261	0.223		16	18
posttest1	0.713	0.124	0.452	0.973	5.747	0.000	***	25	33
posttest2	0.358	0.272	-0.214	0.930	1.314	0.205		3	4

R2 test of heterogeneity for time
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Time	0.438	0.253	0.358	0.001	33.658%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Time model collectively.

8.1.2 Note-Taking Type

3M results for note-taking type categories
	treat_grp	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
1	conventional	baseline	0.166	0.300	-0.503	0.834	0.552	0.593		3	4
2	framework notes	baseline	-0.428	0.285	-1.063	0.206	-1.504	0.163		7	7
3	note-taking instruction	baseline	-0.133	0.300	-0.802	0.536	-0.444	0.667		5	5
4	vocabulary notebook	baseline	-0.664	0.540	-1.867	0.538	-1.231	0.246		1	2
5	conventional	posttest1	0.441	0.229	-0.070	0.952	1.925	0.083	.	7	12
6	framework notes	posttest1	0.868	0.246	0.320	1.417	3.527	0.005	**	9	9
7	note-taking instruction	posttest1	0.844	0.265	0.254	1.434	3.189	0.010	**	7	7
8	vocabulary notebook	posttest1	1.187	0.417	0.259	2.115	2.849	0.017	*	3	5
9	conventional	posttest2	0.376	0.348	-0.398	1.150	1.082	0.305		1	2
11	note-taking instruction	posttest2	0.438	0.550	-0.787	1.663	0.797	0.444		1	1
12	vocabulary notebook	posttest2	0.455	0.726	-1.162	2.072	0.627	0.545		1	1

R2 test of heterogeneity for note-taking type
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Note-Taking Type	0.384	0.250	0.291	0.014	41.937%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Note-Taking Type model collectively.

8.1.3 Outcome

3M results for outcome categories
	outcome	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
1	listening	baseline	-0.269	0.326	-0.987	0.450	-0.823	0.428		6	6
2	miscellaneous	baseline	-0.134	0.368	-0.943	0.676	-0.364	0.723		3	3
3	reading	baseline	0.238	0.476	-0.810	1.286	0.500	0.627		3	3
4	vocabulary	baseline	-0.181	0.259	-0.751	0.388	-0.701	0.498		5	6
5	listening	posttest1	0.453	0.206	0.000	0.905	2.202	0.050	*	9	14
6	miscellaneous	posttest1	0.845	0.341	0.096	1.595	2.482	0.030	*	4	4
7	reading	posttest1	1.241	0.395	0.372	2.109	3.144	0.009	**	5	5
8	vocabulary	posttest1	0.766	0.218	0.286	1.246	3.510	0.005	**	8	10
10	miscellaneous	posttest2	0.507	0.493	-0.578	1.593	1.028	0.326		1	1
12	vocabulary	posttest2	0.355	0.348	-0.412	1.122	1.018	0.331		3	3

R2 test of heterogeneity for outcome
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Outcome	0.456	0.166	0.425	0.055	30.929%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Outcome model collectively.

8.1.4 Measure Type

3M results for measure type categories
	measure	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
1	miscellaneous	baseline	-0.012	0.815	-1.772	1.748	-0.015	0.988		3	3
2	recall	baseline	0.008	0.271	-0.577	0.594	0.031	0.976		5	5
3	recognition	baseline	-0.292	0.213	-0.753	0.169	-1.367	0.195		10	10
4	miscellaneous	posttest1	0.470	0.817	-1.295	2.234	0.575	0.575		3	3
5	recall	posttest1	1.017	0.250	0.476	1.558	4.063	0.001	**	7	8
6	recognition	posttest1	0.598	0.148	0.278	0.919	4.034	0.001	**	17	21
8	recall	posttest2	0.440	0.500	-0.641	1.521	0.880	0.395		1	1
9	recognition	posttest2	0.389	0.349	-0.365	1.142	1.115	0.285		3	3

R2 test of heterogeneity for measure type
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Measure Type	0.458	0.133	0.438	0.043	30.651%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Measure Type model collectively.

8.1.5 Input Mode

3M results for input mode categories
input_mode	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
listening	baseline	-0.050	0.271	-0.641	0.541	-0.185	0.856		7	8
miscellaneous	baseline	-0.490	0.354	-1.260	0.281	-1.384	0.192		5	6
reading	baseline	0.111	0.380	-0.717	0.939	0.292	0.775		4	4
listening	posttest1	0.555	0.214	0.089	1.022	2.592	0.024	*	10	16
miscellaneous	posttest1	0.809	0.329	0.093	1.525	2.462	0.030	*	7	9
reading	posttest1	0.947	0.281	0.335	1.558	3.371	0.006	**	8	8
listening	posttest2	0.385	0.383	-0.449	1.219	1.005	0.335		1	2
miscellaneous	posttest2	0.229	0.758	-1.424	1.881	0.302	0.768		1	1
reading	posttest2	0.530	0.603	-0.784	1.845	0.879	0.397		1	1

R2 test of heterogeneity for input mode
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Input Mode	0.454	0.255	0.376	0.024	31.298%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Input Mode model collectively.

8.1.6 Material Type

3M results for material type categories
material_type	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
academic	baseline	-0.462	0.189	-0.865	-0.058	-2.440	0.028	*	13	14
non-academic	baseline	0.402	0.291	-0.217	1.022	1.384	0.187		3	4
academic	posttest1	0.661	0.138	0.366	0.955	4.777	0.000	***	19	25
non-academic	posttest1	0.896	0.230	0.405	1.387	3.892	0.001	**	6	8
academic	posttest2	0.203	0.443	-0.740	1.147	0.460	0.652		2	2
non-academic	posttest2	0.692	0.351	-0.057	1.441	1.970	0.068	.	1	2

R2 test of heterogeneity for material type
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Material Type	0.397	0.185	0.351	0.003	39.952%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Material Type model collectively.

8.1.7 Optional Note-Taking

3M results for optional note-taking categories
note_option	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
allowed	baseline	0.274	0.264	-0.291	0.840	1.040	0.316		5	6
required	baseline	-0.515	0.207	-0.959	-0.071	-2.488	0.026	*	10	11
allowed	posttest1	0.731	0.224	0.250	1.212	3.258	0.006	**	8	11
required	posttest1	0.782	0.165	0.428	1.135	4.742	0.000	***	16	20
allowed	posttest2	0.567	0.345	-0.172	1.306	1.645	0.122		1	2
required	posttest2	0.260	0.431	-0.664	1.184	0.603	0.556		2	2

R2 test of heterogeneity for optional note-taking
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Note-Taking Option	0.409	0.273	0.305	0.002	38.023%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Optional Note-Taking model collectively.

8.1.8 Proficiency

3M results for proficiency categories
prof	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
beginner to lower intermediate	baseline	-0.719	0.306	-1.386	-0.052	-2.350	0.037	*	3	4
high intermediate to advanced	baseline	0.425	0.369	-0.380	1.229	1.150	0.273		2	3
intermediate	baseline	-0.134	0.221	-0.615	0.346	-0.609	0.554		11	11
beginner to lower intermediate	posttest1	0.736	0.227	0.241	1.230	3.241	0.007	**	7	9
high intermediate to advanced	posttest1	1.037	0.317	0.346	1.729	3.268	0.007	**	4	5
intermediate	posttest1	0.624	0.178	0.236	1.012	3.508	0.004	**	14	19
beginner to lower intermediate	posttest2	0.292	0.729	-1.295	1.880	0.401	0.695		1	1
high intermediate to advanced	posttest2	0.758	0.396	-0.106	1.621	1.912	0.080	.	1	2
intermediate	posttest2	0.297	0.558	-0.920	1.513	0.532	0.605		1	1

R2 test of heterogeneity for proficiency
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Proficiency	0.441	0.283	0.337	0.013	33.299%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Proficiency model collectively.

8.1.9 L1-L2 Orthographic Differences

3M results for l1-l2 orthographic differences categories
script_distance	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
greater	baseline	-0.251	0.184	-0.641	0.139	-1.363	0.192		13	14
shorter	baseline	-0.061	0.378	-0.862	0.741	-0.160	0.875		3	4
greater	posttest1	0.568	0.151	0.249	0.888	3.769	0.002	**	20	26
shorter	posttest1	1.250	0.299	0.616	1.884	4.177	0.001	***	5	7
greater	posttest2	0.283	0.280	-0.309	0.876	1.014	0.326		3	4

R2 test of heterogeneity for l1-l2 orthographic differences
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
L1-L2 Orthographic Distance	0.473	0.314	0.354	0.002	28.416%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the L1-L2 Orthographic Differences model collectively.

8.1.10 Region

3M results for region categories
	region	time	Mean	SE	Lower	Upper	t	p-value	Sig.	K	M
1	East Asia	baseline	-0.001	0.224	-0.483	0.480	-0.006	0.995		6	7
3	Middle East	baseline	-0.192	0.211	-0.646	0.261	-0.910	0.378		10	11
4	East Asia	posttest1	0.786	0.226	0.300	1.272	3.470	0.004	**	6	7
5	Europe/North America	posttest1	0.248	0.268	-0.327	0.822	0.924	0.371		4	8
6	Middle East	posttest1	0.885	0.171	0.518	1.252	5.173	0.000	***	15	18
7	East Asia	posttest2	0.465	0.343	-0.271	1.201	1.355	0.197		1	2
9	Middle East	posttest2	0.371	0.455	-0.605	1.348	0.815	0.429		2	2

R2 test of heterogeneity for region
Model	Total Heterogeneity	Between-study Heterogeneity	Within-study Heterogeneity	p-value	R2
No (M)UTOS	0.661	0.157	0.642
Region	0.395	0.107	0.380	0.010	40.203%

Note: ^a Heterogeneity is in SD unit. ^b The p-value indicates the statistical significance of the MUTOS moderators in the Region model collectively.

8.2 Learning Gains

rs <- results

invisible(lapply(table_names, \(i){

 
  cat(paste0("\n\n### ", i, "\n"))

      
g <- rs[[i]]$ems

# Effects
gains <- if(i=="Time") contrast_rma(g, list("Gain1(post-test 1 - baseline)" =c(2,-1))) else contrast_rma(g, brief = TRUE)


print(kable(dplyr::select(gains$table, -Df),format = "simple", table.attr = "style='width:40%;'",
            caption = paste("Learning gains for", tolower(i))) %>%
  kable_styling(bootstrap_options = "bordered",
                full_width = TRUE, font_size = 9.5))



if(i!="Time") gain_dif <- contrast_rma(g, gain_dif = TRUE, brief = TRUE, gain_dif_type = "same")

if(i!="Time") print(kable(dplyr::select(gain_dif$table, -Df),format = "simple", table.attr = "style='width:40%;'",
            caption = paste("Differences in learning gains for", tolower(i))) %>%
  kable_styling(bootstrap_options = "bordered",
                full_width = TRUE, font_size = 9.5))


# Percentages
gain_prob <- prob_rma(gains, gain=TRUE, target_effect=.2)

print(kable(gain_prob,format = "simple", table.attr = "style='width:40%;'",
            caption = paste("Minimum learning gain percentage for", tolower(i))) %>%
  kable_styling(bootstrap_options = "bordered",
                full_width = TRUE, font_size = 9.5))
  
}))

8.2.1 Time

Learning gains for time
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(post-test 1 - baseline)	0.911	0.167	0.561	1.261	5.465	0.000	***

Minimum learning gain percentage for time
Term	Target_Effect	Probability	Min	Max
Gain1(post-test 1 - baseline)	0.2 or larger	79.97%	63.39%	94.98%

8.2.2 Note-Taking Type

Learning gains for note-taking type
	Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
1	Gain1(conventional)	0.275	0.251	-0.284	0.835	1.096	0.299
2	Gain2(conventional)	0.210	0.315	-0.492	0.913	0.667	0.520
3	Gain1(framework notes)	1.297	0.278	0.677	1.917	4.659	0.001	***
5	Gain1(note-taking instruction)	0.977	0.287	0.338	1.616	3.407	0.007	**
6	Gain2(note-taking instruction)	0.571	0.541	-0.634	1.776	1.056	0.316
7	Gain1(vocabulary notebook)	1.852	0.468	0.809	2.894	3.958	0.003	**
8	Gain2(vocabulary notebook)	1.119	0.832	-0.733	2.972	1.346	0.208

Differences in learning gains for note-taking type
	Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
1	Gain1(conventional) - Gain1(framework notes)	-1.021	0.373	-1.852	-0.190	-2.738	0.021	*
2	Gain1(conventional) - Gain1(note-taking instruction)	-0.702	0.382	-1.553	0.149	-1.838	0.096	.
3	Gain1(conventional) - Gain1(vocabulary notebook)	-1.576	0.532	-2.761	-0.391	-2.962	0.014	*
5	Gain2(conventional) - Gain2(note-taking instruction)	-0.361	0.627	-1.757	1.036	-0.576	0.577
6	Gain2(conventional) - Gain2(vocabulary notebook)	-0.909	0.889	-2.890	1.072	-1.023	0.331
7	Gain1(framework notes) - Gain1(note-taking instruction)	0.319	0.398	-0.568	1.207	0.802	0.441
8	Gain1(framework notes) - Gain1(vocabulary notebook)	-0.555	0.539	-1.757	0.647	-1.029	0.328
11	Gain1(note-taking instruction) - Gain1(vocabulary notebook)	-0.874	0.546	-2.090	0.341	-1.603	0.140
12	Gain2(note-taking instruction) - Gain2(vocabulary notebook)	-0.548	1.000	-2.777	1.680	-0.548	0.596

Minimum learning gain percentage for note-taking type
Term	Target_Effect	Probability	Min	Max
Gain1(conventional)	0.2 or larger	53.92%	31.80%	91.65%
Gain2(conventional)	0.2 or larger	50.52%	24.93%	93.96%
Gain1(framework notes)	0.2 or larger	92.48%	67.96%	99.99%
Gain1(note-taking instruction)	0.2 or larger	84.58%	55.37%	99.90%
Gain2(note-taking instruction)	0.2 or larger	68.66%	20.74%	99.97%
Gain1(vocabulary notebook)	0.2 or larger	98.48%	72.43%	100.00%
Gain2(vocabulary notebook)	0.2 or larger	88.58%	18.08%	100.00%

8.2.3 Outcome

Learning gains for outcome
	Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
1	Gain1(listening)	0.721	0.354	-0.058	1.501	2.036	0.067	.
3	Gain1(miscellaneous)	0.979	0.444	0.001	1.957	2.204	0.050	*
4	Gain2(miscellaneous)	0.641	0.585	-0.647	1.929	1.095	0.297
5	Gain1(reading)	1.002	0.481	-0.056	2.061	2.085	0.061	.
7	Gain1(vocabulary)	0.947	0.281	0.328	1.567	3.366	0.006	**
8	Gain2(vocabulary)	0.536	0.399	-0.342	1.414	1.343	0.206

Differences in learning gains for outcome
	Contrast	Estimate	SE	Lower	Upper	t	p-value
1	Gain1(listening) - Gain1(miscellaneous)	-0.258	0.564	-1.500	0.983	-0.458	0.656
2	Gain1(listening) - Gain1(reading)	-0.281	0.601	-1.604	1.041	-0.468	0.649
3	Gain1(listening) - Gain1(vocabulary)	-0.226	0.450	-1.216	0.763	-0.503	0.625
7	Gain1(miscellaneous) - Gain1(reading)	-0.023	0.656	-1.467	1.420	-0.035	0.972
8	Gain1(miscellaneous) - Gain1(vocabulary)	0.032	0.524	-1.122	1.186	0.061	0.953
10	Gain2(miscellaneous) - Gain2(vocabulary)	0.105	0.698	-1.433	1.642	0.150	0.883
11	Gain1(reading) - Gain1(vocabulary)	0.055	0.558	-1.173	1.283	0.099	0.923

Minimum learning gain percentage for outcome
Term	Target_Effect	Probability	Min	Max
Gain1(listening)	0.2 or larger	71.40%	41.08%	96.60%
Gain1(miscellaneous)	0.2 or larger	80.09%	43.09%	99.31%
Gain2(miscellaneous)	0.2 or larger	68.38%	22.94%	99.23%
Gain1(reading)	0.2 or larger	80.78%	41.14%	99.55%
Gain1(vocabulary)	0.2 or larger	79.11%	54.46%	97.24%
Gain2(vocabulary)	0.2 or larger	64.22%	31.78%	95.57%

8.2.4 Measure Type

Learning gains for measure type
	Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
1	Gain1(miscellaneous)	0.482	0.894	-1.450	2.414	0.539	0.599
3	Gain1(recall)	1.009	0.335	0.286	1.732	3.014	0.010	**
4	Gain2(recall)	0.432	0.553	-0.762	1.626	0.781	0.449
5	Gain1(recognition)	0.890	0.240	0.372	1.408	3.712	0.003	**
6	Gain2(recognition)	0.681	0.390	-0.161	1.522	1.747	0.104

Differences in learning gains for measure type
	Contrast	Estimate	SE	Lower	Upper	t	p-value
1	Gain1(miscellaneous) - Gain1(recall)	-0.527	0.955	-2.590	1.536	-0.552	0.590
2	Gain1(miscellaneous) - Gain1(recognition)	-0.408	0.926	-2.409	1.592	-0.441	0.666
5	Gain1(recall) - Gain1(recognition)	0.119	0.406	-0.759	0.996	0.292	0.775
6	Gain2(recall) - Gain2(recognition)	-0.249	0.666	-1.687	1.190	-0.374	0.715

Minimum learning gain percentage for measure type
Term	Target_Effect	Probability	Min	Max
Gain1(miscellaneous)	0.2 or larger	61.83%	7.54%	99.87%
Gain1(recall)	0.2 or larger	80.62%	52.98%	98.14%
Gain2(recall)	0.2 or larger	59.78%	20.11%	97.38%
Gain1(recognition)	0.2 or larger	76.94%	55.95%	94.98%
Gain2(recognition)	0.2 or larger	69.63%	37.66%	96.39%

8.2.5 Input Mode

Learning gains for input mode
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(listening)	0.606	0.260	0.039	1.173	2.327	0.038	*
Gain2(listening)	0.435	0.371	-0.374	1.244	1.172	0.264
Gain1(miscellaneous)	1.298	0.291	0.665	1.932	4.467	0.001	***
Gain2(miscellaneous)	0.718	0.792	-1.008	2.444	0.907	0.382
Gain1(reading)	0.836	0.366	0.039	1.632	2.285	0.041	*
Gain2(reading)	0.419	0.623	-0.939	1.777	0.673	0.514

Differences in learning gains for input mode
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(listening) - Gain1(miscellaneous)	-0.693	0.388	-1.539	0.154	-1.783	0.100	.
Gain1(listening) - Gain1(reading)	-0.230	0.447	-1.204	0.744	-0.514	0.617
Gain2(listening) - Gain2(miscellaneous)	-0.283	0.870	-2.179	1.613	-0.325	0.750
Gain2(listening) - Gain2(reading)	0.016	0.727	-1.568	1.600	0.022	0.983
Gain1(miscellaneous) - Gain1(reading)	0.463	0.467	-0.554	1.479	0.992	0.341
Gain2(miscellaneous) - Gain2(reading)	0.299	1.017	-1.916	2.514	0.294	0.774

Minimum learning gain percentage for input mode
Term	Target_Effect	Probability	Min	Max
Gain1(listening)	0.2 or larger	68.02%	44.19%	93.73%
Gain2(listening)	0.2 or larger	60.68%	30.12%	94.99%
Gain1(miscellaneous)	0.2 or larger	89.74%	66.35%	99.68%
Gain2(miscellaneous)	0.2 or larger	72.49%	13.64%	99.98%
Gain1(reading)	0.2 or larger	76.84%	44.19%	98.79%
Gain2(reading)	0.2 or larger	59.97%	15.06%	99.35%

8.2.6 Material Type

Learning gains for material type
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(academic)	1.122	0.201	0.694	1.550	5.591	0.000	***
Gain2(academic)	0.665	0.459	-0.312	1.642	1.450	0.168
Gain1(non-academic)	0.494	0.287	-0.117	1.104	1.722	0.106
Gain2(non-academic)	0.289	0.363	-0.484	1.063	0.798	0.437

Differences in learning gains for material type
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(academic) - Gain1(non-academic)	0.629	0.352	-0.121	1.378	1.788	0.094	.
Gain2(academic) - Gain2(non-academic)	0.376	0.586	-0.872	1.624	0.641	0.531

Minimum learning gain percentage for material type
Term	Target_Effect	Probability	Min	Max
Gain1(academic)	0.2 or larger	86.45%	67.87%	98.87%
Gain2(academic)	0.2 or larger	71.06%	31.52%	99.26%
Gain1(non-academic)	0.2 or larger	63.72%	38.29%	93.66%
Gain2(non-academic)	0.2 or larger	54.23%	26.02%	92.75%

8.2.7 Optional Note-Taking

Learning gains for optional note-taking
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(allowed)	0.457	0.235	-0.047	0.960	1.946	0.072	.
Gain2(allowed)	0.293	0.319	-0.391	0.977	0.918	0.374
Gain1(required)	1.297	0.203	0.861	1.733	6.385	0.000	***
Gain2(required)	0.775	0.442	-0.173	1.722	1.754	0.101

Differences in learning gains for optional note-taking
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(allowed) - Gain1(required)	-0.840	0.308	-1.501	-0.179	-2.726	0.016	*
Gain2(allowed) - Gain2(required)	-0.482	0.544	-1.650	0.685	-0.886	0.391

Minimum learning gain percentage for optional note-taking
Term	Target_Effect	Probability	Min	Max
Gain1(allowed)	0.2 or larger	62.89%	40.40%	92.68%
Gain2(allowed)	0.2 or larger	54.74%	28.04%	93.12%
Gain1(required)	0.2 or larger	91.98%	74.23%	99.83%
Gain2(required)	0.2 or larger	76.91%	35.68%	99.82%

8.2.8 Proficiency

Learning gains for proficiency
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(beginner to lower intermediate)	1.455	0.294	0.814	2.096	4.944	0.000	***
Gain2(beginner to lower intermediate)	1.011	0.777	-0.681	2.704	1.302	0.217
Gain1(high intermediate to advanced)	0.613	0.326	-0.098	1.324	1.877	0.085	.
Gain2(high intermediate to advanced)	0.333	0.363	-0.458	1.125	0.917	0.377
Gain1(intermediate)	0.758	0.241	0.233	1.284	3.143	0.008	**
Gain2(intermediate)	0.431	0.567	-0.804	1.666	0.760	0.462

Differences in learning gains for proficiency
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(beginner to lower intermediate) - Gain1(high intermediate to advanced)	0.842	0.440	-0.117	1.802	1.912	0.080	.
Gain1(beginner to lower intermediate) - Gain1(intermediate)	0.697	0.376	-0.124	1.517	1.851	0.089	.
Gain2(beginner to lower intermediate) - Gain2(high intermediate to advanced)	0.678	0.856	-1.187	2.543	0.792	0.443
Gain2(beginner to lower intermediate) - Gain2(intermediate)	0.580	0.969	-1.530	2.691	0.599	0.560
Gain1(high intermediate to advanced) - Gain1(intermediate)	-0.146	0.406	-1.030	0.739	-0.359	0.726
Gain2(high intermediate to advanced) - Gain2(intermediate)	-0.098	0.674	-1.567	1.371	-0.145	0.887

Minimum learning gain percentage for proficiency
Term	Target_Effect	Probability	Min	Max
Gain1(beginner to lower intermediate)	0.2 or larger	93.67%	71.86%	99.95%
Gain2(beginner to lower intermediate)	0.2 or larger	83.83%	20.32%	100.00%
Gain1(high intermediate to advanced)	0.2 or larger	69.24%	38.94%	97.41%
Gain2(high intermediate to advanced)	0.2 or larger	56.43%	26.76%	94.53%
Gain1(intermediate)	0.2 or larger	75.15%	51.24%	96.97%
Gain2(intermediate)	0.2 or larger	61.07%	17.21%	99.44%

8.2.9 L1-L2 Orthographic Differences

Learning gains for l1-l2 orthographic differences
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(greater)	0.819	0.187	0.423	1.216	4.379	0.000	***
Gain2(greater)	0.534	0.287	-0.075	1.143	1.860	0.081	.
Gain1(shorter)	1.311	0.365	0.537	2.084	3.592	0.002	**

Differences in learning gains for l1-l2 orthographic differences
Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
Gain1(greater) - Gain1(shorter)	-0.492	0.408	-1.357	0.373	-1.205	0.246

Minimum learning gain percentage for l1-l2 orthographic differences
Term	Target_Effect	Probability	Min	Max
Gain1(greater)	0.2 or larger	76.91%	58.35%	94.52%
Gain2(greater)	0.2 or larger	65.43%	39.74%	93.12%
Gain1(shorter)	0.2 or larger	90.67%	62.50%	99.85%

8.2.10 Region

Learning gains for region
	Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
1	Gain1(East Asia)	0.787	0.272	0.204	1.370	2.895	0.012	*
2	Gain2(East Asia)	0.466	0.374	-0.337	1.270	1.246	0.233
5	Gain1(Middle East)	1.077	0.230	0.584	1.571	4.683	0.000	***
6	Gain2(Middle East)	0.563	0.477	-0.461	1.588	1.180	0.258

Differences in learning gains for region
	Contrast	Estimate	SE	Lower	Upper	t	p-value	Sig.
2	Gain1(East Asia) - Gain1(Middle East)	-0.290	0.356	-1.054	0.474	-0.815	0.429
4	Gain2(East Asia) - Gain2(Middle East)	-0.097	0.609	-1.403	1.209	-0.159	0.876

Minimum learning gain percentage for region
Term	Target_Effect	Probability	Min	Max
Gain1(East Asia)	0.2 or larger	74.96%	50.15%	96.08%
Gain2(East Asia)	0.2 or larger	61.98%	31.03%	94.62%
Gain1(Middle East)	0.2 or larger	84.27%	63.83%	98.04%
Gain2(Middle East)	0.2 or larger	66.14%	27.11%	98.16%

9 Inclusion Studies

The following provides the studies (k = 27) that were included in the meta-analysis.

Table of Contents

Supplement to ‘Increasing meta-analytic quality: A multivariate multilevel meta-analysis of note taking through exposure to L2 input’

© Reza Norouzian

Citation: Norouzian, R., Jin, Z., & Webb, S. (2025). Increasing meta-analytic quality: A multivariate multilevel meta-analysis of note taking through exposure to L2 input. Modern Language Journal, 109(1), 171-193.