Skip Navigation
Skip to contents

Science Editing : Science Editing



Page Path
HOME > Sci Ed > Volume 11(1); 2024 > Article
Original Article
How authors select covariates in the multivariate analysis of cancer studies in 10 oncology journals in Korea: a descriptive study
Mi Ah Han1orcid, Hae Ran Kim2orcid, Sang Eun Yoon3orcid, Sun Mi Park1orcid, Boyoung Kim4orcid, Seo-Hee Kim1,3orcid, So-Yeong Kim1orcid
Science Editing 2024;11(1):26-32.
Published online: February 20, 2024

1Department of Preventive Medicine, Chosun University College of Medicine, Gwangju, Korea

2Department of Nursing, Chosun University College of Medicine, Gwangju, Korea

3Department of Public Health, Chosun University Graduate School, Gwangju, Korea

4Chonnam National University College of Nursing, Gwangju, Korea

Correspondence to Mi Ah Han
• Received: December 28, 2023   • Accepted: February 1, 2024

Copyright © 2024 Korean Council of Science Editors

This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

  • 73 Download
  • Purpose
    Cancer is the leading cause of death in Korea, leading many investigators to focus on cancer research. We present the current practice of variable selection methods for multivariate analyses in cancer studies recently published in major oncology journals in Korea.
  • Methods
    We included observational studies investigating associations between exposures and outcomes using multivariate analysis from 10 major oncology journals published in 2021 in KoreaMed, a Korean electronic database. Two reviewers independently and in duplicate performed the reference screening and data extraction. For each study included in this review, we collected important aspects of the variable selection methods in multivariate models, including the study characteristics, analytic methods, and covariate selection methods. The descriptive statistics of the data are presented.
  • Results
    In total, 107 studies were included. None used prespecified covariate selection methods, and half of the studies did not provide enough information to classify covariate selection methods. Among the studies reporting selection methods, almost all studies only used data-driven methods, despite having study questions related to causality. The most commonly used method for variable selection was significance in the univariate model, with the outcome as the dependent variable.
  • Conclusion
    Half of the included studies did not provide sufficient information to assess the variable selection method, and most used a limited data-driven method. We believe that the reporting of covariate selection methods requires improvement, and our results can be used to educate researchers, editors, and reviewers to increase the transparency and adequacy of covariate selection for multivariable analyses in observational studies.
The long-term survival rates of cancer patients continue to increase because of early detection and advances in cancer treatment and care in Korea. However, approximately 243,000 cancers are diagnosed annually, and cancer has been the most common cause of death in Korea since 1983, when cause-ofdeath statistics began to be collected in Korea [1]. Due to the burden of cancer, researchers and health authorities have focused on epidemiological investigations, including the causes of cancer incidence, mortality, treatment effects, and health outcomes in patients with cancer.
If feasible, randomized controlled trials might provide the best answer for causal inference between an exposure or intervention and outcomes. However, for ethical and practical reasons, observational studies have provided important evidence regarding the epidemiology of cancer. In particular, the long induction period of several decades or more for cancer occurrence or death due to potential risk factors has underscored the crucial role of observational studies in the field of cancer epidemiology.
However, the interpretation of causal inferences in observational studies has a fatal flaw—namely, prognostic factors can frequently differ systematically between the exposure/intervention group and the control group. This systematic imbalance in prognostic factors is known as a confounder. To control for confounders, researchers have employed various methods, including multivariate analysis, stratification, and propensity score matching [24].
Selecting appropriate covariates for multivariate analysis is an important part of epidemiological studies for various reasons, including the justification and reproducibility of causal inferences. However, the choice of the best method for this selection remains a matter of debate, and determining which variables to include can be challenging. Although researchers and methodologists have recognized the importance of covariates in multivariate analysis, previous literature reviews have shown that variable selection methods differ, and most authors do not adequately report information on how covariates were selected.
A previous study on 488 articles with multivariate analyses published in five major medical journals, including the New England Journal of Medicine and The Lancet, found that 48% reported variable selection methods unclearly, 16% used datadriven methods, and 36% used knowledge-based methods. The authors found that 10.5% of the studies misused variable selection methods (defined as the use of a data-driven method in a study with causal questions) [5]. A substantial proportion of the studies in specific health journals did not report the justification or method of variable selection. In 193 orthopedic studies with multivariate regression analyses, 65.8% selected variables based on nonstatistical methods (including all available variables without any interpretation of causality), and only 16% selected variables based on causal inference [6]. Of 150 nutritional studies, 94% did not select covariates a priori, and 63.3% did not report the selection criteria [7].
The current practice of selecting covariates for multivariate models lacks well-documented methods for specific selection. Therefore, this study aimed to evaluate covariate selection methods in multivariate models by reviewing studies published in major oncology journals in Korea.
Ethics statement
Ethics approval was not required because we only used data from published papers. Our protocol has been registered on the Open Science Framework (OSF) Registry [8].
Journal selection
We included the following 10 oncology journals with the highest impact factors in the Korea Medical Citation Index (KoMCI): Asian Oncology Nursing, Brain Tumor Research and Treatment, Cancer Research and Treatment, Clinical Pediatric Hematology-Oncology, Immune Network, Journal of Breast Cancer, Journal of Cancer Prevention, Journal of Gastric Cancer, Journal of Gynecologic Oncology, and Radiation Oncology Journal.
KoMCI provides citation data for individual journals and subject categories in Korean medical journals [9]. We searched KoreaMed (, a resource provided by the Korean Association of Medical Journal Editors (KAMJE) that provides an assessment of articles published in Korean medical journals, for literature published in 2021. The search strategies used in the references were as follows: ((“Asian Oncol Nurs”[JTI]) OR (“Brain Tumor Res Treat”[JTI]) OR (“Cancer Res Treat”[JTI]) OR (“Clin Pediatr Hematol Oncol” [JTI]) OR (“Immune Netw”[JTI]) OR (“J Breast Cancer”[JTI]) OR (“J Cancer Prev”[JTI]) OR (“J Gastric Cancer”[JTI]) OR (“J Gynecol Oncol”[JTI]) OR (“Radiat Oncol J”[JTI])) AND (2021[DPY]).
Inclusion criteria
We included observational studies that investigated the associations between exposures or interventions and health outcomes using multivariate analyses. We focused on studies involving human participants (e.g., patients, caregivers, health volunteers, or healthcare practitioners). We included a broad definition of health outcomes, including those of direct importance to patients (e.g., mortality, morbidity, and quality of life) and surrogate outcomes (e.g., laboratory measures and radiological findings). The study design included cross-sectional, case-control, case-cohort, and cohort studies.
Since we focused on studies in which multivariate model outcomes were of primary interest, we selected studies that reported conducting multivariate analyses (e.g., multiple linear or logistic regression analysis) or presented regression coefficients (e.g., β coefficient, adjusted odds ratio, or hazard ratio) in both the abstract and main results sections.
Randomized controlled trials, reviews, meta-analyses, pooled analyses, letters, commentaries, economic studies, and animal studies were excluded.
Study selection process
We reviewed all studies published in the journals mentioned above during the study period. For references identified from the electronic database search, title or abstract screening was performed. A study was considered potentially eligible if the title or abstract contained a description of the regression method (e.g., multiple linear or logistic regression analysis) or reported regression coefficients (e.g., β coefficient, adjusted odds ratio, or hazard ratio). For potentially eligible studies, we obtained the full text and determined whether it satisfied the inclusion criteria. After calibration exercises, a team of two reviewers conducted the study selection process independently and in duplicate, and resolved any discrepancies by discussion or consultation with a third reviewer.
Data extraction
We conducted calibration exercises to ensure optimal accuracy and consistency and extracted data using a prepiloted data extraction form with written guidelines. Paired data extraction reports for each study were included independently and in duplicate. Any discrepancies were resolved by discussion or consultation with a third reviewer if necessary.
We examined the Methods section of the included studies to investigate whether the authors provided information supporting the causal inferences behind the multivariate model. We also scrutinized the Methods and Results sections to identify the method the authors used for covariate selection. In addition, we examined the study objective in the Introduction and Discussion sections to determine whether the authors performed a multivariate analysis with causal intent. Detailed information on data extraction is as follows.

Study characteristics

The study characteristics included journal name, first author, publication year, study design (cross-sectional, case-control, and cohort), number of participants, primary exposures or interventions investigated, primary outcomes investigated, type of cancer investigated, language of publication (English or Korean), study questions (causation or association), study protocol, and adherence to reporting guidelines.
We regarded a study as having a causal intent if it described the study questions using causal language (e.g., terms such as “impact,” “effect,” or “causal relationship”) in the Introduction section or study objectives. We also regarded the study as having causal intent if the exposures investigated were pharmacological, surgical, or behavioral treatments, or if they suggested adopting or avoiding exposure to improve the outcome of interest in the Discussion section. Studies with other types of questions, such as prognosis or prediction studies, were regarded as association studies.

Analytic methods

We collected data on analytic models (linear, logistic, Cox proportional hazards, and others). If a study reported more than one multivariate model, we regarded the results in the first table as the study’s primary results and reviewed them accordingly.

Covariate selection methods

We collected reports of covariate selection methods (not described vs. described), prespecification of covariate selection methods (not described vs. described), prespecification of covariates (not described vs. described), and covariate selection methods in the final analytical model (knowledge-based only, data-driven only, both, or not described).

Knowledge-based methods

We regarded covariates as selected by a knowledge-based method if a study stated that the covariates included in the multivariate model were previously known as potential confounders or if the authors hypothesized this. For studies using knowledge-based methods, we collected the following information: source of prior knowledge or hypothesis for covariates (published review, literature search conducted by the authors of the study, primary studies, expert opinion, not described, or others) and covariate selection methods (factors associated with the outcome of interest, factors associated with the exposure of interest, factors associated with both the exposure and outcome of interest, factors associated with either the exposure or outcome of interest, and others).

Data-driven methods

We considered a study to have used data-driven methods for covariate selection if it used the analyzed data. For studies using data-driven methods, we collected the detailed selection method as follows: covariate selection methods (effect estimate change; forward, backward, or stepwise selection; significant in the univariate model with the outcome as the dependent variable [e.g., P<0.05]; significant in the univariate model with exposure as the dependent variable [e.g., P<0.05]; significant in the multivariate model [e.g., P<0.05]; significant in the univariate model first and then significant in the multivariate model; significant in either the univariate or multivariate model; and others). This classification is not mutually exclusive because a single study may use multiple covariate selection methods.
There was no bias in searching and selecting the target literature.
Study size
It was not required to estimate the sample size. All target journals were included.
Statistical analysis
We presented the basic characteristics of the studies, including descriptive statistics, as numbers and percentages. Next, we reported the proportion of covariate selection methods and their associated characteristics. We compared covariate selection methods according to the study question (causation vs. association). All analyses were performed using the SAS ver. 9.4 (SAS Institute Inc).
Characteristics of the included studies
A total of 494 articles published in 2021 were identified from the 10 oncology journals. Among them, 123 were included in title or abstract screening. After excluding 16 studies without multivariate analysis in the full-text screening, a total of 107 observational studies were included. About 90% were cohort studies, and about 74% included fewer than 1,000 participants in their study. The most common primary exposure was multiple risk factors, and the primary outcome was mortality. The most common types of cancer investigated were gastric (20.6%) and cervical (11.2%), and eight studies (7.5%) analyzed cancer in general (i.e., any type) (Table 1). The list of the included studies are provided in Suppl. 1.
Variable selection methods
Approximately 65% of the studies conducted Cox proportional hazard regression, and 26.2% conducted logistic regression. None used prespecified covariate selection methods and covariates. Only seven studies selected covariates using knowledge-based methods, and 57 selected covariates using data-driven methods. The most common data-driven method was significance in the univariate model with the outcome as the dependent variable (e.g., P< 0.05) (Table 2 and Dataset 1).
According to the study question, Cox proportional hazard regression was most frequently used for both types of study questions. Studies with associational questions more frequently described covariate selection than studies with causal questions. Even in studies with causal questions, approximately half did not clearly report the variable selection method (Table 3).
Key results
Multivariate analysis is a commonly used method of controlling for the effects of confounders in observational studies to determine the relationship between exposures and outcomes. This systematic review presented the current variable selection and reporting practices in leading oncology journals in Korea, and it is the first study to present this information. None of the included studies used prespecified covariate selection methods, and half of the studies did not provide sufficient information to allow a classification of their methods.
Comparison with previous studies
Half of the included studies did not provide sufficient information to evaluate the covariate selection methods. This finding is similar to the results of previous studies. A descriptive review of variable selection methods in four major epidemiology journals reported that 37% of the included studies did not provide sufficient details to allow the determination of variable selection methods [10]. Similarly, the variable selection method was unclear in 48% of the studies in the five medical journals [5]. This might be due to well-known associations between variables that do not require citations, lack of space in the paper, the authors’ ignorance, or the journal’s editorial policy [5]. However, in the absence of a widely agreed-upon method for variable selection, a complete and clear description of the statistical methods, including variable selection, would provide evidence for judgment about the uncertainties associated with the interpretation of the results. Therefore, authors and journals should focus on transparent and clear descriptions of variable selection.
Of the studies reporting variable selection, most selected covariates using only data-driven methods, even when the studies had causal study questions. Previous studies have employed various data-driven methods. Among 292 studies in four epidemiology journals, 146 selected variables based on prior knowledge and 69 selected variables using data-driven methods, and the change in estimate approach was the most common [10]. Among 287 studies in two Chinese epidemiology journals, 163 selected variables using bivariate analyses and 45 selected variables based on prior knowledge or personal judgment [11]. Among 488 articles, variable selection was knowledge-based in 176 and data-driven in 78, and univariate selection was the most common [5]. Data-driven methods are known to be suitable for association questions such as prognosis or predictive research, while data-driven methods for causal inference studies are known to pose a risk of bias [4]. Although various data-driven methods have been introduced, only a few were used in the included studies.
Implications for future studies
None of the included studies used prespecified covariate selection methods and covariates. It is best to specify this in the research protocol or statistical analysis plan; however, a previous study addressing the registration practices of observational studies noted that the preparation of protocols for observational studies was very limited [12]. Observational study protocols are necessary for the qualitative aspects of studies, such as covariate selection and prevention of selective reporting; however, the research community has not been able to reach a consensus regarding this issue, and only a few institutions or reporting guidelines suggest it [13].
Guidelines for observational studies, such as STROBE (Strengthening the Reporting of Observational Studies in Epidemiology), indicate that it is necessary to “describe all statistical methods, including those used to control for confounding,” but only one included study directly stated that they followed the reporting guidelines. Since compliance with reporting guidelines is greatly influenced by journal editorial policy, the peer review process or journal guidance for authors should reflect this for the completeness of the quality of research reporting.
The strengths of this study include its adherence to a standard methodology. We conducted independent and duplicate reference screening and data extraction after the calibration exercises. Additionally, our study selected representative and major oncology journals and provided a comprehensive picture of the current practice of covariate selection methods.
A potential limitation of this study is our reliance on journal reports to evaluate the choice of covariates. The authors may not have accounted for covariates because of their relatively low importance or the journal’s word count limits. Nevertheless, we reviewed all the authors’ descriptions, including the protocol and appendices, if available.
The process of extracting information from the studies may have required subjective judgment by a reviewer. For example, even if the study did not explicitly state that causation was of interest, if it recommended changes in exposure to improve the health outcome of interest, we regarded the study question as causation, depending on the reviewer’s interpretation. We conducted training and calibration exercises for the reviewers with documented instructions to ensure a high degree of agreement.
We provided an overview of the covariate selection methods used in articles published in major Korean oncology journals. None of the included studies used prespecified covariate selection methods, and half of the studies did not provide enough information to classify the methods. As there is currently no single agreed-upon method, clearly and completely describing the methods is important for the interpretation of results and judgment of uncertainty. Our results inform the research community about controlling for confounders. In addition, they can be used to educate researchers, editors, and reviewers to increase the transparency and adequacy of covariate selection in multivariate analyses in observational studies.

Conflict of Interest

No potential conflict of interest relevant to this article was reported.


This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Korean Ministry of Education (No. 2021 R1I1A3041301) and the Korean Ministry of Science and ICT (No. NRF-2022R1A5A2030454).

Data Availability

Dataset file is available from the Harvard Dataverse at

Dataset 1. Research data of covariates selection methods.


Supplementary materials are available from
Suppl. 1. List of the included studies.
Table 1.
Characteristics of the included studies (n=107)
Characteristic No. of studies (%)
Type of study design
 Cohort 96 (89.7)
 Cross-sectional 6 (5.6)
 Case-control 5 (4.7)
Total number of participants included
 ≤ 1,000 79 (73.8)
 1,001–5,000 11 (10.3)
 ≥ 5,000 17 (15.9)
Primary exposure
 Multiple exposures (e.g., prognostic factors) 39 (36.5)
 Therapeutic clinical intervention (e.g., behavior change facilitation, drug therapy) 29 (27.1)
 Biophysical status (e.g., blood pressure, blood lipids, body weight) 17 (15.9)
 Morbidity (e.g., cardiovascular disease, cancer, diabetes) 6 (5.6)
 Health behavior (e.g., smoking, alcohol consumption, physical activity, diet) 4 (3.7)
 Other 12 (11.2)
Primary outcome
 Mortality (e.g., all-cause mortality, disease-specific mortality) 64 (59.8)
 Morbidity (e.g., cardiovascular disease, cancer, hospitalization) 22 (20.6)
 Quality of life (e.g., overall, disease-specific quality of life) 5 (4.7)
 Biophysical status (e.g., blood pressure, blood lipids, body weight) 5 (4.7)
 Other 11 (10.3)
Type of cancer investigated
 Any 8 (7.5)
 Gastric 22 (20.6)
 Cervix 12 (11.2)
 Breast 10 (9.4)
 Uterus 8 (7.5)
 Lung 6 (5.6)
 Colorectum 6 (5.6)
 Brain 5 (4.7)
 Ovary 4 (3.7)
 Leukemia/lymphoma 4 (3.7)
 Other 22 (20.6)
Language of publication
 English 100 (93.5)
 Korean 7 (6.5)
Type of question
 Association 65 (60.8)
 Causation 42 (39.3)
Study protocol (not described) 107 (100)
Adherence to reporting guidelines
 Not described 106 (99.1)
 Described 1 (0.9)

Percentages may not total 100 due to rounding.

Table 2.
Reporting of methods for variable selection (n=107)
Reporting method No. of studies (%)
Analytic model
 Cox proportional hazard regression 69 (64.5)
 Logistic regression 28 (26.2)
 Linear regression 8 (7.5)
 Other 2 (1.9)
Prespecification of the covariate selection method (not described) 107 (100)
Prespecification of covariates (not described) 107 (100)
Reporting of the covariate selection method
 Not described 52 (48.6)
 Yes 55 (51.4)
Covariate selection in the final analytic model
 Both 6 (5.6)
 Data-driven only 51 (47.7)
 Knowledge-based only 1 (0.9)
 Not described 49 (45.8)
Knowledge-based method
 Source of prior knowledge or hypothesis for covariates
  Not applicable 100 (93.5)
  Not described 4 (3.7)
  Literature search conducted by authors of the study 1 (0.9)
  Published review 2 (1.9)
 Covariate selection method
  Not applicable 101 (94.4)
  Factors associated with the outcome of interest 5 (4.7)
  Factors associated with both the exposure and outcome of interest 1 (0.9)
Data-driven methoda)
 Significant in univariate model with the outcome as the dependent variable (e.g., P < 0.05) 38 (35.5)
 Significant in the univariate model first and significant in the multivariable model 7 (6.5)
 Backward 8 (7.5)
 Stepwise selection 7 (6.5)
 Forward 4 (3.7)
 Significant in multivariable model (e.g., P < 0.05) 3 (2.8)
 Significant in univariate model with the exposure as the dependent variable (e.g., P < 0.05) 1 (0.9)

Percentages may not total 100 due to rounding.

a) Multiple responses.

Table 3.
Reporting of methods for variable selection according to the type of study question (n=107)
Reporting method Association (n = 65) Causation (n = 42)
Analytic model
 Cox proportional hazard regression 41 (63.1) 28 (66.7)
 Logistic regression 19 (29.2) 9 (21.4)
 Linear regression 5 (7.7) 3 (7.1)
 Other 0 (0) 2 (4.8)
Reporting of the covariate selection method
 Not described 26 (40.0) 26 (61.9)
 Described 39 (60.0) 16 (38.1)
Covariate selection in the final analytic model
 Both 3 (4.6) 3 (7.1)
 Data-driven only 36 (55.4) 15 (35.7)
 Knowledge-based only 1 (1.5) 0 (0)
 Not described 25 (38.5) 24 (57.1)

Percentages may not total 100 due to rounding.

Figure & Data



    Citations to this article as recorded by  

      Science Editing : Science Editing