Assessing adolescent ‘ self-efficacy in body and health ’-Exploring the psychometric properties of the SEBH scale

Self-efficacy beliefs are significant predictors of achievement in education. However, majority of existing self-efficacy measures are rather ‘general’ and assess aggregated perceptions of students’ proficiencies within broad academic disciplines. Applying Rasch analysis, the present study explored the psychometric properties of the five-item ‘self-efficacy in body and health’ (SEBH) scale as administered to more than 1600 tenth-graders aged 15-16years in Norway. Based on our sample, the SEBH DESIRE ALICE NAIGAGA (corresponding author) PhD student, Oslo Metropolitan University, Norway


Introduction
Does it matter whether you have the will and belief that you can?Of course it does!Self-efficacy signifies a person's belief that he or she is able to execute successfully the behaviours required to produce a specific outcome.Thus, self-efficacy is the person's belief in his or her capability to control and execute actions in spite of potential obstacles.A person's perceived self-efficacy has a direct influence on the choice of activities and settings, and the stronger the perceived self-efficacy, the more active the efforts to cope with the task at hand (Bandura, 1977).Therefore, self-efficacy affects individuals' decisions concerning the effort and endurance they will put into a task.In general, higher self-efficacy is linked with greater effort, perseverance and resilience (Van Dinther, Dochy, & Segers, 2011;Zeegers, 2004).
In school achievement, self-efficacy refers to an individual's belief in his or her ability to successfully accomplish academic tasks or to achieve academic goals (Schunk, 1991).Scales measuring academic self-efficacy evaluate the extent to which students perceive they can accomplish established academic goals (Marsh, Hau, Artelt, Baumert & Peschar, 2006;Pastorelli et al, 2001).However, according to Bong and Skaalvik (2003), majority of the existing academic self-efficacy measures are 'wide-ranging', aiming at school proficiency in general, thus making them more reflective of 'academic self-concept'.Self-efficacy is a specific view of one's capacities in a given domain and it follows that efficient self-efficacy measures be tailored to the particular domain of interest (Bandura, 2006).
In spite of the advantages that item response theory (IRT) models, and Rasch models in particular, have over classical test theory (CTT), few health-related and health literacy scales have been evaluated using IRT and Rasch models (see, for example, Davidson, Keating & Eyres, 2004;Escobar et al., 2015;Huang et al. 2018;Nguyen, Paasche-Orlow, Kim, Han and Chan, 2015).One such advantage is concerned with the assumption of item-sample independence, which is strongly emphasized in IRT and Rasch models.While violations of local independence in IRT and Rasch models, and 'error correlations' in confirmatory factor models (CFM), might refer to similar 'problems' in the data, there is no direct link between the probabilistic IRT and Rasch models and the correlation-based CFM.Unlike descriptive IRT-models, the family of prescriptive Rasch models satisfy the requirements of fundamental measurement (Andrich, 1988).

15(2), 2019
To fill in the gaps identified, there is a need for a measurement scale for the evaluation of how adolescents perceive their proficiency in accomplishing specific academic tasks within health, which meets the assumptions and satisfies the requirements of fundamental measurement.To exemplify this in the field of science education, the current study focuses on the subject area of 'body and health' in the Norwegian compulsory school science curriculum.This subject area; which focuses on the structure of our bodies, how the body is affected by nutrition and lifestyle and how the body changes over time; will play a vital part in the new and forthcoming interdisciplinary school topic 'public health and wellbeing' (KD, 2016).A self-efficacy in 'body and health' scale might be efficient for evaluating the proficiency with which adolescents perceive they can apply that knowledge to solve complex problems in new and unfamiliar contexts and adopting critical thinking skills associated with 'deeper learning' (Paakkari, L. & Paakkari, O., 2012;Pellegrino & Hilton 2012;KD, 2016).
The main objective of the current study is therefore to, applying Rasch-analysis, validate a five-item measurement scale tailored towards assessing adolescent self-efficacy in 'body and health' at the end of compulsory school (tenth grade).We will test the following hypotheses: H1) The 'self-efficacy in body and health' (SEBH) scale has acceptable overall fit to the rating scale parameterization of the polytomous unidimensional Rasch model, consists of locally independent items, and represents a well-targeted and reliable measurement scale.H2) Each SEBH-scale item has ordered response categories, is functioning in the same way for the different levels of relevant person factors, and shows acceptable fit to the rating scale parameterization of the polytomous unidimensional Rasch model.
Our first hypothesis is concerned with the overall SEBH-scale psychometric properties, while our second hypothesis refers to the psychometric properties at the individual item level.With the goal of estimating as few parameters as possible (parsimony rule), we hypothesized a unidimensional scale with items sharing the same set of thresholds.

Method -Sample
A sample of 200 Norwegian lower secondary schools was randomly selected, and the school principals were contacted by email and telephone seeking consent to volunteer.Fifty-eight schools (30%) accepted the invitation.From April to May 2015, 1622 students in the tenth grade (47% girls) responded by using an electronic assessment tool.

The substantive theory of the SEBH latent variable
The SEBH-scale is a revised and further developed version of a self-efficacy scale reported by Guttersrud & Pettersen (2015), which was based on self-efficacy measures in science and the control expectation scale applied in PISA (Organization for Economic Co-operation and Development [OECD], 2001).The items were reworded to reflect competencies within 'body and health', with one additional item (Table 1): 'I am confident that I can apply the knowledge that I have in Body and Health in new and unfamiliar situations'.This item reflects aspects of adaptability-the transferability of self-efficacy beliefs to novel and changing situations (Martin, Nejad, Colmar, & Liem, 2013;Pellegrino & Hilton, 2012) and deeper learning -the mastering of core academic content at high levels (Pellegrino & Hilton, 2012).

Person factor levels and data processing
Students reported the following five person factors (with levels indicated in parentheses); gender (male/female); age at the time of the survey (15 or 16 years old); language predominately spoken at home (Norwegian, Danish/ Swedish (i.e., Scandinavian languages) or 'other'); student's, mother's and father's place of birth (Norway, Denmark/ Sweden or 'other'); and the number of books at home (five categories).A picture showing how different numbers of books might appear like on shelves was included to improve validity or 'response accuracy'.
The variables for birthplace were re-coded into a new variable named 'cultural background' with the levels 'majority' (if at least the student or one of the parents were born in any of the Scandinavian countries i.e., Norway, Denmark or Sweden) and 'minority'.This classification is valid as countries within Scandinavia share strong cultural and linguistic similarities.The five levels of 'number of books at home' were merged into the categories 'less than 100 books' and '100 or more books'.These two levels reflected the largest difference in SEBH-scale score (cf.DIF analysis).The number of books was used as an indicator of socioeconomic status (SES), as research on SES and family resources shows that children's initial reading competency is correlated with the home literacy environment and number of books owned; with children from poor households often having less access to learning materials, including books, computers and skill-building lessons to create a positive literacy environment (Aikens & Barbarin, 2008;Bergen, Zuijen, Bishop, & Jong, 2016;Bradley, Corwyn, McAdoo, & García Coll, 2001;Orr, 2003).As a consequence, research indicates that children from low-SES households develop academic skills slower than children from higher SES groups (Morgan, Farkas, Hillemeier, & Maczuga, 2009).

SEBH-scale response characteristics
A six-point rating scale with the extreme response categories anchored with a phrase 1 = 'strongly disagree' and 6 = 'strongly agree' was applied for all SEBH-scale items.Out of the 1622 student responses there were 1568 valid responses: There were 166 extreme scorers of which 12 students attained the lowest possible raw score on the SEBH-items responded to and 154 students attained the highest possible score (ceiling effect) on the items responded to.There were a total of 36 missing responses to the five items, with item 5 having the highest number of these (15) and item 1 having the least (2).We have no evidence weakening the hypothesis stating that 'data are missing completely at random' (MCAR; Allison, 2001).

The unidimensional Rasch model-a rationale for the methodological decisions
The prescriptive Rasch models estimate the probability of endorsing an item based on the difference between the person location (proficiency or attitude) and item location (difficulty or affective level)

Naigaga et al.
[149] 15(2), 2019 (Rasch, 1960;Shaw, 1991).Person and item location estimates refer to the point estimate of a person's or an item's location on the latent trait scale, respectively (Harris, 1989).In the current study, person location refers to an individual's self-reported perceived proficiency in body and health.The different threshold locations reflect the locations at which the probability of a response in two adjacent categories is equal.For example, a dichotomously scored item has one threshold, and the threshold location refers to the location at which the probability of a response in the two adjacent categories is 0.5.In this paper, we applied RUMM 2030 for all analyses (Andrich, Lyne, Sheridan, & Lou, 2010).RUMM uses pairwise maximum likelihood estimation (PLME) and Warm's mean weighted likelihood estimation (WLE) for estimating item location estimates and person locations respectively (Katsikatsou, Moustaki, Yang-Wallentin, & Joreskog, 2012;Warm, 1989).
The concept 'item discrimination' refers to the degree with which an item separates individuals with higher person location estimates from those with lower location estimates.An under-discriminating item differentiates weaker between such respondent groups than the RM expects, given the item location.
Using the Rasch Model (RM), raw scores at the ordinal level (presumes 'ordered response categories' otherwise nominal) are transformed into interval implying additivity (Andrich, 1989;Perline, Wright & Wainer, 1979;Salzberger, 2010).Fit to Rasch models implies that the property of invariance holds meaning that the item-trait relationships are stable for the different person locations along the latent trait scale (Andrich, 1988).Rasch models satisfy specific objectivity which refers to the requirement of item-person independence; any person location estimate must be independent of the specific measurement device or items applied (Stenner, 1994).As the raw scores contain all the information needed to estimate Rasch models parameters i.e., item and person locations, the raw score is a sufficient statistic for Rasch models (Andersen, 1977).
While both Rasch models and other IRT models assume locally independent data-unidimensional and statistically independent data, only the family of Rasch models ensure additivity, invariance, specific objectivity and sufficiency as described above.Therefore, we applied prescriptive Rasch models and not descriptive IRT models in this study.

Overall model fit
The parameters of the rating scale parameterization (RSM; Andrich, 1978) of the RM are a subset of the parameters of the partial credit parameterization (PCM; Masters, 1982) of the RM, so the RSM is nested in the PCM.We compare data-model fit for nested models using likelihood ratio test (LRT).
The LRT test statistic -the change in deviance (D) -is asymptotically χ 2 distributed (i.e., for large samples) with degrees of freedom (df) equal to the difference in model estimated parameters (Wilks, 1938, p. 62).A 'significant' χ 2 value implies rejecting the 'null hypothesis' stating that the less complex and nested model, describing the data using fewer threshold estimates, is preferred (cf.hypothesis 1).Compared to RSM, the df of PCM is larger and the PCM therefore usually accounts better for the observed data.

Individual item and person fit
To account for our somewhat large sample size (N = 1622), we drew five random samples of 250, 500 and 750 persons from the SPSS file storing the data -a total of fifteen samples.These sample sizes correspond to 10, 20 and 30 persons per thresholds (Andrich, 2010).We estimated individual item χ 2 and overall χ 2 for each sample, and we reported the mean values.To account for the significance testing of k individual items, we Bonferroni-adjusted the individual item χ 2 p-values by the number of χ 2 tests performed: 0.05/k = 0.01 (see Bland & Altman, 1995).
Adolescent self-efficacy in body and health [150] 15(2), 2019 Person z-fit shows how well a person's response pattern conforms to the 'Guttmann structure' (Andrich, 1978).The difference in difficulty of the items caused by dependence is reported as a z-fit residual statitic at a conservative 1% level of significance (z = 2.56), a positive z-fit >2.56 indicates an unexpected response pattern (Andrich & Kreiner, 2010).

Local independence-response independency and unidimensionality
Once we have extracted the Rasch factor-the unidimensional underlying latent trait "self-efficacy", we assume there are no further patterns in the residuals (Wright, 1996).This assumption is tested by checking for response dependency and multidimensionality.Response dependency implies that items are linked in such a way that the responses to one item influence the responses to other items, and we identify this phenomenon by inspecting the item residual correlation matrix.The commonly used conservative item residual correlation of < 0.30, has recently come under criticism for being too conservative.Therefore, Yen (1984) proposed exploring local dependence based on comparing the item residual correlation values up against the average item residual correlation with values 0.2 above the average item residual as displaying dependency.
Unidimensionality means that only one latent trait -self-efficacy -explains all the covariances between the items (cf.partial correlations).A combined principal component analysis (PCA) of residuals and paired t-tests procedure is applied to check for unidimensionality (Hagell, 2014).If approximately 5% or less of the dependent t-tests comparing respondents' location estimates on two distinct subscales are significant, then unidimensionality is assumed (Smith Jr, 2002;Tennant & Pallant, 2006).
Furthermore, by creating a 'subtest structure' for a pair of item subsets identified, we can estimate fractal indices (r, c and A) specific to the 'subtest structure'.The index A describes the amount of common variance among the two subsets or subscales identified, c identifies the magnitude of unique subscale variance, and r is the correlation between the two subsets (RUMM, 2009).High values for both A and r, and a low value for c, might therefore indicate an approximately unidimensional scale (Andrich, 2016;Andrich, 2015).

Targeting, reliability, ordering of response categories and differential item functioning
In a well-targeted scale, the distribution of the person estimates matches the distribution of the item threshold estimates centred at 0.0 logits.Poor targeting might increase the risk of unordered response categories and disordered thresholds, large standard errors, extreme person scores, and therefore deflated reliability indices and poor information at certain locations along the latent trait scale.
The internal consistency reliability of the latent trait measurement scale is reported as Person Separation Index (PSI), which is analogous to Cronbach's alpha, and indicates the capacity to separate persons with higher location estimates from those with lower location estimates on the latent trait (Andrich, 1982).Different criteria are suggested for PSI, with values >0.70, >0.80 and >0.90 indicating 'acceptable', 'good' and 'excellent' reliability respectively (Duncan, Bode, Lai & Perera, 2003).Often 0.7 is used as the minimum value for group and 0.85 as the minimum value for assessments at the individual item level (Cronbach, 1951).
Differential item functioning (DIF) or 'within-item bias' might occur when different 'levels' or 'groups' of a person factor, such as males and females, at equivalent levels of the underlying construct have different probabilities of endorsing an item (Holland & Wainer, 1993;Walker, Beretvas, & Ackerman, 2001).When persons belonging to a particular 'level' show a consistent systematic difference in their responses to an item, uniform DIF is implied.In cases where the differences vary across levels of the attribute between the person factor groups, non-uniform DIF is indicated.Items that display non-uniform DIF are discarded from the instrument.

Naigaga et al.
[151] 15(2), 2019 A procedure in RUMM2030, allows for the resolution of uniform DIF by resolving the item into multiple items, one for each group levels and comparing the estimates of the item parameters from the different 'levels'.

RESULTS
We found that the SEBH items did not share the same set of threshold difficulties.A significant likelihood ratio test statistic LRT χ 2 (p =0.000019; df = 11) indicated that the PCM (partial credit parameterization) of the polytomous unidimensional Rasch model described the data 'significantly' better than the RSM (rating scale parameterisation).
In Table 2, we report the overall adjusted mean χ 2 value for each of the amended sample sizes estimated from five random samples reflecting 10, 20 and 30 individuals per scale threshold, as χ2 is a sample size dependent fit statistic.The PCM of the polytomous unidimensional Rasch model was applied.To sum up, Table 2 indicates that hypothesis 1 (the SEBH data is sufficiently described by RSM) is not fully supported.
Individual person residuals showed that 20 and 89 students had z-fit above/below the cut-off criterion of +/-2.56,respectively (Andrich & Kreiner, 2010).Concerns were raised about values above the +2.56 threshold, as these indicate response patterns that are unlikely i.e., deviate significantly from the Guttmann pattern given the self-efficacy score sum.However, removing these few responses did not significantly change any fit parameter estimates.
The assumption of a locally independent scale holds for the SEBH-scale as no response dependence between any pair of items was observed, and only 6.2% of paired t-tests were significant.The t-test structure was based on two subsets of items empirically indicated by the PCA of residuals procedure (the easily endorsable items 1-3 (subscale 1) versus items 4 and 5 (subscale 2), see Table 3).A subset analysis indicated that these two subscales measured strongly related latent traits (high subscale common variance A = 0.90, subscale correlation r = 0.97 and low subscale unique variance c = 0.17).Adolescent self-efficacy in body and health [152] 15(2), 2019 Note.The location estimates with the standard errors are based on the full sample.Each χ 2 value is the mean value estimated from five random samples of sample sizes corresponding to 10, 20 and 30 persons per thresholds respectively (N = 250, 500, 750).
For all random sample sizes of 250 and 500, all the chi-square values were insignificant (p(χ 2 )).For the random samples of 750, the chi-square value for item 5 was significant in two of the five random samples.
When centering the average item location at 0.0 logits, the resulting average person proficiency was at 1.4 logits, pointing to a scale that could have been better targeted.The positively skewed distribution of person self-efficacy estimates deviates somewhat from the locations at which the items measure most efficiently.
The above results suggest that the SEBH-scale is a rather valid measure of self-efficacy in tenth graders.Sufficiently high reliability indices indicated a reliable measure (PSI = 0.88 for original and complete data sets and Cronbach's alpha = 0.88 (excluding extremes) and 0.92 (including extremes) for the complete data set where the 36 respondents with missing data for one or more SEBH-items were discarded).Hence, the SEBH-scale is an accurate and precise measure of self-efficacy.
Moving from the overall analyses to the single item level, the slightly disordered response categories observed for item 1 (Figure 1) is explained by the somewhat poorly targeted SEBH-scale (Figure 2).The curves in Figure 1 show the probability of endorsing each of the six response categories (1 = 'strongly disagree' and 6 = 'strongly agree') versus person location.The second category does not function as intended.The dotted line is the upper limit asymptote, where probability equals 100%.

Naigaga et al.
[153] 15(2), 2019 A skewed distribution toward higher locations of self-efficacy indicates that the the items could have been better targeted at the sample.This skewed distribution leaves few persons located at the lower end of the continuum -the trait locations at which the easily endorsable item 1 (item location at -0.61 according to Table 3) has its lower thresholds.We therefore interpret the SEBH-scale raw score as a sufficient statistic at the ordinal level.
Finally, we investigated DIF using the amended sample sizes based on the rule of thumb of 10, 20 and 30 persons per threshold (Andrich, 2011), with a total of 25 thresholds (5 items with 5 thresholds).
No DIF was observed for any person factor (gender, age, cultural background, language at home and books at home) using the amended sample sizes of N = 250, N = 500 and N = 750 based on a total of 25 thresholds (5 items with 5 thresholds)

Discussion
Empirical data partially support our two composite hypotheses.The first hypothesis was strengthened except for a deviation from our ideal of parsimonity: The partial credit parameterization (PCM), estimating one set of threshold parameters for each item, described the data better than the less complex rating scale parameterization (RSM) estimating one set of threshold difficulties common for all items.
Furthermore, the targeting of the SEBH-scale was not optimal with few items at higher locations.
The lack of items providing information at higher levels of the latent trait is a well-known problem in health-literacy measurement (Nguyen et al., 2015).One of few exceptions is the 'Claim Evaluation Tools' developed by the Informed Health Choices group.The second hypothesis was strengthened except for slightly disordered thresholds observed for item 1.The disordering of response categories for item 1 has a simple explanation: The distribution of person estimates is skewed toward higher locations thereby locating few persons at the lower end of the continuum-the locations at which we find the lower threshold parameters for item 1.
Since the SEBH-scale built on a self-efficacy scale published by Guttersrud & Pettersen (2015), the scale seems to easily translate to different fields of education improving the generalizability and ex-Adolescent self-efficacy in body and health [154] 15(2), 2019 ternal validity of our findings.We interpret this as a serious strength of our study.A limitation to our study is the low school participating rate (58 out of 200 or 30%).This might result in responses from students enrolled in classes taught by above average motivated and enthusiastic teachers-teachers more likely to see the benefits of external assessment resources like the one we developed.This possible difference between the target sample and the accessed sample might explain the high mean self-efficacy estimate in our sample, which again could cause the skewed distribution of self-efficacy person location estimates and the disordering observed for item 1.

Conclusions
The present paper provides insights into an issue that seems to have passed health literacy research by: the application of Rasch analysis to evaluate the psychometric properties of measurement scales.
By fitting the Rasch model, our findings indicate that the SEBH-scale meets the assumptions and satisfies the requirements for fundamental measurement.
The SEBH-scale presented in the study exemplifies that Rasch analysis is a powerful tool for evaluating construct validity of measurement instruments.This is indicated by the absence of construct-irrelevant variance, as all five items fit the Rasch model, implying that the items don't capture unrelated constructs that affect responses in a manner irrelevant to the construct.On the other hand, by meeting the assumption of unidimensionality albeit with the presence of strongly correlated sub-dimensions, the SEBH-scale points to the absence of construct underrepresentation-another threat to construct validity, in which the assessment is too narrow and fails to capture different facets and sub dimensions of the construct.
Furthermore, the total score on the SEBH-scale can be viewed as one of several possible sets of indicators of the construct-perceived self-efficacy in a science subject.An important recommendation is to include more items in the SEBH-scale in order to improve the preciseness with which the abilities of persons that fall between successive items along the hypothesized unidimensional continuum are measured.
The positive effect of perceived self-efficacy on management of diseases is well documented, developing and validating equivalent measures for 'non-sick' individuals particularly adolescents in different domains, as exemplified in the present study, will go a long way in providing measurement tools to inform, design successful health literacy polies and interventions within public health and education.

Figure 2 .
Figure 2. Histogram showing the distributions of person and item threshold locations including Fisher's information function (curve) for the SEBH scale

Table 1 .
The wording of the items in the self-efficacy in body and health (SEBH) scale (originally stated in Norwegian).A six-point rating scale with the extreme response categories anchored with a phrase 1 = 'strongly disagree' and 6 = 'strongly agree' was used.thatI can do an excellent job with difficult tasks in Body and Health 3 I am confident that I can do very well in tests in Body and Health.4Iam confident that I can understand difficult learning material in Body and Health 5 I am confident that I can apply the knowledge that I have in Body and Health in new and unfamiliar situations.

Table 2 .
Overall mean χ 2 fit statistics for the SEBH scale using amended sample sizes.

Table 3 .
Individual mean item χ 2 fit statistics for the SEBH scale using the amended sample sizes.