Tuesday, December 24, 2019

Gini coefficient and Lorenz curve

Tuesday, October 8, 2019

ANOVA

ANOVA is a general technique that can be used to test the hypothesis that the means among two or more groups are equal, under the assumption that the sampled populations are normally distributed.
Suppose we wish to  study the effect of temperature on a passive component such as a resistor. We select three different temperatures and observe their effect on the resistors. This experiment can be conducted by measuring all the participating resistors before placing resistors each in three different ovens. Each oven is heated to a selected temperature. Then we measure the resistors again after, say, 24 hours and analyse the responses, which are the differences between before and after being subjected to the temperatures. The temperature is called a factor. The different temperature settings are called levels. In this example there are three levels or settings of the factor Temperature.
A factor is an independent treatment variable whose settings (values) are controlled and varied by the experimenter. The intensity setting of a factor is the level. Levels may be quantitative numbers or, in many cases, simply “present” or “not present” (“0” or “1”). For example, the temperature setting in the resistor experiment may be:100 degree F, 200 degree F and 300 degree F. We can simply call them: Level1, Level 2 and Level 3
The 1-way ANOVA
In the experiment above, there is only one factor, temperature, and the analysis of variance that we will be using to analyse the effect of temperature is called a one-way or one-factor ANOVA.
The 2-way or 3-way ANOVA
We could have opted to also study the effect of positions in the oven. In this case there would be two factors, temperature and oven position. Here we speak of a two-way or two-factor ANOVA. Furthermore, we may be interested in a third factor, the effect of time. Now we deal with a three-way or three-factor ANOVA. In each of these ANOVA’s we test a variety of hypotheses of equality of means (or average responses when the factors are varied).
ANOVA is defined as a technique where the total variation present in the data is portioned into two or more components having specific source of variation. In the analysis, it is possible to attain the contribution of each of these sources of variation to the total variation. It is designed to test whether the means of more than two quantitative populations are equal. It consists of classifying and cross-classifying statistical results and helps in determining whether the given classifications are important in affecting the results.
The assumptions in analysis of variance are:
Normality
Homogeneity
Independence of error
Whenever any of these assumptions is not met, the analysis of variance technique cannot be employed to yield valid inferences.
With analysis of variance, the variations in response measurement are partitioned into components that reflect the effects of one or more independent variables. The variability of a set of measurements is proportional to the sum of squares of deviations used to calculate the variance:
Σ(X-x ̅)2
Analysis of variance partitions the sum of squares of deviations of individual measurements from the grand mean (called the total sum of squares) into parts: the sum of squares of treatment means plus a remainder which is termed the experimental or random error.
When an experimental variable is highly related to the  response, its part of the total sum of the squares will be highly inflated.
This condition is confirmed by comparing the variable sum of squares with that of the
random error sum of squares using an F test.
Why use Anova and Not Use t-test Repeatedly?
The t-test, which is based on the standard error of the difference between two means, can only be used to test differences between two means
With more than two means, could compare each mean with each other mean using t tests
Conducting multiple t-tests can lead to severe inflation of the Type I error rate (false positives) and is NOT RECOMMENDED.
ANOVA is used to test for differences among several means without increasing the Type I error rate
The ANOVA uses data from all groups to estimate standard errors, which can increase the power of the analysis
Why Look at Variance When Interested in Means?

Three groups tightly spread about their respective means, the variability within each group is relatively small
Easy to see that there is a difference between the means of the three groups

Three groups have the same means as in previous figure but the variability within each group is much larger
Not so easy to see that there is a difference between the means of the three groups
To distinguish between the groups, the variability between (or among) the groups must be greater than the variability of, or within, the groups
If the within-groups variability is large compared with the between-groups variability, any difference between the groups is difficult to detect
To determine whether or not the group means are significantly different, the variability between groups and the variability within groups are compared
One-Way ANOVA
Suppose there are k populations which are from a normal distribution with unknown parameters. A random sample X1, X2, X3……………… Xk is taken from these populations
which hold the assumptions. If μ1, μ2, μ3………… μk are k population means, the null hypothesis is:
H0 : μ1 = μ2 = μ3………… = μk (i.e. all means are equal)
HA : μ1 ≠ μ2 ≠ μ3………… ≠ μk  (i.e. all means are not equal)
The steps in carrying out the analysis are:
Calculate variance between the samples
The variance between samples measures the difference between the sample mean of each group and the overall mean. It also measures the difference from one group to another. The sum of squares between the samples is denoted by SSB. For calculating variance between the samples, take the total of the square of the deviations of the means of various samples from the grand average and divide this total by the degree of freedom, k-1 , where k = no. of samples.
Calculate variance within samples
The variance within samples measures the inter-sample or within sample differences due to chance only. It also measures the variability around the mean of each group. The sum of squares within the samples is denoted by SSW. For calculating variance within the samples, take the total sum of squares of the deviation of various items from the mean values of the respective samples and divide this total by the degree of freedom, n-k, where n = total number of all the observations and k = number of samples.
Calculate the total variance
The total variance measures the overall variation in the sample mean. The total sum of squares of variation is denoted by SST. The total variation is calculated by taking the squared deviation of each item from the grand average and dividing this total by the degree of freedom, n-1 where n = total number of observations.
Calculate the F ratio
It measures the ratio of between–column variance and within-column variance. If there is a real difference between the groups, the variance between groups will be significantly larger than the variance within the groups.
F = ( Variance between the Groups ) / Variance within the Groups
F = SSB / SSW
Decision Rule
At a given level of significance E =0.05 and at n-k and k-1 degrees of freedom, the value of F is tabulated from the table. On comparing the values, if the calculated value is greater than the tabulated value, reject the null hypothesis. That means the test is significant or there is a significant difference between the sample means.
Applicability of ANOVA
Analysis of variance has wide applicability from experiments. It is used for two different purposes:
It is used to estimate and test hypothesis about population means.
It is used to estimate and test hypothesis about population variances.
An analysis of variance to detect a difference in three or more population means first requires obtaining some summary statistics for calculating variance of a set of data as shown below:              Where:
Σx2 is called the crude sum of squares
(Σx)2 / N is the CM (correction for the mean), or CF (correction factor)
Σx2 – (Σx)2 / N is termed SS (total sum of squares, or corrected SS).
σ2(variance)=(Total sum of squares)/(Total DF (Degrees of freedom))=(∑▒〖x^2-(〖∑▒x)〗^2/N〗)/(N-1)
In the one-way ANOVA, the total variation in the data has two parts: the variation among treatment means and the variation within treatments.
The  grand average GM = Σx/N
The total SS (Total SS) is then:
Total SS = Σ(Xi – GM)2 Where Xi is any individual measurement.
Total SS = SST + SSE Where SST = treatment sum of squares and SSE is the experimental error sum of squares.
Sum of the squared deviations of each treatment average from the grand average or grand mean.
Sum of the squared deviations of each individual observation within a treatment from the treatment average. For the ANOVA calculations:
Total Treatment CM  Σ(TCM)= 
SST = Σ(TCM) – CM
SSE = Total SS – SST (Always obtained by difference)
Total DF = N – 1 (Total Degrees of Freedom)
TDF = K – 1 (Treatment DF = Number of treatments minus 1)
EDF = (N – 1) – (K – 1) = N – K (Error DF, always obtained by difference)
MST =SST/TFD=SST/(K-1) (Mean Square Treatments)
MSE = SSE/EDF=SSE/(N-K)  (Mean Square Error)To test the null hypothesis:
H0 : μ1 = μ2 = μ3………… = μk            H1 : At least one mean different
F = MST/MSE         When F > Fα , reject H0The overall mean is
Two-Way ANOVA
It will be seen that the two-way analysis procedure is an extension of the patterns described in the one-way analysis. Recall that a one-way ANOVA has two components of variance: Treatments and experimental error (may be referred to as columns and error or rows and error). In the two-way ANOVA there are three components of variance: Factor A treatments, Factor B treatments, and experimental error (may be referred to as columns, rows, and error).
In a two way analysis of variance, the treatments constitute different levels affected by more than one factor. For example, sales of car parts, in addition to being affected by the point of sale display, might also be affected by the price charged, the location of store and the number of competitive products. When two independent factors have an effect on the dependent factor, analysis of variance can be used to test for the effects of two factors simultaneously. Two sets of hypothesis are tested with the same data at the same time.
Suppose there are k populations which are from normal distribution with unknown parameters. A random sample X1, X2, X3……………… Xk is taken from these populations which hold the assumptions. The null hypothesis for this is that all population means are equal against the alternative that the members of at least one pair are not equal. The hypothesis follows:
H0 : μ1 = μ2 = μ3………… = μk
HA : Not all means μj are Equal.
If the population means are equal, each population effect is equal to zero against the alternatives. The test hypothesis is
H0 : β1 = β2 = β3………… = βk
HA : Not all means βj are Equal.
Calculate variance between the rows
The variance between rows measures the difference between the sample mean of each row and the overall mean. It also measures the difference from one row to another. The sum of squares between the rows is denoted by SSR. For calculating variance between the rows, take the total of the square of the deviations of the means of various sample rows from the grand average and divide this total by the degree of freedom, r-1 , where r= no. of rows.
Calculate variance between the columns
The variance between columns measures the difference between the sample mean of each column and the overall mean. It also measures the difference from one column to another. The sum of squares between the columns is denoted by SSC. For calculating variance between the columns, take the total of the square of the
deviations of the means of various sample columns from the grand average and divide this total by the degree of freedom, c-1 , where c= no. of columns.
Calculate the total variance
The total variance measures the overall variation in the sample mean.The total sum of squares of variation is denoted by SST. The Total variation is calculated by taking the squared deviation of each item from the grand average and divide this total by degree of freedom, n-1 where n= total number of observations.
Calculate the variance due to error
The variance due to error or Residual Variance in the experiment is by chance variation. It occurs when there is some error in taking observations, or making calculations or sometimes due to lack of information about the data. The sum of squares due to error is denoted by SSE. It is calculated as:
Error Sum of Squares = Total Sum of Squares – Sum of Squares between Columns – Sum of Squares between Rows.
The degree of freedom in this case will be (c-1)(r-1).
Calculate the F Ratio
It measures the ratio of between–column variance and within-row variance with variance due to error.
F = Variance between the Columns / Variance due to Error
F = SSC / SSE
F = Variance between the Rows / Variance due to Error
F = SSR / SSE
Decision Rule At a given level of significance α=0.05 and at n-k and k-1 degrees of freedom, the value of F is tabulated from the table. On comparing the values, if the calculated value is greater than the tabulated value, reject the null hypothesis. This means that the test is significant or, there is a significant difference between the sample means.





ANOVA Table for an A x B Factorial Experiment
In a factorial experiment involving factor A at a levels and factor B at b levels, the total sum of squares can be partitioned into:
Total SS = SS(A) + SS(B) + SS(AB) + SSE
ANOVA Table for a Randomized Block Design
The randomized block design implies the presence of two independent variables, blocks and treatments. The total sum of squares of the response measurements can be partitioned into three parts, the sum of the squares for the blocks, treatments, and error. The analysis of a randomized block design is of less complexity than an A x B factorial experiment.
Goodness-of-Fit Tests
GOF (goodness-of-fit) tests are part of a class of procedures that are structured in cells. In each cell there is an observed frequency, (Fo). From the nature of the problem, one either knows the expected or theoretical frequency, (Fe) or can calculate it. Chi square (χ2) is then summed across all cells according to the  formula:
The calculated chi square is then compared to the chi square critical value for the following appropriate degrees of freedom: 

Wednesday, August 14, 2019

Public Private Partnership

Tuesday, August 13, 2019

Least square mean

Least squares means (marginal means) vs. means Least square means is actually referred to as marginal means. In an analysis of covariance model, they are the group means after having controlled for a covariate (i.e. holding it constant at some typical value of the covariate, such as its mean value).
I made up the data in Table 1 above. There are two treatment groups (treatment A and treatment B) that are measured at two centers (Center 1 and Center 2). The mean value for Treatment A is simply the summation of all measures divided by the total number of observations (Mean for treatment A = 24/5 = 4.8); similarly the Mean for treatment B = 26/5 = 5.2. Mean for treatmeng A > Mean for treatment B. Table 2 shows the calculation of least squares means. First step is to calculate the means for each cell of treatment and center combination. The mean 9/3=3 for treatment A and center 1 combination; 7.5 for treatment A and center 2 combination; 5.5 for treatment B and center 1 combination; and 5 for treatment B and center 2 combination. After the mean for each cell is calculated, the least squares means are simply the average of these means. For treatment A, the LS mean is (3+7.5)/2 = 5.25; for treatment B, it is (5.5+5)/2=5.25. The LS Mean for both treatment groups are identical. It is easy to show the simple calculation of means and LS means in the above table with two factors. In clinical trials, the statistical model often needs to be adjusted for multiple factors including both categorical (treatment, center, gender) and continuous covariates (baseline measures). The calculation of LS mean is not easy to demonstrate. However, the LS mean should be used when the inferential comparison needs to be made. Typically, the means and LS means should point to the same direction (while with different values) for treatment comparison. Occasionally, they could point to the different directions (treatment A better than treatment B according to mean values; treatment B better than treatment A according to LS Mean).

Wednesday, January 2, 2019

Which statistical test to apply


Some names of the statistical tests:

Ref. Dtsch Arztebl Int 2010; 107(19): 343–8 DOI: 10.3238/arztebl.2010.0343

Edit:error in the picture
Mann Whitney U test is same as wilxon rank sum test 
Wilcoxon Signed rank test is done in paired data
Ref. Dtsch Arztebl Int 2010; 107(19): 343–8 DOI: 10.3238/arztebl.2010.0343
Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.

Minimisation


The most important drawback of the randomization software is the problem of unmatched groups. In the process of randomization it is probable that the treatment groups develop significant differences in some prognostic factors, especially when the sample size is relatively small (<200). If these factors have important effects on the primary or secondary outcomes of the study, any important difference in the levels of these factors invalidate the trial results, and necessitate complicated statistical analysis with unreliable results. Various methods have been used to overcome the problem of unmatched trial groups including minimization and stratification, with minimization providing more acceptable results. With minimization the first subjects are enrolled randomly into one of groups. The subsequent subjects will be allocated to treatment groups after hypothetical allocation of each subject to every group, and then calculating an imbalance score. Using these imbalance scores, we can decide to which group the new subject must be allocated, to have the minimum amount of imbalance, in terms of prognostic factors. Pure minimization is indeed completely deterministic, that is, we can predict which group the next subject will be enrolled in, provided the factor levels of the new subject are known. This may invalidate the principle of trial blindness and introduce some bias into the trial. To overcome this shortcoming some elements of randomness are incorporated into the minimization algorithm, to make the prediction unlikely. Unfortunately the whole process of minimization is well beyond the skill of a typical clinical researcher, especially when the problem of unequal group allocations has to be taken into account. The difficulty in computation has resulted in a relatively less frequent use of minimization methods, in randomized clinical trials. The computer software can perform excellently in these situations, especially when the implementation has been logical. In the following sections, the aspects of two minimization programs are presented. Again the selection of these programs is based on the availability and ease of use.

Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.

Hosmer Lemeshow test

Proposed grouping based on the values of estimated probabilities.


2 grouping strategies:


  1. Based on percentiles of estimated probabilities
  2. Based on fixed values of estimated probabilities



With the first method, use of g=10 groups result in the first group containing
n1=n/10 subjects having the smallest estimated probabilities and the last group
containing n10=n/10 subjects having the largest estimated probabilities.


With the second method, use of g=10 groups results in cut points defined at the
values k/10, k=1,2……9 & the groups contain all subjects with estimated
probabilities between adjacent cut points.
Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.

Mann-Whitney U test

Mann-Whitney U test: It is the non-parametric alternative test to the unpaired t-test. It is a non-parametric test that is used to compare two sample means that come from the same population, and used to test whether two sample means are equal or not.  Usually, the Mann-Whitney U test is used when the data is ordinal or when the assumptions of the t-test are not met.
Assumptions:
1. Sample drawn from the population is random
2. Observations are independent
3. Ordinal measurement scale
Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.

Wilcoxon sign test

Wilcoxon sign test: It is a type of test of significance done in paired data if the data is non-parametrically distributed or ordinal data. Compared to paired t-tests which analyzes if the average difference of two repeated measures is zero and require metric (interval or ratio) and normally distributed data.
The Wilcoxon signed rank test relies on the W-statistics.  For large samples with n>10 paired observations the W-statistics approximates a Normal Distribution.  The W statistics is a non-parametric test, thus it does not need multivariate normality in the data.
The first step of the Wilcoxon sign test is to calculate the differences of the repeated measurements and to calculate the absolute differences.
The next step of the Wilcoxon sign test is to order the cases by increasing absolute differences.
For the Wilcoxon signed rank test we can ignore cases where the difference is zero.  For all other cases we assign their relative rank. In case of tied ranks the average rank is calculated.  That is if rank 10 and 11 have the same observed differences both are assigned rank 10.5.
The next step of the Wilcoxon sign test is to sign each rank.  If the original difference < 0 then the rank is multiplied by -1; if the difference is positive the rank stays positive.

The W-statistic is simply the sum of the signed ranks.

Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.

Kruskal-Wallis test

Kruskal-Wallis test: It is a nonparametric  test, and is used when the assumptions of one-way ANOVA are not met.
In the ANOVA, we assume that the dependent variable is normally distributed and there is approximately equal variance on the scores across groups. While we use kruskal Wallis test when these assumptions are not met. Therefore, the Kruskal-Wallis test can be used for both continuous and ordinal-level dependent variables.  However, like most non-parametric tests, the Kruskal-Wallis Test is not as powerful as the ANOVA.

Null hypothesis: samples (groups) are from identical populations.
Alternative hypothesis: at least one of the samples (groups) comes from a different population than the others.

The distribution of the Kruskal-Wallis test statistic approximates a chi-square distribution, with k-1 degrees of freedom, if the number of observations in each group is 5 or more.  If the calculated value of the Kruskal-Wallis test is less than the critical chi-square value, then the null hypothesis cannot be rejected.  If the calculated value of Kruskal-Wallis test is greater than the critical chi-square value, then we can reject the null hypothesis and say that at least one of the samples comes from a different population.
Assumptions:
1. Sample drawn is random
2. Observations are independent of each other
3. Measurement scale for the dependent variable is atleast ordinal
Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.

Fisher Exact test

Fisher Exact test: It is a type of test of significance that is used in the place of chi square test in 2 by 2 tables, especially in cases of small samples. (frequency in one box is less than 5 or less than 20% of expected)
The Fisher Exact test tests the probability of getting a table that is as strong due to the chance of sampling. The word ‘strong’ is defined as the proportion of the cases that are diagonal with the most cases.
Generally used in one tailed tests. It can also be used as a two tailed test as well. It is sometimes called a Fisher Irwin test.
The Fisher Exact test uses the following formula:
p= ( ( a + b ) ! ( c + d ) ! ( a + c ) ! ( b + d ) ! ) / a ! b ! c ! d ! N !
In this formula, the ‘a,’ ‘b,’ ‘c’ and ‘d’ are the individual frequencies of the 2X2 contingency table, and ‘N’ is the total frequency.
This formula is used to obtain probability of the combination of the frequencies that are actually obtained. 
Assumptions:
Sample drawn by random sampling
Directional hypothesis is assumed. The directional hypothesis assumed (either a positive association or a negative association, but not both)
Data is not paired
Mutual exclusivity within the observations is assumed
Dichotomous level of measurement of the variables is assumed
Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.

Biostatistics MCQ (AIIMS)

Biostatistics MCQ (AIIMS)

A physician, after examining a group of patients of a certain disease, classifies the condition of each one as ‘Normal’, ‘Mild’, ‘Moderate’ or ‘Severe’. Which one of the following is the scale of measurement that is being adopted for classification of the disease condition?
[AIIMS Nov 92 Dec 98, May 94]
(a) Normal
(b) Interval
(c) Ratio
(d) Ordinal

there is an intrinsic order in ordinal data set. e.g. Mild, Moderate, Severe 


In the WHO recommended EPI cluster sampling for assessing primary immunization coverage, the age group of children to be surveyed is
(a) 0-12 months [AIIMS Nov1992, & 2008]
(b) 6-12 months
(c) 9-12 months
(d) 12-23 months

children aged 12–23 months, if the final primary vaccination is at 9 months of age – this is the most commonly chosen target population (Ref: WHO EPI cluster sampling)


If a biochemical test gives the same reading for a sample on repeated testing, it is inferred that the measurement is [AIIMS June 1992]
(a) Precise
(b) Accurate
(c) Specific
(d) Sensitive

Precision means repeatability 

Mean, Median and Mode are [AIIMS Dec 94, & Nov 2007]
(a) Measures of dispersion
(b) Measures association between two variables
(c) Test of significance
(d) Measures of central tendency

Following are the sampling techniques used to conduct community health surveys, except
(a) Simple random [AIIMS May 1994]
(b) Systematic random
(c) Stratified random
(d) Cluster testing

Median weight of 100 children was 12 kgs. The standard deviation was 3. Calculate the percent coefficient of variance [AIIMS May 1994]
(a) 25%
(b) 35%
(c) 45%
(d) 55%

In statistical literature data are broadly classified as interval scale data, ordinal scale data & categorical data. Blood groups will be an example for: [AIIMS Dec 1994]
(a) Interval scale data
(b) Ordinal scale data
(c) Categorical data
(d) None of the above

Chance of passing a Genetic disease “y” trait by the affected parents to children is 0.16. They plan to have two children. Probability of both the children having “y” trait is [AIIMS Dec 1994]
(a) Zero
(b) 0.16
(c) 0.32
(d) 0.0256

A population study showed a mean glucose of 86 mg/ dL. In a sample of 100 showing normal curve distribution, what percentage of people have glucose above 86mg/ dL [AIIMS Dec 94]
(a) 34
(b) 50
(c) NIL
(d) 68

How much of the sample is included in 1.95 SD? [AIIMS May 1995]
(a) 99%
(b) 95%
(c) 68%
(d) 65%

Square root of p1q1/n1 + p2q2/n2 is a measure of [AIIMS Dec 1995]
(a) Mean
(b) Standard error of difference between two means
(c) Standard error of difference between two proportions
(d) Normal deviate

Histogram is used to describe: [AIIMS Dec 1995]
(a) Quantitative data of a group of patients
(b) Qualitative data of a group of patients
(c) Data collected on nominal scale
(d) Data collected on ordinal scale

If 60 values are arranged in ascending order, middle value is [AIIMS Dec 1995]
(a) Arithmetic Mean
(b) Median
(c) 30th percentile
(d) 31st percentile

50th percentile is equivalent to [AIIMS Sep 1996]
(a) Mean
(b) Median
(c) Mode
(d) Range

A normal distribution curve depends on [AIIMS Feb 1997]
(a) Mean and sample size
(b) Range and sample size
(c) Mean and standard deviation
(d) Mean and median

In a drug trial A 50 yr old patient with CAD is being interviewed about his dietary & smoking habits. The possible bias that might be introduced might be: [AIIMS Feb 1997]
(a) Selection bias
(b) Berkesonian bias
(c) Recall bias
(d) No possibility of bias

The Correlation Coefficient between Smoking & Lung Cancer was found to be 1.4. This indicates
(a) Weak correlation [AIIMS Feb 1997]
(b) Moderate correlation
(c) Strong correlation
(d) Mistake in calculation

A Scatter diagram is drawn to study: [AIIMS June 1997]
(a) Trend of a variable over a period of time
(b) Frequency of occurrence of events
(c) Mean & median values of the given data
(d) Relationship between two given variables

Which of the following is not true about ‘correlation’? [AIIMS June 97]
(a) It indicates degree of association between two characteristics
(b) Correlation coefficient of 1 means that the two variables exhibit linear relationship
(c) Correlation can measure risk
(d) Causation implies correlation


If we know the value of one variable in an individual & wish to know the value of another variable, we calculate - [AIIMS June 1997]
(a) Coefficient of correlation
(b) Coefficient of regression
(c) SE of mean
(d) Geometric mean


A cardiologist wants to study the effect of an antihypertensive drug. He notes down the initial systolic  blood pressure (mmHg) of 50 patients and then administers the drug on them. After a week’s treatment, he measures the following is the most appropriate statistical test of significance to test the statistical significance of the change in blood pressure
[AIIMS June 1997, AIIMS May 1995, AIIMS Nov 2004]
(a) Paired t-test
(b) Unpaired or independent t-test
(c) Analysis of variance
(d) Chi-square test

Not required for Chi-square test is [AIIMS Dec 1997]
(a) Mean & SD of the groups
(b) Each expected cell frequency > 5
(c) Large sample
(d) Contingency Table

The mean B.P. of a group of persons was determined and after an interventional trial, the mean BP was estimated again. The best test to be applied to determine the significance of intervention is
(a) Chi-square [AIIMS Dec 1997]
(b) Paired ‘t’ test
(c) Correlation coefficient
(d) t-test

Study finds a correlation coefficient of + 0.7 between self reported work satisfaction & expectancy of life in a random sample of 5000 corporate workers. (p = 0.01). This means that [AIIMS Dec 1997]
(a) Work satisfaction improves life expectancy
(b) Strong statistically significant (+) association between work satisfaction and life expectancy
(c) 70% people who enjoy work shall live longer
(d) 70% association between work satisfaction & life expectancy

Not true about Chi-square test is [AIIMS June 99]
(a) Tests the significance of difference between two proportions
(b) Tells about presence or absence of an association between two variables
(c) Directly measures the strength of association
(d) Can be used when more than two groups are to be compared

In a bimodal series, if mean is 2 and median is 3, what is the mode? [AIIMS June 99]
(a) 5
(b) 2.5
(c) 4
(d) 3

The standard normal distribution [AIIMS Nov 99]
(a) Is skewed to the left
(b) Has mean = 1.0
(c) Has standard deviation = 0.0
(d) Has variance = 1.0

An investigator into the life expectancy of IV drug abusers divides a sample of patients into HIV- positive and HIV-negative groups. What type of data does this division constitute?
[AIIMS June 2000]
(a) Nominal
(b) Ordinal
(c) Interval
(d) Ratio

P-value is the probability of [AIIMS June 2000]
(a) Not rejecting a null hypothesis when true
(b) Rejecting a null hypothesis when true
(c) Not rejecting a null hypothesis when false
(d) Rejecting a null hypothesis when false

A lecturer states that the correlation coefficient between prefrontal blood flow under cognitive load and the severity of psychotic symptoms in schizophrenic patients is – 1.24. You can therefore conclude that [AIIMS June 2000]
(a) Pre-frontal blood flow under cognitive load is a good predictor of the severity of psychotic symptoms in schizophrenic patients
(b) Prefrontal blood flow under cognitive load accounts for a large proportion of the variance in psychotic symptoms in schizophrenic patients
(c) Psychosis or schizophrenia is in some way a cause or partial cause of low prefrontal blood flow under cognitive load 863 Biostatistics Biostatistics Biostatistics
(d) The lecturer has reported the correlation coefficient incorrectly

Central value of a set of 180 values can be obtained by [AIIMS Nov 2000]
(b) 90th percentile
(a) 2nd tertile
(c) 9th decile
(d) 2nd quartile

The number of malaria cases reported during the last 10 years in a town is given below, 250, 320, 190, 300, 5000, 100, 260, 350, 320, and 160 The epidemiologist wants to find out the average number of malaria cases reported in that town during the last 10 years. The most appropriate measure of average for this data will be [AIIMS May 2001, AIIMS Nov 2004]
(a) Arithmetic mean
(b) Mode
(c) Median
(d) Geometric mean

In a particular trial, the association of lung cancer with smoking is found to be 40% in one sample and 60% in another. What is the best test to compare the results? [AIIMS May 2001]
(a) Chi Square Test
(b) Fischer Test
(c) Paired t Test
(d) ANOVA Test

What can be true regarding the coefficient of correlation between IMR and economic status?
(a) r = + 1 [AIIMS May 2001]
(b) r = – 1
(c) r = + 0.22
(d) r = – 0.8

Standard deviation of means measures  [AIIMS May 01]
(a) Non-sampling errors
(b) Sampling errors
(c) Random errors
(d) Conceptual errors

Among a 100 women with average Hb of 10 gm%, the standard deviation was 1, what is the standard error? [AIIMS May 01, 04, 07]
(a) 0.01
(b) 0.1
(c) 1
(d) 10

A study was undertaken to assess the effect of a drug in lowering serum cholesterol levels. 15 obese women and 10 non-obese women formed the 2 limbs of the study. Which test would be useful to correlate the results obtained?
(a) ANOVA test [AIIMS Nov 01]
(b) Student’s t-test
(c) Chi square test
(d) Fischer test

The incidence of malaria in an area is 20, 20, 50, 56, 60, 5000, 678, 898, 345, 456. Which of these methods is the best to calculate the average incidence?  [AIIMS Nov 01]
(a) Arithmetic mean
(b) Geometric mean
(c) Median
(d) Mode

A randomised trial comparing the efficacy of two drugs showed a difference between the two with a p  value of <0.005. In reality, however the two drugs do not differ. This therefore is an example of
(a) Type I error (alpha error) [AIIMS Nov 02]
(b) Type II error (beta error)
(c) 1 – a (alpha)
(d) 1 – b


A test which produces similar results when repeated, but values obtained are not close to actual/true value, is [AIIMS Nov 02]
(a) Precise but inaccurate
(b) Precise and accurate
(c) Imprecise and accurate
(d) Imprecise and inaccurate

When a diagnostic test is used in “series” mode, then [AIIMS Nov 02]
(a) Sensitivity increases but specificity decreases
(b) Specificity increases but sensitivity decreases
(c) Both sensitivity and specificity increase
(d) Both sensitivity and specificity decrease

The number of patients required in a clinical trial to treat a specify disease increases as
[AIIMS Nov 02]
(a) The incidence of the disease decreases
(b) The significance level increases
(c) The size of the expected treatment effect increased
(d) The drop-out rate increases

The usefulness of a screening test depends upon its- [AIIMS May 03]
(a) Sensitivity
(b) Specificity
(c) Reliability
(d) Predictive value

An investigator wants to study the association between maternal intake of iron supplements (Yes/ No)  and birth weights (in grams) of newborn babies. He collects relevant data from 100 pregnant women and their newborns. What statistical test of hypothesis would you advise for the investigator in this situation? [AIIMS May 03]
(a) Chi-Square test
(b) Unpaired or independent t-test
(c) Analysis of Variance
(d) Paired t-test

For testing the statistical significance of the difference in heights of school children
[AIIMS May 2003]
(a) Student’s ‘t’ test
(b) Chi-squared test
(c) Paired ‘t’ test
(d) One way analysis of variance (one way ANOVA)

The fasting blood levels of glucose for a group of diabetics is found to be normally distributed with a mean of 105 mg per 100 ml of blood and a standard deviation of 10 mg per 100 ml of blood. From this data is can be inferred that approximately 95% of diabetics will have their fasting blood glucose levels within the limits of: [AIIMS Nov 2003]
(a) 75 and 135 mgs
(b) 85 and 125 mgs
(c) 95 and 115 mgs
(d) 65 and 145 mgs

An investigator wants to study the association between maternal intake of iron supplements (Yes or No) and incidence of low birth weight (< 2500 or > 2500) grams). He collects relevant data from 100 pregnant women as to the status of usage of iron supplements and the status of low birth weight in their newborns. The appropriate statistical test of hypothesis advised in this situation is
[AIIMS Nov 03]
(a) Paired – t-test
(b) Unpaired or independent t-test
(c) Analysis of variance
(d) Chi – Square test

 Mean and standard deviation can be worked out only if data is on [AIIMS Nov 03, AIIMS May 05] (a) Interval/Ratio scale
(b) Dichotomous scale
(c) Nominal scale
(d) Ordinal scale

After applying a statistical test, an investigator gets the ‘P value’ as 0.01. it means that [AIIMS Nov 2003, AIIMS May 05, 08]
(a) The probability of finding a significant difference is 1%
(b) The probability of declaring a significant difference is 1%
(c) The difference is not significant 1% times and significant 99% times
(d) The power of the test used is 99%

Sampling method used in assessing immunization status of children under immunization program is (a) Systematic sampling [AIIMS May 2004]
(b) Stratified sampling
(c) Group sampling
(d) Cluster sampling

All are true Except - [AIIMS May 04]
(a) Alpha is the maximum tolerable probability of type-I error
(b) Beta is the probability of type-II error
(c) When Null Hypothesis is true but is rejected, it is Type-II error
(d) P-value can be more or less than alpha

Statistical Power of a trial is equal to  [AIIMS Nov 04]
(a) 1 + a
(b) 1 – b
(c) a + b
(d) a / b

In a 3 x 4 contingency tables, the number of degrees of freedom equals to [AIIMS Nov 2004]
(a) 1
(b) 5
(c) 6
(d) 12

In assessing the association between maternal nutritional status and the birth weight of the newborns, two investigators A and B studied separately and found significant results with p values 0.02 and 0.04 respectively. From this information, what can you infer about the magnitudes of association found by the two investigations? [AIIMS Nov 2004]
(a) The magnitude of association found by investigator A is more than that found by B
(b) The magnitude of association found by investigator B is more than that found by A
(c) The estimates of association obtained by A and B will be equal, since both are significant
(d) Nothing can be concluded as the information given is inadequate

Pearson or spearman coefficient is used for evaluation of: [AIIMS Nov 04]
(a) Differences in proportion
(b) Comparison of more than 2 means
(c) Comparison of variance
(d) Correlation

Sensitivity for a test ‘X’ is 0.90 and Specificity is .50. Prevalence of disease ‘Y’ in a population is 10%. Post-test probability of test ‘X’ when applied to population ‘Y’ is - [AIIMS May 05]
(a) 0.90
(b) 0.84
(c) 0.16
(d) 0.10

A bacterium can divide every 20 minutes. Beginning with a single individual, how many bacteria will  be there in the population if there is exponential growth for 3 hours? [AIIMS May 05]
(a) 18
(b) 440
(c) 512
(d) 1024

The distribution of random blood glucose measurements from 50 first year medical students was found to have a mean of 3.0 mmol/litre with a standard deviation of 3.0 mmol/litre. Which of the following is a correct statement about the shape of the distribution of random blood glucose in these first year medical students? [AIIMS Nov 2005]
(a) Since both mean and standard deviation are equal, it should be a symmetric distribution
(b) The distribution is likely to be positively skewed
(c) The distribution is likely to be negatively skewed
(d) Nothing can be said conclusively

A chest physician observed that the distribution of forced expiratory volume (FEV) in 300 smokers had a median value of 2.5 litres with the first and third quartiles being 1.5 and 4.5 litres respectively. Based on this data how many persons in the sample are expected to have a FEV between 1.5 and 4.5 litres? [AIIMS Nov 05]
(a) 7.5
(b) 150
(c) 225
(d) 300

If the distribution of intra-ocular pressure (IOP) seen in 100 glaucoma patients has an average 30 mm with a SD of 1.0, what is the lower limit of the average IOP that can be expected 95% of times? [AIIMS Nov 05]
(a) 28
(b) 26
(c) 32
(d) 259

In the WHO recommended EPI Cluster sampling for assessing primary immunization coverage, the age group of children to be surveyed is
(a) 0-12 months [AIIMS Nov 2005]
(b) 6-12 months
(c) 9-12 months
(d) 12-23 months

Height of group of 20 Boys aged 10 years was 140 + 13 cm & 20 girl of same age was 135 cm + 7cm to test the statistical significance of difference in height, test applicable is [AIIMS Nov 05]
(a) X2
(b) Z
(c) t
(d) F

Histogram is used to present which kind of the data: [AIIMS May 2006]
(a) Nominal
(b) Continuous
(c) Discrete
(d) Any of above

A randomised trial comparing efficacy of two regimens showed that difference is statistically significant with p<0.001 but in reality the two drugs do not differ in their efficacy. This is an example of- [AIIMS May 2006]
(a) Type-I error (a error)
(b) Type – II error (b error)
(c) 1-a
(d) 1-b

You have diagnosed a patient clinically as having SLE and ordered 6 tests. Out of which 4 tests have come positive and 2 are negative. To determine the probability of SLE at this point, you need to know- [AIIMS May 2006]
(a) Prior probability of SLE; sensitivity and specificity of each test
(b) Incidence of SLE and predictive value of each test
(c) Incidence and prevalence of SLE
(d) Relative risk of SLE in this patient

A diagnostic test for a particular disease has a sensitivity of 0.90 and a specificity of 0.80. A single test is applied to each subject in the population in which the diseased population is 30%. What is the probability that a person, negative to this test, has no disease? [AIIMS May 2006]
(a) Less than 50%
(b) 70%
(c) 95%
(d) 72%

In a given data, degree of freedom will be
Duration of developing AIDS Blood group  A  B  AB  O
0 – 5 years                                                     20 30  48   7
5 – 10 years                                                 110 12  37  12
10 – 15 years                                                 12   9    8    3
[AIIMS May 06]
(a) 12
(b) 6
(c) 9
(d) 20

If the birth weight of each of the 10 babies born in a hospital in a day is found to be 2.8 kg, then the standard deviation of this sample will be [AIIMS May 2006, Dec 97]
(a) 2.8
(b) 0
(c) 1
(d) 0.28

LJ chart is used for: [AIIMS May 07]
(a) Accuracy
(b) Precision
(c) Odds
(d) Likelihood ratio

Which is the best method to compare the results obtained by a new test and a gold standard test?
(a) Correlation study [AIIMS May 07]
(b) Regression study
(c) Bland and Altman analysis
(d) Kolmogorov-Smirnov test

Sensitivity of a screening test ‘X’ is 90 % while its specificity is 10 %. Likelihood ratio for a positive test is - [AIIMS May 07]
(a) 9.0
(b) 8.0
(c) 1.0
(d) 0.1

If a 95% Confidence Interval for prevalence of Cancer in Smokers aged >65 years is 56% to 76%, the chance that the prevalence could be less than 56% is [AIIMS May 07]
(a) Practically NIL
(b) 44%
(c) 2.5%
(d) 5%


In a group of 100 children, the mean weight of children is 15 kg. The standard deviation is 1.5 kg. Which one of the following is true? [AIIMS May 2007]
(a) 95% of all children weight between 12 and 18 kg
(b) 95% of all children weight between 13.5- and 16.5kg
(c) 99% of all children weight between 12 and 18 kg
(d) 99% of all children weight between 13.5 and 16.5kg

Which is the best distribution to study the daily admission of head injury patients in a trauma care centre? [AIIMS May 2008]
(a) Normal distribution
(b) Binomial distribution
(c) Uniform distribution
(d) Poisson distribution

Mean bone density amongst 2 group of 50 people each is compared, which would be the best test?
(a) Chi square [AIIMS May 2008]
(b) Student t test
(c) Mcnemar chi square test
(d) Fischer test

Association can be measured by all except
(a) Correlation coefficient [AIIMS May 2009]
(b) Cronbach’s alpha
(c) P value
(d) Odds ratio

The risk factor association of smoking with pancreatic cancer was studied in a case control study. The values are
Group  Odds ratio        95% Confidence limits
A             2.5                   1.0 – 3.1
B             1.4                   1.1 – 1.7
C             1.6                   0.9 – 1.7
Which of the following is correct [AIIMS Nov 09]
(a) Risk is more associated with Group A
(b) Risk is more associated with Group B
(c) Risk is more associated with Group C
(d) Risk is equally associated with all three groups

All of the following are true about Standard error except? [AIIMS Nov- 09]
(a) As the sample size increases, Standard error will also increase
(b) Based on Normal distribution
(c) It depends on Standard deviation of mean
(d) Is used to estimate confidence limit

In a study following interpretation are obtained: Satisfied, Very satisfied, Dissatisfied. Which type of scale is this? [AIIMS May 2010]
(a) Nominal
(b) Ordinal
(c) Interval
(d) Ratio

Which of the following is used to denote a continuous variable? [AIIMS May 2010]
(a) Simple bar
(b) Histogram
(c) Pie diagram
(d) Multiple bar

In a study following interpretation are obtained: Satisfied, Very satisfied, Dissatisfied. Which type of scale is this? [AIIMS May 2010]
(a) Nominal
(b) Ordinal
(c) Interval
(d) Ratio

True about cluster sampling all except [AIIMS May 2011]
(a) Sample size same as simple random
(b) It is two stage sampling
(c) Cheaper than other methods
(d) It is a method for rapid assessment

An investigator finds out that 5 independent factors influence the occurrence of a disease. Comparison of multiple factors that are responsible for the disease can be assessed by:
[AIIMS May 2011]
(a) ANOVA
(b) Multiple linear regression
(c) Chi-square test
(d) Multiple logistic regression

Method used for comparison of a new test with an available gold-standard test is
 [AIIMS November 2011]
(a) Regression analysis/Likelihood test
(b) Correlation analysis/Bland and Altmann test
(c) Baltin and Altimore method
(d) Kimorov and Samletor technique

In a study first schools are sampled, then sections, and finally students. This type of sampling is known as: [AIIMS November 2012]
(a) Stratified sampling
(b) Simple random sampling
(c) Cluster sampling
(d) Multistage sampling

50% population having disease with estimated prevalence to be 45-55% with 95% of probability of identifying them minimum sample size required is:
(a) 100 [AIIMS May 2013]
(b) 200
(c) 300
(d) 400

If confidence limit is increased, then: [AIIMS May 2013]
(a) Previously insignificant data becomes significant
(b) Previously significant data becomes insignificant
(c) No effect on significance
(d) Any change can happen

In a population of 100 prevalence of candida glabrata was found to be 80%. If the investigator has to repeat the prevalence with 95% confidence what will the prevalence be? [AIIMS May 2013]
(a) 78-82%
(b) 76-84%
(c) 72-88%
(d) 74-86%

How much population falls between median and median plus one standard deviation in a normal distribution? [AIIMS Nov 2013]
(a) 0.34
(b) 0.68
(c) 0.17
(d) 0.47

There is a population of 20000 people with mean haemoglobin being 13.5 gm% having a normal distribution. What proportion of population constitutes proportion more than 13.5 gm%?
[AIIMS Nov 2013]
(a) 0.25
(b) 0.50
(c) 1
(d) 0.34

Q-test is used for detecting: [AIIMS Nov 2013]
(a) Outliers
(b) Interquartile range
(c) Difference of means
(d) Difference of proportions

ANSWERS ARE IN RED!
Creative Commons License
PSM / COMMUNITY MEDICINE by Dr Abhishek Jaiswal is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at learnpsm@blogspot.com.
Permissions beyond the scope of this license may be available at jaiswal.fph@gmail.com.