**
Equipe Raisonnement Induction Statistique
**

Bibliography about the practice of statistical inference

Références théoriques et méthodologiques sur l'usage des tests

Critiques des tests ("la controverse des tests de signification")

Exemples d'abus d'utilisation

Fondements de l'inférence statistique

Solutions de rechange (en particulier méthodes bayésiennes) et exemples d'applications

Theoretical and methodological references about the use of significance tests

Criticisms of significance tests ("the significance test controversy")

Exemples of abuses and misuses

Fundations of statistical inference

Alternative solutions (especially Bayesian methods) and examples of applications

Directeur de recherche C.N.R.S. retraité

Ingénieur d'études C.N.R.S. retraité

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**A**

Abelson, R.P. (1985). A variance explanation paradox: When a little is a lot.

Abelson, R.P. (1995).

Abelson, R.P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test.

Abelson, R.P. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would be invented).

Aberson, C. (2002) Interpreting Null Results: Improving presentation and conclusions with confidence intervals.

Acklin, M.W., McDowell, C.J., & Orndoff, S. (1992). Statistical power and the Rorschach: 1975-1991.

Acree, M.C. (1978).

Aczel, A.D. (1995).

Aickin, M. (2005). Bayes without priors.

Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of Ph.D. programs in North America.

Albert, J. (1995). Teaching Inference about Proportions Using Bayes and Discrete Models.

Albert, J. (1997). Teaching Bayes' rule: A data-oriented approach.

Algina J., Moulder B.C. (2001). Sample sizes for confidence intervals on the increase in the squared multiple correlation coefficient.

Altham, P.M.E. (1969). Exact Bayesian analysis of a 2x2 contingency table and Fisher's "exact" significance test.

Altman, D.G. (1982). Statistics in medical journals.

Altman, D. G. (1985). Discussion of Dr. Chatfield's paper.

Altman, D. G. (1992). Confidence intervals in research evaluation.

Altman, D.G., & Bland J. (1991). Improving doctors' understanding of statistics.

Altman, D. G., Gore, S. M., Gardner, M. J., & Pocock, S. J. (1983). Statistical guidelines for contributors to medical journals.

American Psychological Association. (2001).

American Psychological Association, Board of Scientific Affairs (1996). Task force on statistical inference initial report (draft).

Amery, W.K., Hoing, M., Debroye, M., & Dom, F. (1987). Some comments on the use of statistics in the evaluation of drug trials in migraine.

Amorim, M.A. (1999). A neurocognitive approach to human navigation.

Amorim, M.A., Glasauer, S., Corpinot, K., & Berthoz, A. (1997). Updating an object's orientation and location during nonvisual navigation: A comparison between two processing modes.

Amorim, M.A., Isableu, B., & Jarraya, M. (2006). Embodied spatial transformations: "Body analogy" for the mental rotation of objects.

Amorim, M.A., Loomis, J.M., & Fukusima, S.S. (1998). Reproduction of object shape is more accurate without the continued availability of visual information.

Amorim, M.-A., & Stucchi, N. (1997). Viewer- and object-centered mental explorations of an imagined environment are not equivalent.

Amorim, M.-A., Trumbore, B., & Chogyen, P.L. (2000). Cognitive repositioning inside a desktop VE: The constraints introduced by first-versus third-person imagery and mental representation richness.

Anderson, D.R., Burnham, K.P., & Thompson, W.L. (2000). Null hypothesis testing: problems, prevalence, and an alternative.

Anderson, D.R., Link, W.A., Johnson, D.H., & Burnham, K.P. (2001). Suggestions for presenting the results of data analyses.

Anderson, & Hauck, W.W. (1983). A new procedure for testing equivalence in comparative bioavailability and other clinical trials.

Anderson, W. T. (1992). Trouble in paradigms: robobuyer versus the blob - part 2.

Anscombe, F.J. (1956). Discussion of paper by F.N. David and N.L. Johnson.

Anscombe, F.J. (1963). Sequential clinical trials.

Anscombe, F.J. (1990). The summarizing of clinical experiments by significance levels.

Arabie, P., & Hubert, L.J. (1996). An overview of Combinatorial Data Analysis.

Arbuthnott, J. (1710). An argument for Divine Providence, taken from the constant regularity observ'd in the births of both sexes.

Argimon, J.M. (2002). El intervalo de confianza: algo más que un valor de significación estadística [Confidence intervals: something more than a statistical significance test].

Armatte, M. (2004). Sur les tests d'hypothèses: La véritable nature d'une méthologie hybride [Discussion de l'article de D. Denis, The modern hypothesis testing hybrid: R. A. Fisher's fading influence],

Aron, A., & Aron, E. N. (1999).

Arvey, R.D., Cole, D.S., Hazucha, J.F., & Hartanto, F.M. (1985). Statistical power of training evaluation designs.

Atkins, L., & Jarrett, D. (1981). The significance of significance tests

Atkinson, D.R., Furlang, M.J., & Wampold, B.E. (1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship?

Azar, B. (1997). APA task force urges a harder look at data.

Azar, B. (1999). APA statistics task force prepares to release recommendations for public comment.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**B**

Bakeman, R. (2006). VII. The practical importance of findings.

Bacon, F.T. (1979). Credibility of repeated statements: Memory for trivia.

Bailar, J.C., & Mosteller, F. (1988). Guidelines for statistical reporting in articles for medical journals: Amplifications and explanations.

Bailar, J.C., & Mosteller, F. (1992). Guidelines for statistical reporting in articles for medical journals: Amplifications and explanations.

Baird, D. (1984). Tests of significance violate the rule of implication.

Bakan, D. (1967/1966). The test of significance in psychological research.

Bakan, D. (1967).

Balluerkaa, N.,Gómez, J., Hidalgo, D. (2005). The controversy over null hypothesis significance testing revisited.

Bandt, C. L., & Boen, J. R. (1972). A prevalent misconception about sample size, statistical significance, and clinical importance.

Baril, G.L., & Cannon, J.T. (1995). What is the probability that null hypothesis testing is meaningless [Comment].

Barnard, G.A. (1947). The meaning of a significance level.

Barnard, G.A. (1989). On alleged gains in power from lower

Barnard, G.A. (1990). Must clinical trials be large? The interpretation of

Barnard, G. A. (1992). Statistics and OR - some needed interactions.

Barnard, G. (1998). Letter.

Barndorff-Nielsen, O. (1977). Discussion of D. R. Cox's paper.

Barnett, V. (1982).

Barnett, M.L., & Mathlsen, A. (1997). Tyranny of the p-value: The conflict between statistical significance and common sense [Editorial].

Bartko, J.J. (1991). Proving the null hypothesis [Comment].

Bassok, M., Wu, L.L., & Olseth, K.L. (1995). Judging a book by its cover: Interpretative effects of content on problem-solving transfer.

Batanero, C. (2000). Controversies around the role of statistical tests in experimental research.

Battan, L.J., Neyman, J., Scott, E.L., & Smith, J.A. (1969). Whitetop experiment,

Bayarri, M. J. & Berger, J. O. (2004). The interplay of Bayesian and frequentist analysis.

Bayes, T. (1763). Essay towards solving a problem in the doctrine of chances.

Beale, D.K. (1972). What's so significant about .05?

Beauchamp, K.L., & May, R.B. (1964). Replication report: Interpretation of levels of significance by psychological researchers.

Beaven, E.S. (1935). Discussion on Dr. Neyman's Paper.

Bechhofer, R.E. (1954). A single-sample multiple decision procedure for ranking means of normal populations with known variances.

Beck-Bornholdt, H.P., & Dubben, H.H. (1994). Potential pitfalls in the use of

Becker, G. (1991). Alternative methods of reporting research results.

Becker, H.S. (1998).

Begg, I., Armour, V., & Kerr, T. (1985). On believing what we remember.

Bellhouse, D.R. (1993). Invited commentary:

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing.

Bennett, J.H. (1990).

Berg, A.O. (1979). Some non-random views of statistical significance.

Berger, J.O. (1985).

Berger, J.O. (1986). Are P-values reasonable measures of accuracy?

Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing? [with discussion].

Berger J. (2004). The case for objective Bayesian analysis.

Berger, J.O., & Berry, D.A. (1988). Statistical analysis and the illusion of objectivity.

Berger, J.O., & Berry, D.A. (1988). The Relevance of Stopping Rules in Statistical Inference,

Berger, J.O., Boukai, B., & Wang, Y. (1997). Unified frequentist and Bayesian testing of a precise hypothesis.

Berger, J.O., & Delampady, M. (1987). Testing precise hypotheses.

Berger, J.O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconciability of

Berger J.O., & Wolpert R.L. (1988).

Berger, R.L., & Hsu, J.C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets.

Berger, V.W. (2000). Pros and cons of permutation tests in clinical trials.

Bergin, A.E., & Strupp, H.H. (1972).

Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test.

Berkson, J. (1941). Comments on Dr. Madow's "Note on tests of departure from normality" with some remarks concerning tests of significance.

Berkson, J. (1942). Tests of significance considered as evidence.

Berkson, J. (1943). Experience with tests of significance: A reply to Professor R.A. Fisher.

Bernard, J.-M. (1986). Méthodes d'inférence bayésienne sur des fréquences.

Bernard, J.-M. (1996). Bayesian interpretation of frequentist procedures for a Bernoulli process.

Bernard, J.-M. (2000). Bayesian inference for categorized data.

Bernard, J.-M., Blancheteau, M., & Rouanet, H. (1985). Le comportement prédateur chez un forficule, Eurobellia Moesta (Géné).

Bernardo J.M. (1984). Monitoring the 1982 Spanish socialist victory: A Bayesian analysis.

Bernardo J.M. (2003). Bayesian Statistics.

Bernardo J.M., & Smith, A.F.M. (1994).

Bernardo J.M. (2007). Reference analysis.

Berry, D.A. (1980). Statistical inference and the design of clinical trials.

Berry, D.A. (1985). Interim analysis in clinical trials: Classical vs. Bayesian approach.

Berry, D.A. (1987). Statistical inference, designing clinical trials, and pharmaceutical company decisions.

Berry, D.A. (1987). Interim analysis in clinical trials: The role of the likelihood principle,

Berry, D.A. (1988). Multiple comparisons, multiple tests, and data dredging: A Bayesian perspective [with discussion].

Berry, D.A. (1989). Monitoring accumulating data in a clinical trial.

Berry, D.A. (1991). Experimental design for drug development: A Bayesian approach.

Berry, D.A. (1991). Bayesian methodology in phase III trials.

Berry, D.A. (1993). A case for Bayesianism in clinical trials.

Berry, D.A. (1994).

Berry, D.A. (1995). Decision analysis and Bayesian methods in clinical trials.

Berry, D.A. (1995). Decision Analysis and Bayesian Methods in Clinical Trials.

Berry, D.A. (1996).

Berry, D.A. (1997). Teaching elementary Bayesian statistics with real applications in science.

Berry, G. & Armitage, P. (1995). Mid-

Berry, D.A., & Hochberg Y. (1999). Bayesian perspectives on multiple comparisons.

Berry, D.A., & Lindgren, B.W. (1996).

Berry, D.A., & Stangl D.K. (1996).

Berry, D.A., Stangl D.K. (1996). Bayesian methods in health-related research.

Berry, G. (1986). Statistical significance and confidence intervals [Editorial].

Beshers J (1958). On "A critique of tests of significance in survey research".

Bezeau, S.; Graves, R. (2001). Statistical power and effect sizes of clinical Neuropsychology research.

Bhattacharyya & Johnson (1997).

Binder, A. (1963). Further considerations on testing the null hypothesis and the strategy and tactics of investigating theoretical models.

Bird, K.D. (2002). Confidence intervals for effect sizes in analysis of variance.

Birnbaum, A. (1961). Confidence curves: An omnibus technique for estimation and testing statistical hypotheses.

Birnbaum, A. (1962). On the foundations of statistical inference.

Birnbaum, A. (1977). The Neyman-Pearson theory as decision theory, and as inference theory; with a criticism of the Lindley-Savage argument for Bayesian theory.

Birnbaum, I. (1982). Interpreting Statistical Significance.

Blackwelder, W.C. (1982). Proving the null hypothesis in clinical trials.

Blackwelder, W.C., & Chang, M.A. (1984). Simple size graphs for proving the null hypothesis.

Blaich, C.F. (1998). The null-hypothesis significance-test procedure: Can't live with it, Can't live without it.

Blalock, H. M., Jr. (1972).

Boardman, T. J. (1994). The statistician who changed the world: W. Edwards Deming, 1900-1993.

Bofinger, E. (1985). Expanded confidence intervals.

Bofinger, E. (1992). Expanded confidence intervals, one-sided tests and equivalence testing.

Boik, R.J. (1993). The analysis of two-factor interactions in fixed effects linear models.

Bolles, R. (1962). The difference between statistical hypotheses and scientific hypotheses.

Bolles, R., & Messick, S. (1958). Statistical utility in experimental inference.

Bondy, W.A. (1969). A test of an experimental hypothesis of negligible difference between means.

Boos, D.D. & Hughes-Oliver, J.M. (2000). How large does

Borak, J., & Veilleux, S. (1982). Errors of intuitive logic among physicians.

Borenstein, M. (1994). A note on the use of confidence intervals in psychiatric research.

Borenstein, M. (1994). The case for confidence intervals in controlled clinical trials.

Borenstein, M. (1997). Hypothesis testing and effect size estimation in clinical trials.

Boring, E.G. (1919). Mathematical versus scientific significance.

Bourke, S. (1993). Babies, bathwater and straw person: A response to Menon.

Box, G. E. P. 1976. Science and statistics.

Box, G.E.P. (1980). Sampling and Bayes' inference in scientific modeling and robustness.

Box, G. E. P. (1983). An apology for ecumenism in statistics.

Box, G. E. P., Hunter, W. G., & Hunter, J. S. (1978).

Box, G.E.P., & Tiao, G.C. (1973).

Bozdogan, H. (1994). Editor's general preface.

Braithwaite, R. B. (1953).

Braitman, L.E. (1988). Confidence intervals extract clinically useful information from data [Editorial].

Braitman, L.E. (1991). Confidence intervals assess both clinical significance and statistical significance [Editorial].

Braitman, L.E. (1993). Statistical estimates and clinical trials.

Brandstätter, E. (1999). Confidence intervals as an alternative to significance testing.

Bredenkamp, J. (1972).

Brenac, T. (2009). Common before-after accident study on a road site: A low-informative Bayesian method.

Brennan, P., & Croft, P. (1994). Interpreting the results of observational research.

Breslow, N. (1990). Biostatistics and Bayes [With comments].

Brewer, J.K. (1972). On the power of statistical tests in the American Educational Research Journal.

Brewer, J.K. (1985). Behavioral statistics textbooks: sources of myths and misconceptions?

Brewer, J.K., & Owen, P.W. (1973). A note on the power of statistical tests in the Journal of Educational Measurement.

Bristol, D.R. (1995). Delta: The true clinically significant difference to be detected.

Brophy, J.M., & Joseph, L. (1995). Placing trials in context using Bayesian analysis. GUSTO revisited by Reverend Bayes.

Bross, I.D. (1990). How to eradicate fraudulent statistical methods: Statisticians must do science.

Brown, F.L. (1973). Introduction to statistical methods in psychology.

Brown, J., & Hale, M.S. (1992). The power of statistical studies in consultation-liaison psychiatry.

Brown, L.D. (1990). An ancillarity paradox which appears in multiple linear regression [with discussion].

Brown, L.~D., Cai, T., & DasGupta, A. (2001). Interval estimation for a binomial proportion (with discussion).

Brown, L.D., Hwang, J.T.G., & Munk, A. (1997). An unbiased test for the bioequivalence problem.

Browne, R.H. (1979). On visual assessment of the significance of a mean difference.

Browne, R.H. (1995). Bayesian analysis and the GUSTO trial. Global Utilization of Streptokinase and Tissue Plasminogen Activator in Occluded Coronary Arteries [Letter].

Browner, W.S., & Newman, T.B. (1987). Are all significant

Bru, B. (2004). Remarques sur l'article de D. Denis [The modern hypothesis testing hybrid: R. A. Fisher's fading influence],

Bryan-Jones, J., & Finney, D.J. (1983). On an error in "Instructions to Authors".

Bryk, A.S., & Raudenbush, S.W. (1988). Heterogeneity of variance in experimental studies: a challenge to conventional interpretations.

Buchanan-Wollaston, H.J. (1935). The philosophic basis of statistical analysis.

Bulmer, M.G. (1957). Confirming statistical hypotheses.

Bulpitt, C.J. (1987). Confidence intervals.

Burke, C.J. (1954). Further remarks on one-tailed tests.

Byrne, M.D. (1993). A better tool for the Cognitive Scientist's toolbox: Randomization statistics.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**C**

Camilleri, S.F. (1962). Theory, probability, and induction in social research.

Campbell, J.P. (1982). Editorial: Some remarks from the outgoing editor.

Campbell, M. (1992). Letter.

Campillo, A. C. (1996). Erroneous interpretation of

Capone, C.A., Jr., & Seaman, S.L. (1989). Uses and misuses of hypothesis testing.

Capraro R.M., Capraro M.M. (2002). Treatments of effect sizes and statistical significance tests in textbooks.

Carlin, C, & Louis, T. (2000).

Carlson, R. (1976). The logic of tests of significance.

Carpenter, JA. (2001). Deliberations of the Task Force on Statistical Inference,

Carver, R.P. (1978). The case against statistical significance testing.

Carver, R.P. (1993). The case against statistical significance testing, revisited.

Casella, G., & Berger, L. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem [With discussion].

Chaloner, K. (1996). Elicitation of prior distributions.

Chaloner, K., Church, T., Louis, T., & Matts, J. (1993). Graphical elicitation of a prior distribution for a clinical trial.

Charron, C. (2002). Conceptualization of fractions and categorization of problems for adolescent.

Chase, L.J., & Baran, S.J. (1976). An assessment of quantitative research in mass communication.

Chase, L.J., & Chase, R.B. (1976). A statistical power analysis of applied psychological research.

Chase, L.J., Chase, R.B., & Tucker, R.K. (1978). Statistical power in physical anthropology: A technical report.

Chase, L.J., & Tucker, R.K. (1975). A power-analytic examination of contemporary communication research.

Chase, L.J., & Tucker, R.K. (1976). Statistical power: Derivation, development; and data-analytic implications.

Chatfield, C. (1985). The initial examination of data [With discussion].

Chatfield, C. (1988).

Chatfield, C. (1989). Comments on the paper by McPherson.

Chatfield, C. (1991). Avoiding statistical pitfalls.

Chatfield, C. (2002). Confessions of a pragmatic statistician.

Chernoff , H. (1986). Comment.

Cherry, S. (1998). Statistical tests in publications of The Wildlife Society.

Chew, V. (1976). Comparing treatment means: a compendium.

Chew, V. (1977). Statistical hypothesis testing: an academic exercise in futility.

Chew, V. (1980). Testing differences among means: correct interpretation and some alternatives.

Chia, K.S. (1997). "Significant-itis" - an obsession with the P-value.

Choi, S.C., & Pepple, P.A. (1989). Monitoring clinical trials based on predictive probability of significance.

Chow, S.L. (1988). Significance tests or effect size?

Chow, S.L. (1989). Significance tests and deduction: Reply to Folger (1989).

Chow, S.L. (1991). Conceptual rigor versus practical impact.

Chow, S.L. (1991). Rigor and logic: A response to comments on "conceptual rigor".

Chow, S.L. (1991). Some reservation about power analysis [Comment].

Chow, S.L. (1996).

Chow, S. L. (1998). What Statistical Significance Means.

Chow, S.L. (1998). Open Peer Commentary and author's response / Statistical Significance: Rationale, Validity and Utility.

Chow, S. L. (2002). Issues in statistical inference.

Christensen, J.E., & Christensen, C.E. (1977). Statistical power analysis of health, physical education, and recreation research.

Ciancia, F., Maitte, M., Honoré, J., Lecoutre, B., & Coquery, J.-M. (1988). Orientation of attention and sensory gatting: An evoked potential and RT study in cat.

Clark, C.A. (1963). Hypothesis testing in relation to statistical methodology.

Clark-Carter, D. (1997). The account taken of statistical power in research published in the

Clément, E., & Richard, J.-F. (1997). Knowledge of domain effects in problem representation: the case of Tower of Hanoi isomorphs.

Clements, M.A. (1993). Statistical significance testing: Providing historical perspective for Menon's paper.

Clopper, C.J., & Pearson, E.S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial.

Coats, W. (1970). A case against the normal use of inferential statistical models in educational research.

Cochran, W.G., & Cox, G.M. (1957).

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review.

Cohen, J. (1965). Some statistical issues in psychological research.

Cohen, J. (1968). Multiple regression as a general data-analytic system.

Cohen, J. (1988).

Cohen, J. (1990). Things I have learned (so far).

Cohen, J. (1992). A power primer.

Cohen, J. (1992). Statistical Power Analysis.

Cohen, J. (1994). The earth is round (

Cohen, L.H. (1979). Clinical psychologists' judgments of the scientific merit and clinical relevance of psychotherapy outcome research.

Connolly, R.A. (1991). A posterior odds analysis of the weekend effect.

Cooke, R.W., & Weindling, A.M. (1993). Clinical trials and

Cooper & Topher (1994). Anomalous propagation.

Cooper, H.M. (1989).

Cooper, H., DeNeve, K., & Charlton, K. (1997). Finding the missing science: The fate of studies submitted for review by a human subjects committee.

Cooper, H., & Findley, M. (1982). Expected effect sizes: Estimates for statistical power analysis in social psychology.

Cooper, H.M., & Rosenthal, R. (1980). Statistical versus traditional procedures for summarizing research findings.

Cormack, R. M. (1985). Discussion of Dr. Chatfield's paper.

Cornfield, J. (1966). Sequential trials, sequential analysis and the likelihood principle.

Cornfield, J. (1966). A Bayesian test of some classical hypotheses - with applications to sequential clinical trials.

Cornfield, J. (1969). The Bayesian outlook and its application.

Corroyer, D., Devouche, E., Bernard, J.-M., Bonnet, P., & Savina, Y. (2003). Comparaison de six logiciels pour l'analyse de la variance d'un plan S<A2*B2> déséquilibré.

Corroyer, D., Rouanet, H. (1994). Sur l'importance des effets et des indicateurs dans l'analyse statistique des données.

Cortina, J.M., & Dunlap, W.P. (1997). On the logic and purpose of significance testing.

Cortina, J. M., & Nouri, H. (2000).

Coursol, A., & Wagner, E.E. (1986). Effect of positive findings on submission and acceptance rates.

Cowles, M., & Davis, C. (1982). On the origins of the .05 level of statistical significance.

Cowles, M. (1989).

Cox, D.R. (1958). Some problems connected with statistical inference.

Cox, D.R. (1977). The role of significance tests [With discussion].

Cox, D.R. (1982). Statistical significance tests.

Cox, D.R. (1986). Some general aspects of the theory of statistics.

Cox, D.R. (2001). Another comment on the role of statistical methods.

Cox, D.R., & Snell, E.J. (1981).

Craig, J.R., Eison, C.L., & Metze, L.P. (1976). Significance tests and their interpretation: An example utilizing published research and

Cronbach, L.J. (1975). Beyond the two disciplines of scientific psychology.

Cronbach, L.J., & Snow, R.E. (1977).

Crow, E.L. (1991). Response to Rosenthal's comment "How are we doing in soft psychology" [Comment].

Cumming, G. (2005). Understanding the average probability of replication: Comment on Killeen (2005).

Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions.

Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals, and how to read pictures of data.

Cumming, G., Thomason, N., Howard, A., Les, J., & Zangari, M. (1995). The StatPlay software for statistical understanding: Confidence intervals and hypothesis testing.

Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and researchers' understanding of confidence intervals and standard error bars.

Cutler, S., Greenhouse, S., Cornfield, J.,

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**D**

D'Agostini, G. (1999). Bayesian reasoning versus conventional statistics in high energy physics. Proceedings XVIII International Workshop on Maximum Entropy and Bayesian Methods, Garching, Germany. Kluwer Academic, 157-170.

D'Agostini, G. (2000). Role and meaning of subjective probability: some comments on common misconceptions.

D'Agostini, G. (2000). Teaching Bayesian statistics in the scientific curricula,

D'Agostini, G. (2000). Confidence limits: what is the problem? Is there

D'Agostini, G. (2003). Bayesian inference in processing experimental data: principles and basic applications. Invited paper for

D'Agostini, G. (2003).

D'Andrade, R., & Dart, J. (1990). The interpretation of

Dahl, H. (1999). Teaching hypothesis testing. Can it still be useful?

Daly, J.A., & Hexamer, A. (1983). Statistical power in research in English education.

Daniel, L.G. (1998). Statistical significance testing: A historical overview of misuse and misinterpretation with implications for the editorial policies of educational journals.

Daniel, L.G. (1998). Fight the Good Fight: A Response to Thompson, Knapp, and Levin.

Daniel, L.G. (1998). The statistical significance controversy is definitely not over: a rejoinder to responses by Thompson, Knapp, and Levin.

Dar, E. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists.

Dar, R. (1998). Null hypothesis tests and theory corroboration: Defending NHSTP out of context.

Dar, R., R., Serlin, C., & Omer, H. (1994). Misuse of statistical tests in three decades of psychotherapy research.

Davidoff, F. (1999). Standing statistics right side up [Editorial].

Dawes, R.M. (1988). Probabilistic versus causal thinking.

Dawid P. (2000). A word from the president.

De Cristofaro, R. (1996). The role of inductive inference in statistical analysis.

De Cristofaro, R. (2002). The inductive reasoning in statistical inference.

De Cristofaro, R. (2004). On the foundations of the likelihood principle.

de Finetti, B. (1974). Bayesianism: Its unifying role for both the foundations and applications of statistics.

DeGroot, M. H. (1989).

Deheuvels, P. (1984). How to analyze bio-equivalence studies? The right use of confidence intervals.

DeLong, J. B., & Lang, K. (1992). Are all economic hypotheses false?

del Rosal, A.B., Costas, C.S., Bruno, J.A.S., & Osinski, I.C. (2001). The judgment against null hypothesis. Many witnesses and a virtuous sentence.

Deming, W. E. (1975). On probability as a basis for action.

Denhière, G., & Lecoutre, B. (1983). Mémorisation de récits: Reconnaissance immédiate et différée d'énoncés par des enfants de 7, 8 et 10 ans.

Denis, D.J. (2004). The modern hypothesis testing hybrid: R. A. Fisher's fading Influence.

Dérozières, A. (1985). Histoire de formes: statistiques et sciences sociales avant 1940.

Detsky, A.S., & Sackett, D.L. (1985). When is a negative clinical trial big enough.

Diamond, G.A., & Forrester, J.S. (1983). Clinical trials and statistical verdicts: Probable grounds for appeal.

Dignam, J.J., Bryant, J, Wieand, HS,

Dixon, P. (1998). Why scientists value

Dixon, P., & O'Reilly, T. (1999). Scientific versus statistical inference.

Dixon, P. (2003). The p-value fallacy and how to avoid it.

Dodd, D.H., & Schultz, R.F., Jr., (1973). Computational procedures for estimating magnitude of effect for some analysis of variance designs.

Doros, G. & Geier, A.B. (2005). Probability of replication revisited: Comment on "An alternative to null-hypothesis significance tests".

Dracup, C. (1995). Hypothesis testing: What it really is. The

Duggan, T.J., & Dean, C.W. (1968). Common misinterpretations of significance levels in sociological journals.

DuMouchel, W. (1989). Bayesian metaanalysis.

Duncan, D.B. (1965). A Bayesian approach to multiple comparisons.

Dunlap, W.P., & May J.G. (1989). Judging statistical significance by inspection of error bars.

Dunne A., Pawitan, Y., & Doody, L. (1996). Two-sided P-values from discrete asymmetric distributions based on uniformly most powerful unbiased tests.

Dunnet, C.W., & Gent, M. (1977). Significance testing to establish equivalence betweeen treatments with special reference to treatment in form of 2x2 tables.

Durand, J.-L. (1997). Analyse de l'ouvrage de N. Guéguen,

Dwyer, J.H. (1974). Analysis of variance and the magnitude of effects: A general approach.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**E**

Edgeworth, F.Y. (1885). Methods of Statistics.

Edgington, E.S. (1964). A tabulation of inferential statistics used in psychology journals.

Edgington, E.S. (1974). A new tabulation of statistical procedures used in APA journals.

Edwards, A.W.F. (1972).

Edwards, A.W.F. (1974). A History of Likelihood.

Edwards, A.W.F. (1976). Fiducial probability.

Edwards, W. (1965). Tactical note on the relation between scientific and statistical hypotheses.

Edwards, W., Lindman, H., & Savage, L.J. (1963). Bayesian statistical inference for psychological research.

Edwards, W. (1995). Number magic, auditing acid and materiality: a challenge for auditing research.

Efron, B. (1976). Comment on Savage's "On reading R.A. Fisher".

Efron, B. (1978). Controversies in the foundations of statistics.

Efron, B. (1996). Empirical Bayes methods for combining likelihoods.

Efron, B. (1996). Why isn't everyone a Bayesian? [With discussion].

Efron, B. (1998). R.A. Fisher in the 21st century [With discussion].

Eizner Favreau, O. (1997). Sex and gender comparisons: Does null hypothesis testing create a false dichotomy?

Elifson, K.W., Runyon, R.P., & Haber, A. (1990).

Ellerton, N. (1996). Statistical significance testing and this journal.

Ellis, N. (2000). Editorial.

Elmore, P.B., & Woehlke, P.L. (1988). Statistical methods employed in

Ely, M. (1999). The importance of estimates and confidence intervals rather than p values.

Erhardt, C. (1959). Statistics, a trap for the unwary.

Estes, W. K. (1997). Significance testing in psychological research: Some persisting issues.

Etzioni, R.D., Kadane, J.B. (1995). Bayesian statistical methods in public health and medicine.

Evans, S.J.W, Mills, P., & Dawson, J. (1988). The end of the p-value?

Eysenck, H.J. (1960). The concept of statistical significance and the controversy about one-tailed effects.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**F**

Falk, R. (1986). Misconceptions of Statistical significance.

Falk, R. (1998). In criticism of the null hypothesis statistical test.

Falk, R., & Greenbaum, C.W. (1995). Significance tests die hard. The amazing persistence of a probabilistic misconception.

Fan, X. (2001). Statistical significance and effect size in educational research: Two sides of a coin.

Fan X., Thompson B. (2001). Confidence Intervals About Score Reliability Coefficients, Please: An EPM Guidelines Editorial.

Favreau, O.E. (1993). Do the Ns justify the means? Null hypothesis testing applied to sex and other differences.

Fayers, P.M., Ashby, D., & Parmar, M.K. (1997). Tutorial in biostatistics Bayesian data monitoring in clinical trials.

Feinstein, A. R. (1977).

Fan X., Thompson B. (2001)Feinstein, A.R. (1978). Clinical biostatistics: stochastic significance, apposite data, and some remedies for the intellectual polluants of statistical vocabulary.

Feinstein, A. R. (1985).

Felson, D.T., Anderson, J.J, & Meenan, R.F. (1990). Time for changes in the design, analysis, and reporting of rheumatoid arthritis clinical trials.

Fidler, F. (2002). The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial.

Fidler, F., Cumming, G., Mark, B. & Neil, T. (2004). Statistical reform in medicine, psychology and ecology

Fidler, F., Cumming, G., Thomason, N., Pannuzzo, D., Smith, J., Fyffe, P., Edmonds, H., Harrington, C., & Schmitt, R. (2005). Evaluating the effectiveness of editorial policy to improve statistical practice: The case of the Journal of Consulting and Clinical Psychology.

Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can't make them think: Statistical reform lessons from medicine.

Fidler, F., Thomason, N., Cumming, G., Finch, S. &, Leeman, J. (2005). Still much to learn about confidence intervals: Reply to Rouder and Morey (2005).

Fidler, F., & Thompson, B. (2001). Computing correct confidence intervals for ANOVA fixed and random-effects effect sizes.

Fienberg, S. E. (2006). When did Bayesian inference become "Bayesian"?

Finch, S., Cumming, G., & Thomason, N. (2001). Reporting of statistical inference in the Journal of Applied Psychology: Little evidence of reform.

Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of Memory & Cognition.

Finch, S., Thomason, N., & Cumming, G. (2002). Past and future APA guidelines for statistical practice.

Finney, D. J. (1988). Was this in your statistics textbook? III. Design and analysis.

Finney, D. J. (1989). Was this in your statistics textbook? VI. Regression and covariance.

Finney, D. J. (1989). Is the statistician still necessary?

Fisher, L.D. (1996). Comments on Bayesian and frequentist analysis and interpretation of clinical trials.

Fisher, R.A. (1990/1925).

Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics.

Fisher, R.A. (1925). Theory of statistical estimation.

Fisher, R.A. (1926). The arrangement of field experiments.

Fisher, R.A. (1929). The statistical method in psychical research.

Fisher, R.A. (1990/1935).

Fisher, R.A. (1935). The logic of inductive inference.

Fisher, R.A. (1935). Statistical tests.

Fisher, R. A. (1943). Note on Dr Berkson's criticisms of tests of significance.

Fisher, R. A. (1948). Conclusions fiduciaires.

Fisher, R. A. (1951). Statistics.

Fisher, R. A. (1955). Statistical methods and scientific induction.

Fisher, R. A. (1990/1956).

Fisher, R. A. (1959). Mathematical probability in the natural sciences.

Fisher, R.A. (1962). Some examples of Bayes's method of the experimental determination of probabilities

Fisher, R. A. (1990).

Fisher, R.A., & MacKenzie, W.A. (1923). Studies in crop variation: 2. The manurial response of different potato varieties.

Fiske, D.W., & Fogg, L. (1990). But the reviewers are making different criticism of my paper! Diversity and uniqueness in reviewer comments.

Fleishman, A. E. (1980). Confidence interval for correlation ratios.

Fleiss, J.L. (1969). Estimating the magnitude of experimental effects.

Fleiss, J.L. (1986). Significance tests have a role in epidemiologic research: reactions to A. M. Walker.

Fleiss, J.L. (1986). Confidence intervals vs. significance tests: Quantitative interpretation (Letter).

Fleiss, J. (1986). Dr. Fleiss responds (Letter).

Folger, R. (1989). Significance tests and the duplicity of binary decisions.

Ford, J. (1975).

Forge, R.L. (1967). Confidence intervals or tests of significance in scientific research.

Forster, M. & Sober, E. (2004). Why likelihood?

Fowler, R.L. (1984). Approximating probability levels for testing null hypotheses with noncentral

Fowler, R.L. (1985). Testing for substantive significance in applied research by specifying nonzero effect nullhypotheses.

Fraser, D.A.S. (1996). Some remarks on pivotal models and the fiducial argument in relation to structural models.

Frederick, B.N. (1999). Fixed-, random-, and mixed-effects ANOVA models: A user-friendly guide for increasing the generalizability of ANOVA results.

Freedman, D. (1999). From association to causation: Some remarks on the history of statistics.

Freedman, D., Pisani, R., &Purves, R. (1997).

Freedman L. (1996). Bayesian statistical methods [Editorial].

Freedman, L.S., Spiegelhalter, D.J., & Parmar, M.K.B. (1994). The what, why and how of Bayesian clinical trials monitoring.

Freeman, P.R. (1993). The role of

Freiman, J.A., Chalmers, T.C., Smith, H., & Kueber, R.R. (1978). The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: Survey of 71 "negative" trials.

Frías, Ma.D., Pascual, J., & Garcia, J.F. (2000). Tamaño del efecto del tratamiento y significación estadística [Effect size and statistical significance].

Frick, R.W. (1995). Accepting the null-hypothesis.

Frick, R.W. (1995). A problem with confidence intervals [Comment].

Frick, R.W. (1996). The appropriate use of null hypothesis testing.

Frick, R. W. (1998). Interpreting statistical testing: Processes, not populations and random sampling.

Frick, R. W. (1998). A better stopping rule for conventional statistical tests.

Frick, R.W. (1999). Defending the statistical status quo.

Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation.

Friedman, M. (1988). Money and the stock market.

Friedman, S. B., & Phillips, S. (1981). What's the difference? Pediatric residents and their inaccurate concepts regarding statistics.

Fry, T.C. (1965).

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**G**

Galtung, J. (1967). On the use of statistical tests.

Garbe, E., Rohmel, J., & Gundert-Remy, U. (1993). Clinical and statistical issues in therapeutic equivalence trials.

Gardner, M.J., & Altman, D.G. (1986). Confidence intervals rather than

Gardner, M.J., & Altman, D.G. (1989). Estimation rather than hypothesis testing: confidence intervals rather than

Gardner, M.J., & Altman, D.G. (Eds.) (1989).

Gauch Jr., H. G. (1988). Model selection and validation for yield trials with interaction.

Gavarret, J. (1840).

Geerstma, J.C. (1983). Recent views on the foundational controversy in statistics.

Geary, R. C. (1947). Testing for normality.

Gelman, A., Carlin, J.B., Stern, H.S., & Rubin, D.B. (2004).

Gendreau, P. (2002). We must do a better job of cumulating knowledge.

Gerard, P. D., Smith, D. R., & Weerakkody, G. (1998). Limits of retrospective power analysis.

Gibbons, J.D., & Pratt, J.W. (1975).

Giere, R.N. (1972). The significance test controversy.

Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity.

Gigerenzer, G. (1991). From tools to theories: a heuristic of discovery in cognitive psychology.

Gigerenzer, G. (1991). How to make cognitive illusions disappear: Beyond "Heuristics and Biases".

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning.

Gigerenzer, G. (1998). We need statistical thinking, not statistical rituals.

Gigerenzer, G., & Murray, D.J. (1987).

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J.,& Krüger, L. (1989).

Gill, J. (1999). The insignificance of Null Hypothesis Significance Testing.

Gill, J. (2001). Whose variance is it anyway? Interpreting empirical models with state-level data.

Gill, J. (2002).

Gill, J. (2004). Grappling with Fisher's legacy in social science hypothesis testing: Some comments on Denis [The modern hypothesis testing hybrid: R. A. Fisher's fading influence],

Gill, M. (1993). The significance of "significance".

Glaser, D.N. (1976). The controversy of significance testing.

Glass, G.V. (1976). Primary, secondary, and meta-analysis of research.

Glass, G.V., McGaw, B., & Smith, M.L. (1981).

Glenberg, A. M. (1988).

Gliner, J.A., Leech, N.L., & Morgan, G.A. (2002). Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?

Glymour, C. (1981). Why I am not a Bayesian.

Godambe, V.B., & Sprott, D.A. (Eds.) (1971).

Gold, D. (1958). Comment on "A critique of tests of significance".

Gold, D. (1964). Some problems in generalizing aggregate associations.

Gold, D. (1969). Statistical tests and substantive significance.

Goldberger, A.S. (1991).

Goldstein, H., & Healy, M.J.R. (1995). The graphical presentation of a collection of means.

Good, I.J. (1958). Significance tests in parallel and in series.

Good, I.J. (1973).

Good, I.J. (1981). Some logic and history of hypothesis testing.

Good, I.J. (1983).

Good, I.J. (1984). An error by Neyman noticed by Dickey (C209).

Goodman, S.N. (1992). A comment on replication, P-values and evidence.

Goodman, S.N. (1993).

Goodman, S.N. (1993). Author's response to "Invited commentary:

Goodman, S.N. (1989). Meta-analysis and evidence.

Goodman, S.N. (1998). Multiple comparisons, explained.

Goodman, S.N. (1999). Toward evidence-based medical statistics. 1: The

Goodman, S.N. (1999). Toward evidence-based medical statistics. 2: The Bayes factor.

Goodman, S.N., & Berlin, J.A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results.

Goodman, S.N., & Royall, R. (1988). Evidence and scientific research.

Goodwin, L.D., & Goodwin, W.L. (1985). Statistical techniques in

Gordon, H.R.D. (2001). American Vocational Education Research Association members' perceptions of statistical significance tests and other statistical controversies.

Gore, S.M. (1981). Assessing clinical trials - trial size.

Gore, S.M. (1981). Statistics in question: Assessing methods - confidence intervals.

Gower, J. C. (1983). Data analysis: multivariate or univariate and other difficulties.

Graham, J.M. (2001). Review of Statistics with Confidence.

Grainger, J., & Beauvillain, C. (1988). Associative priming in bilinguals: Some limits of interlingual facilitation effects.

Granaas, M. (2002). Hypothesis testing in psychology: Throwing the baby out with the bathwater? Cape Town, South-Africa : ICOTS 6 [http://icots6.haifa.ac.il/PAPERS/3M1_GRAN.PDF]

Granger, C.W.J., King, M.L., & White, H. (1995). Comments on testing economic theories and the use of model selection criteria.

Grant, D.A. (1962). Testing the null hypothesis and the strategy and tactics of investigating theoretical models.

Gray, M.W. (1983). Statistics and the law .

Graybill, F.A. (1976).

Graybill, F. A., & Iyer, H. K. (1994).

Green, C.D. (2002). Comment on Chow's "Issues in Statistical Inference".

Greenfield, M. L. V. H., Kuhn, J. E., & Wojtys, J. E. (1996). Current concepts. A statistics primer.

Greenhouse, J.B. (1992). On some applications of Bayesian methods in cancer clinical trials.

Greenland, S. (1989). Modeling and variable selection in epidemiologic analysis.

Greenland, S., & Robins J.M. (1991). Empirical-Bayes adjustments for multiple comparisons are sometimes useful.

Greenwald, A.G. (1975). Consequences of prejudice against the null hypothesis.

Greenwald, A. G. (1993). Consequences of prejudice against the null hypothesis.

Greenwald, A.G., Gonzalez, R., Harris, R.J., & Guthrie, D. (1996). Effect sizes and

Gregson, R.A.M. (1998). Understanding Bayesian procedures.

Grissom, R.J., & Kim, J.J. (2001) . Review of assumptions and problems in the appropriate conceptualization of effect size.

Grouin, J.-M., & Lecoutre, B. (1996). Probabilités prédictives: Un outil pour la planification des expériences,

Grouin J.-M., Coste M., Bunouf P., Lecoutre B. (2007). Bayesian sample size determination in non-sequential clinical trials: Statistical aspects and some regulatory considerations?

Grunkemeier, G.L. & Payne, N. (2002). Bayesian analysis: A new statistical paradigm for new technology.

Guilford, J.P. (1942).

Guthery, F.S., Lusk, J.J., & Peterson, M.J. (2001). The fall of the null hypothesis: liabilities and opportunities.

Guttman, L. (1977). What is not what in statistics?

Guttman, L. (1979). Cyril Burt and the careless star worshippers.

Guttman, L. (1985). The illogic of statistical inference for cumulative science.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**H**

Haase, R.F., Waechter, D.M., & Solomon, G.S. (1982). How significant is a significant difference? Average effect size of research in counseling psychology.

Hacking, I. (1965).

Hacking I. (1975).

Hagen, R.L. (1997). In praise of the null hypothesis statistical test.

Hager, W. (1996). On testing a priori hypotheses about quantitative and qualitative trends.

Hager, W. (2000). About some misconceptions and the discontent with statistical tests in Psychology.

Hager, W., & Westermann, R. (1983). Zur wahl und prüfung psychologischer hypothesen in psychologischen untersuchungen.

Hagood, M.J. (1941). The notion of a hypothetical universe.

Hahn, G.J. (1974). Don't let statistical significance fool you!

Hahn, G.J. (1990). Commentary.

Hahn, G.J., & Meeker, W.Q. (1991).

Haldane, J.B.S. (1948). The precision of observed values of small frequencies.

Hall, P., & Selinger, B. (1986). Statistical significance: balancing evidence against doubt.

Hallahan, M., & Rosenthal, R. (1996). Statistical power: Concepts, procedures and applications.

Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers?

Hammond, G. (1996). The objections to null hypothesis testing as a means of analysing psychological data.

Hancock, G.R., & Freeman M.J. (2001). Power and sample size for the root mean square error of approximation test of not close fit in structural equation modeling.

Hand, D.J., & Taylor, C. (1987).

Hansen, M.H., & Edwards, W.E. (1950). On the important limitation to the use of data from samples.

Harcum, E.R. (1990). Methodological versus empirical literature: Two views on casual acceptance of the null hypothesis.

Hardy, A., Harvie, P., & Koestler, A. (1973).

Hardy, R.J., & Thompson, S.G. (1996). A likelihood approach to meta-analysis with random effects.

Harlow, L. L. (1997). Significance Testing Introduction and Overview.

Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.) (1997).

Harris, E.K. (1993). On

Harris, M.J. (1991). Significance tests are not enough: The role of effect size estimation in theory corroboration [Comment on Chow, 1991a].

Harris, M.J., & Rosenthal, R. (1985). Mediation of interpersonal expectancy effects: 31 meta-analyses.

Harris, R.J. (1994).

Harris, R.J. (1997). Significance tests have their place.

Harris, R.J. (1997). Reforming significance testing via three-valued logic.

Harper, W.L., & Hooker, C.A. (Eds.) (1976).

Hauck, W.W., & Anderson, S. (1984). A new statistical procedure for testing equivalence in two-group comparative bioavailability trials.

Hauschke, D., & Steinijans, V.W. (1996). A note on conventional null hypothesis testing in active control equivalence studies.

Hays, W.L. (1963).

Healy, M.J.R. (1978). Is statistics a science?

Healy, M.J.R. (1989). Comments on the paper by McPherson.

Hedges, L.V. (1981). Distribution theory for Glass's estimator of effect size and related estimators.

Hedges, L.V. (1987). Estimation of effect size from a series of independent experiments.

Hedges, L.V. (1987). How hard is hard science, how soft is soft science? The empirical cumulativeness of research.

Hedges, L.V. & Olkin I. (1985).

Heldref Foundation (1997). Guidelines for contributors.

Henderson, A.R. (1993). Chemistry with confidence: should Clinical Chemistry require confidence intervals for analytical and other data?

Henkel, R.E. (1976 -

Henri, V. (1895). Le calcul des probabilités en psychologie.

Henri, V. (1898). Quelques applications du calcul des probabilités à la psychologie.

Henson, R. K., & Smith, A. D. (2000). State of the art in statistical significance and effect size reporting: A review of the APA Task Force Report and current trends.

Herson J. (1979). Predictive probability early termination plans for Phase II clinical trials.

Hess, B., Olejnik, S., & Huberty, C.J (2001). The efficacy of two Improvement-over-chance effect sizes for two-group univariate comparisons under variance heterogeneity and nonnormality.

Hick, W.E. (1952). A note on one-tailed and two-tailed tests.

Hilborn, R. (1997). Statistical essay - statistical hypothesis testing and decision theory in fisheries science.

Hinkle, D.E., Wiersma, W., & Jurs, S.G. (1998).

Hinkley, D.V. (1987). Comment.

Hirsch, L.S., & O'Donnell, A.M. (2001). Representativeness in Statistical Reasoning: Identifying and Assessing Misconceptions. Journal of Statistics Education,

Hoc, J.-M. (1975 - Notes sur l'analyse de la variance et l'inférence fiduciaire.

Hoc, J.-M. (1983).

Hoc, J.-M. (1996). Operator expertise and verbal reports on temporal data.

Hoc, J.-M., & Leplat, J. (1983). Evaluation of different modalities of verbalization in a sorting task.

Hodges, J.L., & Lehmann, E.L. (1954). Testing the approximate validity of statistical hypotheses.

Hoenig, J.M., & Heisey, D.M. (2001). The abuse of power: The pervasive fallacy of power calculation for data analysis.

Hogben, L. (1957).

Hogben, L.T. (1957).

Holland, P. W. (1986). Statistics and causal inference.

Holmes, C.B. (1979). Sample size in psychological research.

Holmes, C.B., Kixmiller, J.S., & Larsen, R.K. (1989). Statistical versus clinical significance in research with the MMPI.

Howard, J. (1999). The 2x2 table. A discussion from a Bayesian viewpoint.

Howard, G.S., Maxwell, S.E., & Fleming, K.J. (2000). The proof of the pudding: An illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis.

Howell, D.C. (1997).

Howlin, P. (1997). When is a significant change not significant?

Howson, C., & Urbach, P.M. (1993).

Hoyle, R. H. (Ed.). (1999).

Hresko, W. (2000). Editorial policy.

Hubbard, M. (1995). The earth is highly significantly round (

Hubbard, R., & Ryan, P.A. (2000). The historical growth of statistical significance testing in psychology-and its future prospects.

Huberty, C.J. (1987). On statistical testing.

Huberty, C. J (1989). Problems with stepwise methods-better alternatives.

Huberty, C. J (1993). Historical origins of statistical testing practices: The treatment of Fisher versus Neyman-Pearson views in textbooks.

Huberty, C.J & Lowman, L.L. (2000). Group overlap as a basis for effect size.

Huberty, C.J. & Pike, C.J. (1989). On some history regarding statistical testing.

Huelsenbeck, J.P., & Rannala, B. (1997). Phylogenetic methods come of age: testing hypotheses in an evolutionary context.

Hugdahl, K., & Ost, L. (1981). On the difference between statistical and clinical significance.

Hughes, M.D. (1993). Reporting Bayesian analyses of clinical trials.

Hunter, J.E. (1997). Needed: a ban on the significance test.

Hunter, J.S. (1990). Commentary.

Hunter, M.A. (1990). Commentary.

Hunter MA, & May, R.B. (2003). Statistical testing and null distributions: What to do when samples are not random.

Hurlburt, R.T. (1998).

Hyde, J.S. (2001). Reporting effect sizes: The roles of editors, textbook authors, and publication manuals.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**I**

International Committee of Medical Journal Editors (1991). Uniform requirements for manuscripts submitted to biomedical journals [Special report].

Iversen, G.R. (1998). Student perceptions of Bayesian statistics. In L. Pereira-Mendoza, L. Seu Kea, , T. Wee Kee, & W. K. Wong (Eds.) (1998),

Iversen, G.R. (2000). Why should we even teach statistics? A Bayesian perspective. IASE Round Table Conference on Training Researchers in the Use of Statistics, The Institute of Statistical Mathematics, Tokyo, 7-11 August, 2000. [http://www.statlit.org/PDF/2000IversenIASE.pdf]

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**J**

Jacobson, N.S., Follette, W.C., & Revenstorf, D. (1984). Psychotherapy outcome research: Methods for reporting
variability and evaluating clinical significance.
*Behavior Therapy*, *15*, 336-352.

Jacobson, N.S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful
change in psychotherapy research.
*Journal of Consulting and Clinical Psychology, 5*9, 12-19.

Jamart, J. (1992). Statistical tests in medical research. *Acta Oncologica*, *31*, 723-727.

Jaynes, E.T.
(1968). Prior probabilities. *IEEE Transactions on Sytems Science and Cybernetics*, *4*, 227-241.

Jaynes, E.T. (1976). Confidence intervals vs Bayesian intervals. *
In* W.L. Harper & C.A. Hooker (Eds.), *Foundations of Probability Theory,
Statistical Inference, and Statistical Theories of Science, Vol. 2*, Dordrecht, Netherlands: D. Reidel, 175-257.

Jaynes, E.T. (1983). *Papers on Probability, Statistics, and Statistical Physics*.
R.D. Rosenkrantz (Ed.), Dordrecht, Netherlands: D. Reidel.

Jaynes, E.T. (1984). The intuitive inadequacy of Classical statistics [with discussion].
*Epistemologia,* ** VII,** 43-73.

Jaynes, E.T. (1985). Where do we go from here?

Jaynes, E.T. (2003).

Jefferys, H. (1990). Bayesian Analysis of Random Event Generator Data.

Jefferys, H. (1992) . Response to Dobyns.

Jefferys, H. (1995). On

Jefferys, H. (1995). Further comments on

Jeffreys, H. (1961).

Jeffreys, W.H. (1995). On

Jeffreys, W.H. (1995). Further comments on

Jegerski, J.A. (1990). Replication in behavioral research.

John, I. D. (1992). Statistics as rhetoric in psychology.

Johns, D., & Andersen, J.S. (1990). Use of predictive probabilities in phase II and phase III clinical trials.

Johnson, D.H. (1995). Statistical sirens: the allure of nonparametrics.

Johnson, D.H. (1998). Hypothesis Testing: Statistics as Pseudoscience. Paper presented at the Fifth Annual Conference of the Wildlife Society, Buffalo, New York, 26 September 1998.

Johnson, D.H. (1999). The insignificance of statistical significance testing.

Johnstone, D.J. (1986). Tests of significance in theory and practice [With discussion].

Johnstone, D.J. (1987). Tests of significance following R.A. Fisher.

Johnstone, D.J. (1987). On the interpretation of hypothesis tests following Neyman and Pearson.

Johnstone, D.J. (1988). Comments on Oakes on the foundations of statistical inference in the social and behavioral sciences: The market for statistical significance.

Johnstone, D.J. (1988). Hypothesis tests and confidence intervals in the single case.

Johnstone, D.J. (1994). A statistical paradox in auditing.

Johnstone, D.J. (1995). Statistically incoherent hypothesis testing in auditing.

Joint Committee on Standards for Educational Evaluation (1994).

Johnstone, D.J., & Lindley, D.V. (1995). Bayesian inference given data 'significant at alpha': Tests of point hypotheses.

Jones, B., Jarvis, P., Lewis, J.A., & Ebbutt, A. F. (1996). Trials to assess equivalence: the importance of rigorous methods.

Jones, B.J., & Brewer, J.K. (1972). An analysis of the power of statistical tests reported in the Research Quarterly.

Jones, D. (1984). Use, misuse, and role of multiple-comparison procedures in ecological and agricultural entomology.

Jones, D., & Matloff, N. (1986). Statistical hypothesis testing in biology: A contradiction in terms.

Jones, L.V. (1952). Tests of hypotheses: One-sided

Jones, L.V. (1954). A rejoinder on one-tailed tests.

Jones, L.V. (1955). Statistics and research design.

Jones, L.V., & Tukey, J.W. (2000). A sensible formulation of the significance test.

Journal of Experimental Education (1993). Special Issue - "

Judd, C. M., & McClelland, G. H. (1989).

Judd, C. M., McClelland, G. H., & Culhane, S. E. (1995). Data analysis: Continuing issues in the everyday analysis of psychological data.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**K**

Kadane, J.B. (1995). Prime time for Bayes.

Kadane, J.B. (1996).

Kadane, J.B., & Sedransk, N. (1980). Toward a more ethical clinical trials [With discussion].

Kahneman, D., & Tversky, A. (1972). Subjective probability: A judgement of representativeness.

Kahneman, D., Slovic, P., & Tversky, A. (1982).

Kaiser, H.F. (1960). Directional statistical decision.

Kalbfleisch, J.G., & Sprott D.A. (1976). On test of significance.

Kass, R.E., & Raftery, A.E. (1995). Bayes factors.

Katzer, J., & Sodt, J. (1973). An analysis of the use of statistical testing in communication research.

Kaufman, A.S. (1998). Introduction to the specila issues on statistical significance testing.

Kazdin, A.E., & Bass, D. (1989). Power to detect differences between alternative treatments in comparative psychotherapy outcome research.

Kelbaek, H. S., Gjorup, T., & Hilden, J. (1990). Confidence intervals instead of

Kempthorne, O. (1966). Some aspects of experimental inference.

Kempthorne, O. (1971). Probability, statistics, and the knowledge business.

Kempthorne, O. (1976). Of what use are tests of significance and tests of hypotheses.

Kendall, P. (1957). Note on significance tests. Appendix C

Kendall, P.C. (1997). Editorial.

Kendall, P.C., Marrs-Garcia, A., Nath, S.R., & Sheldrick, R.C. (1999). Normative comparisons for the evaluation of clinical significance.

Keren, G., & Lewis, C. (Eds.) (1993). A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues, Hillsdale, N.J.: Erlbaum.

Kerlinger, F.N. (1979).

Keselman, H.J., Huberty, C.J., Lix, L.M., Olejnik, S., Cribbie, R., Donahue, B., Kowalchuk, R.K., Lowman, L.L., Petoskey, M.D., Keselman, J.C., & Levin, J.R. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses.

Keuzenkamp, H.A., & Barten, A.P.(1995). Rejection without falsification, on the history of testing the homogeneity condition in the theory of consumer demand.

Keuzenkamp, H.A., & Magnus, J.R.(1995). On tests and significance in econometrics.

Keynes, J.M. (1921).

Kieffer, K.M., Reese, R.J., & Thompson, B. (2000). Statistical techniques employed in

Killeen, P.R. (2005a). An alternative to null-hypothesis significance tests.

Killeen, P.R. (2005b). Replicability, Confidence, and Priors.

Killeen, P.R. (2005c). Tea-tests.

Killeen, P.R. (2006). The problem with Bayes.

Killeen, P.R. (2006). Beyond statistical inference: A decision theory for science.

Killeen, P.R. (2007). The probability of replication: Its logic, justification, and calculation.

King, D.S. (1985). Statistical power of the controlled researchon wheat gluten and schizophrenia.

Kirk, R.E. (1995).

Kirk, R.E. (1996). Practical significance: A concept whose time has come.

Kirk, R. E. (2001). Promoting good statistical practices: Some suggestions.

Kish, L. (1959). Some statistical problems in research design.

Klayman, J., & Ha, Y. (1987). Confirmation, disconfirmation, and information in hypothesis testing.

Kleiter, G.D. (1969). Krise der Signifikanztests in der Psychologie.

Kluger, A.N., & Tikochinksy, J. (2001). The error of accepting the "theoretical" null hypothesis: The rise, fall, and resurrection of commonsense hypotheses in psychology.

Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance testing system.

Knapp, T. R. (1998). Comments on the statistical significance testing articles.

Knapp, T.R., & Sawilowsky, S.S. (2001). Constructive criticisms of methodological and editorial practices.

Kotrlick, J.W. (2000). Guidelines for authors.

Kraemer, H.C. (1983). Theory of estimation and testing of effect sizes: Use in meta-analysis.

Kraemer, H.C. (1998). Statistical significance: A statistician's view. Behavioral and Brain Sciences,

Kraemer, H.C., & Thiemann, S. (1987).

Krantz, D.H. (1999). The null hypothesis testing controversy.

Krebs, C. J. (1989).

Kroll, R.M., & Chase, L.J. (1975). Communication disorders: A power analytic assessment of recent research.

Krueger, J. (1998). Theoretical Progress Requires Refined Methods and Then Some Social Bias. Reply to Ruscio and McCauley on Krueger on Social-Bias.

Krueger, J. (1998). The bet on bias: A foregone conclusion?

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method.

Krueger, J., & Funder, D.C. (2001). Towards a Positive Social Psychology: Causes, Consequences and Cures for the Problem-seeking Approach to Social Behavior and Cognition [http://www.brown.edu/Departments/Psychology/faculty/krueger.html].

Kruskal, W.H. (1978). Significance, Tests of.

Kruskal, W.H. (1980). The significance of Fisher: A review of R.A. Fisher, the life of a scientist.

Kruskal, W.H., & Majors, R. (1989). Concepts of relative importance in recent scientific literature.

Kupfersmid, J. (1988). Improving what is published: a model in search of an editor.

Kwan, E., & Friendly, M. (2004). Strong versus weak significance tests and the role of meta-analytic procedures. [Discussion of D.J. Denis' paper, The modern hypothesis testing hybrid: R. A. Fisher's fading influence].

Kyburg, H.E., Jr. (1974).

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**L**

Ladiray, D. (2002). Conjoncture, statistique et économétrie.

LaForge, R. (1967). Confidence intervals or test of significance in scientific research?

Lang, J.M., Rothman, K.J., & Cann, C.I. (1998). That confounded

Langholtz, B., Witte, J.S., & Duncan, T.C. (1995). Re: "Statistical significance testing in the American Journal of Epidemiology, 1970-1990".

Langer, E.J. (1997).

Langman, M.J.S. (1986). Towards estimation and confidence intervals.

Laplace, P.-S. (1986/1825).

Leamer, E.E. (1978).

Lecoutre, B. (1978). Note sur le calcul de la distribution fiduciaire pour une inférence sur un contraste entre moyennes.

Lecoutre, B. (1981). Extensions de l'analyse de la variance: L'analyse bayésienne des comparaisons.

Lecoutre, B. (1981). Procédures fiducio-bayésiennes pour l'investigation des mécanismes individuels en psychologie.

Lecoutre, B. (1984).

Lecoutre, B. (1984). Réinterprétation fiducio-bayésienne du test F de l'analyse de la variance.

Lecoutre, B. (1985). How to derive Bayes-fiducial conclusions from usual significance tests.

Lecoutre, B. (1985). Reconsideration of the F test of the analysis of variance: The semi-Bayesian significance tests.

Lecoutre, B. (1986). Méthodes bayésiennes en Analyse des Comparaisons: Inférences pour des variables numériques.

Lecoutre, B. (1987). Les procédures fiducio-bayésiennes et semi-bayésiennes dans les études de généralisabilité.

Lecoutre, B. (1988). L'analyse des comparaisons à plusieurs degrés de liberté comme prolongement des procédures élémentaires.

Lecoutre, B. (1991). A correction for the e approximate test in repeated measures designs with two or more independent groups.

Lecoutre, B. (1994). Inférence statistique et raisonnement inductif.

Lecoutre, B. (1996).

Lecoutre, B. (1996). Au delà du test de signification ou l'inférence statistique sans tables (à la suite d'Alain Morineau.

Lecoutre, B. (1997). Et si vous étiez un bayésien "qui s'ignore"?

Lecoutre, B. (1998). Teaching Bayesian methods for experimental data analysis.

Lecoutre, B. (1999). Two useful distributions for Bayesian predictive procedures under normal models.

Lecoutre, B. (1999). Beyond the significance test controversy: Prime time for Bayes?

Lecoutre, B. (2000). From significance tests to fiducial Bayesian inference.

Lecoutre, B. (2001). Bayesian predictive procedure for designing and monitoring experiments.

Lecoutre, B. (2004). Expérimentation, inférence statistique et analyse causale.

Lecoutre, B. (2005). Former les étudiants et les chercheurs aux méthodes bayésiennes pour l'analyse des données expérimentales.

Lecoutre, B. (2006). How to get 1-alpha confidence level from 1-2alpha confidence intervals. Submitted for publication.

Lecoutre, B. (2006). And if you were a Bayesian without knowing it?

Lecoutre, B. (2006). Training students and researchers in Bayesian methods for experimental data analysis.

Lecoutre B. (2007a). Another look at confidence intervals for the noncentral

Lecoutre B. (2008). The Bayesian approach to experimental data analysis.

Lecoutre, B., & Charron, C. (2000). Bayesian procedures for prediction analysis of implication hypotheses in 2x2 contingency tables.

Lecoutre, B., & Derzko, G. (2001). Asserting the smallness of effects in ANOVA.

Lecoutre, B., Derzko, G., & Grouin, J.-M. (1995). Bayesian predictive approach for inference about proportions.

Lecoutre, B., & ElQasyr, K. (2005). Play-the-winner rule in clinical trials: Models for adaptative designs and Bayesian methods.

Lecoutre, B., & ElQasyr, K. (2008). Adaptative designs for multi-arm clinical trials: The play-the-winner rule revisited.

Lecoutre B., Killeen P. (2010). Replication is not coincidence: Reply to Iverson, Lee, and Wagenmakers (2009).

Lecoutre, B., & Lecoutre, M.-P. (1979). A propos d'une expérience d'apprentissage perceptif incident: Quelques aspects de la démarche d'analyse des données et méthodes fiduciaires.

Lecoutre, B., Lecoutre, M.-P., & Grouin, J.-M. (2001). A challenge for statistical instructors : Teaching Bayesian inference without discarding the « official » significance tests.

Lecoutre, B., Lecoutre, M.-P., & Poitevineau, J. (2001). Uses, abuses and misuses of significance tests in the scientific community: Won't the Bayesian choice be unavoidable?

Lecoutre, B., Lecoutre, M.-P., & Poitevineau, J. (2010). Killeen’s probability of replication and predictive probabilities: How to compute, use and interpret them.

Lecoutre, B., Mabika, B., & Derzko, G. (2002). Assessment and monitoring in clinical trials when survival curves have distinct shapes in two groups: a Bayesian approach with Weibull modeling illustrated.

Lecoutre, B., & Poitevineau, J. (1992). PAC (

Lecoutre, B., & Poitevineau, J. (2000). Aller au delà des tests de signification traditionnels: Vers de nouvelles normes de publication.

Lecoutre, B., Poitevineau, J., Derzko, G., & Grouin, J.-M. (2000). Désirabilité et faisabilité des méthodes bayésiennes en analyse de variance: application à des plans d'expérience complexes utilisés dans les essais cliniques.

Lecoutre, B., Poitevineau, J., & Lecoutre, M.-P. (2004). Fisher: Responsible, not guilty. Discussion of D.J. Denis' paper, The modern hypothesis testing hybrid: R. A. Fisher's fading influence.

Lecoutre, B., Poitevineau, J., & Lecoutre, M.-P. (2005). Une raison pour ne pas abandonner les tests de signification de l'hypothèse nulle.

Lecoutre, B., & Rouanet H. (1981). Deux structures statistiques fondamentales en analyse de la variance univariée et multivariée.

Lecoutre, B., Rouanet, H., & Denhière, G. (1988). L'inférence statistique comme instrumen de validation de modèles.

Lecoutre, M.-P. (1982). Comportement des chercheurs dans des situations conflictuelles d'analyse des données expérimentales.

Lecoutre, M.-P. (1983). La démarche du chercheur en psychologie dans des situations d'analyse statistique de données expérimentales.

Lecoutre, M.-P. (1992). Cognitive models and problem spaces in "purely random" situations.

Lecoutre, M.-P. (2000). And...What about the researcher's point of view.

Lecoutre M.-P., Clément E., Lecoutre B. (2004). Failure to construct and transfer correct representations across probability problems.

Lecoutre, M.-P., & Lecoutre, B. (2001). Reaction on Research in statistical education: Some priority questions by Batanero, Garfield, Ottaviani, Truran.

Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (1999). An experimental study of the uses and misuses of null hypothesis significance tests among psychologists and statisticians.

Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (2003). Even statisticians are not immune to misinterpretations of Null Hypothesis Significance Tests.

Lecoutre, M.-P., & Rouanet, H. (1993). Predictive judgments in situations of statistical analysis.

Lee, P. (1989).

Lee M.D., & Wagenmakers, E.J. (2005). Bayesian statistical inference in psychology: Comment on Trafimow (2003).

Lehman, E.L. (1958). Significance level and power.

Lehmann, E.L. (1986).

Lehmann, E.L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two?

Lépine, D., & Rouanet, H. (1975). Introduction aux méthodes fiduciaires: inférence sur un contraste entre moyennes.

Levenson, R.L. (1980). Statistical power analysis: Implications for researchers, planners, and practionners in gerontology.

Levin, J.R. (1967). Misinterpreting the significance of "explained variation".

Levin, J.R. (1998). To test or not to test H0?

Levin, J.R. (1998). What if there were no more bickering about statistical significance tests?

Levin, J.R., & Robinson, D.H. (1999). Further reflections on hypothesis testing and editorial policy for primary research journals.

Levin, J.R., & Robinson, D.H. (2000). Statistical hypothesis testing, effect size estimation, and the conclusion coherence of individual empirical studies (Rejoinder).

Levine, T.R., Hullett, C.R. (2002). Eta squared, partial eta squared, and misreporting of effect size in communication research.

Levy, P. (1967). Substantive significance of significant differences between two groups.

Lewis, C. (1993). Bayesian methods for the analysis of variance.

Lewis, D., & Burke, C.J. (1949). The use and misuse of the chi-square test.

Lhoste, E. (1923). Le calcul des probabilités appliqué à l'artillerie.

Li, Y., & Krantz, D.H. (1996). Overconfidence and the goals of interval estimation.

Lick, J. (1973). Statistical vs. clinical significance in research on the outcome of psychotherapy.

Lilford, R.J., Braunholtz, D. (1996). For debate: The statistical basis of public policy. a paradigm shift is overdue.

Lindgren, B.R., Wielinski, C.L., & Finkelstein, S.M. (1994). Contrasting clinical and statistical significance within the research setting.

Lindley, D.V. (1957). A statistical paradox.

Lindley, D. V. (1972).

Lindley, D.V. (1986). Discussion.

Lindley, D.V. (1993). The analysis of experimental data: The appreciation of tea and wine.

Lindley, D.V. (1998). Decision analysis and bioequivalence trials.

Lindley, D.V., & Phillips, L.D. (1976). Inference for a Bernoulli process (a Bayesian view).

Lindsay, R. M., & Ehrenberg, A. S. C. (1993). The design of replicated studies.

Lindsay, R. M. (1995). Reconsidering the status of tests of significance: An alternative criterion of adequacy.

Lindsey J.K. (1999. Some statistical heresies.

Lindley D.V. (2000). The philosophy of statistics.

Lipset, S.M., Trow, M.A., & Coleman, J.S. (1956). Statistical problems, Appendix I-B

Lipsey, M.W. (1990).

Lipsey, M.W. (1988). Practice and malpractice in evaluation research.

Lipsey, M.W., & Wilson, D.B. (1993). The efficacy of psychological educational and behavioral treatment.

Little, J. (2001). Understanding statistical significance: a conceptual history.

Little, T.M. (1981). Interpretation and presentation of results.

Locascio, J.J. (1999). Significance tests and 'results-blindness'.

Loebbecke, J.K. (1995). On the use of Bayesian statistics in the audit process.

Loftus, G.R. (1991). On the tyranny of hypothesis testing in the social sciences.

Loftus, G.R. (1993). A picture is worth a thousand

Loftus, G.R. (1993). Editorial comment.

Loftus, G.R. (1996). Psychology will be a much better science when we change the way we analyze data.

Loftus, G.R. (1991). On the tyranny of hypothesis testing in the social sciences.

Loftus, G.R. (2002). Analysis, interpretation, and visual presentation of experimental data.

Loftus, G.R., & Masson, M.E.J. (1994). Using confidence intervals in within-subject designs.

Loredo, T.J. (1990). From Laplace to Supernova SN 1987A: Bayesian inference in astrophysics.

Lovie, A.D. (1979). The analysis of variance in experimental psychology: 1934-1945.

Luce, R. D. (1988). The tools-to-theory hypothesis. Review of G. Gigerenzer and D. J. Murray, "Cognition as intuitive statistics".

Ludbrook, J. (2000). Multiple inferences using confidence intervals.

Ludbrook, J., & Dudley, H. (1998). Why permutation tests are superior to t and F tests in biomedical research.

Lunt, P.K., & Livingstone, S.M. (1989). Psychology and statistics: Testing the opposite of the idea you first thought out. The

Lutz, W., & Nimmo, I.A. (1977). The inadequacy of statistical significance [Editorial].

Lykken, D. (1968). Statistical significance in psychological research.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**M**

Macdonald, R.R. (2005). Why replication probabilities depend on prior probability distributions: A rejoinder to Killeen (2005).

MacKenzie, B. (1976). Darwinism and positivism as methodological influences on the development of psychology.

MacRae, A.W. (1995). Statistics in alpha-level psychology: A suitable case for treatment.

Maddock, J. E., Rossi, J. S. (2001). Statistical power of articles published in three health-psychology related journals.

Maindonald, J.H., & Cox, N.R. (1984). Use of statistical evidence in some recent issues of DSIR agricultural journals.

Mainland, D. (1963). The significance of "nonsignificance".

Mainland, D. (1982). Medical statistics - thinking vs. arithmetic.

Mainland, D. (1984). Statistical ritual in clinical joumals: Is there a cure?- I.

Mainland, D. (1984). Statistical ritual in clinicat journals: Is there a cure?- II.

Makuch, R., & Simon, R. (1978). Sample size requirements for evaluating a conservative therapy.

Malakoff, D. (1999). Bayes offers a 'new' way to make sense of numbers.

Man-Son-Hing, M., Laupacis, A., O'Rourke, K., Molnar, F.J., Mahon, J., Chan, K.B.Y., Wells, G. (2002). Determination of the Clinical Importance of Study Results.

Manzano, V. (1997). Usos y abusos del error de Tipo I.

Maret, T.J. (1997). Statistics and hypothesis testing in biology.

Marks, M.R. (1951). Two kinds of experiment distinguished in terms of statistical operations.

Marks, M.R. (1953). One and two-tailed tests.

Markus, K.A. (2001). The converse inequality argument against tests of statistical significance.

Martin-Löf, P. (1974). The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data [With discussion].

Martin-Löf, P. (1975). Reply to Sverdrup's polemical article tests without power.

Matloff, N.S. (1991). Statistical hypothesis testing: problems and alternatives.

Matthews, R. (1997). Faith, hope and statistics.

Matthews, J.N., & Altman, D.G. (1996). Statistical notes. Interaction 2: compare effect sizes not

Mawera, G. (1996). A proposal for the reporting of

Maxwell, N.P. (1994). A coin-flipping exercise to introduce the

Maxwell, S.E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies.

Mauk, A-M.K (2000). A review of confidence intervals. Paper presented at the annual meeting of the Southwest Educational Research Association, Dallas, January 28, 2000. Texas A&M University 77843-4225.

May, K. (2003). A note on the use of confidence intervals.

Mayo, D. (1981). Testing statistical testing.

Mayo, D.G. (1983). An objective theory of statistical testing.

Mayo, D.G. (1985). Behavioristic, evidentialistic, and learning models of statistical testing.

Mazen, A.M., Hemmasi, M., & Lewis, M.E. (1987). Assessment of statistical power in contemporary strategy research.

McBride, G.B., Loftis, J.C., & Adkins, N.C. (1993). What do significance tests really tell us about the environment?

McCall, R.B. (1975).

McCloskey, D.N. (1985). The loss function has been mislaid: the rhetoric of significance tests.

McCloskey, D. N. (1985).

McCloskey, D.N. (1995). The insignificance of statistical significance.

McCloskey, D.N., & Ziliak, S.T. (1996). The standard error of regressions.

McClure, J., & Suen, H.K. (1994). Interpretation of statistical significance testing: a matter of perspective.

McDonald, R.P. (1997). Goodness of approximation in the linear model.

McGinnis, R. (1958). Randomization and inference in sociological research.

McGrath, R. E. (1998). Significance testing: Is there something better?

McGraw, K.O. (1991). Problems with the BESD: A comment on Rosenthal's "How are we doing in soft psychology" [Comment].

McGraw, K.O. (1995). Determining false alarm rates in null hypothesis testing research.

McGraw, K.O., & Wong, S.P. (1992). A common language effect size statistic.

McLean, J.E. (2001). On the nature and role of hypothesis tests. Working Paper 4/2001, Monash University, Australia.

McLean, J.E., & Ernest, J.M. (1998). The role of statistical significance testing in educational research.

McLean, J.E., & Kaufman, A.S. (Eds.) (1998). Statistical significance testing [Special Issue].

McLean, J.E., & Kaufman, A.S. (1998). The role of statistical significance testing in educational research.

McLean, J.E., & Kaufman, A.S. (2000). Editorial: Statistical significance testing and other changes to

McNemar, Q. (1960). At random: Sense and nonsense.

Meehl, P.E. (1967). Theory testing in psychology and physics: A methodological paradox.

Meehl, P.E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology.

Meehl, P.E. (1990). Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant it.

Meehl, P.E. (1990). Why summaries of research on psychological theories are often uninterpretable.

Meehl, P.E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions.

Meeks, S.L. & D'Agostino, R.B. (1983). A note on the use of confidence limits following rejection of a null hypothesis.

Melton, A.W. (1962). Editorial.

Mendoza J.L., Stafford K.L. (2001). Confidence intervals, power calculation, and sample size estimation for the squared multiple correlation coefficient under the fixed and random regression models: A computer program and useful standard tables.

Menon, R. (1993). Statistical significance testing should be discontinued in mathematics education research.

Metzler, C.M. (1979). Bioavailability - A problem in equivalence.

Mialaret, G. (1996).

Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures.

Milleville-Pennel, I., Hoc, J.-M., & Elise, J. (2007). The use of hazard road signs to improve the perception of severe bends.

Minturn, E.B. (1971). A proposal of significance.

Mittag, K.C., & Thompson, B. (2000). A national survey of AERA members' perceptions of statistical significance tests and other statistical issues.

Moher, D., Dulberg, C.S., & Wells, G.A. (1994). Statistical power, sample size, and their reporting in randomized controlled trials.

Mohr, L.B. (1990).

Mood, A.M., Graybill, F.A., & Boes, D.C. (1974).

Moore, G.E. (1992). The significance of research in vocational education: The 1992 AVERA presidential address.

Moore, D.S. (1995).

Moore, D.S. (1997). Bayes for Beginners? Some pedagogical questions.

Moore, D.S. (1997). Bayes for Beginners? Some reasons to hesitate.

Moore, D.S. (1997). New pedagogy and new content: The case of statistics.

Moore, D. S., & McCabe, G.P. (1993).

Morgan, P.L. (2003). Null Hypothesis Significance Testing: Philosophical and practical considerations of a statistical controversy.

Morris, S.B., & Lobsenz, R.E. (2000). Significance tests and confidence intervals for the adverse impact ratio.

Morrison, D.E., & Henkel, R.E. (1969). Significance tests reconsidered.

Morrison, D.E., & Henkel, R.E. (1970). Significance tests in behavioral research: skeptical conclusions and beyond.

Morrison, D.E., & Henkel, R.E. (Eds.) (1970).

Morrison, G.R., & Weaver, B. (1995). Exactly how many p values is a picture worth? A commentary on Loftus's plot-plus-error-bar approach.

Morrow, G.R. (1980). Clinical trials in psychosocial medicine: methodologic and statistical considerations.

Morrow, G.R., Black, P.M., & Dudgeon, D.J. (1991). Advances in data assessment - application to the etiology of nausea reported during chemotherapy, concerns about significance testing, and opportunities in clinical trials.

Morse, D.T. (1998). MINSIZE: A computer program for obtaining minimum sample size as an indicator of effect size.

Moses, L.E. (1992). The reasoning of statistical inference.

Mulaik, S.A., Raju, N.S., & Harshman, R.A. (1997). There is a time and a place for significance testing.

Murphy, K.R. (1990). If the null hypothesis is impossible, why test it?.

Murphy, K.R. (1997). Editorial.

Murphy, K.R., & Myors, B. (1998).

Murphy, K.R., & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model.

Murray, G.D. (1991). Statistical aspects of research methodology.

Murray, L.R. (1995). Reconsidering the status of tests of significance: An alternative criterion of adequacy.

Murray, L.W., & Dosser, D.A., Jr. (1987). How significant is a significant difference?

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**N**

Nelder, J.A. (1971). Discussion on papers by Wynn, Bloomfield, O'Neill and Wetherill.

Nelder, J.A. (1985). Discussion of Dr Chatfield's paper.

Nelson, N., Rosenthal, R., & Rosnow, R.L. (1986). Interpretation of significance levels and effect sizes by psychological researchers.

Nester, M.R. (1996). An applied statistician's creed.

Neyman, J. (1935). Sur la vérification des hypothèses statistiques composées.

Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability.

Neyman, J. (1938). L'estimation statistique, traitée comme un problème classique des probabilités.

Neyman, J. (1941). Fiducial argument and the theory of confidence intervals.

Neyman, J. (1942). Basic ideas and some recent results of the theory of testing statistical hypotheses.

Neyman, J. (1950).

Neyman, J. (1952).

Neyman, J. (1957). "Inductive behavior" as a basic concept of philosophy of science.

Neyman, J. (1958). The use of the concept of power in agricultural experimentation.

Neyman, J. (1962). Two breakthroughs in the theory of statistical decision making.

Neyman, J., & Pearson, E.S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference. Part I.

Neyman, J., & Pearson, E.S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference. Part II.

Neyman, J., & Pearson, E.S. (1933). On the problem of the most efficient tests of statistical hypotheses.

Neyman, J., & Pearson, E.S. (1933). The testing of statistical hypotheses in relation to probabilities a priori.

Nickerson, R. S. (2000). Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy.

Nisbett, R., & Ross, L. (1981).

Nix, T.W., & Barnette, J.J. (1999). The data analysis dilemma: Ban or abandon. A review of null hypothesis significance testing.

Novick, M.R., & Jackson, P.H. (1974).

Nunnally, J.C. (1960). The place of statistics in psychology.

Nunnally, J.C. (1975).

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**O**

O'Brien, T.C., & Shapiro, B.J. (1968). Statistical significance - What?

Ogles, B.M., Lunnen, K.M., & Bonesteel K. (2001). Clinical significance: History, application, and current practice.

O'Grady, K.E. (1982). Measures of explained variance: cautions and limitations.

O'Hagan, T. (1996).

Olejnik, S., & Algina, J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and limitations.

Olson, C.L. (1976). On choosing a test statistic in multivariate analysis of variance.

Orme, J.G., & Combs-Orme, T. (1986). Statistical power and type II errors in social work research.

Orme, J.G., & Tolman, R.M. (1986). The statistical power of a decade of social work education research.

O'Rourke, K. (1996). Two Cheers for Bayes [Letters to the Editors].

Ottenbacher, K. (1982). Statistical power and research in occupational therapy.

Ottenbacher, K.J. (1992). Practical significance in early intervention research: from affect to empirical effect.

Ottenbacher, K.J. (1995). Why rehabilitation research does not work (as well as we think it should).

Owen, A.R.G. (1962). An appreciation of the life and work of Sir Ronald Aylmer Fisher: F.R.S., F.S.S. Sc.D.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**P**

Pagano, M., & Leviton, A. (1990). File drawers,

Panicker, S. (2000). Narrow and shallow,

Parker, R.M., & Szymanski, E.M. (1999). Recommendations of the APA task force on statistical inference.

Parker, S. (1995). The "difference of means" may not be the "effect size" [Comment].

Parkhurst, D.F. (1985). Interpreting failure to reject a null hypothesis.

Parkhurst, D. (1990). Statistical hypothesis tests and statistical power in pure and applied science.

Pascual, J., Frías, Ma.D., & Garcia, J.F. (2000). El procedimiento de significación estadística (NHST): Su trayectoria y actualidad.

Pascual, J., Garcia, J.F., & Frías, Ma.D. (2000). Significación estadística, importancia del efecto y replicabilidad de los datos [Statistical significance and replicability of the data].

Pasquet, P., Monneuse, M.-O., Simmen, B., Marez, A., & Hladik; C.-M. (2006). Relationship between taste thresholds and hunger under debate.

Patel, H.I., & Gupta, G.D. (1984). A problem of equivalence in clinical trials.

Pearce, S.C. (1992). Data analysis in agricultural experimentation. II. Some standard contrasts.

Pearce, S.C. (1992). Introduction to Fisher Statistical methods for Research Workers.

Pearson, E. (1938). 'Student' as a statistician.

Pearson, E.S. (1947). The choice of statistical tests illustrated on the interpretation of data classed in a 2x2 table.

Pearson, E.S. (1955). Statistical concepts in their relation to reality.

Pearson, E.S. (1962). Some thoughts on statistical inference.

Pearson, K. (1900). On the criterion that a given system of deviations form the probable in the case of correlated systems of variables is such that it can reasonably be supposed to have arisen from random sampling.

Pearson, K. (1901). On the correlation of characters not quantitatively measurable.

Pearson, K. (1911). Probability that two independent distributions of frequency are really samples from the same population.

Pearson, K. (1935). Statistical tests.

Pedersen, J.G. (1978). Fiducial inference.

Pedhazur, E.J., & Schmelkin, L.P. (1991).

Pennik, J.E., & Brewer, J.K. (1972). The power of statistical tests in science teaching research.

Perlman, M.D., & Wu, L. (1999). The emperor's new tests.

Perneger, T. (1998). What's wrong with Bonferroni adjustments,

Perry, J.N. (1986). Multiple-comparison procedures: a dissenting view.

Peterman, R. M. (1990). The importance of reporting statistical power: the forest decline and acidic deposition example.

Petranka, J. W. (1990). Caught between a rock and a hard place.

Phillips, L.D. (1973).

Phillips, L.L., Jr. (1988).

Piantadosi, S., Saijo, N., & Tamura, T. (1993). Guidelines for analysis and reporting of clinical trials in oncology.

Pierce, D.A. (1999). On the relation between frequency inference and likelihood.

Pitman, E.J.G. (1937). Significance tests which may be applied to samples from any populations.

Platt, J. R. (1964). Strong inference.

Pocock, S.J., Hughes, M.D., & Lee, R.J. (1987). Statistical problems in the reporting of clinical trials.

Pocock, S.J., Hughes, M.D., & Lee, R.J. (1990). Estimation issues in clinical trials and overviews.

Poitevineau, J. (1998).

Poitevineau, J. (1999). Pratiques des tests statistiques en psychologie cognitive: L'exemple d'une année d'un journal.

Poitevineau J. (2004). L'usage des tests statistiques par les chercheurs en psychologie: Aspects normatif, descriptif et prescriptif.

Poitevineau, J., & Lecoutre, B. (1998). Some Statistical Misconceptions in Chow's Statistical Significance.

Poitevineau, J., & Lecoutre, B. (2001). The interpretation of significance levels by psychological researchers: The .05-cliff effect may be overstated.

Polanyi, M. (1961). The unaccountable element in science?

Pollard, P. (1993). How significant is "significance".

Pollard, P., & Richardson, J.T.E. (1987). On the probability of making type I errors.

Poole, C. (1987). Beyond the confidence interval.

Pratt, J.W. (1965). Bayesian interpretation of standard inference statements [With discussion].

Pratt, J.W. (1976). A discussion of the question: For what use are tests of hypotheses and tests of significance?

Pratt, J.W. (1977). "Decisions" as statistical evidence and Birnbaum's "confidence concept".

Preece, D.A. (1982). The design and analysis of experiments: what has gone wrong?

Preece, D.A. (1984). Biometry in the Third World: science not ritual.

Preece, D.A. (1990). R. A. Fisher and experimental design: a review.

Prentice, D.A., & Miller, D.T. (1992). When small effects are impressive.

Press, S.J. (1989).

Press, W.H. (1989). Understanding data better with bayesian and global statistical methods.

Pruzek, R.M. (1997). An introduction to Bayesian inference and its applications.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**Q**

Quinn, J.F., & Dunham, A.E. (1983). On hypothesis testing in ecology and evolution.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**R**

Ramp, W.K., & Yancey, J.M. (1991).

Ranstam, J. (1996). A common misconception about

Reckhow, K.H., Clements, J.T., & Dodd, R.C. (1990). Statistical evaluation of mechanistic water-quality models.

Reichardt, C.S., & Gollob, H.F. (1997). When confidence intervals should be used instead of statistical significance tests, and vice versa.

Reiser, B. (2001). Confidence intervals for the Mahalanobis distance.

Rennie, D. (1978). Vive la différence (

Reuchlin, M. (1962).

Reuchlin, M. (1977). Epreuves d'hypothèses nulles et inférence fiduciaire en psychologie.

Reuchlin, M. (1992).

Reeves, C.A., & Brewer, J.K. (1980). Hypothesis testing and proof by contradiction: An analogy.

Reynolds, R. (1969). Replication and substantive support: A critique on the use of statistical inference in social research.

Richard, F.D., Bond, C.F., Jr., & Stokes-Zoota, J.J. (2003). One hundred years of social psychology quantitatively described.

Richardson, J.T.E. (1996). Measures of effect size.

Rindskopf, D. (1997). Testing "small", not null, hypotheses: Classical and Bayesian approaches.

Rindskopf, D. (1998). Null-hypothesis tests are not completely stupid, but Bayesian statistics are better.

Robert, C.P. (1994).

Robert, Cl. (1995).

Robert, M. (1994). Stratégies méthodologiques.

Roberts, H.V. (1976). For what use are tests of hypotheses and tests of significance.

Roberts, H.V. (1990). Applications in business and economic statistics: some personal views.

Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing.

Robey, R.R. (2004) Reporting point and interval estimates of effect-size for planned contrasts: fixed within effect analyses of variance [Tutorial].

Robinson, D.H., Fouladi, R.T., Williams, N.J., & Bera, S.J. (2002). Some effects of providing effect size and "what if" information.

Robinson, D.H., & Levin, J.R. (1997). Reflections on statistical and substantive significance, with a slice of replication.

Robinson, D.H., & Wainer, H. (2002). On the past and future of Null Hypothesis Significance Testing.

Rocke, D.M. (1985). Reply to correspondence.

Roebruck, P. (1984). Explorative statistical analysis and the valuation of hypotheses.

Rogers, J.L., Howard, K.I., & Vessey, J. (1993). Using significance tests to evaluate equivalence between two experimental groups.

Ronis, D. L. (1981). Comparing the magnitude of effects in ANOVA designs.

Rorer, L. G. (1991). Some myths of science in psychology. In D. Cicchetti & W.M. Grove (Eds.), Thinking Clearly about Psychology, vol. 1:

Rosenthal, R. (1976).

Rosenthal, R. (1979). The "file drawer problem" and tolerance for null results.

Rosenthal, R. (1983). Assessing the statistical and social importance of the effects of psychotherapy.

Rosenthal, R.(1990). How are we doing in soft psychology.

Rosenthal, R. (1990). Replication in behavioral research.

Rosenthal, R. (1991).

Rosenthal, R. (1991). Effect sizes: Pearson's correlation, its display via the BESD, and alternative indices [Comment].

Rosenthal, R. (1991). Cumulating psychology: An appreciation of Donald T. Campbell.

Rosenthal, R. (1992). Effect size estimation, significance testing, and the file-drawer problem.

Rosenthal, R. (1993). Cumulating evidence. In G. Keren & C. Lewis (Eds.)

Rosenthal, R. (1994). Parametric measures of effect size.

Rosenthal, R., & Gaito, J. (1963). The interpretation of levels of significance by psychological researchers.

Rosenthal, R., & Gaito, J. (1964). Further evidence for the cliff effect in the interpretation of levels of significance.

Rosenthal, R., & Rosnow, R. L. (1985).

Rosenthal, R., Rosnow, R.L., & Rubin, D.B. (1999).

Rosenthal, R., & Rubin, D.B. (1979). A note on percent variance explained as a measure of the importance of effects.

Rosenthal, R., & Rubin, D.B. (1979). Comparing significance levels of independent studies.

Rosenthal, R., & Rubin, D.B. (1982). A simple, general purpose display of magnitude of experimental effects.

Rosenthal, R., & Rubin, D.B. (1985). Statistical analysis: Summarizing evidence versus establishing facts;

Rosenthal, R., & Rubin, D.B. (1994). The counternull value of an effect size: A new statistic.

Rosenthal, R., & Rubin, D.B. (2003). requivalent: A simple effect size indicator.

Rosnow, R.L., & Rosenthal, R. (1988). Definition in interpretation of interaction effects.

Rosnow, R.L., & Rosenthal, R. (1988). Focused tests of significance and effect size estimation in counseling psychology.

Rosnow R.L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science.

Rosnow R.L., & Rosenthal, R. (1996). Computing contrasts, effect sizes, and counternulls on other people's published data: General procedures for research consumers.

Rosnow R.L., & Rosenthal, R. (2002).

Rossi, J.S. (1990). Statistical power of psychological research: What have we gained in 20 years?

Rossi, J.S. (1997). A case study in the failure of psychology as a cumulative science: The spontaneous recovery of verbal learning.

Rothman, K.J. (1978). A show of confidence [Editorial].

Rothman, K.J. (1986). Significance questing [Editorial].

Rothman, K.J. (1988). Modern Epidemiology. Boston, MA: Little and Brown.

Rothman, K.J. (1990). No adjustments are needed for multiple comparisons.

Rothman, K.J. (1996). Lessons from John Graunt.

Rothman, K.J., & Greenland, S. (Eds) (1998).

Rothman, K.J., & Yankauer, A. (1986). Confidence intervals vs significance tests: Quantitative interpretations [Editor's note].

Rouanet, H. (1967).

Rouanet, H. (1986). Modèles en tout genre et pratiques statisticiennes.

Rouanet, H. (1996). Bayesian procedures for assessing importance of effects.

Rouanet, H. (1998). Significance testing in a Bayesian framework: Assessing direction of effects.

Rouanet, H. (2000). Statistics for researchers.

Rouanet, H. (2000). Statistical Practice revisited.

Rouanet, H., Bernard, J.-M., Bert, M.-C., Lecoutre, B., Lecoutre, M.-P., & Le Roux, B. (2000).

Foreword by Patrick Suppes; Rouanet, H. - Statistics for researchers, 1-27; Rouanet, H. - Statistical practice revisited, 29-64; Lecoutre, M.-P. - And... what about the researcher's point of view, 65-95; Rouanet, H., & Bert, M.-C. - Introduction to combinatorial inference, 97-122; Lecoutre, B. - From significance tests to fiducial Bayesian inference, 123-157; Bernard, J.-M. - Bayesian inference for categorized data, 159-226; Rouanet, H., Le Roux, B., Bernard, J.-M., & Lecoutre, B. - Geometric data: From euclidean clouds to Bayesian MANOVA, 227-254.

Rouanet, H., Bernard, J.-M., & Lecoutre, B. (1986). Non-probabilistic statistical inference: A set theoretic approach.

Rouanet, H., Bernard, J.-M., & Leroux, B. (1990).

Rouanet, H., & Bert, M.-C. (2000). Introduction to combinatorial inference.

Rouanet, H., & Bru, B. (1994). Sur les traces de Victor Henri: Les débuts de l'inférence statistique en psychologie.

Rouanet, H., & Lecoutre, B. (1983). Specific inference in ANOVA: From significance tests to Bayesian procedures.

Rouanet, H., Lecoutre, B., & Bernard, J.-M. (1987). L'inférence fiducio-bayésienne comme méthode d'analyse de données: Un exemple d'application à des données psychométriques.

Rouanet, H., Lecoutre, M.-P., Bert, M.-C., Lecoutre, B., & Bernard, J.-M. (1991).

Rouanet, H., & Lépine, D. (1977). Introduction à l'Analyse des Comparaisons pour le traitement des données expérimentales.

Rouanet, H., Lépine, D., & Holender, D. (1978). Model acceptability and the use of Bayes-fiducial methods for validating models.

Rouanet, H., Lépine, D., & Pelnard-Considère, J. (1976). Bayes-fiducial procedures as practical substitutes for misplaced significance testing: An application to educational data.

Rouanet, H., Le Roux, B., Bernard, J.-M., & Lecoutre, B. (2000). Geometric data: From euclidean clouds to Bayesian MANOVA.

Rouder, J.N., & Morey, R.D. (2005). Relational and Arelational Confidence Intervals: A Comment on Fidler, Thomason, Cumming, Finch, and Leeman (2004)

Rowley, G. (1993). Response to Menon.

Royall, R.M. (1986). The effect of sample size on the meaning of significance tests.

Royall, R.M. (1997).

Royall, R.M. (1999).

Rozeboom, W.W (1960). The fallacy of the null hypothesis significance test.

Rozeboom, W.W (1961). Ontological induction and the logical typology of scientific variables.

Rozeboom, W.W (1972). Scientific inference: The myth and the reality.

Rozeboom, W.W (1990). Hypothetico-deductivism is a fraud.

Rozeboom, W.W. (1991). Conceptual Rigor: Where is it? [Comment on Chow, 1991a].

Rozeboom, W.W. (1997). Good science is abductive, not hypothetico-deductive.

Rozencwajg, P., & Corroyer, D. (2005). Cognitive Processes in the Reflective–Impulsive Cognitive Style

Rubin, A. (1981). Reexamining the impact of sex on salary: the limits of statistical significance.

Rubin, D.B. (1978). Bayesian inference for causal effects: The role of randomization.

Rubin, D.B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician.

Rubin, D.B. (1990). Formal modes of statistical inference for causal effects.

Rubin, D.B. (1990). Neyman (1923) and causal inference in experiments and observational studies.

Rucci, A.J., & Tweney, R.D. (1980). Analysis of variance and the "second discipline" of scientific Psychology: A historical account.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**S**

Salsburg, D. (1994). Intent to treat: The

Salsburg, D.S. (1986).

Salsburg, D. (2001).

Samaniego, F.J., & Reneau, D.M. (1994). Toward a reconciliation of the Bayesian and frequentist approaches to point estimation.

Samurçay, R., & Hoc, J.-M. (1996). Causal versus topographical support for diagnosis in a dynamic situation.

Sanabria, F., & Killeen, P.R. (2007). Better statistics for better decisions: Rejecting null hypotheses statistical tests in favor of replication statistics.

Sánchez, J., Valera, A., Velandrino, A, & Marin, F. (1992). Un estudio de la potencia estadística en Anales de Psicología (1984-1991) [A study of statistical power in the journal Anales de Psicología].

Savage, L. (1954).

Savage, L.J. (1957). Nonparametric Statistics.

Savage, L.J. (1976). On rereading R.A. Fisher [With discussion].

Savitz, D.A. (1993). Is statistical significance testing useful in interpreting data?

Savitz, D.A., & Olshan, A.F. (1995). Multiple comparisons and related issues in the interpretation of epidemiologic data.

Savitz, D.A., Tolo, K.-A., & Poole, C. (1994). Statistical significance testing in the

Sayn-Wittgenstein, L. (1965). Statistics - salvation or slavery?

Sawyer, A.G., & Ball, A.D. (1981). Statistical power and effect size in marketing research.

Sawyer, A.G., & Peter, J.P. (1983). The significance of statistical significance tests in marketing research.

Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate.

Schafer, W.D. (1993). Interpreting statistical significance and nonsignificance,

Scheffé, H. (1959).

Schenker, N. & Gentleman, J.F. (2001). On judging the significance of differences by examining the overlap between confidence intervals.

Schervish, M.J. (1992). Bayesian analysis of linear models.

Schervish, M.J. (1995).

Schervish, M.J. (1996).

Scheutz, F., Andersen, B., & Wulff, H.R. (1988). What do dentists know about statistics?

Schield, M. (1998). Using Bayesian strength of belief to teach classical statistics.

Schlaiffer, R. (1959).

Schmidt, F.L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology.

Schmidt, F.L. (1996). Board of scientific affairs action on significance testing.

Schmidt, F.L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers.

Schmidt, F.L., & Hunter, J.E. (1995). The impact of data-analysis methods on cumulative research knowledge: statistical significance testing, confidence intervals, and meta-analysis.

Schmidt, F.L., & Hunter, J.E. (1997)[284]. Eight common but false objections to the discontinuation of significance testing in the analysis of research data.

Schmidt, K. (1995). Statistical tests and estimations [Background paper].

Schmitt, S.A. (1969).

Schneider, A. L., & Darcy, R. E. (1984). Policy implications of using significance tests in evaluation research.

Schuirmann, D.J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.

Schulman, J.L., Kupst, M.J., & Suran, B.G. (1976). The worship of "

Schwartz, D. (1984). Statistique et vérité.

Schweder, T. (1988). A significance version of the basic Neyman-Pearson theory for scientific hypothesis testing.

Schweder, T., & Hjort, N.L. (2002). Confidence and likelihood.

Schwertman, N.C. (1996). A connection between quadratic-type confidence limits and fiducial limits.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies.

Sedlmeier, P. (1996). Jenseits des Signifikanztest-Rituals: Ergänzungen und Alternativen.

Sedlmeier, P. (2002). Beyond uncritical significance testing: Contrasts and effect sizes.

Seeman, J. (1973). On supervising student research.

Seidenfeld, T. (1979).

Selvin, H.C. (1957). A Critique of tests of significance in survey research.

Selvin, H.C. (1958). Reply to Beshers.

Selvin, H.C., & Stuart, A. (1966). Data dredging procedures in survey analysis.

Selvin, S., & White, M.C. (1993). Description and reporting of statistical methods.

Selwyn, W.J., Dempster, A.P., & Hall, N.R. (1981). A Bayesian approach to bioequivalence for the 2x2 changeover design.

Selwyn, W.J., & Hall, N.R. (1984). On Bayesian methods for bioequivalence.

Selwyn, W.J., Hall, N.R., & Dempster, A.P. (1985). Letter to the Editor.

Serlin, R.C. (1987). Hypothesis testing, theory building, and the philosophy of science.

Serlin, R.C. (1993). Confidence intervals and the scientific method: A case for Holm on the range.

Serlin, R.C., & Lapsley, D.K. (1985). Rationality in psychological research: The good-enough principle.

Serlin, R.C., & Lapsley, D.K. (1993). Rational appraisal of psychological research and the good-enough principle.

Shafer, G. (1986). Savage revisited.

Share, D.L. (1984). Interpreting the outcome of multivariate analysis: A discussion of current approaches.

Shaver, J.P. (1985). Chance and nonsense: A conversation about interpreting tests of statistical significance, Part 1.

Shaver, J.P. (1985). Chance and nonsense: a conversation about interpreting tests of statistical significance, Part 2.

Shaver, J. (1992). What significance testing is, and what it isn't. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA., April 1992.

Shaver, J.P. (1993). What statistical significance testing is, and what it is not.

Shea, C. (1996). Psychologists debate accuracy of "significance tests".

Shrout, P.E. (1997). Should significance tests be banned? Introduction to a special section exploring the pros and cons.

Shulman, L.S. (1970). Reconstruction of educational research.

Siegel, S. (1956).

Signorelli, A. (1974). Statistics: tool or master of the psychologist?

Sim, J., & Reid, N. (1999). Statistical inference by confidence intervals: Issues of interpretation and utilization.

Simberloff, D. (1990). Hypotheses, errors, and statistical assumptions.

Simon, R. (1986). Confidence intervals for reporting results of clinical trials.

Simon, R., & Altman, D.G. (1994). Statistical aspects of prognostic factor studies in oncology [Editorial].

Simon, R., & Wittes, R.E. (1985). Methodologic guidelines for reports of clinical trials.

Skinner, B.F. (1956). A case history in scientific method.

Skipper, Jr, J.K., Guenther, A.L., & Nass, G. (1967). The sacredness of .05: A note concerning the uses of statistical levels of significance in social science.

Slakter, M.J., Wu, Y., & Suzuki-Slakter, N.S. (1991). *, **, ***; statistical nonsense at the .00000 level.

Smeeton, N.C. (Ed.) (1994). Conference on practical Bayesian statistics [Special issue].

Smith, A. (1995). A conversation with Dennis Lindley.

Smith, C.A.B. (1960). Book review of Norman T. J. Bailey: Statistical Methods in Biology.

Smith, K. (1983). Tests of significance: some frequent misunderstandings.

Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research.

Smithson, M. (2000).

Smithson, M. (2001). Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals.

Smithson, M. (2002).

Snyder, P. (2000). Reporting results of group quantitative investigations.

Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates.

Snyder, P.A., & Thompson, B. (1998). Use of tests of statistical significance and other analytic choices in a school psychology journal: Review of practices and suggested alternatives.

Sohn, D. (1993). Psychology of the scientist: LXVI. The idiots savants have taken over the psychology labs! Or why in science using the rejection of the null hypothesis as the basis for affirming the research hypothesis is unwarranted.

Sohn, D. (1998). Statistical significance and replicability: Why the former does not presage the latter.

Soric, B. (1989). Statistical "discoveries" and effect-size estimation.

Spiegelhalter,D.J. (2004). Incorporating Bayesian Ideas into Health-Care Evaluation.

Spiegelhalter,D.J., Freedman, L.S. (1986). A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion.

Spiegelhalter, D.J., Freedman, L.S. (1988). Bayesian approaches to clinical trials [With discussion].

Spiegelhalter, D.J., Freedman, L.S., & Blackburn, P.R. (1986). Monitoring clinical trials: Conditional or predictive power?

Spiegelhalter, D.J., Freedman, L.S., & Parmar, M.K.B. (1994). Bayesian approaches to randomized trials [With discussion].

Spiegelhalter D.J., Myles J.P., Jones D.R. & Abrams K.R. (2000). Bayesian methods in health technology assessment: A review.

Spiegelhalter, D., Thomas, A., Best, N., & Gilks, W. (1998). BUGS Bayesian Inference Using Gibbs Sampling. Cambridge, UK MRC Biostatistics Unit.

Spielman, S. (1974). The logic of tests of significance.

Spielman, S. (1978). Statistical dogma and the logic of significance testing.

Spriet, A., & Bieler, D. (1979). When can "non significantly different" treatments be considered as equivalent.

Standards of reporting trials group (1994). A proposal for structured reporting of randomized controlled trials.

Stangl, D. (1998). Classical and Bayesian paradigms: Can we teach both. In L. Pereira-Mendoza, L. Seu, T. Wee & W.K. Wong (Eds.), Statistical Education - Expanding the Network, Proceedings of the Fifth International Conference on Teaching of Statistics, Vooburg, Netherlands: ISI Permanent Office, Vol. 1, 251-258.

Steger, J.A. (Ed.) (1971).

Steidl, R.J., Hayes, J.P., & Schauber, E. (1997). Statistical power analysis in wildlife research.

Steiger, J.H. (2004). Paul Meehl and the evolution of statistical methods in psychology [commentary]

Steiger, J.H. (2004). Beyond the F test: Effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis.

Steiger, J.H., & Fouladi, R.T. (1992).

Steiger, J.H., & Fouladi, R.T. (1997). Noncentrality interval estimation and the evaluation of statistical models.

Steinfatt, T.M. (1990). Ritual versus logic in significance testing in communication research.

Sterling, T.D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa.

Sterling, T.D. (1960). What is so peculiar about accepting the null hypothesis?

Sterling, T.D., Rosenbaum, W.L., & Weinkam, J.J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa.

Sterne, J.A.C., & Smith, D.G. (2001). Sifting the evidence-what's wrong with significance tests?

Sterne, J.A.C. (2002). Teaching hypothesis tests - time for significant change?

Stevens, S.S. (1968). Measurement, statistics, and the schemapiric view.

Stigler, S.M. (1986).

Stigler, S.M. (1996). Statistics and the question of standards.

Stigler, S.M. (2004). Fisher: Discussion of D. Denis [The modern hypothesis testing hybrid: R. A. Fisher's fading influence].

Stone, M. (1969). The role of significance testing: Some data with a message.

Street, D.J. (1990). Fisher's contributions to agricultural statistics.

Student (1908). The probable error of a mean.

Suen, H. K. (1992). Significance testing: Necessary but insufficient.

Sullivan J.R. (2000). A review of post-1994 literature on whether statistical significance tests should be banned. Paper presented at the annual meeting of the Southwest Educational Research Association, Dallas, TX, January 29, 2000 (Texas A&M University 77843-4225).

Summers, L.H. (1991). The scientific illusion in empirical macroeconomics.

Suter, G.W. (1996). Abuse of hypothesis testing statistics in ecological risk assessment.

Sutlive, V.H., & Ulrich, D.A. (1998). Interpreting statistical significance and meaningfulness in adapted physical activity research.

Sverdrup, E. (1975). Tests without power.

Svyantek, D. J., & Ekeberg, S. E. (1995). The earth is round (So we can probably get there from here).

Sylvester, R.J. (1988). A Bayesian approach to the design of phase II clinical trials.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**T**

Tannock, I.F. (1996). False-positive results in clinical trials. multiple significance tests and the problem of unreported comparisons.

Tatsuoka, M. (1993). Effect size.

Taube, A. (1980). Significance, importance and equality: Three basic concepts in the analysis of a difference.

Taylor, D.J., & Muller, K.E. (1995). Computing confidence bounds for power and sample size of the general linear univariate model.

Thomas, D.C., Siemiatycki, J., Dewar, R., Robins, J., Goldberg, M., & Armstrong, B.G. (1985). The problem of multiple inference in studies designed to generate hypotheses.

Thomas, L., & Juanes, F. (1996). The importance of statistical power analysis: An example from Animal Behaviour.

Thompson, B. (1988). A note about significance testing.

Thompson, B. (1989). Asking "what if" questions about significance tests.

Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives.

Thompson, B. (Guest Ed.). (1993). Statistical significance testing in contemporary practice [Special issue].

Thompson, B. (1994). Guidelines for authors.

Thompson, B. (1994). The concept of statistical significance testing.

Thompson, B. (1994). Planned versus unplanned and orthogonal versus nonorthogonal contrasts: The neo-classical perspective.

Thompson, B. (1995). Publishing your research results: Some suggestions and counsel.

Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial.

Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms.

Thompson, B. (1997). Editorial policies regarding statistical significance tests: Further comments.

Thompson, B. (1998). In praise of brilliance: Where that praise really belongs.

Thompson, B. (1998). Review of What if there were no significance tests? by L. Harlow, S. Mulaik & J. Steiger (Eds.).

Thompson, B. (1998). Statistical significance and effect size reporting: Portrait of a possible future.

Thompson, B. (1998). Five methodology errors in educational research: The pantheon of statistical significance and other faux pas. Invited address presented at the annual meeting of the American Educational Research Association, San Diego [ERIC Document Reproduction Service No. ED 419 023].

Thompson, B. (1999). Improving research clarity and usefulness with effect size indices as supplements to statistical significance tests.

Thompson, B. (1999). Journal editorial policies regarding statistical significance tests: Heat is to fire as p is to importance.

Thompson, B. (1999). If statistical significance tests are broken/misused, what practices should supplement or replace them?

Thompson, B. (1999). Statistical significance tests, effect size reporting, and the vain pursuit of pseudo-objectivity.

Thompson, B. (1999). Why "encouraging" effect size reporting is not working: The etiology of researcher resistance to changing practices.

Thompson, B. (1999). Journal editorial policies regarding statistical significance tests: Heat is to fire as

Thompson, B. (2001). Significance, effect sizes, stepwise methods, and other issues: Strong arguments move the field.

Thompson, B. (2001). Editor's Note on the "Colloquium on Effect Sizes: The Roles of Editors, Textbook Authors, and the Publication Manual.

Thompson, B. (2002). "Statistical," "practical," and "clinical": How many kinds of significance do counselors need to consider?

Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect sizes.

Thompson, B., & Kieffer, K.M. (2000). Interpreting statistical significance test results: A proposed new "What if" method.

Thompson, B., & Snyder, P.A. (1997). Statistical significance testing practices in The Journal of Experimental Education.

Thompson, B., & Snyder, P. A. (1998). Statistical significance and reliability analyses in recent

Thompson, B., & Vacha-Haase, T. (2000). Psychometrics

Thompson, W.D. (1987). Statistical criteria in the interpretation of epidemiologic data.

Trafimow, D. (2003). Hypothesis testing and theory evaluation at the boundaries: Surprising insights from Bayes's theorem.

Trafimow, D. (2005). The ubiquitous Laplacian assumption: Reply to Lee and Wagenmakers (2005).

Tryon, W.W. (1998). The inscrutable null hypothesis.

Tryon, W.W. (2001). Evaluating statistical difference, equivalence, and indeterminacy using inferential confidence intervals: An integrated alternative method of conducting null hypothesis statistical tests.

Tullock, G. (1959). Publication decisions and tests of significance: A comment.

Tukey, J.W. (1960). Conclusions

Tukey, J.W. (1962). The future of data analysis.

Tukey, J.W. (1969). Analyzing data: Sanctification or detective work?

Tukey, J.W. (1977).

Tukey, J.W. (1991). The philosophy of multiple comparisons.

Tukey, J.W. (1993). Where should multiple comparisons go next?

Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers.

Tyler, R. (1931). What is statistical significance?

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**U**

Utts, J. (1988). Successful replication versus statistical significance.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**V**

Vacha-Haase, T., & Ness, C.M. (1999). Statistical significance testing as it relates to practice: Use within Professional Psychology: Research and practices.

Vacha-Haase, T., & Nilsson, J.E. (1998). Statistical significance reporting: Current trends and usages within MECD.

Vacha-Haase, T., Nilsson, J.E., Reetz, D.R., Lance, T.S., & Thompson, B. (2000). Reporting practices and APA editorial policies regarding statistical significance and effect size.

Valera, A., Sánchez, J., & Marin, F. (1997). Pruebas de significación y magnitud del efecto: Reflexiones y propuestas [Significance tests and effect magnitude: Reflections and proposals].

Valera, A., Sánchez, J., & Marin, F. (2000). Hypothesis testing and Spanish psychological research: Analyses and proposals [in Spanish].

Valera, A., Sánchez, J., Marin, F., & Velandrino, A. (1998). Potencia estadística de la Revista de Psicología General y Applicada (1990-1992) [Statistical power in the journal Revista de Psicología General y Applicada].

Vallecillos, A. (1995). Comprension de la logica del contraste de hipotesis en estudiantes universitarios.

Vallecillos, A. (1996). Students' conceptions of the logic of hypothesis testing.

Vallecillos, A. (1998). Research and teaching of statistical inference.

Vallecillos, A. (1999). Some empirical evidence on learning difficulties about testing hypotheses.

VanVoorhis, W.C., & Morgan, B.L. (2001). Statistical rules of thumb: What we don't want to forget about sample sizes.

Vardeman, S.B. (1987). Comment.

Vargha, A., & Delaney, H.D. (2000). A critique and improvement of the

Vaughan, G.M., & Corballis, M.C. (1969). Beyond tests of significance: Estimating strength of effects in selected ANOVA designs.

Vaughn, G.M., & Corballis, M.C. (1969). Beyond tests of significance: Estimating strengths of effects in selected ANOVA designs.

Venables, W. (1975). Calculation of confidence intervals for noncentrality parameters.

Venn (1888). Cambridge anthropometry.

Victor, N. (1987). On clinically relevant differences and shifted nullhypotheses.

Vokey, J.R. (1998). Statistics without probability: Significance testing as typicality and exchangeability in data analysis.

Vokey, J.R. (2003). Multiway frequency analysis for experimental psychologists

Vollset, S.E. (1993). Confidence intervals for a binomial proportion.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**W**

Wade, P.R. (2000). Bayesian methods in conservation biology.

Wagenmakers, E.-J., & Grünwald, P. (2006). A Bayesian perspective on hypothesis testing: A comment on Killeen (2005).

Wainer, H. (1999). One cheer for null hypothesis significance testing.

Wainer, H., & Robinson, D.H. (2003). Shaping up the practice of Null Hypothesis Significance Testing.

Wald, W. (1947).

Walker, A.M. (1986). Reporting the results of epidemiologic studies.

Walker, H.M. (1929).

Walley, P. (1991 -

Walley, P. (1996). Inferences from multinomial data: Learning about a bag of marbles [with discussion].

Wallis, W.A., & Roberts, H.V. (1956).

Walster, G.W., & Cleary, T.A. Statistical significance as a decision rule.

Wampold, N.E., Davis, B., & Good, R.H. (1990). Hypothesis validity of clinical research.

Wang, C. (1993).

Wang, Y.H. (2000). Fiducial intervals: What are they?.

Ward, R. C., Loftis, J. C., & McBride, G. B. (1990).

Warren, W.G. (1986). On the presentation of statistical analysis: reason or ritual.

Walster, G., & Cleary, T. (1970). Statistical significance as a decision rule.

Watson, J.M., & Moritz, J.B. (1999). The beginning of statistical inference: Comparing two data sets.

Weinbach, R.W. (1989). When is statistical significance meaningful? A practice perspective.

Weitzman, R.A. (1984). Seven treacherous pitfalls of statistics illustrated.

Wellek, S., & Michaelis, J. (1991). Elements of significance testing with equivalence problems.

Welsh, A.H. (1996).

Wendell, J. P. (1991). More on Jahn's statistics.

Wendell, J. P. (1992). Jahn's statistics again.

West, L.J. (1990). Distinguishing between statistical and practical significance.

Westermann, R., & Hager, W. (1986). Error probabilities in educational and psychological research.

Westfall, P.H., Johnson, W.O. & Utts, J.M. (1997). A Bayesian perspective on the Bonferroni adjustment.

Westlake, W.J. (1976). Symmetrical confidence intervals in analysis of comparative bioavailability trials.

Westlake, W.J. (1981). Response to bioequivalence testing: A need to rethink (reader reaction response).

White, A.L. (1980). Avoiding errors in educational research.

Whitmore, G.A., & Xekalaki, E. (1990).

Windeler, J., & Conradt, C. (2000). How can "significance" and "relevance" be combined? [in German].

Wiens, J. A. (1989).

Wietzman, R.A. (1984). Seven treacherous pitfalls of statistics, illustrated.

Wilcox, R.R. (1998). How many discoveries have been lost by ignoring modern statistical methods?

Wilcox, R.R., & Muska, J. (1999). Measuring effect size: A non-parametric analogue of

Wilkinson, L. and Task Force on Statistical Inference,

Willer, D. (1967).

Willer, D., & Willer, J. (1973).

Williams, A.M. (1997). Students' understanding of hypothesis testing: the case of the significance concepts.

Williams, A.M. (1998). Students' understanding of the significance level concept.

Williams, V.S.L., Jones, L.V., & Tukey, J.W. (1999). Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement.

Willink, R., Lira, I. (2005). A united interpretation of different uncertainty intervals.

Willson, V.L. (1980). Research techniques in

Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference.

Wilson, G. (2003). Tides of change: Is Bayesianism the new paradigm in statistics?

Wilson, K.V. (1961). Subjective statistics for the current crisis.

Wilson, W.R., & Miller, H.L. (1964). A note on the inconclusiveness of accepting the null hypothesis.

Wilson, W.R., Miller, H.L., & Lower, J.S. (1967). Much ado about the null hypothesis.

Winch, R.F., & Campbell, D.T. (1969). Proof? No. Evidence? Yes. The significance of tests of significance.

Winer, B.J. (1962).

Winkler, R. L. (1972).

Winkler, R.L. (1974). Statistical analysis: Theory versus practice.

Winkler, R.L. (1993). Bayesian Statistics: An overview.

Witehead, J. (1993). The case for frequentism in clinical trials.

Wolfowitz, J. (1967). Remarks on the theory of testing hypotheses.

Wolins, L. (1982).

Wonnacott, R. J., & Wonnacott, T. H. (1985).

Woolley, T.W. (1983). A comprehensive power-analytic investigation of research in medical education.

Woolley, T.W., & Dawson, G.O. (1983). A follow-up power analysis of the statistical tests in the Journal of Research in Science Teaching.

Woolson, R.F., & Kleinman, J.C. (1989). Perspectives on statistical significance.

Wright , A., & Ayton, P. - (1994).

Wulff, H.R. (1973). Confidence limits in evaluating controlled therapeutic trials.

Wulff, H.R., Andersen, B., Brandenhoff, P., & Guttler, F. (1987). What do doctors know about statistics?

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**Y**

Yates, F.(1951). The influence of

Yates, F. (1964). Sir Ronald Fisher and the design of experiments.

Yeaton, W.H., & Sechrest, L. (1986). Use and misuse of no-difference findings in eliminating threats to validity.

Yoccoz, N.G. (1991). Use, overuse, and misuse of significance tests in evolutionary biology and ecology.

Young, M.A. (1993). Supplementing tests of statistical significance: Variation accounted for.

Yule, G.U., & Greenwood, M. (1915). The statistics of anti-typhoid and anti-cholera inoculations and the interpretation of such statistics in general.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |

**Z**

Zabell, S.L. (1992). R.A. Fisher and the fiducial argument.

Zeisel, H. (1955). The significance of insignificant differences.

Zhu, M., & Lu, A.Y. (2004). The counter-intuitive non-informative prior for the Bernoulli family.

Zuckerman, M., Hodgins, H., Zuckerman, A., & Rosenthal, R. (1993). Contemporary issues in the analysis of data: A survey of 551 psychologists.

**RÉSUMÉS / ABSTRACTS**

Abdi, H. (1987) **[ Exemple d'assimilation de l'hypothèse nulle à l'hypothèse d'une valeur
de zéro pour le paramètre testé]** "[...] l'effet est nul dans la
Population, de là le terme d'Hypothèse Nulle." (page 74)

Abelson, R.P. (1997) Recognizing that hindsight often provides the clearest vision, this article examines the current significance testing controversy from a unique perspective - the future! It is the year 2006, significance tests have been banned since 1999, and already the pendulum of public opinion is swinging back in their favor: At this point, the author rediscovers a long-lost manuscript written in 1996, and finds that his views on the significance testing controversy have renewed relevance. Specifically: (1) Although bad practice certainly has characterized some significance testing, many of the critics of significance tests overstate their case by concentrating on such bad practice, rather than providing a balanced analysis; (2) Proposed alternatives to significance testing, especially meta-analysis, have flaws of their own; (3) Significance tests fill an important need in answering some key research questions, and if they did not exist they would have to be invented.

Aberson, C. (2002) In this paper, I present suggestions for improving the presentation of null results. Presenting results that "support" a null hypothesis requires more detailed statistical reporting than do results that reject the null hypothesis. Additionally, a change in thinking is required. Null hypothesis significance testing do not allow for conclusions about the likelihood that the null hypothesis is true, only whether it is unlikely that null is true. Use of confidence intervals around parameters such as the differences between means and effect sizes allows for conclusions about how far the population parameter could reasonably deviate from the value in the null hypothesis. In this context, reporting confidence intervals allows for stronger conclusions about the viability of the null hypothesis than does reporting of null hypothesis test statistics, probabilities, and effect sizes.

Aczel, A.D. (1995) **[ Example of interpretation of frequentist confidence intervals in terms of probabilities
about parameters]** "[the confidence level] a measure of the
confidence we have that the interval does indeed contain the parameter of
interest." (page 205)

Aickin, M. (2005).
**Background and Objectives:** Classical statistical inference has attained
a dominant position in the expression and interpretation of empirical results
in biomedicine. Although there have been critics of the methods of hypothesis
testing, significance testing (P-values), and confidence intervals, these
methods are used to the exclusion of all others.

**Methods:** An alternative metaphor and inferential computation based on credibility is offered here.

**Results:** It is illustrated in three datasets involving incidence rates, and its
advantages over both classical frequentist inference and Bayesian inference,
are detailed.

**Conclusion:** The message is that for those who are unsatisfied with classical
methods but cannot make the transition to Bayesianism, there is an alternative
path.

Albert, J. (1995) Teaching elementary statistical inference from a traditional viewpoint can be hard, due to the difficulty in teaching sampling distributions and the correct interpretation of statistical confidence. Bayesian methods have the attractive feature that statistical conclusions can be stated using the language of subjective probability. Simple methods of teaching Bayes' rule are described, and these methods are illustrated for inference and prediction problems for one and two proportions. We discuss the advantages and disadvantages of traditional and Bayesian approaches in teaching inference and give texts that provide examples and software for implementing Bayesian methods in an elementary class.

Algina J., Moulder B.C. (2001) The increase in the squared multiple correlation coefficient
(Delta*R^*2 associated with a variable in a regression equation is a commonly used measure of importance in regression
analysis. The probability that an asymptotic confidence interval will include
DeltaRho^2 was investigated. With sample sizes typically used in regression analyses, when
DeltaRho^2=0.00 and the confidence level is .95 or greater, the probability will be at least .999. For
DeltaRho^2=.01 and a confidence level of .95 or greater, the probability will be smaller
than the nominal confidence level. For DeltaRho^2=.05 and a confidence level of .95,
tables are provided for the sample size necessary for the probability to be at least .925 and to be at least .94.

Altham, P.M.E. (1969) A relationship is derived between the posterior probability of negative association of rows and columns of a 2x2 contingency table and Fisher's "exact" probability, as given in existing tables for testing the hypothesis of no association of rows and columns. The result for the 2x2 table is generalized to provide the posterior probability that one discrete-valued random variable is stochastically larger than another.

Amorim, M.A. (1999) [Example of use of ANOVA fiducial Bayesian procedures].

Amorim, M.A., Glasauer, S., Corpinot, K., & Berthoz, A. (1997) [Example of use of ANOVA fiducial Bayesian procedures].

Amorim, M.A., Isableu, B., & Jarraya, M. (2006) Effect size computations were performed using LeBayesien software (Lecoutre & Poitevineau, 1996).

Amorim, M.A., Loomis, J.M., & Fukusima, S.S. (1998) [Example of use of ANOVA fiducial Bayesian procedures].

Amorim, M.-A., & Stucchi, N. (1997) [Example of use of ANOVA fiducial Bayesian procedures].

Amorim, M.-A., Trumbore, B., & Chogyen, P.L. (2000). [Example of use of ANOVA fiducial Bayesian procedures].

Anderson, D.R., Burnham, K.P., & Thompson, W.L. (2000) This paper presents a review and critique of statistical null hypothesis testing in ecological studies in general, and wildlife studies in particular, and describes an alternative. Our review of Ecology and the journal of Wildlife Management found the use of null hypothesis testing to be pervasive. The estimated number of P-values appearing within articles of Ecology exceeded 8,000 in 1991 and has exceeded 3,000 in each year since 1984, whereas the estimated number of P-values in the Journal of Wildlife Management exceeded 8,000 in 1997 and has exceeded 3,000 in each year since 1991. We estimated that 47% (SE=3.9%) of the P-values in the Journal of Wildlife;fe Management lacked estimates of means or effect sizes or even the sign of the difference in means or other parameters. We find that null hypothesis testing is uninformative when no estimates of means or effect size and their precision are given. Contrary to common dogma, tests of statistical null hypotheses have relatively little utility in science and are not a fundamental aspect of the scientific method. We recommend their use be reduced in favor of more informative approaches. Towards this objective, we describe a relatively new paradigm of data analysis based on Kullback-Leibler information. This paradigm is an extension of likelihood theory and, when used correctly, avoids many of the fundamental limitations and common misuses of null hypothesis testing. Information-theoretic methods focus on providing a strength of evidence for an a priori set of alternative hypotheses, rather than a statistical test of a null hypothesis. This paradigm allows the following types of evidence for the alternative hypotheses: the rank of each hypothesis, expressed as a model; an estimate of the formal likelihood of each model, given the data; a measure of precision that incorporates model selection uncertainty; and simple methods to allow the use of the set of alternative models in making formal inference. We provide an example of the information-theoretic approach using data on the effect of lead on survival in spectacled elder ducks (Somateria fischeri). Regardless of the analysis paradigm used, we strongly recommend inferences based on a priori considerations be clearly separated from those resulting from some form of data dredging.

Anderson, D.R., Link, W.A., Johnson, D.H., & Burnham, K.P. (2001) We give suggestions for tile presentation of research results fi om frequentist, information-theoretic, and Bayesian analysis paradigms, followed by several general suggestions. The information-theoretic and Bayesian methods offer alternative approaches to data analysis and inference compared to traditionally used methods. Guidance is lacking on the presentation of results under these alternative procedures and on nontesting aspects of classical frequentist methods of statistical analysis. Null hypothesis testing has come under intense criticism. We recommend less reporting of the results of statistical tests of null hypothesis in casts where the null is surely false anyway, or where the null hypothesis is of little interest to science or management.

Atkins, L., & Jarrett, D. (1981) Significance tests perform a vital function in the social sciences because they appear to supply an objective method of drawing conclusions from quantitative data. Sometimes they are used mechanically, with little comment, and with even less regard for whether or not the required assumptions are satisfied. Often, too, they are used in a way that distracts attention from consideration of the practical importance of the questions posed or that disguises the inadequacy of the theoretical basis for the investigation conducted. We shall show how these tests developed historically from methodological ideas imported from the natural sciences and from ideological commitments inherent in nineteenth century social thought. We shall use the results of a recent investigation to present and criticise tests of significance. And in describing alternative approaches to evaluating research we shall argue that the central status of these tests in social science is by no means based on a consensus, even amongst statisticians, as to their appropriateness.

Azar, B. (1999) If implemented, a new set of recommendations for analyzing and
reporting data will encourage researchers to be more rigorous and detailed in
their reporting, and also open them up to using a broader group of methods and
statistical techniques, says Robert Rosenthal, PhD, co-chair of APA's Task
Force on Statistical Inference, which penned the recommendations.

**[ Example of misinterpretations of null hypothesis significance tests]**
"[a significant result] indicates that the chances of the finding being random is only 5 percent or less"

Bailar, J.C., & Mosteller, F. (1988) Provides 15 directions on manuscript preparation for reporting scientific statistics including essential elements needed for specific statistics. Provides detail on parts of the Uniform Requirements for Manuscripts Submitted to Biomedical Journals.

Bakan, D. (1967/1966) I will attempt to show that the test of significance does not provide the information concerning psychological phenomena characteristically attributed to it; and that, furthermore, a great deal of mischief has been associated with its use. [...] At the very least it would appear that we would be much better if we were to attempt to estimate the magnitude of the parameters in the populations; and recognize that we then need to make other inferences concerning the psychological phenomena which may be manifesting themselves in these magnitudes. [...] Most important, we need to get on with the business of generating psychological hypotheses and proceed to do investigations and make inferences which bear on them, instead of, as so much of our literature would attest, testing the statistical null hypothesis in any number of contexts in which we have every reason to suppose that it is false in the first place.

Balluerkaa, N.,Gómez, J., & Hidalgo, D. (2005) Null hypothesis significance testing (NHST) is one of the most widely used methods for testing hypotheses in psychological research. However, it has remained shrouded in controversy throughout the almost seventy years of its existence. The present article reviews both the main criticisms of the method as well as the alternatives which have been put forward to complement or replace it. It focuses basically on those alternatives whose use is recommended by the Task Force on Statistical Inference (TFSI) of the APA (Wilkinson and TFSI, 1999) in the interests of improving the working methods of researchers with respect to statistical analysis and data interpretation. In addition, the arguments used to reject each of the criticisms levelled against NHST are reviewed and the main problems with each of the alternatives are pointed out. It is concluded that rigorous research activity requires use of NHST in the appropriate context, the complementary use of other methods which provide information about aspects not addressed by NHST, and adherence to a series of recommendations which promote its rational use in psychological research.

Bartko, J.J. (1991) There is a body of literature in biostatistics (e.g., Blackwelder, 1982; Blackwelder & Chang, 1984; Detsby & Sackett, 1985; Dunnett & Gent, 1977; Makuh & Simon, 1978) that discusses "proving" the null hypothesis in an attempt at establishing bioequivalence. Some readers may be interested in a selection of articles along these lines.

Bassok, M., Wu, L.L., & Olseth, K.L. (1995) **[ Example of misinterpretations
of null hypothesis significance tests]** "In
addition to the overall interpretative bias there was a very strong interaction
between the training and the transfer problems [chi-square(1) = 14.71,

Batanero, C. (2000) In spite of the widespread use of significance testing in empirical research, its interpretation and researchers' excessive confidence in its results have been criticised for years. In this paper, we first describe the logic of statistical testing in the Fisher and Neyman-Pearson approaches, review some common misinterpretations of basic concepts behind statistical tests, and analyse the philosophical and psychological issues that can contribute to these misinterpretations. We then revisit some frequent criticisms against statistical tests and conclude that most of them refer not to the tests themselves, but to the misuse of tests on the part of researchers. We agree with Levin (1998a) that statistical tests should be transformed into a more intelligent process that helps researchers in their work, and finally suggest possible ways in which statistical education might contribute to the better understanding and application of statistical inference.

Battan, L.J., Neyman, J., Scott, E.L., & Smith, J.A. (1969) **[ Example of interpretation of a p-value as Pr(H|X)]**
In these conditions [a

Bayarri, M. J. & Berger, J. O. (2004) Statistics has struggled for nearly a century over the issue of whether the Bayesian or frequentist paradigm is superior. This debate is far from over and, indeed, should continue, since there are fundamental philosophical and pedagogical issues at stake. At the methodological level, however, the debate has become considerably muted, with the recognition that each approach has a great deal to contribute to statistical practice and each is actually essential for full development of the other approach. In this article, we embark upon a rather idiosyncratic walk through some of these issues.

Beauchamp, K.L., & May, R.B. (1964) In replicating the study by Rosenthal and Gaito (1963),
[...] subjects were asked to express their "degree of belief in research findings as a function of associated
*p* levels". [...] The results
generally confirm those of Rosenthal and Gaito although [...] in the
replication no significant "cliff effect" was found in intervals
following the .05, .01 or any other *p* levels.

Berger, J. (2004) Bayesian statistical practice makes extensive use of versions of objective Bayesian analysis. We discuss why this is so, and address some of the criticisms that have been raised concerning objective Bayesian analysis. The dangers of treating the issue too casually are also considered. In particular, we suggest that the statistical community should accept formal objective Bayesian techniques with confidence, but should be more cautious about casual objective Bayesian

Berger, V.W. (2000) Hypothesis testing, in which the null hypothesis specifies no difference between treatment groups, is an important tool in the assessment of new medical interventions. For randomized clinical trials, permutation tests that reflect the actual randomization are design-based analyses for such hypotheses. This means that only such design-based permutation tests can ensure internal validity, without which external validity is irrelevant. However, because of the conservatism of permutation tests, the virtues of permutation tests continue to be debated in the literature, and conclusions are generally of the type that permutation tests should always be used or permutation tests should never be used. A better conclusion might be that there are situations in which permutation tests should be used, and other situations in which permutation tests should not be used. This approach opens the door to broader agreement, but begs the obvious question of when to use permutation tests. We consider this issue from a variety of perspectives, and conclude that permutation tests are ideal to study efficacy in a randomized clinical trial which compares, in a heterogeneous patient population, two or more treatments, each of which may be most effective in some patients, when the primary analysis does not adjust for covariates. We propose the p-value interval as a novel measure of the conservatism of a permutation test that can be defined independently of the significance level. This p-value interval can be used to ensure that the permutation test have both good global power and an acceptable degree of conservatism.

Bernard, J.-M. (1996) In considering the inference about the unknown proportion of a
Bernoulli process, it is shown that the choices involved in the frequentist
approach are equivalent, from a Bayesian viewpoint, to the choice of a
particular ignorance prior within a restricted *ignorance zone*. This link sheds light on the nature of both kinds
of choices, an on undesirable properties that go with null variance data.

Bernard, J.-M. (2000) Section 1 deals with the inference on one frequency, that is with binary data, under either an hypergeometric or a binomial sampling model; it will enable us to introduce the key concepts involved in the Bayesian approach and to compare it to the frequentist one. From this point on, we shall focus on Bayesian inference without further attempting to provide a systematic comparison with frequentist inference. The predictive approach to inference, again on one frequency, is presented in Section 2. We then give, through concrete and real examples, an insight on how the Bayesian approach can be extended to situations involving several frequencies, first considering simple designs (Section 3), and then more cmplex ones (Section 4). The computational aspects, left aside in the first sections, are sketched in Section 5. Finally, Section 6 summarizes the major points put forward in the chapter.

Bernardo, J.M. (2004) Mathematical statistics uses two major paradigms, conventional (or frequentist), and Bayesian. Bayesian methods provide a complete paradigm for both statistical inference and decision making under uncertainty. Bayesian methods may be derived from an axiomatic system, and hence provide a general, coherent methodology. Bayesian methods contain as particular cases many of the more often used frequentist procedures, solve many of the difficulties faced by conventional statistical methods, and extend the applicability of statistical methods. In particular, Bayesian methods make it possible to incorporate scientific hypothesis in the analysis (by means of the prior distribution) and may be applied to problems whose structure is too complex for conventional methods to be able to handle. The Bayesian paradigm is based on an interpretation of probability as a rational, conditional measure of uncertainty, which closely matches the sense of the word 'probability' in ordinary language. Statistical inference about a quantity of interest is described as the modification of the uncertainty about its value in the light of evidence, and Bayes' theorem precisely specifies how this modification should be made. The special situation, often met in scientific reporting and public decision making, where the only acceptable information is that which may be deduced from available documented data, is addressed by objective Bayesian methods, as a particular case.

Berry, D.A. (1987) The classical design of clinical trials is dictated by the eventual analysis. If the design varies from that planned then classical analysis is impossible. The Bayesian approach on the other hand is completely flexible and is therefore ideal for addressing questions and practical decision problems. I contrast these two approaches in two types of clinical trials: (i) those that strive to treat patients as effectively as possible and (ii) those sponsored by pharmaceutical companies attempting to maximise their expected profit.

Berry, D.A. (1991) The Bayesian approach to inference and decision making provides an integrated way of addressing the various aspects of drug development, from the early preclinical study of compounds through the clinical and postmarking phases. In particular, it provides a natural, convenient way for choosing among experimental designs. An essential aspect of the process of evaluating design strategies is the ability to calculate predictive probabilities of potential results. I describe a Bayesian approach to experimental design and illustrate it by considering a particular type of clinical trial. Also, I compare Bayesian and classical statistical attitudes toward design.

Berry, D.A. (1993) This paper describes a Bayesian approach to the design and analysis of clinical trials, and compares it with the frequentist approach. Both approaches addresses learning under uncertainty. But they are different in a variety of way. The Bayesian approach is more flexible. For example, accumulating data from a clinical trial can be used to update Bayesian measures, independent of the design of the trial. Frequentist measures are tied to the design, and interim analyses must be planned for frequentist measures to have meaning. Its flexibility makes the Bayesian approach ideal for analysing clinical trials. In carrying out a Bayesian analysis for inferring treatment effect, information from the clinical trial and other sources can be combined and used explicitly in drawing conclusions. Bayesians and frequentists address making decisions very differently. For example, when choosing or modifying the design of a clinical trial, Bayesians use all available information, including that which comes from the trial itself. The ability to calculate predictive probabilities for the future observations is a distinct advantage of the Bayesian approach to designing clinical trials and other decisions. An important difference between Bayesian and frequentist thinking is the role of randomization.

Berry, D.A. (1996) This is an introduction to statistics for general students. It
differs from standard texts in that it takes a Bayesian perspective. It views
statistics as a critical tool of science and so it has a strong scientific
overtone. While my outlook is conventional in many ways, its foundation is Bayesian.
There are several advantages of the Bayesian perspective:

- It allows for direct probability statements, such as the probability that an experimental procedure
is more effective than a standard procedure.

- It allows for calculating
probabilities of future observations.

- It allows for incorporating evidence from previous experience and previous experiments into overall
conclusions.

- It is subjective. This is a standard objection to the Bayesian approach; different people reach different
conclusions from the same experiment results.

- There would be comfort in giving an answer that others would also give. But differences of opinion are the norm in
science and an approach that explicitly recognizes such differences is realistic.

- Despite differences in focus between the standard and Bayesian approaches, there are more similarities than
differences. Many of the principles illustrated in the examples and exercises
of this text are not peculiar to either approach.

Berry, D.A. (1997) University courses in elementary statistics are usually taught from a frequentist perspective. In this sarticle I suggest howsuch courses can be taught using a Bayesian approach and I undicatewhy students in a Bayesian course are well served. A principal focus of any good elementary course is the application of statistics to real and important scientific problems. The Bayesian approach fits neatly with a scientific focus. Bayesians take a larger view and one onot limited to data analysis. In particular, the Bayesian approach is subjective and requires assessing prior probabilities. This requirement forces users to relate current experimental evidence to other available information - including previous experiments of a related nature, where "related" is judged subjectively. I discuss difficulties faced by instructors and students in elementary Bayesian courses and I provide a sample syllabus for an elementary Bayesian,course.

Berry, D.A. & Hochberg, Y. (1999) We discuss Bayesian attitudes towards adjusting inferences for multiplicities. In the simplest Bayesian view, there is no need for adjustments and the Bayesian perspective is similar to that of the frequentist who makes inferences on a per-comparison basis. However, as we explain, Bayesian thinking can lead to making adjustments that are in the same spirit as those made by frequentists who subscribe to preserving the familywise error rate. We describe the differences between assuming independent prior distributions and hierarchical prior distributions. As an example of the latter, we illustrate the use of a Dirichlet process prior distribution in the context of multiplicities. We also discuss some quasi-Bayesian procedures which combine Bayesian and frequentist ideas. This shows the potential of Bayesian methodology to yield procedures that can be evaluated using "objective" criteria. Finally, we comment on the role of subjectivity in Bayesian approaches to the complex realm of multiple comparisons problems, and on robust vs. informative priors.

Berry, G. (1986) In presenting the main results of a study it is good practice to provide confidence intervals rather than to restrict the analysis to significance tests. Only by doing so can authors give readers sufficient information for a proper conclusion to be done [...] Therefore, intending authors are urged to express their main conclusions in confidence interval form (possibly with the addition of a significance test, although strictly that would provide no extra information).

Beshers J (1958) Hanan Selvin (1957) has confused statistical inference with causal inference. All this statements to the effect that sociologists need not employ significance tests in survey research are based upon this confusion. [...] Significance tests are of little value for surveys which (1) ignore the principle of sampling, and (2) are not guided by theory. Perhaps such exploratory surveys may generate hypotheses to be verified by a subsequent well-designed survey utilizing significance tests.

Bezeau, S.; Graves, R. (2001) Cohen, in a now classic paper on statistical power, reviewed articles in the 1960 issue of one psychology journal and determined that the majority of studies had less than a 50-50 chance of detecting an effect that truly exists in the population, and thus of obtaining statistically significant results. Such low statistical power, Cohen concluded, was largely due to inadequate sample sizes. Subsequent reviews of research published in other experimental psychology journals found similar results. We provide a statistical power analysis of clinical neuropsychological research by reviewing a representative sample of 66 articles from the Journal of Clinical and Experimental Neuropsychology, the Journal of the International Neuropsychology Society, and Neuropsychology. The results show inadequate power, similar to that for experimental research, when Cohen's criterion for effect size is used. However, the results are encouraging in also showing that the field of clinical neuropsychology deals with larger effect sizes than are usually observed in experimental psychology and that the reviewed clinical neuropsychology research does have adequate power to detect these larger effect sizes. This review also reveals a prevailing failure to heed Cohen's recommendations that researchers should routinely report a priori power analyses, effect sizes and confidence intervals, and conduct fewer statistical tests.

Bhattacharyya & Johnson (1997) **[ Example of interpretation of frequentist confidence intervals
in terms of probabilities
about parameters]**"An alternative approach to estimation is to extend the
concept of error bound to produce an interval of values that is likely to
contain the true value of the parameter." (page 243)>

Bird, K.D. (2002) Although confidence interval procedures for analysis of variance (ANOVA) have been available for some time, they are not well known and are often difficult to implement with statistical packages. This article discusses procedures for constructing individual and simultaneous confidence intervals on contrasts on parameters of a number of fixed-effects ANOVA models, including multivariate analysis of variance (MANOVA) models for the analysis of repeated measures data. Examples show how these procedures can be implemented with accessible software. Confidence interval inference on parameters of random-effects models is also discussed.

Blaich, C.F. (1998) If the NHSTP [Null Hypothesis Significance Test Procedure] procedure is essential for controlling for chance, why is very little, if any, discussion of the nature of chance by Chow [Chow, 1996] and other advocates of the procedure. Also, many criticisms that Chow takes to be aimed against the NHSTP procedure are actually directed against the kind of theory that is tested by the procedure.

[37] **[ Example of misinterpretation of significance levels]** "[...] when a statistician
rejects the null hypothesis at a certain level of confidence, say .05, he may
be fairly well assured (

Borenstein, M. (1997) **Learning objectives**:
This paper provides the reader with an overview of several key elements in
study planning and analysis. In particular, it highlights the differences
between significance tests (statistical significance) and effect size
estimation (clinical significance). **Data sources**: This paper focuses on methodologic issues, and provides an
overview of trends in research. **Paper selection**: References were selected to provide a cross-section of the
approaches currently being used. The paper also discusses a number of logical
fallacies that have been cited as examples in earlier papers on research
design. **Conclusions**: Significance tests are intended solely to address the viability of the null hypothesis that
a treatment has no effect, and not to estimate the magnitude of the treatment
effect. Researchers are advised to move away from significance tests and to
present instead an estimate of effect size bounded by confidence intervals.
This approach incorporates all the information normally included in a test of
significance but in a format that highlights the element of interest (clinical
significance rather than statistical significance). This approach should also
have an impact on study planning--a study should have enough power to reject
the null hypothesis and also to yield a precise estimate of the treatment effect.

Boring, E.G. (1919) So it happens that the competent scientist does the best the can in obtaining unselected samples, makes his observations, computes a difference and its "significance", and then - absurd as it may seem - very often discards his mathematical result, because in his judgment the mathematically "significant" difference is nevertheless not large compared with that he believes is the discrepancy between his samples and the larger groups which they represent. [...] The case is one of many where statistical ability, divorced from a scientific intimacy with the fundamental observations, leads nowhere.

Box, G.E.P., & Tiao, G.C. (1973) The object of this book is to explore the use and relevance of
Bayes' theorem to problems such as arise in scientific investigation in which
inferences must be made concerning parameter values about which little is known
*a priori*. In Chapter 1 we discuss some important general aspects of the Bayesian approach, including: the
role of Bayesian inference in scientific investigation, the choice of prior
distributions (and, in particular, of noninformative prior distributions), the
problem of nuisance parameters, and the role and relevance of sufficient
statistics. In Chapter 2, a number of standard problems concerned with the
comparison of location and scale parameters are discussed. Bayesian methods,
for the most part well known, are derived there which closely parallel the
inferential techniques of sampling theory associated with *t*-tests, *F*-tests,
Bartlett's tests, the analysis of variance, and with regression analysis. [...]
Now, practical employment of such techniques has uncovered further inferential
problems, and attempts to solve these, using sampling theory, have had only
partial success. One of the main objective of this book, pursued from
Chapter 3 onwards, is to study some of these problems from a Bayesian
viewpoint.

The following are examples of the
further problems considered: 1. How can inferences be made in small samples
about parameters for which no parsimonious set of sufficient statistics exists?
2. To what extent are inferences about means and variances sensitive to
departures from assumptions such as error Normality, and how can such
sensitivity be reduced? 3. How should inferences be made about variance
components? 4. How and in what circumstances should mean squares be pooled
in the analysis of variance? 5. How can information be pooled from several
sources when its precision is not exactly known, but can be estimated as, for
example, in the "recovery of interblok information" in the analysis
of incomplete block designs? 6. How should data be transformed to produce
parsimonious parametrization of the model as well as to increase sensitivity of
the analysis?

The main body of the text is an
investigation of these and similar questions with appropriate analysis of the
mathematical results illustrated with numerical examples. We believe that this
(1) provides evidence of the value of the Bayesian approach, (2) offers
useful methods for dealing with the important problems specifically considered
and (3) equips the reader with techniques which he can apply in the
solution of new problems.

Braitman, L.E. (1988) The statistical descriptors known as confidence intervals can increase the ability of readers to evaluate conclusions drawn form small trials. Fortunately, an increasing number of journals are asking authors to add confidence intervals to the reporting of data in their papers.

Braitman, L.E. (1991) In this editorial, I use hypothetical examples to illustrate point estimates and confidence intervals of the differences between the percentages of patients responding to two treatments for a cancer. These examples show how confidence intervals can help assess the clinical and statistical significance of such differences.

Brandstätter, E. (1999) The article argues to replace null hypothesis significance testing by confidence intervals. Correctly interpreted, confidence intervals avoid the problems associated with null hy-pothesis statistical testing. Confidence intervals are formally valid, do not depend on a-priori hypotheses and do not result in trivial knowledge. The first part presents critique of null hypothesis significance testing; the second part replies to critique against confidence intervals and tries to demonstrate their superiority to null hypothesis significance testing.

Breslow, N. (1990) Attitudes of biostatisticians toward implementation of the Bayesian paradims have changed during the past decade due to the increased availability of computational tools for realistic problems. Empirical Bayes' methods, already widely used in the analysis of longitudinal data, promise to improve cancer incidence maps by accounting for overdispersion and spatial correlation. Hierarchical Bayes' methods offer a natural framework in which to demonstrate the bioequivalence of pharmacologic compounds. Their use for quantitative risk assessment and carcinogenesis bioassay is more controversial, however, due to uncertainty regarding specification of informative priors. Bayesian methods simplify the analysis data from sequential clinical trials and avoid certain paradoxes of frequentist inference. They offer a natural setting for the synthesis of expert opinion in deciding policy matter. Both frequentist and Bayes' methods have a place in biostatistical practice.

Bristol, D.R. (1995) Sample size determination is a very important aspect of planning a clinical trial. The actual calculation is usually the responsibility of the project statistician, but fruitful communications with the clinical monitor are required. When the variable of interest follows a normal distribution, the statistician must have specified values of the error variance and a difference. Some consequences of mispecification of these values, with emphasis on the difference, are presented. Some discussion of the role of communication between the project statistician and the clinical monitor is also presented.

Brown, F.L. (1973) **[ Example of misinterpretation of significance levels]** Pages 522-523 [Quoted by
Seldmeier & Gigerenzer, 1989, page 314]

Byrne, M.D. (1993) Cognitive Science has typically proceeded with two major forms of research: model-building and experimentation. Traditional parametric statistics are normally used in the analysis of experiments, yet the assumptions required for parametric tests are almost never met in Cognitive Science. The purpose of this paper is twofold: to present a viable alternative to traditional parametric statistics—the randomization test—and to demonstrate that this method of statistical testing is particularly suited to research in Cognitive Science.

Camilleri, S.F. (1962) In attempting to clarify the role of probability in sociological research we have been led into a discussion of the nature of scientific theory and induction. We have tried to articulate the principle that science induction is accomplished through the construction and verification of deductive theories, the primary concern of the social scientist ought to be the development of such theories. [...] We have tried to show that the hypothetical character of the risk probabilities associated with the level of significance and the pragmatic ambiguities of the rationale for choosing any particular level of significance seriously undermine its value in the evaluation of statistical hypotheses. It is our belief that the great reliance placed by many sociologists on tests of significance is chiefly an attempt to provide scientific legitimacy to empirical research without adequate theoretical significance.

Capraro R.M., Capraro M.M. (2002) The dialog surrounding effect sizes and
statistical significance tests often places
the two ideas into separate camps amid controversy. In light ofrecommendations
by the Task Force on Statistical Inference and the fifth edition of the
*American Psychological Association Publication Manual* calling
for the reporting of effect sizes, a review of treatments of effect sizes in
textbooks may be quite timely. This study reviews textbooks published since
1995 and as regards treatments of effect sizes and statistical significance
tests. Of the textbooks examined, every textbook (*n*= 89) included the topic of
statistical significance testing (2,248 pages), whereas only a little more than
two thirds of the textbooks (*n*= 60) included information on effect sizes (789 pages).

Charron, C. (2002) [Example of use of Bayesian methods: ANOVA fiducial Bayesian procedures and procedures for implication hypotheses in 2x2 tables (Lecoutre & Charron, 2000)]

Chatfield, C. (1988) Based on a teaching course for final-year undergraduates, and
on wide consultancy experience, this readable book provides a wealth of
information and valuable insight for the student statistician and practioner
alike.

1. A significant effect is not necessarily the same thing as an interesting effect;
2. A non-significant effect is not necessarily the same thing as no difference." (page 51)

"Scientists often finish their analysis by quoting a P-value, but this is not the right place to stop. One
still wants to know how large the effect is, and a confidence interval should be given where possible." (page 51)

Chatfield, C. (2002) The paper reflects on the author's experience and discusses how statistical theory, sound judgment and knowledge of the context can work together to best advantage when tackling the wide range of statistical problems than can arise in practice. The phrase 'pragmatic statistical inference' is introduced.

Chow, S.L. (1988) I describe and question the argument that in psychological research, the significance test should be replaced (or, at least, supplemented by a more informative index (viz., effect size or statistical power) in the case of theory-corroboration experimentation because it has been made on the basis of some debatable assumptions about the rationale of scientific investigation. The rationale of theory-corroboration experimentation requires nothing more than a binary decision about the relation between two variables. This binary decision supplies the minor premise for the syllogism implicated when a theory is being tested. Some metatheoretical considerations reveal that the magnitude of the effect-size estimate is not a satisfactory alternative to the significance test.

Chow, S.L. (1989) Shows that agreeing with Folger's (1989) methodological observations does not mean that it is incorrect to use significance tests. This contention is based on the dynamics of theory corroboration, with reference to which the following distinction are illustrated, namely, the distinction between (a) statistical hypothesis testing, theory corroboration, and syllogistic argument, (b) a responsible experimenter and a clinical experimenter, (c) logical validity and methodological correctness, and (d) warranted assertability and truth.

Chow, S.L. (1991) This is Chow's response to four comments on his critique of
the view that research conclusions be based on multiple context-dependent
criteria. Five themes could be identified in the comments. In reply, it is
argued that care should be taken not to use the alpha level whimsically because
the continuum *similarity*, is being
used as a dichotomy in theory corroboration. The superiority of effect-size
estimates to statistical significance is more apparent than real. Chow's
assessment of meta-analysis is illustrated with the *apples and oranges* issues.

Chow, S.L. (1991) In sum, two putative advantages of basing theoretical conclusions on statistical power can be questioned. A null hypothesis is not a categorical proposition descriptive of the world, but a prescriptive statement. Using tests of significance is not incompatible with making rational judgment.

Chow, S.L. (1996) The null-hypothesis significance-test procedure (NHSTP) is
defended in the context of the theory-corroboration experiment, as well as the
following contrasts: (a) substantive hypotheses versus statistical hypotheses,
(b) theory corroboration versus statistical hypothesis testing, (c) theoretical
inference versus statistical decision, (d) experiments versus nonexperimental
studies, and (e) theory corroboration versus treatment assessment. The null
hypothesis can be true because it is the hypothesis that errors are randomly
distributed in data. Moreover, the null hypothesis is never used as a
categorical proposition. Statistical significance means only that chance
influences can be excluded as an explanation of data; it does not identify the
nonchance factor responsible. The experimental conclusion is drawn with the
inductive principle underlying the experimental design. A chain of deductive
arguments gives rise to the theoretical conclusion via the experimental
conclusion. The anomalous relationship between statistical significance and the
effect size often used to criticize NHSTP is more apparent than real. The
absolute size of the effect is not an index of evidential support for the
substantive hypothesis. Nor is the effect size, by itself, informative as to the
practical importance of the research result. Being a conditional probability,
statistical power cannot be the *a priori*
probability of statistical significance. The validity of statistical power is
debatable because statistical significance is determined with a single sampling
distribution of the test statistic based on H0, whereas it takes two
distributions to represent statistical power or effect size. Sample size should
not be determined in the mechanical manner envisaged in power analysis. It is
inappropriate to criticize NHSTP for nonstatistical reasons. At the same time,
neither effect size nor confidence interval estimate nor posterior probability
can be used to exclude chance as an explanation of data. Nor can any of them
fulfill the nonstatistical functions expected of them by critics.

Chow, S. L. (1998) Sohn (1998) presents a good argument that neither statistical significance nor effect size is indicative of the replicability of research results. His objection to the Bayesian argument is also succinct. However, his solution of the "replicability belief" issue is problematic, and his verdict that significance tests have no role to play in empirical research is debatable. The strengths and weaknesses of Sohn's argument may be seen by explicating some of his assertions.

Ciancia, F., Maitte, M., Honoré, J., Lecoutre, B., & Coquery, J.-M. (1988) **[ Illustration
of standard Bayesian methods]**"Analysis of
variance was extended by standard Bayesian inferences. Whereas

Clément, E., & Richard, J.-F. (1997)[Example of use of standard Bayesian methods].

Cohen, J. (1962) The purpose of the study was to survey the articles of the
*Journal of Abnormal and Social Psychology*, 1960, 61, from the point of view
of the power of their statistical tests to
reject their major null hypotheses, for defined levels of departure of
population parameters from null conditions, i.e., size of effect. Conventional
tests conditions were employed in power determination: nondirectional tests at
the .05 level. [...] It was found that the average power (probability of rejecting
false null hypotheses) over the 70 research studies was .18 for small effects,
.48 for medium effects, and .83 for large effects. These values are deemed to
be far too small and suggest that much research in the abnormal-social area has
lead to the failure to reject null hypotheses which are in fact false. [...]
Since power is a direct monotonic function of sample size, it is recommended
that investigators use larger sample sizes than they customarily do. It is
further recommended that research *plans*
be routinely subjected to power analysis, using as conventions the criteria of
population effect size employed in this survey.

Cohen, J. (1988) The power of a statistical test is the probability that it
will yield statistically significant results. Since statistical significance is
so earnestly sought and devoutly wished for by behavioral scientists, one would
think that the *a priori* probability
of its accomplishment would be routinely determined and well understood. Quite
surprisingly, this is not the case. Instead, if we take as evidence the
research literature, we find that statistical power is only infrequently
understood and almost never determined. The immediate reason for this is not
hard to discern - the applied statistics textbooks aimed at behavioral
scientists, with few exceptions, give it scant attention.

The purpose of this book is to provide
a self-contained comprehensive treatment of statistical power analysis from an
"applied" viewpoint. The purpose of Chapter 1 is to present the
basic conceptual framework of statistical hypothesis testing, giving emphasis
to power, followed by the framework within which this book is organized. Each
of the succeeding chapters present a different statistical test. They are
similarly organized as follows: 1. The test is introduced and its use
described. 2. The ES [effect size] index is described and discussed in
detail. 3. The characteristics of the power tables and the method of their
use are described and illustrated with examples. 4. The characteristics of
the sample size tables and the method of their use are described and
illustrated with examples. 5. The use of the power tables for significance
tests is described and illustrated with examples.

Cohen, J. (1994) After 4 decades of severe criticism, the ritual of null
hypothesis significance testing - mechanical dichotomous decisions around a
sacred .05 criterion - still persists. This article reviews the problems with
this practice, including its near-universal misinterpretation of *p* as the probability that H0 is false,
the misinterpretation that its complement is the probability of successful
replication, and the mistaken assumption that if one rejects H0 one thereby
affirms the theory that led to the test. Exploratory data analysis and the use
of graphic methods, a steady improvement in and a movement toward
standardization in measurement, an emphasis on estimating effect sizes using
confidence intervals, and the informed use of available statistical methods is
suggested. For generalization, psychologists must finally rely, as has been
done in all the older science on replication.

Cooper & Topher (1994) **[ Example of incorrect definition of a p-value]** "The

Corroyer, D., Devouche, E., Bernard, J.-M., Bonnet, P., & Savina, Y. (2003) Nous comparons six logiciels statistiques (EyeLID-2, PAC, SPSS, Statistica, Statview, Var3)
pour l'analyse de données relevant de l'ANOVA (plan S<A2*B2> déséquilibré) sur les aspects
descriptif et inductif et de plusieurs points de vue: 1/ accès à diverses
options de comparaisons (équipondérer ou non, spéccifique ou non);
2/ intégration d eprocédures liées à des avancées méthodologiques récentes
définies en particulier sous l'égide de l'*APA* (évaluation de la taille
des effets, inférence bayésienne); 3/ mode d'accès aux procédures. Nous
constatons que toutes les options ou procédures souhaitables ne sont pas
toujours disponibles. Il apparaît donc nécessaire de recourir à plusieurs
logiciels. Pour certains logiciels, on constate parfois un déficit d'information
dans l'affichage, voire des incohérences entre divers résultats produits, ceci
risquant de conduire le chercheur à des conclusions erronées.

Cox, D.R. (1977) The main object of the paper is to give a general review of
the nature and importance of significance tests. Such tests are regarded as
procedures for measuring the consistency of data with a null hypothesis by the
calculation of a *p*-value (tail area).
A distinction is drawn between several kinds of null hypothesis. The ways of
deriving tests, namely via the so-called absolute test, via implicit
consideration of alternatives and via explicit consideration of alternatives
are reviewed. Some of the difficulties of the multidimensional alternatives are
outlined and the importance of the diagnostic ability of a test is stressed.
Brief examples include tests of distributional form including multivariate
normality. The effect of modifying statistical analysis in the light of the
data is discussed, four main cases being distinguished. Then a number of more
technical aspects of significance tests are outlined, including the role of the
continuity correction, Bayesian tests and the use of tests in the comparison of
alternative models. Finally the circumstances are reviewed under which
significance tests can provide the main summary of a statistical analysis.

Cox, D.R. (2001) Comment on Sterne, J.A.C., &: Davey Smith , G. (2001). Sifting
the evidence-what's wrong with significance tests? *BMJ*, *322*, 226-231.

Cox, D.R., & Snell, E.J. (1981) P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis An arbitrary division of results, into "significant" or "nonsignificant" according to the P value, was not the intention of the founders of statistical inference A P value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that P < 0.001 does. In the results sections of papers the precise P value should be presented, without reference to arbitrary thresholds Results of medical research should not be reported as "significant" or "nonsignificant" but should be interpreted in the context of the type of study and other available evidence. Bias or confounding should always be considered for findings with low P values To stop the discrediting of medical research by chance findings we need more powerful studies.

Craig, J.R., Eison, C.L., & Metze, L.P. (1976) The issue of the interpretation of significance tests is
addressed. An argument is presented that some measure of association such as *omega-square*
should be provided as an interpretation/decision-making aid for scientific
consumers and journal editors. Published research articles were examined
regarding use of measures of association and the relationship between sample
size and the amount of variance shared by the independent and the dependent
variable. The results indicated that no articles reported measures of
association and that many published studies are based upon small degrees of
relationship between the independent and dependent variable. A change in
report-writing and journal edition practices is suggested.

Crow, E.L. (1991) Rosenthal (1990) is wrong in advocating the use of his and Rubin's binomial effect size display (BESD) "to index the practical value of our research results [...]". He is wrong because the BESD corresponds to no real population of interest.

Cumming, G. (2005) Killeen's (2005) prep is wonderful, but may be difficult to understand. I offer two figures and a table that may assist. I assume normal populations, with sigma known

Cumming, G., & Finch, S. (2001) Reform of statistical practice in the social and behavioural sciences requires wider
use of confidence intervals (CIs), and of effect size measures and
meta-analysis. In this context we discuss four reasons for promoting use of
CIs: (i) they give useful, interpretable information; (ii) they are linked to
statistical significance tests with which researchers are already familiar;
(iii) they can encourage meta-analytic thinking that focuses on estimation; and
(iv) CI width gives information about precision that may be more useful than a
statistical power value. We focus on a basic standardised effect size measure,
Cohen's delta (also referred to as Cohen's *d*). We give methods and
examples for the calculation of CIs for delta, which require use of noncentral
t distributions, and contrast these with the familiar CIs for original score
means. We discuss noncentral t distributions, unfamiliar to many social
scientists, and apply these also to statistical power and to simple meta-analysis
of standardised effect sizes. We provide the *ESCI* graphical software,
which runs under Microsoft Excel, to illustrate the discussion. Wider use of
CIs for delta and other effect size measures should help promote highly
desirable reform of statistical practice in the social sciences.

Cumming, G., & Finch, S. (2005). Wider use in psychology of confidence intervals (CIs), especially as error bars in figures, is a desirable development. However, psychologists seldom use CIs and may not understand them well. The authors discuss the interpretation of figures with error bars, and analyze the relationship between CIs and statistical significance testing. They propose 7 rules of eye to guide the inferential use of figures with error bars. These include general principles: Seek bars that relate directly to effects of interest, be sensitive to experimental design, and interpret the intervals. They also include guidelines for inferential interpretation of the overlap of CIs on independent group means. Wider use of interval estimation in psychology has the potential to improve research communication substantially.

Cumming, G., Williams, J., & Fidler, F. (2004) Confidence intervals (CIs) and standard error bars give information about replication, but do researchers have an accurate appreciation of that information? Authors of journal articles in psychology, behavioural neuroscience, and medicine were invited by email to visit a website and indicate on a figure where they judged replication means would plausibly fall. Responses from 263 researchers suggest that many leading researchers in the three disciplines under-estimate the extent that future replications will vary. A 95% CI will on average capture 83.4% of future replication means. A majority of respondents, however, hold the confidence level misconception (CLM) that a 95% CI will on average capture 95% of replication means. Better understanding of CIs is needed if they are to be successfully used more widely in psychology.

D'Agostini, G. (1999) Subjective probability is based on the intuitive idea that probability quantifies the degree of belief that an event will occur. A probability theory based on this idea represents the most general framework for handling uncertainty. A brief introduction to subjective probability and Bayesian inference is given, with comments on typical misconceptions which tend to discredit it and with comparisons to other approaches.

D'Agostini, G. (2000) Criticisms of so called 'subjective probability' come on the one hand from those who maintain that probability in physics has only a frequentistic interpretation, and, on the other, from those who tend to 'objectivise' Bayesian theory, arguing, e.g., that subjective probabilities are indeed based 'only on private introspection'. Some of the common misconceptions on subjective probability will be commented upon in support of the thesis that coherence is the most crucial, universal and 'objective' way to assess our confidence on events of any kind.

D'Agostini, G. (2000) This contribution to the
debate on confidence limits focuses mostly on the case of measurements with
'open likelihood', in the sense that it is defined in the text. I will show
that, though a prior-free assessment of *confidence *is, in general, not
possible, still a search result can be reported in a mostly unbiased and
efficient way, which satisfies some desiderata which I believe are shared by
the people interested in the subject. The simpler case of 'closed likelihood'
will also be treated, and I will discuss why a uniform prior on a sensible
quantity is a very reasonable choice for most applications. In both cases, I
think that much clarity will be achieved if we remove from scientific parlance
the misleading expressions 'confidence intervals' and 'confidence levels'.

D'Agostini, G. (2003) This report introduces general ideas and some basic methods of the Bayesian probability theory applied to physics measurements. Our aim is to make the reader familiar, through examples rather than rigorous formalism, with concepts such as: model comparison (including the automatic Ockham's Razor filter provided by the Bayesian approach); parametric inference; quantification of the uncertainty about the value of physical quantities, also taking into account systematic effects; role of marginalization; posterior characterization; predictive distributions; hierarchical modelling and hyperparameters; Gaussian approximation of the posterior and recovery of conventional methods, especially maximum likelihood and chi-square tests under well defined conditions; conjugate priors, transformation invariance and maximum entropy motivated priors; Monte Carlo estimates of expectation, including a short introduction to Markov Chain Monte Carlo methods.

Daniel, L.G. (1998) Statistical significance tests (SSTs) have been the object of much controversy among social scientists. Proponents have hailed SSTs as an objective means for minimizing the likelihood that chance factors have contributed to research results; critics have both questioned the logic underlying SSTs and bemoaned the widespread misapplication and misinterpretation of the results of these tests. The present paper offers a framework for remedying some of the common problems associated with SSTs via modification of journal editorial policies. The controversy surrounding SSTs is overviewed, with attention given to both historical and more contemporary criticisms of bad practices associated with misuse of SSTs. Examples from the editorial policies of Educational and Psychological Measurement and several other journals that have established guidelines for reporting results of SSTs are overviewed, and suggestions are provided regarding additional ways that educational journals may address the problem.

Dar, R. (1998) Chow's [Chow, 1996] account of Bayesian inference logic and procedures is replete with fundamental misconceptions, derived from secondary sources and not adequately informed by modern work. The status of subjective probabilities in Bayesian analyses is misrepresented and the cogent reasons for the rejection by many statisticians of the curious inferential hybrid used in psychological research are not presented.

De Cristofaro, R. (1996) In this paper we support the idea that statistical inference can be worked out as a branch of inductive inference. Indeed; unless the assumptions are changed from the very beginning, we must find the solution to the problem of inference inside probability calculus. Moreover, we are not allowed to reach a conclusion (as the choice of a hypothesis) that is outside the scope of probability theory. In this connection, the importance of an appropriate analysis of the complete posterior distribution about the parameters in question is underlined, particularly where there are several parameters.

De Cristofaro, R. (2002) In this article, we support the idea that inductive reasoning can be worked out within probability theory, by means of a logical solution to the old problem of prior probabilities, and that accepting or rejecting hypotheses is a pragmatic choice, which does not belong to inductive reasoning. Many authors that solve statistical inference by simply examining the likelihood function do not follow Bayes theorem. We are consistent with Bayes theorem, and we think that the piece of information about the design is potentially contained in the prior. Later on, we provide a justification to the Jeffreys-rule to assign prior probabilities in the version supported by Box and Tiao. Our conclusion is that the best method to communicate the conclusions of a statistical research in an objective way consists in a probabilistic statement. On the contrary, the significance level is not often a good method of summarizing the information in the posterior distribution.

De Cristofaro, R. (2003) According to the likelihood principle, if the designs produce proportional likelihood func-tions, one should make an identical inference about a parameter from the data irrespective of the design, which yields the data. If it comes to that, there are several counter-examples, and/or paradoxical consequences to likelihood principle. Besides, as we will see, contrary to a widely held opinion, such a principle is not a direct consequence of Bayes theorem. In particular, the piece of information about the design is one part of the evidence, and it is relevant for the prior. Later on, a justification to Jeffreys-rule to assign prior probabilities in the version supported by Box and Tiao is provided. Another basic idea of the present paper is that (apart from other information) the equiprobability as-sumption is to be linked to the idea of the impartiality of design with respect to the pa-rameter under consideration. The whole paper has remarkable implications on the founda-tions of statistics from the notion of sufficiency, the relevance of the stopping rule and of the randomization in survey sampling and in the experimental design, the difference be-tween ignorable and non-ignorable designs, until a reconciliation of different approaches to the inductive reasoning in statistical inference.

Deheuvels, P. (1984) We describe test procedures enabling to decide whether bioequivalence is true or not from the study of the results of a comparative analysis. We prove that a correct use of Student confidence intervals gives a test uniformly more powerful that the corresponding methods based on Westlake [Westlake, 1976] confidence intervals.

del Rosal, A.B., Costas, C.S., Bruno, J.A.S., & Osinski, I.C. (2001) Null hypothesis significance testing has been a source of debate within the scientific community of behavioral researchers for years, since inadequate interpretations have resulted in incorrect use of this procedure. In this paper, we present a revision of the latest contributions of methodologists of different opinions, for and against, and we also set out the guidelines to research within behavioral science recently issued by the A.P.A. (American psychological Association) Task Force in Statistical Inference (Wilkinson, 1999).

Denhière, G., & Lecoutre, B. (1983) [*Illustration
des méthodes bayésiennes standard*] Trois groupes de 60 enfants âgés de 7, 8
et 10 and ont été soumis à une expérience de reconnaissance immédiate et
différée (une semaine) d'énoncés appartenant à des récits, et de distracteurs
sémantiquement proches et lointains. L'expérience tentait de répondre à quatre
questions: 1. Constate-t-on un "effet de niveau" en
reconnaissance comme en rappel? 2. Observe-t-on un effet de l'âge
comparable à celui obtenu en rappel ? 3. L'information est-elle
stockée sous forme lexicale et/ou conceptuelle? 4. Des récits différents
par leur contenu conduisent-ils à des performances différentes? L'analyse
bayésienne des comparaisons (extension bayésienne de l'analyse de la variance)
permet de répondre négativement aux question 1, 2 et 4. L'absence d'effet de
niveau en reconnaissance immédiate et différée et les faibles différences de
performance en fonction de l'âge conduisent à privilégier les modèles qui
prévoient une représentation hiérarchique de l'information en mémoire et un
processus de recherche et de récupération de l'information du type haut-bas.

*Story memory: immediate and delayed recognition of statements by 7, 8 and 10 years old children*

[*Example of use of standard Bayesian methods*] Three groups of 60
children (7, 8 and 10 years old) participated in an immediate and delayed (8
days) recognition experiment. Children had to identify original statements
(segments of the story) and to reject statements which were semantically closed
tp and distant from the original ones. The experiment intended to answer four
questions: 1. Is there a level-effect present in recognition as there in
recall? 2. Is there an effect of age similar in recognition and in recall?
3. Is the information stored in conceptual and/or lexical form? 4. Is
the influence of the content of the stories different in recognition and in
recall? The Bayesian Analysis of Comparisons (Bayesian extensions of ANOVA)
leads to a negative answer to question 1, 2 and 4. The absence of a
level-effect in immediate and delayed recognition and the small differences
between the three age groups are in agreement with memory models which predict
a hierarchical representation of information and a top down retrieval process.

Denis, D.J. (2004) Today's genre of null hypothesis significance testing (NHST) bears little resemblance to the model originally proposed by Fisher over seventy-five years ago. Aside from general misunderstandings, the present model incorporates features that Fisher adamantly rejected. The aim of this article is to bring to attention how NHST differs from the model first proposed by Fisher in 1925, and in doing, locate his model within today's hybrid of hypothesis testing. It is argued that associating Fisher's name with today's version of NHST is not only incorrect, it inappropriately blames Fisher for NHST's deep methodological and philosophical problems. An attempt is made to distinguish between Fisher's original model and today's hybridized, and generally misunderstood approach to statistical inference. It will be shown that today's social science researchers utilize a logically faulty and distasteful blend of Fisherian, Neyman-Pearson and Bayesian ingredients.

Doros, G. & Geier, A.B. (2005) We appreciate Killeen's discussion about the shortcomings of classical hypothesis testing. However, any measure that is no more than a simple transformation of the classical p value (see Killeen's appendix) will inherit the shortcomings of that p value.

Duggan, T.J., & Dean, C.W. (1968) [...] two elementary safeguards can be exercised in reporting results. One is routinely to compute and report a measure of degree of association in addition to the statistical test whenever this is possible. The second safeguard is the introduction of care and caution in the verbal interpretation of data tables and the inferred association of variables.

DuMouchel, W. (1989) In this chapter we provide step-by-step instructions for setting up a Bayesian hierarchical in order to combine statistical summaries from several studies into a single super analysis which integrates the results from each study. A discussion of the data requirements of the methodology is followed by a specification of a particular Bayesian Model designed to be both flexible and easy to use. A set of formulas define all the computations necessary to obtain the posterior distributions of the relevant parameters. An example metaanalysis shows how different specifications of the prior distribution can affect the results.

Dunne A., Pawitan, Y., & Doody, L. (1996) In statistical practice P-values are regularly used to express the amount of evidence in the data, but there is no agreement on how to compute two-sided P-values when the sampling distributions are discrete and asymmetric. Doubling the one-sided P-value, or adding the probabilities less than or equal to the probability of the observed data, has been suggested in practice. However, since P-values are associated with a test, it is not clear what tests correspond to those suggested P-values. In this paper we suggest a way to compute the two-sided P-value as the smallest significance level for which, given the data, we would reject the null hypothesis in favour of a two-sided alternative by using the appropriate uniformly most powerful unbiased test. The method is illustrated using the small sample testing of a binomial proportion and the exact analysis of 2x2 tables as examples. The resulting P-value is compared with the previous two methods and a general discussion on the nature of P-values and two-sided tests is given.

Edwards, W., Lindman, H., & Savage, L.J. (1963) Bayesian statistics, a currently controversial viewpoint concerning statistical inference, is based on a definition of probability as a particular measure of the opinions of ideally consistent people. Statistical inference is modification of these opinions in the light of evidence, and tools of Bayesian statistics include the theory of specific distributions and the principle of stable estimates , which specifies when actual prior opinions may be satisfactorily approximated by a uniform distribution. A common feature of many classical significance tests is that a sharp null hypothesis is compared with a diffuse alternative hypothesis. Often evidence which, for a Bayesian statistician, strikingly supports the null hypothesis leads to rejection of that hypothesis by standard classical procedures. The likelihood principle emphasized in Bayesian statistics implies, among other things, that the rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience.

Efron, B. (1996) This article attempts to answer the following question: why is most scientific data analysis carried out in a non-Bayesian framework? The argument consists mainly of some practical examples of data analysis, in which the Bayesian approach is difficult but Fisherian/frequentist solutions are relatively easy. There is a brief discussion of objectivity in statistical analyses and of the difficulties of achieving objectivity within a Bayesian framework. The article ends with a list of practical advantages of Fisherian/frequentist methods, which so far seem to have outweighted the philosophical superiority of Bayesianism.

Efron, B. (1998) Fisher is the single most important figure in 20th century statistics. This talk examines his influence on modern statistical thinking, trying to predict how Fisherian we can expect the 21st century to be. Fisher's philosophy is characterized as a series of shrewd compromises between the Bayesian and frequentist viewpoints, augmented by some unique characteristics that are particularly useful in applied problems. Several current research topics are examined with an eye toward Fisherian influence, or the lack of it, and what this portends for future statistical developments.

Elifson, K.W., Runyon, R.P., & Haber, A. (1990) **[ Example of interpretation
of frequentist confidence intervals in terms of probabilities
about parameters]** "[...] we assert that the population mean probably
falls within the interval that we have established." (page 367)

Ellis, N. (2000) *Language Learning*, like many journals that publish research using quantitative and statistical
methods, is increasingly influenced by the advantages of the reporting of
effect sizes. Submitting authors to this journal have to date been referred to
the statement in the Publication Manual of the American Psychological
Association (4th edition) which emphasizes that statistical significance *p* values are not acceptable indices of
effect because they depend on sample size and that "you are [therefore]
encouraged to provide effect size information." (APA, 1994, p. 18).
Unfortunately, empirical studies of this and other journals (Wilkinson &
the American Psychological Association Task Force on Statistical Inference,
1999) indicate that this encouragement has had negligible impact. The reporting
of effect sizes is essential to good research. It enables readers to evaluate
the stability of results across samples, operationalizations, designs, and
analyses. It allows evaluation of the practical relevance of the research
outcomes. It provides the basis of power analyses and meta-analyses needed in
future research. This role of effect sizes in meta-analysis is clearly
illustrated in the article by Norris and Ortega which follows this editorial
statement. Submitting authors to Language Learning are therefore required
henceforth to provide a measure of effect size, at least for the major
statistical contrasts which they report. [...] Always present effect sizes and
their confidence intervals for primary outcomes. These effect sizes might be of
various forms. If the units of measurement are meaningful on a practical level
(e.g., reading rate, normed proficiency test scores), then unstandardized
measures (regression coefficient or mean difference) are appropriate. If not,
standardized differences (*d*) or
uncorrected (e.g., *r*, *R-square*, *eta-square*) or corrected (e.g., adjusted
*R-square*, *omega-square*) variance-accounted-for-statistics should be reported. These effect sizes are
required in addition to the usual inferential statistical tests of
significance, they do not replace them. It is also appropriate in the textual
argument of the results section to place these effect sizes in their practical
and theoretical context.

Ely, M. (1999) The importance of reporting estimates and confidence intervals for statistical analyses has been well publicised in the arena of medical studies for some years now. the requirement to give confidence intervals for the main results of a study has been included in the statistical guidelines for contributors to medical journals since the 1980s and methodological points such as this are discussed in the Statistical Notes section of the ;British Medical Journal. If the use of quantitative methods in British sociology is to be encouraged, as Frank Bechhofer (1996) suggests is needed, it is important to have a forum for the dissemination of basic methodological issues which is accessible to researchers within the discipline. This note aims to achieve such dissemination by using an example from current research to illustrate this fundamental, but often overlooked, aspects of quantitative analysis.

Falk, R., & Greenbaum, C.W. (1995) We present a critique showing the flawed logical structure of
statistical significance tests. We then attempt to analyze why, in spite of the
faulty reasoning, the use of significance tests persists. We identify the illusion
of probabilistic proof by contradiction as a central stumbling block, because
it is based on a misleading generalization of reasoning from logic to inference
under uncertainty. We present new data from a student samples and examples from
the psychological literature showing the strength and prevalence of this
illusion. We identify some intrinsic cognitive mechanisms (similarity to *modus tollens* reasoning; verbal
ambiguity in describing the meaning of significance tests; and the need to rule
out chance findings) and extrinsic social pressures which help to maintain the
illusion. We conclude by mentioning some alternative methods for presenting and
analyzing psychological data, none of which can be considered the ultimate
method.

Fan X., Thompson B. (2001) Confidence intervals for reliability coefficients can be estimated in various ways. The
present article illustrates a variety of these applications. This guidelines
editorial also promulgates a request that *EPM* authors report confidence intervals for
reliability estimates whenever they report score reliabilities and note what
interval estimation methods they have used. This will reinforce reader
understanding that all statistical estimates, including those for score
reliability, are affected by sampling error variance. And these requirements
may also facilitate understanding that tests are not impregnated with invariant
reliability as a routine part of printing.

Fidler, F. (2002) The fifth edition of the *Publication Manual of the American Psychological
Association * (APA) draws on recommendations for improving statistical practices made by the APA
Task Force on Statistical Inference (TFSI). The manual now acknowledges the
controversy over null hypothesis significance testing (NHST) and includes both
a stronger recommendation to report effect sizes and a new recommendation to
report confidence intervals. Drawing on interviews with some critics and other
interested parties, the present review identifies a number of deficiencies in
the new manual. These include lack of follow-through with appropriate
explanations and examples of how to report statistics that are now recommended.
At this stage, the discipline would be well served by a response to these
criticisms and a debate over needed modifications.

Fidler, F., Cumming, G., Mark, B., & Neil, T. (2004) Over-reliance on Null Hypothesis Significance Testing (NHST) is a serious problem in a number of disciplines, including psychology and ecology. It has the potential to damage not only the progress of these sciences but also the objects of their study. In the mid 1980s, medicine underwent a (relatively) major statistical reform. Strict editorial policy sawthe number of p values in journals drop dramatically, and the rate of confidence interval reporting rise concomitantly. In psychology, a parallel change is yet to be achieved, despite half a century of debate, several editorial inventions, and even an American Psychological Association Task Force on Statistical Inference. Ecology also lags substantially behind. The nature of the editorial policies and the degree of collaboration amongst editors are important factors in explaining the varying levels of reforms in these disciplines. But without efforts to also re-write textbooks, improve software and research understanding of alternative methods, it seems unlikely that editorial initiatives will achieve substantial statistical reform.

Fidler, F., Cumming, G., Thomason, N., Pannuzzo, D., Smith, J., Fyffe, P., Edmonds, H., Harrington, C., & Schmitt, R. (2005) In 1997, Philip Kendall's editorial encouraged authors in JCCP to report effect sizes and clinical significance. The present authors assessed the influence of that editorial, and other APA initiatives to improve statistical practices, by examining 239 JCCP articles published from 1993 to 2001. For ANOVA, reporting of means and standardized effect sizes increased over that period, but the rate of effect size reporting for other types of analyses surveyed remained low. Confidence interval reporting increased little, reaching 17% in 2001. By 2001 the percentage of articles considering clinical (not only statistical) significance was 40%, compared with 36% in 1996. In a follow-up survey of JCCP authors (N=62), many expressed positive attitudes toward statistical reform, but gave little indication that they understood what was involved. Substantially improving statistical practices may require stricter editorial policies and further guidance for authors on reporting and interpreting measures.

Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004) Since the mid-1980s, confidence intervals (CIs) have been standard in medical journals. We sought lessons for psychology from medicine's experience of statistical reform by investigating two attempts by Kenneth Rothman to change statistical practices. We examined 594 American Journal of Public Health (AJPH) and 110 Epidemiology articles. Rothman's editorial instruction to report CIs and not p values was largely effective: in AJPH sole reliance on p values dropped from 63% to 5%, and CI reporting rose from 10% to 54%; Epidemiology showed even stronger compliance. However compliance was superficial: very few authors referred to CIs when discussing results. These results support what other research has indicated: editorial policy alone is not a sufficient statistical reform mechanism. Achieving substantial, desirable change will entail considerable guidance for full use of CIs and appropriate effect size measures. This will require study of researchers' understanding of CIs, improved education, and development of empirically-justified recommendations for improved statistical practice.

Fidler, F., Thomason, N., Cumming, G., Finch, S. &, Leeman, J. (2005) The advantages of CIs make the debate worthwhile. First, looking at CIs across studies should facilitate a meta-analytic approach, leading eventually to a true parameter value, even if original expectations were wildly wrong (Schmidt, 1996). Second, a focus on estimation should improve the way psychologists theorize, and plan and conduct empirical research.

Fidler, F., & Thompson, B. (2001) Most textbooks explain how to compute confidence intervals for means, correlation coefficients, and other statistics using "central" test distributions (e.g., t, F that are appropriate for such statistics. However, few textbooks explain how to use "noncentral" test distributions (e.g., noncentral t, noncentral F to evaluate power or to compute confidence intervals for effect sizes. This article illustrates the computation of confidence intervals for effect sizes for some ANOVA applications; the use of intervals invoking noncentral distributions is made practical by newer software. Greater emphasis on both effect sizes and confidence intervals was recommended by the APA Task Force on Statistical Inference and is consistent with the editorial policies of the 17 journals that now explicitly require effect size reporting.

Fienberg, S. E. (2006). While Bayes' theorem has a 250-year history, and the method of inverse probability that owed from it dominated statistical thinking into the twentieth century, the adjective "Bayesian" was not part of the statistical lexicon until relatively recently. This paper provides an overview of key Bayesian developments, beginning with Bayes' posthumously published 1763 paper and continuing up through approximately 1970, including the period of time when "Bayesian" emerged as the label of choice for those who advocated Bayesian methods.

Finch, S., Cumming, G., & Thomason, N. (2001) Reformers have long argued that misuse of Null Hypothesis Significance Testing (NHST) is
widespread and damaging. We analyzed 150 papers from the *Journal of Applied
Psychology (JAP)* covering 1940 to 1999. We examined statistical reporting
practices related to misconceptions about NHST, APA guidelines, and reform
recommendations. Our analysis reveals (a) inconsistency in reporting alpha and
p-values, (b) use of ambiguous language in describing NHST, (c) frequent acceptance
of null hypotheses without consideration of power, (d) that power estimates are
rarely reported, (e) virtually no confidence intervals. APA guidelines have
been followed only selectively. Research methodology reported in *JAP* has
increased greatly in sophistication over 60 years, but inference practices have
shown remarkable stability. There is little sign that decades of cogent
critiques by reformers had by 1999 led to changes in statistical reporting
practices in *JAP*.

Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004) Geoffrey Loftus, Editor of Memory & Cognition from 1994 to 1997, strongly encouraged presentation of figures with error bars and avoidance of null hypothesis significance testing (NHST). The authors examined 696 Memory & Cognition articles published before, during and after the Loftus editorship. Use of figures with bars increased to 47% under Loftus' editorship then declined. Bars were rarely used for interpretation, and NHST remained almost universal. Analysis of 309 articles in other psychology journals confirmed that Loftus' influence was most evident in the articles he published, but was otherwise limited. An email survey of authors published by Loftus revealed some support for his policy but also allegiance to traditional practices. Reform of psychologists' statistical practices would require more than editorial encouragement.

Finch, S., Thomason, N., & Cumming, G. (2002) We review the publication guidelines of the American Psychological Association (APA) since 1929 and document their advice for authors about statistical practice. Although the advice has been extended with each revision of the guidelines, it has largely focussed on Null Hypothesis Significance Testing (NHST) to the exclusion of other statistical methods. In parallel, we review over 40 years of critiques of NHST in psychology. The critiques have had little impact on the APA guidelines. The guidelines are influential in broadly shaping statistical practice, although in some cases they are not closely followed. They have an important role to play in reform of statistical practice in psychology. Following the report of the APA's Task Force on Statistical Inference, we propose that revisions of the guidelines reflect a broader philosophy of analysis and inference, provide detailed statistical requirements for reporting research, and directly address concerns about NHST. In addition the APA needs to develop ways to ensure that its editors succeed in their leadership role in achieving essential reform.

Fisher, R. A. (1959) "The subject of a probability statement if we know what we
are talking about, is singular and unique."

"We must therefore specify that if a Bayesian probability *a priori* is
available we shall usr the method of Bayes, and that the first condition for
the applicability of the fiducial argument is that no probability *a priori* of the form needed for Bayes'
theorem shall be available."

"It is sometimes asserted that the fiducial method generally leads to the same results as the method of
Confidence Intervals. It is difficult to understand how this can be so, since
it has been firmly laid down that the method of confidence intervals does not
lead to probability statements about parameters."

Fisher, R.A. (1962) Some further examples are given of Bayes' method of determining probabilities *a priori* by
an experiment.

Fisher, R. A. (1990) This book brings together as a single volume three of Fisher's
most influential textbooks: *Statistical Methods for Research Workers*, *The
Design of Experiments*, and *Statistical
Methods and Scientific Inference*. Whilst the text of each is unchanged save
for the correction of minor misprints, in this new edition Dr Franck Yates has
provided a foreword which sheds fresh light on Fisher's thinking and on the
writing and reception of each of the books. Dr Yates discusses some of the key
issues tackled in the three books and reflects on how the ideas expressed have
come to permeate modern statistical practice.

Folger, R. (1989) Presents a logical justification for the following statements and discusses their implications: It is duplicitous (misleading) to use significance tests for making binary (either/or) decisions regarding the validity of a theory; the binary choice between calling results significant or not significant should not govern the confidence placed in a theory, because such confidence cannot be gained in the either/or fashion characterizing deductive certainty. The implications include grounds for describing ways that effect size estimates become useful in making judgments about the value of theories.

Freedman, L.S., Spiegelhalter, D.J., & Parmar, M.K.B. (1994) We discuss the advantages and limitations of group sequential methods for monitoring clinical trials data. We describe a Bayesian approach, based upon the use of sceptical prior distributions, that avoids some of the limitations of group sequential methods. We illustrate its use with data from a trial of levamisole plus 5-Fluorouracil for colorectoral cancer.

Freeman, P.R. (1993) The current widespread practice of using *p*-values as the main means of assessing
and reporting the results of clinical trials cannot be defended. Reasons for grave concern over the
present situation range from the unsatisfactory nature of *p*-values themselves,
their very common misunderstanding by
statisticians as well as by clinicians and their serious distorting influence
on our perception of the very nature of clinical trials. It is argued, however,
that only fully understanding the reasons why they have become so universally
popular can we hope to change opinion and introduce more sensible ways of
summarizing and reporting results. Some of the ways in which this might happen
are discussed.

Freiman, J.A., Chalmers, T.C., Smith, H., & Kueber, R.R. (1978) Seventy-one "negative" randomized control trials were re-examined to determine if the investigators had studied large enough samples to give a high probability (>0.90) of detecting a 25 per cent and 50 per cent therapeutic improvement in the response. Sixty-seven of the trials had a greater than 10 per cent risk of missing a true 25 per cent therapeutic improvement, and with the same risk, 50 of the trials could have missed a 50 per cent improvement. Estimates of 90 per cent confidence intervals for the true improvement in each trial showed that in 57 per cent these "negative" trials, a potential 25 per cent improvement was possible, and 34 of the trials showed a potential 50 per cent improvement. Many of the therapies labeled as "no different from control" in trials using inadequate samples have not received a fair test. Concern for the probability of missing an important therapeutic improvement because of small sample sizes deserves more attention in the planning of clinical trials.

Frías, Ma.D., Pascual, J., & Garcia, J.F. (2000) Currently, there is a growing interest in the study of the sensitive and validity of the statistical conclusions of experimental design. Although most of books on experimental design stress these issues, many students on applied psychology still do not take advantage of these advances, as can be deduced by low statistical power. The goal of this article is to examine the impact of the guidelines of the editorial Board of peer reviewed respect to the computation and interpretation of the measures of effect size as well as the values of statistical significance.

Frick, R.W. (1995) This article concerns acceptance of the null hypothesis that one variable has no effect on another. Despite frequent opinions to the contrary, this null hypothesis can be correct in some situations. Appropriate criteria for accepting the null hypothesis are (1) that the null hypothesis is possible; (2) that the null hypothesis is possible; and (3) that the experiment was a good effort to find an effect. These criteria are consistent with the meta-rules for psychology. The good-effort criterion is subjective, which is somewhat undesirable, but the alternative - never accepting the null hypothesis - is neither desirable nor practicable.

Frick, R.W. (1996) The many criticisms on null hypothesis testing suggest when it
is not useful and what is should not be used for. This article explores when
and why its use is appropriate. Null hypothesis testing is insufficient when
size of effects is important, but it is ideal for testing ordinal claims
relating the order of conditions, which are common in psychology. Null
hypothesis testing also is insufficient for determining beliefs, but it is
ideal for demonstrating sufficient evidential strength to support an ordinal
claim, with sufficient evidence being 1 criterion for a finding entering the
corpus of legitimate findings in psychology. The line between sufficient and
insufficient evidence is currently set at *p*<.05;
there is little reason for allowing experimenters to select their own value of
alpha. Thus null hypothesis testing is an optimal method for demonstrating
sufficient evidence for an ordinal claim.

Fry, T.C. (1965) Contains discussions of Byesian point of view.

Gendreau, P. (2002) It is argued that our attempts at knowledge cumulation have been flawed in four ways. They are the eroding of "empiricism" in clinical practice, the tendency towards paradigm passion and ethnocentricism, the failure to attend to "simple" measures of effect size, and the misuse of significance testing. It is recommended that speciality designations, the replacement of significance testing with point estimates and confidence intervals, the use of practical effect size statistics, the establishment of data repositories, and a renewed focus on replication would help resolve some of these problems.

Gigerenzer, G. (1998) What Chow [Chow, 1996] calls NHSTP is an inconsistent hybrid of Fisherian and Neyman-Pearsonian ideas. In psychology, it has been practiced like ritualistic handwashing and sustained by wishful thinking about its utility. Chow argues that NHSTP is an important tool for ruling out chances as an explanation for data. I disagree. This ritual discourages theory development by providing researchers with no incentive to specify hypotheses.

Glaser, D.N. (1976) Despite the rumblings and ominous overtones of the proposed banning of the significance test, a more temperate solution has been offered by a wide array of researchers. Significance testing will always have its advocates and opponents. At this time, however, more than any other, researchers are considering the import of instituting effect sizes, confidence intervals, and power analysis alongside the traditional mode of significance testing. The recommendation is not that clinical researchers disavow significance testing, but rather that they incorporate additional information that will supplement their findings.

Gold, D. (1958) Since, at a given level of significance, statistical
significance demands a greater *degree* of relationship from a small sample than a large sample,
it might appear that the researcher can more easily treat substantively important differences by
selecting small samples rather than large samples. This, of course, is not
true. It is simply that smaller samples produce statistics more frequently
which deviate widely from parameter than do large samples. Thus the large
differences in a small sample must always be replicated in large samples to
assess substantive importance.

Gold, D. (1969) It has been contended that a test of significance can be viewed as an indication of the probability that an observed association could be generated in a given set of data by a random process model, without respect o sampling considerations. Statistical significance, in these terms, provides an explicit criterion for attributing substantive importance to the observed association. However, statistical significance is only the minimal criterion, necessary but not sufficient. In addition, the analyst must attend to the size of the association and must also make this criterion for the acceptance of the importance of the association reasonably clear. Some rules of thumb along these lines, especially useful in assessing associations among mixed variables (qualitative and quantitative), have been suggested as illustrative of a general approach to be taken.

Good, I.J. (1984) See Neyman, Scott and Smith, 1969.

Goodman, S.N. (1999) An important problem exists in the interpretation of modern
medical research data: Biological understanding and previous research play
little formal role in the interpretation of quantitative results. This
phenomenon is manifest in the discussion sections of research articles and
ultimately can affect the reliability of conclusions. The standard statistical
approach has created this situation by promoting the illusion that conclusions
can be produced with certain "error rates," without consideration of
information from outside the experiment. This statistical approach, the key
components of which are *P *values and hypothesis tests, is widely
perceived as a mathematically coherent approach to inference. There is little
appreciation in the medical community that the methodology is an amalgam of
incompatible elements, whose utility for scientific inference has been the
subject of intense debate among statisticians for almost 70 years. This article
introduces some of the key elements of that debate and traces the appeal and adverse
impact of this methodology to the *P* value fallacy, the mistaken idea
that a single number can capture both the long-run outcomes of an experiment
and the evidential meaning of a single result. This argument is made as a
prelude to the suggestion that another measure of evidence should be used-the
Bayes factor, which properly separates issues of long-run behavior from
evidential strength and allows the integration of background knowledge with
statistical findings.

Goodman, S.N. (1999) Bayesian inference is usually presented as a method for determining how scientific
belief should be modified by data. Although Bayesian methodology has been one
of the most active areas of statistical development in the past 20 years,
medical researchers have been reluctant to embrace what they perceive as a
subjective approach to data analysis. It is little understood that Bayesian
methods have a data-based core, which can be used as a calculus of evidence.
This core is the Bayes factor, which in its simplest form is also called a *likelihood
ratio. *The minimum Bayes factor is objective and can be used in lieu of the
*p *value as a measure of the evidential strength. Unlike *p *values,
Bayes factors have a sound theoretical foundation and an interpretation that
allows their use in both inference and decision making. Bayes factors show that
*p *values great I y overstate the evidence against the null hypothesis.
Most important, Bayes factors require the addition of background knowledge to
be transformed into inferences-probabilities that a given conclusion is right
or wrong. They make the distinction clear between experimental evidence and
inferential conclusions while providing a framework in which to combine prior
with current evidence.

Goodman, S.N., & Berlin, J.A. (1994) Although there is a growing understanding of the importance of statistical power considerations when designing studies and of the value of confidence intervals when interpreting data, confusion exists about the reverse arrangement: the role of confidence intervals in study design and of power in interpretation. Confidence intervals should play an important role when setting sample size, and power should play no role once the data have been collected, but exactly the opposite procedure is widely practiced. In this commentary, we present the reasons why the calculation of power after a study is over is inappropriate and how confidence intervals can be used during both study design and study interpretation.

Gordon, H.R.D. (2001). The purpose of this study was to identify AVERA members' perceptions of statistical significance tests. A simple random sample was used to select 113 AVERA members for participation. The Psychometrics Group Instrument was used to collect data. Two-thirds of the respondents were males, 93% had earned a doctoral degree, 67% had more than 15 years of experience in educational research and 82.5% were employed at the university level. There was general disagreement among respondents concerning the proposition that statistical significance tests should be banned. Respondents in the study were less likely to realize that stepwise methods do not identify the best predictor set of a given size. The study also revealed that studies with non-significant results can still be very important.

Graham, J.M. (2001) The present book review of Statistics With Confidence is framed in terms of both the recent report of the APA Task Force on Statistical Inference and ongoing movements in the field. The review is structured in terms of two major issues: the interpretation of confidence intervals (null hypothesis significance testing [NHST] versus non-NHST) and the ethics of statistics.

Grainger, J., & Beauvillain, C. (1988) Where we report statistics in the form x±y, x-y and x+y are the lower and upper limits of the fiducial interval (corresponding here to the confidence interval) with a guarantee of 0.90 that the magnitude parameter lies between these two values (Rouanet & Lecoutre, 1983). The fiducial distribution is a Student's t variable with centre x corresponding to the average observed effect and scale e=x/VF, where y = 1.76e. The lower credibility limit x — y is the Bayesian interpretation of a one-sided level of the t test indicating that 95% of the distribution is greater than x — y.

Granaas, M. (2002) For many years null hypothesis testing (NHT) has been the dominant form of statistical analysis.

Gray, M.W. (1983) **[ Example of misinterpretation of a p-value]**
"For the 2×>85 table linking departments and admission rates,

Greenwald, A.G., Gonzalez, R., Harris, R.J., & Guthrie, D. (1996) Despite publications of many well-argued critiques of null
hypothesis testing (NHT), behavioral science researchers continue to rely
heavily on this set of practices. Although we agree with most critics' catalogs
of NHT flaws, this article also takes the unusual stance of identifying virtues
that may explain why NHT continues to be so extensively used. These virtues
include providing results in the form of a dichotomous (yes/no) hypothesis
evaluation and providing an index (*p*-value)
that has a justifiable mapping onto confidence in repeatability of a null
hypothesis rejection. The most criticized flaws of NHT can be avoided when the
importance of a hypothesis, rather that the *p*
value of its test, is used to determine that a finding is worth of report, and
when *p*>.05 is treated as
insufficient basis for confidence in the replicability of an isolated non-null
finding. Together with many recent critics of NHT, we also urge reporting of
important hypothesis tests in enough descriptive detail to permit secondary
uses such as meta-analysis.

Gregson, R.A.M. (1998) Chow's [Chow, 1996] defense of NHSTP [Null Hypothesis Significance Test Procedure] ignores the fact that in psychology it is used to test substantive hypotheses in theory-corroborating research. In this role, NHSTP is not only inadequate, but damaging to the progress of psychology as a science. NHSTP does not fulfill the Popperian requirement that theories be tested severely. It also encourages nonspecific predictions and feeble theoretical formulations.

Grissom, R.J., & Kim, J.J. (2001) Estimation of the effect size parameter, D, the standardized difference between population means, is sensitive to heterogeneity of variance (heteroseedasticity), which seems to abound in psychological data. Pooling s(2)s assumes homoscedasticity, as do methods for constructing a confidence interval for D, estimating D from t or analysis of variance results, formulas that adjust estimates for inflation by main effects or covariates, and the Q statistic. The common language effect size statistic as an estimate of Pr(X1>X2), the probability that a randomly sampled member of Population I will outscore a randomly sampled member of Population 2, also assumes normality and homoscedasticity. Various proposed solutions are reviewed, including measures that do not make these assumptions, such as the probability of superiority estimate of Pr(X1>X2). Ways to reconceptualize effect size when treatments may affect moments such as the variance are also discussed.

Grouin J.-M., Coste M., Bunouf P., Lecoutre B. (2007) The most common Bayesian methods for sample size determination (SSD) are reviewed in the non-sequential context of a confirmatory phase III trial in drug development. After recalling the regulatory viewpoint on SSD, we discuss the relevance of the various priors applied to the planning of clinical trials. We then investigate whether these Bayesian methods could compete with the usual frequentist approach to SSD and be considered as acceptable from a regulatory viewpoint.

Grunkemeier, G.L. & Payne, N. (2002) Full Bayesian analysis is an alternative statistical paradigm, as opposed to traditionally used methods, usually called frequentist statistics. Bayesian analysis is controversial because it requires assuming a prior distribution, which can be arbitrarily chosen; thus there is a subjective element, which is considered to be a major weakness. However, this could also be considered a strength since it provides a formal way of incorporating prior knowledge. Since it is flexible and permits repeated looks at evolving data, Bayesian analysis is particularly well suited to the evaluation of new medical technology. Bayesian analysis can refer to a range of things: from a simple, noncontroversial formula for inverting probabilities to an alternative approach to the philosophy of science. Its advantages include: (1) providing direct probability statements - which are what most people wrongly assume they are getting from conventional statistics; (2) formally incorporating previous information in statistical inference of a data set, a natural approach which we follow in everyday reasoning; and (3) flexible, adaptive research designs allowing multiple looks at accumulating study data. Its primary disadvantage is the element of subjectivity which some think is not scientific. We discuss and compare frequentist and Bayesian approaches and provide three examples of Bayesian analysis: (1) EKG interpretation, (2) a coin-tossing experiment, and (3) assessing the thromboembolic risk of a new mechanical heart valve.

Hager, W. (2000) The function and potential importance of statistical tests in examining and evaluating substantive and psychological hypotheses is discussed. Psychological hypotheses are sharply distinguished from statistical hypotheses. Decisions on statistical hypotheses must be separated from decisions on psychological hypotheses. Some differences between both kinds of hypotheses are addressed, and the question is attacked whether they are complementary or not. The answer to this question is negative. The use of the modus tollens in theory corroboration is discussed and it is argued that evaluations of substantive hypotheses always is accompanied by some inductive aspects. A further reason for the discontent with statistical tests is identified: most of these tests are rather insensitive the differential patterns of predictions and of data, whereas differential patterns of data can be derived from nearly every psychological hypothesis or theory. To test for these differential patterns statistical tests should be applied thoughtfully, not routinely.

Hagood, M.J. (1941) For developments in statistical theory of the last decade or
two have shown the tests formerly used to be incorrect, and those who are using
as guides texts published 10 years or more ago are likely to be using
unacceptable tests of significance for their correlation coefficients. The most
common test of significance answers for the universe the question as to whether
or not association *exists* in the universe - that is, it investigates for the universe the first aspect of
association.

Hancock, G.R., & Freeman M.J. (2001) Targeted toward the applied modeler, this article provides select power and sample size tables and interpolation strategies associated with the root mean square error of approximation test of not close fit under standard assumed conditions. It is hoped that researchers conducting structural equation modeling will be better informed as to power limitations when testing a model given a particular available sample size or, better yet, that they will heed the sample size recommendations contained herein when planning their study to ensure the most accurate assessment of the degree of close fit between data and model.

Hardy, A., Harvie, P., & Koestler, A. (1973)] **[ Examples of misinterpretation of a p-value]**
"Taken altogether, the receivers scored significantly beyond chance [...] with a calculated probability of 3,000 to
1 against it being just chance." (page 117) "The lady passed the
ordeal with flying colors: she correctly identified the method of pouring for
all eight cups, with odds against chance of one in seventy"
(page 236) [Quoted by Falk and Greenbaum, 1995, page 82]

Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.) (1997) This book is the result of a spirited debate stimulated by a recent
meeting of the Society of Multivariate Experimental Psychology. Although the
viewpoints span a range of perspectives, the overriding theme that emerges
states that significance testing may still be useful if supplemented with some
or all of the following -- Bayesian logic, caution, confidence intervals,
effect sizes and power, other goodness of approximation measures, replication
and meta-analysis, sound reasoning, and theory appraisal and corroboration. The
book is organized into five general areas. The first presents an overview of
significance testing issues that synthesizes the highlights of the remainder of
the book. The next discusses the debate in which significance testing should be
rejected or retained. The third outlines various methods that may supplement
current significance testing procedures. The fourth discusses Bayesian
approaches and methods and the use of confidence intervals versus significance
tests. The last presents the philosophy of science perspectives. Rather than
providing definitive prescriptions, the chapters are largely suggestive of
general issues, concerns, and application guidelines. The editors allows
readers to choose the best way to conduct hypothesis testing in their
respective fields.
** Contents:** Preface -- Part I, Overview; Harlow, L.L., Significance testing
introduction and overview -- Part II, The debate: against and for significance
testing; Cohen, J., The earth is round (p<.05); Schmidt, F.L., & Hunter,
J., Eight common but false objections to the discontinuation of significance
testing in the analysis of research data; Mulaik, S.A., Raju, N.S., &
Harshman, R., There is a time and place for significance testing; Abelson,
R.P., A retrospective on the significance test ban of 1999 (if there were no
significance tests, they would be invented) -- Part III, Suggested alternatives
to significance testing; Harris, R.J., Reforming significance testing via
three-valued logic; Rossi, J.S., A case study in the failure of psychology as a
cumulative science: The spontaneous recovery of verbal learning; McDonald,
R.P., Goodness of approximation in the linear model; Steiger, J.H., &
Fouladi, R.T., Noncentrality interval estimation and the evaluation of
statistical models; Reichardt, C.S., & Gollob, H.F., When confidence
intervals should be used instead of statistical significance tests, and vice
versa -- Part IV, A Bayesian approach to hypothesis testing; Pruzek, R.M., An
introduction to Bayesian inference and its application; Rindskopf, D., Testing
"small", not null, hypotheses: Classical and Bayesian approaches --
Part V, Philosophy of science issues; Rozeboom, W.W., Good science is
abductive, not hypothetico-deductive; Meehl, P.E., The problem is epistemology,
not statistics: Replace significance tests by confidence intervals and quantify
accuracy of risky numerical predictions.

Harris, M.J. (1991) Chow (1991) distinguishes between "practical impact" and "conceptual rigor" research, and he concludes that effect size estimation is useful only in practical impact research. I argue that significance tests do not answer substantive questions about the data and are useful only as a check that the results are unlikely to have occurred by chance. Chow's decision to regard the similarity between data and prediction as being a dichotomous judgment made on the basis of significance testing is therefore unwise. I conclude that effect sizes are the single best index of the relationship between theoretical predictions and the obtained data. The role of replications and meta-analysis in advancing theory is also discussed.

Harris, R.J. (1997) The many and frequent misinterpretations of null hypothesis significance testing (NHST) are fostered by continuing to present NHST logic as a choice between only two hypotheses. Moreover, the proposed alternatives to NHST are just as susceptible to misinterpretation as is (two-valued) NHST: Misinterpretations could, however; be great I y reduced by adopting Kaisers (1960) proposal of three-alternative hypothesis testing in place of the traditional two-alternative presentation. The real purpose of significance testing (NHST) is to establish whether we have enough evidence to be confident of the sign (direction) of the effect we 're testing in the population. This is an important contribution to the cumulation of scientific knowledge and should be retained in any replacement system. Confidence intervals (CIs - the proposed alternative to significance tests in single studies) can provide this control, but when so used they are subject to exactly the same Type 1, Type II, and Type III (statistical significance in the wrong direction) error rates as significance testing. There are still areas of research where NHST alone would be a considerable improvement over the current lack of awareness of error variance. Further; there are two pieces of information (namely, maximum probability of a Type III error and probability of a successful exact replication) provided by p values that are not easily gleaned from confidence intervals. Suggestions are offered for greatly increasing attention to power considerations and for eliminating the positive bias in estimates of effect-size magnitudes induced when we make statistical significance a necessary condition of publication.

Hauschke, D., & Steinijans, V.W. (1996) The purpose of this communication is to point out that it is generally not correct to use the conventional approach for testing therapeutic equivalence of two treatments.

Heldref Foundation (1997) Authors are required to
report and interpret magnitude-of-effect measures in conjunction with every *p* value
that is reported.

Henri, V. (1898)] **[ Exemples de formulations fallacieuses
à propos des seuils de signification]** "Nous affirmons avec une probabilité
égale à 0.95 que la différence n'est pas due au hasard." "Nous dirons
que cette différence est produite par le hasard, et notre affirmation a une
probabilité d'exactitude supérieure à 98%." [citées par Rouanet & Bru, 1994]

Hirsch, L.S., & O'Donnell, A.M. (2001) **[ Example of interpretation of a nonsignificant result as a proof of the null hypothesis]**
"Further, two additional 2×2 chi-square tests found class status (graduate vs. undergraduate) to be independent of
whether students appear to hold misconceptions (chi2=3.5,

Hodges, J.L., & Lehmann, E.L. (1954) The distinction between statistical significance and material significance in hypotheses testing is discussed. Modifications of the customary tests, in order to test for the absence of material significance, are derived for several parametric problems, for the chi-square test of goodness of fit, and for Student's hypothesis. The latter permits one to test the hypothesis that the means of two normal populations of equal variance, do not differ by more than a stated amount.

Hoenig, J.M., & Heisey, D.M. (2001) It is well known that statistical power calculations can be valuable in planning an experiment. There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result. Advocates of such post-experiment power calculations claim the calculations should be used to aid in the interpretation of the experimental results. This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic.

Hogben, L.T. (1957) ** The contemporary crisis of the uncertainty of uncertain inferences** - We
may have to reinstate statistics as continental demographers use the term.
Laboratory experiments will have to stand on their own without protection from
a façade of irrelevant computations. Sociologists will have to use their
brains. In my view, science will not suffer.

Howard, G.S., Maxwell, S.E., & Fleming, K.J. (2000) Some methodologists have recently suggested that scientific psychology's overreliance on null hypothesis significance testing (NHST) impedes the progress of the discipline. In response, a number of defenders have maintained that NHST continues to play a vital role in psychological research. Both sides of the argument to date have been presented abstractly. The authors take a different approach to this issue by illustrating the use of NHST along with 2 possible alternatives (meta-analysis as a primary data analysis strategy and Bayesian approaches) in a series of 3 studies. Comparing and contrasting the approaches on actual data brings out the strengths and weaknesses of each approach. The exercise demonstrates that the approaches are not mutually exclusive but instead can be used to complement one another.

Hresko, W. (2000) The APA *Publication Manual* cites the need for including effect-size information in manuscripts
utilizing quantitative data analysis techniques [...] If authors do not include
this information in submitted manuscripts (and the manuscript is based on a
quantitative research design), the author(s) will be asked to provide this
information should the manuscript be recommended for publication or revision
and publication.

International Committee of Medical Journal Editors (1991) Updated author guidelines from the International Committee of Medical Journal Editors including a section on writing statistics.

Iversen, G.R. (1998) The second half of this century has witnessed a very fruitful debate among statisticians about the relative merits of Bayesian and classical statistical inference. Neither side can claim a victory in this debate, since there is no way of proving that one approach is more correct than the other. But the debate has served the purpose of illuminating the strengths and weakness of each approach. The students we teach are to a very large extent exposed only to classical statistical inference. This is a choice made by their instructors, meaning all of us. This spring, in a small group of students studying both approaches I probed for their opinions of the two approaches. Not surprisingly, the Bayesian approach was well received by the students, even though they also had some misgivings about Bayesian statistics, at least early on>.

Iversen, G.R. (2000) Statistical methods have an impact on the results of any statistical study. We do not always realize that the statistical methods act in such a way as to create a construction of the world. We should therefore be more aware of the role of statistics in research, and the question is not so much about what we teach researchers but that we train them to be aware of the impact of the methods they use. This becomes particularly important in statistical inference where we have the choice between the classical, frequentist approach and the Bayesian approach. The two approaches create very different views of the world. The word probability carries with it a notion of uncertainty, and it is tempting to think that the uncertainty refers to parameters and not simply data.

Jaynes, E.T. (2003) The following material is addressed to readers who are already
familiar with applied mathematics at the advanced undergraduate level or
preferably higher; and with some field, such as physics, chemistry, biology,
geology, medicine, economics, sociology, engineering, operations research,
etc., where inference is needed. A previous acquaintance with
probability and statistics is not necessary; indeed, a certain amount of
innocence in this area may be desirable, because there will be less to unlearn.

We are concerned with probability theory and all of its conventional mathematics, but now viewed in a wider
context than that of the standard textbooks. Every Chapter after the first has
"new" i.e. not previously published) results that we think will be found
interesting and useful. Many of our applications lie outside the scope of
conventional probability theory as currently taught. But we think that the results
will speak for themselves, and that something like the theory expounded here will become the
conventional probability theory of the future.

Jefferys, H. (1990) Data from experiments that use random event generators are
usually analyzed by classical (frequentist) statistical tests, which summarize
the statistical significance of the test statistic as a *p-*value.
However, classical statistical tests are frequently inappropriate to these
data, and the resulting *p-*values can grossly overestimate the
significance of the result. Bayesian analysis shows that a small *p-*value
may not provide credible evidence that an anomalous phenomenon exists. An
easily applied alternative methodology is described and applied to an example
from the literature.

Jefferys, H. (1992) Dobyns' article [*Journal of Scientific Exploration*, *6*,
no 1] suggests some reasons why orthodox statistics might be superior to
Bayesian statistics when discussing random event generator statistics. Several
of his main arguments are examined and discussed.

Jefferys, H. (1995) In a recent column in this journal, Cooper (1994) stated that
"The *p*-value is the probability that the results could have
occurred by pure chance given that the null (conventional) hypothesis is
true." This definition is incorrect and highly misleading, although
similar statements are often found in the literature... A correct definition of
the *p-*value is that it is the probability of obtaining the actual result
we did, *or any more extreme result*, given that the null (conventional)
hypothesis is true.

Jefferys, H. (1995) In response to my letter (Jefferys 1995), Dobyns and Jahn
(1995) responded that my objection to their incorrect definition of *p-*values
is "trivial" and mere "pedantic quibbling." It is easy to
convince oneself that this is not the case.

Johns, D., & Andersen, J.S. (1990) Predictive probability is particularly useful in aiding a decision-making process related to drug development. This is especially true for decisions occurring as the result of interim analysis of clinical trials. Examples of clinical trial applications of Bayesian predictive probability and the use of the beta-binomial distribution are described.

Johnson, D.H. (1998) Wildlife biologists recently have been subjected to the credo
that if you're not testing hypotheses, you're not doing real science. To
protect themselves against rejection by journal editors, authors cloak their
findings in an armor of *P* values.
I contend that much statistical hypothesis testing is misguided. Virtually
all null hypotheses tested are, in fact, false; the only issue is whether or
not the sample size is sufficiently large to show it. No matter if it is or
not, one then gets led into the quagmire of deciding biological significance
versus statistical significance. Most often, parameter estimation is a more
appropriate tool than statistical hypothesis testing. Statistical hypothesis
testing should be distinguished from scientific hypothesis testing, in which
truly viable alternative hypotheses are evaluated in a real attempt to falsify
them. The latter method is part of the deductive logic of strong inference,
which is better-suited to simple systems. Ecological systems are complex, with
components typically influenced by many factors, whose influences often vary in
place and time. Competing hypotheses in ecology rarely can be falsified and
eliminated. Wildlife biologists perhaps adopt hypothesis tests in order to make
what are really descriptive studies appear as scientific as those in the
"hard" sciences. Rather than attempting to falsify hypotheses, it may
be more productive to understand the relative importance of multiple factors.

Johnson, D.H. (1999) Despite their wide use in scientific journals such as The Journal of Wildlife Management, statistical hypothesis tests add very little value to the products of research. Indeed, they frequently confuse the interpretation of data. This paper describes how statistical hypothesis tests are often viewed, and then contrasts that interpretation with the correct one. I discuss the arbitrariness of P-values, conclusions that the null hypothesis is true, power analysis, and distinctions between statistical and biological significance. Statistical hypothesis testing, in which the null hypothesis about the properties of a population is almost always known a priori to be false, is contrasted with scientific hypothesis testing, which examines a credible null hypothesis about phenomena in nature. More meaningful alternatives are briefly outlined, including estimation and confidence intervals for determining the importance of factors, decision theory for guiding actions in the face of uncertainty, and Bayesian approaches to hypothesis testing and other statistical practices.

Johnstone, D.J. (1988) [...] Oakes [Oakes, 1986] misrepresents Fisher's position on
points of logic. There is also some overstatement of the case for confidence
intervals. More interesting is the author's positive explanation for the widespread
acceptance of significance tests among applied researchers, for there is no
less settled logic or scheme of inference within theoretical statistics, as
instantiated by the current papers of Casella and Berger (1987) and Berger and
Sellke (1987) in the *Journal of the
American Statistical Association*. That research workers in applied fields
continue to use significance tests routinely may be explained by forces of
supply and demand in the market for statistical evidence where the commodity
traded is not so much evidence, but "statistical significance".

Johnstone, D.J., & Lindley, D.V. (1995) In empirical research in the social sciences expressions of statistical significance are meant to capture and summarise the evidence implied by data. To evaluate the evidential content of statements such as "the difference between means is significant at alpha = 5%", we consider the Bayesian probability of the hypotheses tested, where the conditioning event is an announcement of general form significant at alpha. By proceeding as if neither observed effects nor their exact P-values are reported, the meaning of such descriptions of themselves is revealed. It is demonstrated, for large samples particularly, that a report merely that data are significant at alpha has no objective meaning, and under some conditions should be interpreted not as evidence against the null hypothesis, as is usually supposed, but as strong evidence in its favor. This conclusion is supported by both algebraic arguments and example calculations for the special, but important case of the normal mean. It is also found that significance at one level tends to imply significance at much lower levels, the more strongly the larger the sample.

Jones, L.V., & Tukey, J.W. (2000) The conventional procedure for null hypothesis significance testing has long been the target of appropriate criticism. A more reasonable alternative is proposed, one that not only avoids the unrealistic postulation of a null hypothesis but also, for a given parametric difference and a given error probability, is more likely to report the detection of that difference.

Kadane, J.B. (1995) This paper reviews that Bayesian statistics is and gives pointers to the literature. The need for a subjectively determined prior distribution, likelihood, and loss function is often taken to be the principal disadvantage of Bayesian statistics. This paper argues that the requirement that these be explicitly stated is a distinct Bayesian advantage. Advances in Bayesian technology make it ready now to be the main inferential tool for clinical trials.

Kahneman, D., & Tversky, A. (1972) This paper explores a heuristic - *representativeness* - according to which the subjective probability
of an event, or a sample, is determined by the degree to which it: (i) is
similar in essential characteristics to its parent population; and
(ii) reflects the salient features of the process by which it is
generated. This heuristic is explicated in a series of empirical examples
demonstrating predictable and systematic errors in the evaluation of uncertain
events. In particular, since sample sizes does not represent any property of
the population, it is expected to have little or no effect on judgment of
likelihood. This prediction is confirmed in studies showing that subjective
sampling distributions and posterior probability judgments are determined by
the most salient characteristic of the sample (e.g., proportion, mean) without
regard to the size of the sample. The present heuristic approach is contrasted
with the normative (Bayesian) approach to the analysis of the judgment of uncertainty.

Kendall, P. (1957) The reader will find that no traditional significance tests have been reported in connection with the statistical results in this volume. This is intentional policy rather than accidental oversight.

Kendall, P.C. (1997) Evaluations of the outcomes of psychological treatments are favorably enhanced when the published report includes not only statistical significance and the required effect size but also a consideration of clinical significance. (page> 3).

Kieffer, K.M., Reese, R.J., & Thompson, B. (2000) The authors of the present methodological review investigated the patterns of statistical usage and reporting practices in 756 articles published in the American Educational Research Journal (AERJ) and in the Journal of Counseling Psychology (JCP) over a 10-year period. First, some findings from other similar reviews are summarized. Second, the authors present a framework for characterizing selected research practices that emphasizes, in part, elements of the recent report of the American Psychological Association (APA) Task Force on Statistical Inference (Wilkinson and APA Task Force on Statistical Inference, 1999). Third, characterizations of 10 years of analytic practices in 2 journals are presented >and evaluated within that framework. The article concludes with a discussion of the changes that may be necessary to improve the statistical state of affairs in behavioral research.

Killeen (2005a) The statistic prep estimates the probability of replicating an effect. It captures traditional publication criteria for signal-to-noise ratio, while avoiding parametric inference and the resulting Bayesian dilemma. In concert with effect size and replication intervals, prep provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference.

Killeen (2005b) All commentaries concern priors. In this issue of Psychological Science, Cumming graphically demonstrates the implications of our ignorance of delta. Doros and Geier found mistakes in my argument and provide the Bayesian account. Macdonald notes that my program is like Fisher's, Fisher's is like the Bayesians', and the Bayesians' is incoherent. These Commentaries strengthen the foundation while leaving all conclusions intact.

Killeen (2005c) Replicability analysis, like traditional statistical analysis, is only half the story. Effect sizes are equally important, and should always be reported. An optimal inferential procedure would integrate effect sizes with the probability of replication, to achieve a true scientific decision theory. Presenting effect sizes in terms of a confidence interval is less than optimal, because confidence intervals are the alter-ego of NHST, and inherit the same difficulties of interpretation.

Killeen (2006) Bayesian analysis is ideal for aggregating information, as likelihood analysis is for comparing alternative models that make theoretically motivated predictions. For all other cases, prep permits the evaluation of results without bias from arbitrary priors and ad hoc alternatives.

Kirk, R.E. (1995) *Example of
interpretation of frequentist confidence intervals in terms of probabilities
about parameters*]*Ambiguous formulation*] "A random
sample can be used to specify a segment or interval on the number line such
that the parameter has a high probability of lying on the segment. The segment
is called a confidence interval." (page 42).

Kirk, R.E. (1996) Statistical significance is concerned with whether a research result is due to chance or sampling variability; practical significance is concerned with whether the result is useful in the real world. A growing awareness of the limitations of null hypothesis significance tests has led to a search for ways to supplement these procedures. A variety of supplementary measures of effect magnitude have been proposed. The use of these procedures in four APA journals is examined, and an approach to assessing the practical significance of data is described.

Kish, L. (1959) I intend to touch on several problems dealing with the
interplay of statistics with the more general problems of scientific inference.
[...] The aim of this paper is not a profound analysis, but a clear elementary
treatment of several related problems. The literature references contain more
thorough treatments. Moreover, these are not *all* problems in this area, nor even necessarily the most important
ones; the reader may find that his favorite, his most problem, has been
omitted. The problems selected are a group with a common core, they arise
frequently, yet they are widely misunderstood [Statistical tests of survey
data; Experiments, survey, and other investigations; Some misuses of
statistical tests].

Knapp, T. R. (1998). This review assumes a middle-of-the-road position regarding the controversy. The author expresses that significance tests have their place, but generally prefers confidence intervals. His remarks concentrate on ten errors of commission or omission that, in his opinion, weaken the arguments. These possible errors include using the jackknife and bootstrap procedures for replicability purposes, omitting key references, misrepresenting the null hypothesis, omitting the weaknesses of confidence intervals, ignoring the difference between a hypothesized effect size and an obtained effect size, erroneously assuming a linear relationship between p and F, claiming Cohen chose power level arbitrarily, referring to the "reliability of a study", inferring that inferential statistics are primarily for experiments, and recommending "what if" analyses.

Kotrlick, J.W. (2000) Authors should report effect sizes in the manuscript and tables when reporting statistical significance.

Krantz, D.H. (1999) A controversy concerning the usefulness of "null" hypothesis tests in scientific inference has continued in articles within psychology since1960 and has recently come to a head, with serious proposals offered for a test ban or something close to it. This article sketches some of the views of statistical theory and practice among different groups of psychologists, reviews a recent book offering multiple perspectives on null hypothesis tests, and argues that the debate within psychology is a symptom of serious incompleteness in the foundations of statistics.

Krueger, J. (1998) Can research on social-perceptual biases benefit from improved and diversified statistical methods? Having reached the brink of nihilism, I conclude that (a) any point-hypothesis can be rejected by null hypothesis significance testing (NHST), (b) any such hypothesis can be accepted by Bayesian inference, (c) effect size estimates are meaningful only if that meaning is imported from extra-statistical considerations, and (d) taxonomies of biases and their causes will be messy because most biases are overdetermined.

Krueger, J. (1998) Social psychology has painted a picture of human misbehavior and irrational thinking. For example, prominent social cognitive biases are said to distort consensus estimation, self perception, and causal attribution. The thesis of this target article is that the roots of this negativistic paradigm lie in the joint application of narrow normative theories and statistical testing methods designed to reject those theories. Suggestions for balancing the prevalent paradigm include (a) modifications to the ruling rituals of Null Hypothesis Significance Testing, (b) revisions of what is considered a normative response, and (c) increased emphasis on individual differences in judgment.

Krueger, J., & Funder, D.C. (2001) Mainstream social psychology focuses on how people characteristically violate norms of action through social misbehaviors such as conformity with false majority judgments, destructive obedience, and failures to help those in need. Likewise, they are seen to violate norms of reasoning through cognitive errors such as misuse of social information, self-enhancement, and an over-readiness to attribute dispositional characteristics. The causes of this negative research emphasis include the apparent informativeness of norm violation, the status of good behavior and judgment as unconfirmable null hypotheses, and the allure of counter-intuitive findings. The shortcomings of this orientation include frequently erroneous imputations of error, findings of mutually contradictory errors, incoherent interpretations of error, an inability to explain the sources of behavioral or cognitive achievement, and the inhibition of generalized theory. Possible remedies include increased attention to the complete range of behavior and judgmental accomplishment, analytic reforms emphasizing effect sizes and Bayesian inference, and a theoretical paradigm able to account for both the sources of accomplishment and of error. A more balanced social psychology would yield not only a more positive view of human nature, but also an improved understanding of the bases of good behavior and accurate judgment, coherent explanations of occasional lapses, and theoretically grounded suggestions for improvement.

Lecoutre, B. (1981) Cet article montre, à partir d'exemples
concrets, comment les *procédures fiducio-bayésiennes* permettent l'investigation des mécanismes individuels,
en fournissant des résultats inférentiels, non seulement sur l'*effet moyen*, mais aussi sur les *effets individuels*. Techniquement, ces
procédures sont développées pour l'effet associé à un contraste et pour l'effet
associé à une comparaison (à un nombre quelconque de degrés de liberté) dans un
plan du type S*T (Sujets*Traitements).
*Bayes-fiducial procedures for investigating individual mechanisms in Psychology*

Illustrates, with concrete examples, how *Bayes-fiducial procedures* allow the
investigation of individual mechanisms by yielding inferential results, not
only about the *mean effect*, but also about *individual effects*. Technically,
these procedures have been developed for the effect associated with a contrast
and for the effect associated with a comparison (with any number of freedom) in
a S*T (Subjects*Treatments) design.

Lecoutre, B. (1984) Cet ouvrage se situe dans un courant de
recherche, né en France dans les années 1970, à partir des travaux de
H. Rouanet et D. Lépine, qui consiste à refondre, à partir d'une
formalisation algébrique, les méthodes traditionnelles s'analyse statistique
des données expérimentales. Les données des chercheurs sont en règle générale
des données *structurées*; la *formalisation* des structures,
étroitement liée au plan de recueil des données (plan d'expérience ou plan
d'enquête) fournit un cadre aux questions que le chercheur se pose à propos de
ses données. Le problème de la *généralisabilité*
des conclusions est incontournable; l'idée d'*inférence spécifique*
permet, à l'intérieur de chaque situation,
d'appliquer des procédures inférentielles adaptées pour les structures qui
interviennent dans cette situation. Il s'agit de fournir des procédures,
répondant aux objectifs réels de l'induction; les *procédures bayésiennes*,
envisagées comme un prolongement des *tests de signification* usuels,
permettent notamment de se prononcer sur l'importance de chaque effet examiné,
et non seulement sur son existence; en particulier les *procédures
fiducio-bayésiennes* expriment, pour chaque question
posée par le chercheur, "ce que les données ont à dire",
indépendamment de toute information extérieure. Il en résulte une construction
nouvelle, de plus en plus autonome par rapport aux développements traditionnels
de l'analyse de la variance à l'anglo-saxonne: l'*Analyse Bayésienne des Comparaisons*,
parce que la notion formalisée de comparaison y joue un rôle central.

Lecoutre, B. (1985) It is shown, in the case of the inference on a contrast between means, how Bayes-fiducial analyses can be carried out, given only the observed effect and the significance level; Bayes-fiducial limits can be obtained immediately by mean of tables, in order to establish whether an effect is negligible or notable. The role of significance testing in experimental methodology is thus discussed as far as the generalizibility of descriptive conclusions about the magnitude of effects is concerned.

Lecoutre, B. (1985) The usual F-test of the analysis of variance is reconsidered within the Bayesian framework, in terms of predictive distributions. This leads to the notion of semi-Bayesian significance test, so called because it consists in only probabilizing the space of nuisance parameters, thus bringing a general principle for "eliminating" nuisance parameters, or more exactly incorporating information about these parameters. The approach is shown to extend the F-tests, by allowing the testing of hypotheses of non-zero effects.

Lecoutre, B. (1994) On examine l'utilisation et les apports de
l'inférence statistique dans l'étude des raisonnements inductifs. On montre que
certains aspects de ne sont pas toujours clairement pris en compte. En
particulier on a souvent utilisé une approche exclusivement normative de
l'inférence bayésienne, alors que celle-ci est en fait une construction
beaucoup plus souple et beaucoup plus élaborée qu'il peut apparaître. On
insiste sur la nécessité de l'articulation d'une approche normative et d'une
approche descriptive, visant à étudier la cohérence des réponses, plutôt que leur exactitude.
*Statistical inference and inductive reasoning*

The use and the contribution of statistical inference in studying inductive reasoning is
investigated. It is shown that some aspects are not always clearly taking into
account. In particular, an exclusive normative use of Bayesian inference has
often been involved. Bayesian inference is in fact a more flexible and
elaborated construction that it can appear. Furthermore, the need for articulating
a normative approach and a descriptive approach, in order to study the
coherence of the responses rather than their accuracy, is stressed.

Lecoutre, B. (1996) Cet ouvrage propose à l'utilisateur de l'analyse de
variance une approche pratique, réaliste et constructive de l'inférence
statistique, qui lui apporte un regard nouveau sur ses données. Les procédures
bayésiennes standard sont aussi objectives et simples à utiliser que les
procédures traditionnelles (tests de signification *t* ou *F* familiers,
intervalles de confiance). Intégrant ces dernières, elles en éclairent les
difficultés et les insuffisances, et renouvellent en profondeur la méthodologie
du traitement statistique des données expérimentales.

Des réponses concrètes sont ainsi apportées à des questions essentielles
dans la pratique: *Interprétation*** - **
Comment interpréter correctement les procédures d'inférence statistique? *Importance
des effets*** - **Comment juger de
l'importance d'un effet: ("significativité clinique,
psychologique [...]" et "significativité statistique"? Peut-on
"prouver l'hypothèse nulle" d'absence d'effet quand c'est l'hypothèse
de recherche? *Apport réel des données*** - **
Comment apprécier "ce que les
données ont à dire" et examiner dans quelle mesure des informations
supplémentaires remettraient en cause les conclusions?
*Plans d'expérience complexes*** - **
Comment analyser les plans expérimentaux complexes largement utilisés,
tels que les dispositifs avec mesures répétées ou croisés (cross-over)? *Conditions
de validité*** - **Comment comparer des moyennes sans
supposer l'égalité des variances? *Choix des effectifs*** - **Comment
déterminer les effectifs nécessaires pour "avoir de bonnes chances"
d'obtenir une conclusion donnée?

La présentation des méthodes est effectuée à partir d'exemples réels. Les
programmes informatiques sous Windows, didactiques et conviviaux, permettent la
mise en oeuvre interactive très simple de toutes les procédures au fur et à
mesure de leur exposé. L'ouvrage présente ainsi une conception originale, qui
en fait pour le plus grand nombre de lecteurs un outil précieux, utilisable
aussi bien pour une initiation au traitement des données expérimentales que
pour des applications sophistiquées.
** Sommaire:** Quelques éléments de réflexion -- Du

Lecoutre, B. (1996) En fait nous n'avons pas seulement (ou même nous
n'avons pas...) besoin d'une procédure de décision brutale, qui ne concerne que
la valeur zéro et ne nous renseigne pas sur l'importance réelle de la
corrélation. Mais nous devons aussi pouvoir "tester" d'autres
valeurs, et plus simplement obtenir une "fourchette" qui nous permette
d'apprécier réellement l'information apportée par les données. Nous allons
rappeler que, dans les cas les plus courants de traitements de données
numériques, il est immédiat de passer du test usuel à cette fourchette. Bien
entendu il faudra justifier et interpréter celle-ci; on pourra se réjouir de
savoir qu'elle peut être regardée comme un intervalle de confiance (*fréquentiste*), comme un intervalle *fiduciaire*, ou comme un intervalle de
crédibilité *bayésien* standard. Dans la suite nous l'appellerons simplement "intervalle",
laissant le lecteur libre de choisir son *cadre de justification et d'interprétation*.

Lecoutre, B. (1997) L'objet de cet article est de guider le lecteur
peu familiarisé dans la découverte de l'inférence bayésienne. Quatre idées
pourront motiver cette découverte: l'inférence bayésienne n'est pas récente;
elle apparaît supérieure sur le plan théorique; elle est une inférence
naturelle; elle va devenir de plus en plus facilement utilisable. L'exposé sera
très partiel (et partial), avec tous les oublis et toutes les insuffisances
inévitables s'agissant d'un sujet aussi débattu que l'inférence statistique.

Nous prendrons comme point de départ le fait que les interprétations spontanées des résultats des
procédures statistiques traditionnelles (seuils de signification, intervalles
de confiance), même par des utilisateurs "avertis", sont le plus
souvent en termes de probabilités sur les paramètres, qui sont en fait les
probabilités *naturelles*: "celles qui vont du connu vers l'inconnu".

Lecoutre, B. (1998) The innumerable articles denouncing the deficiencies of significance testing urge us to reform the teaching of statistical inference for experimental data analysis. Bayesian methods are a promising alternative. However, teaching the Bayesian approach should not introduce an abrupt changeover from the current frequentist procedures: at the very least, the two approaches should co-exist for many years to come. According to this fact, we have developed statistical computer program, that incorporate both current practices and standard Bayesian procedures. These programs are used in the graduate statistics course in psychology, where Bayesian methods are especially introduced for inferences about effect sizes in the analysis of variance framework

Lecoutre, B. (1999) The *K*-prime and *K*-square distributions, involved in the
Bayesian predictive distributions of standard *t* and *F* tests are
investigated. They generalize the classical *noncentral t* and *noncentral F* distributions
and can receive different characterizations. Their moments and their
probability density and distribution functions are made explicit.

Lecoutre, B. (1999) The purpose of this paper is to argue that a widely accepted objective Bayesian methods,
with the Fisher's fiducial motivation, are not only desirable but also
feasible. These methods bypass the common misuses of null hypothesis significance testing and offer promising
*new ways* in statistical methodology.

Lecoutre, B. (2000) In this chapter we shall examine how, when analyzing experimental data, the researcher can call on intuitive knowledge to understand the principles and methodological implications of two of the main statistical inference procedures, namely, the traditional significance test and fiducial Bayesian inference. The underlying general problem will be the comparison of means in experimental designs. This problem is usually considered in an analysis of variance framework. In fact, it can be amply illustrated here in the case of a simple situation of inference concerning a mean.

Lecoutre, B. (2001) In recent years many authors have stressed the interest of the Bayesian predictive approach for designing ("how many subjects?") and monitoring ("when to stop?") experiments. The predictive distribution of a test statistic can be used to include and extend the frequentist notion of power in a way that has been termed predictive power or expected power. More generally Bayesian predictive procedures give the researcher a very appealing method to evaluate the chances that the experiment will end up showing a conclusive result, or on the contrary a non-conclusive result. The prediction can be explicitly based on either the hypotheses used to design the experiment, expressed in terms of prior distribution, or on partial available data, or on both.

Lecoutre, B. (2004) On se situe dans le cadre de l'analyse causale de données d'expériences "randomisées"
(les traitements sont affectés à chaque unité expérimentale par tirage au sort). Les apports de quelques
fondateurs de l'inférence statistique sont rapidement examinés. On considère ensuite les travaux récents,
et notamment ceux sur les *modèles graphiques structuraux* de Pearl, qui visent à unifier sous une
interprétation unique un certain nombre d'approches, incluant notamment les *analyses contrefactuelles*,
les *modèles graphiques*, les *modèles d'équations structurelles*. La plupart de ces travaux reposent
sur une approche contrefactuelle (invoquant des *résultats potentiels*: "si un autre traitement avait
été affecté à l'unité expérimentale..." de l'inférence causale. Dans un article provocateur,
Dawid (2000) soutient que cette approche est essentiellement métaphysique, et pleine de tentations
de faire des inférences qui ne peuvent pas être justifiées sur la base de données empiriques.
Concernant plus particulièrement les modèles graphiques structuraux, la critique de Dawid est que les
"variables latentes" en jeu dans de tels modèles ne sont pas de véritables variables concomitantes
(variables mesurables, qui peuvent être supposées non affectées par le traitement appliqué)
et qu'il n'y a alors aucun moyen, même en principe, de vérifier les suppositions ("assomptions") faites -
qui affecteront néanmoins les inférences qui en découlent. Dawid qualifie en conséquence ces modèles
de *pseudo-déterministes* et les considère comme *non scientifiques*. Les différents arguments et les solutions
proposées sont examinés et discutés.

*Experimentation, statistical inference and causal analyis*

The causal analysis of "randomised"
experimental data (treatments are randomly assigned to each experimental unit) is considered here.
The contributions of some founders of statistical inference are briefly examined. Recent works, and especially
Pearl's *graphical structural models*, are then considered. These models include *counterfactual analyses*,
*graphical models*, *structural equations models*. Most of these models are based on a counterfactual approach
(involving potential response : "if another treatment had been allocated to the experimental unit...")
to causal inference. In a provocative article, Dawid (2000) argues that this approach is essentially metaphysical,
and full of temptations to make inferences that cannot be justified on the basis of empirical data.
Regarding graphical structural models, Dawid's major criticism is that "latent variables" involved
in such models are not genuine concomitant variables (measurable variables, that can be assumed unaffected
by the treatment applied) and that there is no way, even in principle, of verifying the assumptions made -
which will nevertheless affect the ensuing inferences. Dawid terms these models *pseudodeterministic* and regards
them as *unscientific*. The arguments and solutions are reviewed and discussed.

Lecoutre, B. (2005) Les tests de signification fréquentistes de
l'hypothèse nulle (en anglais "*Null Hypothesis Significance Testing*" = NHST)
font tellement partie des habitudes des scientifiques que l'on
ne peut supprimer leur usage "en les jetant par la fenêtre". Face à cette
situation, la stratégie proposée pour former les étudiants et les chercheurs
aux méthodes d'inférence statistique pour l'analyse des données expérimentales
repose sur une transition en douceur vers le paradigme bayésien. Les principes
de base de cette stratégie sont les suivants. (1) Présenter les interprétations
bayésiennes naturelles des tests de signification usuels pour attirer
l'attention sur leurs insuffisances. (2) Créer en conséquence le besoin d'un
changement dans la présentation et
l'interprétation des résultats. (3) Finalement fournir aux utilisateurs la possibilité réelle
de penser de manière raisonnable les problèmes d'inférence
statistique et de se comporter
d'une façon plus raisonnable. La conclusion est que l'enseignement de
l'approche bayésienne dans le contexte de l'analyse des données expérimentales
apparaît à la fois *désirable* et *faisable*. Cette faisabilité est
illustrée pour les méthodes d'analyse de variance.

Lecoutre, B. (2006) For assessing the smallness an ANOVA effect, the 1-2alpha usual confidence interval
recommended by Steiger [Steiger, J. H. - Beyond the F test: Effect size confidence intervals and
tests of close fit in the analysis of variance and contrast analysis. *Psychological Methods*, 2004,
*9*, 164-182] and its generalization, the 1-2alpha Sheffé simultaneous interval estimate,
have to be extended in a simple suitable way in order to give "exact" 1-alpha confidence intervals.

Lecoutre, B. (2006) The literature is full of Bayesian interpretations of frequentist
*p*-values and confidence levels.
All the attempts to rectify these interpretations have been a loosing battle.
In fact such interpretations suggest that most users are likely to be Bayesian "without knowing it" and
really want to make a different kind of inference.

Lecoutre, B. (2006) The use of frequentist Null Hypothesis Significance Testing (NHST) is so an integral part of scientists' behavior that its uses cannot be discontinued by flinging it out of the window. Faced with this situation, the suggested strategy for training students and researchers in statistical methods for experimental data analysis involves a smooth transition towards the Bayesian paradigm. Its general outlines are as follows. (1) To present natural Bayesian interpretations of NHST outcomes to draw attention to their shortcomings. (2) To create as a result of this the need for a change of emphasis in the presentation and interpretation of results. (3) Finally to equip users with a real possibility of thinking sensibly about statistical inference problems and behaving in a more reasonable manner. The conclusion is that teaching the Bayesian approach in the context of experimental data analysis appears both desirable and feasible. This feasibility is illustrated for analysis of variance methods.

Lecoutre, B. (2007a) An alternative approach to the computation of confidence intervals for the noncentrality parameter of the Noncentral t distribution is proposed. It involves the percent points of a statistical distribution. This conceptual improvement renders the technical process for deriving the limits more comprehensible. Accurate approximations can be derived and easily used.

Lecoutre, B. (2007b) This chapter introduces the conceptual basis of the objective Bayesian approach to experimental data analysis and reviews some of its methodological improvements. The presentation is essentially non-technical and, within this perspective, restricted to relatively simple situations of inference about proportions. Bayesian computations and softwares are also briefly reviewed and some further topics are introduced.

Lecoutre, B., & Charron, C. (2000) Procedures for prediction analysis in 2×2 contingency tables
are illustrated by the analysis of successes to six types of problems
associated with the acquisition of fractions. According to Hildebrand, Laing,
and Rosenthal (1977), hypotheses such as "success to problem type *A*
implies in most cases success to problem type *B*" can be evaluated
from a numerical index. This index has been considered in various other
frameworks and can be interpreted in terms of a measure of predictive
efficiency of implication hypotheses. Confidence interval procedures previously
proposed for this index are reviewed and extended. Then, under a multinomial
model with a conjugate Dirichlet prior distribution, the Bayesian posterior
distribution of this index is characterized, leading to straightforward
numerical methods. The choices of "noninformative" priors for
discrete data are shown to be no more arbitrary or subjective than the choices
involved in the frequentist approach. Moreover, a simulation study of
frequentist coverage probabilities favorably compares Bayesian credibility
intervals with conditional confidence intervals.

Lecoutre, B., & Derzko, G. (2001) Statistical inference procedures dedicated to asserting the
smallness of effects are commonly used in the field of bioequivalence studies
in pharmacology. They are however still virtually ignored in psychology. One
possible reason is that experimental investigations generally involve complex
designs for which solutions have not been developed in detail. The focus here
is precisely on the extension of these procedures to all the situations where
the usual ANOVA *F* tests apply.
Smallness test and confidence interval procedures, both for raw effects, such
as contrasts between means and their several *df* extensions,
and for standardized effect size measures similar to Cohen's *d* and *f*,
are considered. They are illustrated and contrasted with alternative Bayesian procedures.
From a practical viewpoint, the computations require no more than the observed effect size,
the usual *F* ratio, and percent points of statistical distributions.

Lecoutre, B., Derzko, G., & Grouin, J.-M. (1995) This paper investigates the Bayesian procedures for comparing proportions. These procedures are especially suitable for accepting (or rejecting the equivalence of two population proportions. Furthermore the Bayesian predictive probabilities provide a natural and flexible tool in monitoring trials, especially for choosing a sample size and for conducting interim analyses. These methods are illustrated with two examples where antithrombotic treatments are administrated to prevent further occurrences of thromboses

Lecoutre, B., & ElQasyr, K. (2005) Adaptative designs for clinical trials that are based on a generalization of the "play-the-winner" rule are considered as an alternative to previously developed models. Theoretical and numerical results show that these designs perform better for the usual criteria. Bayesian methods are proposed for the statistical analysis of these designs.

Lecoutre, B., & ElQasyr, K. (2008) Adaptative designs for clinical trials that are based on a generalization of the "play-the-winner" rule are considered as an alternative to previously developed models. Theoretical and numerical results show that these designs perform better for the usual criteria. Bayesian methods are proposed for the statistical analysis of these designs.

Lecoutre, B., & Killeen, P. (2010) Iverson, Lee and Wagenmakers (2009) claim that Killeen’s
(2005a) statistic *p*_{rep} overestimates the "true probability of replication".
We show that ILW confuse the probability of replication of an observed direction
of effect with a probability of coincidence—the probability that two future
experiments will return the same sign. The theoretical analysis is punctuated
with a simulation of the predictions of prep for a realistic random effects
world of representative parameters, when those are unknown a priori. We
emphasize throughout that *p*_{rep} is intended to evaluate the probability of a
replication outcome after observations, not to estimate a parameter. Hence,
the usual conventional criteria (unbiasedness, minimum variance estimator) for
judging estimators are not appropriate for probabilities such as *p*
and *p*_{rep}.

Lecoutre, B., Lecoutre, M.-P., & Grouin, J.-M. (2001) The use of frequentist Null Hypothesis Significance Testing (NHST) is so an integral part of scientists' behavior that its uses cannot be discontinued by flinging it out of the window. Faced with this situation, our teaching strategy involves a smooth transition towards the Bayesian paradigm. Its general outlines are as follows. (1) To present natural Bayesian interpretations of NHST outcomes to call attention about their shortcomings. (2) In this way to create the need for a change of emphasis in the presentation and interpretation of results. (3) Finally to equip the students with a real possibility of thinking sensibly about statistical inference problems and behaving in a more reasonable manner. Our conclusion is that teaching the Bayesian approach in the context of experimental data analysis appears both desirable and feasible.

Lecoutre, B., Lecoutre, M.-P., & Poitevineau, J. (2001) The current context of the "significance test
controversy" is first briefly discussed. Then experimental studies about
the use of null hypothesis significance tests by scientific researchers and
applied statisticians are presented. The misuses of these tests are
reconsidered as judgmental adjustments revealing the researchers' requirements
towards statistical inference. Lastly alternative methods are considered. We
come naturally to the point of asking "won't the Bayesian choice be
unavoidable?"
*Usages, abus et mésusages des test de signification dans la communauté scientifique: Le
choix bayésien ne sera-t-il pas incontournable?*

Nous discutons d'abord brièvement le contexte actuel de la "controverse sur le test de
signification". Puis nous présentons des recherches expérimentales sur
l'usage des tests de signification de l'hypothèse nulle par des chercheurs
scientifiques et des statisticiens professionnels. Les mauvais usages de ces
tests sont reconsidérées comme des jugements adaptatifs, qui révèlent les
exigences des chercheurs envers l'inférence statistique. Finalement, nous
envisageons les solutions de rechange.

Nous en venons naturellement à poser la question: "le choix bayésien ne sera-t-il pas
incontournable?"

Lecoutre, B., Lecoutre, M.-P., & Poitevineau, J. (2010) P. R. Killeen’s (2005a)
[Killeen, P. R. - An alternative to null-hypothesis significance
tests, *Psychological Science*, 2005, *16*, 345–353] probability of replication (*p*_{rep}) of an
experimental result is the fiducial Bayesian predictive probability of finding
a same-sign effect in a replication of an experiment. *p*_{rep} is now routinely
reported in Psychological Science and has also begun to appear in other
journals. However, there is little concrete, practical guidance for use of
*p*_{rep} and the procedure has not received the scrutiny that it deserves.
Furthermore, only a solution that assumes a known variance has been
implemented. A practical problem with *p*_{rep} is identified: in many articles
*p*_{rep} appears to be incorrectly computed, due to the confusion between
1-tailed and 2-tailed *p*-values. Experimental findings reveal the risk of
misinterpreting *p*_{rep} as the predictive probability of finding a
same-sign and significant effect in a replication (*p*_{srep}). Conceptual and
practical guidelines are given to avoid these pitfalls. They include the
extension to the case of unknown variance. Moreover, other uses of fiducial
Bayesian predictive probabilities, for analyzing, designing ("how many
subjects?") and monitoring ("when to stop?") experiments, are presented.
Concluding remarks emphasize the role of predictive procedures in statistical
methodology.

Lecoutre, B., Mabika, B., & Derzko, G. (2002) The comparison of two Weibull distributions with unequal shape parameters, in the case of right censored survival data obtained for several independent samples, is considered within the Bayesian statistical methodology. The procedures are illustrated with the example of a mortality study where a new treatment is compared to a placebo. The posterior distributions about relevant parameters allowing to search for a conclusion of clinical superiority of the treatment, and the predictive distributions used to obtain an early stopping rule at an interim analysis, are considered for a class of appropriate priors.

Lecoutre, B., & Poitevineau, J. (2000) Il y a de bonnes raisons de penser que le
rôle des tests de signification usuels dans la recherche en psychologie sera
considérablement réduit dans un proche avenir. Les résultats des analyses
statistiques traditionnelles devraient être systématiquement complétés
("au delà des seuls seuils observés *p*")
pour inclure systématiquement la présentation d'indicateurs de la grandeur des
effets et leurs estimations par intervalles. Ces procédures pourraient
rapidement devenir de nouvelles *normes*
de publication. Dans cet article, nous passons d'abord en revue les principaux
abus des tests de signification et les solutions de rechange proposées. Parmi
celles-ci, des méthodes d'intervalle de confiance (*fréquentistes*)
et des méthodes d'intervalles de crédibilité (*fiducio-bayésiens*) permettent d'estimer
l'importance réelle des effets, et en particulier d'apprécier leur caractère
négligeable ou notable. A partir d'un exemple numérique, nous illustrons ces
méthodes pour l'analyse de contrastes entre moyennes dans un plan d'expérience
complexe, en considérant à la fois les effets *bruts* et les effets
*relatifs* (calibrés). Nous discutons les similitudes et les différences des approches
fréquentistes et bayésiennes, leur interprétation correcte et leur utilisation
pratique.
*Beyond traditional significance tests: Prime time for new publication norms*

There are good reasons to think that the role of usual null
hypothesis significance testing in psychological research will be considerably
reduced in the near future. Traditional statistical analysis results should be
enhanced ("beyond simple *p* value
statements") to systematically include effect sizes and their interval
estimates. Quite soon, these procedures could become new publication
*norms*. In this paper main abuses of
significance tests and alternative available solutions are first reviewed.
Among these solutions, both confidence interval (*frequentist*)
methods and credibility interval (*fiducial Bayesian*)
methods have been developed for assessing effect sizes, and especially for asserting the negligibility
or the notability of effects. From a numerical example, these methods are illustrated for analysing
contrasts between means in a complex experimental design. Both *raw*
and *relative* (calibrated) effects are considered. The similarities and
differences between the frequentist and Bayesian approaches, their correct
interpretations, and their practical uses, are discussed.

Lecoutre, B., Poitevineau, J., Derzko, G., & Grouin, J.-M. (2000) L'objectif de cet exposé est d'illustrer,
pour reprendre l'expression de Lewis (1982), la *désirabilité*
et la *faisabilité* des méthodes bayésiennes en analyse de variance. Il ne sera pas question ici de
revenir sur les débats sur l'inférence statistique, mais simplement de montrer
de manière constructive comment des procédures bayésiennes de routine peuvent
être aisément mises en oeuvre et apporter des réponses simples et directes aux
critiques méthodologiques formulées à l'encontre de l'usage des tests de signification usuels.

Lecoutre, B., Poitevineau, J., & Lecoutre, M.-P. (2004) When reading Denis' paper the feeling
is that Fisher cannot be judged responsible for
the "problems associated with today's model". Even if we agree that current uses of NHST
are farm from being pure Fisherian, our analysis is somewhat different.
In order to understand the Fisher's real contribution, it is of direct importance to recall
his statistical ideas about causality and probability. In particular his works,
not only on the fiducial theory, but also on the Bayesian method in his last years,
are a fundamental counterpart to his emphasis on significance tests.
In conclusion, while the Fisher's responsibility in the today's practices cannot be
discarded, the verdict imposes oneself: "responsible, not guilty".

*Fisher: Responsable, non coupable*

La lecture de l'article de Denis donne l'impression que Fisher ne peut pas être jugé
responsable des "problèmes associés au modèle d'aujourd'hui". Même si nous sommes d'accord que
les usages actuels des tests de signification de l'hypothèse nulle sont loin d'être purement fishériens,
notre analyse est sensiblement différente. Pour comprendre la contribution réelle de Fisher,
il est essentiel de rappeler ses idées statistiques sur la causalité et la probabilité.
En particulier ses travaux, non seulement sur la théorie fiduciaire, mais aussi sur la méthode
bayésienne dans ses dernières années, constituent une contrepartie fondamentale à son insistance
sur l'usage des tests de signification. En conclusion, tandis que la responsabilité de Fisher
dans les pratiques actuelles ne peut pas être rejetée, le verdict s'impose de lui même:
"responsable, non coupable".

Lecoutre, B., Poitevineau, J., & Lecoutre, M.-P. (2005) It is shown that an interval estimate
for a contrast between means can be straightforwardly computed, given only the observed contrast and the associated
*t* or *F* test statistic (or equivalently the corresponding *p*-value).
This interval can be seen as a *frequentist* confidence interval, as a standard *Bayesian* credibility interval,
or as a *fiducial interval*.
This interval estimate can be viewed either as a frequentist confidence
interval or a fiducial interval or a Bayesian credible interval.
This gives Null Hypothesis Significance Tests (NHST) users the possibility
of an easy transition towards more appropriate statistical practices.
Conceptual links between NHST and interval estimates are outlined.

*Une raison pour ne pas abandonner les tests de signification de l'hypothèse nulle*

On montre que l'on peut directement calculer un intervalle pour un contraste
entre moyennes, étant donné seulement la valeur observée du contraste et la statistique du test
*t* ou *F* associé (ou encore, de manière équivalente le seuil observé correspondant
("*p-value*").
Cet intervalle peut être vu comme un intervalle de confiance *fréquentiste* ou comme un intervalle de crédibilité
*bayésien* ou comme un intervalle *fiduciaire*.
Cela donne aux utilisateurs des tests de signification usuels la possibilité
d'une transition facile vers des pratiques statistiques plus appropriées.
On met en avant les liens conceptuels entre les tests et les intervalles de confiance ou de crédibilité.

Lecoutre, B., Rouanet, H., & Denhière, G. (1988) L'Analyse des Comparaisons, constituée à partir
de 1968 a fait l'objet de plusieurs exposés d'ensemble: Rouanet et Lépine
(1977), Hoc (1983), Lecoutre (1984); on peut la voir comme une restructuration
de l'analyse de variance comportant notamment les innovations suivantes.
- Elaboration d'un *langage d'interrogation de données* permettant de formuler les questions du
chercheur dans le cadre des facteurs du plan expérimental.
- Principe d'*inférence spécifique* (détaillé dans Rouanet et Lépine, 1983), qui consiste à fonder
l'inférence sur un modèle posé, non plus au niveau du protocole de base mais de
chaque protocole dérivé pertinent correspondant à chaque demande d'analyse.
- *Techniques bayésiennes*.
Depuis 1973, l'Analyse des Comparaisons a intégré les techniques bayésiennes,
classiques et contemporaines (Jeffreys, Lindley, etc.), mais en les utilisant
avec une motivation fiduciaire (Fisher). Ces techniques nous paraissent en
effet les mieux adaptées pour pallier les insuffisances des tests de
signification traditionnels. [...] Très tôt, l'Analyse Bayésienne des
Comparaisons a été appliquée au problème de la validation des modèles: Rouanet, Lépine et Holender, 1978.

Lecoutre, M.-P. (1982) Studied the behaviors of psychologists spontaneously developed in conflictual situations of statistical data processing ; what is intended is not a normative aim, but the look for coherence lines. Three usual conflicting problems (for example, a same procedure applied to an experiment and to its replicate yielding discrepant results) were presented to 27 psychologists from several laboratories ; the responses were recorded as semi-directive interviews. Two main findings. First, while behaviors were well differentiated -mostly according to the weights of several criteria such as the observed results, significance tests, reference theories and so on- it was possible to infer global attitudes that are common to almost all searchers. Secondly, it appeared that the issue of data grouping is a critical one for many searchers.

Lecoutre, M.-P. (1992) [Example of use of standard Bayesian methods].

Lecoutre, M.-P. (2000) This chapter presents the findings of an experimental research
project aimed at describing and analyzing the judgments made in situations of
statistical inference by researchers (in this case, researchers in psychology)
who, having completed an experiment, proceed to a statistical analysis of their
data. The rest of the chapter is divided into two parts, one for each stage in
the study. The first part examines the role of the various ingredients which
contribute to the formulation of a statistical conclusion, along with the
interpretations they give rise to. The second part, the study of statistical
prediction situations, leads us to examine how a statistical conclusion is
understood and interpreted: what significance do researchers attach to
conclusions such as, for example "there is a difference between two
treatments", or "there is an effect of such and such a factor", *etc.*

Lecoutre M.-P., Clément E., Lecoutre B. (2004) [Example of use of standard Bayesian methods].

Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (2003) We investigated the way experienced users interpret Null
Hypothesis Significance Testing (NHST) outcomes. An empirical study was designed
to compare the reactions of two populations of NHST users, psychological
researchers and professional applied statisticians, when faced with
contradictory situations.

The subjects were presented with the
results of an experiment designed to test the efficacy of a drug by comparing
two groups (treatment/*placebo*).
Four situations were constructed by combining the outcome of the *t* test
(significant vs nonsignificant) and the observed difference between the two
means *d* (large vs small). Two of these situations appeared as
conflicting (*t* significant/*d* small and *t* nonsignificant/*d*
large). Three fundamental aspects of statistical inference of statistical
inference were investigated by means of open questions: drawing inductive
conclusions about the magnitude of the true difference from the data in hand,
making predictions for future data, and making decisions about stopping the
experiment. The subjects were 25 statisticians from pharmaceutical companies in
France, subjects well versed in statistics, and 20 psychological researchers
from various laboratories in France, all with experience in processing and
analyzing experimental data.

On the whole, statisticians and psychologists reacted in a similar way and were very impressed by significant
results. It must be outlined that professional applied statisticians were not
immune to misinterpretations, especially in the case of nonsignificance.
However, the interpretations that accustomed users attach to the outcome of
NHST can vary from one individual to another, and it is hard to conceive that
there could be a consensus in front of seemingly conflicting situations. In fact beyond the superficial report of
"erroneous" interpretations, it can be seen in the misuses of NHST intuitive
judgmental "adjustments", that try to overcome its inherent shortcomings. These findings encourage the many recent attempts to
improve the habitual ways of analyzing and reporting experimental data.
*Même les statisticiens ne sont pas à l'abri des erreurs d'interprétation des Tests de
Signification de l'Hypothèse Nulle*

Nous avons étudié la manière dont des utilisateurs expérimentés interprètent les résultats des Tests
de Signification de l'Hypothèse Nulle. Une étude empirique a été menée pour
comparer les réactions de deux populations d'utilisateurs, des chercheurs en
psychologie et des statisticiens professionnels, face à des situations
conflictuelles.

On présentait aux sujets les résultats d'une expérience planifiée pour tester l'efficacité d'un
médicament en comparant deux groupes (traitement/*placebo*)>.
Quatre situations étaient construites en combinant
l'issue du test *t* (significatif vs non-significatif) et la différence
observée *d* entre les deux moyennes (grande vs petite). Deux de ces
situations apparaissaient conflictuelles (*t* significatif/*d* petite
et *t* non-significatif/*d* grande). Trois aspects fondamentaux de
l'inférence statistique étaient examinés au moyen de questions ouvertes: tirer
une conclusion inductive sur la grandeur de la vraie différence, faire une
prédiction relative à des données futures, et prendre une décision sur l'arrêt
de l'expérience. Les sujets étaient 25 statisticiens de l'industrie
pharmaceutique en France, donc experts en statistique, et 20 chercheurs en
psychologie de différents laboratoires français, ayant tous une expérience de
l'analyse des données expérimentales.

Dans l'ensemble, les statisticiens et les psychologues se sont comportés d'une manière similaire et
ont été très influencés par les résultats significatifs. Un résultat important
est que les statisticiens ne sont pas à l'abri des abus d'interprétation des
tests, en particulier quand le résultat est non significatif. Cependant
l'interprétation des tests peut varier considérablement d'un individu à l'autre
et est loin de donner lieu à un consensus face à des situations en apparence
conflictuelles. En fait au delà du constat superficiel de l'existence d'interprétations "erronées", on peut voir
dans les mésusages des tests des "ajustements" de jugement intuitifs, pour
tenter de surmonter leurs insuffisances fondamentales. Ces résultats encouragent
les nombreuses tentatives récentes d'améliorer les procédures habituelles pour analyser les données expérimentales et
présenter les résultats.

Lecoutre, M.-P., & Rouanet, H. (1993) Probabilistic judgments made by researchers in psychology were investigated in statistical prediction situations. From these situations, it is possible to test the "representativeness hypothesis" (Tversky & Kahneman, 1971) and the "significance hypothesis" (Oakes, 1986). The predictive judgments concerned both an elementary descriptive statistic and a significance test statistic. In the first case, the predictive judgments were generally coherent and it comparatively well to Bayesian standard predictive probabilities. As for the two hypotheses tested, our findings are compatible with the significance hypothesis, but go against the representativeness hypothesis.

Lee, P. (1989) This book is concerned with estimating the values of unknown
parameters and investigating the degree of confidence we can have in various hypotheses.
The Bayesian approach is distinguished by giving a probability distribution to
the unknown parameters and then modifying it in the light of experimental data.
This is controversial because for a theory with no new data available, the
statistician's own beliefs have to be incorporated into the analysis. The
author presents the ideas behind Bayesian Statistics at a level suitable for
advanced undergraduate or postgraduate students. -- The discrepancies between
the conclusions of Bayesian and "classical" statistics are
highlighted. -- Full treatment of Bayesian statistics is presented; easily
accessible to students with some knowledge of statistics. -- Clear exposition
of *where* and *why* theBayesian approach differs
from the "classical" approach. -- Discusses how real *prior *information
can be incorporated into statistical analyses and explains the difficulties
which can arise when "conventional priors" are used. -- Excellent
appendix includes tables useful in Bayesian statistics (and not readily available elsewhere).

Lee M.D., & Wagenmakers, E.J. (2005) D. Trafimow (2003) presented an analysis of null hypothesis significance testing (NHST) using Bayes's theorem. Among other points, he concluded that NHST is logically invalid, but that logically valid Bayesian analyses are often not possible. The latter conclusion reflects a fundamental misunderstanding of the nature of Bayesian inference. This view needs correction, because Bayesian methods have an important role to play in many psychological problems where standard techniques are inadequate. This comment, with the help of a simple example, explains the usefulness of Bayesian inference for psychology.

Lehmann, E.L. (1993) The Fisher and Neyman-Pearson approaches to testing statistical hypotheses are compared with respect to their attitudes to the interpretation of the outcome, to power, to conditioning, and to the use of fixed significance levels. It is argued that despite basic philosophical differences, in their main practical aspects the two theories are complementary rather than contradictory and that a unified approach is possible that combines the best features of both. As applications, the controversies about the Behrens-Fisher problem and the comparison of two binomials (2x2 tables) are considered from the present point of view.

Lépine, D., & Rouanet, H. (1975) La méthode traditionnelle du t de Student
peut être utilisée pour mettre à l'épreuve un modèle d'hypothèse nulle à 1 degré
de liberté dans une grande variété de plans d'expérience. Mais souvent cette
méthode ne suffit pas à répondre à l'attente du chercheur, qui voudrait
inférer, à partir des seules données expérimentales, sur l'importance, dans la
population parente, de l'écart à l'hypothèse nulle. Les auteurs proposent ici,
à partir des conceptions fiduciaires de Fisher, une méthode qui répond à cette
attente en autorisant, dans les cas où le test "*t*" est valide,
une inférence probabiliste portant sur le
paramètre d'écart à l'hypothèse nulle. Des exemples concrets illustrent la
méthode, qui est utilisée ici principalement pour répondre à la question:
"existe-t-il, en un sens qui est à préciser par le chercheur, un écart
"négligeable" ou au contraire "notable" à l'hypothèse nulle
habituelle?". La méthode apparaît ainsi comme un prolongement du test de
signification usuel, en vue d'une inférence plus précise, et permet ainsi de
discriminer les cas où les données expérimentales peuvent valablement conduire
à une conclusion dans les termes souhaités de ceux où l'information n'est pas
suffisante pour permettre une conclusion inférentielle ferme.
*Introduction to fiducial methods: Inference about a contrast between means*

In many experimental design, the traditional Student *t*-test
may be used to test a model of null hypothesis with one degree of freedom. But often this method does not
satisfy the aim of the researcher who wishes to make inferences about the
deviation from the null hypothesis within the parent population. On the basis
of Fisher's notion of fiducial inference, the authors propose a method that
satisfies this aim: it enables a probabilistic inference about the parameter of
discrepancy from the null hypothesis, that can be used whenever the *t*-test
is valid. Concrete examples illustrate this method which is used, in the present case, to answer the
question: "does there exist, in a sense to be specified by the
experimenter, a negligible or an important discrepancy from the null
hypothesis?". This method is an extension of the common test of
significance and allows a more precise inference; it also permits the
discrimination of cases in which the information contained in the data is
sufficient to yield a firm conclusion, from those in which it does not.

Levine, T.R., Hullett, C.R. (2002). Communication researchers, along with social scientists from a variety of disciplines, are increasingly recognizing the importance of reporting effect sizes to augment significance tests. Serious errors in the reporting of effect sizes, however, have appeared in recently published articles. This article calls for accurate reporting of estimates of effect size. Eta squared (2) is the most commonly reported estimate of effect sized for the ANOVA. The classical formulation of eta squared (Pearson, 1911; Fisher, 1928) is distinguished from the lesser known partial eta squared (Cohen, 1973), and a mislabeling problem in the statistical software SPSS (1998) is identified. What SPSS reports as eta squared is really partial eta squared. Hence, researchers obtaining estimates of eta squared from SPSS are at risk of reporting incorrect values. Several simulations are reported to demonstrate critical issues. The strengths and limitations of several estimates of effect size used in ANOVA are discussed, as are the implications of the reporting errors. A list of suggestions for researchers is then offered.

Levy, P. (1967) **[ Example of misinterpretation
of a p-value]** "Statistical significance refers only to... the
confidence with which a null hypothesis may be rejected." (page 37)
[Quoted by Falk and Greenbaum, 1995, page 82]

Lewis, C. (1993) In this chapter, it is assumed that the reader has some familiarity with the basic concepts of Bayesian inference and of conventional analysis of variance. This allows attention to be focused on what happens when the two are brought together. For this purpose, extensive use is made of results presented by Box and Tiao (1973). This source provides, by far, the most extensive treatment of analysis of variance from a Bayesian point of view, and the interested reader will find in it proofs and generalizations of most of the material appearing here. [...] the emphasis is on laying out, as clearly as possible, a Bayesian approach to analysis of variance.

Lindley, D.V. (1998) It is argued that the determination of bioequivalence involves a decision, and is not purely a problem of inference. A coherent method of decision-making is examined in detail for a simple trial of bioequivalence. The result is shown to differ seriously from the inferential method, using significance tests, ordinarily used. The reason for the difference is explored. It is show how the decision-analytic method can be used in more complicated and realistic trials and the case for its general use presented.

Lindley D.V. (2000). This paper puts forward an overall view of statistics. It is argued that statistics is the study of uncertainty. The many demonstrations that uncertainties can only combine according to the rules of the probability calculus are summarized. The conclusion is that statistical inference is firmly based on probability alone. Progress is therefore dependent on the construction of a probability model; methods for doing this are considered. It is argued that the probabilities are personal. The roles of likelihood and exchangeability are explained. Inference is only of value if it can be used, so the extension to decision analysis, incorporating utility, is related to risk and to the use of statistics in science and law. The paper has been written in the hope that it will be intelligible to all who are interested in statistics.

Lipset, S.M., Trow, M.A., & Coleman, J.S. (1956) In this book, no statistical tests of significance have been used. [...] It can be defended, and we shall defend it at length because there seems to be no good statement of our position in print. Statistical tests of hypotheses, however, seem to be of quite limited aid in building theoretical science.

Little, J. (2001) Few concepts in the social sciences have wielded more discriminatory power over the status of knowledge claims than that of statistical significance. Currently operationalized as alpha=0.05, statistical significance frequently separates publishable from nonpublishable research, renewable from nonrenewable grants, and, in the eyes of many, experimental success from failure. If literacy is envisioned as a sort of competence in a set of social and intellectual practices, then scientific literacy must encompass the realization that this cardinal arbiter of social scientific knowledge was not born out of an immanent logic of mathematics but socially constructed and reconstructed in response to sociohistoric conditions.

Locascio, J.J. (1999) Limitations and inappropriate uses of null hypothesis statistical significance testing (NHST) in behavioral research have been widely cited. Critics recommend alternative data analysis approaches and even outright "banning" of it from professional journals. I agree with most criticisms, but would stop short of supporting a ban.

Loftus, G.R. (1991) [Review of Gigerenzer *et *al*.*,1989] The
*Empire of Chances* is about the history
and current use of probability theory and statistics. [...] Because this review
is for psychologists, I will organize it around the book's insights into a
question that I believe is at the heart of much malaise in psychological research:
How has the virtually barren technique of hypothesis testing come to assume
such importance in the process by which we arrive at our conclusions from our
data? I will first describe why this question is timely and important; I will
then provide a brief synopsis of the book; and finally, I will detail the
book's answers to the question.

Loftus, G.R. (1993) In particular, I offer the following guidelines. 1. By
default, data should be conveyed as a figure depicting sample means *with
associated standard errors and/or, where appropriate, standard deviations*. 2. More often than not, inspection
of such a figure will immediately obviate the necessity of any
hypothesis-testing procedures. In such situations, presentation of the usual
hypothesis information (*F* values, *p* values, etc.) will be discouraged.

Loftus, G.R. (2002) This chapter has two main purposes: the first is to discuss
serious problems with two generally accepted and widely used foundations of
current statistical analysis in psychology, the linear model and null
hypothesis significance testing; the second purpose is to describe some
alternative valuable techniques. The alternatives are grouped into six
categories: (1) use of sophisticated pictorial and graphical techniques for
data display, (2) use of confidence intervals in numerous situations, (3) use
of planned comparisons, emphasizing *contrasts*, (4) use of percent total
variance accounted for, (5) representation of theoretical fits to data, and (6)
use of *equivalence techniques* to investigate interactions. Detailed
hypothetical numerical examples along with associated calculations and graphs
are constructed to illustrate each of the techniques.

Loftus, G.R., & Masson, M.E.J. (1994) We argue that to best comprehend many data sets, plotting judiciously selected sample statistics with associated confidence intervals can usefully supplement, or even replace, standard hypothesis-testing procedures. We note that most social science statistics textbooks limit discussion of confidence intervals to their use in between-subject designs. Our central purpose in this article is to describe how to compute an analogous confidence interval that can be used in within-subject designs. This confidence interval rests on the reasoning that because between-subject error term - that is, on the variability due to the subject condition*interaction. Computation of such a confidence interval is simple and is embodied in equation 2 on p. 482 of this article. This confidence interval has two useful properties. First, it is based on the same error term as is the corresponding analysis of variance, and hence leads to comparable conclusions. Second, it is related by a known factor (square-root of 2) to a confidence interval of the difference between sample means: accordingly, it can be used to infer the faith one can put in some pattern of sample means as a reflection on the underlying pattern of population means. These two properties correspond to analogous properties of the more widely used between-subject confidence interval.

Loredo, T.J. (1990) The Bayesian approach to probability theory is presented as an alternative to the currently used long-run relative frequency approach, which does not offer clear, compelling criteria for the design of statistical methods. Bayesian probability theory offers unique and demonstrably optimal solutions to well-posed statistical problems, and is historically the original approach to statistics. The reasons for earlier rejection of Bayesian methods are discussed, and it is noted that the work of Cox, Jaynes, and others answers earlier objections, giving Bayesian inference a firm logical and mathematical foundation as the correct mathematical language for quantifying uncertainty. The Bayesian approaches to parameter estimation and model comparison are outlined and illustrated by application to a simple problem based on the gaussian distribution. As further illustrations of the Bayesian paradigm, Bayesian solutions to two interesting astrophysical problems are outlined: the measurement of weak signals in a strong background, and the analysis of the neutrinos detected from supernova SN 1987A. A brief bibliography of astrophysically interesting applications of Bayesian inference is provided.

Ludbrook, J. (2000) In a recent review article, the problem of making false-positive inferences as a result of making multiple comparisons between groups of experimental units or between experimental outcomes was addressed.2. It was concluded that the most universally applicable solution was to use the Ryan-Holm step-down Bonferroni procedure to control the family-wise (experiment-wise) type 1 error rate. This procedure consists of adjusting the P values resulting from hypothesis testing. It allows for correlation among hypotheses and has been validated by Monte Carlo simulation. It is a simple procedure and can be performed by hand.3. However, some investigators prefer to estimate effect sizes and make inferences by way of confidence intervals rather than, or in addition to, testing hypotheses by way of P values and it is the policy of some editors of biomedical journals to insist on this. It is not generally recognized that confidence intervals, like P values, must he adjusted if multiple inferences are made from confidence intervals in a single experiment.4. In the present review, it is shown how confidence intervals can be adjusted for multiplicity by an extension of the Ryan-Holm step-down Bonferroni procedure. This can be done for differences between group means in the case of continuous variables and for odds ratios or relative risks in the case of categorical variables set out as 2 x 2 tables.

Lutz, W., & Nimmo, I.A. (1977) The experimental aim should not be to establish whether changes have occurred, but rather to estimate whether changes have occurred in excess of some stipulated magnitude and importance. When a "significant difference" has been established, investigators must then measure the size of the effect and consider whether it is of any biological or medical importance.

Lykken, D. (1968) The moral of this story is that the finding of statistical
significance is perhaps the least important attribute of a good experiment: it
is *never* a sufficient condition for
concluding that a theory has been corroborated, that a useful empirical fact
has been established with reasonable confidence - or that an experimental
report ought to be published. The value of any research can be determined, not
from the statistical results, but only by skilled, subjective evaluation of the
coherence and reasonableness of the theory, the degree of experimental control
employed, the sophistication of the measuring techniques, the scientific or
practical importance of the phenomena studied, and so on. [...] Editors must be
bold enough to take responsibility for deciding which studies are good and
which are not, without resorting to letting the *p* value
of the significance tests determine this decision.

Macdonald, R.R. (1997) This paper argues that a Fisherian approach to statistical inference, which views statistical testing as determining the chance probability of an event, is coherent, consistent with modern statistical methods and forms a sound theoretical basis for the use of statistical tests in psychology. It is argued that Fisherian statistical tests are concerned with establishing the direction of tested effects, give rise to confidence intervals and are quite consistent with power analyses. Researchers are encouraged to report chance probabilities, and to interpret them according to the prevailing conditions rather than using a fixed decision rule. The contrasting Newman Pearson approach, which views statistical testing as a quality control procedure for accepting hypotheses, posits unreasonable research practices which psychologists do not and should not be expected to follow. Newman Pearson theory has caused confusion in the psychological literature and criticisms have been levelled at statistical testing in general that ought to have been directed specifically at Newman Pearson testing. Statistical inference, of any sort, is held to be insufficient to characterize the process of testing scientific hypotheses. Data should be seen as evidence to be used in psychological arguments and statistical significance is just one measure of its quality. It restrains researchers from making too much of findings which could otherwise be explained by chance.

Macdonald, R.R. (2005) As the adage has it, "when something looks too good to be true, it probably is." If it were possible to compute replication probabilities ignoring everything but the data, then these probabilities would be "objective" and could form the basis of an alternative to significance tests. Unfortunately, this is not possible.

Man-Son-Hing, M., Laupacis, A., O'Rourke, K., Molnar, F.J., Mahon, J., Chan, K.B.Y., Wells, G. (2002) Formal statistical methods for analyzing clinical trial data are widely accepted by the medical community. Unfortunately, the interpretation and reporting of trial results from the perspective of clinical importance has not received similar emphasis. This imbalance promotes the historical tendency to consider clinical trial results that are statistically significant as also clinically important, and conversely, those with statistically insignificant results as being clinically unimportant. In this paper, we review the present state of knowledge in the determination of the clinical importance of study results. This work also provides a simple, systematic method for determining the clinical importance of study results. It uses the relationship between the point estimate of the treatment effect (with its associated confidence interval) and the estimate of the smallest treatment effect that would lead to a change in a patient's management. The possible benefits of this approach include enabling clinicians to more easily interpret the results of clinical trials from a clinical perspective, and promoting a more rational approach to the design of prospective clinical trials.

Maxwell, S.E. (2004). Underpowered studies persist in the psychological literature. This article examines reasons for their persistence and the effects on efforts to create a cumulative science. The "curse of multiplicities" plays a central role in the presentation. Most psychologists realize that testing multiple hypotheses in a single study affects the Type I error rate, but corresponding implications for power have largely been ignored. The presence of multiple hypothesis tests leads to 3 different conceptualizations of power. Implications of these 3 conceptualizations are discussed from the perspective of the individual researcher and from the perspective of developing a coherent literature. Supplementing significance tests with effect size measures and confidence intervals is shown to address some but not necessarily all problems associated with multiple testing.

Markus, K.A. (2001) Critics have put forth several arguments against the use of tests of statistical
significance (TOSSes). Among these, the converse inequality argument stands out
but remains sketchy, as does criticism of it. The argument states that we want *P*(*H*|*D*)
(where *H* and *D* represent hypothesis and data, respectively), we
get *P*(*D*|*H*), and the 2 do not equal one another. Each of
the terms in '*P*(*D*|*H*)<>(*H*|*D*)' requires clarification.
Furthermore, the argument as a whole allows for multiple interpretations. If the argument questions the logic
of TOSSes, then defenses of TOSSes fall into 2 distinct types. Clarification
and analysis of the argument suggest more moderate conclusions than previously
offered by friends and critics of TOSSes. Furthermore, the general method of
clarification through formalization may offer a way out of the current impasse.

Mauk, A-M.K (2000) The present paper summarizes the recommendation that statistical significance testing be replaced or at least accompanied by the reporting of effect sizes and confidence intervals and discusses, in particular, confidence intervals. The recent report of the APA Task Force on Statistical Inference suggested that confidence intervals should always be reported.

McCloskey, D.N., & Ziliak, S.T. (1996) "In a survey of papers published in the American Economic Review, the authors found that 59% use the word 'significance' in ambiguous ways at one point meaning 'statistically significantly different from the null,' at another 'practically important' or 'greatly changing our scientific opinion,' with no distinction."

McGraw, K.O. (1991) In light of the data distorsionsintroduced by BESDs [Binomial Effect Size Displays], I fail to see how they can be touted as "intuitively appealing" and "perfectly transparent" ways of representing treatment effects on dichotomously measured outcomes.

McLean, J.E. (2001) Hypothesis testing is widely regarded as an essential part of statistics, but its use in research has led to considerable controversy in a number of disciplines, especially psychology, with a number of commentators suggesting it should not be used at all. A root cause of this controversy was the overenthusiastic adoption of hypothesis testing, based on a greatly exaggerated view of its role in research. A second cause was confusion between the two forms of hypothesis testing developed by Fisher on the one hand and Neyman and Pearson on the other. This paper discusses these two causes, and also proposes that there is a more general misunderstanding of the role of hypothesis testing. This misunderstanding is reflected in vocabulary such as 'the true value of the parameter'.

McLean, J.E., & Kaufman, A.S. (1998) The research methodology literature in recent years has included a full frontal assault on statistical significance testing. The purpose of this paper is to promote the position that, while significance testing as the sole basis for result interpretation is a fundamentally flawed practice, significance tests can be useful as one of several elements in a comprehensive interpretation of data. Specifically, statistical significance is but one of three criteria that must be demonstrated to establish a position empirically. Statistical significance merely provides evidence that an event did not happen by chance. However, it provides no information about the meaningfulness (practical significance) of an event or if the result is replicable. Thus, we support other researchers who recommend that statistical significance testing must be accompanied by judgments of the event's practical significance and replicability.

McLean, J.E., & Kaufman, A.S. (2000) The Publication Manual
of the American Psychological Association (1994), the style guide required by *
Research in the Schools*, provides little guidance. The Manual discusses both statistical significance (pp. 17-18) and
effect size (p. 18), but only "encourages" (p. 18) authors to provide effect size information.

Since the [*RITS*] Special Issue [on statistical significance], we have also "encouraged" authors to provide
effect size information. In fact, we have required authors to provide effect
size information to accompany statistical significance tests unless they could
provide a compelling reason not to. Since that time, we have had no author make
a compelling case to omit effect size information. Thus, we have decided to
make this policy explicit and require that authors accompany the reporting of
statistical significance tests with effect size information. Specifically, the
following line has been added to the Research in the Schools "Information
for Authors" section: "All reporting of statistical significance must
include an estimate of effect size."

We are hopeful that this change will encourage educational researchers to consider
effect size and practical significance when evaluating the results of a study.
We are also encouraging our Editorial Board members to consider effect size and
the practical significance of a study in their recommendations. In the end, we
hope this change supports the movement towards the reporting of more complete
and accurate results of research studies. Since this change in policy merely
formalizes what we have been practicing for at least two years, we do not
expect that it will have an impact on the number of manuscripts we receive, but
we do hope it will have an impact on the quality of the manuscripts.

Meehl, P.E. (1967) The purpose of the present paper is not so much to propound a
doctrine or defend a thesis (especially as I should be surprised if either
psychologists or statisticians were to disagree with whatever in the nature of
a "thesis" it advances), but to call the attention of logicians and
philosophers of science to a puzzling state of affairs in the currently
accepted methodology of the behavior sciences which I, a psychologist, have been
unable to resolve to my satisfaction. The puzzle, sufficiently striking (when
clearly discerned) to be entitled to the designation "paradox", is
the following: *In the physical sciences,
the usual result of an improvement in experimental design, instrumentation, or
numerical mass of data is to increase the difficulty of the "observational
hurdle" which the physical theory of interest must successfully surmount;
whereas, in psychology and some of the allied behavioral sciences, the usual
effect of such improvement in the experimental precision is to provide an
easier hurdle for the theory to surmount*.

Melton, A.W. (1962) [*About his "misinterpretation of significance levels"*] Melton has been often
blamed for having claimed erroneously that the significance level determines the
probability that a significant result will be found in a replication (pages
553-554: see, for instance, Bakan, 1966, Seldmeier & Gigerenzer, 1989).
But, Melton has never asserted that the probability of reproducing results was
1-*p*. He considered only that the smaller the level the more secure was the reproductibility, which is
justified whatever the theoretical statistical framework.

Mendoza J.L., Stafford K.L. (2001) In this article, the authors introduce a computer package written for Mathematica, the purpose of which is to perform a number of difficult iterative functions with respect to the squared multiple correlation coefficient under the fixed and random models. These functions include, among others, computation of confidence interval upper and lower bounds, power calculation, calculation of sample size required for a specified power level, and providing estimates of shrinkage in cross validating the squared multiple correlation under both the random and fixed models. Attention is given to some of the technical issues regarding the selection of, and working with, these two types of models as well as to issues concerning the construction of confidence intervals.

Mialaret, G. (1996) **[ Exemple
d'abus d'interpretation d'un intervalle de confiance]** "La valeur 0
étant comprise dans l'intervalle de confiance on ne peut pas refuser
l'hypothèse nulle selon laquelle les deux séries de valeurs ont la même
moyenne. On dira, en d'autres termes, que l'ensemencement n'a pas eu d'effet
sur la prise des pêcheurs." (page 112).

Milleville-Pennel, I., Hoc, J.-M., & Elise, J. (2007) We are aware that the small number of participants in our study means that the absence of difference in pairwise comparisons needs to be considered with caution. For this reason, a Fiducio-Bayesian analysis has been performed on the non-significant pairwise comparisons in order to ascertain that the population effects are actually negligible (Lecoutre and Poitevineau, 1992; Rouanet, 1996).

Mittag, K.C., & Thompson, B. (2000) Almost as soon as statistical significance tests were
popularized near the turn of this century, critics emerged (Berkson, 1938;
Boring, 1919). The criticism since then has been fairly continual (e.g.,
Carver, 1978; Meehl, 1978; Rozeboom, 1960), but recent commentary has been
particularly striking (cf. Cohen, 1994; Kirk, 1996; Schmidt, 1996; Thompson,
1996, 1999a). Of course, statistical tests also have support from some, though
even most advocates concur that the tests are sometimes misused or misunderstood
(e.g., Frick, 1996; Robinson & Levin, 1997). Particularly thoughtful
advocacy for continued reliance on statistical testing has been offered by
Abelson (1997) and Cortina and Dunlap (1997). A balanced and comprehensive
treatment of the controversies is provided by Harlow, Mulaik, and Steiger
(1997; for detailed reviews of this book, see Levin, 1998, and Thompson, 1998).
Huberty (1987, 1993) and Huberty and Pike (1999) provide the related historical
perspective. However, as Tryon (1998) recently lamented, "The fact that
statistical experts and investigators publishing in the best journals cannot
consistently interpret the results of these analyses is extremely disturbing.
Seventy-two years of education have resulted in minuscule, if any, progress toward
correcting this situation. It is difficult to estimate the handicap that
widespread, incorrect, and intractable use of a primary data analytic method
has on a scientific discipline, but the deleterious effects are doubtless
substantial... (page 796)" Indeed, several *empirical* studies have shown that many researchers do not fully
understand the statistical tests that they employ (Nelson, Rosenthal, &
Rosnow, 1986; Oakes, 1986; Rosenthal & Gaito, 1963; Zuckerman, Hodgins,
Zuckerman, & Rosenthal, 1993).

The present report was written to address two objectives.
First, we wanted to explore current perceptions of AERA
members regarding statistical significance tests. We also explored perceptions
regarding other statistical issues, such as score reliability (e.g., Thompson
& Vacha-Haase, 2000) and stepwise methods (e.g., Cliff, 1987, Huberty,
1989; Thompson, 1995), about which there has also been some controversy. The
present investigation was particularly timely given the recent release of the
related various recommendations of the American Psychological Association (APA)
Task Force on Statistical Inference (Wilkinson & The APA Task Force on
Statistical Inference, 1999). These recommendations will be considered soon in
revising the previous 1994 edition of the APA publication manual, incorporated
by many behavioral science journals into editorial requirements. Second, we
also wanted our report to serve as a vehicle promoting further discussion of
controversial statistical issues. Although we have arrived at reasoned positions
regarding the merits of some research practices, reasonable people disagree
over such issues. We hope our presentation will provide a framework prompting
further discussion.

Morgan, P.L. (2003). This article first outlines the underlying logic of null hypothesis testing and the philosophical and practical problems associated with using it to evaluate special education research. The article then presents 3 alternative metrics - a binomial effect size display, a relative risk ratio, and an odds ratio - that can better aid researchers and practitioners in identifying important treatment effects. Each metric is illustrated using data from recently evaluated special education interventions. The article justifies interpreting a research result as significant when the practical importance of the sample differences is evident and when chance fluctuations due to sampling can be shown to be an unlikely explanation for the differences.

Morris, S.B., & Lobsenz, R.E. (2000) The two most common methods for assessing adverse impact, the four-fifths rule and the z-test for independent proportions, often produce discrepant results. These discrepancies are due to the focus on practical versus statistical significance, and on differing operational definitions of adverse impact. In order to provide a more consistent frame work for evaluating adverse impact, a new significance test is proposed, which is based on the same effect size as the four-fifths rule. Although this new test was found to have slightly better statistical power under some conditions, both tests have low power under the typical conditions where adverse impact is assessed. An alternative to significance testing would be to report an estimate of the adverse impact ratio along with a confidence interval indicating the degree of precision in the estimate.

Morrison, D.E., & Henkel, R.E. (1969) Apart from implication for improved *use*,
however, our analysis, like Selvin's [Selvin, 1957], more basically questions the general *utility*
of the tests in basic (*not* applied) scientific research. The test provides neither the necessary nor the sufficient
scope or type of knowledge that basic scientific social research requires.
[...] But how *is* scientific inference possible if significance tests are of little help? This question leads us
beyond the scope of this paper, but we have offered some hints: replication
over diverse samples as well as internally, the use of abstract concepts, and
the incorporation of such concepts in deductive theories with the conditions of
their validity specified. There are, of course, no computational formulas for
scientific inference: the questions are must more difficult and the answers
much less definite than those of statistical inference.
**[ Example of misinterpretation of significance levels]** "[...] thus, any difference
in the groups on a particular variable in a given assignment will have some calculable probability of being
due to errors in the assignment procedure [...]" (pages 195-209;196)

Mulaik, S.A., Raju, N.S., & Harshman, R.A. (1997) We expose fallacies in the arguments of critics of null hypothesis significance testing who go too far in arguing that we should abandon significance tests altogether: Beginning with statistics containing sampling or measurement error, significance tests provide prima facie evidence for the validity of statistical hypotheses, which may be overturned by further evidence in practical forms of reasoning involving defeasible or dialogical logics. For example, low power may defeat acceptance of the null hypothesis. On the other hand, we support recommendations to report point estimates and confidence intervals of parameters, and believe that the null hypothesis to be tested should be the value of the parameter given by a theory or prior knowledge. We also use a Wittgensteinian argument to question the coherence of concepts of subjective degree of belief underlying subjective Bayesian alternatives to significance testing.

Murphy, K.R. (1997) If an author decides not to present an effect size estimate along with the outcome of a significance test, I will ask the author to provide specific justification for why effect sizes are not reported. So far, I have not heard a good argument against presenting effect sizes. Therefore, unless there is a real impediment to doing so, you should routinely include effect size information in the papers you submit.

Nelson, N., Rosenthal, R., & Rosnow, R.L. (1986) How do American psychologists use statistical results and related
information to interpret research evidence? In previous studies it was found
that confidence ratings increased with larger sample sizes at the same *p* values,
and we found a similar relationship in this study. Perhaps confidence in research results is a
two-step or three-step process for psychological researchers. First, confidence
is earned by the rejectability of the null hypothesis, with *p*=.05 considered a critical level, but
with greater confidence in general given to lower *p* values. Second, given a *p*
low enough, psychologists researchers' confidence is increased by increases in
obtained effect sizes, and especially so for younger investigators. Third,
psychologists trust this effect size more when sample size is larger, because
the effect size is, in general, more accurately estimated when sample sizes are larger.

Nester, M.R. (1996) Hypothesis testing, as performed in the applied sciences, is criticized. Then assumptions that the author believes should be axiomatic in all statistical analyses are listed. These assumptions render many hypothesis tests superfluous. The author argues that the image of statisticians will not improve until the nexus between hypothesis testing and statistics is broken.

Nickerson, R. S. (2000) Null hypothesis significance testing (NHST) is arguably the most widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data.

Nunnally, J.C. (1975) **[ Example of misinterpretation of significance levels]** "[...]
95 [chances] out of 100 that the observed difference will hold up in future investigations."
(page 195) [Quoted by Carver, 1978]

Ogles, B.M., Lunnen, K.M., & Bonesteel K. (2001) The meaningfulness of psychotherapy outcome as measured in therapy research is a persistent and important issue. Following a period of emphasis on statistically significant findings for treated versus control groups, many researchers are renewing efforts to investigate the meaningfulness of individual change. Several statistical methods are available to evaluate the meaningfulness of clients' changes occurring as a result of treatment. This article reviews the history of the clinical significance concept; describes the various methods for defining improvement, recovery, and clinically significant change; examines current criticisms of the methods; and describes the current use of the methods in practice.

O'Grady, K.E. (1982) Measures of explained variance (e.g. proportion of variance accounted for) are often considered to indicate the importance of a statistical finding. Three potential limitations to this viewpoint are discussed: psychometric, methodological, and theoretical. A psychometric perspective suggests that errors of measurement produce an upper bound to any measure of explained variance, this upper bound being the product of the reliabilities of the variables whose association is under investigation. A methodological perspective suggests several factors that influence the magnitude of measures of explained variance, including the intentions of the researcher, the design of the research, and the population sampled. A theoretical perspective suggests that most behavior has multiple determinants, and thus the magnitude of measures of explained variance can me made when researchers examine the agreement between the magnitude that their theory would suggest and the empirical finding.

O'Hagan, T. (1996) First
Bayes is a program intended to help with teaching and learning elementary
Bayesian Statistics. It deals with quite simple and standard statistical
models, with an emphasis on obtaining some understanding of how the Bayesian
approach works. It is *not* a package for *doing* statistical analysis of
practical data. Four standard one parameter models are offered: binomial data,
gamma data, Poisson data and normal data with known variance. A major feature
of First Bayes is that such data may be analysed using an arbitrary mixture of
distributions from the conjugate family. By this means, essentially arbitrary
prior distributions can be defined, allowing the user to obtain an excellent
understanding of how the likelihood and prior distribution are combined by
Bayes' Theorem. Prior, likelihood and posterior can be plotted on a single
"triplot". Analysis of two simple kinds of linear model are also offered.
One is the case of one or more normal samples with common but unknown variance
(one-way analysis of variance), and the other is simple linear regression.
Marginal distributions may be computed (and examined) for arbitrary linear
combinations of the location parameters. In the case of regression, scatter and
residual plots can be produced. Predictive distributions are available in a
variety of forms for all analyses.

O'Rourke, K. (1996) I agree with many points Kadane makes in "Prime Time for Bayes" [Kadane, 1995] but question the conclusion that the Bayesian approach should become the mainstay of statistical analysis in randomized clinical trials and completely disagree with the author's concluding remark that this would provide a [the only?] "foundation for clinical trials that makes sense." In fact, there is no known rational basis (i.e., a basis that "makes sense") for empirical science or any process of induction.

Pagano, R.R. (1990) *Example of interpretation of frequentist confidence intervals in terms of probabilities
about parameters*]

Pascual, J., Frías, Ma.D., & Garcia, J.F. (2000) The standard hypothesis testing method has a number of well-known logical fallacies and the results of the procedures are often misinterpreted. Many scholars have suggested that, perhaps, NHST should be abandoned altogether in favor of other bases for conclusions such as confidence intervals and effect size estimates. Other researchers are often interested in testing the hypothesis that the effects of treatments, interventions, etc. are negligibly small rather than testing the hypothesis that treatments have no effects whatsoever. We argue that we must question the "old" procedures to stimulate the application of new statistical procedures in the progress of scientific inference.

Pascual, J., Garcia, J.F., & Frías, Ma.D. (2000) This paper analyses the relationship between the concepts of statistical significance (level of probability, p) and replicability. The level of statistical significance (for example, p = 0.01) indicates the probability of the data under the null hypothesis assumption, however, this does not mean that in a later replication the probability to obtain significant differences will be the complementary 0.99. if correctly understood, replicability is exclusively related to the reliability and consistency of the data. The only way to evaluate reliability is through repeated empirical tests.

Pasquet, P., Monneuse, M.-O., Simmen, B., Marez, A., & Hladik; C.-M. (2006) To extend traditional significance tests of differences in individual recognition thresholds in the fasted and in the satiated states, we used a standard Bayesian procedure, which provides probability statements about the true standardized population differences (d/s) according to the size of the sample, using LeBayesien software (Lecoutre & Poitevineau, 1996). For this purpose, we retained the criterion of Cohen (1969) who defined the cut-off limit for a 'negligible' difference at jd/sj!0.2.

Pearson, E.S. (1955) This paper contains a reply to some criticisms made by Sir Ronald fisher in his recent article on" Statistical Methods and Scientific Induction".

Pearson, K. (1900) The object of this paper is to investigate a criterion of the probability of any theory of an observed system of errors and to apply it to the determination of goodness of fit in the case of frequency errors.

Perlman, M.D., & Wu, L. (1999) In the past two decades, striking examples of allegedly inferior likelihood ratio tests (LRT) have appeared in the statistical literature. These examples, which arise in multiparameter hypothesis testing problems, have several common features. In each case the null hypothesis is composite, the size a LRT is not similar and hence biased, and competing size alpha tests can be constructed that are less biased, or even unbiased, and that dominate the LRT in the sense of being everywhere more powerful. It is therefore asserted that in these examples and, by implication, many other testing problems, the LRT criterion produces "inferior," "deficient," "undesirable," or "flawed" statistical procedures. This message, which appears to be proliferating, is wrong. In each example it is the allegedly superior test that is flawed, not the LRT. At worst, the "superior" tests provide unwarranted and inappropriate inferences and have been deemed scientifically unacceptable by applied statisticians. This reinforces the well-documented but oft-neglected fact that the Neyman-Pearson theory desideratum of a more (or most) powerful size alpha test may be scientifically inappropriate; the same is true for the criteria of unbiasedness and alpha-admissibility. Although the LR criterion is not infallible, we believe that it remains a. generally reasonable first option for non-Bayesian parametric hypothesis-testing problems.

Poitevineau, J. (1998) La thèse présentée est celle de
l'inadaptation à la recherche expérimentale de la pratique du test de
signification par les chercheurs en psychologie. La question de cette
inadaptation est abordée selon trois approches, normative, prescriptive et
descriptive qui constituent les trois parties de la thèse. La première partie
est consacrée à l'étude, d'un point de vue méthodologique, de la norme statistique
constituée par les théories du test statistique de Fisher et de Neyman et
Pearson. Les principales caractéristiques de ces deux théories sont rappelées
puis les nombreuses critiques dont les tests continuent d'être l'objet sont
examinées, ainsi que les abus d'utilisation et des raisons possibles de la
persistance de l'usage des tests et des abus. La deuxième partie aborde la
question de la pertinence des prescriptions. Parmi les principales solutions de
rechange aux tests qui sont passées brièvement en revue, seules les méthodes
d'intervalle de confiance et les méthodes bayésiennes paraissent devoir
s'imposer comme véritables "challengers" des tests traditionnels. De
l'analyse de six des manuels d'inférence statistique à l'usage des psychologues
parmi les plus connus, il ressort que les théories des tests statistiques y
sont rarement rapportées fidèlement et qu'ils contiennent déjà des abus
d'interprétation, particulièrement dans les exemples présentés. La troisième
partie est consacrée aux attitudes des chercheurs en psychologie à l'égard des
tests de signification. Des réanalyses statistiques de résultats déjà publiés
ainsi qu'une réanalyse que nous avons menée au moyen d'outils fiducio-bayésiens
sont présentées. Nous relatons aussi des d'expériences menées auprès de
chercheurs, dont deux expériences que nous avons réalisées. Nous concluons à
une pratique inadaptée au plan méthodologique, mais socialement adaptée, d'un
outil inadéquat dont le mode d'emploi est trompeur. Nous évoquons aussi le
probable changement d'attitude des psychologues vis-à-vis du test de
signification, en conséquence de prochaines recommandations de l'*American Psychological Association*,
et les possibilités d'une plus grande utilisation de l'analyse bayésienne qui en découlent.
*Methodology of the analysis of experimental data: A study of the use of significance tests by
psychologists, from normative, prescriptive, and descriptive approaches*

The thesis presented is that the current use of
significance tests by psychologists is unsuited for experimental research. This
question is examined through three approaches, normative, prescriptive and
descriptive, which constitute the three parts of the dissertation. The first
part is devoted, from a methodological viewpoint, to the study of the theories
of statistical test developed by Fisher and by Neyman and Pearson, and which
now constitute the statistical norm. The main features of these theories are
first described, then the numerous criticisms which are still directed at the
statistical tests are examined. Misuses of tests are also examined, as well as
possible reasons for the continued use of these tests.The second part deals
with the pertinence of the prescriptions. Among the main alternatives to the
tests reviewed, only confidence interval methods and Bayesian methods seem to
be potential challengers to the traditional tests. From the analysis of six
popular textbooks of statistical inference designed for psychologists, it
appears that the theories of statistical test are rarely accurately reported
and that those textbooks contain some misuses, particularly among the examples
used. The third part is devoted to the attitudes of psychologists toward
significance tests. Some statistical re-analyses of published results are
presented, as long as a re-analysis we performed using standard Bayesian tools.
Some experiments involving researchers as subjects are also reported, including
the two we realized for this thesis. We conclude that the use of significance
tests by psychologists is a socially adapted but methodologically unsuited use
of an inadequate tool promoted through misleading guide-lines of standard
textbooks. We also mention a probable change in psychologists' attitude toward
significance tests, as a consequence of recommendations from the *American Psychological
Association* that are likely to appear in the near future, and the possibility that Bayesian
analysis will become more and more used.

Poitevineau, J. (1999) En 1962 Cohen a publié une recherche qui a servi
de modèle à beaucoup d'autres. Son objectif premier était d'ordre
méthodologique: mettre en évidence certains problèmes soulevés par l'usage des
tests de signification, et éventuellement en tirer les conséquences pour une
meilleure pratique. Il s'agissait pour lui de voir comment les psychologues, si
soucieux de se prémunir contre l'erreur de première espèce, se gardaient de
l'erreur de seconde espèce. Autrement dit, voir si la puissance des tests
utilisés par les psychologues était suffisante pour que l'hypothèse nulle ait
de bonnes chances d'être rejetée quand elle est fausse. A cette fin il a
analysé tous les articles parus dans le volume 61 (1960) du *Journal of Abnormal and Social Psychology*.
Mais dans ce type d'étude, les données (les effets, les statistiques de test)
apparaissant dans les articles n'ont aucun rôle: pour calculer la puissance du
test utilisé il suffit de connaître la structure du plan d'analyse, les
effectifs, et la valeur de l'effet vrai qui est fixée par hypothèse. Nous
présentons une réanalyse que nous avons effectuée dans une perspective plus
descriptive que celle en jeu dans les études de puissance comme celle de Cohen:
il s'agit, d'une part de recenser quels sont les abus d'interprétation des
tests explicitement commis, et d'autre part de chercher à préciser quelle est la
portée réelle des conclusions autorisées en ce qui concerne l'importance des
effets, en relation précisément avec les abus (ou les insuffisances) des
interprétations fournies par les auteurs. En retour cela permettra d'examiner
si les tailles d'échantillon sont suffisantes pour obtenir des conclusions
satisfaisantes sur l'importance des effets (relativement à un certain critère).
La méthode fiducio-bayésienne qui utilise une distribution *a priori* non informative
nous servira de norme: c'est dans le cadre de cette méthode, et donc par rapport à elle, que nous tâcherons de répondre en
examinant comment les conclusions tirées par les chercheurs à partir de tests
statistiques usuels pourraient être prolongées ou modifiées. Cette méthode
permet de choisir le type d'inférence *a posteriori*, ce que ne permettent pas, en toute rigueur, les méthodes
fréquentistes (de test ou d'intervalle de confiance) de recherche de conclusion
d'effet négligeable ou notable. Pour faciliter la comparaison avec les études
antérieures, nous avons choisi de réanalyser des articles parus dans le *Journal of Abnormal Psychology*. Nous
avons retenu le volume 103 (année 1994), c'est-à-dire le plus récent disponible au moment de ce travail.

Poitevineau J. (2004) La pratique des tests statistiques par les chercheurs en psychologie est abordée
selon trois aspects. Du point de vue normatif les tests apparaissent inadaptés; les principales
critiques sont présentées. Du point de vue descriptif, l'examen des manuels statistiques,
les réanalyses d'articles publiés et les expériences auprès de chercheurs montrent l'existence de nombreux
abus d'utilisation. Enfin, du point de vue prescriptif, des solutions de rechange sont envisagées,
en particulier les méthodes bayésiennes qui apparaissent particulièrement prometteuses.

*The use of significance tests by psychologists: normative, descriptive and prescriptive viewpoints*

At a normative level, the significance tests appear to be ill-suited and the main criticisms
are reported. At a descriptive level, both examination of statistical textbooks, re-analyses
of published papers and experiments about the use of significance tests by psychologists clearly
reveals many misuses. At a prescriptive level, alternative solutions are considered, especially
the Bayesian methods which appear to be especially attractive.

Poitevineau, J., & Lecoutre, B. (1998) Chow's book [Chow, S.L. (1996). *Statistical Significance: Rationale,
Validity and Utility*. London: Sage.] makes a provocative contribution to the debate on the role
of statistical significance, but it involves some important misconceptions in the presentation of the Fisher
and Neyman-Pearson's theories. Moreover, the author's caricature-like
considerations about "Bayesianism" are completely irrelevant for
discarding the Bayesian statistical theory. These facts call into question the objectivity of his contribution.

Poitevineau, J., & Lecoutre, B. (2001) Comments about previous studies
indicate that the interpretation of significance levels by psychological
researchers is unequivocally dictated by a binary decision-making framework. In
particular confidence in a *p* level would drop abruptly just beyond the fateful .05 level ("cliff
effect"). A replication of Rosenthal and Gaito's experiment on the degree
of confidence in *p* levels shows that these claims should be moderated. Detailed analysis of individual curves
reveals that the attitude of researchers towards *p-*values is far from being as homogeneous as might be expected.
However most psychological researchers in our study rated graduated confidence
judgments, as either exponential or linear. Only a minority of
"all-or-none" respondents exhibited an abrupt drop in confidence.

Pratt, J.W. (1965) This paper is an attempt
to present in an orderly way various ideas about the interpretation of standard
inference statements from the Bayesian point of view. [...] The use of
insufficient statistics will be considered first, in Section 2. It proves easy
to assimilate them into the Bayesian framework. An example is given in which
this leads to some progress on a problem of Bayesian non-parametric statistics.
Estimation and confidence regions are taken up in Sections 3 and 4
respectively, and it is shown that classical properties give approximately, in
a certain weak Bayesian sense, corresponding Bayesian properties. Classical
anomalies no longer seem disturbing from this point of view. Ideas related to
maximum likelihood are postponed to Section 5, and some general remarks concerning
the approximation idea of Sections 3 and 4, including the application of the
likelihood principle, are postponed to Section 6. tests of hypotheses are
discussed in the next two Sections, significance levels and *P*-values in
Section 7 and common uses of tests in Section 8. The general conclusion here is that only certain one-tailed
*P*-values are interpretable Bayesianly and that, even when this interpretation is applicable, conventional
tests are seldom well articulated to practical problems. Section 9 contains a few final comments.

Press, W.H. (1989) To understand their data better, astronomers need to use statistical tools that are more advanced than traditional "freshman lab" statistics. As an illustration, the problem of combining apparently incompatible measurements of a quantity is presented from both the traditional, and a more sophisticated Bayesian, perspective. Explicit formulas are given for both treatments. Results are shown for the value of the Hubble Constant, and a 95% confidence interval of 66<H0<82 (km/s/Mpc) is obtained.

Pruzek, R.M. (1997) Students in the social and behavioral sciences tend generally to learn inferential statistics from tests and materials that emphasize significance tests or confidence intervals. Bayesian statistical methods support inferences without reference to either significance tests or confidence intervals. This chapter provides an introduction to Bayesian inference. It is shown that this class of methods entails use of prior information and empirical data to generate posterior distributions that in turn serve as the basis for statistical inferences. Two relatively simple examples are used to illustrate the essential concepts and methods of Bayesian analysis and to contrast inference statements made within this subjectivist framework with inference statements derived from classical methods. In particular, the role of posterior distributions in making formal inferences is described and compared with inferences based on classical methods. It also is argued that Bayesian thinking may help to improve definitions of inferential problems, especially in the behavioral and social sciences where the complex nature of applications often may require special strategies to make it realistic for investigators to attempt rigorous formal inferences. Sequentially articulated studies as seen as having special virtue in using results from previous studies to inform inferences about later ones. Numerous references are briefly described to aid the reader who seeks to learn more about Bayesian inference and its applications.

Racine, A., Grieve, A.P., Flühler, H., & Smith, A.F.M. (1986) Four typical applications of Bayesian methods in pharmaceutical research are outlined. The implications of the use of such methods are discussed, and comparisons with traditional methodologies are given. Although a great deal has been written on the comparative merits and demerits of different approach to statistical inference, this debate has very largely been conducted by theoreticians. Indeed, one of the recurring criticisms of the Bayesian approach seems to have been that it is not "practical". Against this background, it seemed to us - from the perspective of an applied statistics section in a major pharmaceutical company - of some interest to give a review of a variety of day-to-day problems which have been analysed for non-statistical clients within the company using Bayesian methods. For the most part, we shall present a straightforward account of the models, methodology and inference summaries employed, but potentially controversial issues will be clearly signposted. Our hope is that the shift of the focus of the debate from the theoretical to the practical domain will stimulate a more productive discussion of these issues, and one which the "practical statistician" will feel less able to ignore.

Reiser, B. (2001) Mahalanobis distances appear, often in a disguised form, in many statistical problems dealing with comparing two multivariate normal populations. Assuming a common covariance matrix the overlapping coefficient (Bradley, 1985), optimal error rates (Rao and Dorvlo, 1985) and the generalized ROC criterion (Reiser and Faraggi, 1997) are all monotonic functions of the Mahalanobis distance. Approximate confidence intervals for all of these have appeared in the literature on an ad-hoc basis. In this paper we provide a unified approach to obtaining an effectively exact confidence interval for the Mahalanobis distance and all the above measures.

Richard, F.D., Bond, C.F., Jr., & Stokes-Zoota, J.J. (2003) This article compiles results from a century of social psychological research, more than 25 000 studies of 8 million people. A large number of social psychological conclusions are listed alongside meta-analytic information about the magnitude and variability of the corresponding effects. References to 322 meta-analyses of social psychological phenomena are presented, as well as statistical effect-size summaries. Analyses reveal that social psychological effects typically yield a value of r equal to .21 and that, in the typical research literature, effects vary from study to study in ways that produce a standard deviation in r of .15. Uses, limitations, and implications of this large-scale compilation are noted.

Richardson, J.T.E. (1996) Two different approaches have been used to derive measures of effect size. One approach is based on the comparison of treatment means. The standardized mean difference is an appropriate measure of effect size when one is merely comparing two treatments, but there is no satisfactory analogue for comparing more than two treatments. The second approach is based on the proportion of variance in the dependent variable that is explained by the independent variable. Estimates have been proposed for both fixed-factor and random-factor designs, but their sampling properties are not well understood. Nevertheless, measures of effect size can allow quantitative comparisons to be made across different studies, and they can be a useful adjunct to more traditional outcome measures such as test statistics and significance levels.

Rindskopf, D. (1997) Critical attacks on null hypothesis testing over the years have not greatly diminished its use in the social sciences. This chapter tells why the continued use of hypothesis tests is not merely due to ignorance on the part of data analysts. In fact, a null hypothesis that an effect is exactly zero should be rejected in most circumstances; what investigators really want to test is whether an effect is nearly zero, or whether it is large enough to care about. Although relatively small sample sizes typically used in psychology result in modest power, they also result in approximate tests that an effect is small (not just exactly zero), so researchers are doing approximately the right thing (most of the time) when testing null hypotheses. Bayesian methods are even better, offering direct opportunities to make statements such as "the probability that the effect is large and negative is .01; the probability that the effect is near zero is .10; and the probability that there is a large positive effect is .89."

Rindskopf, D. (1998) [...] my preferred solutions in the "controversy" about null-hypothesis testing is (1) recognize that we really want to test the hypothesis that an effect is "small", not null, and (2) use Bayesian methods, which are much more in keeping with the way humans naturally think than are classical statistical methods.

Robert, C.P. (1994) This
book is a translation of a French book written to supplement the gap in the
French statistical literature about Bayesian Analysis and Decision Theory. As a
result, its scope is wide enough to cover most graduate programs. It builds on
very little prerequisites in Statistics and only requires basic skills in
calculus, measure theory, and probability. In terms of level and existing
literature, this book starts at a level similar to those of the introductory
books of Lee (1989) and Press (1989), but it also goes further and keeps up
with most of the recent advances in Bayesian Statistics, while motivating the
theoretical appeal of the Bayesian approach on decision-theoretic justifications.
Nonetheless, this book differs from the reference book of Berger (1985a) by
including the more recent developments of the Bayesian field (the Stein effect
for spherically symmetric distributions, multiple shrinkage, loss estimation,
decision theory for testing and confidence regions, hierarchical developments,
Bayesian computation, mixture estimation, etc.).

The plan of the book is as follows: Chapter 1 is an introduction to statistical models, including the Bayesian
model and some connections with the Likelihood Principle. The book then
proceeds with Chapter 2 on Decision Theory, considered from a classical
point of view, this approach being justified through the axioms of rationality
and the need to compare decision rules in a coherent way. It also includes a
presentation of usual losses and a discussion of the Stein effect.
Chapter 3 gives the corresponding analysis for prior distributions and
deals in detail with conjugate priors, mixtures of conjugate priors, and
noninformative priors, including a concluding section on prior robustness.
Classical statistical models are studied in Chapter 4, paying particular
attention to normal models and their relations with linear regression. This
chapter also contains a section on sampling models that allows us to include
the pedagogical example of capture-recapture models. Tests and confidence
regions are considered separately in Chapter 5, since we present the usual
construction through "0-1" losses, but also include recent advances in the
alternative decision-theoretic evaluations of testing problems. The second part
of the book dwells on more advanced topics and can be considered as providing a
basis for a more advanced graduate course. Chapter 6 covers complete class
results and sufficient/necessary admissibility conditions. Chapter 7
introduces the notion of invariance and its relations with Bayesian Statistics,
including a heuristic section on the Hunt--Stein theorem. Hierarchical and
empirical extensions of the Bayesian approach, including some developments on
the Stein effect, are treated in Chapter 8. Chapter 9 is rather
appealing, considering the available literature, as it incorporates in a
graduate textbook an introduction to state-of-the-art computational methods
(Laplace, Monte Carlo and, mainly, Gibbs sampling). In connection with this
chapter, a short appendix provides the usual pseudorandom generators.
Chapter 10 is a more personal conclusion on the advantages of Bayesian
theory, also mentioning the most common criticisms of the Bayesian approach.

[255] **[ Exemple d'interprétation de l'intervalle
de confiance fréquentiste en termes de probabilité sur les paramètres]** "Par exemple, si dans un sondage
de taille 1000, on trouve

Robert, M. (1994) **[ Exemple de "formulation ambiguë"]**
"La majorité des chercheurs en psychologie ont recours à une épreuve de

Robey, R.E. (2004)
The purpose of this tutorial is threefold: (a) review the state of statistical science regarding effectsizes,
(b) illustrate the importance of effect-sizes for interpreting findings in all forms of research and
particularly for results of clinical-outcome research, and (c) demonstrate just how easily a criterion
on reporting effect-sizes in research manuscripts can be accomplished. The presentation centers on
within-effect analyses of variance including the one-way design for testing pre-post hypotheses and
the two-way parallel-groups design for making direct comparisons of competing treatment protocols
(e.g., experimental treatment versus control). The presentation is supported with worked examples
and a web site containing templates for software applications.

**Educational objectives:** The reader will be able to: (1) explain the rationale for the increased use
of estimates of effect-size in reporting results in published research manuscripts; (2) describe what an
effect-size is (generally considered) and provide a rationale for its importance; (3) distinguish among
the many forms of effect-size and apply their features to the most appropriate choices under specific
research circumstances; and (4) appropriately report and interpret effect-sizes.

Robinson, D.H., & Wainer, H. (2002) Recent criticisms of null hypothesis significance testing (NHST) have appeared in wildlife journals (Cherry 1998; Johnson 1999; Anderson et al. 2000, 2001; Guthery et al. 2001). In this essay, we discuss these criticisms with regard to both current usage of NHST and plausible future use. We suggest that the historical use of such procedures was reasonable and that current users might spend time profitably reading some of Fisher's applied work. However, modifications to NHST, and to the interpretations of its outcomes, might better suit the needs of modern science. Our primary conclusion is that NHST most often is useful as an adjunct to other results (e.g., effect sizes) rather than as a stand-alone result. We cite some examples, however, where NHST can be profitably used alone. Last, we find considerable experimental support for a less dogmatic attitude toward the interpretation of the probability yielded from such procedures.

Rogers, J.L., How>ard, K.I., & Vessey, J. (1993) Equivalency testing, a statistical method often used in biostatistics to determine the equivalence of 2 experimental drugs, is introduced to social scientists. Examples of equivalency testing are offered, and the usefulness of the method to the social scientists is discussed.

Rosenthal, R. (1991) When used appropriately, the BESD [Binomial Effect Size Displays] has been used to excellent advantage by methodologically sophisticated behavioral researchers and by experienced mathematical statisticians.

Rosenthal, R., & Gaito, J. (1963) A total of 19
psychologists (graduate students and faculty) rated their degree of confidence
in a variety of *p* levels for each of two assumed sample sizes. The relationship between degree of confidence and
magnitude of *p* levels appeared to be exponential regardless of sample size assumed and type of Ss employed. Ss have
greater confidence in a given *p* level when it was associated with a larger sample size suggesting that investigators
use the probability of both Type I and Type II errors as criteria of
"belief". Graduate students Ss tended to place more confidence in
given *p* levels than did faculty members. For 84 per cent of the Ss, the .05 level had cliff characteristics
manifested by a relatively more precipitous loss of confidence in moving from
the .05 to the .10 level than was true at either higher or lower levels of significance.

Rosenthal, R., & Gaito, J. (1964) In a careful and probably
improved replication of our study, Beauchamp and May (1964) [...] could find
"no evidence" for any .05 cliff effect. [...] Additional evidence for
the existence of an .05 cliff was to be found in their extended report. [...]
For *p* values based on large samples, their 11 graduate student Ss expressed a *greater*
average degree of confidence in the .05 level than did in the .03 level! This
interesting finding even if not statistically significant certainly is
consistent with our hypothesis that the .05 level has rather special
characteristics.
**[ Example of interpretation of significance levels in terms of probabilities about parameters]** "In
summary, the probability that the .05 level of significance possesses cliff
characteristics was established for several samples of psychologists. For one

Rosenthal, R., & Rubin, D.B. (1994) We introduce a new, readily computed statistic, the counternull value of an obtained effect size, which is the nonnull magnitude of effect size that is supported by exactly the same amount of evidence as supports the null value of the effect size. In other words, if the counternull value were taken as the null hypothesis, the resulting p value would be the same as the obtained p value for the actual null hypothesis. Reporting the counternull, in addition to the p value, virtually eliminates two common errors: (a) equating failure to reject the null with the estimation of the effect size as equal to zero and (b) taking the rejection of a null hypothesis on the basis of a significant p value to imply a scientifically important finding. In many common situations with a one-degree-of-freedom effect size, the value of the counternull is simply twice the magnitude of the obtained effect size, but the counternull is defined in general, even with multidegree- of-freedom effect sizes, and therefore can be applied when a confidence interval cannot he. The use of the counternull can be especially useful in meta-analyses when evaluating the scientific importance of summary effect sizes.

Rosenthal, R., & Rubin, D.B. (2003) The purpose of this article is to propose a simple effect size estimate (obtained from the sample size, N, and a p value) that can be used (a) in meta-analytic research where only sample sizes and p values have been reported by the original investigator, (b) where no generally accepted effect size estimate exists, or (c) where directly computed effect size estimates are likely to be misleading. This effect size estimate is called requivalent because it equals the sample point-biserial correlation between the treatment indicator and an exactly normally distributed outcome in a two-treatment experiment with N/2 units in each group and the obtained p value. As part of placing requivalent into a broader context, the authors also address limitations of requivalent.

Rosnow R.L., & Rosenthal, R. (2002) we introduce students to statistical methods that, although enormously useful, do not yet generally appear in undergraduate methods texts: meta-analysis, contrast analysis, interval estimates of effect sizes and their practical interpretation, and so on. The emphasis of these discussions is intended to resonate with the spirit and substance of the guidelines recommended by the American Psychological Association's Task Force on Statistical Inference (Wilkinson et al., 1999).

Rothman, K.J. (1978) In the past, journals have encouraged the routine use of tests of statistical significance; I believe the time has now come for journals to encourage routine use of confidence intervals instead.

Rouanet, H. (1996) In experimental data analysis when it comes to assessing the importance of effects of interest, 2 situations are commonly met. In situation 1, asserting largeness is sought: "The effect is large in the population." In situation 2, asserting smallness is sought: "The effect is small in the population." In both situations, as is well known, conventional significance testing is far from satisfactory. The claim of this article is that Bayesian inference is ideally suited to making adequate inferences. Specifically, Bayesian techniques based on "noninformative" priors provide intuitive interpretations and extensions of familiar significance tests. The use of Bayesian inference for assessing importance is discussed elementarily by comparing 2 treatments, then by addressing hypotheses in complex analysis of variance designs.

Rouanet, H. (1998) Chow's efforts towards a methodology of theory-corroboration and the plea for significance testing are welcome, but there are many risky claims. A major omission is a discussion of significance testing in the Bayesian framework. We sketch here the Bayesian reinterpretation of the significance level for assessing direction of effects.

Rouanet, H. (2000) In this introductory chapter, we will discuss descriptive and inductive procedures, then logical and statistical inference, lastly hypotheses and assumptions. Next we will describe the background of current statistical practice, building on the opposition between Mathematical Statistics and Statistics for Researchers. An outline of the book will close the chapter. In Appendix A, we will sketch the historical relationship between Probability and Statistics. In Appendix B, we will discuss the current issue of mind as an intuitive statistician.

Rouanet, H. (2000) In this chapter we will revisit, from a methodological
standpoint, the current statistical practice. We will concentrate our
examination on statistical tests, or *tests* (for short), also called *hypothesis tests
*(in mathematical statistics), or *significance tests* (by Fisher and in statistics for researchers).
Confidence method, which are also used, but in a more modest scale, will be reviewed more briefly.
We will first review the familiar Student's-test and the chi-square test,
exemplified on the classical Student data and Mendel data, in connection with
two common research paradigms. Then; enlarging the discussion, we will proceed
to a step-by-step examination of the current statistical practice. Then, we
will discuss the perverse effects of the official doctrine and the real
problems that the researcher is faced with. An overall reassessment of current
statistical practice and the outline of alternative frameworks will close the
chapter. In the Appendix, we will discuss further false problems and wrong
tracks, namely the two-tailed vs one-tailed quarrel and the chimera of power.

Rouanet, H., Bernard, J.-M., Bert, M.-C., Lecoutre, B., Lecoutre, M.-P., & Le Roux, B. (2000)
This book, with a Foreword by the outstanding philosopher of
science and mathematical psychologist Patrick Suppes of Stanford University, is
the outgrowth of the work developed within the *Groupe Mathématiques et Psychologie*, a research unit of the
University René Descartes and C.N.R.S. (the French National Center for
Scientific Research). New ways in statistical methodology are presented, which
complement the familiar significance tests by new methods better suited to the
researchers' objectives, in the first place Bayesian methods. In mathematical
statistics, Bayesian methods have made a breakthrough in the last few years,
but those developments are still ignored by the current statistical methodology
and practice. The present book is really the first one to fill this gap. This
book is written for a large audience of researchers, statisticians and users of
statistics in behavioral and social sciences, and contains both an analysis of
the attitude of researchers toward statistical inference, and concrete
proposals for improving statistical practice. The statistical consulting
experience of the authors is centered around psychology and covers a broad
range of subjects from social sciences to biostatistics. All methods developed
by the authors are implemented in software.

Rouanet, H., Bernard, J.-M., & Lecoutre, B. (1986) The familiar sampling procedures of statistical inference can be recast within a purely set-theoretic (ST) framework, without resorting to probabilistic prerequisites. This article is an introduction to the ST approach of statistical inference, with emphasis on its attractiveness for teaching. The main points treated are unsophisticated ST significance testing and ST inference for a relative frequency (proportion).

Rouanet, H., & Bert, M.-C. (2000) This chapter is an introduction to Combinatorial Inference, or Set-theoretic Inference, an alternative to frequentist inference that our Math & Psy Group has been developing since the early eighties: see Rouanet, Bernard, Lecoutre (1986) and Rouanet, Bernard, Le Roux (1990). Its motivation is to provide researchers with a framework that can be used when the "validity assumptions" of the common procedures are not met. [...] Roughly speaking, the algorithms of combinatorial procedures coincide with those of conventional tests, while the random framework is discarded. As a result, in data analysis, it will be possible to keep many familiar algorithms, while the conclusions of Combinatorial Inference are stated in terms of new concepts, such as typicality and homogeneity, formalized in a nonprobabilistic way. We first present Typicality tests, and Homogeneity tests. Then we outline the making of combinatorial inference and discuss related viewpoints.

Rouanet, H., & Bru, B. (1994) C'est avec émotion que nous avons découvert
que Victor Henri, cofondateur de *L'Année Psychologique* avec "son maître" Alfred Binet, fut aussi le
premier "statisticien des psychologues" français [Henri, 1895, 1898].
[...] Les notes de lecture qui suivent n'ont pas d'autre but que d'inviter les
chercheurs actuels à lire à leur tour ce pionnier de la "statistique des
chercheurs", c'est-à-dire cette activité qui, bien distincte de la
"statistique mathématique", cherche avant tout à répondre aux besoins méthodologiques des chercheurs.

Rouanet, H., & Lecoutre, B. (1983) Whenever in a complex design inferences on separate effects
are sought, the (overall) distributional assumptions of a general model are
irrelevant. The *specific inference approach* is examined as a useful alternative to the conventional general
model approach. The specific inference for a particular effect, based only on
data relevant to this effect, is valid regardless of the complexity of the
design. Specific inference is first discussed in terms of significance testing.
It is argued that the usual ANOVA table can be regarded as a system of specific
analyses, each on resting on a separate specific model in its own right. Then specific
inference is discussed within a Bayesian framework. A standard Bayesian ANOVA
is suggested as a direct extension of the usual *F*-test ANOVA. Technical developments and methodological implications are outlined.

Rouanet, H., Lecoutre, M.-P., Bert, M.-C., Lecoutre, B., & Bernard, J.-M. (1991) Cet ouvrage
pluridisciplinaire est à la charnière de la psychologie cognitive et de la statistique. On y trouvera d'une
part une analyse de la démarche et des attitudes des chercheurs face à
l'inférence statistique, d'autre part des propositions pour renouveler les
pratiques statistiques, en les complétant par des méthodes (combinatoires et
bayésiennes) mieux adaptées et aujourd'hui accessibles grâce à l'informatique.
Cet ouvrage s'adresse aux chercheurs, enseignants et utilisateurs de la
statistique, tout particulièrement en psychologie et en sciences humaines, qui
y trouveront un texte de référence pour l'approche de la nouvelle école
statistique de Paris.
** Sommaire:**
Rouanet H., Avant-propos, 5-7 -- Rouanet H., Les pratiques statisticiennes
en question, 9-22 -- Rouanet H., Les tests statistiques revisités, 23-45 --
Lecoutre M.-P., Et... le point de vue des chercheurs? Quelques éléments de
réflexion, 47-77 -- Rouanet H., L'approche de l'inférence ensembliste, 79-86 --
Bert M.-C., Inférence sur une moyenne: Test de signification ensembliste et test
d'hypothèse, 87-93 -- Lecoutre B., Du test de signification à l'inférence
fiducio-bayésienne, 95-120 -- Bernard J.-M., Inférence bayésienne et prédictive
sur les fréquences, 121-153.

Rouanet, H., Lépine, D., & Holender, D. (1978) The use of significance tests for validating models is basically inadequate, since nonsignificant results only pertain to the compatibility of the data with the model. In this paper an alternative form of data analysis is proposed, whose objective is to assess the acceptability of the model given the data. The use of Bayes-fiducial inference is suggested in this connection; the approach is exemplified through the analysis of an experiment planned for investigating a model of successive stages of information processing in binary choice reaction.

Rouanet, H., Lépine, D., & Pelnard-Considère, J. (1976) There are various things one may put forward to arouse sympathy for the Bayesian approach; the key ideas in this paper are: (i) The Bayesian approach leads to conclusions directly interpretable in terms of the psychological research objectives, in contrast to the conclusions from the usual, often misused, significance tests. (ii) Technically, the Bayesian approach may lead to surprisingly simple procedures not requiring sophisticated calculations or machine programs; this is especially true for the simplest of Bayesian methods, which we call "Bayes-fiducial methods", to which this paper is devoted. (iii) The Bayesian approach can be used to analyse real data with educational psychologists handle everyday. To show this, we have worked out a complete Bayes-fiducial analysis of a set of educational data; the paper is organized around the analysis of this example.

Rouanet, H., Le Roux, B., Bernard, J.-M., & Lecoutre, B. (2000) This final chapter provides an opening along two directions.
Firstly, it deals with *geometric data*. Secondly, data are *structured*, that
is to say there is a design to investigate several sources of variation [...].
To analyze such data,one may contemplate two lines of approach. Along the line
of Geometric Data Analysis (GDA),
as developed in France, one would start by representing data as *clouds of points* and proceed to the
descriptive exploration of these clouds. Along the line of the Anglo-Saxon MANOVA
tradition, one would start with a *statistical model* and proceed to
inductive analyses. In this chapter, we present a statistical strategy which
combines both approaches. As in GDA>,
we conceptualize statistical procedures as geometric operations on clouds of
points, and as in MANOVA, we
carry out inductive analyses. For each analysis, the procedures will be
performed along the following three phases: (i) Descriptive analysis and
observed effects; (ii) MANOVA
significance testing and existence of effects; (iii) Bayesian MANOVA and
importance (largeness) of effects.

Rouder, J.N., & Morey, R.D. (2005) We agree with Fidler et al. that (a) CIs provide valuable information about variability in the data, and (b) there must be continuing discourse over their proper use and meaning. On balance, we prefer the parsimony of arelational CIs: They provide for a rough guide to the variability in data and a quick check of the heterogeneity of variance. Because arelational CIs may not be sufficient for comparative purposes, statistical tests are often necessary. In such cases, the statistical test provides more precise information about the comparison, and should be the main focus in discussion. Researchers who rely on relational CIs need to carefully document their construction, as well as provide some justification for beliefs about their coverage probabilities.

Royall, R.M. (1986) Contradictory interpretations of how the meaning of a significance test depends on the sample size are examined.

Royall, R.M. (1999) Current frequentist methods use probabilities to measure both the chance of errors and the strength of observed evidence (for discussion see Royall, 1997, ch.5). The Law of Likelihood explains that it is likelihood ratios, not probabilities, that measure evidence. The concept of statistical evidence embodied in the Law of Likelihood, and represented in the terms "weak evidence" and "misleading evidence" that are central to the evidential paradigm, can lead to a body of statistical theory and methods that: (1) Requires a probability model for the observable random variables only (and is in that sense frequentist, not Bayesian). (2) Contains a valid, explicit, objective measure of the strength of statistical evidence. (3) Provides for explicit, objective measure (and control) of the probabilities of observing weak or misleading evidence.

Rozeboom, W.W (1960) The traditional null hypothesis significance test method, more
appropriately called "null hypothesis decision [NHD] procedure", of
statistical analysis is here vigorously excoriated for its inappropriateness as
a method of *inference*. While several
serious objections to the method are raised, its most basic error lies in
mistaking the aim of a scientific investigation to be a *decision*, rather than a *cognitive*
evaluation of propositions. It is further argued that the proper application of
statistics to scientific inference is irrevocably committed to extensive
consideration of inverse probabilities, and to further this end, certain
suggestions are offered, both for the development of statistical theory and for
more illuminating application of statistical analysis to empirical data.

Rozeboom, W.W. (1991) Conceptual rigor is indeed a desideratum worth dedicated pursuit; in fact one might wish that Chow [Chow, 1991a] had pursued it somewhat more diligently in his present essay. I suggest that the approach to data interpretation he advocates here is an etch-a-sketch draft whose prospect for refinement into an operational logic of inference that professional scientists can live by appears minuscule.

Rozencwajg, P., & Corroyer, D. (2005) We tested the magnitude of the effect in the parent population (the true effect, d) by calculating the Bayesian probability (g), knowing that a value of .90 (90%) is considered a sufficient guarantee (Bernard, 1998; Rouanet). For all analyses, we used PAC and LeBayesien statistical software.

Salsburg, D. (1994) The 'Intent to Treat' paradigm for the analysis of a controlled randomized clinical trial is a direct result of applying the Neyman-Pearson formulation of hypothesis testing. If other formulations are used, the 'Intent to Treat' paradigm makes no sense. Criticisms of the Neyman-Pearson formulation and whether it is applicable to scientific investigations have appeared in the statistical and philosophical literature since it was proposed. This paper reviews the nature of that criticism and notes why the Neyman-Pearson formulation, and with it the 'Intent to Treat' paradigm, is inappropriate for use in the analysis of clinical trials.

Sanabria, F., & Killeen, P.R. (2007) Despite being under challenge for the past 50 years, null hypothesis significance testing (NHST) remains dominant in the scientific field for want of viable alternatives. NHST, along with its significance level p, is inadequate for most of the uses to which it is put, a flaw that is of particular interest to educational practitioners who too often must use it to sanctify their research. In this article, we review the failure of NHST and propose prep , the probability of replicating an effect, as a more useful statistic for evaluating research and aiding practical decision making.

Sánchez, J., Valera, A., Velandrino, A, & Marin, F. (1992) The aim of this paper was to determine the statistical power
of the research published in the journal *Anales de Psicología* across its life (1984-1991). The sixteen studies available
for this calculation were used for analyzing their statistical power. The
results do not seem to differ from those originally obtained by Cohen (1962),
showing average powers of .13, .47, and .76 for small, medium, and large effect
sizes respectively. Also, the mean power for the estimated effect sizes from
the proper studies increased to .71, very close to the minimum of .80
recommended by Cohen (1988). Finally, the results are compared to other recent power studies.

Schield, M. (1998) Previous papers by the author have argued that the Bayesian
strength of belief can be used in interpreting classical hypothesis tests and
classical confidence intervals. In hypothesis tests, one's strength of belief
in the truth of the alternate upon rejecting the null was argued to be equal to
(1-*p*) under certain conditions. In confidence intervals, being 95% confident was argued as being operationally
equivalent to a willingness to bet on a 95% chance. These interpretations were
taught in an introductory class of non-majors. Students found this approach to
be extremely natural for confidence intervals. But in hypothesis testing,
students had difficulty relating the quality of the test (*p*-value)
to the quality of the decision. The underlying problem is student difficulty with related conditionals.
To overcome this problem, we should teach more about conditionality - not less.

Schmidt, F.L., & Hunter, J.E. (1997) Logically and conceptually, the use of statistical significance testing in the analysis of research data has been thoroughly discredited. However, reliance on significance testing is strongly embedded in the minds and habits of researchers, and therefore proposals to replace significance testing with point estimates and confidence intervals o/ten encounter strong resistance. This chapter examines eight of the most commonly voiced objections to reform of data analysis practices and shows each of them to be erroneous. The objections are: (a) Without significance tests we would not know whether a finding is real or just due to chance; (b) hypothesis testing would not be possible without significance tests; (c) the problem is not significance tests but failure to develop a tradition of replicating studies; (d) when studies have a large number of relationships, we need significance tests to identify those that are real and not just due to chance; (e) confidence intervals are themselves significance tests; (f) significance testing ensures objectivity in the interpretation of research data; (g) it is the misuse, not the use, of significance testing that is the problem; and (h) it is futile to try to reform data analysis methods, so why try? Each of these objections is intuitively appealing and plausible but is easily shown to be logically and intellectually bankrupt. The same is true of the almost 80 other objections we have collected. Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution. After decades of unsuccessful efforts, it now appears possible that reform of data analysis procedures will finally succeed. If so, a major impediment to the advance of scientific knowledge will have been removed.

Schmidt, K. (1995) So why not aim directly at CIs [confidence intervals] even when it is more complex to du so? In most cases, a CI can be determined at least by Monte Carlo simulation and trial and error methods trying to find alternative null hypotheses can be accepted given the observed value of the test statistic.

Schuirmann, D.J. (1987) The statistical test of
the hypothesis of no difference between the average bioavailabilities of two
drug formulations, usually supplemented by an assessment of what the power of
the statistical test would have been if the true averages had been
inequivalent, continue to be used in the statistical analysis of
bioavailability/bioequivalence studies. In the present article, this Power
Approach (which in practice usually consists of testing the hypothesis of no
difference at level 0.05 and requiring an estimated power of 0.80) is compared
to another statistical approach, the Two One-Sided Tests Procedure, which leads
to the same conclusion as the approach proposed by Westlake [1981] based on the
usual (shortest) 1-2*alpha* confidence
interval for the true average difference. It is found that for the specific
choice of *alpha*=0.05 as the nominal level of the one-sided tests, the
two one-sided tests procedure has uniformly superior properties to the power
approach in most cases. The only cases where the power approach has superior
properties when the true averages are equivalent correspond to cases where the
chances of concluding equivalence with the power approach when the true
averages are not equivalent exceeds 0.05. With appropriate choice of the
nominal level of significance of the one-sided tests, the two one-sided tests
procedure always has uniformly superior properties to the power approach. The
two one-sided tests procedure is compared to the procedure propose by Hauck and
Anderson [1984].

Sedlmeier, P., & Gigerenzer, G. (1989) The long-term impact of studies of statistical power is
investigated using J. Cohen's (1962) pioneering work as an example. We argue
that the impact is nil; the power of studies in the same journal that Cohen
reviewed (now the *Journal of Abnormal Psychology*) has not increased over the past 24 years.
[...] Low power seems to go unnoticed: only 2 out of 64 experiments mentioned power, and it was never
estimated. Nonsignificance was generally interpreted as confirmation of the
null hypothesis (if this was the research hypothesis), although the median
power was as low as .25 in these cases. We discuss reasons for the ongoing neglect of power.

Sedlmeier, P. (2002) APA review of books: Rosenthal, R., Rosnow, R.L., & Rubin,
D.B. (1999). *Contrasts and effect sizes in behavioral research: A
correlational approach*. New York: Cambridge University Press. "This is a
great book that everybody who uses ANOVA techniques should read".

Selvin, H.C. (1957) Statistical tests are unsatisfactory in nonexperimental research for two fundamental reasons: It is almost impossible to design studies that meet the conditions for using the tests, and the situations in which the tests are employed make it difficult to draw correct inferences. [...] Even if studies could be designed so that the correlated biases were controlled, there would remain the problem of correctly interpreting the tests. Many users of tests confuse statistical significance with substantive importance or with size of association. Sociologists would do better to re-examine their purposes in using the tests and to try to devise better methods of achieving these purposes than to continue to resort to techniques that are at best misleading for the kinds of empirical research in which they are principally engaged.

Serlin, R.C., & Lapsley, D.K. (1993) After first examining the Meehlian complaints against psychological research in more detail, we then propose a number a remedies. Against the claim that the significance test cannot be made to threaten a theory with refutation, we propose a "good-enough" methodology that claims to do precisely that. Against the claim that psychological research is not cumulative, we argue, following Lakatos, that progress in research is never plainly evident but must instead be excavated from historical reconstructions of the various literatures. Along the way we provide examples of how one uses "good-enough" hypothesis testing. We also argue that the comparison with physics is not always to our disadvantage when the good-enough methodology and certain Lakatos considerations are kept in mind. Finally, we conclude with a discussion of what rational appraisal of psychological research might look like, and how this might have an impact on graduate training in psychology.

Shrout, P.E. (1997) Significance testing of null hypotheses is the standard epistemological method for advancing scientific knowledge in psychology, even though it has drawbacks and it leads to common inferential mistakes. These mistakes include accepting the null hypothesis when it fails to be rejected, automatically interpreting rejected null hypothesis as theoretically meaningful, and failing to consider the likelihood of Type II errors. Although these mistakes have been discussed repeatedly for decades, there is no evidence that the academic discussion has had an impact. A group of methodologists is proposing a new approach: simply ban significance tests in psychology journals. The impact of a similar ban in public-health and epidemiology journals is reported.

Sim, J., & Reid, N. (1999) This article examines the role of the confidence interval (CI) in statistical inference and its advantages over conventional hypothesis testing, particularly when data are applied in the context of clinical practice. A CI provides a range of population values with which a sample statistic is consistent at a given level of confidence (usually 95%). Conventional hypothesis testing serves to either reject or retain a null hypothesis. A CI, while also functioning as a hypothesis test, provides additional information on the variability of an observed sample statistic (ie, its precision) and on its probable relationship to the value of this statistic in the population from which the sample was drawn (ie, its accuracy). Thus, the CI focuses attention on the magnitude and the probability of a treatment or other effect. It thereby assists in determining the clinical usefulness and importance of, as well as the statistical significance of, findings. The CI is appropriate for both parametric and nonparametric analyses and for both individual studies and aggregated data in meta-analyses. It is recommended that, when inferential statistical analysis is performed, CIs should accompany point estimates and conventional hypothesis tests wherever possible.

Skipper, Jr, J.K., Guenther, A.L., & Nass, G. (1967) There is a need for social scientists to choose levels of significance with full awareness of the implications of Type I and Type II error for the problem under investigation. The current use of arbitrary levels of alpha, while appropriate for some designs, detracts from interpretive power in others. Moreover, the tendency to dichotomy resulting from judging some results "significant" and other "nonsignificant" can be misleading both to professional and lay audiences. It is suggested that a more rational approach might be to report the actual level of significance, placing the burden of interpretive skill upon the reader. Such a policy would also encourage social scientists to give higher priority to selecting appropriate levels of significance for a given problem.

Smithson, M. (2001) The advantages that confidence intervals have over null-hypothesis significance testing have been presented on many occasions to researchers in psychology. This article provides a practical introduction to methods of constructing confidence intervals for multiple and partial R^2 and related parameters in multiple regression models based on "noncentral" F and Chi^2 distributions. Until recently, these techniques have not been widely available due to their neglect in popular statistical textbooks and software. These difficulties are addressed here via freely available SPSS scripts and software and illustrations of their use. The article concludes with discussions of implications forthe interpretation of findings in terms of noncentral confidence intervals, alternative measures of effect size, the relationship between noncentral confidence intervals and power analysis, and the design of studies.

Smithson, M. (2002) ** Table of contents:**
Ch 1 Introduction and overview. Ch 2 Confidence statements and interval estimates; Why confidence intervals?
Ch 3 Central confidence intervals; Central and standardizable versus noncentral distributions;
Confidence intervals using the central t and normal distributions;
Confidence intervals using the central chi-square and f distributions; Transformation principle.
Ch 4 Noncentral confidence intervals for standardized effect sizes; Noncentral distributions;Computing noncentral confidence intervals.
Ch 5 Applications in anova and regression; Fixed-effects ANOVA; A priori and post-hoc contrasts;
Regression: multiple, partial, and semi-partial correlations; Effect-size statistics for MANOVA and setwise regression;
Confidence interval for a regression coefficient; Goodness of fit indices in structural equations models.
Ch 6 Applications in categorical data analysis; Odds ratio, Difference between proportions and relative risk;
Chi-square confidence intervals for one variable; Two-way contingency tables; Effects in log-linear and logistic regression models.
Ch 7 Significance tests and power analysis; Significance tests and model comparison; Power and precision;
Designing studies using power analysis and confidence intervals; Confidence intervals for power.
Concluding remarks. References.

Snyder, P. (2000) These guidelines are designed
to provide authors who submit manuscripts to the journal and reviewers with a
uniform set of expectations regarding the reporting of results from statistical
investigations. The information contained in this editorial is not exhaustive
in relation to issues that might be addressed. As research designs and methods
continue to evolve, authors, reviewers, and editors associated with *JEI* will need to engage
periodically in dialogue about how inquiry submitted to
the journal will be evaluated. We invite your input about issues that you think
should be addressed in future editorials.

Spiegelhalter,D.J. (2004) We argue that the Bayesian approach is best seen as providing additional tools for those carrying out health-care evaluations, rather than replacing their traditional methods. A distinction is made between those features that arise from the basic Bayesian philosophy and those that come from the modern ability to make inferences using very complex models. Selected examples of the former include explicit recognition of the wide cast of stakeholders in any evaluation, simple use of Bayes theorem and use of a community of prior distributions. In the context of complex models, we selectively focus on the possible role of simple Monte Carlo methods, alternative structural models for incorporating historical data and making inferences on complex functions of indirectly estimated parameters. These selected issues are illustrated by two worked examples presented in a standardized format. The emphasis throughout is on inference rather than decision-making.

Spiegelhalter, D.J., Freedman, L.S. (1988) We sumarise current
statistical practice in clinical trials, and review Bayesian influence over the
past 25 years. It is argued that insufficient attention has been paid to the
dynamic context in which development of therapeutic innovations take place, in
which *experimenters*, *reviewers* and *consumers*
form different interest groups, and may well process the
same evidence in different ways. We illustrate the elicitation of quantitative
prior opinion in trial design and show how graphical expression of current
belief can be related to regions of possible benefit with different clinical
implications. Such displays may be used both for ethical monitoring of trials
and to predict the consequences of further sampling.

Spiegelhalter, D.J., Freedman, L.S., & Blackburn, P.R. (1986) At an interim point in a clinical trial, trial organisers may
wish to use the data on the initial series of patients to judge the likely consequences
of further patient accrual. Halperin and colleagues [*Controlled Clinical Trials*, 1982, *3*, 311-323] have suggested calculating the power of a continued
trial, *conditional* on the data observed so far and the null and alternative hypothesis specified at the start
of the trial, derived by averaging the conditional power with respect to the
current belief about the unknown parameters. Although numerical methods are
generally required for evaluating the necessary integrals, the results may be
presented graphically and enable the statistician to answer the question:
"With the data so far, what is the chance that the trial will end up
showing a conclusive result?"

Spiegelhalter, D.J., Freedman, L.S., & Parmar, M.K.B. (1994) Statistical issues in conducting randomized trials include the choice of a sample size, whether to stop a trial early and the appropriate analysis and interpretation of the trial results. At each of these stages, evidence external to the trial is useful, but generally such evidence is introduces in an unstructured and informal manner. We argue that a Bayesian approach allows a formal basis for using external evidence an in addition provides a rational way for dealing with issues such as the ethics of randomization, trials to show treatment equivalence, the monitoring of accumulating data and the prediction of the consequences of continuing a study. The motivation for using this methodology is practical rather than ideological.

Steiger, J.H. (2004) In his landmark 1978 paper, Paul Meehl delineated, with remarkable clarity, some fundamental challenges facing soft psychology as it attempts to test theory with data. In the quarter century that followed, Meehl's views stimulated much debate and progress, while continually evolving to keep pace with that progress. This paper pays homage to Meehl's prescience, and traces the impact of his ideas on the recent shift of emphasis away from hypothesis testing and toward confidence interval estimates of effect size.

Steiger, J.H. (2004) This article presents confidence interval methods for improving on the standard F tests in the balanced, completely between-subjects, fixed-effects analysis of variance. Exact confidence intervals for omnibus effect size measures, such as omega2 and the root-mean-square standardized effect, provide all the information in the traditional hypothesis test and more. They allow one to test simultaneously whether overall effects are (a) zero (the traditional test), (b) trivial (do not exceed some small value), or (c) nontrivial (definitely exceed some minimal level). For situations in which single-degree-of-freedom contrasts are of primary interest, exact confidence interval methods for contrast effect size measures such as the contrast correlation are also provided.

Sterne, J.A.C., & Davey Smith, G. (2001) P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis An arbitrary division of results, into "significant" or "nonsignificant" according to the P value, was not the intention of the founders of statistical inference A P value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that P < 0.001 does. In the results sections of papers the precise P value should be presented, without reference to arbitrary thresholds Results of medical research should not be reported as "significant" or "nonsignificant" but should be interpreted in the context of the type of study and other available evidence. Bias or confounding should always be considered for findings with low P values To stop the discrediting of medical research by chance findings we need more powerful studies.

Sterne, J.A.C. (2002) Confusion in the teaching of statistical inference dates back to the conflict of Fisher's P-values and significance tests with the Neyman-Pearson hypothesis testing approach. To avoid the well-known pitfalls arising from over-reliance on significance tests and the division of results into 'significant' or 'not significant', many medical journals now insist that presentation of statistical analyses includes confidence intervals as well as or instead of P-values. The confusion over how to report statistical analyses which is evident in the recent medical literature is matched by divergent teaching of hypothesis tests between the 16 U.K. medical schools represented at the April 2000 Burwalls meeting. Suggested guidelines for the teaching of statistical inference to medical students are presented, and possible future developments are discussed.

Student (1908) **[ Example of interpretation of significance levels in terms of probabilities
about parameters]** "From the table the probability is .9985 or the odds
are about 666 to 1 that 2 [the second soporific] is the better soporific." (page 21)

Sylvester, R.J. (1988) A new strategy for the design of Phase II clinical trials is presented which utilizes the information provided by the prior distribution of the response rates, the costs of treating a patient, and the losses or gains resulting from the decisions taken at the completion of the study. A risk function is derived from which one may determine the optimal Bayes sampling plan. The decision theoretic/Bayesian approach is shown to provide a formal justification for the sample sizes used in practice and shows the conditions under which such sample sizes are clearly inappropriate.

Taube, A. (1980) By means of data from ficticious cross over trials, it is first demonstrated that a statistically significant difference is not necessarily of a practically important order of magnitude. This fact is of special interest when the number of observations is large. Second, a statistically non significant difference does not prove the hypothesis about equality between, say, treatment effects. This fact is of special interest when the number of observations is small. For investigating whether equality is possible, confidence intervals are more useful than non significant results from tests of significance.

Thompson, B. (1994) Authors reporting statistical significance will be *required*
to both report and interpret effect sizes. However, these effect sizes may be of various forms, including
standardized differences, or uncorrected (e.g., *r-square*, *R-square*,
*eta-square*) or corrected (e.g., adjusted *R-square*,
*omega-square*) variance-accounted-for statistics.

Thompson, B. (1994) Too few researchers understand what statistical significance
testing does and doesn't do, and consequently their results are misinterpreted.
Even more commonly, researchers understand elements of statistical significance
testing, but the concept is not integrated into their research. For example,
the influence of sample size on statistical significance may be acknowledged by
a researcher, but this insight is not conveyed when interpreting results in a
study with several thousand subjects.

This Digest will help you better understand the concept of significance testing. The meaning of probabilities,
the concept of statistical significance, arguments against significance
testing, misinterpretation, and alternatives are discussed.

Thompson, B. (2001) The author asserts that editors should publicly declare their expectations and expose the rationales for editorial policies to public scrutiny. He argues that editorial policies ought to require effect size reporting, as those at 17 journals now do. He also argues (a) that score reliabilities should be reported; (b) that stepwise methods should not be used; (c) that structure coefficients should be interpreted; and (d) that if used wisely, confidence intervals differ from hypothesis tests in important ways. The use of noncentral t and F distributions to create confidence intervals about effect sizes also is appealing.

Thompson, B. (2001) The following three brief articles extend the discussion in the previous article regarding future prospects for progress in researchers' reporting and interpreting effect sizes. The authors of these brief pieces represent diverse views. [...] These three articles also serve as a useful precursor to the series of articles by Australian scholars that will be published in the August issue. These articles treat various issues in computing "noncentral" confidence intervals about effect sizes and related estimates.

Trafimow, D. (2003) Because the probability of obtaining an
experimental finding given that the null hypothesis is true [*p*(F|*H*0)]
is not the same as the probability that the null hypothesis is true given a
finding [*p*(*H*0|F)], calculating the former probability does not
justify conclusions about the latter one. As the standard null-hypothesis
significance-testing procedure does just that, it is logically invalid (J.
Cohen, 1994). Theoretically, Bayes's theorem yields *p*(*H*0|F), but
in practice, researchers rarely know the correct values for 2 of the variables
in the theorem. Nevertheless, by considering a wide range of possible values
for the unknown variables, it is possible to calculate a range of theoretical
values for *pH*0|F) and to draw conclusions about both hypothesis testing and theory evaluation.

Trafimow, D. (2005) In their comment on D. Trafimow (2003), M. D. Lee and E. Wagenmakers (2005) argued that the requisite probabilities to use in Bayes's theorem can always be found. In the present reply, the author asserts that M. D. Lee and E. Wagenmakers use a problematic assumption and that finding the requisite probabilities is not straightforward. After describing the assumption and the conceptual problems it entails, the author presents some numerical examples to demonstrate that the conceptual problems cause important errors. The author explores some common ground between the original article and the comment by using the ratio form of Bayes's theorem but notes that this procedure is also not without problems.

Tryon, W.W. (1998) "The fact that statistical experts and investigators publishing in the best journals cannot consistently interpret the results of these analyses is extremely disturbing. Seventy-two years of education have resulted in minuscule, if any, progress toward correcting this situation. It is difficult to estimate the handicap that widespread, incorrect, and intractable use of a primary data analytic method has on a scientific discipline, but the deleterious effects are doubtless substantial [...]" (page 796)

Tryon, W.W. (2001) Null hypothesis statistical testing (NHST) has been debated extensively but always successfully defended. The technical merits of NHST are not disputed in this article. The widespread misuse of NHST has created a human factors problem that this article intends to ameliorate. This article describes an integrated, alternative inferential confidence interval approach to testing for statistical difference, equivalence, and indeterminacy that is algebraically equivalent to standard NHST procedures and therefore exacts the same evidential standard. The combined numeric and graphic tests of statistical difference, equivalence, and indeterminacy are designed to avoid common interpretive problems associated with NHST procedures. Multiple comparisons, power, sample size, test reliability, effect size, and cause-effect ratio are discussed. A section on the proper interpretation of confidence intervals is followed by a decision rule summary and caveats.

Vacha-Haase, T., Nilsson, J.E., Reetz, D.R., Lance, T.S., & Thompson, B. (2000) The recent fourth edition of the American Psychological Association Publication Manual emphasized that p values are not acceptable indices of effect and "encouraged" effect-size reporting. However, empirical studies of reporting practices of diverse journals unequivocally indicate that this new encouragement has to date been ineffective. Here two additional multi-year studies of APA journals are reported. Additionally, all 50 APA editorials that have been published since 1990 were reviewed to determine how many editors with approval have articulated policies more forceful than the APA Publication Manual's vague and seemingly self-canceling encouragement. It is suggested that changes in editorial policies will be required before improved reporting will become routine.

Valera, A., Sánchez, J., & Marin, F. (1997) Several proposals that enable to complement the information offered in statistical hypothesis testing are described. using these proposals reduce the most hard critic that significance tests have suffered. Significance tests do not offer information about the magnitude of the relationship among the involved variables. the proposals that are discussed in this paper are confidence intervals, effect size, binomial effect size display, counter-null value and common language effect size indicator.

Valera, A., Sánchez, J., & Marin, F. (2000) The purpose of this paper was to analyse the application of the most common statistical procedure for studying relationships among variables and empirical phenomena in psychology: The null hypothesis statistical test. In order to determine whether its use in psychological Spanish research is adequate, we carried out a power study of the papers published in Spanish journals. The analysis of the 169 experiments selected, with a total of 5,480 statistical tests, showed power values of 0.18, 0.58, 0.83, and 0.59 to low, medium, high, and estimated effect sizes, respectively. These values drastically decreased in about a 20% when the calculations were repeated controlling the Type I error inflation through Bonferroni adjustment. The results were very similar to those obtained in other international power studies and lead us to think about the need for a special attention for controlling the statistical power in designing a research. On the other hand, we discuss several complementary proposals to the use of significance tests that may improve the information obtained.

Valera, A., Sánchez, J., Marin, F., & Velandrino, A. (1998) Although the purpose of hypothesis testing is to reject the
null hypothesis and to detect relationships among variables, the inadequate
control of statistical power is very common in psychological research. In this
way, the statistical power of the papers published in the *Revista de Psicología
General y Applicada* since 1990 to 1992 was analyzed. The power for low, medium, high, and estimated effect sizes were
computed. The values we found were .17, .57, .83, and .55, respectively. Moreover,
the distribution of effect magnitudes and sample sizes in the journal papers
was also analyzed. Finally, the results are discussed and compared with those of the other power studies in the literature.

VanVoorhis, W.C., & Morgan, B.L. (2001) In this article we highlight the statistical rules of thumb guiding the selection of sample sizes for detecting differences, associations, chi-square, and factor analyses.

Vargha, A., & Delaney, H.D. (2000) McGraw and Wong (1992) described an appealing index of effect
size, called *CL*, which measures the difference between two populations in terms of the probability that a score
sampled at random from the first population will be greater than a score
sampled at random from the second. McGraw and Wong introduced this "common
language effect size statistic" for normal distributions and then proposed
an approximate estimation for any continuous distribution. In addition, they
generalized CL to the *n*-group case, the correlated sample case, and the discrete value case.

In the current paper a different generalization of *CLI*, called the *A* measure
of stochastic superiority, is proposed, which may be directly applied for any discrete or continuous variable
that is at least ordinally scaled. Exact methods for point and interval
estimation as well as the significance tests of the *A*=.5 hypothesis are provided.
New generalizations of *CL* are provided for the multi-group and
correlated samples cases.

Victor, N. (1987) The currently usual one value for the judgement of the clinical relevance of therapeutic effects frequently does not suffice to adequately formulate the problems of clinical studies, and the statistical standard procedure (the testing of the classical nullhypothesis) fails to take this value duly into account. Therefore, it is proposed to judge the clinical relevance and importance by means of four values, fixed in discussions with the clinician before commencement of the study, and to proceed by testing non-zero nullhypotheses (shifted nullhypotheses) where the "clinically relevant difference" is the shift parameter. Methodological problems resulting from the shifting of the nullhypothesis are discussed, and other possibilities to take into account the clinically relevant difference (introduction of criteria of success) are considered.

Vokey, J.R. (2003) Many research designs in experimental psychology generate data that are fundamentally discrete or categorical in nature, and produce multiway tables of frequencies. Despite an extensive and, more recently, accessible literature on the topic, multiway frequency analysis is rarely used in experimental psychology. A reason may be the form of exposition in the literature, with emphases and concerns far removed from those of the typical experimental psychologist. An approach to multiway frequency analysis for experimental psychologists is described that has the features we want: asymmetrical designs, factors assessed for their respective main and interactive effects in a manner analogous to ANOVA, and the ability to handle within- subject designs.

Wade, O.L., & Waterhouse, J.A.H. (1977) It would seem to us to be easier for those who design clinical
trials to continue to use the usual form of tests of significance based on the
null hypothesis. But is vital that a *statistically significant* difference should not necessarily be assumed
to be an *important* difference. It is extremely
important that doctors [...] are not persuaded by advertisers or others to
accept statistically significant differences in the performance of drugs as
necessarily indicating a difference of practical importance of value.

Wagenmakers, E.-J., & Grünwald, P. (2006) Bayesian hypothesis tests are often criticized because of their dependence on prior distributions. Yet in our example, no matter what prior is used, the Bayesian test provides substantially less evidence against H0 than either p values or prep. [...] It is our subjective belief that Bayesian methods will prove useful not only for statisticians, but also for psychologists.

Wang, Y.H. (2000) A general explanation of the fiducial confidence interval and its construction for a class of parameters in which the distributions are stochastically increasing or decreasing is provided. Major differences between the fiducial interval and Bayesian and frequentist intervals are summarized. Applications of fiducial inference in evaluating pre-data frequentist intervals and general post-data intervals are discussed.

Wellek, S., & Michaelis, J. (1991) The paper outlines an approach to the general methodological problem of equivalence assessment which is based on the classical theory of testing statistical hypotheses. Within this frame of reference it is natural to search for decisions rules satisfying the same criteria of optimality which are customarily applied in deriving solutions to one- and two-sided testing problems. For three standard situations very frequently encountered in medical applications of statistics, a concise account of such an optimal test for equivalence is presented. It is pointed out that tests based on the well-known principle of confidence interval inclusion are valid in the sense of guaranteeing the prespecified level of significance, but tend to have an unnecessarily low efficiency.

Windeler, J., & Conradt, C. (2000) Clinical trials are aimed at providing results which enable improvements in patient care. It is widely criticized, however, that the characterization of results as "significant" or "non-significant" does not allow any assessment of their clinical relevance. To counter this criticism 2 biostatistical concepts are available: the use of confidence intervals and the application of statistical tests with shifted null-hypotheses. Possibilities and limitations of these concepts are discussed in this contribution.

Wilkinson, L. and Task Force on Statistical Inference In the light of continuing debate over the applications of
significance testing in psychology journals and following the publication of
Cohen's (1994) article, the Board of Scientific Affairs (BSA) of the American
Psychological Association (APA) convened a committee called the Task Force on
Statistical Inference (TFSI) whose charge was "to elucidate some of the
controversial issues surrounding applications of statistics including
significance testing and its alternatives; alternative underlying models and
data transformation; and newer methods made possible by powerful
computers" (BSA, personal communication, February 28, 1996). Robert Rosenthal,
Robert Abelson, and Jacob Cohen (cochairs) met initially and agreed on the
desirability of having several types of specialists on the task force:
statisticians, teachers of statistics, journal editors, authors of statistics
books, computer experts, and wise elders. Nine individuals were subsequently
invited to join and all agreed. These were Leona Aiken, Mark Appelbaum, Gwyneth
Boodoo, David A. Kenny, Helena Kraemer, Donald Rubin, Bruce Thompson, Howard
Wainer, and Leland Wilkinson. In addition, Lee Cronbach, Paul Meehl, Frederick
Mosteller and John Tukey served as Senior Advisors to the Task Force and
commented on written materials.

The TFSI met twice in two years and
corresponded throughout that period. After the first meeting, the task force
circulated a preliminary report indicating its intention to examine issues
beyond null hypothesis significance testing. The task force invited comments
and used this feedback in the deliberations during its second meeting.

After the second meeting, the task
force recommended several possibilities for further action, chief of which
would be to revise the statistical sections of the *American Psychological Association Publication Manual* (1994). After
extensive discussion, the BSA recommended that "before the TFSI undertook
a revision of the *APA Publication Manual,*
it might want to consider publishing an article in *American Psychologist,* as a way to initiate discussion in the field
about changes in current practices of data analysis and reporting".

This report follows that request. The sections in italics are proposed guidelines that the TFSI recommends could be
used for revising the APA publication manual or for developing other BSA
supporting materials. Following each guideline are comments, explanations, or
elaborations assembled by Leland Wilkinson for the task force and under its
review. This report is concerned with the use of statistical methods only and
is not meant as an assessment of research methods in general. Psychology is a
broad science. Methods appropriate in one area may be inappropriate in another.
**Power and sample size.** *Provide information on sample size and the process that led to
sample size decisions. Document the effect sizes, sampling and measurement
assumptions, as well as analytic procedures used in power calculations. Because
power computations are most meaningful when done before data are collected and
examined, it is important to show how effect-size estimates have been derived
from previous research and theory in order to dispel suspicions that they might
have been taken from data used in the study or, even worse, constructed to
justify a particular sample size. Once the study is analyzed, confidence
intervals replace calculated power in describing results.*
**Hypothesis tests.** *It is hard to imagine a situation in which a dichotomous
accept-reject decision is better than reporting an actual p value or, better
still, a confidence interval. Never use the unfortunate expression "accept
the null hypothesis." Always provide some effect-size estimate when reporting
a p value.*
**Effect sizes.** *Always present effect sizes for primary outcomes. If the units of
measurement are meaningful on a practical level (e.g., number of cigarettes
smoked per day), then we usually prefer an unstandardized measure (regression
coefficient or mean difference) to a standardized measure (r or d). It helps to
add brief comments that place these effect sizes in a practical and theoretical
context.*
**Interval estimates.** *Interval estimates should be given for any effect sizes involving
principal outcomes. Provide intervals for correlations and other coefficients
of association or variation whenever possible.*

Williams, A.M. (1998) Throughout introductory tertiary statistics subjects, students are introduced to a multitude of new terms for statistical concepts and procedures. One such term, significance level, has been considered in the statistical literature. Three themes of discussion relate to this concept - the problem of interpretation (and misinterpretation), the selection of an appropriate level, and the evaluation of results based on significance level. However, empirical research regarding this concept is very limited. This paper reports on a qualitative study which used concept maps and standard hypothesis tests to investigate student's conceptual and procedural knowledge of the significance level concept. Eighteen students completing an introductory tertiary statistics subject were interviewed after their final exam in statistics. Results showed that many students did not have a good understanding of the concept.

Williams, V.S.L., Jones, L.V., & Tukey, J.W. (1999) Three alternative procedures to adjust significance levels for multiplicity are the traditional Bonferroni technique, a sequential Bonferroni technique developed by Hochberg (1988), and a sequential approach for controlling the false discovery rate proposed by Benjamini and Hochberg (1995). These procedures are illustrated and compared using examples from the National Assessment of Educational Progress (NAEP). A prominent advantage of the Benjamini and Hochberg (B-H) procedure, as demonstrated in these examples, is the greater invariance of statistical significance for given comparisons over alternative family sizes. Simulation studies show that all three procedures maintain a false discovery rate bounded above, often grossly, by "alpha" (or "alpha"/2). For both uncorrelated and pairwise families of comparisons, the B-H technique is shown to have greater power than the Hochberg or the Bonferroni procedures, and its power remains relatively stable as the number of comparisons becomes large, giving it an increasing advantage when many comparisons are involved. We recommend that results from NAEP State Assessments be reported using the B-H technique rather than the Bonferroni procedure.

Willink, R., Lira, I. (2005). The frequentist and Bayesian philosophies of statistical inference require different approaches to the calculation of an interval of uncertainty for a measurand. A frequentist (or classical) interval will have an associated con.dence level, p, that is the probability of generating an interval enclosing the value of the measurand. A Bayesian interval will have an associated credible level, p, that is a 'degree-of-belief' that the value of the measurand subsequently lies within the interval. Since potential users are not primarily concerned with the method of analysis, a shared interpretation of the information given to them seems desirable. We obtain such an interpretation by recognising that in either case p is the proportion of independent intervals calculated over time that contain their respective measurands. This interpretation is also useful in explaining an interval calculated according to the procedure of Guide to the Expression of Uncertainty in Measurement.

Wilson, G. (2003) For the last 50 years Bayesians and frequentists have disputed the appropriate way to do statistics. Bayesian methods have grown in popularity and acceptance, but how is the con1ict between Bayesians and frequentists likely to play out in the future? This article uses theories advanced by Thomas Kuhn and Lawrence Grossberg to offer a framework for understanding possible futures and to pose questions about the future of the field of statistics.

Winch, R.F., & Campbell, D.T. (1969) To do or not to do a test of significance - that is a question that divides men of good will and sound competence. We believe that although unreasonable claims are sometimes made for the test of significance and that although many have sinned in implicitly treating statistical significance as proof of a favored explanation, still the social scientists is better off for using the significance test that for ignoring it. More precisely, it is our judgment that although the test of significance is irrelevant to the interpretation of the cause of a difference, still it does provide a relevant and useful way of assessing the relative likelihood that a real difference exists and is worthy of interpretive attention, as opposed to the hypothesis that the set of data could be a haphazard arrangement.

Winkler, R.L. (1974) [...] The gap between theory and practice in statistical analysis is investigated, with particular attention given to the Bayesian approach to statistical analysis. [...] Current statistical practice in experimental psychology and various factors contributing to the theory-practice gap in statistical analysis are considered. Finally, some general questions involving scientific reporting and the use of Bayesian procedures in statistical inference are discussed.

Witehead, J. (1993) The title of this paper was chosen for me by the organizers of the conference. [...] A mote accurate title for my paper would be "the case for frequentism in definitive phase III clinical trials". In fact, part of the paper could even be entitled "the case for Bayesianism in early phase clinical trials". Each approach has its place. In keeping with the requirements for robust arguments, I shall simplify my opinions into clear terms of black and white, rather than writing at greater length in various shades of grey. I shall begin with an account of the frequentist statements which can be made at the end of any scientific investigation. This will be made in an abstract setting. Then I shall briefly explain my own understanding of Bayesian approaches. Implementation of the two philosophies in the different phases of clinical research will be discussed, and I shall end with some remarks concerning the preservation of scientific rigour in clinical research.

Yates, F.(1951) "The emphasis given to formal tests of significance throughout [R.A. Fisher's] Statistical Methods [...] has caused scientific research workers to pay undue attention to the results of the tests of significance they perform on their data, particularly data derived from experiments, and too little to the estimates of the magnitude of the effects they are investigating." [...] "The emphasis on tests of significance and the consideration of the results of each experiment in isolation, have had the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective."

Yoccoz, N.G. (1991) "In marked contrast to what is advocated by most statisticians, most evolutionary biologists and ecologists overemphasize the potential role of significance testing in their scientific practice. Biological significance should be emphasized rather than statistical significance. Furthermore, a survey of papers showed that the literature is infiltrated by an array of misconceptions about the use and interpretation of significance tests." [...] "By far the most common error is to confound statistical significance with biological, scientific significance. [...] " [...] "Statements like 'the two populations are significantly different relative to parameter X (P=.004)' are found with no mention of the estimated difference. The difference is perhaps statistically significant at the level .004, but the reader has no idea is if it is biologically significant." [...] "Most biologists and other users of statistical methods still seem to be unaware that significance testing by itself sheds little light on the questions they are posing."

Zeisel, H. (1955)] There is now, in the social sciences no greater need than the development of theoretical insights guided by empirical data. At such times, to provide this guidance and serve as a stimulant is the significance of statistically insignificant data. Even if the probability is great that an inference will have to be rejected later, the practical risk of airing is small. Subsequent and more elaborate studies may disprove some of these inferences; but for those that survive social science will be the richer.

Zhu, M., & Lu, A.Y. (2004) In Bayesian statistics, the choice of the prior distribution is often controversial. Different rules for selecting priors have been suggested in the literature, which, sometimes, produce priors that are difficult for the students to understand intuitively. In this article, we use a simple heuristic to illustrate to the students the rather counter-intuitive fact that flat priors are not necessarily non-informative; and non-informative priors are not necessarily flat.

Zuckerman, M., Hodgins, H., Zuckerman, A., & Rosenthal, R. (1993) We asked active psychological researchers to answer a survey regarding the following data-analytic issues: (a) the effect of reliability on Type I and Type II errors, (b) the interpretation of interaction, (c) contrast analysis, and (d) the role of power and effect size in successful replications. Our 551 participants (a 60% response rate) answered 59% of the questions correctly; 46% accuracy would be expected according to participants' response preferences alone. Accuracy was higher for respondents with higher academic ranks and for questions with "no" as the right answer. It is suggested that although experienced researchers are able to answer difficult but basic data-analytic questions at better than chance levels, there is also a high degree of misunderstanding of some fundamental issues of data analysis.

A |
B |
C |
D |
E |

F |
G |
H |
I |
J |

K |
L |
M |
N |
O |

P |
Q |
R |
S |
T |

U |
V |
W |
Y |
Z |