75. Sick Versus Slick: 9. Why Did the Psychologist Cross the Road?

For the better part of my early adult life I attended university. In my case, I studied clinical psychology. From start to finish, that path required 10 post-secondary years of school, one year residency, and roughly two years of post-residency registration including professional examinations and licensing. During that time, you spend the majority of your days reading about, writing about, and practicing your profession.

Typically, the audience for your work consists of two people — yourself and your professor, or your supervisor, or your client.  That is a lot of effort for a very small audience. Not don’t get me wrong, the ultimate outcome is wonderful and you are allowed to do what you love and your efforts benefit others and yourself.

Most clinical psychologists (and health care providers) stop learning with the same intensity once they are past all of these academic and professional hurdles. This is natural. One cannot keep that pace indefinitely and the demands of clinical practice soon become all-consuming. As well, for many of us, life goals such as marriage, children, and mortgages had been deferred during our education and we now choose to place our energies and time focused there.

Training in clinical psychology is somewhat odd. It is requires a high level of knowledge of research design and statistics and, in PhD programs, demands that you complete original research to graduate. At the same time, you must also learn the practical art and craft of therapy.  This is why training to become a clinical psychologist is often said to follow a scientist-practitioner model.

Yet, upon completion of our degree and training, very few clinical psychologists will continue to pursue any further original research.  It is not that we are incapable of doing this research.  We simply choose to not do this work.  Excluding dissertation publications, the modal number of publications for clinical psychologists following graduation is exactly zero.

There are number of reasons for this, such as an absence of a post-graduate university affiliation, lack of research funding, and inability to gain ethics clearance. More fundamentally, actively pursuing research is somewhat boring and lacks the level of emotional fulfillment that one achieves through helping clients realize their potential to overcome life obstacles.

Which, ironically, brings me full circle. In my clinical practice, I began to notice that quite a few of my otherwise physically healthy male clients were being prescribed testosterone replacement treatment. Which I found odd and inconsistent with my knowledge of the endocrine system, the mechanics of human sexuality, and the nature of intimate relationships. Very few men I saw were being helped by this medication. Quite the contrary.  Most suffered uncomfortable and negative side effects and did not realize any change in their sexual potency or improvement in their intimate relationships.

So, I thought it would be a good idea to use my research skills and dig through the literature. In part, I wanted to bring that information to my clients and I needed a place to park that information.  However, I also realized that I missed writing as I once did back in my formative years.

And that is why I began to blog.

And the audience is greater than two.

Your Friendly Druggist

Advertisements

33. Higher Levels of Phthlates in Men Associated with Delayed Pregnancy in Female Partners

DATE: March 5, 2014

FROM: US National Institute of Health

TITLE: High plasticizer levels in males linked to delayed pregnancy for female partners

ISSUE:  Women whose male partners have high concentrations of three common forms of phthalates, chemicals found in a wide range of consumer products, take longer to become pregnant than women in couples in which the male does not have high concentrations of the chemicals, according to researchers at the National Institutes of Health and other institutions.

33. Phthalate

30. Testosterone Products: Drug Safety Communication – FDA Investigating Risk of Cardiovascular Events

[I don’t want to say I told you so, but…]

DATE: January 31, 2014

FROM: US Food and Drug Administration

AUDIENCE: Cardiology, Urology, Family Practice

ISSUE: FDA is investigating the risk of stroke, heart attack, and death in men taking FDA-approved testosterone products. We have been monitoring this risk and decided to reassess this safety issue based on the recent publication of two separate studies that each suggested an increased risk of cardiovascular events among groups of men prescribed testosterone therapy. FDA is providing this alert while it continues to evaluate the information from these studies and other available data. FDA will communicate final conclusions and recommendations when the evaluation is complete.

BACKGROUND: Testosterone is a hormone essential to the development of male growth and masculine characteristics. Testosterone products are FDA-approved only for use in men who lack or have low testosterone levels in conjunction with an associated medical condition.

RECOMMENDATION: At this time, FDA has not concluded that FDA-approved testosterone treatment increases the risk of stroke, heart attack, or death. Patients should not stop taking prescribed testosterone products without first discussing any questions or concerns with their health care professionals. Health care professionals should consider whether the benefits of FDA-approved testosterone treatment is likely to exceed the potential risks of treatment. The prescribing information in the drug labels of FDA-approved testosterone products should be followed.

Healthcare professionals and patients are encouraged to report adverse events or side effects related to the use of these products to the FDA’s MedWatch Safety Information and Adverse Event Reporting Program:

Complete and submit the report Online: http://www.fda.gov/MedWatch/report.htm

Download form or call 1-800-332-1088 to request a reporting form, then complete and return to the address on the pre-addressed form, or submit by fax to 1-800-FDA-0178.

30. FDA Warning

22. Sick Versus Slick: 8. The Base-Rate Fallacy

  • Probability, like time, is a concept invented by humans, and humans have to bear the responsibility for the obscurities that attend it.   John Archibald Wheeler

We have been deconstructing the Androgen Deficiency in Aging Males (ADAM) questionnaire and measuring its worth in terms of identifying men who are possibly experiencing low testosterone.  Based on data from its source publication, the ADAM’s  ability to accurately predict men with low testosterone is about 42 percent.

A positive predictive value of 42 percent suggests that the ADAM will be wrong more often than it is right.  If the ADAM predicts you have low testosterone, it is a safe bet that you do not.  However, because we know that 1 out of 4 men in the original ADAM study had low testosterone, the ADAM did outperform guessing by over 15 percent.

Is a 15 percent increase meaningful?

Well, that depends on what we are trying to predict, the relative costs of error in that prediction, and the original base rate from which we began.

If we are trying to improve our ability to predict a tornado, a 15 percent increase in prediction may lead to saved lives.  However, if tornados are rare in our geographical area, this method of prediction will also lead to a greater number of false alarms. False alarms may cause unnecessary panic or, worse, a dismissal of tornado warnings as usually wrong by those who live in that area.  Under these circumstances, we might prefer a more precise early warning system and we might deem a 15 percent increase as not beneficial.

If we are trying to predict the presence of a specific type of rare cancer, a 15 percent increase in prediction again will also lead to many false alarms.  However, if that cancer is highly treatable once detected and its treatment not invasive, we might be willing to accept a trade-off between this increase in prediction coupled with an increase in false alarms.

The point here is that our capacity to make a reasonable judgement depends on the meaning of the event to us, the probability of that event occurring, and our ability to accurately detect that event once it has occurred.  Classic Bayesian probability, of course, can not comment on the subjective relevance or moral weight of an event.  It cannot know what we hold in our minds’ eye.  Instead, it’s main focus is on the general probability or base rate of an event and the specific instance of that event under consideration.

As humans beings, however, we seem to fail to consider the base rate that underlies all events.  We are seduced by the instance.  The example given in the last post based on a study by Agoritsas and his colleagues (2010) illustrates both our blindness and easy seduction.  To repeat:

  • As a school doctor, you perform a screening test for a viral disease in a primary school.
  • The properties of the test are very good:  Among 100 children who have the disease, the test is positive in 99, and negative in only 1, and among 100 children who do not have the disease, the test is negative in 99, and falsely positive in only 1.
  • On average, about 1 out of 100 children are infected without knowing it.
  • If the test for one of the children is positive, what is the probability that he or she actually has this viral disease?

As mentioned in the last post, when this problem was offered to a sample of more than 1000 physicians practicing in Switzerland, the majority guessed the probability of detection of viral disease at 95 percent or greater.  This remained true even when the rate of prevalence was manipulated to range anywhere from 1 to 95 percent or was undetermined.

In fact, the answer to this riddle is 50 percent as depicted below:

Fig 21-7

The high sensitivity and specificity of the diagnostic test is tempered by the low prevalence of the virus.  Although the diagnostic test is highly accurate, there is still a relatively large amount of false positives due to more people not being afflicted by the virus (99 out of 100) than the amount of people who do have the virus (1 out of 100).

The authors of this study highlight that the improper use of probability may result in medical error. If the outcome of diagnostic error leads to biomedical consequences, then the tendency to ignore prevalence or base rates goes beyond a curious phenomenon of human decision making, it becomes a potentially harmful event.

In the psychology literature, the general disregard of base rate information has long been a source of focus. Meehl and Rosen (1955) offered an early exploration of the importance of base rates or, more specifically, the lack of base rate information in most psychological tests. As well, considerable experimental research in the 1970s was conducted by Kahneman and Tversky and analyzed universal flaws in reasoning during decision making tasks.  However, one of the more comprehensive and influential articles on base rate errors was written by Bar-Hillel in 1980.

Bar-Hillel labelled the tendency to ignore information about the historical occurrence of an event as the base-rate fallacy.  Her interest was in gaining a better understanding under what circumstances base rate errors were most likely to occur.

Bar-Hillel did not see the base-rate fallacy as inevitable.  Instead, she demonstrated that its influence could be reduced through manipulation of how information was presented and, more importantly, by increasing the relevance of base rate information. Bar-Hillel argued that if we deem information as possessing low relevance then we tend to disregard that information.  It is not that we are unaware or ignorant of base rate information.  On the contrary, she argued, we disregard this information because we strongly feel it should be disregarded.

The results of the Agoritsas study clearly demonstrates that the majority of physicians do not attend to a disorder’s general occurrence or base rate when making a clinical decision.  They fail to do so either because of unawareness or, following Bar-Hillel, because they deem it of low importance.

In the case of the virus problem, the information provided was highly sparse and intended to focus on the importance of base rates.  Real diagnostic problems, however, are complex and carry an abundance of possibly relevant information.  Disregarding some information is an important step in pruning a problem to its smallest set of possible diagnoses.  Determining that the probability of a correct diagnosis is 50 percent given the outcome of specific test makes complete sense in Bayesian logic.  Yet, when a definitive yes or no response is required, as in health care, judicial decisions, or a marriage proposal, this is not overly helpful.

When asked to decide whether a child is positive for a virus, one must decide.  You cannot 50 percent decide. You cannot treat a child with a half-measure.  In the Agoritsas study, those physicians surveyed may have intuitively moved past the question of probability and toward the final goal of clinical action.  For these physicians, if a child tests positive for a virus, they will chose to treat that child.  Therefore, while it may be true that their answer to the question, as posed, was incorrect, the course of action that stemmed from the incorrect answer may have been consistent with those physicians’ method of practice.

Another way of thinking about the physicians’ process of decision making is that they chose to disregard the existing virus base rate and, instead, inserted a prior probability of perfect uncertainty.  In Bayesian terms, a prior of perfect uncertainty looks like this:

Fig 22-1

If you think about it, most of us approach common day-to-day decisions in this way.  Our base rates are subjective and tend to follow our personal history of exposure to certain events.  If a problem is novel to us, we may opt for a prior probability of perfect uncertainty, as did the physicians in the virus problem.  Across time, however, as we accumulate personal history of the same repeated event, we may start to adjust our prior probability rate.

A prior probability of perfect uncertainty is quite allowable in Bayesian probability.  In fact, under Bayesian inference, it is mandatory (more on this in later posts).  Bayesian probability was originally designed as a method of determining unknown events and, ironically, it was this subjective quality of the Bayesian approach that led to it being disfavored in those years following its publication.

Given the pervasive nature of the base-rate fallacy, the strong push in medical practice to treat any possible disorder, and that meeting patients’ needs is correlated with patient satisfaction, it is very likely that any physician who obtains a positive test result will move toward treatment even in the midst of high false positive rates.

21. Sick Versus Slick: 7. Bayes’ Theorem

In this series, we have been discussing the diagnostic statistics of sensitivity, specificity, and predictive values as they pertain to the Androgen Deficiency in Aging Males (ADAM) questionnaire.  The ADAM questionnaire was designed by Morley and his colleagues as a screening test for low testosterone among middle-aged men.  Although the ADAM demonstrates good sensitivity (88%) in identifying men possibly experiencing low testosterone, it’s rates of specificity (60%), and positive predictive value (42%) are more modest.

In the prior post, we showed how the predictive value of a positive response on the ADAM can be calculated directly.

To refresh your memory, Morley et al (2000) administered their questionnaire to a sample of 316 Canadian physicians and measured those physicians’ testosterone levels.  Twenty-five percent of this physician sample had bioavailable testosterone levels lower than 70 ng/mL.  The frequency count of physicians who fell into each diagnostic category was as follows:

Fig 21-1The positive predictive value, or the percentage of physicians who had a positive response on the ADAM and low testosterone, was 42% as shown below:

Fig 21-2
There is another method of calculating the positive predictive value and that is by using Bayes’ Theorem.  Formally, Bayes’ Theorem is:

Bayes 21-1

To explain, let’s superimpose Bayesian terms over our ADAM example.

Fig 21-3

Here, event A (testosterone) can be low (A) or not low (A’) and event B (ADAM) can be positive (B) or not positive (B’).  Figuring out the probabilities of A and B is fairly straight-forward.

The probability of A, or low testosterone, is the number of men with low testosterone divided by the total number of men who were tested:

Fig 21-4

In diagnostic terms, P(A) represents the base rate or prevalence of the disorder under investigation.  In Bayesian language, P(A) is called the prior probability.  The prior represents the known or assumed probability of the event or condition of interest we are trying to predict.  It is thought of as prior in that it represents information we possess before we conduct our diagnostic test.

The probability of B, or a positive ADAM score, is the number of men with positive ADAM scores test divided by the total number of men who were tested:

Fig 21-5
P(B) is can be thought of as our available evidence or the results of our diagnostic test.  In our current example, roughly 50 percent of the physicians had positive scores on the ADAM.

The next term, P(B/A), is defined as the probability of B given A.  However the phrase “B given A” is only crystal clear to those who have spent the majority of their adult life wrestling math, physics, or statistical formulas.  For the rest of us, it is easier to think of the term P(B/A) as asking the question: “How many B’s are in A?”  That is, the number of positive ADAM scores in the low testosterone group is:

Fig 21-6
In diagnostic terms, this represents the sensitivity of a test.  In Bayesian language, P(B/A) is called the likelihood function.  Likelihood is an important concept in both Bayesian and diagnostic statistics.

With the above terms, we can now calculate the positive predictive value of the ADAM as follows:

Bayes 21-2

It does seem unnecessary to use Bayes’ theorem to determine the positive predictive value of a diagnostic test (or, in Bayesian terms, the posterior probability) when one can do this easily by using a simple table.  Yet Bayesian statistics are important in that they emphasize that diagnostic information depends on not only sensitivity and specificity but disorder prevalence as well.

To illustrate, consider the following problem:

  • As a school doctor, you perform a screening test for a viral disease in a primary school.
  • The properties of the test are very good:  Among 100 children who have the disease, the test is positive in 99, and negative in only 1, and among 100 children who do not have the disease, the test is negative in 99, and falsely positive in only 1.
  • On average, about 1 out of 100 children are infected without knowing it.
  • If the test for one of the children is positive, what is the probability that he or she actually has this viral disease?

Congratulations if you answered with a probability anywhere in the range of 95 percent or greater.  If you did, you are in agreement with approximately 80 percent of a sample of physicians who were posed the same question.  You, in fact, are wrong.  But, at least, you are not alone.

This virus screening problem is a classic diagnostic riddle used to introduce health care workers to both Bayesian statistics and the influence of prevalence on clinical decision making.  The actual answer to the question is 50 percent and here is why.

Although our diagnostic test is highly sensitive at 99 percent (among 100 children who have the disease, the test is positive in 99) and highly specific at 99 percent (among 100 children who do not have the disease, the test is negative in 99), the rate of prevalence is low (1 out of 100 children are infected without knowing it).

A low rate of prevalence has a direct influence on the frequency of false positive test results.  In table form, the diagnostic virus problem looks like this:

Fig 21-7
The question was:  “If the test for one of the children is positive, what is the probability that he or she actually has this viral disease?”  In the table above that probability is the positive predictive value or 99 divided by 198 or 50 percent.  The number of true positives are equal to the number of false positives due to the virus’ low rate of prevalence.

The benefit of using Bayes’ theorem is that we can also arrive at this solution with just the three probabilities that are offered in the virus problem.  It looks like this:

Bayes 21-3

Despite the diagnostic importance of prevalence, health care providers consistently overvalue the sensitivity and specificity of diagnostic tests and undervalue (or completely disregard) the prevalence of disorders.  In fact, in their sample of 1361 physicians, Agoritsas and his colleagues (2011) found that independent of whether the prevalence rate offered was 1, 2, 10, 25, or 95 percent the most frequent answer to the virus problem was 95 percent or higher.  This remained true even when physicians were not given any information regarding prevalence and, technically, no answer was possible.

Why would this be the case?

Why would highly trained diagnosticians ignore rates of prevalence despite their importance in understanding test outcomes? Agoritsas suggested that these errors may occur because physicians are (1) unaware of the impact of prevalence on test outcome, (2) may have a poor understanding of basic statistical properties of diagnostic tests, or (3) have difficulty in applying the basic arithmetic underlying Bayesian probability.

Perhaps.  Yet this sample was derived from all 2745 physicians currently practicing in Geneva Switzerland.  It is hard to imagine that within that large population of health care providers that the majority would be unaware of disorder prevalence, or of diagnostic statistics, or basic arithmetic.

Instead, it remains probable that these physicians were using more a implicit or subjective method of estimating their confidence in the outcome of the diagnostic test.  Specifically, these physicians were most likely making a common error in judgement called the base rate fallacy.

In the next post in this series, we will discuss the base rate fallacy.

19. Sick Versus Slick: 6. Sensitivity, Specificity, and Predictive Values

In this series we have been discussing male menopause, or Androgen Deficiency in Aging Males (ADAM), as defined by Morley and his colleagues in 2000.  In our last post, we reviewed the ADAM questionnaire, also created by Morley, and its ability (or inability) to identify men who may be suffering from testosterone deficiency.

Despite over 400 citations and its popularity among pharmaceutical and other commercial medical websites, the ADAM questionnaire is not recommended as a method of detecting testosterone deficiency.  Or at least that is the opinion of the Endocrine Society (ES), the International Society of Andrology (ISA), the International Society for the Study of Aging Male (ISSAM), the European Association of Urology (EAU), the European Academy of Andrology (EAA), and the American Society of Andrology (ASA).

Pray tell, how is that most, if not all, learned societies focusing on men’s health discourage indiscriminate use of the ADAM questionnaire, yet that same questionnaire is strongly promoted by those who peddle testosterone products? In short, the ADAM questionnaire tends to over-diagnose or over-predict the presence of testosterone deficiency.  Its tendency to over-diagnose makes the ADAM questionnaire a limited clinical tool but that same over-diagnostic tendency makes the ADAM questionnaire a superb marketing tool.  It all depends on what you are trying to achieve — optimal men’s health or optimal sales of men’s testosterone products.

Now I am not focusing on the ADAM questionnaire because it is egregious.  It is, in fact, no better or worse than a thousand other relatively short questionnaires that attempt to distill complex disorders down to simplistic outcomes.  I am focusing on it because it is popular.  It is ubiquitous.  And it is used uncritically.

In the last post, we tried to make this argument by discussing the concepts of sensitivity, specificity, true positives, and false positives among diagnostic tests.  It might be helpful to illustrate these diagnostic statistics with actual numbers.  And so I shall.

As described before, Morley et al administered their questionnaire to a sample of 316 Canadian physicians and measured these physicians’ testosterone levels.  Twenty-five percent of this physician sample had bioavailable testosterone levels lower than 70 ng/mL and were deemed to be hypogonadal.  In their publication, Morley and colleagues only provided percentages and did not give the frequency or count of physicians who fell into each diagnostic category.  However, because we know how many physicians had levels of low testosterone (25 percent), and the sensitivity (88 percent) and  specificity (66 percent) of the diagnostic questionnaire, it is easy to estimate how many physicians fell into each group.  We suspect the numbers looked like this:

Fig 19-1

Sensitivity and Specificity

In diagnostic testing, sensitivity and specificity represent two important components of a test and help us to make informed decisions about the quality of that test.

Fig 2 19b

Sensitivity reflects the relationship between the diagnostic test and the presence of the condition or disorder of interest.  Those who possess the disorder and are correctly identified represent a true positive result.   Those who possess the disorder and are incorrectly identified represent a false negative outcome. In the ADAM questionnaire and study, 69 physicians had true positive results and 9 physicians had false negative results. The ADAM’s sensitivity can be determined by dividing the true positive cases by the total number of men who had low levels of testosterone.

Fig 3 19b

Specificity is concerned with the relationship between a diagnostic test and the absence of the condition or disorder of interest.  Those who do not possess the disorder and are correctly identified represent what is called a true negative result.   Those who do not possess the disorder and are incorrectly identified represent a false positive outcome. In the ADAM study, 143 physicians had true negative results and 95 physicians had false positive results. The ADAM’s specificity is determined by dividing the true negative individuals by the total number of physicians who had normal levels of testosterone.

Positive and Negative Predictive Value

Sensitivity asks the question:  When a disorder is present, how well does our test predict that disorder’s presence? Specificity, on the other hand, asks the question:  When a disorder is absent, how well does our test predict that disorder’s absence?

As highlighted in the last post of this series, it is very easy to confuse sensitivity or specificity with predictive accuracy.  Consider this problem:

  • Among men with low testosterone, we know that the ADAM questionnaire correctly identifies the presence of low testosterone approximately 90 percent of the time.  If a man is identified as having low testosterone on the ADAM questionnaire, what is the probability that he has low testosterone?

Because of how human cognition operates, every fiber of our being wants to answer this question as: “Approximately 90 percent.”

Let me ask the same question again but in a different context:

  • Three little kittens have lost their mittens.  We find a mitten.  What is the probability it belongs to a kitten?

I know there is a part of you that wants to guess but the reality is that there is not enough information available to answer who probably owns the lost mitten.  Just as we need to know how many kittens and non-kittens have lost their mittens before we can answer this question, we need to know how many men with low testosterone and men with non-low testosterone exist before we can guess the ability of the ADAM to accurately predict the presence of men with low testosterone.

Positive predictive value refers to the degree to which a positive result on a diagnostic test is correct.  This is a comparison between the number of true positive results and the total number of positive predictions.  The ADAM predicted 164 physicians had low testosterone but was correct in only 69 cases.

Fig 4 19b

The complement to positive predictive value is negative predictive value or the degree to which a negative result on a diagnostic test is correct.  The ADAM predicted that 152 physicians had normal levels of testosterone and was correct in 143 cases.

Fig 5 19b

So, although the ADAM has a high degree of sensitivity, its ability to predict the presence of low testosterone is modest.  As well, a comparison of the ADAM’s positive and negative predictive values suggest that a negative result on the ADAM offers more predictive accuracy than a positive result.  That is, the ADAM is better at excluding the presence of low testosterone than it is at confirming the presence of low testosterone.

Another Way of Looking at Predictive Values

We know that in the current sample that 25 percent of physicians tested positively for low testosterone.  We also know that for those physicians who had a positive ADAM result, 42 percent also possessed low testosterone on blood testing – the positive predictive value.  Or, put another way, in this sample, the probability that a physician actually had a low testosterone level given a positive ADAM result was 42 percent.

There is another way to arrive at this same predictive value without needing to always break down the number of people in each diagnostic category.  This method has its beginning in work initially conducted by Thomas Bayes.  Bayes was an English Presbyterian minister whose thoughts on probability and prediction were published posthumously in 1763.  Bayes’ Theorem, named in his honor, holds a special place in diagnostic testing.

Bayes’ theorem will be the topic of our next post in this series.

18. Sick Versus Slick: 5. True Positives, False Positives, and the Androgen Deficiency in Aging Males (ADAM) Questionnaire

Just as women experience a drop in estrogen during menopause, so might men also experience a drop in testosterone and suffer symptoms akin to menopause as they approach middle age.  Whereas menopause for women is unequivocal and represents a biological demarcation point from fertility to infertility, the effects of a hormonal downturn for men, as men age, are most likely subtle and certainly less dramatic then what women experience.

Despite the arguable status of male menopause, a number of authors have strongly suggested that such an event can occur, can be measured, and can be successfully ameliorated.  As discussed in prior posts in this series, Morley and his colleagues first  labelled male menopause as ADAM or Androgen Deficiency in Aging Males.  In addition to describing ADAM as a clinical phenomenon, Morley also developed a 10 item questionnaire to measure the possible presence of androgen deficiency among middle-aged men.

To test the ADAM construct, Morley et al administered their questionnaire to a sample of 316 Canadian physicians, who ranged in age from 40 to 82 years, and measured these physicians’ testosterone levels.  To refresh your memory, here is the ADAM questionnaire and its scoring key:

ADAM Q

Morley found that 25 percent of these physicians had bioavailable testosterone levels lower than 70 ng/mL.  Using reference values from asymptomatic younger men, 70 ng/mL was deemed to demarcate low testosterone from normal testosterone.  Morley and his colleagues found that positive scores on the ADAM questionnaire identified 88 percent of those men with low testosterone.

Let’s stop and consider this outcome.  At first glance, identifying almost 90 percent of those men in this sample who had low testosterone based on a simple questionnaire seems quite impressive.   The ADAM questionnaire is clearly very sensitive when it comes to identifying the presence of testosterone deficiency.  At the same time, it is also very easy to confuse the sensitivity of a test with the accuracy of a test.  We naturally think of these two terms as interchangeable.  If you are told that a test is 90 percent sensitive to the presence of a disorder, you immediately consider that test to be very accurate.  You, in fact, would be wrong but you would not be alone in your error.

For example, take a moment and type in “ADAM and testosterone” and do a Google search on these keywords.  You will find innumerable websites, both medical and lay, attesting to the ability of the ADAM questionnaire to quickly and accurately detect the presence of testosterone deficiency.  Now I certainly defend your right to stick whatever nostrum in your body if it helps you get from today to tomorrow.  But do not delude yourself in thinking that these websites are concerned about your health or diagnostic accuracy.  They are not.

They are concerned about selling you a product or a service — testosterone enhancement.

The reality is that the sensitivity of a test is only half the story.  The other half is whether or not the test achieves its sensitivity through being over-liberal in its detection of the disorder of interest.  For example, lets say we create a new screening test for testosterone deficiency to compete with the ADAM questionnaire and we call it the Everybody Gets A Disorder (EGAD) questionnaire.  Here is the test and its scoring key:

EGAD Q

Now, to test our questionnaire we give it to another imaginary sample of 316 middle-aged and elderly Canadian physicians and measure those physicians’ testosterone levels.  Again we find that 25 percent of the physicians have low levels of testosterone.  And we find that our new screening measure, the EGAD questionnaire, outperforms the ADAM questionnaire and is 100 percent sensitive.  Our new questionnaire was able to detect all of the physicians who had low levels of testosterone.  Amazing!

So, you get the point.  Sensitivity is nice but, by itself, not very instructive.  We need to consider not only those whom the test correctly identifies as having low testosterone but also those whom the test incorrectly identifies as having low testosterone.  This type of error — suggesting that someone has a disorder when they do not — is called a false positive.  Most diagnostic tests will report both sensitivity and specificity.  Specificity is directly related to the false positive rate.  To be exact, one minus the specificity rate gives you the false positive rate.  When a diagnostic test has good specificity, false positives are few.  When specificity is low, false positives are common.

Morley and colleagues noted that the ADAM questionnaire demonstrated a specificity of 60 percent.  Not great but not completely horrible. It is fairly straightforward to determine how often the ADAM questionnaire was correct in its prediction of low testosterone based on its sensitivity, specificity, and base rate (the number people in the study who actually had low testosterone based on blood testing).  The rate at which the ADAM positively predicts those with low testosterone is only 42 percent.  Or put another way, only four out of ten people that the ADAM questionnaire predicted would have low testosterone actually did have low testosterone upon follow-up blood testing.

That is not very impressive.

But, to be fair, our EGAD questionnaire did not do better.  It’s rate of positive prediction was 25 percent. So, if we round up, the EGAD questionnaire positively predicted only three out of ten people as testosterone deficient.  Not as good as the ADAM but close.

However, as it turns out, the true specificity rate of the ADAM questionnaire may be a bit lower than the 60 percent rate reported by Morley and his colleagues.  Following Morley’s original study, subsequent studies have reported considerably lower specificity rates for the ADAM questionnaire with rates ranging from 22 percent to 40 percent. In fact, the lowest specificity rate of 22 percent was demonstrated in the study with the largest sample and contained over 5000 participants.  As well, in 2006, in a second study evaluating the ADAM questionnaire, Morley and his colleagues also reported a specificity rate of 30 percent.

If we recalculate the rate of positive prediction in Morley’s original study using a specificity rate of 30 percent (or false positive rate of 70 percent), then the ADAM questionnaire’s positive prediction rate drops down to three out of ten people.

Given that our EGAD questionnaire has higher sensitivity than the ADAM questionnaire and most likely an equal rate of specificity, I say EGAD is the winner.

So, pharmaceutical companies and other purveyors of testosterone porn, you know where to find me should you wish to discuss licensing fees.