USMLE Forums banner

How USMLE Scores are calculated?

69282 Views 14 Replies 14 Participants Last post by  fafa86
This seems to be the question on every student's mouth since decades!
And ironically the answer is still UNKNOWN!

OK, now if you don't have time to read this rather lengthy post. Let me give you the final answer now; Nobody knows!

Yes, there are many theories about it and all kinds of speculations. But up to the time writing these lines, there's no official explanation of how scores are calculated and even no consensus theory agreed upon.

The latest USMLE bulletin of information 2009 does say this:
(On the three-digit scale, most Step 1, 2 CK, or 3 scores fall between 140 and 260. The mean score for first-time examinees from accredited medical school programs in the United States is in the range of 210 to 230, and the standard deviation is approximately 20. Your score report will include the mean and standard deviation for recent administrations of the examination. The two-digit score is derived from the three-digit score. It is used in score reporting because some medical licensing authorities have requirements that include language describing a "passing score of 75." The two-digit score is derived in such a way that a score of 75 always corresponds to the minimum passing score).

So first of all, let me dispel a public myth. The two digit score is not a percentile and here are the proofs:

  • First, if the two digit score was a percentile then a 75 passing score means that only 25% of students pass the exam at all times! Thanks God this is not the reality as we would have been crying if that was the truth. In the year 2008 for example, there was a 73% passing percentage among IMG First Takers and even higher percentages for US/Canadian grads. Reference for this info is in this official NBME page.
  • Second, if the two digit score was a percentile then one standard deviation above the mean would mean 84th percentile. Using the numbers given to us directly from the NBME as mentioned above (mean 210-230 and SD=20) then a three digit score of 240 is equal to the 84th percentile. While in reality a three digit score of 240 have always meant 99 two digit score (see this thread for a correlation table) throughout the history of USMLE!
So it's obvious that the two digit score is not a percentile. The two digit score is merely an extrapolation from your three digit score and standardized in a way to mean that 75 is the passing score and 99 is the highest possible score with a range in between the two. Perhaps, the reason behind the two digit score is just to give "humans" a number that they can understand, however, in reality it has only complicated matters further!

Now how do they calculate the three digits score?

Well, I don't know and nobody else knows. If you are thinking that the three digit score is actually the number of questions you answered correct then you are wrong and here's why;

One thing we know for sure is the passing score for each step as reported in this official NBME web page. Currently the pass score for step 1 is 185, for CK is 184, for step 3 is 187. Note that these passing scores are revised every two years or so and they are variable however the 75 two digit passing score is always the same which is a proof that the two digit score is standardized and extrapolated from the original three digit score.

We also know that there are 336 questions in step 1 and 352 questions in step 2 CK (the latter is revised recently see this thread for details).

They also tell us that the percentage of question items to get right in order to meet the minimum pass score is 60 to 70 (Reference for this is found in the USMLE Bulletin 2009 scroll down to paragraph 10 "multiple choice items" and read the last line).

Using the numbers gathered above then we should simply deduce that 65% of the 336 step 1 items to pass the exam is equal to 218.4 questions. But the passing three digit score is 185 as we said! Let's calculate it also for CK (65% of 352 = 228.8, but the passing three digit score is 184!).

So the conclusion is that the three digit score does not by any means (using simple math) correspond to the number of questions you got right in the exam.

Confounding factors:
There are several other factors that complicate the calculation further. Examples:

  • All these numbers are variable and revised from time to time.
  • Some questions items are experimental and do not count toward the total score.
See less See more
  • Like
Reactions: 5
1 - 15 of 15 Posts
This is really frustrating
How come the exam that all US doctors have used to get a license is a mystery exam!
How dare they keep their methodology a secret!

I personally, would suspect some playing with scores in the background so that they don't want to tell us :(:(
My college professor is 100% sure that there are different weights for different questions for example a difficult question will get 1.5 points while a regular question will get 1 point
  • Like
Reactions: 1
Shame on NBME
Such an important exam
Perhaps this is the most taken medical exam on earth

How dare they are not transparent about it

They MUST tell us how do they calculate the score

Shame on NBME :mad:
maybe it is qualitative not quantitative!!

Maybe it depends on the quality of questions not the quantity!! I mean let's say step one exam is 2 parts not just one , one for essential knowledge and that is the major part of the exam which extends quantitatively (by the 3 digit score) up to 240 as it includes most of the questions and the other smaller part is related to knowledge beyond the essential and that represents about one sixth of the questions so it determines three digit score beyond 240. this explains why 99 extends from 240 to 280!! it's just an idea...
  • Like
Reactions: 2
My theory

Hello everyone, thank you for your posts
Step 1
Test Date: 24/11/2010, Score: 228/98

I believed that the calculations are very complex, but I think that if we look at the following factors we can have an idea of how is done. I am sure that I made a few mistakes, so if anyone could spot them I would appreciate if you let me know.

First of all:
How is the test done?
This is relatively easy to deduct. We all have gone trought the UW and Kaplan Qbanks, there are about 2250 question in each, and every random blocks of question looks a lot alike one block of the actual exam.
So lets say that the NBME has a very large qbank from where they build each exam administration.

But there is also the evolution issue. After each examination, the questions are "burned" in a certain way, so they have to change from time to time the questions asked on the actual exam.
On the other hand, the USMLE website explained that the recent administrations of the test were built to be more clinical than before, and most of the questions now come in the form of a clinical vignette.

Now, if the question are being changed into more clinical, it gives them the opportunity to evaluate you from many perspective in a single question
If you remember the questions on your exam, or those on the qbank it is obvious that for most of the question, you need to know more than just the actual answer, you need to integrate the subjects.
It also gives them the chance to differentiate what you know about medicine as a whole, and what you know about the actual exam. Because each exam is made different, there must be core questions for every test taker, and specific question for each of them.
This is written on each report:

your score is influenced by both, you general knowledge of biomedical science and the specific set of items selected for this step 1 examination.

1- The are at least two factor from the exam itself that are taken into account when the calculate your score
2 - In not absurd to think that each year, questions are made harder so new generation of students won't have an advantage over previous cutoff.

I took the NBME self assesment form 2 (2months before the exam) and 6 (3 weeks before the exam), got 490/220 and 430/208 respectively. I really believe that the form 6 was much harder than form 2, and my exam was more alike form 6 than form 2. It make sense, as the previous set of questions are being "burned", they need to increase the difficulty of the exam. If these self-assessment forms are made to predict our performance on the actual exams it's not logical to think than (in my personal case) after studying more and knowing more, I would get a lower score, on the contrary it means that the form was harder, and if it predicts what is going to be your score, then it means that the actual exam its getting harder time by time.

And this can give us a clue of how its score is calculated:

When we do the qbank, we can see there are "low, medium, high" yield questions. But more precisely, we can see what percentage of students got the question right.

We also know that the minimum pasing score for the exam has increased lately, and it is now 188 ( 3 points more than before).

We also know that the APPROXIMATE mean for the first time takers is 221 and the SD 24 (as its written on my report) in recent administrations . This really means that the mean could have been between 197 and 245 (obviously not with such a wide rank but its variable)

That the minimun passing score correspond to answering between 60-70 %

We are told that the 3 digit score is calculated so it is equivalent each year, but how could that be true if the passing score is increased every certain period? if the answer is because our general knowledge is now bigger, and our professional knowledge should be even better, why wouldn't they just adjust this variable and keep the same passing score?. I would say that to know 185 today is not enough but is the same as ten years ago when it was enough to pass.

We are also told that the 2 digit score is also derived from our overall performance on the exam and that 75 is ALWAYS the passing score, and corresponds to 188. But we also know that the 2 digit score is not always the same for each 3 digit score. So in one cutoff a score of 185 = 75 in the next 188= 75, 235 = 99 and 229= 99

I`ll get back to this

There is also another thing. It is said that passing the exam corresponds to answering 60-70%, which by itself means that not every question is worth the same. But also leaves us with a doubt. Lets say that you need to answer 70% of high yield questions in order to get a 188 (and 75). There are 322 question, 70% = 225 questions so 188/225= 0.83 points each question. That leaves 97 questions. 97 x 0.83= 80.51 + 188 = 266 that would be the maximum possible score.
Now, lets aply the same rule to the 2 digit score. 75 = 225 correct question then 75/225= 0.33` each question. Now 97 x 0.33 = 32 + 75 = 107.
That's good, let say that to get a 228 I had to get 228/0.83= 275 correct questions. But in the 2 digit scale 275 x .33` = would be 91 not 98.
Now, as you would know from the qbank, there are questions that you don't need to know exactly what the answer is, you just need to know the principle underlying the answer. And you could only know that principle from understanding and integrating many areas of knowledge.
So we have: that the passing score on the 3 digit scale has increase, but your score correspond with previous years. That the mean score (221) correspond to answering 266 questions now, but 10 to 15 years ago (mean 200 and SD 20) it corresponded to answering, by today standards, 240 questions right (there were more questions on previous exams but less integrative and complex) and the passing score corresponded to 176 wich corresponded to answering 212 questions right by today's standards.
¿So how can the 3 digit scale can be equivalent year after year?
I believed that the answer resides in the fact that medical knowledge has grown exponentially over the lasts years. So every day we have new details about a certain enzyme or drug. What hasn't change is the general principles. The 3 digit score is comparative between generation of test takers, and the 2 digit score is comparative between examinees on the same cutoff.
I will explain myself: lets say that ten 10 years ago if you knew everything about medicine, you knew 1000, and nowadays to know everything you need to know 5000. (Hypothetical values)
So ten years ago, in order to pass the exam, you needed to answer 70% of questions right, so you would have needed to know 700. And today would need to know 3500. To get a 220 (3 digit scale) you would have needed to know 82% or 820 and today to get a 220 you would need to know 4100. So the 3 digit scale tells what percentage of knowledge you have from all that could be known. If there is a 1000 to know, and you know 700, then you pass the exam. If there is 5000 to know and you know 3500 you pass the exam. So, that is why is comparative, the 3 digit scale tells relatively how much you know, if the year 2050 there will be 100000 to know, and if someone know 70000 then he would pass the exam.

But obviously that's an ideal world, its impossible to achieve a 100 % of knowledge, and they know it. That's why there is the 2 digit scale. Because it compares between subjects in the same generation. That's why it changes. Because in absolute numbers there is no way to compare between generations. Obviously someone who knows 820 ten years ago, knows less than someone who knows 4100 today, but relative to each other they know the same because today there is more to know.
But relative numbers don't work for this, because achieving a knowledge of 4100 is harder today that it would have been achieving a knowledge of 820 ten years ago. So to fairly evaluate each student you need to compare him with his generation. So you use another scale. And that's the 2 digit scale.

So how is the score actually calculated?
I believe that each question is worth diferent:
There are 322 questions on the exam. Let's say that 1000 students take the same exact test on a period of 3 weeks. Now, we know that the exam is built to have a certain amount of easy, medium and hard questions and at the same time there are certain amounts of question which test general principles and others which test specific set of items. As I proposed before, clinical oriented question can evaluate several items on one single question. Those are worth intrinsically more than the more specific ones, because answering those correctly implies that you can integrate various items.
Now, lets say that when they received the answers from the 1000 students, they calculate what percentage of student answer each question correctly, just as the qbanks do. Now, lets say that each question has a an intrinsic value, and an extrinsic one. The intrinsic value depends on how much it can say about your knowledge, and the extrinsic value depends on how many students answered it correctly.
So if you answered a hard question which was answer only by 30% of the cohort, then it gives your score a bonus, it could be 1 (intrinsic value) + (1 - 0.3) (extrinsic value) or 1 + (1x0.7) and so on. And then they sum the intrinsic value plus the bonus value and it gives them a score over 800 (as in the self-assessment examination or some number around it, wich they then divide by 3 ( or something).
Let's see: 322 questions, let's say that each question has an average intrinsic value of 1,5 then = 322 x 1.5= 483 that would be the intrinsic value if you answered every question correctly.
Then they sum the extrinsic value. So let's say there are 200 easy questions, answered by 70% each, 70 medium questions answered by 50% each, and there are 52 hard question answered by 20% each.
So every easy question is worth 1.5 + 1.50x0.3 then 1.5 + 0.45= 1.95 and there are 200 easy questions, so answering all of them gives you 390 approximately.
Every medium question is worth 1.5 + 0.75 = 2.25 x 70 = 157
Every hard question is worth 1.5 + 1.2 = 2.7x 52 = 140
Now 140 + 157 + 390= 687 and if we divide this number by 3 = 229
Obviously every question is not worth the same, and every question varies with respect to the people who answered it correctly so the final sum at the end of the calculation should give a number close to 800 which divides by 3 would give 266.6
That is a possible way the 3 digit score could be calculated, obviously using a much more complex algorithm. But is a way to compare between diferent generations, because the final result is a reflection of the knowledge you have achieved in medicine, were question that are more general and integrative are worth more, and questions which are specific are worth less. Because on each examination the behavior of the scores must behave in a "normal" (Gaussian) pattern, then they can give a mean and a standard deviation.
So using the explanation that I proposed before, as biomedical knowledge grows, they incorporate more details to each exam version. And because they use clinical vignette, they can ask several items on a single question, so if you needed 800 questions ten years ago to evaluate all basic biomedical knowledge, now you can ask 4 questions on just 1. So you have 200 general principles question, and 122 question of new knowledge. Then using a statistical method they can evaluate what percentage of basic and current knowledge you have achieved, and that's represented on the 3 digit scale. So a 200 on the 3 digit scale 10 years ago would be to know 60% ( for example) of what was known by that time, and today it also represents 60% of what is known nowadays, but obviously to know 60% today is knowing more, as I`ve shown earlier on this post.
The 2 digit scale is comparative within subjects on the same cohort. They can use a statistical method to take into account what were all the 3 digit score of all the 1000 examinees, and obtain a mean and SD adjusted to a passing score of 75 and a top score of 99, which is NOT a percentile, because there could be many 99, It all depends on which was the highest score on the cohort.

Thank you, and let me know of any mistake on my theory
See less See more
  • Like
Reactions: 5
@ MDMCH17,

You need to publish this theory in a big scientific journal, seriously
Ya I agree with rasheed...look how important exam is it and we still DONT KNOW WHAT IS BEYOND THIS MYSTERY...(HOW DO THEY CALCULATE SCORE):toosad:

We have a lot of questions in our minds BUT BOTTOM LINE IS ...NO ONE KNOWS HOW SCORES ARE CALCULATED...


and think about qualitative and quantitative scoring :eek: like some questions are 1.5 or even more and others 0.5 or even less???
also consider experimental questions??

each day we got more and more frustrated and worries bout their methodology
The United States Medical Licensing Examination score (USMLE score) is given to test takers as a 3-digit score. This score is commonly used by hospitals to determine eligibility for residency and fellowship. The three-digit score is based on a theoretical maximum of 300, but this has not been documented by the NBME / FSMB. :confused:
Extremely informative. But i feel like these test score calculations are so overly complicated. However, thank you very much for the post!
Every question is scored equally

Every question has the same weight.
Else, what if everyone studies just the high yield and hits a 240+, And all easy one-liners are worth no marks(minimum score since everyone knows)?

So in short, every question should be scored equally, else the validity of the exam might be accurate only for really hard questions, being the main factor that determines high outcomes.

My college professor is 100% sure that there are different weights for different questions for example a difficult question will get 1.5 points while a regular question will get 1 point
@ MDMCH17,

You need to publish this theory in a big scientific journal, seriously
Agreed! :)
1 - 15 of 15 Posts
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.