**My theory**

Hello everyone, thank you for your posts

Step 1

Test Date: 24/11/2010, Score: 228/98

I believed that the calculations are very complex, but I think that if we look at the following factors we can have an idea of how is done. I am sure that I made a few mistakes, so if anyone could spot them I would appreciate if you let me know.

First of all:

How is the test done?

This is relatively easy to deduct. We all have gone trought the UW and Kaplan Qbanks, there are about 2250 question in each, and every random blocks of question looks a lot alike one block of the actual exam.

So lets say that the NBME has a very large qbank from where they build each exam administration.

But there is also the evolution issue. After each examination, the questions are "burned" in a certain way, so they have to change from time to time the questions asked on the actual exam.

On the other hand, the USMLE website explained that the recent administrations of the test were built to be more clinical than before, and most of the questions now come in the form of a clinical vignette.

Now, if the question are being changed into more clinical, it gives them the opportunity to evaluate you from many perspective in a single question

If you remember the questions on your exam, or those on the qbank it is obvious that for most of the question, you need to know more than just the actual answer, you need to integrate the subjects.

It also gives them the chance to differentiate what you know about medicine as a whole, and what you know about the actual exam. Because each exam is made different, there must be core questions for every test taker, and specific question for each of them.

This is written on each report:

**your score is influenced by both, you general knowledge of biomedical science and the specific set of items selected for this step 1 examination. **

Conclusion:

1- The are at least two factor from the exam itself that are taken into account when the calculate your score

2 - In not absurd to think that each year, questions are made harder so new generation of students won't have an advantage over previous cutoff.

Second

I took the NBME self assesment form 2 (2months before the exam) and 6 (3 weeks before the exam), got 490/220 and 430/208 respectively. I really believe that the form 6 was much harder than form 2, and my exam was more alike form 6 than form 2. It make sense, as the previous set of questions are being "burned", they need to increase the difficulty of the exam. If these self-assessment forms are made to predict our performance on the actual exams it's not logical to think than (in my personal case) after studying more and knowing more, I would get a lower score, on the contrary it means that the form was harder, and if it predicts what is going to be your score, then it means that the actual exam its getting harder time by time.

And this can give us a clue of how its score is calculated:

When we do the qbank, we can see there are "low, medium, high" yield questions. But more precisely, we can see what percentage of students got the question right.

We also know that the minimum pasing score for the exam has increased lately, and it is now 188 ( 3 points more than before).

We also know that the __APPROXIMATE __mean for the first time takers is 221 and the SD 24 (as its written on my report) in __recent administrations__ . This really means that the mean could have been between 197 and 245 (obviously not with such a wide rank but its variable)

That the minimun passing score correspond to answering between 60-70 %

We are told that the 3 digit score is calculated so it is equivalent each year, but how could that be true if the passing score is increased every certain period? if the answer is because our general knowledge is now bigger, and our professional knowledge should be even better, why wouldn't they just adjust this variable and keep the same passing score?. I would say that to know 185 today is not enough but is the same as ten years ago when it was enough to pass.

We are also told that the 2 digit score is also derived from our overall performance on the exam and that 75 is ALWAYS the passing score, and corresponds to 188. But we also know that the 2 digit score is not always the same for each 3 digit score. So in one cutoff a score of 185 = 75 in the next 188= 75, 235 = 99 and 229= 99

I`ll get back to this

There is also another thing. It is said that passing the exam corresponds to answering 60-70%, which by itself means that not every question is worth the same. But also leaves us with a doubt. Lets say that you need to answer 70% of high yield questions in order to get a 188 (and 75). There are 322 question, 70% = 225 questions so 188/225= 0.83 points each question. That leaves 97 questions. 97 x 0.83= 80.51 + 188 = 266 that would be the maximum possible score.

Now, lets aply the same rule to the 2 digit score. 75 = 225 correct question then 75/225= 0.33` each question. Now 97 x 0.33 = 32 + 75 = 107.

That's good, let say that to get a 228 I had to get 228/0.83= 275 correct questions. But in the 2 digit scale 275 x .33` = would be 91 not 98.

Now, as you would know from the qbank, there are questions that you don't need to know exactly what the answer is, you just need to know the principle underlying the answer. And you could only know that principle from understanding and integrating many areas of knowledge.

So we have: that the passing score on the 3 digit scale has increase, but your score correspond with previous years. That the mean score (221) correspond to answering 266 questions now, but 10 to 15 years ago (mean 200 and SD 20) it corresponded to answering, by today standards, 240 questions right (there were more questions on previous exams but less integrative and complex) and the passing score corresponded to 176 wich corresponded to answering 212 questions right by today's standards.

¿So how can the 3 digit scale can be equivalent year after year?

I believed that the answer resides in the fact that medical knowledge has grown exponentially over the lasts years. So every day we have new details about a certain enzyme or drug. What hasn't change is the general principles. The 3 digit score is comparative between generation of test takers, and the 2 digit score is comparative between examinees on the same cutoff.

I will explain myself: lets say that ten 10 years ago if you knew everything about medicine, you knew 1000, and nowadays to know everything you need to know 5000. (Hypothetical values)

So ten years ago, in order to pass the exam, you needed to answer 70% of questions right, so you would have needed to know 700. And today would need to know 3500. To get a 220 (3 digit scale) you would have needed to know 82% or 820 and today to get a 220 you would need to know 4100. So the 3 digit scale tells what percentage of knowledge you have from all that could be known. If there is a 1000 to know, and you know 700, then you pass the exam. If there is 5000 to know and you know 3500 you pass the exam. So, that is why is comparative, the 3 digit scale tells relatively how much you know, if the year 2050 there will be 100000 to know, and if someone know 70000 then he would pass the exam.

But obviously that's an ideal world, its impossible to achieve a 100 % of knowledge, and they know it. That's why there is the 2 digit scale. Because it compares between subjects in the same generation. That's why it changes. Because in absolute numbers there is no way to compare between generations. Obviously someone who knows 820 ten years ago, knows less than someone who knows 4100 today, but relative to each other they know the same because today there is more to know.

But relative numbers don't work for this, because achieving a knowledge of 4100 is harder today that it would have been achieving a knowledge of 820 ten years ago. So to fairly evaluate each student you need to compare him with his generation. So you use another scale. And that's the 2 digit scale.

So how is the score actually calculated?

I believe that each question is worth diferent:

There are 322 questions on the exam. Let's say that 1000 students take the same exact test on a period of 3 weeks. Now, we know that the exam is built to have a certain amount of easy, medium and hard questions and at the same time there are certain amounts of question which test general principles and others which test specific set of items. As I proposed before, clinical oriented question can evaluate several items on one single question. Those are worth intrinsically more than the more specific ones, because answering those correctly implies that you can integrate various items.

Now, lets say that when they received the answers from the 1000 students, they calculate what percentage of student answer each question correctly, just as the qbanks do. Now, lets say that each question has a an intrinsic value, and an extrinsic one. The intrinsic value depends on how much it can say about your knowledge, and the extrinsic value depends on how many students answered it correctly.

So if you answered a hard question which was answer only by 30% of the cohort, then it gives your score a bonus, it could be 1 (intrinsic value) + (1 - 0.3) (extrinsic value) or 1 + (1x0.7) and so on. And then they sum the intrinsic value plus the bonus value and it gives them a score over 800 (as in the self-assessment examination or some number around it, wich they then divide by 3 ( or something).

Let's see: 322 questions, let's say that each question has an __average__ intrinsic value of 1,5 then = 322 x 1.5= 483 that would be the intrinsic value if you answered every question correctly.

Then they sum the extrinsic value. So let's say there are 200 easy questions, answered by 70% each, 70 medium questions answered by 50% each, and there are 52 hard question answered by 20% each.

So every easy question is worth 1.5 + 1.50x0.3 then 1.5 + 0.45= 1.95 and there are 200 easy questions, so answering all of them gives you 390 approximately.

Every medium question is worth 1.5 + 0.75 = 2.25 x 70 = 157

Every hard question is worth 1.5 + 1.2 = 2.7x 52 = 140

Now 140 + 157 + 390= 687 and if we divide this number by 3 = 229

Obviously every question is not worth the same, and every question varies with respect to the people who answered it correctly so the final sum at the end of the calculation should give a number close to 800 which divides by 3 would give 266.6

That is a possible way the 3 digit score could be calculated, obviously using a much more complex algorithm. But is a way to compare between diferent generations, because the final result is a reflection of the knowledge you have achieved in medicine, were question that are more general and integrative are worth more, and questions which are specific are worth less. Because on each examination the behavior of the scores must behave in a "normal" (Gaussian) pattern, then they can give a mean and a standard deviation.

So using the explanation that I proposed before, as biomedical knowledge grows, they incorporate more details to each exam version. And because they use clinical vignette, they can ask several items on a single question, so if you needed 800 questions ten years ago to evaluate all basic biomedical knowledge, now you can ask 4 questions on just 1. So you have 200 general principles question, and 122 question of new knowledge. Then using a statistical method they can evaluate what percentage of basic and current knowledge you have achieved, and that's represented on the 3 digit scale. So a 200 on the 3 digit scale 10 years ago would be to know 60% ( for example) of what was known by that time, and today it also represents 60% of what is known nowadays, but obviously to know 60% today is knowing more, as I`ve shown earlier on this post.

The 2 digit scale is comparative within subjects on the same cohort. They can use a statistical method to take into account what were all the 3 digit score of all the 1000 examinees, and obtain a mean and SD adjusted to a passing score of 75 and a top score of 99, which is NOT a percentile, because there could be many 99, It all depends on which was the highest score on the cohort.

Thank you, and let me know of any mistake on my theory