Getting cut after pools is probably one of the worst things feelings at a national tournament. It hurts to fly hundreds of miles to an event, stay in a hotel for a night, and only get to fence a handful of 5-touch bouts. Sometimes after being cut, I’m discouraged and fence poorly in my next event, but most of the time I’m upset with myself and fence better.

Before 2021, there was no elimination after pools in youth events—even if you lost all of your bouts you would get to fence in the direct elimination table. Likely due to increasing event sizes, and potentially due to capacity concerns related to the COVID-19 pandemic, USA Fencing added a bottom 20% cut after pools in Y-12 and Y-14 at the 2021 Summer Nationals. The new rule sparked debate among parents, with some arguing it added unnecessary stress for young fencers.

While I lack data on youth events post-2020 to measure the specific impact on young fencers, this raises an important question about the tournament system in general: Does being cut after pools have a significant psychological effect on fencers’ performance in future events?

I recently attended the Northwestern Workshop on Research Design for Causal Inference, where I learned about a methodology called regression discontinuity. It has a fancy name, but it’s pretty simple in principle. Here’s why it works for our fencing study:

**Arbitrary elimination threshold: **USA Fencing cuts the bottom 20% of fencers after pools in most events. But why 20%? It could just as easily be 15% or 25%. This number is pretty much arbitrary—it’s just chosen to keep tournaments manageable.

**Luck plays a role in pools**:

- Referees sometimes make mistakes
- Fencers can get lucky and unintentionally counteract their opponent’s actions
- Pool assignments are somewhat random.

The pool results and who gets eliminated aren’t just about skill—there’s an element of luck in the mix too.

**Fencers near the cutoff are very similar**: Because of the arbitrary cutoff and the luck factor, fencers who barely made it past pools are probably very close in skill level to those who barely got cut. Perhaps if they had different referees, the results would reverse around the cutoff!

- We examine fencers’ performance in their Cadet event after a Junior event where they either barely made the cut or were barely eliminated.
- Performance is measured by pool result percentile, with the #1 ranked fencer at the 100th percentile and the last place fencer near the 0th percentile. By definition, fencers below the 20th percentile in their Junior event are cut after pools.
- We plot fencers’ Cadet performance performance against their performance in Juniors. Then we fit lines through the data points on each side of the cutoff. Here is an example of what that could look like:
- A significant jump or drop at the cutoff suggests that elimination affected future performance. For example, with the (fake) data in the image above, fencers who get cut in Junior do better in Cadet, placing about 20 percentile points higher.

This analysis was implemented using the rdrobust [1] library, which automatically selects a bandwidth—the range of data around the cutoff point we analyze [2]. As a result, our findings are most relevant to fencers close to the elimination threshold and may not apply to those far above or below the cutoff.

The analysis, based on a sample of 8,817 fencers, yielded unexpected results. We found no clear evidence that fencers who narrowly missed the cutoff performed worse in subsequent events compared to those who just made it. This suggests that being eliminated after pools may not have a significant impact on a fencer’s performance in their next event

**Overall Impact**

Across all weapons and genders, the effect of being cut after pools was a tiny increase of 0.014 (1.4 percentile points) in subsequent performance. However, with a *p*-value of 0.597, this result is **not statistically significant**, meaning the observed difference could be due to random chance rather than a true effect.

**Gender-Specific Analysis**

Some research [3] suggests men and women may respond differently to not making the cutoff in competition. Our gender-specific analysis showed **no evidence** of this, as all *p*-values were above 0.05.

- Women: Small positive effect effect (0.04,
*p*-value = 0.317) - Men: No observable effect (0.00,
*p*-value = 0.923)

**Weapon-Specific Results**

Considering that different weapons might attract different personality types, we examined each weapon individually. Again, all of the results were nonsignificant, with *p*-values above 0.05. Therefore there is **no evidence** of the cut having an effect on future performance for any specific weapon.

- Foil: Small positive effect (0.055,
*p*-value = 0.226) - Epee: Negligible negative effect (-0.004,
*p*-value = 0.889) - Saber: Slight negative effect (-0.016,
*p*-value = 0.612)

My investigation into the impact of the 20% elimination on future fencer performance yielded interesting results. Contrary to what I initially expected, I found no evidence that being cut after pools negatively affects a fencer’s performance in subsequent events. This held true across individual genders and weapons. However, it’s also important to note that a lack of significant results doesn’t necessarily mean there’s no effect of being cut; it may simply be too small to detect with our current sample size or methodology.

I must emphasize that this study focused on performance metrics for a specific age group. While being cut in Junior events doesn’t seem to affect future results for Cadet fencers, these findings shouldn’t be extrapolated to younger fencers (Y-12 and Y-14) without dedicated studies for those age groups.

For USA Fencing, these results might provide some reassurance about the use of post-pool cuts to manage large events. This data suggests that such cuts may not harm fencers’ later performance in the tournament.

What’s your take on these findings? Whether you’re a fencer, coach, or parent, I’d love to hear your thoughts and experiences.

[1] Calonico S, Cattaneo M D, Titiunik R. Rdrobust: An R package for robust nonparametric inference in regression-discontinuity designs. The R Journal. 2015;7(1):38. doi:10.32614/rj-2015-004

[2] Calonico S, Cattaneo MD, Farrell MH. Optimal bandwidth choice for robust bias-corrected inference in regression discontinuity designs. The Econometrics Journal. 2019 Nov 12;23(2):192–210. doi:10.1093/ectj/utz022

[3] Fang C, Zhang E, Zhang J. Do women give up competing more easily? Evidence from speedcubers. Economics Letters. 2021 Aug;205:109943. doi:10.1016/j.econlet.2021.109943

The post Can Fencers Bounce Back From Being Cut After Pools? appeared first on Rivka Lipkovitz.

]]>In the olden days (sometime around the 5th century BCE), it was commonly believed that all numbers were rational and could be expressed as fractions with integer numerators and denominators. When Hippasus of Metapontum demonstrated that the square root of 2 is irrational, it challenged the Pythagorean worldview. According to legend, this revelation so outraged the followers of Pythagoras that they killed him for exposing a truth that contradicted their beliefs.

Throughout history, there have been numerous examples of groundbreaking discoveries causing outrage, such as the realization that rainbows are caused by light reflecting off water droplets in the atmosphere and the discovery that the Earth is round. Although nowadays people (usually) don’t kill each other over disagreements or differences in ideology, there are still widely held beliefs that are not entirely correct or are oversimplifications of more complex phenomena.

We may have been wrong about statistical significance all these years.

When I was in 10th grade and took statistics, I learned about the null hypothesis significance testing (NHST) paradigm. This is a very important framework, as it is commonly used in biomedical and social science research and forms a significant part of the AP Statistics curriculum.

In NHST, the first step is to define a null hypothesis and an alternative hypothesis. The null hypothesis assumes that “nothing happens,” while the alternative hypothesis posits that your specific hypothesis is true and that “something happens.” For example, if you want to determine whether light affects plant growth, the null hypothesis would be that light does not affect plant growth, and the alternative hypothesis would be that light does affect plant growth.

The next step is to conduct the experiment or collect data. In the plant example, you would have an experimental group of plants exposed to light and a control group that does not receive light.

Finally, you would perform significance testing. You would measure the mean growth of the experimental group and compare it to the mean growth of the control group. The difference between the two is called the “effect size.” But how would you determine if this variation is not merely due to chance? For instance, if the experimental group grew only 0.1 millimeters more than the control group, such evidence would likely be unconvincing, especially if there is substantial variation in the plants’ growth.

To address this issue, you would use specific statistical formulas to calculate the standard error of your estimated means, which represents the typical variation in the means of each group. Then you would assess whether the experimental group’s outcome has less than a 5% chance of occurring if the null hypothesis were true. This probability, called the *p*-value, indicates how likely the observed result is under the null hypothesis. In a statistically significant result (*p* < 0.05), we reject the null hypothesis. In a nonsignificant result (*p* > 0.05), we fail to reject the null hypothesis. Therefore, we only have evidence that plant growth is affected by light if *p* < 0.05.

Credit: Smahdavi4 from Wikipedia

One thing I always found strange about this framework is that significance is binary. Results with a slightly higher than 5% chance of happening are considered non-significant, viewed as no different from those that could occur by chance. On the other hand, results just below the threshold are considered evidence of a difference. However, I wanted to get a good score on AP Statistics, so I went along with it. I’ve recently found that others have found issues with this dichotomous nature, validating my concerns [1]. These researchers suggest that the study should be evaluated based on its methodology and the significance on a continuum, not based on whether the significance crosses a specific threshold.

Another issue with the framework is the “replication crisis” in science. Unfortunately, many scientific journals are hesitant to publish nonsignificant results, fearing they would be viewed as uninteresting by their readership. However, even with absolutely no effect, there is a 5% chance of obtaining a significant result by chance. Additionally, researchers can inadvertently or deliberately increase their chances of finding a significant result through various practices, such as selectively removing certain subgroups from their analysis until they achieve statistical significance. This phenomenon, known as “p-hacking” or “data dredging,” can lead to the publication of studies that appear to show significant effects when, in reality, no true effect exists. Consequently, when other scientists attempt to replicate these results, they often fail to find the reported effect. It’s been estimated that more than half of psychology studies are unreproducible [2].

Lately, I’ve become interested in statistical significance and whether it is truly a good measurement. Many of the effects I’ve tried to estimate for Touche Stats ended up being nonsignificant. One reason nonsignificant results are looked down upon is that researchers often have a small sample size, which doesn’t give them enough “power” to detect an effect. The power is defined as the probability of rejecting the null hypothesis, given that there truly is an effect, and increases with sample size. Insufficient power means that there may be an effect, but their sample size is too small to distinguish it from chance. However, with my sample sizes of thousands of bouts, I always felt I had adequate power. What if there really is no effect?

I’ve enjoyed reading much of Alberto Abadie’s work, such as his research on the synthetic control method. When I found out that he wrote a paper on nonsignificance [3], my interest was piqued. In this paper, he argues that nonsignificant results can be more intriguing than significant ones in economics. This is because economists often work with large sample sizes, sometimes involving census data from thousands or millions of individuals. Abadie suggests that it is unreasonable to “put substantial prior probability on a point null,” meaning that assuming a “null hypothesis” in economics is often unrealistic. Most types of interventions or effects measured in economics, such as early schooling or changes in minimum wage, are likely to have some effect, even if it is incredibly small. Therefore, researchers frequently achieve significance since the probability of rejecting the null hypothesis becomes very high with large sample sizes, even for a small effect. In this context, statistical nonsignificance becomes rarer than significance, signaling that there might be something particularly noteworthy to investigate.

I also read a similar paper [4] written by Guido Imbens, whose work I also enjoy reading. He agrees with Abadie that it often does not make sense to assume the “null hypothesis” in economics. He also emphasizes the importance of assessing the “point estimate” (average effect) of an intervention, such as early schooling, rather than focusing solely on its significance. If we have a significant result but the effect is extremely small, like early schooling helping students solve just one extra question on a test with 1,000 questions, it is not very impactful. Therefore, it is more meaningful to evaluate the strength of the effect and its likely variability by constructing a confidence interval, which indicates where the effect is likely to fall 95% of the time.

Imbens also mentions that the 5% threshold is rather arbitrary, and it is only used to facilitate scientific communication. This standardized threshold fails to account for the considerable variations across different research contexts. Researchers work with datasets and experimental setups of vastly different sizes and complexities. The costs associated with false positive results can vary dramatically depending on the field of study and the potential real-world implications of the research. Additionally, the plausibility of the null hypothesis can differ greatly across disciplines and specific research questions. Therefore, it doesn’t make much sense to use a one-size-fits all approach to statistical significance. This concern has already led to some researchers changing their practices. In high-energy physics, for example, they use a much lower threshold, requiring *p*-values to be less than 3 × 10^{-7}.

Another issue he discusses is the tendency for results to be accepted simply because they are statistically significant, even when their findings appear bogus. For example, one study found that hurricanes with female names (e.g. Hurricane Katrina) cause statistically significantly more damage than hurricanes with male names. This study likely gained attention and publication primarily due to its statistically significant result, despite its questionable premise. He emphasizes the importance of vigilance against researchers who manipulate their models by testing multiple specifications and engaging in “p-hacking,” where only significant results are published.

Another thing I learned in statistics at school (which is also covered in the AP Statistics exam), are Type I and Type II errors.

- Type I error: A false positive, where the null hypothesis is rejected even when there is no effect.
- Type II error: A false negative, where the null hypothesis is not rejected even when there is an effect.

I recently discovered that another statistician who I like to read, Andrew Gelman, developed two new types of errors: the Type S and Type M errors [5].

- Type S (sign) error: when the test statistic (mean effect found) is the opposite of the true effect size, given that the statistic is significant.
- Type M (magnitude) error: when the test statistic exaggerates the true effect size, given that the statistic is significant.

Researchers found that “while that Type S errors are rare in practice as long as the analysis is conducted in a principled way, Type M errors are rather common” [6]. Drawing on Imbens’ perspectives on significance, Type M and Type S errors are more informative than the traditional Type I and Type II errors. Imbens argues that the point estimate holds greater importance than mere statistical significance. Given that published research tends to favor significant results, many of these reported effects are overestimated. This overestimation can lead to misleading conclusions and skewed understanding of true effect sizes in various fields of study.

Over-reliance on p-values has led to interesting conclusions. In a study on drugs for atrial fibrillation, researchers concluded their results differed from previous studies because they weren’t statistically significant, despite finding an identical effect size [7]. This interpretation overlooks a crucial aspect: the primary concern regarding drug efficacy lies in the magnitude of its effect, not just statistical significance. In fact, the replication of the same effect size in the new study should be viewed positively, even if it didn’t reach the threshold for statistical significance.

Interestingly, in a study that tested psychology students, professors, lecturers, and teaching assistants on their statistics knowledge, none of the 45 students, only four of the 39 professors and lecturers (who did not teach statistics), and only six of the 30 professors and lecturers (who did teach statistics) got all of the answers correct [1]. This suggests that even experts make statistical errors.

Although there are many people raising concerns about the *p*-value with its interpretation and viability, few are arguing that it is completely abandoned. Most research today still uses the null hypothesis significance testing (NHST) paradigm, making a complete shift away from it challenging. Additionally, *p*-values help in assessing the likelihood that observed results are due to chance, assuming the null hypothesis is true. This helps in filtering out genuinely spurious findings, and Imbens acknowledges that statistical significance is still necessary for rejecting the null hypothesis [4]. None of the articles I read propose completely disregarding p-values, and instead suggest a more nuanced approach for interpreting them. While statistical significance might be less significant in the future, it will still play an important role in science.

[1] McShane BB, Gal D, Gelman A, Robert C, Tackett JL. Abandon statistical significance. The American Statistician. 2019 Mar 20;73(sup1):235–45. doi:10.1080/00031305.2018.1527253

[2] Estimating the reproducibility of psychological science. Science. 2015 Aug 28;349(6251). doi:10.1126/science.aac4716

[3] Abadie A. Statistical nonsignificance in empirical economics. American Economic Review: Insights. 2020 Jun 1;2(2):193–208. doi:10.1257/aeri.20190252

[4] Imbens GW. Statistical significance, p-values, and the reporting of uncertainty. Journal of Economic Perspectives. 2021 Aug 1;35(3):157–74. doi:10.1257/jep.35.3.157

[5] Gelman A, Tuerlinckx F. Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics. 2000 Sept;15(3):373–90. doi:10.1007/s001800000040

[6] Lu J, Qiu Y, Deng A. A note on type S/M errors in hypothesis testing. British Journal of Mathematical and Statistical Psychology. 2018 Mar 23;72(1):1–17. doi:10.1111/bmsp.12132

The post Statistical Significance Might Be Less Significant in the Future appeared first on Rivka Lipkovitz.

]]>Can past performance predict future results? This is a question of many fencers, who wonder if the place they currently are is the place they are always going to be, or if one day they will be able to get an A rating or win a tournament. Of course you wish that your past results did not predict your future ones (unless your past results were already amazing–but that is not true for most of us), but also you know that as you try to improve, other fencers are getting better, making it hard for you to catch up.

Can we predict who will be an Olympian from Y10 or Y12 results? Of course that is not a healthy thing to obsess over, but it is natural to wonder if we can.

I have already sort of looked at this question. In my recent posts When can Youth and Cadet fencers make a breakthrough? and Midweek tidbit: Y12 performance does a fairly good job of predicting Cadet performance. I look at the correlation between seedings at national championships between consecutive years. However, one person suggested that I look at the correlation between results as well, since that would give a better snapshot of someone’s performance on a given day regardless of any lucky ratings or rankings earned.

I will admit that results are a more obvious choice, but when I first thought about this question I initially thought that seeding would be better. However, after more carefully thinking about it, I think that both seeding and results have both advantages and disadvantages for measuring the skill levels at fencers at a point in time. Here are (in my opinion) the pros and cons of each:

Pros | Cons | ||

Seeding | Result | Seeding | Result |

Averages performance over long period | Gives nonrandom placement, regardless of whether a fencer has national points or not | Fencers of same rating and no ranking are randomized | Can be messed up by luck of tournaments |

Not as affected by injury | Less affected by fencer’s financial resources (single result) | Affected by fencer’s financial resources (ability to earn national points) | Very affected by injury |

My interest was piqued, so I decided to look at the predictive power of previous results for future performance. Methodology is identical toWhen can Youth and Cadet fencers make a breakthrough?except I use results instead of initial seeding, so you can look at that if you are interested in the details.

As a reminder here is the result from the previous post:

And here is the new result:

The predictive power of results one year to the next is a bit lower than the predictive power of seeding from one year to the next, but it is still moderately correlated which is decent. Again, predictive power seems to increase slightly as the fencers get older but not in too drastic of a manner (except in mens’ fencing for some reason).

Here are the results for the longer term correlations. This one is between the seeding for the 2nd year of Y12 and the 1st year of Cadet.

And here is the new one with results instead of seeding:

Just like the other graphs above, the graph that uses results has slightly lower correlations. However, the correlation between Y12 and Cadet results is about 0.5, which still constitutes moderate predictive power. That is quite high considering that so much can change in 3 years, from height to mindset and motivation.

If one were to move one place up in their Y12 seeding, how much would they expect to move up in their Cadet seeding 3 years later? I did linear regression which basically makes a y=mx+b line from the scatterplot of Y12 seeding to Cadet seeding. Here are the slopes for the lines when looking at both seeding and results:

For each increase of one place in seeding for Y12, there is an approximate 0.67 gain in seeding for Cadet 3 years later, and for each increase of one place in results in Y12 there is about a 0.55 gain in the later Cadet results.

I am not sure what ideas you had coming into this article, but you may have either confirmed your suspicions or changed your mind. Let me know what you think in the comments. I’d say that overall, I think that past performance can do a pretty good job predicting future results.

Many parents, fencers, and coaches have expressed frustration on social media that East Coast North American Cups (NACs) are problematic for West Coast fencers. Morning events on the East Coast translate to 5:00 AM on West Coast time, and fencers often have to wake up 1-2 hours earlier than their event start time to eat breakfast and warm up. One would expect an early start time to have a negative effect on performance. Try performing at your best at 5:00 AM when you’re exceptionally groggy!

I wanted to determine if early East Coast start times have a causal effect on the performance of West Coast fencers, specifically whether these early start times lead to poorer results for West Coast fencers at events. Sneak peek into the conclusion: they probably don’t!

Measuring a causal effect is very challenging. It is so complex that Nobel Prizes have been awarded to those who have developed new methods for measuring it. Let’s say we are analyzing the impact of a new drug on people. Determining causation is difficult because we cannot measure the counterfactual – the outcome if the individual had not received the treatment. Additionally, we cannot simply compare the average outcomes of treated versus untreated individuals because treated individuals may differ in characteristics from untreated ones, leading to confounding factors.

In the context of fencing, we want to compare the performance of West Coast and East Coast fencers when an East Coast NAC (North American Cup) starts early in the morning. The treatment is the early East Coast start time. However, a direct comparison between East Coast NACs and all other NACs isn’t straightforward because the individuals willing to endure the long flight and miss school for an East Coast NAC are likely to be highly committed and perhaps more skilled fencers. This creates a confounding variable, as the West Coast fencers who participate (“treated”) are typically more skilled at fencing.

This is why randomized control trials (RCTs) are often considered the “gold standard” for determining causality; individuals are randomly assigned to treatment or control groups, making it unlikely that the groups will have substantial differences. Therefore, we can directly compare the treated and untreated groups to determine the causal effect of the treatment.

In observational data, which is what we have access to, our best chance to uncover causal evidence is through a natural experiment—where we can reasonably assume that the treated group resembles the untreated group.

It seems reasonable to assume that the start time of an event is quasi-random and uncorrelated with skill. Although male events are generally in the morning (only one Junior Men’s Foil event in my sample was in the afternoon), we can focus solely on female events to guarantee a balanced distribution of morning and afternoon events. We specifically restrict our sample to female Junior events, which have a similar number of morning and afternoon start times, making it plausible that the event time is random. Additionally, the Junior category happens to have the largest sample size of East Coast events. We also look at pool bouts rather than DE bouts since there are a larger sample of pool bouts and fencers may have already “woken up” and feel alert by the time they get to the DE round.

Since event times are typically announced only about a month in advance, fencers are unlikely to withdraw after seeing the event time. Therefore, the inherent characteristics of East Coast events with 8:00 AM start times are unlikely to substantially differ in terms of fencer skill compared to 1:00 PM events, which allow fencers from both coasts to wake up at more reasonable times. We also assume that USA Fencing does not disproportionately assign events with stronger West Coast fencers to either the morning or the afternoon.

East Coast fencers and West Coast fencers adhere to the natural definition of East and West Coast. East Coast fencers are those who reside in a division on Eastern Time (ET), and West Coast fencers are those who reside in a division on Pacific Time (PT). Similarly, an “East Coast event” is an event in a city that follows ET. We only analyze bouts from East Coast events that have exactly one West Coast and one East Coast fencer since our goal is to measure how West Coast fencers do against East Coast fencers at an early event versus a late event.

An event has an “early start time” if it starts between 8 and 9 AM, and is otherwise not considered an early start time. There were few events that started between 9 AM and 12 PM, so most “late start time” events are in the afternoon. Excluding late morning event times did not change the results.

We also control for the number of fencers in the event in case there is confounding where larger events are more likely to be put in the morning and also have more weak/strong fencers from either coast. Our results are not sensitive to the inclusion of this control. Since there may be correlation between the performance of West Coast fencers versus East Coast fencers at a specific event and all fencers at an event are impacted (“treated”) by the early start time, we cluster bouts from the same event in the regression analysis. Clustering only affects the standard errors, and it won’t change our estimate for the effect of early start times. In the end, our regression formulas is

(*West Coast fencer score *–* East Coast fencer score*) = *α*·(*number of fencers in event*) + * β*·(

Where our variable of interest is * β*, the effect of the early start time.

In all three weapons, there was no evidence that early start times led to worse performance for female West Coast fencers in the Junior category. The estimated effects of early start times are negligible (less than 0.06 points in either direction). Moreover, the p-values, all exceeding 0.05, indicate that the observed effects could likely be attributed to random chance.

Weapon | Effect of 8:00 AM event on West Coast fencer’s score | 95% confidence interval | p-value | n |
---|---|---|---|---|

Foil | -0.0074 points | (-0.463, 0.448) | 0.975 | 1590 |

Epee | -0.0572 points | (-0.332 0.218) | 0.684 | 1325 |

Saber | 0.0082 points | (-0.779 0.795) | 0.984 | 1602 |

While my analysis focuses solely on female events due to their variation in start times, I believe these results could extend to male fencers as well. It seems unlikely that early start times would have a significant effect on male fencers while not affecting female fencers.

As a West Coast fencer, although I agree that it is annoying to travel 5+ hours to the East Coast to fence, it is reassuring that it (probably) won’t affect my performance. In spite of this, I still argue that there should be more West Coast NACs. East Coast NACs are still more expensive and inconvenient for West Coast fencers, and exacerbate economic inequality in this already extremely costly sport. Let’s hope for some more West Coast NACs in the 2025-2026 season!

]]>Improving in fencing is complex and unpredictable. Some days, I feel like I’m not making any progress at all, then one week later everything clicks. Like most sports, progression in fencing isn’t linear, and everyone has ups, downs, and plateaus. In comparison to sports like gymnastics, where athletes typically peak in their late teens to early twenties, fencers often continue to excel well into their 30s. For instance, 33-year-old Gerek Meinhardt is attending the Olympics this year.

Despite the peak age being later in fencing, it still seems plausible that progress is influenced by age. During puberty, athletes gain substantial muscle mass, likely leading to rapid improvement. Since females typically enter puberty 1-2 years earlier than males, their peak gains may also occur earlier. To explore the effect of aging on fencers’ development, I quantified year-over-year improvement in fencers, aiming to identify the ages at which they make the most significant gains.

To measure skill, I used Microsoft TrueSkill, a rating system that updates players’ ratings after each match based on their opponent’s rating and the match outcome. Similar to the Elo system on Chess.com, TrueSkill ensures higher-rated players gain less from defeating lower-rated opponents and more from beating higher-rated ones. Using outcomes from pool bouts, I calculated the TrueSkill of all fencers who competed at a national tournament from 2012-2020, segmented by weapon and gender. Then I collected all fencers who competed continuously (in at least one national tournament a year) from their last year of Y12 to their first year of Junior. For each season from the first year of Y14 to the first year of Junior, I found the highest TrueSkill rating that each fencer attained. By averaging out each fencers’ TrueSkill ratings, I could see how fencers generally improved over time.

TrueSkill is a valuable tool to assess relative skill among fencers. However, it focuses on how fencers compare to each other, not necessarily their absolute improvement. This can be insightful, but it has limitations. For instance, imagine one fencer does improve between tournaments, but makes significantly less progress than their peers. TrueSkill might show a rating decrease for the fencer with the smaller improvement, even though all of the competitors got better overall.

In short, **a loss in rating could just mean that a fencer improved less than other competitors at the tournament,** not that they got worse.

- Male epee fencers make the most improvement between their first and last year of Y14.
- Male foil and saber fencers make more improvement during the transition between Y14 and Cadet than they do in other years.
- Female fencers have diminishing improvement over time; as expected, their greatest period of growth is earlier than that of male fencers.
- Most fencers make substantially less progress as they more from Cadet into Junior. These fencers are moving into either 11th or 12th grade, so school and college applications may be becoming a bigger priority.
- After each weapon/gender’s year of big growth, the following year has slightly less growth, but the years after go significantly downhill.
- The graphs were very messy, so they aren’t displayed in this blog post, but the rate of improvement didn’t differ considerably between higher-level and lower-level fencers. All of the same trends were observed.

This blog post explored the progression and improvement patterns of fencers across different ages and weapons, using skill quantified by the Microsoft TrueSkill rating system.

The results suggest that the largest improvements generally occur during the middle school to high school transition, which corresponds to middle adolescence, the period where both boys and girls have the most muscle growth. Since boys have puberty about one year later than girls, their year of big improvement is about one year after the girls’. This muscle growth could explain why Y14 competitors are able to compete (and often dominate) in older age categories. For example, at the 2023 November NAC, there were 4 Y-14 fencers in the top 8 in Cadet Women’s Foil.

Additionally, fencers make significantly less progress each year as they age, being at their lowest during the late high school years. There are many possible causes for this. One possibility is that with more responsibilities in 11th and 12th grade, such as college applications and AP testing, fencers have less time to practice. Another possibility is that fencers whose goal was recruitment already know whether they have gotten an offer by the end of their 11th grade year, and don’t put as much effort into their fencing anymore. Lastly, there could be physiological factors specific to 17-year-olds that affect their rate of improvement. Further research is needed to fully understand the “why” behind this trend.

]]>Have you ever wondered if a policy change actually has the intended effect?

A few months ago, I stumbled upon a paper by Meehan & Stephenson (2024) that examined the effects of the name change of the Washington Bullets to the Wizards in 1997. The team owner worried that the “Bullets” name was insensitive due to prevalent gun violence in Washington D.C., and could be an indirect subconscious cause of some of the deaths. In the paper, the authors used a method called the “Synthetic Control Method” to see if the name change actually * caused* a decrease in homicides per capita in Washington D.C.

The Synthetic Control Method allowed the author to construct a “Synthetic Washington D.C.” which modeled the counterfactual; what would have happened if the Bullets had not changed their name. The authors found that there was no statistically significant difference between the real-life Washington D.C. and its synthetic counterpart, meaning that the data did not provide evidence that the name change caused any detectable decrease in gun violence. I thought that this method of constructing a “synthetic control” was magical. In statistics class, you learn that with an observational study you can never establish a cause-and-effect relationship, you can only find associations. However, using the synthetic control (and the broader fields of econometrics and causal inference) you can find cause-and-effect relationships in observational data if you specify the correct model.

This is especially interesting and important because many policy changes and important questions cannot be properly answered experimentally. This is either because it is unethical to do so (measuring if something causes homicides is obviously unethical), or unfeasible because it is too expensive or difficult to simulate a real world situation in a lab. Since learning about the synthetic control method I have been captivated by it, and also by the broader field of causal inference, the study of finding cause-and-effect through observation. It’s fascinating we can use these methods to quantify the effects of interventions and policies, and answer the crucial question: “What would have occurred if we hadn’t done that?” In one famous example, Abadie et al. (2010) are able to measure the effect of Proposition 99 (cigarette tax) in California on cigarette sales, finding that the tax decreased the per-capita cigarette pack sales by about 26 packs by 2000. This shows that the Proposition 99 accomplished what it set out to do!

First, let’s introduce a few key vocabulary words.

**Treatment**: The intervention or condition applied to units in an experiment or study. In the Bullets example, the treatment is the name change.**Unit**: An individual, group, or entity to which a treatment is applied or observed. In the Bullets example, the units are cities.**Outcome**: The variable of interest measured to assess the effect of the treatment. In the Bullets example, the outcome is homicide rate/100k people.**Counterfactual**: The unobserved outcome that represents what would have happened under different treatment conditions. In the Bullets example, this is the hypothetical homicide rate/100k that would have happened if there were no name change.**Confounder**: A variable that is associated with both the treatment and the outcome, potentially leading to a false association between them. For example, one could (falsely) conclude that being taller causes kids to be better at math, even though in reality this is confounded by age: older kids are usually better at math and are also taller.

In a randomized control trial, often considered the “gold-standard” for establishing causality, researchers randomly assign units to either the treatment group or control group. Then they run a statistical test to determine if there is a significant different in outcomes for the treated units compared to the control units. This can establish a cause-and-effect relationship because the assignment of units to either the treatment or control group is completely independent of any of the units’ characteristics; it is randomized. Consequently, the treated group is very likely to closely resemble the untreated group.

With observational data, we don’t have controls. We just know the outcomes of the units and whether or not they were treated. Importantly, we can’t simply compare the outcomes of the treated units to the untreated units; there could be confounders. For example, a city that implements a program to combat homelessness is also likely to be facing high levels of homelessness. It would not be appropriate to compare this city to an untreated city, because that untreated city probably has lower homelessness numbers. So how can we identify units that are similar to our treated unit to act as controls?

This is where the genius of the synthetic control method comes in: we create a new “synthetic control” unit by combining untreated units in a manner that closely approximates our treated unit prior to the treatment. In other words, for a set of untreated units {*x*_{1}, *x*_{2}, … *x*_{n}} and a treated unit *x*_{0}, each untreated unit is assigned a weight between 0 and 1. These weights {*w*_{1}, *w*_{2}, … *w*_{n}}, which in total sum to one, are optimized to minimize the difference between *x*_{0} and the weighted sum of units *w*_{1}*x*_{1} + *w*_{2}*x*_{2}+ … + *w*_{n}*x*_{n} during the pre-treatment period. To enhance the accuracy of the synthetic control, researchers can incorporate additional predictor variables (e.g., population density). This ensures that the synthetic control not only closely mirrors the treated unit in terms of the outcome variable but also approximates it across other relevant metrics. Typically, after optimizing the weights, units significantly dissimilar to the treated unit are assigned a weight of 0, since they do not contribute to the pre-treatment fit. Ultimately, we are left with a combination of units that are very similar to our treated unit.

Below is an example from Abadie et al. (2014) that illustrates how weighing the units in the control group can deliver a significantly better pre-treatment fit than averaging out all of the control units. Notably, in the second image it is much easier to see how Germany diverges compared to its synthetic counterpart, making it clear that West Germany’s GDP per capita dropped as a result of its reunification.

Country | Weight |
---|---|

Austria | 0.42 |

Japan | 0.16 |

Netherlands | 0.09 |

Switzerland | 0.11 |

United States | 0.22 |

So in the Germany example, we can see that West Germany’s GDP per capita dropped compared to its counterfactual “Synthetic West Germany” after reunification. How can we know that this change is not coincidental, and that reunification actually caused this effect? Firstly, for there to be a significant effect, the post-treatment fit must be poorer than the pre-treatment fit. If the synthetic control closely tracks the treated unit post-treatment, akin to its pre-treatment behavior, it suggests that the treatment may not have had a substantial effect.

Let’s say that our post-treatment fit is worse than our pre-treatment fit. Still, how can we be convinced that this difference is not due to chance? There are many ways to measure whether the treated significantly deviates from the synthetic control. One way is to conduct a “placebo experiment,” where we simulate the synthetic control method for units in the control group, treating each one as if it were the treated unit. By dividing the post-treatment fit by the pre-treatment fit, we can calculate a metric called the mean square prediction error, which quantifies the post-treatment deviation from the counterfactual. Units are then ranked based on this metric, with the expectation that the treated unit will exhibit the highest (or one of the highest) deviations from its synthetic counterpart. In their study on the impact of Proposition 99 on cigarette sales in California, Abadie et al. (2010) employed this method and found that California had the highest mean square prediction error by a significant margin, confirming their finding that Proposition 99 decreased cigarette sales.

In summary, the synthetic control method is a fascinating and powerful way to measure the impact of policies. This blog post is just an introductory glimpse into its potential. Numerous other variations of the synthetic control exist, some using machine learning to improve the pre-treatment fit, and there are new developments every year. Additionally, there are other methods to assess whether the treatment effect is statistically significant or merely coincidental. I hope that this article has piqued your interest and inspires you to read more about the synthetic control method and causal inference. If you are as fascinated by the method and want to know more of the details that go into using the synthetic control, Alberto Abadie (2021), who devised the method, offers a more in-depth description. Adhikari (2021) also provides a thorough and concise guide.

Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. *Journal of Economic Literature*, *59*(2), 391–425. https://doi.org/10.1257/jel.20191450

Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program. *Journal of the American Statistical Association*, *105*(490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746

Abadie, A., Diamond, A., & Hainmueller, J. (2014). Comparative politics and the Synthetic Control Method. *American Journal of Political Science*, *59*(2), 495–510. https://doi.org/10.1111/ajps.12116

Adhikari, B. (2021). A guide to using the synthetic control method to quantify the effects of shocks, policies, and shocking policies. *The American Economist*, *67*(1), 46–63. https://doi.org/10.1177/05694345211019714

Meehan, B., & Stephenson, E. F. (2024). Did the Wizards help Washington Dodge Bullets? *Sports Economics Review*, *5*, 100026. https://doi.org/10.1016/j.serev.2024.100026

The question of “rating inflation” in different regions or divisions has been a topic of debate. Some argue that places like New York and New Jersey have higher rating inflation because they have numerous tournaments with highly rated fencers. On the flip side, others believe that these areas have deflated ratings because it’s challenging to perform well and earn a rating with such strong competition. Determining rating inflation is tricky through personal experience alone, given the considerable skill level variation within individual fencers of a given rating. However, with the power of statistics, we can look at a large sample of data to find which regions truly have rating inflation!

For reference, here is the USA Fencing region map:

Let’s focus on Cadet and Junior pool bouts since these two age groups are fairly homogenous and have a wide range of skill levels. Additionally, let’s limit the bouts to those that took place in either 2019 or 2020 to ensure comparability with the present. Even with these restrictions, we still have over 15,000 bouts for each weapon, providing a substantial sample size.

To measure rating inflation, we use multiple linear regression. The independent variables are the fencer region, opponent region, fencer rating, and opponent rating. The dependent variable is the score difference in the bout. Additionally, since win/loss and score difference depends on which perspective you are looking at the bout from, each bout is randomly assigned a perspective.

By controlling for individual skill (ratings), we can isolate the effect of region on score difference and win rate. In a scenario without rating inflation, regional controls wouldn’t influence the results, as fencer ratings alone would fully explain the score difference or win/loss. On the flip side, if a region’s coefficient is statistically significant in comparison to another, it suggests inflated/deflated ratings there.

Each dot represents a region with a 95% confidence interval.

In foil, Region 6 had the most inflated ratings. A Region 6 fencer would on average, be more than half a point down against a Region 4 fencer with the same rating in a 5-touch bout. Regions 2 and 5 also had inflation, but it was less severe, as they would only be about half a point down against a Region 4 fencer of the same rating. Regions 1, 3, and 4 were not significantly different from one another.

In epee, Region 1 was the most deflated, scoring significantly more points against same-rated fencers from other regions. Regions 3, 4, and 5 were all about in the middle, neither inflated nor deflated. Regions 2 and 6 were the most inflated, on average being half a point down against a Region 1 fencer.

Region 2 is the most inflated in saber, on average being 0.6 points down against a Region 3 fencer in a 5-touch bout. Regions 3, 4, 5, and 6 are all a bit deflated, but all of them are relatively similar to one another. Region 1 is the closest to a middle ground, not being substantially inflated or deflated.

For all three weapons:

- Region 2 had rating inflation
- Region 4 never substantially differed from Region 3.
- Generally, most ratings were deflated compared to Region 3.
- Within each weapon there were always a few other regions who had rating inflation or deflation.

This means that **rating inflation is a real phenomenon.**

Additionally this proves that **populous regions are (generally) deflated in comparison to other regions.** Regions 3 and 4 are the most populous and were always on the right (deflated) side of the graph.

Also, note that the average gap between two “adjacent” ratings in pools is about half of a point in epee (this gap is about 0.8 points in foil and saber). This means that **in epee, fencers from Region 6 are usually rated one rating higher than a similarly skilled Region 1 fencer**

At national tournaments where fencers without national points are randomly seeded, the existence of rating inflation increases the aspect of chance in pools. For example, a fencer who gets a B from Region 2 is very lucky compared to a fencer who gets a B from Region 3, even though in a parallel universe these two B-rated fencers could have been switched in the seeding arrangement. However, it is difficult to solve this problem, as even Elo or other performance ranking systems are subject to regional inflation and deflation.

]]>A reader recently proposed an idea that struck my curiosity: “How does a fencer’s performance in their pools correlate with their overall performance in a tournament?” Sometimes, people’s performance after pools goes as expected, beating opponents lower in the pool results in DEs before finally losing to a strong opponent who did better in pools. Much of the time though, things don’t go as expected. A fencer may only win one or two pool bouts, then go on and get a medal in the event! Or the fencer seeded 2nd after pools loses their first DE to someone who only won one pool bout. How often does this happen?

I looked at every national event between 2017 and 2020 (besides Div 1 events, which have two rounds of pools), and compared the fencer’s pool result to their final result. Instead of comparing their exact place (eg. #20 to #112), which doesn’t account for varying event sizes, I used their percentiles. For example the fencer who placed #23 out of 23 fencers would be in the 0th percentile, and a fencer who got #60 out of 120 fencers would be in the 50th percentile. Then I placed these values on a scatterplot. I used a rainbow gradient to color them based on their pool results, with pink colored dots representing fencers who won all of their pool bouts, turquoise representing fencers that won half, and the red representing fencers who won none of their pool bouts. Here was the result:

Fencers in the bottom 20% of the pool results almost always place the same in the final results. This is because in many events, the bottom 20% are cut and cannot proceed to DEs, so they are guaranteed to get the same place as their pool results. Even in events without the 20% cut, these fencers still almost never place in the top 25% in the final results.

For the most part, fencers in the top 20% are protected from placing below the bottom half. Since they have such a high seed going into DEs, they won’t end up too low even if they lose their first DE. For example, if a fencer seeded #1 loses their first DE in the table of 128, they’ll still end up in 65th place.

Starting at about the 40th percentile and up, these fencers have a shot at placing high in the event. Although the majority of them still end up placing about the same as their result in pools, a bunch manage to make it to the final rounds.

The correlation in this graph was 0.916, which means that pool results and final placing is very well correlated. This means that 83% of the final results can be explained by the pool results.

Unlike the pool results and final results, which are highly correlated, the initial seeding and pool results are not very correlated. The correlation is only 0.69, so only about half of the pool results can be explained by the seeding.

Low seeds have a much higher chance of finishing above their pool result than the high seeds do at finishing lower than their pool result. In fact, the high seeds rarely finish below the bottom half after pools, whereas it’s not uncommon for a low seed to finish around the middle.

The “dots” that you see in the diagram are because of the “snaking algorithm” that makes pools. Those percentiles in the initial seed (~25%, ~55%, ~85%) are the ones who almost always get placed in a pool of 6, rather than a pool of 7 like everyone else.

Interestingly, the seeding and final results are slightly *more* correlated than the seeding and the pool results, with a correlation coefficient of 0.707. This is probably because a high seeded fencer can mess up in pools, but then make a comeback and place where they are supposed to, so the placement in end bears slightly more resemblance to the initial seeding.

The coloring remains the same as the first scatterplot, with pink representing fencers who won all of their pool bouts and red the fencers who won none. Of those in the top 20% of the final results, almost all of them won all of their pool bouts or lost at most one.

I did make weapon specific graphs, but they don’t look all that different, so I didn’t include them in this post. Although epee has more variability than foil and saber, this is expected considering that epee bouts are closer, and there’s also more upsets in epee.

Let’s say that the model for tournament performance is*Overall result* = *Fencing skill that day* + *Matchups at tournament* + *Luck*

Since matchups at a tournament are partially due to pool placement, we can rewrite this equation to be*Overall result* = *Fencing skill that day* + *Pool placement* + *Luck*

Is doing well in pools the key to success at tournaments because it sets up you for an easy path in DEs? Or is it that fencers who are having a good day do well in pools and also perform well in the overall result?

This is a complicated question, and it is impossible to answer from this data. This is because *Fencing skill that day* and *Pool placement* are endogenous variables: they are highly correlated with each other. Therefore it is impossible to isolate the impact of each variable individually on fencer performance. However, given that doing well in pools gives an easier path in DEs, fencers should of course try their best to maximize their pool performance to optimize their chance of winning.

Are A-rated fencers more consistent than E-rated fencers? They should be, but how can we know? Luckily, I have tens of thousands of bouts in my dataset, and I can answer this question (and answering such questions is the basis of the blog)! In this post, I will explore how the distribution of scores varies based on the ratings of two fencers and uncover valuable, and possibly unexpected insights! As usual, the full data is available at the end.

If you enjoy this post, I recommend checking out this one, which looks at score differences across division and weapon:

Why Fencing Is Often So Disappointing: 12% Of Bouts Are Lost by One Point

Let’s begin our exploration by examining the average score differences in pool bouts between fencers of varying ratings. Based on national tournaments bouts from 2017-2020, the average score difference for fencers who are one rating apart is approximately 1.11 points. As the rating differentials increase, we observe a progressive increase in the average score difference between the fencers. For instance, when the rating difference jumps to two, the average score difference rises to 1.81 points. This progression continues, with score differences of 2.43, 3.00, and 3.34, for three, four, and five rating differentials, respectively.

This is all pictured below in the following graph. Each bar between two columns represents the average score difference between those two ratings in a 5 touch bout. For example, the bar that connects the red A column with the orange B column represents that the average score difference between an A and B rated fencer is 1.36 points in a pool bout. Note that for all of these averages, a loss is counted as a negative value, so if an A loses to a B the average score difference will decrease.

In general, barring comparisons of a rating with U, the lines become shorter as we move towards the right. This suggests that the gap between lower ratings is less than that between higher ratings. It also implies that earning a higher rating progressively becomes more difficult, which makes sense.

For DEs, I am going to use boxplots to visually represent the distribution, central tendency, and spread of the dataset. Each box represents the interquartile range (IQR), which is the range within which the central 50% of scores fall. The line inside the box indicates the median value, or the middle value when scores are arranged in ascending order. The whiskers extend to 1.5 times the IQR, and outliers are represented by dots outside the min/max.

These boxplots tell an interesting story about the performance of different rated fencers. Let’s break it down:

A-rated fencers display a high level of consistency against D-, E-, or U-rated fencers, winning on average with a score of around 15-7 for all of them. Each box also has a small IQR, indicating that it’s rare for scores to fall to far out of the 5-10 touch differential range. This shows that A-rated fencers are very consistent, answering the question posed in the title.

B-rated fencers, show the smallest IQR when fencing another B rated fencer. This indicates that Bs are fairly homogenous in skill. Surprisingly, Bs tend to have a lower average score against lower rated fencers, as compared to As, meaning that A rated fencers are a lot more consistent in beating weaker fencers than Bs.

C-rated fencers are interesting as they sit in the middle of the skill spectrum. The scores reflect that, as they generally lose to As by almost the same margin as they beat Us. The graph is also relatively symmetrical.

D-rated fencers tend to perform consistently poorly against both As and Bs (the IQR is small), which indicates a big difference in skill between even proficient Ds and weak Bs.

This graph is fairly similar to the D-rated fencer graph, but the E-rated fencers have slightly worse outcomes as expected.

Lastly, the U-rated fencers tell quite a tale. They show a wide range of scores against other Us, which points to it being a heterogeneous group. They lose to As, Bs, and Cs by a similar, and consistent amount.

In Unveiling Fencing’s Winning Odds by Rating, I found that Epee fencers had the lowest chance of beating a lower rated fencer. Likewise, in Why Fencing Is Often So Disappointing: 12% of Bouts Are Lost by One Point I found that Epee fencers have closer bouts than the other weapons.

Unsurprisingly, when looking at the score difference for “adjacent ratings” (ratings that are different by one letter), I found that Epee fencers beat lower rated fencers by less than the other weapons.

**Check out the full data here**

You are standing on the unforgiving piste, the weight of your weapon in hand. Your breath quickens, heart pounding, as you face your opponent. The score is tied, 14-14. Every move feels like a dance on the edge of victory and defeat. Each step, each parry, a desperate bid for that final touch. The crowd’s hushed anticipation hangs heavy in the air. Every fiber of your being strains towards that elusive point, but fate has other plans. The final touch lands, not in your favor. The world seems to slow as the referee’s arm rises, sealing your fate. A symphony of cheers and sighs echoes around you, drowning out the thunderous beat of your own heart. It’s over. You’ve come so close, yet fallen short.

Losing by one touch has happened to all of us at least once, and it’s often the most painful thing possible in fencing, sometimes even more painful than injuries and cramps, despite it being 100% mental. Sometimes it feels like this “bad luck” happens more often than it should, and you are correct to think that, because it does!

This distribution highlights that losing by just one touch (in most cases, a score of 15-14) is indeed the most common outcome, with 11.6% of bouts ending this way. Additionally, score differences of 5 or less collectively make up a significant portion (more than 50%) of the outcomes, underlining the fine margins and competitiveness often seen in DE bouts.

In fencing, each weapon has its unique characteristics, and the statistics reflect these distinctions. When it comes to score differences, epee stands out with the highest likelihood of ending with a one-point difference, approximately 15%, compared to other weapons, which are around 10%.

In terms of score distribution, the curve for epee tends to skew more to the right compared to saber. Similarly, saber exhibits a right skew when compared to foil. This suggests that epee has the closest bouts, followed by saber, then foil.

Although it is not as pronounced, a one-touch score difference is also the most common outcome in pools. Pools are intentionally designed to encompass a broad range of fencers, from the most skilled competitors to those still developing their abilities. This diversity in skill levels often leads to a wider spread of scores, including more wins by a large margin.

In Division 1, reserved for fencers ranked C and higher, the level of competition is notably higher, resulting in closer bouts. This is reflected in the data on score differences:

The lower percentages for score differences of 4 and 5 indicate the rarity of overwhelming victories. Instead, the majority of bouts are decided by only a one or two touches. This is a testament to the high skill level and competitiveness of the fencers in Division 1.

The weapon specific trends found in DEs are still the case in pool bouts, and epee features even closer bouts compared to foil and saber, with almost a third being decided by one touch.

It takes time to figure out your opponent in epee, so probably the score remains tight until the stronger fencer can figure out their opponent’s weakness, resulting in a smaller overall score difference.

As you would expect, the majority (~95%) of 15-touch bouts reach 15. Specifically, 100% of saber bouts reach 15, 95% of foil bouts reach 15 and 90% of epee bouts reach 15.

Interestingly, removing those bouts, the winning score of bouts that run out of time follows a trend modeled almost perfectly by a cubic function.

This is the distribution of the scores in 15 touch bouts, with larger circles representing outcomes that are more frequent.

Here is the same graph but for 5 touch bouts

**You can check out the full data here**!

Everyone thinks their weapon is the best. This is not hyperbole. If you don’t think your weapon is the best, then why are you doing it? Switch to the weapon that you think is the best. We also all believe that our weapon is the most difficult, the most fun, the most important, etc. for the same reasons.

I used to fence saber, but switched to foil about 5 years ago and fence epee occasionally. From my experience with my foil friends, they all think epee is easy. Like they think that foil fencers should be able to earn an identical (or at worst one lower) rating in epee to their foil rating. I wanted to know if this is really true though, so I looked at the numbers.

Basically, I looked at each national tournament (eg. March NAC 2019) and generated the list of fencers who participated in both foil and epee. I assumed a fencer was a foil fencer if their foil rating was higher than their epee rating, and a fencer was an epee fencer if their epee rating was higher than their foil rating. Few fencers had the same rating for both foil and epee, and since I couldn’t figure out which was their main weapon, I threw them out. Then for each fencer, I checked whether they earned a rating in their non-preferred weapon (indicating that the other weapon is easier since they are able to “accidentally” earn a higher rating in it without really practicing it).

There are some fencers who do both weapons seriously, so I put a cap on the rating for the non-preferred weapon, as this would indicate that they probably don’t take it too seriously. Specifically, I only looked at fencers who were C or under in their secondary weapon. Here are the results of the analysis:

When foil fencers with a epee rating of C or below competed in epee, they got a rating 35/328 times

When epee fencers with a foil rating of C or below competed in foil, they got a rating 19/512 times

However, when manually looking through the results, I realized that many of the fencers earned a rating in Veteran, which is very different from the Div 1-3, Cadet, Junior, and Youth categories. There are many more fencers who seriously compete in two weapons in Veteran, and the ratings are also a bit inflated because a Vet fencer could get older and lose skill, but still maintain the same rating for at least 4 years. So I decided to throw out Vet events because they weren’t representative of fencers aged between 10 and 30 and had all sorts of quirks.

Here are the results when you remove Veteran events:

When foil fencers with an epee rating of C or below competed in epee, they got a rating 16/207 times. These were the ratings earned:

A:1 B:1 C:5 D:2 E:7

When epee fencers with a foil rating of C or below competed in foil, they got a rating 2/312 times. These were the ratings earned:

E:2

You may argue that a fencer that has a C or D rating in their non-preferred may be taking it somewhat seriously. However, it’s fairly unlikely that they have an E or U and are super knowledgeable about epee.

When foil fencers with an epee rating of E or below competed in epee, they got a rating 11/125 times. These were the ratings earned:

A:1 C:2 D:1 E:7

When epee fencers with an foil rating of E or below competed in foil, they got a rating 2/171 times. These were the ratings earned:

E:2

Additionally, just to be transparent and make it clear that I am not trying to hide anything, here are the results when you have no cap on the rating of the non-preferred weapon (although the non-preferred weapon rating is still below the preferred weapon):

When foil fencers competed in epee, they got a rating 18/214 times. These were the ratings earned:

A:2 B:2 C:5 D:2 E:7

When epee fencers competed in foil, they got a rating 2/339 times. These were the ratings earned:

E:2

Interestingly, of these 18 ratings earned in epee, 10 of them were equal or higher than the fencers’ foil ratings. None of the epee fencers earned a rating in foil that was equal to or higher than their epee rating.

Overall, there are three conclusions to draw from this. Either:

a. Epee is easier than foil,

b. Epee has a lot of rating inflation, or

c. Foil skills are more transferrable to epee than epee skills are to foil.

This is just mere speculation, and one can never get a perfect conclusion from an observational study, but I think that the best answer is probably some mix of the latter two: likely foil skills are more transferrable to epee than epee skills are to foil, but epee ratings are also inflated, making it easier to earn a high rating.

I don’t think that it is fair to say that epee is *overall* easier than foil. The difficulty in fencing is beating other people who are also trying their best to beat you, so if there was some “easy” way to get good at epee, everyone could just learn it and nobody would have a specific advantage.

Furthermore, epee fencers do not know right of way, which puts them at a disadvantage when they try to fence foil. On the other hand, foil fencers can just move around and do some basic actions (eg. disengage, parry) and score some singles and doubles without too much specialized knowledge.

Based on the numbers it is definitely easier to get ratings in epee than foil. Foil fencers have about an 8% chance of earning a rating in epee, whereas epee fencers only have a 0.06% chance of earning a rating in foil. About 5% of the time when a foil fencer competes in epee they earn an equal to or higher rating than their foil rating, which never happens in epee.

Let me know what you think in the comments! Here is a spreadsheet of the full data, in case you are interested.

If you are wondering about people who fenced both saber and epee at a NAC, or foil and saber:

Foil fencer doing saber only happened 53 times, with 2 D ratings earned

Saber fencer doing foil only happened 28 times, with no ratings earned

Epee fencer doing saber happened 78 times with 3 ratings given: 1 C, 1 D, and 1 E

Saber fencer doing epee happened 12 times, with no ratings earned

This is probably because saber fencers don’t have point control, which makes doing well in other weapons hard because they can’t even hit. Foil and epee fencers won’t really have good saber strategy, but at least they’ll be able to get a light on.

]]>Most of the time when someone gets top 8 at a NAC, they get an A! This is cool, however, how often are people upgrading their ratings rather than just renewing? I analyzed the number of fencers from each rating who got into the top 8, 16, 32, and 64, and the results are quite surprising. I specifically chose to look at Cadet and Junior, as those categories allow fencers of all ratings to participate and have a high enough number of each that the sample size is good.

Note that the results from the fencers who got into top 64 also includes those in the top 32, 16, etc.

D and below rated fencers almost never make it past the round of 32; their bars are almost invisible in the round of 8 and 16. This means almost all of the fencers in the video round at a national event are at least Cs.

For the sake of simplicity, I assumed that all events give points to the top 64, even though in reality smaller national events may only give points to the top 32. Below is the proportion of fencer ratings for those who get to the round of 64:

A% | B% | C% | D% | E% | U% |

29.7 | 33.6 | 22 | 8.6 | 3.5 | 2.7 |

A% | B% | C% | D% | E% | U% |

58.1 | 27 | 10.1 | 2.8 | 1.2 | 0.8 |

Unsurprisingly, lower rated fencers get points in Cadet more often because it’s easier than Junior. However, Us and Es are still very rare in the top 64 of Cadet: in an average tournament the top 64 would only have 2 Es and 2 Us.

**View the full data here**

For my first post, I want to look at the win probability between certain ratings. Ratings are a bit strange in fencing because everybody cares about earning a higher rating, but at the same time people also say that they don’t matter. It’s also commonly known that there is a great deal of variability and randomness in how ratings are awarded, as a Division 1 NAC could have the same A4 rating of a local tournament, but the skill level of the people that earn As are very different. In spite of this, ratings are still used for seeding at tournaments, as they are the only standardized method that can compare fencers that live all over the United States without requiring everyone to have a national ranking. This post will address a few questions that I had about ratings, and hopefully a few of your own.

Here is the table we have all been waiting for:

Pools (All) | A | B | C | D | E | U |

A | 50.00% | 71.73% | 80.39% | 87.69% | 93.00% | 94.90% |

B | 28.27% | 50.00% | 63.70% | 74.43% | 84.08% | 90.25% |

C | 19.61% | 36.30% | 50.00% | 63.47% | 73.88% | 84.70% |

D | 12.31% | 25.57% | 36.53% | 50.00% | 65.35% | 79.69% |

E | 7.00% | 15.92% | 26.12% | 34.65% | 50.00% | 71.83% |

U | 5.10% | 9.75% | 15.30% | 20.31% | 28.17% | 50.00% |

DEs (All) | A | B | C | D | E | U |

A | 50.00% | 75.82% | 87.36% | 92.67% | 96.49% | 95.67% |

B | 24.18% | 50.00% | 69.14% | 81.69% | 89.33% | 92.48% |

C | 12.64% | 30.86% | 50.00% | 65.45% | 77.56% | 88.35% |

D | 7.33% | 18.31% | 34.55% | 50.00% | 66.70% | 83.13% |

E | 3.51% | 10.67% | 22.44% | 33.30% | 50.00% | 72.86% |

U | 4.33% | 7.52% | 11.65% | 16.87% | 27.14% | 50.00% |

But continue reading, it gets more interesting.

People tend to disagree on which ratings are the most difficult to earn. By comparing a rating’s win probability with the rating one below, we can determine which ratings have the largest skill gap. These were the results of the comparison:

The biggest gaps are between U vs E and B vs A for both pools and DEs, showing that the hardest ratings to earn are your first rating and the prestigious A rating. Although the B vs C gap and C vs D gap are nearly identical in pools, the A vs B and B vs C gaps are far greater in DEs, indicating that getting to a high rating requires a lot more skill in 15 touch bouts.

A lot of people say the phrase “you can beat anyone in pools.” Is this really true though?

The data do support this, with the lower rating being 1-7% more likely to cause an upset in pools vs DEs.

People tend to agree that epee is more random than the other weapons with a high chance of upsets. The graphs for pools and DEs do seem to support this claim:

I was interested in gender differences, which haven’t been discussed much. The data show that upsets are 1-6% less likely in women’s fencing than men’s, which is not a huge difference but is still meaningful.

**View the full data here, and more tables**