You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First off, amazing job putting together a leaderboard with so many models! 🙌 It’s such a valuable resource for the community to compare performance easily—thank you for making this effort!
I did notice a couple of things that seemed a bit off, and I was hoping to get some clarification:
1️⃣ Some of the results on the leaderboard seem quite different from those published in other sources, like this and the RT-DETR paper.
2️⃣ Additionally, I’m curious about how the validation set is being used in each evaluated model. If it’s influencing training (e.g., for early stopping), it might make the validation set less ideal as a benchmark for the leaderboard.
Would you mind shedding some light on these points? I’m asking to better understand and align expectations. The race to push higher mAP is incredibly competitive among the models, and even the smallest decimal point can make a big difference when comparing models! 💡😊
Keep up the awesome work! 🚀
The text was updated successfully, but these errors were encountered:
First, I’d like to mention that we are transparent about how we run all metric tests in our repository. This allows everyone to clearly see how we evaluate models. Our focus has always been on using the original models and repositories to achieve results that are as close as possible to the reported benchmarks.
Here are a few key details about our evaluation process:
Data: We use the COCO 2017 val dataset for evaluation.
Metrics: We utilize the Supervision library for metric calculations.
Additionally, we ensure that the models we test are "pre-trained." We do not re-train or fine-tune them, staying consistent with the original implementations.
When validating our results, we cross-check with the repository benchmarks. For instance, with RT-DETR, we referenced their official implementation. However, differences in scores may arise due to variations in metric calculation methods. For example, RT-DETR likely uses "COCO metrics," which could account for discrepancies in the results.
Let me know if you have any questions or need further clarification!
Hi there! 👋
First off, amazing job putting together a leaderboard with so many models! 🙌 It’s such a valuable resource for the community to compare performance easily—thank you for making this effort!
I did notice a couple of things that seemed a bit off, and I was hoping to get some clarification:
1️⃣ Some of the results on the leaderboard seem quite different from those published in other sources, like this and the RT-DETR paper.
2️⃣ Additionally, I’m curious about how the validation set is being used in each evaluated model. If it’s influencing training (e.g., for early stopping), it might make the validation set less ideal as a benchmark for the leaderboard.
Would you mind shedding some light on these points? I’m asking to better understand and align expectations. The race to push higher mAP is incredibly competitive among the models, and even the smallest decimal point can make a big difference when comparing models! 💡😊
Keep up the awesome work! 🚀
The text was updated successfully, but these errors were encountered: