This paper consolidates three independent benchmark evaluations of Punge, an on-device NSFW image detector built on a custom-trained YOLO nano model, currently running YOLO26n. Across three distinct evaluation axes — accuracy against commercial cloud APIs, accuracy against open-source classifiers, and demographic fairness using a reproduction of the Leu, Nakashima & Garcia (FAccT 2024) methodology — Punge's 5.1 MB model consistently meets or outperforms significantly larger models. On suggestive content classification, Punge outperforms Google Cloud Vision SafeSearch by 14.6 percentage points. Against Falcons.ai's Vision Transformer classifier, Punge achieves a 3% misclassification rate versus 38%. On the FAccT fairness audit, Punge's gender false positive disparity ratio (1.23×) is lower than all three models audited in the original Garcia study, which ranged from 1.0× to 6.4×. The architectural choice to detect anatomical shapes rather than classify whole images is both the mechanism behind Punge's accuracy advantage and a structural response to the demographic bias problem identified in Garcia et al.
NSFW image detection is a real problem with real stakes. Parents use it to protect children. Individuals use it to manage their own photo libraries. Platforms use it to moderate content at scale. Yet the dominant approaches — large cloud-based classifiers and general-purpose whole-image models — share a common limitation: they were built for the average case, not for the constraints of a mobile device or the specific demands of detecting explicit content accurately without encoding demographic bias.
Punge was built under a hard constraint: all processing must happen on-device, with no image ever uploaded to a server. This required a model small enough to run on a smartphone CPU in real time. The architecture chosen was YOLO — specifically the nano variant, chosen for its efficiency. What emerged from this constraint was not a compromise. It was, as the benchmarks below demonstrate, a structural advantage.
This paper reports three evaluations:
The model evaluated in Studies 1 and 2 was trained on YOLOv11n. Punge has since migrated to YOLO26n. On standard COCO benchmarks, YOLO26n achieves comparable accuracy to YOLO11n (~39.8% vs ~38.5% mAP50–95) while delivering approximately 31% faster CPU inference (38.9ms vs 56.1ms) — an improvement attributable in part to YOLO26's native end-to-end, NMS-free inference design. Study 3 was conducted using YOLO26n.
Google's Cloud Vision API is among the most widely used commercial image analysis tools. Its SafeSearch feature returns likelihood ratings across five categories: adult, spoof, medical, violence, and racy. In this study, the adult annotation was used, with an image classified as NSFW if the rating returned POSSIBLE, LIKELY, or VERY_LIKELY. As a proprietary cloud API, its internal architecture and thresholds are not public.
A fine-tuned ResNet-50 model, widely used as a baseline in NSFW detection. It returns a single confidence score per image. Evaluated at thresholds of 0.50, 0.60, and 0.70.
A Vision Transformer (ViT) model fine-tuned on approximately 80,000 images for binary NSFW classification. Based on Google's vit-base-patch16-224-in21k pretrained weights. Treats the full image as a single global representation.
Leu, Nakashima, and Garcia audited three classifiers in "Auditing Image-based NSFW Classifiers for Content Filtering" (FAccT 2024):
| Model | Architecture | Size |
|---|---|---|
| NSFW-CNN | InceptionV3 | 85.3 MB |
| CLIP-Classifier | CLIP + FC layer | 888.3 MB |
| CLIP-Distance | CLIP + cosine distance | 887.5 MB |
All three are whole-image classifiers trained on internet-scraped datasets.
A custom-trained YOLO nano model. Rather than classifying the full image, it detects and draws bounding boxes around specific anatomical regions. An image is flagged as NSFW if at least one detection crosses the confidence threshold. Model weight size: 5.1 MB. All inference runs locally on the user's device via CoreML (iOS) or LiteRT (Android).
The current production model runs on YOLO26n, which introduces native NMS-free end-to-end inference. In prior YOLO versions, Non-Maximum Suppression (NMS) — the post-processing step that filters overlapping bounding boxes — was applied after the model returned its raw predictions, and its implementation details could vary across deployment environments. YOLO26 eliminates this step by building suppression into the model's prediction head directly. For a mobile deployment, this means lower and more predictable latency, and one fewer variable in the inference pipeline.
Both models were evaluated on an identical 100-image test set at a 0.50 confidence threshold. The dataset is small by academic standards and is intended as a directional comparison rather than a definitive benchmark; the results should be interpreted accordingly.
| Model | Misclassification Rate |
|---|---|
| Falcons.ai (ViT) | 38% |
| Punge (YOLO11n) | 3% |
The 35-percentage-point gap reflects a fundamental architectural difference. Falcons.ai's ViT processes the image globally, building a holistic representation that encodes scene context, background, and visual composition. This can cause both false positives — where contextual cues mislead the classifier — and false negatives — where explicit regions are diluted by surrounding non-explicit content.
Punge detects localized regions. If an explicit anatomical shape is present anywhere in the image, the bounding box fires. If it is not, no detection occurs, regardless of the surrounding scene. This localized approach is both more sensitive to actual explicit content and more resistant to contextual misfires.
Three datasets were used:
Punge was evaluated at three confidence thresholds (0.50, 0.60, 0.70). Google Cloud Vision uses its internal threshold (not publicly specified). Yahoo was also evaluated at 0.50, 0.60, and 0.70.
Explicit Imagery — True Positive Rate
| Model | True Positive Rate |
|---|---|
| Punge — 0.50 threshold | 100% |
| Google Cloud Vision — internal threshold | 99% |
| Yahoo — 0.50 threshold | 66% |
Suggestive Imagery — False Positive Rate (lower is better)
| Model | False Positive Rate |
|---|---|
| Punge — 0.70 threshold | 1.6% |
| Yahoo — 0.70 threshold | 2.1% |
| Google Cloud Vision — internal threshold | 16.2% |
Google Cloud Vision flagged nearly 1 in 6 swimsuit photos as explicit content. Punge flagged fewer than 1 in 60.
Everyday Imagery (COCO 2014)
All models achieved near-perfect accuracy on non-explicit everyday imagery. Punge at 0.60+ and Yahoo achieved 100% accuracy. Google Cloud Vision also performed well on this dataset.
The most practically significant finding is the suggestive content result. For real-world NSFW detection, the hard problem is not identifying explicit pornography — most classifiers handle that reasonably well. The hard problem is the boundary between explicit and suggestive: swimwear, lingerie, artistic nudity, medical imagery. A model that flags swimsuit photos as NSFW is not useful.
Punge's 14.6-percentage-point advantage over Google Cloud Vision on suggestive content reflects its localized detection approach. Swimsuit photos do not contain explicit anatomical regions in Punge's training taxonomy, so they do not trigger detections. Google's classifier, operating on global image representations, appears to over-index on visual features associated with bodies, skin, and context that correlate with explicit content in its training data.
A practical advantage of the YOLO threshold approach is also worth noting: developers can tune the confidence threshold to calibrate the sensitivity/precision tradeoff for their specific application. Google's API provides a fixed operating point with no developer control.
In their FAccT 2024 paper, Leu, Nakashima, and Garcia audited three NSFW classifiers against the MSCOCO and Google Conceptual Captions datasets, using demographic annotations to measure whether false positive rates varied across gender, skin tone, and age. Their findings were stark:
This is not a minor calibration issue. It is a systematic encoding of the demographic appearance of women as a proxy for explicit content. Garcia et al. called for broader scrutiny of NSFW detection methodology and explicitly aimed to "stimulate further exploration in this domain."
We reproduced the Garcia et al. methodology to evaluate Punge against the same standard:
Because Punge uses object detection rather than image classification, a false positive is defined as any image where at least one bounding box detection crosses the confidence threshold. This is a meaningful distinction from the Garcia models: Punge never classifies a person or their demographic attributes — it detects shapes.
At the 0.60 threshold, Punge's gender disparity ratio (1.23×) is lower than all three Garcia models. Its skin tone ratio (0.89×) shows near-perfect parity, slightly favoring darker skin tones, compared to 2.0–2.2× ratios in the audited models.
This result is not accidental. There are structural reasons why an anatomy-detecting YOLO model produces less demographic bias than whole-image NSFW classifiers.
YOLO is trained to detect explicit anatomical regions, not to model person identity. The Garcia paper's explainability finding — that all three classifiers were using female faces as NSFW signals — is only possible for models that process the entire image and build up a semantic representation that includes person identity and perceived gender. Punge's model was trained on anatomical shapes, not on people. A female face does not appear in Punge's training taxonomy, and there is no pathway from "this person appears to be female" to "detection fires."
Large models absorb demographic correlations from training data. The CLIP-based models audited by Garcia were trained on massive internet-scraped datasets, including LAION-5B. The internet contains systematic associations between certain demographics and explicit content that have nothing to do with the actual prevalence of explicit content in those groups. Models with broad training objectives and enormous capacity absorb these correlations. A compact, task-specific detector trained on anatomical shape examples has no mechanism to absorb such correlations.
The absence of a "person" concept. Whole-image classifiers have rich internal representations of people, faces, gender expression, and social context. NSFW classification layered on top of these person-aware representations makes demographic contamination nearly inevitable. YOLO trained for anatomical detection has no such person representation. It operates at the level of shapes and spatial patterns, not semantic identity.
Garcia et al.'s call to "stimulate further exploration" in NSFW detection methodology is consistent with the observation that whole-image semantic classifiers may be structurally ill-suited to this task from a fairness standpoint. A detector that never sees the person — only their anatomy, if explicit content is present — is a structural answer to the problem they identified.
| Model | Size | Architecture |
|---|---|---|
| CLIP-Classifier | 888.3 MB | CLIP + FC classifier |
| CLIP-Distance | 887.5 MB | CLIP + cosine distance |
| NSFW-CNN | 85.3 MB | InceptionV3 CNN |
| Falcons.ai | ~350 MB | Vision Transformer |
| Punge (YOLO26n) | 5.1 MB | YOLO object detector |
Punge is roughly 17× smaller than NSFW-CNN and 175× smaller than the CLIP-based models. It outperforms all of them on both accuracy metrics and demographic fairness.
This size difference is not just an engineering curiosity. Punge runs entirely on the user's device. Content is never uploaded to a cloud server for analysis. The model makes detection decisions locally, without a network call, without any human review layer, and without any data ever leaving the phone.
This makes the fairness properties especially consequential. In a cloud-based moderation system, systematic demographic bias can be audited, corrected, and overridden by human review. In an on-device system processing a user's personal photo library, there is no such correction mechanism. A biased detection at that layer is invisible and uncorrectable at scale. The low disparity ratios observed in this evaluation are therefore not just academically interesting — they are a direct property of the deployed product, affecting real users in real time.
Several limitations of this work should be acknowledged.
The benchmark datasets used in Studies 1 and 2 were curated by the author and are not publicly released. The explicit dataset (n=100) is small by academic standards, though it is representative of the detection task.
The FAccT methodology reproduction used 10,783 images compared to Garcia et al.'s 11,628, due to filtering applied to ensure demographic homogeneity within each evaluated image. This difference is unlikely to materially affect the directional findings but should be noted.
Punge's training dataset (~21,000 images at the time of YOLO26n training) is task-specific and domain-focused. Performance may vary on distributions of explicit content that differ significantly from the training data.
As an anatomy-based detector, Punge will not flag explicit content where no trained anatomical regions are visible. This includes clothed sexual acts, obscured or implied nudity, and contextually explicit scenes that do not expose specific body parts. This is a fundamental property of the detection approach rather than a calibration issue, and developers requiring detection of non-anatomical explicit content should be aware of this scope.
Finally, this evaluation was conducted by the developer of Punge. Independent replication would strengthen the findings. The methodology is documented here with sufficient detail to support reproduction, and the author welcomes external validation.
Three independent evaluations — against a commercial cloud API, an open-source ViT classifier, and a rigorous academic fairness audit — consistently find that Punge's 5.1 MB on-device YOLO model meets or outperforms significantly larger models. The central hypothesis supported by these results is that the architectural choice to detect anatomy rather than classify images is both the cause of Punge's accuracy advantage and a direct response to the demographic bias problem identified by Garcia et al.
The implication for the field is worth stating plainly: bigger is not always better, and in NSFW detection specifically, whole-image semantic classification may be the wrong architectural paradigm. A model that never sees the person — only the anatomical content, if present — is faster, smaller, more accurate on the hard cases, and less likely to encode demographic appearance as a proxy for explicit content.
A natural direction for future work is evaluating model performance on explicit content that does not involve anatomical nudity — clothed scenes, implied content, and contextual explicitness. Whole-image classifiers may perform better in this category than anatomy-based detectors, though to the authors' knowledge no systematic benchmark exists for this problem. It represents an open research question for the field.
| Criterion | Garcia Models | Punge |
|---|---|---|
| Architecture | Whole-image classifiers | Object detection |
| Model size | 85–888 MB | 5.1 MB |
| Training data | Internet-scraped | Task-specific |
| Gender FPR ratio | 1.0×–6.4× | 1.23× |
| Skin tone FPR ratio | 2.0×–2.2× | 0.89× |
| Deployment | Server-side | On-device, no upload |
| Suggestive FPR vs. Google | Baseline | +14.6 pp better |
Leu, W., Nakashima, Y., & Garcia, N. (2024). Auditing Image-based NSFW Classifiers for Content Filtering. ACM FAccT 2024.
Zhao, J., et al. (2021). Understanding and Evaluating Racial Biases in Image Captioning. Princeton Annotation Dataset.
Sapkota, R., et al. (2025). Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5. arXiv:2510.09653.
Ultralytics. (2026). YOLO26 Documentation.