How We Rank
Anyone can throw together a list and slap affiliate links on it. We don't do that. Every one of the 37 platforms in our index is tested hands-on, scored against a consistent rubric, and ranked on merit — never on who pays the most.
This page explains exactly how we do it — the criteria, the process, the limits, and the disclaimers — so you can decide how much weight to give our recommendations. Transparency is the whole point.
The six factors we score
Each platform receives a score from 0 to 10 in six weighted categories. The weighted average of those scores becomes the overall rating you see on every review card and comparison table.
How natural, in-character, and memory-aware the conversation feels over long sessions. We run identical scenario scripts on every app — a first-meeting exchange, a memory test across multiple sessions, emotional escalation, and explicit roleplay — then judge writing quality, initiative, pacing, and how long the app holds a thread before it loses context.
Resolution, detail, anatomical accuracy, lighting realism, and — most importantly — character consistency. Any model can produce one flattering image. We judge on whether the same face, body, and style stay recognisable across dozens of generations in different poses, outfits, and settings, because that is what makes a companion feel real.
Voice messages, AI phone calls, video clips, character creation depth, scenario and roleplay tooling, customisation, and overall product breadth. We weight quality of implementation over sheer quantity of listed features — a voice feature that sounds robotic scores lower than nothing at all.
Time from sign-up to first satisfying interaction. Onboarding clarity, UI navigation, how quickly a complete newcomer gets a great experience without reading a manual. We specifically test with a beginner mindset, not a power user one, because most people coming to these apps are new to the genre.
Quality per dollar over realistic monthly spend, including the subscription price, token or credit economy, what the free tier genuinely allows, and how transparent the billing is. We model real usage — average chat, moderate image generation — and calculate the actual monthly cost for that behaviour on every plan the app offers.
Discreet billing descriptor, account privacy controls, data deletion tools, clarity of the privacy policy on model training, and the security of the platform. Lower weight than the others does not mean it is unimportant — it means a bad privacy score can still drag an otherwise excellent app significantly down the table.
Why those weights?
Chat realism leads at 25% because in a companion app, the conversation is the product. An app with stunning images but hollow chat is a photo viewer. Image quality follows at 20% because visuals are the second-most important dimension for most users — and because image consistency (not just one-off quality) is genuinely hard to achieve. Features, ease of use, and value are weighted equally at 15% each because they each represent a meaningfully different axis of quality. Privacy sits at 10% — it has real weight, but a small privacy compromise does not outweigh a dramatically superior experience on all other dimensions for most users.
These weights are not arbitrary. We derived them by asking what a new user would complain about first if a platform failed in each category. Failed chat drives people away fastest; privacy issues are rarer and less immediately felt. The weights reflect that hierarchy.
Our testing process, step by step
- Sign-up and first-impression audit. We create a fresh account with no prior data and time how long it takes to reach a genuinely engaging interaction. We flag friction, dark patterns, and confusing onboarding. This session is scored for ease of use and also gives us our first read on chat quality at zero context.
- We pay for premium. We upgrade to the best available paid plan within 24 hours and test the full feature set: HD images, voice messages, AI calls, video where available, and every major feature listed on the pricing page. We do not evaluate a premium product by spending nothing on it.
- Standardised conversation battery. We run the same five prompts on every platform: a casual getting-to-know-you exchange, a personal detail revealed that we check for recall three days later, an emotional-support scenario, a slow-build romantic escalation, and an explicit roleplay. We score chat quality on naturalness, initiative, pacing, and accuracy of long-term memory.
- Image generation battery. We generate the same six prompts on every image-capable platform: a casual selfie, a specific outfit in a recognisable setting, a close portrait, a full-body shot, an outdoor scene, and a repeat of the portrait with a different background. We score on detail, lighting, anatomy, and — most critically — whether the same character shows up consistently across all six.
- Feature depth check. We use every listed feature at least once, checking for rough edges, quality of implementation, and whether marketing matches reality. Features that are listed but broken or effectively unusable are scored accordingly.
- Value calculation. We model three user profiles — light (mostly chat, few images), moderate (daily chat, 10–20 images/week), and heavy (daily chat, daily images, voice features) — and calculate realistic monthly spend at each plan tier. Value score reflects what the moderate user gets for their money.
- Independent scoring and moderation. Each tester scores independently before comparing notes. Where scores diverge significantly, we discuss and either re-test the disputed area or note the disagreement in the review. This prevents the "halo effect" where a strong first impression inflates every other score.
- Ongoing re-testing. The top 15 platforms are re-evaluated every 2–3 months. Features change, prices change, model quality improves or degrades. We update scores and reviews when the change is material enough to affect someone's decision.
What we do not do
- We do not accept paid placements. No platform can pay for a higher ranking, a better score, or a more favourable review. Ever.
- We do not test with developer or press accounts. We sign up as regular users. If a platform gives a different experience to journalists or critics, we will not catch it — but in practice the platforms we recommend treat all users the same.
- We do not rate platforms we cannot access. If a platform requires a VPN or is geo-restricted in our test environment, we note it in the review rather than guessing.
- We do not let affiliate rates affect our scores. Some platforms pay us more per referral than others. We track this to ensure it has zero correlation with our rankings. Our #1 pick is not necessarily our highest-paying affiliate, and that is intentional.
How scores translate to star ratings
Our scores are on a 0–10 scale. For platforms that display a star rating (out of 5), we convert by dividing by 2, so 9.7/10 becomes 4.9/5. Scores below 7.0 are considered poor and are still reviewed to help users avoid them. We do not artificially compress scores to cluster everyone near 8–9 — a platform that scored 5.2 gets a 5.2, and a review that explains exactly why.
For context on what scores mean in practice:
- 9.0–10.0 — Outstanding. Genuine best-in-class, recommended without qualification.
- 8.0–8.9 — Very good. Strong at its core, with minor weaknesses. Worth trying for the right user.
- 7.0–7.9 — Good. Solid option for a specific use case, but has notable gaps versus the leaders.
- 6.0–6.9 — Average. Some redeeming features but significant shortcomings. Use with clear expectations.
- Below 6.0 — Below average. Not recommended for most users; reviewed so you can make an informed choice to skip it.
How we make money (and why it does not bias us)
This site is free to read and supported by affiliate commissions: if you sign up through one of our links, we earn a fee at no extra cost to you. We disclose this clearly on every page that contains affiliate links, as required by law and as a matter of principle.
Crucially, we do not sell rankings. No platform can pay to move up, and our editorial scores are locked before any commercial consideration. Our process is designed specifically to prevent our commercial relationships from leaking into our editorial: the person writing the score does not know the affiliate payout for the platform being scored. If our #1 pick paid us the least commission in the entire index, it would still be #1. That is not just a policy — it is built into our workflow.
We do, however, prioritise reviewing platforms where affiliate programmes exist, because that is how we fund the site. We try to disclose when a platform we have reviewed does not have an affiliate relationship with us — in those cases there is genuinely no financial incentive to review it, which is arguably extra evidence of our commitment to completeness.
Category-specific testing notes
AI Girlfriend and Boyfriend Apps
For companion apps, our testing emphasises the relationship arc over any single session. We look for how the app handles continuity — does it remember details two weeks later? Does it initiate conversation or only respond? Does the companion have a consistent personality that doesn't flip based on what we praise? We deliberately test edge cases: mentioning something sad and noting whether the companion acknowledges it next session, making a factual error and seeing whether the companion corrects it or agrees. The best apps do all of this. Most do not.
AI Porn Generators
For image-only generators, chat quality is replaced in our scoring framework with prompt control and style range. We test how precisely a prompt translates into output, how the platform handles complex or compound instructions, and how much variation exists between identical prompts. Style range matters because a generator that does only one aesthetic poorly is less useful than one that handles several well. We also specifically test consistency — generating the same described character across ten prompts and measuring drift.
NSFW Chatbots
For chatbots, writing quality gets even more weight. We focus on the prose itself: does it read like a person wrote it, or like a language model filling in blanks? Does it have a sense of rhythm and escalation, or does it jump straight to explicit without warming up? We also test the "refusal rate" — how often the bot declines to engage with legal but explicit themes — because a chatbot marketed as uncensored that constantly hedges is misleading its users.
Our limits and caveats
We are a small, independent team, not a lab. We do not have the resources to run statistically rigorous A/B tests or to test every platform in every region and language. Our scores represent our genuine best assessment at a given point in time, but they are human judgements, not algorithmic certainties.
The AI companion space moves extremely fast. A platform we scored in January may have shipped major improvements by April — which is why we re-test the top of the list regularly and display an "Updated" date on every review. If you notice a significant discrepancy between our review and your own experience, tell us — we investigate and update.
Finally: we review these platforms as adults testing products for other adults. We are not moral arbiters of what people choose to use for personal entertainment. Our job is to give you accurate, honest, comparative information so you can make the best decision for yourself.
A note on responsible use
We only review legal, adult-oriented AI services intended for users 18 and older. We do not promote, and actively discourage, any non-consensual content, the generation of imagery depicting minors, or the use of AI tools to harass, defame, or impersonate real people without their consent. Every platform in our index has terms of service that prohibit this, and violations typically result in permanent bans. Use these tools legally, responsibly, and for the entertainment of consenting adults.