A telehealth founder will ask me, on average once a quarter, what the architecture for a video-conferencing telehealth app should look like. They've usually got a deck with three boxes labeled "WebRTC," "Media Server," and "AWS." They want to know which media server, which TURN service, how to scale SFU clusters, whether to use Janus or mediasoup or Jitsi or Pion. They're gearing up to spend nine months and a million dollars solving a problem that is not their problem.
The architectural advice I give them takes about three minutes and saves them, conservatively, twelve months and several hundred thousand dollars. So let me write it down once.
Do not build the video layer. Use Twilio Video or LiveKit or Daily or Vonage. They cost real money per participant-minute — but the per-minute cost at telehealth scale (a few thousand sessions a day) is a tiny fraction of what running your own media stack would cost in engineering time, SRE on-call burden, and the periodic two-day production incident when your TURN cluster melts at 4pm. The build-vs-buy math here is brutal. Daily and LiveKit both have generous free tiers and per-minute billing thereafter. Twilio is older, more expensive, more enterprise-friendly. Vonage is what's left of TokBox and is fine. Pick one in an hour and move on.
If you have a regulatory or contractual reason you cannot use a third-party media stack — and "we want to control the media" is not such a reason — then you have a different conversation, one that involves at least one senior media-infra engineer on staff and a multi-quarter timeline. Most telehealth companies do not have that constraint. Most that think they do are wrong.
So the architecture question becomes: with the video layer commoditized, what's actually hard? Quite a lot.
Patient identity and authentication. Telehealth is healthcare, healthcare is HIPAA, and HIPAA is a discipline rather than a checkbox. The patient on the video call has to be the patient whose chart you're showing. This means strong authentication on both sides (provider and patient), session binding to a specific appointment, and a clear audit trail. Build this on a known identity stack — Auth0, Cognito, your existing patient portal — rather than rolling your own. Bind every video token to a server-issued appointment ID; don't trust the client to assert who they are.
Scheduling and waiting rooms. The non-obvious operational pain in telehealth is that providers run late, patients show up early, and both want a way to indicate readiness. You need a real waiting-room concept: the patient can be in the queue, the provider can see who's waiting, the system can hold the patient with a polite UI while the provider finishes the prior visit. This is product work, not infrastructure work, but you'd be surprised how many teams get to "video connects" and consider themselves done. Connecting is the easy part. The choreography around connecting is the part that determines whether providers actually use your system.
Recording and consent. Recording video visits is regulated and varies by state. Two-party consent states require both parties to actively opt-in. Even where it's allowed, the patient should be told clearly, on screen, that the session is being recorded, and the provider's notes should mark the consent moment. If you're recording, the recordings are now HIPAA-protected PHI and need to live in storage with the right access controls, retention policy, and audit log. Most telehealth companies should not record by default. Recording is a feature you turn on per-customer when there's a legitimate clinical or training reason and the legal framework supports it.
The provider workflow. This is the part that founders skip and providers care about most. A provider during a video visit needs: the patient's chart on the screen next to the video, the ability to take structured notes, the ability to e-prescribe, the ability to order labs, the ability to bill at the end. Building a great video call is necessary; building a great clinical surface around the video call is the actual product. If the provider has to context-switch between your video app and their EHR for every visit, you've lost. The integration with the EHR — Epic, Cerner, Athena, whatever the customer uses — is the single most valuable piece of work you can do, and it's where the engineering team should be spending time, not on STUN/TURN tuning.
Network reality. Patients are not on your data center's network. They're on rural broadband, on hotel wifi, on a phone with two bars in their car. The video stack has to gracefully degrade — drop video, keep audio, switch from H.264 to VP8 to whatever still works, reconnect quietly after a flap. This is exactly what the third-party providers spent a decade getting right. Use them. The "we tested it in the office, it worked great" failure mode is what makes telehealth feel terrible when it fails. Spend your test budget on real-world conditions, not lab conditions.
Storage and PHI residency. Anything you store about a patient is PHI. The video session, while in-flight, generally isn't (it's not "at rest"), but the appointment record, the clinical notes, the file uploads, the chat messages within the session — all of it is. Pick a HIPAA-eligible cloud (AWS BAA, GCP BAA, Azure BAA), turn on encryption at rest by default, encrypt in transit, audit-log every access. Have your engineering org sign the BAAs with you, not just legal. Do an actual access review every quarter; don't just say you do.
Architecture sketch. What this actually looks like in boxes:
- Frontend (web + iOS + Android). Reuses your patient portal session token to authenticate. Calls your backend to start a visit.
- Backend (a normal stateless service in whatever language you like). Authenticates the user, looks up the appointment, mints a short-lived token for the video provider (Twilio/LiveKit), returns it to the client.
- Video provider (third-party). Handles the actual media. Provides client SDKs for web/iOS/Android. Bills you per participant-minute.
- Database (Postgres, probably). Stores appointments, patients, providers, notes. PHI lives here. Encrypted at rest. Backups encrypted. Restoring backups is a tested workflow, not an aspirational one.
- EHR integration layer. Either FHIR (modern) or HL7 v2 (everything that exists in production). Probably both, depending on which customers you have. This is the unsexy work that wins enterprise deals.
- Audit log. Append-only. Every clinical access to PHI gets logged with user, timestamp, resource, and reason. You will be audited eventually. Have the logs.
- Recordings (only if needed). S3 or equivalent. Encrypted with your own KMS keys. Retention policy enforced. Access controlled separately from regular PHI.
That's basically it. There's no magic. The video is rented, the database is boring Postgres, the audit log is append-only, the integration layer is most of your code.
What founders get wrong. They spend the first six months on the video. They should be spending the first six months on the EHR integration, the consent flows, the audit logging, and the provider workflow. By the time the integration is good, the customer doesn't care that you used Twilio under the hood. They care that the visit feels native to their workflow and the audit log is bulletproof. That is the moat.
One more thing. Telehealth is a regulated business that intersects with practicing medicine across multiple state jurisdictions, with insurance billing rules that change constantly, with parity laws that vary by year, with a Medicare reimbursement schedule that's the second-most-complex document the federal government produces. Your engineering choices are a tiny fraction of the work. The big work is in compliance, billing, clinical workflow, and relationships with health systems. If your founding team is three engineers, you need a fourth person who has actually worked in healthcare administration, and you need them before you write any code. Don't build the architecture without that person in the room.
Rent the video. Build the workflow. That's the architecture.