How to Scale Video KYC to 1 Million+ Monthly Verifications

TL;DR: Scaling video KYC to 1 million+ monthly verifications requires a purpose-built infrastructure stack: distributed WebRTC session routing, auto-scaling agent queues, a compliant recording pipeline, and fault-tolerant session recovery. Most teams underestimate concurrency requirements by 3–5x and treat compliance storage as an afterthought both of which create catastrophic bottlenecks at scale.

To scale video KYC to 1 million verifications per month, you need a distributed SFU-based WebRTC infrastructure capable of handling 1,500–2,000 peak concurrent sessions, an agent capacity model supporting 8–10 verifications per agent per hour, an auto-scaling orchestration layer tied to queue depth, and an RBI-compliant recording pipeline with end-to-end audit storage. No single component can be left unscaled the system is only as fast as its slowest constraint.

Why scaling video KYC is now a business-critical infrastructure problem

The push to scale video KYC to 1 million verifications and beyond is no longer a future ambition it is the present reality for large NBFCs, digital banks, insurance aggregators, and broker platforms operating in India. As of 2024, RBI-regulated entities are required to complete customer due diligence through Video-based Customer Identification Process (V-CIP), the formal designation for video KYC under the Reserve Bank of India's Master Direction on KYC (updated in 2021 and subsequently amended). Simultaneously, SEBI and IRDAI have published parallel guidelines aligned with the same verification model.

Beyond India, MAS (Singapore), FCA (UK), BaFin (Germany), and FinCEN (US) have each adopted digital identity verification frameworks that include video as an acceptable and often preferred verification channel. The result is a global infrastructure problem: real-time identity verification systems must be architected to handle volume, latency, compliance, and failure simultaneously.

The failure mode for teams that build this ad hoc is well-documented: video drops during peak hours, recording gaps that fail audits, agent queues that back up causing 20–40 minute wait times, and compliance breaches that trigger regulatory notices. This guide is for engineering leaders and architects who need to build this right.

Market and compliance context

What does 1 million monthly verifications actually look like?

Breaking down the math is the first step to responsible infrastructure design.

Metric	Calculation	Example Value
Monthly verifications	Target	1,000,000
Daily average	÷ 22 working days	~45,500/day
Peak day multiplier	1.8–2.2x average	~91,000–100,000/day
Peak hour (10 AM–12 PM)	25–30% of daily	~22,500–30,000/hour
Peak concurrent sessions	Avg. 8-min session, ÷ 60	3,000–4,000 concurrent
Agent hours required	8 sessions/agent/hour	375–500 active agents at peak

Most teams design for average load. The infrastructure must be designed for 2–3x average peak concurrency with headroom for viral onboarding campaigns and IPO application surges common in the Indian fintech context.

RBI V-CIP compliance requirements

V-CIP (Video-based Customer Identification Process): V-CIP is the RBI-mandated framework for conducting KYC using live video interaction between a regulated entity's officer and a customer, as specified in the Master Direction – Know Your Customer (KYC) Direction, 2016 (last updated 2023). The process is not simply a video call it is a structured verification event with defined data capture, storage, and audit requirements.

Key requirements under V-CIP:

Live, uninterrupted video with no pre-recorded segments
Geo-tagging of the customer's location at session initiation
PAN card verification via OCR with face-match liveness check
Aadhaar-based OTP authentication (with UIDAI integration)
Session recording stored in India-based servers
Audit trail with timestamps, agent ID, and session metadata
Agent must be a trained official of the regulated entity (not a third party)

Failure to comply exposes regulated entities to penalties under the Prevention of Money Laundering Act (PMLA), potential license suspension, and reputational risk. The consequences extend beyond fines a compliance gap in the audit trail can invalidate thousands of onboarding records retroactively.

Note: Compliance requirements evolve. Always validate your V-CIP implementation against the most current RBI Master Direction and consult a qualified compliance officer before production deployment.

How to scale video KYC to 1 million verifications

End-to-end system architecture

A production-grade video KYC infrastructure is not a monolithic system. It is a pipeline of coordinated services, each responsible for a specific stage of the session lifecycle.

Video SDK Image — End-to-end video KYC architecture

The seven layers of the architecture:

Layer 1 — Session initiation. The customer opens the mobile app or web client and requests a video KYC session. The client sends an API call to the session orchestration service, which creates a session ID, assigns a queue position, and returns session metadata including estimated wait time. At this point, geo-location is captured and stored.

Layer 2 — Queue and agent assignment. The orchestration layer maintains a priority queue of pending sessions. When an agent becomes available, the session manager assigns the session, notifies both the customer and agent, and instructs the WebRTC layer to initiate connection establishment.

Layer 3 — WebRTC connection. The client and agent establish a real-time video connection through the WebRTC infrastructure layer. This is where most infrastructure complexity lives, and where the majority of scaling failures originate.

Layer 4 — KYC verification flow. During the session, the agent follows a structured checklist: document capture (PAN, Aadhaar), face-match liveness, verbal confirmation, and geo-tag verification. The KYC backend processes OCR and biometric checks in real time, returning pass/fail signals to the agent dashboard.

Layer 5 — Session recording. Every session is recorded simultaneously with the live stream. The recording pipeline captures encrypted video to a compliant India-hosted storage layer (S3-compatible, with WORM Write Once Read Many policy enabled).

Layer 6 — Post-session processing. After the session closes, the system stitches metadata (timestamps, agent ID, session ID, geo-coordinates, document scan results) with the recording and stores the complete audit bundle in the compliance archive.

Layer 7 — Audit and reporting. Regulators and internal compliance teams access the audit layer through a separate read-only interface. All access is logged.

The WebRTC infrastructure layer the hardest part to scale

WebRTC: WebRTC (Web Real-Time Communication) is an open standard that enables peer-to-peer audio, video, and data communication directly between browsers and native apps without requiring plugins. In video KYC, WebRTC is the protocol layer responsible for transmitting live video between customer and agent.

SFU (Selective Forwarding Unit): An SFU is a media routing server that receives media streams from participants and selectively forwards them to other participants without transcoding. Unlike a peer-to-peer model, an SFU centralizes media routing, enabling recording, quality monitoring, and scalable multi-party communication. For video KYC, SFU is the correct architecture it handles thousands of concurrent two-party sessions efficiently.

MCU (Multipoint Control Unit): An MCU mixes all media streams into a single composite stream before forwarding. MCUs are suited for multi-party conferencing where every participant sees the same layout. For video KYC (one customer, one agent), MCU overhead is unnecessary and adds latency.

For 1M+ monthly verifications at peak concurrency (3,000–4,000 sessions), the SFU cluster must be horizontally scaled across multiple server instances, with a session manager distributing load based on current capacity. A single SFU server typically handles 500–1,500 concurrent sessions depending on video resolution and codec settings. At 720p (the minimum acceptable quality for document legibility), plan for 800–1,000 sessions per SFU node.

Network handling. Video KYC customers are not always on fiber. Indian last-mile connectivity includes 4G with significant packet loss, shared WiFi environments, and rural users on 3G. The WebRTC layer must implement adaptive bitrate control (reducing resolution under bandwidth constraints), NACK-based packet loss recovery, and jitter buffering. Target session quality thresholds: latency below 250ms (one-way), packet loss tolerance up to 5%, minimum viable bitrate of 400 kbps at 480p.

Multi-region scaling. For India-first deployments, a Mumbai primary region with a Hyderabad or Chennai secondary region covers geographic latency requirements. TURN servers (relay servers used when direct peer connections fail) must be deployed in each region. Approximately 15–20% of sessions will require TURN relay rather than direct connection account for this in your TURN server capacity.

Scaling model: traffic spikes to session delivery

The scaling architecture has two independent axes: infrastructure scaling (SFU, recording, storage) and agent capacity scaling (human agents). Infrastructure can auto-scale in under 2 minutes. Agent capacity cannot it requires workforce management planning, trained agent pools, and shift scheduling. This asymmetry is the most common failure point.

Infrastructure scaling model:

Metric	Example Value	Impact
SFU nodes at baseline	8 nodes	Handles ~6,400 concurrent sessions
Scale-out trigger	Queue depth > 50 sessions	Auto-provision 2 nodes
Provision time (cold)	75–90 seconds	Governs max ramp rate
Recording pipeline workers	1 per 200 concurrent sessions	Scale with SFU
TURN server capacity	1 server per 500 TURN sessions	Scale at 15% of concurrent
Storage throughput	~500 MB per session (720p, 8 min avg)	500 TB/month at 1M vol

Agent capacity model:

The agent layer is not infrastructure it is a workforce. At 8 sessions per agent per hour (accounting for documentation, breaks, and gap time between sessions), 500 active agents are needed at peak. Most platforms operate a hub-and-spoke model: central training, regional agent pools, and overflow contracts with BPO partners. Overflow agents must be pre-approved by the regulated entity's compliance team they cannot be spun up on demand without prior due diligence.

Agent availability signals must feed back into the queue. When agent availability drops below a threshold, the queue system must proactively display accurate wait times to customers and offer asynchronous alternatives (document upload queues for non-real-time verification where regulations permit).

Build vs. buy vs. hybrid the infrastructure decision framework

Factor	Build	Buy	Hybrid
Control over WebRTC layer	Full	None	Partial (SDK-level)
Time to production	9–18 months	4–8 weeks	6–12 weeks
Compliance customization	Full	Vendor-dependent	Configurable
SFU infrastructure ops burden	High (dedicated team)	Zero	Low
Recording pipeline ownership	Full	Vendor-managed	Shared
Vendor lock-in risk	None	High	Medium
Cost at 1M monthly volume	High (infra + eng team)	Predictable per-session	Lowest total cost
Regulatory audit readiness	Requires internal build	Vendor certification needed	Best balance

For most fintech platforms scaling past 500K monthly verifications, the hybrid model delivers the best balance: use a real-time video infrastructure SDK for session management, WebRTC routing, and SFU operations, while owning the KYC backend (liveness, OCR, face-match), compliance storage, and audit pipeline. This preserves regulatory control while eliminating the operational burden of running a WebRTC media infrastructure team.

VideoSDK provides real-time video APIs that handle session orchestration, SFU-based media routing, and scalable connection management the infrastructure layer in a hybrid architecture. The KYC verification logic, recording pipeline, and compliance storage remain under the regulated entity's control.

Failure handling at scale

**Failure handling and session recovery**

At 1M monthly verifications, even a 0.5% failure rate produces 5,000 failed sessions per month. Each failure has a cost: customer drop-off, agent idle time, re-scheduling overhead, and potential compliance gaps if a partially recorded session is not handled correctly. Failure handling is not a nice-to-have it is a core architectural requirement.

Call drops and reconnection. WebRTC ICE (Interactive Connectivity Establishment) provides a native reconnection mechanism. When connectivity is lost, the client should attempt ICE restart before declaring the session failed. The session orchestration layer must preserve session state (session ID, timestamp, completed verification steps, partial recording) for a minimum of 5 minutes to enable seamless resumption. Customers who reconnect within this window should not restart the verification flow from the beginning.

Network degradation. Adaptive bitrate (ABR) handling should activate before a call drops. At below 400 kbps, the system should drop video resolution to 360p and notify the agent. At below 200 kbps, the session should prompt the customer to switch networks or reschedule. Agent dashboards must display real-time network quality indicators so agents can make informed decisions about whether to continue or reschedule.

Agent unavailability. Agents disconnect, crash browsers, or go offline unexpectedly. The session manager must detect agent disconnection within 10 seconds and either reassign the session to an available agent (preserving the customer's queue position) or hold the session with an in-queue notification to the customer. A session should never be silently dropped because an agent's browser crashed.

Recording failures. A session that completes verification but fails to produce a valid, complete recording is a compliance failure not just a technical failure. The recording pipeline must write to two destinations simultaneously (primary and backup), with integrity checksums validated post-session. If the primary recording fails, the backup must be automatically promoted. Sessions with recording failures must be flagged for manual review and not treated as completed verifications.

Partial sessions and audit gaps. Every incomplete session must generate an audit record. The audit record must include: session ID, timestamp of initiation, reason for termination, last completed verification step, agent ID, and customer ID. Incomplete sessions without audit records will surface as compliance gaps during regulatory inspections.

RBI V-CIP compliance checklist

Requirement	Description	Applies to	Risk if missing
Live video — no pre-recorded	Session must be conducted in real time with no pre-recorded customer segments	All regulated entities	Session invalidated, regulatory notice
Geo-tagging at initiation	Customer's GPS coordinates captured and stored at session start	All V-CIP implementations	Audit failure, session invalidated
PAN verification with OCR	PAN card captured via camera, OCR-verified against customer-submitted data	Mandatory for most product types	KYC not completed, onboarding blocked
Aadhaar OTP authentication	Customer authenticates via Aadhaar-linked OTP during session	Mandatory with UIDAI integration	Regulatory non-compliance
Face-match liveness check	Real-time liveness detection with face match against ID document	All V-CIP	Identity fraud risk, audit failure
Session recording	Full session video recorded and stored in India-based servers	Mandatory	Audit failure, PMLA exposure
WORM storage policy	Recording cannot be modified or deleted post-session	Mandatory	Tamper evidence failure
Audit trail with metadata	Session ID, timestamps, agent ID, geo-tag, document scan results stored together	Mandatory	Incomplete audit, regulatory penalty
Trained agent (regulated entity staff)	Verification officer must be employee or verified official of the regulated entity	All V-CIP	Session invalidated
Data residency (India servers)	All session data, recordings, and metadata stored in India	Mandatory	Data sovereignty violation

Common mistakes engineering teams make

1. Designing for average load, not peak concurrency. The most expensive infrastructure mistake is provisioning for the average of 45,000 sessions per day rather than the peak of 90,000 on the highest-demand day. Auto-scaling helps, but provision-time lag (75–90 seconds for a new SFU node) means the system must pre-scale based on predictive signals calendar events, marketing campaign start times, IPO application windows rather than reactive threshold triggers alone.

2. Treating the recording pipeline as secondary. Many teams build the live video path first and bolt on recording later. At scale, this creates a fragile single-threaded pipeline that cannot sustain simultaneous high-fidelity live streaming and recording. The recording tap must be built into the SFU layer from day one, with independent scaling and dual-destination writes.

3. Ignoring the agent-infrastructure mismatch. Infrastructure can scale in minutes. Agents cannot. A platform that scales SFU capacity to handle 4,000 concurrent sessions but only has 300 trained agents online will produce queue times that drive customer abandonment and the abandonment rate compounds the compliance overhead of failed sessions that must be tracked and logged.

4. Underspecifying the failure audit trail. Every failed, interrupted, or partial session must generate a complete audit record. Teams that log only successful sessions will fail regulatory inspections. Build the failure audit path with the same rigor as the success path.

5. Hardcoding geo-restrictions without dynamic validation. V-CIP requires customer geo-tagging, but several teams implement this as a static IP-based check rather than a GPS capture. The correct implementation requires device GPS with a timestamp, stored alongside the session record. IP-based location is not an acceptable substitute under V-CIP guidelines.

Key takeaways

Design for 2–3x average peak concurrency at 1M monthly verifications, peak concurrent sessions reach 3,000–4,000, requiring 4–5 SFU nodes with headroom and auto-scaling to 8–10 nodes under surge conditions.
SFU is the correct WebRTC topology for video KYC it enables centralized recording, quality monitoring, and horizontal scaling without the transcoding overhead of an MCU.
The hybrid build/buy model delivers the best balance own the KYC backend, compliance storage, and audit pipeline; use a real-time video infrastructure SDK for session management and SFU operations.
Recording failures are compliance failures dual-destination writes, integrity checksums, and automatic promotion of backup recordings are non-negotiable at regulated scale.
Agent capacity is the hardest constraint to scale infrastructure auto-scales in under 2 minutes; trained agent pools require weeks of planning, which means workforce forecasting must be integrated into the infrastructure scaling model.

FAQ

Q: What is the minimum infrastructure required to handle 100,000 video KYC sessions per month?

At 100,000 monthly sessions with a peak concurrency of approximately 300–400 simultaneous sessions, a deployment of 2–3 SFU nodes with a load balancer and auto-scaling configured to a 50-session queue threshold is sufficient. A single recording pipeline worker handling up to 200 concurrent recordings is adequate at this scale. Ensure your TURN server can handle 60–80 TURN-relayed sessions at peak.

Q: How does WebRTC handle poor network conditions during a video KYC session?

WebRTC includes native adaptive bitrate control through its congestion control algorithms (Google Congestion Control / REMB). When network bandwidth drops, the WebRTC layer automatically reduces video resolution and frame rate to maintain session continuity. For video KYC specifically, you should configure a minimum quality floor (400 kbps at 480p) below which the system should notify the agent and prompt the customer to switch networks or reschedule.

Q: Can video KYC sessions be conducted on 3G networks?

Yes, but with constraints. A 3G connection delivering a sustained 1–2 Mbps can handle a 480p video KYC session adequately. The WebRTC layer must be configured with aggressive jitter buffering and packet loss recovery (NACK). Sessions on borderline connectivity should be flagged in the recording metadata, as quality degradation could affect document legibility which has compliance implications.

Q: How long must video KYC recordings be retained under RBI guidelines?

The RBI Master Direction on KYC requires that KYC records (including V-CIP recordings) be maintained for at least 5 years after the business relationship ends, or 5 years from the date of the transaction whichever is later. Storage must be in India-based servers with WORM policy. Always verify the current retention period with your compliance team, as this requirement may be updated.

Q: What is the difference between V-CIP and standard video KYC?

V-CIP (Video-based Customer Identification Process) is the specific RBI-mandated framework for video KYC in India, as defined in the Master Direction – KYC Direction, 2016. It specifies the exact data capture requirements (geo-tag, PAN, Aadhaar OTP, liveness), agent qualifications, storage requirements, and audit obligations. "Video KYC" is the generic term; V-CIP is the regulated implementation. All video KYC conducted by RBI-regulated entities must comply with V-CIP requirements.

Q: How do you handle session resumption after a network drop?

The session orchestration layer must preserve session state session ID, completed verification steps, partial recording, timestamps for a minimum of 5 minutes after a drop. When the customer reconnects within this window, the WebRTC layer initiates an ICE restart using the same session parameters, and the verification flow resumes from the last completed step. Partial recordings are stitched with the resumed recording at the post-session processing stage, and both segments are stored together with a discontinuity marker in the metadata.

Q: What is the estimated storage cost for 1 million video KYC sessions per month?

At an average session length of 8 minutes and a recording bitrate of ~800 kbps (720p, H.264), each session produces approximately 480 MB of raw recording. At 1 million sessions, this is approximately 480 TB per month of raw recording data. With H.265/HEVC encoding (approximately 40% size reduction) and storage tiering (hot storage for 30 days, cold storage for the remainder of the 5-year retention period), the effective managed storage cost can be reduced substantially. Budget both hot and cold storage tiers separately, and account for replication overhead across backup destinations.

Q: How should we approach compliance storage for a globally distributed video KYC platform?

For RBI-regulated entities, recording data must be stored in India regardless of where the platform's primary infrastructure runs. For global deployments (MAS, FCA, FinCEN), each jurisdiction has its own data residency requirements. The standard architecture is regional compliance storage nodes data captured in a jurisdiction stays in that jurisdiction's storage tier. The audit and reporting layer is a federated query layer that can access all regional stores with role-based access control and a complete access log.