TL;DR: Scaling video KYC to 1 million+ monthly verifications requires a purpose-built infrastructure stack: distributed WebRTC session routing, auto-scaling agent queues, a compliant recording pipeline, and fault-tolerant session recovery. Most teams underestimate concurrency requirements by 3–5x and treat compliance storage as an afterthought both of which create catastrophic bottlenecks at scale.
Why scaling video KYC is now a business-critical infrastructure problem
The push to scale video KYC to 1 million verifications and beyond is no longer a future ambition it is the present reality for large NBFCs, digital banks, insurance aggregators, and broker platforms operating in India. As of 2024, RBI-regulated entities are required to complete customer due diligence through Video-based Customer Identification Process (V-CIP), the formal designation for video KYC under the Reserve Bank of India's Master Direction on KYC (updated in 2021 and subsequently amended). Simultaneously, SEBI and IRDAI have published parallel guidelines aligned with the same verification model.
Beyond India, MAS (Singapore), FCA (UK), BaFin (Germany), and FinCEN (US) have each adopted digital identity verification frameworks that include video as an acceptable and often preferred verification channel. The result is a global infrastructure problem: real-time identity verification systems must be architected to handle volume, latency, compliance, and failure simultaneously.
The failure mode for teams that build this ad hoc is well-documented: video drops during peak hours, recording gaps that fail audits, agent queues that back up causing 20–40 minute wait times, and compliance breaches that trigger regulatory notices. This guide is for engineering leaders and architects who need to build this right.
Market and compliance context
What does 1 million monthly verifications actually look like?
Breaking down the math is the first step to responsible infrastructure design.
| Metric | Calculation | Example Value |
|---|---|---|
| Monthly verifications | Target | 1,000,000 |
| Daily average | ÷ 22 working days | ~45,500/day |
| Peak day multiplier | 1.8–2.2x average | ~91,000–100,000/day |
| Peak hour (10 AM–12 PM) | 25–30% of daily | ~22,500–30,000/hour |
| Peak concurrent sessions | Avg. 8-min session, ÷ 60 | 3,000–4,000 concurrent |
| Agent hours required | 8 sessions/agent/hour | 375–500 active agents at peak |
Most teams design for average load. The infrastructure must be designed for 2–3x average peak concurrency with headroom for viral onboarding campaigns and IPO application surges common in the Indian fintech context.
RBI V-CIP compliance requirements
V-CIP (Video-based Customer Identification Process): V-CIP is the RBI-mandated framework for conducting KYC using live video interaction between a regulated entity's officer and a customer, as specified in the Master Direction – Know Your Customer (KYC) Direction, 2016 (last updated 2023). The process is not simply a video call it is a structured verification event with defined data capture, storage, and audit requirements.
Key requirements under V-CIP:
- Live, uninterrupted video with no pre-recorded segments
- Geo-tagging of the customer's location at session initiation
- PAN card verification via OCR with face-match liveness check
- Aadhaar-based OTP authentication (with UIDAI integration)
- Session recording stored in India-based servers
- Audit trail with timestamps, agent ID, and session metadata
- Agent must be a trained official of the regulated entity (not a third party)
Failure to comply exposes regulated entities to penalties under the Prevention of Money Laundering Act (PMLA), potential license suspension, and reputational risk. The consequences extend beyond fines a compliance gap in the audit trail can invalidate thousands of onboarding records retroactively.
Note: Compliance requirements evolve. Always validate your V-CIP implementation against the most current RBI Master Direction and consult a qualified compliance officer before production deployment.
How to scale video KYC to 1 million verifications
End-to-end system architecture
A production-grade video KYC infrastructure is not a monolithic system. It is a pipeline of coordinated services, each responsible for a specific stage of the session lifecycle.
The seven layers of the architecture:
Layer 1 — Session initiation. The customer opens the mobile app or web client and requests a video KYC session. The client sends an API call to the session orchestration service, which creates a session ID, assigns a queue position, and returns session metadata including estimated wait time. At this point, geo-location is captured and stored.
Layer 2 — Queue and agent assignment. The orchestration layer maintains a priority queue of pending sessions. When an agent becomes available, the session manager assigns the session, notifies both the customer and agent, and instructs the WebRTC layer to initiate connection establishment.
Layer 3 — WebRTC connection. The client and agent establish a real-time video connection through the WebRTC infrastructure layer. This is where most infrastructure complexity lives, and where the majority of scaling failures originate.
Layer 4 — KYC verification flow. During the session, the agent follows a structured checklist: document capture (PAN, Aadhaar), face-match liveness, verbal confirmation, and geo-tag verification. The KYC backend processes OCR and biometric checks in real time, returning pass/fail signals to the agent dashboard.
Layer 5 — Session recording. Every session is recorded simultaneously with the live stream. The recording pipeline captures encrypted video to a compliant India-hosted storage layer (S3-compatible, with WORM Write Once Read Many policy enabled).
Layer 6 — Post-session processing. After the session closes, the system stitches metadata (timestamps, agent ID, session ID, geo-coordinates, document scan results) with the recording and stores the complete audit bundle in the compliance archive.
Layer 7 — Audit and reporting. Regulators and internal compliance teams access the audit layer through a separate read-only interface. All access is logged.
The WebRTC infrastructure layer the hardest part to scale
WebRTC: WebRTC (Web Real-Time Communication) is an open standard that enables peer-to-peer audio, video, and data communication directly between browsers and native apps without requiring plugins. In video KYC, WebRTC is the protocol layer responsible for transmitting live video between customer and agent.
SFU (Selective Forwarding Unit): An SFU is a media routing server that receives media streams from participants and selectively forwards them to other participants without transcoding. Unlike a peer-to-peer model, an SFU centralizes media routing, enabling recording, quality monitoring, and scalable multi-party communication. For video KYC, SFU is the correct architecture it handles thousands of concurrent two-party sessions efficiently.
MCU (Multipoint Control Unit): An MCU mixes all media streams into a single composite stream before forwarding. MCUs are suited for multi-party conferencing where every participant sees the same layout. For video KYC (one customer, one agent), MCU overhead is unnecessary and adds latency.
For 1M+ monthly verifications at peak concurrency (3,000–4,000 sessions), the SFU cluster must be horizontally scaled across multiple server instances, with a session manager distributing load based on current capacity. A single SFU server typically handles 500–1,500 concurrent sessions depending on video resolution and codec settings. At 720p (the minimum acceptable quality for document legibility), plan for 800–1,000 sessions per SFU node.
Network handling. Video KYC customers are not always on fiber. Indian last-mile connectivity includes 4G with significant packet loss, shared WiFi environments, and rural users on 3G. The WebRTC layer must implement adaptive bitrate control (reducing resolution under bandwidth constraints), NACK-based packet loss recovery, and jitter buffering. Target session quality thresholds: latency below 250ms (one-way), packet loss tolerance up to 5%, minimum viable bitrate of 400 kbps at 480p.
Multi-region scaling. For India-first deployments, a Mumbai primary region with a Hyderabad or Chennai secondary region covers geographic latency requirements. TURN servers (relay servers used when direct peer connections fail) must be deployed in each region. Approximately 15–20% of sessions will require TURN relay rather than direct connection account for this in your TURN server capacity.
Scaling model: traffic spikes to session delivery
The scaling architecture has two independent axes: infrastructure scaling (SFU, recording, storage) and agent capacity scaling (human agents). Infrastructure can auto-scale in under 2 minutes. Agent capacity cannot it requires workforce management planning, trained agent pools, and shift scheduling. This asymmetry is the most common failure point.
Infrastructure scaling model:
| Metric | Example Value | Impact |
|---|---|---|
| SFU nodes at baseline | 8 nodes | Handles ~6,400 concurrent sessions |
| Scale-out trigger | Queue depth > 50 sessions | Auto-provision 2 nodes |
| Provision time (cold) | 75–90 seconds | Governs max ramp rate |
| Recording pipeline workers | 1 per 200 concurrent sessions | Scale with SFU |
| TURN server capacity | 1 server per 500 TURN sessions | Scale at 15% of concurrent |
| Storage throughput | ~500 MB per session (720p, 8 min avg) | 500 TB/month at 1M vol |
Agent capacity model:
The agent layer is not infrastructure it is a workforce. At 8 sessions per agent per hour (accounting for documentation, breaks, and gap time between sessions), 500 active agents are needed at peak. Most platforms operate a hub-and-spoke model: central training, regional agent pools, and overflow contracts with BPO partners. Overflow agents must be pre-approved by the regulated entity's compliance team they cannot be spun up on demand without prior due diligence.
Agent availability signals must feed back into the queue. When agent availability drops below a threshold, the queue system must proactively display accurate wait times to customers and offer asynchronous alternatives (document upload queues for non-real-time verification where regulations permit).
Build vs. buy vs. hybrid the infrastructure decision framework
| Factor | Build | Buy | Hybrid |
|---|---|---|---|
| Control over WebRTC layer | Full | None | Partial (SDK-level) |
| Time to production | 9–18 months | 4–8 weeks | 6–12 weeks |
| Compliance customization | Full | Vendor-dependent | Configurable |
| SFU infrastructure ops burden | High (dedicated team) | Zero | Low |
| Recording pipeline ownership | Full | Vendor-managed | Shared |
| Vendor lock-in risk | None | High | Medium |
| Cost at 1M monthly volume | High (infra + eng team) | Predictable per-session | Lowest total cost |
| Regulatory audit readiness | Requires internal build | Vendor certification needed | Best balance |
For most fintech platforms scaling past 500K monthly verifications, the hybrid model delivers the best balance: use a real-time video infrastructure SDK for session management, WebRTC routing, and SFU operations, while owning the KYC backend (liveness, OCR, face-match), compliance storage, and audit pipeline. This preserves regulatory control while eliminating the operational burden of running a WebRTC media infrastructure team.
VideoSDK provides real-time video APIs that handle session orchestration, SFU-based media routing, and scalable connection management the infrastructure layer in a hybrid architecture. The KYC verification logic, recording pipeline, and compliance storage remain under the regulated entity's control.
Failure handling at scale
At 1M monthly verifications, even a 0.5% failure rate produces 5,000 failed sessions per month. Each failure has a cost: customer drop-off, agent idle time, re-scheduling overhead, and potential compliance gaps if a partially recorded session is not handled correctly. Failure handling is not a nice-to-have it is a core architectural requirement.
Call drops and reconnection. WebRTC ICE (Interactive Connectivity Establishment) provides a native reconnection mechanism. When connectivity is lost, the client should attempt ICE restart before declaring the session failed. The session orchestration layer must preserve session state (session ID, timestamp, completed verification steps, partial recording) for a minimum of 5 minutes to enable seamless resumption. Customers who reconnect within this window should not restart the verification flow from the beginning.
Network degradation. Adaptive bitrate (ABR) handling should activate before a call drops. At below 400 kbps, the system should drop video resolution to 360p and notify the agent. At below 200 kbps, the session should prompt the customer to switch networks or reschedule. Agent dashboards must display real-time network quality indicators so agents can make informed decisions about whether to continue or reschedule.
Agent unavailability. Agents disconnect, crash browsers, or go offline unexpectedly. The session manager must detect agent disconnection within 10 seconds and either reassign the session to an available agent (preserving the customer's queue position) or hold the session with an in-queue notification to the customer. A session should never be silently dropped because an agent's browser crashed.
Recording failures. A session that completes verification but fails to produce a valid, complete recording is a compliance failure not just a technical failure. The recording pipeline must write to two destinations simultaneously (primary and backup), with integrity checksums validated post-session. If the primary recording fails, the backup must be automatically promoted. Sessions with recording failures must be flagged for manual review and not treated as completed verifications.
Partial sessions and audit gaps. Every incomplete session must generate an audit record. The audit record must include: session ID, timestamp of initiation, reason for termination, last completed verification step, agent ID, and customer ID. Incomplete sessions without audit records will surface as compliance gaps during regulatory inspections.
RBI V-CIP compliance checklist
| Requirement | Description | Applies to | Risk if missing |
|---|---|---|---|
| Live video — no pre-recorded | Session must be conducted in real time with no pre-recorded customer segments | All regulated entities | Session invalidated, regulatory notice |
| Geo-tagging at initiation | Customer's GPS coordinates captured and stored at session start | All V-CIP implementations | Audit failure, session invalidated |
| PAN verification with OCR | PAN card captured via camera, OCR-verified against customer-submitted data | Mandatory for most product types | KYC not completed, onboarding blocked |
| Aadhaar OTP authentication | Customer authenticates via Aadhaar-linked OTP during session | Mandatory with UIDAI integration | Regulatory non-compliance |
| Face-match liveness check | Real-time liveness detection with face match against ID document | All V-CIP | Identity fraud risk, audit failure |
| Session recording | Full session video recorded and stored in India-based servers | Mandatory | Audit failure, PMLA exposure |
| WORM storage policy | Recording cannot be modified or deleted post-session | Mandatory | Tamper evidence failure |
| Audit trail with metadata | Session ID, timestamps, agent ID, geo-tag, document scan results stored together | Mandatory | Incomplete audit, regulatory penalty |
| Trained agent (regulated entity staff) | Verification officer must be employee or verified official of the regulated entity | All V-CIP | Session invalidated |
| Data residency (India servers) | All session data, recordings, and metadata stored in India | Mandatory | Data sovereignty violation |
Common mistakes engineering teams make
1. Designing for average load, not peak concurrency. The most expensive infrastructure mistake is provisioning for the average of 45,000 sessions per day rather than the peak of 90,000 on the highest-demand day. Auto-scaling helps, but provision-time lag (75–90 seconds for a new SFU node) means the system must pre-scale based on predictive signals calendar events, marketing campaign start times, IPO application windows rather than reactive threshold triggers alone.
2. Treating the recording pipeline as secondary. Many teams build the live video path first and bolt on recording later. At scale, this creates a fragile single-threaded pipeline that cannot sustain simultaneous high-fidelity live streaming and recording. The recording tap must be built into the SFU layer from day one, with independent scaling and dual-destination writes.
3. Ignoring the agent-infrastructure mismatch. Infrastructure can scale in minutes. Agents cannot. A platform that scales SFU capacity to handle 4,000 concurrent sessions but only has 300 trained agents online will produce queue times that drive customer abandonment and the abandonment rate compounds the compliance overhead of failed sessions that must be tracked and logged.
4. Underspecifying the failure audit trail. Every failed, interrupted, or partial session must generate a complete audit record. Teams that log only successful sessions will fail regulatory inspections. Build the failure audit path with the same rigor as the success path.
5. Hardcoding geo-restrictions without dynamic validation. V-CIP requires customer geo-tagging, but several teams implement this as a static IP-based check rather than a GPS capture. The correct implementation requires device GPS with a timestamp, stored alongside the session record. IP-based location is not an acceptable substitute under V-CIP guidelines.
Key takeaways
- Design for 2–3x average peak concurrency at 1M monthly verifications, peak concurrent sessions reach 3,000–4,000, requiring 4–5 SFU nodes with headroom and auto-scaling to 8–10 nodes under surge conditions.
- SFU is the correct WebRTC topology for video KYC it enables centralized recording, quality monitoring, and horizontal scaling without the transcoding overhead of an MCU.
- The hybrid build/buy model delivers the best balance own the KYC backend, compliance storage, and audit pipeline; use a real-time video infrastructure SDK for session management and SFU operations.
- Recording failures are compliance failures dual-destination writes, integrity checksums, and automatic promotion of backup recordings are non-negotiable at regulated scale.
- Agent capacity is the hardest constraint to scale infrastructure auto-scales in under 2 minutes; trained agent pools require weeks of planning, which means workforce forecasting must be integrated into the infrastructure scaling model.
FAQ
Q: What is the minimum infrastructure required to handle 100,000 video KYC sessions per month?
At 100,000 monthly sessions with a peak concurrency of approximately 300–400 simultaneous sessions, a deployment of 2–3 SFU nodes with a load balancer and auto-scaling configured to a 50-session queue threshold is sufficient. A single recording pipeline worker handling up to 200 concurrent recordings is adequate at this scale. Ensure your TURN server can handle 60–80 TURN-relayed sessions at peak.
Q: How does WebRTC handle poor network conditions during a video KYC session?
WebRTC includes native adaptive bitrate control through its congestion control algorithms (Google Congestion Control / REMB). When network bandwidth drops, the WebRTC layer automatically reduces video resolution and frame rate to maintain session continuity. For video KYC specifically, you should configure a minimum quality floor (400 kbps at 480p) below which the system should notify the agent and prompt the customer to switch networks or reschedule.
Q: Can video KYC sessions be conducted on 3G networks?
Yes, but with constraints. A 3G connection delivering a sustained 1–2 Mbps can handle a 480p video KYC session adequately. The WebRTC layer must be configured with aggressive jitter buffering and packet loss recovery (NACK). Sessions on borderline connectivity should be flagged in the recording metadata, as quality degradation could affect document legibility which has compliance implications.
Q: How long must video KYC recordings be retained under RBI guidelines?
The RBI Master Direction on KYC requires that KYC records (including V-CIP recordings) be maintained for at least 5 years after the business relationship ends, or 5 years from the date of the transaction whichever is later. Storage must be in India-based servers with WORM policy. Always verify the current retention period with your compliance team, as this requirement may be updated.
Q: What is the difference between V-CIP and standard video KYC?
V-CIP (Video-based Customer Identification Process) is the specific RBI-mandated framework for video KYC in India, as defined in the Master Direction – KYC Direction, 2016. It specifies the exact data capture requirements (geo-tag, PAN, Aadhaar OTP, liveness), agent qualifications, storage requirements, and audit obligations. "Video KYC" is the generic term; V-CIP is the regulated implementation. All video KYC conducted by RBI-regulated entities must comply with V-CIP requirements.
Q: How do you handle session resumption after a network drop?
The session orchestration layer must preserve session state session ID, completed verification steps, partial recording, timestamps for a minimum of 5 minutes after a drop. When the customer reconnects within this window, the WebRTC layer initiates an ICE restart using the same session parameters, and the verification flow resumes from the last completed step. Partial recordings are stitched with the resumed recording at the post-session processing stage, and both segments are stored together with a discontinuity marker in the metadata.
Q: What is the estimated storage cost for 1 million video KYC sessions per month?
At an average session length of 8 minutes and a recording bitrate of ~800 kbps (720p, H.264), each session produces approximately 480 MB of raw recording. At 1 million sessions, this is approximately 480 TB per month of raw recording data. With H.265/HEVC encoding (approximately 40% size reduction) and storage tiering (hot storage for 30 days, cold storage for the remainder of the 5-year retention period), the effective managed storage cost can be reduced substantially. Budget both hot and cold storage tiers separately, and account for replication overhead across backup destinations.
Q: How should we approach compliance storage for a globally distributed video KYC platform?
For RBI-regulated entities, recording data must be stored in India regardless of where the platform's primary infrastructure runs. For global deployments (MAS, FCA, FinCEN), each jurisdiction has its own data residency requirements. The standard architecture is regional compliance storage nodes data captured in a jurisdiction stays in that jurisdiction's storage tier. The audit and reporting layer is a federated query layer that can access all regional stores with role-based access control and a complete access log.
