Why We Replaced Kafka with gRPC for Service Communication
There was a time when I thought Kafka was the answer to everything.
Need services to talk? Kafka.
There was a time when I thought Kafka was the answer to everything.
Need services to talk? Kafka.
Want retries? Kafka.
Want scalability? Kafka.
Want async communication? Kafka.
Even for request-response type calls somehow we convinced ourselves that Kafka was the way.
But as the system grew and reality set in, we hit bottlenecks. Not with Kafka itself but with how we were using it. Eventually, we reached a point where we had to rethink everything.
And that’s when we made a shift. Not to REST.
Not to WebSockets.
But to gRPC.
Now See.. what we faced, what we fixed, and what we learned along the way.
How We Ended Up Over-Kafka’d¶
We were building a loan servicing platform for banks and NBFCs. Lots of moving parts:
- Loan creation
- Disbursement
- Eligibility checks
- Moratorium variations
- ROI changes
- Installment recalculations
- Notifications
- Credit Bureau reporting
- Accounting + Ledger
All broken into microservices.
We wanted decoupling. Scalability. Loose dependencies. And Kafka felt like the glue that would hold all these services together.
We designed it like this:
- Service A does something important
- Publishes an event on Kafka
- Services B, C, D consume and react
So far so good.
But very quickly, problems started surfacing.
Real Example: Where Kafka Made Things Shit…¶
Say a loan is disbursed.
The flow looks like:
- DisbursementService emits
loan_disbursedevent to Kafka - NotificationService listens → sends SMS/email
- AccountingService listens → updates ledger
- ComplianceService listens → notifies regulator
This seems clean. Except:
- What if NotificationService crashes? Kafka has no idea. The event just sits there. No alert.
- What if AccountingService delays consumption due to backpressure? The ledger gets stale.
- What if one service fails silently? We never know unless we manually monitor logs or someone raises a ticket.
- What if multiple retries fail? The event lands in a dead-letter queue and is forgotten.
And the biggest one:
- What if we need an immediate response? Like checking if disbursement succeeded, or if a user is eligible Kafka just doesn’t give that guarantee.
This isn’t Kafka’s fault.
We were simply using Kafka for things it wasn’t built for.
The Moment of Clarity¶
It was late one evening, and a high-value customer hadn’t received a notification after their loan was disbursed.
We checked the flow:
- Disbursement → Kafka → NotificationService
- Kafka showed the event was produced
- But NotificationService was down
- No retry triggered
- No alert fired
Support team had no clue what went wrong. Neither did the customer.
That’s when one of our devs just blurted out:
“Why don’t we just call NotificationService directly instead of throwing it into Kafka?”
That hit us hard.
We realized we were using Kafka as a crutch. A fancy way to feel decoupled while actually increasing uncertainty.
Exploring gRPC¶
We had used gRPC before, but only lightly. This time, we decided to go deeper.
The first thing we noticed?
- It’s built on HTTP/2 (faster, multiplexed, binary)
- It uses Protocol Buffers (protobuf) super efficient, small payloads
- It supports bi-directional streaming
- It’s strongly typed every call and field is contractually defined
- It supports real-time, request-response, and streaming
- You can generate code in Java, Go, etc the same
.protofile works everywhere
We ran a small POC where LoanService used gRPC to call EligibilityService. The response came in under 10ms.
Simple. Fast. Clear. No brokers, no consumers, no polling.
That was enough to convince us to try replacing more flows.
What We Replaced (and What We Didn’t)¶
We started small and then went bigger.
how we split it:
Moved to gRPC:¶
- Loan → EligibilityService (needs immediate response)
- Loan → NotificationService (retries if failed, with fallback)
- Loan → AccountingService (critical ledger update, rollback on failure)
- Loan → ScheduleService (fetches amortization schedule in real time)
Kept Kafka for:¶
- Audit Logs (fire-and-forget, replayable)
- Data Lake Ingestion (send raw event data for analytics)
- Fan-out Patterns (1 service producing to many consumers)
- Resilience + Buffering for async bulk workflows
So no, we didn’t abandon Kafka.
We just stopped forcing it into places it didn’t belong.
Operational Benefits of Switching to gRPC¶
what changed after we made the switch:
1. Debugging Became Easier¶
Kafka made debugging hell.
You’d ask: “Did the message go through?” → Check producer logs
“Was it consumed?” → Check consumer logs
“Did it fail?” → Check DLQ
“Did retry work?” → Check offset and retry mechanism
With gRPC, you get a response: success or failure. That’s it.
2. Latency Dropped by 70 -> 80%¶
Most Kafka consumers were polling every few seconds.
Even with real-time topics, processing was delayed.
With gRPC, it’s all synchronous responses within 10–30ms.
Perfect for eligibility checks, balance validations, and schedule preview flows.
3. Less Infra to Maintain¶
Kafka came with Zookeeper (or now KRaft), partitions, replication, topic retention configs, consumer groups, lag monitoring, schema registry, etc.
With gRPC, there’s nothing to manage except service deployments.
Our infra team literally said:
“We now get fewer alerts. Kafka used to wake us up at 2 AM.
What We Learned¶
Kafka is powerful.
But power without clarity leads to confusion.
We had made Kafka our default without asking why.
We thought “event-driven” meant “everything should be async.”
We confused “scalable” with “complex.”
Switching to gRPC didn’t just improve performance it made our architecture easier to reason about.
Today, when someone says “let’s use Kafka here,” we pause and ask:
- Do we need replayability?
- Do we need fan-out?
- Is this async or sync?
- Is real-time response critical?
If the answer is real-time + clarity + simple interaction, we go with gRPC.
Final Words¶
Kafka is amazing. And it still powers a big chunk of our backend.
But it’s not a silver bullet.
gRPC gave us the simplicity, speed, and control we were looking for in service-to-service communication.
And sometimes, the smartest move is to use a tool for what it’s built for and not because it’s popular.
So no hate for Kafka.
But gRPC? That was the upgrade we didn’t know we needed until we tried it.
SDE-2 @ Fintech | Full-Stack Engineer | Top 1% in Global Coding Contests (ICPC, Code Jam, Kick Start) | Strong in HLD/LLD & Distributed Systems
More from Himanshu Singour¶
Recommended from Medium¶
[
See more recommendations
](https://medium.com/?source=post_page---read_next_recirc--1c946db514d4---------------------------------------)