bitdrift
PricingDocs

episode 14 | May 5 2026

The One-Second Surge: Debugging a Billion-Device Edge Case at Google with Jay Gengelbach

The One-Second Surge: Debugging a Billion-Device Edge Case at Google with Jay Gengelbach

The One-Second Surge: Debugging a Billion-Device Edge Case at Google with Jay Gengelbach

Beyond the Noise

About the episode

In this episode, Jay Gengelbach, longtime Engineer at Alphabet and current Software Engineer at Vercel, breaks down the story of a hard-to-track-down bug he encountered: a mysterious spike in server errors that hit exactly at the top of every hour (and other "round" time boundaries) across Google Now/Discover’s massive mobile footprint. Jay walks through how the team traced it to battery-saver behavior on certain devices and apps that effectively rationed background connectivity, causing synchronized bursts of requests the infrastructure couldn’t react to in time.

The fix wasn’t just "handle more load," but reshape it: using server-controlled scheduling to create intentional "load divots" so misbehaving clients could spike into empty space. The conversation wraps with Jay’s current work at Vercel scaling CI/CD reliability and speed, plus a measured take on AI: real value, real hype, and a looming reckoning when "free" stops being free.

[00:00:00]

Matt Klein: Alright folks. Welcome to another episode of Beyond the Noise Signals, Stories, and Spicy Takes, the show where we dig into the stories of the people shaping the future of app-based computing with a special focus on mobile. I'm your host, Matt Klein, co-founder and CTO of bitdrift, as well as the founder of Envoy proxy. Each episode we'll talk with engineers, founders, and technical leaders who have transformed the way their companies build and understand what's happening inside their systems. We'll dig into the challenges, the breakthroughs, the lessons learned, and we'll wrap it all up with their hottest takes.

So let's dive in.

Today we have Jay Gengelbach, who is a software engineer and engineering leader. He spent 18 years at Alphabet growing from an entry level Software Engineer to a Principal [00:01:00] Engineer through a career that spanned Google's observability, infrastructure, payments, processing, and search teams, and then Alphabet's life sciences startup Verily.

In 2025, he joined Vercel where he continues working on web scale infrastructure challenges. Welcome Jay. Thank you for joining us.

Jay Gengelbach: Yeah, it's great to be here.

Matt Klein: Yeah, I just wanna let people out there know, I do not know Jay at all. And how this came about is, um, Jay posted an absolutely awesome bug story on LinkedIn, that we will dive into later.

It just got me thinking... I was like, wow, that is an amazing bug and I love amazing bugs. But, before we go there, how we typically start is just to, to learn a little more about you. So could you just give us a brief intro in terms of how you got into computing in the first place and how you got to Google and some of the exciting things that you worked on there, and then how you got to Vercel currently?

That would be fantastic.

Jay Gengelbach: Man. So, yeah, I have [00:02:00] been programming computers since about fifth grade. I, weirdly enough got exposed to computer programming at my piano teacher's office. They had the old, like blue screen Commodore 64 computers

Matt Klein: Wow.

Jay Gengelbach: In their, their waiting room with some music related software that you could play with.

But, in that era, you know, the Apple II and the Commodore 64, if you booted it up and just started typing at, at the prompt, it supported BASIC. And so, someone like turned it on without a floppy disc skin, that, that's when the, the disc, the floppies were actually floppy. And, he just like booted it up and wrote just a little, choose your own adventure.

Just like stupid, little, like text-based basic thing, so

Matt Klein: I just like, how did you know though, like how did you know what to type? I mean, was there like a BASIC manual sitting, sitting next to it?

Jay Gengelbach: I mean, so, so yeah. One of the other students just knew what to do. He, he did it and, I just kinda like looked over his shoulder, asked him what he was doing, and learned like...

if, and [00:03:00] go to and print or something like that. I learned like the, the very, very basics of, of BASIC. And what I, what I really learned was a search term and of course it didn't have Google to, to type that into, but like from my school library or things like that, was able to pick up a few books on BASIC and just start tinkering around with stupid things.

Through a lot of middle school, I would do a bunch of, I just would call 'em screensavers. I'd just like write some rules to animate, like a circle around the screen. You know, maybe it would change colors when it hit the wall and like bounced off the wall. Just like, you know, try painting random colors on the, on the screen.

But basically just got into... I - I thought I wanted to make video games. That's kind of where, where it all started was, the, the dream of making video games, but loved the computer programming. And, so it was pretty, pretty into it from an early age. I actually found one, one year I, I would write like [00:04:00] pencil and paper, like...

Computer programs in my notebooks at school and then like go home and type them in. And, years later I found one of these and like typed it in and I was a little surprised. I went home, typed it into a BASIC interpreter, and it just ran flawlessly the first time around.

Like, you know, that, that, that's always a surprise, I think for any seasoned engineer is when the, the code works the first time through. And I was like, you know, I, I did this as a, as like a sixth grader and, and it worked. Although-

Matt Klein: I was gonna say I, I haven't had this memory in a long time since you just talked about it, but in high school, I, I also, I think in some computer classes, I do remember writing computer programs by hand.

Like you're saying and just like in the current day and age, you know, where you're sitting in an editor and, you know, these days you're having some AI tool, write half of it for you. Um, it's just pretty, pretty crazy to think about how much has [00:05:00] changed in that period of time.

Jay Gengelbach: Yeah. Honestly, you adapt to the tool that you have, though.

You know, because I was doing it by paper, I spent a lot more time thinking through everything and like making sure it was all correct because the feedback loop was so slow and so painful.

Matt Klein: Yeah.

Jay Gengelbach: Like to, to write 600 lines of code before you, like, run any of it and test it like you behave differently. I, I think people back in the punch card days behaved differently and like they were able to get by with it.

Whereas, you know, one of the jokes I've made in the AI age is like, now I know languages that I also don't know. Like I've written production code in these languages that, like if you asked me to write like a, a for loop on a whiteboard... I probably wouldn't get it right, but I'm shipping code in production because like I, I offload knowing the syntax to, to the ai and I, I just bring, you know, the, the product management to, to the arena.

So, so yeah, I've been programming for a long time. Went to school for computer science and like I knew I wanted to be a [00:06:00] computer programmer, you know, a decade before that. And really, I... so I went to Purdue University and I remember just going to their... Going to the orientation presentation that they had on, on campus for prospective students.

And really the only question in my mind was, what is the name of the major that I want? Like, I knew I wanted it to be, computer programming and like they, they explained the difference between like electrical engineering, computer engineering, computer science, and I was like, computer science. That's the one.

Yeah. Never looked back. That, that's been my niche forever.

Matt Klein: Yeah. Fantastic. And then, um, it seems like you had a, a very long career when you were at Google. Is that the first place that you worked out of school? Um, or, or...

Jay Gengelbach: yeah. I had a couple of internships... before my senior year had an internship at Microsoft and actually fully expected to go back to Microsoft after graduation.

But they wanted me to like sign that contract before I left the internship. And I was like. I, I expect to come back, but I wanna keep my options open and I'm glad I did. That was the first year [00:07:00] Google came to Purdue campus and recruited, and they just sold themselves in a way that seemed a lot more exciting to me.

I mean, it, it's weird to say now, like that was the thing where I was like, I don't know if this Google thing's gonna catch on.

Matt Klein: Yeah.

Jay Gengelbach: But maybe it'll be a... you know, it's, it, it may never be a Microsoft, but it'll, It'll be a fun ride.

Matt Klein: Yeah, it, it's also funny how things come and go and ebb and flow. I mean, I feel like in that era, Microsoft was decidedly not cool and Google was cool, but now I don't know who's cooler.

Is it Microsoft or Google? It's actually really hard to say. Um, but it sounds like you worked on a lot of really interesting things while you were at Google. Before we, you know, completely nerd out and talk about that bug and then go from there. Are, are there any, you know, particular things there that you would like to highlight?

Jay Gengelbach: I, I have a lot of good war stories through the years from all the teams that I've been on, I, I like to geek out with them. I- Google's kind of notoriously tightlipped about a lot of [00:08:00] their stuff. And so, sometimes I was like, worried about what I could talk about, but I think I've gotten enough distance from some of these things that the one that we're gonna talk about today, we like made a defensive publication about it.

So like, it's literally public on the internet.

Matt Klein: Yeah.

Jay Gengelbach: A lot of the details there. So I'll, I'll add a little color to it, but yeah, I, I love telling these stories. I mean, yeah, like a million QPS is like a, a medium sized service at Google.

Matt Klein: Yeah,

sure.

Jay Gengelbach: So you just run to these absurd sorts of situations.

You know, working at that scale where things that seem, you know, astronomically small odds, uh, just happened to crop up.

Oh man. I remember my, my first project I ran into, a deadlock bug, which, you know, happens from time to time, but it was a three thread deadlock bug where it was not just like two, two things that just like acquired two locks in, in different orders.

It was like, it, it was the classic like, dining philosophers with the chopsticks. It was like, it was, N equals three and like you just needed [00:09:00] all three threads to hit the, the right, you know, the right place at the right time, to trigger this. And that was... It was like, what are the chances? Well, you know, we do this, we do this operation, you know, a couple hundred million times a day, so the chances are higher than you think.

Matt Klein: How did you find the bug? Were you able to get a stack trace? I mean, was it

Jay Gengelbach: Yeah.

Matt Klein: Yeah. So it was pretty easy.

Jay Gengelbach: Great thing about deadlocks...

Matt Klein: easy once, once you got the stack trace.

Jay Gengelbach: Yeah.

As soon as it dumps core, like the threads tell you exactly where they are and what they're waiting on, and it, it, it becomes very clear.

Matt Klein: Well you also, I mean you, you probably started at Google, I mean really, right when mobile started to blow up, right? So I mean you were there basically...

Jay Gengelbach: before then, I was there in 2006, so yeah. Pre iPhone. Pre Android.

Matt Klein: Yeah.

Jay Gengelbach: Like I remember when they like acquired Android when they gave us all

Matt Klein: Yeah.

Jay Gengelbach: Android phones for Christmas 'cause they like, they just wanted developers to have these and to start thinking about them.

Matt Klein: Yeah, it's more, what's interesting is you, you were there really during the time period in [00:10:00] which web is the thing, right? It's like web is where everyone goes, and you were there during the transition from where, you know, these days I, I don't know what the percentage breakdown is, but for a lot of companies it's like 90% mobile and 10% web, right?

It's just, it's kind of interesting that you were there during that transition, and I think that the bug, you know, that we're gonna talk about is a mobile specific bug. But I'm sure that, you know, you saw a lot like during that period of change, right? Where like the company's focus is web, web, web, web, web, and then it starts to switch more towards mobile, I'm sure.

Jay Gengelbach: Yeah. Well, and, and having done Microsoft right beforehand also sort of experienced that transition from... like local apps to web. The, the team that I interned with at Microsoft was core file services. So it's like check disk and format and things like this. And, the like check disk sometimes runs like as part of the boot loop, like before the rest of the operating system is there.

And so they wanted me to, to do some like caching in [00:11:00] there and I was like, okay, can I use... like a, a red-black tree. Can I use this data structure? And they're like, no, you can't use any of the STL. None of that exists. Like, we don't even know, like we, we've got like 32 megabytes of memory at this point. And that's all we know that we have until we run check disk.

And, you know, every memory allocation you do, you have to be like, did, did the memory get allocated? If not, how do I fail gracefully? And I remember just in my like Noogler training at Google, someone was like, yeah, should you be checking your allocations? And they were like, no, God no. Like if, if you can't get memory, just crash.

There are, there are a hundred other servers doing the same thing, and if one of them dies, oh well, and if they all start crash looping, we'll look into it. But yeah, it, it was really liberating from this environment where like, I can't use any standard libraries and like memory is... undependable and like you don't know if you have it to an environment where you're writing servers on the cloud.

You control the computer that it runs on. [00:12:00] You know what operating system it's running on, like down to the kernel version and you could iterate so much faster at Google under that control. And actually the weird thing is when, when we moved to mobile, there are a lot of things that, that got more interesting and more powerful, but it like throws a lot of that back out the window where it's like,

Matt Klein: for sure,

Jay Gengelbach: you know, when I worked on a, on a mobile app team, it was like, oh, now we have clients running like

Android versions from six years ago.

Matt Klein: Mm-hmm.

Jay Gengelbach: And we need to worry about getting request- like every time we release a new version of the app, we're gonna be receiving that query forever. And we, we need to be, need to be ready to field that query from those old clients, like for the rest of time.

Matt Klein: Well, and it's, and it's not just the old clients, it's that even for quick updates even if you get a 90% uptake within a certain amount of time is still gonna take you 3, 4, 5 weeks. Like just to get that uptake. So I think, you know, that was eye-opening for me as well because I also, did my career start back at Microsoft. And, [00:13:00] you know, I, I wouldn't be the software engineer that I am today without that type of foundational experience for sure.

But then when I started to move, as you did more into the web scale, systems, it is, it is really eye-opening how quickly you can A. break things but also fix them. But in the mobile world, the code really does effectively live forever. So you, you definitely start thinking a little more about how all this stuff works.

Anyway, why, why don't we dive into the bug, that I originally saw, and then we can go from there. So, I, I would just assume that the folks that are listening don't have any clue what we're talking about. So, why don't you set the background and then we can, we can chat about it more.

Jay Gengelbach: Yeah. So the, the app I was working on at the time, we called it Google Now, it went through a couple different public names.

You know, Google Now, the Feed, Discover. But, I think if you open like Chrome on mobile these days, you see a bunch of like [00:14:00] stories and things you can scroll through.

Matt Klein: Oh yeah, for sure.

Jay Gengelbach: Like cards.

Matt Klein: Yeah.

Jay Gengelbach: That is basically the, the app we were talking about. And it, it went through various iterations of being like assistive where like, you, you could connect your emails and like if you got a...

package tracking link in your email somewhere, it would just like track the package for you and be like, 'Hey, this has arrived,' and things like that. And so sort of like use intelligence based on your location history and your email history to just give you things that might be useful to you. These days

it's basically evolved into, 'here's some news to read,' for, for various reasons. But, yeah, it was very like mobile, mobile-first sort of, application for trying to, you know... it was called Google Now, the, the idea was like, 'what if we give you search results before you even asked for them?' Like, can, can we think about what, what you might be interested in and surface it to you?

And, try to try to determine your interests in various ways and just like, give, give you stories that, that are interesting to read. And so, to have that be a fast on load experience... when you [00:15:00] open it, it does background requests. And so in, in the background it says, okay, if, if they open the app in the next 30 minutes, what should I show them?

And so you, you'd make a request. We cache that so that, you know, you, you get content instantaneously when you, when you open the app and, you know, across the entire Google install base. You know, this, this was basically like built into Android or built into, the Google search app, which are installed on like a billion plus devices.

So if we're having that many devices, making background requests, you know, the system is, was somewhere, somewhere north of, a hundred thousand queries per second. Even at Google, that's a large look- that's a big number. You know? There's, there's not a, not a, a lot of things that, that do that, or like, you know, some of the things that, that are much higher than that are like very simple requests.

And like, we had a, like a, a heavy stateful request at, at a hundred thousand queries per second. And so, somewhere along the way, the SRE team wanted to work on improving reliability there, so we want that query to succeed. I think we were pushing from like [00:16:00] two nines to three nines. Uh, maybe we're pushing towards, towards four nines, something like that.

But, you know, we, we wanted, you know, 99.9% of these requests to succeed in less than, you know, a thousand milliseconds or something like that. So, obviously, you know, when you're doing that, you have certain, like incidents that, that kind of push that number down, but then you also just have your like background churn rate, like your, your error rate where things time out for one reason or another, some percentage of the time.

And so we were looking at just our background error burn rate, where like, not really incident driven, but just, you know, things that that happened spontaneously all the time. And, our SREs like zoomed way in on the graph and they saw we have a big spike in errors at the top of the hour, every hour. Why? So we just started digging into that.

And the, you know, this might be a fairly normal thing in a normal company... things that happen at the top of the hour are kind of unheard of at Google. Because yeah, when you're dealing [00:17:00] with billions of devices, like you never tell the device to do something at the top of the hour.

Matt Klein: Yeah, for sure.

Jay Gengelbach: Like even an entry level Google engineer, like quickly, like if they see that in a code review, they're like, 'nope.' Like never, never pick the time. The time is always random. So you always have to randomize your, your callback time for something like that 'cause you have to spread the, the billion requests that you're gonna get over, over an hour, like across the entire hour.

So yeah, it's rare to have these time aligned events. And so, I mean, one of the first things that we, we looked into was like, you know, did, did we screw that up somewhere? And no, we didn't. So this is kind of where, where the things start, kind of where I come in and I start like zooming in on the graphs to see what the heck is going on, what's happening at the top of the hour, and...

Matt Klein: and,

and sorry for this, you're right now you're looking at server side graphs, obviously, right? Like you're looking at the incoming request rates. So you, you don't yet know, or do you know at this point [00:18:00] what is the client side success rate? Because obviously those, those could be different, right? I mean, you could have tons of, of requests that are going out, the client that themselves are not succeeding for any number of reasons.

So I'm just trying to set the stage here.

Jay Gengelbach: Yeah, the, the SLO we were worried about was our server side, SLO. And yeah, we did have some, back channel reporting. So like for things like latency, we can measure it from the server side. So like from where we got the request, how long did it take? And that's usually what we based the SLO on 'cause we don't control how fast your wifi is.

But, we did also have metrics on, the client side of like, what, what's the client's view of latency and, and things like that. So yeah, as we zoom in, what we see is at the top of the hour, there's just a big, momentary spike. At like the first second of every hour, there's a big spike in load, 15% more requests.

You know, very briefly. We also noticed at like 30 minutes past the hour, 15 [00:19:00] minutes, 45 minutes, increments of 10, you know, 10, 20, 30, 40, like all the, the round numbers. All the round numbers have a spike. Top of the hour is the biggest. Midpoint is, is the second biggest. But it's like, if you pick every number that you can divide, divide it by, like, you see all of them show up, at like at lower rates.

So we start looking into like, why, why would that be happening? The- actually one early theory, so we had a couple different... so we, we built this thing on the client we called request schedules, which was, we sent a bunch of conditions down to the client and, and said basically like, like

call us back if any of these things is true. And it's things like, you know, if you hit this, this particular timestamp, you know, that's, that's the original version, the OG one, that, that was in the app forever. But then, we started adding things like, as soon as you open the app, let's just send a request and refresh the data for you.

And then, then we'll figure out on the client how to like slide the new data in without you seeing it blink.

Matt Klein: Yeah. And, and sorry, for, for this [00:20:00] mechanism, had you already had that in the app or did you build this to debug this issue basically?

Jay Gengelbach: So those request schedules were built over my tenure on the team.

And so like, we proposed them, we built them. And so one of our thoughts is like, we screwed this up. We, we built this wrong. You know, we, we could also do things like geofences, like when you get home, when you get to work, like, we'll, we'll do a, a refresh. And so one theory was, you know, we had this app open request and we're like, what if for some reason people open their app right at the top of the hour? Or they turn their screen on at the top of the hour?

Like these are some of the events that sometimes result in a request. One of the great things about Google is you have amazing telemetry. Like they have this very detailed log record on every request that that gives you all that information and so we, we were logging, why did we make the request? You know, what was the last schedule we sent to that client?

Things like that. And...

Matt Klein: right, but that's, but, but that's server side, or that's client side telemetry, because that's where just, you know, I'm obviously coming from [00:21:00] a mobile observability company and it, it's, it's, you know, this dichotomy between what you have on the server and what you have on the client, and then you have to line them up.

It, it's actually really confusing. So that, that's why I'm asking these questions. Like I'm trying to understand... for some of the debugging that you're doing, are you looking at the server side or are you looking at independently reported client side telemetry?

Jay Gengelbach: So the client, I mean, obviously includes a bunch of stuff in its request, you know, user agent, Android version.

Matt Klein: Right, right, right. Okay.

Jay Gengelbach: IP address.

Matt Klein: Yep.

Jay Gengelbach: You know, all, all that stuff. So it, it sends up a lot of that stuff. And then on the server side, we log what the client sent us and also what we sent back to the client.

Matt Klein: Got it. Okay.

Jay Gengelbach: So that was the main log source that I was using. We did have some other detailed client side telemetry that we used in various places.

I didn't end up digging very deep into that.

Matt Klein: Got it. Okay.

Jay Gengelbach: I, I do recall actually one of the, one of the really tricky things... in middle of all of this, there was a big push from Android that's like battery life on Android is not as good as battery life on [00:22:00] iOS. We wanna fix that and, so they started saying like, every, every new feature we wanna push out, we need to run an experiment and measure battery life impact.

And boy is measuring battery life impact of a, a feature flag, a difficult thing to do. And so, yeah, like they, they... I, I didn't build that telemetry, but like they, they had the instrumentation to, that you could basically correlate total battery drain on, you know, on the thing with like the feature flags that you have turned on,

Matt Klein: right,

yep.

Jay Gengelbach: For a particular user. So yeah, so I'm using mostly the, the server side logs. As I dig into it, speaking of like the server side versus the client side... one of the things that I discovered is this big surge in requests... the, the load spike lasts about three seconds or so.

But it all comes from clients... who they believe the timestamp when they sent the request was exactly the top of the hour. So, obviously because of the network, it takes a little while, like our- the [00:23:00] server side timestamp...

Matt Klein: Sure.

Jay Gengelbach: It may, may not be the same for, for two reasons, one of which is just network lag.

It takes a while for your packet to make it to our servers. And two is actually you know, sometimes the client timestamp is newer than the server timestamp because your client has a bad clock. You know, if you've read some of the Google papers on like spanner where we have, like literally, atomic clocks in data centers to, to be really sure

Matt Klein: for sure,

Jay Gengelbach: you know exactly what time it is.

You know, clients don't have that. Most mobile phones do not have an atomic clock embedded and if, if you care about high precision timestamps like clocks you is a thing that you, you will know. And, I mean, if you want a fun interview question sometimes figure out how to synchronize the clocks of two, two, uh, devices that are separated by a network.

It's a, not an easy problem. But yeah, so this spike in requests was coming... the clients all believed they were sending this request at the top of the hour and we dug back into it and asked. And so the client says why it's sending the request and they, they're sending the request because they believed it was time to send the [00:24:00] request.

And not because of one of the other conditions that we put into the request schedule. So they said it, it was time, but what we said is like I told you to call back at 1:48 PM and instead you, you called 12 minutes later, you call at exactly two o'clock. So there were a large number of clients that

we told them to call back. So like we did correctly, tell them, you know, call back at a random time and they instead waited for the, the next round number after the point where we told them to call back and that's when the request hit. And so we got this, this big spike in load, from, turned out to be a smallish number of clients.

It, it was hard to craft the SQL query to, to, to figure out which clients were doing this. Because obviously when you randomize, you know, clients around across, you know, 3,600 seconds in the hour, some of them hit, hit zero naturally.

Matt Klein: Yeah, of course. Yeah.

Jay Gengelbach: Uh, you know, not, not everyone that sends a sends a

a request at the [00:25:00] top of the hour is doing it maliciously or because of a bug. Some of them are just doing it 'cause they won the lottery.

Matt Klein: But were you, were you logging when the client thought they were supposed to be sending the request? Like the, like either- either the client timestamp or the schedule time, and then I guess you could compare that against the receive time to see how far apart they were.

I'm, I'm trying to understand the process.

Jay Gengelbach: I think part

of why the SQL query was hard is 'cause what we logged was what the server told them to do the last time they talked to the server and I had to join that log record with the next one they sent.

Matt Klein: Got it. Okay. Makes sense

Jay Gengelbach: when they said,

Matt Klein: right,

right.

Jay Gengelbach: I'm calling back because I think it's time.

Matt Klein: Yes.

Yep. Okay.

Jay Gengelbach: And so, so yeah, there, there was some, some gnarly like log SQL to, to put all this together. But the, the story that takes shape is there are certain client IDs that have- have a strong preference for sending their requests at the top of the hour.

And so, like I did this aggregation of like, [00:26:00] what percentage of the time do you send your requests at the top of the hour because it, like if you won the lottery once, that makes sense. If you won, win the lottery eight times a day... something's up. And so I was able to like identify some number of like client IDs that are, that send traffic at weird boundaries.

And not at the time that we told them to. So then I'm like, okay, I, I know the client IDs. Now what do these have in common? You know, what, what can I figure out about what they're doing? So, you know, I looked by, you know, IP addresses, you know, actually some of these clients, of course, your IP address on a mobile device changes throughout the day.

So, like, you know, re- regionality didn't seem to do it too much. I think it was like maybe slightly more pronounced in India. Or something like that, but like nothing, not nearly a smoking gun. Android version, no. App version, no. So I was like, okay, is it, you know... do they have a buggy client? You know, was was three versions back a bad client that did this?

No.

The first clue that I got was sorting by [00:27:00] Android manufacturer or like device, you know, like the OEM, the, the manufacturer of the device. And of course, you know, at the top of the list you've got like Samsung and, and a few things like that.

So, you know, you've got, you've got the big ones. But, of course Android being open source, like there's a ton of Android manufacturers out there. And so there are like some of these small manufacturers that I had never heard of. But like there's one of 'em where it's like 80% of its users, are bad clients,

Matt Klein: starting to smell like a

hardware,

hardware...

Jay Gengelbach: oh, yeah. So that, that, that, that's an interesting clue. They, they are not all of them. So that they don't have all the bad clients and not, not all of their clients are bad, but, significantly overrepresented versus the, like, the base install rate of this OEM.

So, anyway, in pulling on that thread, the other one that was really hard to do... so despite the fact that we have a lot of these logs, actually Google really locks down a lot of the like, personalization logs.

So one of the other theories I wanted to test was what apps do you have installed? And [00:28:00] like Google has that because Play Store is like, is one of the things that Google has. It's not my team. My team is not play store. And so like, it took me, a while to like apply for access to, to play store logs, but I was like, I want to know what APKs are on the, like these particular clients.

And actually my recollection is... I didn't actually get anything out of that, that that didn't turn up what I thought it did.

Matt Klein: Yeah. Right.

Jay Gengelbach: But what we traced this to in the end was... battery saver apps that, these manufacturers, so this particular manufacturer shipped with an app to, to save your battery life.

And that app functioned by turning off wifi, and basically saying like, like background internet requests are only allowed from, you know, at scheduled times. Like we, we will give you like five seconds of internet connectivity an hour or something like that. And we'll, otherwise-

Matt Klein: but you also said that you did see the [00:29:00] issue on like Samsung devices as well.

So was, was every manufacturer doing this in some form or was that a red herring and, and like, it ended up being that it was just a couple of these manufacturers that, that had these apps.

Jay Gengelbach: Yeah. So I, the manufacturer we found is one that like shipped with this installed like as a base feature of their app,

Matt Klein: but

then like other people could install such apps.

Jay Gengelbach: Exactly.

Matt Klein: And thats Got it. Got it, got it.

Jay Gengelbach: You could install your own app.

Matt Klein: Right? Right. That

Jay Gengelbach: did basically the same thing.

Matt Klein: Right. Okay.

Jay Gengelbach: And there, there may be more than one. So I don't recall actually finding like maybe the, maybe I just. I wrote the SQL Query wrong, but I don't, I don't remember actually finding an app, but I did find a developer.

I found, I found an OEM, that was doing this. And, again, like Google being Google, I was able to like, reach out through developer re- relations, like confirm they were doing this and be like, 'Hey. This is bad, please don't do this.' And, this was like, was one of the fun things was realizing, this is not specific to us at all.

Like our app is [00:30:00] experiencing this, but these... Like mobile devices are doing this everywhere

Matt Klein: for all apps. Yep, for sure.

Jay Gengelbach: Yeah. And in fact, like quite possibly, you know, there could be places where like the cell towers get overloaded at at one second past the hour,

Matt Klein: right.

Jay Gengelbach: Because a lot of people have this phone.

And there was sort of this moment where I was like... google might be the only place with telemetry that's like high enough resolution that we actually saw that this was happening and had like, had enough detail to, to put together why it happened and... And, so we, like, we reached out to the phone developer because we're like, like honestly, we could be, we could be influencing like the total packet drop rate across the internet to go down by a fraction of a percent.

Matt Klein: Yeah.

Jay Gengelbach: If, if we get them to fix this bug.

Matt Klein: And then did they, did they fix it or, or...

Jay Gengelbach: um. So that developer did respond and said, yes, we'll, we can, we can adjust the way this works in the, in the next, thing. I did not [00:31:00] follow up on that, so I, I don't know if they ever did. And yeah, I didn't, I didn't find more of them and at some point I, I just gotta the point where I was like, okay, that's one developer, but that's just at like, just one needle in the haystack and there's clearly more of this going on.

And so then the next question was like, what can we do, to improve on this and, there was a Google exec, I, I think it's Udi Manber, that would tell these stories sometimes where it was like any problem that gets between the user and using Google's services successfully, like it's our problem.

That like if the user has a bad network connection, that's still our problem. Like, can we make our apps work on bad network connections?

Matt Klein: Yeah, for sure.

Jay Gengelbach: If the user doesn't know how to spell the thing they're searching for, you know, I, I, I know, you know, people have been astounded various times through the years at like how badly you can type something and Google's like, did you mean...?

and like, and nails it. You know, you can put your, you know, hands on the wrong spot on the keyboard and type it and Google can autocorrect it for you. And, and so that, that was this philosophy of like, [00:32:00] it's our problem. Like if, if the users are having trouble, don't blame their phone manufacturer. Don't blame their their wifi network.

Like, can we fix it? And so we, we kind of got to brainstorming about what we could do and came up with this really peculiar idea. That... so we see this load spike and ultimately the reason that that creates errors for us, it actually hits us so fast. Like it, it's not more load than we can handle.

You know, we have a typical like seasonal curve where, like daily, you know, there's a, there's a high period and a low period, and the peak of that load spike at the low point of the day. Is lower than like our load at the top point of the day. But the problem is it happens so fast that the load balancers don't have time to respond.

Like if you slowly ramp up to, you know, to 10 million QPS, the load balancers, like they add, they add tasks, they, you know, they shift traffic between various data centers. But when it happens, when it's one second long, like none of our infrastructure had like... can even [00:33:00] detect that the load spike is happening before the load spike is over.

And so, basically the problem was can we smooth this out? Like if we can just smear this out over a, a more normal time window, we can handle the load. We just can't, can't respond that quickly. So we, got thinking about ways to smooth the load. And the, the funny idea that we had was, you know, it's only a small percentage of our customers that are creating this load spike.

These are the people that, like, we tell them when to call back and they ignore us, but most of the people, when we tell them when to call back... they listen to us. Can we move the, the traffic right from the people that do listen to us to make room? Can we like part the Red Sea, just like put a little divot in our, in our load that the, their load spike can fit inside of so that, yeah, so, so that they, they can continue to, to serve traffic and like the, the traffic becomes smooth and flat and,

[00:34:00] so, yeah, so we, we have this defensive publication. We, we put on this, on the way we do it, but basically what we did was, instead of just your typical random number generation, you know, typical RNG of just saying, okay, pick any one of the milliseconds in this hour. We like segmented it into buckets and we're like, okay, we're gonna use a weighted random number generator.

And like these couple of milliseconds are gonna be like, we're gonna hand out fewer tickets from, from that than we do for anything else. And we, so we basically, like we sampled the request rate. The tricky thing was, was again, that this clock skew issue of like the, the ticket that we hand you is not actually when you're gonna come back.

Matt Klein: Right.

Jay Gengelbach: It's like you're gonna come back, you know, that time plus network lag. So that, that's where it was really a, a pain to implement. But we basically just like did this iterative, algorithm that's like, okay, here's what we asked for. Here's what we got, let's change what we asked for so that what we get, what what we get is [00:35:00] flat.

And so yeah, we, we created basically load divots that matched our load spikes. And then just like iteratively refine it, like every hour you re- like you re-crunch the numbers and redo your random number generator.

Matt Klein: And, and you were, I think you said that the server was telling the client when to call back.

So I, I think that that was a nice design feature here because you could, you could update it pretty easily on the server side. And I think that's interesting because at least in my experience, or at least in probably more basic systems, most people, you know, do client side jitter and they put the jitter on the client, but the server's not really responsible for telling the client when to call back.

I'm actually curious, was it an intentional decision to have the server tell the client when to call next, just to give you more flexibility? Or, or like were there other reasons that it was done that way as opposed to just a client picking a random number or something along those lines?

Jay Gengelbach: Yeah, it was absolutely intentional so that that request schedule feature...[00:36:00]

the whole reason we built it was so that if we had a new idea of a different way that we could orchestrate the requests that would work better for us, that would be, you know, better quality for the client or better load characteristics for us, or things like that, that we don't have to push a new version of the APK

Matt Klein: Right.

Jay Gengelbach: To,

to change.

Matt Klein: Makes sense.

Jay Gengelbach: behavior.

Matt Klein: Yeah.

Jay Gengelbach: So giving us server side control to iterate as quickly as we need to on this.

Matt Klein: Yeah.

Jay Gengelbach: So like, once we had everyone adopt the version that had request schedules, we could then push whatever schedule worked for us.

Matt Klein: Yeah.

Jay Gengelbach: And we - we ended up leveraging that later.

Actually there, there was a period where it was very hard to get server resources and we had, we had a feature we wanted to launch that was gonna like, add a bunch of users and it's like, well, how do we add users without adding servers? And so what we did was ratchet the, the request schedule up and down.

So like the, the very old versions of the app, just like hard coded on the client, you know, make a request every 30 minutes, every hour or something like that. And we ended up tuning that. So, again, the, the [00:37:00] battery life issue where they were like, Hey, we don't like that you're- you're using so much battery life.

And, one of the compromises we made was, okay, how about light users of our product? People that, that aren't actually opening and interacting with this? We, we dial them way back. They, they make a request once a day. And the people that are actually opening the app and using it a lot, we crank them up.

And you know, it turns out we have a, a typical like 80/20 Pareto curve.

Matt Klein: Yeah.

Jay Gengelbach: And you know that there's a lot fewer people that are heavy users than, than that are casual users. And so that was a way we were able to like save battery life on the people that don't use the product save server resources, that were like, that we're spending doing computation for people that aren't gonna, aren't gonna look at the results.

And then that's all fully under the server control. And so we had plenty of room to, to fiddle with the experiments without having to like ship a new APK every time and wait for a billion devices to install it.

Matt Klein: Yeah, I, I mean, it, it's, again, I'm obviously biased. It's really the entire purpose of our company, but I mean, you know, it, it's.

I think a, I think a lot of times people, come to mobile and they try to [00:38:00] apply like server style methods directly to mobile, right? You know, whatever they put the, the, the jitter on there, you know, it's like they assume that things can get updated right away. And as we're talking about, I mean, on mobile,

A, like, things live forever, and B, the updates take forever. I mean, so it's like the more you can do without having to ship new code, obviously the better. So yeah, that is, that is pretty interesting.

Jay Gengelbach: Yeah. I mean, this, this is what leads to like embedding Lua interpreters into your client

Matt Klein: mm-hmm.

Jay Gengelbach: And things like that.

Like, can I have some way of shipping soft code to the client because

Matt Klein: Right.

Jay Gengelbach: You can iterate so much faster.

Matt Klein: Yeah.

Jay Gengelbach: And yeah, personally, like I, although I worked on the mobile product, I'm not a mobile developer and never have been. Like I, I'm the server side guy. I'm the server side infra guy. But yeah, like it's...

because of these like version skew issues and things like that. Yeah, there, there's a lot of power in like trying to empower your server because your developers are a lot more agile on the server than, than they are on the client.

Matt Klein: One thing that I've been asking people that I would love to get your take on is obviously within the industry we have, you know, we have a really good [00:39:00] understanding of server side observability.

We've built up good SRE server type practices over the years and, you know. To this day, there's nothing like an MRE, like a mobile reliability engineer, right? It's like observability is still relatively immature on the mobile side. And one, one thing that I'm interested in and, and I'd be curious to ask you, since I'm sure Google, if any company was gonna think about this, would have thought about that well.

Is, why do you think that's the case? Because these days, I mean, the way that most people interact with these systems is through the app. So it's like even if you're a server side reliability person... ultimately, if you want people to enjoy using your services, like you have to make it work well all the way out to the phone.

And I guess I, I'm just, I think it's interesting that to this day there really still is a bifurcation of, you're a, you're a server, SRE, and like I'm not a mobile developer and I don't, I don't mean that in [00:40:00] a bad way, it's just kind of the way that people think and then, or you're a mobile developer and you know, maybe you think a little bit about server, but, but not that much.

It's still very segmented. And I guess do, do you think that is good? I mean, do you think that is necessary? Do you think that there's things that we could improve there?

Jay Gengelbach: I mean, it's a good question. I think, and like even on the, the... I mean on the traditional web side, you still have that like frontend backend divide.

And so I think part of it is that, although even within mobile, there's like, there's also a frontend and a backend there. Like there's, there's part of the mobile that's just like, you know, making the pixels render smoothly, you know, smooth strolling, like how do, how do I use the memory well and use the rendering libraries well, you know, to, to make it display correctly.

And then there's just like the backend side of orchestrating my network requests and, you know, local storage and local database and, you know, caching and, and a lot of things like that. I mean, yeah, I think, I think it would be really valuable... [00:41:00] My dream is that, we, we develop these like useful abstractions that you can just sort of, you know, the, the dream is write once run it- run anywhere.

And, you know, for, for decades now, there've been people, you know, promising that with various degrees of, of reliability. You know, Java was one stab at, at write once run anywhere and, you know, achieves it to some extent and, and also fails to some extent. I mean, yeah, the reality is that you, you develop these like special, specialties within computer science and software development for various needs that you have.

I mean, yeah. O- over time, you know, we've, we've got a lot fewer people specializing in, you know, writing machine language than we, than we did 30 years ago. You know, driver developers, you know, is a, I think smaller, smaller than it was before. You know, things like USB sort of made, made a lot of that.

Matt Klein: Yeah.

Jay Gengelbach: More

obsolete.

Matt Klein: It's more, it's more that I, I think that there aren't that many people [00:42:00] in the industry that are actually focusing on like the types of bugs or the types of issues that we were just talking about, right?

Jay Gengelbach: Yeah.

Matt Klein: 'cause it's like you... then now like you're a quote server side person. But that bug, I think it's an awesome bug because it really is the interaction of, of pretty complicated mobile specific functionality and the realities of mobile devices and hardware.

But also there's a lot of, there's a lot of server interaction as well. And, I think, I think just in general as an industry, we don't do a great job of like looking holistically across the stack. I would imagine that Google does it better than almost any other company, but like at least in the larger industry, it is pretty rare actually for people to kind of like look at the big picture across those two sides.

Jay Gengelbach: Yeah, I mean, I, I think it's hard. I mean, I think this is one of my superpowers is like finding bugs like this, like finding these really complex like race condition [00:43:00] interactions. That that's something that I do well and recognize like it's not easy 'cause it requires... I mean, yeah, it, it's hard to be full stack

It's hard to really understand all the layers between you and the client well enough to diagnose the bug and like I,

Matt Klein: yeah,

Jay Gengelbach: I didn't have to be an expert on client side rendering to figure this out, but I had to have some understanding of like what it's like to be on a client, how the client's network works, and the extent to which we have control over the client's network stack and,

and yeah, it, it is hard to be an expert on more than one thing. And so, I mean, that that's why we gravitate towards this. And, you know, that, that's, why a lot of these kind of bugs like just get chalked up to like, yeah, I don't know, 'we, we fail request at the top of the hour.' Like, I guess, I guess we're just gonna live with that.

Uh,

Matt Klein: yeah. Well, anyway,

Jay Gengelbach: it's hard

and expensive

to,

to diagnose them.

Matt Klein: Thanks for sharing. That is, that is a fantastic bug. I personally love fantastic bugs. In our last little bit, just tell us a bit, you obviously left Google, now you're at Vercel. Tell us a bit about what you're up to now. Like what, what [00:44:00] excites you in technology currently?

Jay Gengelbach: Yeah, I, I mean, I, I do still very much love the server space. You know, Vercel is a lot faster moving than, than Google was, because it's a lot younger and able to be more agile. But I, I love the agility of the server side that you control the machines that it runs on. You know exactly what operating system they are and things like that.

Matt Klein: Yeah.

Jay Gengelbach: And, and that gives you this ability to just like push code fast, whereas... yeah, both the pre-web days of like, we're shipping this out on CDs and we can't reprint them, and so like, it has to be right. Or even the mobile day of like, we're shipping this APK and, you know, whatever, if it does something bad, we're gonna have to live with the consequences of that for, you know, at least an upgrade cycle.

Those things, I think are, are difficult. So I like, I like being in the, in the web space and... Yeah, I am at Vercel working on the CI/CD team, basically working on making compiles fast so [00:45:00] that, you know, you push code, you get, you get a new build deployed as fast as we possibly can manage it. And then also the various reliability guarantees of like, we can, detect if your deployment goes bad and help you, help you roll it out safely, help you roll it back when it...

you know, if it has issues.

Matt Klein: Sorry, and, and, and that's for internal CI/CD or for customer CI/CD?

Jay Gengelbach: Yeah, customer CI/CD

Matt Klein: Got it, got it. Yes.

Jay Gengelbach: Vercel customers, you know, you push code, you want a new website as fast as possible.

Matt Klein: Of course. Right.

Jay Gengelbach: That, that's, that's our whole org.

Matt Klein: Yeah. Cool.

Is there a particular part of what you're working on that is giving you the biggest challenge right now? I, I guess like what, um, what are, what are the things that you're trying to improve, I guess? Like what are the, what are the levers that you're pulling currently?

Jay Gengelbach: I mean, Vercel is in this classic like scale up startup location where there's a lot of stuff that was just like

built to be a proof of concept to, to see if customers were interested in adopting it and customers have been adopting it. And so, you know, it's been growing, I mean high double digit, percent [00:46:00] year over year, you know, growth. And so then of course you very quickly hit some of the scaling limits of some of these things where.

I mean, like, it, it's an intelligent choice for a startup to make, to like, to, to move quickly and not necessarily like design everything to be absolutely rock solid. Um, the thing I like is, I'll say like, rather than building from zero to one, I like scaling from one to 10. So I like taking a system that like we, you know, we've proved that, that, that the business idea works and now we just need it to scale.

Like we, we need it to not fall apart. And so, I'm doing a lot of work, work just trying to like, get everything to not... fall apart from like the duct tape that has built up over, over time and just like basically, you know, rebuild, rebuild the entire plane while it's still flying. I think that's a, a fun and fascinating set

of

problems.

Matt Klein: It is fun. Yeah.

Jay Gengelbach: And

Matt Klein: it's, I have, I have done that at, at many different points in my career. And there is a particular challenge of... keeping things running while you make significant changes. Um, it is a, it is a very difficult thing.

Jay Gengelbach: Well, I think it's- there's a really fun prioritization challenge. 'cause like, I [00:47:00] mean like any established org you have a list a mile long of things you hate about your system.

Like there's so many things that you wish you could fix. And one of the hardest problems is like, which one are we fixing now? So, one of the things that, that I drive in our organization, I own our technical roadmap, which is, you know, distinct from the product roadmap. You know, PM gets to decide what, what's good for the customer.

And I get to decide what, what's good for the platform. And basically decide like of all the things that I wish I could fix. Like, I, I have this many engineer hours to spend. What, what next? And I, I, I mean that, that's it, its own fascinating puzzle because, you know, of course you know, you, you can imagine this perfect system, but it's, you know, 25 engineer years to get there.

You don't have that. So, you know what, what's the most impactful decision you can make right now that makes your system better?

Matt Klein: Yeah. Cool. Well, before we wrap it up, do you have any, tech predictions over the next few years? Obviously there's, there's lots of things changing right now. I, I don't know if [00:48:00] you're a... AI proponent or doomer or somewhere in the middle or any other topic, but... tell, tell us where you think things are heading.

Jay Gengelbach: I mean, I, I'm somewhere in the middle. I, I am braced for the AI bubble to burst, at some point and have been thinking about, heck, I've been looking at like my investment strategies of like, if I really believe it's going to burst, what, what do I do? Or like, thinking of

Matt Klein: same,

same.

Jay Gengelbach: Like,

Matt Klein: same

here for what it's worth.

Yeah. Yeah.

Jay Gengelbach: Jensen Huang, like the, the leader of, Nvidia, I was like, how do you lead the company if you believe you're in a bubble? Like I know publicly he has to project like, this isn't a bubble. Like there's, there's real stuff here. But it's like, I'm sure he knows there's a good chance that like his company could lose 50% of its value and still be extremely valuable

Matt Klein: for sure

Jay Gengelbach: that, that that could happen.

How do you even lead a company through that? That's a big question on my mind, but, really, like, I, I view this kind like the .com bubble where, like there are some bubbles where like there's just nothing of value there. You know, I, I, I happen to believe that like, the crypto [00:49:00] boom was sort of that, like, there, there's nothing actually interesting in there, but, .com, like there were the ridiculous things like pets.com that just like, just...

bought the buzzword and failed and lost a bunch of investor money. And then you had, like Google, you had a bunch of these big companies, so like there were new billionaires that were created by, by that boom and bust cycle. And so I, I think AI, like, there's clearly some value there. I, I think there's also clearly a bubble that, that it's being sort of, you know, overhyped.

And so, like I expect the bubble to burst, but that doesn't mean that I expect... there to be nothing there. And I

Matt Klein: for sure,

Jay Gengelbach: I think, you know, there are gonna be some new billionaires that, that show up, you know, probably trillionaires at this point, uh, you know, from the value that that gets created. And so, so yeah, I'm like neither doom and gloom nor like a big proponent, like I- there's value, and there's also hype.

And at some point, like those two are gonna meet each other. The other thing I, [00:50:00] I'm really looking out for, like right now, all these models are just being given away for free 'cause they're in the like, market consolidation phase of like, if, if we give it away for free, we'll get them hooked. And like what happens when, when you have to start paying market price for, you know, for the electricity that you're, that you're using instead of borrowing it from Silicon Valley investors.

And so I, I think there, there's gonna be a great reckoning when that happens of like, there's a fair number of things... you know, sometimes I'll chat with, with chatGPT just to like bounce around an idea I have and I would not pay 20 bucks for that interaction.

Matt Klein: Yeah.

Jay Gengelbach: But, I will steal it for free.

Matt Klein: Yeah.

Jay Gengelbach: And you know, when, when you know, when you start having to, to pay for, for that, you know, what, what things go away, and what things really like carry, carry weight. I think that's the interesting problem that, that we're coming up on.

Matt Klein: If, if someone out there were to ask me to give my opinion on the question that I just asked you, I would probably tell them just to listen to your response because it's pretty much exactly what I think.

Also, um, I, I, [00:51:00] I think the tools are, are real. I think there are real benefits. I don't think the value is quite as high as, as the valuations that are happening right now, but who knows? Maybe you and I are wrong and we'll see. So. Anyway, gonna wrap it up. But thank you so much. This was a great conversation.

That is a absolutely fantastic bug that you shared. I think people are gonna love that story. So thank you for coming on, and that's a wrap for this episode of Beyond the Noise Signals, Stories, and Spicy Takes. Huge thanks to Jay for joining and sharing his story. You can find this episode and all past ones on the bitdrift YouTube channel.

If you had fun, drop us a review, tell your friends or yell your favorite hot take into the void. Just make sure to tag us. I'm Matt Klein and I will see you next time. Thank you, Jay.

Matt and Hemant Garg chatting

From Droid Days to DraftKings: Hemant Garg on Velocity with Guardrails

November 10 2025

54 mins

Matt Klein and Jesse Wilson discussing open source, mobile engineering, and WebAssembly

Jesse Wilson: From SourceForge to OkHttp, and Why WebAssembly Beats the AI Hype

February 24 2026

60 mins

Subscribe for new episode announcements


© 2023-2026 bitdrift, Inc. All rights reserved.

SOC 2 Type II Compliant