PricingDocs

episode 4 | December 2 2025

In Praise of Bugs: P-Y Ricau on the Joy of Breaking Things

In Praise of Bugs: P-Y Ricau on the Joy of Breaking Things

In Praise of Bugs: P-Y Ricau on the Joy of Breaking Things

Beyond the Noise

About the episode

In this episode of Beyond the Noise, P-Y Ricau, Principal Engineer at Block (formerly Square) and a legend in the Android community sits down with bitdrift CTO and Envoy creator Matt Klein. P-Y shares his journey from tinkering with early Android phones in 2009 to becoming one of the most influential voices in mobile performance and observability. He recalls joining Square through a GitHub issue, the company’s culture of open source engineering under Bob Lee’s leadership, and the early years of building Android tooling when the platform itself was still maturing. P-Y recounts debugging notorious early-era Android race conditions, contributing to major libraries like Dagger and creating LeakCanary, and the quirky realization that his “useless” college signal-processing classes later powered Square’s first card reader.

The conversation digs into the deeper technical and philosophical lessons from 16 years of Android engineering, from JVM garbage collection quirks and ANRs, to the evolving meaning of app quality. P-Y argues that the industry’s obsession with crash rates misses the bigger picture: real user frustration often comes from slow, unresponsive, or invisible failures that traditional tooling can’t see. He and Matt discuss why mobile observability still lags behind backend systems, how Block approaches performance at scale, and why the next frontier isn’t just AI-generated code, it’s AI introducing new kinds of bugs for curious engineers to hunt down.

[00:00:00]

Matt Klein: All right folks. Welcome to Beyond the Noise: Signals, Stories, and Spicy Takes, the show where we dig into the stories of the people shaping the future of app-based computing with a special focus on mobile. I'm your host, Matt Klein, co-founder and CTO of bitdrift and creator of Envoy. Each episode we'll talk with engineers, founders, and technical leaders who transform the way their companies build and understand what's happening inside their systems.

We'll dig into the challenges, the breakthroughs, the lessons learned, and wrap it all up with their hottest takes. So let's dive in.

Today I'm thrilled to have P-Y Ricau, who works as a principal engineer at Block... the company previously known as Square. He's been having fun with Android for the last 16 [00:01:00] years.

And has more recently been focusing towards performance and observability. Thank you so much for joining us. I just wanna say I'm very excited for those of you, probably most of you listening, do know P-Y is a legend in the Android community. So very, very excited to have you. Thank you.

P-Y: Thank you, Matt.

That's... love to hear that, and thank you for the great intro. I'm excited to be here.

Matt Klein: Yeah. I think the way that we typically get started is, you know, you've had a... very exciting career, you're obviously extremely well known in the larger mobile community. You also have done a lot of open source as I have, so we can talk... a lot about... open source and the pain and the glory of that as well.

But would love to start by just understanding... you know, a bit about your career, how you got started in Android, you know, what excited you about it. So, give us the breakdown of just how you, how you got into this space. [00:02:00]

P-Y: So I got into Android... in 2009. Kind of accidentally really, it was my first... internship out of college.

I-I was still in college actually at the time and, and you know, the last internship. And, and I just, my boss was like, 'Hey... can you try this new thing called Android and just build a sample app?' You know, he had bought a G1 and so I did that, and I was hooked. Though... At the beginning, I wasn't sure I wanted that to be my career because I, you know, I was doing... Java backend code and I thought that was where the real engineering was.

And so I thought Android was like... wouldn't be like serious engineering and, and I wouldn't have a very serious career if I, if I did that. But I, but I was hooked from the very beginning because of the, the ability to write some code and immediately see something on the screen and touch it. It was really, really nice to get that sort of like toy aspect of like... playing with things.

Matt Klein: Yeah, I mean, I - I [00:03:00] actually, I - I've said this on a couple of different... calls at this point or podcasts, but most people don't know. I actually, also got my start doing mobile development. I mean, my career, I'm - I'm known as a backend infrastructure engineer, but actually started working on Windows phone.

So this is pre... pre- iPhone, pre- Android. And I also remember the same feeling is that it was really fun to, to hold that thing in your hand.

And I'm sure you would agree that, you know, the - the engineering involved in these constrained systems, I think in many ways is more complicated than what has to happen on, on the backend.

P-Y: I don't... I don't wanna get into that fight. I think, you know, I think today's writing backend, has evolved from, you know, just writing a little controller or something to like dealing with infrastructure and a whole bunch of different things that I have no idea how to do. So yeah, it's just different types of complexity.

But you know, [00:04:00] I know that we're typically blind to the things we, we don't really know, right? So if, if you're not a mobile engineer, you might be like, well, how hard can it be? You know, it's just a couple, like... couple screens. It's just buttons, you know?

And then you - you know, you end up at Square where we have, you know, mono repos for like a single app that's like millions of lines of code and hundreds of contributors every month. And, and it's a lot of complexity that you have to handle.

Matt Klein: Yeah, of course. So I - I would absolutely love to get into that because I think most people would be fascinated to know, what are the big challenges that you're facing there and what are the big trends that you're seeing in this space.

But before we do that, I, I'd love even just to step back a bit, and you talked a bit about your internship. I'd love to know kind of the - the early history of how you wound up at Square and Block, and just tell us a little bit about... obviously, as I mentioned before, you know, you're very well known in the Android [00:05:00] community.

I think a lot just for a lot of the open source work that you've done in the space and your general leadership. So I, I just like... I'd love to explore a little more

P-Y: yeah.

Matt Klein: How did that happen, right? It's like, how did you wind up there? How did you, you know, get the ability to actually have a lot of that industry focus?

P-Y: Yeah. That's a good question. It, you know, it sort of just happened. I've, I've done open source for a long time, even before I was out of college. I started, you know, it was just a... just made sense, right? Like, share your code with others.... cause, before - when you're not working yet, you don't know what to do with your code,

you just wanting to share it with the world. So I was doing open source back then. And I just got in the habit of doing that and I liked sharing what I, what I learned, you know, so presenting. And so I think that - that sort of went together with that.

The big milestones for me was where... before, so one of my early successes, [00:06:00] on, on open source, I guess success like early project with adoption, was something called Android Annotations. And, it was a little ridiculous. What happened is I went to a Java user group one night and someone was presenting... annotation processing, you know? So you, you put adaptations in code and you get a little bit of... some sort of... it's kind of like a compiler plugin, some code that executes and it compiles and lets you a compile time and lets you generate more Java code.

And I used that to create something that, in a way was similar to Google Web Toolkit. Cause I'd been using Google Web Toolkit at the time and there were some nice utilities that I liked from that, that I didn't have on Android, you know? They had ways to tie the UI to... to the code and those kind of things.

And so I built this thing called Android Annotations. When I wasn't working on Android, I was doing... Java backend at the time and it was just sort of like a - a side project to sort of like play with things and it sort of took off. And I was always surprised cause I'd actually [00:07:00] never really used it. And what led me to the next big step was that... I, one of the thing that had been my goal at the time was I wanted to build...a... DI tool that would work at compile time, mostly because I had seen the Java spec for that, the at inject spec that would had a little comment at the end saying, 'Hey, you could also do this at comp compile time.'

And I was like, oh, annotation processing, I can do this. I could never do it. I was just not smart enough. I couldn't figure it out. And then Square... released Dagger, which was exactly that. So when I saw that, I just posted... a little GitHub issue saying, 'Hey, this is cool. Can you make like a plugin architecture?

Cause I wanna integrate your thing with my thing so that I don't have to deal with DI you are doing it.' And, uh, Jesse Wilson, which is an... you know, a big open source... he's got so much impact in the community, right? He just emailed me and he is like, 'Hey, um, are you interested in, in [00:08:00] joining us?' And that was, that was, that's how I joined Square.

And that, that's basically the, the story of me joining Square is just like filing, uh, a GitHub issue. And he was like, do you wanna interview? And, uh, yeah.

Matt Klein: You know, it's, it's funny because I've given a lot of career advice over the years and... I think, it's not uncommon that that happens. And I think a lot of times when I'm talking to younger engineers and they're asking, I don't know, it's like, how, what's the best way of finding a job?

Or how do I do this? Or how do I do that? And I do think a very real strategy is getting involved in open source. It's like the number of people that I've seen get jobs through even small contributions... just because people are always looking and it's such a great way of almost doing a... You're almost doing an interview.

I mean, it's like if you can write the code and you can comment respectfully on the PRs and all of those things, it's a great way of learning about someone before they show up.

P-Y: It's one of the things where you - you know, maybe [00:09:00] if it's, it's okay for that to be in the back of your mind, but it shouldn't be your main driver for like doing, doing open source.

Matt Klein: Sure, yeah.

P-Y: Cause you're gonna get tired fast. But, it is one of the ways that, that you get in. And so I joined Square. Which, for me, it was like... you know, this company that early on, the... so the thing with Android, I - I, if you're not a - a mobile engineer, you might not know that... Google was working on Android at the same time that Apple was working on iOS.

But, you know, I, Apple got there first. That put a lot of pressure on the, on the Android team at Google. So for a couple years they were basically catching up. And that meant that they were focusing on shipping a lot of features really fast, not focusing on designing the best APIs ever, building the right frameworks and all of that.

And that left a big gap in the Android community. And add to that, the fact that a number of Android developers, because Android was Java, number of Android, first Android developers were Java backend engineers, and they were used to like Spring and all of their frameworks and [00:10:00] utilities. And so, people were trying to fill that gap.

Uh, and so Square, very early on, started building a lot of tiny libraries that were incredibly useful. And so to me, Square was like... this amazing company. I had no idea what the business model was like, what Square was doing, I just knew they, they did open source. So when I got the opportunity to join, you know, I jumped in, um, and started, yeah, contributing. Yeah.

Matt Klein: Well, I - I was just gonna ask, I mean, because you, you joined relatively early at, at Square, and I guess actually don't know this, do you know how did Square choose Android? Right? Because it's a - it's a pretty young company picking a pretty young technology, so I don't know, like, do you know the history of how that happened?

P-Y: I think it just is a, a happy, I mean, it's not an accident, but it's, what happened is Square hired Bob Lee as a CTO, I think back, I forget [00:11:00] exactly when, maybe 2010, maybe around those times. And he was, he had been working on Android at the time and he was, um, you know, one of the local celebrity... celebrities in the, in the Java community doing amazing

talks and, and his job as a CTO was a bit more, um, outward focused than you'd expect. His job was basically make some of the best engineers wanna work in this company.

Matt Klein: Yeah.

P-Y: And so he did that and he did that through open source, like, fostering that culture and then through public tech talks.

And so I think that's... it wasn't so much a choice of Android, but rather a CTO that had strong Java background that hired, you know, brought in a number of folks. And for Android, we just hired, we, we typically... there wasn't really many people who had Android background, so we just hired

Matt Klein: Right. Yeah.

P-Y: Uh, you know, very senior Java developers who had open source background and we just learned Android on the job for

Matt Klein: Yeah.

P-Y: The most part. And so I think that's how the, the focus came... you know, and then we hired like Jake [00:12:00] Court and Jesse Wilson, a number of folks like that, um, Ray Ryan that had... they went all in on this, they were just... it's just the way that they work and it created that culture. One thing that was really interesting and surprised me is that there was never, um, a clear... like it was part of the engineering culture and, and something that would be repeated, but not something that was incurred in processes.

Like we didn't have an open source manager or any role like that or anyone thinking strategically about all of that. Um, we just had people doing open source.

Matt Klein: Yeah, I - I mean, I think it's a fantastic way of-of doing things. I would imagine, though, that very early on you were having to build a lot of this, a lot of these niceties on top, but I would imagine you were dealing with a lot of platform bugs

as well. I mean, because the,

P-Y: yeah,

Matt Klein: I mean, it's pretty early on, right? So as you said, they were shipping fast. Um, so I, I'm just curious. You know, I - I mean, I actually love bug stories, if [00:13:00] there's a particular early bug that you wanna share, but more generally... what was it like to work, you know, on the platform that early?

Because I would imagine that, you know, Square and Google at, at... during that time, I would imagine that you became a fairly important customer to them, right? So it's like, I, I would imagine that, that you worked together fairly closely. So I'm curious what, you know, what that experience is like.

P-Y: I don't think that was ever really the case.

Matt Klein: interesting

P-Y: Square, so, relatively speaking, Square initially was the point of sale, and then we had Cash app, but the point of sale was a fairly small baseline of customers compared to like the bigger consumer apps.

Matt Klein: Yeah, right.

P-Y: So from that angle, not that important to Google. And then from the contribution angle, again, the Android team was so focused on shipping, they didn't really pay attention to, and they're, you know...

the paying attention to the community came a lot later. And so today the landscape is very different. They're building a lot of additional tools that are [00:14:00] amazing, but for the most part, early on, you know, the engineers at Google, they, they were working crazy and they were not paying attention to the community.

And so it was... we had ties, but they were more, um, you know, relationships. But I - I wouldn't say. Yeah, it's interesting. I wouldn't say that we had like a, a direct clear relationship. We had, you know, here and there. I have so many bugs, like over the last 12 years that I've opened.

Matt Klein: Sure. Yeah.

P-Y: That, you know, so most of them just get a little bit of traffic. And then they get closed as obsolete because they haven't been updated in a year. It makes me mad, to be honest. And, but you know, like I'll spend an hour, uh, a lot of or more, I'll spend a lot of time creating a repro case, providing details, and then it's like we passed this on to our engineering team two years later, closed as obsolete.

And you're like, and the bugs are still there. One of the... early on at Square, one of the most [00:15:00] entertaining bugs that I ran into... so I joined the company, um, and you know, I didn't join to work on open source. I joined to contribute, right? So I was given some, some bugs and one of the bugs I kept running into was some weird, null pointer

exception, but the thing was, we were trying to get some resources and the the resources were null. And, and that's not supposed to happen ever. And so... and you know, as I started looking into this, I started noticing a pattern. And the thing about me is I'm very stubborn. And I - I, while some of our, you know, previous contributors had just like, tried to, to quick fixes and close the, the bugs, I was like, I'm not letting this go.

I'm gonna figure it out. And eventually I found that there was a race, race condition in the app installer of the OS where it would... basically when you got an update, the app would get... so, you know, your binary is in a file, in zip file that's called an APK. And so the way that this works is the new file gets downloaded, [00:16:00] then the app is killed and eventually restarted on, you know, the new binary.

And that's what would normally happen. But there was a race condition where the... while right after installing the new app, and after having killed the, the old app, there was a, what's called a broadcast. So some sort of background intent, telling the app to do something, it would be fired and the system hadn't updated its references yet, so it would still wake up the, old code.

But then as the code was waking up, the system would then point to the new files. And so a number of things were missing, like basically we couldn't find its own, its own zip files 'cause the files had moved and that... that was not supposed to happen. So you'd end up with those weird situations. And the way that I found this was that I would...

in the crash reports, I added report the version... your current version, number of the app. And I would see, so Crash Report would tell me, well, I, I asked for my... I asked the system for my version [00:17:00] number and it's new version number, but then the Stack Trace had line numbers that matched the old version.

Matt Klein: Right, right.

P-Y: You put from the old version.

Matt Klein: Yeah.

P-Y: That was mind blowing. And the interesting thing is that bug had been there for a while, but it only started happening when Android started doing... properly, doing multi-core. And so you had more of those race, race conditions.

Matt Klein: Sure, yeah.

P-Y: Showing up.

So yeah.

Matt Klein: Yeah. I - I, yeah, there's,

P-Y: sorry, that was like a

Matt Klein: No, no, no. I mean, I - I think people love hearing about bugs. I certainly do. I, I guess one other question in general before we get back to, you know, what, what you've done at Square and Block is Android itself is open source, right? So I, I mean, has...

I would imagine the existence of the operating system code itself has been very useful, right? I mean, it makes it easy.

P-Y: It's amazing.

Matt Klein: I, I myself, in various reasons over the years, have looked at the Android source code and figured out various things. I mean, it just like having the Linux source code available.

It's really amazing to be able to look at [00:18:00] everything it makes, you know, makes it so that you can take a lot more problems into your own hands. So... I guess I, you know, could you talk a little bit about, you know, what that having access to all of that code means to you? And a, and a follow-up question is, I would imagine at this point at Block and some of your devices, you, you own the entire device, right?

So it's like you can patch the code if you want. So mostly just trying to understand, you know, it's like when it comes to filing these bugs and they go nowhere, I would imagine that sometimes you just fix them, right? I mean, so it's like, how does that, how does that work?

P-Y: Yeah.

Matt Klein: You know?

P-Y: Yeah. Yeah. That's a great question.

I, I guess the first part, it... very early on in my career, I was really lucky. Before Square, I worked with an engineer... fairly senior engineer. I was running into some bugs and I was like, I don't know what the framework is doing. It was probably Spring or you know, something like that. And, and he's told me, well, [00:19:00] I was like, that's not in my code.

He was like, well, it's open source and, and you have a debugger, so let's just debug through what happens when you get an HTP request in all of this, in throughout the entire stack, right? And he showed me that, and it sort of like it was... mind- I mean, it's obvious in retrospect, but it was mind blowing to me that the boundary of what I would, should look at wasn't just my code.

I should be going through the entirety of the code and learning from all of that. And I think with Android it's that same idea. I mean, I'm so happy that we have that when our PRs on iOS just don't have access to the sources.

Matt Klein: Yeah.

P-Y: Right? And a lot of the early, uh, you know, like Crashlytics for example, like how do you handle crashes on iOS?

The reason Crashlytics is... crashlytics came to exist as far as I know, was because someone at Apple left Apple and wrote Crashlytics. So they knew exactly how to, you know, get that...

Matt Klein: right. Yeah.

P-Y: Information, right. But without that, it wouldn't have happened, right? And so I think that was one of the big things was like [00:20:00] realizing that you can debug beyond your code. Even better on, on a rooted device

you can debug other processes. So like, you can be like, okay, well I, I made this system service call and it's not doing the right thing. Well, I can actually like... put a break point in this other process and, and you know, go through that. Sorry, what was the second question?

Matt Klein: No, I mean, I was, I was asking more some bug about like given access to the code,

do you patch the, like do you, do you solve your own problem sometimes?

P-Y: Yeah.

Matt Klein: Right.

P-Y: So we have... we have a blog post we didn't use to do that mostly because before we didn't have our own hardware and so... which we would try to provide patches sometimes, but by the time the patch rolls out, you know, it's like several years down the line before it's...

the Android devices have caught up. We do do that now... for major bugs. And it's still tricky because like there's so much, so many moving parts involved, updating your own fork, updating... you want to, [00:21:00] you know, update the main mainstream mainline. But, if you, if you look up the Block engineering blog... I have a colleague, Tom, who wrote a really amazing blog post recently around... we found , a bug in the USB stack that's from 2013.

Matt Klein: Wow.

P-Y: And, uh. Really cool bug. I, I'm not gonna go into all of the details, but the core idea was that when you'd say... you'd be on the background thread and you say, 'Hey, I wanna read... you know, I wanna read this, this USB stream, basically gimme some bites from... from the USB device,' that would have the side effects of freezing the entire VM... or not really exactly...

it would have the side effect of freezing the GC, while doing that, while reading...these - these bites from the connected USB devices. And, and if the GC was... in the midst of doing very basic operations on any thread, those threads would [00:22:00] just get frozen as well. So we'd notice because we started seeing like, hey, sometimes the main thread is blocked for like two seconds... for no reason. And then as we started gathering data from production, so we're getting into observability a bit.

Matt Klein: Yeah.

P-Y: We started noticing, um, that it would mostly happen on devices that had... USB barcode scanners connected. So one of the particular things about what we do is we have all these USB devices connected to barcode scanner, printer, a cash drawer.

And so... we noticed, it's weird. All these people using this USB barcode scanner, they keep running into these UI freezes. And then we were lucky that we were able to reproduce the issue. And finally, with some traces we could see that, what was happening is there's this low... low level API. So when you're doing native code, you have... and you're manipulating, a data structure that lives on the.

Java side, you sort of have two options. Number one, you use this... intermediary layer, that abstract pointers. [00:23:00] Cause the problem is, you know, you're in native code, you're dealing with pointers, but the Java layer... has this GC that's moving stuff around.

Matt Klein: Yep.

P-Y: And so if you try to read a pointer and the thing's been moved around, you're - you're reading in the wrong place, right?

So you have this... either you can use this abstraction layer that abstract pointers, but then it's slow. Or you can say, hold on, i'm just gonna read this byte array directly. Please stop moving stuff when I read. It's gonna be really quick, and then I'm gonna be done and you can start moving stuff again. When you do that, you should make sure you're not blocking... it should not be any sort of blocking operation while you're reading.

It shouldn't be reading for me any sort of io. And so that, that's what was happening in the, in the AOSP, uh, code base, and we finally got it fixed in Android 16.

Matt Klein: Fantastic. Love those stories. So I - I guess before we get into the future and the current, I mean, can you quickly just take us through some of the major things that you've worked on at Square and Block and I don't know, like some of the... some of the things that have excited [00:24:00] you the most there.

That could be internal projects or open source work or whatever else.

P-Y: Yeah, so there's... that one is more like, sort of like, lite career advice. That's kind of, I, I like to tell the stories 'cause it's kind of funny to me. So when I was in engineering school, I went through... you know, I knew I wanted to write code, but it was still like, okay, I gotta get my diploma and all that.

And so I had these signal processing classes. And I was like, you know, it's - it's just a bunch of math. I was like, I hate this. I just wanna write code, like why... i'm never gonna use this in my life, ever, right? And so I joined Square and after a couple months, like, okay, cool, so here's a new reader. And it, you know, you - you put in your credit card and it encrypts that and then it sends that as an... through the audio jack as a signal, and then you have to decode it on the other side.

And it was like, and - and it was a new version, so it was doing... there was a whole lot of like, fast fourier transforms. So basically analyzing the frequencies of the [00:25:00] signals.

Matt Klein: So you're saying that you should

have paid attention in school?

P-Y: Exactly.

Matt Klein: Is that, is that what you're saying?

P-Y: No.

Yeah, I- for a second I was like, oh no. Like I knew the words but I had no idea what to do. So that was great. The same theme, more recently, I've - I've been focusing on performance benchmarks and one of the things that I've started noticing and pushing people on trying to push the entire industry on is, 'hey, you cannot just take a bunch of data points from one benchmark, and then another set of data points from another and like compute the mean or the median and say, well, the median moved from whatever, like 50 to 20

so it's that amount of percentage improvement.' Like you cannot do that because, well, what if you run the first, first benchmark again, and then you get a different set of points? What do you do now? And so you're not, you need to start talking about, confidence intervals and distributions and all that.

Well, that's, that's basically stats, right?

And same thing, when I was in college, I had a stats class and I was like, I'm not... I'm never using that, like I don't need this class, you know? So I sort of have these [00:26:00] stories of like... in a way, yes, I did not pay a whole lot of attention. I just tried to pass the class, but I still had like the keywords and I knew that I could, you know, learn more about that.

Matt Klein: Right.

So have you - have you spent most of your time at Block working on these types of performance things? Or, you know, did you spend most of your early part of your career working on product stuff? I'm just trying to understand.

P-Y: Yeah.

Matt Klein: What, what, what has your focus

been over the years?

P-Y: It's been product a lot until recently, until last couple years. But I was always doing the other fun stuff on the side.

Matt Klein: Yeah.

P-Y: Early on, we started, you know, porting the app to tablets. I worked on the reader, worked on a fun project.

We have these hardware readers and they're... we wrote a lot of code for that to connect through Bluetooth and audio jacks and all of that. And all that code was very deeply, connected to the rest of our code base. But we had the [00:27:00] idea that from a business standpoint, we had the idea that, we had some customers that would very much like to build their own application and just use our hardware.

And so they would need some sort of SDK to connect to our reader. And it was really hard to... So we, we had a couple attempts in the past, like some teams tried to rewrite the- this code from scratch. Another team tried to extract it, but it was so... we were constantly adding new features, it was so hard that we never succeeded on that.

And then somehow that project landed on my lap and I did not want to do that cause I was like, 'it's gonna take me three years to extract this SDK.'

And I got a - I got a horrible idea. I thought for a second. Hold on. What if I don't try to extract the SDK? What if I - I create a very light interface, API, and then I take the entirety of the app code base and I shove that in a jar, essentially.

And I call that the SDK. And that's what we did, you know, for the first iterations and so in a way it was horrifying. We were doing all these crazy hacks. We were actually [00:28:00] compiling an app. And then, cause we were letting all the build tools do their job and then we'd extract the Dex files.

And so Dex is a compiled- so you have your class files in - in... from the Java world, and then that's compiled into Dex for Android. And normally that's all happening when you compile the final app. But, we were then taking those out and loading them dynamically at runtime once the SDK was initializing.

And so... that, that wasn't great, but it let you know, it let us prove out that there was a business model and that we could iterate, right?

Matt Klein: Yeah, of course. Yeah.

P-Y: So that was one of the giant hacks that I worked on. I am, I'm proud, you know, in, in a Frankenstein kind of way, you know, like this monster is mine.

And then the... another thing I did in I think 2015 was, LeakCanary. I think it's one of the, the libraries that is almost...

Matt Klein: yep

P-Y: known from what I've built. And that one was really that we had... we had out of memory crashes, we had memory issues. [00:29:00] And, at the time there was this tool called, Eclipse Memory Analysis Tool.

MAT. And so you would, connect your device and, and trigger a heap dump and then open, open that,

Matt Klein: yeah

P-Y: in this tool and try to explore heap. And I kept trying... so at some point I realized, hold on, we have these objects like, the activity that have a lifecycle, and when they reach the end of their lifecycle, you can, you can listen to that and then I can, I can connect the tool and say, wait, 'why, why wasn't this object garbage collected?'

You know, I would use weak references to where in the... on destroy, I would create a weak reference to the object, and then after a while I would check the weak reference isn't, isn't cleared. I would connect the tool and try to find why. After I did that a bunch, I realized, hold on. Eclipse is open source. So what if I didn't have to do all that work manually, but I just took all of this open source eclipse code and, and shipped that in the app, in the runtime and, and let that do all the work of trying to do the analysis [00:30:00] and surfacing the results to me directly.

So that's how LeakCanary was born.

Matt Klein: Let me, let me ask you a controversial question. I'm just curious to get your take on it. At the time that Android was created, I mean, I think if you look at the choice of the systems language, Java, like compared to C and C plus plus, it's probably a no-brainer to use Java for that type of system...

i'm just curious, you know, I have been around a lot of systems both on mobile and on backend and like, you know, there's obviously, a lot of pros of these types of garbage collected languages and some cons as well. And I'm just curious on your opinion at, at this point in the industry with Apple, obviously moving from Objective C to Swift, you know, do you, do you think that Java and that type of language runtime is the right long-term future for - for Android?

Or, you know, do you - do you think that things will evolve?

P-Y: That's a great question. I'm... to be honest, I'm really bad at future [00:31:00] stuff.

Matt Klein: Oh, no, it's fine. I, I mean, you could talk about the past. I'm just curious more about your general opinion on this space.

P-Y: The, the reality is it makes it really nice and easy to write code, right?

And, memory leaks have been a huge problem in the past. They're a bit less of a problem right now, a little bit, because the way that we write code in Kotlin, we tend to have... we tend to do less of this, like create stateful objects that... and, and do... and have inner classes that have... a lot of the memory leaks we had were just accidents and the design of the language was, was making them easier.

And it's harder in Kotlin today, it's, it's harder to do inner classes that have, hidden references to their outside class. So that part, the memory leak side is still... we, I mean, basically the memory leaks that are left today are the most complex one, really hard to debug. But another part about the GC is really, it's impact on the runtime.

Matt Klein: Mm-hmm.

P-Y: And I think that one is, that one is pretty bad. We are frequently [00:32:00] noticing that, you know, so a - a common issue with - with Android is an ANR, so application not responding. The app is freezing, one of... one type of ANR that people don't think to look for is they'll see an ANR

they're like, 'oh, what is, what is going on?' And it turns out it's just cause the memory's really high cause you've had all these leaks. And so the GC is just freezing the vm, trying to clear out memory. And so common advice that I give when trying to figure out an ANR, is first check the memory level.

If, if it's very high, don't look at the stack trace just like you have a memory leak problem. Uh, and the ANR is just a step towards the out of memory crash

Matt Klein: right, yep.

P-Y: Beyond that, it is very nice to have a GC, you know? Like not have to think about those relationships. And our friends on, on iOS, I see them chasing a lot of memory

issues as well. They're different ones, but they have their own problems.

Matt Klein: It's a very interesting space and it's one that I, I don't know as much about. I know that on server, obviously [00:33:00] there's been a huge investment in like GraalVM and like other Java runtimes that are a lot more efficient in terms of how they do pauseless GC and all of those things.

I don't know the current state of the art, you probably do in terms of whether some of those technologies are being brought to the phone. Like do you know, are they, you know, are they working on bringing some of these advancements to the runtime that actually runs on Android itself?

P-Y: I know they brought a bunch, but I don't know the latest on that.

Matt Klein: Yeah.

P-Y: It's also, you know, it's always trade offs between the - the types of hardware you got, the amount of memory you have,

Matt Klein: absolutely, yeah.

P-Y: number of CPUs. And so, I mean, Android already has, you know, concurrent GCs and they've got local, thread.

You know, when, when you know when you're allocating, you'll typically have a thread local buffer.

So you are allocating there first. There's a whole bunch of mechanisms in place to, to make that nice. But I - I don't know compared to like, what's the latest in the GC world, I'm not super familiar with that, to be honest.

Matt Klein: Yeah. Yeah. Okay, cool. [00:34:00] So, you know, I think what, what I'd like to chat with you about next is, obviously it's clear so far in this conversation that you - you have a lot of interest in performance and, you know, general understanding of systems.

What I'm curious about is, you know, you work at a company that has a lot of engineers, and I'm sure you face lots of problems, right? From like long build times to developer productivity to like observability to performance.

And I - I guess I'm curious, you know, you yourself are clearly focusing on the performance side, but obviously as an organization I'm sure you face lots of issues.

So I - I guess like, I guess it's a two part question. From an engineering organization perspective, what do you think are the biggest problems facing mobile in general right now?

And then I guess for you, yourself, how do you choose where to focus on that menu of problems? Because you're obviously a very [00:35:00] senior contributor. You know, you have some leeway in terms of what you work on in terms of its impact. So I guess I'm curious, again, like the menu of large problems that you're facing right now and how you've chosen your current focus area.

P-Y: It's interesting. I mean, it's always changing. I - I guess actually I- the, the second part is easy, easier for me to answer. Because mostly it's the gaps, right? So for example, it used to be that our build was build an IDE situation and CI situation was terrible. And it's just that... and today, it's not, because we have a team that's doing a really good job there.

And honestly, it's a- it's mostly having the right people and the right manager and - and it's been... there's still a lot to do there, but I'm giving you that as example of like a couple years back our builds were on fire. And today it's mostly fine. And they keep improving things and I'm like, thank God.

Like...

Matt Klein: but sorry.

When you say 'on fire' [00:36:00] took a long time? Or breaking all the time?

P-Y: Yeah.

Matt Klein: I'm like trying to understand what, what was the problems first there?

P-Y: Really,

Matt Klein: yeah

P-Y: what we had was... builds were very slow. That would be like maybe five years ago. builds were incredibly slow, and IDE Sync times were slow.

IDEs, were slow. You have to think, you know, there's, at the time maybe 50 engineers, 50-70 engineers contributing. Everyone has a different IDE configuration. Some people are trying the latest and different build, you know, build versions and things, and so like Gradle version.

So it's really hard to debug what's going on. Everything is slow. Gradle at the time was slow and what happened since is that we started building- giving them, the Gradle team, we gave them demo projects that reproduced the biggest issue and they started addressing all of the core issues.

At the same time, we... I think we tried Buck for a bit, but without having the deeper knowledge of how to use Buck, how to set it up. And, the biggest thing that happened is one engineer [00:37:00] was... created this concept of, feature modules. The idea was when you work on a new feature, instead of contributing to like the, the couple modules we have, you're gonna create a set of modules, one for your public API, one for your implementation, and one for like a small demo app that only runs the implementation code of your module.

Therein you won't have many transitive dependencies. You'll just have, direct dependencies to API modules that have a, a...

Matt Klein: yep.

Makes sense.

P-Y: fake version of that. Right. So this makes a lot of sense, but as soon as he did that and he created utilities to quickly create those, every engineer was using those.

Matt Klein: Yeah.

P-Y: Because you were switching from like five minute deals to like 30 seconds. Right?

Matt Klein: Right.

P-Y: And so that led to a huge increase in the number of modules, which made things worse for our build team at the time. Right? Because like all of a sudden the build was... the, the biggest build was slowing down. This was just fixed by

doing what you do as any engineering team, which means, you know, taking problems in the right order, sitting [00:38:00] with customers, doing things like standardizing the tools so that everyone uses the same tool. Hiring good engineers who know how to debug stuff and who actually get into that and fix things.

Matt Klein: Yep.

P-Y: So they did all that. So that was an example of like, this used to be a problem, still is, you know, a lot of investment, but I don't feel concerned about that. And so, what I am looking at today is a lot of performance and observability. And the primary reason is because historically we haven't been terribly good at that in that investing there.

And so I'm, I'm trying to push for that. And then, you know, I think the obvious one, that's - that's coming down the pipe is, um, the impact of using LLMs to,

Matt Klein: of course, yep

P-Y: generate code and, and how it changes the practice. Because, you know, at - at Block we're... so, there's nothing like, 'hey, if you don't use LLMs, you're gonna get fired' or anything like that.

Matt Klein: Right

P-Y: right? But there is a push for like, 'this is a tool that's gonna... should help you with productivity. You should get to learn to use it.' And [00:39:00] everyone's constantly questioning like, 'oh, should we use it for migrations?' Which is a really interesting use case.

Matt Klein: Yep, for sure.

P-Y: you know, versus new features.

The biggest question, though, is are we gonna be introducing, you know... It's, it's the same. It's like writing... doing things in reasonable ways and so like, are we gonna start seeing a whole lot new code being introduced and how do we ensure a high quality bar?

Matt Klein: Yep. I - I, I'm not,

P-Y: yeah,

Matt Klein: I'm not an AI doomer any means.

I, I think these tools are real and I think they will... they will increase productivity over time. At the same time, my personal experience so far, it - it has not lived up to the hype. And, and I don't know if that's just me. You know, I do use these tools on a daily basis, and they do help me in certain things.

But, the world in which, especially for sophisticated, large, existing code bases where you're like magically generating all the code and not shipping more bugs than you [00:40:00] would've had previously, is not super clear to me currently. So I don't know...

P-Y: I'm in the same boat. I... you know, every time I surface something, I hear 'you're using it wrong.'

And I'm like, 'well, at some point if the tool cannot be used, right, like, you know...'

Matt Klein: Well, you know, it, it's, I mean, it's, it's actually funny that you say that because we, in our small company, we have a, we have a Slack channel like, just for helping each other use AI. And we talk about our prompts and all of these things. And like, there are times in which I am typing there and I'm saying like, I kind of feel dumb because I read online that everyone says that these things are supposed to be amazing

and it's like, it's not working for me, so...

P-Y: mm-hmm.

Matt Klein: Am I, am I just doing it wrong? You know, but-but, to your point, if like, I, I'm a reasonably smart person, if I can't use the tool, then maybe the tool is not quite right yet. Anyway.

P-Y: Yeah, I guess, to me there's just two, two things...

I've seen two fairly interesting use cases at least that for now works for me. Number one is... and I think that's what most people have started using it for as well, is [00:41:00] like stuff you're not... you are remotely familiar with, but not an expert at.

Matt Klein: Absolutely.

P-Y: And it helps you like get started, right? So like I talked about stats and I was recently throwing some data set at ChatGPT and I was like, write me some Python code to run this type of analysis.

Matt Klein: Mm-hmm.

P-Y: That type of analysis. And it's sort of like helping me... and also just learning learning a number of things and then asking for like the code, and where is that information coming from and all that.

And starting to build a mental model. And then I know I can take that code and turn that into something more production ready. So that's, that's one thing. Like anything you're like... You're on the frontier and you're not an expert. Where I would normally need to go find... statistics experts at my company, which either they don't exist or they don't talk to me like I don't know how to find these people.

I think that's number one. And then number two, I saw recent, recent example that I'm really excited about. So we - we released, this thing called trailblaze, and the key idea is writing UI tests. So there's this so-called maestro where you write UI tests as sort of [00:42:00] like a little comments like, do this, do that, do this.

It's meant to be editable by a human. And it can run on Android and iOS. So that's nice. But the idea was... but you still need to describewhat you're expecting to see on the- in the UI and what actions to do. Using an LLM to generate that. Going from English sentences to that, but in a way where you, you write your sentence, the LLMs generates those commands, they're run, but then what gets checked in or remembered is those commands.

Matt Klein: Yeah.

P-Y: And then every time the test run, it runs from those commands. And when the test fails, you can rerun the LLM loop and say, 'Hey, try again. Like, maybe the UI has changed, maybe you can figure out, you know...' and if the LLM can't, then you can go back and actually look at what's broken. Yeah, I think it super interesting.

Matt Klein: Yeah, I - I mean, I have no doubt that there are going to be very real good use cases and my personal experience is that 99% of the value I get is what you said, which is just having a buddy that I can do a Q & A with and [00:43:00] just... expedite my learning, right? It's like... it helps me learn faster. So I think there's a lot of value there, In - in the little time that we have left, I do wanna come back to the performance side.

So, you know, you were talking about that, that's your focus now. I, I think one question I have is... this is a very high level question.

I think intuitively we as, you know, senior or sophisticated engineers, we know that performance is important. There's been many studies over the years that show that performance of applications and systems directly contribute to conversion and all of those things.

P-Y: Yep.

Matt Klein: At the same time, it's very difficult to equivocally prove, right? That like you do X, Y, or Z with performance and it is going to, you know, improve business goals and all of those things. So I, I, I'd love to just understand from you, you know, when you work on this topic, you clearly think that it's [00:44:00] important.

I guess first, why do you think it's important and how do you justify to the company spending your time like working on these things?

P-Y: Yeah, that's a, that's a great question. There's a couple things there. Yes, there are a bunch of studies, but like, it's studies from Amazon and like dropping carts, usage and all that.

And it's, it's great, but it doesn't map to like many apps, you know, and so it's kind of hard to translate that.

Okay, so. This is very hard. First of all, translating back, like doing, like trying to be data first on this. I'll give you one example, especially with our types... our type of customer. When, when our, our customers use us, we're like a very integrated solution.

So we're basically saying, 'Hey, with us, you get everything. And you don't have to worry about anything. Everything is connected together as like an amazing experience,' right? And we have competitors.

And so why do people switch to competitors? Well, if your performance is - is bad, they will, the customers will rarely [00:45:00] say, 'oh, the performance is bad.'

Typically, what, what will happen is... The competitor has cheaper prices.

Matt Klein: Yeah.

P-Y: They have a reason to go. And they'll just go and they'll say, well, cheaper prices or missing feature or what- whatnot. But when your customers love... your app, like, so we're serving small businesses, they have other things to do than trying to focus on the app.

They just want to get things done and it's just gotta work, right? So when it works fine and they don't- that's not a problem for them to handle, they don't care that much about the slightly lower price or you know, that feature that's missing unless it's very blocking. So, when the... when you have bad UX, performance problems, they're not gonna say I'm leaving because of those.

But they basically are not a loyal customer anymore and they're ready to switch anytime. So in terms of churn, it's gonna be really hard to connect that data because they'll say, well, lower prices over there, but they were long gone... before that they just stopped, you know, being engaged with the product and it's hard to see cause they're still using it.

But I think you can tell that story. [00:46:00] And so I tell that story. And also we've had... you know, sometimes we had an important customer leave us or someone, someone that matters for us. And it's usually a good time to look into what has their performance experience been and sort of bring, you know... show the data side of things, but tell the story around it.

Like,

Matt Klein: yeah

P-Y: jump on the story and, and talk to execs, talk to - to PMs and say, 'Hey, they - they talked about having problems with the app. Did you know that they were running into this and that and this?' And sort of like building the bigger story. You don't always need to have the proof in the data, especially when, when you can't really.

And then the other thing that I've done a lot is look into, there's a number of studies and you can also experiment for yourself around.... what is, like, what does it mean to have a performance issue, right? So like, for example, with mobile devices... when, you know, touching the screen of your phone and all that, it turns out...

[00:47:00] humans can't... So the delay between when you touch and when the UI responds, humans can't really perceive any difference at all below 69 milliseconds. And really you can round that up and you can say a hundred milliseconds is totally fine for - for tap response. For motion, you really just don't want - so like scrolling and all that - you just don't wanna miss any frames until the hardware gets like a lot better, but we're not there. So that means you can pick that, that... and that's actually what Apple recommends, 100 milliseconds. Or you can pick like a bit bit higher than that, but you can say, 'Hey, when we're below that, it doesn't really matter.

When we get over 300, 400, 500 milliseconds between the time I do a thing and the time the UI responds, that's noticeable.' And, what it changes, especially for us, like we're - we're trying to... we're building tools... we don't want people to be focused on the tools. We want them to use muscle memory and just tap, tap, tap without thinking, right?

But if the UI takes a while to respond, you're tapping and you, you learn to wait and pay attention and you're [00:48:00] not focused on engaging with your customer, right?

So that's kind of the story that, that I tell. And I built a demo. I built a bunch of demo apps, internally that would show, 'Hey, here's what the interaction feels like when you have that type of delay.'

And if you put that in the hands of someone. 'Oh. Oh yeah, that's bad.' So you can kind of justify the thresholds you're using as targets that way.

Matt Klein: Yeah,

I mean, what, what you're talking about is something that obviously as a company we talk about all the time, which is that, you know, and I guess maybe this is where we can close out, just get your opinion on this, is I feel like from an, from a mobile industry perspective, and maybe this is because of what developers are used to or the tooling or all of those things, but obviously like crash reporting is the thing that people know, right?

I mean, it's like.... at least historically, like crashes are what mobile developers care about. And, and I think what you're describing is something that we talk about all the time, which is that immature applications, actual crashes are a tiny fraction of the problems, right?

P-Y: Yeah.

Matt Klein: It's like slow UI, [00:49:00] like things that don't work, just frustration quits, all of those things.

And I think, you know, I - I guess as we close, I, I'd love to just hear from you like how you think about that, right? Because you, you clearly are looking into a lot of these non crashing issues, so it's like, how do you think about, you know, the industry's almost historical obsession on like crashing versus the types of problems that you can argue on a- not, the crashes aren't important,

obviously a bad crash is very important, but the type of things that you're talking about impact all users, right?

P-Y: Yeah. I mean, I think it's the... it's, you know, the joke around the - the person who lost their keys and they're looking right on the light when... and I forget, someone asked them like, why are you looking here at... I'm not telling the - the story, right

but like, they're looking right on their light at night and - and 'where did you lose your keys?' 'O- over there, but this is the only place where there's lights. So that's where I'm looking,'

Matt Klein: right, yep, [00:50:00] yeah

P-Y: I think that's the same story, where.... for the longest time, the - the easiest signal to get was crashes.

So we got crashes. And I- I've surfaced many times, I don't think either Apple or Google did a great job enabling observability, enabling, getting people to understand what's happening into their apps. We've left that to vendors, which, historically, have not had a very big mobile background or deep understanding of what happens in mobile app.

And so, have built tools that... there are a number of vendors that have mobile tools, that have side effects, that will, you know, you're trying to see if your app is good or not, and you enable the observability tool and it's tanking your app, right?

So, I think historically there's just been a lack of focus from Apple and Google and crash reporting has been easy to implement, so that's what people have been looking at.

We, we took this interesting approach as at Block of like, we call it 'crash fast,' which is the idea that, you know, offensive programming. It's this idea that if something's wrong, you just crash.

Matt Klein: Yep.

P-Y: [00:51:00] Instead of trying to recover. And so if something is null, instead of trying to say, well, don't show the UI, don't show this

pop up if the thing is null, and then you end up with like a user pressing a button and nothing's happening. And you don't even know that because they, there's just no signal.

Crashing is bad, but at least you get the signal and you're gonna work on it. And so I think that's why crashes are like...

Matt Klein: I was gonna, I was gonna say real quick, that is my general philosophy also.

Until you write an SDK that runs in other people's apps

P-Y: and people get mad at you. Absolutely.

Matt Klein: At that point, crashing is real, real, real bad.

P-Y: Yeah. Yeah. No, I hear you. Different story. But anyway, yeah, I think that's primarily the... that's the signal we got. So it's the thing we fix. And that... I think, you know, I know everybody's focused on AI, but I think there's this parallel timeline of like, the next big thing for mobile is also observability.

Matt Klein: Yeah. Well, cool. Thank you. I could talk about this all day, but in the [00:52:00] interest of time, we have to wrap up. Any, any final parting words, any hot takes that you'd like to share about the future?

P-Y: Uh... not really. I think, I mean, I'm still excited for the future. I - I think I was... so I was at Droidcon recently, and, and I said that... I love bugs. Like,

everything I learn... I don't, I don't typically sit down and go learn new technologies. I learn them when there's a bug and I need to investigate because someone else used the new technology. So I love bugs and that's what my career has been made of, learning how to fix bugs, you know, investigating bugs.

So I'm excited about the future. Because, LLMs are likely to introduce a whole lot of new interesting bugs.

Matt Klein: More, more bugs.

Alright, thank you. That's a great, great way to wrap up. So anyway, thank you so much for joining us. This was fantastic. And, that's a wrap for this episode of Beyond the Noise: Signals, Stories, and Spicy Takes.

Huge thanks to P-Y for joining and sharing your story. You can find this episode and all past ones on the bitdrift of [00:53:00] YouTube channel. And if you had fun, drop us a review. Tell your friends or yell your favorite hot take into the void and just make sure to tag us. I'm Matt Klein and I will see you next time.

Thank you so much.

Subscribe for new episode announcements