I began as a Senior Secure Application Engineer in 2017 and was skip promoted past staff engineer to Application Architect after 18 months.
Below are a few of my contributions
When I started at Crunchyroll, I joined a team of 6-8 engineers building a new subscription processing system for a subscription video on demand (SVOD) service with 2 Million subscribers. Building a multi-tenant system to process over $10 Million a month in credit card transactions presents a number of challenges.
The systems being replaced were single tenant apps written in PHP having undiagnosed edge cases and known defects. A new multitenant system was envisioned to replace both systems in Go.
A few things I did on the project:
The weekend after I started, the website went down for several hours on a Saturday night when a new episode of a popular series went live. As a member of the Secure Apps team, I was confident that the systems for which I was responsible were not the cause of service disruption. Nevertheless, I took it upon myself to investigate the stability issues to see if I could help other teams pinpoint the source of the problem.
After a few hours sifting through NewRelic, Cloudwatch and access logs, I had a good understanding of the various service resiliency failures and decided that the application's stability issues could be fixed with a microcache.
So, that weekend, I went home and I wrote an open source embedded HTTP cache in Go.
A strong 5 minute TTL on API responses in the CDN protected the external API from traffic bursts. However, these protections were not available on internal API calls from NodeJS which were routed directly to the scaling group's load balancer.
The NodeJS app also had a 6 second timeout such that it would attempt to fetch data from internal APIs for up to 6 seconds at which point a response would be returned to the client regardless of how much data had been collected. This was problematic because there was no thread shutdown on timeout so after that 6 second timeout was breached for a request due to downstream latency during high traffic, outstanding requests would continue to be sent to downstream internal APIs even after a response had been returned to the client.
This abberant behavior resulted in a request multiplication feeback loop ensuring that if any of the system's internal APIs ever experienced increased latency for any reason, the entire application would beat itself to death with duplicate requests.
With no additional infrastructure, the http caching middleware was integrated into the API and it never failed in the same way again.
A few weeks later, the front-end team introduced an error to production which effectively doubled the traffic to the API. This sudden traffic spike caused no service disruption thanks to the cache and actually improved the efficiency of the API.
I jumped in to fix the problem without being asked to do so because where I come from, engineers take notice when the website goes down.
The first project I lead at Crunchyroll was a third party integration with AT&T.
Basically, AT&T would run a promotion offering some of its premium subscribers a free subscription to Crunchyroll's video on demand service.
As the project's tech lead, I:
QA was fun. AT&T actually sent us batches of SIM cards linked to accounts in various states in the mail that we would need to insert into phones in order to perform end to end testing. This applied a physical limitation to the number of end-to-end tests we could run and completely decimated any hopes of automated cross-organization integration testing. Internal automated unit and integration tests ensured that things on our end continued to operate as expected.
Integrating with large established enterprises requires plasticity.
It would have been nice if they had integrated with our API rather than delivering information via SFTP, but in the end it wasn't worth the fight. We were the smaller, more agile company and it made a lot more sense for us to build a one-off SFTP drop point to work with their existing systems than to ask them to integrate with us. Bending to the will of giants yielded a more stable system built in less time.
Much has been written about microservice standardization in Go.
There are a million different ways to structure a repository. Some people in organizations with many teams and dozens of microservices spend a lot of time debating what is the best package structure for microservices on the premise that consistent package structure is important. This is true to some extent, though I felt as an application architect that the behavior of the services was more important than their structure.
If you can establish a set of architectural constraints that everyone can agree to, you end up with a shared set of concerns that every service must address. This can be developed into a common set of interfaces that span the entire organization.
Systemic interface standardization reduces the learning curve for coming up to speed on a new service and it also reduces the total amount of work required to create a new microservice. This improves the quality of the components because developers have more time to document and scrutinize their design. It can lead to increased coupling since a change in a shared library is more likely to affect multiple services, so care should be taken in selecting which portions of the application ought to be shared and which ought to be duplicated.
Generally speaking, a component should only be shared between services if it is needed in more than one place today (don't assume that it might be needed) and it can be broken down into a small number of immutable interfaces.
Crunchyroll has multiple tenants feeding client and server side events to Segment. With billions of events per month, the pricing was becoming unreasonable at the same time that the business wanted to emit more events from more clients. Not all of these events needed to go through Segment. Segment only served as an intermediary between clients and S3 for the highest volume event types.
So, it was decided that events dispatched from the 20+ clients would be routed instead to an event collection service which would write all events to the data lake in S3 and proxy some events to Segment as necessary for product metrics.
As a co-lead of the project, I:
The end result was an API capable of efficiently routing > 3,000 events per second. The only bottleneck was the number of kinesis shards which had not yet been fitted with a reshard lambda for autoscaling.
One important aspect of making this service efficient was the addition of batching. 80ms of artificial latency was added to the API in order to enable batch writes to Kinesis.
This graph shows the introduction of artificial latency.
Below you'll see CPU time per request drop significantly after batching is introduced.
In fact, the trend is reversed. The service now becomes more CPU efficient under higher load due to batching.
Scalability is not binary. Most would say that a service that scales linearly is a scalable service. And they'd be right. But I'd argue that there's a higher goal we should all aspire to and that's sub-linear scalability (illustrated above).
Batching is an easy way to acheive economy of scale in web service APIs.
Architecture is the thing that happens before the building. I quit Crunchyroll because the company's biggest projects for which I was refused any involvement in planning were headed toward certain disaster and as an architect I wasn't willing to accept responsibility for the architecture of new systems without having any involvement in their architecture. Rather than involving more technical leadership in the planning process, upper management decided to establish a new layer of non-technical project management between product and engineering which I perceived as a strong movement in the wrong direction. Since I left, the department has a new CTO and presumably a new direction. I enjoyed my time with Crunchyroll and wish them all the best.
Crunchyroll is a great place with a lot of amazing people.
Crunchyroll will always be a dearly beloved brand for Anime fans across the globe.