In distributed systems, the hardest part is often not finding the bug in your code, but tracking down which component is actually the source of the problem so you know what code to look at. Or finding the requests that exhibit the bug, and deducing what they all have in common.
The most effective way to structure your instrumentation, so you get the maximum bang for your buck, is to emit a single arbitrarily wide event per request per service hop. We’re talking wiiiide. We usually see 200-500 dimensions in a mature app. But just one write.
Any and all unique identifying bits you can get your paws on: UUID, request ID, shopping cart ID, any other ID <<- HIGHEST VALUE DETAILS
Any other useful application context, starting with service name
Possibly system resource state at point in time e.g. /proc/net/ipv4
The entire thread is worth at least two read-throughs. I’m still pondering.
For me, the current team is more about structured logs into Splunk and extracting metrics and call geometry from UUID and spans, so the idea of a 200-500 element event per call is new, compelling and … feels correct. Like, I need to figure out how to start doing this awesome new thing here too. Especially for serverless, where you can’t log into the server and poke around; all you have are logs and/or events.