Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Meta problem with URPF our bundle in Boca raton (metafixthis.com)
56 points by synthesis5x 3 months ago | hide | past | favorite | 11 comments
Meta has a problem in its clusters in Boca Raton, Miami; this is affecting the MNA content delivery network and direct content consumption. This has a regional impact in Latin America, since so far most non-cacheable content is consumed from the clusters in Florida.

The impact is traceable via ICMP, but also reproducible via TCP and difficult to measure via UDP. This is why monitoring tools are misleading: there is no “slowness” resulting from interface saturation; instead, there is data corruption where packets are discarded at the interface level. Therefore, if network performance is measured using those same data points, it won’t work and you won’t see any alerts.

The issue can also be replicated from the looking glass. In fact, I will attach images below, although you can also see them on the website attached to the post, as well as a more specific report

There is packet loss and probably flapping on a BGP instance, OSPF, or some IGP within Meta’s network. I believe it is between 129.134.101.34, 129.134.104.84, and 129.134.101.51. It is possible that it’s a faulty interface in a bundle or some hardware issue that a “show interface status” doesn’t reveal, which is why I’ve failed to report this problem through your NOC.

How can Meta replicate the failure?

1: Look for random MNA cluster IPs from your clients. 2: Ping from 157.240.14.15 with a payload larger than 500 bytes (a packet is more likely to get corrupted on a faulty interface if the payload increases). 3: Ping many servers from point 1.

You will see that once you find the affected upstream or downstream route combination, you will have 10-60% packet loss to the destination host.

How to fix it? Isolate the port or discard faulty hardware.

Why didn’t we see it before?

Simply put, your monitoring tools and troubleshooting protocols don’t work for these problems. The protocol is to attach a HAR file that bases its performance on window scaling and TCP RTT; if both are good, even with data loss, there’s “no problem.” Especially because that HAR file is extracted using QUIC, and QUIC is particularly good at mitigating slowness caused by data loss (since packets are retransmitted without the TCP penalty). You know what uses TCP? WhatsApp Statuses, and those are slow.

Can an MTR show where the problem is?

Generally not, this is because:

In any network route, there is a certain number of hops; for example, suppose there are 5 hops between host A and host B. To perform a traceroute, packets are sent with increasing TTL values (1, 2, 3, etc.). Each time a packet expires before reaching its destination, the transit hop reports a “TTL Time Exceeded” message, which is how the route is mapped. The problem is that these are basically point-to-point probes; it’s like pinging each hop individually. And when there’s a problem on an affected interface in an ECMP or bundle, those P2P connections won’t necessarily take the affected path. So they are unreliable; generally, you will see that the losses are produced by the final host even though the fault is in the middle. check metafixthis.com



I hate it when I can't get in touch with the right engineers at a large company. This (especially the highly targeted ad mentioned on the page) is a very creative way to try to solve that problem.

Not associated with Meta, but this piqued my interest. That being said, I found some parts confusing and hard to follow. For example what does URPF (Unicast Reverse Path Forwarding) in the title of this submission have to do with the contents?

And is the packet loss supposedly happening at specific times only? It's not mentioned anywhere, but one screenshot highlights the time. I couldn't reproduce the packet loss using any of the looking glasses and dest IP addresses in the screenshots. At this point, if this was a report I had received about one of my services, I would have probably bumped down the priority to low and asked for a reproducible test, because in my experience even issues that affect a single path in an ECMP group are not this hard to reproduce. I think it's way more important to give the engineer who will process the report an easy way to check that there is indeed a problem than to start to teach how traceroute works.

TBF, there does seem to be an issue somewhere, because sticking 129.134.80.234, one of the Meta IP addresses from a screenshot, on ping.pe does definitely show significant packet loss from more locations than you'd expect to see for an address with no connectivity issues.


Addressing the question of what uRPF has to do with this: it’s possible—unlikely, but possible—that Meta hasn’t been able to find the issue because while it looks like a faulty interface within a bundle (and it probably is), it could also be that an internal route has uRPF accidentally enabled and is receiving asymmetric traffic, causing it to drop packets on that path. It’s a possibility, but only Meta would know for sure. I included it in the title to give them a lead; it can really only be one of two things: uRPF on an interface participating in an ECMP, or an interface dropping packets at the hardware level within a bundle


Hey man. I agree, this issue has been going on for nearly six months, and they’ve been closing my tickets—it’s honestly a joke at this point. Back in 2023, the exact same thing happened, and I had to resort to social engineering just to get them to find the problem; they fixed it a day later. I’m not proud of doing that, but I have to emphasize it because Meta has built performance dashboards designed to delude themselves.

Packet loss is happening all the time, though it might be more noticeable during peak hours since a faulty interface will show a higher error rate under heavy load. You can replicate it using looking glasses; maybe you didn't see it five days ago but you do now. Since it’s an ECMP issue, it depends heavily on which source and destination servers you’re testing. It’s just a matter of iterating.

I’m glad you were able to replicate it on ping.pe; Meta, however, still has no clue


I’m not close to being a network engineer but having been in Florida for work, I couldn’t help but feel that their network infrastructure was off in some way. Miami felt like a black hole, with network traffic being sucked down to it even if you were up in the northern end of the state.

Bumping for visibility.


Meh It’s not what you’d expect from a multi-billion dollar company.


Interesting problem, perhaps you could replicate results using RIPE Atlas to see geographical impact as well?


That's a very good idea. Thank you.


Does this mean meta has a bad interface/optics in their internal network?


100%


> I will attach images below

Where?


Sorry, I’m new to this forum. I thought I’d be able to attach images, but you can find them all on the website linked to the post. Apologies for the late response; it’s been a rough few days. Meta took this thread quite hostilely and even tried contacting my clients. It doesn't feel good to get on an organization's bad side, but I didn't see any other alternative




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: