Plenary session
23 May 2023
4 p.m.
JAN ZORZ: Hello. Welcome to 4 p.m. session. My name is Jan, this is Antonio, we will be chairing this session. Please have a seat, we can wait for another minute because I think the bell was rang.
So, we have an announcement: At 4 p.m., that's like now, the voting starts for the Programme Committee. So, please go, have a look who volunteered to be at the Programme Committee and do the work, and cast your votes. Thank you very much.
.
So, our first presenter is Tom, I saw him, Tom will talk about incident management at scale. Thank you.
TOM STRICKX: Good afternoon, everyone. As introduced, I am going to talk a bit about incident management at scale, specifically incident management at CloudFlare.
So, quick recap. Basically why you have to trust me slide. I am a principal ‑‑ I love that name for the slide, it's great, I am going to keep using that. I am a principal network engineer at CloudFlare, I also work on the backbone, I kind of have two hats with you, one of the other things I do is I break things. I break things usually from vendors, but sometimes also my own stuff. When things break, you need to fix them, and that's basically what this talk is going to be about.
Why do we need to fix things? Why do we need to care about incident management in the first place?
.
It's that all of the infrastructure that we, as a community, have been building over the last 30 years is becoming more and more of a critical or a crucial aspect of day‑to‑day life and day‑to‑day society, which means that we can get very annoying things in the press that are basically just network X did something bad, and therefore, you know, you couldn't cook your pizza or whatever else that happens.
So really important that we get on top of these problems as fast as possible, that we fix them as fast as possible but also we communicate about them as openly and honestly as possible.
So what's incident management specifically at CloudFlare?
.
It looks a bit like this, sometimes. But what we try to do is not like, you know, some of you may have heard about ITIL, we don't follow ITIL because ITIL specifically calls out for a dedicated incident management team that is, you know, kind of handles the entire process and makes sure things happen. The reason why we don't follow ITIL is because everyone needs to be part of the incident process. That means that every engineer or every team that pushes code or hardware or infrastructure or anything else to the edge, is responsible for their own stack. That means that, you know, you break it, you fix it, quite literally.
But that means that we need to make sure that everyone involved has really good training and good understanding of what the processes are and because we're a growing company, it's not always the easiest thing to do. So one of the things that we have done is instituted an incident commander. Incident commanders are basically the WD‑40 of the incident process. They are only there to make sure that the individual contributor, so the SME, the subject matter experts that are actually involved in, you know, like understanding what broke and trying to fix what broke, that they can get all of the resources that they need, right, that can be, you know, like do we need more time? More engineers? Do we need to pay someone? Do we need to get someone out of bed at 4 a.m.? Those sort of questions can be asked by the incident commander so that the engineers that are trying to fix it can work on the problem and don't have to deal with the metaness of an actual incident. Focus on the thing that you're good at and we'll deal with the rest. And that's what the incident commander is. So the incident commander is all of the engineering managers, we follow a weekly rotation, so the incident commanders are on‑call for seven days a week, but it is follow the sun. They are not on‑call for however many hours is in a week. You are on‑call for eight hours, then you hand over to whoever.
It works really well. So let's talk a bit about that process. The process that the incident commanders are there to kind of WD‑40 the hell out of. First of all, you have detection. So detection, most of the time, hopefully, you have automated alerts. You have ‑‑ you set up some rules in whatever your system that you're using, for us for example we're a Prometheus company, that means we use alert manager and that's how we set up our alerting pipeline. But it doesn't have to be that specifically. Like, for example, Grafana has a good setup for alerts as well. But the really annoying things with detection is well you need to know what can break, right, it's like I know it's like a glass pane. If I throw a ball through it, that glass is going to break and that's a known failure scenario. You can test for that. But not everything that we build, especially on the infrastructure level, has an easily detectable problem. So, for that, we discovered this amazing thing called customers. Customers are great at figuring out problems.
So, when customers detect problems, it's usually probably already a bit too late, because that means that it's public, it means that people have noticed, that people are going to start yelling at you, there is an additional pressure of whatever it is that you did, don't do it is not usually the most constructive conversation to have. But it kind of is what this means. Like you need to do mitigation and remediation. It's really important that we distinguish between mitigation and remediation. I think for some people, that might be all of the same thing, but mitigation is stop the bleeding completely. It's like you lose an arm, you want to stop the bleeding because otherwise you don't just lose the arm you lose a life and remediation then is well how do we continue from here? Can we get a prosthetic on there? Stuff like that. It's figuring out what's the long‑term solution to the problem that either a customer uncovered or metrics uncovered or, you know, just running things in production uncovers.
So that's the big difference between mitigation, which is, you know, stop the bleeding, it's like whatever it is, like do a roll back of the code, if you know what the specific bug is like for example, you forgot a colon, then you push out a new release, you get that going as soon as possible, and then you do a longer think‑through of, well, how do we make sure that this doesn't happen again in the future with more solid structure?
.
That's where reporting comes in as well. We have a very clear culture of incident reports at CloudFlare. That means that every single incident that we have needs to have an incident report attached to it. For the most part, they are internal, but for our customers, if the customer says we were impacted by incident X, Y, Z, could you please tell us what the hell happened and how you're making sure that doesn't happen again, then we occasionally redact some things and rewrite. It doesn't always have to be the nitty gritty datasets, and then we ship that to the customer.
.
A really important thing to note with the reports is that what we try to do is, you have the engineers working on the incident, is that you don't have the engineers that worked on the incident write the incident report, you have someone else do it. The reason for that is, it helps you resolve another problem which is communication during the incident. One of the things that is very easy for engineers is to kind of get stuck in your own head and you're not communicating about the things that you're thinking through. You're not communicating about the things that you're seeing, or communicating about, oh, I think I might solve it that way, that way, so you don't get the feedback from your colleagues. So by forcing someone else, an outsider, to write that incident report, you're forcing the engineers that are working the incident to over communicate. To communicate everything that they are thinking, everything that they are seeing, making sure they are pasting the right graphs to make sure that we're looking all at the same thing. Because, if you don't, you are going to get a bunch of questions from an engineering like a day later or two or three hours later where they are going to ask you, yeah, so, 50 minutes in that instant like nothing was said and then suddenly you go I fixed it. What happened? Like how did this ‑‑ what? Where is the timeline of events happening?
.
So, by doing that, we kind of just make that better.
One of the other things as well is because like, we want to work on fixing the problem, we don't want to work on communicating. One the things that we have built is a respect tables. This is, for us, we use Google Chat, I think we're the only company that uses Google Chat. It works I guess for the most part. But respect tables, it's a bot it allows us to very easily update time lines, very easily post status updates. It Ray allows us to page people, we can check on‑call schedules, things like that, it makes things easier so you don't have to go through the interface that is exist for like dashboards and then pagers and it's like everything else, it centralised that system making sure that it's as low a friction as possible. Again for the incident commander but also for the ICs if they really need to.
.
We also have an Incident Board. The incident Board, this took a long time to get this because that basically means there is no incidents. That's really, really hard to get no incidents. But the Incident Board will always list the active incidents at CloudFlare, it makes it easier for incident commanders that are going to come into rotation for example, they have a quick look okay, what's broken, but also for engineers, you know, it's like you wake up, I'm in London so I wake up at, you know, usually 9 GMT‑ish, or get into work at 9 GMT‑ish, and I can then open the Incident Board and have a quick look because is anything broken? Is there anything out there that might require the network team's attention? If the Board is clear, awesome. If the Board isn't clear, I have a very quick, very easy or nice overview of what is broken and if I may or may not need to be involved in this.
.
Another thing that we do communication wise, because all of the previous items are internal communication, right. It's making sure that the rest of the company knows what the status of the network is, what the status is of our edge services, everything else. But we also need to make sure that we communicate externally, right. Because CloudFlare wouldn't be CloudFlare if we didn't have customers, it would be a significantly much more fun but, you know... some things got to make the money that makes me get the links.
.
So, a thing that we use is CloudFlare status. You know, there is a gazillion different status page providers out there, we use one and what this will allow us to do is, is just kind of communicate, hey, we know that this system is broken. The DNS propagation is delayed or we were having issues in this specific location or whatever. So that customers that see issues with CloudFlare can have a look at CloudFlare status status and see if they know about it and, if they don't know about it, you go on Twitter and yell about it. Right?
.
A really important thing with the CloudFlare status page is, that address that you see there is not CloudFlare IP space. Some of you may or may not recognise that IP space. It is Google IP space. Also, I want to thank Google for making sure that there's signed ROAs through that prefix, because, you know, it wouldn't be a RIPE conference without mentioning RPKI every single five minutes.
IPv6 doesn't exist!
.
But, we make sure that, you know, it's like the CloudFlare status page it's important that it's not hosted on the CloudFlare network because if we break the CloudFlare network, how can you check the CloudFlare status page? Circle of dependency as we all know bad idea, we make that's somewhere else. So that when we do have a full complete outage, people can still figure out that we have a full outage.
And then last but not least, it is incident reviews. We do these once a week. We do that in both the San Francisco time zone and the London time zone. We don't have one really in Singapore. We are trying to spin that up. Apparently, Singapore does not have as many incidents, I guess. Good people.
But we did them once a week. What we do there is we walk through the incident. We'll basically just open the incident report and either the engineering manager of the team that was responsible for the incident will walk through it or if the engineering manager wants an individual contributor, so, random engineer, to kind of walk us through what happened, they'll walk us through what happened, and the key point of this is that we can then have a chat, usually five minutes, we need to move things along, right, and we try to figure out as a group what we could have done better.
It's great when you can like walk through an incident and the incident lasted ten minutes and it was super easy to fix and we had all of the tooling and we had all of the dashboards and everyone instantly knew what they needed to do, but that's not always the case. And in those cases, we try to figure out okay, what went wrong? Is there something in the process that needs fixing? Did we get the right engineers on the right call at the right time? Did we get the right team? Did we even know something broke? Because like I said at the beginning, it's all fine and dandy if you have metrics and you can define alerts around those metrics but if you don't have them you rely on customers. That's a bad look. We prefer not to get paged by customers.
So, if it was a like a customer mailing in or calling or anything else, then we need to figure out okay, is there something in our existing systems that we could have used to detect the problem that the customer is talking about? Because if we can ‑‑ yeah, it's going to happen again, right. The basic assumption should be something that broke once will break twice will break three times will break four times and will just kind of breaking and if you don't have anything that detects that, then you are just going to keep repeating the same pattern over and over again, and I was going to say that's a code by Einstein but I think that's not actually. It's not insanity.
Either way, that's kind of how we try to do incident management at CloudFlare. I hope it's kind of helpful for some of you to maybe review your own processes and implement some of the things that we talked about, but I'm more than happy to answer some questions now.
(Applause)
JAN ZORZ: Thank you, Tom. Interesting stuff. Are there any questions?
AUDIENCE SPEAKER: Okay, Tom. What is the incident that deserves a blog post post ‑mortem?
TOM STRICKX: We are a blogging company that just happens to have a CDN, so... most of the times it's the ones that we deemed interesting in one way another. It's either going to be with super interesting or it was hyper visible. Like, we had an incident last year in June or July where we basically killed 50% of our traffic. It wasn't a super interesting problem. Like, at the end of the day it was just a misordered routing policy term that meant that we weren't advertising prefixes that we were supposed to. That's not hyper interesting, right, from I think everyone here that has an operational background has done that at least once. Probably more than once. So it's like from that aspect, not super interesting but it was hyper visible, so at that point it behoves us to talk about this publicly, be honest about it, kind of crawl through the dust a bit to make sure that customers understand we have a process for this and we're trying to make sure that doesn't happen again. That's one aspect. And then the other one is well that's kind of neat. That's not supposed to break that way. And then we talk about it, because I think it's a benefit to the entire community that we talk about interesting problems, because, you know, we all go through the same shit.
JAN ZORZ: Okay, it's you and then Will and Daniel.
AUDIENCE SPEAKER: Tim, DE‑CIX. You have gone through dataset that you kept the status page out of your own network which is perfectly reasonable. I would be interested in how is the information on that status page fed? Is that kind of tied with your incident report internally or does somebody need to do that manually, is that part of a manual process?
TOM STRICKX: It's both. So we have certain alerts configured that will automatically process status page, so like for example, the best ‑‑ the first one I can think of is if our API availability goes below 80% we instantly automatically create an incident. Incident gets spun up and in that process of the incident spin‑up the status page automatically gets created. The thing is, we can do that for hyper‑specific cases where we can automate this. But, because our fleet is now so many different pieces of software that we offer as a service, it's really, really hard to automate that. So for those aspects, it's our customer support folks, they will, we'll tell them please status page, this is the wording, and then they'll put it up. So it's a mix of both.
AUDIENCE SPEAKER: Will, from a small ISP called Saitis. I am wondering, you are like a fairly decently big company, would you have any suggestion for a company that's still an ISP with 20 employees to how we could deal with that because it's rough to get full‑time and on‑call 24/7 and so on?
TOM STRICKX: I can fully empathise. I think the best suggestion I can give is be kind to yourself. Sometimes, like, we're sometimes hyper vigilant, which means that we start alerting on metrics that we probably shouldn't really be alerting on yet but it's more like, it's that customer anxiety, right, it's like if we don't look at this, then the customer might mail in, just so be kind to yourself and make sure that if you do have an on‑call rotation and if you do have incidents, that you are doing this within like intervals that are reasonable, right. Where you can say well, you know, five minute outage, that's going to suck, but it's a five‑minute outage, right. Be kind to yourself.
AUDIENCE SPEAKER: Hi, Daniel Karrenberg, I am the first and current employee of the RIPE NCC speaking only for myself. Thank you very much for this presentation. I would really personally like much more of this happening, so I like the style, I like the openness. My question is: How do you justify that sort of inside the company? I'm quite sure there might have been people who say well we shouldn't be talking about our mistakes. How do you deal with that?
TOM STRICKX: Frankly, it never came up. So ‑‑
DANIEL KARRENBERG: Good. My follow‑up question is ‑‑ I tricked you ‑‑ my follow‑up question is: Why don't other ISPs do this, because what I get from some of the engineers is we're never allowed to speak about this stuff, how do you create a culture that is different? Because, my big example is the airline industry, they got forced into this and my fear is that we might be forced into this stuff as well if we don't do it ourselves. So any ideas?
TOM STRICKX: You are a hundred percent correct. You can bring the horse to water but you can't make it drink. So, yeah, as a suggestion, I think it just ‑‑ from an engineering perspective, we need to go to leadership around go look, this is important, this helps, it's good for our reputation within the industry to openly and honestly talk about the shit that we fuck up. That's really important and if we don't do this, the European Commission is going to make us do it. And them regulating us, it's already bad enough with the stuff that roulette do have talked about, I'd also not like to get the European Commission mandated incident reports because they are going to be dreadful so better do ourselves.
DANIEL KARRENBERG: Hear, hear, I like your speed test. Anybody who hasn't opened it, do so, it's the best one.
TOM STRICKX: Thank you, I'll let the team know. For those of you looking for it's speed.cloudflare.com.
AUDIENCE SPEAKER: RPKI team at RIPE NCC. I had to mention it again. You talked about going over the incidents with other people. You guys also go over alerts to refine that and could you maybe elaborate on how you guys handle that?
TOM STRICKX: So alerts is something that is handled on a pre‑team basis. We do have an SRE team but most of the stuff that we push in our network is owned by the engineering team that built and wrote that thing. They are also responsible for alerting an observability in their own software suites. So that's something that we keep within the teams themselves because for those aspects, they are the SMEs, like they are the ones that know well this is important and this isn't. It's not super helpful I think for other engineering teams to then kind of come in as like an outsider to just kind of go, oh, yeah, it's like you should be alerting on that. That's not super helpful but we do encourage teams to, you know, regularly prune alerts or regularly prune metrics as well because there is such a thing as too much data. So, we make sure that the teams are aware that it's perfectly fine to do that and to kind of redefine alerts that are trigger happy, not trigger happy, things like that. It's a culture more than it is a process.
AUDIENCE SPEAKER: I agree with that. Thanks.
AUDIENCE SPEAKER: Hi, Trey. Just a quick question, I forget the name of the tool that you referred to that was helping the incident commander sort of cut through all the different ‑‑ the respect tables. Is that ‑‑ is there any part of that that's generalisable enough to be open sourceable? Because it sounded really cool. Maybe that's a dumb question.
TOM STRICKX: No, possibly the problem is that a lot of that are tight integrations with APIs that are either internal or semi‑public. So it's kind of difficult.
AUDIENCE SPEAKER: I hope you don't mind, I may borrow the idea somewhere.
AUDIENCE SPEAKER: Under the EU's new NIS 2, which is coming into force in 2024, and I believe CloudFlare would be covered under this as a DNS service provider, how are you updating your process to, say, comply with the reporting requirements for instance under NIS 2?
TOM STRICKX: We are already covered by NIS in Germany and the UK. For DNS specifically.
AUDIENCE SPEAKER: But specifically what changes are you making as a result of NIS 2 versus 1?
TOM STRICKX: As far as I know, there is no changes that we are required to make from the DNS outline. So we need to report to the, I think for us it's the Portuguese regulator, so whenever we have an outage, we report it to the Portuguese regulator and DPS.
AUDIENCE SPEAKER: Fine.
TOM STRICKX: That's the closest I'd like to go to NIS 2 for mental health reasons more than anything else.
AUDIENCE SPEAKER: Entirely understandable. I was just looking at it this morning and dreading.
JAN ZORZ: All right. If there are no more questions, Tom, thank you very much.
(Applause)
.
Giovane Moura. Assessing e‑Government DNS resilience.
GIOVANE MOURA: All right. Can you hear me? So, good afternoon, everybody, my name is Giovane Moura and I am going to be presenting this work here with colleagues from the University of Twentey, and I am a data scientist with SIDN collaborates which is a research team at SIDN, which is a registry and I am also Assistant Professor at TDL University, 20 kilometres from here, which is the delph Blue China comes from here, so, just before I start, some of you may have known Riccardo, he was a colleague at Twente, where I did my Ph.D., he unfortunately passed away last March and he was also a colleague at SIDN Labs, and he was in Brazil and has been at this conference before.
All right, a little bit of the context before I you dive into the research. This was a paper we did. So before I published the paper, the national cybersecurity centre here in the Netherlands had commissioned SIDN Labs to do a study on the the Dutch government, e‑gov domain names and their resilience and their infrastructure, what they use for DNS for e‑gov. So that became the Dino project, and we had a number of colleagues to do this research. So this paper in particular is an extension of this project because the Dutch government just needed to know how the Dutch e‑gov behaves, and I took this code from this philosopher, he says if it's worth doing it, it's worth over doing it. So what they did is not only got the Dutch e‑gov names, we got other e‑gov domains through other countries and compare them.
Because governments increasingly are using the Internet to communicate with citizens, and e‑gov provides crucial services in Holland here, it can do taxes, apply for benefits, driver's licence, a bunch of stuff. Here is a screenshot of my local government where I live, it's a bunch of things you can do online with this e‑gov. And what happens when e‑gov breaks?
.
This happens in the US, some Russian hackers shut down state governments in the US, Kentucky, Colorado, and that was a fly doing the DDoS attack, so that became a problem for those people, and since e‑gov is dependent on the Internet and the Internet is fully dependent on the DNS, we have an issue here and I found this beautiful thing from Japanese poetry, it's like not the DNS, there is no way it's DNS. It was DNS. And it's always the DNS, so when there is a problem so we're going to look now for the back‑end of e‑gov domain names from different countries and compare them.
Because DNS has been designed for resilience. There is a bunch of ways you can replicate servers in that, and with Anycast and a bunch of other features but the things are complex and they are not exactly cheap and configuration errors may even go unnoticed because you just need one server to operate actually correctly and then if a resolver can reach that you are never going to notice unless the system breaks. The question we want to answer is an are the e‑gov authoritative servers are configured following the best practices for robustness, if they are redundant enough. What they do is use the Internet measurement. So our contribution is to analyse the EDNS, e‑gov DNS infrastructure for four different countries.
What we do is, we measure the e‑gov infrastructure. For different countries, we use active measurements. We compare them because we need to know how they compare to each other and provide some suggestions for improvement. And why we choose these four here, we have the Netherlands, Switzerland, Sweden and the US. Because it's very hard actually to get e‑gov domain names. Nobody makes this information available. So, for the Netherlands, boast of the Dutch e‑gov domains are registered now, but we actually had to get the centre to give us a list of those domain names, and in Switzerland and Sweden is the same who had contacts to help us to give these domain names because we wouldn't know. And the US they published .gov lists, it's open information, just go there and download it. It's not easy to get.
This is the dataset we have. We have these four countries here. This is the number of second level domains we are going to be looking into per country. The Netherlands has a population with ‑‑ 600 domain names in a population of 70 million people. Sweden, roughly the same, Switzerland roughly 4,000, but only 8.7 million. I suspect it's because the language they have, they have many languages there, I think even four, like the entry company has three domains because we have one in German, French and Italian, maybe it's because of that. And the United States because it has a large population, almost 8,000. And I'm only looking at second level domain here. When I was doing those slides it looked to me like the scores you have in the winter games because you have all the skiing.
So, the first thing you are going to be looking at here the measurements that we did is to evaluate the sort of single points of failure where you don't put all your eggs in your basket. We are going to be looking into respects of that, different types of baskets.
The the first one is using DNS providers, the usage of single or multiple DNS providers. So how you measure that. So we have here a domain, let's say example .nl, and this domain has two DNS servers, authoritative DNS servers, a.example.nl and b.example.com, they have their own address and they are used by an AS. What we measure is the number of ASes for the particular domain name that announces this IP address so. In this case we have two DNS providers for this particular domain. Meaning if you one goes down the other one can take out the traffic and respond to the DNS queries, users can still reach their e‑gov services.
So, this is the table to summarise that. Here you have the second level domains for each country and the spans that is we got and 42% for the Netherlands, 41% IPv4 Sweden, 43 for Switzerland, 82 for the US for IPv4, they use a single DNS provider. Which I mean, I got to remember these domains are very dissimilar, some are from local government, they don't consider themselves very important to have the redundancy at this level. Maybe some are not. But again, like, I think 80% of that e‑gov may be a little risky, it's not up to me to evaluate that they should make their own decision, and you can say, well, but this is a bogus metric, I'll put everything in the Cloud. A German company called Haribo, they made their own candy for a Cloud providers, you can see here, but even occasions they fail as well. There was a Dyn attack, route 53 in 2019 as well. Amazon.com doesn't use AWS for DNS, they use Dyn, we saw CloudFlare doesn't use their own thing for the status page, they use Google. It's not your practice to put all your eggs in your basket.
And if you look where ‑‑ who are those DNS providers per country? Here, we have a table which shows per country, like which are the top five DNS providers per country. And maybe not the companies you wouldn't know unless you live in those countries and the reason is that the DNS providing market for e‑gov is dominated by local companies. This is because you have small towns and local governments, they say we need a new website for these servers, which one should I use, you use the hosting provider that I saw in the advertisement. It's a really local decision. This is at least for the Netherlands. I talked to the people there. So, like, here have Trans IP, this is well known in the country. So, I think it comes from this tradition on that. So the conclusion is most DNS providers are local. The only exception here is number 3, in Sweden it's Microsoft.
All right, now we are moving to different metric, we are going to look at the number of DNS servers that each domain has. So, if we have a domain here, example.nl, NS records define how many in DNS how many servers they have and here they have two. We, in this case we only found 6 dot e‑gov names, that's against the original RFC 10.34, it's 35 years old, and older than a bunch of people here, not older than me unfortunately, but we notified .gov, there are three domains there, they fixed them, so they fixed it, but we did not find in the other zones.
The other thing we're going to look into here, again the number of BGP prefixes that announce those DNS servers. Example.nl have the servers, have these IP addresses, if they are coming from the same BGP prefixes, that means they are not topologically diverse, they share parts of the same infrastructure, either routing or some part of infrastructure, and that's not a good recommendation for e‑gov services, because e‑gov is important enough.
So, now the results, if you have one prefixes, you have the same location, you have a false sense of security if there's a large DDoS that takes down that prefixes, your entire e‑gov site is going to be down.
So what we have here in this figure is a CDF that shows the number of prefixes they announce the e‑gov domains. So, what I'm showing here is there is like 40% of the Swiss, roughly one third of the Switzerland e‑gov domains that have a single prefixes for the DNS provider. So one third of the domains they use only one prefixes. For the rest, they are having 20%. And most of the people here have two, which is a good thing, two is better than one, and some have even like four or more.
So this is the number of BGP prefixes.
The other thing you're going to be looking is the number of top level domains that are used in the DNS servers. So for example here, this one has .nl and dotcom at top level domains for their own DNS servers, so if something were to happen to dotcom or.nl you still would be able to resolve this domain. So, it's the TLDs for the DNS records not the domains themselves by the way. So here we have two, and we see that Switzerland for example here the red colour here again, here we are showing the percentage of the domains that are using how many TLDs they use for their NS records. Most e‑gov domains from Switzerland they use only one TLD. Their own, the Netherlands, on the other hand, is the most diverse, they use 40%, and the others come in between, so it's a good general policy, to diversify things if you can. I mean, it doesn't mean there is still they are going to break, but it's just a principle that you should apply to your e‑gov services if you consider them important.
So, only one TLD is not a really good policy. And if you look into why TLDs use e‑gov domains you are going to see the number one here is obviously their own thing, and the United States, even though it's .gov, dotcom, it's also very closely associated with the US. The first number one is always their own. And then you have like alternated between.net and dotcom probably for historical reasons as well, and .eu is not very popular and .be as well, Belgium, neighbouring country here, but that's ‑‑ so the conclusion is, typically they have their own country for e‑gov and then they have dotcom and.net.
Now, we're going to be looking into extra features that can improve the resilience of DNS. One of those things is IP Anycast, we had a paper talk about how Anycast reacts to denial of service attacks. This presentation, all those links in red are clickable, you can download them later.
And the second thing you're going to be looking into it the time to leave fields for DNS. We had two papers on that as well, RFC 9199 summarises them for operators.
So here we have IP Anycast and Unicast. Unicast what you have is one prefixes being announced from one location. In this case here in Amsterdam, this prefixes is being announced so all traffic gets to the single location. If there is a DDoS, all the traffic goes there and you get around. With Anycast, you have the same prefixing being announced from, in this case, four different locations on different continents. So traffic gets distributed among those locations, some may remain online while others go down.
So we manage to use some Internet census and... is also in the room he is a co‑ought another this paper, we found that in the US 58 percent of the e‑gov domain names have Anycast and this derives from which DNS providers they use. And not so good, very few Swiss domain names are actually doing e‑gov and mostly they don't do Anycast. And Sweden and the Netherlands have 20% of the servers for e‑gov, they use Anycast.
So it's ironic because in Switzerland I think it's GDP per capita is the highest of those countries but they are still an e‑gov lagging behind.
And these are the ones that don't have any Anycast.
Most of them they have two.
So, time to leave. I think we talked before about time to leave here. DNS time to leave, it's a field of how long a record is going to recaine cache and is the last resource you have against a large outage if an authoritative server goes down, if the record is still in the cash, if your local resolver is going to be able to retrieve the thing you are going to be able to recover the e‑gov domain. The current recommendations, this is not set in stone, this is pure recommendation based on the studies we did, that to use a couple of hours TTL for your domains. What we did look into the e‑gov domains, measured their TTLs and we plot here, we are going to show the average per country and we see here the Netherlands is kind of doing well here, this is in seconds. But Swiss and Sweden and Switzerland they are using one hour for the authoritative servers and the same for the IP addresses of these authoritative servers. All these folks here in light red can increase. The United States is a bit better here. Even if there is a DNS problem you need to propagate a couple of hours, one hour is not ‑‑ you need someone to go there and fix it, at least we can have two hours for people to fix the stuff. That would be helpful for them.
So, for now, only talked about web, what we're going to do web and doing things but we also look in this paper about e‑gov e‑mail DNS. It's important. Here, in this paper, the paper that we did, we also measured the back‑ends of e‑gov e‑mail. We are in Rotterdam, there is a very famous philosopher in, and you have a picture of him writing some e‑mails some centuries ago but he was very popular, and what we did, we got an example.nl to look into their DNS back‑ends for mail, we got first MX, which are DNS records that define the mail servers. Then you get authoritative servers for the MX domains and then the A records and then we follow from on, AAAA records and what we found, we're going to show only the first most popular provider for all the countries. It doesn't matter which one it is, it's not your local provider that's the most popular here. The government really loves Microsoft. They like it for stability and maybe it's a generation thing, they are older people in the government, they want to use SLAAC or whatever tools young people are using nowadays, they want their own outlook thing, it's like 20% here and in Switzerland, and 41% in the US and Netherlands and Sweden is roughly 39, a little under the 40. So, the DNS back‑ends for mail mostly with Microsoft and outlook for all the countries but the DNS back‑ends for web, it's mostly local markets. So that's a very interesting thing.
And so, there are more things in the paper, you can download the paper if you are more interested in this part of the back‑end. And the recommendations we have for e‑gov in general, if you consider that particular e‑gov website you have, it's important, you should provide some diversity into that, add more DNS providers, more prefixes in those coming along, use different TLDs for your authoritative servers, and deploy Anycast for more robust services and reconsider some low TTL values that you have, can increase them. Here is an example of an infrastructure that lasts for a long time. This is a Romain aqueduct in Spain, and I think this is ‑‑ we reached the conclusions then.
We followed that the many e‑gov domains are not following the recommendations for robust services. This creates unnecessary risk and they should revelate themselves to see if they can replicate, if they consider their services should be critical and we hope it can help to prompt the operators to improve their resilience and redundancy. Again, another example of an infrastructure.
And if you are interested, here is the paper. You can download it here. And yesterday we were talking about in the time session like the NTP session, we have a free NTS server here in the Netherlands that can check that. Anyway, that's the conclusions. And if there are any questions, I'd be happy to take.
(Applause)
AUDIENCE SPEAKER: Hi. Lars Liman from NetNod. You talk about the recommendations. I can probably drag out twice as many recommendations as there are DNS operators in this room. Which ones are you talking about?
GIOVANE MOURA: Just, well, exactly the ones I evaluated. Don't use a single provider ‑‑
AUDIENCE SPEAKER: Where does it come from?
GIOVANE MOURA: So, from ourselves.
AUDIENCE SPEAKER: My point is that there is not a single set of recommendations. You cannot talk about the recommendations. You can talk about some recommendations say that and you can explain why this is good and bad for different situations but this sounds like there is a single set of recommendations that everyone should operate from and there isn't. So that's ‑‑
GIOVANE MOURA: I think I said that, this is not set in stone.
AUDIENCE SPEAKER: Thanks.
JAN ZORZ: That's a good topic for a document.
AUDIENCE SPEAKER: There is one on the way.
GIOVANE MOURA: This is our own recommendations, this particular one here. This is only suggestions, let's say, if you will.
AUDIENCE SPEAKER: Sorry, this may be an inappropriate or silly question, but ‑‑ Trey Darley, I speak for myself, but Accenture security and first.org Board of Directors. So DNS considered as infrastructure, there's been a fair number of talks about DNS here. It seems like there have been issues with the protocol for a very long time. Just as a sort of thought experiment and a question to you: In 2039, after we have dealt with the end of UNIX epic, do you think we'll still be using the DNS protocol?
GIOVANE MOURA: That's a very good question. I actually don't know the answer because DNS was never originally designed to last for so long and it's still there and IPv6 was designed to replace IPv4, it still hasn't happened, so... again, I have no idea. I mean...
AUDIENCE SPEAKER: So then I'll just ask a follow‑up. There is an old saying that any sufficiently advantaged technology is indistinguishable from magic. At some point, we're going to get locked into tech...
JAN ZORZ: Okay, any more questions? Online.
CHAIR: From Carsten Schiefner: "Did you encounter any problems in Internet services that eventually got routed in a low TLS Internet services means services on the Internet, web, mail etc.?
GIOVANE MOURA: I can't remember here by heart. We did a study, other studies on that, and I don't remember, like, outages specifically now, but I remember we found that TTLs, for example, for the authoritative services from the .ey and TLDs, that they were using very low TTL and then what happened is that they were getting massive latency in the services because all the queries had to go to the authoritative servers before they come back and we notified them, they changed and then they reduced it 90 percent, so not only protect against outages but also improves performance because you can get response for caching and nothing beats caching in DNS.
JAN ZORZ: All right. Any more questions? It doesn't look like people is running to the mics. Okay. Thank you very much.
(Applause)
.
This concludes this session. Please remember to rate the presentations, so we are as a Programme Committee know which ones you did like and vote for the new PC members. Thank you very much.