26 May 2023
Plenary
9 a.m.:
DMITRY KOHMANYUK: Morning, everyone. We are about to start our Friday happy plenary, I thank you for surviving the RIPE dinner and please just a moment and we will go with our programme. Two longer talks and three shorter talks, also known as lightning talks, so we are going to announce them really shortly.
Thank you. The first one is called listening to the noise, understanding QUIC deployments using passive measurements.
JONAS MÜCKE: Today I am going to present you our joint work on using passive measurement data to analyse QUIC deployments.
We have talked about QUIC in this conference before but I want to introduce you a little bit to the protocol, so QUIC is a new transport protocol and based on existing, based on UDP so any futures you are used to from TCP you will have to implement now in QUIC.
And additionally, QUIC gives you encryption and even some of the metadata is hit enand this is to avoid ossification of the transport protocol.
The name QUIC might want to tell you it might be quicker or faster and this is indeed the case,ing so if you for example want to request a website, you first have to do normally your TCP handshake and TLS handshake and then you do your HTTP request but in QUIC, you combine both handshakes, TCP and TLS handshake are combined so after one roundtrip time you can request your website and get a response.
So, in this talk we are going to explore hypergiant deployments so we should have a look at how those look like, hypergiants are just entities that contribute large shares of the Internet traffic.
So, to handle these vast amounts of traffic they receive, they use multiple layers of load balancing. Using DNS so if you use request to DNS record, they will return them with IP address, or usually they will return multiple IP addresses but these IP addresses have virtual IP addresses as we call them because they are not used by one host but ult Mel hosts can respond for these IP addresses
Next the user will use this address to connect to the infrastructure of the hypergiant and within that infrastructure there might be additional layers of load balancing so, for example, first ECMP load balancing and then there might be Layer 4 and Layer 7 load balances. The step from Layer 4 to 7 is often handled using consistent hashing of the topple, you use a hash of your source and destination plus the transferred protocol and then according to this hash, always forward packets for the same ‑‑ or connections to the same Layer 7 load balancer.
For this talk I will refer to on‑net deployment as deployments within the autonomous system of the hypergiant and to off‑net deployment as deployment outside of the autonomous system of that hypergiant.
So, what has prior work done.
Prior work has, for example, scanned the entire IPv4 address space and this way you end up with a list of virtual IP addresses that I use by given hypergiants. You can also analyse how the DNS load balancing works and you can use TLS certificates to fetch off‑net ‑‑ to detect off‑net deployments.
What we want to look into is the Layer 7 load balancers so where the content is served from.
And we want to do this with passive measurements because massive measurements are not intrusive, you sit there and wait for data and analyse and you can even analyse your competitors because don't send additional traffic, nothing they can stop.
And just to remind you, last year there was a request on the NANOG mailing list about scanning of all of the Internet or large chunks of the Internet for known vulnerabilities and one ended upsetting a load of responses so maybe passive measurements are a good idea.
So, what did we want to achieve? Well first, we identify the servers of specific hypergiants, we do this simply using the IP to IS mapping so we know they are on net deployments and we fingerprint these deployments to detect the same configurations in off‑net servers and we, where possible, identify Layer 7 load balancers.
So, why is this interesting for the RIPE community?
Well, with this additional information, you can understand why you see some unexpected traffic from your peers or you can observe things like inter‑domain replications between caches that you normally wouldn't want to have.
So what is our approach?
Well we analyse QUIC backscatter traffic, we use QUIC because it's already broadly adopted by a lot of hypergiants and for example just to give you a number in 2020 so before the standardisation 75% of Facebook's traffic was QUIC.
And QUIC gives us some additional information compared to UDP and TCP and we want to use backscatter traffic because this is a passive measurement data source and relatively easy to capture.
Just a quick refresher on Backscatter is. It is response traffic to IP packets with incorrect source IP addresses, and in fact, you can see it a drawing here on the left. So when an attacker sends an IP packet with a random source IP address, e.g. 1.2.3.4 to the server, the server will respond to that false IP address and if that false IP address is in the range of your network telescope, the packet that the server sends will arrive at your network telescope and that's the backscatter, that's what we will analyse and why would attackers want to do this? Well it is to the leverage state exhaustion.
Depecked here is our measurement set‑up and the QUIC one RTT handshake, begins with an initial packet and then the server responds with multiple packets to establish the connection.
And that is exactly what we see at the network telescope, so the packets highlighted in red here is what we receive and as you can see if the client does not react these packets are sent again so there's also we analyse.
Additionally we find that you can also observe some properties of the attackers, so for example you learn the QUIC version that used to attack this service. And we look into the first part about the server behaviour, just now.
So, we use data from a relatively large network telescope which is CAIDA/9 IPv4 telescope and we use data from January of 2022. And we are required to extend our dataset using active QUIC scans and TLS scans only where necessary.
So, let's start with something simple. Shown here is the arrival of packets for different QUIC connections at our network telescope. And what you can see is there is distinct patterns for different hypergiants on when you see a lot of packets. What this means is that the servers that sends those are similarly configured on when to send them and if you look to the beginning of the graph you can see when they start sending those packets, then you can spot the intervals when they send a lot of packets, are doubling each time, so you see there's exponential back‑off and when you count those peaks you can find out they often they retransmit these packets.
And as you have seen, there's a little bit different configured for each hypergiant.
Next we are going to look into the connection IDs. So QUIC during the handshake establishes connection IDs, there's one set by the server and one set by the client. And we are going to look only into the server connection ID part, and we are going to look at them as the hexadecimal representation from left to right and this is what is shown here on the X Axis of this graph you have the position in the SCID and on the Y axis how often we observe a certain value at those positions, and where you can first spot is Google and Facebook is the same length of SCID, and you can also remark at a certain positions some values are way more frequently used than at other positions, and what this means is that there is information encoded in there and if you dig a little further you can find that for Facebook, for example, the encoding is documented in their QUIC implementation and you find what is encoded at which positions, so they encod the version and host, work ID and process ID.
We use this information that we learned so we can, for example, detect off‑net deployments of Cloudflare because we always know they start with 1 in the first byte and for Facebook we know that at the time of our measurement they only Version 1 of the SCID so we also know the beginning of the SCID and with this we look into packets that receive from other autonomous systems and this way we are able to detect a lot of these servers already correctly but we still have higher false positive rate, this is the SCID row shown here, and additionally we found that Facebook uses low host IDs for their off‑net deployments, so we know that these ‑‑ that the host ID always begin with a lot of zeros. So, using this additional information, we find we can improve our classificators and detect more ‑‑ detect the same amount of true positives and have way less false positives. So this is just a new way to defect off not servers.
So, now that I have introduced you to the host ID, the question is: what does such a host ID denominate?
Well, it denominates the QUIC end point so that is in this case the Layer 7 load balancer at the end of this graphic, and now the question arises: Why would you want to encode such a host ID into the connection ID?
And the answer is, for QUIC you might have connection migration. So during the ‑‑ during an existing QUIC connection, the client can change its IP address but the QUIC connection should be kept active, and when you do this with the set‑up, your five topple changes and so your request will be forwarded to a different layer 7 load balancer which would not have the state of the QUIC connection, so your QUIC connection would break. And to avoid this, you give the load balancer of the information on how to forward the packet properly and this way, it will arrive at the same load balance Erin stance as before and your connection can be maintained.
Next, we wanted to find out about Facebook's content serving infrastructure so there we did some active probing, so we probed around 3,000 of Facebook's virtual IP addresses, we send a lot of packets to each of them and we alternated the source ports so we would arrive at a lot of different host IDs hopefully. And using this method we end up with 37,000 different host IDs, already 19% of these can be observed in our passive measurement data. And next we tried to group multiple different IP addresses into a cluster. So that is the frontend cluster of Facebook and we group them only if multiple virtual IP addresses share the same host ID. So behind different IP addresses we observe the same host ID, and this way we find 115 Facebook clusters of about equal size. So they all consist, nearly all of them consist of 22 IP virtual addresses and we additionally found all of these clusters span a single /24 IP prefix, there's one exception to this rule and we found out a host ID can answer requests for any of the IP addresses in that cluster so when you want to know the size of the cluster described by the number of different host IDs in that cluster you just have to scan one virtual IP address and you know all the other virtual IP addresses won't give you any more different host IDs.
We mapped the location of the clusters using IP address geolocation, to different countries and we found in Asia, there are a lot more host IDs per cluster than in any other region.
And at first we were surprised at this result, but we think that the reason for this is just the large population in that area and that Facebook does not have that many data centres in that region.
So, another question is will this also work in the future? And we would answer this with a yes, because there will be no Internet without attackers and we think that even the backscatter will increase with more attackers adopting QUIC, so even more of our measurements can be done without active scanning.
And to conclude my talk, we just saw that passive, non‑intrusive measurement data can tell us a lot about hypergiants, we use QUIC features that we saw to create finge prints and to take off‑net deployments and otherwise you saw that structured connection IDs are a good stateless solution on maintaining connections when the underlying five topple might change, and if you want to read more about this, there's an Internet draft currently at the IETF, so you can find more details, and they also do some encryption of the connection IDs so your details don't leak to the public.
With that, I can refer you to our paper, you can find more details here or you can ask questions now. Thank you for your attention.
MASSIMILIANO STUCCHI: Good morning, everyone. Do we have any questions? I didn't see any questions online, so far. We don't seem to have any questions on the here in presence. One at the microphone.
AUDIENCE SPEAKER: Chris Whitfield, ARIN AC. Is there any indication that end points are Unicast end points or Anycasted?
JONAS MÜCKE: What do you mean with end points? So the IP addresses are Unicast?
AUDIENCE SPEAKER: The VIPs that were detected.
JONAS MÜCKE: Unicast.
AUDIENCE SPEAKER: I am Lars, this is a nice talk, thank you. I wonder if you had thoughts of how you would continue this study if you were going to continue it because it seems you have got to follow what the implementations are doing and what's being deployed pretty closely in order to understand the structure?
JONAS MÜCKE: Yeah, we want to continue on this path but currently I can't tell you any more details, but you might see some more papers in this direction later.
AUDIENCE SPEAKER: Give me a teaser. That's a very corporate answer right there, don't worry, it's fine, thank you.
JONAS MÜCKE: Sorry.
MASSIMILIANO STUCCHI: We still have some time available for questions. Oh, there's one coming. Anna from the University of Aberdeen.
"You have done the study for IPv4, do you think it would be feasible for IPv6 as well?"
JONAS MÜCKE: Yes, so, in principle, this approach would work with IPv6 just the same but you have to have the same data and I'm not aware of any large scale IPv6 telescopes as of now, but if you can get any of this information, we would be happy to collaborate.
MASSIMILIANO STUCCHI: No one else. Okay, Dmitry, do we have anyone online, any question on Meetecho.
DMITRY KOHMANYUK: Let me check.
MASSIMILIANO STUCCHI: No, I don't see any. Okay. Well, then thank you very much.
(Applause)
So, while the ‑‑ while the next presenter prepares, I will introduce him in a moment, Etienne Khan if you are in the room, we need your slides. Perfect, thank you. And in the meantime, let me remind you ‑‑ let me ask you to rate the talks, please, at the end of the session, and. So the presenter we have right now is Claudio Allocchio and we have a remote presenter, that's ‑‑ which should be online,
FABIO FARINA: Good morning, everyone.
MASSIMILIANO STUCCHI: They are both from the Italian NRN, GARR, and I think Claudio, do you need any help?
CLAUDIO ALLOCCHIO: I should be online but I go with remote presentation from here, I just do the live later on. Do you want me to do it it from here? As you like.
CLAUDIO ALLOCCHIO: Let's do it with the remote and then I switch over. Thank you, everyone, for being with us this morning, especially after the dinner the evening before, I know how hard it is to be on time at the first session.
What are we going to do this morning?
Well, the title of this presentation which I will share with my colleague Fabio which is there in Milan, is monitoring the hidden: Timemap. What do we mean by "hidden"? Normally, let's try to make a comparison with normal traffic pattern that we know most, the traffic one.
So, this is the outline of what we are going to do, why we created Timemap, what is in the current status, what are we doing now to go beyond the observation and even further idea, which are wondering about in the head of the group which we are working together, it's a group by the GEANT, it's a European multinational big backbone which connects 44 I think countries now, and covers a lot of Europe with connection everywhere.
So, what are we talking about?
Well, when I take my car and I leave to go somewhere, especially if I need to go via motorway or city what I normally do I switch on my navigator and click on it and say I want to go there, then I have to look at the colours, how is the road going to be ahead? I did it yesterday because I had to go to the airport and I wanted to know if the motorway exit to the airport was red as usual so I should avoid it, or very green. It was very green, so I went straight to it.
And I also want to know how it is in average anyhow when I do a monitoring system I want to plan for the future so if I need to do something between, for example, Italy and Estonia and I know that I am going to use that link for very specific activity, I want to check if that link is very stable or had some hiccups and might create me some trouble if I do something live on it. And so the question is: Well, if I listen to the radio and I get something like okay, highway 101 in California, 246 vehicles per minute, six lane motorway, fine, smooth is going smoothly but the same number can be very bad if it is a narrow country road like this and a lot of stops and go. So this is why we want to know in advance what is happening. And when I want to go somewhere, as I said, I want to look at the map and check if I have any red spot in front of me and try to avoid it to go around it, etc., etc., I want to know also the transit times so what do I want to measure in general in this case?
Well, we normally measure bandwidth. This was now I have got new one ‑‑ this was my classical car backbone weather map, my initial map, what I am measuring here, well here I am measuring bits per second, not how smooth they flow or how fast, I can be sure that if I need to send bits between Milan and Rome I can do it quite smoothly but how much I am I am going to take and are they going to jitter allot or not, I don't know. So that's why this part of the network monitoring which we are monitoring now is called hidden part of the network monitoring because nobody does it. So we decided to do T there are realtime application like this one, this is most typical one that is affected by very bad network performances, bad network performances means a lot of latency and jitter, this is LoLa, which requires to run very fast because embedded latency of the system itself is 4 milliseconds, if you compare it to Zoom or whatever which is 400 milliseconds you can understand why having a network which is faster and very stable and reliable is important because the network is important piece of application like this.
And so, it tries to use the lowest possible roundtrip times along the network and it must be very stable and the reason for it it does not use any audio video buffering and this is a terrific that things like Netflix or YouTube out a quite a lot, when you watch a film on YouTube you are watching one‑and‑a‑half minute late because of buffering and so on. There is no buffering here, so the network must be stable, fast and quick.
An application like this one to set the speed control, you just drive and you do nothing else and then go ahead. So, how did we create this stuff?
Well, just let you know these kind of applications are on the very fast rise, I mean LoLa installation grew 32% in one year and even more in the world, in Europe, the red ones are in the new ones, it means everywhere there is something like this you need to be in control of the network latency and jitter, it is not only for audio or video but realtime control of anything. There are some experiments which you want to do in realtime control and you need latency as low as possible and you need to be sure that the data flow in a very steady way and not jumping and going ahead and coming back.
So, we created these objects which is called Timemap, it is needed to do a live track but also has historical things, monitoring is not only thing. One of the sentence is if you don't monitor you don't control what you are doing. You need to know the history as I said before.
We saw the data and decided to go beyond and we decided to apply machine learning, alarms and things that go automatically the network operator people to fix some issues. It's not to have a monitoring system but normally you don't have a big screen and people standing there 24 hour per day watching it and say ah, it's something strange. Now it's running behind the scene and when there is something strange an alarm sounds and humans come into the scenario. So this means machine learning and a lot of authentications and a lot of understanding what is being done behind the scenes.
So, this is how we designed the system. The purpose of it was not yet to develop yet another piece of software that then needs a lot of maintenance, a lot of monitoring, a lot of updates, a lot of checking etc. So we tried just to be an aggregator, we put together different things which already exist which somebody else is maintaining, for the graphical part we decided we use Grafana, somebody is taking care of having Grafana up to date, not me. So we put together the pieces, we did very little programming ‑ that was one of the scopes: do not produce too much software code because the more you produce, the more you have to maintain, and especially about everything used under specifications. When we started the idea was wonderful, we have T1 as a protocol which is standard testify one. How many vendors have actually implemented it oh, wonderful and we tried it and they implemented it but it's very early implementation stage, we found quite a lot of hiccups and we a had a lot of fixes by Juniper working with us to make it work, and it must be easy to deploy and have federated access control, I might decide my monitoring data are not for everyone but I want to authorise somebody to be able to see them and not everybody. And it is very modular, so a very state‑of‑the‑art way of implementing software deployment in it.
So, this is how the architecture is done but given that we have our main architecture here which is Fabio, I will let him go ahead and he will tell me when to switch over.
FABIO FARINA: Thank you, Claudio. Timemap was focusing as a tool not ‑‑ but on the backline and the analysis of it.
So when we decided how to implement Timemap the most natural choice was doing with MX transport loader pattern. therefore we figured out that the main building blocks that collects data from the devices use ... and performs some kind of normalization. Then of course we need a time series database and all the upper layers that summarise the data and easy way towards the users also ‑ to serve through the federated authentication and all those things that can keep control of the data.
As I said, we focused on some issues of deployment. So the first choice was for data doing that works... containers work, so we wanted something that uses as much as possible additional components, doing as little code as possible, as Claudio said.
Next slide, please.
So, how Timemap is implemented, what's under the hood. Very simple components. We used influx and Grafana as building blocks for the central... on top of that we got some thin layer of presentation using very, very simple HTML and ‑‑ for the representation and visualisation, together with the separate layer to enable federated authentication.
The only custom part in the very first release of Timemap of course is the data position, so again, we decided to go with something that leverages the very standard telemetry data so we decided to adopt Telegraaf and we implemented custom Python code to get the SMP probes or measurements from the devices using some Python.
Next slide, please.
As the idea of the map is the modularity in the first release we were not relying on Influx and Grafana, we were using Elasticsearch but the plus of having such a simple framework and architecture in general allowed us to switch from the elastic step to the influx because we realised Elasticsearch had too large footprint of our needs so we ended up having more issues in running Elasticsearch cluster than the problem that we were indeed able to solve by using that tool, and changing the engine wast was really simple. Again, we were focusing on the data packet lines, not on the technology themselves.
Next one, please. Now, I am going to leave the stated gain to Claudio because it's easier, that will show some screenshots and some interesting patterns that we noticed through Timemap because one very interesting thing is that when you look at jitter and latency, you end up noticing a lot of things that are not related to jitter and latency but to the overall status of your network. Please, Claudio.
CLAUDIO ALLOCCHIO: You can see that this is a screenshot which was taken by the GEANT backbone network map, a few weeks ago, as this is also completely dynamic. I mean the map itself is built from the IS‑IS routing data so we don't do anything, when the network changes the map changes, when we go live the network is different but because it has been changing in the meantime. What do you do?
Here you see an old version of it where rpm which was Juniper I was still in use, T1 is up and running in a decent way on Juniper so there is no more RPN. You have the links, router and very simple, you did right click on the link or router and you get information out of it and that is what you see. This is not a classical weather map, there's no colour in here, even if we have idea we might add colour now we just go on a link and you click on it and you get some data. What do you get?
We also get a link on the router when you get anomalies, and normal detection is getting mature now and we decided to put it on‑line but I will show you later on live this one.
So, what do you see? Well, very often you have some periodic events which happen during time over link and here you can see very easily there are spikes at very regular time interval? What is that? We don't know but we know there is something we have to watch along that link. Why every few hours the latency and the jitter over that link it is such a big spike, what is happening? You go and investigate and find out and maybe you discover something you need to fix on the network and/or you have some background process running that it is disturbing or creating the effect. So this is very evident and easy; you see them.
You immediately see rerouting, for example, if it is a circuit and not a fibre link you might have rerouting because you don't see how the data are flowing but here you can see very easily that the latency is changing at regular intervals going up and down, up and down. What is it? That's classical rerouting, you are going to be via different paths so the latency changes, and given that with Grafana it is very easy to zoom in and down there at the bottom over six hours you very clearly see there are at least three paths along this link which you can take: A faster one, a slower one, which is the one in the middle, and the mid‑range one which is the one on the right. If you are doing something realtime, this is a very bad situation because you never know what's going to happen to your data, and when the routing changes you are going to have some issues in the data flow.
We all know that everybody is relying on NTP and more accurate stuff and so on. Well this is what you see in a number of links. The above one is a very stable one, the clocks are very good in sync. The below one is not, there is drifting on the clocks and it's very, very evident they stay, on average, correct but if you rely on the NTP on that link between those two locations, you are wrong and not getting the correct data, you are have having an important trend you have to consider and you need to use it in case you need to consider some data for syncronising telescope or things like that. And that's the only way you can have it to make the correction.
We have implemented anomaly detection, as you can see the machine tries to learn what is happening and when it is happening then we have dots and then human have to come in and understand what it is.
Again, the machine is able to detect very easily the very big spikes and the dots are all marked over there.
Some networks are starting to use ECMP which is a very modern and very useful way to deploy the maximum power of the network itself because you do multiple routing and multiple paths. Well this is typical pattern which happens when you use ECMP. As you can see there is low path, medium and fast one because there's two possilbe paths from A to B in this case. This is a typical pattern you get using ECMP over a network, you immediately see it, it's very, very clear.
The other thing which we call non‑identified objects, non‑idenbtified events. Well actually this one wasn't identified later on, this was something very strange happening over a link, and after that we discovered that one of the interfaces were flipping so BGP routing was going up and down and changing all the time and so we get the store that you see over there but this was a post‑mortem analysis, we went to see it and find it. This is another very strange thing which we are trying to investigate and to understand. As you can see well, I can show it you better live later on, there is a very strange pattern in the latency in the jitter, in the maximum values, they go up up, down and correct down and we discovered that it is a 24‑hour pattern. What is it? I don't know. Maybe we have an answer to that, maybe the router itself is doing some very strange maintenance and we are discussing for example and discuss with Juniper to find out exactly what is the cause of this, this was evident in a link which was very low traffic so we saw it immediately, but when we made the analysis and we discovered it happens everywhere, on the GEANT backbone including on the heavy traffic ones we need to zoom on the lower part of the spectrum or analysis and you see it everywhere, so there is something we did not know at all before.
Another very strange effect that we find out, this link is one‑and‑a‑half metres long, two machines in the same. What is happening here? What is this pattern? We still don't have any idea, we are investigating together to try to understand why it happens, but it happens very regularly.
So, again, this is something which is not completely under control and we need to find out. And now I leave to Fabio again because this is the results part of it.
FABIO FARINA: Yes, thank you again. Of course having a look at the data, the current data ‑‑ having an historical view of the trends in doing the analysis that Claudio showed us, but if you want to have better way, a more modern way to perform network operation, adding the alert is the best way because ‑‑ you do not want this day looking at the data every second but you want that the system itself has enough knowledge to provide information about what is going wrong, and this is in particular useful where you are preparing for example a lot of sessions that needed careful tuning for a few days, and then in the night‑time something changes that then your potential set up is rapid so you want to get information about things that changes in an unexpected way as soon as possible.
So, when we decided to introduce anomaly detection Timemap, we did for this reason and also we wanted to have add one more tool to cross‑correlate events that are not clear because, as you can imagine, jitter and latency have a wave front that propagates as a factor, so everything is related during the ‑‑ all along the paths all over the network.
We wanted to ‑‑ there is some strict requirement about the way we do machine learning. So we have many flowing data and we do not have a prior knowledge about the quality of the data. We needed two kinds of characteristics, two technical characteristics about the anomaly detection algorithms, we want to have realtime streaming, the machine learning the algorithms, so we must be able to learn while we classify. And we want to have robust statistical parameters and we want to keep the light footprint principle that derives all the Timemap development.
So, for the choice of the tool, we decided to go with streaming library called River, and this library has a wide range of machine learning algorithms that relies on streaming, so you learn things as the data flows.
And anomaly detection in the end is just a way to do classification over the standardisation in principle is and also provides ... but it can become very sensible to the over feed when you do very long runs and is longer running services, so the main risk here is that at a certain point you will start learning anomalies as regular point. To devoid this we decided to use not one but two families of algorithm, one is half space or random trees, is a kind of classifier, and we peered with completely different machine learning technique that it's called support Vector machine and we tuned both the algorithm to be sensible around tree setment but the outcome of each is weighted in some sort of majority rule, that's called bagging as a name in machine learning.
So we decided that the point is indeed anomalous just when both the algorithms say that point is indeed strange.
Next slide, please.
As the overall streaming machine learning approach is lightweight in terms of number of footprint, we decided to do this kind of approach. We decided to train a model for each link that connects a pair of router, because the because they are small. And this is something that we need because latency and jitter are related to the length of the distance between the routers, so having one model that fits all the measurements was something that we do not like because it would be too generic, we want to characterise exactly the behaviour of each link. And from the technical point of view, the container that runs all the offerings is just bunch of Python lines, everything is less than 300 lines of code that runs as a sidecar docker to the container that talks to influx, it pulls data, it runs every line finds the classification and marks the data as needed.
This is the result of all this algorithms just on a pie chart.
Next one. Where we are going with the anomaly of the detection site, we would like to fix two issues that we are facing right now. We want to mitigate the problem so we would like to have it's also able to forget things, not just to learn things. Because for getting things is an important part of the intelligence and ‑‑
CLAUDIO ALLOCCHIO: We have one minute.
FABIO FARINA: I would like to have a method that marks when anomaly situation ends. We are exploring a number of different methods that are feature and provide even more interesting things on this insight. Claudio.
CLAUDIO ALLOCCHIO: Thank you. All the links and all the things that can be found here, the idea that it's something that you can just download and go over very quickly and while we wait for take question I put online the live demo so we can go and have a look to it while we take question.
We switch over to the other one. And let me see. Okay. Well, you have ‑‑ meanwhile, if you have any questions, just come forward and ask. This is how it is, you need to authenticate so I need to login, I already put my password in, this is a map, I go on a link, for example, I don't know this one, which is the one which I like a lot, ask for the statistics, of course I need me to authenticate again, so let's go. Grafana is collecting data. As you see, I already see here some anomalies, I can select how long it is. Let's see, 24 hours. And as you can see this is example I was showing you before, very strange things happening at night where we don't have an explanation and if I want to see something like, for example, in here, I just go and zoom. And then I get the data and I can explore. Or if I go again on the map and I want to see anomalies in Amsterdam, there we go, anomalies, and then I get all the list of the possible anomalies that the artificial intelligence detected and I can go and investigate and so on.
So, questions?
AUDIENCE SPEAKER: Hi, Alexandros Milolidakis from KTH. In the past I have also used normal detection to monitor co‑lotion facilities and IXP links, the daily patterns you saw were quiet, common and they happen a lot and if I understand correctly at the end of the day to deal with them, you actually mix two algorithms together to get when is the actual anomaly.
I have one question to ask: Have you thought of like those anomalies that you set as daily patterns, they are not really anomalies but have you thought if you consider those as anomalies then identifying the real anomalies out of those an a.m. lease so you can put those to algorithm to figure out variations from the user pattern?
CLAUDIO ALLOCCHIO: I think Fabio is more qualified.
FABIO FARINA: This is something ‑‑ so far ‑‑... where nearly on the... so we did not want to have a... from the observation.
For the ‑‑ for better classification of course we will need some kind of training or at least parameter tunings to have better insight of the the destination that are underlying the signals we are talking about overlapping of different sources.
AUDIENCE SPEAKER: Be careful with hyper and Meta source, we need to know exactly what you are doing because they may work right now but in the future they may start failing so it's like, doesn't mean because it works now it will work in the future, if you always start with those values you need to figure like you do it automatic way for that to work.
FABIO FARINA: That's what Timemap does, it retrains periodically over the last 24 hours to ‑‑ to start from scratch in some sense but it's not something that it's ‑‑ it's somehow an approach to turn off at the start, it's not satisfying.
AUDIENCE SPEAKER: Thank you, you can also use something longer than 24 hours because there may be other kinds of deviation so thank you very much.
AUDIENCE SPEAKER: Thank you very much for your presentation, Alexander. I am pretty sure that you are aware that ‑ it's of course not a passive measurement, it is an active measurement that is generated using a CPU of the devices that making the measurement. I wonder if you have started how ‑‑ studied how the CPU usage of the device is affecting what you are seeing in the measurements?
CLAUDIO ALLOCCHIO: This is a thing that we are working on because for example one anomaly which I show might be exactly this. It's affecting. And I know that some vendors are working ‑‑ the problem disappears. It's very mature as an implementation in the machines. Cisco is even ‑‑ and Juniper, yes, it is, and of course that is why we need to correlate this data with everything else, we are also connecting all the data from the machine, from the router itself so we have CPU load even temperature can affect it so we are correlating all this together altogether because in that case you discard immediately the fake anomalies because they are called by something else, the ones caused by something else are very regular, you see them all the time while the real anomalies are not because there is something happening on the network.
DMITRY KOHMANYUK: Amazing, we have local, remote presentation. Thank you very much.
CLAUDIO ALLOCCHIO: This is intended to be installed by anybody who wants, it's an Open Source you download it.
DMITRY KOHMANYUK: Thank you a lot, guys.
(Applause)
Thank you, Willem, from NLnet Labs is talking about improvements.
WILLEM TOOROP: This is a presentation about route original validation so hence the capital ROV in "improvement".
It's actually a presentation about a research project that Kevin clerk from the University of Amsterdam at NLnet Labs so me and and Koen supervised it. He did this in the context of his security network engineering master.
Earlier research has shown that if you have both a valid announcement and RPKI invalid announcement of the same prefix or overlapping prefixes on the Internet, then RPKI has a bit of weakest link problem, in the sense that if an ‑‑ tries to contact the actual IP address on the valid announcement any hop on paths that does not do route original validation can redirect it to the invalid announcement.
And so there's about that ‑‑ so, Koen has done a really nice blog post on that very subject, which is on the RIPE Labs website and what he did was he ran an RPKI certificate authority on both valid and invalid prefix and RPKI validators 25% of them reached the certificate authority running on the invalids. So even validating routers that are likely to be in RPKI validated resources can reach the invalids because of on path hops.
So, the research question that Kevin asked himself is: Can we determine what the most prominent hops are that redirect people to ‑‑ or redirect traffic to the invalid and ‑‑ so which ASs have the most impact on redirecting traffic to invalid announced RPKI resource? That's basically the question.
All right. And so here is experiment consisted from doing two announcements: One valid announcement announced Anycasted across the 30 POPs that were available at the time of this research, the 10th March this year. So it's a less specific valid announcement, /23 and then a more specific invalid announcements without a valid route origin at the station, at CoLo clue and invalid announcements goes everywhere, even though the transit providers do not forward it, CoLo has such good peering, it is an organisation of network hobbyists, so they say in Amsterdam and this is the invalid becomes everywhere because they have really good connectivity.
So very suitable for this kind of experiment.
So, on one of the addresses which is in both the valid and the invalid announcements address number 6, we run authoritative name server that on valid says horray, your resolver reached the RPKI valid announcement, and on the invalid it says no, you have not reached RPKI announcement. And the valid announcement is larger, right, because it's less specific, also has another number on which the authoritative name server which always says hooray, you reached the valid because the invalid is not ‑‑ there's no invalid announcement for that IP address.
And so, if ‑‑ if you would reach the valid announcements from end point on Internet, then that means that the trace route to the earlier part which is also announced invalidly and the second half, only announced validly, are the same, right? So we don't actually know if all the HOPs on that trace route then do route origin validation so some hops could not but surrounded by other HOPs that do it and therefore it reaches the valid, but what you can tell is if a trace route reaches the invalids, then the paths will diverge the trace routes and we can also say that the HOP that is just before the point where the paths diverge to the valid and invalid announcements is the HOP that does not do route origin validation.
So, we, with the help of Emile Aben, submitted a few RIPE Atlas measurements, one DNS query to determine the baseline which probes go to the invalids which go to the invalids, and from all the probes on RIPE Atlas, yeah, do trace routes.
And these are the results. So, 43% of the IPv6 probes reach the invalid and 48 of IPv4, so IPv6 is slightly better. And from this graph, you can see that this is a graph showing the impact it would have if the ASs on the X Axis would do routing origin validation so from this graph you can see that if 25 of the ASs that are on those paths that we detected that do not do route original validation, would do it, that would already be a 50% improvement for IPv4 and 60% improvement for IPv6. This was the most prominent AS to divert traffic to the invalid announcements, is someone from Telecom Italia in the room, perhaps? It would be good if you would do route origin validation.
We looked at on which HOP the redirection to the valids took place so HOP number 7 on the trace route is most prominent, but you have to look of course at how that HOP is relative to the path and we looked at that and that is remarkably almost strange line. So it's not the first few hops that divert it to the invalid announcement but it's pretty much spread all over the hops in all the trace routes.
Furthermore, we also looked at which Anycast hop the probes would enter with when reaching the valid and here we clearly see RIPE Atlas bias so Frankfurt has more than 20% of everything that reaches so the coloured piece is reaching the invalids. I think what is remarkable here is that there are some ‑‑ or the probes have some affinity with Anycast hops are more like to reach the invalid than others so the three conclusions from the graphs that I just show you are a small group of organisations can have a big impact on routing security; this can happen anywhere on the path; and some POPs fare better than others. That's more or less the presentation so I'm open to questions. I have 57 seconds.
MASSIMILIANO STUCCHI: We have a question here in the front.
AUDIENCE SPEAKER: Alexandros Milolidakis from KTH again. Have you checked the opposite of what you did? So if some ASs fail to do validation, what will happen? I am asking because I have seen some weird stuff sometimes, it will be interesting, the reverse of what you did?
WILLEM TOOROP: If hops do not validate would not do it, what would happen then?
AUDIENCE SPEAKER: Yes
WILLEM TOOROP: It's much more positive presentation
AUDIENCE SPEAKER: I have seen something interesting and I can say it happens sometimes.
WILLEM TOOROP: Yes, that would be interesting, yes. We haven't looked at that in this research
AUDIENCE SPEAKER: So now your next work?
PETER HESSLER: I'm wondering if you took a look at the ASs that were not doing origin validation and compared it to public announcements that yes, this AS is doing public validation and is checking this and seeing if there's a discrepancy between those who claim they are and those who are actually are not? Tour no
WILLEM TOOROP: We haven't done that, we just measured it and collected the numbers.
MASSIMILIANO STUCCHI: We have run out of time so thank you very much.
WILLEM TOOROP: You are welcome.
(Applause)
MASSIMILIANO STUCCHI: Now, next up is Etienne Khan who is going to talk about VPNs and how to measure them.
ETIENNE KHAN: Thanks for being here this early in the morning. Let's have some exercise. Raise of hands, please, who of use uses streaming services like Netflix or Disney+? Lots of hands. Who is annoyed that other regions have more content? That's what I want to see. Who knows about geo‑unblocking VPNs, who promise to show you this content. Who knows how they do it? I hope the hands are not being raised right now, this is what we will discuss right now.
What happens. Well, I was in a different country and downloaded a show on to my device and it would be nice if I finish this show when I get back home, so I booted up the Netflix AP and it says it's not available to watch in your area I have to rather go back to the area when I downloaded the show to keep on watching so that's a bit annoying but we are all network engineers here and I can set up by own VPN to that country I have been to. So I do that and then Netflix gives me a different message, hey you are using a VPN over proxy and well to fix the problem please turn off the VPN. Okay, that's a bit rough. I vaguely remember some of these parties, they told me if you use our VPN service you can actually watch that kind of content so I thought that's a bit strange, how are those VPNs able to do so, whereas I who set it up cannot.
That's how I got to my talk: Stranger VPNs.
I will be presenting this today.
So, I bought a trial subscription for other VPN providers and thought hey, let's do another trace route to see where we get to when we want to watch Netflix, and as you can see, that's a pretty weird trace route to do if you are trying to look for Netflix. It's single hop and this address space is actually IETF protocol reserved so we are off to a good start.
It turns out though, so trace route and ping and these things don't work, they terminate after one hop but we have found something very special in the Netflix CDN N so that's one of the URLs, for example, and when you just request the header of the CDN you get a lot of text and most of it is unimportant but on the very bottom you can actually see that our public IP address we have been requesting header is is being shown in that header so I thought what will happen if I just request this through several VPN provides in several regions.
So these are the VPN provides I was investigating. A small mix of smaller ones like we VPN or private VPN and a few bigger ones doing all these commercials like Nord VPN, the one on the bottom, they went bankrupt and took our money so that's that.
We had two measurement periods where we measured this IP header from these providers every 30 seconds and we have one set of 4 million samples and another one of 7 million samples and the samples we collected were pretty simple so we have Unix time stamp, a status if everything was okay or not and the IP address we have detected doing the unblocking. Let's see how this looks so far. On the left VPN user and provide to the gateway, when we try to resolve Netflix or any other site we have to do a DNS request and the DNS request gives us this weird IP which actually is TLS proxy, from there something weird happens and we get to the site because yes after all it does work.
So this is actually what we measured and we will break it down to digest this more easily because this contains all of the information which is important. So for example here we have the VPN provider CYBERGHOST for three different regions because they don't support unblocking in the Netherlands and these are split in two parts, let's have a look at the top part. The top part shows how many unique IP addresses we found when looking for those unblocking IPs, green colour means IPv4 and pink colour which we will see later means IPv6. Also this black bar on top on some shows IP churn so if the black bar is always riding the peaks of the IPs it means we have always flesh IPs.
Now switching to the bottom, that's actually a graph that shows to which AS those IP addresses belong, and the colours are accurate across all of the images, meaning if you have for example the Orange colour here in the middle, that's only a single AS but on the left side there are many more ASs on the right side and here we can see some very interesting patterns, in the United States using only a few IP addresses and few ASs and in second measurement period roughly from August onwards they started using a lot of ASs and roughly 2200 different unique IPv4s per day for unblocking.
By the way we found more than 2000 ASs and only gave the top 50 a unique colour and the rest were coloured black so if you see those black bars on the right‑hand side with United States, they are many, many ASs more.
Here is a different pattern for Express VPN. You can can especially see in United States they are only relying on IPv6 and only a few IPv6 addresses from a single AS.
If we go to Japan, we see some weird patterns as well, using one AS the orange one, for whatever reason decide to go into a whole mix of different ASs, and then at the end of September they switch back to their one AS.
Things can look different like this, they are apparently making use of only a single providers and as you can see the IP graph at the top shows they are using a different IPv6 for every connection we requested and the IP turn follows those lines perfectly meaning we get a fresh IPv6 every time.
Also interesting to see is on the right side in the United States they have just a small mix of a few ASs they use for unblocking, let's say two or three on some days we have these dirty lines where there's a big amount of addresses rolling in, so there's some IPv4s I am thinking one of the engineers made a mistake with configuring the unblocking on their side..
We have private VPN which has a special role you can see the amount of is rather low so that is the ASNs and we will discuss that a bit later.
This is how it looks so far and actually we can split it up in two different modes of operation. Using specialised or hosting provides for the unblocking and the second one is residential proxies.
Let's have a look at the top.
So these were some of the nets that private VPN used and you can see from the RIPE maintainer object that they belong to instances that have a very similar name to PrivateVPN, like Private Communication, or PV, like Private Datanet, and it turns out these are all the same company that run their private VPN, they created their own fake ISP to look like they are a residential ISP but in reality they are not.
And also here we have PVDatanet, they claim to do dedicated hosting but for some reason since two years when I am observing them everything is coming soon and the links are dead links like the one on the bottom right to Facebook and LinkedIn, whatever.
We found traffic force from Lithuania, the only thing they say is they provide unique solutions for global networks and furthermore you have to contact them and hope they reply to you, I guess if you throw a lot of money on the table to help you do the unblocking.
Let's switch to a different type the residential proxies, here is a small overview, we have all the big players like Deutsch Telecom in Germany, Vodafone or KPN in Netherlands, NTT in Japan and Comcast in America. On the right side we show how many unique IP addresses for these provides. I want to give a shout out to the ministry of defence whose IP address we found in our dataset and we reported this to them immediately, we haven't heard back from them but also not seen those IP addresses appear again. I assume that's a good sign.
What are we seeing today? If you want to do unblocking we first connect to the VPN server and get back that host, then we get routed to one of the two different methods they use, the specialised hosting or fake ISPs or residential proxies before we reach our ultimate goal which is the streaming provider.
Please talk to me if you have any more questions about the paper we wrote, or if you are a consumer or any other kind of ISP and want to help me in my follow‑up study on this topic or just want to talk. Thanks for your attention and we still have some time for questions, I think.
(Applause)
DMITRY KOHMANYUK: Any people who want to ‑‑ I see two, please, and that's it. Thank you for your insightful presentation.
AUDIENCE SPEAKER: Thank you very much, very interesting stuff. I remember a couple of years ago I basically was able to set sufficient a DNS proxy up myself and at some point it stopped working because the upper path you showed was detected by the Cloud provides or streaming provides so I guess more stuff is happening on the lower path, the residential ISPs. Do you have any insight how do they get so many addresses in those networks? What's the ‑‑ what's the benefit?
ETIENNE KHAN: I have some ideas and I talked to some people around here already, I want to find out in my next study so I can't really talk about this right now but if you many of us think of botnets and malware and I think it might go towards that direction, but I don't have concrete evidence right now.
AUDIENCE SPEAKER: Do they also use those residential ISPs to actually tunnel all the traffic or is it what I did back a couple of years ago which is just the DNS requests the rest of the traffic?
ETIENNE KHAN: From what I see it's just streaming sites, in some cases it might also be a new site for of a country where you need an IP address to see all the content but it's very difficult to verify because I will need to resolve all kind of domains against their resolver, I tried to do that for some sites, but it wasn't that fruitful. So it's not all traffic, it's mostly the streaming.
AUDIENCE SPEAKER: IP address from the ministry of defence that you saw, makes me wonder whether if the customers of those VPNs services that get routed through residential ISPs actually become end points themselves for other people who want to unblock through their countries?
ETIENNE KHAN: Thanks for the question. There is some research on this topic and it's some three VPNs do this and I think Hola is one that comes to mind. The big players like Nord or Express VPN, their applications are actually clean so they don't tunnel you back as a proxy, which makes us believe their source is something else and not the actual VPN application. But for very shady ‑‑ well they are all shady but let's say the free ones, that might be the case.
DMITRY KOHMANYUK: Thank you. I think we are close to our limit. So thank you again and for insightful when the VPN is a service.
ETIENNE KHAN: Thank you.
(Applause)
DMITRY KOHMANYUK: Tony Finch talking about where time has come from. And by the way, folks, don't forget to rate the talks in the system and you can win a cool prize, thank you.
TONY FINCH: Where does my computer get the time from? Well, NTP, here is a picture of a NTP packet and here is a picture of David mills who invented it, easy question and easy answer, end of presentation.
Let's peel back a few layers and sue where we go. Stratum 2 get the time from stratum 1 NTP servers and they get the time from some reference clock, a radio signal like MSF or DF 77 in Germany but most of the time some GPS receiver, so and they of course get their time from a GPS satellite. So where does GPS get the time from? Well the GPS satellites are controlled from Schriever space force base in Colorado, they do a lot of ‑‑ look after a lot of super secret satellites and you can see all the different mission logos they have got there. It's a thoroughly ugly place with lots of security fences and things. So where does Schriever space force base get the time from? Well, in Colorado on their site they host the United States Naval Observatory alternate master clock and that is maintained by the United States navel observatory in DC.
Well, they have absolutely stacks of atomic clocks, they have hydrogen mazers and rubidian phantom clocks, they have so many that they have special buildings for housing their clocks. When I was preparing this talk I was looking at the United States NO campus in Washington DC and I saw this big scar, they have got a building site there. The accuracy is affected to a large extent by the environment that they are held in, so stable temperature and humidity and they are building a new building just for even more atomic clocks.
There are a couple of other answers to this question to where the USNO gets time from. UTC is a kind of horrible comprise between atomic time and earth rotation, so in order to keep the UTC in sync with earth rotation there is the international earth rotation service and every six months they send out bulletin C which says whether or not there is a going to be a leap second in the next six months to keep UTC in sync with the rotation of the earth.
Now, the IERS is a big organisation, they have multiple sites to carry out their scientific mission, and if you subscribe to their bulletin A, which is every week, it gives you details of earth orientation parameters. That will come from the United States naval observatory in Washington DC. They need to know the precise orientation of the earth underneath all of these GPS satelites so you can get precise location from GPS.
There's a third.
Answer: How does the USNO know that their atomic clocks are telling the right time? They get the time from the international bureau of weights and measures in Paris, and they are responsible for the official UTC. So where does the BI PM ghetto figures UTC from? They have a periodical called Circular T, which includes the measurements from all of the time labs in national time labs all around the world, in Colorado, London, Paris observatory, and Circular T says how close each of these national time labs are to official UTC.
So, there's ‑‑ the next part of this is how is official UTC defined? It's the BIPM is responsible for the international system of units as defined by the general conference of weights and measures which is a treaty organisation set up by the convention of the meter in 1875, the CGPM has defined as part of the international system of units the exact frequency of atomic clocks based on the system, so where did this magic number, this 9.2 gigahertz come from? It came from Louis Essen and Parry in the 1950s.
Where did they get the time from? At that time, the definition of the second was based on astronomy, so they needed help from astronomers to calibrate their clock, and they get help from the United States observatory in Washington DC. So, there's Bill Markowitz from the United States Observatory looked at the stars and measured the position of the earth under the sky and Louis Essen looked at his clock and measured the time that way and they compared their measurements by also listening to the WW V time signal from the national bureau of standards in Washington.
So, where did Bill Markowitz get his definition of time from? In 1952 the International Astronomical Union redefined a second from being based on the rotation of the earth about its axis called to the a sermorist second based on the orbit of the earth around the sun. Now, we had found out in the 1930s that the rotation of the earth about its axis is not perfectly stable or smooth and clocks were getting better at keeping time than the earth is. So, where did this he have em RIS second come from? It's based on a model, a math math model of the solar system which was created by Simon newcomb in by collecting a huge amount of astronomical data and building a mathematical model of the solar system. He is a fine Victorian gentleman and where did he work?
At the USNO in Washington DC.
So, I think I have just about run out of layers to peel back now, before this point time is all based on just looking at the stars and seeing them go past. But I guess there's another answer to my question: Where did my computer get the time from? Well it does not get the time from the royal Greenwich observatory. And that's my talk.
(Applause)
DMITRY KOHMANYUK: Thank you, Tony. So you want to ask a quick one or then announce and that would be it and we have the General Meeting part.
Stephen: Are you actually seeing saying USNO are time lords or ‑‑
TONY FINCH: Absolutely.
AUDIENCE SPEAKER: I had one thing to add: The UTC is a weird standard and as far as I remember they have agreed like at the general conference object weights and measures, they have agreed to abolish leap seconds by 2035, I don't know if they specified what they will do with the drift away from U T1 or whatever the standard is but leap seconds as a thing will stop bothering us at some point, is my understanding.
TONY FINCH: Hopefully, yes. Over the last couple of years it's been a bit weird because the earth has been rotating a bit faster than unusual and there's a risk of a negative leap second which has never happened before but the earth has been slowing down again so maybe it won't happen.
DMITRY KOHMANYUK: Suggest a separate leap second plenary at the meeting, with that we are over but the GM part has to be announced separately.
NIALL O'REILLY: RIPE vice‑chair and sometime stability analysis wonk. It seems to me we have a system here that keeps on hanging together only because it keeps on hanging together. It's a bit like the importance of importing inertia to the power grid and I am just wondering how big or indeed how scarily small is the domain of attraction for the I can Libry yum that we sort of have?
TONY FINCH: Good question. I don't know.
(Applause)
DMITRY KOHMANYUK: With that, we are done. Thank you everyone for being present here, it's astonishing attendance. Thank you.
(Applause)
ONDREJ FILIP: So good morning, everyone. I think this is the moment that everybody is expecting, I have a results from the General Meeting for you, first of all everything was roughly smooth, there were no major issues so we can ‑‑ we have results and I can announce them. So, let's go for that.
Resolution number 1:
"The General Meeting adopts the RIPE NCC financial report 2022."
And there was majority voting yes so the resolution is approved.
Resolution 2:
"The General Meeting discharges the Executive Board with regards to its action as they appear from the annual report 2022 and financial report 2022 "
Again majority of yes, so the resolution is approved.
And resolution number 3:
"The General Meeting adopts the RIPE NCC charging scheme 2024 model ."
So that's a flat model with lowest fees that were suggested by the board.
Resolution 4:
"In addition to the RIPE NCC charging scheme adopted in resolution 3 the General Meeting adopts an extra charge of €50 per ASN as an integral part of the charging scheme 2024"
And this resolution was not approved.
Surprise, surprise. Resolution number 5:
"In addition to the RIPE NCC charging scheme adopted in resolution 3 the General Meeting adopts an extra charge of €500 per accepted transfer request as an integral part of the charging scheme 2024"
Again, the resolution is not approved.
Resolution 6:
"The General Meeting adopts the amendments to the RIPE NCC Standard Service Agreement".
Super majority, yes, resolution is approved.
And now we have the elections, so I would like to thank all the candidates who participated, it was really brave of you to stand in front of you and say something, so thank you very much for that.
(Applause)
And elected from Raymond, Maria, and Harald. Thank you very much, and congratulations.
(Applause)
And with that, I officially close the General Meeting now, thank you very much.
LIVE CAPTIONING BY AOIFE DOWNES, RPR
DUBLIN, IRELAND