668 stories
·
8 followers

Android Avaliable Now

1 Share

Google Play Store

Good things come to those who wait (and after I've had my morning coffee).

Read the whole story
Flameeyes
7 days ago
reply
Dublin, Ireland
Share this story
Delete

Taking over a postal address An Post edition

1 Share

As I announced a few months ago, I’m moving to London. One of the tasks before the move is setting up postal address redirection, so that the services unable to mail me across the Irish Sea can still reach me. Luckily I know for a fact that An Post (the Irish postal service) has a redirection service, if not a cheap one.

A couple of weeks ago, I went on to sign up for the services, and I found that I had two choices: I could go to the post office (which is inside the EuroSpar next door), show a photo ID and a proof of address, and pay cash or with a debit card1; or I could fill in the form online, pay with a credit card, and then post out a physical signed piece of paper. I chose the latter.

There are many laughable things that I could complain about, in the process of setting up the redirection, but I want to focus on what I think is the big, and most important problem. After you choose the addresses (original and new destination), it will ask you where you want your confirmation PIN sent.

There is a reason why they do that. I set up the redirect well before I moved, and in particular I chose to redirect mail from my apartment to my local office — this way I can either batch together the mail, or simply ask for an inter-office forwarding. This meant I had access to both the original and the new address at the same time — but for many people, particularly moving out of the country, by the time they know where to forward the mail, they might only have access to the new address.

The issue is that if you decide to get the PIN at the new address, the only notification sent to the old address is one letter, confirming the activation of the redirection, sent to the old address. This is likely meant so you can call An Post and have them cancel the redirection if that was done against your will.

While this stops a possible long-term takeover of a mail address, it still allows a wide window of opportunity for a takeover. Also, it has one significant drawback: the letter does not tell you where the mail will be redirected!

Let’s say you want to take over someone’s address (let’s look later what for). First you need to know their address; this is the simplest part of course. Now you can fill in the request on An Post’s website for the redirection — the original address is not given any indication that a request was filled – and get the PIN at the new address. Once the PIN is received, there is some time to enable the redirection.

Until activation is completed, and the redirection time is selected, no communication is given to the original address.

If your target happens to be travelling or otherwise unable to get to their mail for a few weeks, then you have an opportunity. You can take over the address, get some documents at the given address, and get your hands on them. Of course the target will become suspicious when coming back, finding a note about redirection and no mail. But finding a way to recover the mail without being tied to an identity is left as an exercise to the reader.

So what would you accomplish, beside annoying your target, and possibly get some of their unsolicited mail? Well, there are a significant amount of interesting targets in the postal mail you receive in Ireland.

For instance, take credit card statements. Tesco Bank does not allow you to receive them electronic, and Ulster Bank will send you the paper copy even though you opt-in to all the possible electronic communications. And a credit card statement in Ireland include a lot more information than other countries, including just enough to take over the credit card. Tesco Bank for instance will authenticate you with the 16 digits PAN (on the statement), your full address (on the statement), the credit limit (you guessed it, on the statement), and your date of birth (okay, this one is not on the statement, but you can probably find my date of birth pretty easily).

And even if you don’t want to take over the full credit card, having the PAN is extremely useful in and by itself, to take over other accounts. And since you have the statement, it wouldn’t be difficult to figure out what the card is used for — take over an Amazon account, you can take over a lot more things.

But there are more concrete problems too — for instance I do receive a significant amount of pseudo-cash2 in form of Tesco vouchers — having physical control of the vouchers effectively means having the cash in your hand. Or say you want to get a frequent guest or frequent flyer card, because a card is often just enough to get the benefits, and have access to the information on the account. Or just get enough of a proof of address to register on any other service that will require one.

Because let’s remember: an authentication system is just as weak as its weakest link. So all those systems requiring a proof of address? You can skip over all of them by just having one recent enough proof of address, by hijacking someone’s physical mail. And that’s just a matter of paying for it.


  1. An Post is well known for only accepting VISA Debit cards, and refuses both MasterCard Debit and VISA Credit cards. Funnily enough, they issue MasterCard cards, but that’s a story for another time. [return]
  2. I should at some point write a post about pseudo-cash and the value of a euro when it’s not a coin. [return]
Read the whole story
Flameeyes
15 days ago
reply
Dublin, Ireland
Share this story
Delete

Behind the Masq: Yet more DNS, and DHCP, vulnerabilities

1 Share


Our team has previously posted about DNS vulnerabilities and exploits. Lately, we’ve been busy reviewing the security of another DNS software package: Dnsmasq. We are writing this to disclose the issues we found and to publicize the patches in an effort to increase their uptake.

Dnsmasq provides functionality for serving DNS, DHCP, router advertisements and network boot. This software is commonly installed in systems as varied as desktop Linux distributions (like Ubuntu), home routers, and IoT devices. Dnsmasq is widely used both on the open internet and internally in private networks.

We discovered seven distinct issues (listed below) over the course of our regular internal security assessments. Once we determined the severity of these issues, we worked to investigate their impact and exploitability and then produced internal proofs of concept for each of them. We also worked with the maintainer of Dnsmasq, Simon Kelley, to produce appropriate patches and mitigate the issue.

These patches have been upstreamed and are now committed to the project’s git repository. In addition to these patches we have also submitted another patch which will run Dnsmasq under seccomp-bpf to allow for additional sandboxing. This patch has been submitted to the DNSmasq project for review and we have also made it available here for those who wish to integrate it into an existing install (after testing, of course!). We believe the adoption of this patch will increase the security of DNSMasq installations.

We would like to thank Simon Kelley for his help in patching these bugs in the core Dnsmasq codebase. Users who have deployed the latest version of Dnsmasq (2.78) will be protected from the attacks discovered here. Android partners have received this patch as well and it will be included in Android's monthly security update for October. Kubernetes versions 1.5.8, 1.6.11, 1.7.7, and 1.8.0 have been released with a patched DNS pod. Other affected Google services have been updated.

During our review, the team found three potential remote code executions, one information leak, and three denial of service vulnerabilities affecting the latest version at the project git server as of September 5th 2017.

CVE
Impact
Vector
Notes
PoC
CVE-2017-14491
RCE
DNS
Heap based overflow (2 bytes). Before 2.76 and this commit overflow was unrestricted.
CVE-2017-14492
RCE
DHCP
Heap based overflow.
CVE-2017-14493
RCE
DHCP
Stack Based overflow.
CVE-2017-14494
Information Leak
DHCP
Can help bypass ASLR.
CVE-2017-14495
OOM/DoS
DNS
Lack of free() here.
CVE-2017-14496
DoS
DNS
Invalid boundary checks here. Integer underflow leading to a huge memcpy.
CVE-2017-13704
DoS
DNS
Bug collision with CVE-2017-13704



It is worth expanding on some of these:

  • CVE-2017-14491 is a DNS-based vulnerability that affects both directly exposed and internal network setups. Although the latest git version only allows a 2-byte overflow, this could be exploited based on previous research. Before version 2.76 and this commit the overflow is unrestricted. 
==1159==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62200001dd0b at pc 0x0000005105e7 bp 0x7fff6165b9b0 sp0x7fff6165b9a8
WRITE of size 1 at 0x62200001dd0b thread T0
  #0 0x5105e6 in add_resource_record
/test/dnsmasq/src/rfc1035.c:1141:7
  #1 0x5127c8 in answer_request /test/dnsmasq/src/rfc1035.c:1428:11
  #2 0x534578 in receive_query /test/dnsmasq/src/forward.c:1439:11
  #3 0x548486 in check_dns_listeners 
/test/dnsmasq/src/dnsmasq.c:1565:2
  #4 0x5448b6 in main /test/dnsmasq/src/dnsmasq.c:1044:7
  #5 0x7fdf4b3972b0 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202b0)
  #6 0x41cbe9 in _start (/test/dnsmasq/src/dnsmasq+0x41cbe9)

  • CVE-2017-14493 is a trivial-to-exploit DHCP-based, stack-based buffer overflow vulnerability. In combination with CVE-2017-14494 acting as an info leak, an attacker could bypass ASLR and gain remote code execution.
dnsmasq[15714]: segfault at 1337deadbeef ip 00001337deadbeef sp 00007fff1b66fd10 error 14 in libnss_files-2.24.so[7f7cfbacb000+a000]
  • Android is affected by CVE-2017-14496 when the attacker is local or tethered directly to the device—the service itself is sandboxed so the risk is reduced. Android partners received patches on 5 September 2017 and devices with a 2017-10-01 security patch level or later address this issue.
Proofs of concept are provided so you can check if you are affected by these issues, and verify any mitigations you may deploy.


We would like to thank the following people for discovering, investigating impact/exploitability and developing PoCs: Felix Wilhelm, Fermin J. Serna, Gabriel Campana, Kevin Hamacher, Ron Bowes and Gynvael Coldwind of the Google Security Team.
Read the whole story
Flameeyes
16 days ago
reply
Dublin, Ireland
Share this story
Delete

How blogging changed in the past ten years

1 Share

One of the problems that keeps poking back at me every time I look for an alternative software for this blog, is that it somehow became not your average blog, particularly not in 2017.

The first issue is that there is a lot of history. While the current “incarnation” of the blog, with the Hugo install, is fairly recent, I have been porting over a long history of my pseudo-writing, merging back into this one big collection the blog posts coming from my original Gentoo Developer blog, as well as the few posts I wrote on the KDE Developers blog and a very minimal amount of content from my (mostly Italian) blog when I was in high school.

Why did I do it that way? Well the main thing is that I don’t want to lose the memories. As some of you might know already, I faced my mortality before, and I came to realize that this blog is probably the only thing of substance that I had a hand on, that will outlive me. And so I don’t want to just let migration, service turndowns, and other similar changes take away what I did. This is also why I did publish to this blog the articles I wrote for other websites, namely NewsForge and <a href="http://Linux.com" rel="nofollow">Linux.com</a> (back when they were part of Geeknet).

Some of the recovery work actually required effort. As I said above there’s a minimal amount of content that comes from my high school days blog. And it’s in Italian that does not make it particularly interesting or useful. I had deleted that blog altogether years and years ago, so I had to use the Wayback Machine to recover at least some of the posts. I will be going through all my old backups in the hope of finding that one last backup that I remember making before tearing the thing down.

Why did I tear it down in the first place? It’s clearly a teenager’s blog and I am seriously embarrassed by the way I thought and wrote. It was 1314 years ago, and I have admitted last year that I can tell so many times I’ve been wrong. But this is not the change I want to talk about.

The change I want to talk about is the second issue with finding a good software to run my blog: blogging is not what it used to be ten years ago. Or fifteen years ago. It’s not just that a lot of money got involved in the mean time, so now there are a significant amount of “corporate blogs”, that end up being either product announcements in a different form, or the another outlet for not-quite-magazine content. I know of at least a couple of Italian newspapers that provide “blogs” for their writers, which look almost exactly like the paper’s website, but do not have to be reviewed by the editorial board.

In addition to this, a lot of people’s blogs stopped providing as much details of their personal life as they used to. Likely, this is related to the fact that we now know just how nasty people on the Internet can be (read: just as nasty as people off the Internet), and a lot of the people who used to write lightheartedly don’t feel as safe, correctly. But there is probably another reason: “Social Media”.

The advent of Twitter and Facebook made it so that there is less need to post short personal entries, too. And Facebook in particular appears to have swallowed most of the “cutesy memes” such as quizzes and lists of things people have or have not done. I know there are still a few people who insist on not using these big names social networks, and still post for their friends and family on blogs, but I have a feeling they are quite the minority. And I can tell you for sure that since I signed up for Facebook, a lot of my smaller “so here’s that” posts went away.

Distribution chart of blog post sizes over time

This is a bit of a rough plot of blog sizes. In particular I have used the raw file size of the markdown sources used by Hugo, in bytes, which make it not perfect for Unicode symbols, and it includes the “front matter”, which means that particularly all the non-Hugo-native posts have their title effectively doubled by the slug. But it shows trends particularly well.

You can see from that graph that some time around 2009 I almost entirely stopped writing short blog posts. That is around the time Facebook took off in Italy, and a lot of my interaction with friends started happening there. If you’re curious of that visible lack of posts just around half of 2007, that was the pancreatitis that had me disappear for nearly two months.

With this reduction in scope of what people actually write on blogs, I also have a feeling that lots of people were left without anything to say. A number of blogs I still follow (via NewsBlur since Google Reader was shut down), post once or twice a year. Planets are still a thing, and I still subscribe to a number of them, but I realize I don’t even recognize half the names nowadays. Lots of the “old guard” stopped blogging almost entirely, possibly because of a lack of engagement, or simply because, like me, many found a full time job (or a full time family), that takes most of their time.

You can definitely see from the plot that even my own blogging has significantly slowed down over the past few years. Part of it was the tooling giving up on me a few times, but it also involves the lack of energy to write all the time as I used to. Plus there is another problem: I now feel I need to be more accurate in what I’m saying and in the words I’m using. This is in part because I grew up, and know how much words can hurt people even when meant the right way, but also because it turns out when you put yourself in certain positions it’s too easy to attack you (been there, done that).

A number of people that think argue that it was the demise of Google Reader1 that caused blogs to die, but as I said above, I think it’s just the evolution of the concept veering towards other systems, that turned out to be more easily reachable by users.

So are blogs dead? I don’t think so. But they are getting harder to discover, because people use other platforms and it gets difficult to follow all of them. Hacker News and Reddit are becoming many geeks’ default way to discover content, and that has the unfortunate side effect of not having as much of the conversation to happen in shared media. I am indeed bothered about those people who prefer discussing the merit of my posts on those external websites than actually engaging on the comments, if nothing else because I do not track those platforms, and so the feeling I got is of talking behind one’s back — I would prefer if people actually told me if they shared my post on those platforms; for Reddit I can at least IFTTT to self-stalk the blog, but that’s a different problem.

Will we still have blogs in 10 years? Probably yes, but they will not look like the ones we’re used to most likely. The same way as nowadays there still are personal homepages, but they clearly don’t look like Geocities, and there are social media pages that do not look like MySpace.


  1. Usual disclaimer: I do work for Google at the time of writing this, but these are all personal opinions that have no involvement from the company. For reference, I signed the contract before the Google Reader shutdown announcement, but started after it. I was also sad, but I found NewsBlur a better replacement anyway. [return]
Read the whole story
Flameeyes
17 days ago
reply
Dublin, Ireland
Share this story
Delete

The Apple of Your EFI: Mac Firmware Security Research

1 Share

The Apple of Your EFI: Mac Firmware Security Research



We are really excited to give a talk at Ekoparty in Buenos Aires on September 29th, 2017 covering some recent research we have done on the security support being given to Apple’s EFI firmware.

Tags:

via Pocket <a href="https://duo.com/blog/the-apple-of-your-efi-mac-firmware-security-research" rel="nofollow">https://duo.com/blog/the-apple-of-your-efi-mac-firmware-security-research</a>

September 30, 2017 at 01:55PM

Read the whole story
Flameeyes
18 days ago
reply
Dublin, Ireland
Share this story
Delete

Serving 100 Gbps from an Open Connect Appliance

3 Shares

By Drew Gallatin

In the summer of 2015, the Netflix Open Connect CDN team decided to take on an ambitious project. The goal was to leverage the new 100GbE network interface technology just coming to market in order to be able to serve at 100 Gbps from a single FreeBSD-based Open Connect Appliance (OCA) using NVM Express (NVMe)-based storage.

At the time, the bulk of our flash storage-based appliances were close to being CPU limited serving at 40 Gbps using single-socket Xeon E5–2697v2. The first step was to find the CPU bottlenecks in the existing platform while we waited for newer CPUs from Intel, newer motherboards with PCIe Gen3 x16 slots that could run the new Mellanox 100GbE NICs at full speed, and for systems with NVMe drives.

Fake NUMA

Normally, most of an OCA’s content is served from disk, with only 10–20% of the most popular titles being served from memory (see our previous blog, Content Popularity for Open Connect for details). However, our early pre-NVMe prototypes were limited by disk bandwidth. So we set up a contrived experiment where we served only the very most popular content on a test server. This allowed all content to fit in RAM and therefore avoid the temporary disk bottleneck. Surprisingly, the performance actually dropped from being CPU limited at 40 Gbps to being CPU limited at only 22 Gbps!

After doing some very basic profiling with pmcstat and flame graphs, we suspected that we had a problem with lock contention. So we ran the DTrace-based lockstat lock profiling tool that is provided with FreeBSD. Lockstat told us that we were spending most of our CPU time waiting for the lock on FreeBSD’s inactive page queue. Why was this happening? Why did this get worse when serving only from memory?

A Netflix OCA serves large media files using NGINX via the asynchronous sendfile() system call. (See NGINX and Netflix Contribute New sendfile(2) to FreeBSD). The sendfile() system call fetches the content from disk (unless it is already in memory) one 4 KB page at a time, wraps it in a network memory buffer (mbuf), and passes it to the network stack for optional encryption and transmission via TCP. After the network stack releases the mbuf, a callback into the VM system causes the 4K page to be released. When the page is released, it is either freed into the free page pool, or inserted into a list of pages that may be needed again, known as the inactive queue. Because we were serving entirely from memory, NGINX was advising sendfile() that most of the pages would be needed again — so almost every page on the system went through the inactive queue.

The problem here is that the inactive queue is structured as a single list per non-uniform memory (NUMA) domain, and is protected by a single mutex lock. By serving everything from memory, we moved a large percent of the page release activity from the free page pool (where we already had a per-CPU free page cache, thanks to earlier work by Netflix’s Randall Stewart and Scott Long, and Isilon’s Jeff Roberson) to the inactive queue. The obvious fix would have been to add a per-CPU inactive page cache, but the system still needs to be able to find the page when it needs it again. Pages are hashed to the per-NUMA queues in a predictable way.

The ultimate solution we came up with is what we call “Fake NUMA”. This approach takes advantage of the fact that there is one set of page queues per NUMA domain. All we had to do was to lie to the system and tell it that we have one Fake NUMA domain for every 2 CPUs. After we did this, our lock contention nearly disappeared and we were able to serve at 52 Gbps (limited by the PCIe Gen3 x8 slot) with substantial CPU idle time.

Pbufs

After we had newer prototype machines, with an Intel Xeon E5 2697v3 CPU, PCIe Gen3 x16 slots for 100GbE NIC, and more disk storage (4 NVMe or 44 SATA SSD drives), we hit another bottleneck, also related to a lock on a global list. We were stuck at around 60 Gbps on this new hardware, and we were constrained by pbufs.

FreeBSD uses a “buf” structure to manage disk I/O. Bufs that are used by the paging system are statically allocated at boot time and kept on a global linked list that is protected by a single mutex. This was done long ago, for several reasons, primarily to avoid needing to allocate memory when the system is already low on memory, and trying to page or swap data out in order to be able to free memory. Our problem is that the sendfile() system call uses the VM paging system to read files from disk when they are not resident in memory. Therefore, all of our disk I/O was constrained by the pbuf mutex.

Our first problem was that the list was too small. We were spending a lot of time waiting for pbufs. This was easily fixed by increasing the number of pbufs allocated at boot time by increasing the kern.nswbuf tunable. However, this update revealed the next problem, which was lock contention on the global pbuf mutex. To solve this, we changed the vnode pager (which handles paging to files, rather than the swap partition, and hence handles all sendfile() I/O) to use the normal kernel zone allocator. This change removed the lock contention, and boosted our performance into the 70 Gbps range.

Proactive VM Page Scanning

As noted above, we make heavy use of the VM page queues, especially the inactive queue. Eventually, the system runs short of memory and these queues need to be scanned by the page daemon to free up memory. At full load, this was happening roughly twice per minute. When this happened, all NGINX processes would go to sleep in vm_wait() and the system would stop serving traffic while the pageout daemon worked to scan pages, often for several seconds. This had a severe impact on key metrics that we use to determine an OCA’s health, especially NGINX serving latency.

The basic system health can be expressed as follows (I wish this was a cartoon):

<...>
Time1: 15GB memory free. Everything is fine. Serving at 80 Gbps.
Time2: 10GB memory free. Everything is fine. Serving at 80 Gbps.
Time3: 05GB memory free. Everything is fine. Serving at 80 Gbps.
Time4: 00GB memory free. OH MY GOD!!! We’re all gonna DIE!! Serving at 0 Gbps.
Time5: 15GB memory free. Everything is fine. Serving at 80 Gbps.
<...>

This problem is actually made progressively worse as one adds NUMA domains, because there is one pageout daemon per NUMA domain, but the page deficit that it is trying to clear is calculated globally. So if the vm pageout daemon decides to clean, say 1GB of memory and there are 16 domains, each of the 16 pageout daemons will individually attempt to clean 1GB of memory.

To solve this problem, we decided to proactively scan the VM page queues. In the sendfile path, when allocating a page for I/O, we run the pageout code several times per second on each VM domain. The pageout code is run in its lightest-weight mode in the context of one unlucky NGINX process. Other NGINX processes continue to run and serve traffic while this is happening, so we can avoid bursts of pager activity that blocks traffic serving. Proactive scanning allowed us to serve at roughly 80 Gbps on the prototype hardware.

RSS Assisted LRO

TCP Large Receive Offload (LRO), is the technique of combining several packets received for the same TCP connection into a single large packet. This technique reduces system load by reducing trips through the network stack. The effectiveness of LRO is measured by the aggregation rate. For example, if we are able to receive four packets and combine them into one, then our LRO aggregation rate is 4 packets per aggregation.

The FreeBSD LRO code will, by default, manage up to 8 packet aggregations at one time. This works really well on a LAN, when serving traffic over a small number of really fast connections. However, we have tens of thousands of active TCP connections on our 100GbE machines, so our aggregation rate was rarely better than 1.1 packets per aggregation on average.

Hans Petter Selasky, Mellanox’s 100GbE driver developer, came up with an innovative solution to our problem. Most modern NICs will supply an Receive Side Scaling (RSS) hash result to the host. RSS is a standard developed by Microsoft wherein TCP/IP traffic is hashed by source and destination IP address and/or TCP source and destination ports. The RSS hash result will almost always uniquely identify a TCP connection. Hans’ idea was that rather than just passing the packets to the LRO engine as they arrive from the network, we should hold the packets in a large batch, and then sort the batch of packets by RSS hash result (and original time of arrival, to keep them in order). After the packets are sorted, packets from the same connection are adjacent even when they arrive widely separated in time. Therefore, when the packets are passed to the FreeBSD LRO routine, it can aggregate them.

With this new LRO code, we were able to achieve an LRO aggregation rate of over 2 packets per aggregation, and were able to serve at well over 90 Gbps for the first time on our prototype hardware for mostly unencrypted traffic.

An RX queue containing 1024 packets from 256 connections would have 4 packets from the same connection in the ring, but the LRO engine would not be able to see that the packets belonged together, because it maintained just a handful of aggregations at once. After sorting by RSS hash, the packets from the same connection appear adjacent in the queue, and can be fully aggregated by the LRO engine.

New Goal: TLS at 100 Gbps

So the job was done. Or was it? The next goal was to achieve 100 Gbps while serving only TLS-encrypted streams.

By this point, we were using hardware which closely resembles today’s 100GbE flash storage-based OCAs: four NVMe PCIe Gen3 x4 drives, 100GbE ethernet, Xeon E5v4 2697A CPU. With the improvements described in the Protecting Netflix Viewing Privacy at Scale blog entry, we were able to serve TLS-only traffic at roughly 58 Gbps.

In the lock contention problems we’d observed above, the cause of any increased CPU use was relatively apparent from normal system level tools like flame graphs, DTrace, or lockstat. The 58 Gbps limit was comparatively strange. As before, the CPU use would increase linearly as we approached the 58 Gbps limit, but then as we neared the limit, the CPU use would increase almost exponentially. Flame graphs just showed everything taking longer, with no apparent hotspots.

We finally had a hunch that we were limited by our system’s memory bandwidth. We used the Intel® Performance Counter Monitor Tools to measure the memory bandwidth we were consuming at peak load. We then wrote a simple memory thrashing benchmark that used one thread per core to copy between large memory chunks that did not fit into cache. According to the PCM tools, this benchmark consumed the same amount of memory bandwidth as our OCA’s TLS-serving workload. So it was clear that we were memory limited.

At this point, we became focused on reducing memory bandwidth usage. To assist with this, we began using the Intel VTune profiling tools to identify memory loads and stores, and to identify cache misses.

Read Modify Write

Because we are using sendfile() to serve data, encryption is done from the virtual memory page cache into connection-specific encryption buffers. This preserves the normal FreeBSD page cache in order to allow serving of hot data from memory to many connections. One of the first things that stood out to us was that the ISA-L encryption library was using half again as much memory bandwidth for memory reads as it was for memory writes. From looking at VTune profiling information, we saw that ISA-L was somehow reading both the source and destination buffers, rather than just writing to the destination buffer.

We realized that this was because the AVX instructions used by ISA-L for encryption on our CPUs worked on 256-bit (32-byte) quantities, whereas the cache line size was 512-bits (64 bytes) — thus triggering the system to do read-modify-writes when data was written. The problem is that the the CPU will normally access the memory system in 64 byte cache line-sized chunks, reading an entire 64 bytes to access even just a single byte. In this case, the CPU needed to write 32 bytes of a cache line, but using read-modify-writes to handle those writes meant that it was reading the entire 64 byte cache line in order to be able to write that first 32 bytes. This was especially silly, because the very next thing that would happen would be that the second half of the cache line would be written.

After a quick email exchange with the ISA-L team, they provided us with a new version of the library that used non-temporal instructions when storing encryption results. Non-temporals bypass the cache, and allow the CPU direct access to memory. This meant that the CPU was no longer reading from the destination buffers, and so this increased our bandwidth from 58 Gbps to 65 Gbps.

In parallel with this optimization, the spec for our final production machines was changed from using lower cost DDR4–1866 memory to using DDR4–2400 memory, which was the fastest supported memory for our platform. With the faster memory, we were able to serve at 76 Gbps.

VTune Driven Optimizations

We spent a lot of time looking at VTune profiling information, re-working numerous core kernel data structures to have better alignment, and using minimally-sized types to be able to represent the possible ranges of data that could be expressed there. Examples of this approach include rearranging the fields of kernel structs related to TCP, and re-sizing many of the fields that were originally expressed in the 1980s as “longs” which need to hold 32 bits of data, but which are now 64 bits on 64-bit platforms.

Another trick we use is to avoid accessing rarely used cache lines of large structures. For example, FreeBSD’s mbuf data structure is incredibly flexible, and allows referencing many different types of objects and wrapping them for use by the network stack. One of the biggest sources of cache misses in our profiling was the code to release pages sent by sendfile(). The relevant part of the mbuf data structure looks like this:

struct m_ext {
volatile u_int *ext_cnt; /* pointer to ref count info */
caddr_t ext_buf; /* start of buffer */
uint32_t ext_size; /* size of buffer, for ext_free */
uint32_t ext_type:8, /* type of external storage */
ext_flags:24; /* external storage mbuf flags */
void (*ext_free) /* free routine if not the usual */
(struct mbuf *, void *, void *);
void *ext_arg1; /* optional argument pointer */
void *ext_arg2; /* optional argument pointer */
};

The problem is that arg2 fell in the 3rd cache line of the mbuf, and was the only thing accessed in that cache line. Even worse, in our workload arg2 was almost always NULL. So we were paying to read 64 bytes of data for every 4 KB we sent, where that pointer was NULL virtually all the time. After failing to shrink the mbuf, we decided to augment the ext_flags to save enough state in the first cache line of the mbuf to determine if ext_arg2 was NULL. If it was, then we just passed NULL explicitly, rather than dereferencing ext_arg2 and taking a cache miss. This gained almost 1 Gbps of bandwidth.

Getting Out of Our Own Way

VTune and lockstat pointed out a number of oddities in system performance, most of which came from the data collection that is done for monitoring and statistics.

The first example is a metric monitored by our load balancer: TCP connection count. This metric is needed so that the load balancing software can tell if the system is underloaded or overloaded. The kernel did not export a connection count, but it did provide a way to export all TCP connection information, which allowed user space tools to calculate the number of connections. This was fine for smaller scale servers, but with tens of thousands of connections, the overhead was noticeable on our 100GbE OCAs. When asked to export the connections, the kernel first took a lock on the TCP connection hash table, copied it to a temporary buffer, dropped the lock, and then copied that buffer to userspace. Userspace then had to iterate over the table, counting connections. This both caused cache misses (lots of unneeded memory activity), and lock contention for the TCP hash table. The fix was quite simple. We added per-CPU lockless counters that tracked TCP state changes, and exported a count of connections in each TCP state.

Another example is that we were collecting detailed TCP statistics for every TCP connection. The goal of these statistics is to monitor the quality of customer’s sessions. The detailed statistics were quite expensive, both in terms of cache misses and in terms of CPU. On a fully loaded 100GbE server with many tens of thousands of active connections, the TCP statistics consumed 5–10% of the CPU. The solution to this problem was to only keep detailed statistics on a small percentage of connections. This dropped CPU used by TCP statistics to below 1%.

These changes resulted in a speedup of 3–5 Gbps.

Mbuf Page Arrays

The FreeBSD mbuf system is the workhorse of the network stack. Every packet which transits the network is composed of one or more mbufs, linked together in a list. The FreeBSD mbuf system is very flexible, and can wrap nearly any external object for use by the network stack. FreeBSD’s sendfile() system call, used to serve the bulk of our traffic, makes use of this feature by wrapping each 4K page of a media file in an mbuf, each with its own metadata (free function, arguments to the free function, reference count, etc).

The drawback to this flexibility is that it leads to a lot of mbufs being chained together. A single 1 MB HTTP range request going through sendfile can reference 256 VM pages, and each one will be wrapped in an mbuf and chained together. This gets messy fast.

At 100 Gbps, we’re moving about 12.5 GB/s of 4K pages through our system unencrypted. Adding encryption doubles that to 25 GB/s worth of 4K pages. That’s about 6.25 Million mbufs per second. When you add in the extra 2 mbufs used by the crypto code for TLS metadata at the beginning and end of each TLS record, that works out to another 1.6M mbufs/sec, for a total of about 8M mbufs/second. With roughly 2 cache line accesses per mbuf, that’s 128 bytes * 8M, which is 1 GB/s (8 Gbps) of data that is accessed at multiple layers of the stack (alloc, free, crypto, TCP, socket buffers, drivers, etc).

To reduce the number of mbufs in transit, we decided to augment mbufs to allow carrying several pages of the same type in a single mbuf. We designed a new type of mbuf that could carry up to 24 pages for sendfile, and which could also carry the TLS header and trailing information in-line (reducing a TLS record from 6 mbufs down to 1). That change reduced the above 8M mbufs/sec down to less than 1M mbufs/sec. This resulted in a speed up of roughly 7 Gbps.

This was not without some challenges. Most specifically, FreeBSD’s network stack was designed to assume that it can directly access any part of an mbuf using the mtod() (mbuf to data) macro. Given that we’re carrying the pages unmapped, any mtod() access will panic the system. We had to augment quite a few functions in the network stack to use accessor functions to access the mbufs, teach the DMA mapping system (busdma) about our new mbuf type, and write several accessors for copying mbufs into uios, etc. We also had to examine every NIC driver in use at Netflix and verify that they were using busdma for DMA mappings, and not accessing parts of mbufs using mtod(). At this point, we have the new mbufs enabled for most of our fleet, with the exception of a few very old storage platforms which are disk, and not CPU, limited.

Are We There Yet?

At this point, we’re able to serve 100% TLS traffic comfortably at 90 Gbps using the default FreeBSD TCP stack. However, the goalposts keep moving. We’ve found that when we use more advanced TCP algorithms, such as RACK and BBR, we are still a bit short of our goal. We have several ideas that we are currently pursuing, which range from optimizing the new TCP code to increasing the efficiency of LRO to trying to do encryption closer to the transfer of the data (either from the disk, or to the NIC) so as to take better advantage of Intel’s DDIO and save memory bandwidth.


Serving 100 Gbps from an Open Connect Appliance was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
Flameeyes
18 days ago
reply
Dublin, Ireland
Share this story
Delete
Next Page of Stories