All things ZFS and FreeBSD with Allan Jude

Podcast Host

Join Viktor, a proud nerd and seasoned entrepreneur, whose academic journey at Santa Clara University in Silicon Valley sparked a career marked by innovation and foresight. From his college days, Viktor embarked on an entrepreneurial path, beginning with YippieMove, a groundbreaking email migration service, and continuing with a series of bootstrapped ventures.

Links

Listen to podcast on YouTube

Listen to podcast on Spotify

Listen to podcast on Apple

Follow Me

Podcast Host

Listen to podcast on YouTube

Listen to podcast on Spotify

Listen to podcast on Apple

Follow Me

Join Viktor, a proud nerd and seasoned entrepreneur, whose academic journey at Santa Clara University in Silicon Valley sparked a career marked by innovation and foresight. From his college days, Viktor embarked on an entrepreneurial path, beginning with YippieMove, a groundbreaking email migration service, and continuing with a series of bootstrapped ventures.

All things ZFS and FreeBSD with Allan Jude

Play On

Listen to podcast on YouTube

Listen to podcast on Spotify

Listen to podcast on Apple

01 DEC • 2024 1 hour 18 mins

Share:

In this episode, I’m joined by Allan Jude, a distinguished FreeBSD developer and ZFS expert, to explore the fascinating world of advanced storage systems and operating system architecture. Allan’s extensive contributions to both FreeBSD and ZFS offer unique insights into how these technologies shape modern infrastructure.

We start with ZFS’s architectural foundations. What particularly caught my attention was Allan’s explanation of how copy-on-write mechanisms transform data integrity and storage management. His breakdown of ZFS’s self-healing capabilities and data verification approaches reveals why it remains crucial for enterprise storage solutions.

The conversation gets especially interesting when we dive into FreeBSD’s networking stack. Allan shares insights into why major technology companies trust FreeBSD for their mission-critical operations, backing up the discussion with real-world examples from his extensive experience. His practical deployment strategies bridge the gap between theoretical knowledge and real-world applications.

I was particularly intrigued by our discussion of optimizing ZFS configurations and managing storage pools effectively. Allan’s perspective on leveraging FreeBSD’s security features and his thoughts on the future of storage systems and operating system development show just how much innovation is happening in this space.

If you’re interested in storage infrastructure, operating system internals, or enterprise systems, you’ll find plenty of practical insights here. Allan brings both deep technical knowledge and years of hands-on experience to the discussion, making complex storage and OS concepts accessible while maintaining their technical depth.

Transcript

Show/Hide Transcript

[00:03] Viktor Petersson

Welcome back to another episode of Nerding out with Victor.

[00:06] Viktor Petersson

Today we're going to go deep into the world of ZFS and FreeBSD with Alan Jude.

[00:11] Viktor Petersson

Welcome to the show, Alan.

[00:12] Allan Jude

Hello.

[00:13] Allan Jude

Thank you.

[00:14] Viktor Petersson

So you've been around the FreeBSD world for a long time, and I feel like I mentioned before you hit the record button, FreeBSD is not really getting enough attention for their deserves, really.

[00:29] Viktor Petersson

So I guess for the people in the audience who are not Even familiar with FreeBSD and the BSD family, maybe can you do a quick intro to, like, why it sets it apart and just some big backstory?

[00:41] Allan Jude

Yeah, I guess we can start with the initial part of the backstory.

[00:44] Allan Jude

So UNIX was originally developed at Bell Labs, which was the research arm of AT&T, the phone company, the only phone company in the US Back then.

[00:55] Allan Jude

And so they developed this operating system basically to be able to build things and also to convince management to buy them a big enough computer, also to make printed manuals and so on to do actual work for the phone company.

[01:12] Allan Jude

But that really was just a side effect of getting a computer they could use to actually write computer programs, a more comfortable environment.

[01:21] Allan Jude

And so they built this, and as it started going out, they found that they wanted people to know how to use it and to find new things to do with it.

[01:32] Allan Jude

So they licensed copies to a bunch of universities.

[01:36] Allan Jude

And back then, computer programs weren't really the same, and there was no standard architecture.

[01:43] Allan Jude

Right.

[01:44] Allan Jude

Every computer was a completely different computer that basically spoke a different language.

[01:49] Allan Jude

And so you had to write the operating system and the programs in, like, the compiled to this assembly dialect of that machine, which is completely different from some other machine.

[02:01] Allan Jude

And that's how or what led to the invention of the C programming language, as we can write the code once and compile it for multiple different machines.

[02:11] Allan Jude

Anyway, one of the universities that got a copy of original research, Unix, as it was called, was the University of California at Berkeley.

[02:20] Allan Jude

And they started adding things to it and changing it to be useful for them.

[02:25] Allan Jude

And eventually what they did was they would send copies of it on a tape because floppy disk hadn't been invented yet.

[02:35] Allan Jude

So they'd mail this tape to other universities and other people who would take it and they would change it and they would send back some of their changes.

[02:43] Allan Jude

Of course, you know, we didn't have Git or tools for managing patches.

[02:47] Allan Jude

So it was a little crazy back then.

[02:50] Allan Jude

But eventually some of those changes made it back to Berkeley and they got incorporated and went out to other people on the next Tape.

[02:55] Allan Jude

And that was kind of the precursor to open source.

[02:59] Allan Jude

No one had thought up licensing or text to put on it yet.

[03:03] Allan Jude

It was just, you know, this is the program and you use it.

[03:06] Allan Jude

And of course every program comes with all the source code and you can do whatever you want with it.

[03:12] Allan Jude

But then it turns out, you know, AT&T had ideas about this.

[03:19] Allan Jude

But that went on for a while and really got interesting.

[03:22] Allan Jude

And eventually they added things to it, including the TCP IP stack, which was the invention of the Internet.

[03:29] Allan Jude

And there's some great stories about that.

[03:32] Allan Jude

Kirk McKusick, one of the people who wrote the file system for that original Unix or the original BSD Unix has a history series on DVD that talks about some of the really interesting things around the invention of TCP ip.

[03:46] Allan Jude

And there was like a proprietary version and then the BSD version.

[03:50] Allan Jude

And the BSD version was faster at some parts, but worse at other parts.

[03:56] Allan Jude

And the way the story went was the proprietary one was faster, but it would crash.

[04:02] Allan Jude

And in the time while it was rebooting, after it crashed, the BSD one would catch up.

[04:06] Allan Jude

So if you transferred a big enough file, you would still be faster on the BSD1 and things like that.

[04:13] Viktor Petersson

I didn't even know that.

[04:14] Viktor Petersson

Okay, Yeah, I didn't know that.

[04:15] Viktor Petersson

I didn't even know that backstory of the TPIP stacks actually.

[04:18] Allan Jude

Yeah, that was, I think BBN it was called.

[04:21] Allan Jude

It was like way back at the very beginning, like when the Internet was only for the government.

[04:25] Allan Jude

It wasn't the open Internet yet.

[04:27] Viktor Petersson

Right, right.

[04:28] Allan Jude

It's probably still the arpanet.

[04:29] Allan Jude

Yes, and things like that.

[04:31] Allan Jude

Anyway, that's a great lecture that Kirk gives that you can catch from other conferences or from history dvd.

[04:40] Allan Jude

But eventually they kind of finished up the last version of that and a company started called BSDI that was going to make a version of this and actually sell it.

[04:53] Allan Jude

And so they did the port to what was the 386 and made this BSDOS and they were going to sell it to people.

[05:01] Allan Jude

And it was, they were doing a pretty brisk business at the time because, you know, your only other option was Windows or probably wasn't even Windows yet.

[05:11] Allan Jude

Like, yeah, early versions of Windows or like sunos, which was expensive and came with special hardware, whereas the 386 was a cheap machine you could just buy.

[05:20] Viktor Petersson

Right.

[05:23] Allan Jude

But they made the probably ill advised decision to have their phone number to order this software be 100 its Unix.

[05:32] Allan Jude

Whereas Unix was a trademark of AT&T.

[05:36] Allan Jude

And AT&T is like, yeah, no, Also, your code probably contains, you know, our code, which, you know, we charge $1,000 a license for at least, or if not, a lot more.

[05:50] Allan Jude

And you know, you're selling your software to other people for less than that, so you're not paying us.

[05:55] Allan Jude

How does that work?

[05:56] Allan Jude

And this led to the famous ATT USL lawsuit.

[06:02] Allan Jude

And it turned out there were maybe seven files that needed to be rewritten, but not really.

[06:08] Allan Jude

I don't think the final details ever really got fully disclosed.

[06:12] Viktor Petersson

Right.

[06:13] Allan Jude

But it slowed BSD down, especially the adoption of it, at just the wrong time, as it turns out.

[06:21] Allan Jude

And so in Finland this student named Linus was like, you know, I want a Unix like operating system and oh, I can't use BSD because it's tied up in this lawsuit, so I'll have to build my own.

[06:36] Allan Jude

And that was the start of Linux.

[06:37] Allan Jude

And like you said, if it hadn't been for the AT and T lawsuit, I would have never bothered making Linux and we would just have BSD instead.

[06:46] Viktor Petersson

It's a crazy parallel universe, right?

[06:49] Allan Jude

Yeah, yeah, that's a really interesting parallel universe.

[06:53] Allan Jude

Like how that would have affected the evolution of the UNIX wars and what would have happened where and how different life might be just because of the licensing.

[07:04] Allan Jude

So BSD and similar stuff like the MIT license, ISC license, basically the whole license is a couple of sentences that say, you can use this code for whatever you want, but don't take our name off of it and don't claim you were the one that wrote it.

[07:20] Allan Jude

And so the Sony PlayStation 4 and 5 are based on FreeBSD.

[07:26] Allan Jude

And so at the back of the manual there's a bunch of pages of copyright notices that saying, this contains code from this person and this person and so on.

[07:36] Allan Jude

And that's the extent of Sony's obligation, right?

[07:41] Allan Jude

Well, many of the companies that use this BSD software do give back because there's an advantage to doing so.

[07:47] Allan Jude

You know, if you're taking FreeBSD and building something on top of it, you're making some changes.

[07:52] Allan Jude

If some of those changes will be your intellectual property or is it the unique selling feature of your product and you want to keep those secret, but there are other parts and smaller features and so on that aren't secret but will be work for you.

[08:06] Allan Jude

Every time you upgrade to a newer version, you will have to reintegrate those, and that's a lot of work.

[08:11] Allan Jude

If you can contribute those back, they become part of the main line and everybody collectively maintains them, then it's that much easier for you to update in the future.

[08:21] Allan Jude

And really I think the thing that sells it for me the most is the thought of if I buy a washing machine that has the ability to send a push notification to my cell phone when the laundry is done, I don't want it to be using some proprietary or hand built network stack because the people who wrote the software on it didn't want to give back to the gpl which has this viral clause that means any code that touches this has to become open.

[08:52] Allan Jude

So if they're going to use some closed source thing that they bought or write their own thing to avoid that's probably going to lead to a bad time.

[09:01] Allan Jude

Whereas if they can have this standard reference implementation that is under a liberal license, then we're probably all in better shape.

[09:10] Viktor Petersson

Yeah.

[09:11] Viktor Petersson

So that takes us to the modern BSD landscape, I guess, where you have, well, three, well, I guess three major flavors of bsd Open, Free, nat.

[09:22] Viktor Petersson

I guess they're all.

[09:23] Viktor Petersson

They are the main ones.

[09:24] Viktor Petersson

There is.

[09:24] Viktor Petersson

There are some derivatives of those I think as well.

[09:27] Viktor Petersson

But those are the main ones today.

[09:28] Allan Jude

Right.

[09:30] Viktor Petersson

Can you speak a bit?

[09:31] Viktor Petersson

Just compare and contrast like real quickly, like for those who.

[09:34] Viktor Petersson

Not those who are new to BSD and just understand what their strengths are between the three of them.

[09:40] Allan Jude

Yeah.

[09:41] Allan Jude

So back in the early 90s when, after the BSD USL lawsuit was settled and it became possible, these different versions of BSD came out and the first two were NetBSD and FreeBSD.

[09:56] Allan Jude

And NetBSD wanted to focus on portability running on all those different types of computer.

[10:03] Allan Jude

Especially at the time there were, you know, in the early 90s there were still a lot of these older big iron machines around and companies maybe weren't still producing new software for them.

[10:15] Allan Jude

And there were these ideas of other CPU architectures other than the X86.

[10:21] Allan Jude

And NetBSD really wanted to focus on, you know, when a new thing comes out, we can port to it and be running on it really quickly because they saw that as being the future of computers, not this 386, 486 type thing.

[10:35] Allan Jude

So their focus is always on this portability and being able to run on all the different types of CPUs at the same time.

[10:43] Allan Jude

With FreeBSD the focus was we have this one main architecture, the 386 and eventually the x86.

[10:51] Allan Jude

And we really want to be able to use that as a desktop and a serp.

[10:56] Allan Jude

And so that was their book.

[10:58] Allan Jude

After a couple of years, One of the NetBSD developers had a falling out with the other developers and wanted to do something quite a bit different.

[11:07] Allan Jude

And so they forked off from NetBSD and created OpenBSD.

[11:11] Allan Jude

The main goals there was that the repository, not just the source code, but the repository and the history, would be open to the public and then have a really big focus on security.

[11:23] Allan Jude

And it's come to the point where OpenBSD is mostly kind of like the original Unix, an environment for the developers to do things, whether that's their everyday things or building and kind of this incubator for all these ideas of different ways to do security.

[11:41] Allan Jude

So on top of their other enhancements, they've done the idea of privilege separation.

[11:47] Allan Jude

So one of the things that's kind of maintained inside the OpenBSD project is OpenSSH, the SSH server that's used in every operating system.

[11:56] Allan Jude

And originally that had this one bigger binary that ran as root and then could spawn, you know, every time you logged in.

[12:04] Allan Jude

But that's now been split up into separate tools so that if one gets compromised, it's running without the same level of privileges that, you know, any risky processing is done as like a less important user that can only communicate over interprocess communication with the main process that has all the privileges, so that less chances of it being exploited.

[12:28] Viktor Petersson

Right.

[12:29] Allan Jude

And then they've taken that idea further and further as they've gone.

[12:33] Allan Jude

But one of the other really interesting ones they do is they actually relink.

[12:38] Allan Jude

So not compiling, but assembling the bits together, the kernel every time it reboots.

[12:45] Allan Jude

So as it's booting up, it makes a new kernel for next time, basically.

[12:49] Allan Jude

And it puts all the pieces together in a random order so that no machine will be exactly the same for an exploit to be able to know where it's going to find like the.

[13:00] Viktor Petersson

Certain functions for memory injections or for memory attacks, essentially.

[13:04] Allan Jude

Yeah.

[13:04] Allan Jude

For basically any kind of memory attack.

[13:06] Allan Jude

Where you're going to try to find these gadgets to string together, do an exploit.

[13:10] Allan Jude

Every time this machine reboots, it'll be different.

[13:12] Allan Jude

And that means every machine will be different and you won't be able to have the same kind of.

[13:16] Allan Jude

It'll make the attacker's life that much harder.

[13:19] Viktor Petersson

Right.

[13:20] Allan Jude

And they do a lot of interesting work like that.

[13:22] Allan Jude

Yeah.

[13:23] Viktor Petersson

I love the slogan for OpenBSD was no default vulnerabilities for like 10 years or whatever it was, or 15 years, whatever the number is today.

[13:31] Allan Jude

Right.

[13:31] Allan Jude

And then it became kind of a heck of a long time when there was one.

[13:34] Allan Jude

The one time.

[13:35] Allan Jude

And.

[13:36] Viktor Petersson

Right.

[13:37] Allan Jude

They've Had a really good track record on.

[13:39] Viktor Petersson

Yeah, it's pretty cool in terms of just to compare and contrast, do they share kernels or they completely.

[13:46] Allan Jude

They just.

[13:47] Viktor Petersson

Toolkit.

[13:48] Allan Jude

Yeah, so the kernels, I suppose technically were close to the same in the very early 90s, but they are very diverged.

[13:57] Allan Jude

Now one of the things that sets each of the three BSDs apart from something like a Linux is that each one is a complete operating system.

[14:07] Allan Jude

In that, you know, in one repository you have the kernel, the basic utilities, like ls, cat, grep, et cetera, and a bunch of the other pieces.

[14:19] Allan Jude

And all the drivers all live in one source code repository and can be built and run with just that.

[14:27] Allan Jude

And in fact, I pretty sure in all three of them that a bunch of the stuff comes with the operating system and then the packages are off to the side and by default there won't be any packages installed.

[14:41] Allan Jude

And then you can maybe you install whatever additions you need on top of the operating system.

[14:46] Allan Jude

Whereas almost every Linux distribution is all right, we take the a version of the kernel, that who knows which one, and we combine it with this bunch of GNU Core utils and other things, and all of those are packages that get installed.

[15:00] Allan Jude

And so the concept of having an operating system with no packages in Linux doesn't really make any sense because all of the components of the operating system, including the kernel, are packages.

[15:10] Allan Jude

Whereas in the bsds you have the operating system and then separately you have all the third party package, right?

[15:16] Viktor Petersson

And the package managers are rather sophisticated, like the ports from bsd.

[15:21] Viktor Petersson

I know that was overhauled a decade ago, right?

[15:24] Viktor Petersson

It used to be just tarballs and now that, yeah, they redesigned that, I think, what, 10 years ago.

[15:31] Viktor Petersson

I think.

[15:34] Allan Jude

In the very beginning FreeBSD added this concept of ports, which is basically a separate repository with a directory structure sorted into categories and then each program has a directory and inside it there's a little make file, a little.

[15:49] Viktor Petersson

Oh, yes, sorry, yes, make.

[15:51] Allan Jude

And in there when you run the right command, it knows, hey, I need to go to this website, download the source code, extract it, check the checksum to make sure nobody's feeding me the wrong source code, extract it, apply the set of patches, maybe run the configure script, compile it, install it, and so on, and make sure it gets installed to the right place and that you have all the information you need to be able to uninstall it and all that.

[16:17] Allan Jude

And so it did that for a long time and there were binary packages, but back in the early days, the way it worked Was all of those packages were built once at the day of the release and included on a second cd, but they were never updated.

[16:36] Allan Jude

So you had the version of the packages that shipped with the operating system when it came out and if it was six months later, sorry, all you have is this old packages and you'd have to build your own from the ports tree and that was it.

[16:50] Allan Jude

And it was not ideal.

[16:53] Allan Jude

So yeah, I think around FreeBSD8 for reference, we're on version 14 right now.

[16:59] Allan Jude

With FreeBSD8 we released a new package manager that meant somebody would a tool called Poodra would take that ports tree and recompile it every three or four days because it takes that long to do the build and post it.

[17:13] Allan Jude

And so you could just do package upgrade and it would download stuff kind of the same as an APT or a Yum or whatever.

[17:19] Allan Jude

Yeah, because I remember the big difference is that gets rebuilt every three or four days.

[17:23] Allan Jude

So it's very fresh.

[17:26] Viktor Petersson

Yeah, because I remember one of the issues.

[17:29] Viktor Petersson

I used to run BSD, FreeBSD specifically many years ago and we ran it at somewhat small scale but like I think, I don't know, 50 servers maybe so smallish scale but I remember do those installations across a fleet of servers where you have to compile everything on every single server and then you have to.

[17:47] Viktor Petersson

There were some mechanisms for distributing packages but it's very different from the Linux world, which is apt get installed and voila, that's all you have to care about.

[17:56] Allan Jude

Yeah.

[17:57] Allan Jude

And so eventually we added the concept of a quarterly branch.

[18:01] Allan Jude

So on top of those packages built every three or four days, there's a second repo also built every three or four days.

[18:07] Allan Jude

But is the major versions of those packages only change once every three months.

[18:13] Allan Jude

So that you have the choice of basically a rolling release or a quarterly release that's just security fixes every day, but doesn't randomly change from version three to version four of a program in the middle of your cycle so that you can safely just run automated packages.

[18:33] Allan Jude

And the big thing is with that system there's also the available the program called pudrare which allows you to compile those packages yourself.

[18:43] Allan Jude

So if you have that fleet of 50 servers, the big thing that the ports tree let you do is there are configuration options for all these programs.

[18:52] Allan Jude

So you know, you can decide whether you want this program.

[18:56] Allan Jude

Like if you're installing WordPress, do you want it to use MySQL or Postgres as the database or something like that.

[19:03] Allan Jude

And there's all These knobs you can tweak, but the packages that come from upstream only have the defaults.

[19:09] Allan Jude

So if you want different options, then you can use podrare and build your own package repository that's going to have those options and be able to use it for all 50 of your servers.

[19:19] Viktor Petersson

Right.

[19:20] Viktor Petersson

Are those reproducible builds now?

[19:24] Allan Jude

Pretty close to it.

[19:25] Allan Jude

The operating system is fully reproducible and I think the packages are reproducible as well.

[19:31] Allan Jude

You have to, not necessarily by default.

[19:33] Allan Jude

You have to set a couple of options to enable it because there's some downsides to faking the date you set for everything.

[19:40] Allan Jude

But yeah, there's a lot of work that's gone into making that more reproducible so that you can take the same source code and the same tool set and build the identical CD for FreeBSD 13.3 to prove that what the FreeBSD release engineering team gave you came from this exact source code.

[20:02] Viktor Petersson

Yeah, I mean that's super important.

[20:04] Viktor Petersson

So let's talk a bit about what actually BSD has been used for FreeBSD in particular, Netflix massive FreeBSD shop, at least was.

[20:12] Viktor Petersson

I'm not sure they still are.

[20:13] Viktor Petersson

I think they are still using the cdn.

[20:14] Allan Jude

Are they?

[20:15] Viktor Petersson

Are, are they still running their entire CDN on Netflix?

[20:18] Allan Jude

On FreeBSD when you are browsing Netflix and picking a video that all runs in Amazon on Linux, but as soon as the video starts playing, that's all coming from FreeBSD.

[20:30] Allan Jude

And so yeah, they built this thing called the Open Connect appliance, which was customized server that they could send to Internet exchanges, but specifically to your ISP.

[20:42] Allan Jude

So when they first started they were using commercial CDNs, but that got too expensive and it got to be that Netflix individually was such a high percentage of the Internet traffic that there weren't any more bandwidth providers available that they weren't already busy with Netflix traffic.

[21:01] Allan Jude

And so to help ameliorate that, they came up with this concept of the Open Connect appliance, which they would basically send to your ISP and your ISP would install in their data center, so that the most popular videos and TV shows that you're watching would be coming from the box inside their network and wouldn't use any of your ISP's Internet traffic, saving your ISP a bunch of money and saving Netflix a bunch of money.

[21:23] Allan Jude

So it worked out pretty well, and Netflix didn't really invent that concept.

[21:27] Allan Jude

Akamai, one of the bigger CDNs had come up with that before, but not to the same degree that what Netflix did.

[21:34] Allan Jude

And so Netflix chose to use FreeBSD for that because of a couple of reasons.

[21:39] Allan Jude

A the license, although it turns out they don't have any proprietary code that's actually in the operating system, so that doesn't matter so much for them.

[21:46] Allan Jude

But the big thing for them was the speed with which they could upstream stuff.

[21:51] Allan Jude

So when they made a change, they were able to contribute it back to FreeBSD and then have it included in the mainline within a couple of weeks.

[22:00] Allan Jude

And so they run the development version of FreeBSD, so not the releases, but actually the development version some number of weeks behind live.

[22:08] Allan Jude

I think it's like six weeks or something.

[22:12] Allan Jude

And so with that they're able to find problems, engineer a solution and get it upstream and then have it ship.

[22:19] Allan Jude

Whereas if they worked with Linux, it would be a lot harder to get a change into the kernel in general.

[22:26] Allan Jude

And if they did, by time a distribution shipped it could be like three years.

[22:34] Allan Jude

They could do their own thing, but that'd be a lot more work.

[22:37] Allan Jude

And just the fact that FreeBSD has all the infrastructure to basically make your own distribution kind of included made it really easy for them, especially because they ended up making quite a few changes.

[22:49] Allan Jude

They worked with NIC vendors including Mellanox, which is now Nvidia and Chelsea to design this concept of encryption offload.

[23:00] Allan Jude

So one thing that set BSD apart back in the early days of like FreeBSD4 and like the early Internet, like very early 2000s or in late 90s, was it was really good at being an FTP server and a web server.

[23:16] Allan Jude

Part of that was this system special system call it had called Send File.

[23:21] Allan Jude

So normally, you know, we have a web server, back then it was mostly Apache running in user space and you know, a request comes in over the network and then we have to wake up the web server and say, hey, there's a new request, do you want to accept it?

[23:34] Allan Jude

Web server says, I'm not too busy, I'll accept it.

[23:36] Allan Jude

And it would go back to the kernel and then eventually a request comes in and it processes it and okay, I need to open this file.

[23:42] Allan Jude

And now I want to read some data from this file.

[23:44] Allan Jude

And now I'm going to ask the kernel, hey, could you send this to the user?

[23:48] Allan Jude

And then when that's done, tell me and I'll read another chunk of the file and then I'll send it to the kernel.

[23:53] Allan Jude

And it involves a lot of copying back and forth because you know, when the web server asks to read a file, it goes to the kernel, hey, could you read this file?

[24:02] Allan Jude

So the kernel reads it up and then has to copy it from the kernel to Apache.

[24:06] Allan Jude

And Apache doesn't look at it just immediately says, hey, kernel, this buffer you just gave me, could you write it back to this socket instead?

[24:15] Allan Jude

And it was all this back and forth.

[24:17] Allan Jude

So with send file, the web server or the FTP server can just say, hey, I have this socket and this file descriptor of a file I've opened.

[24:24] Allan Jude

Can you send this range of bytes in it to the socket for me and just tell me when you're done?

[24:30] Allan Jude

And so now, instead of having to cross that boundary back and forth all the time, you can just say, hey, kernel, just read from this file and write to the socket and do it until either there's an error or you finished the amount I asked you to do.

[24:43] Allan Jude

And that sped things up greatly.

[24:47] Allan Jude

But then HTTPs took off and now we need to encrypt all of that.

[24:53] Allan Jude

And so that meant we couldn't just have the kernel do it.

[24:55] Allan Jude

You had to still copy it out of the kernel to user space, then feed it through OpenSSL to encrypt it and then send it to the user, which involves copying it back into the kernel.

[25:06] Allan Jude

And this took away a lot of the performance that was available.

[25:09] Allan Jude

So they invented this concept of kernel tls, where basically your web server, now, modern days, usually nginx, will accept that connection, negotiate the encryption with the user, and then once it's done the public key part, right, and proven the SSL certificate and all that, it will have a symmetric key, basically a bulk encryption key.

[25:30] Allan Jude

And it can now set that as a socket option on the socket and then do the send file system call.

[25:38] Allan Jude

And now the kernel has the key it needs to encrypt the data, and it will actually, through a kernel module of OpenSSL, encrypt the data and send it whenever you write it to the socket.

[25:48] Allan Jude

So then the send file system call works again and it allows all this offload so you don't have to copy back and forth.

[25:54] Allan Jude

Which became especially a big deal after Spectre and Meltdown, where crossing that kernel user space boundary required a bunch of extra steps and slowed down a bit.

[26:06] Allan Jude

And so that made a big difference there.

[26:08] Allan Jude

But you're still doing the encryption on the cpu, which is fast, but not as fast as it could be.

[26:14] Allan Jude

So Netflix took it one step further after that when they wanted even more performance out of each server where there are dedicated chips on the NICs that can do the encryption.

[26:24] Allan Jude

So instead of sending that bulk encryption key to the kernel, you send it to the kernel, but the kernel sends it to the network card driver, and then it writes the data to the network card driver unencrypted.

[26:35] Allan Jude

And the NIC itself, in its own special CPU, will do the encryption and then send the packet on the network.

[26:44] Allan Jude

And so this allows you to offload all that CPU usage to a dedicated encryption chip on the nic.

[26:50] Allan Jude

And this allowed Netflix to originally go from, you know, maybe they could keep 100 gigabit server busy by sending all this traffic.

[27:00] Allan Jude

Now they're up to doing 800 gigabits per second from a single CPU server.

[27:07] Viktor Petersson

Wow, that's impressive.

[27:09] Viktor Petersson

Yeah.

[27:09] Viktor Petersson

I remember seeing a talk, I think it was EuroBSDcon I went to many years ago.

[27:14] Viktor Petersson

Netflix did.

[27:15] Viktor Petersson

I think it was.

[27:15] Viktor Petersson

Maybe it was around the time they announced these OpenConnect boxes.

[27:18] Viktor Petersson

And I remember them mentioning, because of the volume of traffic they're running through these boxes, the bugs that they discover are things that nobody else basically would have discovered in 10 years because of the sheer volume.

[27:31] Allan Jude

Yeah, I remember there was one early on in the IPv6 stack, which was basically an overflow.

[27:37] Allan Jude

When you crossed 4 billion of something, it would go back to zero and it caused a problem.

[27:43] Allan Jude

And in general, so they would crash the kernel, but in general, nobody was sending so much traffic that they would do that in less than a month or two, even on a busy host, and probably even longer on another host.

[27:54] Allan Jude

And so it would happen infrequently enough that nobody really got overly bothered by it.

[27:59] Allan Jude

But when Comcast first switched to having IPv6 for all their customers, and so all those people suddenly switched from V4 to V6, going through Netflix, it would take out a Netflix box every couple of hours.

[28:13] Allan Jude

And so they quickly saw the pattern and were able to find the problem and fix it easily and upstream that into FreeBSD and have an errata node so that everybody got the fix.

[28:22] Allan Jude

And it was really interesting just to see that kind of interesting scaling issues.

[28:27] Viktor Petersson

Right, right.

[28:28] Viktor Petersson

All right.

[28:29] Viktor Petersson

And also there was a new announcement.

[28:32] Viktor Petersson

The German Savant Fund decided to invest some pretty good chunk of change into the Freebies foundation, which I'm sure was very welcome to modernize some infrastructure.

[28:41] Viktor Petersson

Because how big is the team, like the bsd, the active maintainers?

[28:45] Viktor Petersson

How ballpark.

[28:47] Allan Jude

One of the other big differences with FreeBSD is rather than on Linux, where there's Linus as the benevolent dictator for life, FreeBSD has a core team of nine people that's elected every two years by the developer base, and that entire developer base has write access to the repository, whereas in Linux it's Linus and a dozen lieutenants or so that actually have the ability to merge stuff and you just send them requests.

[29:19] Allan Jude

With FreeBSD there are 250ish people that can just directly commit stuff into the repositories.

[29:29] Allan Jude

And so anybody who's committed things in the last 365 days at the day of the election is allowed to vote.

[29:37] Allan Jude

And that decides the nine people that will kind of be in charge of a project for the next two years.

[29:43] Viktor Petersson

Interesting.

[29:44] Viktor Petersson

Okay, yeah.

[29:44] Viktor Petersson

Different Covenant model entirely then.

[29:46] Allan Jude

Yeah.

[29:47] Allan Jude

And so the Sovereign Tech Agency's Sovereign Tech Fund is invested money in FreeBSD to improve some of the infrastructure and try to burn down the BugPal and get things in a more sustainable setup.

[30:00] Allan Jude

And then there's also funding coming from the Alpha Omega project to implement additional two FA and a bunch of other security stuff, including an audit of FreeBSD's native hypervisor.

[30:15] Allan Jude

So rather than KVM that they have on Linux, FreeBSD has something called Beehive, which was originally developed at a vendor, and when they decided they weren't going to use it, they upstreamed it and it became part of FreeBSD.

[30:27] Viktor Petersson

Right, okay, very interesting.

[30:31] Viktor Petersson

All right, let's switch gear to the primary topic for this conversation, which is run zfs.

[30:37] Viktor Petersson

Obviously it's really important to give the backstory of FreeBSD, because that's very tightly coupled with zfs, or at least in the history of it.

[30:48] Viktor Petersson

So you can run bsd.

[30:51] Viktor Petersson

Sorry, you can run ZFS on Linux these days as well.

[30:55] Viktor Petersson

And maybe we can start there.

[30:58] Viktor Petersson

ZFS on Linux and ZFS on bsd.

[31:01] Viktor Petersson

And BSD is slightly different.

[31:05] Viktor Petersson

Maybe talk a moment about that.

[31:06] Viktor Petersson

How they differ and why they differ.

[31:09] Allan Jude

Well, we can start back at the very quick intro in the beginning.

[31:14] Allan Jude

So Sun Microsystems started developing ZFS because they needed a new file system.

[31:19] Allan Jude

They had tried a couple of times to write new file systems, but they got the teams kind of too big, too quickly, and it got complicated and they ended up never finishing any of them.

[31:29] Allan Jude

So Jeff Bonwick handpicked a fresh graduate coming out of Brown University, and basically the two of them got locked in a room for a while with a whiteboard and came up with a whole new file system from scratch.

[31:44] Allan Jude

And especially trying to solve a lot of the problems that file systems had at the time.

[31:49] Allan Jude

Because if you think back to the early 2000s, most file systems had a limit on the biggest file you could have.

[31:56] Allan Jude

So depending if you're on Windows, it was like 2 gigabytes was the biggest file you could have.

[32:00] Allan Jude

And some of them you couldn't have a partition bigger than 4 gigabytes.

[32:05] Allan Jude

And weird things like this.

[32:07] Allan Jude

Even on Linux, most of the file systems there was a limit to how big the biggest file you could have is how many files you could have, how big the volume could be.

[32:16] Allan Jude

All these things were relatively tight limits were actually running into.

[32:21] Allan Jude

Most of them are now big enough that they're harder to run into.

[32:24] Allan Jude

But ZFS was designed with the idea that everything was dynamically allocated so you can't run out of inodes.

[32:31] Allan Jude

If you throw a billion files at it, you're not ever going to get an error.

[32:35] Allan Jude

Most file systems now still have a limit to the number of inodes.

[32:39] Allan Jude

Some of them can dynamically add more after, like you can run a command to adjust the limit and add more.

[32:46] Allan Jude

But they all do still have a limit, whereas ZFS just doesn't because it allocates them dynamically.

[32:51] Allan Jude

It's like you can have as many as you need until you run out of space, right?

[32:55] Allan Jude

But the other big thing was every other file system that exists right now.

[33:00] Allan Jude

Traditional file system assumes you have one disk and it only uses that one disk.

[33:06] Allan Jude

If you have multiple disks, you have to use a volume manager like MD RAID on Linux, or a hardware RAID controller or something to turn that multiple disks into one fake disk that you can then feed to the file system.

[33:18] Allan Jude

ZFS has that functionality built in so you can give it all your disks and it actually knows that they are individual disks and can do things with them.

[33:27] Allan Jude

Anyway, after a couple of years, sun decided to open source that, because everything at sun was open source, because they found that model was better and it was their competitive advantage over something like Windows.

[33:39] Allan Jude

And So when some FreeBSD developers saw that, they're like, well, you know, Solaris originally spawned out of bsd, yeah, many of the concepts are not that different.

[33:49] Allan Jude

We'd be able to port this.

[33:51] Allan Jude

And so one developer in particular, Pavel Doadek, ported ZFS to FreeBSD.

[33:57] Allan Jude

You know, he did most of the initial work in a kind of mad sprint over just less than a month of getting it from it's not in FreeBSD to.

[34:07] Allan Jude

You can actually mount and read write files in a ridiculously short amount of time.

[34:15] Allan Jude

So you did that.

[34:16] Allan Jude

And ZedFS started coming into FreeBSD and getting a lot of attention from that.

[34:21] Allan Jude

And that continued for a long Time when Oracle bought sun, they closed off the source code.

[34:28] Allan Jude

So a project called the lumos, which is a fork of the last open source version of Solaris, started and previously switched to using that as where it would get newer ZFS features from and where it would contribute its changes back to.

[34:41] Allan Jude

But at that time, each version of ZFS was kind of unique to that operating system.

[34:46] Allan Jude

It was up to each operating system to kind of maintain their version and try to keep them in sync.

[34:51] Allan Jude

And then the ZFS on Linux project started around that time as well, having their first release a while after FreeBSD, but they've been working on it around the same time that FreeBSD started, and eventually that became the ZFS on Linux project and was also there on GitHub.

[35:10] Allan Jude

And as things changed and some of the companies that were using Solaris and Lumos switched to using other things or just, you know, their time came to an end, less and less new features were being added in the LUMOS repo, and more of them were appearing in FreeBSD and Linux.

[35:28] Allan Jude

And then getting them back into Illumos and then there and then waiting to be able to pull them back in was causing delays, but also just features were being added in different orders and it was getting kind of complicated.

[35:40] Allan Jude

So that student I mentioned who originally helped design ZFS has now been doing it for over 20 years and is the leader of the project.

[35:50] Allan Jude

He came up with this concept of open ZFS where we would have one repo that was just the agnostic code, just the core of ZFS and not the integration with each operating system.

[36:02] Viktor Petersson

Right, okay.

[36:03] Allan Jude

And we could compile this in user space and run all the tests, and basically have a common code base that each operating system can pull from, add the glue for their operating system and go from there.

[36:15] Allan Jude

But that didn't really manage to go anywhere because that upstream repo by itself wasn't useful because it didn't have the integration for any operating system.

[36:24] Allan Jude

It was just the common code.

[36:27] Allan Jude

And so it meant somebody would have to do a lot of work to be able to actually get that running and build the infrastructure around it to test it, but to no direct value to them or anybody.

[36:39] Allan Jude

And so that concept was there and the REPO existed and we'd push and pull code from it, but it was not really working as intended.

[36:50] Allan Jude

So especially as more development started coming in on the Linux side and things were getting kind of done out of order a bit, and it was just getting complicated to, you know, eventually they had a bunch of features that FreeBSD didn't have.

[37:05] Allan Jude

But FreeBSD wanted to pull from Illumos, but Illumos didn't have those features yet.

[37:09] Allan Jude

It got complicated.

[37:10] Allan Jude

So we looked at the idea of having a real version of the OpenZFS repo, which would be one repository that includes the OS glue for all of the operating systems and means it would be one official upstream code repository that'd be the same on all the operating systems.

[37:28] Viktor Petersson

Right.

[37:29] Allan Jude

And so in time for FreeBSD 13.0, that project happened.

[37:34] Allan Jude

And so the OpenZFS repo you see on GitHub now actually is Linux and FreeBSD's code mixed together basically under the modules subdirectory, there's an OS directory and there's one for Linux and one for FreeBSD, and all the common code lives in module ZFS.

[37:51] Allan Jude

And then the operating systems have their special bits in those other directories and it means the code is literally exactly the same.

[37:58] Allan Jude

On FreeBSD and Linux it all comes from exactly the same source code.

[38:02] Allan Jude

There are one set of files that are different based on which operating system you're running, mostly just to integrate with those kernels which are different and provide what we call the Solaris porting layer, which translates the common codes Solaris calls into Linux.

[38:17] Allan Jude

Or FreeBSD calls for things like allocating memory from the kernel, which is done differently.

[38:22] Allan Jude

Linux has the slab allocator and FreeBSD has the UMA, the Unified Memory architecture and so on.

[38:29] Allan Jude

Right, but it means that we have the same code and we added version numbers so you can actually see, you know, FreeBSD14 and Ubuntu 2404 have basically the same version of ZFS.

[38:41] Viktor Petersson

Right, but there if remind me, there was some issue that held back ZFS coming into mainline Linux for quite some time and it was a license related issue, if I'm mistaken.

[38:51] Allan Jude

Right, and so technically that's still true.

[38:54] Allan Jude

Their ZFS is not included in mainline Linux.

[38:58] Allan Jude

So when sun released zfs, they did it under a license called the cddl, the Common Development License, which is basically identical to Mozilla's MPL that Firefox has released under.

[39:11] Allan Jude

So it says the source code is under this CDDL license, which is mostly liberal, it's not viral or anything like the gpl, but in particular Kadenta clause that when you compile a binary out of it, you can license that binary however you like.

[39:29] Allan Jude

So you can take ZFS and compile a kernel module and license that module as gpl.

[39:37] Allan Jude

The problem is actually an incompatibility with the GPL where if you make this GPL license ZFS module and link it into your kernel, the GPL license says that should make the source code for ZFS GPL licensed.

[39:52] Allan Jude

And the CDDale doesn't let you do that because that would not make any sense.

[39:57] Viktor Petersson

Right.

[39:58] Allan Jude

And so that's where the issues come from.

[40:03] Allan Jude

And so it's not included in upstream.

[40:07] Allan Jude

Kind of like was it the bcachefs recently was getting added to Linux and it sounds like maybe it doesn't have a long life left after some arguments between the maintainers.

[40:19] Allan Jude

So ZFS is not included in mainline, but Ubuntu has started shipping it as part of their distro and it turns out nobody's suing them.

[40:28] Allan Jude

And so it has kind of become okay to basically for Ubuntu to compile it for you and you load it on your module, whereas, you know, it was always okay for you to compile it yourself and load it on whatever version of Linux you wanted.

[40:43] Allan Jude

But that was a bit of a pain with dkms is not really a pain anymore, but being able to start to get closer to the integration like you can see with FreeBSD, where our bootloader fully supports it and our installer knows all about it.

[40:59] Allan Jude

And we build features like boot environments on top of it, which is basically having multiple different root file systems, possibly based on snapshots of your root file system.

[41:08] Allan Jude

Meaning that if you install some new packages and it breaks something, you can just reboot and from your bootloader pick an older version of your root file system and be back to how your system was an hour ago and everything works again.

[41:22] Viktor Petersson

It's crazy.

[41:23] Viktor Petersson

Like one of the things, like I've been using ZFS for well over a decade, I think by now, in one capacity or another.

[41:29] Viktor Petersson

And the thing that always strikes me is this is the fastest, but it's very old.

[41:34] Viktor Petersson

But it's still probably the most sophisticated file system out there in terms of feature sets.

[41:39] Viktor Petersson

And it just keeps chugging along and it's just so reliable, have so many beautiful features that are not widely available across any other file system, really.

[41:49] Viktor Petersson

Like particularly about snapshots and all these things.

[41:52] Viktor Petersson

Right.

[41:53] Viktor Petersson

So let's talk about some ZFF fundamentals.

[41:56] Viktor Petersson

Like, you obviously is an authority on CFS in your day job, and maybe for those not familiar with your day job, do a quick spiel about what you actually spend in your day and why you're spending your day doing CFS work.

[42:09] Allan Jude

Yeah.

[42:10] Allan Jude

So back in 2018, I founded a company called Clara, which is clarasystems.com and we provide professional development and support services around FreeBSD and ZFS, including ZFS and Linux.

[42:24] Allan Jude

And so customers come to us when they hit a bug in ZFS and need help with it.

[42:29] Allan Jude

We sell support subscriptions, and we develop new ZFS features.

[42:33] Allan Jude

So, for example, we developed a feature to be able to delegate a ZFS data set into an LXD container on Ubuntu so that one of our customers could run unprivileged Docker inside a container, but using ZFS so that Docker would be able to use ZFS snapshots and so on.

[42:51] Allan Jude

So when they give, they take your pool in zfs, and in zfs you can have.

[42:56] Allan Jude

So it combines all those hard drives you have into one pool of storage, which you can build multiple file systems that share the free space.

[43:05] Allan Jude

So you don't have the problem of partitioning you used to have.

[43:08] Allan Jude

If you had a 20 terabyte hard drive and you decide, okay, I'm going to Split this into four 5 terabyte partitions and run the five different workloads on them, and suddenly one of them is only using 1 TB, the other one would like to use 6.

[43:20] Allan Jude

But now that's not where your partitions are.

[43:22] Allan Jude

And you have this problem of like, oh, I don't have enough space over here, but too much space over here.

[43:27] Allan Jude

By pooling your storage in zfs, it means that you can just use all the space where you need it, and so you can create one of these virtual file systems and then give it to a container.

[43:37] Allan Jude

And inside that container, the fake root user can only see that one and none of the rest.

[43:43] Allan Jude

So they use this to run a CI system for many different customers.

[43:47] Allan Jude

So each Docker workload runs inside a container and can't see the other ones, but does have access to actually create and destroy snapshots and clones and do all the stuff Docker needs to do to run really quickly, taking advantage of the copyright features of zfs.

[44:03] Allan Jude

And so, yeah, they came to us needing this.

[44:06] Allan Jude

We developed it and then we also upstreamed it, included it by default in OpenZFS.

[44:12] Allan Jude

So then when Ubuntu 22.04 came out, it included that feature and they were able to just run stock Ubuntu and have it in production.

[44:20] Viktor Petersson

Very cool.

[44:21] Viktor Petersson

So let's talk a bit about fundamentals in zfs, and you kind of alluded to already, I mentioned pools.

[44:28] Viktor Petersson

There are like three building blocks.

[44:30] Viktor Petersson

I guess they're V devs, pools and data sets.

[44:34] Viktor Petersson

Right.

[44:34] Viktor Petersson

And let's Speak a bit.

[44:36] Viktor Petersson

Well, maybe the fourth one that I'm not that I'm missing, but let's speak about the fundamentals there.

[44:40] Viktor Petersson

So he's like, get big data way.

[44:43] Allan Jude

Yeah.

[44:43] Allan Jude

So like we said, ZFS is a volume manager, so it can also basically do the RAID for you.

[44:49] Allan Jude

And so a V dev is one of your kind of RAID components.

[44:54] Allan Jude

So you take all the hard drives you have and then you can combine them in groups called V devs, and that V dev can have a transform on it that makes it do something.

[45:03] Allan Jude

So if you have no transform, you just have each disk as a separate V device, you've basically done a RAID zero, and it means if you lose any drive, all your data is gone, and that's bad.

[45:14] Allan Jude

So you can do a mirror V dev where you're going to have pairs of disks and each display every block you write to some disk.

[45:23] Allan Jude

The second disk in that mirror is going to have an exact copy of it.

[45:26] Allan Jude

And we'll come back to why ZFS does that better, using checksums in a minute.

[45:32] Allan Jude

But then it also has what it calls Raidz 1, which is when you combine any number of drives in a group and it can withstand the loss of any one drive and keep working.

[45:43] Allan Jude

That's the same as a raid 5, except for it has one slight advantage.

[45:47] Allan Jude

So with raid 5, there's this flaw called the write hole.

[45:53] Allan Jude

Whereas when you update something, it's going to write the new data to the hard drive, and then it has to write to a different hard drive the updated parity, so that if it loses that drive using the parity and the other rows on the remaining hard drives, it can calculate what that.

[46:09] Allan Jude

What data was on that missing drive.

[46:11] Allan Jude

Basically, it adds up the chunks of all of the drives together and gets a value and writes that in the parity.

[46:17] Allan Jude

And then if one drive is missing, it can take that thing, subtract what's on every other drive, and get back the value that would have been on the missing hard drive.

[46:26] Allan Jude

But because it updates the data one hard drive and then updates the parity separately, if the power goes out between those two steps, then the parity is wrong, but the RAID controller doesn't know it.

[46:41] Allan Jude

And so when it boots up, it takes that parity, subtracts the remaining drives, and gets back the wrong answer.

[46:47] Allan Jude

And so now the block you just wrote, isn't there some gibberishes, which is a combination of the new data and the old parity?

[46:55] Allan Jude

Right, Because ZFS is transactional, it doesn't have that problem.

[46:58] Allan Jude

And we'll get back to how that works in a second.

[47:01] Allan Jude

But it also has Raid Z2, which means you combine a bunch of drives and any two drives can go missing and it works.

[47:07] Allan Jude

So that's like Raid 6.

[47:10] Allan Jude

That's available on some hardware controllers.

[47:12] Viktor Petersson

And how many drives do you need for all of these?

[47:15] Viktor Petersson

Well, V devs, I guess.

[47:16] Viktor Petersson

How many V devs do you need for each of these to work?

[47:18] Viktor Petersson

I mean, let's call V dev a physical drive, I presume in most practical use cases.

[47:23] Allan Jude

So technically the V dev is generally the transform the group of a bunch of physical drives, right?

[47:27] Viktor Petersson

Right.

[47:28] Allan Jude

So for a raid Z1, technically you need at least two drives, one and the parity.

[47:35] Allan Jude

Although it doesn't make sense to do that with less than 3, because you might as well just do a mirror if you only have two.

[47:40] Allan Jude

Yeah, and the same with Raidz 2.

[47:44] Allan Jude

Technically you only need three drives, but it probably doesn't make sense with fewer than four.

[47:50] Allan Jude

One of the big additions that ZFS has is Raid Z3, which allows you to group together some drives and lose any three of them and keep going.

[47:58] Allan Jude

That isn't something any hardware RAID controller I know of can actually do.

[48:03] Viktor Petersson

And I may have limited two.

[48:05] Allan Jude

All you would need technically is four, but you'd probably want six or more for it to make sense where you're going to lose half of them and keep it.

[48:14] Allan Jude

Because the big thing is, out of that set, whatever the number is, 1, 2 or 3, that many of the drives basically not going to be used to store your data.

[48:24] Allan Jude

It's just going to store the parity to be able to reconstruct the data.

[48:27] Viktor Petersson

Yeah.

[48:27] Viktor Petersson

Okay.

[48:27] Allan Jude

So if you use four drives in a RAID Z3, you're only going to get the space of one hard drive.

[48:32] Viktor Petersson

Right.

[48:34] Allan Jude

Another thing that's different than most hardware RAID is it doesn't have dedicated parity drives.

[48:38] Allan Jude

Right.

[48:39] Allan Jude

It's not going to be that, you know, if you have six drives in a Raid Z3, that three of them are only going to contain parity.

[48:44] Allan Jude

Z spreads the parity out between the drives so that you get more the bandwidth of the drives when you're writing.

[48:53] Allan Jude

So that provides that.

[48:54] Allan Jude

And then.

[48:55] Allan Jude

So each V dev or group of drives is responsible for its integrity.

[48:59] Allan Jude

So it has its own RAID there.

[49:02] Allan Jude

But then we combine multiple V devs if you have enough hard drives, and those are basically just striped together.

[49:08] Allan Jude

And so if you lose any one V dev, you've broken the whole pool.

[49:12] Allan Jude

But, you know, if you've used Raidz 3 in each of your VDEVs then it'll be fine.

[49:17] Allan Jude

And so, you know, a big server I built many years ago is basically a petabyte of usable space made up of a bunch of v devs of 12 terabyte hard drives in a RAID Z3.

[49:30] Allan Jude

And so I think it was like 9 or 10 of these VDEVs.

[49:34] Allan Jude

And that way you get all this space and, you know, as long as you don't lose more than three drives from any group of 12, then everything will be fine.

[49:45] Allan Jude

And, you know, when a hard drive dies, you replace it with a new one and it rebuilds and you're okay.

[49:52] Viktor Petersson

Right.

[49:53] Allan Jude

So that provides that's the V dev layer, and then above that you have the pool, which is basically giving you all the usable space of all those drives put together.

[50:02] Allan Jude

And it means that out of that you can create these virtual file systems called data sets.

[50:07] Allan Jude

And they're the same thing as kind of like an ext4 file system or whatever.

[50:13] Allan Jude

And you could have multiple of them, but each one will show up as a dynamic size where basically they are the amount of data they contain plus all the remaining free space in the pool.

[50:22] Viktor Petersson

Right, right.

[50:23] Allan Jude

So if you have five of these file systems, each one will say that it has 20 terabytes of space left.

[50:30] Allan Jude

But if you write to any one of them, all five of them's free space will go down by the amount you changed.

[50:35] Viktor Petersson

So kind of over provisioning in a sentence.

[50:37] Allan Jude

Yeah, yeah, except for we're not actually lying and saying that we have more on any of them.

[50:42] Allan Jude

And as you write stuff to them, the free space on all of them will go down.

[50:46] Allan Jude

But as you delete something one of them, the free space on all of them will go back up.

[50:50] Viktor Petersson

Right.

[50:51] Allan Jude

Unless you have snapshots, which we'll get to in a second.

[50:55] Allan Jude

And so that allows you to have different settings on each data set as well.

[51:00] Allan Jude

So ZFS has a feature called transparent compression, where you can set compression on the data set that contains my home directory, but not the dataset that contains my music or something.

[51:10] Viktor Petersson

Right.

[51:12] Allan Jude

And what this is, as you write the data to ZFS, it will use a fast compressor like LZ4 or Zstandard to shrink the size of those.

[51:22] Allan Jude

And then on disk, it'll actually store the compressed version, and then when you read the file back, it'll decompress it first before it gives it to you.

[51:28] Allan Jude

So the application doesn't have to know that there's compression happening, but the file system takes care of it for you.

[51:34] Allan Jude

And then all your, you know, the Many, many copies of the ZFS source code I have on my server take a lot less space because source code is text and compresses really well.

[51:44] Viktor Petersson

Right.

[51:45] Viktor Petersson

And what's overhead?

[51:47] Viktor Petersson

Let's talk about compression for a second, because I think that's a super interesting one.

[51:50] Viktor Petersson

Like what's the.

[51:51] Viktor Petersson

On a modern server, let's talk about ZZ standard, like a basic compression or the basic one, I think, in ZPass.

[51:59] Viktor Petersson

What's the overhead, CPU wise, I presume would be the big impact.

[52:03] Viktor Petersson

Right?

[52:04] Allan Jude

Right.

[52:04] Allan Jude

So that's the interesting.

[52:05] Allan Jude

With LZ4, the overhead is quite minimal.

[52:09] Allan Jude

It can compress multiple gigabytes per second per core.

[52:13] Allan Jude

So as long as you have a couple of cores, you're probably going to run out of performance on your storage, even if it's NVME before you run out of CPU time.

[52:21] Viktor Petersson

Interesting.

[52:23] Allan Jude

And it can actually end up being faster because if you have 100 gigabytes of source code you're trying to or 100 megabytes of source code you're trying to write, and a regular spinning hard drive that can write 100 megabytes per second, then it would take you a second to save that 100 megabytes of source code.

[52:39] Allan Jude

If you can compress it at 2 gigabytes per second, so it takes a fraction of the second to compress it, and it compresses down to 50 megabytes, you write it to the hard drive and it only took half a second.

[52:53] Allan Jude

So it means technically, if you had more source code, you could write 200 megabytes a second to this hard drive if your compression ratio is 2x.

[53:01] Allan Jude

So in exchange for a little bit of CPU time, you could make your hard drive seem faster only if the data is actually compressible.

[53:08] Allan Jude

And the same thing applies when you read.

[53:10] Allan Jude

Now, when you read, you only have to read 50 megabytes in order to read all 100 megabytes of the data.

[53:16] Allan Jude

And LZ4 decompresses at over 10 gigabytes per second per core.

[53:21] Viktor Petersson

Right.

[53:22] Allan Jude

And so it can end up making a lot of workloads faster.

[53:26] Viktor Petersson

Interesting.

[53:26] Viktor Petersson

Are there any?

[53:28] Viktor Petersson

Because I believe compression is disabled by default.

[53:31] Allan Jude

So as of version I think 2.2 compression is on by default.

[53:35] Allan Jude

Okay, it's now to the point where it's so fast it doesn't have a penalty if you try to Compress things like MP3s or video files that are uncompressible or even encrypted files.

[53:49] Allan Jude

Computers are so fast now that it basically doesn't have a high enough cost.

[53:53] Viktor Petersson

Right.

[53:54] Allan Jude

There's also a feature in ZFS called Early Abort where it can notice.

[53:58] Allan Jude

I've been compressing this chunk for a little bit and I noticed that it's not compressing enough that it's going to be smaller enough to matter.

[54:06] Allan Jude

Basically if it's not going to save you at least 12% if it becomes obvious we stop and don't finish trying and just store it uncompressed.

[54:14] Allan Jude

Okay, so there's like.

[54:15] Viktor Petersson

Okay, interesting.

[54:16] Viktor Petersson

Yeah.

[54:16] Viktor Petersson

Because I mean if you are chucking compressed files on the compressed compressor, that's not going to make much.

[54:21] Viktor Petersson

Yeah, that's just going to happen.

[54:22] Allan Jude

It's just going to use up a bunch of CPU time.

[54:24] Allan Jude

Now LZ4 is so cheap it doesn't make a big difference, but it will also notice and give up early in order to save some of that CPU time.

[54:31] Viktor Petersson

Right, okay, so there's what you're saying essentially there is no reason not to use it.

[54:36] Allan Jude

Exactly.

[54:37] Allan Jude

It's now to the point where you might as well just have LZ4 on everywhere and it will help you where it can and it won't make a big difference where it doesn't help.

[54:44] Viktor Petersson

Right.

[54:45] Viktor Petersson

I would imagine back in the early days that was quite not the case.

[54:48] Allan Jude

Right.

[54:48] Allan Jude

Well, especially before LZ4 there was only basically LZMA which wasn't or not LZGB which wasn't as good.

[54:57] Allan Jude

And gzip.

[54:59] Allan Jude

And gzip can compress better, but it is slow.

[55:03] Allan Jude

Like gzip will top out at like 50 or 100 megabytes per second per core.

[55:07] Allan Jude

And yes, that will you will run out of CPU before you run out of storage bandwidth.

[55:12] Allan Jude

So gzip will really hurt you.

[55:14] Allan Jude

As far as CPU usage, Z standard supports 19 different levels in ZFS and so you can tune it from only use a bit of the CPU to use all the CPU.

[55:24] Allan Jude

But if you use up to 19, it will only compress at like a single digit megabytes per second.

[55:30] Allan Jude

So you only want to use that one when it's like I'm writing this once, I know it's compressible and I'm just going to keep it forever.

[55:36] Allan Jude

And it'll be worth it to spend the CPU time to compress it that much.

[55:41] Viktor Petersson

Right.

[55:42] Allan Jude

But we also added a feature I think in 2.2 or 2.3 where if you enable a really high Z standard level, it will quickly try to compress it with LZ4 first and see if LZ4 couldn't compress it, we assume Z standard won't be able to compress it and we just won't even try.

[55:59] Allan Jude

Whereas if it does compress it, then we'll let it.

[56:01] Allan Jude

And so it allows you to skip trying the heavy compression on files that are definitely not compressible, again to avoid the overhead and let you just kind of turn it on without having to worry that it's going to spend a lot of time trying to compress things that aren't important or not compressible.

[56:18] Viktor Petersson

Right.

[56:19] Viktor Petersson

Okay, so we talk about data sets.

[56:21] Viktor Petersson

One thing about datasets that I found important is that you can run encrypted datasets on top of a pool.

[56:27] Allan Jude

Right.

[56:27] Viktor Petersson

So you could have non encrypted or you could have some encrypted, Maybe speak a bit about how that works because that's kind of a relatively unique feature as well to zfs.

[56:35] Allan Jude

Yeah, so most block level systems have some kind of encryption.

[56:41] Allan Jude

So on Linux there's lux and on FreeBSD there's Gelly.

[56:44] Allan Jude

And these use an algorithm called aesxts, which is whole disk encryption, and they use a key and they encrypt the data on the disk.

[56:54] Allan Jude

With zfs, we wanted more granularity.

[56:58] Allan Jude

So yeah, we have these multiple different data sets.

[57:00] Allan Jude

And maybe I want my operating system non encrypted so it's easier to boot from and so on, but I want my home directory encrypted.

[57:08] Allan Jude

And so it uses AES gcm, which is the encryption normally used for like HTTPs, and it encrypts not the structural information.

[57:20] Allan Jude

So like the name of the data set and the ZFS specific bits are not encrypted because it needs to be able to work on those.

[57:26] Allan Jude

But basically the file names and the actual data in the files are all encrypted.

[57:33] Allan Jude

And so earlier we mentioned the checksum.

[57:37] Allan Jude

So ZFS stores a checksum of every block so they can verify that the block didn't get corrupted by your hard drive.

[57:43] Allan Jude

Or as we mentioned, when you're doing a mirror in traditional mirroring with hardware, RAID or md, RAID and so on, if the two sides of the mirror don't match, there's no way to tell which one is right and which one's wrong.

[57:56] Allan Jude

Yeah, instead of S, because we store a checksum separate from the data, we can read both pieces of data and see which one matches the checksum and know, oh, the SHA256 says copy two is the right one.

[58:06] Allan Jude

And so we'll repair copy one with the right copy.

[58:11] Allan Jude

When we encrypt files we split that checksum in half.

[58:14] Allan Jude

We keep the first half of the SHA256.

[58:17] Allan Jude

And in the second half, we keep the message authentication code from the encryption.

[58:23] Allan Jude

This allows us to verify both bits, the raw encrypted data on disk, that it's not corrupted.

[58:30] Allan Jude

And the Mac ensures that when we decrypt the data, we got the right data back and that it hasn't been tampered with.

[58:36] Allan Jude

This means that unlike whole disk encryption, where if you mount the file system, you have to have entered the decryption key and now everything's decrypted.

[58:46] Allan Jude

With zfs, if you have your home directory and somebody else's home directory, those are encrypted with different keys.

[58:52] Allan Jude

And when the other user is not logged in, they can unload their key and that data is actually at rest and encrypted and can't be accessed by anybody until they come and enter the key, maybe by logging in over SSH and using their passphrase.

[59:06] Allan Jude

And so that allows that data to actually be protected.

[59:08] Allan Jude

Whereas if you just used whole hard drive encryption, that really only protects you against somebody stealing the physical machine.

[59:14] Allan Jude

And then when they reboot and try to access it, they don't have the key and they can't decrypt the hard drive.

[59:18] Allan Jude

Whereas with zfs, you can unload this key.

[59:21] Allan Jude

And now nobody can access this data without entering the key first.

[59:26] Allan Jude

So it allows you to keep the data at rest and have the protection from encryption without having to power off the server to get the protection.

[59:35] Viktor Petersson

Right.

[59:37] Allan Jude

But importantly, because we split that checksum in two, it means that we can still run the repair of a failed hard drive on the encrypted data without needing the encryption key.

[59:50] Viktor Petersson

Oh, right, okay.

[59:51] Allan Jude

And so it can, it knows from the first half the checksum that, hey, that block is not right.

[59:57] Allan Jude

The hard drive flipped a bit or something and, or, you know, we've drive failed and we put in a new one and we can rebuild all the data and get the array back healthy without ever having to have the encryption key.

[01:00:09] Allan Jude

So it means the storage administrator doesn't need all the encryption keys, which was a big advantage.

[01:00:14] Viktor Petersson

Right.

[01:00:15] Allan Jude

So a use case that came up with this for one of our customers being a law firm was, you know, we have all this discovery evidence from a case.

[01:00:22] Allan Jude

The case is now over.

[01:00:24] Allan Jude

We have to destroy that evidence and not have a copy of it anymore.

[01:00:28] Allan Jude

But, you know, we can't be sure that our flash drives are actually going to erase anything we erase.

[01:00:34] Allan Jude

So overriding is not necessarily going to be good enough.

[01:00:36] Allan Jude

And if we use whole disk encryption, we'd have to Reformat the whole hard drive in order to actually ensure that it's gone.

[01:00:43] Allan Jude

But by having a different encryption key for each case's data set, they can just unmount it, destroy the key and never be able.

[01:00:50] Allan Jude

That data is not recoverable now, and it makes their life much easier for doing that.

[01:00:55] Viktor Petersson

So you mentioned two types of, I guess, unlocking.

[01:00:59] Viktor Petersson

There's the key and there's the passphrase.

[01:01:02] Viktor Petersson

One thing that at least I've been bothered with, both in the Linux world and in the BSD world, I guess, is that in most other operating systems you can use a TPM to unlock and lock and do disk encryption.

[01:01:15] Viktor Petersson

That is not quite the case.

[01:01:17] Viktor Petersson

It's kind of possible on Linux.

[01:01:19] Viktor Petersson

I don't know how it is in the BSD world, but do you want to say a few words about the narrative around that?

[01:01:25] Viktor Petersson

If that's something that is being worked on or particular encryption, I guess, yeah.

[01:01:32] Allan Jude

I've not really looked at the full disk encryption stuff in a while because I've been more focused on zfs.

[01:01:37] Allan Jude

But I did do support for the full disk encryption in FreeBSD's bootloader years ago.

[01:01:45] Allan Jude

So yeah, there is interest in doing the tpm.

[01:01:47] Allan Jude

So what ZFS has right now is you can load the key basically by typing it in by having a path to a file or having a URL to like an API.

[01:01:58] Allan Jude

So you can actually have, as the machines boot up, they call some other machine to say, hey, it proves with a certificate or something that I'm part of your network.

[01:02:08] Allan Jude

And that server says, okay, here's the key to decrypt that data.

[01:02:11] Viktor Petersson

So you type it in.

[01:02:14] Allan Jude

Yeah, but we would definitely be interested in using the TPM to store the key material.

[01:02:22] Allan Jude

Just no one's come with that use case.

[01:02:24] Allan Jude

So if you would like ZFS whole disk or dataset encryption to use the TPM, then definitely go to Cloudystems.com and click on the ZFS button and tell us you want that feature and we'll talk to you and definitely get that built because there's some interest in it.

[01:02:40] Allan Jude

We just would need somebody who has a commercial interest in it.

[01:02:43] Viktor Petersson

Yeah, no, I had a more personal interest.

[01:02:46] Viktor Petersson

I use ZFS on my Just Home server, my Boxbox server.

[01:02:54] Viktor Petersson

And just when you reboot the server and you have to manually type in a passphrase and you don't have a KVM or something hooked up to it, that is a pain point, right?

[01:03:02] Allan Jude

Yeah, exactly.

[01:03:03] Allan Jude

And that's why some people use that HTTP thing to fetch the key from another machine that's going to have stayed up, hopefully, or whatever, some kind of zero trust environment where before you can have this encryption key, you have to prove that you're not a compromised machine and you're the machine that's supposed to have this decryption key and so on.

[01:03:21] Viktor Petersson

Yeah, exactly.

[01:03:23] Allan Jude

Cool.

[01:03:24] Viktor Petersson

We talked a bit about what we kind of alluded to, scrubbing.

[01:03:27] Viktor Petersson

I guess maybe you should just knock that out as well, what scrubbing is and how it's different than other profile systems.

[01:03:32] Allan Jude

Right.

[01:03:33] Allan Jude

So that kind of depends on checksumming.

[01:03:35] Allan Jude

So as ZFS writes, any block of data you give it stores.

[01:03:41] Allan Jude

So in the indirect block where it actually says, you know, the fifth megabyte of that file is on this offset on disk, it has information about it like when we wrote it and the checksum.

[01:03:53] Allan Jude

And so a scrub is basically a patrol read where it goes through your whole disk and reads every block and checks that the checksum is still what it's supposed to be.

[01:04:02] Allan Jude

And so this allows it to detect bit flips on your hard drive or bitrod or any kind of corruption.

[01:04:08] Allan Jude

And so ZFS scans all the data and makes sure that the checksums are correct and fixes them.

[01:04:13] Allan Jude

They aren't.

[01:04:14] Allan Jude

Because if you have bitrot, it usually kind of.

[01:04:17] Allan Jude

It's creeping.

[01:04:18] Allan Jude

It starts small and gets bigger and bigger.

[01:04:20] Allan Jude

So when you catch it and fix it means that, you know, if you have a RAID Z2 where you have, you know, eight hard drives and two of them are providing parity, if you find a bit flip and fix it, when you find the second bit flip and fix it, you had all the data you need to reconstruct it.

[01:04:36] Allan Jude

Whereas eventually, if you had three bit flips, you wouldn't have enough data to fix it if all three were in the same file.

[01:04:43] Allan Jude

But if you once a month are reading all the data, making sure the checksum is correct, it means you fix each of those errors as they come up and they don't accumulate to the point where you actually have data.

[01:04:54] Viktor Petersson

Right.

[01:04:55] Viktor Petersson

Yeah.

[01:04:55] Viktor Petersson

I had Brian Kantra from Oxide on the show a few episodes ago, and he was giving some story from Illuminus when they were running that, and how they discovered driver issues and issues with IO controllers, essentially where they basically was, because of the intelligence of zfs, they were able to pick up things that otherwise would have gone undercover.

[01:05:15] Viktor Petersson

Really?

[01:05:15] Allan Jude

Yeah.

[01:05:16] Allan Jude

And we've hit similar cases.

[01:05:17] Allan Jude

We had one customer using like an early version of an all Flash array, and occasionally ZFS was finding this corruption.

[01:05:24] Allan Jude

And it turned out there had been A firmware bug and a rounding error or something meant that certain writes would get written to the wrong place.

[01:05:31] Allan Jude

So when you read from the right place, the data wasn't there.

[01:05:34] Allan Jude

And when you read what was supposed to be at the wrong place, it had been overwritten with this other data.

[01:05:40] Allan Jude

And ZFS was able to point it out and we eventually found the pattern and they were able to fix the firmware.

[01:05:45] Allan Jude

Nice.

[01:05:47] Viktor Petersson

Snapshotting is not a thing you've hinted at.

[01:05:50] Allan Jude

Yes, the really interesting one.

[01:05:52] Allan Jude

So other file systems in the past have had snapshotting like you think of.

[01:05:55] Allan Jude

Even KVM with Qcow2 has this concept of snapshotting.

[01:05:59] Allan Jude

But in other file systems, the way snapshotting worked is, you know, you have your normal file system, when you change a file, we just overwrite it in place.

[01:06:07] Allan Jude

But if snapshotting is enabled, then we'll detect that and we'll, oh, write the new change to a different place instead and keep both versions.

[01:06:14] Allan Jude

So that meant there was this cost.

[01:06:16] Allan Jude

So like, if you use the snapshots in lvm, as soon as you have one snapshot, your performance gets cut like in half.

[01:06:21] Allan Jude

And then you have a second snapshot and it's half of that.

[01:06:24] Allan Jude

And so if you have like eight snapshots, you're at like, you know, 10% of your original performance.

[01:06:28] Allan Jude

It's pretty terrible.

[01:06:30] Allan Jude

With ZFS, it works differently.

[01:06:33] Allan Jude

Every time we make a change to a file, an existing file, we always write the blocks to a new place.

[01:06:39] Allan Jude

If you have no snapshots, the old place after a couple of seconds just becomes free space and can be reused later.

[01:06:45] Allan Jude

But if we have a snapshot, then we know to keep that data.

[01:06:50] Allan Jude

So basically I mentioned one of the bits of metadata we have along with the checksum is the time when the block was written.

[01:06:57] Allan Jude

And this is measured in transaction groups since the pool was created.

[01:07:00] Allan Jude

So when you create a snapshot, it's mostly just remembering the time of the snapshot.

[01:07:04] Allan Jude

And so when you overwrite some data, if it's before that snapshot, then we need to keep it, and so it won't erase it and it'll just keep going.

[01:07:15] Allan Jude

So it means when you're writing while having snapshots or reading, there's no extra work to do.

[01:07:20] Allan Jude

We're actually doing less work because we're not freeing that space because it's referenced by the snapshot.

[01:07:26] Allan Jude

Then later when you delete the snapshot, it can go through and say anything that's older than the now oldest snapshot.

[01:07:34] Allan Jude

We don't necessarily need that anymore.

[01:07:36] Allan Jude

And so it can go and free it and make the free space.

[01:07:39] Viktor Petersson

That's very cool.

[01:07:40] Allan Jude

So the copy and write nature of ZFS also provides the other big thing.

[01:07:44] Allan Jude

Especially in the early 2000s, there was this problem where if your server ever got rebooted unexpectedly, when it came back up, it had to run an FSCK and check all the stuff to make sure your directories weren't corrupt or whatever.

[01:07:56] Allan Jude

Yeah, on bigger hard drives that could take days.

[01:08:00] Allan Jude

And have your operating system not fully running for days was not an option.

[01:08:04] Allan Jude

Eventually that was mostly kind of solved with journaling, but to a limited degree with zfs.

[01:08:11] Allan Jude

The way it works is using those transactions.

[01:08:14] Allan Jude

Any changes you make get accumulated in a transaction group and then written out.

[01:08:17] Allan Jude

And by default that happens every five seconds or more frequently if you're pushing a lot of data and it's getting full.

[01:08:28] Allan Jude

But this means that, you know, in an overriding file system, if you're in the, you know, just made a bunch of changes to a big Excel file and hit save and the power goes out halfway through saving.

[01:08:39] Allan Jude

On a regular file system, you would have overwritten the first half of the original copy of the program or the Excel file with the new version, but the second half didn't get written yet because the power went out.

[01:08:49] Allan Jude

So when you boot back up, you have half the new file and half the old file.

[01:08:52] Allan Jude

And correl Excel is just going to say, that's gibberish.

[01:08:55] Allan Jude

I can't make sense of that.

[01:08:58] Allan Jude

Whereas zfs, the new file was being written over here and we only have half of it.

[01:09:04] Allan Jude

And so that transaction didn't finish, so the checksum won't match.

[01:09:07] Allan Jude

So when it boots up, it'll just say, okay, that one didn't finish.

[01:09:10] Allan Jude

We'll go back to one previous that is completed and do that.

[01:09:14] Allan Jude

And then ZFS has this concept called the ZFS Intent log or zil, where any time an application asks for a promise that the write I just did is on disk before I continue, like a database would, that gets written to the xil and then when you reboot after the crash, the XIL will replay those changes and make sure that everything we promised actually got finished.

[01:09:36] Viktor Petersson

Like a wall login postgres.

[01:09:39] Allan Jude

Exactly.

[01:09:39] Allan Jude

But that for the file system and this way we go from the file system was perfect to the file system was perfect without ever having to be in that in between state.

[01:09:49] Allan Jude

Because it's in between state, we kind of roll back like a database.

[01:09:52] Allan Jude

And so it doesn't have this need to fsck.

[01:09:56] Allan Jude

Other file systems do super interesting.

[01:09:59] Viktor Petersson

The last feature that I wanted to cover, which I think is a very neat feature as well, is the ZFS send, which is kind of clever feature as well for like, shipping data across systems, really.

[01:10:09] Viktor Petersson

So maybe speak a bit about that.

[01:10:11] Allan Jude

Yeah.

[01:10:11] Allan Jude

So basically it allows you to serialize a file system into a stream that you can send over the network.

[01:10:17] Allan Jude

Technically, you could send it to a file and then receive it on a different computer much later, but usually it makes more sense to just do it directly, because having a giant file full of a stream of the file system isn't as useful as if you receive it, you have that same amount of space, but as a usable copy of the file system.

[01:10:36] Viktor Petersson

Right.

[01:10:38] Allan Jude

But its power really comes from its ability to do incremental replication, where after you've sent the whole file system based on a snapshot, if you make a newer snapshot, you can send the difference just between those two.

[01:10:49] Allan Jude

And that depends on what we had just talked about, those transactions and the transaction group id.

[01:10:56] Allan Jude

So when you created the first snapshot, were at, say, transaction 1000, and when you created the second one, were at, say 1100.

[01:11:04] Allan Jude

So now when you say send the difference between snapshot one and snapshot two, ZFS just has to look for blocks that have a birth time between those two numbers.

[01:11:11] Viktor Petersson

Oh, neat.

[01:11:12] Viktor Petersson

Okay.

[01:11:12] Allan Jude

And so we can just scan all those blocks and say, okay, any block that has a birth time greater than a thousand, but up to 1100, we're just going to feed it into the stream, and on the other side we'll just receive those.

[01:11:25] Allan Jude

And so compared to doing a backup with something like rsync, where it's going to walk through every single file in your system and run stat on it and be like, when did you last change?

[01:11:34] Allan Jude

When did you last change?

[01:11:35] Allan Jude

And only.

[01:11:36] Allan Jude

And if a file is huge, but it was touched, rsync has to read the whole file on your side and read the whole file on the far side and check the checksum of the chunks and then to figure out which blocks changed.

[01:11:51] Allan Jude

Whereas ZFS just looks at the time on each block in the metadata without having to read the data and says, okay, only these seven blocks change.

[01:11:59] Allan Jude

So it just sends those seven blocks and it's done.

[01:12:02] Viktor Petersson

Interesting.

[01:12:03] Allan Jude

And so ZFS is able to basically saturate the network and do the backup in a couple of seconds, where Rsync might take a whole day to copy that same couple of blocks, because it has to check every file, and then if the file is newer, it has to read the whole file on both Sides and decide which blocks are actually different.

[01:12:20] Allan Jude

Whereas ZFS just natively knows because it has a timestamp for each block of the file instead of just the whole file.

[01:12:26] Viktor Petersson

Right.

[01:12:26] Viktor Petersson

And I guess if you send it on a network, you would just pipe it through something that's encrypted and then receive it and decrypt it.

[01:12:31] Viktor Petersson

Yeah.

[01:12:32] Viktor Petersson

Okay.

[01:12:32] Allan Jude

Well, also, if your data set was encrypted on zfs, you can do a raw send where it'll send the encrypted version without decrypting it to the other side.

[01:12:42] Allan Jude

And then that way the data is never decrypted on the far side.

[01:12:46] Allan Jude

And so if they don't have the key to decrypt it, they only have a backup of your data that they can send back to you, but they can't ever mount.

[01:12:54] Viktor Petersson

Right, okay.

[01:12:56] Allan Jude

The thing lots of people are after.

[01:12:58] Allan Jude

Yeah.

[01:12:58] Viktor Petersson

Are there any hosted services I know obviously a fellow Canadian call in is running tarsnap.

[01:13:05] Viktor Petersson

Are they supporting that ZFS send as torch now?

[01:13:09] Viktor Petersson

Because that would be an interesting service offer to basically ship.

[01:13:12] Allan Jude

They're not.

[01:13:13] Allan Jude

So the two services I know of are ZFS Rent, where they will basically rent you a hard drive in a VM that you can do that to and then rsync.net has a thing where they will basically stand up freebies to your Linux in a VM and sell you five terabytes of space that you can ZFS receive your encrypted data sets to.

[01:13:36] Allan Jude

Or not encrypted if you don't care.

[01:13:39] Viktor Petersson

Interesting.

[01:13:40] Viktor Petersson

All right, so the last thing I want to round up the episode on is Tales from the Trenches.

[01:13:45] Viktor Petersson

Obviously you've been around the data world for quite some time and I'm sure we've seen a lot of horror stories around this.

[01:13:52] Viktor Petersson

So I kind of want to hear from your side what's the craziest and more like bizarre data recovery missions you guys been on trying to recover data on GMS clusters or in general?

[01:14:03] Allan Jude

Yeah, we've done quite a few different ones.

[01:14:07] Allan Jude

We just did a webinar on Halloween that we'll drop a link in here for, but if you go to clarasystems.com and click webinars, you'll find our Halloween horror stories.

[01:14:17] Allan Jude

So I won't cover one of those because they're all really good, but I just did them.

[01:14:21] Allan Jude

So another one we did, there was an interesting one at a university where they had just made a bunch of changes and we had helped them with those and it was fine.

[01:14:33] Allan Jude

And then so they had a high availability system where they'd have a JBOD full of hard drives connected to two different servers so that if one server goes down, they could import the pool on the other one.

[01:14:47] Allan Jude

And they had a bunch of these pools and they had just numbered them like everybody does.

[01:14:52] Allan Jude

This is not the same as the war story in the Halloween one where a similar thing kind of happened.

[01:14:56] Allan Jude

But after importing them, after doing the upgrade, they accidentally did import pool 4 on server 4 and then went to server 5 and accidentally typed import pool 4 instead of pool 5.

[01:15:11] Allan Jude

So now pool 4 was mounted on two machines at the same time, which isn't supposed to be able to happen because they didn't enable the feature that stops it because there's a trade off for it.

[01:15:23] Allan Jude

But so now every time there's a new transaction group, server 4 is writing to the hard drive and then server 5 is writing different information with a different transaction group number over top of that.

[01:15:34] Allan Jude

And so they're just sitting there writing over top of each other, somewhat being correct and somewhat not.

[01:15:39] Allan Jude

And also meaning that if you wrote different data on different servers, they're going to not know that the other one has used a certain bit of space.

[01:15:47] Allan Jude

So think it's free and allocate it and basically overwrite something Server 4 just wrote with something different that Server 5 just wrote.

[01:15:54] Allan Jude

And you can really corrupt things really badly.

[01:15:58] Allan Jude

They managed to catch it after I think it was about 25 minutes or so.

[01:16:02] Allan Jude

And so they shut everything down and called us and things were quite broken and we had to invent some new tools to be able to fix it, but eventually got to the point where we could do a version of ZFS sen to copy all the data to a spare machine they had so that they could get most of their data back, except for obviously the newer stuff they had overwritten and a couple of pieces got damaged.

[01:16:26] Allan Jude

But we managed to get most of the data back by basically being able to do ZFS send.

[01:16:32] Allan Jude

And the big advantage there over copying it some other way was just the speed, the fact that we could saturate their 10 gigabit network.

[01:16:38] Allan Jude

And when you're talking about hundreds and hundreds of terabytes of data, it's really slow to copy.

[01:16:44] Allan Jude

So you really need a way to try to do that quickly.

[01:16:48] Viktor Petersson

Good stuff.

[01:16:50] Viktor Petersson

This has been very helpful and I think hopefully opened the eyes to ZFS to a new audience.

[01:16:57] Viktor Petersson

And I think it's well worth shouting about because I think it's a fantastic file system.

[01:17:03] Viktor Petersson

So I'm very happy to wider adoption for it.

[01:17:06] Viktor Petersson

So thank you so much for coming on the show, Alan.

[01:17:10] Viktor Petersson

Very much appreciated.

[01:17:11] Viktor Petersson

Any last words about Clara?

[01:17:13] Viktor Petersson

Something you want to say, you want.

[01:17:14] Allan Jude

To draw attention to?

[01:17:16] Allan Jude

If you have.

[01:17:17] Allan Jude

If you want to learn more about zfs, I do a weekly podcast where we tend to answer quite a few people's questions.

[01:17:23] Allan Jude

They write in about ZFS.

[01:17:24] Allan Jude

So that's 2.5 admins.com so 2 and a half admins.

[01:17:28] Allan Jude

It's a podcast with myself and another ZFS admin, Jim Salter, and a host who's trying to become a sysadmin.

[01:17:34] Allan Jude

So we have two and a half admins.

[01:17:36] Allan Jude

So if you're into sysadminning in general, or ZFS specifically, you want to check out that podcast that comes out every Thursday.

[01:17:44] Allan Jude

And there's also we have a website, practicalzfs.com where we have a discourse set up replacing the old Reddit rzfs.

[01:17:53] Allan Jude

And so it's a place where people ask a lot of ZFS questions, we answer them and kind of collect a whole thing there.

[01:17:58] Allan Jude

But yeah, if you need support with ZFS or feature development or same for freebsd, then do hit us up at clarasystems.

[01:18:07] Viktor Petersson

Com.

[01:18:07] Viktor Petersson

Amazing.

[01:18:08] Viktor Petersson

Again, thanks so much for coming on the show, Alan.

[01:18:10] Viktor Petersson

Have a good one.

[01:18:11] Viktor Petersson

Talk soon.

[01:18:11] Viktor Petersson

Cheers.

[01:18:12] Allan Jude

Bye.

Found an error or typo? File PR against this file or the transcript.