Viktor Petersson logo

Podcast

Follow Me

Join Viktor, a proud nerd and seasoned entrepreneur, whose academic journey at Santa Clara University in Silicon Valley sparked a career marked by innovation and foresight. From his college days, Viktor embarked on an entrepreneurial path, beginning with YippieMove, a groundbreaking email migration service, and continuing with a series of bootstrapped ventures.

All things ZFS and FreeBSD with Allan Jude

Play On Listen to podcast on YouTube Listen to podcast on Spotify Listen to podcast on Apple
01 DEC • 2024 1 hour 18 mins
Share:

In this technically rich episode of Nerding Out with Viktor, host Viktor sits down with Allan Jude, a distinguished FreeBSD developer and ZFS expert, for an in-depth exploration of advanced storage systems and operating system architecture. Allan, known for his extensive contributions to FreeBSD and deep expertise in ZFS, breaks down complex concepts into digestible insights while maintaining the technical depth that advanced users crave.

The conversation begins with a comprehensive look at ZFS’s architectural foundations, where Allan explains how the copy-on-write mechanism fundamentally transforms data integrity and storage management. He illuminates the practical implications of ZFS’s design choices, from its self-healing capabilities to its sophisticated approach to data verification, making a compelling case for why ZFS remains a cornerstone for enterprise storage solutions.

Viktor and Allan dive deep into FreeBSD’s networking stack, examining how its architectural decisions enable exceptional performance and reliability in production environments. The discussion reveals why major technology companies continue to trust FreeBSD for their mission-critical operations, with Allan sharing real-world examples and best practices drawn from his extensive experience in the field.

Throughout the episode, listeners are treated to practical deployment strategies that bridge theoretical knowledge with real-world applications. Allan offers invaluable insights into optimizing ZFS configurations, managing storage pools effectively, and leveraging FreeBSD’s security features to their full potential. The conversation also touches on the future of storage systems and operating system development, with both hosts sharing their perspectives on emerging trends and technologies.

Whether you’re a systems administrator looking to enhance your storage infrastructure, a developer interested in operating system internals, or a technology enthusiast curious about advanced filesystem architectures, this episode delivers actionable insights and deep technical knowledge. Allan’s ability to explain complex technical concepts while maintaining practical relevance makes this discussion an essential listen for anyone working with enterprise storage solutions or Unix-like operating systems.

Transcript

Show/Hide Transcript
[00:03] Viktor Petersson
Welcome back to another episode of Nerding out with Victor.
[00:06] Viktor Petersson
Today we're going to go deep into the world of ZFS and FreeBSD with Alan Jude.
[00:11] Viktor Petersson
Welcome to the show, Alan.
[00:12] Allan Jude
Hello.
[00:13] Allan Jude
Thank you.
[00:14] Viktor Petersson
So you've been around the FreeBSD world for a long time, and I feel like I mentioned before you hit the record button, FreeBSD is not really getting enough attention for their deserves, really.
[00:29] Viktor Petersson
So I guess for the people in the audience who are not Even familiar with FreeBSD and the BSD family, maybe can you do a quick intro to, like, why it sets it apart and just some big backstory?
[00:41] Allan Jude
Yeah, I guess we can start with the initial part of the backstory.
[00:44] Allan Jude
So UNIX was originally developed at Bell Labs, which was the research arm of AT&T, the phone company, the only phone company in the US Back then.
[00:55] Allan Jude
And so they developed this operating system basically to be able to build things and also to convince management to buy them a big enough computer, also to make printed manuals and so on to do actual work for the phone company.
[01:12] Allan Jude
But that really was just a side effect of getting a computer they could use to actually write computer programs, a more comfortable environment.
[01:21] Allan Jude
And so they built this, and as it started going out, they found that they wanted people to know how to use it and to find new things to do with it.
[01:32] Allan Jude
So they licensed copies to a bunch of universities.
[01:36] Allan Jude
And back then, computer programs weren't really the same, and there was no standard architecture.
[01:43] Allan Jude
Right.
[01:44] Allan Jude
Every computer was a completely different computer that basically spoke a different language.
[01:49] Allan Jude
And so you had to write the operating system and the programs in, like, the compiled to this assembly dialect of that machine, which is completely different from some other machine.
[02:01] Allan Jude
And that's how or what led to the invention of the C programming language, as we can write the code once and compile it for multiple different machines.
[02:11] Allan Jude
Anyway, one of the universities that got a copy of original research, Unix, as it was called, was the University of California at Berkeley.
[02:20] Allan Jude
And they started adding things to it and changing it to be useful for them.
[02:25] Allan Jude
And eventually what they did was they would send copies of it on a tape because floppy disk hadn't been invented yet.
[02:35] Allan Jude
So they'd mail this tape to other universities and other people who would take it and they would change it and they would send back some of their changes.
[02:43] Allan Jude
Of course, you know, we didn't have Git or tools for managing patches.
[02:47] Allan Jude
So it was a little crazy back then.
[02:50] Allan Jude
But eventually some of those changes made it back to Berkeley and they got incorporated and went out to other people on the next Tape.
[02:55] Allan Jude
And that was kind of the precursor to open source.
[02:59] Allan Jude
No one had thought up licensing or text to put on it yet.
[03:03] Allan Jude
It was just, you know, this is the program and you use it.
[03:06] Allan Jude
And of course every program comes with all the source code and you can do whatever you want with it.
[03:12] Allan Jude
But then it turns out, you know, AT&T had ideas about this.
[03:19] Allan Jude
But that went on for a while and really got interesting.
[03:22] Allan Jude
And eventually they added things to it, including the TCP IP stack, which was the invention of the Internet.
[03:29] Allan Jude
And there's some great stories about that.
[03:32] Allan Jude
Kirk McKusick, one of the people who wrote the file system for that original Unix or the original BSD Unix has a history series on DVD that talks about some of the really interesting things around the invention of TCP ip.
[03:46] Allan Jude
And there was like a proprietary version and then the BSD version.
[03:50] Allan Jude
And the BSD version was faster at some parts, but worse at other parts.
[03:56] Allan Jude
And the way the story went was the proprietary one was faster, but it would crash.
[04:02] Allan Jude
And in the time while it was rebooting, after it crashed, the BSD one would catch up.
[04:06] Allan Jude
So if you transferred a big enough file, you would still be faster on the BSD1 and things like that.
[04:13] Viktor Petersson
I didn't even know that.
[04:14] Viktor Petersson
Okay, Yeah, I didn't know that.
[04:15] Viktor Petersson
I didn't even know that backstory of the TPIP stacks actually.
[04:18] Allan Jude
Yeah, that was, I think BBN it was called.
[04:21] Allan Jude
It was like way back at the very beginning, like when the Internet was only for the government.
[04:25] Allan Jude
It wasn't the open Internet yet.
[04:27] Viktor Petersson
Right, right.
[04:28] Allan Jude
It's probably still the arpanet.
[04:29] Allan Jude
Yes, and things like that.
[04:31] Allan Jude
Anyway, that's a great lecture that Kirk gives that you can catch from other conferences or from history dvd.
[04:40] Allan Jude
But eventually they kind of finished up the last version of that and a company started called BSDI that was going to make a version of this and actually sell it.
[04:53] Allan Jude
And so they did the port to what was the 386 and made this BSDOS and they were going to sell it to people.
[05:01] Allan Jude
And it was, they were doing a pretty brisk business at the time because, you know, your only other option was Windows or probably wasn't even Windows yet.
[05:11] Allan Jude
Like, yeah, early versions of Windows or like sunos, which was expensive and came with special hardware, whereas the 386 was a cheap machine you could just buy.
[05:20] Viktor Petersson
Right.
[05:23] Allan Jude
But they made the probably ill advised decision to have their phone number to order this software be 100 its Unix.
[05:32] Allan Jude
Whereas Unix was a trademark of AT&T.
[05:36] Allan Jude
And AT&T is like, yeah, no, Also, your code probably contains, you know, our code, which, you know, we charge $1,000 a license for at least, or if not, a lot more.
[05:50] Allan Jude
And you know, you're selling your software to other people for less than that, so you're not paying us.
[05:55] Allan Jude
How does that work?
[05:56] Allan Jude
And this led to the famous ATT USL lawsuit.
[06:02] Allan Jude
And it turned out there were maybe seven files that needed to be rewritten, but not really.
[06:08] Allan Jude
I don't think the final details ever really got fully disclosed.
[06:12] Viktor Petersson
Right.
[06:13] Allan Jude
But it slowed BSD down, especially the adoption of it, at just the wrong time, as it turns out.
[06:21] Allan Jude
And so in Finland this student named Linus was like, you know, I want a Unix like operating system and oh, I can't use BSD because it's tied up in this lawsuit, so I'll have to build my own.
[06:36] Allan Jude
And that was the start of Linux.
[06:37] Allan Jude
And like you said, if it hadn't been for the AT and T lawsuit, I would have never bothered making Linux and we would just have BSD instead.
[06:46] Viktor Petersson
It's a crazy parallel universe, right?
[06:49] Allan Jude
Yeah, yeah, that's a really interesting parallel universe.
[06:53] Allan Jude
Like how that would have affected the evolution of the UNIX wars and what would have happened where and how different life might be just because of the licensing.
[07:04] Allan Jude
So BSD and similar stuff like the MIT license, ISC license, basically the whole license is a couple of sentences that say, you can use this code for whatever you want, but don't take our name off of it and don't claim you were the one that wrote it.
[07:20] Allan Jude
And so the Sony PlayStation 4 and 5 are based on FreeBSD.
[07:26] Allan Jude
And so at the back of the manual there's a bunch of pages of copyright notices that saying, this contains code from this person and this person and so on.
[07:36] Allan Jude
And that's the extent of Sony's obligation, right?
[07:41] Allan Jude
Well, many of the companies that use this BSD software do give back because there's an advantage to doing so.
[07:47] Allan Jude
You know, if you're taking FreeBSD and building something on top of it, you're making some changes.
[07:52] Allan Jude
If some of those changes will be your intellectual property or is it the unique selling feature of your product and you want to keep those secret, but there are other parts and smaller features and so on that aren't secret but will be work for you.
[08:06] Allan Jude
Every time you upgrade to a newer version, you will have to reintegrate those, and that's a lot of work.
[08:11] Allan Jude
If you can contribute those back, they become part of the main line and everybody collectively maintains them, then it's that much easier for you to update in the future.
[08:21] Allan Jude
And really I think the thing that sells it for me the most is the thought of if I buy a washing machine that has the ability to send a push notification to my cell phone when the laundry is done, I don't want it to be using some proprietary or hand built network stack because the people who wrote the software on it didn't want to give back to the gpl which has this viral clause that means any code that touches this has to become open.
[08:52] Allan Jude
So if they're going to use some closed source thing that they bought or write their own thing to avoid that's probably going to lead to a bad time.
[09:01] Allan Jude
Whereas if they can have this standard reference implementation that is under a liberal license, then we're probably all in better shape.
[09:10] Viktor Petersson
Yeah.
[09:11] Viktor Petersson
So that takes us to the modern BSD landscape, I guess, where you have, well, three, well, I guess three major flavors of bsd Open, Free, nat.
[09:22] Viktor Petersson
I guess they're all.
[09:23] Viktor Petersson
They are the main ones.
[09:24] Viktor Petersson
There is.
[09:24] Viktor Petersson
There are some derivatives of those I think as well.
[09:27] Viktor Petersson
But those are the main ones today.
[09:28] Allan Jude
Right.
[09:30] Viktor Petersson
Can you speak a bit?
[09:31] Viktor Petersson
Just compare and contrast like real quickly, like for those who.
[09:34] Viktor Petersson
Not those who are new to BSD and just understand what their strengths are between the three of them.
[09:40] Allan Jude
Yeah.
[09:41] Allan Jude
So back in the early 90s when, after the BSD USL lawsuit was settled and it became possible, these different versions of BSD came out and the first two were NetBSD and FreeBSD.
[09:56] Allan Jude
And NetBSD wanted to focus on portability running on all those different types of computer.
[10:03] Allan Jude
Especially at the time there were, you know, in the early 90s there were still a lot of these older big iron machines around and companies maybe weren't still producing new software for them.
[10:15] Allan Jude
And there were these ideas of other CPU architectures other than the X86.
[10:21] Allan Jude
And NetBSD really wanted to focus on, you know, when a new thing comes out, we can port to it and be running on it really quickly because they saw that as being the future of computers, not this 386, 486 type thing.
[10:35] Allan Jude
So their focus is always on this portability and being able to run on all the different types of CPUs at the same time.
[10:43] Allan Jude
With FreeBSD the focus was we have this one main architecture, the 386 and eventually the x86.
[10:51] Allan Jude
And we really want to be able to use that as a desktop and a serp.
[10:56] Allan Jude
And so that was their book.
[10:58] Allan Jude
After a couple of years, One of the NetBSD developers had a falling out with the other developers and wanted to do something quite a bit different.
[11:07] Allan Jude
And so they forked off from NetBSD and created OpenBSD.
[11:11] Allan Jude
The main goals there was that the repository, not just the source code, but the repository and the history, would be open to the public and then have a really big focus on security.
[11:23] Allan Jude
And it's come to the point where OpenBSD is mostly kind of like the original Unix, an environment for the developers to do things, whether that's their everyday things or building and kind of this incubator for all these ideas of different ways to do security.
[11:41] Allan Jude
So on top of their other enhancements, they've done the idea of privilege separation.
[11:47] Allan Jude
So one of the things that's kind of maintained inside the OpenBSD project is OpenSSH, the SSH server that's used in every operating system.
[11:56] Allan Jude
And originally that had this one bigger binary that ran as root and then could spawn, you know, every time you logged in.
[12:04] Allan Jude
But that's now been split up into separate tools so that if one gets compromised, it's running without the same level of privileges that, you know, any risky processing is done as like a less important user that can only communicate over interprocess communication with the main process that has all the privileges, so that less chances of it being exploited.
[12:28] Viktor Petersson
Right.
[12:29] Allan Jude
And then they've taken that idea further and further as they've gone.
[12:33] Allan Jude
But one of the other really interesting ones they do is they actually relink.
[12:38] Allan Jude
So not compiling, but assembling the bits together, the kernel every time it reboots.
[12:45] Allan Jude
So as it's booting up, it makes a new kernel for next time, basically.
[12:49] Allan Jude
And it puts all the pieces together in a random order so that no machine will be exactly the same for an exploit to be able to know where it's going to find like the.
[13:00] Viktor Petersson
Certain functions for memory injections or for memory attacks, essentially.
[13:04] Allan Jude
Yeah.
[13:04] Allan Jude
For basically any kind of memory attack.
[13:06] Allan Jude
Where you're going to try to find these gadgets to string together, do an exploit.
[13:10] Allan Jude
Every time this machine reboots, it'll be different.
[13:12] Allan Jude
And that means every machine will be different and you won't be able to have the same kind of.
[13:16] Allan Jude
It'll make the attacker's life that much harder.
[13:19] Viktor Petersson
Right.
[13:20] Allan Jude
And they do a lot of interesting work like that.
[13:22] Allan Jude
Yeah.
[13:23] Viktor Petersson
I love the slogan for OpenBSD was no default vulnerabilities for like 10 years or whatever it was, or 15 years, whatever the number is today.
[13:31] Allan Jude
Right.
[13:31] Allan Jude
And then it became kind of a heck of a long time when there was one.
[13:34] Allan Jude
The one time.
[13:35] Allan Jude
And.
[13:36] Viktor Petersson
Right.
[13:37] Allan Jude
They've Had a really good track record on.
[13:39] Viktor Petersson
Yeah, it's pretty cool in terms of just to compare and contrast, do they share kernels or they completely.
[13:46] Allan Jude
They just.
[13:47] Viktor Petersson
Toolkit.
[13:48] Allan Jude
Yeah, so the kernels, I suppose technically were close to the same in the very early 90s, but they are very diverged.
[13:57] Allan Jude
Now one of the things that sets each of the three BSDs apart from something like a Linux is that each one is a complete operating system.
[14:07] Allan Jude
In that, you know, in one repository you have the kernel, the basic utilities, like ls, cat, grep, et cetera, and a bunch of the other pieces.
[14:19] Allan Jude
And all the drivers all live in one source code repository and can be built and run with just that.
[14:27] Allan Jude
And in fact, I pretty sure in all three of them that a bunch of the stuff comes with the operating system and then the packages are off to the side and by default there won't be any packages installed.
[14:41] Allan Jude
And then you can maybe you install whatever additions you need on top of the operating system.
[14:46] Allan Jude
Whereas almost every Linux distribution is all right, we take the a version of the kernel, that who knows which one, and we combine it with this bunch of GNU Core utils and other things, and all of those are packages that get installed.
[15:00] Allan Jude
And so the concept of having an operating system with no packages in Linux doesn't really make any sense because all of the components of the operating system, including the kernel, are packages.
[15:10] Allan Jude
Whereas in the bsds you have the operating system and then separately you have all the third party package, right?
[15:16] Viktor Petersson
And the package managers are rather sophisticated, like the ports from bsd.
[15:21] Viktor Petersson
I know that was overhauled a decade ago, right?
[15:24] Viktor Petersson
It used to be just tarballs and now that, yeah, they redesigned that, I think, what, 10 years ago.
[15:31] Viktor Petersson
I think.
[15:34] Allan Jude
In the very beginning FreeBSD added this concept of ports, which is basically a separate repository with a directory structure sorted into categories and then each program has a directory and inside it there's a little make file, a little.
[15:49] Viktor Petersson
Oh, yes, sorry, yes, make.
[15:51] Allan Jude
And in there when you run the right command, it knows, hey, I need to go to this website, download the source code, extract it, check the checksum to make sure nobody's feeding me the wrong source code, extract it, apply the set of patches, maybe run the configure script, compile it, install it, and so on, and make sure it gets installed to the right place and that you have all the information you need to be able to uninstall it and all that.
[16:17] Allan Jude
And so it did that for a long time and there were binary packages, but back in the early days, the way it worked Was all of those packages were built once at the day of the release and included on a second cd, but they were never updated.
[16:36] Allan Jude
So you had the version of the packages that shipped with the operating system when it came out and if it was six months later, sorry, all you have is this old packages and you'd have to build your own from the ports tree and that was it.
[16:50] Allan Jude
And it was not ideal.
[16:53] Allan Jude
So yeah, I think around FreeBSD8 for reference, we're on version 14 right now.
[16:59] Allan Jude
With FreeBSD8 we released a new package manager that meant somebody would a tool called Poodra would take that ports tree and recompile it every three or four days because it takes that long to do the build and post it.
[17:13] Allan Jude
And so you could just do package upgrade and it would download stuff kind of the same as an APT or a Yum or whatever.
[17:19] Allan Jude
Yeah, because I remember the big difference is that gets rebuilt every three or four days.
[17:23] Allan Jude
So it's very fresh.
[17:26] Viktor Petersson
Yeah, because I remember one of the issues.
[17:29] Viktor Petersson
I used to run BSD, FreeBSD specifically many years ago and we ran it at somewhat small scale but like I think, I don't know, 50 servers maybe so smallish scale but I remember do those installations across a fleet of servers where you have to compile everything on every single server and then you have to.
[17:47] Viktor Petersson
There were some mechanisms for distributing packages but it's very different from the Linux world, which is apt get installed and voila, that's all you have to care about.
[17:56] Allan Jude
Yeah.
[17:57] Allan Jude
And so eventually we added the concept of a quarterly branch.
[18:01] Allan Jude
So on top of those packages built every three or four days, there's a second repo also built every three or four days.
[18:07] Allan Jude
But is the major versions of those packages only change once every three months.
[18:13] Allan Jude
So that you have the choice of basically a rolling release or a quarterly release that's just security fixes every day, but doesn't randomly change from version three to version four of a program in the middle of your cycle so that you can safely just run automated packages.
[18:33] Allan Jude
And the big thing is with that system there's also the available the program called pudrare which allows you to compile those packages yourself.
[18:43] Allan Jude
So if you have that fleet of 50 servers, the big thing that the ports tree let you do is there are configuration options for all these programs.
[18:52] Allan Jude
So you know, you can decide whether you want this program.
[18:56] Allan Jude
Like if you're installing WordPress, do you want it to use MySQL or Postgres as the database or something like that.
[19:03] Allan Jude
And there's all These knobs you can tweak, but the packages that come from upstream only have the defaults.
[19:09] Allan Jude
So if you want different options, then you can use podrare and build your own package repository that's going to have those options and be able to use it for all 50 of your servers.
[19:19] Viktor Petersson
Right.
[19:20] Viktor Petersson
Are those reproducible builds now?
[19:24] Allan Jude
Pretty close to it.
[19:25] Allan Jude
The operating system is fully reproducible and I think the packages are reproducible as well.
[19:31] Allan Jude
You have to, not necessarily by default.
[19:33] Allan Jude
You have to set a couple of options to enable it because there's some downsides to faking the date you set for everything.
[19:40] Allan Jude
But yeah, there's a lot of work that's gone into making that more reproducible so that you can take the same source code and the same tool set and build the identical CD for FreeBSD 13.3 to prove that what the FreeBSD release engineering team gave you came from this exact source code.
[20:02] Viktor Petersson
Yeah, I mean that's super important.
[20:04] Viktor Petersson
So let's talk a bit about what actually BSD has been used for FreeBSD in particular, Netflix massive FreeBSD shop, at least was.
[20:12] Viktor Petersson
I'm not sure they still are.
[20:13] Viktor Petersson
I think they are still using the cdn.
[20:14] Allan Jude
Are they?
[20:15] Viktor Petersson
Are, are they still running their entire CDN on Netflix?
[20:18] Allan Jude
On FreeBSD when you are browsing Netflix and picking a video that all runs in Amazon on Linux, but as soon as the video starts playing, that's all coming from FreeBSD.
[20:30] Allan Jude
And so yeah, they built this thing called the Open Connect appliance, which was customized server that they could send to Internet exchanges, but specifically to your ISP.
[20:42] Allan Jude
So when they first started they were using commercial CDNs, but that got too expensive and it got to be that Netflix individually was such a high percentage of the Internet traffic that there weren't any more bandwidth providers available that they weren't already busy with Netflix traffic.
[21:01] Allan Jude
And so to help ameliorate that, they came up with this concept of the Open Connect appliance, which they would basically send to your ISP and your ISP would install in their data center, so that the most popular videos and TV shows that you're watching would be coming from the box inside their network and wouldn't use any of your ISP's Internet traffic, saving your ISP a bunch of money and saving Netflix a bunch of money.
[21:23] Allan Jude
So it worked out pretty well, and Netflix didn't really invent that concept.
[21:27] Allan Jude
Akamai, one of the bigger CDNs had come up with that before, but not to the same degree that what Netflix did.
[21:34] Allan Jude
And so Netflix chose to use FreeBSD for that because of a couple of reasons.
[21:39] Allan Jude
A the license, although it turns out they don't have any proprietary code that's actually in the operating system, so that doesn't matter so much for them.
[21:46] Allan Jude
But the big thing for them was the speed with which they could upstream stuff.
[21:51] Allan Jude
So when they made a change, they were able to contribute it back to FreeBSD and then have it included in the mainline within a couple of weeks.
[22:00] Allan Jude
And so they run the development version of FreeBSD, so not the releases, but actually the development version some number of weeks behind live.
[22:08] Allan Jude
I think it's like six weeks or something.
[22:12] Allan Jude
And so with that they're able to find problems, engineer a solution and get it upstream and then have it ship.
[22:19] Allan Jude
Whereas if they worked with Linux, it would be a lot harder to get a change into the kernel in general.
[22:26] Allan Jude
And if they did, by time a distribution shipped it could be like three years.
[22:34] Allan Jude
They could do their own thing, but that'd be a lot more work.
[22:37] Allan Jude
And just the fact that FreeBSD has all the infrastructure to basically make your own distribution kind of included made it really easy for them, especially because they ended up making quite a few changes.
[22:49] Allan Jude
They worked with NIC vendors including Mellanox, which is now Nvidia and Chelsea to design this concept of encryption offload.
[23:00] Allan Jude
So one thing that set BSD apart back in the early days of like FreeBSD4 and like the early Internet, like very early 2000s or in late 90s, was it was really good at being an FTP server and a web server.
[23:16] Allan Jude
Part of that was this system special system call it had called Send File.
[23:21] Allan Jude
So normally, you know, we have a web server, back then it was mostly Apache running in user space and you know, a request comes in over the network and then we have to wake up the web server and say, hey, there's a new request, do you want to accept it?
[23:34] Allan Jude
Web server says, I'm not too busy, I'll accept it.
[23:36] Allan Jude
And it would go back to the kernel and then eventually a request comes in and it processes it and okay, I need to open this file.
[23:42] Allan Jude
And now I want to read some data from this file.
[23:44] Allan Jude
And now I'm going to ask the kernel, hey, could you send this to the user?
[23:48] Allan Jude
And then when that's done, tell me and I'll read another chunk of the file and then I'll send it to the kernel.
[23:53] Allan Jude
And it involves a lot of copying back and forth because you know, when the web server asks to read a file, it goes to the kernel, hey, could you read this file?
[24:02] Allan Jude
So the kernel reads it up and then has to copy it from the kernel to Apache.
[24:06] Allan Jude
And Apache doesn't look at it just immediately says, hey, kernel, this buffer you just gave me, could you write it back to this socket instead?
[24:15] Allan Jude
And it was all this back and forth.
[24:17] Allan Jude
So with send file, the web server or the FTP server can just say, hey, I have this socket and this file descriptor of a file I've opened.
[24:24] Allan Jude
Can you send this range of bytes in it to the socket for me and just tell me when you're done?
[24:30] Allan Jude
And so now, instead of having to cross that boundary back and forth all the time, you can just say, hey, kernel, just read from this file and write to the socket and do it until either there's an error or you finished the amount I asked you to do.
[24:43] Allan Jude
And that sped things up greatly.
[24:47] Allan Jude
But then HTTPs took off and now we need to encrypt all of that.
[24:53] Allan Jude
And so that meant we couldn't just have the kernel do it.
[24:55] Allan Jude
You had to still copy it out of the kernel to user space, then feed it through OpenSSL to encrypt it and then send it to the user, which involves copying it back into the kernel.
[25:06] Allan Jude
And this took away a lot of the performance that was available.
[25:09] Allan Jude
So they invented this concept of kernel tls, where basically your web server, now, modern days, usually nginx, will accept that connection, negotiate the encryption with the user, and then once it's done the public key part, right, and proven the SSL certificate and all that, it will have a symmetric key, basically a bulk encryption key.
[25:30] Allan Jude
And it can now set that as a socket option on the socket and then do the send file system call.
[25:38] Allan Jude
And now the kernel has the key it needs to encrypt the data, and it will actually, through a kernel module of OpenSSL, encrypt the data and send it whenever you write it to the socket.
[25:48] Allan Jude
So then the send file system call works again and it allows all this offload so you don't have to copy back and forth.
[25:54] Allan Jude
Which became especially a big deal after Spectre and Meltdown, where crossing that kernel user space boundary required a bunch of extra steps and slowed down a bit.
[26:06] Allan Jude
And so that made a big difference there.
[26:08] Allan Jude
But you're still doing the encryption on the cpu, which is fast, but not as fast as it could be.
[26:14] Allan Jude
So Netflix took it one step further after that when they wanted even more performance out of each server where there are dedicated chips on the NICs that can do the encryption.
[26:24] Allan Jude
So instead of sending that bulk encryption key to the kernel, you send it to the kernel, but the kernel sends it to the network card driver, and then it writes the data to the network card driver unencrypted.
[26:35] Allan Jude
And the NIC itself, in its own special CPU, will do the encryption and then send the packet on the network.
[26:44] Allan Jude
And so this allows you to offload all that CPU usage to a dedicated encryption chip on the nic.
[26:50] Allan Jude
And this allowed Netflix to originally go from, you know, maybe they could keep 100 gigabit server busy by sending all this traffic.
[27:00] Allan Jude
Now they're up to doing 800 gigabits per second from a single CPU server.
[27:07] Viktor Petersson
Wow, that's impressive.
[27:09] Viktor Petersson
Yeah.
[27:09] Viktor Petersson
I remember seeing a talk, I think it was EuroBSDcon I went to many years ago.
[27:14] Viktor Petersson
Netflix did.
[27:15] Viktor Petersson
I think it was.
[27:15] Viktor Petersson
Maybe it was around the time they announced these OpenConnect boxes.
[27:18] Viktor Petersson
And I remember them mentioning, because of the volume of traffic they're running through these boxes, the bugs that they discover are things that nobody else basically would have discovered in 10 years because of the sheer volume.
[27:31] Allan Jude
Yeah, I remember there was one early on in the IPv6 stack, which was basically an overflow.
[27:37] Allan Jude
When you crossed 4 billion of something, it would go back to zero and it caused a problem.
[27:43] Allan Jude
And in general, so they would crash the kernel, but in general, nobody was sending so much traffic that they would do that in less than a month or two, even on a busy host, and probably even longer on another host.
[27:54] Allan Jude
And so it would happen infrequently enough that nobody really got overly bothered by it.
[27:59] Allan Jude
But when Comcast first switched to having IPv6 for all their customers, and so all those people suddenly switched from V4 to V6, going through Netflix, it would take out a Netflix box every couple of hours.
[28:13] Allan Jude
And so they quickly saw the pattern and were able to find the problem and fix it easily and upstream that into FreeBSD and have an errata node so that everybody got the fix.
[28:22] Allan Jude
And it was really interesting just to see that kind of interesting scaling issues.
[28:27] Viktor Petersson
Right, right.
[28:28] Viktor Petersson
All right.
[28:29] Viktor Petersson
And also there was a new announcement.
[28:32] Viktor Petersson
The German Savant Fund decided to invest some pretty good chunk of change into the Freebies foundation, which I'm sure was very welcome to modernize some infrastructure.
[28:41] Viktor Petersson
Because how big is the team, like the bsd, the active maintainers?
[28:45] Viktor Petersson
How ballpark.
[28:47] Allan Jude
One of the other big differences with FreeBSD is rather than on Linux, where there's Linus as the benevolent dictator for life, FreeBSD has a core team of nine people that's elected every two years by the developer base, and that entire developer base has write access to the repository, whereas in Linux it's Linus and a dozen lieutenants or so that actually have the ability to merge stuff and you just send them requests.
[29:19] Allan Jude
With FreeBSD there are 250ish people that can just directly commit stuff into the repositories.
[29:29] Allan Jude
And so anybody who's committed things in the last 365 days at the day of the election is allowed to vote.
[29:37] Allan Jude
And that decides the nine people that will kind of be in charge of a project for the next two years.
[29:43] Viktor Petersson
Interesting.
[29:44] Viktor Petersson
Okay, yeah.
[29:44] Viktor Petersson
Different Covenant model entirely then.
[29:46] Allan Jude
Yeah.
[29:47] Allan Jude
And so the Sovereign Tech Agency's Sovereign Tech Fund is invested money in FreeBSD to improve some of the infrastructure and try to burn down the BugPal and get things in a more sustainable setup.
[30:00] Allan Jude
And then there's also funding coming from the Alpha Omega project to implement additional two FA and a bunch of other security stuff, including an audit of FreeBSD's native hypervisor.
[30:15] Allan Jude
So rather than KVM that they have on Linux, FreeBSD has something called Beehive, which was originally developed at a vendor, and when they decided they weren't going to use it, they upstreamed it and it became part of FreeBSD.
[30:27] Viktor Petersson
Right, okay, very interesting.
[30:31] Viktor Petersson
All right, let's switch gear to the primary topic for this conversation, which is run zfs.
[30:37] Viktor Petersson
Obviously it's really important to give the backstory of FreeBSD, because that's very tightly coupled with zfs, or at least in the history of it.
[30:48] Viktor Petersson
So you can run bsd.
[30:51] Viktor Petersson
Sorry, you can run ZFS on Linux these days as well.
[30:55] Viktor Petersson
And maybe we can start there.
[30:58] Viktor Petersson
ZFS on Linux and ZFS on bsd.
[31:01] Viktor Petersson
And BSD is slightly different.
[31:05] Viktor Petersson
Maybe talk a moment about that.
[31:06] Viktor Petersson
How they differ and why they differ.
[31:09] Allan Jude
Well, we can start back at the very quick intro in the beginning.
[31:14] Allan Jude
So Sun Microsystems started developing ZFS because they needed a new file system.
[31:19] Allan Jude
They had tried a couple of times to write new file systems, but they got the teams kind of too big, too quickly, and it got complicated and they ended up never finishing any of them.
[31:29] Allan Jude
So Jeff Bonwick handpicked a fresh graduate coming out of Brown University, and basically the two of them got locked in a room for a while with a whiteboard and came up with a whole new file system from scratch.
[31:44] Allan Jude
And especially trying to solve a lot of the problems that file systems had at the time.
[31:49] Allan Jude
Because if you think back to the early 2000s, most file systems had a limit on the biggest file you could have.
[31:56] Allan Jude
So depending if you're on Windows, it was like 2 gigabytes was the biggest file you could have.
[32:00] Allan Jude
And some of them you couldn't have a partition bigger than 4 gigabytes.
[32:05] Allan Jude
And weird things like this.
[32:07] Allan Jude
Even on Linux, most of the file systems there was a limit to how big the biggest file you could have is how many files you could have, how big the volume could be.
[32:16] Allan Jude
All these things were relatively tight limits were actually running into.
[32:21] Allan Jude
Most of them are now big enough that they're harder to run into.
[32:24] Allan Jude
But ZFS was designed with the idea that everything was dynamically allocated so you can't run out of inodes.
[32:31] Allan Jude
If you throw a billion files at it, you're not ever going to get an error.
[32:35] Allan Jude
Most file systems now still have a limit to the number of inodes.
[32:39] Allan Jude
Some of them can dynamically add more after, like you can run a command to adjust the limit and add more.
[32:46] Allan Jude
But they all do still have a limit, whereas ZFS just doesn't because it allocates them dynamically.
[32:51] Allan Jude
It's like you can have as many as you need until you run out of space, right?
[32:55] Allan Jude
But the other big thing was every other file system that exists right now.
[33:00] Allan Jude
Traditional file system assumes you have one disk and it only uses that one disk.
[33:06] Allan Jude
If you have multiple disks, you have to use a volume manager like MD RAID on Linux, or a hardware RAID controller or something to turn that multiple disks into one fake disk that you can then feed to the file system.
[33:18] Allan Jude
ZFS has that functionality built in so you can give it all your disks and it actually knows that they are individual disks and can do things with them.
[33:27] Allan Jude
Anyway, after a couple of years, sun decided to open source that, because everything at sun was open source, because they found that model was better and it was their competitive advantage over something like Windows.
[33:39] Allan Jude
And So when some FreeBSD developers saw that, they're like, well, you know, Solaris originally spawned out of bsd, yeah, many of the concepts are not that different.
[33:49] Allan Jude
We'd be able to port this.
[33:51] Allan Jude
And so one developer in particular, Pavel Doadek, ported ZFS to FreeBSD.
[33:57] Allan Jude
You know, he did most of the initial work in a kind of mad sprint over just less than a month of getting it from it's not in FreeBSD to.
[34:07] Allan Jude
You can actually mount and read write files in a ridiculously short amount of time.
[34:15] Allan Jude
So you did that.
[34:16] Allan Jude
And ZedFS started coming into FreeBSD and getting a lot of attention from that.
[34:21] Allan Jude
And that continued for a long Time when Oracle bought sun, they closed off the source code.
[34:28] Allan Jude
So a project called the lumos, which is a fork of the last open source version of Solaris, started and previously switched to using that as where it would get newer ZFS features from and where it would contribute its changes back to.
[34:41] Allan Jude
But at that time, each version of ZFS was kind of unique to that operating system.
[34:46] Allan Jude
It was up to each operating system to kind of maintain their version and try to keep them in sync.
[34:51] Allan Jude
And then the ZFS on Linux project started around that time as well, having their first release a while after FreeBSD, but they've been working on it around the same time that FreeBSD started, and eventually that became the ZFS on Linux project and was also there on GitHub.
[35:10] Allan Jude
And as things changed and some of the companies that were using Solaris and Lumos switched to using other things or just, you know, their time came to an end, less and less new features were being added in the LUMOS repo, and more of them were appearing in FreeBSD and Linux.
[35:28] Allan Jude
And then getting them back into Illumos and then there and then waiting to be able to pull them back in was causing delays, but also just features were being added in different orders and it was getting kind of complicated.
[35:40] Allan Jude
So that student I mentioned who originally helped design ZFS has now been doing it for over 20 years and is the leader of the project.
[35:50] Allan Jude
He came up with this concept of open ZFS where we would have one repo that was just the agnostic code, just the core of ZFS and not the integration with each operating system.
[36:02] Viktor Petersson
Right, okay.
[36:03] Allan Jude
And we could compile this in user space and run all the tests, and basically have a common code base that each operating system can pull from, add the glue for their operating system and go from there.
[36:15] Allan Jude
But that didn't really manage to go anywhere because that upstream repo by itself wasn't useful because it didn't have the integration for any operating system.
[36:24] Allan Jude
It was just the common code.
[36:27] Allan Jude
And so it meant somebody would have to do a lot of work to be able to actually get that running and build the infrastructure around it to test it, but to no direct value to them or anybody.
[36:39] Allan Jude
And so that concept was there and the REPO existed and we'd push and pull code from it, but it was not really working as intended.
[36:50] Allan Jude
So especially as more development started coming in on the Linux side and things were getting kind of done out of order a bit, and it was just getting complicated to, you know, eventually they had a bunch of features that FreeBSD didn't have.
[37:05] Allan Jude
But FreeBSD wanted to pull from Illumos, but Illumos didn't have those features yet.
[37:09] Allan Jude
It got complicated.
[37:10] Allan Jude
So we looked at the idea of having a real version of the OpenZFS repo, which would be one repository that includes the OS glue for all of the operating systems and means it would be one official upstream code repository that'd be the same on all the operating systems.
[37:28] Viktor Petersson
Right.
[37:29] Allan Jude
And so in time for FreeBSD 13.0, that project happened.
[37:34] Allan Jude
And so the OpenZFS repo you see on GitHub now actually is Linux and FreeBSD's code mixed together basically under the modules subdirectory, there's an OS directory and there's one for Linux and one for FreeBSD, and all the common code lives in module ZFS.
[37:51] Allan Jude
And then the operating systems have their special bits in those other directories and it means the code is literally exactly the same.
[37:58] Allan Jude
On FreeBSD and Linux it all comes from exactly the same source code.
[38:02] Allan Jude
There are one set of files that are different based on which operating system you're running, mostly just to integrate with those kernels which are different and provide what we call the Solaris porting layer, which translates the common codes Solaris calls into Linux.
[38:17] Allan Jude
Or FreeBSD calls for things like allocating memory from the kernel, which is done differently.
[38:22] Allan Jude
Linux has the slab allocator and FreeBSD has the UMA, the Unified Memory architecture and so on.
[38:29] Allan Jude
Right, but it means that we have the same code and we added version numbers so you can actually see, you know, FreeBSD14 and Ubuntu 2404 have basically the same version of ZFS.
[38:41] Viktor Petersson
Right, but there if remind me, there was some issue that held back ZFS coming into mainline Linux for quite some time and it was a license related issue, if I'm mistaken.
[38:51] Allan Jude
Right, and so technically that's still true.
[38:54] Allan Jude
Their ZFS is not included in mainline Linux.
[38:58] Allan Jude
So when sun released zfs, they did it under a license called the cddl, the Common Development License, which is basically identical to Mozilla's MPL that Firefox has released under.
[39:11] Allan Jude
So it says the source code is under this CDDL license, which is mostly liberal, it's not viral or anything like the gpl, but in particular Kadenta clause that when you compile a binary out of it, you can license that binary however you like.
[39:29] Allan Jude
So you can take ZFS and compile a kernel module and license that module as gpl.
[39:37] Allan Jude
The problem is actually an incompatibility with the GPL where if you make this GPL license ZFS module and link it into your kernel, the GPL license says that should make the source code for ZFS GPL licensed.
[39:52] Allan Jude
And the CDDale doesn't let you do that because that would not make any sense.
[39:57] Viktor Petersson
Right.
[39:58] Allan Jude
And so that's where the issues come from.
[40:03] Allan Jude
And so it's not included in upstream.
[40:07] Allan Jude
Kind of like was it the bcachefs recently was getting added to Linux and it sounds like maybe it doesn't have a long life left after some arguments between the maintainers.
[40:19] Allan Jude
So ZFS is not included in mainline, but Ubuntu has started shipping it as part of their distro and it turns out nobody's suing them.
[40:28] Allan Jude
And so it has kind of become okay to basically for Ubuntu to compile it for you and you load it on your module, whereas, you know, it was always okay for you to compile it yourself and load it on whatever version of Linux you wanted.
[40:43] Allan Jude
But that was a bit of a pain with dkms is not really a pain anymore, but being able to start to get closer to the integration like you can see with FreeBSD, where our bootloader fully supports it and our installer knows all about it.
[40:59] Allan Jude
And we build features like boot environments on top of it, which is basically having multiple different root file systems, possibly based on snapshots of your root file system.
[41:08] Allan Jude
Meaning that if you install some new packages and it breaks something, you can just reboot and from your bootloader pick an older version of your root file system and be back to how your system was an hour ago and everything works again.
[41:22] Viktor Petersson
It's crazy.
[41:23] Viktor Petersson
Like one of the things, like I've been using ZFS for well over a decade, I think by now, in one capacity or another.
[41:29] Viktor Petersson
And the thing that always strikes me is this is the fastest, but it's very old.
[41:34] Viktor Petersson
But it's still probably the most sophisticated file system out there in terms of feature sets.
[41:39] Viktor Petersson
And it just keeps chugging along and it's just so reliable, have so many beautiful features that are not widely available across any other file system, really.
[41:49] Viktor Petersson
Like particularly about snapshots and all these things.
[41:52] Viktor Petersson
Right.
[41:53] Viktor Petersson
So let's talk about some ZFF fundamentals.
[41:56] Viktor Petersson
Like, you obviously is an authority on CFS in your day job, and maybe for those not familiar with your day job, do a quick spiel about what you actually spend in your day and why you're spending your day doing CFS work.
[42:09] Allan Jude
Yeah.
[42:10] Allan Jude
So back in 2018, I founded a company called Clara, which is clarasystems.com and we provide professional development and support services around FreeBSD and ZFS, including ZFS and Linux.
[42:24] Allan Jude
And so customers come to us when they hit a bug in ZFS and need help with it.
[42:29] Allan Jude
We sell support subscriptions, and we develop new ZFS features.
[42:33] Allan Jude
So, for example, we developed a feature to be able to delegate a ZFS data set into an LXD container on Ubuntu so that one of our customers could run unprivileged Docker inside a container, but using ZFS so that Docker would be able to use ZFS snapshots and so on.
[42:51] Allan Jude
So when they give, they take your pool in zfs, and in zfs you can have.
[42:56] Allan Jude
So it combines all those hard drives you have into one pool of storage, which you can build multiple file systems that share the free space.
[43:05] Allan Jude
So you don't have the problem of partitioning you used to have.
[43:08] Allan Jude
If you had a 20 terabyte hard drive and you decide, okay, I'm going to Split this into four 5 terabyte partitions and run the five different workloads on them, and suddenly one of them is only using 1 TB, the other one would like to use 6.
[43:20] Allan Jude
But now that's not where your partitions are.
[43:22] Allan Jude
And you have this problem of like, oh, I don't have enough space over here, but too much space over here.
[43:27] Allan Jude
By pooling your storage in zfs, it means that you can just use all the space where you need it, and so you can create one of these virtual file systems and then give it to a container.
[43:37] Allan Jude
And inside that container, the fake root user can only see that one and none of the rest.
[43:43] Allan Jude
So they use this to run a CI system for many different customers.
[43:47] Allan Jude
So each Docker workload runs inside a container and can't see the other ones, but does have access to actually create and destroy snapshots and clones and do all the stuff Docker needs to do to run really quickly, taking advantage of the copyright features of zfs.
[44:03] Allan Jude
And so, yeah, they came to us needing this.
[44:06] Allan Jude
We developed it and then we also upstreamed it, included it by default in OpenZFS.
[44:12] Allan Jude
So then when Ubuntu 22.04 came out, it included that feature and they were able to just run stock Ubuntu and have it in production.
[44:20] Viktor Petersson
Very cool.
[44:21] Viktor Petersson
So let's talk a bit about fundamentals in zfs, and you kind of alluded to already, I mentioned pools.
[44:28] Viktor Petersson
There are like three building blocks.
[44:30] Viktor Petersson
I guess they're V devs, pools and data sets.
[44:34] Viktor Petersson
Right.
[44:34] Viktor Petersson
And let's Speak a bit.
[44:36] Viktor Petersson
Well, maybe the fourth one that I'm not that I'm missing, but let's speak about the fundamentals there.
[44:40] Viktor Petersson
So he's like, get big data way.
[44:43] Allan Jude
Yeah.
[44:43] Allan Jude
So like we said, ZFS is a volume manager, so it can also basically do the RAID for you.
[44:49] Allan Jude
And so a V dev is one of your kind of RAID components.
[44:54] Allan Jude
So you take all the hard drives you have and then you can combine them in groups called V devs, and that V dev can have a transform on it that makes it do something.
[45:03] Allan Jude
So if you have no transform, you just have each disk as a separate V device, you've basically done a RAID zero, and it means if you lose any drive, all your data is gone, and that's bad.
[45:14] Allan Jude
So you can do a mirror V dev where you're going to have pairs of disks and each display every block you write to some disk.
[45:23] Allan Jude
The second disk in that mirror is going to have an exact copy of it.
[45:26] Allan Jude
And we'll come back to why ZFS does that better, using checksums in a minute.
[45:32] Allan Jude
But then it also has what it calls Raidz 1, which is when you combine any number of drives in a group and it can withstand the loss of any one drive and keep working.
[45:43] Allan Jude
That's the same as a raid 5, except for it has one slight advantage.
[45:47] Allan Jude
So with raid 5, there's this flaw called the write hole.
[45:53] Allan Jude
Whereas when you update something, it's going to write the new data to the hard drive, and then it has to write to a different hard drive the updated parity, so that if it loses that drive using the parity and the other rows on the remaining hard drives, it can calculate what that.
[46:09] Allan Jude
What data was on that missing drive.
[46:11] Allan Jude
Basically, it adds up the chunks of all of the drives together and gets a value and writes that in the parity.
[46:17] Allan Jude
And then if one drive is missing, it can take that thing, subtract what's on every other drive, and get back the value that would have been on the missing hard drive.
[46:26] Allan Jude
But because it updates the data one hard drive and then updates the parity separately, if the power goes out between those two steps, then the parity is wrong, but the RAID controller doesn't know it.
[46:41] Allan Jude
And so when it boots up, it takes that parity, subtracts the remaining drives, and gets back the wrong answer.
[46:47] Allan Jude
And so now the block you just wrote, isn't there some gibberishes, which is a combination of the new data and the old parity?
[46:55] Allan Jude
Right, Because ZFS is transactional, it doesn't have that problem.
[46:58] Allan Jude
And we'll get back to how that works in a second.
[47:01] Allan Jude
But it also has Raid Z2, which means you combine a bunch of drives and any two drives can go missing and it works.
[47:07] Allan Jude
So that's like Raid 6.
[47:10] Allan Jude
That's available on some hardware controllers.
[47:12] Viktor Petersson
And how many drives do you need for all of these?
[47:15] Viktor Petersson
Well, V devs, I guess.
[47:16] Viktor Petersson
How many V devs do you need for each of these to work?
[47:18] Viktor Petersson
I mean, let's call V dev a physical drive, I presume in most practical use cases.
[47:23] Allan Jude
So technically the V dev is generally the transform the group of a bunch of physical drives, right?
[47:27] Viktor Petersson
Right.
[47:28] Allan Jude
So for a raid Z1, technically you need at least two drives, one and the parity.
[47:35] Allan Jude
Although it doesn't make sense to do that with less than 3, because you might as well just do a mirror if you only have two.
[47:40] Allan Jude
Yeah, and the same with Raidz 2.
[47:44] Allan Jude
Technically you only need three drives, but it probably doesn't make sense with fewer than four.
[47:50] Allan Jude
One of the big additions that ZFS has is Raid Z3, which allows you to group together some drives and lose any three of them and keep going.
[47:58] Allan Jude
That isn't something any hardware RAID controller I know of can actually do.
[48:03] Viktor Petersson
And I may have limited two.
[48:05] Allan Jude
All you would need technically is four, but you'd probably want six or more for it to make sense where you're going to lose half of them and keep it.
[48:14] Allan Jude
Because the big thing is, out of that set, whatever the number is, 1, 2 or 3, that many of the drives basically not going to be used to store your data.
[48:24] Allan Jude
It's just going to store the parity to be able to reconstruct the data.
[48:27] Viktor Petersson
Yeah.
[48:27] Viktor Petersson
Okay.
[48:27] Allan Jude
So if you use four drives in a RAID Z3, you're only going to get the space of one hard drive.
[48:32] Viktor Petersson
Right.
[48:34] Allan Jude
Another thing that's different than most hardware RAID is it doesn't have dedicated parity drives.
[48:38] Allan Jude
Right.
[48:39] Allan Jude
It's not going to be that, you know, if you have six drives in a Raid Z3, that three of them are only going to contain parity.
[48:44] Allan Jude
Z spreads the parity out between the drives so that you get more the bandwidth of the drives when you're writing.
[48:53] Allan Jude
So that provides that.
[48:54] Allan Jude
And then.
[48:55] Allan Jude
So each V dev or group of drives is responsible for its integrity.
[48:59] Allan Jude
So it has its own RAID there.
[49:02] Allan Jude
But then we combine multiple V devs if you have enough hard drives, and those are basically just striped together.
[49:08] Allan Jude
And so if you lose any one V dev, you've broken the whole pool.
[49:12] Allan Jude
But, you know, if you've used Raidz 3 in each of your VDEVs then it'll be fine.
[49:17] Allan Jude
And so, you know, a big server I built many years ago is basically a petabyte of usable space made up of a bunch of v devs of 12 terabyte hard drives in a RAID Z3.
[49:30] Allan Jude
And so I think it was like 9 or 10 of these VDEVs.
[49:34] Allan Jude
And that way you get all this space and, you know, as long as you don't lose more than three drives from any group of 12, then everything will be fine.
[49:45] Allan Jude
And, you know, when a hard drive dies, you replace it with a new one and it rebuilds and you're okay.
[49:52] Viktor Petersson
Right.
[49:53] Allan Jude
So that provides that's the V dev layer, and then above that you have the pool, which is basically giving you all the usable space of all those drives put together.
[50:02] Allan Jude
And it means that out of that you can create these virtual file systems called data sets.
[50:07] Allan Jude
And they're the same thing as kind of like an ext4 file system or whatever.
[50:13] Allan Jude
And you could have multiple of them, but each one will show up as a dynamic size where basically they are the amount of data they contain plus all the remaining free space in the pool.
[50:22] Viktor Petersson
Right, right.
[50:23] Allan Jude
So if you have five of these file systems, each one will say that it has 20 terabytes of space left.
[50:30] Allan Jude
But if you write to any one of them, all five of them's free space will go down by the amount you changed.
[50:35] Viktor Petersson
So kind of over provisioning in a sentence.
[50:37] Allan Jude
Yeah, yeah, except for we're not actually lying and saying that we have more on any of them.
[50:42] Allan Jude
And as you write stuff to them, the free space on all of them will go down.
[50:46] Allan Jude
But as you delete something one of them, the free space on all of them will go back up.
[50:50] Viktor Petersson
Right.
[50:51] Allan Jude
Unless you have snapshots, which we'll get to in a second.
[50:55] Allan Jude
And so that allows you to have different settings on each data set as well.
[51:00] Allan Jude
So ZFS has a feature called transparent compression, where you can set compression on the data set that contains my home directory, but not the dataset that contains my music or something.
[51:10] Viktor Petersson
Right.
[51:12] Allan Jude
And what this is, as you write the data to ZFS, it will use a fast compressor like LZ4 or Zstandard to shrink the size of those.
[51:22] Allan Jude
And then on disk, it'll actually store the compressed version, and then when you read the file back, it'll decompress it first before it gives it to you.
[51:28] Allan Jude
So the application doesn't have to know that there's compression happening, but the file system takes care of it for you.
[51:34] Allan Jude
And then all your, you know, the Many, many copies of the ZFS source code I have on my server take a lot less space because source code is text and compresses really well.
[51:44] Viktor Petersson
Right.
[51:45] Viktor Petersson
And what's overhead?
[51:47] Viktor Petersson
Let's talk about compression for a second, because I think that's a super interesting one.
[51:50] Viktor Petersson
Like what's the.
[51:51] Viktor Petersson
On a modern server, let's talk about ZZ standard, like a basic compression or the basic one, I think, in ZPass.
[51:59] Viktor Petersson
What's the overhead, CPU wise, I presume would be the big impact.
[52:03] Viktor Petersson
Right?
[52:04] Allan Jude
Right.
[52:04] Allan Jude
So that's the interesting.
[52:05] Allan Jude
With LZ4, the overhead is quite minimal.
[52:09] Allan Jude
It can compress multiple gigabytes per second per core.
[52:13] Allan Jude
So as long as you have a couple of cores, you're probably going to run out of performance on your storage, even if it's NVME before you run out of CPU time.
[52:21] Viktor Petersson
Interesting.
[52:23] Allan Jude
And it can actually end up being faster because if you have 100 gigabytes of source code you're trying to or 100 megabytes of source code you're trying to write, and a regular spinning hard drive that can write 100 megabytes per second, then it would take you a second to save that 100 megabytes of source code.
[52:39] Allan Jude
If you can compress it at 2 gigabytes per second, so it takes a fraction of the second to compress it, and it compresses down to 50 megabytes, you write it to the hard drive and it only took half a second.
[52:53] Allan Jude
So it means technically, if you had more source code, you could write 200 megabytes a second to this hard drive if your compression ratio is 2x.
[53:01] Allan Jude
So in exchange for a little bit of CPU time, you could make your hard drive seem faster only if the data is actually compressible.
[53:08] Allan Jude
And the same thing applies when you read.
[53:10] Allan Jude
Now, when you read, you only have to read 50 megabytes in order to read all 100 megabytes of the data.
[53:16] Allan Jude
And LZ4 decompresses at over 10 gigabytes per second per core.
[53:21] Viktor Petersson
Right.
[53:22] Allan Jude
And so it can end up making a lot of workloads faster.
[53:26] Viktor Petersson
Interesting.
[53:26] Viktor Petersson
Are there any?
[53:28] Viktor Petersson
Because I believe compression is disabled by default.
[53:31] Allan Jude
So as of version I think 2.2 compression is on by default.
[53:35] Allan Jude
Okay, it's now to the point where it's so fast it doesn't have a penalty if you try to Compress things like MP3s or video files that are uncompressible or even encrypted files.
[53:49] Allan Jude
Computers are so fast now that it basically doesn't have a high enough cost.
[53:53] Viktor Petersson
Right.
[53:54] Allan Jude
There's also a feature in ZFS called Early Abort where it can notice.
[53:58] Allan Jude
I've been compressing this chunk for a little bit and I noticed that it's not compressing enough that it's going to be smaller enough to matter.
[54:06] Allan Jude
Basically if it's not going to save you at least 12% if it becomes obvious we stop and don't finish trying and just store it uncompressed.
[54:14] Allan Jude
Okay, so there's like.
[54:15] Viktor Petersson
Okay, interesting.
[54:16] Viktor Petersson
Yeah.
[54:16] Viktor Petersson
Because I mean if you are chucking compressed files on the compressed compressor, that's not going to make much.
[54:21] Viktor Petersson
Yeah, that's just going to happen.
[54:22] Allan Jude
It's just going to use up a bunch of CPU time.
[54:24] Allan Jude
Now LZ4 is so cheap it doesn't make a big difference, but it will also notice and give up early in order to save some of that CPU time.
[54:31] Viktor Petersson
Right, okay, so there's what you're saying essentially there is no reason not to use it.
[54:36] Allan Jude
Exactly.
[54:37] Allan Jude
It's now to the point where you might as well just have LZ4 on everywhere and it will help you where it can and it won't make a big difference where it doesn't help.
[54:44] Viktor Petersson
Right.
[54:45] Viktor Petersson
I would imagine back in the early days that was quite not the case.
[54:48] Allan Jude
Right.
[54:48] Allan Jude
Well, especially before LZ4 there was only basically LZMA which wasn't or not LZGB which wasn't as good.
[54:57] Allan Jude
And gzip.
[54:59] Allan Jude
And gzip can compress better, but it is slow.
[55:03] Allan Jude
Like gzip will top out at like 50 or 100 megabytes per second per core.
[55:07] Allan Jude
And yes, that will you will run out of CPU before you run out of storage bandwidth.
[55:12] Allan Jude
So gzip will really hurt you.
[55:14] Allan Jude
As far as CPU usage, Z standard supports 19 different levels in ZFS and so you can tune it from only use a bit of the CPU to use all the CPU.
[55:24] Allan Jude
But if you use up to 19, it will only compress at like a single digit megabytes per second.
[55:30] Allan Jude
So you only want to use that one when it's like I'm writing this once, I know it's compressible and I'm just going to keep it forever.
[55:36] Allan Jude
And it'll be worth it to spend the CPU time to compress it that much.
[55:41] Viktor Petersson
Right.
[55:42] Allan Jude
But we also added a feature I think in 2.2 or 2.3 where if you enable a really high Z standard level, it will quickly try to compress it with LZ4 first and see if LZ4 couldn't compress it, we assume Z standard won't be able to compress it and we just won't even try.
[55:59] Allan Jude
Whereas if it does compress it, then we'll let it.
[56:01] Allan Jude
And so it allows you to skip trying the heavy compression on files that are definitely not compressible, again to avoid the overhead and let you just kind of turn it on without having to worry that it's going to spend a lot of time trying to compress things that aren't important or not compressible.
[56:18] Viktor Petersson
Right.
[56:19] Viktor Petersson
Okay, so we talk about data sets.
[56:21] Viktor Petersson
One thing about datasets that I found important is that you can run encrypted datasets on top of a pool.
[56:27] Allan Jude
Right.
[56:27] Viktor Petersson
So you could have non encrypted or you could have some encrypted, Maybe speak a bit about how that works because that's kind of a relatively unique feature as well to zfs.
[56:35] Allan Jude
Yeah, so most block level systems have some kind of encryption.
[56:41] Allan Jude
So on Linux there's lux and on FreeBSD there's Gelly.
[56:44] Allan Jude
And these use an algorithm called aesxts, which is whole disk encryption, and they use a key and they encrypt the data on the disk.
[56:54] Allan Jude
With zfs, we wanted more granularity.
[56:58] Allan Jude
So yeah, we have these multiple different data sets.
[57:00] Allan Jude
And maybe I want my operating system non encrypted so it's easier to boot from and so on, but I want my home directory encrypted.
[57:08] Allan Jude
And so it uses AES gcm, which is the encryption normally used for like HTTPs, and it encrypts not the structural information.
[57:20] Allan Jude
So like the name of the data set and the ZFS specific bits are not encrypted because it needs to be able to work on those.
[57:26] Allan Jude
But basically the file names and the actual data in the files are all encrypted.
[57:33] Allan Jude
And so earlier we mentioned the checksum.
[57:37] Allan Jude
So ZFS stores a checksum of every block so they can verify that the block didn't get corrupted by your hard drive.
[57:43] Allan Jude
Or as we mentioned, when you're doing a mirror in traditional mirroring with hardware, RAID or md, RAID and so on, if the two sides of the mirror don't match, there's no way to tell which one is right and which one's wrong.
[57:56] Allan Jude
Yeah, instead of S, because we store a checksum separate from the data, we can read both pieces of data and see which one matches the checksum and know, oh, the SHA256 says copy two is the right one.
[58:06] Allan Jude
And so we'll repair copy one with the right copy.
[58:11] Allan Jude
When we encrypt files we split that checksum in half.
[58:14] Allan Jude
We keep the first half of the SHA256.
[58:17] Allan Jude
And in the second half, we keep the message authentication code from the encryption.
[58:23] Allan Jude
This allows us to verify both bits, the raw encrypted data on disk, that it's not corrupted.
[58:30] Allan Jude
And the Mac ensures that when we decrypt the data, we got the right data back and that it hasn't been tampered with.
[58:36] Allan Jude
This means that unlike whole disk encryption, where if you mount the file system, you have to have entered the decryption key and now everything's decrypted.
[58:46] Allan Jude
With zfs, if you have your home directory and somebody else's home directory, those are encrypted with different keys.
[58:52] Allan Jude
And when the other user is not logged in, they can unload their key and that data is actually at rest and encrypted and can't be accessed by anybody until they come and enter the key, maybe by logging in over SSH and using their passphrase.
[59:06] Allan Jude
And so that allows that data to actually be protected.
[59:08] Allan Jude
Whereas if you just used whole hard drive encryption, that really only protects you against somebody stealing the physical machine.
[59:14] Allan Jude
And then when they reboot and try to access it, they don't have the key and they can't decrypt the hard drive.
[59:18] Allan Jude
Whereas with zfs, you can unload this key.
[59:21] Allan Jude
And now nobody can access this data without entering the key first.
[59:26] Allan Jude
So it allows you to keep the data at rest and have the protection from encryption without having to power off the server to get the protection.
[59:35] Viktor Petersson
Right.
[59:37] Allan Jude
But importantly, because we split that checksum in two, it means that we can still run the repair of a failed hard drive on the encrypted data without needing the encryption key.
[59:50] Viktor Petersson
Oh, right, okay.
[59:51] Allan Jude
And so it can, it knows from the first half the checksum that, hey, that block is not right.
[59:57] Allan Jude
The hard drive flipped a bit or something and, or, you know, we've drive failed and we put in a new one and we can rebuild all the data and get the array back healthy without ever having to have the encryption key.
[01:00:09] Allan Jude
So it means the storage administrator doesn't need all the encryption keys, which was a big advantage.
[01:00:14] Viktor Petersson
Right.
[01:00:15] Allan Jude
So a use case that came up with this for one of our customers being a law firm was, you know, we have all this discovery evidence from a case.
[01:00:22] Allan Jude
The case is now over.
[01:00:24] Allan Jude
We have to destroy that evidence and not have a copy of it anymore.
[01:00:28] Allan Jude
But, you know, we can't be sure that our flash drives are actually going to erase anything we erase.
[01:00:34] Allan Jude
So overriding is not necessarily going to be good enough.
[01:00:36] Allan Jude
And if we use whole disk encryption, we'd have to Reformat the whole hard drive in order to actually ensure that it's gone.
[01:00:43] Allan Jude
But by having a different encryption key for each case's data set, they can just unmount it, destroy the key and never be able.
[01:00:50] Allan Jude
That data is not recoverable now, and it makes their life much easier for doing that.
[01:00:55] Viktor Petersson
So you mentioned two types of, I guess, unlocking.
[01:00:59] Viktor Petersson
There's the key and there's the passphrase.
[01:01:02] Viktor Petersson
One thing that at least I've been bothered with, both in the Linux world and in the BSD world, I guess, is that in most other operating systems you can use a TPM to unlock and lock and do disk encryption.
[01:01:15] Viktor Petersson
That is not quite the case.
[01:01:17] Viktor Petersson
It's kind of possible on Linux.
[01:01:19] Viktor Petersson
I don't know how it is in the BSD world, but do you want to say a few words about the narrative around that?
[01:01:25] Viktor Petersson
If that's something that is being worked on or particular encryption, I guess, yeah.
[01:01:32] Allan Jude
I've not really looked at the full disk encryption stuff in a while because I've been more focused on zfs.
[01:01:37] Allan Jude
But I did do support for the full disk encryption in FreeBSD's bootloader years ago.
[01:01:45] Allan Jude
So yeah, there is interest in doing the tpm.
[01:01:47] Allan Jude
So what ZFS has right now is you can load the key basically by typing it in by having a path to a file or having a URL to like an API.
[01:01:58] Allan Jude
So you can actually have, as the machines boot up, they call some other machine to say, hey, it proves with a certificate or something that I'm part of your network.
[01:02:08] Allan Jude
And that server says, okay, here's the key to decrypt that data.
[01:02:11] Viktor Petersson
So you type it in.
[01:02:14] Allan Jude
Yeah, but we would definitely be interested in using the TPM to store the key material.
[01:02:22] Allan Jude
Just no one's come with that use case.
[01:02:24] Allan Jude
So if you would like ZFS whole disk or dataset encryption to use the TPM, then definitely go to Cloudystems.com and click on the ZFS button and tell us you want that feature and we'll talk to you and definitely get that built because there's some interest in it.
[01:02:40] Allan Jude
We just would need somebody who has a commercial interest in it.
[01:02:43] Viktor Petersson
Yeah, no, I had a more personal interest.
[01:02:46] Viktor Petersson
I use ZFS on my Just Home server, my Boxbox server.
[01:02:54] Viktor Petersson
And just when you reboot the server and you have to manually type in a passphrase and you don't have a KVM or something hooked up to it, that is a pain point, right?
[01:03:02] Allan Jude
Yeah, exactly.
[01:03:03] Allan Jude
And that's why some people use that HTTP thing to fetch the key from another machine that's going to have stayed up, hopefully, or whatever, some kind of zero trust environment where before you can have this encryption key, you have to prove that you're not a compromised machine and you're the machine that's supposed to have this decryption key and so on.
[01:03:21] Viktor Petersson
Yeah, exactly.
[01:03:23] Allan Jude
Cool.
[01:03:24] Viktor Petersson
We talked a bit about what we kind of alluded to, scrubbing.
[01:03:27] Viktor Petersson
I guess maybe you should just knock that out as well, what scrubbing is and how it's different than other profile systems.
[01:03:32] Allan Jude
Right.
[01:03:33] Allan Jude
So that kind of depends on checksumming.
[01:03:35] Allan Jude
So as ZFS writes, any block of data you give it stores.
[01:03:41] Allan Jude
So in the indirect block where it actually says, you know, the fifth megabyte of that file is on this offset on disk, it has information about it like when we wrote it and the checksum.
[01:03:53] Allan Jude
And so a scrub is basically a patrol read where it goes through your whole disk and reads every block and checks that the checksum is still what it's supposed to be.
[01:04:02] Allan Jude
And so this allows it to detect bit flips on your hard drive or bitrod or any kind of corruption.
[01:04:08] Allan Jude
And so ZFS scans all the data and makes sure that the checksums are correct and fixes them.
[01:04:13] Allan Jude
They aren't.
[01:04:14] Allan Jude
Because if you have bitrot, it usually kind of.
[01:04:17] Allan Jude
It's creeping.
[01:04:18] Allan Jude
It starts small and gets bigger and bigger.
[01:04:20] Allan Jude
So when you catch it and fix it means that, you know, if you have a RAID Z2 where you have, you know, eight hard drives and two of them are providing parity, if you find a bit flip and fix it, when you find the second bit flip and fix it, you had all the data you need to reconstruct it.
[01:04:36] Allan Jude
Whereas eventually, if you had three bit flips, you wouldn't have enough data to fix it if all three were in the same file.
[01:04:43] Allan Jude
But if you once a month are reading all the data, making sure the checksum is correct, it means you fix each of those errors as they come up and they don't accumulate to the point where you actually have data.
[01:04:54] Viktor Petersson
Right.
[01:04:55] Viktor Petersson
Yeah.
[01:04:55] Viktor Petersson
I had Brian Kantra from Oxide on the show a few episodes ago, and he was giving some story from Illuminus when they were running that, and how they discovered driver issues and issues with IO controllers, essentially where they basically was, because of the intelligence of zfs, they were able to pick up things that otherwise would have gone undercover.
[01:05:15] Viktor Petersson
Really?
[01:05:15] Allan Jude
Yeah.
[01:05:16] Allan Jude
And we've hit similar cases.
[01:05:17] Allan Jude
We had one customer using like an early version of an all Flash array, and occasionally ZFS was finding this corruption.
[01:05:24] Allan Jude
And it turned out there had been A firmware bug and a rounding error or something meant that certain writes would get written to the wrong place.
[01:05:31] Allan Jude
So when you read from the right place, the data wasn't there.
[01:05:34] Allan Jude
And when you read what was supposed to be at the wrong place, it had been overwritten with this other data.
[01:05:40] Allan Jude
And ZFS was able to point it out and we eventually found the pattern and they were able to fix the firmware.
[01:05:45] Allan Jude
Nice.
[01:05:47] Viktor Petersson
Snapshotting is not a thing you've hinted at.
[01:05:50] Allan Jude
Yes, the really interesting one.
[01:05:52] Allan Jude
So other file systems in the past have had snapshotting like you think of.
[01:05:55] Allan Jude
Even KVM with Qcow2 has this concept of snapshotting.
[01:05:59] Allan Jude
But in other file systems, the way snapshotting worked is, you know, you have your normal file system, when you change a file, we just overwrite it in place.
[01:06:07] Allan Jude
But if snapshotting is enabled, then we'll detect that and we'll, oh, write the new change to a different place instead and keep both versions.
[01:06:14] Allan Jude
So that meant there was this cost.
[01:06:16] Allan Jude
So like, if you use the snapshots in lvm, as soon as you have one snapshot, your performance gets cut like in half.
[01:06:21] Allan Jude
And then you have a second snapshot and it's half of that.
[01:06:24] Allan Jude
And so if you have like eight snapshots, you're at like, you know, 10% of your original performance.
[01:06:28] Allan Jude
It's pretty terrible.
[01:06:30] Allan Jude
With ZFS, it works differently.
[01:06:33] Allan Jude
Every time we make a change to a file, an existing file, we always write the blocks to a new place.
[01:06:39] Allan Jude
If you have no snapshots, the old place after a couple of seconds just becomes free space and can be reused later.
[01:06:45] Allan Jude
But if we have a snapshot, then we know to keep that data.
[01:06:50] Allan Jude
So basically I mentioned one of the bits of metadata we have along with the checksum is the time when the block was written.
[01:06:57] Allan Jude
And this is measured in transaction groups since the pool was created.
[01:07:00] Allan Jude
So when you create a snapshot, it's mostly just remembering the time of the snapshot.
[01:07:04] Allan Jude
And so when you overwrite some data, if it's before that snapshot, then we need to keep it, and so it won't erase it and it'll just keep going.
[01:07:15] Allan Jude
So it means when you're writing while having snapshots or reading, there's no extra work to do.
[01:07:20] Allan Jude
We're actually doing less work because we're not freeing that space because it's referenced by the snapshot.
[01:07:26] Allan Jude
Then later when you delete the snapshot, it can go through and say anything that's older than the now oldest snapshot.
[01:07:34] Allan Jude
We don't necessarily need that anymore.
[01:07:36] Allan Jude
And so it can go and free it and make the free space.
[01:07:39] Viktor Petersson
That's very cool.
[01:07:40] Allan Jude
So the copy and write nature of ZFS also provides the other big thing.
[01:07:44] Allan Jude
Especially in the early 2000s, there was this problem where if your server ever got rebooted unexpectedly, when it came back up, it had to run an FSCK and check all the stuff to make sure your directories weren't corrupt or whatever.
[01:07:56] Allan Jude
Yeah, on bigger hard drives that could take days.
[01:08:00] Allan Jude
And have your operating system not fully running for days was not an option.
[01:08:04] Allan Jude
Eventually that was mostly kind of solved with journaling, but to a limited degree with zfs.
[01:08:11] Allan Jude
The way it works is using those transactions.
[01:08:14] Allan Jude
Any changes you make get accumulated in a transaction group and then written out.
[01:08:17] Allan Jude
And by default that happens every five seconds or more frequently if you're pushing a lot of data and it's getting full.
[01:08:28] Allan Jude
But this means that, you know, in an overriding file system, if you're in the, you know, just made a bunch of changes to a big Excel file and hit save and the power goes out halfway through saving.
[01:08:39] Allan Jude
On a regular file system, you would have overwritten the first half of the original copy of the program or the Excel file with the new version, but the second half didn't get written yet because the power went out.
[01:08:49] Allan Jude
So when you boot back up, you have half the new file and half the old file.
[01:08:52] Allan Jude
And correl Excel is just going to say, that's gibberish.
[01:08:55] Allan Jude
I can't make sense of that.
[01:08:58] Allan Jude
Whereas zfs, the new file was being written over here and we only have half of it.
[01:09:04] Allan Jude
And so that transaction didn't finish, so the checksum won't match.
[01:09:07] Allan Jude
So when it boots up, it'll just say, okay, that one didn't finish.
[01:09:10] Allan Jude
We'll go back to one previous that is completed and do that.
[01:09:14] Allan Jude
And then ZFS has this concept called the ZFS Intent log or zil, where any time an application asks for a promise that the write I just did is on disk before I continue, like a database would, that gets written to the xil and then when you reboot after the crash, the XIL will replay those changes and make sure that everything we promised actually got finished.
[01:09:36] Viktor Petersson
Like a wall login postgres.
[01:09:39] Allan Jude
Exactly.
[01:09:39] Allan Jude
But that for the file system and this way we go from the file system was perfect to the file system was perfect without ever having to be in that in between state.
[01:09:49] Allan Jude
Because it's in between state, we kind of roll back like a database.
[01:09:52] Allan Jude
And so it doesn't have this need to fsck.
[01:09:56] Allan Jude
Other file systems do super interesting.
[01:09:59] Viktor Petersson
The last feature that I wanted to cover, which I think is a very neat feature as well, is the ZFS send, which is kind of clever feature as well for like, shipping data across systems, really.
[01:10:09] Viktor Petersson
So maybe speak a bit about that.
[01:10:11] Allan Jude
Yeah.
[01:10:11] Allan Jude
So basically it allows you to serialize a file system into a stream that you can send over the network.
[01:10:17] Allan Jude
Technically, you could send it to a file and then receive it on a different computer much later, but usually it makes more sense to just do it directly, because having a giant file full of a stream of the file system isn't as useful as if you receive it, you have that same amount of space, but as a usable copy of the file system.
[01:10:36] Viktor Petersson
Right.
[01:10:38] Allan Jude
But its power really comes from its ability to do incremental replication, where after you've sent the whole file system based on a snapshot, if you make a newer snapshot, you can send the difference just between those two.
[01:10:49] Allan Jude
And that depends on what we had just talked about, those transactions and the transaction group id.
[01:10:56] Allan Jude
So when you created the first snapshot, were at, say, transaction 1000, and when you created the second one, were at, say 1100.
[01:11:04] Allan Jude
So now when you say send the difference between snapshot one and snapshot two, ZFS just has to look for blocks that have a birth time between those two numbers.
[01:11:11] Viktor Petersson
Oh, neat.
[01:11:12] Viktor Petersson
Okay.
[01:11:12] Allan Jude
And so we can just scan all those blocks and say, okay, any block that has a birth time greater than a thousand, but up to 1100, we're just going to feed it into the stream, and on the other side we'll just receive those.
[01:11:25] Allan Jude
And so compared to doing a backup with something like rsync, where it's going to walk through every single file in your system and run stat on it and be like, when did you last change?
[01:11:34] Allan Jude
When did you last change?
[01:11:35] Allan Jude
And only.
[01:11:36] Allan Jude
And if a file is huge, but it was touched, rsync has to read the whole file on your side and read the whole file on the far side and check the checksum of the chunks and then to figure out which blocks changed.
[01:11:51] Allan Jude
Whereas ZFS just looks at the time on each block in the metadata without having to read the data and says, okay, only these seven blocks change.
[01:11:59] Allan Jude
So it just sends those seven blocks and it's done.
[01:12:02] Viktor Petersson
Interesting.
[01:12:03] Allan Jude
And so ZFS is able to basically saturate the network and do the backup in a couple of seconds, where Rsync might take a whole day to copy that same couple of blocks, because it has to check every file, and then if the file is newer, it has to read the whole file on both Sides and decide which blocks are actually different.
[01:12:20] Allan Jude
Whereas ZFS just natively knows because it has a timestamp for each block of the file instead of just the whole file.
[01:12:26] Viktor Petersson
Right.
[01:12:26] Viktor Petersson
And I guess if you send it on a network, you would just pipe it through something that's encrypted and then receive it and decrypt it.
[01:12:31] Viktor Petersson
Yeah.
[01:12:32] Viktor Petersson
Okay.
[01:12:32] Allan Jude
Well, also, if your data set was encrypted on zfs, you can do a raw send where it'll send the encrypted version without decrypting it to the other side.
[01:12:42] Allan Jude
And then that way the data is never decrypted on the far side.
[01:12:46] Allan Jude
And so if they don't have the key to decrypt it, they only have a backup of your data that they can send back to you, but they can't ever mount.
[01:12:54] Viktor Petersson
Right, okay.
[01:12:56] Allan Jude
The thing lots of people are after.
[01:12:58] Allan Jude
Yeah.
[01:12:58] Viktor Petersson
Are there any hosted services I know obviously a fellow Canadian call in is running tarsnap.
[01:13:05] Viktor Petersson
Are they supporting that ZFS send as torch now?
[01:13:09] Viktor Petersson
Because that would be an interesting service offer to basically ship.
[01:13:12] Allan Jude
They're not.
[01:13:13] Allan Jude
So the two services I know of are ZFS Rent, where they will basically rent you a hard drive in a VM that you can do that to and then rsync.net has a thing where they will basically stand up freebies to your Linux in a VM and sell you five terabytes of space that you can ZFS receive your encrypted data sets to.
[01:13:36] Allan Jude
Or not encrypted if you don't care.
[01:13:39] Viktor Petersson
Interesting.
[01:13:40] Viktor Petersson
All right, so the last thing I want to round up the episode on is Tales from the Trenches.
[01:13:45] Viktor Petersson
Obviously you've been around the data world for quite some time and I'm sure we've seen a lot of horror stories around this.
[01:13:52] Viktor Petersson
So I kind of want to hear from your side what's the craziest and more like bizarre data recovery missions you guys been on trying to recover data on GMS clusters or in general?
[01:14:03] Allan Jude
Yeah, we've done quite a few different ones.
[01:14:07] Allan Jude
We just did a webinar on Halloween that we'll drop a link in here for, but if you go to clarasystems.com and click webinars, you'll find our Halloween horror stories.
[01:14:17] Allan Jude
So I won't cover one of those because they're all really good, but I just did them.
[01:14:21] Allan Jude
So another one we did, there was an interesting one at a university where they had just made a bunch of changes and we had helped them with those and it was fine.
[01:14:33] Allan Jude
And then so they had a high availability system where they'd have a JBOD full of hard drives connected to two different servers so that if one server goes down, they could import the pool on the other one.
[01:14:47] Allan Jude
And they had a bunch of these pools and they had just numbered them like everybody does.
[01:14:52] Allan Jude
This is not the same as the war story in the Halloween one where a similar thing kind of happened.
[01:14:56] Allan Jude
But after importing them, after doing the upgrade, they accidentally did import pool 4 on server 4 and then went to server 5 and accidentally typed import pool 4 instead of pool 5.
[01:15:11] Allan Jude
So now pool 4 was mounted on two machines at the same time, which isn't supposed to be able to happen because they didn't enable the feature that stops it because there's a trade off for it.
[01:15:23] Allan Jude
But so now every time there's a new transaction group, server 4 is writing to the hard drive and then server 5 is writing different information with a different transaction group number over top of that.
[01:15:34] Allan Jude
And so they're just sitting there writing over top of each other, somewhat being correct and somewhat not.
[01:15:39] Allan Jude
And also meaning that if you wrote different data on different servers, they're going to not know that the other one has used a certain bit of space.
[01:15:47] Allan Jude
So think it's free and allocate it and basically overwrite something Server 4 just wrote with something different that Server 5 just wrote.
[01:15:54] Allan Jude
And you can really corrupt things really badly.
[01:15:58] Allan Jude
They managed to catch it after I think it was about 25 minutes or so.
[01:16:02] Allan Jude
And so they shut everything down and called us and things were quite broken and we had to invent some new tools to be able to fix it, but eventually got to the point where we could do a version of ZFS sen to copy all the data to a spare machine they had so that they could get most of their data back, except for obviously the newer stuff they had overwritten and a couple of pieces got damaged.
[01:16:26] Allan Jude
But we managed to get most of the data back by basically being able to do ZFS send.
[01:16:32] Allan Jude
And the big advantage there over copying it some other way was just the speed, the fact that we could saturate their 10 gigabit network.
[01:16:38] Allan Jude
And when you're talking about hundreds and hundreds of terabytes of data, it's really slow to copy.
[01:16:44] Allan Jude
So you really need a way to try to do that quickly.
[01:16:48] Viktor Petersson
Good stuff.
[01:16:50] Viktor Petersson
This has been very helpful and I think hopefully opened the eyes to ZFS to a new audience.
[01:16:57] Viktor Petersson
And I think it's well worth shouting about because I think it's a fantastic file system.
[01:17:03] Viktor Petersson
So I'm very happy to wider adoption for it.
[01:17:06] Viktor Petersson
So thank you so much for coming on the show, Alan.
[01:17:10] Viktor Petersson
Very much appreciated.
[01:17:11] Viktor Petersson
Any last words about Clara?
[01:17:13] Viktor Petersson
Something you want to say, you want.
[01:17:14] Allan Jude
To draw attention to?
[01:17:16] Allan Jude
If you have.
[01:17:17] Allan Jude
If you want to learn more about zfs, I do a weekly podcast where we tend to answer quite a few people's questions.
[01:17:23] Allan Jude
They write in about ZFS.
[01:17:24] Allan Jude
So that's 2.5 admins.com so 2 and a half admins.
[01:17:28] Allan Jude
It's a podcast with myself and another ZFS admin, Jim Salter, and a host who's trying to become a sysadmin.
[01:17:34] Allan Jude
So we have two and a half admins.
[01:17:36] Allan Jude
So if you're into sysadminning in general, or ZFS specifically, you want to check out that podcast that comes out every Thursday.
[01:17:44] Allan Jude
And there's also we have a website, practicalzfs.com where we have a discourse set up replacing the old Reddit rzfs.
[01:17:53] Allan Jude
And so it's a place where people ask a lot of ZFS questions, we answer them and kind of collect a whole thing there.
[01:17:58] Allan Jude
But yeah, if you need support with ZFS or feature development or same for freebsd, then do hit us up at clarasystems.
[01:18:07] Viktor Petersson
Com.
[01:18:07] Viktor Petersson
Amazing.
[01:18:08] Viktor Petersson
Again, thanks so much for coming on the show, Alan.
[01:18:10] Viktor Petersson
Have a good one.
[01:18:11] Viktor Petersson
Talk soon.
[01:18:11] Viktor Petersson
Cheers.
[01:18:12] Allan Jude
Bye.

Found an error or typo? File PR against this file or the transcript.