Scaling Bitcoin workshop : Montreal 2015
Reworking bitcoin core p2p code for robustness and event driven
Reworking Bitcoin Core's p2p code to be more robust and event-driven.
That is me. I want to apologize right off. I am here to talk about reworking Bitcoin Core p2p code. This is a Bitcoin Core specific topic. I have been working on some software for the past few weeks to talk about working things rather than theory. The presentation is going to suffer because of this.
Bitcoin Core has a networking stack that dates back all the way to the initial implementation from Satoshi or at least it's been patched and hacked on ever since. It has some pretty serious limitations at the moment. Two of the biggest that actually got me whipped into shape for this specific topic is that there's no throttling. That's one of the most requested features on the network side of things that I see. It's easy for a bunch of nodes connected to you to suck away your bandwidth. The configuration is very static, and actually hacking on the networking code it's actually pretty hard; if you want to ban a node, disconnect a node, it's actually more complicated than you would expect to do things like that. Same with diagnoses and simulations and testing, because of that sort of static configuration, it becomes hard to do those things.
The other major interesting thing that got me really into looking at the current networking issues is the p2p latency. It's a terrible name and doesn't describe what I want; it describes the lag time between the time you receive a message, process it and do something as a result of it.
This is pretty core-specific. Bitcoin Core developers might recognize some of these things. I am sure they could add to the list very easily. Some of the big issues is that select() doesn't scale. You have only 1024 file descriptors. That's the limit of connections that you're going to get. There are major blocking issues. This is very specific to Bitcoin Core itself. It becomes very hard to make changes to the way that connections work or established, or how proxies work, or anything regarding sockets and data transmission is covered with locks and crazy code paths, it's all spaghetti that has emerged over time. It's something that I'm not sure has been analyzed all that much, but what I consider to be one of the major scaling issues at the moment is how much is done on the message handling thread in Core. As we receive messages on hte network and process them very quickly, there's a lot of thread contention because things like block validation and so on, is handled on the messaging thread. As more and more comes in over the network, we become less and less able to process it. I haven't seen many proposals for how to address that particular issue, and given the amount of- the current code base and its viability, I decided to dive in and essentially look at a complete rework of the networking code.
This is one of my primary motivators. I was doing a ninitial lockdown mode a few months ago. This was on my macbook which is about a year old I suppose. One of the trends that emerged were these peaks and troughs trend, and I think most people would see this; the slower the computer that you are syncing on, you would notice that trend even more strongly. What's happening there is that as blocks are coming in, we can't process it fast enough, as blocks are coming in out of order, you may have 15 or 20 or 30 blocks queued up that don't actually connect until the right block comes in. There could be 10s of seconds or even minutes that go by before you can even request a new block. The valleys are where we're currently churning our code or validating things and we're not able to ... and network traffic just drops, and over the long-term it just drops to zero.
I ran a quick experiment. I added some new threads to add block handling in new threads. I gave myself an award for most simplified data here. It's at least possible to offset some of this peaks-and-valley trend. I started to work for the past few months, and really in the past few weeks, to work on a complete replacement for the networking stack inside of bitcoin. More threads is kind of traditionally what we've done to make Bitcoin Core scale, and anyone who has done networking on a low-level already knows that this stops being useful at some point.
So I started to look at things we could do asynchronously instead. We spawn some threads, we sspawn connections, we wait for them to connect, and ew wait for that instead. And as a result of tha tpattern, the Bitcoin Core logic tends to work the same way. We wait for a connection, once we have it we start talking, and then we run a handshake and start asking for blocks. As we try to scale up on the networking size, it becomes much easier to say- I'm connected now, what do you want from me?
libevent is a library that is essentially written to handle these kinds of issues. It's usage was merged into Bitcoin Core for a different RPC server over the last few weeks. There are few complaints about adding a new library and depending on third-party library for core logic, but libevent is a pretty-well established library already. There are a lot of things thta it does for us. Socket creation, binding, polling, buffering, water-marks, throttling, DNS resolving, timers/timeouts, much more. We're doing that manually in Bitcoin Core. Also it handles prioritization, which is really interesting. Make sure it's connected before the others or that the data is available before the others.
It also uses epoll/kqueue/iocp, I guess iocp is used for Windows where possible. That soft limit of 1024 connections at the moment just kind of goes away, and we don't have to worry about it anymore.
This is the approach that I have taken. I have tried to add the concept of a singular connection manager for Bitcoin Core. It should say I don't have enough connections, give me someone to connect to, and then it should connect to someone and then ask what Bitcoin Core wants to do. It would make it much easier to kind of reverse that logic in Bitcoin Core code to say, if someone asks for a block, what do you want to do about it, rather than just stopping everything and going to find that block (which is what we do now). In writing the manager it became obvious pretty quickly, the rules for how to manage connections .. it became clear that a network connection manager doesn't really have to know anything about bitcoin itself, it has to know two or three things, like how big a header is, what the size is, what the message size is, what the offset is inside the blockheader, it has to know about the version offset. The first incoming message is where you establish a lot about the connection itself. Before I'm notified about, then I can start sending messages, do version handshake first and then continue.
In working on this new manager, one of my goals that I set out for initially, I've stuck to this so far, is writing it as a new standalone library. Not that it makes sense to use it that way, it wouldn't be useful to anyone else I guess, for being implemented in other programs, but it does make much easier for testing. We can write stub applications for grabbing of blocks, simulate initial block downloading, simulate this scenario, and do this without worrying about the other Bitcoin Core things going on. So you can test against how easily you can connect to 2000 nodes, or you can write your own little stub applications. Because nothing is hardcoded and nothing is global, lots and lots of things become dynamic in Bitcoin Core. So I would like to see the ability to just kick a node on the fly, by operator selection. And limit some of the bandwidth on the fly. Limit them in particular but leave everyone else sort of unlocked.
So the current status is that it's ugly. And my primary intention for coming and speaking here was to kick off the idea that it's going to try to get this into Bitcoin Core will be a big nasty process. It will be a whole lot to try to dump in at once. So I would like to gather some ideas and talk to some people who have analyzed network traffic, Matt being one of them. To understand the pitfalls to avoid in the future to try to not make the same mistakes that we have made in the past.
Current status: in the second most boring graph, this is something that I ran, and the code came up two days, and this is an initial block download where it downloaded 1.25 GB and I had set a throttle of 1 meg/sec of download rate limit. And this number was hit exactly. So at least this part works.
I do have a plan for trying to get it merged in, but it's going to be ugly. Once we do have a new network manager, it opens up some interesting possibilities. One that I was talking with jonas about last night was that, it's important going forward, since the manager would be instanced, it would be possible to say run two different sets of services at once. Say I run advertise no network on this side, or another simultaneous port I advertise alone or or any of the other server features you want to run. You don't have to be static about it. You could laso have the ability to have a pruned node, until you acquire the blocks, and then you say I'm a whole node now, and you can grab blocks from me.
Lots of interesting things beyond that. A curses GUI would be very helpful for running interactive connection management. Out-of-process networking might be useful, who knows.
My primary attention here is to introduce the idea and I hope that I could talk about some of the issues that we're running into.