A general attack to map out peer-to-peer networks

Bennett Haselton, 12/29/2002

For several classes of peer-to-peer systems, it is a desirable property that a single person or computer should not be able to map out a list of all, or most, of the peers in the network.

For some types of peer-to-peer systems, this is more desirable than others. For example, for a peer-to-peer system like FreeNet, whose purpose is to distribute the hosting of a file across many nodes so that the file cannot be censored except by shutting down all of the nodes, it's not a high priority to prevent someone from mapping out a list of all or most of the nodes in the network. This is because even if an adversary had a list of most nodes in the network, there would be nothing the adversary could easily do to make a shared file inaccessible -- they would still have to launch a denial-of-service attack or a legal attack on each of the individual nodes. On the other hand, suppose you are designing a peer-to-peer system whereby users in China can connect to "circumventor" servers outside China, in order to access sites that are blocked by the Chinese government. If the Chinese government ever finds a way to map out most of the nodes in the circumventor network, they can easily block all the ones they know about, cutting off most users' access.

This page argues that in any kind of peer-to-peer network where a node is allowed to find out about other nodes in the network besides those that it is connected to directly, an adversary would be able to map out all or most of the network by repeating a simple process.

The attack

Assume that the adversary -- the Chinese censorship bureau in this example -- can find out about at least one node "X" in the peer-to-peer network of circumventor servers outside China. (If it's possible for legitimate users to use the network at all, it must be possible for an arbitrary new user to find out about at least one node to connect to.) So the adversary installs the p2p client software on their machine Y, and connects it to X, thereby plugging into the network.

Suppose X feeds information to Y, the node under the adversary's control, about other nodes in the p2p network, nodes which Y can connect to in the event that X becomes inaccessible. At this point, the adversary controlling computer Y can extract the information about the list of redundant nodes that X has communicated to Y. Even if the communication from X to Y was encrypted, and even if the node software is closed-source and extremely difficult to reverse-engineer, all that the adversary has to do is use a local firewall to block Y from communicating with X directly. Then the p2p software will automatically start communicating with one of the redundant nodes that Y received in the list from X -- and the adversary can monitor their own network traffic to see the IP address of the new node (call it Z1) that Y is communicating with. Then the adversary can block that node and watch Y switch over to the next node, say Z2, and repeat the process until the adversary has determined all the nodes in the list that X sent to Y.

Having done this, the adversary could disconnect and then try connecting to Z1 or Z2 or any of the other nodes in the list, pretending once again to be a "new node". (First, the adversary would want to switch the IP address of their machine so that they would not appear to be the same person, in case the existing nodes in the network shared information about the IP addresses of new nodes that had recently contacted them.) At this point, the p2p network may "realize" that Y was behaving so suspiciously that it was probably a dishonest node -- connecting quickly, grabbing information about all the nodes neighboring X, and then disconnecting -- but there's no way to penalize Y for the suspicious behavior or to recognize him in the future, after he disconnects. The p2p network has no way of knowing that the new connection is coming from the same machine that disconnected itself recently; the connection looks like it's just coming from another new machine attempting to join the network. Once the adversary's machine has connected to Z1, if Z1 sends it a list of other nodes in the network, the adversary's machine can repeat the process outlined above to extract all of the machines from the list. Then the machine can disconnect and move on to carry out the same steps for Z2, etc.

It may be that existing nodes in the peer-to-peer network impose some time delay before they will tell the new node, Y, about nodes in the network other than the node(s) that Y is connected to directly. So it may be that X waits 24 hours, for example, before starting to send a list of backup nodes to Y, and even then it might not send the entire list, preferring instead to send one node every 24 hours until the entire list of, say, 7 nodes have been sent over the period of a week. But in this case, the adversary can connect multiple nodes to the network (either using multiple machines, or one multi-homed machine with multiple IP addresses) in order to do the waiting in parallel, without requiring any human intervention. Even though it may take a week to get the first list Z1, Z2, Z3 of alternate nodes that X is aware of, you could then connect simultaneously to Z1, Z2, etc. and do the waiting for all of them in parallel, so that at the end of another week, you'd have all of the alternate nodes that each of those nodes were aware of. Creating the appearance of multiple nodes with multiple unrelated IP addresses is pretty easy, especially if you control the Internet infrastructure of the country and can order ISPs to give you additional IP addresses at no cost, as the Chinese government can.

Other papers such as the one at:
http://thalassocracy.org/achord/achord-iptps.html
describe strategies for designing peer-to-peer systems in a way that limit the number of other nodes that a given node in the network can find out about. I think the flaw in this logic is that a single adversary can repeat the process of "joining" the network over and over (either with a single machine repeatedly joining and disconnecting from the network, or with multiple machines that join the network without requiring any human intervention). So even if you have a limit on the number of additional nodes that one node is allowed to know about, it's easy to get around that limit by creating multiple nodes, either simultaneously, or one after another using the same machine.

Countermeasures, and counter-countermeasures

You could require new nodes to pass a "Turing test" -- some proof that they are being manned by an actual human, such as being presented with a graphic and required to type in the numbers displayed in the graphic -- at least once before they are allowed to download a list of other nodes. This would at least make it impossible for the censors to run a completely unmanned "crawler" machine that mapped out the entire p2p network. However, it would not require much additional effort for the censors to assign one person to sit next to the machine (working on something else to pass the time), and take the Turing test whenever the "crawler" machine prompted him to.
Instead of having the peer-to-peer network have a large tree structure, which can be mapped out by a crawling machine that continually discovers the neighbors of each node it connects to, an idea that comes to mind would be to have the network divided into "clumps" so that it's impossible to navigate out of each clump just by following links between peer-to-peer neighbors. Say that after the adversary Y has obtained the list of neighboring nodes Z1, Z2, etc. from the node X, it proceeds to ask Z1, Z2, etc. for all of their neighbors -- but the only neighbors that it gets from each of those nodes are those nodes in the list Z1, Z2, etc. that it already knows about. "If your family tree doesn't branch... you might be a redneck." The problem is that this limitation also applies to legitimate users of the network as well. If a legitimate user W has connected to Z1, for example, and the only neighboring nodes that W is aware of are the nodes Y, Z1, Z2, etc., then after the adversary has mapped out this small group of nodes, it can block the entire group and cut off W from all of the nodes that it knew about. Regardless of the amount of searching that W had to do to find and connect to one of the nodes in that group, if W could do it, the adversary Y could do it sooner or later as well, and cut off W's access shortly after Y discovers the group and maps the whole thing out. You might as well have had only one node in the whole group, with no neighbors, and you'd achieve the same effectiveness -- it works for all the legit users who are able to find it, but then once an adversary finds it, they can block it and cut off the access of all the legit users.
The p2p client software could be closed-source, storing the list of nodes in an encrypted format, and written in such a way that even if you try to extract its stored list of nodes by blocking each node in the list and observing the IP address of the node that it starts talking to instead, the client software will refuse to iterate through the list so quickly. It may impose a restriction that it won't switch over to a new backup node more often than once every few hours. This would limit the adversary's ability to extract the list of neighbors quickly -- but once again, they could be carrying out the process on multiple nodes, and doing the waiting in parallel. Also, assuming that the Chinese will not be able to reverse-engineer the software, is the weakest assumption of the lot, especially in a country with a thriving computer "underground" and an abundance of coders who are skilled in reverse-engineering for the purpose of pirating software, and may be interested in cutting a deal with the government if arrested.