Bennett Haselton, 12/29/2002
For several classes of peer-to-peer systems, it is a desirable property that a single person or computer should not be able to map out a list of all, or most, of the peers in the network.
For some types of peer-to-peer systems, this is more desirable than others. For example, for a peer-to-peer system like FreeNet, whose purpose is to distribute the hosting of a file across many nodes so that the file cannot be censored except by shutting down all of the nodes, it's not a high priority to prevent someone from mapping out a list of all or most of the nodes in the network. This is because even if an adversary had a list of most nodes in the network, there would be nothing the adversary could easily do to make a shared file inaccessible -- they would still have to launch a denial-of-service attack or a legal attack on each of the individual nodes. On the other hand, suppose you are designing a peer-to-peer system whereby users in China can connect to "circumventor" servers outside China, in order to access sites that are blocked by the Chinese government. If the Chinese government ever finds a way to map out most of the nodes in the circumventor network, they can easily block all the ones they know about, cutting off most users' access.
This page argues that in any kind of peer-to-peer network where a node is allowed to find out about other nodes in the network besides those that it is connected to directly, an adversary would be able to map out all or most of the network by repeating a simple process.
Assume that the adversary -- the Chinese censorship bureau in this example -- can find out about at least one node "X" in the peer-to-peer network of circumventor servers outside China. (If it's possible for legitimate users to use the network at all, it must be possible for an arbitrary new user to find out about at least one node to connect to.) So the adversary installs the p2p client software on their machine Y, and connects it to X, thereby plugging into the network.
Suppose X feeds information to Y, the node under the adversary's control, about other nodes in the p2p network, nodes which Y can connect to in the event that X becomes inaccessible. At this point, the adversary controlling computer Y can extract the information about the list of redundant nodes that X has communicated to Y. Even if the communication from X to Y was encrypted, and even if the node software is closed-source and extremely difficult to reverse-engineer, all that the adversary has to do is use a local firewall to block Y from communicating with X directly. Then the p2p software will automatically start communicating with one of the redundant nodes that Y received in the list from X -- and the adversary can monitor their own network traffic to see the IP address of the new node (call it Z1) that Y is communicating with. Then the adversary can block that node and watch Y switch over to the next node, say Z2, and repeat the process until the adversary has determined all the nodes in the list that X sent to Y.
Having done this, the adversary could disconnect and then try connecting to Z1 or Z2 or any of the other nodes in the list, pretending once again to be a "new node". (First, the adversary would want to switch the IP address of their machine so that they would not appear to be the same person, in case the existing nodes in the network shared information about the IP addresses of new nodes that had recently contacted them.) At this point, the p2p network may "realize" that Y was behaving so suspiciously that it was probably a dishonest node -- connecting quickly, grabbing information about all the nodes neighboring X, and then disconnecting -- but there's no way to penalize Y for the suspicious behavior or to recognize him in the future, after he disconnects. The p2p network has no way of knowing that the new connection is coming from the same machine that disconnected itself recently; the connection looks like it's just coming from another new machine attempting to join the network. Once the adversary's machine has connected to Z1, if Z1 sends it a list of other nodes in the network, the adversary's machine can repeat the process outlined above to extract all of the machines from the list. Then the machine can disconnect and move on to carry out the same steps for Z2, etc.
It may be that existing nodes in the peer-to-peer network impose some time delay before they will tell the new node, Y, about nodes in the network other than the node(s) that Y is connected to directly. So it may be that X waits 24 hours, for example, before starting to send a list of backup nodes to Y, and even then it might not send the entire list, preferring instead to send one node every 24 hours until the entire list of, say, 7 nodes have been sent over the period of a week. But in this case, the adversary can connect multiple nodes to the network (either using multiple machines, or one multi-homed machine with multiple IP addresses) in order to do the waiting in parallel, without requiring any human intervention. Even though it may take a week to get the first list Z1, Z2, Z3 of alternate nodes that X is aware of, you could then connect simultaneously to Z1, Z2, etc. and do the waiting for all of them in parallel, so that at the end of another week, you'd have all of the alternate nodes that each of those nodes were aware of. Creating the appearance of multiple nodes with multiple unrelated IP addresses is pretty easy, especially if you control the Internet infrastructure of the country and can order ISPs to give you additional IP addresses at no cost, as the Chinese government can.
Other papers such as the one at:
describe strategies for designing peer-to-peer systems in a way that limit the number of other nodes that a given node in the network can find out about. I think the flaw in this logic is that a single adversary can repeat the process of "joining" the network over and over (either with a single machine repeatedly joining and disconnecting from the network, or with multiple machines that join the network without requiring any human intervention). So even if you have a limit on the number of additional nodes that one node is allowed to know about, it's easy to get around that limit by creating multiple nodes, either simultaneously, or one after another using the same machine.