Friday, July 23, 2010

Ars Forgets How Torrents Work, Cites Faulty Study

If you follow tech news, you have a certain list of sites you'll keep an eye on. Personally, I always keep an eye on TorrentFreak (but then, I am their researcher, and night-time comment moderator) but there are others as well, Wired's Threat Level, Slyck, and of course, ArsTechnica.

The problem for all tech news sites is that there's a deadline game. You have to be first to break the story, so you can get it passed around the social media circles, facebook, slashdot etc. Often that means that stories, or more specifically the data that comprises the story, doesn't get the attention it should, and ArsTechnica has fallen foul of this, repeating the conclusions of a study, and not noticing some glaring errors.

The first error is in the basic method. This is the method as described in the paper.

They go to a site, grab the 10 most popular, pull the trackers from them, giving 23 trackers. Then they fully scrape those trackers, and pick the most popular torrents from the combined scrape response. They get the filename, and use that to categorise, and from that we get the results.

So they pick the most popular torrents based on a combined figure of seeds, obtained by scrapes which are notoriously easy to spoof. From that they look at the names, and categorise them based on that, which is, again, easy to spoof. Finally, a determination of the copyright status is made based on the name. And thus we end up at the 0.3 figure championed in the Ars piece.

Problems in this method are not hard for anyone to spot. The most obvious one for anyone that's been around torrents for a while, is that huge seed numbers are usually a bad sign. They're fakes, put up by Anti P2P companies, to discourage people by being impossible to finish, or not what they claim to be. This started in the Napster era, and the most famous example was Madonna, who hit the headlines with it back in 2003. Alternatively, they can be torrents of trojans, and other malware set up to infect peoples computers, deliberately mislabeled and with hyper-inflated seed/leech figures to entice people to download them. The most important thing to do, in a study involving bittorrent, is verify. Anyone that knows bittorrent well, knows this. The highest number of seeds on a torrent that has been verified, at least that I'm aware of, wouldn't even be on the first page of their torrents.

The second issue is the initial data collection itself. They went to a site, grabbed the most popular (which, as has just been noted, is not the way to get a 'valid' torrent) then used the trackers listed in them, to compile a list of trackers, noting "each torrent having at least 10 trackers associated with it". No torrent should have more than 2 trackers, and really only needs 1. More trackers don't add anything (no extra peers, just extra overhead for you, and for the tracker). It does mean that 'disreputable', or honeypot trackers (ones set up specifically to track users for purposes other than being purely a bittorrent tracker) can hide in the swarm better. Again, to anyone with knowledge of bittorrent, this is well known. Thus, when they include these trackers, they're going to get 'more' fake/honeypot/trojan'd torrents, rather than the 'real' torrents they are after in order for their study to be accurate.

Finally, names. They categorise based on names, and as we all know, the name of the file ALWAYS matches the contents. It's not like you can change a file name to anything you want and the contents stay the same... Oh wait, it's EXACTLY like that. So, what they have listed as The Incredible Hulk[2008]DvDrip-aXXo97065494792.4447 could quite easily be 'randomDataGeneratedByN00b' and not work, or could be 20 seconds of the film intro, then switch to Rick Astley for 3 minutes, and then random data. If so, then The University of Ballarat AND AFACT have both been Rickrolled, very very publicly. It's also not like AntiP2P companies intentionally misname files, oh DAMN, yes they do, that's exactly what MediaDefender did.

So, we've got a method that uses bad data, collected by using other bad data, using bad data to make determinations about copyright. From that, AFACT makes a big deal gloating about it.

The Australian Federation Against Copyright Theft (AFACT) has welcomed the release of a research paper by the University of Ballarat into the extent of infringing content on BitTorrent networks stating it gives a clear insight into the nature of traffic on Bit Torrent network.

The academic research is the first to quantify the percentage of infringing BitTorrent (BT) traffic. Previous research only looked at the overall percentage of BT traffic across the internet, but not the legality of the traffic packets.

The key finding was that at least 89.9% of all torrents to be infringing.
The research analysed a sample of 1,000 unique torrents taken from 19 of the most popular BT trackers. The research objective was to investigate the percentage of shared files which are infringing, both by number of files and total seeders, as well as to evaluate the most popular categories of shared files. The results found that the percentage of legitimate BT traffic being shared.
A summary of key findings included:

1. 89.9% of all torrents within the sample were found to be infringing both by the number of files and total downloads. This was excluding all pornographic torrents whose legality could not be verified. If all pornographic titles were classified as infringing this overall figure would rise to 98.1%.

2. The top two categorized torrents were movie and TV shows making up 72.4% of all torrents. There were no legitimate movies or TV torrents in the sample.

3. The top two movie files were being seeded more than 1 million times each. The third most popular movie file was being seeded more than 500,000 times.

4. 9.9% of torrents were responsible for 90% of the total seed population.

5. Only 1 non-infringing torrent (an open source program) was found in the most popular 100 torrents.

Unfortunatly, as I've now pointed out, these points are completely unsupported by the data. The highest number of seeds on a torrent that I've been able to verify, was around 115,000, as I said a few weeks ago. A million is right out. They've also called bittorrent a network, it's not. It's a protocol. A network would imply they're all connected. While they may be connected via DHT, that's another protocol on top of bittorrent. Bittorrent itself doesn't have a network, and never has.

They may well be correct in saying that a large amount of the traffic they studied is illegal though. there's a good chance much of it included trojans or virii, or software to enable identity or data theft. These items are, in most places, criminal to distribute with the intention of using them for gain. That's the ONLY way though that AFACT is remotely correct.

Finally, there's no way to tell from a name if the material is infringing or not. First of all, it's all under copyright, no matter what - Berne Convention sees to that. If we were to assume the names were accurate, and the real thing, in the majority of cases you could guess if a license to distribute had been given, but a lot of times you can''t. You have to determine the copyright holder, and then ask them if it's licensed. That's not as easy as it sounds much of the time. Even the big boys, like the IFPI get it wrong sometimes.

So, what's the study worth overall? Well, we've got bad data, obtained poorly, then bad conclusions drawn from it (thanks to unclear copyright laws); the answer is NOTHING. The study, and the data in it are absolutely worthless. The methodology is only of any use if you're hunting for AntiP2P orgs, or malware. If you're trying to determine the scale of copyright infringement though, you should start by asking someone who knows the subject. If the Internet Commerce Security Laboratory would like my help in redoing it, using accurate data, they're more than welcome to ask - the contact link is at the top of the page - meanwhile I'm going to work on my own version, and because I know how both bittorrent and AntiP2P orgs work, mine will be accurate.

ArsTechnica on the other hand SHOULD know this, it's their job to. If they weren't sure, they should ask. I know they even have my personal cellphone number, as they've interviewed me on it in the past. AFACT almost certainly knows it's inaccurate (they're not dumb, despite what people think) but it's exactly the message they want to promote. The fact that they have to resort to effectively worthless studies to make their point should tell you everything you need to know about the validity of their point.

UPDATE 24/7/10 09:00: Torrentfreak also published an article with similar viewpoints (not surprising though, since Ernesto and I talked about how foolish the study was)

1 comment:

  1. Wow, I wouldn't consider myself a BT expert but I know most of this stuff. I would have thought that they would have made efforts that were a little more useful than this. Good luck on the study. If you can conduct the study in a P2P way somehow (I and many others would be very willing to volunteer to help) then do so. Maybe an open source app that sits on our computer and keeps track of the connections made by a real BT user. I wouldn't mind saying "yes I download hollywood movies a lot" if I could have a basic level of anonymity in submitting it. It's not like BT is a private activity to begin with.