Remarks of Carl Malamud
It took a while. It was about 30 terabytes of data. I ended up with 463,000 books that I was able to successfully get. Some of them, I couldn’t get, some of them were broken URLs, but we got 463,000 PDF files.
This was December of last year [2016], and in January, I did the upload of the Internet Archive—and these things take awhile, when you’re doing that much—and uploaded them. So this collection, when I started looking at it in more detail, because I couldn’t really tell, until I actually had the data.
This is books in 50 different languages. There are, I believe, 30,000 books in Sanskrit. There’s tens of thousands of books in Gujarati, and Bengali, and Hindi, and Punjabi, and Telugu—you name it, it’s all there. About half the books are in English and French and German, but it’s a unique collection.
Now, it had problems. When I went to mirror it, the server kept spitting out code 500 system errors. It kept breaking, and so, my scripts kept breaking. I’d go back the next day, and I’d start the scripts again, and I’d be able to get some data, and then, they’d lose DNS. Their DNS servers kept going down.
And so, you’d ask for a DNS name, and it’d say, “Host Not Found,” and so on. I started hard coding the IP addresses, because that was the only way I could grab the docs. There were other issues, besides poor hosting. The metadata is kind of a mess. Many of the titles are broken. The scanning, some was good, some was not.
There’s a lot of duplicates in there, but it’s still, it’s a unique collection. I also noticed that there were some books that seemed somewhat adventuresome on copyright. I looked at it, and said, “Well, you know, some of these are pretty recent.” But I looked down at the copyright field, and, “Not Copyright.” So, I said, “Well, they must have known what they were doing.”
What I do on archives like that is, we put them online, and if people start complaining, you say, “Okay, fine. I’ll take that stuff off.” So I put it online, and it went online in February of this year. We’ve gotten, I think, eight and a half million views on this collection, so far.
So this collection went online, Google started seeing it, people looked at it—we had half a dozen people write to us, and say, “Ah, you’ve got my book there!” You know, standard DMCA takedown in the United States. Not a problem. Fine, we’ll remove the books.
79