Scribd: You're Alright

Scribd, the YouTube for text documents? Something like that. Scribd lets you upload documents in a shit-ton of different kinds of formats and view them in their dinky little Flash viewer. It's not a bad idea, I will give them that. It makes viewing documnts in many different formats really, really easy. They have a couple of real technical hurdles to overcome, which will be the true test of their engineering skill.

Text Is A Pain In The Ass

When I was in college, I searched for a lot of obscure documents. Namely, answers to homework questions that I didn't feel like doing. They came in all shapes and sizes, MS Word, PDF, even raw LaTeX. I burned a lot of time in the turn-around cycle of downloading a document, opening it up, checking to see if the answer I wanted was in there, and moving on to the next one. If Scribd had been around, I could have started happy hour a lot earlier.

It doesn't surprise me that someone is tackling this problem. It does surprise me that they're not completely sucking at it. It's even more surprising that this thing came out of YCombinator, but everyone gets lucky from time to time. It's still a bit of a mystery how they're going to turn a profit, but this is Web 2.0, we don't have to care about such things. Don't you guys know? Everyone in the San Francisco Bay Area is rich. We just do this all for fun.

Don't Get Too Cocky

I said they're not completely sucking at it. Scribd still sucks, but it doesn't make me want to gouge my eyeballs out with a fork. Let's tackle the elephant in the room: search. Scribd, your search function blows donkey chode. Really. It would be better if you just didn't have the search bar.

I wanted to see what kind of mathematical content Scribd had, so I searched for "theorem of lagrange", looking for a proof of one of the most basic theorems of group theory. Well, I didn't find the proof I was looking for, but I did find out that the Scribd developers never heard of stopwords. The first hit for this query, has the word "Of" highlighted in the title. Even the most basic classes on information retrieval and data mining teach you to cut out stopwords.

Google managed to find this proof on Scribd, though (I think). I queried for "theorem of lagrange site:scribd.com" and Google found a book, "Elementary Abstract Algebra" on Scribd. Of course, clicking on this link in Google doesn't give anything useful, because Scribd will serve the text of the document to googlebot, but will serve the Flash viewer to a person. Click the "cached" link on Google - you will see what I mean. When you get to a Scribd document from Google, the viewer just loads the first page of the document: not particularly useful.

So, Google does a better job at finding stuff on Scribd than Scribd does finding stuff on itself, and Scribd sucks at showing you the information you want. I can tell their search is basically a glorified grep, and they're going to have to do a lot better than that, or at least serve results to Google users correctly. Just a tip, guys, don't try latent semantic analysis. The singular value decomposition is going to kill you. Also, because your documents don't have any link structure, PageRank isn't going to help you. Good luck with this one.

Also, what I would love to see out of Scribd is a Firefox extension that will render any documents you come across on the web with their viewer. Get on it.

Welcome Aboard

Relax, I'm just breakin' balls. Scribd, you're alright, but now is the time to get your shit in gear and make a great product. Once I can find the documents I want on your service, you'll kick ass.