注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

纷纷红紫已成尘·布谷声中夏令新

山西财院78jitong 19781017--19820715

 
 
 

日志

 
 
关于我

78jitong.......................................................... 高三李五七弓长,三赵九刘七大王,阎吴谢孙崔氏双,柴米余侯箩万堂, 毛邓陈宋任申杭,曾肖徐翁程董梁,储曲祁解韦国强,男女七十学跟党。

网易考拉推荐

2016年3月3日  

2016-03-03 09:39:10|  分类: 默认分类 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
2016年3月3日 - 78jitong - 春韵......

Does Amazon’s Data Speak for Itself?

Finding meaning in the retailer's mountain of information isn't easy.

I have a copy of Amazon. Meaning that, on my hard drive there is a massive chunk of Amazon’s product and reviews database—a listing of nine million or so products and 80 million or so reviews taken from 1996 to 2014. The names of all the books in that chunk, their sales ranks, their categories. Every pair of pants for kids, every sock. All the books about Hitler; all the books about snakes. All the different Lego sets. Whatever.

The way I came to be in possession of this thing is that someone tweeted that it existed. I visited a web page, sent an email to a researcher at UC San Diego, and was sent a link to download the data, with a request to cite a paper associated with it, which was presented at the 2015 SIGIR conference: “Image Based Recommendations on Styles and Substitutes.” The data totaled about 20 gigabytes, compressed.

2016年3月3日 - 78jitong - 春韵......
 

Illustration by Serafine Frey.

It’s not a perfect copy by any means, but neither is it a pirated one. Rather, it is “spidered” data, culled by automatically visiting Amazon’s web site and copying what is found, adding it up, aggregating it. One could do the same with Walmart.com, or with any big company. But Amazon is a special case: It is possibly the most purely optimized commercial enterprise in history, marrying hard computer science to ruthless labor practices in pursuit of delivering brown, branded boxes to anyone who might conceivably want them. It knows so much about us, and we know so little about it. Walmart has done terrible things for longer, but in comparison seems so amateur. Amazon is out for the world. And I write this as a hypocrite. Who knows how many Amazon boxes are on their way to my house? They show up daily sometimes. Fear is the coin flip of admiration.

In the data, the books don’t have authors, many prices are missing (and I can’t find any prices above $999.99), and there are other gaps besides. Nonetheless, it’s what was granted me. A conglomerate in a teacup. I decided to absorb the data into a database. The first draft of the code I wrote to do so informed me that it would take 25 days of computing processing to complete. That was too long. Also I was out of hard drive space. So I went to a store and bought a computer, a big, boxy, unfashionable PC with a 4-GHz quad-core processor and ten terabytes of extra hard-drive space, installed Linux on it, and got the most recent version of the PostgreSQL database. I could have done all this in the cloud of course, but it’s harder to just mess around in the cloud, and there’s something very comfortable about having your own big machine next to your knee. Besides, the cloud I know best is Amazon’s, and I didn’t want to get conflicted.

With the help of that machine and quite a few database tricks to massage and extract the data, I got 25 days down to one, with searchable titles, descriptions, and reviews. Seven days of programming and one day of absorption to beat one day of programming and 25 days of absorption: a pretty familiar set of trade-offs. You’re always trying to balance your time against the computer’s, but there’s also the challenge of the thing. I probably should have just let it run for four weeks.

So now I have it. I have my very own local, diminished Amazon. Now what do I do? Do I set up shop? Not really; I can’t just reproduce their pages and reviews. Whenever I find myself in an unfamiliar database I search for the same damn things. Hitler always comes to mind, because Hitler shows up everywhere. What, I wondered, was the most expensive Hitler book I could find? Speeches and Proclamations 1932-1945, in translation, four volumes, $721.05. For the strapped, though, Mein Kampf is only 99 cents on Kindle. What about Roosevelt? $140.36, for a book called Allies at War. The Eleanor Roosevelt Encyclopedia runs $95.07.

Reviews are associated with a total number of votes, and this quickly reveals that the very topmost, thumbs-upped reviews are the joke ones, and the mean ones; for the Hutzler 571 Banana Slicer, “No more winning for you, Mr. Banana!” (52,861 votes); for the BIC Cristal for Her Ball Pen, a review of “Finally!” with 38,604 votes; for the Fifty Shades of Grey audiobook, “Did a teenager write this?” And of course there’s much fun to be had from a five-pound bag of Haribo Gummi Bears, which if eaten in quantity are a laxative, or the Playmobil Security Check Point playset. Books feature very little in the “most reviewed.” But then again few books are as hilarious as the idea of Gummi Bears causing terrible diarrhea, or as likely to inspire passion as a Kindle Fire HD 7”.

Looking at the book rankings by popularity reveals very few secrets. The Alchemist: Doing fine. Heaven Is for Real is up there, even though heaven is possibly not for real. We like books for children; we like dying teens; we like dragons; we like sex and murder. In other categories—pet supplies, for example—it’s kitty litter that’s number one. Presumably 2015 was similar.

It gets a stranger at the bottom. You can sort in reverse order. This is a computer. At the very end of the long tail you findthe typical basement bin: How to Stay Sane in a Crazy World down at 15 million; or Creative Screwing, which is self-published and costs more than $700, and thus is also down around 15 million. You can see all the basic forces at work: At the top of the list there’s marketing, popularity, and relatively little regard for the literary; many of these books are garbage, and their popularity is immune to reviews. You have to go down the list to find the world of “quality.” Different ecosystems thrive in there, in among the rankings: the world of the careful sentences; the world of the graphic novels. To Amazon, though, or rather to its computers, it’s all one thing. If you’re a programmer, the difference between a can of oil and a book is fairly minimal. Each one is worthy of review, each one can be assigned a ranking. If you make more profit on the can of oil, you should focus your efforts there. Also, there may be two million books out of those nine million quality items, although no one knows the exact number. The real business is in the digital downloads, of course. Those are the best: Immediate gratification. No warehouses. The labor is purely the author’s, as is the marketing and promotion. Margins approaching infinity.


I kept looking and looking but finally I had to admit: I can’t climb this particular mountain. There’s no obvious path through this data. I could claim that it’s a mirror of capitalism, or the global marketplace, but I can’t prove that. The broad claims of the essayist are no match for the digital reality of a global megastructure.

I don’t have a good mental model for thinking across nine million objects, nor for exploring 80 million opinions. This is what people are talking about when they say “big data,” of course: No one knows what’s actually inside there, no one can make sense of all that stuff. No single human being could possibly read all of the reviews on Amazon in a single lifetime, and even reading the names of all the products would take six or seven months. Big data, for the most part, is made by humans—it is the record of what we clicked on, the banner ads we viewed, our paths through a site, multiplied by humanity. Sometimes it is seismic data or star charts too, but mostly what people are talking about with big data is data about human behavior that can be mined to create better predictive models for future human behavior.


 


  评论这张
 
阅读(45)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017