Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Netflix Reverse Engineered Hollywood (theatlantic.com)
336 points by coloneltcb on Jan 2, 2014 | hide | past | favorite | 129 comments


For all the high praise that gets heaped on Netflix for their brilliant technology, I have a feeling there is some other Netflix that is concealed from me.

I have been Netflix customer for years. I thought the idea was brilliant - super-cheap movies arriving whenever you want, what could be better?! I loved Netflix. Then I slowly discovered Netflix is running out of movies I want to watch - up to now where about 95% of movies I want to see are out. Then there was that streaming vs. DVS fiasco - and I stayed with streaming. But then I discovered there's nothing for me to stream. I thought maybe my tastes are weird - so I went to wikipedia and IMDB and looked "top X movies" - and most of them, of course, can't be watched on Netflix, except for those few that I've already watched long ago.

And that million dollar recommendation system? I've over 800 ratings, and I have hard time remembering last time their system suggested me something useful. In fact, the only reason I am keeping the subscription is because my wife has some series on her sub-account that she's watching. For me, Netflix has become almost 100% useless. So I wonder, with all the high praise to their brilliant data usage and innovative technology - am I doing something wrong? Am I missing some important part of Netflix that everybody else is seeing?


Yes, you're right. What is making Netflix gain market share are the series. Basically most people are using Netflix instead of TV, not to get Hollywood movies. The same pattern happened at my home: there are no good movies to see, but plenty of activity around TV shows and documentaries. And from a business point of view this is even better for them, because TV series make people stick around longer than a single movie.


Right, you basically still need to get the little plastic disks--from Netflix or elsewhere--if you want to watch quality movies, especially recent ones. Someone I know there told me a year or two back that people get streaming with the intention of watching movies but they end up watching TV shows instead.

This doesn't quite apply to me because I quit streaming when they did the whole splitting of physical media from streaming thing. And signed back up when House of Cards came out. But I do view streaming today as a source for TV shows and any movies I want to watch on it are basically lagniappe.


The algorithms and operations behind the storage and distribution of those physical disks is a marvel to behold though. A friend of mine was in Netflix's DVD operations for a number of years, and it was fascinating to hear what they had implemented (I can't do it justice). Imagine a bunch of Computer Scientists implementing sorting and data storage optimizations in the physical world, and you get the Netflix DVD distribution system.

If you ever meet someone from the DVD division, take him out for some drinks. You won't regret it :)


I see a lot of "recent" movies on rent/sale on YouTube (India). No subscription though. There is even an Indian version of Netflix - Bigflix, but content is old and scarce.


Your complaint is misguided. This is not the fault of Netflix, they can only show what they are licensed to show. If your favorite movie is not on Netflix, if the movie you really want to see is not on Netflix, if the latest movies are not on Netflix fast enough; it's likely because of the copyright holder not making a deal with Netflix.

For example, take a look at newest releases in Netflix in a browser. Now use a proxy to change your country and look again. Likely, depending on where you actually are and where you say you are, you will see a different list. That's because of licensing.

I would imagine that Netflix would be more than happy to have in their catalog every bit of movie and television products that have ever existed and will ever exist as fast as possible for everyone to see regardless of where you happen to live. But as long as the copyright holders feel they will make good money off of DVD sales or other avenues before offering it to streaming, then that's the environment streaming will always be.

If there's something you wish to see on Netflix and it isn't there, complain to the distributor of that product and not Netflix.


It's not the question of "fault", it is the question of Netflix not being a useful service to me. What I'm saying is that maybe that recommendation engine is excellent technically, I don't know - for me good recommendation engine is the one that can recommend me something I could use, and Netflix one can not. Whose "fault" it is - why would I care? I certainly wouldn't start a personal crusade to find out who in the byzantine network of IP rights is to blame that I can't watch anything of interest on Netflix. For me it is a failure of Netflix as a service, and the reason for this failure is not for me to investigate - it's for them if they want to become a useful service.


I think streaming Netflix requires a slightly different approach in how you watch them.

For ages I found myself ignoring Netflix recommendations based on the descriptions, but when I watched them I actually enjoyed them. If I wanted Netflix for well-known movies and safe choices I'd be disappointed, but as a way of watching unknown (to me) movies, I find it great.


If they had the same films available to stream as they do to get on DvD, I'd pay for netflix.

I cancelled the free trial in about 30 minutes for this reason. I still think netflix owns for series, but I didn't need it for that.


Maybe you are wrong, and if you were it would be great because then you could enjoy it right?

However, maybe this product just isn't for you.


A friend of a friend works at Netflix and told me how they use some of this data.

House of Cards was basically a data driven production. Based on Netflix's customer preferences, they knew that a political thriller, starring Kevin Spacey and directed by David Fincher would maximize the number of views based on the habits of its customers.

It would appear the data was correct!


Is there a post somewhere where they describe this?


Good link by yonasb.

But the comment by refurb sounds a bit misleading. As I understand it, they did not produce this solely because of the data (as if the show was born out of the data). The idea, the actors, and the production staff were already chosen before Netflix's involvement. Netflix looked at those things as they had been proposed and realized it was a good gamble to bid on the show.


I don't have the facts to contradict your statement about finding the show after it had already been made, but wasn't it a "Netflix original production"? As in, Netflix pitched it to the studios?

I know the script originated from the British series, but don't they usually only select actors after they have funding? I'm not that familiar with the process, so I may be wrong.


As I understand it, it was indeed a Netflix original, but the plans for who would be in it, who would direct it, what it was about, etc. were already in place before Netflix got involved. Again, I'm no expert on the subject either, but that's how I understand what Wikipedia explains:

"MRC approached different networks about the series, including HBO, Showtime and AMC, but Netflix, hoping to launch its own original programming, outbid the other networks. Ted Sarandos, Netflix's Chief Content Officer, looked at the data of Netflix user's streaming habits and concluded that there was an audience for Fincher and Spacey."

(http://en.wikipedia.org/wiki/House_of_Cards_%28U.S._TV_serie...)

See the sources linked in those paragraphs, where those statements are summarized from.



Netflix's data allows it not only to recommend movies, but also to finance original productions.

Lots of businesses want "recommendation engines" to appease their cargo cult gods, few ask what possibilities their data really creates.

Sometimes data can make you better at delivering your service. Other times you can optimize inventory, enter entirely new lines of business or even obsolete your competitors.


So audience data when you're talking about original productions can be a blessing and a curse. We don't need big data to tell us that the top four favorite subjects of Netflix users are marriage, royalty, parenthood, and reunited lovers. We've been telling stories about those things for hundreds - if not thousands - of years.

The data can also be misleading, because sometimes what audiences want isn't what's good for dramatic effect. Both Dexter and Homeland suffered from audiences reacting positively to the main character, which caused Showtime to stop the writers from making creative decisions that yes, would have alienated the audience initially, but on the other hand would have made for shows that would have in their later seasons been better received.

Another example: I'm pretty sure that if you looked at the Netflix data it would show that people liked to watch movies and TV shows that either had happy endings or twist endings. The former makes you feel good, the latter makes you appreciate the writing. What people really don't like are ambiguous endings. If we based creative decisions purely on audience data a show like The Sopranos wouldn't have the ending it does. The upshot of all of this is that many writers end up 'cheating' the data to get their shows made. Take Orange is the New Black, where the creator Jenji Kohan has gone on record saying she basically used the main white, engaged protagonist as a trojan horse to get the show commissioned. What I'm basically trying to say in this rant is that using data to inform what shows to commission is one thing, but using it to direct the shows as they progress is another (and something I'm against).


> We don't need big data to tell us that the top four favorite subjects of Netflix users are marriage, royalty, parenthood, and reunited lovers. We've been telling stories about those things for hundreds - if not thousands - of years.

I'm really irritated at this attitude. Things that are obvious to you aren't obvious to other people. Confirming things using hard data is useful.


You're right that audience data is going to be full of the obvious. But it's also good for watching trends, before they go bust. A number of genres (e.g. superheroes) have cycles of growing, being massively popular, then dying off for a couple decades.

If you're in the business of making content, you want to join the trend early and get out before everyone realizes it's a fad. A truly great movie can buck trends, or even change them, but there are few truly great movies. Or you could just ignore this all and make the perennial favorites, generic romance or action movies.


Actually their recommendation engine is a little too accurate. It drives as much as 70% of their views. They have to purposely randomize it. For those curious how it works: See the brilliant talk from the guy who heads it up himself (this was a great talk to attend in person) http://www.mlconf.com/mlconf-2013/mlconf-2013-agenda/xavier-...

One reference I could find for the statistic from google: http://blog.kissmetrics.com/how-netflix-uses-analytics/

I don't think they'll have to worry about this anytime soon.


Haven't people gone to jail for scraping a URL and enumerating its possible values?


Yes. When they knowingly are collecting data they know they shouldn't have access to, unlike the data in the article.


Not defending the law (which I think is pretty terrible), but that seems like an incredibly fine distinction. I doubt Netflix would agree that I'm supposed to be able to download a list of every possible genre.


It is a fine distinction. The effort in a CFAA case to resolve the distinction will focus on evidence of intent; of the state of mind of the person committing the acts.

The CFAA is a very bad law, but it's hard to resolve "authorized" versus "unauthorized" deterministically without coming up with absurd situations. I think a better thing to look at is CFAA sentencing and severity, which currently scale up with the counter in your "for()" loop and quickly produce dizzying penalties for minor offenses. Taking the piss out of CFAA sentencing would also reduce the incentive that US Attorneys have to throw the book at petty offenders.


> it's hard to resolve "authorized" versus "unauthorized" deterministically without coming up with absurd situations

Like in other cases when computers meet the law, it's often a "colour of bits" distinction. I recommend this article:

http://ansuz.sooke.bc.ca/entry/23


I don't think that's really what is going on here. "colour of bits" is distinction between process and result. The question of authorization comes down to intent and what fits societal norms of fair dealing. It's a complex mess and there's no alternative to it, vs. colour of bits where there are two sane points of view.


> The CFAA is a very bad law, but it's hard to resolve "authorized" versus "unauthorized" deterministically without coming up with absurd situations.

The CFAA is a very bad law because it's hard to resolve "authorized" versus "unauthorized" deterministically without coming up with absurd situations. Imposing only misdemeanor penalties would certainly resolve the issue in practice by depriving the failure of its teeth, but it still wouldn't make it a good piece of legislation.


If it is API data that needs to be protected I would hope to see some HTTP code saying unauthorized. If it is a public feed then I probably would expect a 201 when successfully called.

If someone subverts an authentication system to get authed, then they were clearly never authorized no matter the granted access to the data.

Seems unauthorized or not is baked into the web already. It is the humans behind corporations and their lawyers trying to stretch the definition or create new mechanics. The AT&T case shows this for sure.


> If it is API data that needs to be protected I would hope to see some HTTP code saying unauthorized. If it is a public feed then I probably would expect a 201 when successfully called.

If I leave my apartment door unlocked, and you go in while I'm at work and steal my television, you are still liable for unlawful entry and burglary. There is no legally admissible argument that my not having locked the door confers upon you authorization to go through it.

Of course, that's a much clearer case than API access tends to be, but the point I make from it is that "nobody stopped me from getting to this" does not imply "it must be lawful for me to make use of this".


Yeah, but with HTTP API access, it's much more like someone who goes up to your apartment, knocks on your door, his knock is answered by your butler, he asks the butler for something, and the butler gives him it. It's not his fault that you hired a cretin for a butler.


No, but it's his fault that he exploited the butler's cretinism to gain access he knew or suspected he shouldn't have. If you're still unclear on the concept, your Google query for today is the Latin phrase mens rea; I'm no lawyer, but the impression I have is that that's the legal concept of import here.


Who said anything about exploit? You made the API and it has open access. If it is private don't make it public. Absolutely no analogies are needed for that.

Slap on some authentication before calling the lawyers. Lawyers are needed for when such systems have been subverted.

Seems the reality distortion of blaming anyone that touches something a corporation doesn't want touched is catching on. Persons behind corporations that dispatch lawyers with great zeal are responsible for conflating their own mistakes for exploitation.

For those having a hard time, just think like the NSA: Public unencrypted data will be used, but not always in the way the persons who made it public contemplated.


Yes, and expecting that no one will try my doorknob to see if I forgot to lock it is relying on security by obscurity, and if someone exploits my having forgotten to lock my door, both the exploit, and whatever harm comes to me by it, are entirely my own fault and no one else is liable.


Apologize for injecting insults and vitriol into this thread if wish for a contextual reply.


I know that when you're in the middle of an argument it feels like you're having a 1-1 conversation, but you're not: you're writing messages to everyone on HN. This kind of "apologize first" stuff is really boring. If you don't think the person you're talking to is arguing in good faith, stop responding to them.


I only caught on about the bad faith after noticing several comments in a row that included insults or slights. You are right, best course is just to totally disengage.


> If I leave my apartment door unlocked

Oh god, this again


This is a perfectly reasonable analogy.

Many on HN seem like to feign ignorance when it comes to intent. Have fun playing dumb in court when you try to bring out your strictly technical argument why you should have legally downloaded all of those credit card and social security numbers since you "could".


It is a terrible analogy because websites have status codes defined in RFCs.

All analogy is terrible if we're talking about computer access.

The door analogy fails because there are two existing sets of laws covering both sets of circumstances, and how the courts interpret authorized access might have very little to do with what normal people would mean.

By having a website on the public facing Internet with no access controls you are inviting anyone to view that page. You are giving implied consent to view that page. Except that's not how some courts interpret it, so my opinion means very little.


That implied invitation gets absurd quickly in the real world, like when it accompanies access to millions of stored credit card numbers. There is information that cannot possibly come with a legitimate implied consent to access.

It's also a technical non-starter, because the HTTP status code that accompanies a response is an artifact, not a promise; the result of a SQL injection attempt could for instance very well have a 200 code on it.


I think of it more like calling someone and asking them to send you a book. They do. Then the police come knocking on the door saying you accessed the information illegally...

If you don't want someone to have something, don't have your system dispense it when they ask for it.

But yes, intent does come into it - and that's often the problem - people investigating systems for curiosity or academic interest or because the system owners won't take them seriously when they have reported problems that put people's data at risk and want to demonstrate it are treated like criminals (often charged with felony offences with potential jail time worse than assault or theft)


a better analogy would be, you're invited to a friend's apartment, there's a living room, a bathroom, and bedrooms. Some doors are shut and some are not. You clearly were invited to enter the dwelling, much like a web server invites you to use a service. You are invited to sit on a couch, but may not be invited in a bedroom. If the door is locked, it is clear that it is restricted, if it's not, it's ambiguous.

As per other people's comparisons, some things are implied restricted: credit card information of another individual is not for you to see. The door to a bathroom is closed: If no one is in it, you may enter, if someone is using it, usually not.

I think in the bathroom analogy, it's clear that 1: A person should lock the bathroom door if they want privacy and 2. An individual may want to knock if a door is shut instead of trying the door.

Therefore, engineers should explicitly restrict access to certain data (rate limits, terms of service, passwords etc..) and people working on projects and what have you may want to actually ask: (is it ok, if I do this?)


The toilet is the database. To update it, you shit and piss.


I see the children are home from school. Did you have an argument to make?


The physical analogy is completely inapplicable as you don't "enter" a webserver, you send it a request.

Putting things into a state wherein they get auto-served to any request is something akin to posting documents inside your front window for passerby on the sidewalk to see.

The whole thing's a fucking terrible analogy though, and every time I see it trotted out to support some "if I leave my doors unlocked" argument I die a little bit inside.


If you "request" that the doorman of a building let you in and then subsequently rob the place, the doorman's complicity does not lessen the severity of your offense.

People like to make this debate more complicated than it really is, what with the warring analogies. Really, the issue is straightforward: did the person accessing the information have a reasonable belief that they were authorized to access it?

Your argument is, "if the webserver says it's OK, then my belief that I'm authorized is reasonable". That's a sane argument, but has any court ever agreed with it?


> the doorman of a building let you in and then subsequently rob the place

Oh god this again


By not acknowledging the rest of the comment, do you mean to concede the point?


All of these physical analogies are completely inapplicable.

When you send a request to a machine owned by a person and it responds with data that person doesn't want you to have, it is not your fault for sending the request - EVEN IF YOU KNOW/SHOULD KNOW/COULD REASONABLY BE EXPECTED TO KNOW/ETC THE PERSON DOESN'T WANT YOU TO HAVE THE DATA.


I agree about analogies. But: this is a perspective on "unauthorized access" that decriminalizes SQL injection.


Yes, SQL injection should not be criminal. You're responsible for what your robots do even if it's contrary to your own wishes.


> Really, the issue is straightforward: did the person accessing the information have a reasonable belief that they were authorized to access it?

The inherent flaw in this interpretation is that it uses a self-referential definition of authorization. Saying "access is authorized if it is reasonable to believe that access is authorized" is meaningless. You would need an independent definition of authorized access in order to evaluate the reasonableness of the belief, which defeats the purpose of using it to define authorization.

Where a "reasonable belief" standard makes sense is in determining intent, not in establishing the definition of an element of the crime. For example, it may be illegal for you to steal my laptop (where the definition of the elements of theft are fairly rigorous), but if it turns out that you have exactly the same model of laptop and "reasonably believed" mine to be yours then even though you objectively stole my laptop, the prosecution may not be able to prove that you intended to.

But "authorization" is clearly not about intent. The alleged perpetrator is not the one permitted to define the scope of authorization, so their intent is irrelevant in determining whether the access was in fact authorized, and the intent of the owner of the computer system is also irrelevant because it isn't reasonable to subject a defendant to the intent of a third party if that third party fails to clearly articulate it. What matters instead is what has been articulated -- the information made available to users of the computer system about what level of access is permitted.

But then we're back to the terms of service trouble again. If you allow contractual restrictions to define authorization then you're allowing corporations to define the contours of a federal felony and anyone who lies about their name on Facebook can be thrown in prison at the whim of an overzealous prosecutor. Conversely, relying on technological authorization makes the law extremely narrow (capturing only the likes of perpetrators who defeat technological denials of authorization via e.g. physical intrusion into the victim's data center -- although that would align better with the established penalties). Neither seems particularly satisfactory, and the reason is that "unauthorized access" is the wrong thing to prohibit. "Authorization" is vague and subjective, and the specifics of a violation span such a wide range of conduct and misconduct that lumping them all together under the same set of penalties is all but guaranteed to violate proportionality unless the penalties are set so low as to make the prohibition nominal.

And I still don't understand what useful purpose it serves as distinguished from the already-existing penalties for vandalism, fraud, misappropriation of trade secrets, etc. etc. If someone engages in access without authorization, without more, the warranted level of punishment is minimal and almost never even worth prosecuting. If they engage in some further malicious act then the penalty for the greater crime can be imposed and unauthorized access is only a side show and a dangerous instrument of prosecutorial overreach. When is it ever supposed to be useful in proportion to the controversy and abuse it produces?


You really are overthinking it. Context and intent will always have a place in the law.

Did you typo a wget to random URL and it returned my credit card number / ssn and home address, realized this was something you shouldn't be accessing and stop? I'm sure you would have a good case that your intent was merely an accident or innocent and you won't get in trouble.

Did you wget a million URL's over the course of the day and store these credit card numbers / ssn's for some use later? Probably might have a harder time justifying that. Again it is all about context and intent, these are going to be a factor in the court. It is never going to be a simple black or white sequence of events that can be followed to determine if you were doing something harmless or not, no matter how much the crowd here doesn't want that to be the case.


> You really are overthinking it. Context and intent will always have a place in the law.

But there is a difference between the definition of a crime and the intent to meet that definition.

Let's try an analogy. Suppose I go to a local retailer and tell the clerk that I'm a professor at the local university and ask her to record the name, address, credit card number and other information of everyone who buys something in the shop for me today because I want to collect some statistics. There are now two issues here.

The first issue is whether the clerk agrees to provide me with the information. The shop owner and the credit card company need for her to not agree to this because it is clearly a security vulnerability. Even if I really am a professor collecting statistics for my research, she can't know that for certain, and I could be a scam artist out to commit credit card fraud. But let's suppose the clerk hasn't been properly trained and she gives me the credit card numbers. The obvious analogy is to a misconfigured web server.

At this point it matters quite a lot what I do with the credit card numbers. If I'm a scam artist and start making fraudulent charges to the cards, there ought to be significant penalties. But if I really am out to do statistical analysis on local retailer purchasing behavior, should there be? Or should the retailer just take the opportunity to retrain the staff? There is a strong case for a lack of penalties -- because intent does matter. Which is the fundamental problem with the CFAA. It doesn't require the intent to commit a malicious act, only a vaguely defined lack of authorization to commit any act regardless of whether or not the act is malicious.


I think there is a lack of distinction over exploiting to obtain data or exploiting data obtained. In the web-server case I would argue access was allowed unfettered but authorization for the data was not. This is how more than a few cases have gone down. Gotchas don't really cut it in court.

Edit: My original comment further up in this thread hinted at this difference and how lawyers seem to be trying to shift the line so unwanted access by a user is treated the same as exploitation of the data within even if the data was not exploited upon.


While I'm not with you on how "lawyers" are trying to "shift the lines" on the distinction between accessing and exploiting information, which simply does not exist in the law, I agree that the distinction is important; in a perfect world, non-commercial non-malicious non-damaging use of unauthorized data would be relegated by the sentencing guidelines to something less than a felony.


"which simply does not exist in the law"

Rather sure it does, some crimes like credit card fraud are separate from unauthorized access of the computer systems the information came from. Exploiting the data is not the same as exploiting to access the data.

Lawyers are deployed to push their client's POV, wins or losses may "shift the line" regarding how written laws are upheld in court. Which in my opinion is used by some as a chance to defray from responsibility to properly administer public facing systems.


Here's the CFAA:

http://www.law.cornell.edu/uscode/text/18/1030

Perhaps it would be simpler if you just pointed to the part of the statute you're referring to, because I don't follow your argument.


"I see the children are home from school."

Sir your analogy doesn't mesh with reality as another commenter above noted. Servers provide responses, not hand over the hard-drive.

Calling people children is absolutely negative and is a detriment to any reasonable discussion, which is what this site seems to be about. Very kindly shut your gob if you want to continue this route.


If you leave your blinds open, can I look through them?


Not necessarily.


No one is stealing TVs by reading data being firehosed out of an open API. Just like your Wifi network name being broadcast is not theft of televisions.


I just knew some shallow oaf was going to take his finger out of his nose long enough to pick on my having included a stolen television in my example. True, no theft of a physical object occurs. But one would have to be either deliberately obtuse or abjectly stupid to ignore the legal commonality I used that example to demonstrate.


What the fuck is your problem? You are calling people "children" and "shallow oafs."

Stop being such a negative jerk when communicating with others.

Edit: Thanks whoever down-voted the guy saying "fuck" but not the user injecting insults and slights. This HN community feels healthy and mature.


Oh, dear. "Streetnigga" just can't deal with such an unfriendly tone.


Why would I? I didn't come here for unfriendly and insulting tones.


Both of you could do the rest of us a favor by not responding to each other anymore.


I too wish there was an easier way to distinguish this.

One of my weekend projects involved stripping actor, genre, and director information from IMDB for the AFI Top 100 movies. The prototype was simplistic but still gave some interesting data to mine (e.g. Robert Duvall has been in the most AFI Top 100 films, credited 6 times).

The next step of the project would be to strip more data from IMDB, but I hesitated since I didn't want to get my IP banned or something. I had never considered that what I was doing could potentially be illegal.


In case you didn't know there is some data available in plain text format here http://www.imdb.com/interfaces.


IMDB actually provides this information without needing to scrape their site. See http://www.imdb.com/interfaces


> I hesitated since I didn't want to get my IP banned or something

If you're worried about that I would use something like a DigitalOcean droplet. If your IP gets banned, at least it's not your home IP.


If they didn't, they should have to send you a letter telling you to stop building your list.

I believe AT&T did send such a letter.


>Netflix cooperated with my quest to understand what they internally call "altgenres,"


Not sure how I missed that -- thanks. Does seem like they scraped first and asked later, though.


Pretty much how you have to do it. Virtually no one is going to agree to allow you to scrape their stuff.

It's just like when you're discussing your entrepreneurial plans with a non-entrepreneur. Typically, their response is an eye roll and a sarcastic "Good luck!". Most people envy successful entrepreneurs really hard, but they have no patience for beginners. It's like they don't understand that there's stuff in between being a broke college kid and having a record-breaking tech IPO. The same is true for copyright infringers and web scrapers; demonstrate your value first, and then open discussions.

If Google strictly abode copyright or computer access laws, it would have never existed because it'd have to ask each website owner for formal permission to crawl their page and store records of their copyrighted material, including derivative works caused by compression, indexing, etc. (and worse the right to reproduce this content for display in search results) and meticulously record each copyright holder's assent. They'd have to verify that the legitimate copyright owner had provided assent, and not an imposter. They wouldn't be able to access anything because almost all boilerplate ToS documents forbid "automated access" and similar, and violating the ToS is illegal computer access.

Google is operating in a highly illegal fashion, and the only reason they're allowed to exist is because they demonstrated their value first by driving traffic to the copyright holders' websites. PayPal did the same thing with banking and payment regulations. At some point, you just have to recognize that most people aren't going to get it until you show them, and don't want to be bothered with your delusions of grandeur. It's better to ask forgiveness than permission.


For an even starker example, look at the Google Books case. They scanned thousands of books and made portions of them available online. Authors guild sued them, and it was dismissed because it benefited society to have the works available.


I find Google Books ruling particularly insane... they have undeniably been copying books illegally and for a profit motive (even if they don't yet make money directly off the copies), but this is okay because they've provided a valuable service.

What?!

If it's such a valuable service they could have given a cut to the authors instead of illegally copying their books. Allowing somebody to break the law because you believe they are morally good is textbook legislating from the bench and it's wrong.


Not only that, but now Google has an exclusive license deal to do this, and nobody else can do it.


No, there was a proposed settlement with the Author's Guild that sort of but not really was an exclusive deal with Google. But that settlement was not approved.

That proposed deal did apply only Google, but it didn't preclude anybody else from negotiating their own deal.


> If Google strictly abode copyright or computer access laws, it would have never existed

Whoa there. At the time, the consensus was that anything put up on web servers was there for anyone to see by any means. Minus the 'robots.txt' convention for automated scanning, which existed because there were many other search engines before Google.


Was that really the consensus among the legal community? Somehow I seriously doubt that anyone familiar with copyright law would assert that your copyright was invalidated by publishing on the web. A weak argument for fair use could be employed for crawling the text portions because only a small (often "insignificant") portion of the work is reproduced in human-readable form, but certainly is not applicable when one discusses crawling images.

Is robots.txt a legally-admissible copyright release? There's probably more room to debate that one, but it's not clear-cut. What does it cover? Is it applicable to all crawlers? Can you do a general license release like that without your work effectively becoming public domain? What's the difference between a crawler and a human reader subject to standard copyright terms? What licenses are implicitly granted in a typical robots.txt? It's not like robots.txt is a verbose document that lays all of this out, and all of it is a potential legal problem point.

Also, Google assumes permission by default and only doesn't scan if you explicitly DISALLOW it with robots.txt. This is the opposite of copyright, which reserves a monopoly to the rightsholder unless he explicitly ALLOWS a certain use. It's undeniable that Google is violating copyright law millions of times each and every day, and that said violation is fundamental to their business.

And does any of this negate the computer access laws that make a site's ToS legally binding, even to those who don't formally agree to them? Strictly interpreted, Google would still be behaving illegally even if the copyright element was taken away.

I agree that there were search engines before Google, and that they mostly were in the same problematic legal situation.


And then there's the Internet Archive. IANAL but it seems as if there's effectively this assumption that opt-out makes everything OK even if there's not much if any legal basis to it. (I wrote this piece back in 2005 and not a whole lot has changed: http://bitmason.blogspot.com/2005/07/thoughts-on-wayback-mac...)

To be clear, this is arguably the way that the Web almost has to work--but that doesn't make it all neatly legal under current copyright law.


That's the privilege of writing for a well-respected, national magazine.

If he had scraped the content to integrate into a movie torrent indexing site, he most likely would have run into legal difficulties. But since he was scraping the content to create a thought-provoking article about the kinds of interesting problems that Netflix is working on, it became PR gold for Netflix and they cooperated.

You can generally get away with a lot of things that would otherwise be actionable by simply considering the perspective of party that would have to act. In this case, he correctly reasoned that Netflix would rather see the article that was published rather than a "Netflix sued me for playing with their site."


Usually they don't have the lawyers that The Atlantic probably has. You can get away with a lot under the guise of journalism (and you should be able to).


At the top of the article is a Netflix genre generator. That is worth the price of admission all by itself.

But then there's a fairly entertaining look into what happened to content at Netflix after the million dollar challenge.


It's simple. They payed people to tag films with more useful features for their machine learning algorithms.


  Rogue-Cop Viral Plague Tearjerkers Based on Real Life Set in the Edwardian Era


I'd actually watch that.


"Sherlock Holmes and the Influenza (zombie) Pandemic of 1918."


My favorite so far is "Viral Plague Gay & Lesbian Movies". For a random generator it sure seems to have an agenda.


What would be really cool is if this list of genres was open-sourced somewhere. I can see Netflix not wanting that, but it would really save time for however many hackers read this article and decide they want the same data.


I was just thinking how sad it is that Netflix has all this great data on movies (as Pandora does for music), and it's locked away from us.


I find the part at the end about the Perry Mason aspect very interesting, and actually my favourite part of the article.

And the final sentence, feels like the real reason this was posted to HN: "And sometimes we call that a bug and sometimes we call it a feature."

Edit: Also, the 'Gonzo' genre of Post-Apocalyptic Comedies and Friendship seems it's got its first one in "This Is The End".


Meanwhile, their client still can't separate my daughter's kid shows from mine. It took them several years to implement profiles on iOS and then another to do it on Android.

Now implemented, "My Top Picks" last night were still dominated by My Little Pony.

Also would like to choose which shows she can watch, but the client doesn't support that. </complaints-over> ;)


There's also jinni.com which has a similar system, not limited by UI issues and that can be used globally. Usually i get great recommendations from them , and they're fun to play with.


I signed up for that a year ago and everytime I am reminded of its existence, I kick myself for not using it more often. Especially a time like now when the TV shows my wife and I used to watch are not on and we are looking for something new.

Just finished Broadchurch over the holidays, very entertaining for those interested. BBC murder mystery


Broadchurch was aired on ITV[1], not BBC. (And in typical fashion, when bringing it to the US, it needs to be a remake because US audiences are apparently too stupid -- according to the television network execs -- to watch British television without heavy localization -- by Fox Networks in this case).

[1] http://en.wikipedia.org/wiki/ITV_%28TV_network%29


Already brought to America on BBC America, which is why I said BBC.


For those who are data curious: https://gist.github.com/agibsonccc/8230583

I cherry picked this from the source for those who might want the generator. I "think" that's everything, correct me if I'm wrong there. I didn't really test it, just took a few seconds to grab what I saw for later.


I get a kick out of the genre names. My wife and I both use the same account, and each rate movies ourselves. Whenever something comes out of left field, we say, "Look at your movies..."

I wonder if Netflix can tell if multiple people are rating movies. Does it think we are one confused person, or two distinct personalities?


Have you tried their multiple profiles feature? https://movies.netflix.com/EditProfiles


We have multiple profiles now, but before that there were 10 years of Netflix and five of them dominated by a child who enjoys "Curious George" and "Super Hero Squad."

It's going to take a while to get back to normal.


Makes me wonder about time decay to account for changing tastes.


76k micro-genres seems much. For my website http://5000best.com/movies/ I created 40 main genres using IMDb tags together with the 100 million ratings from the Netflix Prize data (I was 43rd in that competition).

Additionally, earlier I extracted and named 12 new genres (those ones on the right) from the Netflix ratings alone - I described the process here: http://arek-paterek.com/book/predict_sample.pdf


What do you mean much? When working like this the more the better. There's an opportunity for an open source variant of this technology.


How is this any different from that Pandora did with music?


>The only semi-similar project that I could think of is Pandora's once-lauded Music Genome Project, but what's amazing about Netflix is that its descriptions of movies are foregrounded. It's not just that Netflix can show you things you might like, but that it can tell you what kinds of things those are. It is, in its own weird way, a tool for introspection.

I have both Pandora One and Netflix, and am very satisfied with both. When I read why Pandora served up a particular song I have sometimes gone "hmm," but not much more.

But I can still remember pausing when Netflix put "critically-acclaimed cerebral dark thrillers" or "visually-striking foreign sci-fi & fantasy" in front of me. Not only did it map my preferences to films I hadn't seen, but it told me something new about my tastes and thus myself.


Just an FYI if you didn't know this, but Pandora will show you the same kind of information if you hover over the song (web) and pick "Why was this track selected?".

"Based on what you've told us so far, we're playing this track because it features electronica roots, house influences, danceable beats and vocal samples."


Pandora's genome doesn't seem to have enough sub-genes to distinguish between songs and groups that are noticeably different to the ear, particularly in obscure sub-genres where the music is similar, but the vocal styles are wildly different.


Sure, it's the same basic idea.


> There are so many that just loading, copying, and pasting all of them took the little script I wrote more than 20 hours.

I want to see that script.


I laughed at that line. Here is someone who never scraped a website or run a spider to collect data from the net. It's a magical feeling next day, when you see your data.


To people who do their work on computers, but have never programmed before, programming is like a superpower.


It ran on a netbook and was actually controlling a browser and copying and pasting. It's not optimized, but it got the job done, so I'll applaud anyway. Great article, I expect we'll see more like this soon. ;-)


Oh and if you're nice on Netflix's servers, perhaps at 1 request a second, it would still take 25 minutes to run through 90,000 requests without the overhead of the browser. Sounds like they averaged one request every 24 seconds.


It actually looks pretty wasteful, since it probably needed to load the full webpage instead of just the html.


It was enough for the author to write a good story on it.

Wastefulness be damned, it got the job done.


This sort of thing is generally pretty easy. For instance, this would be a good start:

    for i in {0..10000}; do
       wget http://whatever.invalid/genres/$i
       sleep 1
    done
From there processing the results to pull out the genre would be a regex away; a single properly formed grep could probably do it.

Determining the top number would have to be done by hand; having holes in the sequences makes it more complicated but you could still use a sort of basic binary search.


That would be a good start on a site that doesn't require authentication. :)


You can add a cookie jar, probably good enough for these purposes when combined with browser string override.


I wonder how often there are "new generes", that is a complete creative movie that doesnt fit an existing one. Someofthose quirkly little movies that seem to win oscars may be such.


I wonder if NSA would find that data useful. You know, to profile netflix users.


What about the Perry Mason thing? That's scary shit.


Theories:

1) Assuming there's some clustering algorithm in use, it could just be some tradeoff edge case that isn't worth optimizing away.

2) It could be that one of their reviewers had an overly-aggresive and non-standard genre-tagging approach for Perry Mason and classified a bunch of shows in such a way that the source data was polluted. This could be something stupid like most other movies having many reviewers giving higher confidence, while a large number of Perry Mason shows and DVDs only had a single reviewer who was either through a bug or through overweighting, given too much influence. This seems to be the most likely cause of the bug--some skew or amplification in the source data.

3) Intentional poisoning of the results, like cartographers putting in bogus features, or data sellers seeding their data with watermarks, etc.


Sounds like some data cleanup issue. Definitely an outlier that should probably be ignored.


Royalty is America's second favourite topic?


Nobody loves monarchs more than someone who has never had to live under one.


Anyone else wish that "Taken 3: Lil Bub" was a real movie?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: