Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know how they are doing it, but Google Scholar does not have an API, and scraping is against their TOS.

> Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.

Despite this, there is scholar.py [0], which can extract files from Google Scholar, though it explicitly doesn't work around the rate limits.

[0] https://github.com/ckreibich/scholar.py



or try to access them using a method other than the interface

Unless this actually exploits something and hacks into Google's servers to get to the content, which would be something quite different, it wouldn't really be distinguishable from someone manually visiting the site in a browser, volume aside.

IMHO the pervasive attitude today of somehow requiring permission or an explicitly sanctioned "API" to access what is otherwise publicly accessible data is rather troubling for the freedom and flexibility of the Web as a whole. It encourages walled-garden content models and centralisation.


I absolutely agree. If something is publicly accessible then the public should be able to use it as they see fit, from my viewpoint. (A HTTP response has already authorised you to copy the data to a machine. How can it be bound by a TOS that you need to access the original page to find?)

However, Google doesn't agree and the current court precedent doesn't either. So I tried to address the parent's concern from that viewpoint.


Yup. I don't believe web hosts should be entitled to that much control.

My browser is my User Agent. The way it renders or interprets the data is my business.


Http is an interface with implicit instructions (especially if restful), provided by google




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: