Copyright Limits of "Web Scraping"
For empirical research, information from websites is often "scraped off".But the "web scraping" is not always allowed.An overview.
By Thilo Klawonn 07.01.2020
Print article
Web scraping is now increasingly being used in empirical research.This denotes a procedure with which data is "scraped off" from the Internet."Scraper" are small programs that call up the desired websites and from there the information, for example hotel prices, read and save in a file that the researchers can then use for their investigation.For example, the price data of 30 for a research project.000 hotels through scraping of online ice rods such as Booking.com collected to analyze the best price clauses.
In the case of empirically researchers, there is often uncertainty regarding the legal framework when using web scraping.This article shows which copyright limits in research must be observed.
Law of the website operator
The website operators have no owners -like rights to the data stored on their page.The compilation of the data, on the other hand, can be subject to protection.Because in the EU the suction exists.Database manufacturer law.Websites such as evaluation portals, online exchanges or social networks are generally databases in this sense.So it can be assumed that most websites relevant to empirical research are a database.
The database manufacturer is entitled to reproduce, distribute and reproduce its database.These are copyright technical terms: Reproduction means copying, spreading is the physical extension of the original or copy and a public reproduction of the database is available if you provide them with others in non-physical form, for example by setting the intra- orInternet.With the web scraping you have to inevitably reproduce something.If the scraper extracts the information, it copies it into the RAM and then onto the hard disk.This means that a reproduction act is already committed, which is generally only due to the database manufacturer.
Legal admissibility
The good news in front: As a rule, web scraping is still permitted for empirical research.The terms of use of the website operators cannot change that.Because often only insignificant database parts are used.For example, it was 30 when scraping.000 price data of the online ice ages, which should only be a fraction of the total databases.In the scientific context, the insignificant parts of the database can basically copy and continue to copy each other.However, effective technical protective measures in web scraping must not be avoided.The website operator prevents automated reading of the data, for example in the so -called robots.txt, the researcher must not ignore it.
Further legal restrictions exist if essential parts of the database are to be used.It cannot be said abstract whether it is a significant part of a database.However, the Federal Court of Justice said, for example, when the ten percent of a database was taken over, that was not essential.In another procedure, he concluded that annual personnel costs of 200.000 euros is an essential qualitative investment.So if a particularly large number of data to be obtained are to be copied, caution should be exercised.Nevertheless, this is also not prohibited per se.
On the one hand, researchers who want to be on the safe side are always open to ask the website operator to ask permission.However, this is not always possible or methodologically sensible.But even without consent, there are ways to use essential parts of databases for research.
Science barrier, text and data mining
According to the copyright science barrier, everyone can reproduce up to 75 percent of copyrighted works for their own non-commercial scientific research.This also applies to databases.However, a transfer of the reproduced data records is not covered by this permission regulation.This is the case if the data record leaves its own research group, for example if you want to forward the data for the purpose of quality control.
Aus Forschung & Lehre 1/20
Read now
In addition, the German legislator introduced a barrier for text and data mining (TDM) in 2018.It allows for non-commercial, scientific purposes to reproduce a variety of works to create a body.With databases, however, it is not permitted to pass on the body for quality control.Nor is it allowed to reproduce the entire database.A total survey of the data stored in a database is therefore never allowed without the manufacturer's consent.
An important restriction of the TDM barrier concerns the temporal horizon.Reproduction and body may only be created for a specific research project and must be deleted after completing this.The body may only archive the body permanently public libraries, archives and comparable institutions.
The essential difference between the two barriers can therefore be reduced to the extent of the duplication and storage: Almost the entire database can be copied according to the TDM barrier, but the copies must be deleted or after completion of the research project.be passed on to the library.After the science barrier, on the other hand, the copies can also be kept afterwards, but only up to 75 percent of the database may be reproduced.In both cases, the database and its manufacturer must be specified as a source.
Conclusion
As a rule, web scraping is legally permitted for empirical research.The terms of use that are often used do not change this.The situation is different with technical barriers that must not be avoided.
If you want to go safely, you can ask the manufacturer of the database for permission and be given it-preferably in text form (for example by email).In case of doubt, the legal departments of the research institutions advise.
Print article
to the top