{"id":88267,"date":"2020-09-16T19:13:29","date_gmt":"2020-09-16T19:13:29","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=88267"},"modified":"2020-09-16T19:13:29","modified_gmt":"2020-09-16T19:13:29","slug":"web-scrapping-with-python-using-beautifulsoap","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/other\/web-scrapping-with-python-using-beautifulsoap\/","title":{"rendered":"Web Scrapping with Python using BeautifulSoap"},"content":{"rendered":"<p>Today the internet is completely associate with large source of data.\u00a0Inappropriately, the huge majority of it isn\u2019t out there in handily organized CSV files for download and analysis. If you want to capture data from several websites, you\u2019ll have to be compelled to strive web scraping. It helps them to learn about operational activities, also the need of the market, and the data of competitors on the internet which helps them setup plan out future views.<\/p>\n<ul>\n<li>Real Astate agents use Web scrapping to get the data of new projects, resale properties, etc.<\/li>\n<li>Marketing companies can also use Web scrapping data to lead-related informations.<\/li>\n<li>Web scrapping technique can also be used by Price comparision portals, to get information about the products and price from e-commerce websites.<\/li>\n<\/ul>\n<p>Using Web Scraping\u00a0technique we can extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. The extracted data is stored in data pipeline and stored in a structure format.<\/p>\n<p><strong>How does Web Scrapping works?<\/strong><\/p>\n<p>To scrape the net, we tend to write code that sends asking to the server that\u2019s hosting the page we tend to such. Generally, our code downloads that page\u2019s ASCII text files, even as a browser would. However rather than displaying the page visually, it filters through the page trying to find HTML components we\u2019ve such, and extracting no matter content we\u2019ve schooled it to extract.<\/p>\n<p>For example, if we tend to wished to include all of the titles within H2 tags from a web site, we tend to might write some code to try and do that. Our code would request the site\u2019s content from its server and transfer it. Then it might undergo the page\u2019s HTML trying to find the H2 tags. Whenever it found an H2 tag, it might copy no matter text is within the tag, and output it in no matter format we tend to such.<\/p>\n<p>One factor that\u2019s vital to note: from a server\u2019s perspective, requesting a page via internet scraping is that the same as loading it during a browser once we use code to submit these requests, we might be \u201cloading\u201d pages more faster than a regular user, and thus quickly consumption up the website owner\u2019s server resources.<\/p>\n<p><strong>Web Page Components<\/strong><\/p>\n<p>We can make GET request to a web server to retrieve files from the server. The server then sends back response files that request browser how to render the page. The files can be of different types:<\/p>\n<p><a href=\"https:\/\/www.w3.org\/TR\/html\/\">HTML<\/a>\u00a0\u2014 Hyper Text Markup Language is a language that web pages are created in.<\/p>\n<ul>\n<li><a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/CSS\">CSS<\/a>\u00a0\u2014 stands for cascading Style Sheet add styling to make the page look nicer.<\/li>\n<li><a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/JavaScript\">JS<\/a>\u00a0\u2014 JavaScript is the Programming Language for the web and most popular programming language.<\/li>\n<li>Images \u2014 Images, such as\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/JPEG\">JPG<\/a>\u00a0and\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Portable_Network_Graphics\">PNG<\/a>\u00a0allow web pages to show pictures.<\/li>\n<\/ul>\n<p>After our browser receives response of all the files, it renders and displayed the page. When we perform web scraping, we will be focusing on the main content of the web page, so we will look at the HTML.<\/p>\n<p><strong>Steps to Scrap Websites:<\/strong><\/p>\n<p>Below are the four steps to scrap<\/p>\n<p>1) Using REQUEST library of python, you can send an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content.<\/p>\n<p>2) Using BeautifulSoap you can fetch and parse the data and store in data structure such as List or Dict.<\/p>\n<p>3) Identify and Analyze your HTML tags attributes such as Class, ID and other HTML tag attributes.<\/p>\n<p>4) Saving your extracted data in files such as csv, json or xls.<\/p>\n<p>I will show you web scraping using Python3 and BeautifulSoap library with an example of extracting the name of weblinks available on the home page of\u00a0<em>https:\/\/www.red-gate.com\/ <\/em>website.<\/p>\n<p><strong>Scraping using BeautifulSoap <\/strong><\/p>\n<p><strong>Step 1<\/strong><\/p>\n<p>We need two libraries: BeautifulSoup in bs4 and request in urllib to start with web scraping, Import both of these Python packages.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1788\" height=\"398\" class=\"wp-image-88268\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/09\/word-image-89.png\" \/><\/p>\n<p><strong>Step 2<\/strong><\/p>\n<p>To extract its HTML elements select the URL.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1064\" height=\"172\" class=\"wp-image-88269\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/09\/word-image-90.png\" \/><\/p>\n<p><strong>Step 3\u00a0<\/strong><\/p>\n<p>By using urlopen() function in the request, we could access the content on this webpage and save the HTML in \u201cmyUrl\u201d.<\/p>\n<p><strong>Step 4<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1116\" height=\"134\" class=\"wp-image-88270\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/09\/word-image-91.png\" \/><\/p>\n<p><strong>Step 4<\/strong><\/p>\n<p>Create an object of BeautifulSoup library to parse this document,<\/p>\n<p>and extract the webpage element data, using its various inbuilt functions.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"2068\" height=\"858\" class=\"wp-image-88271\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/09\/word-image-92.png\" \/><\/p>\n<p><strong>Step 5<\/strong><\/p>\n<p>Locate and scrape the services. Using the soup.find_all() function, to find all instances of a tag on a page. Remember find_all will return the lists so we\u2019ll have to loop through, or use list indexing, it to extract text.<\/p>\n<p>We use find_all methods to search for items by class or by id. For each element in the web page, they always have a unique HTML &#8220;ID&#8221; or &#8220;class&#8221;, we would need to INSPECT element on the webpage to check their ID or class. Once you find the HTML services on this web page, extract and store them.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1360\" height=\"202\" class=\"wp-image-88272\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/09\/word-image-93.png\" \/><\/p>\n<p><strong>Step 6<\/strong><\/p>\n<p>On inspecting the web page for extracting all the services names on the www.c-sharpcorner.com website, we located the ul tag with the class value as &#8216;site-nav&#8217; as the parent node.<\/p>\n<p>&nbsp;<\/p>\n<p>To extract all the child node which is our target to extract all the weblink names on the www.red-gate.com website, we located the li tag as the target node.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"2036\" height=\"548\" class=\"wp-image-88273\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/09\/word-image-94.png\" \/><\/p>\n<p><strong>Output of the above code:<\/strong><\/p>\n<p>Home<\/p>\n<p>SQL<\/p>\n<p>.NET<\/p>\n<p>Cloud<\/p>\n<p>Sysadmin<\/p>\n<p>Opinion<\/p>\n<p>Books<\/p>\n<p>Blogs<\/p>\n<p><strong>Web Scrapping Challenges:<\/strong><\/p>\n<p>The web has grown organically out the many sources. It combines plenty of various technologies, styles, and personalities, and it continues to grow to the current day. In alternative words, the web is quite as mess ! This may result in some challenges you\u2019ll see after you strive web scrapping.<\/p>\n<p>One challenge is selection. Each website is totally different, whereas you\u2019ll encounter general stuctures that tend to repeat themselves, every web site is exclusive and can would like its own personal treatment if you would like to extract the knowledge that\u2019s relevant to you.<\/p>\n<p>Another challenges is sturdiness. Website perpetually amendment. Say you\u2019ve engineered a shiny new web scraper too that mechanically cherry-picks exactly what you would like from your resource of interest the primary time you run your script, it works cleanly. However after you run a similar script solely a brief whereas later, you run into a discouraging and long stack of tracebacks!<\/p>\n<p>This is a practical state of affairs, as several websites area until in active development. Once the site\u2019s structure has modified, your scrap tool won\u2019t be ready to navigate the sitemap properly or realize the unit tiny and progressive, thus you\u2019ll doubtless be ready to update your scrape tool with solely borderline changes.<\/p>\n<p>However, detain mind that as a result of the web is dynamic, the scrapers you\u2019ll build can in all probability need constant maintenance. You\u2019ll got wind of continuous integration to run scrapping tests sporadically to confirm that your main script doesn\u2019t break .<\/p>\n<h3>Alternative to Web Scraping: API\u2019s<\/h3>\n<p>Some web site suppliers supply Application Programming Interface(API\u2019s) that enables you to access their information in a very predefined manner. With API\u2019s you\u2019ll be able to avoid parsing markup language and instead access the info directly victimization formats like JSON and XML. Markup language is primarily how to visually gift content to users.<\/p>\n<p>When you use associate degree API, the method is usually additional stable than gathering the info through internet scraping. That\u2019s as a result of genius API\u2019s area unit created to be consumed by programs, instead of by human eyes. If the look of an internet site changes, then it doesn\u2019t mean that the structure of the API has modified.<\/p>\n<p>However, genius API\u2019s will amendment additionally. Each the challenges of selection and sturdiness apply to genus API\u2019s even as they are doing to websites to boot, it\u2019s lot of more durable to examine the structure of associate degree API by yourself if the provided documentation is lacking in quality.<\/p>\n<p>The approach and tools you would like to assemble info victimization genus API\u2019s area unit outside the scope of this tutorial to be told additional regarding it, examples API Integration in Python.<\/p>\n<p><strong>Best Practices for Web Scraping:<\/strong><\/p>\n<p><strong>Never scrape more often than you need to<\/strong>:<\/p>\n<p>Some websites are not tested against high load, when we try to hit at constant interval then it creates huge traffic on a server side and it may crash or fail to serve other requests. This creates a high influence on user experience as they are more essential than the bots. So, we should make use a standard delay of 10 seconds or the requests according to the specified interval in robots.txt. Robots.txt file generally contains the instructions for crawlers. So this will not to get blocked by the target website.<\/p>\n<p><strong>Make use of functions like\u00a0time.sleep() to keep from overwhelming servers with too many requests in too short a timespan.<\/strong><\/p>\n<p><strong>Take responsibility to use scrape data:<\/strong><\/p>\n<p>We should always take responsibility of using scrap data. We should not scarp data and republish it somewhere else. It can be considered as legal issues so,we need to check \u201cTerms of Service\u2019 page before scrapping. We can review TOS and respect the terms and condition and privacy policy.<\/p>\n<p><strong>Never follow same scrapping pattern:<\/strong><\/p>\n<p>Now, as\u00a0you recognize\u00a0several\u00a0websites use anti-scraping technologies,\u00a0therefore\u00a0it\u2019s\u00a0simple\u00a0for them to\u00a0sight\u00a0your spider if it\u2019s\u00a0creeping\u00a0within the\u00a0same pattern. Normally, we, as a human,\u00a0wouldn&#8217;t\u00a0follow a pattern on\u00a0a selected\u00a0web site. So,\u00a0to own\u00a0your spiders run\u00a0swimmingly,\u00a0we will\u00a0introduce actions like mouse movements, clicking a random link, etc,\u00a0which provides\u00a0the impression of your spider as\u00a0a personality&#8217;s.<\/p>\n<p><strong>Scrape throughout off peak hours:<\/strong><\/p>\n<p>Off-peak hours\u00a0are appropriate\u00a0for bots\/crawlers\u00a0because the\u00a0traffic on\u00a0the web site\u00a0is\u00a0significantly\u00a0less. These hours\u00a0are often\u00a0known\u00a0by the geo location from\u00a0wherever\u00a0the site\u2019s traffic originates. This\u00a0additionally\u00a0helps\u00a0to boost\u00a0the\u00a0creeping\u00a0rate and avoid\u00a0the additional\u00a0load from spider requests. Thus,\u00a0it&#8217;s\u00a0better\u00a0to schedule the crawlers to run\u00a0within the\u00a0off-season\u00a0hours. We can schedule crawling tasks to run during off-peaks hours.<\/p>\n<p><strong>Mask your requests by rotating IP\u2019s and Proxy Services:\u00a0<\/strong><\/p>\n<p>We\u2019ve\u00a0mentioned\u00a0this\u00a0within the\u00a0challenges\u00a0higher than. It\u2019s\u00a0continuously\u00a0higher\u00a0to use rotating IPs and proxy service\u00a0in order that\u00a0your spider won\u2019t get blocked. Rotating IP\u2019s address is an easy job if you are using scrapy.<\/p>\n<p><strong>User Agent Rotation and Spoofing:\u00a0<\/strong><\/p>\n<p>Every request consists of a User-Agent string within the header. This string helps to spot the browser you are using, its version, and the platform. If we tend to use the identical User-Agent in every request then it\u2019s simple for the target website to invision that request is coming from a crawler. So, to create certain we do not face this, try to rotate the User and the Agent between the requests. You can easily identify the information of browser and operating You can try examples of genuine User-Agent strings which is easily available on the Internet. You can set USER_AGENT property in settings.py, if you are scrapping. While scraping its always better to provide your accurate details in the Header of request.<\/p>\n<p>Example of USER_AGENT<\/p>\n<p>user-agent: Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/75.0.3770.100 Safari\/537.36<\/p>\n<p><strong>Be transparent:\u00a0<\/strong><\/p>\n<p>Don\u2019t misrepresent your purpose or use deceptive ways to achieve access. If you have not login and a password that identifies ways to achieve access to a source, use it. Don\u2019t hide who you are. If possible, share your credentials. Give your correct details in Header of the request.<\/p>\n<p><strong>Conclusion:<\/strong><\/p>\n<p>In this article we\u2019ve seen the fundamentals of scraping the website and fetch all the useful data., frameworks, a way to crawl, and the best practices of scraping using Red-Gate page using BeautifulSoap.<\/p>\n<p>For further practice here are the good examples of scrapping data from website<\/p>\n<p>Weather Forecasts, Stock prices, Articles<\/p>\n<p>To conclude:<\/p>\n<ul>\n<li>Follow target URLs rules whereas scraping. Don\u2019t create them block your spider.<\/li>\n<li>Maintenance of knowledge and spiders at scale is troublesome. Use Docker\/ Kubernetes and public cloud suppliers, like AWS to simply scale your web-scraping backend.<\/li>\n<li>Always respect the principles of the websites you propose to crawl. If APIs are available, always use them first.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today the internet is completely associate with large source of data.\u00a0Inappropriately, the huge majority of it isn\u2019t out there in handily organized CSV files for download and analysis. If you want to capture data from several websites, you\u2019ll have to be compelled to strive web scraping. It helps them to learn about operational activities, also&#8230;&hellip;<\/p>\n","protected":false},"author":333151,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[124821],"class_list":["post-88267","post","type-post","status-publish","format-standard","hentry","category-other"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/88267","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/333151"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=88267"}],"version-history":[{"count":2,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/88267\/revisions"}],"predecessor-version":[{"id":88275,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/88267\/revisions\/88275"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=88267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=88267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=88267"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=88267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}