Qingli Niu1, Irfan Ali Kandhro2, Anil Kumar2, Shahnawaz shah3, Muhammad Hasan2, Hifza Mehfooz Ahmed2, and Fei Liang This email address is being protected from spambots. You need JavaScript enabled to view it.1

1College of Information Engineering, Zhengzhou University of Science & Technology, Zhengzhou 450064, China
2Department of Computer Science, Sindh Madressatul Islam University, Karachi, Pakistan
3Department of telecommunication engineering, University of Sindh Jamshoro, Pakistan


 

Received: February 28, 2022
Accepted: May 8, 2022
Publication Date: June 17, 2022

 Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.


Download Citation: ||https://doi.org/10.6180/jase.202304_26(4).0002  


ABSTRACT


Web scraping is the process of extracting data from a website in an efficient and fast way. In such a scenario, python programming can offer useful set of methods that help web editors to improve the quality of the provided service. This scraper contains three steps 1) to understand the structure of web page, 2) design regular expression pattern and finally use that pattern to get certain data. In this paper, we also used Flask, Request, JSONify library to get the data, after processing, the data is transformed into the JSON form and ready for CSV with help of API. After generated all required regex patterns, the system uses these patterns as a set of rules, and with this, designed scraper tool works efficiently, and achieved outstanding results with help of support libraries to storing and extracting the news and web-based information. The proposed Web scraping tool eliminates the time and effort of manually collecting or copying data by automating the process. It is found that this designed scraper is easy and direct approach to extract the newspapers, websites, blogs, and images data.


Keywords: web scraping, extracting, retrieving, Python framework, API, manually collecting data


REFERENCES


  1. [1] V. Draxl. “Web Scraping Data Extraction from websites". University of Applied Sciences Technikum Wien, 2018.
  2. [2] B. Manjushree and G. Sharvani, (2020) “Survey on Web scraping technology" Wutan Huatan Jisuan Jishu 16: 1–8.
  3. [3] L. Junjoewong, S. Sangnapachai, and T. Sunetnanta. “ProCircle: A promotion platform using crowdsourcing and web data scraping technique”. In: 2018 Seventh ICT International Student Project Conference (ICTISPC). IEEE. 2018, 1–5.
  4. [4] A. V. Saurkar, K. G. Pathare, and S. A. Gode, (2018) “An overview on web scraping techniques and tools" International Journal on Future Revolution in Computer Science & Communication Engineering 4(4): 363–367.
  5. [5] M. El Asikri, S. Krit, and H. Chaib, (2020) “UsingWeb Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy" European Journal of Molecular & Clinical Medicine 7(03): 2020.
  6. [6] E. Gallagher, (2018) “ScrapingWebsites for Law Enforcement" School of Computing, Engineering & Intelligent Systems, Computer Science:
  7. [7] A. Tarango. “307 ChopShop Senior DesignWeb Scraper". (phdthesis). University ofWyoming.
  8. [8] K. Parikh, D. Singh, D. Yadav, and M. Rathod, (2018) “Detection of web scraping using machine learning" Open access international journal of Science and Engineering: 114–118.
  9. [9] R. Landers, R. Brusso, K. Cavanaugh, and A. Collmus, (2016) “A primer on theory-driven web scraping: Automatic extraction of big data from the internet for use in psychological research" Psychological Methods 21(4): 475–492. DOI: 10.1037/met0000081.
  10. [10] B. Ujwal, B. Gaind, A. Kundu, A. Holla, and M. Rungta. “Classification-based adaptive web scraper”. In: 2017-December. cited By 10. 2017, 125–132. DOI: 10.1109/ICMLA.2017.0-168.
  11. [11] B. G. Dastidar, D. Banerjee, and S. Sengupta, (2016) “An intelligent survey of personalized information retrieval using web scraper" International Journal of Education and Management Engineering 6(5): 24–31.
  12. [12] O. Lloyd and C. Nilsson. How to Build a Web Scraper for Social Media. 2019.
  13. [13] S. Han and C. Anderson, (2021) “Web Scraping for Hospitality Research: Overview, Opportunities, and Implications" Cornell Hospitality Quarterly 62(1): 89–104. DOI: 10.1177/1938965520973587.
  14. [14] W. Liu, X. Meng, andW. Meng, (2010) “ViDE: A vision based approach for deep web data extraction" IEEE Transactions on Knowledge and Data Engineering 22(3): 447–460. DOI: 10.1109/TKDE.2009.109.
  15. [15] K. Clark and A. Evert, (2019) “Building an Alternative Web Scraper for Big Data Analytics":
  16. [16] International Data Corporation.
  17. [17] S. Chasins, M. Mueller, and R. Bodik. “Rousillon: Scraping distributed hierarchical web data”. In: cited By 35. 2018, 963–975. DOI: 10.1145/3242587.3242661.
  18. [18] R. Chaulagain, S. Pandey, S. Basnet, and S. Shakya. “Cloud Based Web Scraping for Big Data Applications”. In: cited By 23. 2017, 138–143. DOI: 10.1109/SmartCloud.2017.28.
  19. [19] N. R. Haddaway et al., (2015) “The use of web-scraping software in searching for grey literature" Grey J 11(3): 186–90.
  20. [20] V. Singrodia, A. Mitra, and S. Paul. “A Review on Web Scrapping and its Applications”. In: cited By 11.2019. DOI: 10.1109/ICCCI.2019.8821809.
  21. [21] K.Weedman, (2002) “On the spur of the moment: Effects of age and experience on hafted stone scraper morphology" American Antiquity 67(4): 731–744. DOI: 10.2307/1593801.
  22. [22] E. Vargiu and M. Urru, (2013) “Exploiting web scraping in a collaborative filtering-based approach to web advertising." Artif. Intell. Res. 2(1): 44–54.
  23. [23] A. Sundas, S. Badotra, Y. Alotaibi, S. Alghamdi, and O. Khalaf, (2022) “Modified bat algorithm for optimal VM s in cloud computing" Computers, Materials and Continua 72(2): 2877–2894. DOI: 10.32604/cmc.2022.025658.
  24. [24] B. Zhao, (2017) “Web scraping" Encyclopedia of big data: 1–3.
  25. [25] D. S. Sirisuriya et al., (2015) “A comparative study on web scraping":
  26. [26] B. Audeh, M. Beigbeder, A. Zimmermann, P. Jaillon, and C. Bousquet, (2017) “Vigi4Med scraper: A framework for web forum structured data extraction and semantic representation" PLoS ONE 12(1): DOI: 10.1371/journal.pone.0169658.
  27. [27] A. Khan, A. Laghari, A. Shaikh, S. Bourouis, A. Mamlouk, and H. Alshazly, (2021) “Educational blockchain: A secure degree attestation and verification traceability architecture for higher education commission" Applied Sciences (Switzerland) 11(22): DOI: 10.3390/app112210917.
  28. [28] Y. Neil, (2016) “Web Scraping the Easy Way":
  29. [29] R. Egger, M. Kroner, and A. Stöckl. “Web scraping”. In: Applied Data Science in Tourism. Springer, 2022, 67–82.
  30. [30] V. Krotov and L. Silva. “Legality and ethics of web scraping”. In: cited By 14. 2018.
  31. [31] R. Sharma, (2020) “DATA CRAPER":
  32. [32] B. Sharma, A. Hashmi, C. Gupta, O. I. Khalaf, G. M. Abdulsahib, and M. M. Itani, (2022) “Hybrid Sparrow Clustered (HSC) Algorithm for Top-N Recommendation System" Symmetry 14(4): 793.
  33. [33] Felix Speckmann, (2021) “Web Scraping: A Useful Tool to Broaden and Extend Psychological Research" Zeitschrift für Psychologie 229(4): 241–244. DOI:https://doi.org/10.1027/2151-2604/a000470.
  34. [34] U. Janniekode, R. Somineni, O. Khalaf, M. Itani, J. Chinna Babu, and G. Abdulsahib, (2022) “A Symmetric Novel 8T3R Non-Volatile SRAM Cell for Embedded Applications" Symmetry 14(4): DOI: 10.3390/sym14040768.
  35. [35] D. Goßen, I. H. Jonker, and I. E. Poll. “Design and implementation of a stealthy OpenWPM web scraper". (phdthesis). Master’s thesis, Radboud Universiteit Nijmegen, 2020.
  36. [36] M. Edeh, O. Khalaf, C. Tavera, S. Tayeb, S. Ghouali, G. Abdulsahib, N. Richard-Nnabu, and A. Louni, (2022) “A Classification Algorithm-Based Hybrid Diabetes Prediction Model" Frontiers in Public Health 10: DOI:10.3389/fpubh.2022.829519.
  37. [37] Hemavathi, S. Akhila, Y. Alotaibi, O. Khalaf, and S. Alghamdi, (2022) “Authentication and Resource Allocation Strategies during Handoff for 5G IoVs Using Deep Learning" Energies 15(6): DOI: 10.3390/en15062006.
  38. [38] X.Wang, J. Liu, X. Liu, Z. Liu, O. I. Khalaf, J. Ji, and Q. Ouyang, (2022) “Ship feature recognition methods for deep learning in complex marine environments" Complex & Intelligent Systems: 1–17.
  39. [39] J. Jayapradha, M. Prakash, Y. Alotaibi, O. Khalaf, and S. Alghamdi, (2022) “Heap Bucketization Anonymity - An Efficient Privacy-Preserving Data Publishing Model for Multiple Sensitive Attributes" IEEE Access 10: 28773–28791. DOI: 10.1109/ACCESS.2022.3158312.
  40. [40] C. Kavitha, V. Mani, S. Srividhya, O. Khalaf, and C. Tavera Romero, (2022) “Early-Stage Alzheimer’s Disease Prediction Using Machine Learning Models" Frontiers in Public Health 10: DOI: 10.3389/fpubh.2022.853294.
  41. [41] T. Puri, M. Soni, G. Dhiman, O. Ibrahim Khalaf, M. alazzam, and I. Raza Khan, (2022) “Detection of Emotion of Speech for RAVDESS Audio Using Hybrid Convolution Neural Network" Journal of Healthcare Engineering 2022: DOI: 10.1155/2022/8472947.
  42. [42] A. A. Khan, A. A. Laghari, and S. A. Awan, (2021) “Machine learning in computer vision: A review" EAI Transactions on Scalable Information Systems: e4.
  43. [43] A. Khan, Z. Shaikh, L. Belinskaja, L. Baitenova, Y. Vlasova, Z. Gerzelieva, A. Laghari, A. Abro, and S. Barykin, (2022) “A Blockchain and Metaheuristic-Enabled Distributed Architecture for Smart Agricultural Analysis and Ledger Preservation Solution: A Collaborative Approach" Applied Sciences (Switzerland) 12(3): DOI: 10.3390/app12031487.
  44. [44] A. Khan, Z. Shaikh, L. Baitenova, L. Mutaliyeva, N. Moiseev, A. Mikhaylov, A. Laghari, S. Idris, and H. Alshazly, (2021) “QoS-ledger: Smart contracts and metaheuristic for secure quality-of-service and cost-efficient scheduling of medical-data processing" Electronics (Switzerland) 10(24): DOI: 10.3390/electronics10243083.
  45. [45] A. Khan, Z. Shaikh, A. Laghari, S. Bourouis, A.Wagan, and G. Ali, (2021) “Blockchain-aware distributed dynamic monitoring: A smart contract for fog-based drone management in land surface changes" Atmosphere 12(11): DOI: 10.3390/atmos12111525.
  46. [46] A. Khan, A. Laghari, D.-S. Liu, A. Shaikh, D.-A. Ma, C.-Y. Wang, and A. Wagan, (2021) “EPS-ledger: Blockchain hyperledger sawtooth-enabled distributed power systems chain of operation and control node privacy and security" Electronics (Switzerland) 10(19): DOI: 10.3390/electronics10192395.


    
 

0.9
2021CiteScore
 
 
42nd percentile
Powered by  Scopus

SCImago Journal & Country Rank

Enter your name and email below to receive latest published articles in Journal of Applied Science and Engineering.