\documentclass[10pt]{article} \usepackage{fullpage} \usepackage{setspace} \usepackage{parskip} \usepackage{titlesec} \usepackage[section]{placeins} \usepackage{xcolor} \usepackage{breakcites} \usepackage{lineno} \usepackage{hyphenat} \PassOptionsToPackage{hyphens}{url} \usepackage[colorlinks = true, linkcolor = blue, urlcolor = blue, citecolor = blue, anchorcolor = blue]{hyperref} \usepackage{etoolbox} \makeatletter \patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{% \errmessage{\noexpand\@combinedblfloats could not be patched}% }% \makeatother \usepackage[round]{natbib} \let\cite\citep \renewenvironment{abstract} {{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize} {\bigskip} \titlespacing{\section}{0pt}{*3}{*1} \titlespacing{\subsection}{0pt}{*2}{*0.5} \titlespacing{\subsubsection}{0pt}{*1.5}{0pt} \usepackage{authblk} \usepackage{graphicx} \usepackage[space]{grffile} \usepackage{latexsym} \usepackage{textcomp} \usepackage{longtable} \usepackage{tabulary} \usepackage{booktabs,array,multirow} \usepackage{amsfonts,amsmath,amssymb} \providecommand\citet{\cite} \providecommand\citep{\cite} \providecommand\citealt{\cite} % You can conditionalize code for latexml or normal latex using this. \newif\iflatexml\latexmlfalse \providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}% \AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}} \usepackage[utf8]{inputenc} \usepackage[english]{babel} \begin{document} \title{Scraping web with python tools} \author[1]{Block User}% \affil[1]{NYBG Publishing}% \vspace{-1em} \date{March 29, 2024} \begingroup \let\center\flushleft \let\endcenter\endflushleft \maketitle \endgroup \sloppy Scraping web with pyton is one of the most essential jobs that simplfy the job of a data scientist. In this document i will try to implement some codes~of the previously published documents. \section*{Beautiful Soup Documentation} {\label{405512}} \href{http://www.crummy.com/software/BeautifulSoup/}{Beautiful Soup}~is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations. The examples in this documentation should work the same way in Python 2.7 and Python 3.2. You might be looking for the documentation for~\href{http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html}{Beautiful Soup 3}. If so, you should know that Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#porting-code-to-bs4}{Porting code to BS4}. Lets begin: ```\# import required libraries for scraping import requests from bs4 import BeautifulSoup html\_doc = ``''" The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. \ldots{} ``''" soup = BeautifulSoup(html\_doc, `lxml') \#print(soup.prettify()) \section*{There are some useful codes for bs4}\label{there-are-some-useful-codes-for-bs4} \section*{soup.title}\label{soup.title} \section*{soup.title.name}\label{soup.title.name} \section*{soup.title.string}\label{soup.title.string} \section*{soup.title.parent.name}\label{soup.title.parent.name} \section*{soup.p}\label{soup.p} \section*{\texorpdfstring{soup.p{[}`class'{]}}{soup.p{[}class{]}}}\label{soup.pclass} \section*{soup.a}\label{soup.a} \section*{\texorpdfstring{soup.find\_all(`a')}{soup.find\_all(a)}}\label{soup.find_alla} \section*{\texorpdfstring{soup.find\_all(id=`link3')}{soup.find\_all(id=link3)}}\label{soup.find_allidlink3} for link in soup.find\_all(`a'): print(link.get(`href')) print(soup.get\_text()) \cite{documentation} \section*{Parsing only part of a document} {\label{325242}} Let's say you want to use Beautiful Soup look at a document's \textless{}a\textgreater{} tags. It's a waste of time and memory to parse the entire document and then go over it again looking for \textless{}a\textgreater{} tags. It would be much faster to ignore everything that wasn't an \textless{}a\textgreater{} tag in the first place. The~\texttt{SoupStrainer}~class allows you to choose which parts of an incoming document are parsed. You just create a~\texttt{SoupStrainer}~and pass it in to the~\texttt{BeautifulSoup}~constructor as the~\texttt{parse\_only}~argument. (Note that~\emph{this feature won't work if you're using the html5lib parser}. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didn't actually make it into the parse tree, it'll crash. To avoid confusion, in the examples below I'll be forcing Beautiful Soup to use Python's built-in parser.) \subsection*{SoupStrainer} {\label{130737}} The~\texttt{SoupStrainer}~class takes the same arguments as a typical method from~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#searching-the-tree}{Searching the tree}:~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#id11}{name},~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#attrs}{attrs},~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#id12}{string}, and~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#kwargs}{**kwargs}. Here are three~\texttt{SoupStrainer}~objects: \begin{verbatim} from bs4 import SoupStrainer only_a_tags = SoupStrainer("a") only_tags_with_id_link2 = SoupStrainer(id="link2") def is_short_string(string): return len(string) < 10 only_short_strings = SoupStrainer(string=is_short_string) \end{verbatim} I'm going to bring back the ``three sisters'' document one more time, and we'll see what the document looks like when it's parsed with these three~\texttt{SoupStrainer}~objects: \begin{verbatim} html_doc = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # # Elsie # # # Lacie # # # Tillie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # # Lacie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # \end{verbatim} You can also pass a~\texttt{SoupStrainer}~into any of the methods covered in~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#searching-the-tree}{Searching the tree}. This probably isn't terribly useful, but I thought I'd mention it: \begin{verbatim} soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n'] \end{verbatim} \subsection*{\texorpdfstring{Improving Performance\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#improving-performance}{}}{Improving Performance}} {\label{245184}} Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is critical, if you're paying for computer time by the hour, or if there's any other reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop~\href{http://lxml.de/}{lxml}. That said, there are things you can do to speed up Beautiful Soup. If you're not using lxml as the underlying parser, my advice is to~\href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#parser-installation}{start}. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib. You can speed up encoding detection significantly by installing the~\href{http://pypi.python.org/pypi/cchardet/}{cchardet}~library. \href{https://www.crummy.com/software/BeautifulSoup/bs4/doc/\#parsing-only-part-of-a-document}{Parsing only part of a document}~won't save you much time parsing the document, but it can save a lot of memory, and it'll make~searching~the document much faster. \url{http://jsbeautifier.org/} \par\null \par\null \selectlanguage{english} \FloatBarrier \bibliographystyle{plainnat} \bibliography{bibliography/converted_to_latex.bib% } \end{document}