Last Updated:

Simple technique to Web Scraping with PHP

Web Scraping with PHP

There's a lot of data that's flowing everywhere. Unstructured, non-useful pieces of data moving back and forth. Obtaining this data and structuring it, processing it can make it really expensive. There are companies that make billions of dollars just to scrape off web content and show it in a nice way.

Another reason for such things could be, for example, the lack of an API from the original site. In this case, it's the only way to get the data you need to process. Today I will show you how to get web data using PHP and that it can be as simple as pie.

First of all, is parcing or Webscrapping (in PHP) bad? Doing PHP web scraping is no worse than other languages, just parsing in general is likely to be viewed with contempt by content producers. And that will make your code more fragile than it should be, and generally make the app more difficult to build.

However, sometimes "Webparcing" is your only choice. If you love PHP and need to do parcing, you are in the right place.

What is Parsing?

 

Web passing with PHP is no different from any other kind of "web scraper". And while different people mean different things when they say "web scraper," I mean that you're extracting information from the HTML of a web page whose owner hasn't made that information available in the REST, SOAP, or GraphQL API. (Like any other kind of identifiable programming-friendly interface.)

Web scraping is your last choice when someone doesn't provide a formal web API to access the data.

Reasons to do parcing:

  • We need data inside our PHP script
  • The data owner does not provide an API through which we can more efficiently retrieve this data
  • We really want their data to be inside our PHP script

I do twice "we need this data" because a characteristic feature of web scraper scripts is their fragility. Because they get the underlying data represented in the internal HTML code of a web page, they can break for random reasons. That's why Intuit (the makers of Mint, QuickBooks, etc.) spend millions on their banking web parsers every year.

Reasons not to parsing

 

Here are a few reasons why your web scraping (once it's working, which is a completely different topic...) will break:

  1. The data format is now represented differently in terms of its text strings
  2. The layout of the data on the page is moved due to design considerations
  3. Host data format (surrounding HTML) moves due to design considerations
  4. The data you're trying to clean up is intentionally obfuscated by its owner (think facebook anti-ad markup)

In short, parsing should always be a last resort. If they wanted to make this information available to you, and they could, they would. They may indeed have neither the technical ability nor the interest. That's when scraping is perfect. Because a slow website is one of the best goals for clearing data from websites using PHP.

Why use PHP for web scraping?

 

There are several PHP webscraping libraries. There are better languages than PHP that can be used for cleanup.

The main reason to perform PHP webscraping is that you know and love PHP. Use PHP for your Web scraping if the rest of your application (which will use the result of this parsing) is written in PHP. Parcing with the help of PHP is not so simple to use it if you do not know PHP. For example, in the middle of a Python web project. The PHP cleanup libraries are pretty good, but they're not perfect.

Which PHP WebScraping libraries should you use?

So, the obvious answer here is this: everything you like.Everyone uses the kind of PHP scraping framework that suits you, there is no significant development between them.

To make easy PHP web scrapers in the context of a project, you can use the Symfony PHP web framework. As an option, you can consider FriendsOfSymfony/Goutte and Symfony/Panther. But there are many good analogues. In general, the main difference I would highlight is between a PHP webscraping library like Panther or Goutte and a PHP webrequest library like URL, Guzzle, Requests.

Differences between the PHP Web Request Library and the Web-scramling Library:

  • The Web Request Library helps you make requests using all the basic HTTP methods
  • It can give you the basic HTML code of the page that you can parse the way you want.
  • It won't help you parse the web page that your HTTP request returns.
  • It won't help you to make a series of queries in sequence while navigating through a series of web pages that you're trying to clean up.

So, it is best to first rely on Gut, Panther and Laravel to correctly approach PHP parsing.

Simple parser in PHP

 

PHP Simple HTML DOM Parser is a dream utility for developers who work with both PHP and the DOM because developers can easily find DOM elements using PHP. Here are some examples of using PHP Simple HTML DOM Parser:

test_parser.php
<!—?php
require_once('ListLexer.php');
require_once(‘Token.php’);
require_once(‘ListParser.php’);
$lexer = new ListLexer($argv[1]);
$parser = new ListParser($lexer);
$parser—>rlist(); begin parsing at rule list
?>
match(ListLexer::LBRACK);
$this->elements();
$this->match(ListLexer::RBRACK);
}
/** elements : element (‘,’ element)* ; */
function elements() {
$this->element();
while ($this->lookahead->type == ListLexer::COMMA ) {
$this->match(ListLexer::COMMA);
$this->element();
}
}
/** element : name | list ; element is name or nested list */
function element() {
if ($this->lookahead->type == ListLexer::NAME ) {
$this->match(ListLexer::NAME);
}
else if ($this->lookahead->type == ListLexer::LBRACK) {
$this->rlist();
}
else {
throw new Exception(«Expecting name or list : Found » .
$this->lookahead);
}
}
}
?>
Parser.php

input = $input;
$this->consume();

}

/** If lookahead token type matches x, consume & return else error */
public function match($x) {
if ($this->lookahead->type == $x ) {
$this->consume();
} else {
throw new Exception(Expecting token » .
$this->input->getTokenName($x) .
«:Found » . $this->lookahead);
}
}
public function consume() {
$this->lookahead = $this->input->nextToken();
}
}
?>

The parser takes our input language as an argument from the command line. Passing the correct string to the parser won't return anything because we haven't implemented any language processing code. Passing an incorrect string will throw an exception.

The code and comments are self-explanatory. I've set up the script so that I get one "digest" warning if many pages change. The script is the hardest part – to run the script, I set up a CRON job to run the script every 20 minutes.

This solution doesn't just apply to spying on footy – you can use this type of script on any number of sites. This script, however, is a bit simplistic in all cases. If you want to spy on a website that has extremely dynamic code (i.e. there was a timestamp in the code), you need to create regular expressions that would isolate the content just for the block you're looking for.

Conclusion

I have a few tips, if you want this kind of script to handle the same page all the time:

  • set the user agent header to simulate a real web browser request
  • make a call with a random delay to avoid the blacklist from the web server
  • use PHP 7
  • try to optimize the script as much as possible

You can use this script for production code, but honestly, it's not the best approach.

After working with the code, you may realize that PHP is not a good candidate for developing programs of this kind. Even if you can complete the task, developing a much more complex parser design in PHP can be a daunting task.