Last Updated:

Search in php - Simplified Code

Search in php

95% of free php scripts (and not only php) - completely "sucks". It is understandable: a good programmer will not write anything for free, and if he does, it will be only in his free time as entertainment, and, of course, not all sorts of platitudes, like guest books. Or, as Roma raval said: "Here's the problem with these creative people: they always want to be composers, artists and writers. As a result, the production of large-diameter pipes is engaged in mediocrity."

That's exactly what happens.

Today I was again digging through the catalogs of free scripts, mainly out of curiosity, but also in the secret hope of finding something funny. Last time, from the "funny" scripts, I found, for example, "script for outputting a text file in php". I thought I was a parser. It turned out to be yes: almost a parser. Here is the whole script: "?php include ("text.txt"); ?". Or here I see a script that says "This script will reverse the text you give it. reversed: .ti evig uoy txet eht esrever lliw tpircs siht It isn't very useful, it's just funny. Try it out :)", that is, the script flips the string backwards. The worst expectations were met: they did it in a cycle. Probably, they did not know that php has a special, ready-made function for this:

I found a search script for the site: it searches the directories that you specified, opens all the html files and stupidly compares: I also had the same script written a long time ago, but then, when I realized that people still use the search (surprise!), I decided to make it in a human way. With index and other benefits of civilization: Done. As a result, 1.5 megabytes of notes turn into a 900-kilobyte index in 17 seconds (indexing should be carried out every few days, or even less often - depending on the speed of updating the site), after which the index is searched for less than one second.

In general, I decided to share this script. Brief information: a script in php, no mysql is needed to work, it is assumed that html (or txt, like mine) files lie somewhere, and are not stored in mysql. In general, a search engine for a small (well, or "average") site. Like, for example, p-qc.com.

So, let's start:


The very first is the indexing script. What is it for?.. I have 278 notes. If we open each file and look for matches, we'll need to open 278 files. And that's for how long: Moreover, we will need to carry out cunning manipulations with these files 278 times (about manipulations - below). If we have an index, then firstly, the search takes place in one file (index), and secondly, all these "tricky manipulations" have already been performed.

The algorithm of the indexing script is as follows:

  1. Open another file
  2. Remove the "garbage" from it (why the garbage is removed - it is clear, the less garbage, the faster it is searched :):
    1. line breaks
    2. html-tags
    3. Punctuation
    4. words shorter than three letters (and why are they there?)
  3. Make capital letters lowercase.
  4. Remove duplicate words. (Really, why do we need all this tautology?)
  5. Write everything to the index.
  6. If there are still files, go to step 1.

It's all implemented in php - easy!

<?php

// Spectator's Indexing Script
// (C) p-qc.com
// Requires PHP 4 or higher to run.
// If you use this script, link to p-qc.com
// highly desirable. Thank you.

// put the script "on the counter" (to know how long it took to run
$ttt=microtime();
$ttt=((double)strstr($ttt, ' ')+(double)substr($ttt,0,strpos($ttt,' ')));


$indexdir="text"; #indexed directory
$indexfile="indexfile.txt"; #file that will contain the index
// if you want to index files in multiple directories, you need to
// make a few tiny additions...


// make it so that there is no timeout due to the fact that the script will be long
// run (just in case) and because the user clicks
// browser stop button

$abort = ignore_user_abort(1);
set_time_limit(600);

// Function to remove words shorter than 3 letters. Useful further.
function sw (&$item1, $key) { if (strlen($item1)<3) $item1=""; }

// open all the files in the directory one by one and check if they can be
// indexing I can only index files that have
// kind of "number.txt" i.e. && (is_numeric(str_replace (".txt","", $file)))
// you probably won't need this.

$handle=opendir('./'.$indexdir);
while (false!==($file = readdir($handle))):
if ($file!="." && $file!=".." && (is_numeric(str_replace (".txt","", $file)))):

// open next file
$fd = fopen($indexdir."/".$file, "r");
$contents = fread ($fd, filesize ($indexdir."/".$file));
Fclose($fd);


// remove newlines
$contents=str_replace("n"," ", $contents);
$contents=str_replace("r","", $contents);


// remove html tags
$contents=str_replace('<br>', ' ', $contents);
$contents=str_replace('<p>', ' ', $contents);
$contents=strip_tags($contents);


// remove punctuation marks and numbers
// all these lines are faster than one eregi_replace!

$contents=str_replace (' -', ' ', $contents);
$contents=str_replace('.', ' ', $contents);
$contents=str_replace(',', ' ', $contents);
$contents=str_replace('!', ' ', $contents);
$contents=str_replace('?', ' ', $contents);
$contents=str_replace(':', ' ', $contents);
$contents=str_replace(';', ' ', $contents);
$contents=str_replace (')', ' ', $contents);
$contents=str_replace ('(', ' ', $contents);
$contents=str_replace('"', ' ', $contents);

// remove capital letters
$contents=strtolower($contents);

// split into words, remove words shorter than 3 letters
$contents=explode(" ", $contents);
// here is the function that came in handy...
array_walk($contents, 'sw');


// remove duplicate words
$contents=array_unique($contents);


// connect words
$contents=implode(" ", $contents);


// form the corresponding line in the index.
$fullfile.=$file."| ".$contents." n";


// index file will look like:
// file_name|index_for_this_file n
// file_name|index_for_this_file n
// file_name|index_for_this_file n

echo ($file." indexed<br>");
// move on to the next file

endif;
endwhile;
closedir($handle);

// remove double spaces
while (stristr($fullfile, " ")) $fullfile=str_replace (" "," ",$fullfile);

// index is ready, save it
$fp = fopen($indexfile, "w+");
fwrite($fp, $fullfile);
fclose($fp);

// count how long the script worked
$ddd=microtime();
$ddd=((double)strstr($ddd, ' ')+(double)substr($ddd,0,strpos($ddd,' ')));

echo ("<br>Indexing time: ".(number_format(($ddd-$ttt),3)).
      "seconds<br>");
echo ("Index size: ".(number_format((round ((filesize($indexfile))/1024)) ,
      0, ".",".")))." Kb";
    
?>

So we have an index. Then it's simple. Right?.. You just have to search it. Take the eregi function, for example:

Although I did it very differently:

Lyrical digression: Often, when it is necessary to check, if there is any combination of symbols in the line, they write something like this:

if (eregi('this must be found',$string))
    echo 'found!!';
else     
echo 'nothing found!';

The method is good, but the brake is because of the eregi. (This function works with regular expressions, so it slows down.) For the same reason, it is recommended to use str_replace instead of ereg_replace where possible. Faster every 10: So cool programmers ;) when they need to check if something is found in a string, use the strstr function. In fact, it is not intended for this, (or rather, "not intended for this"), because it "Find first occurrence of a string", that is, "looks for the first location of the string" and displays the line, starting from this very location. Confusing, right? (smiley).

Ok, here's an example with php.net:

$email = 'sterling@designmultimedia.com';
$domain = strstr ($email, '@');
print $domain;
// выводит: @designmultimedia.com

Got it now? The function looks for where the substring "@" appears in the string and displays everything after it (inclusive). Most importantly, if nothing is found, the function returns false. That is why it can be used like this:

if (stristr($string, 'this must be found'))
     echo 'found!!';
else
     echo 'nothing found!';

My script uses stristr to search the index. In addition, the search understands the simplest syntax: "+" (the word must be found, aka AND), "-" (the word should not be found, aka NOT) and "*" (asterisk). But, analyzing what was searched on my site, I can say only one thing: Somewhere in the discussion about search engines and their AI (artificial intelligence), I found such a phrase that "it is easier to teach the last moron to use the query language than to teach the search engine to guess what exactly this moron needs /". Indeed, it is difficult to program a search engine so that it immediately gives out what is needed for idiotic queries. But it seems to be even more difficult to train THEM to make queries correctly:

In my search, no one uses all these icons, no matter how crucified I am. Although, when I need to find something specific, I find it from the first query (yes, of course, I know roughly what to look for, and I wrote the search script myself, but still :)


The script is simple, but it works reliably. If you make a request correctly, you will find everything the first time. In principle, you can still sort by relevance, make it so that a piece with text is shown from the found file, where the searched word would be highlighted, and so on: But you do it yourself:

<?php

// Spectator's Site Search Script
// (C) p-qc.com
// Requires PHP 4 or higher to run.
// If you use this script, link to p-qc.com
// highly desirable. Thank you.


// file with index
$indexfile="indexfile.txt";

// process the request

$total=0;
$qu2=str_replace("+","&",$words);

// remove capital letters
$qu2=strtolower($qu2);

// cut off extra spaces at the end
$qu2=chop($qu2);

// remove double spaces
while (stristr($qu2," ")) $qu2=str_replace (" "," ",$qu2);


echo ('Request: '.$qu2);
echo('<p><p><p>');

// split the request into words
$words = explode(' ', $qu2);

// remove everything unnecessary in the request (punctuation marks, etc.)
$qu2=eregi_replace ('[.?,!()#":;|]', '', $qu2);

// check request length
if (strlen($qu2)>2):


// open index
$index=file($indexfile);
$num= (count($index)-1);

// for each line from the index (one line = one file) we execute:
for ($i=1; $i<$num+2; $i++):

$contents=$index[$i-1];

$wordcount=0;
$mustfound=1;

// execute for each of their requested words:
$mustntfound=1;

for ($q=0; $q<count($words); $q++):

// handle *, + and - signs

// sign *
if (stristr($words[$q], "*")) {$search=str_replace ("*","",$words[$q]); }
else { $search=" ".$words[$q]." ";}

// if the word has an asterisk, then remove the asterisk and add it to the beginning
// and the end of the word by space if there are no gaps, then the word will not be searched
// as a whole, but "in general", that is, the word "tea" will also be displayed for the query
// "random".

// bug: the script does not take into account where the asterisk is in the word and considers that in
// in any case, it is at the end (!!)

// sign & (or +)
if (stristr($search, "&")) {$search=str_replace ("&","",$search); $mustfound++; }

// if there is a + sign, then the number of words that _should_ be found,
// increase by 1

// sign -
if (stristr($search, "-")) {
    $search=str_replace("-","",$search);
    $mustntfound=0;
    }

// if there is a + sign, then if the word is found, the entire result is multiplied by 0
// (look further).

// if the word is found, count it and multiply it by $mustntfound, i.e. by 1,
// if a "correct" word is found and to 0 if a word marked
// sign -

if (stristr($contents, $search)) {
    $wordcount++;
    $wordcount=$wordcount*$mustntfound;
    }

endfor;

// check if all words marked with + are found,
// or (if there are no such words) whether at least one word was found at all

if ($wordcount >= $mustfound):

// find the name of the file it was found in
$file=explode("|",$contents);
$file=$file[0];

// output the name of the file in which it was found with a link
// (you will need to remake this piece for your own needs).

$file=str_replace(".txt", "", $file);
$file=str_replace("_", ".", $file);
echo ("<a href=".$file.">".$file."</a><br>");


// count how many files are put on
$total++;
endif;

// move on to the next file
endfor;

// display results
if ($total!=0) echo ('<br><br>Total pages found: '.$total);
else echo ("<b>Nothing found!</b><p>Maybe you".
          ." they just didn't make the request correctly. How to do it right".
          ." - see <a href=search>here</a>.");

else:
echo ("<br>Request too short!");
endif;


?>

<!-- Search form: -->
<form method=get action=search.php>
<input type=text size=19 name=words value="" maxlength=150>
<input type=submit class=frm value=Go>
</form>

PS. PHP.net is a very good site. 99% of what I need, I find there. It's simple: if you do not understand how a function works, just type it into the search box on that site and you will be given a description of the function, but in addition, on the php.net to the description of each function, users can leave comments, so you will be given a mountain of sensible comments. Quite the cheese.