Last Updated:

Creating a search engine in PHP / MySQL

search engine in PHP

So, you have a completely dynamic site that has a large amount of different data (of any kind - forums, articles, etc.). Mostly large data is stored in BLOBs (which you probably could not avoid), therefore, it is impossible to do anything useful using the standard type of query LIKE %searchword% since the output will not be appropriate (that is, relevant).

There has to be a different path. And it is :-).

Step One: Reducing "Extraneous" Words from the Blob

 

The first problem is that the data is overflowing with extraneous words (prepositions, interjections..) such as "how, where, a, and". These words help us humans communicate but have nothing to do with our problem when we need to get a conclusion on relevance.

Below, at the end of the article, I have attached my personal list of such "extraneous" words.

So, we are now trying to do - to choose from the data these glory, and, - in the newly created plate with two fields: the word and its pointer (counter). We need something like this:

+-----+---------------+ 
| qid | word          | 
+-----+---------------+ 
| 6   | links         | 
| 5   | Fire          | 
| 5   | topics        | 
| 5   | related       | 
| 5   | Shakespeare   | 
| 4   | people        | 
| 4   | Knowpost      | 
| 3   | cuba          | 
| 3   | cigar         | 
+-----+---------------+ 

So, let's create the actual table:

CREATE TABLE search_table( 
word VARCHAR(50), 
qid INT) 

The next step is to process and move the data to our search_table table.

<?php
$query = "SELECT blob,identifier FROM your_table";
$result = mysql_query($query);
$number = mysql_numrows($result);
$j = 0;
WHILE ($j < $number) {

/* Our "blob" */
$body = mysql_result($result,$j,"blob");

/* Our "identifier */
$qid = mysql_result($result,$j,"qid");

/* Open the file with extraneous words into an array */

$noise_words = file("noisewords.txt");
$filtered = $body;

/* Place a space before the first word */
$filtered = ereg_replace("^"," ",$filtered);

/* Now we got rid of unnecessary words and
we can put what is left - in an array
*/

/* Loop through and remove wrong words */
for ($i=0; $i < count($noise_words); $i++) {
$filterword = trim($noise_words[$i]);
$filtered =
eregi_replace(" $filterword "," ",$filtered);
}

$filtered = trim($filtered);
$filtered = addslashes($filtered);
$querywords = ereg_replace(",","",$filtered);
$querywords = ereg_replace(" ",",",$querywords);
$querywords = ereg_replace("\?","",$querywords);
$querywords = ereg_replace("\(","",$querywords);
$querywords = ereg_replace("\)","",$querywords);
$querywords = ereg_replace("\.","",$querywords);
$querywords = ereg_replace(",","','",$querywords);
$querywords = ereg_replace("^","'",$querywords);
$querywords = ereg_replace("$","'",$querywords);

/* Now we should have something like
'Word1','Word2','Word3'
so now we can put everything into an array
*/
$eachword = explode(",", $querywords);

/* finally we can run through
array and put each word into the database,
along with counter
*/

for ($k=0; $k < count($eachword); $k++) {
$inputword = "INSERT INTO search_table
VALUES($eachword[$k],$qid)";
mysql_query($inputword);
}

/* Run through the loop again with new data */
$j++;
}

?>

This script processes your old data. You need to add the same processing to adding data to your database (when it is entered by a user, host - or whatever) - so that the database is updated in the process.

Step two: Search the table

Now we have a table with keywords and counters. How do I build a request?

First, you need to reformat (no, not the hard disk) - the search words in a string of the form 'word1','word2','word3' and write it in the $querywords.

Next, use a query similar to the following:

SELECT count(search_table.word) as score, search_table.qid,your_table.blob 
FROM search_table,your_table 
WHERE your_table.qid = search_table.qid AND search_table.word 
IN($querywords) 
GROUP BY search_table.qid 
ORDER BY score DESC"; 

The conclusion can be, for example, as follows:

<?php

$getresults = mysql_query($search);
$resultsnumber = mysql_numrows($getresults);

IF ($resultsnumber == 0) {

PRINT "Nothing found."
."Try other keywords.";

} ELSEIF ($resultsnumber > 0) {

PRINT "Search returned $resultsnumber of results<BR>"
      ."Ranking by relevance <BR><BR>";
for($count = 0; $count < $resultsnumber; $count++) {
$body = mysql_result($getresults,$count,"blob");
$qid = mysql_result($getresults,$count,"qid");

$body2print = substr($body, 0, 100);
$cnote = $count+1;
PRINT "$cnote. <a href=yourcontent.php3?qid=$qid>"
      ."<i>$body2print...</i></a><BR>";
}
}

?>

So, you have a mechanism for searching for keywords in your database by relevance (relevance to the query).

Of course, this is not bing or Google :-))).

But now we have a small search engine that works quite quickly and competently and is quite suitable for an ordinary user (who is not going to use logical elements, etc.).

Here is a sheet of my "extraneous" words:

noisewords.txt 
-------------- 
a 
about 
after 
ago 
all 
almost 
along 
also 
am 
an 
and 
answer 
any 
anybody 
anywhere 
are 
aren't 
around 
as 
ask 
at 
bad 
be 
been 
before 
being 
best 
better 
between 
big 
but 
by 
can 
can't 
come 
could 
couldn't 
day 
did 
didn't 
do 
does 
don't 
down 
each 
either 
else 
even 
ever 
every 
everybody 
everyone 
far 
find 
for 
found 
from 
get 
go 
going 
gone 
good 
got 
had 
has 
have 
haven't 
having 
her 
here 
hers 
him 
his 
home 
how 
href 
I 
if 
in 
into 
is 
isn't 
it 
its 
know 
large 
less 
like 
little 
looking 
look 
many 
me 
more 
most 
must 
my 
near 
never 
new 
news 
no 
none 
not 
nothing 
of 
off 
often 
old 
on 
once 
only 
or 
other 
our 
ours 
out 
over 
page 
please 
question 
rather 
recent 
she 
should 
sites 
small 
so 
some 
something 
sometime 
somewhere 
than 
true 
thank 
that 
the 
their 
theirs 
them 
then 
there 
these 
they 
this 
those 
though 
through 
thus 
time 
times 
to 
too 
under 
until 
untrue 
up 
upon 
use 
users 
version 
very 
via 
want 
was 
way 
web 
were 
what 
when 
where 
which 
who 
whom 
whose 
why 
wide 
will 
with 
within 
without 
world 
worse 
worst 
would 
www 
yes 
yet 
you 
your 
yours