anonymous user

Forums   Register   Login   Forgot your login/password?   Search

PDF indexer...?

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 518 | 519 | 520 | 521 | next »» | Create new thread

ixgamerz

Name: Aless
Posts: 17

2007-03-20 07:52:05 | reply!


Hi,
I'm a newbie and i discover Sphinx.

I have a little question on the indexing procedure. is that possible to index the
contains of pdf or doc Files with an external program lije pdftotext pod2html, concat?

How to proceed?

Thanks for you answer

Best regards
Ix

ixgamerz

Name: Aless
Posts: 17

to: ixgamerz, 2007-03-22 04:03:44 | reply!


Hi everybody,

I've read all the documentation and I've an idea of how to proceed. If someone can just
confirm me that I'm right, it will be very usefull.

This is my actual situation:

Every day, some people put in a folder 1000 pdf files.

My Objective:
Schedule an automatic index or a live update index for these PDF files.

How I imagine the procedure:

On a scheduled time,

I launch a Php or PERL procedure (launching of PDFtoText) that will create a XML files
with the structure below.

<document>
<id>123</id>
<group>45</group>
<timestamp>1132223498</timestamp>
<title>test title</title>
<body>
this is my document body
</body>
</document>

<document>
<id>124</id>
<group>46</group>
<timestamp>1132223498</timestamp>
<title>another test</title>
<body>
this is another document
</body>
</document>

in sphinx.conf, I wrote:

type = xmlpipe
xmlpipe_command = cat /var/myPDFList.xml

and I execute the indexer.

And all will be indexed. Is that correct?

My question:

After when I proceed to a seach. How can I find the good file and open it. There is no
information about it....

Thanks for your answer
Best regards

ix

ixgamerz

Name: Aless
Posts: 17

to: ixgamerz, 2007-03-22 07:46:34 | reply!


The XML structure doen't appear but it's the same than the documentation...

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ixgamerz, 2007-03-25 06:15:03 | reply!


> type = xmlpipe
> xmlpipe_command = cat /var/myPDFList.xml

Either that, or you could make your php/perl program print XML to stdout instead of temp
file and run that program from indexer:

xmlpipe_command = /usr/local/bin/php /where/is/my/pdf2xml.php

> After when I proceed to a seach. How can I find the good file and open it.

You will need to assign IDs to each file name, store that somewhere, and remap IDs coming
from Sphinx back to the names. Could very well be done in that pdf2xml.php program as
well.

ixgamerz

Name: Aless
Posts: 17

to: shodan, 2007-03-26 04:36:58 | reply!


Thanks for your helpful answer.

> Either that, or you could make your php/perl program print XML to stdout instead of temp
> file and run that program from indexer:
>
> xmlpipe_command = /usr/local/bin/php /where/is/my/pdf2xml.php

Are you sure that it could work because you need to respect the xml structure found in
the documentation of sphinx. To do that, you need to prepare it before.

What do you think about?

best regards

ix

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ixgamerz, 2007-03-26 05:07:29 | reply!


> > xmlpipe_command = /usr/local/bin/php /where/is/my/pdf2xml.php
> Are you sure that it could work because you need to respect the xml structure found in
> the documentation of sphinx.

It would work. There's nothing in that XML structure which prevents you from emitting
everything on the fly document by document.

ixgamerz

Name: Aless
Posts: 17

to: shodan, 2007-03-26 11:01:03 | reply!


Hi,

this my .php file:

$appDir = "/usr/bin/"; //directory of pdftotext
$appPdf2Txt = "pdftotext"; //Application name of pdf2text
$appOption = ""; //suplementary options for application execution (before output filename)
$appOutputFile = "-"; //suplementary options for application execution (if "-" => stdout)

$testfile = "RSCOMM_DOC_3608457.pdf";

$appDir = $appDir.$appPdf2Txt." ".$appOption." ".$testfile." ".$appOutputFile;

//print($appDir);

print("<document>");
print("<id>"."10"."</id>");
print("<group>11</group>");
print("<timestamp>1132223498</timestamp>");
print("<title>".$testfile."</title>");
print("<body>");
passthru($appDir);
print("</body>");
print("</document>");

?>

I have add this line to my sphinx.conf
          type = xmlpipe
          xmlpipe_command = /usr/bin/php /var/www/sphinx/pdf2xml.php

and this the result that I got when I do indexer --all

/usr/local/etc$ sudo indexer --all
Sphinx 0.9.7-RC2
Copyright (c) 2001-2006, Andrew Aksyonoff

using config file 'sphinx.conf'...
indexing index 'test1'...
collected 4 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 4 docs, 155 bytes
total 0.010 sec, 15500.00 bytes/sec, 400.00 docs/sec
indexing index 'test1stemmed'...
collected 4 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 4 docs, 155 bytes
total 0.010 sec, 15500.00 bytes/sec, 400.00 docs/sec
skipping index 'dist1' (distributed indexes can not be directly indexed)...

The 4 docs are the first file put in mysql for the test. But my pdf files is not indexed.

What is wrong?

Thanks for your help

best regards
Ix

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ixgamerz, 2007-03-26 11:54:48 | reply!


> The 4 docs are the first file put in mysql for the test. But my pdf files is not indexed.

Well, redirect the output to temp file and check if everything is OK.

There might be interfering PHP header lines, notices or warnings for instance.

ixgamerz

Name: Aless
Posts: 17

to: ixgamerz, 2007-03-26 11:59:19 | reply!


Hi,

I have disactivate all the line in the "source src1" of sphinx.conf except this one :

                type = xmlpipe
        xmlpipe_command = /usr/bin/php /var/www/sphinx/pdf2xml.php

I correct my pdf2xml.php like this adding the exact path of my pdf file.

show below:
<?php
$appDir = "/usr/bin/"; //directory of pdftotext
$appPdf2Txt = "pdftotext"; //Application name of pdf2text
$appOption = ""; //suplementary options for application execution (before output filename)
$appOutputFile = "-"; //suplementary options for application execution (if "-" => stdout)

$testfile = "/var/www/sphinx/RSCOMM_DOC_3608457.pdf";

$appDir = $appDir.$appPdf2Txt." ".$appOption." ".$testfile." ".$appOutputFile;

//print($appDir);


print("<document>");
print("<id>"."10"."</id>");
print("<group>11</group>");
print("<timestamp>1132223498</timestamp>");
print("<title>".$testfile."</title>");
print("<body>");
passthru($appDir);
print("</body>");
print("</document>");
?>

this the result for the indexation:

sudo indexer --all

Sphinx 0.9.7-RC2
Copyright (c) 2001-2006, Andrew Aksyonoff

using config file 'sphinx.conf'...
indexing index 'test1'...
WARNING: CSphSource_XMLPipe(): expected '</body>', got '<test.test@test.com> Date : Wed,
31 Jan 2007 15:03:23'.
collected 1 docs, 0.0 MB
total 1 docs, 1652 bytes
total 0.220 sec, 7521.64 bytes/sec, 4.55 docs/sec
indexing index 'test1stemmed'...
WARNING: CSphSource_XMLPipe(): expected '</body>', got '<test.test@test.com> Date : Wed,
31 Jan 2007 15:03:23'.
collected 1 docs, 0.0 MB
total 1 docs, 1652 bytes
total 0.267 sec, 6178.15 bytes/sec, 3.74 docs/sec
skipping index 'dist1' (distributed indexes can not be directly indexed)...

It seems the it don't see the xml tag </body> perhaps due to passthru($appDir) command.

I will try something else with exec command and tell you late if it works.

Thanks for your help.

best regards
Ix

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ixgamerz, 2007-03-26 12:18:30 | reply!


> WARNING: CSphSource_XMLPipe(): expected '</body>', got '<test.test@test.com>

This means that you should XML-escape document body before inserting it into XML.

ixgamerz

Name: Aless
Posts: 17

to: shodan, 2007-03-26 12:23:01 | reply!


> > WARNING: CSphSource_XMLPipe(): expected '</body>', got '<test.test@test.com>
>
> This means that you should XML-escape document body before inserting it into XML.

What is XML-escape document body mean?

shodan

Name: Andrew Aksyonoff
Posts: 4360

to: ixgamerz, 2007-03-26 12:38:03 | reply!


> What is XML-escape document body mean?

You can't have "<" or ">" in valid well-formed XML body; you need to replace those with
"&lt;" and "&gt;"

This affects some other characters as well. Try using htmlspecialchars()

ixgamerz

Name: Aless
Posts: 17

to: shodan, 2007-03-27 08:18:05 | reply!


> > What is XML-escape document body mean?
>
> You can't have "<" or ">" in valid well-formed XML body; you need to replace those with
> "<" and ">"
>
> This affects some other characters as well. Try using htmlspecialchars()

In fact, you're right. If someone need help to do that I display below my php code.

This is my pdf2xml.php file:

<?php
$appDir = "/usr/bin/"; //directory of pdftotext
$appPdf2Txt = "pdftotext"; //Application name of pdf2text
$appOption = ""; //suplementary options for application execution (before output filename)
$appOutputFile = "-"; //suplementary options for application execution (if "-" => stdout)

$dir = '/var/www/sphinx/myfolder/'; //Rйpertoire ou sont stockйs les fichiers а traiter
$pdfFiles = scandir($dir, 1); //Stocke dans un tableau

//print_r($pdfFiles); //debug Liste le tableau des fichiers trouvй dans le rйpertoire

for ($i = 0; $i <= count($pdfFiles)-3; $i++)
{
  //Recherche d'espace
  if(strstr($pdfFiles[$i]," "))
  {
    //Ajout de guillemets
    $pdfFile = "\"".$pdfFiles[$i]."\"";

    //print($pdfFile); //debug correction des espaces

  } //endif

  //cmd line for execution of PDF extractor
  $cmd = $appDir.$appPdf2Txt." ".$appOption." ".$dir.$pdfFile." ".$appOutputFile;
  //print($cmd); //degug command line

  //XML structure
  print("<document>");
  print("<id>".($i+1)."</id>");
  print("<group>".($i+1)."</group>");
  print("<timestamp>1132223498</timestamp>");
  print("<title>".htmlspecialchars($pdfFiles[$i])."</title>");
  print("<body>");
  print(htmlspecialchars(shell_exec($cmd)));
  print("</body>");
  print("</document>");

}
?>

This is the sphinx.conf used:
#
# sphinx configuration file sample
#

#############################################################################
## data source definition
#############################################################################

source src1
{
                type = xmlpipe
        xmlpipe_command = /usr/bin/php /var/www/sphinx/pdf2xml.php
}

#############################################################################
## index definition
#############################################################################

# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
index test1
{
        source = src1
        path = /var/data/test1
                docinfo = inline
        morphology = none
        stopwords =
        min_word_len = 1
                charset_type = sbcs
                U+410..U+42F->U+430..U+44F, U+430..U+44F
}

index test1stemmed : test1
{
        path = /var/data/test1stemmed
        morphology = stem_en
}

index dist1
{
        # 'distributed' index type MUST be specified
        type = distributed

        # local index to be searched
        # there can be many local indexes configured
        local = test1
        local = test1stemmed

        # remote agent
        # multiple remote agents may be specified
        # syntax is 'hostname:port:index1,[index2[,...]]
        agent = localhost:3313:remote1
        agent = localhost:3314:remote2,remote3

        # remote agent connection timeout, milliseconds
        # optional, default is 1000 ms, ie. 1 sec
        agent_connect_timeout = 1000

        # remote agent query timeout, milliseconds
        # optional, default is 3000 ms, ie. 3 sec
        agent_query_timeout = 3000
}

#############################################################################
## indexer settings
#############################################################################

indexer
{
        # memory limit
        #
        # may be specified in bytes (no postfix), kilobytes (mem_limit=1000K)
        # or megabytes (mem_limit=10M)
        #
        # will grow if set unacceptably low
        # will warn if set too low and potentially hurting the performance
        #
        # optional, default is 32M
        mem_limit = 32M
}

#############################################################################
## searchd settings
#############################################################################

searchd
{
        # IP address on which search daemon will bind and accept
        # incoming network requests
        #
        # optional, default is to listen on all addresses,
        # ie. address = 0.0.0.0
        #
        # address = 127.0.0.1
        # address = 192.168.0.1


        # port on which search daemon will listen
        port = 3312


        # log file
        # searchd run info is logged here
        log = /var/log/searchd.log


        # query log file
        # all the search queries are logged here
        query_log = /var/log/query.log


        # client read timeout, seconds
        read_timeout = 5


        # maximum amount of children to fork
        # useful to control server load
        max_children = 30


        # a file which will contain searchd process ID
        # used for different external automation scripts
        # MUST be present
        pid_file = /var/log/searchd.pid


        # maximum amount of matches this daemon would ever retrieve
        # from each index and serve to client
        #
        # this parameter affects per-client memory and CPU usage
        # (16+ bytes per match) in match sorting phase; so blindly raising
        # it to 1 million is definitely NOT recommended
        #
        # starting from 0.9.7, it can be decreased on the fly through
        # the corresponding API call; increasing is prohibited to protect
        # against malicious and/or malformed requests
        #
        # default is 1000 (just like with Google)
        max_matches = 1000
}


and it works fine for the first test...

Thanks a lot for your help

Best regards

Ix

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 518 | 519 | 520 | 521 | next »» | Create new thread