1 files changed, 332 insertions, 0 deletions
diff --git a/doc/TODO.detail/yacc b/doc/TODO.detail/yacc
new file mode 100644
index 00000000000..acccb250a08
--- /dev/null
+++ b/doc/TODO.detail/yacc
@@ -0,0 +1,332 @@
+From selkovjr@mcs.anl.gov Sat Jul 25 05:31:05 1998
+Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
+	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id FAA16564
+	for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:31:03 -0400 (EDT)
+Received: from antares.mcs.anl.gov (mcs.anl.gov [140.221.9.6]) by renoir.op.net (o1/$Revision: 1.1 $) with SMTP id FAA01775 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:28:22 -0400 (EDT)
+Received: from mcs.anl.gov (wit.mcs.anl.gov [140.221.5.148]) by antares.mcs.anl.gov (8.6.10/8.6.10)  with ESMTP
+	id EAA28698 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 04:27:05 -0500
+Sender: selkovjr@mcs.anl.gov
+Message-ID: <35B9968D.21CF60A2@mcs.anl.gov>
+Date: Sat, 25 Jul 1998 08:25:49 +0000
+From: "Gene Selkov, Jr." <selkovjr@mcs.anl.gov>
+Organization: MCS, Argonne Natl. Lab
+X-Mailer: Mozilla 4.03 [en] (X11; I; Linux 2.0.32 i586)
+MIME-Version: 1.0
+To: Bruce Momjian <maillist@candle.pha.pa.us>
+Subject: position-aware scanners
+References: <199807250524.BAA07296@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii
+Content-Transfer-Encoding: 7bit
+Status: RO
+
+Bruce,
+
+I attached here (trough the web links) a couple examples, totally
+irrelevant to postgres but good enough to discuss token locations. I
+might as well try to patch the backend parser, though not sure how soon.
+
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+1. 
+
+The first c parser I wrote,
+http://wit.mcs.anl.gov/~selkovjr/unit-troff.tgz, is not very
+sophisticated, so token locations reported by yyerr() may be slightly
+incorrect (+/- one position depending on the existence and type of the
+lookahead token. It is a filter used to typeset the units of measurement
+with eqn. To use it, unpack the tar file and run make. The Makefile is
+not too generic but I built it on various systems including linux,
+freebsd and sunos 4.3. The invocation can be something like this:
+
+./check 0 parse "l**3/(mmoll*min)"
+parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
+`'(''
+
+l**3/(mmoll*min)
+      ^^^^^
+
+Now to the guts. As far as I can imagine, the only way to consistently
+keep track of each character read by the scanner (regardless of the
+length of expressions it will match) is to redefine its YY_INPUT like
+this:
+
+#undef YY_INPUT
+#define YY_INPUT(buf,result,max_size) \
+{ \
+	int c	= (int) buffer[pos++]; \
+	result = (c == '\0') ?	YY_NULL	: (buf[0] = c, 1); \
+}
+
+Here, buffer is the pointer to the origin of the string being scanned
+and pos is a global variable, similar in usage to a file pointer (you
+can both read and manipulate it at will). The buffer and the pointer are
+initialized by the function 
+
+void setString(char *s)
+{
+   buffer = s;
+   pos = 0;
+}
+
+each time the new string is to be parsed. This (exportable) function is
+part of the interface. 
+
+In this simplistic design, yyerror() is part of the scanner module and
+it uses the pos variable to report the location of unexpected tokens.
+The downside of such arrangement is that in case of error condition, you
+can't easily tell whether your context is current or lookahead token, it
+just reports the position of the last token read (be it $ (end of
+buffer) or something else):
+
+./check 0 convert "mol/foo"
+parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
+`'(''
+
+mol/foo
+       ^^^
+
+(should be at the beginning of "foo")
+
+./check 0 convert "mmol//l"        
+parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
+`'(''
+
+mmol//l
+    ^
+
+(should be at the second '/')
+
+
+I believe this is why most simple parsers made with yacc would report
+parse errors being "at or near" some token, which is fair enough if the
+expression is not too complex.
+
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+2. The second version of the same scanner,
+http://wit.mcs.anl.gov/~selkovjr/scanner-example.tgz, addresses this
+problem by recording exact locations of the tokens in each instance of
+the token semantic data structure. The global,
+
+UNIT_YYSTYPE unit_yylval;
+
+would be normally used to export the token semantics (including its
+original or modified text and location data) to the parser.
+Unfortunately, I cannot show you the parser part in c, because that's
+about when I stopped writing parsers in c. Instead, I included a small
+test program, test.c, that mimics the parser's expectations for the
+scanner data pretty well. I am assuming here that you are not interested
+in digging someone else's ugly guts for relatively small bit of
+information; let me know if I am wrong and I will send you the complete
+perl code (also generated with bison).
+
+To run this example, unpack the tar file and run Make. Then do
+
+  gcc test.c scanner.o
+
+and run a.out
+
+Note the line
+
+    yylval = unit_getyylval();
+
+in test.c. You will not normally need it in a c parser. It is enough to
+define yylval as an external variable and link it to yylval in yylex()
+
+In the bison-generated parser, yylval gets pushed into a stack (pointed
+to by yylsp) each time a new token is read. For each syntax rule, the
+bison macros @1, @2, ... are just shortcuts to locations in the stack 1,
+2, ... levels deep. In following code fragment, @3 refers to the
+location info for the third term in the rule (INTEGER):
+
+(sorry about perl, but I think you can do the same things in c without
+significant changes to your existing parser)
+
+term:           base    {
+                        $$ = $1;
+                        $$->{'order'} = 1;
+                }
+        |       base EXP INTEGER {
+                        $$ = $1;
+                        $$->{'order'} = @3->{'text'};
+                        $$->{'scale'} = $$->{'scale'} ** $$->{'order'};
+                        if ( $$->{'order'} == 0 ) {
+                                yyerror("Error: expecting a non-zero
+integer exponent");
+                                YYERROR;
+                        }
+                }
+
+
+which translates to:
+
+  ($yyn == 10)    && do {
+          $yyval = $yyvsa[-1];
+          $yyval->{'order'} = 1;
+          last SWITCH;
+  };
+
+  ($yyn == 11)    && do {
+          $yyval = $yyvsa[-3];
+          $yyval->{'order'} = $yylsa[-1]->{'text'}
+          $yyval->{'scale'} = $yyval->{'scale'} ** $yyval->{'order'};
+          if ( $yyval->{'order'} == 0 ) {
+                   yyerror("Error: expecting a non-zero integer
+exponent");
+                   goto yyerrlab1 ;
+          }
+          last SWITCH;
+  };
+
+In c, you will have a bit more complicated pointer arithmetic to adress
+the stack, but the usage of objects will be the same. Note here that it
+is convenient to keep all information about the token in its location
+info, (yylsa, yylsp, yylval, @n), while everything relating to the value
+of the expression, or to the parse tree, is better placed in the
+semantic stack (yyssa, yyssp, yysval, $n). Also note that in some cases
+you can do semantic checks inside rules and report useful messages
+before or instead of invoking yyerror();
+
+Finally, it is useful to make the following wrapper function around
+external yylex() in order to maintain your own token stack. Unlike the
+parser's internal stack which is only as deep as the rule being reduced,
+this one can hold all tokens recognized during the current run, and that
+can be extremely helpful for error reporting and any transformations you
+may need. In this way, you can even scan (tokenize) the whole buffer
+before handing it off to the parser (who knows, you may need a token
+ahead of what is currently seen by the parser):
+
+
+sub tokenize {
+    undef @tokenTable;
+    my ($tok, $text, $name, $unit, $first_line, $first_column,
+$last_line, $last_column);
+    
+    while ( ($tok = &UnitLex::yylex()) > 0 ) { # this is where the
+c-coded yylex is called,
+                                               # UnitLex is the perl
+extension encapsulating it                            
+       ( $text, $name, $unit, $first_line, $first_column, $last_line,
+$last_column ) = &UnitLex::getyylval;
+       push(@tokenTable, 
+           Unit::yyltype->new (
+              'token'         => $tok,
+              'text'          => $text,
+              'name'          => $name,
+              'unit'          => $unit,
+              'first_line'    => $first_line,
+              'first_column'  => $first_column,
+              'last_line'     => $last_line,
+              'last_column'   => $last_column,
+           )
+       )
+    }
+
+}
+
+
+It is now a lot easier to handle various state-related problems, such as
+backtracking and error reporting. The yylex() function as seen by the
+parser might be constructed somewhat like this:
+
+sub yylex {
+    $yylloc = $tokenTable[$tokenNo];  # $tokenNo is a global; now
+instead of a "file pointer",
+                                      # as in the first example, we have
+a "token pointer"
+    undef $yylval;
+
+
+    # disregard this; name this block "computing semantic values"       
+    if ( $yylloc->{'token'} == UNIT) {
+        $yylval = Unit::Operand->new(
+        'unit'  => Unit::Dict::unit($yylloc->{'unit'}),
+        'base'  => Unit::Dict::base($yylloc->{'unit'}),
+        'scale' => Unit::Dict::scale($yylloc->{'unit'}),
+        'scaleToBase' => Unit::Dict::scaleToBase($yylloc->{'unit'}),
+        'loc'   => $yylloc,
+       );    
+    }
+    elsif ( ($yylloc->{'token'} == INTEGER ) || ($yylloc->{'token'} ==
+POSITIVE_NUMBER) ) {
+        $yylval = Unit::Operand->new(
+          'unit' => '1',
+          'base' => '1',
+          'scale' => 1,
+          'scaleToBase' => 1,
+          'loc'   => $yylloc,
+        );
+    }
+
+    $tokenNo++;
+    return(%{$yylloc}->{'token'}); # This is all the parser needs to
+know about this token. 
+                                   # But we already made sure we saved
+everything we need to know.
+}
+
+
+Now the most interesting part, the error reporting routine:
+
+
+sub yyerror {
+    my ($str) = @_;
+    my ($message, $start, $end, $loc);
+
+    $loc = $tokenTable[$tokenNo-1]; # This is the same as to say, 
+                                    # "obtain the location info for the
+current token"
+  
+    # You may use this routine for your own purposes or let parser use
+it
+    if( $str ne 'parse error' ) {
+        $message = "$str instead of `" . $loc->{'name'} . "' <" .
+$loc->{'text'} . ">,  at line " . $loc->{'first_line'} . ":\n\
+n";
+    }
+    else {
+        $message = "unexpected token `" . $loc->{'name'} . "' <" .
+$loc->{'text'} . ">,  at line " . loc->{'first_line'} . ":\n
+\n";
+    }
+
+    $message .= $parseBuffer . "\n"; # that's the original string that
+was used to set the parser buffer
+
+    $message .= ( ' ' x ($loc->{'first_column'} + 1) ) . ( '^' x
+length($loc->{'text'}) ). "\n";
+    if( $str ne 'parse error' ) {
+        print STDERR "$str instead of `", $loc->{'name'}, "' {",
+$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
+    }
+    else {
+        print STDERR "unexpected token `", $loc->{'name'}, "' {",
+$loc->{'text'}, "},  at line ", $loc->{'first_line'}, ":\n\n";
+    }
+    
+    print STDERR "$parseBuffer\n";
+    print STDERR ' ' x ($loc->{'first_column'} + 1), '^' x
+length($loc->{'text'}), "\n";
+}
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Scanners used in these examples assume there is a single line of text on
+the input (the first_line and last_line elements of yylloc are simply
+ignored). If you want to be able to parse multi-line buffers, just add a
+lex rule for '\n' that will increment the line count and reset the pos
+variable to zero.
+
+
+Ugly as it may seem, I find this approach extremely liberating. If the
+grammar becomes too complicated for a LALR(1) parser, I can cascade
+multiple parsers. The token table can then be used to reassemble parts
+of original expression for subordinate parsers, preserving the location
+info all the way down, so that subordinate parsers can report their
+problems consistently. You probably don't need this, as SQL is very well
+thought of and has parsable grammar. But it may be of some help, for
+error reporting. 
+
+
+--Gene
+