diff options
Diffstat (limited to 'doc/TODO.detail/yacc')
-rw-r--r-- | doc/TODO.detail/yacc | 332 |
1 files changed, 332 insertions, 0 deletions
diff --git a/doc/TODO.detail/yacc b/doc/TODO.detail/yacc new file mode 100644 index 00000000000..acccb250a08 --- /dev/null +++ b/doc/TODO.detail/yacc @@ -0,0 +1,332 @@ +From selkovjr@mcs.anl.gov Sat Jul 25 05:31:05 1998 +Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) + by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id FAA16564 + for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:31:03 -0400 (EDT) +Received: from antares.mcs.anl.gov (mcs.anl.gov [140.221.9.6]) by renoir.op.net (o1/$Revision: 1.1 $) with SMTP id FAA01775 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:28:22 -0400 (EDT) +Received: from mcs.anl.gov (wit.mcs.anl.gov [140.221.5.148]) by antares.mcs.anl.gov (8.6.10/8.6.10) with ESMTP + id EAA28698 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 04:27:05 -0500 +Sender: selkovjr@mcs.anl.gov +Message-ID: <35B9968D.21CF60A2@mcs.anl.gov> +Date: Sat, 25 Jul 1998 08:25:49 +0000 +From: "Gene Selkov, Jr." <selkovjr@mcs.anl.gov> +Organization: MCS, Argonne Natl. Lab +X-Mailer: Mozilla 4.03 [en] (X11; I; Linux 2.0.32 i586) +MIME-Version: 1.0 +To: Bruce Momjian <maillist@candle.pha.pa.us> +Subject: position-aware scanners +References: <199807250524.BAA07296@candle.pha.pa.us> +Content-Type: text/plain; charset=us-ascii +Content-Transfer-Encoding: 7bit +Status: RO + +Bruce, + +I attached here (trough the web links) a couple examples, totally +irrelevant to postgres but good enough to discuss token locations. I +might as well try to patch the backend parser, though not sure how soon. + + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +1. + +The first c parser I wrote, +http://wit.mcs.anl.gov/~selkovjr/unit-troff.tgz, is not very +sophisticated, so token locations reported by yyerr() may be slightly +incorrect (+/- one position depending on the existence and type of the +lookahead token. It is a filter used to typeset the units of measurement +with eqn. To use it, unpack the tar file and run make. The Makefile is +not too generic but I built it on various systems including linux, +freebsd and sunos 4.3. The invocation can be something like this: + +./check 0 parse "l**3/(mmoll*min)" +parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or +`'('' + +l**3/(mmoll*min) + ^^^^^ + +Now to the guts. As far as I can imagine, the only way to consistently +keep track of each character read by the scanner (regardless of the +length of expressions it will match) is to redefine its YY_INPUT like +this: + +#undef YY_INPUT +#define YY_INPUT(buf,result,max_size) \ +{ \ + int c = (int) buffer[pos++]; \ + result = (c == '\0') ? YY_NULL : (buf[0] = c, 1); \ +} + +Here, buffer is the pointer to the origin of the string being scanned +and pos is a global variable, similar in usage to a file pointer (you +can both read and manipulate it at will). The buffer and the pointer are +initialized by the function + +void setString(char *s) +{ + buffer = s; + pos = 0; +} + +each time the new string is to be parsed. This (exportable) function is +part of the interface. + +In this simplistic design, yyerror() is part of the scanner module and +it uses the pos variable to report the location of unexpected tokens. +The downside of such arrangement is that in case of error condition, you +can't easily tell whether your context is current or lookahead token, it +just reports the position of the last token read (be it $ (end of +buffer) or something else): + +./check 0 convert "mol/foo" +parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or +`'('' + +mol/foo + ^^^ + +(should be at the beginning of "foo") + +./check 0 convert "mmol//l" +parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or +`'('' + +mmol//l + ^ + +(should be at the second '/') + + +I believe this is why most simple parsers made with yacc would report +parse errors being "at or near" some token, which is fair enough if the +expression is not too complex. + + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +2. The second version of the same scanner, +http://wit.mcs.anl.gov/~selkovjr/scanner-example.tgz, addresses this +problem by recording exact locations of the tokens in each instance of +the token semantic data structure. The global, + +UNIT_YYSTYPE unit_yylval; + +would be normally used to export the token semantics (including its +original or modified text and location data) to the parser. +Unfortunately, I cannot show you the parser part in c, because that's +about when I stopped writing parsers in c. Instead, I included a small +test program, test.c, that mimics the parser's expectations for the +scanner data pretty well. I am assuming here that you are not interested +in digging someone else's ugly guts for relatively small bit of +information; let me know if I am wrong and I will send you the complete +perl code (also generated with bison). + +To run this example, unpack the tar file and run Make. Then do + + gcc test.c scanner.o + +and run a.out + +Note the line + + yylval = unit_getyylval(); + +in test.c. You will not normally need it in a c parser. It is enough to +define yylval as an external variable and link it to yylval in yylex() + +In the bison-generated parser, yylval gets pushed into a stack (pointed +to by yylsp) each time a new token is read. For each syntax rule, the +bison macros @1, @2, ... are just shortcuts to locations in the stack 1, +2, ... levels deep. In following code fragment, @3 refers to the +location info for the third term in the rule (INTEGER): + +(sorry about perl, but I think you can do the same things in c without +significant changes to your existing parser) + +term: base { + $$ = $1; + $$->{'order'} = 1; + } + | base EXP INTEGER { + $$ = $1; + $$->{'order'} = @3->{'text'}; + $$->{'scale'} = $$->{'scale'} ** $$->{'order'}; + if ( $$->{'order'} == 0 ) { + yyerror("Error: expecting a non-zero +integer exponent"); + YYERROR; + } + } + + +which translates to: + + ($yyn == 10) && do { + $yyval = $yyvsa[-1]; + $yyval->{'order'} = 1; + last SWITCH; + }; + + ($yyn == 11) && do { + $yyval = $yyvsa[-3]; + $yyval->{'order'} = $yylsa[-1]->{'text'} + $yyval->{'scale'} = $yyval->{'scale'} ** $yyval->{'order'}; + if ( $yyval->{'order'} == 0 ) { + yyerror("Error: expecting a non-zero integer +exponent"); + goto yyerrlab1 ; + } + last SWITCH; + }; + +In c, you will have a bit more complicated pointer arithmetic to adress +the stack, but the usage of objects will be the same. Note here that it +is convenient to keep all information about the token in its location +info, (yylsa, yylsp, yylval, @n), while everything relating to the value +of the expression, or to the parse tree, is better placed in the +semantic stack (yyssa, yyssp, yysval, $n). Also note that in some cases +you can do semantic checks inside rules and report useful messages +before or instead of invoking yyerror(); + +Finally, it is useful to make the following wrapper function around +external yylex() in order to maintain your own token stack. Unlike the +parser's internal stack which is only as deep as the rule being reduced, +this one can hold all tokens recognized during the current run, and that +can be extremely helpful for error reporting and any transformations you +may need. In this way, you can even scan (tokenize) the whole buffer +before handing it off to the parser (who knows, you may need a token +ahead of what is currently seen by the parser): + + +sub tokenize { + undef @tokenTable; + my ($tok, $text, $name, $unit, $first_line, $first_column, +$last_line, $last_column); + + while ( ($tok = &UnitLex::yylex()) > 0 ) { # this is where the +c-coded yylex is called, + # UnitLex is the perl +extension encapsulating it + ( $text, $name, $unit, $first_line, $first_column, $last_line, +$last_column ) = &UnitLex::getyylval; + push(@tokenTable, + Unit::yyltype->new ( + 'token' => $tok, + 'text' => $text, + 'name' => $name, + 'unit' => $unit, + 'first_line' => $first_line, + 'first_column' => $first_column, + 'last_line' => $last_line, + 'last_column' => $last_column, + ) + ) + } + +} + + +It is now a lot easier to handle various state-related problems, such as +backtracking and error reporting. The yylex() function as seen by the +parser might be constructed somewhat like this: + +sub yylex { + $yylloc = $tokenTable[$tokenNo]; # $tokenNo is a global; now +instead of a "file pointer", + # as in the first example, we have +a "token pointer" + undef $yylval; + + + # disregard this; name this block "computing semantic values" + if ( $yylloc->{'token'} == UNIT) { + $yylval = Unit::Operand->new( + 'unit' => Unit::Dict::unit($yylloc->{'unit'}), + 'base' => Unit::Dict::base($yylloc->{'unit'}), + 'scale' => Unit::Dict::scale($yylloc->{'unit'}), + 'scaleToBase' => Unit::Dict::scaleToBase($yylloc->{'unit'}), + 'loc' => $yylloc, + ); + } + elsif ( ($yylloc->{'token'} == INTEGER ) || ($yylloc->{'token'} == +POSITIVE_NUMBER) ) { + $yylval = Unit::Operand->new( + 'unit' => '1', + 'base' => '1', + 'scale' => 1, + 'scaleToBase' => 1, + 'loc' => $yylloc, + ); + } + + $tokenNo++; + return(%{$yylloc}->{'token'}); # This is all the parser needs to +know about this token. + # But we already made sure we saved +everything we need to know. +} + + +Now the most interesting part, the error reporting routine: + + +sub yyerror { + my ($str) = @_; + my ($message, $start, $end, $loc); + + $loc = $tokenTable[$tokenNo-1]; # This is the same as to say, + # "obtain the location info for the +current token" + + # You may use this routine for your own purposes or let parser use +it + if( $str ne 'parse error' ) { + $message = "$str instead of `" . $loc->{'name'} . "' <" . +$loc->{'text'} . ">, at line " . $loc->{'first_line'} . ":\n\ +n"; + } + else { + $message = "unexpected token `" . $loc->{'name'} . "' <" . +$loc->{'text'} . ">, at line " . loc->{'first_line'} . ":\n +\n"; + } + + $message .= $parseBuffer . "\n"; # that's the original string that +was used to set the parser buffer + + $message .= ( ' ' x ($loc->{'first_column'} + 1) ) . ( '^' x +length($loc->{'text'}) ). "\n"; + if( $str ne 'parse error' ) { + print STDERR "$str instead of `", $loc->{'name'}, "' {", +$loc->{'text'}, "}, at line ", $loc->{'first_line'}, ":\n\n"; + } + else { + print STDERR "unexpected token `", $loc->{'name'}, "' {", +$loc->{'text'}, "}, at line ", $loc->{'first_line'}, ":\n\n"; + } + + print STDERR "$parseBuffer\n"; + print STDERR ' ' x ($loc->{'first_column'} + 1), '^' x +length($loc->{'text'}), "\n"; +} + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Scanners used in these examples assume there is a single line of text on +the input (the first_line and last_line elements of yylloc are simply +ignored). If you want to be able to parse multi-line buffers, just add a +lex rule for '\n' that will increment the line count and reset the pos +variable to zero. + + +Ugly as it may seem, I find this approach extremely liberating. If the +grammar becomes too complicated for a LALR(1) parser, I can cascade +multiple parsers. The token table can then be used to reassemble parts +of original expression for subordinate parsers, preserving the location +info all the way down, so that subordinate parsers can report their +problems consistently. You probably don't need this, as SQL is very well +thought of and has parsable grammar. But it may be of some help, for +error reporting. + + +--Gene + |