                        C P P - T E S T . T X T
/*
 * Document of
 * "Validation Suite for Standard C Conformance of Preprocessing"
 */
                Kiyoshi Matsui      kmatsui@t3.rim.or.jp

V.1.0   1998/08     First released.
                                                                kmatsui
V.1.1   1998/09     Updated according to C99 draft in 1998/08.
                                                                kmatsui
V.1.2   1998/11     Updated according to C++98 Standard.
                                                                kmatsui
V.1.3 prerelease 1      2002/08     Updated according to C99 Standard.
                                                                kmatsui
V.1.3 prerelease 2      2002/12     Slightly modified.
                                                                kmatsui
V.1.3 release   2003/02     Finally released.
                    Port to GCC / testsuite.
                                                                kmatsui
V.1.3 patch 1   2003/03     Made the testsuite edition applicable to
                        GCC as old as 2.9x.
                                                                kmatsui
V.1.4 prerelease    2003/11     Added Visual C++ evaluation.
                                                                kmatsui
V.1.4 release   2004/02     Added tests of various multi-byte character
                        encodings.
                                                                kmatsui
V.1.4.1     2004/03     Revised the section 2.4.26 (on recursive macro)
                                                                kmatsui
V.1.5       2005/03     Moved tests of multi-byte character encoding to
                    quality matters.
                Changed point allocation of the test items.
                Added a few testcases for macro expansion.
                Updated a few testsuite testcases to cope with GCC
                    3.3 and 3.4.
                Removed test results on older preprocessors (DJGPP,
                    compiler systems on MS-DOS except Borland C 4.0).
                                                                kmatsui
V.1.5.1     2006/07     Updated the test results of some preprocessors.
                                                                kmatsui


0.  Standard C and Validation Suite

1.  Standard C Preprocessing Features
    [1.1]       K&R 1st and Standard C Preprocessing
    [1.2]       Translation Phases
        [1.2.1]     Line Splicing Before Tokenization
    [1.3]       Preprocessing Token
        [1.3.1]     No Keyword
        [1.3.2]     Preprocessing-number
        [1.3.3]     Token-based Operations and Token Concatenation
    [1.4]       Evaluation Type of #if Expression
    [1.5]       Portable Preprocessor
    [1.6]       Function-like Macro Expansion
        [1.6.1]     Analogous to Function Call
        [1.6.2]     Argument Expansion Before Substitution
            [1.6.2.1]   No Expansion for the Operand of the # or ##
                                Operator
        [1.6.3]     Rescanning
        [1.6.4]     Prevention of Recursive Macro Expansion
    [1.7]       Issues
        [1.7.1]     Header Name in the <stdio.h> Format
        [1.7.2]     # Operator Specification with Legacy from
                            Character-based Preprocessing
        [1.7.3]     White Space Handling at Macro Re-definition
        [1.7.4]     Parameter Name at Function-like Macro Re-definition
        [1.7.5]     Unpredictable Evaluation of Character Constant in
                            #if Expression
        [1.7.6]     Non-Function-like Rescanning of Function-like Macro
        [1.7.7]     C90 Corrigendum 1 and Amendment 1
        [1.7.8]     Redundant Specifications
    [1.8]       Preprocessing Specification in C99
    [1.9]       Toward Clear Preprocessing Specifications

2.  Validation Suite Explanation
    [2.1]       Validation Suite for Conformance of Preprocessing
    [2.2]       Testing Method
        [2.2.1]     Manual Testing
        [2.2.2]     Automatic Testing by cpp_test
        [2.2.3]     Automatic Testing by GCC / testsuite
            [2.2.3.1]   TestSuite
            [2.2.3.2]   Installation to TestSuite and Testing
            [2.2.3.3]   MCPP Automatic Testing
            [2.2.3.4]   TestSuite and Validation Suite
    [2.3]       Violations of syntax rules or Constraints and Diagnostic
                        Messages
    [2.4]       Details
        [2.4.1]     Trigraphs
        [2.4.2]     Line Splicing by <backslash><newline>
        [2.4.3]     Comments
        [2.4.4]     Special Tokens (digraphs) and Characters (UCN)
        [2.4.5]     Spaces and Tabs on a Preprocessing Directive Line
        [2.4.6]     #include
        [2.4.7]     #line
        [2.4.8]     #error
        [2.4.9]     #pragma, _Pragma() operator
        [2.4.10]    #if, #elif, #else, and #endif
        [2.4.11]    #if defined
        [2.4.12]    #if Expression Type
        [2.4.13]    #if Expression Evaluation
        [2.4.14]    #if Expression Error
        [2.4.15]    #ifdef and #ifndef
        [2.4.16]    #else and #endif Errors
        [2.4.17]    #if, #elif, #else, and #endif Miss-matching Errors
        [2.4.18]    #define
        [2.4.19]    Macro Re-definition
        [2.4.20]    Macro Names Same as Keywords
        [2.4.21]    Macro Expansion Requiring Pp-token Separation
        [2.4.22]    Macro-like Sequence in a Pp-number
        [2.4.23]    Macros Using the ## Operator
        [2.4.24]    Macros Using the # Operator
        [2.4.25]    Macro Expansion in a Macro Argument
        [2.4.26]    Macros of a Same Name during Macro Rescanning
        [2.4.27]    Macro Rescanning
        [2.4.28]    Predefined Macros
        [2.4.29]    #undef
        [2.4.30]    Macro Calls
        [2.4.31]    Macro Call Error
        [2.4.32]    Character Constant in #if Expression
        [2.4.33]    Wide Character Constant in #if Expression
        [2.4.35]    Multi-Character Character Constant in #if Expression
        [2.4.37]    Translation limits
    [2.5]       Documentation of Implementation-Defined Items

3.  Evaluation of Aspects Unspecified by Standard
    [3.1]       Multi-byte Character Encoding
    [3.2]       Undefined Behavior
    [3.3]       Unspecified Behavior
    [3.4]       Other Cases Where a Warning is Preferable
    [3.5]       Other Quality Matters
        [3.5.1]     Qualities regarding Behaviors
        [3.5.2]     Options and Extended Functionalities
        [3.5.3]     Efficiency and others
        [3.5.4]     Quality of Documents
    [3.6]       C++ Preprocessing

4.  Issues Around C Preprocessing
    [4.1]       Standard Header Files
        [4.1.1]     General Rules
        [4.1.2]     <assert.h>
        [4.1.3]     <limits.h>
        [4.1.4]     <iso646.h>

5.  Preprocessor Test Results
    [5.1]       Preprocessors Tested
    [5.2]       Lists of Marks
    [5.3]       Characteristics of Each Preprocessor
    [5.4]       Overall Review
    [5.5]       Test Reports and Comments


0.  Standard C and Validation Suite

I completely rewrote DECUS cpp by Martin Minow to create a portable C
preprocessor called MCPP V.2.  MCPP stands for 'Matsui cpp'.  This
preprocessor is provided as source code and can be ported for various
compiler systems by modifying some macros in header files on compilation.
In addition, execution programs has various behavioral modes such as
Standard C (ISO/ANSI/JIS C) and others.  Among those modes, the Standard
C mode literally implements strict Standard C preprocessing.

While implementing this preprocessor, I also created a testing tool
called "Validation Suite for Standard C Conformance of Preprocessing".
This document explains the Validation Suite.  The Validation Suite is
open to the public as free software along with this documentation.

The Validation Suite became available to the public on NIFTY SERVE/FC/
LIB 2 in August, 1998, and also later on http://www.vector.co.jp/pack.
It did not have a version number, however, and so is assumed to be
version 1.0.

V.1.1 supports the C99 draft in August, 1997 and is an update to V.1.0.
V.1.1 was also made public on NIFTY SERVE/FC/LIB 2 and vector/software
pack in September, 1998.

V.1.2 supports the official C++ Standard release and is a small update
to V.1.1.  It also became available on NIFTY SERVE/FC/LIB 2 and vector/
software pack in November, 1998.

V.1.3 supports the official C99 release and is an update to V.1.2.  In
addition, behavioral test samples were rewritten so that they can be
used by the GCC / testsuite.

V.1.3, while it was under development, was adopted as the year 2002's
"Exploratory Software Project" at the Information-technology Promotion
Agency, Japan (IPA) by Yutaka Niibe Project Manager.  From July, 2002 to
February, 2003, the development continued under the grants-in-aid from
IPA and Niibe PM's advice.  The English version of the document was
created under my supervision with the translation work outsourced to
Highwell, Inc.  In 2003/02, MCPP V.2.3 and Validation Suite V.1.3 were
released on m17n.org.

In addition, MCPP and Validation Suite were adopted as the year 2003's
"Exploratory Software Project" by Hiroshi Ichiji Project Manager.  This
allowed an update to V.2.4 and V.1.4. *

MCPP and Validation Suite have been kept on updating after the project.
V.2.5 and V.1.5 are released in March, 2005.  Validation Suite V.1.5
changed allocation of points and some other matters.  In July 2006, MCPP
V.2.6 and Validation Suite V.1.5.1 are released.  Validation Suite V.1.5.
1 updated the test result of the preprocessors.

ISO/IEC 9899:1990 (JIS X 3010-1993) had been used as C Standard, but in
1999, ISO/IEC 9899:1999 was adopted as a new Standard.  This document
calls the former C90 and latter C99.  The former is generally called
ANSI C or C89 because it migrated from ANSI X3.159-1989.  ISO/IEC 9899:
1990 plus its Amendment 1995 is sometimes called C95.  C++ Standards are
ISO/IEC 14882:1998 and its corrigendum version ISO/IEC 14882:2003.  This
document calls both of them C++98.

The Standards referred in this explanation are below.

  C90:
    ANSI X3. 159-1989       (ANSI, New York, 1989)
    ISO/IEC 9899:1990(E)    (ISO/IEC, Switzerland, 1990)
        ibid.   Technical Corrigendum 1     (ibid., 1994)
        ibid.   Amendment 1: C Integrity    (ibid., 1995)
        ibid.   Technical Corrigendum 2     (ibid., 1996)
    JIS X 3010-1993         (JIS Handbook 59-1994, Tokyo, 1994, Japanese
            Standards Association)
  C99:
    ISO/IEC 9899:1999(E)
        ibid.   Technical Corrigendum 1 (2001)
        ibid.   Technical Corrigendum 2 (2004)
  C++:
    ISO/IEC 14882:1998(E)

ANSI X3.159 contained "Rationale."  It was not adopted by ISO C90 for
some reason, but reappeared in ISO C99.  This "Rationale" is also
referred to from time to time.

PDF versions of C99 and C++ Standards can be obtained online on the
following sites.

    C99, C++98, C++03
        http://webstore.ansi.org/ansidocstore/default.asp
    C99 Corrigendum 1
        http://ftp2.ansi.org/download/
            free_download.asp?document=ISO%2FIEC+9899%2FCor1%3A2001
    C99 Rationale final draft in October, 1999
        http://www.open-std.org/jtc1/sc22/wg14/www/docs/n897.pdf

  * The overview of the Exploratory Software Project can be found below
    (in Japanese only).

        http://www.ipa.go.jp/jinzai/esp/

    MCPP from V.2.3 through V.2.5 had been located at:

        http://www.m17n.org/mcpp/

    In April 2006, MCPP project moved to:

        http://mcpp.sourceforge.net/

    MCPP V.2.2 and Validation Suite V.1.2 are located in the following
    Vector's web site.  They are in the directory called dos/prog/c, but
    they are not for MS-DOS exclusively.  Sources are for UNIX, WIN32,
    MS-DOS.

        http://download.vector.co.jp/pack/dos/prog/c/cpp22src.lzh
        http://download.vector.co.jp/pack/dos/prog/c/cpp22bin.lzh
        http://download.vector.co.jp/pack/dos/prog/c/cpp12tst.lzh

        http://download.vector.co.jp/
    and
        ftp://ftp.vector.co.jp/
    seem to be the same.

    The text files in these archive files available at Vector use [CR]+
    [LF] as a <newline> and encode Kanji in shift-JIS for DOS/Windows.
    On the other hand, those from V.2.3 through V.2.5 available at
    SourceForge use [LF] as a <newline> and encode Kanji in EUC-JP for
    UNIX.  From V.2.6 on two types of archive, .tar.gz file with [LF]/
    EUC-JP and .zip file with [CR]+[LF]/shift-JIS, are provided.


1.  Standard C Preprocessing Features

Before explaining the Validation Suite, I will talk about the overall
characteristics of Standard C (ANSI/ISO/JIS C) preprocessing.  This is
not text bookish, but intended to point out the concepts and issues of
Standard C by comparing with K&R 1st.

As I explain, I will concentrate on the differences between K&R 1st and
C90 first, C90 and C99, then C90 and C++ in this order.  C99 is
currently available as a Standard, however, it has not been implemented
on actual compiler systems very much.  Therefore, it is more realistic
to center on C90.

This chapter shows no samples at all, please refer to the Validation
Suite since it is a sample itself.


    [1.1]       K&R 1st and Standard C Preprocessing

There were endless varieties of dialects amongst pre-Standard C language
implementations.  Above all, there were almost no standards for
preprocessing.  The reason was because the preprocessing specification
was too simplistic and ambiguous in the 1st edition of "The C
Programming Language" by Kernighan & Ritchie as a reference.  In
addition, it seems that preprocessing was thought to be a bonus to the
language proper.  However, many features were added to preprocessing by
each implementation since K&R 1st.  Some supplemented flaws in the
language proper while others tried to maintain portability among
different implementations.  However, there were too many differences
among implementations in any case.  The truth was it was nowhere close
to being portable.

Standard C provided a clear specification on preprocessing which had
been a cause of the confusion for many years.  There are some new
features added which are well known.  What is more important, however,
is that Standard C provides virtually the first overall specification on
preprocessing.  You can see a basic point of view, "what is
preprocessing?", which had been vague thus far, everywhere in this
specification.  Preprocessing in Standard C is not just K&R 1st + alpha.
In order to understand this, I believe we need to grasp not only
"new features", but also such basics clearly.  Unfortunately, however,
the basics of preprocessing are not summarized together in the body of
the Standard and just mentioned briefly in "Rationale", a commentary on
the Standard.  Even more unfortunately, it contains incoherent parts
which seem to be the results of making a compromise with conventional
preprocessing.  Therefore, I will summarize basic characteristics of
Standard C preprocessing and examine their issues.

Characteristics different from pre-Standard processing or newly defined
are summarized as the following four points.

1.  Literal "pre-processing" independent of implementation-specific
parts in the language proper (execution environment, so to speak.)  It
is extremely rare for preprocessing to end in an unexpected result
depending on implementation.  This allows us to write portable source
code for a preprocessor itself.  Also, only one preprocessor executable
program suffices for each OS.

2.  The translation phase specification clearly defines a procedure for
tokenizing source.  A token is handled in an interim form as a
preprocessing token until preprocessing completes.  Specifying
preprocessing tokens and tokens themselves separately helps
preprocessing getting independent of implementation-dependent parts in
the language proper.

3.  Preprocessing takes place in the unit of preprocessing token and is
token-oriented in principle.  On the contrary, pre-Standard
preprocessing was supposedly token-oriented but contained parts for
character-oriented processing to no small extent based on its historical
background; it was neither one thing nor the other.

4.  Function-like macro expansion is modeled after a function call to
organize grammar.  Function-like macro calls can be used anywhere
function calls can.  Processing for a macro call in an argument
parallels the evaluation for a function call in a function argument and
is substituted for a parameter in the replacement list after the macro
in the argument is completely expanded.  At this expansion, the macro
call in the argument must be completed within the argument.

These principles are examined below in turn.


    [1.2]       Translation Phases

No preprocessing procedure was described in K&R 1st, which caused much
confusion.  Standard C specifies and defines translation phases.  It can
be summarized as follows.  *1

1.  Map the characters in source files to the source character set if
necessary.  Replace trigraphs. *2

2.  Delete <backslash><newline>.  By doing so, splice physical lines to
form logical lines.

3.  Decompose source files into preprocessing tokens and white spaces.
Comments are replaced by one space character.  <newline> is retained.

4.  Execute preprocessing directives and expand macro calls.  If #
include directive exists, files specified are processed for phase 1 to
phase 4 recursively.

5.  Convert characters in the source character set into the execution
character set.  Convert escape sequences in character constant and
string literals similarly.

6.  Concatenate adjacent string literals and concatenate adjacent wide
character string literals.

7.  Convert preprocessing tokens into tokens and compile.

8.  Link.

Needless to say, these steps do not actually have to be in separate
phases as long as they are processed to lead the same result.

Among these phases, phase 1 to 4 or to 6 belong to the range of
preprocessing.  It is usually handled up to phase 4 since token
separators such as <newline> need to be retained if a preprocessor is an
independent program and outputs the preprocessing result as an
intermediate file (if escape sequences such as \n are converted in phase
5,  they cannot be distinguished from <newline> and other token
separators.)  This Validation Suite also tests up to phase 4. *3

  *1  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
      C99 5.1.1.2 Translation phases
    C99 added _Pragma() operator processing at phase 4.  Some words have
    been added to the description, but no changes to the meanings in
    particular.

  *2  In the C99 draft in November, 1997, a multi-byte character not
    included in the basic source character set before trigraphs
    replacement was, by a specification, to be converted into a Unicode
    hexadecimal sequence of the \Uxxxxxxxx format, that is, a universal
    character name.  This sequence is re-converted into the execution
    character set in phase 6.  This is similar to the C++ Standard.
    This specification is vague, furthermore, the load is great on
    implementations.  Fortunately, the processing at phase 1 was deleted
    in the draft in January, 1999 and remained as is in C99 official
    version.

  *3  When the SJIS_IS_ESCAPE_FREE macro is defined as FALSE, MCPP
    inserts an extra 0x5c if the second byte of a shift-JIS Kanji
    character in a string literal or string constant is 0x5c (\) after
    completing phase 4 (or 6) for the compiler proper which does not
    recognize shift-JIS Kanji characters.  As with the encodings in Big5
    or iso2022-jp, the things are the same.  Refer to [3.1].
    This is a feature for the compiler proper whose phase 5 and 6
    processes are not sufficient.  Performing such processes in early
    phases causes wrong results, but not after completing phase 4.


        [1.2.1]     Line Splicing Before Tokenization

K&R 1st described only 2 cases below regarding line splicing by
<backslash><newline>, no others were defined.

    1. In the middle of a long #define line
    2. In the middle of a long string literal

Standard C clearly defines that <backslash><newline> is deleted in phase
2 before decomposing lines into preprocessing tokens and token
separators in phase 3.  This allows any lines or any tokens to be
spliced.

Also, as trigraph processing is done in phase 1, the ??/<newline>
sequence is similarly deleted.  On the other hand, 0x5c code in the
second byte of Kanji characters is not a <backslash> for the
implementation with ASCII as its basic character set and Shift-JIS Kanji
Characters as the multi-byte character encoding since one Kanji
character is one multi-byte character.

It is good that the translation phases became clear, but I am wondering
if it is necessary to support line splicing in the middle of a token.
Although this was the only way to write a long string literal which
cannot fit on one line on the screen in K&R 1st, it is not necessary to
start a new line in the middle of a token on purpose since adjacent
string literals are concatenated in Standard C.  Line splicing is
required only when you write a long control line.  If that were the only
issue, it would have been better to reverse phase 2 and 3.

Still, the current specification seems to be adopted for backward
compatibility so that the source written assuming the specification of
the concatenation by line splicing for string literals in K&R 1st can be
processed.  The specification is almost meaningless for new source in
practical use, however, it is probably appropriate since it is simple,
comprehensible, and easiest to implement.


    [1.3]       Preprocessing Token

The concept of preprocessing token (pp-token, for short) was also
introduced for the first time in Standard C.  Since it does not seem to
be known very much, however, I will summarize the content at first.
Below are specified as pp-tokens. *

    header-name
    identifier
    preprocessing-number
    character-constant
    string-literal
    operator
    punctuator
    Non-white-space character other than above

These look nothing special at a casual glance, but they are quite
different from the token proper.  Tokens are below.

    keyword
    identifier
    constant (floating-constant, integer-constant, enumeration-constant,
        character-constant)
    string-literal
    operator
    punctuator

Pp-tokens differ from tokens in the following points.

1. No keywords.  A name the same as a keyword is handled as an
identifier.
2. For constants, character-constant is same while floating-constants,
integer-constants, and enumeration constants do not exist and
preprocessing-numbers replace floating-constants and integer-constants.
3. Header-name exists only as a pp-token.
4. Operator and punctuator are almost the same, but # and ## operators
and # punctuator exist only as a pp-token.  (All are valid only in
preprocessing directive lines.)

Surprisingly, only string-literal and character-constants are the same.
The most important of all are no keywords and the existence of a
preprocessing-number instead of a numeric token.  We will discuss these
2 items further.

  *  ANSI C 3.1 (C90 6.1) Lexical elements
     C99 6.4 Lexical elements
    In C99, an operator was absorbed by a punctuator whether it is a pp-
    token or token.  Operator became the term simply for an "operator"
    functionality, not as a token type.  The same punctuator token
    (punctuator pp-token) may function as a punctuator or operator
    depending on the context.  Also, _Pragma was added as a pp-token
    operator.


        [1.3.1]     No Keyword

A keyword is recognized for the first time in phase 7.  A keyword is
handled as an identifier in preprocessing phases.  For preprocessing, an
identifier is a macro name or an identifier which has not been defined
as a macro.  That means that even a macro with a same name as a keyword
can be used. *1

This specification is indispensable in order to separate preprocessing
from implementation-dependencies.  This, for example, prohibits using a
cast or sizeof in #if expressions. *2

  *1  To be more precise, a parameter name in a macro definition is also
    an identifier.  In addition, a preprocessing directive name is a
    special identifier and has a similar characteristic to a keyword.
    Whether this is a directive, however, is judged by syntax.  If it is
    not in a valid place for directives, it is simply an identifier,
    which may be subject to macro expansion.

  *2  Refer to [2.4.14.7] and [2.4.14.8].


        [1.3.2]     Preprocessing-number

Preprocessing-numbers (pp-number, for short) are specified as below. *1,
*2

    digit
    .digit
    pp-number digit
    pp-number nondigit
    pp-number e sign
    pp-number E sign
    pp-number .

Non-digits are letters and underscores.

Summarized as below.

  1. Starts with a digit or .digit.
  2. The rest is sequences of letters (alphabets), underscores, digits,
    periods, and e+, e-, E+, or E- in any order.

Pp-number includes all floating-constants and integer-constants, and
even non-numerical sequences, 3E+xy, for example.  Pp-number was adopted
to make preprocessing simple and is considered to be helpful for the
tokenization of this type of sequence which precedes semantic
interpretation. *3

It is correct that tokenization becomes simple, however, a non-numeric
pp-numbers is not a valid token.  Therefore, it must disappear before
completing preprocessing.  Use of non-numeric pp-numbers deliberately in
source is highly unlikely.  The only case I can think of is that a
numeric pp-number and another type of pp-token are concatenated to
become a non-numeric pp-number in the macro defined by using a ##
operator and it is stringized by a macro defined by a # operator.  Any
pp-token becomes a valid token if it is put in a string literal.
However, without accepting the existence of non-numeric pp-number, the
one generated by concatenation will not become a valid pp-token (the
result becomes undefined.)

Although this type of usage is extremely special and need not to be
examined in detail, pp-numbers provide an interesting subject matter in
terms of token-oriented preprocessing.

  *1  ANSI C 3.1.8 (C90 6.1.8) Preprocessing numbers
      C99 6.4.8 Preprocessing numbers

  *2  In C99, pp-number p sign and pp-number P sign sequences were added
    to enable hexadecimal expression for floating point numbers.  In
    addition, the nondigit above was replaced by the identifier-nondigit.
    This is a change accompanied by the approval of using an UCN
    (universal character name) and implementation-defined multi-byte
    characters in an identifier. (Refer to [1.8].)  In other words, an
    UCN can be used in a pp-number and an implementation using multi-
    byte characters is supported.  This is allowed in case of
    stringizing by ## and # operators though no UCN or multi-byte
    characters are included in a numerical value token.

  *3  C89 Rationale 3.1.8 Preprocessing numbers
      C99 Rationale 6.4.8 Preprocessing numbers


        [1.3.3]     Token-based Operations and Token Concatenation

Standard C acquired pp-token concatenation capability with the ## binary
operator in a macro definition.  This is known as a "new feature" in
Standard C.  This, however, is something introduced to replace a hack
rather than a new feature.  What I would like to pay attention here is
that this is essential for token-oriented preprocessing.

The traditional token concatenation method, uses a specification of
replacement of a comment with 0 character, known as the so-called
"Reiser type cpp."  On other occasions, token concatenation occurs
unintentionally in the preprocessor with character-oriented operations.
And, there were hacks taking advantage of it.  I can say all took
advantage of flaws in character-oriented preprocessing.

On the other hand, Standard C allows explicit token concatenation by
token-oriented operations.  Source file is decomposed into sequences of
pp-tokens and white spaces in translation phase 3.  The only cases of
combining pp-tokens later is ## operator concatenation and # operator
stringizing, header-name construction, and the concatenation of string
literals or wide character string literals next to each other.  The
handling of non-numeric pp-numbers is clear if its existence is
considered in this context.  That is to say, there are the following
principles of Standard C tokenization.

Pp-tokens are not concatenated implicitly.  Concatenation must be done
explicitly by ## operators.  Pp-tokens concatenated once will never be
separated again.

In pre-Standard character-oriented preprocessing, macro call expansion
sometimes caused unintended concatenation with tokens before and after
with the token sequence as its result.  However, this can be thought as
something which must not occur in token-oriented Standard C
preprocessing. *

  *  Refer to [2.4.21].


    [1.4]       Evaluation Type of #if Expression

In C90, the #if expression was evaluated with one type of size, either
long or unsigned long.  This also simplified preprocessing and helped
reducing implementation-dependent parts at the same time.  Compared with
the int size, which varies greatly depending on implementations, long/
unsigned long are 32 bits for the most part and sometimes 64 or 36 bit.
This assures considerable portability in general #if expressions. *1, *2,
*3

  *1  ANSI C 3.8.1 (C90 6.8.1) Conditional inclusion -- Semantics

  *2  C99 6.10.1 Conditional inclusion
    In C99, the #if expression type was specified as the maximum integer
    type for the compiler system.  Since C99 requires long long/unsigned
    long long, this would be long long/unsigned long long or wider.
    This, however, reduces the portability of #if expressions.

  *3  There will be more implementations with 64 bit long in the future,
    but I am not sure if that is good...  By the way, I personally
    believe that the integer type size should be defined as below.

    1. short as 2 bytes and long as 4 bytes.
    2. longlong (not long long) or quadra as 8 bytes.
    3. int should be CPU dependent.  In other words, either short, long,
      or longlong.
    4. Stop using the usage of short int and long int by modifying int
      with short or long.

    In other words, there has been a constraint of sizeof (short) <=
    sizeof (int) <=  sizeof (long) since the arrival of 64 bit compiler
    systems which made everything constrained and caused no type to
    maintain portability.  It will be better to remove this constraint
    and to decide types by absolute size.


    [1.5]       Portable Preprocessor

Standard C preprocessing specifications above allow writing the source
code for a preprocessor itself portably.  That is because the
preprocessor needs to know nothing about the implementation-dependent
parts of compiler proper.  Only peripheral parts as below become
problems when you actually try to write preprocessors portably in
Standard C compiler systems.

  1. Path list description format of OS for #include processing.  Where
    standard header files exist.
  2. Format to pass filename and line number information to a compiler
    proper.
  3. Runtime options.
  4. Character set.
  5. The size of long/unsigned long in C90 or the maximum integer type
    in C99 must not go below that of the target implementation in cross
    implementations.

2 and 3 above are necessary only if they are implemented for existing
implementations.  I expect that 2 will be standardized in the #line 123
"filename" format same as a source file.  3 is not necessary (logically,
leaving convenience aside) for Standard C preprocessing.  4 can be
written in the form such that special implementations are not required
depending the source code (though implementations are easier for a basic
character set if a table is used in the source code.)  5 will not be a
problem since the integer type size for a host implementation is seldom
smaller than that of target one in reality.

Needless to say, MCPP was also created with a motive that these Standard
C preprocessing specifications are independent of compiler proper
(though there are many #if sections in order to assure the portability
since this MCPP is intended to be ported to many compiler systems.)


    [1.6]       Function-like Macro Expansion

Macro expansion method with arguments is specified modeling after
function calls in Standard C and called a function-like macro.  If
macros are contained in an argument, they are expanded before
substitution for a parameter in principle.

This point was not clear for pre-Standard implementations.  I suspect
that most used the method that an argument was substituted for a
parameter without the macro within it being expanded and was expanded at
rescanning.  So to speak, the editor-like text replacement repetition is
speculated to be behind this type of expansion.  In general, text
replacement repetitions for editors are fine for macro expansion without
arguments, but I wonder if they were extended to the macro expansion
with arguments for many preprocessors.

However, this method causes a strange way of using macros totally
different from function call-like appearance in source.  When calling a
nested macro with arguments, the situation where it is not clear which
argument is which occurs.  Given these points, implementation-specific
features will increase.  In short, I can say that it became too heavy a
load for the editor-like text replacement repetition as macro expansion
of C grew to take on advanced features.


        [1.6.1]     Analogous to Function Call

In consideration of this confusion, the grammar was organized in
Standard C by positioning function-like macro calls as replacements for
function calls.  The Rationale states the following principles for which
the Standard C specifications of macros are grounded. *

  - Allow macros to be used wherever functions can be.
  - Define macro expansion such that it produces the same token sequence
    whether the macro calls appear in open text, in macro arguments, or
    in macro definitions.

This stands to reason for function calls, but not for macros with
arguments.  It is obvious that this is not for editor-like text
replacement repetition.

  *  C89 Rationale 3.8.3 Macro replacement
     C99 Rationale 6.10.3 Macro replacement


        [1.6.2]     Argument Expansion Before Substitution

What is essential for achieving the principle of macro expansion
parallel to function calls is to expand macros in an argument first
before substituting a parameter with the argument.  And, the macro call
within an argument must have been completed within that argument (it
causes an error if not completed.)  A macro within an argument must not
absorb the text right after it.  Thus, nested function-like macro calls
can maintain logical clarity. *

  *  Refer to [2.4.25].

            [1.6.2.1]   No Expansion for the Operand of the # or ##
                                Operator

Operands for the # operator, however, are not supposed to be macro-
expanded.  Operands for the ## operator are not macro-expanded, either.
The pp-token generated by concatenation becomes subject to macro
expansion at rescanning.  Why do we need this specification?

This specification is meaningful when an argument includes macros indeed.
It is helpful if you want to stringize or concatenate token sequences
including macros as they are.  On the contrary, to expand a macro before
stringizing and concatenation, it needs wrapping another macro which
does not use # and ## operators.  In order for a programmer to be able
to choose either of these, a specification is needed so that no macro-
expansion is performed for the operands of # and ## operators. *

  *  Refer to [2.4.25.4, 2.4.25.5], misc.t/PART 3 and 5.  A typical
    example where the specification of no macro expansion for the
    operand of the # operator helps is the assert() macro.  Refer to [4.
    1.2].


        [1.6.3]     Rescanning

Now, after macro calls are replaced with a replacement list and function-
like macro arguments are substituted for parameters in the replacement
list after being expanded, the replacement list is rescanned searching
for more macro calls.

This rescanning is a specification since K&R 1st.  And there seems to
have been an "editor-like text replacement repetition" approach in the
background.  In Standard C, however, function-like macro arguments have
been completely expanded already except for operands of ## operators.
What on earth is expanded at rescanning?

Rescanning is necessary for macros in replacement list other than
parameters, and for macros with ## operators.  What is needed rescanning
other than those is so-called cascaded macros, where macro "definitions"
are nested.  If "arguments" for macro calls are nested, they usually do
not get expanded again at rescanning since they have been expanded in
the nesting structure before rescanning (though there are exceptions.
Refer to [1.7.6] and [2.4.27].)


        [1.6.4]     Prevention of Recursive Macro Expansion

Although cascaded macros are expanded one after another, it may be a
problem sometimes.  That is the case the macro definition itself is
recursive.  If this is expanded as is, it will fall into infinite
recursion.  The same problems occur in not only the direct recursive
case where a macro is included in the definition itself, but also the
indirect recursion of 2 or more definitions.  In order to avoid this
situation, Standard C adds a specification of "If the name of the macro
being replaced is found during the rescan of the replacement list, it is
not replaced."  The phrases are difficult, but its intention is easy to
understand.

This is a point where a function-like macro has different grammar from
the function.  It is also different from the editor-like replacement.
Since it is a macro specific specification and has been used as a
convenient processing, which is used only in macros,  I think it is
appropriate to keep this specification by clearly defining it.


    [1.7]       Issues

I have covered above only good aspects or simple and clear aspects of
Standard C preprocessing specifications.  If I go into more detail,
however, there are parts that are irregular or lacking in their utility
or portability for their implementation overhead.  Most of these are
there without being able to settle traditional or implicit pre-Standard
preprocessing methods.  The existence of this type of useless area like
an appendix which confuses the specification and makes implementation
troublesome.  Standard C also includes a few parts which caused new
unnecessary complications.  Those problems are sorted out below.


        [1.7.1]     Header Name in the <stdio.h> Format

Although header-names enclosed by < and > have been used traditionally
since K&R 1st, they are extremely exceptional and irregular as tokens.
Because of this, Standard C has many undefined and implementation-
defined parts regarding pp-tokens for header-names.  For example, it is
undefined if the /* sequence is included.  Also, if it is not a header-
name as in <stdio.h>, the part which is divided into multiple pp-tokens
as in <, stdio, . , h, and > must be combined to one pp-token as far as
the #include line goes.  That method is implementation-defined.
Tokenization is performed in translation phase 3.  However, if the line
turns out to be a #include directive in phase 4, tokenization needs to
be redone.  I would have to say it is a very illogical specification.
The processing in case a space exists between (temporary) pp-tokens,
which were once tokenized in phase 3, is also implementation-defined.
Directives such as #include <stdio.h> appear to have most portability,
but have low portability in respect of preprocessing implementations.
Irregularity increases more if the argument of the #include line is a
macro.

Header-names enclosed by " and " have no problems like these.  However,
\ is not handled as an escape character as in header-names enclosed by
< and > and this is the difference from string literals.  It is not
illogical that no escape sequence exists in a header-name which is
processed in phase 4 since escape sequences are processed in phase 6 (\
within a header-name is undefined by the Standard.  This must be a
consideration to ease implementation.  In reality, no problem occurs
unless \ comes right before " inside of " and ".  It is a little more
complicated inside of < and >.) *

Also, the difference between #include <stdio.h> and #include "stdio.h"
is only that the former searches a specific implementation-defined
location while the latter first searches (a relative path from) the
current directory and the same location as <stdio.h> if it is not found
(as Standard C does not assume an OS, it does not use the term,
"current directory."  But it is interpreted for most operating systems.)
Therefore, #include <stdio.h> can be simply written as #include
"stdio.h".

By having two kinds of header-name formats, there is a readability
advantage of spotting at a glance the distinction between user defined
and system provided headers.  But it does not have to go to the trouble
of providing an irregular token for that purpose.  All it needs is to
distinguish one from another by using different suffixes as in "stdio.h"
and "usr.H" (if I add just in case, it is acceptable if a system does
not distinguish uppercase and lowercase of a filename since this is a
readability issue.  Naturally, they can be "usr.hdr", "usr.usr", "usr.u",
and etc.)

I believe that header-names enclosed by < and > should be abolished
since it serves no use as a language specification and complicates
preprocessing tokenization.  It cannot be abolished out of the blue, but
I would like it to be specified as an obsolescent feature.

  *  UCN starting with \ was introduced by C99, which is a little
    troublesome.


        [1.7.2]     # Operator Specification with Legacy from
                            Character-based Preprocessing

The next problem is the handling of white spaces as token separators
between operands of # operators.  One or more white spaces are
compressed into one space and no space is inserted if there is no white
space.

This is half-defined specification.  In order to ensure token-based
operations, the existence of token separators must not have an influence.
For that reason, it should have been defined so that all token
separators are deleted or a space is placed between every pp-token.  C89
Rationale 3.8.3.2 (C99 Rationale 6.10.3.2) states that the # operator
was decided "as a compromise between token-based and character-based
preprocessing discipline" within this specification.

This compromise led to an extra burden rather than easing preprocessor
implementation and brought ambiguity to complicated macro expansion as
well.  There is an example shown below in Example 4 of ANSI C 3.8.3 (C90
6.8.3, C99 6.10.3) Macro replacement -- Examples.

    #define str(s)      # s
    #define xstr(s)     str(s)
    #define INCFILE(n)  vers ## n

    #include xstr(INCFILE(2).h)

This #include line is expanded as:

    #include "vers2.h"

This example is filled with many problems.  There is no vagueness in
INCFILE(2) being replaced with vers2.  However, the expansion result of
INCFILE(2).h, an argument for xstr(), is a sequence of 3 pp-tokens,
vers2, ., and h.  The expansion example in the Standard is handled with
no white spaces among 3 pp-tokens.  This involves issues as below.

1. vers2 is not a pp-token which was in source, but generated by macro
replacement.  To guarantee there are no white spaces after vers2, macro
replacement must not generate white spaces before or after.  However, pp-
tokens may implicitly merge as a result of macro expansion at least for
the preprocessors which are independent of the compiler if macro
replacement is always such.  This is against token-based preprocessing
principles.

2. Without generating white space before and after the macro replacement,
which can be the operands of # operators and to avoid the implicit
concatenation of pp-tokens, a little trick is necessary; for the macro
replacement which exists in an argument for a function-like macro call,
wrap the replacement result with temporary white spaces internally,
delete them if it becomes an operand for a # operator, and replace only
the temporary white spaces left after all replacement is complete with
real spaces (*1.)  This is quite a burden for a preprocessing
implementation.  And there is no merit.  Furthermore, it is not clear
from the Standard text that this type of processing is necessary and it
is unclear what is the right process.

All these ambiguity and complexity come from the incompleteness of token
separator handling in operands for # operators.

I think it is better to have the specification such that # operators are
stringized after each pp-token is separated with a single space in order
to avoid implicit concatenation of pp-tokens and causing complicated
problems and to show what kind of pp-token sequence the argument
stringized is.  If defined that way, this macro will be expanded as
"vers2 . h".  Needless to say, this is not an appropriate macro.

As this example shows, the only case where it will be troublesome to
insert a space where none exists is the macro for the #include line
using # or ## operator.  The #include line to be processed in
translation phase 4 cannot use the concatenation of the string literals
to be processed in phase 6.  However, the macro for the #include line
can be simply defined as a string literal without bothering to be
parameterized using # and ## operators.  Sacrificing token-based
principles just because of this parameterization is a great loss in the
balance.

In the Standard C preprocessing specification, the syntax is token-based
while the semantics specification of # operators is suddenly character-
based, losing logical consistency. *2

Moreover, this example of the Standard assumes the specification, which
is not necessarily clear from the Standard text.  It is an inappropriate
example and should be deleted.

  *1  MCPP was compelled to be implemented in the same way.

  *2  JIS C is only a translated version of ISO C and there is not
    supposed to be changes to the content.  However, there is an error,
    a difference in the existence of spaces for the macro expansion
    result of fputs(..) from the original, in the printed version of
    "JIS handbook" X 3010/6.8.3 example 4.  This document is insensitive
    to space handling and I do not believe that draft or printing
    checking was done by someone with a great understanding of the #
    operator specification.  It is true, though, the specification
    itself is unnecessarily complicated.


        [1.7.3]     White Space Handling at Macro Re-definition

There is a similar specification to white space handling in operand of #
operators with regards macro re-definition.  It is defined; "A macro re-
definition must be equivalent to the original macro.  In order to be
equivalent, the number and name of parameters must be the same and the
replacement list must have the same spellings.  However, in the case of
white spaces in the replacement list, their existence must be the same
though the number can be different."

If the specification of the # operators are as above, this is an obvious
conclusion since same handling is necessary for white spaces in the
replacement list.  Still, the cause of the problem is the specification
of the # operators.

If the # operator is handled in such a way that one space exists between
every pp-token in operands, there will be no issue regarding the
existence of white spaces for macro re-definition.

Moreover, this can be generalized in the preprocessor implementation, by
replacing with one space between every pp-token in source as a principle.
By doing so, tokenization for macro expansion can be done easily and
accurately.  However, there are two exceptions to this principle.  One
is <newline> in the preprocessing directive line and another is whether
there are white spaces between a macro name and the subsequent '(' in
macro definition.  This traditionally has been the basis of
preprocessing in C and cannot be changed after all these years.


        [1.7.4]     Parameter Name at Function-like Macro Re-definition

I have mentioned in [1.7.3] that parameter names must match regarding
macro re-definition in the specification, but I believe this is an
excessive specification.  Parameter names, of course, do not make any
difference to macro expansion.  However, in order to check for re-
definition, a preprocessor needs to store parameter names of all macro
definitions.  Even so, its usage is nothing other than re-definition
checking within the specification.  It is not such a great idea to give
overhead to implementations only for almost meaningless checking.

I think it is better to remove the specification that parameter names
must match at macro re-definition.


        [1.7.5]     Unpredictable Evaluation of Character Constant in
                            #if Expression

#if expression as an argument for the #if line is a constant expression
in the integer type.  Its evaluation must be independent of the
execution environment since it is done in preprocessing.  Because of
that, a cast, the sizeof operator, and enum constants, which require
references to the execution environment (these are first evaluated in
translation phase 7) are excluded from #if expression compared to
standard integer constant expressions.  Character constants (and wide
character constants), however, are not excluded.

Character constant evaluation is implementation-defined with many
factors as shown below and has little portability.

  1. Even the value of a basic character differs depending on the basic
    character set (ASCII, EBCDIC, and others.)

  2. Even a single-character character constant, within the same basic
    character set, has implementation-defined sign handling (depending
    on whether char is signed in the compiler proper.)

  3. Multi-character character constant evaluation is implementation-
    defined and the value may not be the same even if the sign handling
    is the same as basic character set.  It is not defined whether 'ab'
    is 'a' * 256 + 'b' or 'a' + 'b' * 256 when CHAR_BIT is 8 and char is
    unsigned.

  4. Multi-byte character encoding is implementation-defined.  Wide
    character encoding is same as multi-byte character encoding.  The
    size of wchar_t and whether it is signed or unsigned is
    implementation-defined.

  5. Even if multi-byte character encoding is the same, the evaluation
    of a character may not.  There is an issue same as 3.

  6. All above are common problems with character constant evaluation in
    compiler proper.  In addition, the character set in preprocessing
    can differ from the one the compiler proper sees.
    A source character set is applicable up to translation phase 4 and
    an execution character set is applicable at phase 6 and after.
    Phase 5 converts the characters in character constants and string
    literals from the source character set into the execution character
    set.  That is to say, either or both basic character sets and multi-
    byte character encodings also may differ between source and runtime.
    Character constants in the #if expression are evaluated in phase 4.
    This can be either the value of the source character set or
    simulated value of the execution character set.  It is not defined
    as a source character set either.

  7. Even if the character set evaluated in a #if and the execution
    character set are the same, the methods of evaluation can differ.
    That is to say, sign handling and the byte order of multi-character
    character constant and multi-byte character constant evaluation may
    differ between phase 4 and phase 7.

  8. Furthermore, while the character constants including multi-byte
    character constants are evaluated in int and wide character
    constants in wchar_t in phase 7, they are all evaluated in long or
    unsigned long in phase 4.  In other words, int is handled as if it
    has the same internal representation as long and unsigned int as if
    it has the same internal representation as unsigned long in phase 4.
    Therefore, for the implementation of INT_MAX < LONG_MAX, even if
    character sets, sign handling, evaluation byte orders are totally
    identical between phase 4 and 7, the character constant which does
    not overflow in phase 4 can overflow in phase 7.  As a negative
    number in phase 7 may be a long positive number in phase 4, even
    whether it is positive or negative is not always the same.  The
    integer constant token, which is not a character constant, does not
    become a negative number, however, whether positive or negative can
    hardly be predicted in general for character constants.

  9. In C99, the #if expression type became the maximum integer type of
    its implementation.  In other words, the evaluation type may vary
    depending on the implementation.

  10. In addition, multi-byte characters have an encoding problem though
    it exists in compilation as well as preprocessing.  For example, UTF-
    8 encodes a two byte Unicode character in between one byte and three
    bytes, however, what is the "value" of its character constant?  Is a
    Kanji value the "value" which results from evaluating the three byte
    UTF-8 code or the "value" of the original Unicode?  It will be
    implementation-defined, however, it is not even clear what sort of
    specification is reasonable.

  11. Though this is also a common problem with compilation, UCN was
    also introduced in C99 and C++98.  Is a character expressed in UCN
    and one written in multi-byte characters the same "character"?  They
    should be the same in nature, however, their "value" will be
    different depending on the encoding of multi-byte characters.

As above, little can be predicted how evaluation is done since the
character constant values for the #if expressions have no portability
among implementations and may differ even within the same implementation
depending on the compilation phases.

In general, the specification of the C language integer type has few
ambiguous parts.  Although negative value handling is implementation-
defined with respect to computation, they are CPU-dependent and there
are few parts implementers can decide optionally with character constant
evaluation as an only exception.  Many aspects are determined at the
discretion of implementers other than the CPU specification, basic
character set, and multi-byte character encoding.

The range of discretion for implementations increases immensely for the
character constants of the #if expressions and no matching is guaranteed
among compilation phases.  Little of what is evaluated is understood
even if this is evaluated.  Character constant evaluation can be thought
to require a reference to the execution environment under normal
conditions.  Standard C preprocessing removed this process of requiring
a reference, but not character constants only somehow.  And, it seems
that the specification, which does not require a reference, was forced
to be created, which created ambiguity.

How are this type of character constants for #if expressions used?  The
value of a char type variable is often compared with a character
constant in the compilation phase, but there is no usage in the
preprocessing phase in which no variables are used.  I cannot think of
any appropriate examples for the use of a character constant in a #if
expression.  This is a useless thing and should be removed from the #if
expression subject just as cast and sizeof.  There will be less source
with issues if this is removed compared even with cast or sizeof removal.


        [1.7.6]     Non-Function-like Rescanning of Function-like Macro

Macro calls are once replaced with the replacement list, and then
rescanned.  The messiest thing in the rescanning specification in
Standard C is that the token sequence after the macro call is rescanned
continuously to the replacement list as if the sequence is following the
replacement list.  This is completely deviated from the principle of
function-like macro specification modeled after function calls and
becomes the most outstanding factor to make macro expansion
incomprehensible.  I think that this specification of the subsequent
token sequence as a rescanning subject should be removed and that the
rescanning subject should be limited to the replacement list only.

Actually, rescanning subsequent token sequence seems to have been a long
time implicit specification since around K&R 1st.  This specification is
no longer necessary in Standard C, but remained in an appendix as a
legacy.  Since this issue concerns the foundation of macro expansion, I
will study it further in detail below.

It is not an easy thing to describe the macro rescanning method in
writing.  The text in the Standard or in K&R 2nd is not easy to
understand, either.  For example, K&R 2nd A.12 says "the replacement
list is repeatedly rescanned."  However, the Standard does not state
"repeatedly".  It can be read as one rescanning.  It can also be read as
recursive rescanning, but not clearly described as such either.

This cannot be explained accurately without using an actual example.
Furthermore, it cannot be understood intuitively without explaining the
implementation method.  That is how close macro rescanning is to the
traditional implementation of macro expansion.

First, we will take a silly example.  To simplify the problem, assume x
and y are not a macro.  How can this macro call be expanded?

#define FUNC1( a, b)    ((a) + (b))
#define FUNC2( a, b)    FUNC1 OP_LPA a OP_CMA b OP_RPA
#define OP_LPA          (
#define OP_RPA          )
#define OP_CMA          ,

    FUNC2( x, y);

    1:  FUNC2( x, y)
    2:  FUNC1 OP_LPA x OP_CMA y OP_RPA
    3:  FUNC1 ( x , y )

It becomes clear at once that 1: is replaced with 2: and 3: is generated
by rescanning in 2:.  Then, is 3: a macro call?  More specifically,
should this be rescanned again from the beginning?

Is rescanning something repeated many times from the beginning or whose
applicable range is gradually narrowed down recursively?  The truth is
neither.

As a matter of fact, rescanning seems to have been performed in a
certain type of exceptional recursion or in a certain type of repetition
resulting in the same.  Its classical example is the one in the Macro
Processing chapter in "Software Tools" by Kernighan & Plauger.  This is
something to be developed into the M4 macro processor later and this
itself is not a C preprocessor.  It is indicated that the macro
processor was originally designed and implemented in C by Ritchie.  The
prototype of the preprocessor implementation is available.

In this macro processor, rescanning is realized by sending back the
replacement list to the input for re-reading when there is a macro call.
When there is another macro call in the replacement list, the new macro
replacement list is sent back and re-read as if "it had been in the
input originally."  As it is written; "it provides an elegant way to
implement the rescanning of macro replacement text", "This is how we
handle the recursion implicit in nested sources of input", and others,
this method greatly helps macro processor program to be structured and
understood easily.

Many of C preprocessors perform rescanning by putting the replacement
list in pseudo inputs, a type of stack, and re-reading it.

In the example above, if FUNC1 turns out not to be a macro call at this
point when 2: is being rescanned, this token is established at the same
time and the replacement hereafter will be for OP_LPA and after whether
repetition or recursion.  If OP_LPA is replaced with ( and turns out not
to be a macro, x and later will be applicable next.  This way,
establishing a token is done sequentially starting with the beginning
and 3: becomes the last result.  This is no longer a macro call.

This method since "Software Tools" (or even before that) is certainly a
concise implementation method.  Though not mentioned in "Software Tools",
there is also a pitfall.  The problem is that there is a chance that
rescanning may scan the part after the original macro call beyond the
replacement list since the replacement list sent back to the input is
read-in consecutively with source.  In nested macros, the nesting level
may get shifted; unnoticed while rescanning.  A macro without arguments
expanded into the name of the macro with arguments and a abnormal macro
where the replacement list comprises the first half of another macro
call with arguments causes this situation.

#define add( x, y)      ((x) + (y))
#define head            add(

    head a, b)

This is the example.  This strange macro call is expanded as ((a) + (b)).
This for some reasons ended up with being officially acknowledged by
Standard C.  In fact, this macro is legal rather than undefined.

I cannot think that C preprocessors were intended for abnormal macros
like this.  However, I wonder perhaps whether the original C
preprocessor implementation was as above, which resulted in expanding
these macros somehow in silence, and some programs consciously took
advantage of these holes to the point where this became a de facto
standard specification and finally approved in Standard C.  That is to
say, a small defect in the original C preprocessor implementation led to
a strange de facto standard and left a trail even in Standard C.  This
is the reason an appendix is an appendix.

Now, returning to the topic of whether rescanning is recursive or
repetitive, I believe that it is not necessarily wrong to say this is
either an irregular recursion or repetition.  It is recursion, but it
has the strange characteristic that it is not always narrowed down its
applicable range as in ordinary recursion, but rather the range is
gradually shifted.  This is repetition.  However, the repetition is not
from the beginning, but from the middle by including following parts
gradually.

Therefore, it is possible to process the text, after all comments and
preprocessor directives are processed, from the beginning until the end
with this shifted rescanning only.  In fact, such method is used in
"Software Tools" and there is some using a similar way in the current C
preprocessor source.  In other words, rescanning is a synonymous with
macro expansion and also the macro expansion for all text.

The fact that rescanning subjects are shifted gradually causes many
problems.  The next example was listed in C89 Rationale 3.8.3.4 (C99
Rationale 6.10.3.4) as an example of a macro, which is unclear how to be
expanded.  It is stated that the reason why this process was not defined
as a specification was that "as the Committee saw no useful purpose in
specifying all the quirks of preprocessing for such questionably useful
constructs."  However, this example is suggestive.  Rather, this was not
possible to be defined as a specification.

#define f(a)    a*g
#define g(a)    f(a)

    f(2)(9)

In this example, f(2) is replaced with 2*g at first.  If the
"subsequent preprocessing tokens" is not to be rescanned, macro
expansion is completed and f(2)(9) becomes the token sequence of 2*g(9).
However, as the "subsequent token sequence" is applicable, this g(9)
forms a macro call and replaced with f(9).  Here, it is not clear
whether this f(9) should be replaced with 9*g again or not by applying
the rule of no re-replacement for the macro with a same name.  The token
sequence of f(9) is generated by rescanning the continuation of g, that
is the end of the first replacement result of f(2), and (9) of the
"subsequent token sequence" and it is unclear whether this is inside or
outside the f(2) call nest.

This problem was corrected in C90 Corrigendum 1, which adds the next
example to Annex G.2 Undefined behavior.

  -- A fully expanded macro replacement list contains a function-like
    macro name as its last preprocessing token (6.8.3).

This correction, however, only causes more confusion.

First of all, the wording, "fully expanded macro replacement list", is
not clear in meaning.  This can be only interpreted as "the replacement
list after the macro within the argument is expanded if there is an
argument."  In that case, in the example of f(2)(9), f(2) is replaced
with 2*g before considering the re-replacement of the macro with a same
name and it becomes undefined already when g is a function-like macro
name by rescanning it.  In other words, if f is called, it always
becomes undefined in this f and g macro definition.

If this "correction" is applied, the following example for macro
rescanning in ISO/IEC 9899:1990 6.8.3 Examples will be undefined to
begin with.

#define f(a)    f(x * (a))
#define x       2
#define g       f
#define w       0,1
#define t(a)    a

    t(t(g)(0) + t)(1);      /* f(2 * (0)) + t(1);   */
    g(x+(3,4)-w)            /* f(2 * (2+(3,4)-0,1)) */

The Standard states that these macro calls will be expanded as in the
comments, but this is not the case if the Corrigendum is applied.  In
these macro definitions of f and g, they will be always undefined when
the g identifier appears.  It is because f, a function-like macro name,
is the only and last pp-token in the replacement list for g.

    t(t(g)(0) + t)(1)

At first, the argument of the first t call will be expanded.

        t(g)(0) + t

Since there is another macro call, t(g), it will be expanded, but the
argument must be expanded first for that.

            g

And if this is replaced with f, it will become undefined here.

Even if replacements are continued as is, it will become:

        t(f)
        f

And it will be undefined again since the last pp-token of the t(f)
expansion result is f.  If replacements continue further, it will become
:

        f(0) + t
            f(x * (0))
            f(2 * (0))
        f(2 * (0)) + t
    t(f(2 * (0)) + t)
    f(2 * (0)) + t

This ends the expansion of the first t call in any case, but it will be
undefined for the third time since the end of this replacement list is a
function-like macro name, t.

How about the following?

    g(x+(3,4)-w)

This will be undefined by the time g is replaced with f.

This results in confusion by contradicting Examples with G.2.

If the examples in the Examples are omitted, the correction in the
Corrigendum does not relieve confusion.  First of all, G.2 is not a part
of the Standard and this addition does not have grounds in the text of
Standard proper.  In the text of Standard, it is only written that the
"subsequent token sequence" is also to be rescanned.  Secondly, even if
this Corrigendum is included in the Standard proper,

    #define head            add(

in the previous example, add is correct since it is not in the end of
the replacement list.

    #define head            add

is undefined, however.  This is too unbalanced.  Also, there is an issue
of the wording, "fully expanded", being unclear in meaning. *1, *2

It goes without saying that these are quirks brought by the
specification on "subsequent token sequence" as rescanning subject.  The
more plausible they try to make it sound, the more confusing it gets.
The Standard states the specification forbidding the replacement of the
macro with a same name in extremely difficult sentences.  A reason for
this difficulty comes also from these quirks.

On the other hand, Standard C defines that macro expansion in an
argument must be performed only within the argument for function-like
macro calls.  Since it will be turmoil if the macro expansion in an
argument eats up the text behind it, this is no wonder.

As a result of this, however, an imbalance occurs between the macro
within an argument and not so.

#define add( x, y)      ((x) + (y))
#define head            add(
#define quirk1( w, x, y)    w x, y)
#define quirk2( x, y)       head x, y)

    head a, b);
    quirk1( head, a, b);
    quirk2( a, b);

In this quirk1() call, it will be a violation of constraint as an
incomplete macro call at rescanning after the first argument, head, is
replaced with add(.  Put simply, it is an error.  However, quirk2() and
head a, b) will not be an error, but expanded as:

    ((a) + (b))

It may sound repetitious, but this type of absurdity all comes from the
fact that even the "subsequent token sequence" is applicable for macro
rescanning in general.  As a matter of implementation, the nesting level
information needs to be added in order for the argument expansion to be
performed independent of other text parts even using the method of
sending the replacement list back to input.  By using that method, it
will be easy not to have the "subsequent token sequence" rescanned in
general.  Rather, in the current half-baked specification, it is
necessary to change the process depending on whether it is inside an
argument or not, resulting in extra load for implementations.

Macro expansion in C has been traditionally influenced by editor-like
string replacement.  We can say that pre-Standard macro expansion is
something that has been added to string replacement for editors and
become complicated to excess.

By contrast, Standard C took the trouble to name the macro with
arguments a function-like macro.  I can guess that it tried to bring the
call syntax closer to a function call.  The specification that the macro
in an argument is replaced with a parameter after being fully expanded
and the one that the expansion is performed only within the argument
conform this principle.  However, this principle is spoiled by the
specification that macro rescanning in general includes the subsequent
token sequence.  It is an inheritance of text replacement repetition
from its ancestor.

If the subsequent token sequence was removed from the rescanning subject,
it could have been defined that macro expansion is completely recursive,
and that the applicable range is narrowed or (at least not extended)
forward or backward on every recursion.  And, it would have been clear
as an appropriate macro for the name, function-like macro.  I cannot
think there is much source code which would have had problems by this
decision.  I can only think that ANSI C committee could not make a
decision on cutting an appendix inherited from an ancestor. *3

I wish C99 had cut it off cleanly, but the appendix has survived again.

  *1  The object-like macro which is expanded into a function-like macro
    name is sometimes seen in actual programs.  It is as below.

            #define add( x, y)      ((x) + (y))
            #define sub( x, y)      ((x) - (y))
            #define OP  add
                OP( x, y);

    This is not as abnormal as an expansion into the first half of the
    function-like macro call as the former, but there is no reason why
    it must be this way.  It is good to define a function-like macro
    nesting in a function-like macro as below.

            #define OP( x, y)       add( x, y)

  *2  The reason for this correction by Corrigendum is in "Record of
    Responses to Defect Reports" by C90 ISO C committee (SC 22 / WG 14)
    (#017 / Question 19.)   The question on the macro expansion in f(2)
    (9) of ANSI C Rationale 3.8.3.4 was brought up again.  The direct
    issue in this example must have been the application range of the
    specification on "prohibiting the re-replacement of macros with a
    same name", but the committee has answered as a common problem
    unlimited to macros with a same name.  They did not realize that
    this interpretation might cause contradictions in the Examples.

    In addition, this wording, "fully expanded", is strange.  When f(2)
    is replaced with 2*g and rescanned up to g, is it fully expanded?
    If so, no more replacements will be performed.  Therefore, it will
    not be undefined, either.  If not fully expanded yet, g is rescanned
    with the succeeding (9) and replaced by f(9).  If this is fully
    expanded, the last pp-token of 2*f(9) is not a function-like macro
    name.  Therefore, this answer does not apply.  In other words, it
    says "after macro expansion is completed" where the issues is when
    macro expansion ends.  Thus, when macro expansion ends became more
    confusing.

    In C99 draft in November, 1997, this item in the Corrigendum was
    included in Annex K.2 Undefined behavior but deleted in the draft in
    August, 1998 replaced by a following paragraph below in Annex J.1
    Unspecified behavior.  This eventually was adopted in C99.

        When a fully expanded macro replacement list contains a function-
        like macro name as its last preprocessing token and the next
        preprocessing token from the source file is a (, and the fully
        expanded replacement of that macro ends with the name of the
        first macro and the next preprocessing token from the source
        file is again a (, whether that is considered a nested
        replacement.

    It seems that the committee finally realized the contradiction in
    the Corrigendum.  Fundamental problems in the text proper of the
    Standard, however, still remain.  Also, when macro expansion ends is
    unspecified in the end.  Furthermore, the distinction by the
    presence of '(' in the source means the difference in the result
    when the same macro exists in the source and when it exists in the
    replacement list of another macro.  This is an inconsistent
    specification.

    On this issue, refer also to section [2.4.26].

    Also, the specification of macro expansion in C++ Standard is same
    as C90 without an equivalent of Corrigendum 1 in C90 nor the
    specification added in Annex J.1 of C99.

  *3  Even with this decision, FUNC2( x, y) in the previous example will
    be FUNC1 (x, y) in argument expansion if this is in the argument of
    another macro call and again expanded into ((x) + (y)) at the
    original macro rescanning.  In other words, the final expansion
    results differ depending whether in an argument or not so.  However,
    this is another level of problem and not an inconvenience.


        [1.7.7]       C90 Corrigendum 1, 2, Amendment 1

With respect to ISO/IEC 9899:1990, Corrigendum 1 was released in 1994,
Amendment 1 in 1995, and finally Corrigendum 2 in 1996.

Corrigendum 1 contains trivial corrections in wording mostly but only 2
impacts preprocessing.  One is regarding macro rescanning of [1.7.6]
described above.

Another is a specification extremely special regarding the case that a
macro name in macro definition includes $ and others.

In Standard C, '$' is not accepted as a character in an identifier
though there are leading implementations allowing this traditionally.
In an example of 18.9 in test-t/e_18_4.t, $ is a character and
interpreted as a pp-token in Standard C.  The macro name is THIS and $
and after becomes the replacement list of an object-like macro, which is
totally different result from the intention of the program which is a
function-like macro with the name, THIS$AND$THAT.

In Corrigendum 1, an exception specification was added regarding this
type of example; "if object-like macro replacement list starts with a
non-basic character, a macro name and a replacement list must be
separated by white-space."  Standard C must output a diagnostic message
to this example in 18.9.  It is supposed to preventing the situation
where the source with $ or @ used in macro names is silently
preprocessed into an unintended result.  It is a painstaking
specification, but it is annoying that this type of exception increases.
In implementations not accepting $ and/or @ as an identifier, macros
like this always become an error in the compilation phase even if they
are not an error in preprocessing.  So, it does not seem to be necessary
to define this exception specification. *1

In addition, ISO 9899:1990 had an ambiguous constraint that a pp-token,
header-name, can be appear only in #include directives.  However, it was
corrected in Corrigendum 1 so that a header-name is recognized only in a
#include directive.

In Amendment 1, the core is multi-byte characters, wide characters, and
the library functions operating those strings.  Accompanied by those,
<wchar.h> and <wctype.h> standard headers were added.  In addition, the
<iso646.h> standard header and the specification on digraphs are added
as alternatives to trigraphs as characters not included in ISO 646
character set or a notation method for token and pp-token using those
characters.  <iso646.h> is a quite easy header which define some
operators as macros and it does not have any special problems. *2

The problem is a digraph.  This is very similar to a trigraph and the
usage is almost the same though the positioning in preprocessing is
completely different.  A trigraph is a character and converted into a
regular character in translation phase 1 while a digraph is a token and
pp-token.  If a digraph sequence is stringized by the # operator, it
must be stringized as is without conversion (though this # itself is
also written as %: in a digraph.)  Because of this, and only this,
implementations need to retain this as a pp-token at least until phase 4
completes.  If it were to be converted, it is later (convert the digraph
sequence left as a token, not in string literal.)

This imposes an unnecessary burden on implementations.  Implementations
are more concise recognizing a digraph as a character just as a trigraph
and to convert it in phase 1.  There is no benefit to keep this as a pp-
token.  The Amendment also notes that the difference between a digraph
and a usual token occurs only when they are stringized.  It might be
seen that it would be troublesome in writing a string literal like "%:"
if a conversion is done in phase 1.  But this is similar in trigraphs
and too special of a problem to consider.  If it must be written, just
"%" ":" is enough.  Digraphs should be re-positioned so that they are
converted in phase 1 as an alternative to trigraphs.

Corrigendum 2 has no corrections regarding preprocessing.

  *1  This specification disappeared in C99 and C++ Standard.
    In C99, the following generalized specification was added to 6.10.3
    Macro replacement/Constraints instead.

        There shall be white space between the identifier and the
        replacement list in the definition of an object-like macro.

    As this is also a tokenization exception specification, it is not
    praiseworthy.

  *3  In C++ Standard, these identifier-like operators are tokens, not
    macros.  Though it is difficult to understand why it is so (could it
    be an idea to cut down what preprocessing must do?), it is
    troublesome for implementations in any case.


        [1.7.8]     Redundant Specifications

In ANSI C 2.1.1.2 (C90 5.1.1.2, C99 5.1.1.2) Translation phases 3, there
are redundant specifications though they are harmless.

    A source file shall not end in a partial preprocessing token or
    comment.

Since translation phase 2 specifies that source files must not end
without a <new-line> or with <backslash><newline>, the source file that
passes phase 2 always end with a <newline> without a <backslash>.  It
never ends with a partial preprocessing token.  Within Partial
preprocessing token categories, there are ", ', <, and >, which are
unmatched in logical lines, which are considered to be undefined in ANSI
C 3.1 (C90 6.1) Lexical Elements/Semantics and not problems limited to
the source ending.  "Partial preprocessing token or" is unnecessary
wording.

ANSI C 3.8.1 (C90 6.8.1, C99 6.10.1) Conditional inclusion/Constraints
contains expressions, which cause a misunderstanding.

    it shall not contain a cast; identifiers (including those lexically
    identical to keywords) are interpreted as described below;

This "it shall not contain a cast; " is superfluous.  In the succeeding
parts and Semantics, it is made clear that all identifiers including an
identifier the same as a keyword are expanded if macro and remaining
identifiers are evaluated as 0.  A cast does not need to be considered.
In the (type) syntax, it is clear that type is handled as a simple
identifier.

On the contrary, if this is in a constraint, it can be interpreted that
the implementation recognizes the cast syntax and must output a
diagnostic message.  That is not the intention of the Standard.  There
is no keyword in translation phase 4, cast has no way to be recognized.
As far as this is concerned, sizeof is also the same.  It is strange
that only cast is mentioned without mentioning sizeof.

This type of wording is called "superfluous."


    [1.8]       Preprocessing Specification in C99

The following specification regarding preprocessing was added to C99.

1.  The hexadecimal sequence in the \uxxxx or \Uxxxxxxxx format in
identifiers, string literals, character constants, or pp-numbers is
called UCN (universal-character-name) and means a Unicode character
value.  This must specify extended characters not included in the basic
source character set.  Whether a \ should be inserted when a UCN is
stringized by the # operator is implementation-defined.

2.  Implementation-defined characters can be used in identifiers.
Therefore, implementations that allow the use of multi-byte-characters
such as Kanji characters in identifiers became possible.

3.  Handle // to the end of the line as a comment.

4.  As e+, E+, e-, and E-, the sequence of p+, P+, p-, and P- is
accepted in a pp-number.  This is for writing the bit pattern of a
floating-point number hexadecimal such as 0x1.FFFFFEp+128.

5.  The type of the #if expression is a maximum integer type in the
implementation.  As long long/unsigned long long is required, the type
of the #if expression has the size of long long or wider.

6.  Variable argument macros can be used.

7.  An empty argument of a macro call is a valid argument.

8.  Add a predefined macro, __STDC_HOSTED__.  This is defined as 1 on a
hosted implementation, 0 otherwise.  A predefined macro,
__STDC_VERSION__, is defined in 199901L.

9.  Add predefined macros, __STDC_ISO_10646__, __STDC_IEC_559__, and
__STDC_IEC_559_COMPLEX__ as options.

10. The new _Pragma operator.

11. Reserve the directive name starting with #pragma STDC for the
Standard and implementation and add three #pragma STDC directives which
show floating point operation methods.  Directives starting with #pragma
STDC are not to be macro-expanded, but other #pragma lines not so are
implementation-defined.

12. When a wide-character-string-literal and a character-string-literal
are side by side, it was considered to be undefined in C90.  However,
they are concatenated as a wide character string literal.

13. Extend the range of the line number used as an argument for #line to
[1,2147483647].

14. Raise translation limits as below.
    Length of a source logical line     :   4095 bytes
    Length of a string literal, character constant, and header name
                                        :   4095 bytes
    Length of an internal identifier    :   63 characters
    Number of #include nesting          :   15 levels
    Number of #if, #ifdef, #ifndef nesting          :   63 levels
    Number of parenthesis nesting in an expression  :   63 levels
    Number of parameters of a macro     :   127
    Number of macros definable          :   4095

15. Header name was guaranteed up to 6 characters + . + 1 character.
This is changed to 8 characters + . + 1 character.

Variable argument macros are as below.  If there is a macro definition,

    #define debug(...)  fprintf(stderr, __VA_ARGS__)

a macro call,

    debug( "X = %d\n", x);

is expanded as:

    fprintf(stderr, "X = %d\n", x);

In other words, ... in the parameter list means one or more parameters
and __VA_ARGS__ in the replacement list corresponds to it.  Even if
there are multiple arguments which correspond to ... in a macro call,
the result of merging those including ',' is handled as one argument.

Among undefined behaviors in C90, there are some in which adequately
meaningful interpretations are possible.  An empty argument in macro
calls is one of them and there is a case that it is useful to interpret
this as 0 pp-token.  This became a valid argument in C99.

C99 mentions an extension operator called _Pragma which is converted
into #pragma foo bar if written as _Pragma( "foo bar").  In C90, the
argument for the #pragma line does not get macro-expanded and the line
similar to a #pragma directive as a result of macro expansion is not
handled as a directive and cannot write #pragma in the replacement list
of macro definition.  On the other hand, the _Pragma expression can be
written in the macro replacement list and #pragma which came from its
result is handled as a directive.  The extension by _Pragma tries to
improve the portability of cumbersome #pragma.

It is simpler to make a modification that the argument of #pragma is
subject to macro expansion, without this type of irregular extension.
It will largely achieve the intention of portability improvement.
However, in that case, there still remains a constraint that #pragma
cannot be written in a macro and there will be an issue that the
argument of #pragma which must not be macro-expanded has to have its
name changed to start with __ in order to separate from the user name
space.  Though _Pragma() operator is irregular, its implementation is
not so troublesome and it is a reasonable specification.

There are too many issues on the introduction of Unicode.  First of all,
implementations must prepare a huge table for multi-byte characters and
Unicode conversion, causing large overheads.  It is virtually impossible
to implement it on the systems with 16 bits and less.  There are many
systems that do not handle Unicode at all.  In addition, there are many
cases that a Unicode and a multi-byte character do not have a one-on-one
mapping.  It seems too aggressive to place Unicode in a C language
standard in the name of programming language internationalization.

In C99, UCN handling drastically reduced compared with the draft in
November, 1997 and C++ Standard.  The preprocessing load became small
relatively.  Therefore, a certain implementation became possible in MCPP
as well. *1

However, there are still some large loads on compiler proper.  Also,
since these are unreadable expressions, I expect that they will end up
not being used much as trigraphs. *2

In MCPP, by the way, it adheres to the C99 specification with the -S1
-V199901L option.  However, __STDC_ISO_10646__, __STDC_IEC_559__, and
__STDC_IEC_559_COMPLEX__ are not predefined.  That is because these will
be defined in the header files for each implementation.

  *1  In the draft in November, 1997, almost same as C++ Standard, it
    was supposed to be that the extended characters which are not in the
    basic source character set are all converted into UCN in translation
    phase 1 and converted again into the execution character set at
    phase 5.
    In case of implementing this, it is speculated that a tool will be
    called to convert these before and after processing.  As the
    conversion is OS-dependent, separate tools will be realistic.

  *2  According to C99 Rationale 5.2.1 Character sets, this
    specification assumes that unreadable expressions are converted
    between the source in multi-byte characters by a tool included in
    the compiler system to be used.  This must mean separating multi-
    byte character string literal parts in a separate file to process.
    I wonder how practical that is.


    [1.9]       Toward Clear Preprocessing Specifications

Problems in Standard C preprocessing specifications and what I think
mentioned above are also requests for Standard C in the future.  In
summary, there are following items.

1. Header-names in the <stdio.h> format should be an obsolescent feature.
In the next version after the next, header-names should be only in the
string literal format.

2. Stick to the token-based preprocessing principle.  The # operator
should be stringized after a single space is inserted between each pp-
token even without a token separator so that whether a token separator
exists in an argument does not influence.

3. Similarly, macro re-definition is not influenced by whether a token
separator exists in the replacement list.

4. Parameter name differences should not be an issue at macro re-
definition, since checking parameter name differences just increases the
overheads in implementation and has no value.

5. Character constant evaluation normally requires a reference to the
execution environment and has no use in the #if expression.  Therefore,
this should be removed from the #if expression subject.

6. Function-like handlings of function-like macros should be consistent.
Whether a macro call is in an argument or replacement list, macro
rescanning should be applicable only for the replacement list and not
for the succeeding pp-token sequence of a macro call so that the same pp-
token sequence is generated in principle.

7. A digraph should not be a token, but an alternate spelling of a
character similar to a trigraph.  It should be converted in translation
phase 1.

8. Remove "partial preprocessing token or" from translation phase 3.

9. Remove a description regarding the #if expression, "it shall not
contain a cast; ", from the constraint and move it to footnote 140 in
C99.

10. Trigraphs should be abolished as they are not used in Europe.

These are all intended to reorganize irregular rules and make
preprocessing specifications simple and clear.  There is no doubt that
these will make preprocessing easier to understand.  On the contrary,
there should be little annoyance.

I believe that MCPP V.2 implemented all preprocessing specifications in
Standard C in the Standard mode including the parts I do not think
highly of.  In the 'post-Standard' mode, preprocessing with
modifications above is implemented (also excluding UCN, the use of multi-
byte characters in an identifier.)

Amendment 1 and Corrigendum 1 in C90 took the direction of increasing
the irregularity of preprocessing rather than cleaning it up.

C99 added various new features, but did not clean up the confusion of
logic above either, unfortunately. *1

Regarding the specifications added in C99, I have the following request.

  1. The introduction of Unicode (UCN) should be limited to an option.

In addition, there is an issue other than the problems described above
regarding evaluation rules for the integer type applicable to #if
expressions.

1. There is a constraint that the constant expression shall evaluate to
a constant in the range of representable values for its type.  It is not
clear if this applies to all constant expressions.  Since there are no
exceptions described, I can only interpret that all constant expressions
are applicable, but I expect that the intention of Standard C is
something like "where a constant expression is necessary."  It should be
clear.  On the other hand, there is a specification, "computation
involving unsigned operands can never overflow."  It is vague as to
whether a diagnostic message should be output in case that the result of
unsigned type constant evaluation goes beyond its range (since a
constant expression can be evaluated at compilation, it seems
appropriate to output a diagnostic message.)

Since this is not a preprocessing specific issue, it will not be
discussed further.

Also, in C90 it was defined that the result of / or % when one or both
operands are negative is implementation-defined, which was a terrible
specification.  This became the same specification as div() and ldiv()
in C99.

  *1  Various defect reports regarding C99, responses to them, and
    corrigendum drafts are on the ftp site below.  This is the official
    ftp server for ISO/SC22/WG14 and you can ftp as anonymous at least
    for now (SC stands for a steering committee and WG means a working
    group.  SC22 deliberates programming language Standards and WG14
    handles C Standards.)
        http://www.open-std.org/jtc1/sc22/wg14/


2.  Validation Suite Explanation


    [2.1]       Validation Suite for Conformance of Preprocessing

Items in the test-t, test-c, test-l, and tool directories and this cpp-
test.txt itself are "Validation Suite for Standard C Conformance of
Preprocessing" developed by myself.  It tests the level of Standard C
(ANSI/ISO/JIS C) conformance for preprocessing in optional compiler
systems in detail.  It is intended to cover all preprocessing
specifications defined in Standard C.  There are many additions
addressing issues outside the specification.

The test-t directory contains 183 sample text files.  Out of 183, 30 are
header files, 145 sample text files, and 8 are files gathering small
pieces of sample text.  All but header files have a name in the *.t
format, except some of *.cc.  These have nothing to do with compilation
phases, but test preprocessing phases.  Therefore, they are not
necessarily in the correct C programming format.  They are rather sample
text files for testing preprocessing.

As Standard C implementations can compress preprocessing and compilation
to a single process, it is not possible to test preprocessing separately,
depending on the implementation.  You can say that these *.t samples
themselves do not conform Standard C.  However, there are many
implementations that can be tested by separating preprocessing only.  In
fact, specifications and problems are clear if they can be separated.
The *.t sample files are for those.

*.cc files are samples for C++ preprocessing, provided for some
preprocessors which do not accept the files named *.c or *.t as C++
source.  Those have the same content with corresponding *.t files.

Among sample text files, there are ones with names starting with n_
(meaning normal), i_ (meaning implementation-dependent), m_ (meaning
multi-byte character) and e_ (meaning erroneous.)

Files starting with n_ are samples that do not contain errors, something
causing undefined behavior, or implementation-defined parts.
Preprocessors conforming Standard C must be able to process these
properly.

Files starting with i_ are samples dependent on implementation-defined
specifications regarding character sets assuming the ASCII basic
character set. (*)  Preprocessors for the implementations with ASCII
character set conforming to Standard C must be able to process these
properly without errors.

Files starting with e_ are samples that contain some sort of violation
of syntax rules or constraints, in other words, errors.  Preprocessors
conforming Standard C must be able to diagnose these, but not overlook
these.

Files with a number succeeding n_, i_, m_ or e_ are samples that test
preprocessing in C90 and common preprocessing specifications in C90 and
C99.  Among header files, pragmas.h, ifdef15.h, ifdef31.h, ifdef63.h,
long4095.h and nest9.h to nest15.h are samples to test C99 preprocessing
specifications and others are for common specifications in C90 and C99.

Files with alphabetics other than std or post after n_, i_, e_, or u_,
are samples for C99 and C++.  n_dslcom.t, n_ucn1.t, e_ucn.t and u_concat.
t are samples to test preprocessing specifications common in C99 and C++
98, n_bool.t, n_cnvucn.t, n_cplus.t, e_operat.t and u_cplus.t for C++,
and the rest for C99.

The file named ?_std.t combines pieces of files in C90 together.
?_std99.t is an equivalent for C99.  ?_post.t and ?_post99.t files are
bonus files and used for testing MCPP in the 'post-Standard ' mode.

The files named u_*.t are bonus files and the pieces of files to test
undefined behaviors.  undefs.t combines those as one file.  unbal?.h is
a header file used in those.  unspcs.t tests unspecified behaviors and
warns.t does not belong to any of the above, but is the file describing
texts for which warnings are desirable.  unspcs.t and warns.t are also
bonus.  Files named m_*.t are samples for several encodings as multi-
byte character and wide character sets.  It is desirable to process many
encodings properly.  m_*.t belong to quality test items like u_*.t.

misc.t, recurs.t and trad.t are real bonuses.  misc.t is a collection of
what is in Standards and other documentation, tests with different
results depending on the internal representation of the integer type,
tests related to translation phase 5 or 6, tests for enhanced functions,
and others.  recurs.t is a special case of recursive macro, and trad.t
is a sample for the old "Reiser model cpp".

There are 132 files in the directory called test-c.  26 of those are
header files (24 files are same as the ones in test-t), 101 of them are
pieces of sample source files, 3 of them are files which combine pieces
of sample source, and the other 2 are files used for automatic testing.
Among these, 31 files are bonus sample source files.  Source files other
than header files are named *.c.  This is in the C program format.

Naturally, file names start with n_, i_, m_ or e_.  Ones starting with
n_ are strictly conforming programs (which does not have any errors nor
implementation-dependent portions) in Standard C.  Implementations must
be able to compile these files correctly without errors and execute them
correctly.  In case of correct execution, the messages below are
displayed.

    started
    success

With exceptions of n_std.c and i_std.c, these messages are not displayed.
However, only the end message,

    <End of "n_std.c">

, is displayed.  Otherwise, some sort of error message is displayed.
Some files starting with i_ are samples of character constant assuming
ASCII.  Implementations supporting ASCII character set must be able to
compiles these files correctly and execute them correctly as ones
starting with n_.  Files starting with e_ must be correctly diagnosed by
compiler systems at compilation (preprocessing.)

Testing by compilation or execution is the most proper testing method.
However, the method detects the existence of error in an implementation,
but there are cases it is not clear where errors are.  You can give more
accurate evaluation by applying *.c files only to a preprocessor and
performing testing by looking through the results as far as the
implementation allows (*.t files are even more straightforward.)

The files called ?_std.c combines pieces of files.

Files named u_*.c are bonus and pieces of files which test undefined
behaviors.  undefs.c collects them in one file.  unspcs.c tests
unspecified behaviors while warns.c does not belong to any of the above,
but is the file of texts for which warnings are desirable.  unspcs.c and
warns.c are a bonus.  Those starting with m_ are samples of several
multi-byte character encodings.

C99 tests are not included in the test-c directory since there is no
compiler proper supporting C99 fully.  C++ tests are only in the test-t
directory.

The test-l directory contains samples for testing translation limits
that exceed specifications.  All 144 files are bonus.  They are a mix of
*.c, *.t, and *.h files.

Many *.h files overlap in each directory, test-t, test-c, and test-l.
If the duplicate header files are gathered in one directory, the
following way of including method is necessary, for example.

    #include "../test-t/nest1.h"

However, the method of searching this type of path list format or files
(where to locate the base directory etc.) is all implementation-defined
and the compatibility is not guaranteed.  In order to avoid this problem,
those header files are placed in each directory regardless of
duplication (Even the concept of "directory" is excluded from C Standard.)

The tool directory includes tools necessary for automatic testing.

As Standard C conformance requirements, not only that compiler systems
must behave correctly, but also that documents must contain necessary
items mentioned accurately.  I will explain this in [2.5].


    [2.2]       Testing Method

When performing tests using the Validation Suite and if a compiler
system has options to make closer to Standard C, all should be set
(refer to [5.1] for a concrete example.)


        [2.2.1]       Manual Testing

Each of test-t and test-c directories contain 2 kinds of samples, big
files with multiple pieces put together and small files divided into
pieces.  If a preprocessor conforms to Standard C well, only big files
with multiple pieces put together are necessary to test n_*.  However,
if the level of conformance is not high, a preprocessor will fall into
confusion in the middle with these files and the rest of the items
cannot be tested.  Therefore, small pieces of files are also provided.
Since the number of files gets too large and it is a lot of trouble to
do testing if I divide the pieces into too many files, I made a
reasonable compromise.  Depending on the implementation, even these
small pieces of files cannot be processed till the end.  In such event,
please divide the sample into even smaller pieces to continue testing.

As the #error directive terminates a process depending on
implementations, samples testing #error are not included in big files
with pieces put together.  The #include error also often terminates a
process, it is not included in big files with pieces put together.

The *.t samples are used in case a preprocessor is an independent
program or that a compiler has an option to output the text after
preprocessing.  By checking the result of preprocessing these files
directly, you can compare if they match correct results written in
comments.  Since it is possible to view preprocessing results directly,
it is possible to make more accurate judgment this way as long as
implementations permit.

Many *.c programs include the "defs.h" header.  As 3 kinds of assert
macro definitions are written in "defs.h" depending on the conformance
level of Standard C, set 0 to 1 in #if 0 for any of these.  The 1st or
2nd one is used in the implementation with the # operator for a
preprocessor is implemented.  The first one only includes <assert.h>.
The second one is the assert macro which does not abort on an assertion
failure.  This, of course, is not a correct assert macro, but more
convenient in this way.  The 3rd one only displays "Assertion failed" on
an assertion failure and it is not clear what assertion has failed.
This is used for the implementation without the # operator
implementation or for the one which causes a compilation error because
of a mistake even with the # operator implementation.

In old implementations without the <limits.h> standard header file, it
is necessary to write this (refer to [4.1.3].)

In multi-byte character processing, behavioral specifications of
implementations may differ depending on the runtime environment, thus
testing m_* requires attention (refer to [3.1].)

This type of testing has a difficult issue since testing an item may be
caught by another failure in an implementation.  For example, if <limits.
h> has an error and is included to test #if, it is not clear whether
<limits.h> or #if is tested.  The test which compiles and execute *.c
files are more troublesome than one for preprocessing *.t files.  If the
last result is wrong, it will make it appear that there is some sort of
error in the implementation, but not necessarily in the item tested.

I tried to contrive ways to aim the target item in this Validation Suite.
However, there is a restriction that the Validation Suite itself must be
portable.  In addition, in order to test an item, the correct
implementation of other language specification must be assumed.
Therefore, the preprocessing item used as this "assumption" is
implicitly tested in areas other than the test item that was targeted
for the item.  Please note such implicit allocation of points also exist
in the "allocation of points" which will be described next.  It may not
be possible to judge whether the sample failed in the test item that it
really targeted or by another factor in case an implementation fails a
sample process without looking at another test.

Each test item is set by each allocation of points.  Marking criteria
are also written.  Standard C does not have subsets, therefore unless
all items match specifications, an implementation cannot be said to be
Standard C conforming, strictly speaking.  In reality, there are not
many such implementations and we cannot help using the measure of
Standard C conformance level to evaluate implementations.  In addition,
as there are large differences in the importance of items, counting the
number of passed items will not do, rather a weighting depending on the
importance must be applied.

However, this weighting does not have objective criteria, of course.
The marking of this Validation Suite was decided by myself and does not
have a grounded base.  Still, it will be a guideline in evaluating
compliance levels for implementations objectively.

n_*, i_*, e_*, and d_* are tests related to Standard conformance and
marking for these is in 2 point unit in general.  In testing outside of
Standards and quality evaluation, marking for q_* is in 2 point and the
rest is in 1 point units.  Where a diagnostic message should be
displayed, no points will be scored in case it is wide of the mark
although it is displayed.  A partial score may be given to a diagnostic
message not absolutely incorrect but rather off the point.  An
implementation is free to issue diagnostics on correct program if it
correctly processes the program, however, wrong diagnostics will be
subject to subtraction.


        [2.2.2]     Automatic Testing by cpp_test

If you compile the cpp_test.c program in the tool directory and run it
in the test-c directory, you can test n_*.c and i_*.c for C90
automatically.  However, this only scores pass or fail and it does not
provide any detail.  It does not include tests such as e_*.?.  It just
takes aim at the conformance level of preprocessors for C90 briefly.  No
tests regarding C99 are included.  That is because most compilers do not
support C99 sufficiently yet. *1, *2, *3

How to use cpp_test is, in an example of Visual C++ 2005, as follows.

    cpp_test VC2005 "cl -Za -TC -Fe%s %s.c" "del %s.exe" < n_i_.lst

The second argument and on need to be enclosed by " and " respectively
(in case the shell removes ", ", it is necessary to take the measure to
enclose the second argument to the last one all together in ' and '.)
%s will be replaced by a sample program name without .c such as n_* and
i_*.

The first argument: Specifies the name of a compiler system.  This must
be within 8 bytes and must not include '.' (to suit MS-DOS.)  Files with
this name plus .out, .err, or .sum are created.

The second argument: Writes the command to compile.

The third argument and later: Writes the command to delete the files no
longer necessary.  Multiple of these are allowed.

n_i_.lst is in the test-c directory.  It includes the list of n_*.c and
i_*.c without .c respectively.

Depending on the implementation, they may start runaway processing some
source files.  In such an event, change the source name in n_i_.lst to a
name which does not exist, none, for example, then run the test again.

By running cpp_test this way, n_*.c and i_*.c are compiled and executed
sequentially.  The outputs to stderr for sample programs are recorded in
the n_*.err and i_*.err files.  In addition, the score results are
written on a column in VC2005.sum.  However, there are only 3 kinds of
marking below.

    *:  Pass
    o:  Compiles, but the execution result failed.
    -:  Could not be compiled.

In VC2005.out, the command line that called cpp_test is recorded and so
is the message which was output to stdout by the compiler system if any.
Messages output to stderr by a compiler system are recorded in
VC2005.err if any.

Look at these for more information.

Now, use the following command.

    paste -d'\0' side_cpp *.sum > cpp_test.sum

By doing so, the *.sum files which are test results for each compiler
system are combined horizontally to create one table to be recorded in
cpp_test.sum.  side_cpp is the table side portion where test item titles
are written and exists in the test-c directory.

cpp_test.sum that I created this way is located in the doc directory.
In [5], the detail results of manual testing are written.  They test
more preprocessors than cpp_test.sum.  Among those preprocessors, there
are some that do not support compiler drivers for any compiler systems.
They cannot be tested automatically by cpp_test.

  *1 This cpp_test.c was written based on runtest.c and summtest.c in
    "Plum-Hall Validation Sampler."

  *2 cpp_test.c does not operate with expected behavior if it is
    compiled on Borland C / bcc32.  This is because cpp_test calls
    system() to redirect stdout and stderr but standard I/O path does
    not seem to get inherited by the descendant process in bcc32.  If
    cpp_test.c is compiled in Visual C or LCC-Win32, it operates without
    problems.

  *3 m_36_*.c are the tests of encoding which has a byte of 0x5C ('\\')
    value.  cpp-test does not use them, since some systems do not use
    these encodings.


        [2.2.3]     Automatic Testing by GCC / testsuite

            [2.2.3.1]   TestSuite

GCC contains something called testsuite.  Do 'make check' after
compiling the GCC source files, samples of this testsuite are checked
one after another and the results are reported.

My Validation Suite, since V.1.3, was appended the edition which is
rewritten so that it could be used as testsuite of GCC.  Putting this in
testsuite allows automatic checking by 'make check'.  While the cpp_test
tool in [2.2.2] can test only samples with n_* or i_* as a name,
testsuite allows samples which require diagnostic messages such as e_*,
w_*, and u_* to be tested automatically.  This set of testcases is
applicable to cpp0 (cc1, cc1plus) of GCC 2.9x and later and MCPP.

Here, I will explain how to use the Validation Suite in GCC / testsuite.

The cpp-test directory of the Validation Suite is the edition for GCC /
testsuite created by rewriting the test-t and test-l directories and cpp-
test contains each directory of test-t and test-l.

GCC / testsuite, however, cannot change the execution environment.  The
files named m_* or u_1_7_* are the testcases for several multi-byte
character encodings.  Since those testcases need different environments
each other for at least GCC 3.3 or former, those are excluded from this
testsuite edition. *

GCC and testsuite specifications have been changed many times thus far
and are expected to be changed in the future as well.  It may require a
partial fix to the Validation Suite accordingly, especially in case of
addition or change of diagnostics.  However, no extensive fix seems to
be necessary so far unless the version of GCC is extremely old.  The
testcases in cpp-test have been verified in each cpp0 (cc1, cc1plus) of
GCC 2.95.3, 3.2, 3.3.2, 3.4.3 and 4.0.2, and MCPP.

Runtime options cannot be changed in the testsuite depending on the
implementation.  As a matter of fact, multiple standards coexist and it
is necessary to specify a version of the standard using the 'std='
option.  However, this option does not exist in older versions of GCC.
Therefore, my testsuite applies to GCC 2.9x and later and MCPP V.2.3 and
later.

Testsuite is executed by interpreting the comments in the following
format written in the sample program.  This is a comment which does not
affect tests in other compiler systems.

    /* { dg-do preprocess } */
    /* { dg-error "out of range" "" }  */

The samples with the comment, dg-error or dg-warning, written test
diagnostic messages.  Testing multiple compiler systems is supported by
writing diagnostic messages of each compiler system with '|' (OR) in-
between.

This is executed by the tool called DejaGnu and it is directly a shell-
script called runtest.  The setup of DejaGnu is written in some files
named *.exp.  *.exp are the scripts for the tool called Expect.  And,
Expect is a program written in the command language called Tcl.

Therefore, using testsuite requires these many tools of appropriate
versions according to the testsuite.  This is same when my Validation
Suite is used.

  * In fact, GCC does not work properly even if the environment variable
    is set.

            [2.2.3.2]   Installation to TestSuite and Testing

My Validation Suite is used in GCC / testsuite in the following manner.

First, copy the cpp-test directory to an appropriate directory in
testsuite of GCC.

The cpp-test directory is the one created by copying necessary files in
each directory of test-t and test-l and adding the configuration file
cpp-test.exp.  The suffix of the files named *.t is mostly changed to .c,
the suffix of the files for C++ is changed to .C.

Most samples test the preprocessor only.  Since two samples cannot test
the preprocessor due to the problems in DejaGnu and Tcl, they are for
compiling and running (named *_run.c).  These two samples contain the
line:

    { dg-options "-ansi -no-integrated-cpp" }

where -no-integrated-cpp is an option for GCC 3 and 4.  GCC 2 does not
support the option, which need to be removed in order to test in GCC 2.
To accommodate both GCC 2 and GCC 3 or 4, there are two types of files,
*_run.c.gcc2 and *_run.c.gcc3, for these two testcases.  Link the
appropriate one to *_run.c.

Below, I will take an example of GCC 3.4.3 on my machine.  Suppose the
source files of GCC 3.4.3 are located in /usr/local/gcc-3.4.3-src.  Also,
the GCC compilation is done by /usr/local/gcc-3.4.3-objs.

    cp -r cpp-test /usr/local/gcc-3.4.3-src/gcc/testsuite/gcc.dg

This copies files under cpp-test to the gcc.dg directory.

By doing this, if you

    make bootstrap

in /usr/local/gcc-3.4.3-objs to compile the GCC source files and you

    make -k check

then the entire testsuite including testcases in cpp-test will be tested.

Also, testing by using cpp-test only is done as below in the /usr/local/
gcc-3.4.3-objs/gcc directory.

    make check-gcc RUNTESTFLAGS=cpp-test.exp

The testsuite logs are recorded in gcc.log and gcc.sum under the
./testsuite directory.

When you do 'make check', depending on the environment, you need to set
up the environment variable called DEJAGNULIBS, TCL_LIBRARY as explained
in INSTALL/test.html of the GCC source files.

In addition, the environment variable, LANG and LC_ALL, must be C to set
the environment to English.

Please note that it is xgcc, cc1, cc1plus, cpp0 etc. generated in the
gcc directory that are used in make check at compiling GCC, not gcc, cc1
and such that have already been installed.

Tests can be executed in any directory as follows.

    runtest --tool gcc --srcdir /usr/local/gcc-3.4.3-src/gcc/testsuite  \
                                                        cpp-test.exp

Logs are output to the current directory.  In this case, what is to be
tested is gcc, cc1, and cpp0 which have already been installed.  cpp-
test requires testsuite as it contains various configuration files for
GCC (config.*, *.exp).

The argument 'gcc' of "runtest --tool gcc" should be exactly 'gcc'.  If
the name of the compiler to be tested is not 'gcc', for example 'cc' or
'gcc-3.4.3', you should make symbolic link so that the compiler is
invoked by the name of 'gcc'.

Also, cpp-test contains the testcases for warnings in the cases where it
is thought to be desirable for a preprocessor to issue a warning.  The
GCC preprocessor passes less than a half of those cases, however, not
passing does not mean that the behavior is wrong or that the
preprocessor was not compiled properly.  This is not the issue of being
right or wrong, but rather of the "quality" of the preprocessor.

            [2.2.3.3]   MCPP Automatic Testing

This cpp-test can test MCPP also.  Therefore, substituting the GCC
preprocessing with MCPP and calling

    make check-gcc RUNTESTFLAGS=cpp-test.exp

in the gcc directory checks MCPP of Standard mode automatically.  Tests
can be done also using runtest command in any directory.

    runtest --tool gcc --srcdir /usr/local/gcc-3.4.3-src/gcc/testsuite  \
                                                        cpp-test.exp

If MCPP is executed in GCC 3 or 4, all testcases for cpp-test except one
should pass.  There is another testcase which does not pass when
executed in GCC 2.  However, it is because gcc calls MCPP with the
-D__cplusplus=1 option and not MCPP's fault.

Please refer to manual.txt [3.9.5] and [3.9.7] for how to substitute the
preprocessing with MCPP.  To apply the testsuite, MCPP startup needs the
-23j options to be set.  -2 is an option to enable digraph and -3 is one
to enable trigraphs.  -j is an option for not adding information such as
source lines to diagnostic message output.  Do not use other options.
Additionally, testsuite can test MCPP in standard mode only, no other.

The method above is done after the make with GCC, however, automatic
testing can be done by 'configure' and 'make' of MCPP itself as long as
GCC / testsuite is installed and is ready to execute.  This case is the
easiest as 'make check' automatically performs necessary settings.  See
the INSTALL in MCPP for this method.

            [2.2.3.4]   TestSuite and Validation Suite

GCC has had testsuite for a long time, but very few samples about
preprocessing up to V.2.9x.  You can see how little attention
preprocessing was paid to.  The number of testcases for preprocessing
increased quite a lot in V.3.x.  You can tell preprocessing was given
more importance as it was completely changed with up-to-date
preprocessor source and documents.

However, these testcases are still quite unbalanced.  The causes seem to
come from the following nature of testsuite.

  1. Collection of bug reports submitted by users.  In other words,
    concentrate on the areas to correct bugs actually detected and to
    prevent reappearance.
  2. The testcases added for debugging when developers implement new
    functionalities there.

This is the way of debugging special to the open source project and
became possible as GCC has been used by many excellent programmers in
the world.  However, this method might have brought the randomness and
imbalance of testcases at the same time.

In addition, most of these testcases are valid only in GCC and cannot be
used in other compiler systems.  Also, testcases for GCC 3 contain many
testcases which cannot be applied to even GCC 2 / cpp.  The reason is
the differences in preprocessing output spacing and diagnostic messages.

On the other hand, my Validation Suite was originally written by myself
only in order to debug my preprocessor and rewritten so that the entire
preprocessing specifications are tested.  Many samples are organized
systematically on the whole.

It will have considerable meaning to add these systematic testcases to
GCC / testsuite.

Also, my testsuite edition of Validation Suite is written so that it can
test three preprocessors, GCC 2.9x / cpp, GCC 3.x, 4.x / cc1 (cc1plus),
and MCPP.  In other words, the use of the regular expression facility in
DejaGnu and Tcl can absorb the implementation differences in
preprocessing output spacing and diagnostic messages. *

Below are the results of applying the testsuite edition of the
Validation Suite to these three preprocessors (tested in June 2006).

Below is the case where the preprocessor is replaced by MCPP V.2.6 in
GCC 3.3.2.

                === gcc Summary ===

# of expected passes            264
# of unexpected failures        1
# of expected failures          4
/usr/bin/gcc version 3.3.2

There is one failure due to a lack of the universal-character-name <=>
multi-byte character conversion implementation in C++98.

Here is the GCC 3.3.2 / cc1 case.

                === gcc Summary ===

# of expected passes            215
# of unexpected failures        52
# of unexpected successes       2
# of expected failures          2
/usr/bin/gcc version 3.3.2 20031218 (Vine Linux 3.3.2-0vl8)

Most of failures are due to a missing warning.

GCC 4.0.2 / cc1 is almost the same.

                === gcc Summary ===

# of expected passes            214
# of unexpected failures        53
# of unexpected successes       2
# of expected failures          2
/usr/bin/gcc version 4.0.2 20050901 (prerelease) (SUSE Linux)

Here is the GCC 2.95.3 / cpp0 case.

                === gcc Summary ===

# of expected passes            181
# of unexpected failures        87
# of unexpected successes       3
# of expected failures          1
gcc version 2.95.3 20010315 (release)

There are less warnings than GCC 3, 4 / cc1.  There are also some
diagnostic messages that are off the point.  Half of new C99 and C++98
specifications have not been implemented yet, either.

The number of items differ among different versions of GCC, since
multiple failures can occur in one testcase.

  *  This makes dg script in my testcases difficult to read with
    frequent use of \ and symbols.  The regular expression processing in
    DejaGnu and Tcl has a considerable number of peculiarities and flaws
    requiring ingenuity to perform all the automatic testing achieved on
    multiple compiler systems.  Currently, however, runtime options used
    in testcases for those compiler systems must be common.


    [2.3]       Violations of syntax rules or Constraints and Diagnostic
                        Messages

Standard C implementations must certainly process correct source
correctly, but they also must issue diagnostic messages for erroneous
source.  Standard C also contains portions where behavior specifications
are up to the implementation or not defined.  They are as below in
summary. *1

1. Correct programs and data whose outcomes are the same in every
implementation.

2. Correct programs and data whose process methods are not specified.
These do not need to be described in documents and the results are
called unspecified behavior.

3. Correct programs and data whose processing is up to implementations.
These specifications must be mentioned in documents by each
implementation.  These results are called implementation-defined
behavior.

4. Programs or data that are erroneous or not portable and their
processing is not defined as specifications at all.  Implementations may
or may not output diagnostic messages.  They may process them as some
sort of valid programs.  These results are called undefined behavior.

5. Erroneous programs or data for which implementations must issue
diagnostic messages.  There are violations of syntax rules and
violations of constraint among these. *2

Among these, programs and data in 1 only are called strictly conforming
(it is interpreted that 2 and 3 may be included if their results do not
differ depending on implementation or special cases.)

Programs and data in 1, 2, and 3 only are called conforming programs.

How to issue diagnostic messages is implementation-defined.  Supposedly,
one or multiple diagnostic messages of some sort are issued for one
translation unit that includes some sort of violations of syntax rules
or constraints.  It is up to the implementation whether diagnostic
messages should be issued for the programs with no violation of syntax
rules or constraint.  However, strictly conforming programs or
conforming programs matching implementation-defined or unspecified
specifications for the implementation must be able to be processed
correctly until the end.

Violations of syntax rules or constraints are called an "error" in this
document.  Among e_* files in this Validation Suite, there are many
which include multiple errors.  In scoring below, it is expected that a
compiler system issues one or more diagnostic messages.  However, there
may be compiler systems that issue just one diagnostic message (such as
"violation of syntax rules or constraints") for one translation unit no
matter how many errors there are.  In addition, there may be compiler
systems that get confused after an error.  These types of problems are
of "quality", but not of Standard conformance level.  Please make
samples into pieces and test them again as needed.  The problems of
quality will be discussed in [3] separately.

  *1 ANSI C 1.6 (C90 3) Definitions of Terms
     ANSI C 1.7 (C90 4) Compliance
     ANSI C 2.1.1.3 (C90 5.1.1.3) Diagnostics
     C99 3 Terms, definitions, and symbols
     C99 4 Conformance
     C99 5.1.1.3 Diagnostics
  *2 Although C++98 differs from C90 or C99 in these terms, it does not
    much differ in the meanings.


    [2.4]       Details

Each test item is explained one by one below.  This is also a
description of Standard C preprocessing itself.  The specifications in
common with K&R 1st are not explained again.  Item numbers are common in
*.t and *.c files.


        [2.4.1]     Trigraphs

            [n.1.1]     9 trigraph sequences

As there are 9 characters not included in the Invariant Code Set of ISO
646:1983 among the basic character set in C, these can be written in
source using 3 character sequence below.  This is a new specification
introduced in C90. *

    ??=     #
    ??(     [
    ??/     \
    ??)     ]
    ??'     ^
    ??<     {
    ??!     |
    ??>     }
    ??-     ~

Equivalent characters replace these 9 trigraph sequences in translation
phase 1.  On systems where you can type in these 9 characters on
keyboard, it is not necessary to use trigraphs, of course.  However, it
is necessary for preprocessing conforming Standard C to be able to do
trigraph conversion on even those systems.

Scoring:  6.  6 points if all 9 are processed correctly.  Each trigraph
which cannot be processed properly, 2 points are reduced with 0 at the
lowest limit.

  *  ANSI C 2.2.1.1 (C90 5.2.1.1) Trigraph sequences
     ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
     C99 5.2.1.1 Trigraph sequences
     C99 5.1.1.2 Translation phases

            [n.1.2]     Trigraph sequences in control lines

Since trigraph conversion is performed prior to tokenization in
translation phase 3 or control line processing in phase 4, trigraphs can
be written wherever on a control line.

Scoring: 2.

            [n.1.3]     Only 9 trigraphs

There are only 9 trigraphs mentioned above, therefore sequences starting
with ?? other than those are never translated into another character nor
?? can be skipped.  Preprocessing must be able to handle the case of
sequences with a trigraph and ?'s which are not trigraphs.

Scoring: 2.


        [2.4.2]     Line Splicing by <backslash><newline>

In case there is a \ at the end of a line and a <newline> immediately
afterward, this sequence of <backslash><newline> is deleted in
translation phase 2 unconditionally.  As a result, 2 lines are connected.
In Standards, the line on a source file is called a physical line to
distinguish it while the line connected by removing <backslash><newline>
(if any) is called a logical line.  Processing in translation phase 3 is
performed with this logical line as subject. *

In K&R 1st, the #define line and string constants can continue on the
next source line using <backslash><newline>, but other cases are not
mentioned.  Actual implementations allow other control lines may
connected, not only #define.

  *  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
     C99 5.1.1.2 Translation phases

            [n.2.1]     Between a parameter list and replacement list
                                on the #define line

The #define line connections are accepted in K&R 1st and most of
implementations.

Scoring: 4.  4 points for processing correctly and 0 point otherwise.

            [n.2.2]     Inside a parameter list on the #define line

There are some implementations which cannot handle <backslash><newline>
in unusual places such as inside a parameter list even on the #define
line.

Scoring: 2.

            [n.2.3]     Inside a string literal

<backslash><newline> inside a string literal has been supported since K&
R 1st.

Scoring: 2.

            [n.2.4]     Inside an identifier

In Standard C, <backslash><newline> must be removed unconditionally even
if it is inside an identifier or anywhere.

Scoring: 2.

            [n.2.5]     <backslash> as a trigraph

<backslash> is not only \, but also ??/ as a trigraph.  ??/ in source is
converted into \ in translation phase 1, it is obviously \ itself in
phase 2.

Scoring: 2.


        [2.4.3]     Comments

In translation phase 3, a logical line is broken into pp-tokens and
white spaces.  A comment is converted into a single space at that time.
*1

Here, implementations may convert consecutive white spaces (including
comments) into a single space.  However, <newline> is not converted and
stays as is in any case.  That's because the process of preprocessing
directive in the next phase 4 is subject to this "line."

In case the comment expands over lines, line splicing is performed
virtually by a comment.

  *1  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
    C99 5.1.1.2 Translation phases

            [n.3.1]     Conversion into one space

In the old cpp so-called Reiser type, comments functioned as token
separators only internally in cpp and were removed before output.  By
taking advantage of it, there was a method of using comments for token
concatenation.  However, this specification derailed K&R 1st and was
clearly rejected by Standard C.  In Standard C, the ## operator is used
for token concatenation.

Scoring: 6.

            [n.dslcom]  // a comment

From K&R 1st to C90, comments started with /* and ended with */. *1

However, C99 started supporting C++ style of comment, //. *2

Scoring: 4.

In C90, this should be processed as just a sequence of pp-token '/' and
'/', not a comment.  However, as implementations which handle // as
comments even prior to C99 were common, MCPP treats this as a comment
and issues a warning in C90 mode.

  *1  ANSI C 3.1.9 (C90 6.1.9) Comments
  *2  C99 6.4.9 Comments

            [n.3.3]     Comment processing prior to pp-directive
                                processing

The preprocessing directive starting with # is for a "line", but this
"line" is not necessarily a physical line in source.  It could be a
logical line combined by <backslash><newline> or the "line" which extend
over multiple physical and logical lines by a comment.  This is not
surprising if you think about the order of translation phase 1 through 4.

Scoring: 4.

            [n.3.4]     Comments and <backslash><newline>

There are pp-directive lines that extend over some physical lines by
both <backslash><newline> and a comment.  Preprocessors that do not
implement translation phases properly cannot handle this correctly.

Scoring: 2.


        [2.4.4]     Special Tokens (digraphs) and Characters (UCN)

In C90/Amendment 1 (1994), alternative spelling called digraph was added
for some of operators and punctuators. *1

In C99, a character code called UCN (universal character sequence) was
added. *2

In e.4.?, token errors are covered.

  *1  Amendment 1/3.1 Operators, 3.2 Punctuators (added to ISO 9899/6.1.
    5, 6.1.6)
    C99 6.4.6 Punctuators
  *2  C99 6.4.3 Universal character names

            [n.4.1]     Digraph spelling in a preprocessing directive
                                line

Digraphs are handled as tokens (pp-tokens.)  '%:' is another spelling
for '#'.  It can certainly be used as the first pp-token of a
preprocessing directive line or as a string operator.

Scoring: 6.

            [n.4.2]     Digraph spelling stringizing

Different from trigraphs, as digraphs are tokens (pp-tokens), they are
stringized as they are in spelling without being converted (meaningless
specification.)

Scoring: 2.

            [n.ucn1]    UCN recognition 1

UCN is recognized in string literals, character constants, and
identifiers.

Scoring: 8.  4 points if UCN in a string literal passes preprocessing as
is.  2 points each if UCN is processed correctly in a character constant
or identifier.  No good if UCN is not recognized and output as is.

            [n.ucn2]    UCN recognition 2

UCN can be used inside a pp-number as well.  However, it has to
disappear from a number-token by the end of preprocessing.  This
specification exists in C99, not in C++.

Scoring: 2.

            [e.ucn]     UCN errors

UCN must be 8 digit hexadecimal if it starts with \U or 4 digit
hexadecimal if it starts with \u.

UCN must not be in the range between [0..9F] and [D800..DFFF].  However,
24($), 40(@), and 60(`) are valid.

Scoring: 4. 1 point each for each correct diagnosis regarding 4 samples.

            [e.4.3]     Empty character constants

Even sequences not in C token format are also recognized as pp-tokens.
Therefore, there are not many error cases in tokenization of
preprocessing.

However, there are some cases other than this which become undefined
behavior (refer to [3.2].)

Empty character constants are violations of syntax rules in a
preprocessing #if line or compiling. *

Scoring: 2.

  *  ANSI C 3.1.3.4 (C90 6.1.3.4) Character constants -- Syntax
     C99 6.4.4.4 Character constants -- Syntax


        [2.4.5]     Spaces and Tabs on a Preprocessing Directive Line

Spaces, tabs, vertical-tabs, form-feeds, carriage-returns, and new-lines
are all white spaces.  White spaces that are not in string literals,
character constants, header names, or comments usually have a meaning as
a token separator.  However, new lines that remain until translation
phase 4 are special and become a pp-directive separator.  There are
slight restrictions in white spaces that can be used in pp-directive
lines.

            [n.5.1]     Spaces and tabs before and after #

In Standard C, spaces and tabs before and after the #,  which is the
first pp-token on a preprocessing directive line are guaranteed to end
in the same results whether they exist or not (*.)  In K&R 1st, this is
not clear and there were actually implementations which do not accept
spaces and tabs before and after #.

In K&R 1st, it is interpreted that spaces and tabs after that in the
line are accepted as just a token separator as in Standard C.

However, in the case where there are white spaces other than spaces and
tabs on the pp-directive line, it is undefined (refer to [u.1.6] of [3.2].)

Scoring: 6.

  *  ANSI C 3.8 (C90 6.8)  Preprocessing directives -- Description,
    Constraints
     C99 6.10 Preprocessing directives -- Description, Constraints


        [2.4.6]     #include

#include is the most basic pp-directive since K&R 1st.  However, the
specifications on this directive in Standard C have more undefined and
implementation-defined portions (*.)  The reasons are below.

1. A pp-token called header-name which is an argument for this directive
  is dependent on OS file systems and difficult to standardize.
2. "Standard" location to search files depends on implementations.
3. Even the header-name in the format similar to the string literal
  enclosed by ", " is another pp-token different from a string literal
  as \ is not an escape character.  Furthermore, a header-name enclosed
  by < and > is the most irregular pp-token.

In n.6.*, the most basic test below is not performed by different
categories.  This is included with other test items on the premise that
not being able to process this is out of the question since many tests
will not be able to be performed.

    #include    <ctype.h>
    #include    "header.h"

  *  ANSI C 3.8.2 (C90 6.8.2) Source file inclusion
     C99 6.10.2 Source file inclusion

            [n.6.1]     Standard header include in 2 formats

There are 2 formats in header-names.  The difference is only that the
format enclosed by < and > searches the header from a implementation-
defined specific location (may be multiple locations) while the one
enclosed by " and " searches the source file first in implementation-
defined method and performs the same process as the one enclosed by
< and > upon failure.  Therefore, the format enclosed by " and " can
include standard headers as well.  This point was the same in K&R 1st.

In Standard C, same standard headers can be included many times.  Either
way, however, this is not a preprocessor issue, but on how to write
standard headers (refer to [4.1.1].)

Scoring: 10.  4 scores if only one of 2 samples is processed.

            [n.6.2]     Header-name by a macro - Part 1

In K&R 1st, no macro could be used on the #include line, however, it was
officially permitted in Standard C.  In case the #include argument does
not match either of the 2 formats, the macro included there is expanded.
The result must match either one of the 2 formats.

This specification has something subtle.  For example, how should the
following source be handled?

    #define MACRO   header.h>
    #include    <MACRO

Should MACRO be expanded first as below?

    #include    <header.h>

Or, should this be an error as > matching < does not exist prior to
macro expansion?

I cannot think that Standards are written with this level of detail in
mind.  Therefore, I believe it is more straightforward to handle < and >
as quotation delimiters similar to " and ".  Though, expanding macros
first cannot be said to be against Standards.  This Validation Suite
does not include tests which dig holes of this type of specifications.

Scoring: 6.

            [n.6.3]     Header-name by a macro - Part 2

This is not so interesting, but it does not have to be a single macro.

Scoring: 2.


        [2.4.7]     #line

The #line pp-directive is not usually used by a user, but used to pass
along the file name of the original source and line numbers occasionally
in case another tool or something pre-preprocesses source (not
necessarily C.)  As this has been around since K&R 1st, it must have
some purpose in its own way traditionally.

In addition, #line or its variant is used to pass along a file name and
line number information to a compiler proper for preprocessor output in
general.  However, this is not defined as a specification.

The file name and line number specified in #line become the value of
predefined macro, __FILE__ and __LINE__ (in addition, __LINE__ will be
incremented for every physical line.)*

  *  ANSI C 3.8.4 (C90 6.8.4) Line control
     C99 6.10.4 Line control
     ANSI C 3.8.8 (C90 6.8.8) Predefined macro names
     C99 6.10.8 Predefined macro names

            [n.7.1]     Line number and file name specification

#line specifying a line number and a file name are same in K&R 1st and
Standard C.

A file name is a string literal, but a different token from the header-
name for #include.  Strictly speaking, it has subtle problems in \
handling and the like.  However, no problems will arise for valid source
(it is fortunate that the filename for #line are not in the <stdio.h>
format.)

Scoring: 6.

            [n.7.2]     No filename may be specified

The filename argument is optional and it does not have to exist.  This
is same as K&R 1st (undefined if no line number is specified.)

Scoring: 4.

            [n.7.3]     Line number and file name specification by
                                a macro

In K&R 1st, the #line argument could not use a macro.  This is permitted
in Standard C.

Scoring: 4.

            [n.line]    Line number range

The line number range for the #line directive was [1.32767] in C90, but
was extended to [1.2147483647] in C99.

Scoring: 2.

            [e.7.4]     File name in wide string literal

The filename argument must be a string literal.  This is not so
interesting, but wide string literals become a violation of constraint
(pp-tokens other than that are undefined for some reasons.  This is an
imbalanced specification.)

Scoring: 2.


        [2.4.8]     #error

#error is a directive newly introduced in Standard C.  It displays an
error message that includes an argument as a part at preprocessing.  In
old implementations, there were some with directives such as #assert.
However, these were not standard.

It is not specified that #error should end a process.  According to the
Rationale, it was not specified since the Standard cannot make
requirements to that extent.  However, it says that ending a process is
the ANSI C committee's intention. *

  *  ANSI C 3.8.5 (C90 6.8.5) Error directive
     ANSI C Rationale 3.8.5
     C99 6.10.5 Error directive
     C99 Rationale 6.10.5

            [n.8.1]     No macro expansion

Macros on the #error line are not expanded.  Macros which are expanded
on control lines are only for #if (#elif), #include, and #line. *

Scoring: 8.  2 points for processing by expanding macros.  Whether to
terminate the process does not matter.

  *  ANSI C 3.8 (C90 6.8)  Preprocessing directives -- Semantics
     C99 6.10 Preprocessing directives -- Semantics

            [n.8.2]     Optional message

This is not so interesting, but the #error line argument is optional and
it does not have to exist.

Scoring: 2.


        [2.4.9]     #pragma, _Pragma()

#pragma was also introduced in Standard C.  Extended directives which
are unique to an implementation are all supposed to be implemented by #
pragma sub-directives. *1

  *1  ANSI C 3.8.6 (C90 6.8.6) Pragma directive
      C99 6.10.6 Pragma directive

            [n.9.1]     No error for unrecognized #pragma

Different #pragma sub-directives are recognized for each implementation.
As it is not possible to write portable programs if each unrecognized #
pragma becomes an error, #pragma which cannot be recognized by
implementations is ignored.  In preprocessing, only the #pragma
recognizably regarding preprocessing is processed while the rest of #
pragma is all passed to compiler proper as is.

Scoring: 10.  #pragma must not be an error, however, it is acceptable to
issue a warning.  10 points for the case that preprocessing does not
issue an error but the compiler proper does, which should not occur,
though it is not a mistake in preprocessing even if it does.  0 points
as a matter of convenience in case this distinction does not exist in
the implementation where preprocessors are not independent.

            [n.pragma]  _Pragma() operator

C99 introduced the _Pragma() operator which has the same effect as #
pragma but can be written in a macro definition as opposed to #pragma.

In addition, when a pp-token following a pragma has a standard feature
if it is STDC and macro expansion is prohibited in this case.  In other
cases, however, whether macro expansion is applicable is implementation-
defined.

Scoring: 6.

            [e.pragma]  _Pragma() argument is string literal

The _Pragma() operator argument must be a string literal.

Scoring: 2.


        [2.4.10]    #if, #elif, #else, and #endif

#if, #else, and #endif have been supported since K&R 1st. * In the
implementations which cannot use #if, none of n.10.1, n.11.*, n.12.*, n.
13.*, e.12.*, and e.14.* can be processed as well as many other tests.

  *  ANSI C 3.8.1 (C90 6.8.1) Conditional inclusion
     C99 6.10.1 Conditional inclusion

            [n.10.1]    #elif support

#elif was added in Standard C.  By using this, we can avoid illegibility
of multiple #if nesting.

Macros can be used in the #if expression.  The identifier not defined as
a macro is evaluated as 0.

In Standards, #if to the corresponding #endif is called a #if section
and a block divided by #if (#ifdef, #ifndef), #elif, #else, or #endif in
the section is called a #if group.  The #if and #elif line expressions
are called a #if expression.

Scoring: 10.

            [n.10.2]    pp-token processing in the #if group which is
                                skipped

In the #if group skipped, preprocessing directives are not processed
other than checking #if, #ifdef, #ifndef, #elif, #else, and #endif in
order to trace the correspondence of #if group and macros are not
expanded, either.

However, tokenization takes place.  First of all, that is because C
source is a sequence of comments and pp-tokens from the beginning till
the end.  Secondly, it is necessary to process comments at least in
order to check the corresponding relation of #if and others, and in
order to check if /* and others are comment symbols, they must be made
sure that they are not in a string literal or character constant.

Scoring: 6.


        [2.4.11]    #if defined

An operator, defined, was introduced for #if expressions in Standard C.
This integrates #ifdef and #ifndef into #if (#elif) and prevents
illegibility of multiple #ifdef nesting.

            [n.11.1]    #if defined

The operand for the defined operator is an identifier.  Both enclosing
it by ( and ) and not doing so are supported as a writing style.  Both
mean the same and are evaluated as 1 if an operand is defined as a macro,
0 otherwise.

Scoring: 8.  2 scores if only one of 2 #if sections is processed.

            [n.11.2]    defined is an unary operator

defined is one of the operators and provides either value of 1 or 0.
Therefore, it is possible to do an operation on its result and another
expression.

Scoring: 2.


        [2.4.12]    #if Expression Type

The #if expressions were vaguely defined as constant expressions in K&R
1st and their types were not clear.  In C90, the #if expressions are
integer constant expressions and it was made clear that int and unsigned
int have the same internal representation as long and unsigned long
respectively.  In other words, the #if expressions including sub-
expressions in them are all evaluated in long and unsigned long.  To
restate, they are handled as if constant tokens all had L or UL suffix.
*1

In addition, the #if expression type became the maximum integer type for
each implementation in C99.  typedef to the type names, intmax_t and
uintmax_t, is applied to this type in the standard header called <stdint.
h>.  As long long is required in C99, the #if expression type is long
long or a wider size.  Suffixes, LL (ll)/ULL (ull), are used to write
long long/unsigned long long constants. *2

A length modifier, %ll, and an appropriate conversion specifier (such as
%lld, %llu, and %llx) are used to display the value of long long/
unsigned long long in printf().  The length modifier to display the
value of intmax_t or uintmax_t is %j (%jd, %ju, %jx and others.) *3

  *1  ANSI C 3.8.1 (C90 6.8.1) Conditional inclusion -- Semantics
  *2  C99 6.10.1 Conditional inclusion -- Semantics
  *3  C99 7.19. 6.1 The fprintf function -- Description

            [n.12.1]    long type #if expressions

In C90, constant expressions of the long type in the #if expressions
must be evaluated.

Among the implementations with sizeof (long) > sizeof (int), there are
some which cannot evaluate the #if expressions only in int and unsigned
int.  There are also implementations that perform evaluation by
truncating long silently.  The second sample is for checking the latter.

Scoring: 6.

            [n.12.2]    unsigned long type #if expressions

In C90, constant expressions of the unsigned long type in the #if
expressions must be evaluated.

Scoring: 4.

            [n.12.3]    Octal numbers

Octal numbers must be evaluated in the #if expressions.  It is similar
in K&R 1st, however the constant exceeding the maximum value of long was
evaluated as a negative value of long in K&R 1st while it is evaluated
as unsigned long in C90. *1

Scoring: 4.  2 points when recognized as an octal number but evaluated
as a negative number or overflow.  0 points for not recognized as an
octal number.

  *  ANSI C 3.1.3.2 (C90 6.1.3.2) Integer constants
     C99 6.4.4.1 Integer constants

            [n.12.4]    Hexadecimal numbers

Hexadecimal numbers must be also evaluated in the #if expressions.  It
is similar in K&R 1st, however the difference is that the constant
exceeding the maximum value of long was evaluated as a negative value of
long in K&R 1st while it is evaluated as unsigned long in C90.

Scoring: 4.  2 points when recognized as a hexadecimal number but
evaluated as a negative number or overflow.  0 points for not recognized
as a hexadecimal number.

            [n.12.5]    Suffixes L and l

Constant tokens with a suffix, L or l, must be also evaluated in the #if
expressions.  The constants not exceeding the maximum value of long are
the same in K&R 1st also.  This suffix does not matter in preprocessing
for evaluation.

Scoring: 2.

            [n.12.6]    Suffixes U and u

Constant tokens with a suffix, U or u, must be also evaluated in the #if
expressions.  This was the notation not supported in K&R 1st and was
officially accepted in C90.

Scoring: 6.

            [n.12.7]    Negative numbers

Negative numbers must be also handled in the #if expressions.  This is a
specification since K&R 1st.

Scoring: 4.

            [e.12.8]    Constant token value out of range to represent

In C90, it is an violation of constraint or error, to put it simply, if
the integer constant token value which appears in the #if expression is
not in the range which can be represented in long or unsigned long.
This is not something directly defined as a specification for the #if
expressions, however, there is a specification regarding constant
expressions in general.  The #if expressions are not exceptions, either.
*

The character constant overflow is tested in e.32.5, e.33.2, and e.35.2.

Scoring: 2.

  *  ANSI C 3.4 (C90 6.4)  Constant expressions -- Constraints
     C99 6.6 Constant expressions -- Constraints

            [n.llong]   long long/unsigned long long evaluation

At least, constant expressions in long long/unsigned long long must be
evaluated in the #if expressions in C99.

The constant exceeding the maximum value of long long is evaluated as
unsigned long long.

Suffixes, LL, ll, ULL, and ull, are added in C99. *

Scoring: 10.  2 points each for processing each of 5 samples correctly.

  *  C99 6.4.4.1 Integer constants
     C99 6.6 Constant expressions -- Constraints

            [e.intmax]  Operation result out of intmax_t range

The constant or constant expression which exceeds the intmax_t or
uintmax_t range is a violation of constraint in the #if expression for
C99.

Scoring: 2.  In case there is no <stdint.h>, it is acceptable to write a
macro appropriately.


        [2.4.13]    #if Expression Evaluation

The #if expressions are a kind of integer constant expressions.
Compared with standard integer constant expressions, there are
differences below.

1. In C90, int and unsigned int are evaluated as if they have the same
internal representation as long and unsigned long respectively.  In C99,
int and long are evaluated as if they have the same internal
representation as intmax_t.  unsigned int and unsigned long are
evaluated as if they have the same internal representation as uintmax_t.

2. The 'defined' operator is supported.

3. All identifiers remained after macro expansion are evaluated as 0. *1

4. As there exists no keywords in preprocessing, identifiers with same
name as a keyword are treated as just an identifier.  Therefore, neither
cast nor sizeof can be used.

Function calls and comma operators cannot be used in standard integer
constant expressions.  Since constant expressions are not variables, no
assignments, increments, decrements, nor arrays can be used.

In n.13, evaluation rules common with generic integer constant
expressions are tested.  Among these, n.13.5 is different from K&R 1st.
n.13.6 is the area which was all different in pre-Standard #if
expressions.  n.13.13 and n.13.14 were not clear in K&R 1st.  The rest
is unchanged since K&R 1st. *2

n.13 uses only small values so that this rule can be tested in the
implementations where only int values can be evaluated in the #if
expressions.  The defined operator and >= are not in n.13, but somewhere
else.

  *1 In C++ Standard, 'true' and 'false' are treated differently and
    evaluated as 1 and 0 respectively.  These are not macros, but
    keywords; however, they are treated as boolean literals in
    preprocessing.

  *2 ANSI C 3.3 (C90 6.3) Expressions
     ANSI C 3.4 (C90 6.4) Constant expressions
     C99 6.5 Expressions
     C99 6.6 Constant expressions

            [n.13.1]    << and >>

Bit shift operations have no troublesome issues regarding positive
numbers, at least.

Scoring: 2.

            [n.13.2]    ^, |, and &

Since the bit pattern of the same value is same despite of CPUs or
implementations in positive integers as long as it is in the type range,
operations such as ^, |, and & which may appear to be dependent on CPU
specifications have the exact same results in that range on any
implementation.

Scoring: 2.

            [n.13.3]    || and &&

All implementations should be able to process these.

Scoring: 4.

            [n.13.4]    ? :

All implementations should be able to process these, too (it seemed so,
but not so in reality.)

Scoring: 2.

            [n.13.5]    No usual arithmetic conversion in << and >>

Usual arithmetic conversions are performed in many binary operators in
order to match types on both sides.  In K&R 1st, usual arithmetic
conversions were performed in shift operators while this is not the case
in Standard C.  This is an adequate specification considering that the
right side value is always small positive number and that a bit pattern
changes also if a negative number is converted into a positive number in
internal representation other than 2's complement which creates a
confusion. *

Scoring: 2.

  *  ANSI C 3.3.7 (C90 6.3.7) Bitwise shift operators -- Semantics
     C99 6.5.7 Bitwise shift operators -- Semantics

It says point by point, "If both operands have arithmetic type, the
usual arithmetic conversions are performed on them.", regarding many
binary operators.  However, this is not mentioned as far as << and >>
are concerned.  There is an explanation regarding this in C89 Rationale
3.3.7 (C99 Rationale 6.5.7.)

            [n.13.6]    Conversion from a negative number into a
                        positive number by usual arithmetic conversion

Usual arithmetic conversions are applied to operands on both sides for
binary operators such as *, /, %, +, -, <, >, <=, >=, ==, !=, &, ^, and
|, in order to match types on both sides.  Same for the second and third
operands for tertiary operator, ? :.  Therefore, if one side is an
unsigned type, the other is converted into an unsigned type that causes
a negative value to be converted into a positive number.

In standard integer constant expressions, the integer promotion is
applied to the operand before the usual arithmetic conversion if each
operand is in the integer type shorter than int.  However, in the #if
expressions, no integer promotion occurs since all operands are handled
as the same size type.

Scoring: 6.  2 points each for processing one of 3 tests correctly.

This evaluation rule is same in K&R 1st, however, this sample cannot be
processed in K&R 1st based implementations since the constant token, 0U,
uses the U suffix.  In that case, test needs to be re-run by replacing
this 0U with the value big enough to be evaluated as unsigned type but
not as big as the value obtained by converting -1 into the unsigned type.
Some of the pre-ANSI implementations do not support the method of
writing the #if expressions with unsigned types.

            [n.13.7]    Evaluation short-circuit in ||, &&, and ? :

The order of evaluation is defined for the || and && operators and the
right side is not evaluated if the result is determined by the left side
evaluation.  In ? :, either the second or third is evaluated but not the
other as a result of the first operand evaluation. *

Therefore, even division by 0 is not an error in the term not evaluated.

Scoring: 6.  Subtract 2 points each from 6 if one of the 5 samples fails.
0 point for failure of 3 or more samples.  Subtract 2 points for wrong
diagnostics, even if an implementation succeeds to process.

  * In the ? : operator, however, the usual arithmetic conversion is
    performed between the second operand and the third one.  It is
    strange to perform a conversion even when no evaluation is done.
    Especially, as the type for an integer constant token used in the #
    if expression is not determined until a value is evaluated, the
    value cannot be helped being evaluated in order for the type (though
    it is just if signed or unsigned) to be determined.  However, no
    division by 0 is allowed.  This is rather messy.

            [n.13.8]    Grouping of unary operators, -, +, !, and ~

n.13.8 to n.13.12 are tests for grouping sub-expressions in the #if
expressions.  Sub-expressions are grouped according to the precedence
and associativity of operators.  Though there are parts decided by the
syntax prior to the precedence in standard integer constant expressions,
the #if expressions do not have areas with syntax problems other than
( and ) of grouping.  n.13.8 to n.13.10 are tests for associativity and
n.13.8 is a test for the associativity for unary operators, -,  +, !,
and ~.  All unary operators associate from right to left.

Scoring: 2.

            [n.13.9]    Grouping of ? :

The conditional operator, ? :, is associated from right to left.

Scoring: 2.

            [n.13.10]   Grouping << and >>

All binary operators are associated from left to right.  n.13.10 tests
<< and >>.  Scoring: 2.

            [n.13.11]   Grouping of operators with different precedence
                                - Part 1

Here, we test expressions including unary operators, -, +, and !, and
binary operators, *, /, and >>, which have different precedence and
associativity.

Scoring: 2.

            [n.13.12]   Grouping of operators with different precedence
                                - Part 2

Here, we test grouping of even more complex expressions including unary
operators, -, +, ~, and !, binary operators, -, *, %, >>, &, | ^, ==,
and !=, and tertiary operator, ? :.

Scoring: 2.

            [n.13.13]   Macros expanded into operators

The use of macros are allowed in the #if expressions.  These macros are
usually expanded into integer constants, however, we do not test these
here since they are included in n.10.1, n.12.1, n_12.2, and n.13.7 tests.

Though macros expanded into operators are not ordinary, they should be
handled as operators in principle.  A standard header called <iso646.h>
was defined as a specification in ISO C 1990/Amendment 1 which defines
some operators using macros (*.)  The purpose seems to be that source
can be written without using characters such as &, |, !, ^, and ~.
Preprocessing is required to expand these macros in #if and handle them
as operators.

On the other hand, there is a specification in which it is undefined if
macros in the #if expressions are expanded to 'defined'.  I suspect that
defined is handled separately since it is similar to an identifier
(refer to [u.1.19].)

Scoring: 4.

  * In C++ Standard, these identifier-like operators are not macros but
    tokens for some reasons.

            [n.13.14]   Macro expanded into 0 piece of token

The #if expression including a macro which expands into 0 piece of token
is not ordinary, either.  This should be evaluated after the macro is
removed (expanded.)

Scoring: 2.


        [2.4.14]    #if Expression Error

e.14.1 to e.14.10 are tests for violations of syntax rules and
violations of constraint in the #if expressions.  Compiler systems must
issue diagnostic messages for all source including one of these. *

  *  ANSI C 3.8.1 (C90 6.8.1) Conditional inclusion
     ANSI C 3.4 (C90 6.4) Constant expressions
     C99 6.10.1 Conditional inclusion
     C99 6.6 Constant expressions

            [e.14.1]    String literals

As the #if expressions are integer constant expressions and pointers
cannot be used, string literals cannot be used.

Scoring: 2.

            [e.14.2]    Operators such as =, ++, --, and .

As the #if expressions are constant expressions, operators and variables
with side effects cannot be used.  A --B is different from A - -B and a
violation of constraint.

Scoring:  4.  2 points if one of 4 samples cannot be correctly diagnosed.
0 point if 2 or more.

            [e.14.3]    Incomplete expressions

Missing one operand in a binary operator or parenthesis is also a
violation of syntax rules.

Scoring: 2.

            [e.14.4]    Missing #if defined parenthesis

An argument for defined operator on the #if line may or may not be
enclosed by ( and ), however, it is a violation of constrains if only
one of the parenthesis pair exists.

Scoring: 2.

            [e.14.5]    No expressions

Only #if without any expression is certainly a violation of syntax rules.

Scoring: 2.

            [e.14.6]    No expression after macro expansion

The identifier not defined as a macro is evaluated as 0, however, the
argument of the #if line which disappears after macro expansion is a
violation of syntax rules.

Scoring: 2.

            [e.14.7]    Unrecognized keyword - sizeof

sizeof, a pp-token, is simply treated as an identifier and evaluated as
0 in the #if expression if it is not defined as a macro.  A pp-token
called int is the same.  Therefore, sizeof (int) becomes 0 (0) and it is
a violation of syntax rules.

Scoring: 2.

            [e.14.8]    Unrecognized keyword - type name

Just as e.14.7, (int)0x8000 becomes (0)0x8000 and it is a violation of
syntax rules.

Scoring: 2.

            [e.14.9]    Division by 0

This e.14.9 and the next e.14.10 admit of several interpretations
regarding a diagnostic message should be issued and specifications are
vague.  There are following specifications in the Standards.

    ANSI C 3.4 (C90 6.4) Constant expressions -- Constraint
    C99 6.6 Constant expressions -- Constraint
        Each constant expression shall evaluate to a constant that is in
    the range of representable values for its type.

The applicable range in this specification is not clear, however, it is
clear that this is applied to at least where constant expressions are
necessary.  The #if expressions must be constant expressions.  On the
other hand, there are specifications below.

    ANSI C 3.3.5 (C90 6.3.5) Multiplicative operators -- Semantics
    C99 6.5.5 Multiplicative operators -- Semantics
        if the value of the second operand is zero, the behavior is
    undefined.

    ANSI C 3.1.2.5 (C90 6.1.2.5) Types
    C99 6.2.5 Types
        A computation involving unsigned operands can never overflow,

Which specification should be applied between division by 0 and unsigned
operation?  It seems either interpretation is possible.

However, we will make an interpretation this way here -- include
division by 0 where a constant expression is required and a diagnostic
message must be issued in case it does not fit in the range of the type
--.  That is because it seems appropriate to issue a diagnostic since
only an error in program can cause this type of result and constant
expressions are something evaluated at compilation rather than execution.
In addition, that is also because it is unnatural to treat only division
by 0 as an exception.  However, since the specification gets doubly
vague in case the result of an unsigned operation is out of range, we do
not include it here, but interpret it as undefined.

ISO 9899:1990/Corrigendum 1 added a specification, "A conforming
implementation shall produce at least one diagnostic message .. if ..
translation unit contains a violation of any syntax rule or constraint,
even if the behavior is also explicitly specified as undefined or
implementation-defined."  This was carried on by C99. *

Scoring: 2.

  *  C99 5.1.1.3 Diagnostics

            [e.14.10]   Operation results out of representable range

In C90, values of the #if expressions must be in the range representable
as long/unsigned long.

Scoring: 4.  4 points for correctly diagnosing all 4 tests.  2 points
for correctly diagnosing 2 or 3 tests.  0 points if only 1 or none is
correctly diagnosed.


        [2.4.15]    #ifdef and #ifndef

n.15 tests #ifdef and #ifndef.  These are exactly same between K&R 1st
and Standard C.  e.15 tests the violation of syntax rules for that. *

  *  ANSI C 3.8 (C90 6.8) Preprocessing directives -- Syntax
     ANSI C 3.8.1 (C90 6.8.1) Conditional inclusion
     C99 6.10 Preprocessing directives -- Syntax
     C99 6.10.1 Conditional inclusion

            [n.15.1]    #ifdef macro testing

Scoring: 6.

            [n.15.2]    #ifndef macro testing

Scoring: 6.

            [e.15.3]    Argument which is not an identifier

Arguments on the #ifdef and #ifndef lines must be identifiers.

Scoring: 2.

            [e.15.4]    Extra argument

Arguments on the #ifdef and #ifndef lines must not have extra tokens
other than identifiers.

Scoring: 2.

            [e.15.5]    Missing argument

It is a violation of syntax rules not to have any arguments.

Scoring: 2.


        [2.4.16]    #else and #endif Errors

Next is a test of violations of syntax rules for #else and #endif.  This
syntax has not changed since K&R 1st. (However, Standard C introduced a
new specification that a diagnostic message must be issued for a
violation of syntax rules or constraints.) *

  *  ANSI C 3.8 (C90 6.8) Preprocessing directives -- Syntax
     C99 6.10 Preprocessing directives -- Syntax

            [e.16.1]    Extra token in #else

The #else line must not have any other tokens.

Scoring: 2.

            [e.16.2]    Extra token in #endif

The #endif line must not have any other tokens.

Do not write below.

#if     MACRO
#else   ! MACRO
#endif  MACRO

Use below instead.

#if     MACRO
#else   /* ! MACRO  */
#endif  /* MACRO    */

Scoring: 2.


        [2.4.17]    #if, #elif, #else, and #endif Miss-matching Errors

Next tests violations of syntax rules for matching #if (#ifdef, #ifndef),
#elif, #else, and #endif.  This syntax is almost the same since K&R 1st
except that #elif was added to Standard C.  In addition, K&R 1st was not
clear in the point that these must match within the source file unit. *

  *  ANSI C 3.8 (C90 6.8) Preprocessing directives -- Syntax
     C99 6.10 Preprocessing directives -- Syntax

            [e.17.1]    #endif without #if

#endif without a preceding #if is obviously a violation of syntax rule.

Scoring: 2.

            [e.17.2]    #else without #if

#else without a corresponding #if is also an error.

Scoring: 2.

            [e.17.3]    Another #else after #else

Having another #else after a #else is also prohibited.

Scoring: 2.

            [e.17.4]    #elif after #else

#elif after #else is not allowed.

Scoring: 2.

            [e.17.5]    #endif without #if in the file included

#if, #else, and #endif must be matched in the source file (preprocessing
file) unit.  It is not acceptable for the file included to be treated as
if it existed from the beginning or in the original file.

Scoring: 2.

            [e.17.6]    #if without #endif in the file included

Scoring: 2.

            [e.17.7]    #if without #endif

Forgetting #endif actually happens quite often, however, compiler
systems must issue a diagnostic message for that.

Scoring: 2.


        [2.4.18]    #define

For the #define syntax, the # and ## operators were added in C90 whereas
they did not exit in K&R.  The rest is unchanged. *1

In C99, variable argument macros were added (refer to [1.8].) *2

If #define is not possible, it cannot be called a C preprocessor.  In
such event, the score will be subtracted 60 points for n.18.* and e.18.*
all together and there will be even more deductions since macros are
used in other tests.

  *1  ANSI C 3.8 (C90 6.8) Preprocessing directives -- Syntax
      ANSI C 3.8.3 (C90 6.8.3) Macro replacement
  *2  C99 6.10 Preprocessing directives -- Syntax
      C99 6.10.3 Macro replacement

            [n.18.1]    Object-like macro definition

The first token on the #define line is a macro name.  However, in case
there are white spaces immediately after, the second token is considered
to be the beginning of the replacement list even if it is '(' and not
considered as a function-like macro definition.  If there is no token
after a macro name, the macro is defined as 0 pieces of token.

Scoring: 30. 10 points if only one of 2 macros are defined correctly.

            [n.18.2]    Function-like macro definition

If '(' is immediately after a macro name without having white spaces
between, it is considered to be the beginning of the function-like macro
parameter list.  This specification has been around since K&R 1st and
has a trace of character-oriented preprocessing which is influenced by
the existence of white spaces.  Nothing can be done at this point.

Scoring: 20.

            [n.18.3]    No replacement applicable inside string literals

In a so-called "Reiser" model preprocessor, the same spelling as a
parameter in string literals or character constants in the replacement
list, that portion was substituted for an argument by macro expansion.
However, this was not accepted by Standard C nor K&R 1st.  This
replacement is a specification characterized well by character-oriented
preprocessing, however, it is out of the question in token-oriented
processing.

Scoring: 10.

            [n.vargs]   Variable argument macros

Variable argument macros were introduced in C99.

Scoring: 10.

            [e.18.4]    Missing name

The first token on the #define line must be an identifier.

Scoring: 2.

            [e.18.5]    Missing argument

If there is not a single token on the #define line, it is a violation of
syntax rules.

Scoring: 2.

            [e.18.6]    Empty parameter

Empty parameter is also a violation of syntax rules. *

Scoring: 2.

  *  ANSI C 3.5.4 (C90 6.5.4) Declarators -- Syntax
     C99 6.7.5 Declarators -- Syntax

            [e.18.7]    Duplicate parameter names

Duplicate parameter names in the parameter list for one macro definition
is a violation of constraints. *

Scoring: 2.

  *  ANSI C 3.8.3 (C90 6.8.3) Macro replacement -- Constraints
     C99 6.10.3 Macro replacement -- Constraints

            [e.18.8]    Parameter which is not an identifier

A parameter in a macro definition must be an identifier. *

Scoring: 2.

  *  The ... parameter was added in C99.  __VA_ARGS__ in the replacement
    list is a special parameter name that corresponds to it.

            [e.18.9]    Special combination of a macro name and
                                a replacement list

Though '$' is not accepted as a character within an identifier in
Standard C, there are leading compiler systems which accept it.  This
sample is the kind that can be seen in the source compiled by those
compiler systems.  Since this example interprets $ as one character and
one pp-token in Standard C, the macro name is THIS and $ and after
becomes a replacement list for an object-like macro which is totally
different from the program purpose of a function-like macro named THIS$
AND$THAT.

In the Corrigendum 1 of ISO 9899:1990, an exceptional specification was
added to this type of example.  Standard C must issue a diagnostic
message regarding this example. *1

Conversely, C99 defined that white spaces must exist between a macro
name a replacement list in object-like macro definition in general. *2

Scoring: 2.

  *1  Addition to Constraints in C90 6.8 by Corrigendum 1.
      This specification, however, has disappeared in C++ Standard.
  *2  C99 6.10.3 Macro replacement -- Constraints

            [e.vargs1]   __VA_ARGS__ not in the replacement list

Variable argument macros in C99 use __VA_ARGS__ in the replacement list
corresponding to a parameter ... in the parameter list of the macro
definition.  This identifier must not be used anywhere else.

Scoring: 2.  2 points if 2 samples are correctly diagnosed.  0 points if
only one is diagnosed correctly.


        [2.4.19]    Macro Re-definition

Macro re-definition was not mentioned in K&R 1st and implementations
were all different as well.  In Standard C, some re-definition of the
original definition is allowed, but not a different one.  Macros are
virtually not re-defined (unless they are voided by #undef). *1, *2

  *  ANSI C 3.8.3 (C90 6.8.3) Macro replacement -- Constraints
     C99 6.10.3 Macro replacement -- Constraints
  *2  However, many compiler systems issue a warning and accept
    redefinition.  So does MCPP starting with V.2.4, to maintain
    compatibility with existing compiler systems.

            [n.19.1]    Differences in white spaces before and after
                                a replacement list

Re-definition where only the number of white spaces are different, is
allowed.

Scoring: 4.

            [n.19.2]    Differences in white spaces in a parameter list
                    and differences in white spaces extending over lines

White spaces include ones extending over source lines by <backslash>
<newline> sequence or comments.

Scoring: 4.

            [e.19.3]    Differences in token sequence in
                                a replacement list

Re-definition where token sequences in a replacement list are different
is a violation of constraint.

Scoring: 4.

            [e.19.4]    Differences in the existence of white spaces in
                                a replacement list

Re-definition where the existence of white spaces in a replacement list
is different is a violation of constraints.  This has a trace of
character-oriented preprocessing.

Scoring: 4.

            [e.19.5]    Different parameter usage

As re-definition of different parameter usage is essentially a different
definition, it is a violation of constraints.

Scoring: 4.

            [e.19.6]    Differences in parameter names

Re-definition where only parameter names are different but which is
essentially the same is a violation of constraints.  This seems to be an
excessive constraint.

Scoring: 2.

            [e.19.7]    Sharing function-like and object-like macro
                                names

As a macro name belongs to one name space, a function-like macro and an
object-like macro cannot use the same name.

Scoring: 2.


        [2.4.20]    Macro Names Same as Keywords

Since no keyword exists in preprocessing, an identifier with same name
as a keyword can be defined as a macro and expanded. *

  *  ANSI C 3.1 (C90 6.1) Lexical elements -- Syntax
     C99 6.4 Lexical elements -- Syntax
     C89 Rationale 3.8.3 (C99 Rationale 6.10.3) Macro replacement

            [n.20.1]    Names are all subject to macro expansion

Scoring: 6.


        [2.4.21]    Macro Expansion Requiring Pp-token Separation

Tokenizing a source file into pp-tokens is performed in translation
phase 3.  And the case where multiple pp-tokens are concatenated into 1
pp-token later is defined only by the case where the concatenation is
done by expanding the macro defined using the ## operator, the case
where stringizing takes place by macro expansion defined by the #
operator, and the case of concatenating adjacent string literals.
Therefore, it is interpreted that implicit concatenation of multiple pp-
tokens must not happen.  This is obvious in the principle of token-
oriented preprocessing. *

  *  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
     ANSI C 3.8.3 (C90 6.8.3) Macro replacement
     C99 5.1.1.2 Translation phases
     C99 6.10.3 Macro replacement

            [n.21.1]    Pp-tokens are not concatenated implicitly

In case preprocessing is done in an independent program, it is necessary
to separate and pass 3 -'s in the output of this sample by some sort of
token separator so that compiler proper can figure out 3 are separate pp-
tokens.

Scoring: 4.

            [n.21.2]    Separation of macro in outer macro's argument

Even if a macro is invoked in an argument of outer macro, expansion
result of the macro should not be merged with its surrounding pp-tokens
in outer macro's replacement list.

Scoring: 2.


        [2.4.22]    Macro-like Sequence in a Pp-number

Preprocessing-numbers were introduced by Standard C for the first time.
They have a wider range than integer constant tokens and floating point
constant tokens put together and may include identifier-like portion.
They were defined as a specification in order to simplify tokenization
in preprocessing.  However, in case there is a macro-like sequence in a
pp-number, a wrong result may occur unless this simple tokenization is
done exactly. *

  *  ANSI C 3.1.8 (C90 6.1.8) Preprocessing numbers
     C99 6.4.8 Preprocessing numbers

            [n.22.1]    Macro-like Sequence in a Pp-number - Part 1

Since the sequence, 12E+EXP, is one pp-number, it will not be expanded
even if a macro, EXP, is defined.

Scoring: 4.

            [n.22.2]    Macro-like Sequence in a Pp-number - Part 2

A pp-number starts with a digit or .

Scoring: 2.

            [n.22.3]    Macros outside of a pp-number are expanded

In C90, + or - can appear inside a pp-number only if it immediately
follows E or e.  12+EXP is different from 12E+EXP and is divided into 3
pp-tokens, 12, +, and EXP.  These are a pp-number, an operator, and an
identifier, respectively.  EXP is expanded if it is a macro.

Scoring: 2.

            [n.ppnum]   [Pp][+-] sequence

In C99, the sequence of + or - following P or p in a pp-number was added
to write a floating point number in hexadecimal.

In order to display a floating point number in printf(), conversion
specifiers such as %a and %A are used. *

Scoring: 4.

  *  C99 7.19. 6.1 The fprintf function -- Description


        [2.4.23]    Macros Using the ## Operator

## is an operator newly introduced in Standard C and used only in the
replacement list on the #define line.  The pp-tokens before and after a
## are concatenated into one pp-token.  If pp-tokens before and after ##
are parameters, they are first replaced by actual arguments at macro
expansion and concatenated. *

  *  ANSI C 3.8.3.3 (C90 6.8.3.3) The ## operator
     C99 6.10.3.3 The ## operator

            [n.23.1]    Token concatenation

This is an example of the most simple function-like macro using the ##
operator.

Scoring: 6.

            [n.23.2]    Pp-number generation

Since operands of the ## operator are not macro-expanded, using another
macro which appears to be meaningless together such as xglue() in this
example often takes place.  This is for expanding a macro in an argument,
then concatenating the result.  The 12e+2 sequence which was generated
by a macro call in this sample is a valid pp-number.

Scoring: 2.

            [e.23.3]    Missing tokens before or after ##
                                - object-like macros

There must be some pp-tokens before and after the ## operator in a
replacement list.  This is an example of an object-like macro.  It is
meaningless to use ## in an object-like macro, but not an error.

Scoring: 2.

            [e.23.4]    Missing tokens before or after ##
                                - function-like macros

This is an example of a function-like macro definition without pp-tokens
before or after the ## operator in a replacement list.

Scoring: 2.


        [2.4.24]    Macros Using the # Operator

The # operator was introduced in Standard C.  It is used only in the
replacement list for the #define line which defines a function-like
macro.  The operands of the # operator are parameters and corresponding
actual arguments at those macro expansion are converted to string
literals. *

  *  ANSI C 3.8.3.2 (C90 6.8.3.2) The # operator
     C99 6.10.3.2 The # operator

            [n.24.1]    Argument stringizing

The argument corresponding to the operand for the # operator is enclosed
by " and " on both ends and stringized.

Scoring: 6.  2 points if a space is inserted between token as "a + b".

            [n.24.2]    White space handling between argument tokens

In case the argument corresponding to the operand for the # operator
comprise of multiple pp-token sequence, white spaces between those pp-
tokens are converted into a space and stringized.  No space is inserted
if there are no white spaces.  That is to say, the results differ by the
existence of white spaces though they are not influenced by the number
of white spaces (this still has a trace of character-oriented
preprocessing.)  White spaces before and after an argument are deleted.

Scoring: 4.

            [n.24.3]    \ insertion

In case a string literal or character constant is in the argument
corresponding to the operand for the # operator, \ is inserted
immediately prior to the \ or " in it and the " which encloses a string
literal.  This is same as the method of writing string literals to
display string literals or character constants as they are.

Scoring: 6.

            [n.24.4]    Calling macro including <backslash><newline>

As the <backslash><newline> sequences are removed in translation phase 2,
they do not exist at macro expansion.

Scoring: 2.

            [n.24.5]    Token separator inserted by macro expansion
                                should not remain

Macro expansion routine generally guards the expansion result with token
separators in order to avoid token-merging with surrounding tokens. (See
[2.4.21].)  In case of stringization, however, the inserted token-
separators should not remain.

These cumbersome issues arise from character-oriented portion of the
Standard mixed into token-based principle.

Scoring: 2.

            [e.24.6]    Operand for the # operator is not a parameter
                                name

The operand for the # operator must be a parameter name.

Scoring: 2.


        [2.4.25]    Macro Expansion in a Macro Argument

When macros in an argument should be expanded in a function-like macro
call was not mentioned in K&R 1st.  Implementations were all different
in pre-ANSI preprocessors.  More might have expanded at rescanning a
replacement list.  In Standard C, it was defined as a specification that
this was expanded after an argument was identified first, then before
substituted for a parameter.  The order is similar to the argument
evaluation of function calls, which is easy to understand.  However, in
case the argument corresponds the parameter which is an operand for the
# or ## operator, a macro is not considered as a macro call and is not
get expanded even if the macro name is included.

  *  ANSI C 3.8.3.1 (C90 6.8.3.1) Argument substitution
     C99 6.10.3.1 Argument substitution

            [n.25.1]    Macro in an argument is expanded first

A macro in an argument is expanded after the argument is identified,
then substituted for a parameter in the replacement list.  Therefore,
the one identified as one argument is one argument even if it appears
multiple arguments after expansion.

Scoring: 4.

            [n.25.2]    Argument expanded into 0 piece of token

Similarly, arguments which become 0 piece of token after expansion are
legitimate.

Scoring: 2.

            [n.25.3]    Calling macro using the ## operator
                                in an argument

In case the operand for the ## operator is a parameter, the argument
corresponding to it does not get macro-expanded.  Therefore, it is
necessary to nest another macro in order to do macro expansion.

Since xglue() does not use the ## operator in this example, glue( a, b),
which is an argument for it, gets macro-expanded and becomes 'ab' and
the replacement for xglue() becomes glue(ab, c).  This is rescanned and
'abc' is the final result.

Scoring: 2.

            [n.25.4]    No expansion for the operand for the ## operator

Since glue() is directly called, a macro does not get expanded even if a
macro name is in the argument.

Scoring: 6.

            [n.25.5]    No expansion for the operand for the # operator

The argument corresponding to the parameter which is an operand for the
# operator is not macro-expanded.

Scoring: 4.

            [e.25.6]    Macro expansion in an argument incomplete in the
                                argument

Macro expansion in an argument is done only in that argument.
Incomplete macro is a violation of constraints.  Though a function-like
macro name is not a macro call by itself, it becomes the beginning of a
macro call sequence if it is followed by '('.  Once it begins, ')'
corresponding to this '(' must exist. *

Scoring: 4.

  *  ANSI C 3.8.3 (C90 6.8.3) Macro replacement -- Constraint
     C99 6.10.3 Macro replacement -- Constraints


        [2.4.26]    Macros of a Same Name during Macro Rescanning

In case macro definition is recursive, expanding the macro as is causes
infinite recursion.  Because of this, recursive macros could not be used
in K&R 1st.  Standard C introduced a new specification that the macro
name replaced once does not get replaced again in order to allow
recursive macro expansion, preventing infinite recursion.  This
specification is quite difficult, but can be paraphrased as below. *1

1. While rescanning the replacement list of macro A, the name, A, even
if found, will not be replaced again.

2. In case of nested replacement which has the name, A, in the middle of
macro B replacement when there is a macro B call in the replacement list
of macro A, the A will not be replaced again.

3. However, in case of macro B replacement included the token sequence
after the replacement list of the original macro A, A at this part
included is replaced, if and only this A is in the source file (not in
replacement list of any macro).

4. 1 to 3 are recursively applied to macro B replacement.  In other
words, if B appears again in the middle of macro B replacement and it is
on the token sequence after the replacement list of the original B, and
this B is not in the source file, this B will not be replaced.

5. In case there is macro C call in the argument for the original macro
A call and C which appeared in the middle of replacement of C applies 1
through 3 and is forbidden from replacement, this C will not be replaced
again even when the replacement list of the original macro A is
rescanned.

After all these paraphrasing, it is still difficult.  Especially, 3
comes from a traditional macro rescan specification of including
subsequent token sequence and this complicates the problem unnecessarily.
The Standard has issued a corrigendum and made corrections concerning
this subject, however, it just gets more confusing.  Furthermore, the
Standard changes macro expansion depending whether the subsequent token
sequence is in the source file or not.  This is an inconsistent
specification. *2

Not only that the macro expansion involving the succeeding token
sequence is uncommon, but also it is doubly uncommon that the macro,
whose re-replacement at that part has been prohibited, appears again.
The Validation Suite does not have this type of macro as a test item
other than n.27.6.  I hope this abnormal specification regarding macro
expansion of "including a succeeding token sequence" will be removed. *3

  *1  ANSI C 3.8.3.4 (C90 6.8.3.4) Rescanning and further replacement
      C99 6.10.3.4 Rescanning and further replacement
  *2  Refer to [1.7.6].
  *3  At the newsgroup comp.std.c, there has been some controversy on
    the Standard's specification about recursive macro expansion.
    Mainly two interpretations have been insisted on this subject.
    recurs.t is one of the samples used in the discussion.  Refer to the
    comment in recurs.t.  This Validation Suite does not evaluate
    preprocessors behavior on this sample.
    MCPP V.2.4.1 or later in Standard mode implements the two ways of
    recursive macro expansion.  MCPP sets, by default, the range of non-
    re-replacing of the same name as wide as the explanation above 1-5,
    and expands this sample as 'NIL(42)'.  MCPP sets, by '-@compat'
    option, the range narrower, and expands this sample as '42'.  The
    difference of these two specifications appears, in 3 above, when the
    first half of a function-like macro call of A is in the replacement
    list of B.  MCPP, by default, prohibit re-replacing of A even if
    only the name of A is in the replacement list of B.  On the other
    hand, MCPP, by -@compat option, prohibit re-replacing if and only
    the name of A and the arguments list with '(', ')' surrounding it
    are all in the replacement list of B, and does not distinguish
    whether the name is in the source file or not.

            [n.26.1]    No re-expansion for direct recursive
                                object-like macros

This is an example of direct recursion for object-like macros.

Scoring: 2.

            [n.26.2]    No re-expansion for indirect recursive
                                object-like macros

This is an example of indirect recursion for object-like macros.

Scoring: 2.

            [n.26.3]    No re-expansion for direct recursive
                                function-like macros

This is an example of direct recursion for function-like macros.

Scoring: 2.

            [n.26.4]    No re-expansion for indirect recursive
                                function-like macros

This is an example of indirect recursion for function-like macros.

Scoring: 2.

            [n.26.5]    Recursive macros in arguments

In Standard C, there is a difficult specification meaning "the macro
whose re-replacement has been prohibited will not be replaced at rescan
in another context."  What applies to this in concrete is the handling
at rescanning the parent macro for the macro in the argument.  When
there is a recursive macro in an argument, it is replaced only once.  It
is not replaced at the rescan of the parent macro, either.

Scoring: 2.


        [2.4.27]    Macro Rescanning

Rescanning of macro replacement lists has been a specification since K&R
1st.  Macros found at rescan are replaced as long as they are not
recursive macros.  This takes care of nested macro definitions.  There
was no special change in Standard C, though what was not obvious in K&R
1st became somewhat clearer. *

  *  ANSI C 3.8.3.4 (C90 6.8.3.4)  Rescanning and further replacement
     C99 6.10.3.4 Rescanning and further replacement

            [n.27.1]    Nested object-like macro

This is same in K&R 1st as well.

Scoring: 6.

            [n.27.2]    Nested function-like macro

This is same in K&R 1st as well.

Scoring: 4.

            [n.27.3]    Name generated by the ## operator is subject to
                                expansion

The argument corresponding to the operand for the ## operator is not
macro-expanded, however, the pp-token newly generated by pp-token
concatenation is subject to macro expansion at rescan.

Scoring: 2.

            [n.27.4]    Function-like macros formed in replacement list

In case a function-like macro name, even if any, does not follow (, it
is not considered to be a macro call.  If a function-like macro name is
obtained from an argument and a function-like macro call is formed by
using the name in the replacement list, it will be expanded.

Scoring: 6.

            [n.27.5]    Replacement list forming the first half of a
                                function-like macro

The unusual macro that a replacement list forms the first half of a
function-like macro call was an unspoken specification in pre-Standard,
but was officially accepted in Standard C.  The pp-token sequence
subsequent to the replacement list is included to complete a macro call.

Scoring: 2.

            [n.27.6]    Same name macro to be re-replaced

In general, the same name macro is not re-replaced while rescanning.
Some unusual cases are, however, re-replaced.  In case of nested macro
call, when the replacement list involves the subsequent pp-tokens and
finds the same name in source file, the name is re-replaced.

Scoring: 2.

            [e.27.7]    Mismatched number of arguments at rescan

The number of arguments must match that of parameters in function-like
macro calls.  It is also the same in the function-like macros found at
the rescan of the replacement list.  However, the number of arguments
may not be easy to recognize intuitively in tricky macros since
arguments are separated by ,.

Scoring: 2.


        [2.4.28]    Predefined Macros

C90 introduced a specification that 5 special macros are predefined by
implementations. *1

Furthermore, a macro named __STDC_VERSION__ was introduced in C90/
Amendment 1.

__FILE__ and __LINE__ are extremely special macros that change
definitions dynamically.  It is used in the assert(), debug tools, and
others.  The rest of Standard predefined macros do not change during the
processing of one translation unit.

  *1  ANSI C 3.8.8 (C90 6.8.8) Predefined macro names
  *2  C99 6.10.8 Predefined macro names

            [n.28.1]    __FILE__

This is defined to be the string literal for the source file name being
preprocessed.  As the file name is used for the source file included by
#include, it changes in the same translation unit also.

Scoring: 4.  0 points for just a name as in n_28.t, rather than a string
literal as in "n_28.t".

            [n.28.2]    __LINE__

This is defined to the decimal constant for the line number in the
source file being preprocessed.  The line number starts with 1.

Scoring: 4. 2 points if the line number starts with 0.

            [n.28.3]    __DATE__

This is defined to the string literal for the date on which
preprocessing is performed.  The format is "Mmm dd yyyy" and almost the
same as the one generated by the asctime() function with an exception
that the 1st digit of dd is a space, not 0, in case the day is prior to
the 10th.

Scoring: 4. 2 points if the 1st digit of dd is 0 in case the day is
prior to the 10th.

            [n.28.4]    __TIME__

This is defined to the string literal for the time at which
preprocessing is performed.  The format is "hh:mm:ss", same as the one
generated by the asctime() function.

Scoring: 4.

            [n.28.5]    __STDC__

This is defined to a constant, 1, in C90 or C99 compliant
implementations.

Scoring: 4.

            [n.28.6]    __STDC_VERSION__

This is defined to 199409L in implementations supporting C90/Amendment 1
:1995. *1

Scoring: 4.

  *1  Amendment 1/3.3 Version macro (addition to ISO 9899:1990 / 6.8.8)

            [n.stdmac]  C99 predefined macros

In C99, the value of __STDC_VERSION__ is 199901L.

Also, a predefined macro, __STDC_HOSTED__, was added.  This is defined
to 1 for a hosted implementation, 0 otherwise.

Scoring: 4. 2 points each.

            [n.28.7]    __LINE__ and others in files included

Since __FILE__ and __LINE__ are subject to the source files, not
translation units, they are the name of file included and a line number
in the source included.

Scoring: 4.  2 points if the line number starts with 0.


        [2.4.29]    #undef

#undef has been supported since K&R 1st and there are no major changes.
It cancels the macro definition previously defined.  The valid range for
the macro is from when it is defined by #define until when it is
canceled by #undef. *

  *  ANSI C 3.8.3.5 (C90 6.8.3.5) Scope of macro definitions
     C99 6.10.3.5 Scope of macro definitions

            [n.29.1]    Macro cancellation by #undef

The macro name is no longer a macro after #undef.

Scoring: 10.

            [n.29.2]    #undef for the macro without being defined

Applying #undef to the name which has not been defined as a macro is
allowed.  Compiler systems must not reject this as an error.

Scoring: 4.

            [e.29.3]    Missing name

An identifier is required on the #undef line.

Scoring: 2.

            [e.29.4]    Extra token

The #undef line must not have other than one identifier.

Scoring: 2.

            [e.29. 5]    Missing argument

Missing an argument on the #undef line is a violation of syntax rules.

Scoring: 2.


        [2.4.30]    Macro Calls

In a macro call, a <newline> is treated as simply one of white spaces.
Therefore, a macro call can extend over multiple lines, which was not
clear in K&R 1st. *

Arguments for function-like macro calls are separated by ','.  However,
the , inside of a pair of ( and ) is not considered to be a separator
for arguments.  This is not tested here directly, however, it is
implicitly tested throughout the Suite, in n.25 as an example.  Since
many of *.c samples use the assert() macro, quite complicated testing
regarding this is performed.

  *  ANSI C 3.8.3 (C90 6.8.3) Macro replacement -- Semantics
     C99 6.10.3 Macro replacement -- Semantics

            [n.30.1]    Macro call extending over multiple lines

Scoring: 6.

            [n.nularg]  Empty arguments in macro calls

In C99, an empty argument in macro calls was accepted.  This is
different from insufficient arguments.  The ',' to separate arguments
cannot be omitted (though both cannot be identified in case of one
parameter.)

Scoring: 6.


        [2.4.31]    Macro Call Error

The next are some errors regarding macro calls.

            [e.31.1]    Too many arguments

Different number of arguments and parameters is a violation of
constraint, not undefined. *

Scoring: 2.

  *  ANSI C 3.8.3 (C90 6.8.3)  Macro replacement -- Constraint
     C99 6.10.3 Macro replacement -- Constraints

            [e.31.2]    Insufficient arguments

The number of arguments less than that of parameters is a violation of
constraints.

In C99, an empty argument was accepted.  This is different from
insufficient arguments.  The ',' to separate arguments must exist.

Scoring: 2.

            [e.31.3]    Incomplete macro call on the control line

In general, a macro call can extend over multiple lines.  However, since
the preprocessing directive line starting with # completes in the line
(possibly spliced by multi-line comment), a macro call in it must
complete in the line as well.

Scoring: 2.

            [e.vargs2]  No Arguments for Variable Argument Macros

Variable argument macros in C99 need at least one actual argument for
__VA_ARGS__.

Scoring: 2.


        [2.4.32]    Character Constant in #if Expression

The #if expressions can use character constants.  However, the
evaluation is mostly implementation-defined and does not have much
portability.  32.? Covers the most simple single byte character
constants. *

  *  ANSI C 3.1.3.4 (C90 6.1.3.4) Character constants
      ANSI C 3.8.1 (C90 6.8.1) Conditional inclusion -- Semantics
      C99 6.4.4.4 Character constants
      C99 6.10.1 Conditional inclusion -- Semantics

    Below, the sources of the Standards for the 33, 34, and 35 tests are
    also the same.

            [n.32.1]    Character octal escape sequences

In character constants, octal escape sequences are supported.  These are
same in any implementation and there are no implementation-defined areas.

Scoring: 2.

            [n.32.2]    Character hexadecimal escape sequences

In character constants, hexadecimal escape sequences are also supported.
There are no implementation-defined areas here either.  Hexadecimal
escape sequences did not exist in K&R 1st.

Scoring: 2.

            [i.32.3]    Single-byte character constants

Single byte character constants not in an escape sequence are simple,
however, values depend on basic character sets.  In a cross compiler
whose basic character set varies at compilation and at runtime, it is
acceptable to use either of them in the #if expression evaluation.

Even the same basic character sets are implementation-defined in terms
of sign handling.  Moreover, handlings can be different between the
compiler proper (translation phase 7) and preprocessing (phase 4.)

Therefore, judging a basic character set from the value of a character
constant in the #if expression is not a guaranteed method.

Scoring: 2.

            [i.32.4]    '\a' and '\v'

Standard C added new escape sequences, '\a' and '\v'.

Scoring: 2.

            [e.32.5]    Escape sequence value out of the unsigned char
                                range

As an escape sequence in a character constant represents a single-byte
character value, it must be in the unsigned char range.

Scoring: 2.


        [2.4.33]    Wide Character Constant in #if Expression

Wide character constants were introduced in Standard C.  The value
evaluation is even more implementation-defined than a single-byte
character constant and even byte evaluation order is not specified.

Although there are various encodings of wide character, only the wide
character corresponding to ASCII character is tested here.  For the
other encodings, see [3.1].

            [e.33.2]    Wide character constant value out of range

Hexadecimal or octal escape sequences can be used in wide character
constant values as well, however, the values must be in the range
representing one wide character value unsigned.

Scoring: 2.


        [2.4.35]    Multi-Character Character Constant
                            in #if Expressions

Character constants include something called multi-character character
constants.  They appear similar to multi-byte character constants and
confusing, however they are different concept and means character
constants comprising of multiple characters.  Among these characters,
there are single-byte characters, multi-byte characters, and wide
characters and the multi-character character constants corresponding to
each exist (in Standards, the term, character, is used as a single-byte
character, but it refers to 3 kinds of characters here.)

There seems to be no usage for multi-character character constants.  The
reason why these has been accepted since K&R 1st seems to be simply
because whatever is in the int range is fine as the character constant
type is int.

            [i.35.1]    Multi-character character constant evaluation

Multi-character character constants for single-byte characters have been
around since K&R 1st. (A.16.)  However, the evaluation byte order is not
defined in K&R 1st or Standard C.

Scoring: 2.

            [e.35.2]    Multi-character character constant value
                                out of range

Multi-character character constant values in a hexadecimal or octal
escape sequence must be in the int or unsigned int range.  However,
since int/unsigned int are treated as if they have the same internal
representation as long/unsigned long in the #if expressions in C90,
checking if values are in the long or unsigned range seems enough.  This
point is not clear in the Standards.  It can be interpreted that range
checking is implementation-defined since the method of value evaluation
itself is implementation-defined.

In any case, it is appropriate for this sample to issue a diagnostic
message since this sample exceeds the unsigned long range in the 64-bit
long or lower compiler systems no matter how evaluation is performed.

In C99, the #if type became the maximum integer type in the
implementation.

Scoring: 2.

            [i.35.3]    Multi-character wide character constant
                                evaluation

There exists something called wide character multi-character constants.
The evaluation method is also implementation-defined overall, however, a
wide character multi-character constant needs to match the multi-
character constant for corresponding multi-byte character.

Scoring: 0.


        [2.4.37]    Translation limits

Standard C specified the minimum for each translation limit that can be
handled by compiler systems.  However, this specification is quite
lenient.  Regarding 22 kinds of limitation values, each program
including one or more examples meeting each limitation value must be
processed and executed.  As you can see in this Validation Suite sample,
it is possible to write this program extremely simple and impractically
with the minimum load for a compiler system.  Please note that these
translation limits are not guaranteed all the time.  The translation
limit specification is only an indication.  These samples test only 8
kinds of translation limits regarding preprocessing. *1, *2, *3

A part of these test samples has lines wrapped to fit on a screen.
These tests may fail in the compiler system that cannot process line
splicing correctly (for example, Borland C.)  Since line splicing
testing is not the purpose here, please concatenate lines in an editor
to re-test the sample if it fails.

  *1  ANSI C 2.2.4.1 (C90 5.2.4.1) Translation limits
  *2  C99 5.2.4.1 Translation limits
    In C99, translation limits are expanded to a large extent.  They are
    even more so in the C++ Standard (refer to [3.6] .)

            [n.37.1]    31 parameters in a macro definition

In C90, up to 31 parameters in a macro definition are guaranteed in any
case.

Scoring: 2.

            [n.37.2]    31 arguments in a macro call

Similarly, up to 31 arguments in a macro call are guaranteed in C90 in
any case.

Scoring: 2.

            [n.37.3]    31 characters in an identifier

In C90, the top 31 characters of the internal identifier (including
macro names) in a translation unit are guaranteed to be significant.
Preprocessing goes without saying, but 31 byte names also need to be
passed. *

Scoring: 4.

  *  ANSI C 3.1.2 (C90 6.1.2)  Identifiers - Implementation limits

            [n.37.4]    8 levels of nested conditional inclusion

In C90, 8 levels of #if (#ifdef, #ifndef) section nesting is guaranteed
in any case.

Scoring: 2.

            [n.37.5]    8 levels of nested #include

In C90, 8 levels of #include nesting is guaranteed in any case.

Scoring: 4.

            [n.37.6]    #if expression with 32 levels of parentheses

In C90, 32 nesting levels of parentheses are guaranteed in an expression
in any case.  This seems to apply to the #if expressions as well
(Different from generic expressions, it does not seem to be necessary to
guarantee to that extent for the #if expressions.  The only
specifications which are defined as an exception are the #if expressions
are integer constant expressions, which are evaluated only in long/
unsigned long, do not require a query to execution environment, and that
same evaluation methods as runtime or phase 7 are not required.  Since
the rest receive no special treatment, there are some parts where the
specification seems somewhat extreme.)

Scoring: 2.

            [n.37.7]    509 byte string literal

In C90, the length of string literals is guaranteed up to 509 bytes.
This length is that of a token and not the number of elements in a char
array.  It includes " in both ends and \n is counted as 2 bytes.  In a
wide string literal, the L prefix is included in the number.

Scoring: 2.

            [n.37.8]    509 byte logical line

In C90, the length of a logical line is guaranteed up to 509 bytes in
any case.

Scoring: 2.

            [n.37.9]    1024 macro definitions

In C90, 1024 macro definitions are guaranteed in any case.  This is the
most ambiguous among specifications regarding translation limits.  The
amount of memory required by compiler systems for the simplest macros in
this sample are totally different from that for many long macros.  Also,
test programs vary depending predefined macros are included in the 1024
macros.  In an actual program, many macros are defined in standard
headers before they are defined in a user program.  This specification
is truly only rough indication.  The real limitations shall be
determined by the memory amount, which can be provided by the system.

Scoring: 4.

            [n.tlimit]  C99 translation limits

In C99, translation limits were extended to a large extent.

Scoring: 2 for each of below.

            [n.37.1L]   127 parameters in a macro definition
            [n.37.2L]   127 arguments in a macro call
            [n.37.3L]   63 characters in an identifier
            [n.37.4L]   63 levels of nested conditional inclusion
            [n.37.5L]   15 levels of nested #include
            [n.37.6L]   #if expression with 63 levels of parentheses
            [n.37.7L]   4095 byte string literal
            [n.37.8L]   4095 byte logical line
            [n.37.9L]   4095 macro definitions


    [2.5]       Documentation of Implementation-Defined Items

Standard C has items called implementation-defined.  Specifications in
these items vary depending on implementations.  However, it is required
that implementations describe specifications in a document. *

Among the specifications that are implementation-defined, there are ones
determined by CPU and OS in addition to the ones by compiler systems
themselves.  In case of cross compilers, CPU and OS may differ at
translation and runtime.

The items below check if the implementation-defined items in
preprocessing are described in a document.  Since this is for
preprocessing, it is for translation time as far as CPU and OS are
concerned.  d.1.* are for preprocessing specific specifications and d.2.
* are related to compiler proper specifications.  However, the #if
expression evaluation can have different specifications from compiler
proper.

Other than items below, there are some implementation-defined aspects in
the #if expression evaluation.  The first one is character sets (basic
character set is ASCII or EBCDIC and such.)  The integer representation
(2's complement, 1's complement, or sign+absolute value) is another one.
Same as a result converted from a signed integer into an unsigned by
usual arithmetic conversion.  However, since these are all machine and
OS dependent, just those documents should be enough and they do not need
to be described in the preprocessor documents.  Therefore, these are not
subject to scoring.

  *  ANSI C 1.6 (C90 3) Definitions of Terms
     C99 3 Terms, definitions, and symbols

        [d.1.1]     Header-name construction method

A header-name is a special pp-token.  How to combine sequences enclosed
by < and > or " and " into one pp-token called a header-name is
implementation-defined.  It is easy since what is enclosed by " and " is
treated as a pp-token called a string literal as far as implementation
is concerned, however, what is enclosed by < and > has extremely special
problems.  That is because <stdio.h> is divided into 5 pp-tokens, <,
stdio, ., h, and >, in translation phase 3 and composed into 1 pp-token
in phase 4.  In case this part is written using macros, even more
delicate issues arise. *

Scoring: 2.  2 points if this specification is described in
implementation documents, 0 points otherwise.

Case-sensitivity in header-names and file name rules are also
implementation-defined, however they do not necessarily need describing
in implementation documents since these are OS dependent.

  *  ANSI C 3.8.2 (C90 6.8.2) Source file inclusion -- Semantics
     C99 6.10.2 Source file inclusion

        [d.1.2]     Header search method in #include

How to search the header file after a header-name is taken out of the #
include line is also implementation-defined.  In case of the header-name
enclosed by " and ", the files is searched in an implementation-defined
method first and searched in the same way as the header-name enclosed by
< and > if not found.  However, the latter is also implementation-
defined.  This specification does not make any sense at all, however, it
cannot be helped since Standard C cannot make assumptions regarding the
OS.

In the OS with directory structures, the former searches the relative
path to the current directory while the latter searches the directory
specified by the implementation.  However, some implementations search
the relative path to the file that is the source of include.  This
cannot be said wrong, either, since it is implementation-defined.  This
is explained by the Rationale that the committee's intention is the
specification of the search in the relative path to the current
directory, but that it could not be officially defined since the OS
cannot be assumed. *1

Also, the former search does not have to be supported (it is acceptable
to treat the header-name enclosed by " and " in the same way as one
enclosed by < and >.)  The latter does not necessarily have to be a file,
either.  There may be a header built in implementations. *2

Scoring: 4.  4 points if these header search methods are fully described
in documents, 2 points if not fully, and 0 point if almost no
description.

  *1  ANSI C Rationale 3.8.2 Source file inclusion

  *2  ANSI C 3.8.2 (C90 6.8.2) Source file inclusion -- Semantics
      C99 6.10.2 Source file inclusion

        [d.1.3]     #include nesting limitation

The number of the #include nesting level is implementation-defined.
Though, at least 8 levels for C90 and 15 levels for C99 must be
guaranteed. *1, *2

Scoring: 2.

  *1  ANSI C 2.2.4.1 (C90 5.2.4.1) Translation limits
      C99 5.2.4.1 Translation limits

  *2  ANSI C 3.8.2 (C90 6.8.2) Source file inclusion -- Semantics
      C99 6.10.2 Source file inclusion

        [d.1.4]     #pragma sub-directives that are implemented

The #pragma preprocessing directive is a directive to specify enhanced
functionalities specific to implementations.  Enhanced functionalities
must be all implemented as #pragma sub-directives in preprocessing as
well. *

Scoring: 4.  4 points if enough description is in documents regarding
the #pragma sub-directives valid in the implementation (at least all #
pragma sub-directives for preprocessing in preprocessing documents), 2
points if the descriptions are not enough, and 0 points if almost no
description.  Deduct 2 points if there is a directive specific to the
implementation other than #pragma sub-directives (0 point is the lowest
limit.  The directives, which are prohibited by options specifying the
closest specification to Standard C, are not included.)

  *  ANSI C 3.8.6 (C90 6.8.6) Pragma directive
     C99 6.10.6 Pragma directive

        [d.pragma]  Macro expansion in #pragma

In C90, pp-tokens on the #pragma line was not subject to macro expansion.
In C99, however, the line with the STDC token following #pragma is not
subject to macro expansion and the macro expansion for the rest of #
pragma sub-directives became implementation-defined.

Scoring: 2.

        [d.1.5]     Predefined macros

Predefined macros other than __FILE__, __LINE__, __DATE__, __TIME__,
__STDC__, __STDC_VERSION__ (also __STDC_HOSTED__ in C99) are
implementation-defined.  They must have a name with one _ followed by
uppercase letters or starting with 2 _'s. *

Scoring: 4.  2 points if the description is not enough.  Deduct 2 points
in case there is a predefined macro with the name which does not follow
the specification restrictions (0 point is the lowest limit.  The
directives, which are prohibited by options specifying the closest
specification to Standard C, are not included.)

  *  ANSI C 3.8.8 (C90 6.8.8) Predefined macro names
     C99 6.10.8 Predefined macro names

        [d.predef]  C99 predefined macros

C99 added the macros, __STDC_IEC_559__, __STDC_IEC_559_COMPLEX__, and
__STDC_ISO_10646__, which are predefined conditionally.

__STDC_IEC_559__ and __STDC_IEC_559_COMPLEX__ are defined as 1
respectively in the implementation matching IEC 60559 floating point
specification.  These 2 are determined by the floating point operation
library and it may be appropriate to define these in <fenv.h> or another
actually.  They do not necessarily have to be predefined in a
preprocessor.

__STDC_ISO_10646__ is defined as a constant in the format such as
199712L which represents the year and month of the specification
including amendments and corrigenda in complying ISO/IEC 10646 in case
values of characters in the wchar_t type are all some sort of coded
implementations of ISO/IEC 10646 (Universal Character Set of Unicode
system.)  This may also be defined in <wchar.h> or another and it does
not seem to have to be predefined in a preprocessor.

In any case, the explanation is required in documents.

Scoring: 6. 2 points each for each of 3.

        [d.1.6]    Whether white-spaces should be compressed in phase 3

In Standard C, tokenization is performed in translation phase 3.
Whether a white-space sequence other than <newline> should be compressed
in to one space at that time is implementation-defined. *

However, as this is an internal behavior that does not usually influence
the compilation result, it is not a user concern.  White-spaces in the
beginning and end of a line may be removed.

If you are wondering this specification is unnecessary, it is not so,
but there is one case necessary.  That is in case the preprocessing
directive line has [VT] and [FF].  This is defined as a specification in
an incomprehensible way in Standard C.  On one hand, this is treated as
a violation of constraints; the specification above is established on
the other hand.  In other words, [VT] and [FF] can be compressed into
one space together with spaces or tabs before and after in phase 3.  In
that case, [VT] and [FF] do not remain in phase 4, but they do remain in
case they are not compressed to cause a violation of constraints.

I think it is enough if [VT] and [FF] handlings are described in
documents.

Scoring: 2.

  *  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases 3
     ANSI C 3.8  (C90 6.8)  Preprocessing directive -- Constraints
     C99 6.10 Preprocessing directive

        [d.ucn]     Whether UCN's \ should be repeated in stringizing

In case stringizing pp-tokens including UCN by the # operator, whether
UCN's \ should be repeated is implementation-defined.

Scoring: 2.

  *  C99 6.10.3.2 The # operator

        [d.2.1]     #if: Single-character character constant evaluation

In general, the evaluation of character constant values is
implementation-defined.  This has several levels.

  1. What is the basic character set
  2. What are multi-byte character and wide character encodings
  3. How is sign handled
  4. What is the byte order for multi-byte character constant evaluation

Among these, 1 is not subject to scoring since they are OS dependent.
The problems are 2, 3 and 4.

Even single byte single-character character constants are implementation-
defined in terms of sign handling.  Moreover, different evaluation
between preprocessing and compilation is permitted. *

Scoring: 2.  2 points if a document includes descriptions or if it has
descriptions regarding the evaluation in the compilation phase and the
same evaluation is performed in the #if expression.  0 points if no
accurate description is written.

  *  ANSI C 3.1.3.4 (C90 6.1.3.4) Character constants
     ANSI C 3.8.1 (C90 6.8.1) Conditional inclusion -- Semantics
     C99 6.4.4. 4 Character constants
     C99 6.10.1 Conditional inclusion -- Semantics

        [d.2.2]     #if: Multi-character character constant evaluation

There is an issue of a byte order in multi-character character constant
evaluation such as 'ab'.  This is also implementation-defined.

Scoring: 2.  The scoring method is same as d.2.1.

        [d.2.3]     #if: Multi-byte character constant evaluation

Other than encoding, there are differences in sign handling and byte
order in multi-byte character constant evaluation.  These are all
implementation-defined.

Scoring: 2.  The scoring method is same as d.2.1.

        [d.2.4]     #if: wide character character constant evaluation

Other than encoding, there are differences in sign handling and byte
order in wide character constant evaluation.  These are all
implementation-defined.

Scoring: 2.  The scoring method is same as d.2.1.

        [d.2.5]     #if: right shift for a negative number

In general, how sign bits are handled in the right bit shift for a
negative integer is implementation-defined.  This depends on the CPU
specification, but also the implementation of the compiler system. *

Scoring: 2.  The scoring method is same as d.2.1.

  *  ANSI C 3.3.7 (C90 6.3.7) Bitwise shift operators -- Semantics
     C99 6.5.7 Bitwise shift operators -- Semantics

        [d.2.6]     #if: Division and modulo of a negative number

In general, in case one or both of right-hand side and left-hand sides
are negative integers, the results of division and modulo are
implementation-defined.  This depends on the CPU specification, but also
the implementation of the compiler system. *1, *2

Scoring: 2.  The scoring method is same as d.2.1.

  *1  ANSI C 3.3.5 (C90 6.3.5) Multiplicative operators -- Semantics
      C99 6.5.5 Multiplicative operators - Semantics

  *2  In C99, quotients are rounded off to the direction of 0 just as
    div() and ldiv().

        [d.2.7]     Valid length of identifiers

Up to which byte of the identifier including a macro name from the
beginning is significant is implementation-defined.  For macro names and
internal identifiers, 31 bytes for C90 and 63 bytes for C99 must be
guaranteed. *

Scoring: 2.

  *  ANSI C 3.1.2  (C90 6.1.2)  Identifiers - Implementation limits
     C99 6.1.2 Identifiers -- General -- Implementation limits

        [d.mbident]  Multi-byte characters in identifiers

In C99, implementations using multi-byte characters in identifiers are
permitted.  This specification is implementation-defined.

Scoring: 2.

  *  C99 6.1.2 Identifiers -- General


3.  Evaluation of Aspects Unspecified by Standard

Even if something is not required to implementations by Standards, they
may be important in order to evaluate the "quality" of implementations.
In this chapter, quality evaluation testing is explained.


    [3.1]       Multi-byte Character Encoding

There are various encodings for multi-byte character and wide character.
The specification of encoding is implementation-defined.  In this
section, however, the "quality" matter of implementations are explained,
that is how many encodings the implementation can handle and to what
degrees can handle.

This Validation Suite provides samples supporting several encodings for
m_33, m_34, and m_36.  Compiler systems not only have to support the
standard encoding on the system, but also must support many encodings in
order to process source files supporting multiple languages. *

However, the method of testing differs depending on the system and
compiler systems and is not easy.

In GCC, use environment variables, LC_ALL, LC_CTYPE, and LANG, change
behavior, but their implementations are half-finished and not reliable.
Moreover, whether this feature is available depends on the configuration
when GCC itself was compiled.

GCC 3.4 changed its way of multi-byte character processing entirely.  It
converts every encoding of source file to UTF-8 at the start of
preprocessing (so to say the translation phase 1).  Hence, a multi-byte
character constant in #if expression cannot be evaluated in other than
UTF-8 which is irrelevant to the original encoding.

The C++98 Standard has a similar problem.  Since it stipulates to
convert multi-byte character into UCN at translation phase 1, multi-byte
character constant in #if expression cannot be evaluated in other than
UCN.

Considering the rather confused status of the Standards and the
implementations, and considering the lack of portability and lack of
meaning of character constant in #if expression, I excluded the tests of
multi-byte/wide character constants in #if expression (m.33, m.34, m.36.
1) from the subject of scoring since Validation Suite V.1.5.

In Visual C++, use #pragma setlocale, which is a convenient directive.
On Windows, "Regional and Language Options" is supposed to change the
language to use but is ill-defined and cumbersome.  #pragma setlocale is
convenient for a programmer since it can be used without tampering with
Windows (though how well Visual C++ itself implements it is a separate
story).

As far as other compiler systems I have tested are concerned, they
support only their default encoding only.  Many of them support
functions such as setlocale() in the library, which have nothing to do
with preprocessing or compiling source files.  What is necessary is a
capability for a compiler system itself to recognize and process the
encoding of source files.

  *  In C99, Unicode sequences starting with \u or \U were introduced
    which made it difficult to understand the relationship with multi-
    byte/wide character.  The C++ Standard is even more complicated.

            [m.33.1]    Wide character constant evaluation

For wide character constant, see [2.4.33].

Scoring: none.

            [m.34]      Multi-byte character constant evaluation

Multiple byte multi-byte character constants are supported in the #if
expressions (in Standards, the term, multi-byte characters, include
single-byte characters, however, in this document, they refer to only
multi-byte characters which are not single-byte to avoid confusion.)
The evaluation is even more implementation-defined than single-byte
character constants.

Scoring: none.
This test needs to be determined along with the test in u.1.7 described
later.  Simply evaluating a character value does not mean recognizing an
encoding.  u.1.7 tests whether a multi-byte character is in the range
accepted for the encoding.  The encoding of a character is properly
recognized only after its value is correctly evaluated in m.34 and after
an appropriate diagnostic message is displayed in u.1.7.

            [m.36.1]    0x5c in Multi-byte Characters is not an escape
                                character

If the encoding of (multi-byte | wide) characters is shift-JIS, Big-5,
or ISO-2022-*, cumbersome issues arise as there may be a byte with the
0x5c value in a character which is same as '\\'.  Compiler systems must
not interpret this as a \ (backslash) escape character.  A (multi-byte |
wide) character is one character and not two single-byte characters.

0x5c in the value for multi-byte character constants must not be
interpreted as the beginning of an escape sequence.

Scoring: none.

            [m.36.2]    The # operator does not insert \ in Multi-byte
                                Characters

'\\' must not be inserted where a multi-byte character with 0x5c is
included in an argument of an operand for the # operator.  Though there
is a method of supporting multi-byte characters, not supported in the
compiler proper, in a preprocessor by inserting '\\', is at another
level.

In addition, there are troublesome issues different from other literals
in the tokenization for this type of character constant and string
literal including multi-byte characters.

As multi-byte characters encoded in ISO-2022-* include not only '\\',
but also bytes of values matching '\'' or '"', sloppy processing will
end up with a tokenization failure.

Scoring: 7.  2 points each for the Shift-JIS and Big-5 encoding.  1
point each for 3 samples of ISO-2022-JP.

This item needs to test not only m_36_*.t, but also m_36_*.c.  This is
because the preprocessor which correctly processes stringizing may fail
identifying the complex string literal including a Kanji character with
0x5c as an argument in the assert() macro.  m_36.c also tests the
tokenization of string literals.


    [3.2]       Undefined Behavior

In Standard C, there are many specifications of an undefined behavior.
What causes an undefined behavior is incorrect programs or data or at
least programs without portability.  However, it is not mandatory to
issue a diagnostic message, different from violation of syntax rules or
constraints.  Implementations can process this in any way.  Some sort of
reasonable processing as a valid program or handling it as an error by
issuing a diagnostic message are acceptable, and it is not against
Standards even if the process is canceled or run out of control without
issuing a diagnostic message.

However, in order to evaluate the "quality" of an implementation,
concretely what undefined behavior is a question.  I think it is
appropriate for an implementation to issue some sort of diagnostic
message.  Not to mention an error case, even in the case of handling it
as a valid program, it is helpful to issue a warning in order to
indicate portability issues.  Runaway is out of the question.

The following tests check whether implementations issue an appropriate
diagnostic message for source that causes an undefined behavior.
Diagnostic messages may be an error or warning.  In case of a warning,
some sort of reasonable processing is necessary.

u.1.* are preprocessing specific problems and u.2.* are common in
constant expressions in general.

Scoring: 1 point each for an item unless otherwise noted if an
appropriate diagnostic message is issued.  0 points for an off-center or
no diagnostic message.

        [u.1.1]     Source files not ending with <newline>

Source files which are not empty and whose end is not the <newline>
cause undefined behavior (Depending on OS, no newline character data
exists in a file and a newline is automatically added by the
implementation when a file is read.) *

u.1.1, u.1.2, u.1.3, and u.1.4 are all ones with incomplete source files.
In case the translation unit ends within the file, most implementations
seem to issue a diagnostic message.  However, even with such
implementations, there is a possibility that the incomplete file is
treated as valid source by processing the file included within another.
Although this is also a type of undefined behavior and not an erroneous
process, it is appropriate to issue a diagnostic message.

  *  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
     C99 5.1.1.2 Translation phases

        [u.1.2]     Source files ending with <backslash><newline>

Source files ending with the <backslash><newline> sequence cause an
undefined behavior. *

  *  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
     C99 5.1.1.2 Translation phases

        [u.1.3]     Source files ending in the middle of comments

Source files ending in the middle of comments cause an undefined
behavior.  This is actually the case of unclosed or nested comments. *

  *  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
     C99 5.1.1.2 Translation phases

        [u.1.4]     Source files ending in the middle of a macro call

Source files ending with an incomplete macro call is undefined. *

This occurs in case that a parenthesis is missing to close a macro
argument and it is important to have a diagnostic message.

  *  ANSI C 3.8.3.4 (C90 6.8.3.4)  Rescanning and further replacement
     C99 6.10.3.4 Rescanning and further replacement

        [u.1.5]     Invalid character in places other than quotes

Extremely limited characters can be written in places other than string
literals, character constants, header-names, and comments.  They are
uppercase and lowercase alphabets, numbers, 29 symbols, and 6 kinds of
white space characters.  It is no wonder since these are for source. *

Here, we test the case of control codes other than white spaces.
Although control codes are invalid even in string literals and elsewhere,
compiler proper can check this.  As there are many locale-specific or
implementation-defined aspects in character sets in general and the
range is not necessarily clear, we do not test those.  Kanji characters
are undefined in places other than above, but not tested for similar
reasons.

  *  ANSI C 2.2.1 (C90 5.2.1) Character sets
     C99 5.2.1 Character sets
     UCN was added in C99.

        [u.1.6]     [VT][FF] on control lines

Preprocessing directive lines starting with #, even if white space
characters, cause a violation of constraints when other than [SPACE][TAB]
exists.  However, this is the case of translation phase 4 and it is
possible to compress those with a sequence of white spaces before and
after which are other than <newline> into one space in phase 3 prior to
phase 4.  In such event, no violation occurs. *

So as in Standards, it is appropriate to issue a diagnostic message to
this.  This is not an undefined behavior test, however, it is included
here as a matter of convenience since it is difficult to classify this
elsewhere.

  *  ANSI C 2.1.1.2 (C90 5.1.1.2) Translation phases
     C99 5.1.1.2 Translation phases

     ANSI C 3.8 (C90 6.8) Preprocessing directives -- Constraints
     C99 6.10 Preprocessing directives -- Constraints

        [u.1.7]     Invalid multi-byte character sequences in quotes

Even inside a string literal, character constant, header-name, or
comment, a sequence that is not accepted as multi-byte characters causes
undefined behavior.  This is the case where the next byte of the first
byte of a multi-byte character cannot be used as the second byte. *

Scoring: 7.  1 point each for 7 types of encodings.  Also, refer to the
explanation of m.34.

  *  ANSI C 2.2.1.2 (C90 5.2.1.2) Multibyte characters
     C99 5.2.1.2 Multibyte characters

        [u.1.8]     Logical lines ending in the middle of a character
                            constant

The character constant pp-token must complete on that logical line.  If
there is a ' without corresponding ' on a logical line, it is undefined.
*

An optional message can be written on the #error line.  However, it must
be arranged in the pp-token sequence and a single apostrophe is not
allowed.  In this sample, the part intended for a comment is eaten by
searching the ' which should be at the end of the character constant.

*  ANSI C 3.1 (C90 6.1) Lexical elements -- Semantics
   C99 6.4 Lexical elements -- Semantics

        [u.1.9]     Logical lines ending in the middle of a string
                            literal

String literals must complete on that logical line.  A single " is
undefined. *

String literals extending over lines seem to have been accepted in many
UNIX compiler systems.  Source files expecting it can still be seen
sometimes.

  *  ANSI C 3.1 (C90 6.1)  Lexical elements -- Semantics
     C99 6.4 Lexical elements -- Semantics

        [u.1.10]    Logical lines ending in the middle of a header name

Incomplete header-names on the #include logical line is also undefined.
*

  *  ANSI C 3.8.2 (C90 6.8.2) Source file inclusion -- Semantics
     C99 6.10.2 Source file inclusion

        [u.1.11]    ', ", /*, or \ in a header name

', /*, or \ in the header-name pp-token is undefined.  The case " exists
in the header-name enclosed by < and > is also the same (error since "
becomes the end of the pp-token from the beginning in the header-name of
a string literal format.) *

These, except for \, are all confusable with a character constants,
string literal, or comment and can be interpreted either way.

\ is mistakable with an escape sequence.  Though a header-name has no
escape sequence, this is not determined as a header-name until phase 4
after tokenization in translation phase 3.  Therefore, implementations
suffer from identifying this case.  Though escape sequences are
processed in phase 6, it is necessary to recognize an escape sequence
since \" is interpreted as an escape sequence rather than the end of a
string literal in phase 3.

However, \ is a Standard path-delimiter in the OS such as DOS/Windows
and implementations on those OS certainly handle this as a valid
character (except for the case that the last character is \ in the
header-name of a string literal format.)  Many cases causing an
undefined behavior are erroneous programs, but not always so.  It is not
recommended to write \ where / is adequate.  Implementations should
issue a warning.  On other OS's, it will be an error when opening files;
no diagnosis is necessary in preprocessing tokenization.

  *  ANSI C 3.1.7 (C90 6.1.7) Header names -- Semantics
     C99 6.4.7 Header names -- Semantics

        [u.1.12]    #include argument is not a header name

It is undefined if the argument on the #include line is not a header-
name.  In other words, this is the case where an argument is not in the
string literal format or not enclosed by < and >, or no macro which is
expanded into either of these exists.

  *  ANSI C 3.8.2 (C90 6.8.2) Source file inclusion -- Semantics
     C99 6.10.2 Source file inclusion -- Semantics

        [u.1.13]    Extra token in the #include argument

The argument on the #include line is one header-name only.  Extra pp-
token other than that is undefined.

  *  ANSI C 3.8.2 (C90 6.8.2) Source file inclusion -- Semantics
     C99 6.10.2 Source file inclusion -- Semantics

        [u.1.14]    Missing line number in the #line argument

It is undefined if the argument on the #line has no line number (A file
name is optional.  A line number must be the first argument.) *

  *  The sources of Standards up to [u.1.18] are all same below.
     ANSI C 3.8.4 (C90 6.8.4) Line control -- Semantics
     C99 6.10.4 Line control -- Semantics

        [u.1.15]    The #line file name argument is not a string literal

The file name which is the second argument of #line must be a string
literal.

If this is a wide string literal, it is a violation of constraints while
the rest of the #line errors are undefined.  This specification lacks
balance.

        [u.1.16]    Extra token in the #line arguments

Three or more arguments on the #line line cause undefined errors.

        [u.1.17]    The line number argument for #line is out of
                            [1, 32767] range

In C90, the line number argument for #line must be in the range of [1,
32767].  It is undefined otherwise.

This sample tests the case where the #line specification is within this
range, but the line number of the source exceeds the range afterward.
Depending on implementations, the line number may silently wrap.  It is
appropriate to issue a warning.

Scoring: 2.  1 point if 1 or 2 out of 3 samples can be diagnosed.

        [u.line]    C99: the line number argument for #line is out of
                            range

In C99, this range is [1, 2147483647].

        [u.1.18]    The line number argument for #line is not decimal

The line number argument for #line must be a decimal number.
Hexadecimal and others are undefined.

        [u.1.19]    Macro on the #if line expanded into defined

The fact that 'defined' is an operator and confusable with an identifier
causes various problems.  Since it is once tokenized as an identifier in
translation phase 3 and recognized as an operator in phase 4 only if
this exists on the #if line, defining a macro which is expanded into
defined on the #define line is possible.  If this macro actually appears
on the #if line, it is undefined.  There is no guarantee for the result
of its expansion to be treated as an operator. *

Defining a macro named defined is itself undefined (refer to [u.1.21]),
however, such an example is not seen in reality.  Although, the macro
definition which has a token named defined in the replacement list can
be seen from time to time.  Some compiler systems treat this as
legitimate by performing a special process, which is not a logical
specification.

Scoring: 2.  1 point if only 1 out of 2 samples can be diagnosed.

  *  ANSI C 3.8.1  (C90 6.8.1) Conditional inclusion -- Semantics
     C99 6.10.1 Conditional inclusion -- Semantics

        [u.1.20]    defined, __LINE__, etc. for the #undef argument

It is undefined if the #undef argument is defined, __LINE__, __FILE__,
__DATE__, __TIME__, __STDC__, or __STDC_VERSION__. *1, *2, *3

  *1  ANSI C 3.8.8 (C90 6.8.8) Predefined macro names
      C99 6.10.8 Predefined macro names

  *2  Amendment 1/3.3 Version macro

  *3 __STDC_HOSTED__, __STDC_ISO_10646__, __STDC_IEC_559__, and
    __STDC_IEC_559_COMPLEX were added in C99.

        [u.1.21]    defined, __LINE__, etc. for the #define macro name

It is undefined if the macro name defined by #define is defined,
__LINE__, __FILE__, __DATE__, __TIME__, __STDC__, or __STDC_VERSION__. *
1, *2, *3

Scoring: 2.  1 point if 2 out of 3 samples are diagnosed.

  *1  ANSI C 3.8.8 (C90 6.8.8) Predefined macro names

  *2  Amendment 1/3.3 Version macro

  *3  __STDC_HOSTED__, __STDC_ISO_10646__, __STDC_IEC_559__,  and
    __STDC_IEC_559_COMPLEX were added in C99.

        [u.1.22]    Invalid pp-token generation by the ## operator

The result of pp-token concatenation by the ## operator must be a valid
(single) pp-token.  It is undefined, otherwise. *

In this sample, the subject matter is the pp-token called a pp-number
which was defined as a new specification in Standard C.

  *  ANSI C 3.8.3.3 (C90 6.8.3.3) The ## operator -- Semantics
     C99 6.10.3.3 The ## operator -- Semantics

        [u.concat]  C99: Invalid pp-token generation by the ## operator

  The // mark is used for a comment in C99 and C++, but is not a pp-
token.  The result of generating this sequence with the ## operator is
undefined.
  A comment is converted into a space before a macro is defined or prior
to expansion, thus it is not possible for a macro to generate a comment.

        [u.1.23]    Invalid pp-token generation by the # operator

The result of pp-token concatenation by the # operator must be a valid
(single) string literal.  It is undefined otherwise.

This problem rarely occurs.  As you can see in this sample, it is
limited to odd cases where \ is outside of a literal and even more
special cases.  This sample shall be diagnosed in the compilation phase
rather than preprocessing and that is enough.  However, implementations
must not crash or silently ignore the problem.

  *  ANSI C 3.8.3.2 (C90 6.8.3.2) The # operator -- Semantics
     C99 6.10.3.2 The # operator -- Semantics

        [u.1.24]    Empty argument in a macro call

Having an empty argument in a function-like macro call is undefined in
C90. *1

Performing reasonable macro expansion by interpreting an empty argument
as 0 pieces of token is very possible. *2

This is an example where implementations can give the undefined
specification a meaningful definition.  Such implementations, however,
also have no portability in pre-C99 at least.  It will be appropriate to
issue a warning.

Scoring: 2.  1 point if 3 of 4 out of 5 samples are diagnosed.

  *1  ANSI C 3.8.3 (C90 6.8.3) Macro replacement -- Semantics
  *2  C99 6.10.3 Macro replacement -- Semantics

        [u.1.25]    Argument similar to the control line in a macro call

A function-like macro call can extend over multiple logical lines.
Therefore, there may be a line confusable with a preprocessing directive
in an argument; the result is undefined. *

This type of "argument" will be interpreted as a preprocessing directive
if it is in the #if group to be skipped.

  *  ANSI C 3.8.3  (C90 6.8.3) Macro replacement -- Semantics
     C99 6.10.3 Macro replacement -- Semantics

        [u.1.26]    Macro expansion ending with a function-like macro
                            name

In C90, the macro expansion result ending with a function-like macro
name is considered to be undefined.  This is an interpretation added
later, but not clear in meaning.  Please refer to [1.7.6].

This was included as a test in this Validation Suite up to V.1.2, but
omitted from V.1.3.

  *  ISO/IEC 9899:1990 / Corrigendum 1
    This specification was removed in C99 and a complicated
    specification was added.

        [u.1.27]    Invalid directives

In case the first pp-token on a line is # and there is another pp-token
after that, what follows # must be a preprocessing directive usually.
The line of # only is accepted for some reason. *1

However, # at the beginning of a line or followed by an invalid
directive or other pp-token are not violations of syntax or constraint
for preprocessing.  As you can see in [u.1.25], that is because all
lines starting with # do not have to be preprocessing directive lines.
Which one is a preprocessing directive line depends on the context.

Standards do not specify this as undefined, however, it is in a way
undefined in the sense that "it is not defined as a specification."  It
is desirable for an implementation to issue some sort of diagnostic
message.  However, this does not necessarily have to be diagnosed by
preprocessing.  If preprocessing outputs this line as is, it will be an
error in the compilation phase.  That is acceptable as well.  There is
no danger as long as preprocessing avoids the case where it interprets
silently #ifdefined as #if defined.

In C99, something not clear in the meaning called # non-directive was
added.  However, its content is not defined and I can say it is
undefined, practically speaking. *2

  *1  ANSI C 3.8 (C90 6.8) Preprocessing directives - Syntax
  *2  C99 6.10 Preprocessing directives -- Syntax

        [u.1.28]    Directive name not macro-expanded

In order for the line starting with # to be a preprocessing directive,
only a directive name is allowed as the next pp-token.  Directive names
are never macro-expanded.

In the case where an identifier which is not a directive name comes
after # and it is a macro name, there are 2 processing steps; diagnosing
it as a missing directive and considering it as usual text to expand a
macro for an output.  Even the latter needs some sort of diagnosis.
Since the latter should cause an error in the compilation phase, that is
acceptable as well.  Only processing what is macro-expanded as a
"correct" preprocessing directive must not happen.

        [u.2.1]     Undefined character escape sequence in the #if
                            expression

Character escape sequences in a string literal or character constant
support only \', \", \?, \\, \a, \b, \f, \n, \r, \t, and \v as a
specification.  Other character sequences starting with \ are undefined.
Especially, sequences with \ followed by a lowercase are reserved for
future use. *

Many of these diagnoses are handled by the compilation phase, however,
only preprocessing diagnoses the case where these exist in a character
constant for the #if expression.

  *   ANSI C 3.1.3.4 (C90 6.1.3.4) Character constants -- Description
      C99 6.4.4.4 Character constants -- Description
      ANSI C 3.9. 2 (C90 6.9. 2) Character escape sequences
      C99 6.11.4 Character escape sequences

In C99, UCN (universal-character-name) escape sequences in the \uxxxx or
\Uxxxxxxxx format were added.

        [u.2.2]     Bit shift operation with invalid shift count in the
                            #if expression

In a bit shift operation for a integer type, the case that the right
operand (shift count) is a negative value or greater than or equal to
the width of the left operand type is undefined. *

If this is in the #if expression, it should be diagnosed by
preprocessing.

  *  ANSI C 3.3.7 (C90 6.3.7) Bitwise shift operators -- Semantics
     C99 6.5.7 Bitwise shift operators -- Semantics


    [3.3]       Unspecified Behavior

In Standard C, there is a specification called unspecified.  This is for
valid programs without a processing method unspecified and whose
processing method is not necessarily documented by the implementation.

There are not many examples for this.  Only extremely special cases have
different results according to the processing method.  However, it is
desirable to issue a warning even if it is a special case.

The result of a program depending on unspecified behavior is undefined.

It is unspecified in preprocessing and the ones which have different
results according to processing methods are the following.  In these 2
tests, 2 points are given for each, whether an invalid pp-token is
generated and a diagnostic message is issued or whether a warning for no
portability is issued.  In case of the latter, it can be at macro
definition or expansion.

Additionally, evaluation order within an #if expression is unspecified.
This is not included in testing since the #if expression result does not
change because of this.

        [s.1.1]     Unspecified evaluation order of the # and ##
                            operators

In case both the # and ## operators are in one macro definition, which
is evaluated first is not specified. *

This sample is the example of different results depending on which of #
and ## is evaluated first.  Furthermore, if ## is evaluated first, # is
treated as a pp-token rather than an operator, causing the concatenation
result to generate an invalid pp-token.  As this type of macro does not
have portability, it is preferable for an implementation to issue a
warning.

Scoring: 2.

  *  ANSI C 3.8.3.2 (C90 6.8.3.2) The # operator -- Semantics
     C99 6.10.3.2 The # operator -- Semantics

        [s.1.2]     Unspecified evaluation order for multiple ##
                            operators

In case there are multiple ## operators in one macro definition, the
evaluation order is not specified. *

In this sample, an invalid pp-token is generated on the way of
evaluation depending on the evaluation order.  As this type of macro
does not have portability, it is preferable for an implementation to
issue a warning.

Scoring: 2.

  *  ANSI C 3.8.3.3 (C90 6.8.3.3)  The ## operator -- Semantics
     C99 6.10.3.3 The ## operator -- Semantics


    [3.4]       Other Cases Where a Warning is Preferable

Other than undefined and unspecified, there are cases for which it is
preferable for implementations to issue a warning.  Those are below.

w.1.* and w.2.* are totally valid programs as far as Standards are
concerned, however, they have probably some sort of errors and diagnosis
is important.  w.1.* are preprocessing specific problems and w.2.* are #
if expression versions of the problems common with operations in the
compilation phase.

w.3.* concern a specification with implementation-defined aspects of
translation limits.  Implementing the translation limits beyond the
minimum value guaranteed by a Standard can be said to improve the
quality of the implementation.  On the other hand, programs depending on
it will have the problem of limited portability.  Therefore, it is
preferable for an implementation which implements translation limits
beyond the minimum value to issue warnings.

In tests below, it is a pass if an appropriate diagnostic message is
issued.  w.3.* allow an error when the translation limits of an
implementation match the minimum value of the Standard.  It is a failure
if an error occurs without meeting the minimum value (whether meeting
the minimum value is determined in n.37.*.)

        [w.1.1]     /* inside a comment

There are many nested comments and typos in source missing a comment
mark.  Among those, nested comments, /* /* */ */, and a comment with
only */ always cause an error in the compilation phase since the */
sequence does not exist in the C language.  However, when */ is missing,
it may not cause an error since the part up to the end of the next
comment is interpreted as a comment.  This is a dangerous mistake and it
is important for preprocessing to issue a warning.  Even if the case
some sort of error occurs in the compilation phase, the cause is no
longer clear at that time.

Scoring: 4.

        [w.1.2]     Inclusion of a subsequent token sequence by macro
                            rescan

The case of the inclusion of a token sequence after a replacement list
at macro rescan is an unwritten specification in K&R 1st and the
official Standard C.  However, this situation is brought about only by
an unusual macro.  Especially, the one where the replacement list
comprises the first half part of another function-like macro call is an
extremely unusual macro.  In reality, the possibility of an error in a
macro definition is large, it is preferable for the implementations to
issue a warning.  I sometimes see an object-like macro expanded into a
function-like macro name though such writing style causes bad
readability.

This is a problem for Standards.  I think that incomplete rescan in a
replacement list should be specified as an error (violation of
constraints) (refer to [1.7.6].)

Scoring: 4.  1 point if only 1 out of 2 samples can be diagnosed.

        [w.2.1]     Negative value converted into an unsigned number by
                the usual arithmetic conversion in the #if expression

In the mixed mode operation of signed and unsigned integers, "usual
arithmetic conversion" is performed and signed numbers are converted
into unsigned numbers.  In case the original signed integer is positive,
the value does not change.  However, if it is negative, it is converted
into a large positive number.  This is not an error, but not normal and
there is a possibility of an error.  It is preferable to issue a warning.
In preprocessing, this phenomenon is seen in the #if expression.

Scoring: 2.

        [w.2.2]     Unsigned operation in the #if expression wraps round

In the case the result of an unsigned operation is out of the range, it
may not be an error since it is supposed to be wrapped round.  Since
there is a possibility of an error, it is preferable to issue a warning.

Scoring: 1.

        [w.3.1]     Identifier of 32 bytes or longer
                    32 or more macro parameters
                    32 or more macro arguments

w.3.1, w.3.2, w.3.3, w.3.4, and w.3.5 are all tests regarding the
translation limits in C90.  The contents are clear by themselves and no
explanation is necessary.  Compare these with n.37.*.

Scoring: 3.  1 point each for each of 3 items.

        [w.3.2]     9 or more levels of #if (#ifdef, #ifndef) nesting

Scoring: 1.

        [w.3.3]     9 or more levels of #include nesting

Scoring: 1.

        [w.3.4]     33 or more sub-expression nesting in the #if
                            expressions

Scoring: 1.

        [w.3.5]     510 bytes or longer string literals

Scoring: 1.

        [w.3.6]     510 bytes or longer logical lines

Scoring: 1.

        [w.3.7]     1025 or more macro definitions

Only this is same as n.37.8.  This is the most approximate one in the
specification for translation limits.  Whether built-in macros should be
counted in 1024 macros and whether the macros defined in standard
headers should be counted vary the number.  The macro tested in this
sample is the 1024th in header files, but there are some macros defined
prior in warns.t and warns.c.  In any case, this macro exceeds the
1024th.  Implementations are expected to issue warnings at appropriate
places.

Scoring: 1.

        [w.tlimit]  C99 translation limits

In C99, translation limits were largely extended.  It is preferable even
for the implementation with the specification exceeding this to issue a
warning for the source exceeding the specification for the portability.

Scoring: 9.  1 point each for the following 9 items.

Test samples are in the test-l directory.  l_37_8.c is pseudo source
which can be preprocessed, but not compiled.

        [w.3.1L]    128 or more macro parameters
        [w.3.2L]    128 or more macro arguments
        [w.3.3L]    64 bytes or longer identifier
        [w.3.4L]    64 or more levels of #if (#ifdef, #ifndef) nesting
        [w.3.5L]    16 or more levels of #include nesting
        [w.3.6L]    64 or more levels of sub-expression nesting in the
                            #if expressions
        [w.3.7L]    4096 bytes or longer string literals
        [w.3.8L]    4096 bytes or longer logical lines
        [w.3.9L]    4096 or more macro definitions


    [3.5]       Other Quality Matters

Below is the items concerning the quality such as implementation's ease
of use.  Cases other than q.1.1 cannot be tested by sample programs.

  q.1.* are regarding behaviors.
  q.2.* are regarding options and extended functionalities.
  q.3.* are regarding the runnable systems and the efficiency on them.
  q.4.* are regarding documents.

Among these, some cannot help relying on rather subjective evaluation.
Others can be evaluated objectively, but the measure may not be
specified clearly.  Refer to [5.2] to score them appropriately.


        [3.5.1]     Qualities regarding Behaviors

            [q.1.1]     Translation limits above the specification

Concerning translation limits, the minimum specification is defined
leniently in Standards.  However, the actual implementations should have
the specification above this.  Especially, the requirement of nesting
levels of #if and #include are too low in C90.

In C99, translation limits were largely raised.  Also, restricting
identifier length to less than 255 bytes is described as an obsolescent
feature.

In the q.* items, only these are prepared as test samples.  l_37_?.t and
l_37_?.c in the test-l directory each test the translation limits as
below.  These exceed the C99 specification (however, the translation
limit values in the C++ Standard guideline are higher than these.)

    37.1L   :   Number of parameters in a macro definition  :   255
    37.2L   :   Number of arguments in a macro call     :   255
    37.3L   :   Length of an identifier                 :   255 bytes
    37.4L   :   Nesting level for #if                   :   255
    37.5L   :   Nesting level for #include              :   127
    37.6L   :   Nesting level for sub-expressions
                        in the #if expressions          :   255
    37.7L   :   Length of a string literal              :  8191 bytes
    37.8L   :   Length of a logical line in source      : 16383 bytes
    37.9L   :   Number of macro definitions             :  8191

l_37_8.c does not become an execution program even if it is compiled.
Only preprocessing is necessary, however, make an object file by doing
cc -c l_37_8.c to compile.  If preprocessing is successful, you can find
out how long line the compiler proper can accept.

Scoring: 9.  1 point for each of 9 kinds of samples.  Compiler proper
testing is not included.

            [q.1.2]     Accuracy of diagnostic messages

Though a diagnostic message is issued, it can be difficult to understand,
too vague, or not obvious about the point of the problem.  Some
diagnostic messages are detailed while others may lack focus.  A
diagnostic message should not be simply "syntax error", but should show
the reason for the error.  The implementation should indicate the line
and point out the token of the problem.

The error of a miss-matched #if section must indicate the corresponding
line.  Otherwise, it is not possible to detect where the error resides.

Duplicate diagnostic messages for the same error are not desirable.

Scoring: 10.

            [q.1.3]     Accuracy of line number display

It is not acceptable for the line number that a preprocessor passes to
the compiler proper to be miss-matched.  This appears in a diagnostic
message, however, it is treated as a separate item as a matter of
convenience.  This scoring is done whether line numbers are accurately
displayed in the sample programs thus far.  There are preprocessors that
do not output the line number information.  These are out of the
question.

Scoring: 4.

            [q.1.4]     Runaway or abort

Scoring: 20.  Deduct points as below for implementations which runaway
or abort in any of the samples in this Validation Suite.  A "runaway"
means that the program must be aborted by ^C, must be reset, or leaves a
discrepancy in the file system.  An "abort" means that the damage is not
done though the program ends prematurely.

1. 0 point for a runaway in n_std.c (strictly conforming program.)

2. 10 points (testing the part after where aborted causes another
runaway, classify it as a "runway") if a processed is aborted in n_std.c.

3. 10 points for a runaway in any of other samples.


        [3.5.2]     Options and extended functionalities

            [q.2.1]     Specifying the include directory

The so-called include directory where standard header files are located
is fixed to one location in the most simple case.  However, it is often
the case that multiples exist and there are cases that users must
specify one.  About user level header files, there is no problem in case
those are in the current directory, however, the directory search rule
is different depending on the implementation in case they are in another
directory and they include another header file.  In any case, it is
inconvenient if the include directory cannot be specified by users
explicitly by using options and environment variables (many
implementations use -I option.)  In addition, the option to exclude
specified directories in the system is necessary in the case of
searching multiple directories in sequence.  The option to change the
search rule itself justifies its existence.

Scoring: 4.

            [q.2.2]     Macro definition option

It is valuable to have the option that allows an object-like macro to be
defined at compilation rather than in source (many implementations use
the -D option.)  This makes it possible for the same source to generate
objects of different specifications or to be compiled in different
systems.  In case of omitting a replacement list, there are some
implementations defining 1 and others defining 0 piece of token,
requiring a user to check documentation.

Some implementations have options to define the macro with arguments.

The fact that some implementations with trigraph support cannot use
trigraphs in the macro definition option is unfortunate (though this is
not subject to scoring.)

Scoring: 4.

            [q.2.3]     Macro cancellation option

There should be an option to cancel implementation-specific built-in
macros.  There are the following types.
1. The -U option which undefines a macro.
2. The option that removes implementation-specific built-in macros all
    at once.
3. The option that removes built-in macros prohibited in Standard C only
    ('unix' and others which do not start with _) all at once.

Scoring: 2.  2 points with 1 or 2 option.  3 is not applicable as it is
evaluated in d.1.5.

            [q.2.4]     Trigraphs option

Though trigraphs are always used in the environment that needs them,
they are hardly used in most environments since they are not needed.
This should be enabled or disabled by an option at compilation.

Scoring: 2.

            [q.2.5]     Digraphs options

Similarly, digraphs also should be enabled or disabled by an option at
compilation.

Scoring: 2.

            [q.2.6]     Warning specifying option

It will be more helpful to have warnings other than violations of syntax
rules or constraints issued as much as possible, however, they can be
annoying sometimes.  The option to specify a warning level or one to
enable or disable warnings by type is needed.

Scoring: 4.

            [q.2.7]     Other options

There are other helpful options in preprocessing.  Having too many
options makes things complicated, however, there are some that are
convenient.  The option not to output the line number information (-P in
most cases) is relatively common and it seems to be used for the
purposes other than C language.  There are some which output without
deleting comments.  Depending on command processors of OS, there are
cases where implementations need to implement redirection of diagnostic
messages.  In addition, so-called one pass compilers in which
preprocessing is not an independent program should have an option to
specify the output after preprocessing.

The option to identify C90 (C95), C99, and C++ is obviously necessary.
Furthermore, the option to achieve the compatibility between C99 and C++
(making C++ preprocessing closer to C99's) is helpful.

There is an option to output source file dependency relations in order
to generate a Makefile.

Scoring: 10.

            [q.2.8]     Extension by #pragma

In Standard C, implementation specific directives are all implemented as
sub-directives of #pragma.  Preprocessing usually passes #pragma to a
compiler as is and processes some of #pragma on its own.  There are not
many examples of the #pragma that preprocessing processes.

The header file with #pragma once written may be read only once even if
#include to it is applied many times.  This not only prevents multiple
inclusion, but also is effective in improving processing speed.  In case
the whole header file is included without using #pragma once, for
example, by

#ifndef __STDIO_H
#define __STDIO_H
    ..
#endif

some implementations automatically do not access this for a second time.

In MCPP, there is a #pragma directive which makes many header files
combined as one file as a "pre-preprocess."  In other words, this is the
functionality to preprocess the header included by #include to output
and add the #define directives which appeared there to the combined
output.  Including this is enough in compiling the original source.
This way, the size of the header file gets smaller since comments, #if,
and macro calls disappear.  There is only one file to access.  As a
result, preprocessing speed is faster.

Some implementations are equipped with a header pre-compilation feature.
This seems to have been introduced to process huge header files in C++,
however, there is a tendency that the size of pre-compiled header gets
larger than the total of original header files.  The speed improvement
cannot be seen very much in C, at least.  The content of the pre-
compiled header depends on the internal specification of compiler-proper.
And the fact that it becomes a black box, which a user cannot see, is
also a drawback.

In any case, above are all intended to speed up preprocessing and
nothing else.  Therefore, these features are not evaluated here, but
rather in [q.3.1].

Some implementations have #pragma which specifies the encoding of multi-
byte characters.  It is complete as a method of passing along the
encoding of source files to a preprocessor or compiler.

MCPP also has #pragma which traces preprocessing and outputs debug
information.  As no debugging can be performed for preprocessing using a
regular debugger, this is an important feature that can be done only in
a preprocessor.  This feature is easier to use since debug points can be
restricted by using #pragma rather than specifying an option.

Some implementations implement what is normally specified as an option
at compilation by #pragma such as error and warning control
specification.  While #pragma has merits of no portability issues in as
far as Standard C compliant implementations are concerned and of being
able to specify the location on source, it has the demerit that the
source files have to be rewritten when making changes to it.  If this
were to be implemented, it should be done so in addition to the
implementation of the compilation option.

There are not very many other #pragma which are processed in
preprocessing.

By the way, #pragma sub-directives are implementation-defined and have a
possibility of name collision between implementations.  Some device to
avoid name collision is desirable.  GCC 3 uses the name GCC as
'#pragma GCC poison'.  This is a good idea.  MCPP has adopted this idea
as '#pragma MCPP debug' since V.2.5.

Scoring: 10.

            [q.2.9]     Extended functionalities

Although extended functionalities should be implemented as #pragma sub-
directives, some preprocessors implement directives other than #pragma
as proposals of new preprocessing specifications.

In Linux system headers, GCC / #include_next directive is used.  Using
of this directive means, however, organization of the system headers are
unnecessarily complicated, and is not praiseworthy.  GCC implements also
non-Standard directives such as #warning, #assert, etc. whose needs are
not high.  GCC 3.2 / cpp treats some of those directives as "obsolete"
features.  It seems a good turn.

The preprocessor named "wave" has a directive to specify the scope of
macro definition, and, in correspondence with the directive, adopts a
macro call syntax similar to C++ scope specifier.  It seems an
interesting experiment.

GCC / cc1 has a "traditional" mode option other than the standard
behavioral mode.  MCPP has options for various behavioral specifications.
Those experiments also have some meanings.

Scoring: 6.


        [3.5.3]     Efficiency and others

            [q.3.1]     Processing speed

Though the accuracy of processing and diagnosis is the most important
thing, the faster the speed is, the better.

The #pragma and option to improve speed are called out to find out the
result of the speed.

Scoring: 20.  The speed of the program that does nothing but copying
input to output is set as 20 points.  Comparative scoring is done speed
relative to this benchmark.  Refer to [5.2] for concrete measurements.
Since the absolute speed varies depending on hardware, comparison should
be done with same level of hardware.  In addition, the speed of
processing same program depends on the amount of standard headers to be
read.  Comparison should be done with MCPP as a point of reference.

In order to measure time, the time command (built-in command in bash or
tcsh) is used for UNIX systems.  On the Windows systems, there is a time
command in bash if CygWIN is installed.  Also, WindowsNT has a similar
command called TimeThis in the "resource kit."  On systems where these
are not available, compile tool/clock_of.c for use (though it is rather
inaccurate.) *

  * Some WindowsNT resource kit programs can be used on WindowsXP while
    others do not.  TimeThis seems to be usable.

            [q.3.2]     Memory usage

The smaller the memory usage is, the better it is.  This is a serious
problem especially where there is a strict limitation in memory usage.

As preprocessing is a part of compilation, the memory usage for overall
compiler system actually becomes an issue.  Since the compiler proper
itself eats up more memory usually in case preprocessing is performed by
an independent preprocessor, memory usage by a preprocessor shall not be
a problem.  However, in case there are many macro definitions and such,
a preprocessor may consume more memory.  Memory usage includes not only
the size of a program, but also data area usage.

Scoring: 20.  Out of memory problems in regular usage subtract points.
In case data size cannot be measured, program size only is to be
evaluated.  In the UNIX systems, refer to the 'file' and 'ldd' commands.
The 'time' command in tcsh also serves as a reference.

            [q.3.3]     Portability

The portability of preprocessor source itself becomes an issue when it
replaces a resident preprocessor of the compiler system or when it is
being updated or customized.  The following shall be subject to
evaluation.
  1. Whether source is open (0 points if not open.)
  2. Whether many compiler systems - OS's are supported.
  3. What is the range of compiler systems - OS's for porting?  Under
what conditions?
  4. Whether the source is easy to port.
  5. Whether porting documents are provided.

Scoring: 20.  I only read one and a half source.  I just took a glance
at the rest.  So, this scoring is just a guess.


        [3.5.4]     Quality of Documents

            [q.4.1]     Quality of documents

In d.*, we saw only whether there are documents regarding "
implementation-defined items" by Standard C.  Here, we will also
evaluate the quality of documents.

In addition to implementation-defined items, the following are necessary
documents at minimum.
  1. Difference with Standard C.
  2. Option specifications.
  3. Meaning of diagnostic messages.

In addition, having the description of the overall preprocessing
specification including the Standard C would be of course much better.

Accuracy, readability, searchability, easiness of viewing and others
become subjects for evaluation.  The document for porting is included in
the q.3.3 evaluation.

Scoring: 10.


    [3.6]       C++ Preprocessing

Nowadays, in most compiler systems C and C++ are provided together.  In
that case, the same preprocessor seems to be used in both C and C++.
Since preprocessing is almost the same in both, it is not necessary to
prepare separate preprocessors for each.  However, it is not exactly the
same in both.

If you compare C++ Standard with C90, the C++ preprocessing is the C90
preprocessing plus the specifications below.

1. A character not included in the basic source character set is
converted into a Unicode hexadecimal sequence in the \Uxxxxxxxx format
in translation phase 1.  And, this is converted again into an execution
character set in the translation phase 5. *1

2. // to the end of the line is a comment. *2

3. Each of ::, .*, and ->* is treated as one pp-token. *3

4. In the #if expression, 'true' and 'false' are evaluated as 1 and 0
respectively as a boolean literal. *4

5. 11 kinds of identifier-like operators defined as a macro in the
<iso646.h> standard header in ISO C : Amendment 1 (1995) are all tokens,
not macros in C++ (pointless specification.)(*3)  Similarly, new and
delete are also operators. *5

6. The __cplusplus macro is predefined in 199711L. *6

7. Whether __STDC__ is defined and how it should be defined if so are
implementation-defined (On the contrary, it is undefined in C99 if
__cplusplus is defined.) *6

8. Translation limits are extended to a large extent as below.  However,
these are just guidelines, not requirements.  Implementations must
explicitly document translation limits. *7
    Length of a source logical line     :   65536 bytes
    Length of a string literal, a character constant, and a header name
                                        :   65536 bytes
    Length of an identifier             :   1024 characters
    #include nesting                    :   256 levels
    #if, #ifdef, and #ifndef nesting    :   256 levels
    #if expression parentheses nesting  :   256 levels
    Number of macro parameters          :   256
    Number of definable macros          :   65536

9. The restriction (by Standards) in the length prior to '.' in a header
name no longer exists. *8

In C99, only the second of these are the same and the others are
different.  In C99, there occurred new differences due to additions such
as p+ and P+ sequences in floating point numbers, official multi-byte
character support in identifiers, variable macros, legitimization of
empty arguments, macro expansion for the argument on the #pragma line,
_Pragma() operator, #if expression evaluation in long long and wider,
concatenation of a neighboring wide-character-string-literal and a
character-string-literal as a wide-character-string-literal and others.
UCN became a specification only in translation phase 5 in C99.  The
constraint on UCN is also slightly different.  Though translation limits
were also largely extended in C99 compared with C90, it was not as
extreme as C++ Standard and there are differences here, too.

These may not seem to be too many differences but enough not to use the
same preprocessing in C and C++.  In addition, in C, C90 (C95) and C99
cannot use the same preprocessing.

Besides, predefining __STDC__ in C++ is a cause of trouble and not
desirable.

Although some implementations define __cplusplus using the -D option, it
is inappropriate, as it becomes one of the user-defined macros.

Although whether each of ::, .*, and ->* should be treated as one token
is hardly an issue in preprocessing, handling it correctly is always the
best.

Therefore, in order to share a preprocessor among C90 (C95), C99, and
C++, a decent implementation seems to use a runtime option and change
processing to accommodate differences above.

Please note that MCPP does not support the following two points among
the specifications above since implementing them is too much burden for
the value.  I believe what I have is almost enough in practice.

1. The conversion to UCN in translation phase 1 is not implemented.  The
C++ Standard states that UCN conversion does not necessarily have to be
performed as long as the same outcome is obtained.  Apart from the
practical use, however, no same outcome would be obtained without
conversions, strictly speaking.  You can see this if you think about the
character constant comparison between UCNs and multi-byte characters in
the #if expression.  In the stringizing by the # operator also, a
problem occurs in a narrow sense. *9

2. Only up to 255 parameters at maximum for a macro.

In MCPP, invoking the program as a C++ preprocessor with the -+
-V199901L option and setting __cplusplus to 199901L and above extends
specifications to those for C99 for 1 through 13 in [1.8] excluding 8
and 9 (3 is the same without the option.  2 depends on the
implementation during MCPP compilation.)

The conformance tests for C++ specific preprocessing specifications
added to C90 preprocessing specifications are shown below.

In the implementations which do not recognize file names as C++ source
unless they are in the format of *.cc, copy files to *.cc to test.

Samples more than what are in the test-l directory are not provided for
translation limits.  In C++, translation limits are only guidelines, so
they are not subjects for scoring here.  In addition, the length of
header names will not be tested since it is OS dependent.

  *1  C++ 2.1 Phases of translation
  *2  C++ 2.7 Comments
  *3  C++ 2.12 Operators and punctuators

  *4  C++ 2.13.5 Boolean literals
    In C99, bool, true, false, and __bool_true_false_are_defined are
    defined in <stdbool.h> as _Bool, 1, 0, and 1 respectively as a macro.

  *5  C++ 2.5 Alternative tokens
  *6  C++ 16.8 Predefined macro names
  *7  C++ Annex B Implementation quantities

  *8  C++ 16.2 Source file inclusion
    In C90 6.8.2. only up to 6 characters to the left starting with '.'
    in a header name were guaranteed.  8 characters in C99 6.10.2.  This
    restriction was removed in C++.

  *9  In C99, whether extra \ is inserted when UCN is stringized by the
    # operator is implementation-defined.  C++ does not have this
    specification.
    If extra \ is added, the UCN no longer goes back to a multi-byte
    character.  Therefore, it is a better implementation without an
    extra \.  However, not having the extra \ is an "erroneous"
    implementation in the C++ specification.

        [3.5.n.ucn1]    UCN recognition

Scoring: 4.

        [3.5.n.cnvucn]  Conversion from a multi-byte character to UCN

Scoring: 2.

        [3.5.n.dslcom]  // Comments

Scoring: 4.

        [3.5.n.bool]    true and false are boolean literals

Scoring: 2.

        [3.5.n.token1]  ::, .*, and ->* are tokens

Scoring: 2.  Even if the test appears successful, it is invalid if token
concatenation "succeeds" without any warning when processing it in C as
well.

        [3.5.n.token2]  Alternative token for operators

Scoring: 2.

        [3.5.n.cplus]   __cplusplus predefined macro

Scoring: 4.  1 point for __cplusplus < 199711L.

        [3.5.u.cplus]   #define, #undef __cplusplus

Scoring: 1.  1 point if a diagnostic message such as a warning is issued.

        [3.5.d.tlimit]  Documents on the translation limits

Scoring: 2.


4.  Issues Around C Preprocessing

Below are issues faced when a preprocessor is actually used other than
those of C preprocessor Standard conformance level and quality.


    [4.1]       Standard Header Files

Samples in this Validation Suite include some standard headers.  Without
those headers correctly written, testing a preprocessor itself cannot be
performed accurately.

The followings are prone to problems in practice in the standard header
implementation.


        [4.1.1]     General Rules

The standard headers must not only include all function declarations,
type definitions, and macro definitions, but also meet the conditions
below.

1. The identifier not specified by the Standard nor reserved must not be
declared nor defined.  The range declarable is determined for each
standard header (range may be duplicate or shared between more than one
standard header.) *1

2. Therefore, it is usually not acceptable for one standard header to
include another one.

3. No matter in which order multiple standard headers are included, the
results must be the same. *2

4. No matter how many times the same standard header is included, the
result must be the same for other than <assert.h>. *2

5. Everything defined as an object-like macro which is expanded into an
integer constant must be the #if expression. *3

The range of the identifier reserved is specified by the Standard and
other identifiers must be open to users.  Since names starting with one
or two '_' are reserved for some sort of usage, they can be used in
standard headers by implementations (On the other hand, they must not be
defined by users.)

This is a little constrained specification.  No traditional names
outside of Standards can be used in Standard C unless they are changed
to the names starting with '_'.  In POSIX which became a starting point
for libraries and standard headers in Standard C, names outside of
Standard C are also enclosed by:

    #ifdef  _POSIX_SOURCE
        ...
    #endif

At least when this part is used, implementations are no longer Standard
C.

Also, even if function names such as open(), creat(), close(), read(),
write(), and lseek() do not appear in standard headers, implementing
functions such as fopen(), fclose(), fread(), fgets(), fgetc(), fwrite(),
fputs(), fputc(), fseek() etc. by using open() etc. violates user's name
space indirectly.  Therefore, dividing open() etc. in _POSIX_SOURCE or
separating them in separate headers such as <unistd.h> on the surface is
meaningless.

This type of "system call functions" should be changed to the names
starting with '_' or ones that are essential in reality should be
included in Standard C.

Although 2 is not clearly described in a specification, a standard
header including another standard header usually results in the
declaration of the name which the standard header cannot declare which
is caught by 1.  Each header including <stddef.h> is against the rule.
To avoid this, non-standard header of a different name called <sys/_defs.
h> or so can be provided and included by a standard header (also <stddef.
h> itself.)  And, names used there should all start with '_'. *4, *5

3 will not become an issue in reality.

4 had problems in old implementations but is complied by most current
implementations.

The method of enclosing the whole standard header as below is common.

#ifndef _STDIO_H
#define _STDIO_H
    ...
#endif

In addition, there is a method of using extended directive such as #
pragma once.

What becomes an issue in 5 is there are implementations where macros
using sizeof or cast are written in standard headers.  In Standard C,
sizeof and cast are not allowed in the #if expression.  Since sizeof and
cast are available in the #if expression as well in the implementations
actually using sizeof or cast in standard headers (Borland C 5.5), they
must consider this as an extended specification.

As long as a user does not use sizeof and cast in the #if expression in
his/her program, no portability problem or no other problems will arise.
However, this preprocessing implementation is not an "extension" of
Standard C, but rather a "deviation."  That is because #if is processed
in translation phase 4 in Standard C and no keyword exists in this phase.
Keywords are recognized in phase 7 for the first time.  In phase 4,
names same as keywords are all treated as simple identifiers and
identifiers not defined as macros are all evaluated as 0 on the #if line.
So, sizeof (int) becomes 0 (0) and (int)0x8000 becomes (0)0x8000,
causing a violation of syntax rule.  Implementations must issue a
diagnostic message for this.  Not issuing a diagnostic message is not an
"extension" but rather a "deviation" from Standard C.  Recognizing only
a part of keywords in phase 4 is a bit of a stretch as preprocessing
logical configuration and confuses the meaning as "pre"-process phase in
the compilation phase (translation phase 7.) *6

Even if we accept this as an "extended specification" by giving in, at
least a warning should be issued for the #if line including sizeof or
cast.

  *1  ANSI C 4.1.2.1 (C90 7.1.3) Reserved identifiers
      C99 7.1.3 Reserved identifiers

  *2  ANSI C 4.1.2 (C90 7.1.2) standard headers
      C99 7.1.2 standard headers

  *3  ANSI C 4.1.6 (C90 7.1.7) Use of library functions
      C99 7.1.4 Use of library functions

  *4  This method is used in the book below.  This book has many points
    which serve as useful references to the implementation of compiler
    systems especially.
        P. J. Plauger "The Standard C Library", 1992, Prentice Hall

  *5  In the GNU glibc system, other standard headers such as <stddef.h>
    are read in by a standard header itself multiple times.  However,
    what is defined at that time seems to be only the name in the range
    reserved.  Although this is not against Standards, it is not a good
    method as it loses the readability of standard headers and it makes
    the maintenance difficult.  It is better to use the file such as
    <sys/_defs.h>.

  *6  Refer to [1.3] and [2.4.14].


        [4.1.2]     <assert.h>

Next, we will look at each standard header.  From what I see the
standard headers attached to some implementations, the ones with most
problems seem to be <assert.h> and <limits.h>.  Although these 2 are the
most easy headers, they tend to have errors in the implementation since
they are newly defined as specifications in Standard C.  The usage for
these 2 is covered a little here.

At first, it is <assert.h>. *1, *2

Different from other standard headers, including this file many times is
not the same.  Depending on the NDEBUG is defined by a user changes the
results every time the file is included.  In other words, as needed,
this header is used as below.

    #undef      NDEBUG
    #include    <assert.h>
        assert( ...);

And, starting with area where debugging is complete, it becomes below.

    #define     NDEBUG
    #include    <assert.h>
        assert( ...);

If NDEBUG is defined, assert(...); disappears after macro expansion even
if it remains in source (even if ... is the expression with side effects,
the side effects do not occur since the expression is not evaluated.)

In order for <assert.h> to be used like this, it must not be enclosed by
:

    #ifndef _ASSERT_H
    #define _ASSERT_H
        ...
    #endif

#pragma once must not be written, either.

Also, as you can see from this, assert() is a macro and changes its
definition. <assert.h> must apply #undef assert and the assert macro
must be redefined according to NDEBUG.

In the assert(expression) call, when NDEBUG is not defined, nothing
happens if the expression is true.  If it is false, that is reported in
the standard error output.  This report displays the expression as is
(not expanded even if there is a macro) and also the file name of the
source and the line number.  This can be easily implemented as long as
the # operator and the __FILE__ and __LINE__ macros are correctly
implemented.

In reality, some of the old implementations do not implement the #
operator correctly or do not implement <assert.h> correctly.  There are
many samples in this Validation Suite which includes <assert.h>, testing
the preprocessor itself cannot be correctly performed if <assert.h> is
not written correctly.  Since it is easy to write <assert.h> correctly,
it is better to rewrite the ones that are not right among the files in
an implementation.  The following is an example from C89 Rationale 4.2.1.
1.  In this as well, the correct result cannot be obtained if the #
operator is not correctly implemented.  However, it is a preprocessor
problem and cannot be helped.

#undef  assert
#ifndef NDEBUG
#   define  assert( ignore)     ((void) 0)
#else
    extern void __gripe( char *_Expr, char *_File, int _Line);
#   define  assert( expr)   \\
        ((expr) ? (void)0 : __gripe( #expr, __FILE__, __LINE__))
#endif

The __gripe() function can be written as below (the name, __gripe, can
be anything as long as it starts with '_'.)

#include    <stdio.h>
#include    <stdlib.h>

void    __gripe( char *_Expr, char *_File, int _Line)
{
    fprintf( stderr, "Assertion failed: %s, file %s, line %d\n",
            _Expr, _File, _Line);
    abort();
}

Some implementations write fprintf(), fputs(), or abort() directly in
<assert.h> without using __gripe().  That is also acceptable, however,
it requires declarations for these functions.  The FILE and stderr
declarations are also necessary.  However, it is quite complicated since
<stdio.h> cannot be included.  There will be no mistake if a separate
function is implemented.

This is not so significant, but a duplicate string literal is generated
every call in case all are implemented by macros.  If implementations do
not perform optimization by merging duplicate string literals into one,
this is not wise in terms of the code size.

  *1  ANSI C 4.2 (C90 7.2) Diagnostics <assert.h>
      C99 7.2 Diagnostics <assert.h>

  *2  Starting C99, the assert() macro started displaying from which
    function a macro was called.  The internal identifier, __func__, is
    defined for this type of purpose.


        [4.1.3]     <limits.h>

This standard header is where macros representing the range of the
integer type and others are written.  These macros must be written so
that their values match specifications and also the following conditions
must be met. *1

1. Must be an integer constant expression which can be used in the #if
directive.

2. Must be the expression of the same type as the object of the
corresponding type after integer promotion.

There are implementations that use cast.  We discussed that sizeof and
cast in the #if expression are not in the range of Standard C in [4.1].
First of all, the meaning that <limits.h> was newly specified is in that
a preprocessor does not have to perform query regarding the execution
environment such as cast and sizeof.

For example, instead of

#if (int)0x8000 < 0

and

#if sizeof (int) < 4

the following is used:

#include    <limits.h>
#define VALMAX  ...
#if INT_MAX < VALMAX

In #if, cast and sizeof are not necessary if <limits.h> macros are used.

Examples where the <limits.h> macros are wrong in the type can be seen
from time to time.  Those do not come from the preprocessor
specification, but it seems that the person writing <limits.h> forgets 2
above and integer promotion, the usual arithmetic conversion rule, or
the evaluation rule for integer constant tokens.

For example, there is a definition as below.

    #define UCHAR_MAX   255U

The unsigned char values are all in the int range (if CHAR_BIT is 8) and
the data object values in the unsigned char type becomes int by integer
promotion.  Therefore, UCHAR_MAX also must be evaluated as int.  However,
it becomes unsigned int in 255U.  This must be:

    #define UCHAR_MAX   255

Although either way does not seem to have any issues in practice, it is
not necessarily so.  Operations including unsigned type cause "usual
arithmetic conversion" and force the conversion from signed type to
unsigned of the same size.  Therefore, the result of value comparison
varies.

This is easy to see in the following:
    assert( -255 < UCHAR_MAX);

This mistake is related to the circumstance that the integer promotion
and usual arithmetic conversion rules were changed in Standard C from
the ones adopted in many conventional implementations.  The unsigned
char, unsigned short, and unsigned long types were not in K&R 1st, but
were implemented in many implementations later.  In addition, in most of
those implementations, unsigned was always converted into unsigned.

In the integer promotion in Standard C, however, unsigned char and
unsigned short are promoted into int as long as all values stay in the
int range and they are promoted into unsigned int otherwise.  Also, in
the usual arithmetic conversion between unsigned int and long, all
values of unsigned int are converted into long if they are in the long
range and into unsigned long otherwise.  This is called the change from
"unsigned preserving rules" to "value preserving rules."  The reason for
the specification is that this is supposed to be more predictable
intuitively.  A caution is necessary for this rule in <limits.h>. *2

In all examples below, short is 16 bit and long is 32 bit.  The value of
USHRT_MAX is 65535, but how to write depends on if int is 16 bit or 32
bit.

    #define USHRT_MAX   65535U      /* if int is 16 bit   */
    #define USHRT_MAX   65535       /* if int is 32 bit   */

Since unsigned short is not in the int range if int is 16 bit, it is
promoted into unsigned int.  Therefore, USHRT_MAX also must be evaluated
as unsigned int.  In 65535, it is evaluated as long.  The suffix, 'U',
is necessary.  On the other hand, since unsigned short values are all in
the int range if int is 32 bit, they are promoted into int.  Therefore,
USHRT_MAX also must be evaluated as int.  'U' must not be attached.
However, there is an example opposite of this.

    #define USHRT_MAX   0xFFFF

In this example, correct evaluation will be performed whether int is 16
bit or 32 bit.  In Standard C, octal or hexadecimal integer constant
tokens without U, u, L, and l suffix are evaluated in the type which can
express the value non-negative in the order of int, unsigned int, long,
and unsigned long.  In other words, 0xFFFF is evaluated as 65535 of
unsigned int if int is 16 bit and 65535 of int if int is 32 bit.  On the
other hand, decimal integer tokens without a suffix are evaluated in the
order of int, long, and unsigned long.  65535 is evaluated as long if
int is 16 bit and as int if int is 32 bit. *3

C99 added long long/unsigned long long.  It also added the _Bool type
which has only the 0 or 1 value.  Other types of integers became
implementable as well.  Rules for integer promotion were extended and
the integer constant tokens that cannot be expressed as unsigned long
are evaluated as long long/unsigned long long.

In accordance with the increased integer types and the acceptance of the
implementation-defined integer types, the size relations of types became
confusing.  Therefore, the concept of integer conversion rank was
introduced.  This concept is a little complex, but there is no need to
worry in practice.  In standard integer types, the size relations of the
rank are as below.

    long long > long > int > short > char > _Bool

Here, the point is that the rank size for the implementations of the
same size, for example, long and int which are both 32 bit, are
distinguished. *4, *5

  *1  ANSI C 2.2.4.2.1 (C90 5.2.4.2.1) Sizes of integral types <limits.h>
      C99 5.2.4.2.1 Sizes of integer types <limits.h>

  *2  ANSI C 3.2.1 (C90 6.2.1) Arithmetic operands
      C99 6.3.1 Arithmetic operands

  *3  ANSI C 3.1.3.2 (C90 6.1.3.2) Integer constants
      C99 6.4.4.1 Integer constants

  *4  C99 6.4.4.1 Integer constants

  *5  C99 added standard headers called <stdint.h> and <inttypes.h> in
    order to absorb the differences in integer types by implementations.
    These typedef some type names other than the long and short names
    since the number of integer types increased due to the arrival of 64
    bit systems and the corresponding relations became confusing.
    However, there are 26 kinds of these type names, 42 kinds of macros
    representing the maximum and minimum values corresponding to these,
    56 kinds of macros converted into the format specifier of
    corresponding fprintf(), and 56 kinds of macros converted into the
    format specifier of fscanf() similarly.  Although there is no much
    load on implementations, too much complexity gives the impression of
    terminal symptoms.


            [4.1.3.1]   INT_MIN

Among all macros in <limits.h>, the most confusing ones are INT_MIN and
LONG_MIN in the system with the internal representation of 2's
complement.  Especially, the INT_MIN in the implementation where int is
16 bit and long is 32 bit shows all the problems above.  These are
specially covered in separate sections.

In this case, it is understood that the range of int is [-32768,32767].
Additionally, there are no problems by having INT_MAX as 32767 or 0x7FFF
in any implementation.  However, I see an example where INT_MIN is
defined as below.

    #define INT_MIN     (-32767)

Why is this type of definition different from the reality?

On the other hand, there are no implementations with this type of
definition, as might be expected.

    #define INT_MIN     (-32768)

-32768 consists of 2 tokens, - and 32768.  And, 32768 is not in the
range which can be expressed in int.  So, this is evaluated as long.
Therefore, -32768 becomes the meaning of - (long) 32768.

Some make a definition like this:

    #define INT_MIN     ((int)0x8000)

No comment is repeated for the definition using cast.  It is also
invalid since 0x8000 only becomes the meaning of (unsigned) 32768.

Then, how can the definition be in order to make an evaluation as (int)
-32768 without casting?

    #define INT_MIN     (-32767-1)

This is fine.  32767 can be INT_MAX or 0x7FFF.  This definition has a
subtraction operation, but it is not an issue (unary - is an operator in
the first place.) *1, *2

    #define INT_MIN     (~INT_MAX)
    #define INT_MIN     (1<<15)

These are also correct definitions.

    #define INT_MIN     (-32767)

I can imagine that this gave up defining a correct value since the idea
of operation did not occur.

Then, is the definition of -32767 wrong or correct?

The bottom line is that this is wrong.

INT_MIN is defined as a macro representing the minimum value of int.  If
INT_MIN is -32767, what does this mean?  And, what is INT_MIN-1 at all?
Or, what are ~INT_MAX and 1<<15?

Regarding the INT_MIN-1 in this case, there seems to be thought of as
the bit pattern representing out of range such as "NaN" in floating
point operation.

However, compared with the Standard C specification regarding integer
type, this interpretation has no basis.  First, the result of bit
operations on integer types is undefined in case the value of op2 is
negative or the number of bit for op1 type and above where op1 << op2 or
op1 >> op2, and not undefined, returning all unique values of integer
type, otherwise.  The result of ~op is int if op is int and the results
of op1 & op2, op1 | op2, op1 << op2, and op1 >> op2 are int if both op1
and op2 are int.  Therefore, the results of ~INT_MAX and 1<<15 are both
int.  You may think 1<<15 will overflow, but it is not so.  Since the
bit operation returns the value corresponding to the bit pattern as
result of the bit operation, overflow cannot occur.

In C, the integer type operations are defined well in general.  There
are extremely few undefined areas.  Especially, the relationship between
a bit pattern and a value corresponds one-on-one completely except when
2 bit patterns exist for 0 in the internal representations of 1's
complement and sign+absolute value.  This is consistent from K&R 1st to
Standard C.  C has no way to write a bit pattern itself and "Not-a-
Number" can be written only as (-32767-1) etc.  This is an int value
itself as you can see.  C89 Rationale mentions some grounds and made
clear there is no room for bit patterns representing an "invalid
integer" or "illegal integer" in the integer type. *3, *4

In the internal representation of 2's complement, ~INT_MAX is the value
of INT_MIN and I must say that the definition bigger than that is wrong.

  *1 I saw this definition in "The Standard C Library" by P. J. Plauger
    for the first time.  This style of limits.h is getting popular in
    recent implementations.
    However, limits.h in this book also contains a mistake.  The
    definitions for the compiler system with 16 bit int and 32 bit long
    are as below.
        #define UINT_MAX    65535
        #define USHRT_MAX   65535
    These will evaluate to long.  The correct definitions are:
        #define UINT_MAX    65535U
        #define USHRT_MAX   65535U

  *2  In recent compiler systems, *_MIN is typically defined in the form
    of (-*_MAX - 1) and there are few mistakes though they are still
    found occasionally.  Vc7/include/limits.h and Vc7/crt/src/include/
    limits.h in Visual C++ 2003 contains:
        #define LLONG_MIN   0x8000000000000000
    0x8000000000000000 evaluates to unsigned long long.  Since this type
    has the highest rank, the result of integer promotion has the same
    type.  It never becomes a negative value.  Therefore,
        #if LLONG_MAX > LLONG_MIN
    does not turn out as expected.

    LLONG_MIN in include/limits.h of LCC-Win32 V.3.2, V.3.8 is as below.
        #define LLONG_MIN   -9223372036854775808LL
    9223372036854775808LL is a violation of constraints as this token
    value overflows the range of signed long long.  If LLONG_MIN is
    defined as:
        #define LLONG_MIN   -9223372036854775808LLU
    9223372036854775808LLU becomes unsigned long long.  Using the unary
    - operator on an unsigned type does not change the result type,
    however, the result becomes a value which cannot be expressed in
    unsigned long long and ends up undefined.

    In Visual C++ 2003 and LCC-Win32, all other *_MIN definitions are (-
    *_MAX - 1), but why is only LLONG_MIN wrong?  If it is defined as
    below, there will be no problem.
        #define LLONG_MIN   (-LLONG_MAX - 1LL)

    In Visual C++ 2005, this definition was revised correctly.

  *3  C89 Rationale 3.1.2.5 Types
      C99 Rationale 6.2.6.2 Integer types

  *4  In C99, the handling of the specific bit pattern as "trap
    representation" which causes an exception is allowed in
    implementations.
    I do not know what sort of implementations fall under this
    specification in reality.


        [4.1.4]     <iso646.h>

ISO C 9899:1990/ Amendment 1 added the standard header called iso646.h.
This provides the operators including &, |, ~, ^, or ! with the
replacement spelling expressed only in invariant character set in ISO
646.  Replacement spelling is provided for |, ~, and ^ in trigraphs as
well, however, trigraphs lack in readability.  Alternatively, iso646.h
defines 11 types of operators in macros in the token unit.

This implementation is very easy and the following example is enough.
Since macro expansion is performed in preprocessing, there is no trouble
for implementations. *

/* iso646.h     ISO 9899:1990 / Amendment 1 */

#ifndef _ISO646_H
#define _ISO646_H

#define and     &&
#define and_eq  &=
#define bitand  &
#define bitor   |
#define compl   ~
#define not     !
#define not_eq  !=
#define or      ||
#define or_eq   |=
#define xor     ^
#define xor_eq  ^=

#endif

  * In the C++ Standard, these identifier-like operators are specified
    as operator-tokens rather than macros.  This is a troublesome and
    meaningless specification for implementations.


5.  Preprocessor Test Results

    [5.1]   Preprocessors Tested

Compiler systems tested and execution methods are as below.  Compiler
systems are sorted in the order of release time.

Runtime options slightly vary in each of C95 (C90), C99, and C++98.

If there are problems in <assert.h> and <limits.h>, testing was done
after correctly rewriting them.

Number: OS          / Compiler System       / Execution program (version)
        Runtime option
    Comment

1   :   Linux       /                       / DECUS cpp
        C95:    cpp
    DECUS cpp original version by Martin Minow (June 1985.)  It was
    ported to some systems such as various DEC systems, UNIX, and MS-DOS
    at that time, but what I used in this test was modified by kmatsui
    and compiled on Linux / GCC.  Macros were rewritten so that
    translation limits clear as many specifications as possible.

2   :   FreeBSD 2.2.7   / GCC V.2.7.2.1     / cpp (V.2.0)
        GO32        / DJGPP V.1.12          / cpp (V.2.0)
        WIN32       / BC 4.0                / cpp (V.2.0)
        MS-DOS      / BC 4.0, TC 2.0        / cpp (V.2.0)
        MS-DOS      / LSI C-86 V.3.3        / cpp (V.2.0)
        OS-9/6x09   / Microware C/09        / cpp (V.2.0)
        C95:    cpp -23 (-S1 -S199409L) -W15
                gcc -ansi -Wp,-2,-W15
        C99:    cpp -23 (-S1) -S199901L -W15
        C++:    cpp -23+ -S199711L -W15
    Free software by kmatsui (August, 1998.)  Called MCPP.  Rewrite of
    DECUS cpp.
    I have compiled with GCC V.3.3 on Linux and used the executable for
    this test.

3   :   WIN32       / Borland C++ V.5.5J    / cpp32 (August, 2000)
        C95:    cpp32 -A -w
                bcc32 -A -w
        C99:    cpp32 -A -w
        C++:    cpp32 -A -w
    Trigraphs are not processed by cpp32 nor bcc32.  Instead, a
    conversion program called trigraph.exe is provided.
    It just set up an alibi called "Standard conformance."  In Borland C,
    I used this program to convert trigraphs in advance to test (lenient
    testing.)  This trigraph.exe, however, processes even line splicing
    by <backslash><newline>.
    Therefore, the line number is out of alignment (deduct scores in q.1.
    2.)

4   :   Vine Linux 3.2, CygWIN 1.3.10
                                 / GCC V.2.95.3 (March, 2001)   / cpp0
        C95:    cpp0 -D__STRICT_ANSI__ -std=c89 -$ -pedantic -Wall
                gcc -ansi -pedantic -Wall
        C99:    cpp0 -std=c9x -$ -Wall
        C++:    g++ -E -trigraphs -$ -Wall
    Since GCC is portable source, the person who ported it should
    prepare the specification for what has been ported to a specific
    system.  However, no such document is provided.  Only GNU cpp.info
    exists as a cpp document.

5   :  Vine Linux 3.2   / GCC V.3.2 (August, 2002)      / cpp0
        C95:    cpp0 -D__STRICT_ANSI__  -std=iso9899:199409 -$ -pedantic
                                                                -Wall
                gcc -std=iso9899:199409 -pedantic -Wall
        C99:    cpp0 -std=c99 -$ -Wall
        C++:    g++ -E -trigraphs -$ -Wall
    Compiled from the source by kmatsui.  Configured with --enable-c-
    mbchar option.

6   :   Vine Linux 3.2  /                   / ucpp (V.1.3)
        C95:    ucpp -wa -c90
        C99:    ucpp -wa
    Free software by Thomas Pornin (January, 2003.)  A portable stand-
    alone preprocessor.

7   :   WIN32       / Visual C++ 2003       / cl
        C95:    cl -Za -E -Wall -Tc
        C99:    cl -E -Wall -Tc
        C++:    cl -E -Wall -Tp
    Since the -E option does not process comments and <backslash>
    <newline> properly, a compilation testing is used together (April,
    2003.)

8   :   WIN32       / LCC-Win32 V.3.2       / lcc
        C95:    lcc -A -E
                lcc -A
        C99:    lcc -A -E
        C++:    lcc -A -E
    An integrated development environment which Jacob Navia wrote based
    on free software by C. W. Fraser & Dave Hanson (August, 2003.)  The
    source is shareware.  The preprocessing part is based on source
    originally written for Plan9 by Dennis Ritchie.

9   :   WIN32       /                       / wave (V.1.0.0)
        C95:    wave
        C99:    wave --c99
        C++:    wave
    Free software by Hartmut Kaiser (January, 2004.)  Stand-alone
    preprocessor.  Implemented using the C++ library called "Boost
    preprocessor library" by Paul Mensonides et al.

10 :    FreeBSD, Linux, CygWIN   / GCC 2.95, 3.
   :    WIN32, MS-DOS   / Visual C 2003, BCC, etc.  / mcpp_std (V.2.4)
        C95:    mcpp_std -23 (-S1 -V199409L) -W31
                gcc -ansi -Wp,-2,-W31
        C99:    mcpp_std -23 (-S1) -V199901L -W31
        C++:    mcpp_std -23+ -V199711L -W31
    MCPP V.2.4 (February, 2004), in Standard mode.  Compiled with Linux
    / GCC 2.95.3.

11  :    VineLinux 3.2  / GCC V.3.4.3 (November, 2004)
                                                    / cc1, cc1plus
        C95:    gcc -E -std=iso9899:199409 -pedantic -Wall
        C99:    gcc -E -std=c99 -$ -Wall
        C++:    g++ -E -std=c++98 -$ -Wall
    Compiled by kmatsui with GCC V.3.3.2.

12 :    openSUSE Linux 10.0 / GCC V.4.0.2 (September, 2005)
                                                    / cc1, cc1plus
        C95:    gcc -E -std=iso9899:199409 -pedantic -Wall
        C99:    gcc -E -std=c99 -$ -Wall
        C++:    g++ -E -std=c++98 -$ -Wall
    Bundled in openSUSE 10.0.

13 :   WIN32        / Visual C++ 2005       / cl
        C95:    cl -Za -E -Wall -Tc
        C99:    cl -E -Wall -Tc
        C++:    cl -E -Wall -Tp
    Since the -E option does not process comments and <backslash>
    <newline> properly, a compilation testing is used together.
    (September, 2005.)

14 :    WIN32       / LCC-Win32 V.3.8       / lcc
        C95:    lcc -A -E
                lcc -A
        C99:    lcc -A -E
        C++:    lcc -A -E
    LCC-Win32 V.3.8 (March, 2006.)

15 :    FreeBSD, Linux, CygWIN   / GCC 2.95-4.
   :    WIN32  / Visual C 2003-2005, BCC, LCC-Win32 / mcpp (V.2.6)
        C95:    mcpp -23 (-S1 -V199409L) -W31
                gcc -ansi -Wp,-2,-W31
        C99:    mcpp -23 (-S1) -V199901L -W31
        C++:    mcpp -23+ -V199711L -W31
    MCPP V.2.6 (July, 2006.)


    [5.2]       Lists of Marks

                D   M   B   G   G   u   V   L   W   M   G   G   V   L   M
                E   C   C   C   C   c   C   C   a   C   C   C   C   C   C
                C   P   C   C   C   p   2   C   v   P   C   C   2   C   P
                U   P   5   2   3   p   0   W   e   P   3   4   0   W   P
                S   2   5   9   2   1   0   i   1   2   4   0   0   i   2
                C   0   C   5       3   3   n   0   4   3   2   5   n   6
                P       P   3               3   0                   3    
                P       P                   2                       8    

        max     1   2   3   4   5   6   7   8   9  10  11  12  13  14  15

[K&R: Processing of sources conforming to K&R and C90] (31 items)
n.2.1     4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.2.2     2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.2.3     2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.6.1    10    10  10  10  10  10  10  10  10   4  10  10  10  10  10  10
n.7.2     4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.10.2    6     0   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.12.3    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.12.4    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.12.5    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.12.7    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.13.1    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.13.2    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.13.3    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.13.4    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.13.7    6     6   6   4   6   6   6   4   6   0   6   6   6   4   4   6
n.13.8    2     0   2   2   2   2   2   2   2   0   2   2   2   2   2   2
n.13.9    2     2   2   2   2   2   2   0   2   0   2   2   2   0   2   2
n.13.10   2     2   2   2   2   2   2   0   0   2   2   2   2   2   0   2
n.13.11   2     0   2   2   2   2   2   0   0   0   2   2   2   2   0   2
n.13.12   2     0   2   2   2   2   2   2   0   0   2   2   2   2   0   2
n.15.1    6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.15.2    6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.18.1   30    30  30  30  30  30  30  30  30  30  30  30  30  30  30  30
n.18.2   20    20  20  20  20  20  20  20  20  20  20  20  20  20  20  20
n.18.3   10    10  10  10  10  10  10  10  10  10  10  10  10  10  10  10
n.27.1    6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.27.2    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.29.1   10    10  10  10  10  10  10  10  10  10  10  10  10  10  10  10
n.32.1    2     2   2   2   2   2   2   2   2   0   2   2   2   2   2   2
i.32.3    2     2   2   2   2   2   2   2   2   0   2   2   2   2   2   2
i.35.1    2     2   2   2   2   2   2   0   0   0   2   2   2   0   0   2
stotal  166   150 166 164 166 166 166 156 158 140 166 166 166 160 156 166

[C90: Processing of strictly conforming sources] (76 items)
n.1.1     6     0   6   6   6   6   6   6   6   0   6   6   6   6   6   6
n.1.2     2     0   2   2   2   2   2   2   2   0   2   2   2   2   2   2
n.1.3     2     0   2   2   2   2   2   2   2   0   2   2   2   2   2   2
n.2.4     2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.2.5     2     0   2   2   2   2   2   2   2   0   2   2   2   2   2   2
n.3.1     6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.3.3     4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.3.4     2     0   2   0   2   2   0   2   2   2   2   2   2   2   2   2
n.4.1     6     0   6   0   6   6   6   6   0   0   6   6   6   6   0   6
n.4.2     2     0   2   0   2   2   2   2   0   0   2   2   2   2   0   2
n.5.1     6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.6.2     6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.6.3     2     0   2   0   2   2   2   2   2   2   2   2   2   2   2   2
n.7.1     6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.7.3     4     0   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.8.1     8     0   8   8   8   8   8   8   8   8   8   8   8   8   8   8
n.8.2     2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.9.1    10    10  10  10  10  10  10  10  10  10  10  10  10  10  10  10
n.10.1   10    10  10  10  10  10  10  10  10  10  10  10  10  10  10  10
n.11.1    8     8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
n.11.2    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.12.1    6     0   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.12.2    4     0   4   4   4   4   4   4   4   0   4   4   4   4   4   4
n.12.6    6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.13.5    2     2   2   2   2   2   2   2   2   0   2   2   2   2   2   2
n.13.6    6     0   6   6   6   6   4   6   4   0   6   6   6   4   4   6
n.13.13   4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.13.14   2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.19.1    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.19.2    4     2   4   4   4   4   4   4   4   2   4   4   4   4   4   4
n.20.1    6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.21.1    4     0   4   0   4   4   4   0   4   4   4   4   4   0   4   4
n.21.2    2     0   2   0   2   2   2   0   2   2   2   2   2   0   2   2
n.22.1    4     0   4   0   4   4   4   4   4   0   4   4   4   4   4   4
n.22.2    2     0   2   0   2   2   2   2   2   0   2   2   2   2   2   2
n.22.3    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.23.1    6     2   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.23.2    2     2   2   2   2   2   2   2   2   0   2   2   2   2   2   2
n.24.1    6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.24.2    4     0   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.24.3    6     0   6   0   6   6   6   6   6   6   6   6   6   6   6   6
n.24.4    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.24.5    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.25.1    4     2   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.25.2    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.25.3    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.25.4    6     0   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.25.5    4     0   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.26.1    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.26.2    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.26.3    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.26.4    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.26.5    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.27.3    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.27.4    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.27.5    2     2   2   2   2   2   2   0   2   0   2   2   2   0   2   2
n.27.6    2     0   0   2   2   2   2   2   2   2   2   2   2   2   2   2
n.28.1    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.28.2    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.28.3    4     0   4   4   4   4   2   4   4   4   4   4   4   4   4   4
n.28.4    4     0   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.28.5    4     0   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.28.6    4     0   4   0   4   4   2   0   0   4   4   4   4   0   0   4
n.28.7    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.29.2    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.30.1    6     6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
n.32.2    2     2   2   2   2   2   2   2   2   0   2   2   2   2   2   2
n.37.1    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.37.2    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.37.3    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.37.4    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.37.5    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.37.6    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.37.7    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.37.8    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
n.37.9    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
stotal  286   160 284 252 286 286 278 274 272 240 286 286 286 272 272 286

[C90: Processing of implementation defined portions] (1 item)
i.32.4    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
stotal    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2

[C90: Diagnosing of violation of syntax rule or constraint] (50 items)
e.4.3     2     2   2   1   2   2   2   2   2   2   2   2   2   2   2   2
e.7.4     2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.12.8    2     0   2   2   2   2   2   2   0   2   2   2   2   2   0   2
e.14.1    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.14.2    4     2   4   2   4   4   2   2   4   4   4   4   4   4   4   4
e.14.3    2     2   2   2   2   2   1   2   2   2   2   2   2   2   2   2
e.14.4    2     2   2   2   2   2   1   2   2   2   2   2   2   2   2   2
e.14.5    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.14.6    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.14.7    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.14.8    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.14.9    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.14.10   4     0   4   2   0   0   0   0   0   0   4   0   0   0   0   4
e.15.3    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.15.4    2     2   2   1   2   2   2   1   2   2   2   2   2   2   2   2
e.15.5    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.16.1    2     2   2   1   2   2   2   1   2   2   2   2   2   2   2   2
e.16.2    2     2   2   1   2   2   2   1   2   2   2   2   2   2   2   2
e.17.1    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.17.2    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.17.3    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.17.4    2     2   2   2   2   2   2   2   2   0   2   2   2   2   2   2
e.17.5    2     0   2   2   2   2   2   2   2   0   2   2   2   2   2   2
e.17.6    2     0   2   2   2   2   2   2   2   0   2   2   2   2   2   2
e.17.7    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.18.4    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.18.5    2     2   2   2   0   2   2   2   2   2   2   2   2   2   2   2
e.18.6    2     0   2   2   2   2   2   2   2   0   2   2   2   2   2   2
e.18.7    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.18.8    2     2   2   2   2   2   2   2   2   0   2   2   2   2   2   2
e.18.9    2     0   2   0   2   2   2   2   0   0   2   0   2   2   0   2
e.19.3    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
e.19.4    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
e.19.5    4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
e.19.6    2     0   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.19.7    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.23.3    2     0   2   2   2   2   2   2   0   0   2   2   2   2   0   2
e.23.4    2     2   2   2   2   2   2   2   0   0   2   2   2   2   0   2
e.24.6    2     2   2   2   2   2   2   2   0   0   2   2   2   2   0   2
e.25.6    4     0   4   0   4   4   4   4   4   4   4   4   4   4   4   4
e.27.7    2     0   2   2   2   2   2   0   2   2   2   2   2   0   2   2
e.29.3    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.29.4    2     2   2   1   2   2   2   2   2   2   2   2   2   2   2   2
e.29.5    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.31.1    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.31.2    2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
e.31.3    2     2   2   0   2   2   2   1   2   2   2   2   2   1   2   2
e.32.5    2     0   2   2   2   2   0   2   0   0   2   2   2   2   0   2
e.33.2    2     0   2   0   0   2   0   2   0   0   2   2   2   2   0   2
e.35.2    2     0   2   1   2   2   0   2   0   2   2   2   2   2   0   2
stotal  112    74 112  92 104 108  98 100  92  86 112 106 108 105  92 112

[C90: Documents on implementation defined behaviors] (13 items)
d.1.1     2     0   2   0   0   2   0   0   0   0   2   2   2   0   0   2
d.1.2     4     2   4   4   4   4   0   4   0   0   4   4   4   4   0   4
d.1.3     2     0   2   0   0   2   0   0   2   2   2   2   2   2   0   2
d.1.4     4     0   4   4   4   4   0   4   4   4   4   4   4   4   4   4
d.1.5     4     2   4   4   2   4   4   4   4   4   4   2   2   4   4   4
d.1.6     2     0   2   0   0   1   0   0   0   0   2   1   1   0   0   2
d.2.1     2     0   2   2   2   2   2   0   0   0   2   2   2   0   0   2
d.2.2     2     0   2   2   0   2   0   0   0   0   2   2   2   0   0   2
d.2.3     2     0   2   0   0   0   0   0   0   0   2   0   0   0   0   2
d.2.4     2     0   2   0   0   0   0   0   0   0   2   0   0   0   0   2
d.2.5     2     0   2   0   0   0   0   2   0   0   2   0   0   2   0   2
d.2.6     2     0   2   2   0   0   0   2   0   0   2   0   0   2   0   2
d.2.7     2     0   2   2   0   2   0   2   0   0   2   2   2   2   0   2
stotal   32     4  32  20  12  23   6  18  10  10  32  21  21  20   8  32

[C90: Degree of Standard C conformance] (171 items)
mttl90  598   390 596 530 570 585 550 550 534 478 598 581 583 559 530 598

[C99: Conformance to new features] (20 items)
n.dslcom  4     0   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.ucn1    8     0   0   0   0   6   8   6   0   2   8   6   6   8   0   8
n.ucn2    2     0   0   0   0   0   2   0   0   0   2   2   2   2   0   2
n.ppnum   4     0   4   0   4   4   4   0   0   0   4   4   4   0   4   4
n.line    2     0   2   2   2   2   2   2   0   2   2   2   2   2   0   2
n.pragma  6     0   6   0   0   6   6   0   0   2   6   6   6   0   0   6
n.llong  10     0   0   0  10  10   8  10   0   0  10  10  10  10   0  10
n.vargs  10     0  10   0  10  10  10   0   0  10  10  10  10  10   2  10
n.stdmac  4     0   2   0   0   4   4   0   0   4   4   4   4   0   0   4
n.nularg  6     0   6   0   6   6   6   2   0   6   6   6   6   6   0   6
n.tlimit 18     0  18  14  18  18  17  18  14  18  18  18  18  18  12  18
e.ucn     4     0   0   0   0   0   2   0   0   2   4   0   0   2   0   4
e.intmax  2     0   0   0   2   2   2   0   0   0   2   1   1   0   0   2
e.pragma  2     0   2   0   0   2   2   0   0   2   2   2   2   0   0   2
e.vargs1  2     0   0   0   0   2   1   0   0   1   2   2   2   2   0   2
e.vargs2  2     0   2   0   0   0   0   0   0   0   2   0   0   0   0   2
d.pragma  2     0   2   0   0   2   2   0   0   0   2   2   2   0   0   2
d.predef  6     0   0   0   0   6   6   0   0   0   6   6   6   0   0   6
d.ucn     2     0   0   0   0   0   0   0   0   0   2   0   0   0   0   2
d.mbiden  2     0   0   0   0   2   2   1   0   0   2   2   2   1   0   2
mttl99   98     0  58  20  56  86  88  43  18  53  98  87  87  65  22  98

[C++: Conformance to new features not in C90] (9 items)
n.dslcom  4     0   4   4   4   4   4   4   4   4   4   4   4   4   4   4
n.ucn1    4     0   0   0   0   4   4   2   0   2   4   4   4   4   0   4
n.cnvucn  4     0   0   0   0   0   1   0   0   0   0   0   0   0   0   0
n.bool    2     0   0   0   0   2   0   0   0   2   2   2   2   0   0   2
n.token1  2     0   2   0   0   2   0   2   2   2   2   2   2   2   2   2
n.token2  2     0   0   0   0   2   0   2   0   2   2   2   2   2   0   2
n.cplus   4     0   2   2   2   2   0   4   0   4   4   2   2   4   0   4
e.operat  2     0   0   0   0   2   0   0   0   2   2   2   2   0   0   2
d.tlimit  2     0   2   0   0   2   0   1   0   0   2   2   2   1   0   2
mttl++   26     0  10   6   6  20   9  15   6  18  22  20  20  17   6  22

[C90: Qualities / 1 : handling of multibyte character] (1 item)
m.36.2    7     0   2   2   0   0   0   4   0   0   7   5   7   2   0   7
stotal    7     0   2   2   0   0   0   4   0   0   7   5   7   2   0   7

[C90: Qualities / 2 : diagnosis of undefined behaviors] (29 items)
u.1.1     1     0   1   0   1   1   0   0   0   1   1   1   1   0   0   1
u.1.2     1     0   1   0   1   1   0   1   0   1   1   1   1   0   0   1
u.1.3     1     0   1   1   1   1   1   1   1   1   1   1   1   1   1   1
u.1.4     1     0   1   1   1   1   1   1   1   1   1   1   1   1   1   1
u.1.5     1     1   1   1   1   1   1   1   1   0   1   1   1   1   1   1
u.1.6     1     0   1   0   1   1   0   0   1   0   1   1   1   0   1   1
u.1.7     7     0   1   0   0   0   0   0   0   0   7   6   0   0   0   7
u.1.8     1     1   1   0   1   1   0   0   0   0   1   0   0   0   1   1
u.1.9     1     1   1   0   1   1   1   0   1   0   1   0   0   0   1   1
u.1.10    1     1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
u.1.11    1     1   1   1   0   1   1   1   1   1   1   1   1   1   1   1
u.1.12    1     1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
u.1.13    1     0   1   1   1   1   1   1   1   1   1   1   1   1   1   1
u.1.14    1     0   1   1   1   1   1   1   1   1   1   1   1   1   1   1
u.1.15    1     0   1   1   1   1   1   1   1   1   1   1   1   1   1   1
u.1.16    1     0   1   1   1   1   1   1   1   0   1   1   1   1   1   1
u.1.17    2     0   2   0   1   1   0   0   1   0   2   1   1   0   1   2
u.1.18    1     0   1   0   1   1   1   0   0   0   1   1   1   0   0   1
u.1.19    2     0   2   0   0   1   1   0   0   1   2   1   1   0   0   2
u.1.20    1     0   1   1   1   1   1   1   0   1   1   1   1   1   1   1
u.1.21    2     0   2   1   0   1   2   2   2   2   2   1   1   2   2   2
u.1.22    1     0   1   0   0   1   1   0   1   0   1   1   1   0   0   1
u.1.23    1     1   1   0   0   1   0   0   0   0   1   1   0   0   0   1
u.1.24    2     0   2   0   0   0   0   0   0   0   2   0   0   0   0   2
u.1.25    1     0   1   0   0   1   0   0   0   0   1   1   1   0   0   1
u.1.27    1     1   1   1   0   1   1   1   1   0   1   1   1   1   1   1
u.1.28    1     1   1   1   0   1   1   1   1   0   1   1   1   1   1   1
u.2.1     1     1   1   1   1   1   0   1   0   1   1   1   1   1   0   1
u.2.2     1     0   1   1   0   1   1   0   0   0   1   1   1   0   0   1
stotal   39    10  33  16  18  27  20  17  18  15  39  31  24  16  19  39

[C90: Qualities / 3 : Diagnosis of unspecified behaviors] (2 items)
s.1.1     2     0   2   0   0   0   2   0   0   0   2   0   0   0   0   2
s.1.2     2     0   2   0   0   0   0   0   0   2   2   0   0   0   0   2
stotal    4     0   4   0   0   0   2   0   0   2   4   0   0   0   0   4

[C90: Qualities / 4 : Diagnosis of suspicious cases] (12 items)
w.1.1     4     4   4   0   4   4   0   0   0   0   4   4   4   0   0   4
w.1.2     4     0   4   0   0   0   0   0   0   2   4   0   0   0   0   4
w.2.1     2     0   2   1   0   0   0   0   0   0   2   2   2   0   0   2
w.2.2     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
w.3.1     1     1   1   0   0   0   0   0   0   0   1   0   0   0   1   1
w.3.3     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
w.3.4     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
w.3.5     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
w.3.6     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
w.3.7     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
w.3.8     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
w.3.9     1     0   1   0   0   0   0   0   0   0   1   0   0   0   0   1
stotal   19     5  19   1   4   4   0   0   0   2  19   6   6   0   1  19

[C90: Qualities / 5 : Other features] (17 items)
q.1.1     9     0   9   6   9   9   8   7   4   9   9   9   9   7   5   9
q.1.2    10     6  10   4   8  10   4   4   4   4  10  10  10   4   4  10
q.1.3     4     4   4   2   4   4   4   4   4   4   4   4   4   4   4   4
q.1.4    20    10  20  10  20  20  20  10  20  10  20  20  20  10  20  20
q.2.1     4     2   4   2   4   4   4   4   2   4   4   4   4   4   2   4
q.2.2     4     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
q.2.3     2     2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
q.2.4     2     2   2   0   2   2   0   0   0   0   2   2   2   0   0   2
q.2.5     2     0   2   0   0   0   0   0   0   0   2   0   0   0   0   2
q.2.6     4     0   4   2   4   4   2   4   2   0   4   4   4   4   2   4
q.2.7    10     4   6   4   8   8   4   4   4   2   8   8   8   4   4   8
q.2.8    10     0   6   2   2   2   0   6   4   4   8   2   2   6   4   8
q.2.9     6     2   2   0   2   2   0   0   0   4   2   2   2   0   0   2
q.3.1    20    10   8   8  14  12   8  10  10   6   8  12  12   8  10   8
q.3.2    20    20  20  18  16  16  18  16  18  14  18  14  14  14  18  18
q.3.3    20    10  14   0  10  12   8   0  10   8  16  12  12   0  10  16
q.4.1    10     2   6   6   4   6   2   4   6   4   6   6   6   4   6   6
stotal  157    78 123  70 113 117  88  79  94  79 127 115 115  75  95 127

[C90: Qualities] (61 items)
mttl90  226    93 181  89 135 148 110 100 112  98 196 157 152  93 115 196

[C99: Qualities of new features] (3 items)
u.line    2     0   2   0   1   1   0   0   0   0   2   1   1   0   0   2
u.concat  1     0   1   0   0   1   0   0   0   0   1   1   1   0   0   1
w.tlimit  8     0   8   0   0   0   3   2   0   0   8   0   0   2   0   8
mttl99   11     0  11   0   1   2   3   2   0   0  11   2   2   2   0  11

[C++: Qualities of features not in C90] (1 item)
u.cplus   1     0   1   1   0   0   0   1   0   1   1   0   0   1   0   1
mttl++    1     0   1   1   0   0   0   1   0   1   1   0   0   1   0   1

[Overall] (265 items)
gtotal  960   483 857 646 768 841 760 711 670 648 926 847 844 737 673 926

                D   M   B   G   G   u   V   L   W   M   G   G   V   L   M
                E   C   C   C   C   c   C   C   a   C   C   C   C   C   C
                C   P   C   C   C   p   2   C   v   P   C   C   2   C   P
                U   P   5   2   3   p   0   W   e   P   3   4   0   W   P
                S   2   5   9   2   1   0   i   1   2   4   0   0   i   2
                C   0   C   5       3   3   n   0   4   3   2   5   n   6
                P       P   3               3   0                   3    
                P       P                   2                       8    

        max     1   2   3   4   5   6   7   8   9  10  11  12  13  14  15


    [5.3]       Characteristics of Each Preprocessor

1   :   Linux       /                           / DECUS cpp

This was written around the early stage of ANSI draft and the Standard
conformance level is low by now.  Diagnostic messages are adequate,
however, there is almost no documentation.  It is a well-structured
stable program.

The portability of source is high and it has been ported to some
compiler systems.  The source code is easy to read as if reading a
textbook and you can learn a lot from by just reading it.  I modeled
MCPP after this source.

3   :   WIN32       / Borland C++ V.5.5J        / cpp32

The C90 conformance level is relatively high and troublesome shift-JIS
is respectably supported.  Documents are well provided.  Although e_*
usually issues diagnostic messages, most of them are hasty and the
quality is not good.

"Quality other than Standards" is poor and there are no special
extension features.  Not many diagnostic messages are issued for
undefined parts and the program runs away sometimes.  Supporting
Standards only seems to be the best it could do.

Different from Turbo C, the speed is no longer fast.  Merits in one path
compiler seem to have disappeared and only demerits seem to have
remained.  I wonder how long Borland continue this style.

4   :   Linux, CygWIN   / GCC V.2.95.3          / cpp0

The C90 and C95 Standard conformance level is quite high and diagnostic
messages are accurate.  The behavior is near stable and the speed is
extremely fast.  There are plentiful options which are almost too
abundant.  MCPP also modeled after some of those options.

Though there were a few painful bugs in older versions, V.2.95 has
almost no bugs.

The remaining issues are that new specifications in C99 and C++98 have
not been implemented, there are not enough diagnostic messages,
documentation is lacking, there are many extension features against
Standards which do not use #pragma, that many obsolete pre-Standard
specifications are hidden, and that multi-byte character encoding
support is half-finished and does not reach practical level.

The cpp.info document is excellent as the explanation of overall GCC/cpp
and Standard C preprocessing.  However, it is really too bad that
documentation for implementation-defined areas do not exist in CygWIN,
FreeBSD, or Linux.  "Portability" is not only for programs.

The source is a full of patches and difficult to read, and the program
structure is still dragging old macro processor structure.  However,
since overall GCC compiler systems are good, this is ported to many
systems.

5   :   Linux       / GCC V.3.2                 / cpp0

GCC V.3 changed the source for preprocessing completely from V.2.  At
the same time, it changed documentations completely.  Token-based
principles are fully enforced, warnings are issued while allowing pre-
Standard specifications, and the number of undocumented specifications
decreased.  On the whole, it has improved to the direction I had hoped
for to a large extent.  I suspect that future improvements will be
easier since the program structure has changed completely.

Diagnostic messages, documentations, C99 support, multi-byte character
support are not enough yet.  The speed is slightly slower than V.2, but
it is still one of the faster ones.

However, it is troublesome that header files became complex and setting
the search order of include directories is getting complicated.  Also,
while old options are no longer necessary, many new options are
introduced and it is taking forever for options to get organized.  It is
unfortunate that the internal interface between preprocessing and
compilation parts is complicated for some reasons although preprocessing
become combined with the compiler proper in V.3.

6   :   Linux       /                           / ucpp (V.1.3)

The characteristics are the support for C99, open source, and portable.
The Standard conformance level is rather high.  This version is supposed
to support UCN and UTF-8, but the support is insufficient.  The
diagnostic messages are somewhat poor.  Documentation is not sufficient,
either.

7   :   WIN32       / Visual C++ 2003           / cl
13  :   WIN32       / Visual C++ 2005           / cl

Though in 2003 few C99 specifications are implemented, more than half of
them are implemented in 2005.  There remain, however, some bugs
regarding the specifications for C90 and prior.  The most fundamental
problem is confusion in the translation phases.  The upgrades must have
been rework of some very old source code.

The diagnostic messages are often somewhat off the point.  An error
terminates the process which makes this software difficult to use.  The
manual update is behind the implementation.

The merits are large translation limits and a relatively large number of
#pragma.  #pragma setlocale in particular is useful.  However, it is
problematic that the #pragma line is macro-expanded even in C90 but that
#pragma sub-directive uses user's names pace.

The system headers included seem to have few problems on the whole with
a few exceptions.

8   :   WIN32       / LCC-Win32 V.3.2           / lcc
14  :   WIN32       / LCC-Win32 V.3.8           / lcc

Jacob Navia modified the preprocessing part of the source code for Plan9
by Dennis Ritchie, but this lacks in debugging and there are quite a
number of bugs in the #if expression evaluation and others.  The
specifications since C95 are not supported.  Lack of documentation.

There are few differences of preprocessing between V.3.2 and V.3.8.

9   :   WIN32       /                           / wave (V.1.0.0)

This preprocessor has been developed for "meta-programming" of C++ STL
as a primary purpose, and has some extended facilities for it.  Wave has
an unique construction: i.e. it is consisted of mainly C++ libraries,
and its source is consisted of mainly header files.  The examples of
meta-programming use recursive macros heavily, and Wave expands those
macros as GCC / cpp or -@compat option of MCPP, i.e. limiting the scope
of "inhibition of once-replaced macro's re-replacement" narrower than
the Standard's wording. (see [2.4.26].)

Although Wave intends also to be used as a usual preprocessor, and
intends conforming to C++98 and C99, its conformance is not yet high.
It was reported, however, that a lot of bugs were fixed after V.1.0
using the validation suite of MCPP V.2.4.

Meanwhile, MCPP's recursive macro expansion was revised using GCC / cpp
testsuite and the testcases accompanied to Wave, although many testcases
of the latter are Wave specific ones and a few testcases contain wrong
interpretation of the Standards.

11  :   Linux       / GCC V.3.4.3               / cc1, cc1plus
12  :   Linux       / GCC V.4.0.2               / cc1, cc1plus

The scoring is almost the same with V.3.2.  However, the construction of
preprocessing has changed largely.  Although V.3.2 seemed to proceed in
the direction to portability, GCC has changed the direction on V.3.3 and
3.4.  It has become one huge and complex compiler, removing independent
preprocessor, predefining many macros and restoring some old
specifications which was once obsoleted by V.3.2.  It is a question
whether these changes can be said improvements.  It has also given a
privileged place to UTF-8 among many encodings of multi-byte character.
I am afraid that it might narrow the wide variety of multi-
lingualization.

2   :   FreeBSD, DJGPP, WIN32, MS-DOS, OS-9/09  /   / MCPP (V.2.0)
10  :   FreeBSD, Linux, CygWIN, WIN32, MS-DOS   /   / MCPP (V.2.4)
15  :   FreeBSD, Linux, CygWIN, WIN32           /   / MCPP (V.2.6)

Since I created and tested this myself, the conformance level is the
best, of course.  It should be the world's most accurate preprocessor.
The plentifulness and accuracy of diagnostic messages and the detailed
documentation is also the best.  Useful options and #pragma directives
are provided.  The C99 specification is fully supported in V.2.3 and
later.

The portability of source is also the best.  Only problem is there are
not many ports, so I would appreciate your contribution.


    [5.4]       Overall Review

As we test many preprocessors, we can find that nowadays many have high
level of C90 Standard conformance.  However, each compiler system still
has many issues.  I am not going to speak for MCPP since most of items
score full.

More compiler systems can process the n_* samples correctly now.  GCC 2.
95 and later, BC (Borland C) 5.5, LCC-Win32 3.2,3.8, Visual C++ (VC)
2003,2005 and Ucpp 1.3 have reached the level with not so many problems
in practice.  However, each compiler system has unexpected bugs.

The most surprising is that compiler systems including Visual C often
have the division by 0 errors in n_13_7.t (n_13_7.c.)  The basic
specification of C which is the "short-circuit evaluation" by &&, || and
a ternary operator is not handled.  Borland C issues a warning in n_13_7.
c and it issues the same warning only for the real division by 0 in
e_14_9.c as well.  In Turbo C, the real division by 0 and partial
expression with skipped evaluation caused the same error while the same
diagnostic message is downgraded only to warning in Borland C.   This is
an example of a "hasty diagnostic message" in this compiler system.

In the C90 specification, there are some with errors in the stringizing
implementation using the # operator.

The specifications added by Amendment 1, Corrigendum 1 are implemented
to some extent by GCC 2.95 and later, VC 2003,2005 and Ucpp.

In the C99 specification, only GCC 3.2 and later and Ucpp implements
largely but not completely.  The // comments has been implemented by
many compiler systems for a quite some time.  In addition, GCC has long
long, has considerable room in translation limits, and properly
processes empty arguments of macros.  GCC has variable argument macro of
its own specification, but the one in the C99 specification is also
implemented since 2.95.  _Pragma() is implemented since 3.2.  UCN is
implemented by Ucpp and VC 2005 only.  GCC 3.2 and later implements UCN
in string literals and character constants only.  Wave implements half
of C99 specifications.

In C++98, GCC 3.2 and later, Wave and VC 2005 implements most of the
specifications.

The queer specification of C++98 to convert extended-characters to UCN
is not yet implemented by any preprocessor.

In processing the implementation-defined area in i_*, many cannot handle
wide characters in the #if expression.  Though this is specified in
Standards, using not only wide characters, but also character constants
in the #if expression is almost meaningless and it will not hurt even if
these cannot be used.  This type of meaningless specification should be
removed from Standard.

Visual C supports relatively many encodings for multi-byte characters,
though not enough.  Other preprocessors are poor.  The implementation of
GCC 2.95,3.2 are half-finished and does not reach a practical level.
GCC 3.4,4.0 has begun to support many encodings, by converting all
encodings to UTF-8.  The actual implementation is, however, not yet
practical level.

In the systems using shift-JIS or BIG-5, tokenization of literals and
stringizing using the # operator requires attention.  Visual C supports
these well.  Also BC 5.5J support shift-JIS.

In the diagnostic messages for e_*, GCC 2.95 and later are superior.
Though Visual C and Ucpp issue diagnostics to comparatively many items,
they are often vague or off-target.  Very few preprocessors issue
diagnostic messages to the overflow in the #if expression and only BC,
Ucpp and GCC do to some extent.

In documents for implementation-defined areas, GCC 3.2 and later are at
adequate level and the rest are all very poor.

In the diagnostics for u_*, GCC 3.2 and later is only adequate.  The
rest are very poor.  I don't think it is acceptable for compiler systems
not to do anything just because the result is undefined.

Almost no compiler systems handle s_* and w_*.  It is unexpected that
very few compiler systems issue even a warning for nested comments.

In "other various quality", GCC stand out with plentiful options,
accurate diagnostic messages, high speed, and portability.

Overall, GCC, especially V.3.2, excels the most in the Standard
conformance level, ease of use, and stability without many big problems.

Certainly, it is understood that MCPP exceeds in most aspects though
only speed is inferior.

After the huge volume of testing, what I realize is the importance of
test samples.  MCPP is the result of creating samples and debugging in
parallel.  Since you cannot notice the existence of bugs without enough
samples, it is anything but debugging.

If Standards come with this type of exhaustive test samples, the quality
of each compiler system will be exponentially improved.  Also, creating
exhaustive test samples reveals the problems in Standards at the same
time.  Test samples are the illustration of Standards.


    [5.5]       Test Reports and Comments

I look forward to opinions about this Validation Suite and preprocessing
test reports for various compiler systems using this tool.  Please use
the "Open Discussion Forum" at:

    http://mcpp.sourceforge.net/

or email.

If you perform the detail testing of a preprocessor, cut out the [5.2]
score table and send it.  To calculate the total score, please compile
and use tool/total.c.  The score table is cpp_test.old and if

    total 16 cpp_test.old cpp_test.new

each field of stotal (sub-total), mtotal (mid-total), and gtotal (grand-
total) is written and output to cpp_test.new.  Specify the compiler
system number at "16".

You can test automatically GCC by the testsuite edition of Validation
Suite.  I am waiting for the test reports on various versions of GCC.
Please send me the log files (gcc.sum and gcc.log), and I will
supplement my testsuite edition with the diagnostics of various versions
if any differences exist.

Also, the development of Validation Suite and MCPP are in progress in
the mcpp project above in SourceForge.  Please send me email if you
would like to join the development.

                                                                  [eof]
