User:O11c/script

From The Mana World

This is a summary of some internal details of the eathena scripting language.

General

parse_script expects a string buffer that starts with a { and ends with a }.

parse_script detects labels, and otherwise just calls parse_line a lot.

parse_line first does a parse_simpleexpr, which must correspond to a FUNC. Yes, this is silly. Then, it does parse_expr a lot until it gets to a ;.

parse_expr basically just calls parse_subexpr with a limit of -1.

parse_subexpr can be a naked - label, or a unary operator followed by parse_subexpr with a limit of 100. Or, it can be parse_simpleexpr followed by a number of binary operators followed by parse_subexpr with the adjusted operator's limit. If the "binary operator" is (, get some parse_subexpr -1 and finally a )

parse_simpleexpr can be a (, a parse_subexpr -1, and a ) Or, it can be a number or a string. Or, it can be a name, which can be a label, variable, or function. If the function is "if", magic happens to prevent the comma warning. If it's not a function, it may be a [, parse_subexpr -1, and ]. Note that labels and variables are NOT distinguished at this point.

Bytecode

Subsets of the ScriptCode enum is used for bytecode, as well as for typing the stack variables.

The following unary or binary operators do exactly what you would expect:

   LOR, LAND, LE, LT, GE, GT, EQ, NE,
   XOR, OR, AND, ADD, SUB, MUL, DIV, MOD, 
   NEG, LNOT, NOT, R_SHIFT, L_SHIFT,

The rest are treated specially, if at all:

  • NOP: Automatically added at the end of the script. Equivalent to the 'end' command.
  • POS: Represents a label - either known or backpatched. Followed by a 3-byte (little-endian) offset into the bytecode, which is pushed with type POS. Note: a POS is sometimes converted from a NAME (for forward jumps) when the compiler didn't know it was a label originally.
  • INT: Not actually used in bytecode. Instead, integers are variable-length-encoded using bytes that are not valid bytecode.
  • PARAM: Not actually used in bytecode. Instead, type-1 constants were stored as a 3-byte (little-endian_ offset into str_data, which is later checked for being a PARAM.
  • FUNC: Not used for function names - those are 3-byte (little-endian) offsets into str_data. However, this IS used for function *calls*. Note that array indexing is a function call. It occurs after all the function arguments are pushed, and and makes the call actually happen. Note: function calls include lines aka commands. After the function call, the goto state is checked.
  • STR: Push the immediately-following 0-terminated string literal. In the interpreter, advances to the 0, so the later increment passes it.
  • CONSTSTR: Not used in bytecode. But note that STR will push a CONSTSTR!
  • ARG: Pushes a dummy value onto the stack, after the function name/reference but before the actual parameters, so that when FUNC is executed, it knows how many arguments to copy.
  • NAME: Similar to POS, followed by a 3-byte (little-endian) offset, but into thr str_data. This is pushed with type NAME. Transiently, it may instead by followed by a backpatch, when it can't prove that there won't be a label later.
  • EOL: Does some sanity checking of the stack and sets up rerun info (for two-cycle commands such as menu). This is emitted after the FUNC for the line, and the position of the '-' label is immediately AFTER this.
  • RETINFO: Not emitted in bytecode.

plan for changes

Add/change the following values:

  • Change NAME to VARIABLE. Its 24-bit argument is (after backpatching) an index into a new dedicated int<->string table.
  • Create a new FUNC_REF bytecode, since FUNC bytecode means to actually do the call. It's 24-bit argument will be an index into the function table. Remove the .func field.
  • Emit PARAM directly instead of NAME + typecheck. Store the 24-bit value directly.

values

There is (was?) a problem if bytecode values go over 0x20 (32). Nope, not anymore. Yay, I can introduce new bytecodes to solve *all* my problems. I should probably still split the stack types though.

  • NOP 0
  • POS 1
  • INT 2 unused
  • PARAM 3 unused
  • FUNC 4
  • STR 5
  • CONSTSTR 6 unused
  • ARG 7
  • NAME 8
  • EOL 9
  • RETINFO 10 unused
  • LOR 11
  • LAND 12
  • LE 13
  • LT 14
  • GE 15
  • GT 16
  • EQ 17
  • NE 18
  • XOR 19
  • OR 20
  • AND 21
  • ADD 22
  • SUB 23
  • MUL 24
  • DIV 25
  • MOD 26
  • NEG 27
  • LNOT 28
  • NOT 29
  • R_SHIFT 30
  • L_SHIFT 31

integer encoding

Integers are encoded in add_scripti and decoded in get_num. An integer is encoded by zero or more 0b11?? ???? bytes, terminated by a 0b10?? ???? byte.

The first byte contains the low 6 bits, then the remainder subtracts 0b0100 0000 and repeats until it fits it only 6 bits, which is put in the 0b10?? ???? byte.

Note that get_com synthesizes a code of INT.

Integers are *decoded* slightly differently. It just masks 0x7f - which includes the low bit of the 0xc0 - which manages to undo the crazy subtraction done earlier.

But the important thing to realize is that all integers are over 0x80.

str_data hackery

This stuff is freaking confusing. Good luck!

str_data is, logically, a string-indexed array of data.

  • type: a ScriptCode. Can take a few values.
  • (str): (index of) this string. Should probably be removed once it's using a proper map.
  • backpatch: -1 or an offset of a label that needs to be patched.
  • label: label offset.
  • func: function pointer for type = FUNC
  • val: function table offset for type = FUNC
  • (next): linked list offset for the string hash. Internal, removed.

Index 0 is unused for reasons. Index 1 is the ephemeral "next line" label. Indexes from 2 up are actual string things. Note that string literals are NOT included.

search_str and add_str do the obvious things. There is no clear. If add_str creates a new entry, it has:

  • type: ScriptCode::NOP
  • func: NULL
  • backpatch: = -1
  • label: -1
  • val: undef, probably 0.

Note that add_str etc can be called at any time, not just at the beginning! But fortunately, the individual indices are stable, or the pointers are stable in the map-backed version.

add_scriptl takes a str_datum and does the following things:

  • POS: write the 24-bit label offset
  • NOP: write the 24-bit backpatch and update the str_datum's backpatch to point to this offset.
  • INT: (for type-0 constants in db/const.txt) write the actual integer value instead.
  • FUNC: write the 24-bit str_data offset, with type NAME. We need a new bytecode, then we can write the offset into the function table instead.
  • PARAM: write the 24-bit str_data offset, with type NAME. It would probably be sane to instead write the value directly, under code PARAM.

set_label takes at str_data and sets type to POS and label to the given value. It then reads the backpatch info and overwrites the bytecode to be a POS with (instead of a NAME with the backpatch link).

parse_simpleexpr takes a thing that may be a thing. If it's not a function, it may be followed by a [. This is silly, it should only apply for variables (NOP), not labels, params, ints. Of course, it doesn't know if it's a label.

When the first parse_script, add_builtin_functions adds a bunch of str_datum with type FUNC, func the (redundant!) pointer, and value the offset into the function table. This means that backpatch and label are left as -1. It also does read_constdb, which adds a bunch of PARAM or INT with the given val (leaving FUNC NULL< and backpatch and label -1).

Then in parse_script, str_data's LABEL_NEXTLINE is set to NOP, -1, -1. Also, all things with a type of POS or NAME are reset. I'm not sure if NAME ever happens? Unless that's in the post stage. After the EOL, it again resets the LABEL_NEXTLINE entry.

At the very end of the script, after it emits the NOP bytecode, it iterates over EVERY str_datum with a type of NOP and replaces it with a NAME. Obviously, these are known to be variables now. It also sets label to the index, which I'm pretty sure is dumb. But it also walks the backpatch list and replaces it with the offset list, which is *not* dumb, and is kind of hard to replace. Note that this does NOT replace the bytecode - it was already a NAME bytecode.

Later, get_val finds an *object* (not a bytecode) with type NAME and reads the *string* value from the corresponding str_data (it does n't check teh str_data's type). It then checks it for prefixes and suffixes. If suffix is $ and prefix is @ or $, it does a lookup by the datum index (which by now also includes an array index in the high bits). If the suffix is not $, it checks if type is PARAM, and calls pc_readparam with the datum's val. If not a PARAM, if prefix is @ or $, does another lookup by num. But if prefix is # or if there is no prefix, it does a lookup by name! This is the troublesome part.

set_reg does pretty much the opposite of get_val - it calls the corresponding set methods instead of read, keeping the same num/name separation.

conv_str has some never-executed code that checks for datum's type as NAME, but this is impossible sine it already called get_val.

The builtin input command does some magic with a num/name. It implicitly assumes that the stack value's num is an index into str_data.T TODO eliminate the optionality of the argument - it has a little UB. The builtin set and setarray (and family) commands are similar, but without the evil optionalness.

getelementofarray actually checks that the stack element's type is a name (and, normally, pushes a NAME in return. But this is somewhat far afield).

In run_func, st->start is set to the function, st->start+1 is the dummy arg, and 2 or more are actual args. Why do I care? the st->start's num (which must be from a NAME) must be an index to a FUNC in str_data. Then .func had better not be NULL, so it can be run. It does some popping, too. But this is not touching str_data.

script_save_mapreg_intsub and script_save_mapreg_strsub need to map their datum indices into name + array index.

And that's all!