How about a Safe Virtual Machine?

Daniel W. Connolly (connolly@hal.com)
Fri, 30 Sep 1994 14:04:13 -0500


[For those on the scheme48 list, this is in response to some
discussions on www-talk and comp.lang.python about exploring ways to
host untrusted programs. Some folks are considering doing to the
python interpreter what other folks have done to Tcl to make
Safe-Tcl.]

Hmmm... Safe Tcl is the most mature system of its kind (it's kind
being restricted to open systems -- telescript is possibly more
mature). Still... Tcl give me the heebee-jeebees. I wish I could
prove a theorem like:

"Construct X (which has been essential to the development
of distributed hypermedia applications) will always
have a super-quadratic running time in Tcl, whereas in
languages with richer type systems, the running time
is linear."

I have a feeling there's one out there, but it seems like nobody
has found it yet.

One answer is "well... you can always write the performance-critical
parts in C" but that doesn't apply to the Safe-Tcl scenario where you're
sending your code to run on an interpreter on somebody else's machine.

Anyway... the point of this message is to explore the possibility of
supporting mutliple languages by making the safe-computing platform a
virtual machine, rather than a programming langage.

I like the scheme48 virtual machine[1] -- especially the performance.
But the heap technology isn't mature yet (the current release uses a
stop-and-copy collector, but incremental/generational collectors are
"in the works") and the foreign function interface is raw (though
there's a scheme shell, sort of a perl work-alike, and it's author
says the foriegn function interface is maturing). The scheme48 virtual
machine supports full continuations and threads.

The python virtual machine[2] is a proven success, but it's threads
support seems kinda flakey, and its reference-counting implementation
is a pain in the butt for C interfacing.

I understand there's a Tcl compiler called RUSH (I need to read up on
this: somebody wanna send me a pointer?). Ironically, I hear that it
compiles to scheme.

The "plug-n-play" technology for multi-architecture bus cards being
deployed in the PC/workstation arena uses some forth-based virtual
machine to initialize the cards, rather than having machine code for
each architecture for initialization.

Let's explore the plusses and minuses:

- Safe Tcl is mature technology. If we don't stick to safe-tcl,
we have to do a whole bunch of engineering over again.
We can re-use some of the knowledge and perhaps some of
the code, but "warm fuzzies" are not reusable.

- Shipping byte codes around rather than source code precludes
the possibility of having an optimizing compiler on the compute
server. One can argue that it's best to represent the problem
at the highest level of abstraction possible, and let the
receiving end be responsible for optimizing the concrete
solution to the problem.

I think this is an unrealistic characterization of the situation.
Server developers are going to want to link in a "libSafeComputing.a"
and be done with it.

+ Shipping byte codes around allows for the possiblitiy of
an optimizing compiler in the development environment. It allows
for development in a variety of languages: Tcl, scheme,
python, SmallTalk, etc., given the right kind of virtual
machine.

+ Shipping byte codes around puts the lexing, parsing, and
syntax-error checking on the client side. Plus: these
tasks will only be done once per version of code, rather
than once per execution of code.

- Shipping byte codes around requires more technology on
the client side. You won't be able to just have a form
that says: "Type your program here:" and runs the program --
that is, unless you're willing to compile the program on
the server side.

Again, I think this is a rare scenario. Any application
where the end user is required to do programming is bound
to fail. We must stop making users play the "what sequence
of characters will get the machine to do what I want?" game.
The tremendous success of Mosaic is yet another piece of
evidence that visual interfaces make people productive and happy.

I think that portable source code is a myth. By its nature, source
code is for human consumption and manipulation. A body of code is not
really reusable until it's reached the state of a shared library or
DLL. (The possible exception being Ada/Modula-3 generic modules,
and to some extent, C++ template classes.)

Part of the mess we're in with HTML is because the "code" goes from
the author, through the server, to the display client without ever
going through any kind of compiler/validator/syntax checker.

On the one hand, it seems easier to exhaustively examine the behavior
of a bytecoded virtual machine than the behaviour of the set of
programs expressible in some high-level language. On the other hand,
the important issue is the safety of the system at a macro level. Just
because no one bytecoded instruction can cause damage doesn't mean
that a large program -- or worse, a collection of programs -- won't
exhibit antisocial behavior. In a way, it's more a problem of
analyzing a nonlinear dynamical system than looking at a
state-machine.

I'm sorry I don't have a concrete proposal. In a way, I'm generating a
bunch of heat and not much light. I think I've got a good idea, and
I'd like to get some impressions. Ideally, somebody with more time for
this sort of thing will go off and try it and prove that my idea is
workable or that it is really bad.

To be somewhat concrete, let's describe a hypothetical system
called Safe-VM:

Hmm... the more I think about it, the less interesting it becomes to
think about executing completely untrusted, anonymous programs. The
interesting part is to allow programs access to exactly the set of
resources that they are authorized to use, and to support accounting
for the use of these resources.

A truly anonymous program might begin executing with with finite
resources: a certain amount of memory and some CPU time. To gain
access to the filesystem, network ports, databases, more compute time,
or more memory, I see three options:

(1) In an interactive session, (like a program that runs inside Mosaic
to control a smart form) the virtual machine could present a dialog to
the interactive user:

The 'fractal image display' task has consumed the default
maximum memory resources, and requested 5MB more.
<Grant Request> <Abort Task>

(2) In a batch session, the program would have to be certified by some
principal that is authorized to use the resources. For example, the
bytecodes might get to the server with a certificate that says 'This
program is allowed to access databases X, Y, and Z. It can use 50MB
of disk space for intermediate query results.' An interesting way to
deploy an authorization/accounting system like this is given in a
paper by Cliff Neuman[3].

(3) If the program can begin execution with a communications channel
back to the user that owns it, it can begin executing anonymously,
and use the communications channel to satisfy challenges to gain
access to resources. For example:

def auth_callback(encrypted_nonce):
... send encrypted_nonce accross channel ...
... it gets decrypted on the other side, and sent back ...
return result

x = gimme_a_database_access_cookie("Personnel Files", auth_callback)

x.read_records(...)

Dan


[1] "Scheme 48 home page"
http://www-swiss.ai.mit.edu/~jar/s48.html
Richard Kelsey (kelsey@research.nj.nec.com)
Jonathan Rees (jar@ai.mit.edu)

[2] "The Python Programming Language"
Guido van Rossum
http://www.cwi.nl/~guido/Python.html

[3] "Proxy-based Authorization and Accounting"
B. Clifford Neuman
ftp://prospero.isi.edu/pub/aac/pbaa.ps.Z