This is the follow up on Dan's previous articles:
Virtual Worlds, Virtual Robots, and AI: beyond gaming and social
spaces and The
future of Virtual Reality is not server centric.
If everyone is running their own physics, how do we resolve
conflicts? Who is authoritative with regard to position and
orientation of objects? The answer may surprise you, but it makes a
lot of sense. Everyone is responsible for their own behavior, and
everyone is authoritative with regard to their own state. But
first, we have to talk a little about time.
A favorite band of mine from the 70's, Chicago, had a hit song
entitled "Does anybody really know what time it is?" The song
continues with the lyric, "Does anybody really care?" I think
perhaps this song was written with a precognitive understanding of
the Second Life event protocol. In the SL world, events happen
"now", meaning whenever you get the message. This ends up working
out to some degree by virtue of two facts: 1), in Second Life, all
activity is mediated by the server, and 2), ping times tend to be
somewhat stable between machines on the 'net. Since the server is
the only source of behavior, and the ping times are relatively
stable, everyone tends to get the same events at a predictable rate
of delay. (Note however that this situation totally breaks down at
the boundary between two sims that have substantial lag between
them. Linden runs all the servers on a grid at a co-located
facility, but the Opensim grids and hypergrid are distributed
across the world.)

If we want to move to a model where behavior is computed on
different machines, we have to up our game a bit in terms of
sophistication when it comes to time. If Alice and Bob are moving
about on Charlie's server, and there is a 50 millisecond lag
between Alice and Charlie (as well as between Bob and Charlie),
then Alice sees Charlie's avatar delayed by about 100 milliseconds,
and vice versa. A problem then arises if Alice and Bob might bump
into each other. One can imagine a scenario where, in Alice's
world, she just misses being bumped into by Bob; but in Bob's
world, Bob and Alice collide. How do we manage this
discrepancy?
It's actually not all that difficult. If we have a good
knowledge of our network, and know the average ping time as well as
the standard deviation, we can devise a buffering scheme that
smooths over time lags. In this example, the buffer time might be
set to 125 ms, which is a bit longer than the maximum round trip
packet delay. When Alice presses a key to move her avatar forward,
an event message is broadcast, with a time stamp set 125
milliseconds in the future. Alice's avatar itself doesn't respond
immediately; instead, it waits for the event to become timely, and
after 125 milliseconds, the avatar begins to move (note this is not
much worse than the SL case, with a varying delay of around 100 ms
when the network is responsive). Assuming it takes 100 ms for this
message to get to Bob, there is still 25 ms left until Bob also
starts moving Alice's avatar in his version of reality. If
everything goes according to plan, both Alice and Bob will see
Alice's avatar move at the same time, and it will therefore be in
the same place according to both client machines. If Bob is moving
as well, Alice and Bob will agree precisely on where each avatar is
at every moment, and their calculation of affairs, including
potential collisions, will be synchronized.
One big advantage to such an arrangement is how it responds to
times when the network may be overloaded, or responding poorly for
some reason. In Second Life, if the network goes down, you are
stuck in one place (surreally, you can look around and your camera
will follow, but you cannot move). In this new model, you can
continue to walk around, smell the virtual flowers, and do anything
you want to do in the environment that doesn't depend on network
communication. The only problem is, you will not get accurate
update information for other avatars or objects that you don't
control -- and they won't get the right information from you. Once
the network recovers, everyone will get update messages, and the
problem can be corrected (just as it happens in SL today). The
difference is, you didn't spend that time in a virtual cage. It's
worth pointing out that this model, where messages are buffered and
time-stamped, is basically what we use today to implement streaming
protocols. In particular, VOIP and teleconferencing apps have used
this technique for many years.
There are many details to be worked out, but this is the rough
outline of a set of capabilities that I think would go a long way
towards increasing the scalability and robustness of the Metaverse.
Just as the REST concept revolutionized thinking about the
architecture of the web, I think we need some sort of conceptual
foundation for real-time, immersive worlds that takes account of
the stochastic nature of network connections, and doesn't simply
fail to scale when more than a few dozen avatars decide to show up
to an event. I'm not claiming that these ideas are magic bullets,
but I think they throw some perspective on why we choose to do
things in certain ways, and on how those choices are likely to
impact the user experience as the system scales.
One element of this idea I'd like to examine a bit more closely
is the relationship between physics, scripts, and animation. All
three of these phenomena have the ability to direct the behavior of
elements in the world. However, due mostly (in my opinion) to
historical reasons, they present a very non-orthogonal set of
capabilities that are distributed among various elements in
somewhat arbitrary fashion. I would make the potentially
inflammatory proposal that they should be integrated into a single
concept, which I referred to previously as the "behavior server".
Whether I want to use pre-sequenced animation, rigid body
simulation, or some custom logic coded in a script, in all three
cases my goal is to control the behavior of an element in the
world. In my proposed scenario, all such behavior is invoked on the
machine that owns the object in question. In the case of an avatar
running a typical Second Life-style animated walk sequence, my
"behavior server" would (locally) invoke a bvh (avatar animation)
player function, apply it to my avatar's skeleton, and broadcast
the resulting joint angles using a buffered stream of update
messages as previously explained. Rather than have the other
clients need to load animation assets, they will get a sequence of
avatar motions that is exactly what my animation behavior server
outputs. In the case where I want to use either physics (ie
"ragdoll") or other custom logic to control my avatar, the other
clients simply operate as before -- they faithfully replay whatever
position, orientation, and joint angle commands they receive from
me.
In this way, we decouple the underlying principle by which we
generate behavior -- animation vs. physics vs. some other logic --
from the job of transmitting and reproducing that behavior. What is
interesting about this idea from a scalability and robustness angle
is this: if some bright hacker codes up a new way of controlling
the avatar, they can put that capability into a custom client.
Anyone who wants to exhibit that behavior needs the new client;
however, *everyone*, even those with old clients, can *see* the new
behavior. This applies to techniques like inverse kinematics, more
realistic physics, new scripting languages, or any other possible
innovation. By reducing the client's playback capability to "do
what the messages tell you to do", we enable a Metaverse where many
different ways of doing things can safely exist simultaneously.
And now for my final act of blasphemy. Everything I just said
about avatars should also be possible for "prims". I'll go one step
further: from the display perspective, the client doesn't need to
know the difference between prims and avatars. There should be a
generic scene element, which I refer to as an
"existent". This element can be anything from a
simple cube to a furry alien with compound eyes and wheels instead
of legs. The fact that some of these existents are avatars should
be a feature, not a fundamental distinction. The job of the client
(as far as viewing the scene; there are other client
responsibilities such as inventory management and so on) is simply
to display these existents properly, updating their state according
to buffered messages received from their respective owners
(mediated through the server for now -- though there are p2p
ramifications here that I won't go into). The idea is to factor out
the semantic capabilities (is a human, can buy things, can create
prims, can terraform etc) from the visual presentation (is a cube,
is a skeleton with deformable mesh, has textures abc and xyz,
etc).
There are many ramifications of these ideas, and of course it's
all just mental gymnastics until someone (presumably I) takes the
time to actually try to implement some of this stuff. All sorts of
questions come to mind -- how do you invoke synchronized
animations, such as Second Life's poseballs? -- but personally, I
like challenging questions almost as much as I like unexpected
answers.