Lately researchers that should properly be called "artificial intelligence" researchers have been trying to distance themselves from that name. The reasons for this are manifold. To some, it smacks of "Good Old Fashioned AI," (GOFAI) the early disastrous efforts to use naive brute force to program intelligent machines. To others, it is inconvenient because it seems like a fool's errand to try to create intelligence out of thin air (which is somewhat ironic since most of these people also work as teachers, though given the intelligence and attention span of many students, I might agree with that characterization of the educational process as a whole!). Some people find that describing their research as AI simply gets many people a bit hissy, particularly those in the humanities fields, who think that intelligence begins and ends with human analysis of literature. Still others have serious problems with the "artificial" part, as they find it demeaning to suggest that computer intelligence must be any less real than human intelligence. It seems that nobody likes the term, and lately you see a lot of people calling it many different things.
I use the term, and I do so unapologetically. There's nothing unfortunate or inaccurate about it. If we're trying to create intelligence, and we're not making it out of meat, it's artificial. Imagine a prosthetic specialist being offended because someone called his creation an artificial arm - ridiculous! Call a spade a spade, and stop whining about it.
Similarly, I don't care if humanists are offended by the idea or the name. If the practical goal is to achieve intelligence, let's not pretend at more specialized goals. Philosophers and English majors will never shut up about how inferior computers are until they are proven wrong, so don't even pay any attention to them.
As for the connection to GOFAI, yeah, it's unfortunate that the early efforts failed so miserably. But we need to accept that and move past it. Playing word games to pretend that we're too smart to ever have thought that those methods would be useful is disingenuous.
First, let's try to define the problem. Scratch that - let's call it The Problem, since this is the most important unsolved problem in human history, and its solution would have further reaching influence than anything we have ever done in the past.
The Problem (V0): to produce a computer program that exhibits true creative thought, learning, and understanding.
Unfortunately The Problem stated as in V0 does little to guide our efforts to a solution. Creative thought, learning, and understanding - say what? We don't have a concrete definition of any of those things - what does it mean to be creative, to learn, or to understand? At best we have a vague intuitive sense of what these things mean. Computers already manage each of them under some weak approximation. But none of the current abilities of computers are considered "intelligence," because every time we figure out a trick to ape one of those qualities, we immediately shift our goal. We lack a clear end point, so let's try again.
The Problem (V1): to produce a computer program that can solve as wide a range of problems as possible without external guidance.
This definition is closer to concrete, and certainly if it was fully realized we would have something that would qualify as intelligent. However, it sets the bar a bit too high - how many humans can solve all but a small number of problems without significant external help? Also, it misses the human ability to set our own goals, though that could be wedged in as one of the problems to solve, but this indicates that the definition is not clear enough.
The Problem (V2): to produce a computer program that can pass the Turing Test.
This is much better. It gives us a concrete end point, a threshold beyond which almost everyone (except probably the philosophers, but nobody cares what they have to say anyways) has to admit that there is some intelligence in the machine. I have no problem with this definition. However, the way the Turing Test is phrased seems to imply that this task is primarily, if not entirely, about parsing and responding to chunks of text. This is a severe understatement - what makes a human intelligent is much more than the ability to deduce meaning from language. We process the information, mainly, and the input/output of this information is but a trivial step in the process. AI is most certainly not about communication, it is about what happens behind the scenes; V2 offers a reasonable statement of the final goal, but no useful information about how to measure progress along the way.
Some have offered the following definition:
The Problem (V3): to produce a computer program that can optimally compress and decompress English (or any other human language) text.
The connection here is not immediately clear, but on reflection it actually makes sense - an optimal compression/decompression strategy actually can imply a "full understanding" in the statistical sense that you always know what the most likely next word will be given the past words. However, again, this focuses on thought as reflected in an output string. Furthermore, I would argue that even humans cannot optimally achieve compression (usually measured by the ability to guess the next letter/word in a string), and insofar as they can, this mostly reflects syntactical understanding as opposed to idea manipulation, which likely accounts for a miniscule percent of the compressability of the text stream. If we merely shoot to match the human ability to compress, we are most likely to end up with a system that figures out how to syntactically "understand" a little better than humans do (our level of syntax "understanding" is likely slightly suboptimal - when have humans been perfect at anything?), which will more than make up for the complete lack of idea manipulation in the "AI" engine. Compressibility would be a great metric if it worked as a teaching tool, but it is unlikely to be very effective. Let's try again.
The Problem (V4): to produce a computer program that can take a plain language description of a computer program and turn it into functional code.
This doesn't quite cut it either, but it's an interesting twist on the problem, and it is likely a subset of the functionality that true AI would have. There would definitely have to be some intelligence in a machine for it to be able to make all the design and implementation decisions inherent in taking a program spec and turning it into a program, though (unless you cheat) this doesn't quite capture the entire essence of AI.
The Problem (V5): to produce a computer program that does its best to predict everything ("understand").
This is an interesting one - is the ability to predict equivalent to intelligence or understanding? In some senses, probably, especially if at some level predictions are made of the internal state of the predictor. This is very related to definition V3, which assumes that optimal compression implies prediction. The difference is one of emphasis, and it may turn out to be a useful difference - in particular, we are no longer tied to a big-picture metric like compression ratio, as we can look closer at predictions that are correct and incorrect, perhaps using them to guide training. Unfortunately perfect prediction is not a realistic goal, so there is no clear end point. That said, this definition is better than some of the others, because at least we measure something other than success/failure.
Now that we've tried a few definitions of The Problem, each with drawbacks and strengths, I should admit that I have no perfect one. So, lacking a definition of The Problem, let's just dive right in and outline the steps to solving It. This is an outline of the approach I consider most feasible, which I could be completely wrong about. The cornerstone of this strategy is to transform the conceptual problem into a computational one that we can throw thousands of computers at, though most of the steps are extremely difficult and have not, to date, been even conceptually achieved.
It is worth mentioning that except for 4), each of these steps is tremendously difficult. Of them all, 2) is probably the toughest - how do you train something to achieve an unmeasurable task? Even if we go with V2, the Turing Test, how do we train a computer to "improve" at the Turing test without testing it millions and millions of times against human feedback? Even to train a simple XOR network, it often takes many thousands of training steps; surely a more subtle problem, even under a very optimized evolutionary/descriptive scheme, will take much longer. And we just don't have the time to give a full Turing Test to every crappy iteration of our AI. This is a general problem in AI training of any sort:
The Sub-Problem (Avoiding the Human Bottleneck): Once an algorithm/representation exists for training AI, figure out how to provide it with enough feedback and evaluation to achieve training without having to manually interact with it in real time.
Luckily, there is a vast wealth of human output available for inspection on the Internet. This will certainly be useful in training AI on language, parsing, and perhaps understanding, even if it doesn't allow us to train it on interactivity - it may be the case that a final period of direct human interaction is essentially unavoidable to produce something that will pass the Turing Test, though it would be ideal to push this off as far as possible into the process so that we can be sure we're "getting somewhere" with an approach before wasting years training it by hand.
If we leave 2) unsolved for the moment, we can begin work on the rest. We can certainly come up with lesser metrics to plug in as step 2) that, while not giving us AI, will surely do useful things and help us refine the other steps so that if we ever do think of the "right" way to approach the true step 2), we will have much of the groundwork laid. If we achieved this, we could then state The Problem as follows:
The Problem (V6): to produce a metric that gives a measure of the intelligence or "understanding" in a system without requiring human intervention or inspection (though it may use already-produced human data and writing).
Genetic algorithms suck. Really, really badly, they suck. Decades after the original idea to use evolution to build computer programs, we still have very little to show for it. If you look at any exposition on GAs, just about the most impressive thing that anyone has done with them is use them to come up with a decent probabilistic solution to the traveling salesman problem. Pathetic. Seriously.
So we should move on to something else, right? It's clear that evolution is no magic bullet, so time to find something new, no?
Nope. What it's time to do is forget everything we know about GAs, or rather, realize how little we know about them, and pick out the reasons that current efforts are so fatally flawed. Let me walk you through a naive "GA" implementation that's attempting to create "intelligent" behavior:
Sample implementation #1
Clearly there are severe oversimplifications of the problem at each step along the way. That's obvious, but it's also necessary - we have neither the computing power nor the development time to simulate the real world, so we are forced to simplify, and I have no problem with this, as long as it's done well. Unfortunately biology offers us little guidance here, since biologists know in detail how evolution works in the real world, but don't think too much about the minimum conditions necessary for it to be effective. We really have to think independently a bit about what a trimmed down evolutionary scenario like the above is missing.
To that end, I offer up the following obvious (yet often violated) first principle of applied evolution:
Rule #1: The Please Don't Neuter Principle: No part of your experimental setup should, by itself, make it impossible for your creatures to exhibit the behavior you are seeking.
Sample implementation #1 above fails this test in a few ways. The most obvious problem is in step 3 - a time independent feedforward network is just way too trivial to exhibit anything approaching intelligence. It's really just a static function in disguise; the fact that it's a neural network doesn't change this fact. The sample also likely fails this test during step 2, since the inputs provided to artificial life creatures are usually very limited; this really depends on the implementation, though.
Now, suppose we fix up our implementation to conform to rule #1. We've now got a more interesting neural network, it's time dependent, and we're providing good amounts of input. Definitely enough to show some "smart" behavior. We're still likely to fail, due to another principle:
Rule #2: The Principle of Least Resistance: If the behavior you wish to evolve is not the simplest way to achieve the training/survival goal you have set, you will never stably evolve that behavior without real-world scale - put another way, if nothing biases your algorithm in favor of complexity, your results will be overwhelmingly dominated by simplicity.
Goals like gathering food are easy to accomplish. Even reproduction is simple, and using it as a goal does not necessarily lead to a proliferation of complexity. Look at the real world - measured by number, the bacteria "beat" us humans by several orders of magnitude. They are far fitter for this world than we are, yet they are also far simpler. If we want to evolve human intelligence, we're going to have to find another way - reproduction and food gathering metrics only result in humans over the course of millennia, during which billions and billions of generations of bacteria and other life forms occur. It's really more of an accident that beings of our complexity can persist rather than an evolutionary imperative, and unless your algorithm specifically selects for complexity in some form, you're going to mostly end up with bacteria-like garbage, which survives very well by functioning very simply.
Luckily it's pretty easy to measure violations of rule #2: if your fitness level peaks at a fairly high level and your creatures are not getting any more complex or interesting, you've likely run afoul of this rule.
But what if the problem is that you're never achieving a decent level of fitness? You keep evolving your creatures, but they don't seem to be getting any better at the task. That brings us to...
Rule #3: The Principle of the Lazy Designer: Never pawn off onto evolution what you should be doing yourself.
The principle is that evolution is a lazy designer, so you do not have the luxury to be one as well! In the real world, it's fine that evolution had to fend for itself and evolve things like edge detection, eye tracking, and a Fourier transform (your ear performs this in hardware, essentially, before your brain ever "hears" anything) - it's had plenty of time and resources to do so. You, on the other hand, do not. You are working with a trimmed down version of the evolutionary process and the physical world, so a lot of the functionality your creatures may need will actually be impossible for them to evolve on their own given the limited computational resources available to them. Give them these things yourself and let them use them, dammit! If you're evolving bots to handle text of some sort, don't leave it to evolution to figure out how to parse ASCII strings out of binary, do it yourself!
Another closely related rule is the following:
Rule #4: The "Possible Is Nothing" Principle: That your experimental setup leaves the theoretical possibility of achieving a certain behavior does not imply that evolution is likely to find a way to implement that behavior.
Nobody uses assembly language to code GUI apps. It's possible, but it's just not feasible without an incredible expenditure of effort. Don't expect evolution to figure out how to pull of a similar task - tune the language it "speaks in" to the task at hand, and make sure it's powerful enough to do the job.
Of course, no matter how much you give your creatures up front in terms of language and features, you still need to make sure they can properly leverage these tools and reuse their discoveries.
Rule #5: The Principle of Recycling: Your scheme must allow for reuse of evolved high level constructs.
This is a serious issue that has rarely been addressed in actual GA implementations. When you write code in a programming language, you break your program up into functions so that every time you need to achieve the same result you don't have to re-type all the same code all over again. Without this ability, complex programs would just never be written. Similarly, it is an absolute must that your creatures be able to reuse pieces of genetic code in some fashion.
For a concrete example of this at work, look at the human brain versus the human genome. The genome contains [insert estimate] bits of information. The human brain contains [insert estimate] neurons, with about [insert estimate] connections apiece. Let's make a conservative estimate and say that the informational content of a connection is a mere four bits of information. You can do the math - there is absolutely no possible way that the brain (let alone the rest of the body) could be constructed from DNA if there was not some sort of information reuse (also known as compression) happening.
Yet most GA/neural net combinations treat the genetic code as a literal list of synapse weights, and "evolution" merely involves perturbing and swapping these weights from generation to generation. Is it any wonder that non-trivial behavior rarely results? The neocortex of the brain is a highly repetitive structure, repeating roughly the same pattern of connections millions and billions of times, and is responsible for a large part of human thought and understanding. Under a simplistic genetic code->network weight mapping, we would have to evolve the exact same structure millions of times, absolutely independently, in order to achieve the same effect. The odds of this happening are basically zero. This type of evolution is equivalent in terms of productivity to programming in a language without a for loop. It's just stupid. Don't do it, and if you do, don't be surprised when your results suck.
The specific way to satisfy this rule is less clear. Some guidance from biology would be immensely helpful, but it's tough to come by. It's unclear to me exactly how the DNA->body translation in humans achieves code reuse, but it is definitely the case that it's not just taking place during DNA copying. The construction process itself is likely able to refer to the same segment of DNA during the construction of different parts of the body. Unfortunately the construction of a human from DNA is an extremely convoluted process, nothing nearly as simple as the compilation of code into machine language, so we might have to be a little more creative in our efforts to achieve this ability. At the very least, something like a DNA pointer or reference is needed.
Taken together, I'm still not sure that these five rules are sufficient to guide the way towards evolving intelligent agents. There are still huge unsolved issues in AI, apart from the evolutionary algorithm used to train our creatures - what are we altering when we're training, how do we define fitness, and how do we do it all fast enough to see results? But these rules are an attempt at necessary conditions for successful use of genetic algorithms in the field of AI, and violations of them should not be taken lightly.
There are various ways to represent algorithms:
- Source code of a high level language Syntax laden at both high and low resolutions Combinations of many individually valid lines can be invalid Evolutionarily modular Not evolutionarily robust due to syntax - Byte code/machine code/other low level representation Mostly syntax free at high levels; combinations of low level ops are usually valid Dense Non-modular - Lookup list of ops/vars Mostly syntax free Bound check required Ends up looking very Lisp-like - Neural net Totally syntax free, just a string of numbers Very close to Turing complete - time dependent nets lose memory due to precision loss Simple to evolve, tend to have smooth gradients in fitness space Dense Non-modular in most cases - Connected net of modular units Not entirely syntax free, but mostly Combination of lookup list and neural net Programmable by hand Lisp-ish, but network based
We're going to look at this last case more closely, as for various reasons the other methods seem to be somewhat lacking as far as genetic programming is concerned. We would ideally like a representation that is syntax free, modular, easy to hand-code, and is robust evolutionarily (for a small change in code, normally the change in functionality should be small, though larger changes should be possible in relatively few steps; more importantly, though, most changes should "parse" without crashing the program - a minor change should not generally reduce functionality to zero). This method fits the bill, more or less.
This type of method is probably very familiar to anyone that has used Max MSP or (to some extent) Java Beans.
Basic unit: module, could be programmed by hand in high level language or made of sub-modules and bundled. - Naturally modular at static, prebuilt unit level. - Any subnet can be boxed up and presented as its own unit, simply by gathering up all inputs and outputs and storing away internals. - Mutant subnets are able to evolve from existing set of modules and connections - Can't allow prebuilt units to evolve since they are written in base language - must provide rich enough base set by hand - Easy to add useful functionality manually by adding new prebuilt modules - Not necessarily trivial to figure out which subnets would be individually useful - Potential strategy - package up randomly, figure out which ones are used most in successful mutants - When discarding non-useful packages, unpackage existing uses that appear in surviving population - Must carefully handle timing issues upon packaging/unpackaging so that behavior is identical under this step - How to deal with recursion? Possibilities: - Don't allow (at graph level) - Allow up to specified depth (maybe require default node to replace recursive one at that depth) - May be unnecessary to allow - recursive algorithms can be implemented using feedback, though this involves timing issues - Timing - brief look - External input/output polled each time step - Cached output model - always use output from previous node in previous time step to compute new output values - We desire that... - some units should be instantaneous (filters) - some units should delay output - some units should be instantaneous externally but use time dependence internally - some units should have tunable time dependence based on input - Design question: who controls timing? Is it externally set, or internal? - Timing makes modularization more complex if we allow different time scales, but for completeness we need different scales - What do we do about real-time dependence? For example, if an algorithm hangs in a while loop, when do we cut it off?
We need to give much more thoughtful consideration to these timing issues to come up with a workable plan. A good model to hold in your head is the time-dependent neural network model. The way this works is that for each time step, we take the input to each node as the output from the last time step of the previous node in the directed graph. This has the nice quality that it doesn't matter what order we iterate over the entire network, we always end up with the same result. Contrast this with a feedforward network, where we always start at the beginning and propagate towards the end, thus ending up with a single pass giving a result. Notice that if the minimal path between input and output is of length L in a recurrent (time dependent) network, it takes at least L units of time to recieve an output signal that is causally dependent on the input, and thus the output is always lagging (though predictive methods may be employed to successfully complete a task that involves mimicry if there is some predictability to the input). In a feedforward net, there is no such delay.
Though the human brain is by its nature bound by some lag (read up on "the 100 step rule"), an ideal network would incorporate both possibilities - we should never constrain ourselves to follow the design limitations of the brain, as much as we might like to take inspiration from its construction. For instance, we might want to set up a network to compare current values to old values of the input. But we may wish first to transform the input by a multiplicative factor. The multiplication needs to be done instantly, without delaying the input, otherwise we will be using an old value; but we NEED to allow some delay in the net in order to store the old value under the usual recurrent net model.
We might just let the multiplication be performed as a single step, and pre-apply it as a special case. But a nastier situation comes about if instead of multiplication we need another, more complicated algorithm that we've evolved separately, and we still need the transformation applied instantly. Say, for instance, that a particular unit uses an internal time dependence to figure out the answer to a tricky mathematical problem, but what we really want is just the answer, not the forced time-dependence of that answer. We must come up with a scheme whereby we can change the amount of time that a subnet takes to compute its result. And any such scheme should be nestable - we should be able to use time-dependent steps inside a time-independent filter, and vice-versa.
We can easily include an additional meta-unit in our specification, one that wraps a whole bunch of connected nodes and determines the time that they execute in relative to the outside tick rate - for instance, we might bundle a group of four serial units together and dictate that in one external tick, the group undergoes four internal ticks. Or we may even prefer that the group takes all four of its ticks in zero external time, thus turning it into an instantaneous filter on the input, its output available to the next node as soon as the previous node's output would have been without the filter. A comparison of these time flows is useful:
(nodes 1->4 are bundled, let's suppose each one multiplies by 2) 1) Unadorned In --> N1 --> N2 --> N3 --> N4 --> Out Tick 0: xIn 0 0 0 0 0 Tick 1: 0 2*xIn 0 0 0 0 Tick 2: 0 0 4*xIn 0 0 0 Tick 3: 0 0 0 8*xIn 0 0 Tick 4: 0 0 0 0 16*xIn 0 Tick 5: 0 0 0 0 0 16*xIn 2) Bundled, 4:1 ratio [---------4:1----------] In --> N1 --> N2 --> N3 --> N4 --> Out Tick 0: xIn n/a n/a n/a 0 0 Tick 1: 0 n/a n/a n/a 16*xIn 0 Tick 2: 0 n/a n/a n/a 0 16*xIn 3) Bundled, 4:0 "ratio" (instantaneous) [---------4:0----------] In --> N1 --> N2 --> N3 --> N4 --> Out Tick 0: xIn n/a n/a n/a 16*xIn 0 Tick 1: 0 n/a n/a n/a 0 16*xIn 4) Reference scenario without N-nodes In --> Out Tick 0: xIn 0 Tick 1: 0 xIn
We see that 3) is just like 4) except that it transforms the value. Notice that in example 3, the value that N4 outputs is dependent upon the value at In in the same time-step - this requires special handling, since it means that we must have the current value for xIn before we start evaluating the bundle. This means that non-instantaneous stepping must happen first, so that the results are available for the instantaneous nodes. Furthermore, it means that instantaneous bundles that connect to each other must be evaluated in the appropriate order, which means that islands of connected instantaneous bundles should be sorted into layers, and each layer computed at the same time. This could be a problem in a situation with instantaneous loops, for instance if another set of nodes made In instantaneously dependent on the value of N4. In practice, if we encounter a loop during the sort, we could continue propagating through the loop as we count the layers, which means that a particular instantaneous node would be evaluated on several different layers; this would lead to infinite recursion if we did not have some way of cutting it off. There is no natural way to cut off this process, so we must hesitantly admit a global MAX_RECURSION_DEPTH variable that determines the maximum number of loop traversals we allow. We might prefer not to allow recursion at all, which would amount to setting MAX_RECURSION_DEPTH to zero (thus causing only the first node, that branches off into the loop, to be evaluated more than once - a counter on branch nodes is probably the cleanest way to implement MAX_RECURSION_DEPTH), but in some cases we might imagine an open ended recursion to be useful, say, to iteratively refine an estimate of a square root via Newton's method or something like that.
We may now look into the internal time-steps taken when we have bundled nodes, as there are some ambiguities there, too. In example 2, what should the input to N1 be at each of the internal steps? In this case it does not matter, but in others it will. We choose to present the internal nodes with the incoming input for the larger time step at each of the sub-steps, so in this case, N1 receives xIn at each of the sub-steps. We choose to do this because if it is not the desired behavior, it is simple enough to filter out any input after the first time step.
As far as output, we take the output of the bundle to be the output of the exit node(s) of the bundle at the last timestep. Again, if any other behavior is desired, such as averaging, it is simple enough to bake it in to the bundled nodes, whereas to go the other way around would be trickier.
To avoid problems of overlapping bundles, we insist that once a group is bundled and given a ratio, it appears to the rest of the network as a single unit. Notice that we MAY allow strict inclusions of bundles without any problems - it is only overlapping bundles that lead to ambiguity. We need to make some choices if bundles interconnect, however - say a 2:1 bundle and a 4:1 bundle have some connections between them. As the 2:1 bundle undergoes its internal timesteps, does the input from the 4:1 bundle change? In my opinion, it should not. The entire reason for inventing a time stepping scheme like this is to allow modular pieces to decouple their own time scale from that of the external simulation, thus avoiding concurrency issues like the one just mentioned. If two bundles are dependent on each others internals, then modularity is already broken, and they should be bundled together under the finer resolution (in this case, 4:1) rather than trying to interact mid-step. By making this choice, each of our "bundles" can truly be its own independent and nested net - hence the title of this section.
If we think that a network-based approach is likely to be useful in the AI problem (either the natural language or text compression statement of "the problem," as discussed in a previous post), we have to figure out the right way to present the input to the network. This is a somewhat tricky problem, and there is no immediate answer. The classic time independent neural network approach is too limited - there we would be tempted to present the entire input stream at one time, encoded as numbers, run it through all the connections and instantly get an output stream back which we decode similarly. The problem is that this does not scale at all; the net must be sized appropriately to recieve the right amount of input, and I know of no useful algorithms that could possibly dynamically resize a feed-forward net (how would you decide what the new weights should be when you add a node, without retraining the net?).
We might instead choose to deal with an extended network scheme that can accept strings and manipulate them as fundamental units. This is, in fact, a very useful idea; in a previous post I discussed in some detail what can be gained by generalizing our usage of the network paradigm to include much more than mere weighted sums of inputs, as well as the pitfalls that must be handled (with extreme care!). At a fundamental level, presenting such a network with an input string (or some complex object in general) is exactly what we'd like to do.
But it's naive to stop there. Consider what happens when a human comes across something like an encyclopedia. Does he read the entire thing at one pass, THEN go back and parse everything that was said and try to understand the thing as a whole? Such a task is ludicrous, and this is essentially what we'd be doing in passing the entire encyclopedia as a string to our network. Humans break things up into chunks and digest them individually, storing away details from the prior chunks to relate to the next ones, and so on. Yes, the encyclopedia exists as a string, but we never have to access it all at once. We suspect that it may be better to handle the breaking off of edible pieces by ourselves before the net ever sees the entire string, so that it doesn't have to evolve its own routines to break up the big string and do all the relevant bookkeeping.
That said, if we merely present a letter by letter stream, one time step at a time, to the net, we're committing the opposite error. Human text parsing definitely occurs at greater than the letter level, but less than the passage level. So we will likely be best off if we pass our network something in that range (My guess? Seven to ten words should do it, probably - I don't think most humans read in chunks greater than seven, so this should be plenty; more complicated grammatical structures than seven words are likely parsed from short term memory rather than immediately on sight, anyhow).
There is a lot of room for debate here. To some extent, as long as the time-dependent network has memory, it can always scale up - for example, even if we did pass the network individual characters, by storing these away it could theoretically reconstruct the words, and thereby the sentences, and eventually entire paragraphs and books (with enough memory). Similarly, you can always back down from a greater unit by decomposing it through processing, i.e. break down a sentence into words, a word into letters, a letter into strokes, etc. But the big problem in AI is most definitely NOT to merely find a representation that is capable of achieving what you want it to do - assembly language surely fits that bill - it is to find a representation that is efficient at figuring out how to do it. And every level-jump that needs to be implemented by the network itself is a severe hit against this metric, similar to adding a layer of interpretation on top of an existing programming language. NAND gates are blazing, assembly language is fast, C is slower, and a simulation of the next Pentium chip running in a transistor simulator program is going to be even slower, even though the functionality will be identical across all of these things.
So there are a couple of reasonable options for dealing with this presentation of text. First, we could arbitrarily choose a unit size and present units to the network in order, for instance we could break our text up into five word chunks and feed them in one after another. Alternatively, we could adopt a sliding window approach wherein the network sees, for example, seven words at a time, and this window slides over the entire input stream as time goes by. This way you never get stuck in the unpleasant (and inefficient) situation of having to store in memory a word that is only one word away from the word you're currently looking at, which would happen quite often in the first example - you can always chunk small phrases if you move the window to the right place (analagous to moving your eyes). This second method is a lot closer to how humans actually read text, seeing phrases in chunks instead of as strings of individual words. This approach to text presentation is akin to hard-coding the fact or pattern that human-generated text is generated by humans, who tend to both think and read in a fairly local way.
I prefer the second approach, but having decided upon that, we still have problems. Even if we presume that our core language engine successfully takes a valid and grammatical stream of words and figures out what to do with it, we still need to get that valid and grammatical stream. i mean;, yuo kan undrstand thsi snetnece even tho ughitsnot crrect atall. Say'm whit tis won. There's clearly a large amount of error detection going on before we ever get to the logic level (at quite a speed hit, too - I'm sure it took you a few extra seconds to get through those sentences than it does to get through most, and I'm guessing that the second one took longer than the first despite its shorter length; something to think about as we move forward!), some of it to correct mere spelling errors without context dependence, some of it involving much deeper connections - notice that a word-level parsing of "Say'm whit tis won" would likely register real words for almost each token, but most of them happen to be the wrong words. Syntax and sound clearly play a role here. Perhaps this is too much to expect first-gen AI to handle, but the point remains that there must be backwards paths from grammar to parsing along with the forwards ones from parsing to grammar - the two do not happen independently, and even if they are modular, there is feedback, where the grammar guides the parse and vice versa until they both agree that the sentence is "handled." Any sort of robust language engine must be tolerant of slightly erroneous spellings or grammars, and this flexibility is a necessary element of language that will pervade the construction of the actual guts of the processor as well.
Some of these things can be partially handled with similarity graphs. For example, "the" is similar to "then" textually, similar to "duh" acoustically, and related to "a" by meaning, so these might all have labelled connections that could be set during some preprocessing phase, and the exploited during parsing. This gets real laborious as the acoustic and meaning bits require external intervention to define. However, we must resign ourselves to having to put something in by hand; ideally we'd like to avoid constructing meaning maps by hand (this is a little too GOFAI for this day and age), but acoustic information is necessary because pronunciation is too littered with special cases to boil down to automated rules (though some general rules will likely cut down the amount of work we need to do).
The way I envision a system like this working (assuming we have something that can handle well-formed and correctly spelled sentences, that is) is that we attach a preprocessing module to the "guts," with a connection leading back telling the preprocessor how satisfied the guts are with the result, along with perhaps some feedback and instructions. For instance, if the guts get sent the tokens "Say whit tis won," they might spit back that this doesn't make sense grammatically, and the preprocesser should take another stab at it. The preprocessor might then send back another close attempt, "Say, Whit, it is won," which the guts will accept for the moment, until they get to the next line and realize that it doesn't make sense in the greater scheme of things at the meaning level. So they send a request to back up and try again, going on like this as the preprocessor tries all its various manipulations to try and bring the sentence into a form that the guts will accept. Eventually the preprocessor may try the various "sounds like" transforms, and the guts will finally be happy with "Say, I'm with this one," which would again be likely rejected due to context, and eventually "Same with this one." I would reckon that this is very similar to the thought process in your head as you struggled to make sense out of the sentence.
"Why," you might ask, "don't we just consider the preprocessor to be a normal part of the network, though, and let it evolve as we train the network?" In my opinion, it's better if we do it ourselves because parsing a string to produce a candidate list of word lists is a fairly easy (by which I mean long and probably difficult, but straightforward) task that we can accomplish mostly by hand, we can test it separately, and we can enumerate all the various tricks without too much difficulty envn sdo to frk (there really aren't too many possible word strings, and if things get too messy then even humans just give up trying to figure it out...generally if that happens the phrase or sentence is "put on the back burner," and they keep reading in hopes that they will find some other clarifying evidence, though it tends to leave a nasty feeling unless it is resolved, for instance by the author telling them that "envn sdo to frk" means absolutely nothing and is just a random string of letters thrown in to make the point that you'll only go through a few different parsing possibilities before trying to push ahead without worrying about the phrase). Compare this to the problem of assigning meaning, which is a much more difficult problem (unsolved, even) that we don't even know where to start on - it is these difficult tasks that we should sic evolution on, not the easy ones. This should be a general theme in our construction, to factor out any simple tasks that we KNOW will be necessary, and try to use those results to provide evolution with shorter paths towards the more difficult behaviours, hoping that our work provided enough raw material for evolution to meander its way towards a solution that we couldn't pick out because the details are too complex (or because there's no good way of distilling the method into something we can intuit even if it's not very complicated). This is the key when it comes to modularity, and is one reason I believe the usual neural network structures are insufficient by themselves - if we expect that audio data coming in to a neural network will be more useful in the frequency domain, it is much more effective to FFT it before training the network rather than to rely on the network to accomplish this difficult task itself. Even if the network did finally figure out the algorithm, it likely would not get one that's as efficient and robust as the one that's in use in millions of systems already.
Not to say that we shouldn't leave open the possibility of letting the preprocessor evolve, too. If we write the preprocessor module in the same language that the rest of the net speaks, we could certainly let it change while we evolve. But we should at least start it in a place that we think will be useful.
Here are some good external resources for the interested: