« Light and speed | Home | Colour, social beings, and und... »

What Colour are your bits?

Thu 10 Jun 2004 by mskala Tags used: , ,

There's a classic adventure game called Paranoia which is set in an extremely repressive Utopian futuristic world run by The Computer, who is Your Friend.  Looking at a recent LawMeme posting and related discussion, it occurred to me that the concept of colour-coded security clearances in Paranoia provides a good metaphor for a lot of copyright and intellectual freedom issues, and it may illuminate why we sometimes have difficulty communicating and understanding the ideologies in these areas.

An article based on this one and its follow-ups, by me, Brett Bonfield, and Mary Fran Torpey, appeared in the 15 February 2008 issue of LJ, Library Journal.

In Paranoia, everything has a colour-coded security level (from Infrared up to Ultraviolet) and everybody has a clearance on the same scale.  You are not allowed to touch, or have any dealings with, anything that exceeds your clearance.  If you're a Red Troubleshooter, you're not allowed to walk through an Orange door.  Formally, you're not really supposed to even know about the existence of anything above your clearance.  Anyone who breaks the rules is a Commie Mutant Traitor, subject to the death penalty.

Much of the game revolves around the consequences of the security levels.  For instance, Friend Computer might assign a team of Red Troubleshooters to re-paint a hallway that ought to be Orange but was painted Yellow by mistake the Commie Mutant Traitors.  It's quite likely in such a case that the Troubleshooters will all end up shooting each other for treason against Friend Computer, since none of them are allowed to touch the paint, go near the hallway, or talk about their mission, and they're all charged with enforcing the rules on one another.

In intellectual property and some other fields we're very interested in information, data, artistic works, a whole lot of things that I'll summarize with the term "bits".  Bits are all the things you can (at least in principle) represent with binary ones and zeroes.  And very much of intellectual property law comes down to rules regarding intangible attributes of bits - Who created the bits?  Where did they come from?  Where are they going?  Are they copies of other bits?  Those questions are perhaps answerable by "metadata", but metadata suggests to me additional bits attached to the bits in question, and I'd like to emphasize that I'm talking here about something that is not properly captured by bits at all and actually cannot be, ever.  Let's call it "Colour", because it turns out to behave a lot like the colour-coded security clearances of the Paranoia universe.

Bits do not naturally have Colour.  Colour, in this sense, is not part of the natural universe.  Most importantly, you cannot look at bits and observe what Colour they are.  I encountered an amusing example of bit Colour recently:  one of my friends was talking about how he'd performed John Cage's famous silent musical composition 4'33" for MP3.  Okay, we said, (paraphrasing the conversation here) so you took an appropriate-sized file of zeroes out of /dev/zero and compressed that with an MP3 compressor?  No, no, he said.  If I did that, it wouldn't really be 4'33" because to perform the composition, you have to make the silence in a certain way, according to the rules laid down by the composer.  It's not just four minutes and thirty-three seconds of any old silence.

My friend had gone through an elaborate process that basically amounted to performing some other piece of music four minutes and thirty-three seconds long, with a software synthesizer and the volume set to zero.  The result was an appropriate-sized file of zeroes - which he compressed with an MP3 compressor.  The MP3 file was bit-for-bit identical to one that would have been produced by compressing /dev/zero...  but this file was (he claimed) legitimately a recording of 4'33" and the other one wouldn't have been.  The difference was the Colour of the bits.  He was asserting that the bits in his copy of 433.mp3 had a different Colour from those in a copy of 433.mp3 I might make by means of the /dev/zero procedure, even though the two files would contain exactly the same bits.

Now, the preceding paragraph is basically nonsense to computer scientists or anyone with a mathematical background.  (My friend is one; he'd done this as a sort of elaborate joke.) Numbers are numbers, right?  If I add 39 plus 3 and get 42, and you do the same thing, there is no way that "my" 42 can be said to be different from "your" 42.  Given two bit-for-bit identical MP3 files, there is no meaningful (to a computer scientist) way to say that one is a recording of the Cage composition and the other one isn't.  There would be no way to test one of the files and see which one it was, because they are actually the same file.  Having identical bits means by definition that there can be no difference.  Bits don't have Colour; computer scientists, like computers, are Colour-blind.  That is not a mistake or deficiency on our part:  rather, we have worked hard to become so.  Colour-blindness on the part of computer scientists helps us understand the fact that computers are also Colour-blind, and we need to be intimately familiar with that fact in order to do our jobs.

The trouble is, human beings are not in general Colour-blind.  The law is not Colour-blind.  It makes a difference not only what bits you have, but where they came from.  There's a very interesting Web page illustrating the Coloured nature of bits in law on the US Naval Observatory Web site.  They provide information on that site about when the Sun rises and sets and so on...  but they also provide it under a disclaimer saying that this information is not suitable for use in court.  If you need to know when the Sun rose or set for use in a court case, then you need an expert witness - because you don't actually just need the bits that say when the Sun rose.  You need those bits to be Coloured with the Colour that allows them to be admissible in court, and the USNO doesn't provide that.  It's not just a question of accuracy - we all know perfectly well that the USNO's numbers are good.  It's a question of where the numbers came from.  It makes perfect sense to a lawyer that where the information came from is important, in fact maybe more important than the information itself.  The law sees Colour.

Suppose you publish an article that happens to contain a sentence identical to one from this article, like "The law sees Colour." That's just four words, all of them common, and it might well occur by random chance.  Maybe you were thinking about similar ideas to mine and happened to put the words together in a similar way.  If so, fine.  But maybe you wrote "your" article by cutting and pasting from "mine" - in that case, the words have the Colour that obligates you to follow quotation procedures and worry about "derivative work" status under copyright law and so on.  Exactly the same words - represented on a computer by the same bits - can vary in Colour and have differing consequences.  When you use those words without quotation marks, either you're an author or a plagiarist depending on where you got them, even though they are the same words.  It matters where the bits came from.

I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides.  The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation.  You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with.  Oh, happy day!  The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!

The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient.  When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen.  When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour?  It's just random bits!  Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs.  The problem is that there are two conflicting sets of rules there.  Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits.  It matters where the bits came from. The scrambled file still has the copyright Colour because it came from the copyrighted input file.  It doesn't matter that it looks like, or maybe even is bit-for-bit identical with, some other file that you could get from a random number generator.  It happens that you didn't get it from a random number generator.  You got it from copyrighted material; it is copyrighted.  The randomly-generated file, even if bit-for-bit identical, would have a different Colour.  The Colour inherits through all scrambling and descrambling operations and you're distributing a copyrighted work, you Commie Mutant Traitor.

To a computer scientist, on the other hand, bits are bits are bits and it is absolutely fundamental that two identical chunks of bits cannot be distinguished.  Colour does not exist.  I've seen computer people claim (indeed, one did this to me just today in the very discussion that inspired this posting) that copyright law inescapably leads to nonsense conclusions like "If I own copyright on one thing, and copyright inherits through XOR, then I own copyright on everything because everything can be obtained from my one thing by XORing it with the right file." That sounds profound only if you're a Colour-blind computer scientist; it would be boring nonsense to a lawyer because lawyers are trained to believe in and use Colour, and it's obvious to a lawyer that the Colour doesn't magically bleed to the entire universe through the hypothetical random files that might be created some day.  You could create the file randomly, but you didn't. Maybe you could create a file identical to the complete works of Shakespeare by XORing together two files of apparently random garbage.  "Why, so can I, or so can any man;" but that doesn't mean that I am William Shakespeare.

This idea of Colour is a problem for communication between those of us who work in the world of computers, where Colour does not exist, and those of us who work in the law, where Colour exists and is important.  Lawyers will ask computer scientists questions about how to determine the Colour of bits (like "How can Friend Computer prevent the Commie Mutant Traitors from making illegal copies of files, while still allowing loyal Troubleshooters to use disk-copying equipment?"), and computer scientists will find it difficult to say anything in response that the lawyers can comprehend - because a big part of computer science is about understanding that Colour does not exist.  Someone who cares a lot about what Colour the bits are, and spends a lot of resources on trying to answer that question, is a dangerous idiot if not a Commie Mutant Traitor.  In intellectual property law the Colour of bits exists and is of absolutely paramount importance.  A computer scientist who won't tell what Colour the bits are is being deliberately unhelpful, and a computer scientist who denies the very existence of Colour (as any conscientious computer scientist must eventually do) is a dangerous idiot and/or a Commie Mutant Traitor.

There are several ways we could try to avoid the issue.  Computer scientists who want to try to be helpful may say, "Okay, you, the lawyer, are a dangerous idiot, but I have to work with you or be thrown in jail as a Commie Mutant Traitor as happened to Dmitry Sklyarov, so I'll try to address your concerns.  You say there is some special property of some bits and we need to know which bits have this property.  Fine.  We'll attach tags to the files to say what Colour they are." In the copyright realm, that's the "rights management information" solution.  It's what they do with DVDs (region coding), VHS tapes (Macrovision), Adobe eBooks ("you may not read this file aloud"), CDs (SCMS), and many other formats.  The trouble is, if we (as computer scientists) are intellectually honest about it, we'll have to admit that it can't really work.

The tags are just more bits.  You can write a tag that says "this is an Orange tag", but it will be made out of bits and so it can't really have a Colour because Colour does not exist.  It will just be a Colour-less tag saying "this is an Orange tag".  It will be subject to all the consequences of the fact that Colour does not exist - such as the fact that the tag could be stripped out somewhere down the line.  The computer scientists are aware of that; we have to be, because knowing about the non-existence of Colour is what makes us computer scientists in the first place.

What we are doing with rights management information is simulating Colour in a computer-sciencey way.  But lawyers will seize on the possibility of doing this kind of simulation and say, "See!  You admit it!  You can recognize the Colour of bits after all!" and then conclude from there that all the other rules they want to make (such as "Red Troubleshooters may not walk down Orange hallways") are meaningful in the computer science realm.  They'll say "You can recognize the Colour of bits after all!" rather than "Colour exists after all!" because the idea of Colour not existing in the first place is not within their imagination.  The "fact" that Colour is something real is so fundamental to law that it can't be challenged.  Of course Colour exists.  We lawyers think about Colour so much that we think we can see it.  Why can't you?  Maybe there is something wrong with your eyes.  As computer scientists, we need to make clear that Colour simulated by Colour-less tags saying "this is an Orange tag" and such, is still only a simulation.  The properties that Colour is supposed to have do not automatically come with the tags, because those properties are Colour, the tags are bits, and bits do not have Colour.  Even bits that talk about Colour do not have Colour themselves.  There is no such thing as Colour.

Another thing computer scientists will try to do is to treat Colour as a function (in the strict mathematical sense of "function") of the bits - maybe an uncomputable function (in the strict mathematical sense of "uncomputable"), maybe intractable, but a function nevertheless.  We either do that because we mistakenly believe that Colour really is a function, or because we're a little more sophisticated, we know that it's not a function, but we think that we can fake it closely enough with a function to get the lawyers off our backs.  Either way, the idea is that we should be able to look at bits and somehow determine, from the bits themselves, what Colour they ought to be.

Treating Colour as a function is almost the same as attaching tags to the bits - the difference is that when the Colour is a function of the bits, we don't have to worry about the tags being detached; on the other hand, when the Colour is a function of the bits, we can never have more than one possible Colour for a given sequence of bits.  Monolith depends on exploiting this problem:  it assumes that one file can only ever have one Colour, asserts that the Colour of its output file is the "you may copy this" Colour because of the (correct) claim that fixing any other single unchangeable Colour would raise legal problems, and then follows the logic to a claim that it can produce what would otherwise be an illegal copy of the copyrighted input, without breaking copyright law.  One Colour per file was never one of the lawyers' rules of Colour; it's merely a consequence of "Colour is a function", and Colour being a function is just something we computer people decided to believe because functions make sense to our training and Colour doesn't.  Colour is not actually a function at all.

Trying to infer the Colour from the bits may seem like an okay thing to do as long as bits are tied to physical objects.  You can examine a paper document and determine whether it is an original or a photocopy.  You can probably examine something purporting to be a photograph and determine whether it is a photograph of a real scene, or something more complicated.  But even in the analog realm, determining Colour by examination is not always possible.  You can't determine by looking at a photograph of two people having sex whether they consented to the sex or not, let alone whether they consented to the making of the photograph.  That's a Colour distinction that is not a function of the bits that make up the photograph - and it's true even of analog photographs.

Other important questions which you may or may not be able to answer by examining a photograph are "Are those things actually humans, or some kind of simulation?" and "How old are they?" Those questions may have been difficult with analog; they become even more difficult with digital.  It is easy to imagine that someone could render by innocent means (drawing or ray tracing or whatever) an image bit-for-bit identical to an image that has the Colour (presumably Pink) of illegal child pornography.  In that case, depending on your view of such things, it may matter where the bits came from to the determination of whether they are Pink (illegal) or Green (legal).  Identical bits may have different Colour.

Child pornography is an interesting case because I find myself, and I think many people in the computing community will find themselves, on the opposite side of the Colourful/Colour-blind gap from where I would normally be.  In copyright I spend a lot of time explaining why Colour doesn't exist and it doesn't matter where the bits came from.  But when it comes to child pornography, I think maybe Colour should make a difference - if we're going to ban it at all, it should matter where it came from.  Whether any children were actually involved, who did or didn't give consent, in short:  what Colour the bits are.  The other side takes the opposite tack:  child pornography is dangerous by its very existence, and it doesn't matter where it came from.  They're claiming that whether some bits are child pornography or not, and if so, whether they're illegal or not, should be entirely determined by (strictly a function of) the bits themselves.  Legality, at least under the obscenity law, should not involve Colour distinctions.

I think computer scientists could actually understand Colour a lot better than we do, because there are places in computer science where Colour does matter.  I already mentioned the idea of quoting and plagiarism - identical words are or are not okay to use without quote marks in an academic paper depending on their Colour.  Those of us with degrees are able to follow the rules for that because people who aren't get kicked out of school before finishing their degrees.  That's a general academic application of Colour.

If you've any exposure to metrology - not "meteorology", I mean the science of measurement - you'll be familiar with the idea of tracing the pedigree of standards.  Down in the chemistry lab they have a big jar of buffer solution with a label asserting that it not only has a pH of exactly 7.00, but that its pH is "traceable" to such-and-such primary standard, through a chain that probably terminates at the National Bureau of Standards in Boulder, Colorado, USA. That's Colour.  Not only do you know the pH of the buffer solution, but you know where it came from.  Someone other than the National Bureau of Standards might be able to produce a buffer solution that is just as good and just as accurately 7.00 pH. If you have a sample of good pH 7.00 buffer solution it might be indistinguishable from the real traceable standard solution; but it wouldn't really be the traceable solution unless it had the intangible Colour to make it authentic.

The computer science applications of Colour seem to be mostly specific to security.  Suppose your computer is infected with a worm or virus.  You want to disinfect it.  What do you do?  You boot it up from original write-protected install media.  Sure, you have a copy of the operating system on the drive already, but you can't use that copy - it's the wrong Colour.  Then you go through a process of replacing files, maybe examining files, swapping disks around and carefully write-protecting them; throughout, you're maintaining information on the Colour of each part of the system and each disk until you've isolated the questionable files and everything else is known to be the "not infected with virus" Colour.  Note that developers of Web applications in Perl use a similar scorekeeping system to keep track of which bits are "tainted" by influence from user input.

When we use Colour like that to protect ourselves against viruses or malicious input, we're using the Colour to conservatively approximate a difficult or impossible to compute function of the bits.  Either our operating system is infected, or it is not.  A given sequence of bits either is an infected file or isn't, and the same sequence of bits will always be either infected or not.  Disinfecting a file changes the bits.  Infected or not is a function, not a Colour.  The trouble is that because any of our files might be infected including the tools we would use to test for infection, we can't reliably compute the "is infected" function, so we use Colour to approximate "is infected" with something that we can compute and manage - namely "might be infected".  Note that "might be infected" is not a function; the same file can be "might be infected" or "not (might be infected)" depending on where it came from.  That is a Colour.

But the "might be infected" Colour is clearly a fictional thing we create to help us approximate a tricky function.  It's still easy to argue that Colour doesn't really exist.  I've saved until last what I think is the best example of a Colour in computer science, and I think even the most hardline mathematicians will have to agree that even though this isn't a function and cannot be represented in bits, it's something real that we have to be able to think about and care about.

Random numbers have a Colour different from that of non-random numbers.  The question of how to determine whether numbers are random or not by looking at them is one of the recurring flame wars of sci.crypt.  You can't do it.  Here's a number:  2.  Was that a random number?  Well, maybe I got it by rolling a die (a random generator); or maybe I got it by counting my legs (probably not random).  If I give you a file of supposedly random bits, there's no way you can tell whether they are randomly generated or not.  The same file could have been generated by a quantum-mechanical random source, monkeys on typewriters, or by encrypting some well-known non-random file with some scheme that may or may not be generally known.

There are statistical tests you can do; for instance, if you look at the file and discover that it contains a copy of the works of Shakespeare, then it doesn't look much like you would expect randomly generated numbers to look.  But it could still be randomly generated.  The test tells you whether the file has the statistical properties expected from randomly generated files, not whether the file really is randomly generated or not.  It's not even correct to say "the probability of this being from a random generator is very low" because that's not true - it either was or was not randomly generated, that's not open to probability.  At best you could say "If we ran a random generator to produce a file this size, the probability of it generating this file would be very low", which sounds almost the same, but is not.

Note my terminology - I spoke of "randomly generated" numbers.  Conscientious cryptographers refuse to use the term "random numbers".  They'll persistently and annoyingly correct you to say "randomly generated numbers" instead, because it's not the numbers that are or are not random, it's the source of the numbers that is or is not random.  If you have numbers that are supposed to come from a random source and you start testing them to make sure they're really "random", and you throw out the ones that seem not to be, then you end up reducing the Shannon entropy of the source, violating the constraints of the one-time pad if that's relevant to your application, and generally harming security.  I just threw a bunch of math terms at you in that sentence and I don't plan to explain them here, but all cryptographers understand that it's not the numbers that matter when you're talking about randomness.  What matters is where the numbers came from - that is, exactly, their Colour.

So if we think we understand cryptography, we ought to be able to understand that Colour is something real even though it is also true that bits by themselves do not have Colour.  I think it's time for computer people to take Colour more seriously - if only so that we can better explain to the lawyers why they must give up their dream of enforcing Colour inside Friend Computer, where Colour does not and cannot exist.  Maybe then they'd stop trying to shoot us as Commie Mutant Traitors.

Hey, Reddit, Ycombinator, and Metafilter readers! You know what, I'm proud that this article has become a benchmark, but I've written a lot of others I like too, some of them more recently than 2004. It would sure be nice if my other articles got some love instead of just this one being linked from a discussion every week.

36 comments

Tiago
I disagree with you with regards to fake porn and obscenity laws

child porn is illegal because people believe it's very likely children can be harmed, directly or indirectly, if they participated in the production of the child porn, if there is no children harmed, then there is no point in making it illegal (i don't buy the "but it stimulates people to do it for real" thing, there are things way worse than plain sex being represented in all sorts of media and no proof they are the cause of any cases of people doing it for real)


and obscenity laws should be torn down and burned, they harm us more than they help
Tiago - 2010-04-17 23:21
Matt
I'm not sure why you describe that as a disagreement - as I said, it should matter where the bits came from, and it sounds like your opinion on obscenity laws is pretty much the same as mine. I think we're on the same side.

However, it is not the case, at least in Canada, that child porn is illegal because its production is thought to harm children. Maybe that would be a sensible reason for it to be illegal, but the law isn't always sensible. R. v. Sharpe made it very clear that in Canada, child porn is illegal because people think that looking at it harms the person looking (by reinforcing "cognitive distortions") - whether anyone was harmed to produce it or not. Maybe you don't "buy" that justification - I don't, myself - but it is the reason for the current law whether you and I buy it or not.

If you read R. v. Sharpe carefully, you may notice something else interesting: the Court didn't even write that they think looking at child porn stimulates people to commit crimes. They wrote that Canadian society thinks that looking at child porn stimulates people to commit crimes, and just because Canadian society thinks so - and despite the evidence the Court heard that it isn't actually true - it becomes the basis for the law.
Matt - 2010-04-20 07:39
Terry A. Davis
Bible: "The lot cast in the lap is entirely up to the Lord."

God says...
novelty Humans impious Perfect deridedst GIVE Shepherd clog
inebriate element contend fruitful becometh circumstance
Terry A. Davis - 2010-05-17 05:41
Matt
Is the previous comment spam or not? I can't tell. The random-looking words at the bottom are characteristic of spam, but there's what looks like a legitimate email address, and the name matches the registration on the domain - which isn't what I expect from spammers.

The Bible reference seems to be to Proverbs 16:33, which doesn't exactly say that but is pretty close.
Matt - 2010-05-17 06:02
Andy Baker
I wonder who would own the copyright on Pierre Menard's version of Don Quixote?
http://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Quixote

(or to bring more Borges to bear - there will be a copy of Don Quixote in the library of Babel with a copyright inscription that says "copyright 1932 Pierre Menard" - as well as one that says "copyright 2011 Bill Gates")
Andy Baker - 2010-05-17 08:29
Matt
People often bring up the idea of copyrighted works showing up in the digits of pi. If pi is what's called a "normal number" - which is not known to be true, but strongly suspected - then its base-ten digits contain as a substring every finite sequence of base-ten digits, and so with a trivial encoding it contains all the volumes of the Library of Babel, and you can make the same arguments about pi that you could make about the Library.

I imagine that if the inhabitants of Babel had a notion of copyright, they would follow some kind of "sweat of the brow" theory: whoever did the work of climbing stairs and scanning galleries to find a book with important contents, would have some kind of property right over it. Of course, that gets us into the deep issue of whether ideas are *created* or *discovered*.
Matt - 2010-05-17 11:33
Mick
"If we ran a random generator to produce a file this size, the probability of it generating this file would be very low"

I don't follow your logic in this paragraph: Could any set of randomly generated numbers be said to be more- or less- likely than another?

And what is wrong with saying that (2,4,6,8,10) is unlikely to be a list of randomly generated numbers? Sure, it _could_ be, but other rational hypotheses seem more likely.
Mick - 2010-05-18 07:26
Matt
Mick, assuming a uniform distribution, every output sequence is equally unlikely. However, every output sequence of any significant size is extremely unlikely, and that's the point. If you fix any sequence in advance, run the generator, and find that exactly your sequence comes out, that's somewhat surprising (and we can quantify exactly how surprising). The fact that they're all equally so is a red herring. If I run a random number generator that produces five integers, each from 1 to 10, independently uniformly distributed, then the chance of it producing (2,4,6,8,10) is one in 100,000. That's the same as the chance of it producing (3,1,4,1,5), or (2,4,1,5,4). Pick one of those, run the generator, get your sequence, and wow! that was a one-in-100,000 event. But run the generator first, look at the sequence, and what are the chances that that sequence came out? One out of one. You just saw it happen.

What are the chances I produced each of those examples from a random number generator when writing this comment? Well, for the first two, it's zero. I chose those by hand. For the third, it is one. I got that one from a one-line Perl program that invoked the generator. It is an objective statement, which is or isn't true and will not change for all eternity, whether I did or didn't get each sequence from a random number generator. That's not open to probability. It may be open to *uncertainly* - you might not have read this paragraph yet, or I might even by lying about where I got the numbers and so you have to consider some possibility that I might be lying - but uncertainty isn't exactly the same thing as probability. You would, for instance, not be well-advised to make a bet with me about whether the numbers came from a random number generator, because I know that and you might not be certain; but if you believe that the random number generator works, it might be quite reasonable to make bets about its *next* (not its existing, previous) output.

There are different interpretations of probability and this discussion highlights the distinction between "Bayesian" and "frequentist" interpretations. It's possible to define the word "probability" in such a way that it becomes subjective and applies to what I'm calling "uncertainty"; under that interpretation your "probability" for something could meaningfully be different from mine because of your different knowledge of the events. However, even under that interpretation it would be foolish to say that the probability of being random or not is a property of the numbers - it would, instead, become a property of your state of knowledge about the numbers. The numbers themselves did or didn't come from the random number generator and that will never change.
Matt - 2010-05-18 07:55
Anon
What about data type as a Colour? As in, int versus float?
Anon - 2010-07-01 13:30
sbartel
Internally, data types are encoded with bits, making them colourless.
sbartel - 2011-03-24 00:38
Matt
Well said.

Data types are not irrelevant; the question of "how should we interpret these bits?" that can differentiate two otherwise-identical sequences of bits, often IS exactly the kind of thing I mean by "Colour". One of my favourite examples is the "tainting" mechanism used by such languages as Perl and PHP: under certain configurations, variables containing user-entered data or influenced by user-entered data may carry a special out-of-band property, one which differentiates them from otherwise-identical variables, so that you get errors if you later use them in dangerous ways such as to construct command lines. That can protect against certain classes of security problems. It can be shown to work perfectly as long as you remain within the language that implements it and that language is perfectly implemented. However, the notion of typecasting - deliberately changing the type of bits without changing the bits - is integral to how computers work. Even if you choose to use a programming language that doesn't allow typecasting you can't force me to also use such a language; and like any other metadata, it's always possible for the metadata to be lying or simply wrong.
Matt - 2011-03-24 07:18
Michael Rule
Regarding the x-or example : The encoded file is only useful if one knows the public domain file with which to x-or it to recover the copyrighted file. The encrypted file is not a string of random bits. It is a string of bits, coupled with the information "x-or me with these bits that can be found here". The reference to the public-domain file necessary for decryption can be considered a one-time pad, which, coupled with the decryption function "x-or", allows me to decode the copyrighted file.

The point is that -- knowing how to decode the file makes all the difference.
Michael Rule - 2011-03-30 22:21
Matt
Also worth noting is that in logic programming, the concept of "attributed variables" is pretty much exactly an implementation of Colour: you can attach extra information to a variable that differentiates it from otherwise-identical variables, and with coroutining you can even make this extra information have magic consequences like throwing an exception when someone tries to copy it. (Of course, attributed variables also have other, nobler uses; they don't ONLY exist for the purpose of boobytrapping your code.)

Prolog, the best-known logic programming language, doesn't do this out of the box; but more advanced logic programming systems (see, in particular, the Attribute Logic Engine (ALE), open source maintained by colleagues of mine) implement a distinction between "extensional" and "intensional" (not "intentional") data. Two identical pieces of extensional data cannot be distinguished from each other. Two pieces of intensional data can be identical in all ways except that you can tell them apart from each other. Although you can think about a similar distinction in any language that has pointers or references, procedural systems generally don't describe it as a precisely defined concept in itself.

The take-away lesson: we DO have the tools in mathematics and computing to understand the lawyers' concepts; but we need the moral courage to accept that the lawyers are actually talking about something that has a meaning, and they're not ALL dangerous idiots ALL the time.
Matt - 2011-05-01 08:59
Dan Mills
Fascinating write-up & comments.

Here are two more examples of colour in computer science:

* Expanding on the previous point about types: some languages have the ability to compare the equality of the contents of variables separately from whether they are "the same one". For example, a "standard" comparison operator might coerce (typecast) values, but another operator might be used which doesn't do that--thus failing equality for two variables that might be considered semantically equal at a higher level. In the case of OO languages type coercion might not be needed, and the strict comparison might involve checking an object ID.

If the type information is available at runtime, it is probably (always?) implemented via some sort of tagging mechanism, though. It's more interesting to consider compile-time typing failures since they might involve identical bits (e.g. int and int32 might be the same on this platform but are still "different").

A better example, also involving types:

* Strings. "does this string contain ASCII or UTF-8?" is one typical programming question that is exactly asking about the colour of bits. Some programmers' inability to understand colour has resulted in e.g. functions like "is_unicode" which might examine a string of bits and determine if they are "in unicode" (by which they mean some coding flavor like UTF-8, -16, UCS-2, etc). The problem is that (particularly if ASCII code-pages are a possiblity), any sequence of bits is likely a valid combination in any coding system. You cannot infer what the string contains by looking at it, you just have to know.

Extrapolating a bit, I think this is generally why data standards are important / useful: you have a bunch of bits, what do they mean? The same sequence might be interpreted in a number of ways, so without some colour they are worthless. e.g., when your browser opens up a connection to some server on port 80 and requests a webpage, there is some mutual understanding about what might be said between the parties.

Thanks! :-)
Dan Mills - 2011-05-09 14:35
Simon Weber
This is an interesting view, but it seems to me that it simply dresses up the idea anyone will learn in an undergrad CS education: everything is bits. As others have said, saying anything else about a piece of information depends on making assumptions, and assumptions can be incorrect (or purposefully broken, as is so often seen in security). Color then, like a datatype, is just another assumption that can be broken and manipulated to lead to various problems.
Simon Weber - 2011-05-11 13:59
Matt
Anyone will learn in an undergrad *law* education something quite different.

I suggest reading the follow-up article at http://ansuz.sooke.bc.ca/entry/24 , which gets a great many fewer links than this one.
Matt - 2011-05-11 14:33
g__
In religion, there's an idea of "holiness". While holy water is chemically the same substance as normal water, it has different Colour to religious people. If you want to baptise, the religious law says you need holy water. This seems arbitrary to a passer-by, who doesn't recognize the holy Colour and says that normal water and holy water are physically the same. Just like identical files might be different in copyright law, the same bottles of water might be different in religious law. Is this a good example?
g__ - 2011-08-02 20:59
Matt
There are actually several interesting examples in religion - consider, for instance, kosher/not kosher status in Judaism, and the status of a transubstantiated host in Catholicism. Both are considered to be non-physical; that is to say, an implement could be kosher or not, and a host could be the body and blood of the Savior or not, without any atom being any different. The key is that, like the supposed Colour of bits, it's something that distinguishes *otherwise identical* things.
Matt - 2011-08-03 17:45
psychoslave
To Dan Mills (or a bot using a random generator engine) wrote : You cannot infer what the string contains by looking at it

Well, I think you are plain wrong about what inference mean. Because, yes you can infer a string is in unicode, and that's exactly what programmers you are talking about do. In fact, you don't need a CS degree to make such an inference. Anyone using a computer in a non-ASCII covered language have already do that. When you read a document where every occurrence of a letter is substituted with some non-sense characters, you understand that there's an encoding problem, but that it's the language you are thinking about.

What you probably mean is that you can not deduce it. But inference is not only deduction, but also induction and abduction.
* Deductive inference - finding the effect, given the cause and the rule.
* Abductive inference - finding the cause, given the rule and the effect.
* Inductive inference - finding the rule, given the cause and the effect.

In fact, I'm not even sure you can't deduce it, if you take the proposition "No one will publish almost non-sense text, that would really have sense with some simple substitution" as an axiom/postulate.

Sure you can say "may be that's really what the author really wanted to publish" and some times it can be the case (may be you would name a satire on encoding "unicode %C3%A7a roxe"). But then, an human person (or your last bright AI engine) may guess it from the context.

Also note that formal logic doesn't give you any clue on what truth is. It doesn't give you the ability to be sure that I exist, or even that you exist. Descartes was lying, it's not "I think so I am", but "I can stand the idea that I don't exist, so lets pretend that everything that is logical must exist". Yeah, but take the other way, you don't exist, so your feeling that logical things are real doesn't exist. Logic is ego-powered.

Add the Gödel's incompleteness theorems, and there you are. You don't have a clue if anything is absolutely true, neither from the bottom nor the top.

----

Now on the article itself, I would say that what is called color here is what you can call a relation in math, or even in general. Mathematicians (and CS) focused so much on elements, probably Euclide's books have greatly influenced that.

Do you think we are a bunch of atoms ? Lets assume that we are in a discrete world, then yes we are. But we are not just an atom set, we also are all the relations between all this atoms.

And it's the same with set of all your favorite pedo-nazi-porns. Yes, you can find it somewhere in pi. But what is important to us is : does it have some relation with the world we are living in or is it just plain fantasy ? In an evolutionary point of view It's important to us because, the better you know the world, the better you can fit. But of course for the exact same reason, being able to imagine "unreal" world is very important, because it's what enable you to figure out scenarios of what may happen later.

Now let's talk about suspicion. I don't know for the US, but in France (yeah, you didn't guess ?), we are supposed to have a "presumption of innocence", so you are innocent until you have serious evidences you are not.

Ok. Lets say they share a file which interpreted in a well known way, give you the last radio song which is illegal to share. Did they shared the song ? Because, you don't know, may be they just like to open it with vim and read. Or may be they use it to make some arithmetic operations — it's so fun to take a really big number and to rise it to power of itself. How do you know what they do with this files if you don't spy them?

But there's worst. As every number can be written as a prime number product, lets say you take this shiny number, which happen to also encode the previous song, and you compute this prime numbers. Now every day you send you friend some of them, so after a while, he can compute the shiny number. But no, wait, that's not it, you just sent him illegally a song. So sending prime numbers to your friend is illegal.

So what's important is not that the file is a number – and it is not a number, it just happen that we can numerate anything our mind meet, so we can numerate electrons for example – the important is what you do with the file.

That mean that in a non-big brother society, with a justice working with a presumption of innocence, you should hardly be able to proof that someone violated copyright law when all they publicly did was sharing digital files.

So what do "entertainment industrials" want now ? A big brother society and presumption of guiltiness.
psychoslave - 2011-08-23 17:24
Matt
psychoslave: noting that your Web site is in French, I think some of your points may have gotten lost in translation, because some of what you say above comes across as nonsense in English.

Looking at a string, it is possible to determine whether it does or does not meet the requirements to be valid UTF-8. Some strings do and some strings don't; and the answer to that question is a function of the bits. But what I call Colour is by definition not a function of the bits. In particular, the sequence of bits 01000001 is valid UTF-8. If you think it should be interpreted as UTF-8 then it denotes the letter "A." But it might also be the integer 65, or a bit-field encoding of the current states of eight different railroad switches, or any of many other things. It also might be part of an "original" work of art, or part of a "copy." There is no way to know, internally to the bits, which of those interpretations holds. You cannot take the information internal to the bits (01000001) and use that to make correct conclusions about the information external to the bits (it is or isn't really intended to be a UTF-8 string). The distinction you draw among different kinds of "inference" isn't really relevant at all. No form of inference will let you correctly extract the information that isn't there to begin with.

A "relation" in mathematics, at least as that term is used in English, is just a function from ordered pairs to truth values.. I'm not sure how you get from that to Colour, which is by definition not a function.

Some forms of logic, and the philosophy they derive from, draw a distinction between "extensional" and "intensional" equality of objects. That's really the issue here: law tries to treat computer files in an intensional way while computer science almost always treats them in an extensional way.
Matt - 2011-08-25 10:25
psychoslave
We agree to say that "There is no way to know, internally to the bits, which of those interpretations holds."

Now to my mind, information is an action, not some static states. In- forme -ation, the action of mapping a form with something which is not in it.

So yes, when we infer, we do add something which isn't in our start point.

"Nothing is lost, nothing is created, everything is transformed". Well, that may be true for elements/atoms, but not for relation/information. We all know that : "Ooops, my non-backuped file!". Sure we could find the very same file, bit for bit, but to find it again, we will have to go through a procees that will change us. And being changed, we won't interprete it in the very same way. This is close to your 4'33" story.

So, is it an extension/intension dilem ? As we see meaning appears in interpretation. So you won't intrepte this text as I do (even if you forget/forgive my english), and even "me" tomorow won't interpret it in the same way.

In fact even writing/reading text may change us. Like a self-modifying code, or a code which can be modified at run-time by its input.

I hope this text makes sense(s) for English readers. ^^
psychoslave - 2011-08-26 06:10
g__
After some thought, I think the point of Monolith is different. When you have two seemingly random files that xor to something copyrighted, one of their distributors certainly did something illegal. The point is, who? By arresting both of them you will arrest a person who only put some random numbers, nothing wrong. I think the legal term for this behavior is "conspiracy".

Conspiracy is not generally legal, but it is possible to defeat the scheme even if it was. A trusted person introduces a random file as a trap. Anyone who xors it with copyrighted thing is busted. Of course it requires trust. However, law enforcement can make their case stronger, by marking a file with hidden patterns. For example, put a seemingly random file A that has embedded works of Shakespeare in it. If you see now a file B such that A xor B is a copyrighted thing, it's clear that B is the criminal. You can even bypass "randomness detectors" if A was xor of several old public domain works.

However, I like the notion of Color in this article. Here's another one. If you are in a shop and take an item, it's illegal to walk out without paying. By paying, the item changes Color from unpaid to paid. Even though nothing happened to it physically, it feels different. Same with checking books out of a library. More generally, the idea of property - if I give you a cup and say it's yours now, nothing physically tangible happens to the cup, but an invisible property of "ownership" changes.

If I label a cup that it's mine, you strike that out and write your name, it does not make you the owner. It's because metadata are only approximation of the color. DRM is like selling cups with label "You cannot drink wine from it". Even if it is a legal restriction, it's very ineffective - you can't stop buyers from tearing the label or drinking wine.
g__ - 2011-09-13 12:22
Tim
A very interesting read for one who is neither a computer scientist nor a lawyer.

An idea I had, though I don't know if it's that different from Monolith, though may prove better, is to take the data of a public work and to strip out the data of a copyrighted work. As an example say you take the bits composing 'Hamlet' and strip out the bits composing 'Smells Like Teen Spirit', what you are left with is a legal, derivative work which you can call 'Smells Like Hamlet'. If someone ran a special program comparing your 'Smells Like Hamlet' with the source 'Hamlet' they would be left with the remaining difference which is also the copyrighted work, similar to Monolith I'm guessing. Is this all then a conspiracy? For that counsel will have to prove intent, that the person intended to get the copyrighted work and didn't just want to see the difference between the two documents, offender by offender instead of shutting down an entire network that's distributing the derivative works. Honestly this seems a very convoluted way to get free media, but the novelty of the approach is interesting to me.
Tim - 2012-01-25 01:02
beleester
That's pretty much like Monolith. Here, the intent is still obvious from how it's distributed. If you post "Smells Like Hamlet" online and say "Hey, take the difference between this and Hamlet and play it as an MP3," then it's pretty obvious that your intent is to distribute copyrighted content, not to create some sort of amusing novelty variety of Hamlet. This goes double if you run a large website (call it "The Hamlet Bay") which hosts a lot of different variants of Hamlet which just "happen" to decode to popular songs.
beleester - 2012-01-27 10:00
Tanner Swett
If I'm not mistaken, there's already a word for the concept of Colour: "provenance".
Tanner Swett - 2012-02-02 17:26
Jonathan
This feels like one of the most important articles in the current discussion of copyright that I have seen. Is the concept of "colour" (by whatever other name) yours, or did you find it elsewhere?

Either way, brilliantly written article, and I hope that more people read and understand what you are trying to get across.
Jonathan - 2012-02-12 13:22
Matt
Glad you like it. I think I'm the first to formulate it in this form and apply it to copyright specifically, but the more general concept is something that philosophers and logicians have been aware of for a long time. In the field of ontology there is a concept of "extensional" and "intensional" (note, not the same thing as "intentional") entities, which amounts to the question of whether there can be two entities that are just like each other except that they are not the same entity. My claim boils down to "lawyers know that files are intensional; computer scientists know that files are extensional."
Matt - 2012-02-13 11:15
Gijs Feldberg
If I may pick on "If we ran a random generator to produce a file this size, the probability of it generating this file would be very low". This statement sound rather dubious. Indeed, for any practical file size probability would be fairly low. But in no way that could be distinguishing characteristic. It's a core property of a randomly generated file of given size, that probability of it matching bit-by-bit some other arbitrarily selected file of the same size is precisely equal to the probability of matching any other file of that size.
Gijs Feldberg - 2012-02-27 15:50
Matt
That's sort of the point. Assuming the random generator produces a uniform distribution, you could say the same thing and it would be equally true about ANY file of the same size, so there's no sensible way you can say one file is more random than another. Randomness is determined by where the file came from, not the file itself.
Matt - 2012-02-27 16:08
Matt
However, one thing you can do is an hypothesis test - you can say "If the file was randomly generated from such-and-such assumed distribution, then this test statistic would have such-and-such distribution; and the value it assumes on this file happens to be a very rare value." Then you're not saying *this* file is unlikely (which is always true) but that *this file or any other at least as surprising as it* is unlikely, which makes the claim that the file was random look questionable. For instance, a test statistic might be "length of the longest increasing sequence of integers in the file." If the file happens to look like "1,2,3,4,5,6,..." then that number will be maximized; but that's the only file with such a large value for the statistic, any other will have a smaller value, and if you see the maximum possible value you'd be right to be suspicious.
Matt - 2012-02-27 16:24
Mark Parker
I think the issue is less that computers are "Color Blind" - where did the bits come from. The issue is lawyers want to treat perfect duplicates of bits as the bits actually "moving" from one place to another, which is a fundamental fallacy, Computers can't do a "move" operation, they mimic it with a "copy and delete" operation. This miisunderstanding by Lawyers is, imho, why they don't understand that you can't disable copying without also disabling moving.
Mark Parker - 2012-03-06 07:38
Phil Hibbs
Saying "you can't stop copying without also stopping moving" is kind of like saying "you can't stop crime without stopping quantum mechanics". Both are true, but both can be argued to be worthwhile to attempt - all the law can hope to do is reduce, not eliminate. You may disagree that DMCA takedowns are inherently undersirable, but if you mock them for being mathematically impossible, then you're just being silly.
Phil Hibbs - 2012-03-06 08:22
Mark Parker
@Phil Hobbs, not at all - computers simply cannot do "move" - they emulate it by copying the original and then deleting the original. You can stop crime simply by taking the criminal away, but if you take copying away from a computers "move" "delete" and "copy" fuctions, all you are left with is "delete", because the "true" statement is computers can only copy and delete.
Its like saying real life simulates stealing as a buy operation with no payment, and the lawyers are (literally) arguing to stop stealing by taking the option to buy away.
Mark Parker - 2012-03-06 17:15
Dwayne Litzenberger
To anyone who thinks you can "tag" files for their colour, remember that things like fair use have squishy definitions that change the colour of the bits in question, but tagging can't handle that.
Dwayne Litzenberger - 2012-05-27 00:32
Nathan Stoddard
Excellent article. I haven't considered this issue in detail before.

You say that "It's not even correct to say 'the probability of this being from a random generator is very low' because that's not true - it either was or was not randomly generated, that's not open to probability." I disagree - you can certainly apply probability to situations like this. Probability doesn't have to be about things that are "truly random". We apply probability to situations that aren't actually random all the time. Even a coin flip isn't truly random - if you know the exact state of every molecule near the coin, you can run a simulation and predict whether it lands on heads or tails with very high probability (there's still a bit of uncertainty due to quantum mechanics). But we still say that a coin has 50/50 odds of getting heads vs tails even though it isn't truly random. However, I'm not sure how to calculate the probability of a number being from a random generator; I don't know enough about probability to do that. But your statement that this isn't "open to probability" isn't true. Eliezer Yudkowsky wrote an article about this kind of thing (though not this specific example) at http://lesswrong.com/lw/oj/probability_is_in_the_mind/
Nathan Stoddard - 2014-03-20 09:49
Peter Gerdes
Interesting article but I think you use the term color in a way that confuses two different properties that both fail to supervene on the bit sequence of a file (FYI supervene is a fancy philosophy word designed to capture the intuition that the supervening facts are a mathematical function of the supervened facts).

While I understand that you used color to talk about anything that doesn't supervene on the sequence of bits I think it would have been slightly more clear to instead talk about the epistemic relationship you have toward the bits and the causal history of the bits.

I mean when talking about kiddie-porn the US supreme court has ruled that it's the causal history of those bits that matter...were those bits created by abusing an actual child. If so then (I believe), even if you reasonably believed that image was produced by ray tracing, you can be convicted of possessing child-pornography. On the other hand if those bits were produced via rendering you have a protected first amendment right to possess and even distribute that material (though perhaps not to represent it as having come from actual photos).

On the other hand when you talk about the authority of a reference sample of given ph or presenting evidence in a court case it's the epistemic status (and the opposing parties right to question that status) which is at issue. Even if your reference sample came from exactly the same source as the one with a pedigree traced back to the international standards bodies it doesn't matter if you don't know (or can't provide reasons for others to believe) where it came from. On the other hand, you probably could get the time of sunrise that you copied from the Naval observatory's website admitted into court if you had an expert who testified he had conducted extensive experiments verifying that the Naval observatory's published times were exceedingly accurate in the situation in question (it might be much harder treating that time as a black box than if you looked at the means of calculation but if in some strange hypothetical their means of calculation was a state secret you could still probably verify the fact that it's reliable).

In particular this distinction matters because I tend to think that it's the causal role which causes technical people trouble, not the epistemic status. Very few people are confused by the fact that someone can hand you a bunch of numbers they did produce by consulting a source of physical entropy but you still can't use them to run a monte-carlo simulation because you can't trust them. This isn't confusing since epistemic state is observer relative so we encounter cases all the time where the same bit string can have different a different epistemic status depending on what you know about how it was generated. On the other hand it is very very rare that two bit strings of any appreciable size and non-trivial Kolmogorov complexity (the minimum length program that produces them...e.g. entropy) can't both be traced back to a common cause. Thus, people do get quite confused when the law cares about the causal history of a bit string (bans copying a copyrighted work but not independent duplication or bans bits that can be traced back to the actual abuse of a child).


-------
As an aside I have to say I can't disagree strongly enough with the approach to child-porn law that starts by consulting our intuitions about what we feel is bad and making child-porn illegal because we disapprove. Regardless of how much something disgusts/bothers you the proper role of legal punishment is only to protect children from harm and unfortunately there is good reason to believe our current punitive child-porn laws may create more child abuse than having no law at all (after release they ensure child-porn possessors no longer have factors known to discourage actual abuse and also increase reluctance of friends and family to alert authorities when they are worried but unsure if abuse is happening). Frankly, I think it's disgusting that as a society we are willing to put our own feelings of outrage and self-congratulation over the well being of children by not even bothering to look at the empirical evidence and figure out what set of laws minimize overall child abuse.

I mean which is sicker? The person who shamefully looks at pictures of kids being abused he downloads from the internet (without paying or distributing) or the person who is willing to let more kids get abused so they can feel satisfied those evil people who look at child porn get what they deserve?
Peter Gerdes - 2014-09-23 15:05


(optional field)
(optional field)
Answer "bonobo" here to fight spam. ここに「bonobo」を答えてください。SPAMを退治しましょう!
I reserve the right to delete or edit comments in any way and for any reason.