Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
When XML in Word Became Illegal (withedge.com)
134 points by ejz on Oct 12, 2023 | hide | past | favorite | 50 comments


"Microsoft.... built a custom XML tool into its word processor in 2007... this was a tool for power users, and was only used by a small percentage of its user base."

I'm definitely confused by that statement and its link, because it implies the relevant tool is the disk format for every Office file, which has been described by an Excel program manager as "complicated enough to reduce a grown programmer to tears." https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...


It's not referring to the XML formats.. it was a feature of Word specifically which allowed you to embed a user-defined xml schema in your Word document, and use XML data that fits the schema in your document.

See https://www.zdnet.com/article/custom-xml-the-key-to-patent-s...

(edit: grammar)


Ah, thanks for explaining.


> it implies the relevant tool is the disk format for every Office file

Does it imply that?

Another commenter has already pointed out why it's likely not the case.

But also, I don't think the article is well written. Partly because it doesn't clearly explain what the infringing tool was, or did, or how it operated. Also I'm pretty sure there's a typo in "ex part" instead of "ex parte". But another major issue is the following:

> $40 million of that judgment [against Microsoft] was imposed by the court as punishment for continually arguing that i4i was a patent troll even though it had an operating business in a manner that was “persistent, legally improper, and in direct violation of the Court's instructions.”

What?

Why would i4i operating in a manner that was persistent, improper and in violation of the court's instructions preclude it from also being a patent troll? It could do both?

Or is the "persistent..." descriptor meant to apply to Microsoft? That might make more sense, but the "even though" seems to be a comparison between two types of activity by one entity - namely i4i.

But then again, I might be reading "it had an operating business in a manner" wrong, because it feels ungrammatical to me. I might not be putting the emphasis in the right place, and that's what's causing me to misread the sentence?

The whole thing just feels confusing.


Thanks for reading. Sorry if this was confusing! Microsoft said that i4i was a patent troll despite the court repeatedly telling Microsoft to not do that. The judge referred to Microsoft's repeated ignoring of its instructions as "persistent" etc. i4i had an operating business; it wasn't a patent troll. That operating business is niche and small, but it is real. I have updated that sentence to make it clearer. Thanks for your feedback!


Depends on one's definition. I don't think "not having a real product/service" is the defining charateristic of "patent troll". Here's what Wikipedia says.

> attempts to enforce patent rights against accused infringers far beyond the patent's actual value or contribution to the prior art

> often do not manufacture products or supply services based upon the patents in question


No problem.

Looking back at it again now, I can see the intent of the original sentence where "it had an operating business" refers to i4i, but "in a manner that was..." refers to Microsoft. I didn't get the change of subject at that point.

Maybe an additional comma would have been all that I needed to figure it out: "even though it had an operating business, in a manner that was..."


The article says the feature has been removed; if it was the disk format:

1) it has never been removed, afaik Word still uses OOXML, so Word would keep being infringing

2) LibreOffice would probably be infringing too, as ODF is also XML based

So... it has to be some other form of XML tool and not the file format.

As for Joel's comment, IIRC he was an Excel PM before OOXML; in any case his blog post refer to the binary format that precedes OOXML. I'm pretty sure OOXML is equally if not even more complicated, as the product themselves are way more complicated than they appear, but the fact is that he was talking about a different thing.

Edit: as many users pointed out, it's not the file format itself, but the ability to add arbitrary attributes/elements to the file format XML as additional data.


Nitpick: Joel is referring to the old BIFF-style format (from 2003 and before) in that quote. The new "Office Open XML" formats are not mentioned in that post at all. However, one of the many criticisms of the Office Open XML formats is that they are, in some areas, nothing more than an XML serialization of the BIFF records.


This isn't want Joel is talking about here.

On the backend, all .docx files use XML. Joel is saying the root XML format was difficult to work with.

What my article is about is this: Microsoft used to allow users to write their own custom XML rules on top of Word. (This was mostly app developers using XML for macros rather than end users, and overall it was very rare.) This is the feature that was at issue with the patent.

Sorry if this was not clear!


> Joel is saying the root XML format was difficult to work with.

Joel wasn't writing about the XML version of MS Office documents, he was writing about the binary versions.


Thanks for clarifying!


Looking at the patent application, it doesn't appear to mention XML at all (it does talk about SGML, though), and the application appears to claim any mapping of a symbolic name to style properties (think Word styles or CSS classes); in other words, technical trivialities, reflecting poorly on US lawyers and their patent law.


It's not about storing XML, it's (as far as I understand the patent) about a specific representation of XML that can be more efficient to read.

The patent is about representing documents with markup (XML or otherwise) not by embedding them in the text, but rather having them stripped and maintained as a separate list of (tag, position) pairs, with the document only containing the raw text.

I'm only surprised that Microsoft couldn't find prior art, because having a (content-type, address) index at the beginning of a file is not exactly an unusual representation. It also reminds me that the USPTO's idiosyncratic usage of non-obviousness doesn't really match my intuition.


This is a huge issue with the patent world in general. There's just so much prior art out there, and you have to be really clear about showing that it applies. This isn't a patent case, but I have a great Google Maps case involving Wi-Fi where a judge completely borked it. As for this particular patent, I'm not enough of an XML expert to say whether the court got it right here. But it is worth noting that Microsoft tried to invalidate the patent several times with USPTO and failed to do so there as well. So perhaps there's something more to the patent than meets the eye, or that is was novel at that time but not modern XML. Remember, the actual i4i patent at issue was filed in 1994, and it only matters if there was prior art from before 1994. It might have been novel at the time.


> Remember, the actual i4i patent at issue was filed in 1994, and it only matters if there was prior art from before 1994. It might have been novel at the time.

I am aware of the date of the "invention". I was programming on 8- and 16-bit computers in the 1980s and I was using this and similar kinds of formats for non-textual data, simply because it was easier to do this in assembler than writing a parser, paired with the difficulty of finding unused special bytes in binary data to separate meta-information from the data proper.

And I was also talking about non-obviousness, not novelty.


Fair enough. I haven’t seen the invalidation proceedings and am clearly less of an expert than you. So don’t know whether they got it right. Non-obviousness is, erm, non-obvious.


Am I right to understand that it would be the equivalent of visual studio's wpf designer [1], where you have the WYSIWYG editor side by side with an xml editor and you can make the change in either of them and it translates into the other?

If it is, it would have been really really cool.

[1] https://i.stack.imgur.com/8pJnn.png


No. It's more like what the following piece of code produces:

  def convert(xml):
      import re

      parsed = re.split(r"(<.+?>)", xml)
      output = parsed[0]
      tags_with_pos = []
      for i in range(1, len(parsed), 2):
          tags_with_pos.append((parsed[i], len(output)))
          output += parsed[i+1]
      return tags_with_pos, output


> the USPTO's idiosyncratic usage of non-obviousness doesn't really match my intuition

Remember that USPTO gets paid for each patent application, and not penalised when it's later falsified.


Well, it was apparently upheld twice on reexamination, where they could have fixed that. The problem is more that the bar for non-obviousness is so low, it's basically on the floor. Paired with a discipline (software development), where independent reinvention is common, this is just a recipe for disaster.


Everyone knows 1+1=2, so why did Russell spend many many pages/hours on a proof, surely if people know it then it's easy to demonstrate? /s

Programmers are notoriously good at documenting everything after all. /s

It's easy to give documentary evidence for things someone found self-evident and so only wrote a scribbled note about in a workbook 40 years ago. /s

FWIW patent law obviousness is not the same thing as ordinary notions of obviousness either.

All my personal opinion, ofc.


I believe I ran into this issue a few years ago and discovered the patent case when trying to work around. The xml file format allowed for arbitrary properties to be added (as xml does), and we were trying to embed metadata in word files. But when MS Word opened a file with anything extra in it it gave a warning like "this file has extra stuff in it" and it automatically removed anything that wasn't explicitly expected.


Not sure why this is downvoted, it’s absolutely correct. I tried this myself; it would have -greatly- simplified scraping Word docs because the custom tags would have been available for XPath querying. Alas, Word strips it all on open.


Yep, we had a similar use-case. I remember the error message pointed to a help page which pointed to an article about this patent.


A classic Joel on Software article about funny backwards compatibility built into Excel: https://www.joelonsoftware.com/2006/06/16/my-first-billg-rev...


I realised there is more time between now and that article, than there is between that article and the events described within.


Looking at patent abstract [0] it basically patents separation of information and structure. That latter can be used to present information in various ways.

My take is that it is fucking obvious and I just simply do not believe that the concept did not have prior art. It just show what a crooked business this whole modern patent system is.

[0] - "A system and method for the separate manipulation of the architecture and content of a document, particularly for data representation and transformations. The system, for use by computer software developers, removes dependency on document encoding technology. A map of metacodes found in the document is produced and provided and stored separately from the document. The map indicates the location and addresses of metacodes in the document. The system allows of multiple views of the same content, the ability to work solely on structure and solely on content, storage efficiency of multiple versions and efficiency of operation."


Sounds similar to Phoenix Liveview.


I'm completely baffled as to how it's allowed to get a patent on stuff like this.

Can I patent sending REST requests using JSON?


No, that's not how it works, you can't patent a specific technology that's already been invented, what you do is wait for a new technology to be invented and then patent doing some obvious thing with the new technology.

Like, a good patent today would be: "Using a computer text prediction engine to automatically review and approve code."

It probably would have been pretty smart to skim through all the hacker news threads after ChatGpt came out patenting every other comment.


XML editing had already been invented


You could apply for that patent but I would expect it to be rejected due to prior art i.e. someone came up with it before you. Even if it was accepted, if you tried to enforce it, it'd definitely be challenged on prior art and you would very likely lose because it wouldn't be hard to prove you went the first.

Now, why this particular patent exists, and seems so general, is also likely related to prior art. What could be patented for software was a bit murky until the late 1990's when it was established that business methods implemented in software were allowed. This led to a large flood of patents in that space.

One of the issues is that the Patent Office tends to look at prior art as being "things that have already been patented" so when rules change, a lot of things that seem obvious are up for grabs because there's no prior patent. Now, these can (and are) challenged in court and, in court, they're more likely to accept blatant prior usage in the wild. i don;t know whether this case won it's challenge but it's possible that it didn't because XML was quite new in the late 90s too.

Source: I have a patent from around that time that's basically covers anything in finance that's data driven from an XML document. For about a decade, that covered a fairly large chunk of finance. I never did anything about it as I disagreed in principle with the premise of such an absurdly broad patent. I agreed to it being patented solely for defensive reasons ie it might prevent a competitor from egregiously attacking my employer with patents.


If you're sufficiently creative, certainly. Some of my friends patented something totally absurd: there's a transformation you can easily do in software and lots of software does it quite routinely. They did it twice. Patent issued.


You can try to patent anything, but patent might not be accepted.

The thing is that patent office is funded by patent fees, so there is an incentive to accept the patent plus they are often hard to read.


What I understand of US law is that there's very little in the way of filing a patent. It's not really tested until someone challenges it.


If only someone had filed a patent that blocked Word from inserting curly quotes the wrong way, like ‘449.


Anyone got a screenshot of this feature?



I've not read the patent, but it's definitely inaccurate to say "Microsoft removed custom XML from Word." It's still possible to create custom XML parts programmatically, and I suspect it's quite commonly done. Also, I just checked, and Microsoft 365 has a custom XML mapping tool on the developer tab. So it would be interesting to know how Microsoft complied with the judgment and the subsequent history of the feature.


> Indeed, as you work on your Excel clone, you’ll discover all kinds of subtle details about date handling. When does Excel convert numbers to dates? How does the formatting work? Why is 1/31 interpreted as January 31 of this year, while 1/50 is interpreted as January 1st, 1950? All of these subtle bits of behavior cannot be fully documented without writing a document that has the same amount of information as the Excel source code.

A quick note to anybody building an Excel clone: If you want to turn this insane date handling behavior of Excel into an optional feature that can be disabled everybody will appreciate it.



> Scientists will thank you

Scientists gave up and changed the gene names: https://duckduckgo.com/?q=excel+gene+names+changed+septin1


I always wondered why they won't just make it a popup button?

Default should be to not change anything, if a date is recognized offer a button right next to the cell that allows you to accept the suggestion to turn it into a fully fledged date. Just make it so that pressing tab or shift enter or a similar comination accepts that suggestion.



Just do https://xkcd.com/927/, happened once and it was okay.


What comes after .docx? .docxx? .docy? .docxi?


docxEx, in Win32 fashion.


[flagged]


Did you read the article at all? If we take the article's word at face value, Microsoft was absolutely a victim of bullshit patent trolling to the tune of $240 million. Perhaps there's some schadenfreude in seeing Microsoft finally being the victim of the corrupt IP system that they themselves have so often taken advantage of, but the garbage being inflicted in on the world in this case is entirely from i4i and the awful, corrupt, bullshit patent system.


The United States is #1 for protection of intellectual property in the world according to the property rights index: https://www.internationalpropertyrightsindex.org/

Real property on the other hand? The US is ranked 14th.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: