+ 1
Python Parser, for complex flat files
So dealing with EDI files, https://x12.org/examples samples of them can be found here. Would like to make a parser to parse the data and place it into a database
5 Answers
+ 7
+ 5
The EDI specifications published by the X12 organisation, are NOT open standards. They are protected by intellectual property. The companies who use it, can do this for paying a license fee.
https://x12.org/products/licensing-program
The specification itself is not publicly accessible.
Furthermore these message standards are merely guidelines. Each company that implements this format, typically uses only a small subset of the message. You can imagine every message definition as a separate mini language, where the client can choose their own set of expressions and custom encodings. The "examples" are purely one of many possible variations.
+ 3
Sounds like a good challenge Brandon.
Do you have a code that you've started? Are you having trouble getting it to work?
If so, save it in the code playground and attach it here using the plus āļø button.
+ 2
For a little more practical advice how to build a parser, let's look at an example.
Each message is composed of lines that are called "segments" such as:
DTM*582****RD8*20140101-20140131~
In the message text file, there is usually no line break, so the separator ~ between segments should be noted.
Each segment is composed of "elements" separated by * character. The first element refers to the syntactic meaning of the segment, like DTM = date. Usually there is a qualifier code, in above example 582 refers to "report period" (there are large code tables to define the meaning).
Some elements have multiple parts separated by - dash, these are called composite elements.
Some segments are mandatory. The standard dictates the structure of the message, and which segments and elements are mandatory, and which segments or segment groups are repeatable. However, client implementation can override this, also they can pick their preferred qualifiers or even define custom ones. Building an universal parser is impossible.
+ 2
Parsing a single line is not terribly difficult, basically you can use split() a couple of times and you're done. But making sense of the structure of the data is a bigger challenge. This message is best represented as a tree structure, with branches and leaf nodes, sometimes with multiple levels of nesting. But actually, sometimes multiple segments that follow each other, might represent information that belong closely together.
How you want to store it in a database, is a big question. In its raw format, maybe the best solution is a NOSQL object store like MongoDB. But most often these messages represent structured technical data, maybe composed of a header and one or multiple lines. So you could also represent it in a relational database, then you have to write an algorithm how to "flatten" the tree, which pieces of information belong together, maybe what is the data you can discard as irrelevant (like the segment tags, the first element in each line), and what you want to directly map to a dedicated field