# Parsing MAT files with class objects in them¶

February 20, 2014. Matt Bauman [first initial][last name] (at) [gmail]. MIT license.

Matlab saves class objects in .mat files in a crazy undocumented scheme. As far as I can tell, few attempts have been made to understand how this information is stored. Matlab itself once allowed loading of unknown class objects as structures, but has recently decided to forbid that behavior (due to class loading rules) and so now everybody in all languages are subjected to the strange representation of these objects:

% In Matlab:
Warning: Variable 'obj' originally saved as a SimpleClass cannot be instantiated
as an object and will be read in as a uint32.
obj =
[3707764736; 2; 1; 1; 1; 1]

# Or Python (SciPy):
MatlabOpaque([ (b'obj', b'MCOS', b'SimpleClass', [[3707764736], [2], [1], [1], [1], [1]])],
dtype=[('s0', 'O'), ('s1', 'O'), ('s2', 'O'), ('arr', 'O')])



Wholly unhelpful. And very strange. An array of six unsigned integers? Where's the data? And what do those numbers mean?

This chronicles my attempts at parsing this information out of Matfiles. It's a work in progress as I blindly reverse engineer this, and may eventually become a part of Julia's MAT.jl package. If you find any matfiles that don't conform to these expectations, email them to me! Or better yet: don't store your data in any undocumented format!

## The Matfile subsystem for version 5.0¶

As some folks have already discovered, when this happens there's a hidden, unnamed matrix filled with unsigned int8s stored beyond the bounds of the documented Matfile. SciPy calls it the __function_workspace__, but it's much more general than that -- it's where all the data is for class objects (it just so happens that function workspaces are one such thing stored there). Matlab makes one small mention of this; it's called a subsystem and the only thing official that they tell us is that it lives at the end of some matfiles at a specified offset.

The bytes in that subsystem matrix? They conform almost exactly (with some very minor caveats) to a matfile themselves and can be parsed quite easily into a bunch of objects. Here's what a simple one looks like in Julia:

In [1]:
using MAT, MAT_v5 # with some special sauce from PR MAT.jl#23 (https://github.com/simonster/MAT.jl/pull/23)
f = matopen("simple.mat")
summarize(f.subsystem) # summarize and xxd are defined in the appendix at the bottom

Dict{ASCIIString,Any}:
"_i1"=>Dict{ASCIIString,Any}:
"MCOS"=>("FileWrapper__",6x1 Array{Any,2}:
[1] 352x1 Array{Uint8,2}: [0x02,0x00,0x00,0x00,0x06,0x00,0x00,0x00,0x70,0x00,…]
[2] 0x0 Array{None,2}: []
[3] 1x4 Array{Any,2}:
[1] Float64: 1.0
[2] ASCIIString: "one"
[3] Float64: 2.0
[4] ASCIIString: "two"
[4] ASCIIString: "another_char"
[5] 1x4 Array{Any,2}:
[1] Float64: 3.0
[2] ASCIIString: "three"
[3] Float64: 4.0
[4] ASCIIString: "four"
[6] 3x1 Array{Any,2}:
[1] Dict{ASCIIString,Any}: {}
[2] Dict{ASCIIString,Any}:
"array_field_2"=>1x6 Array{Float64,2}: [0.0,1.0,2.0,3.0,4.0,5.0]
"char_field_1"=>ASCIIString: "char_field"
[3] Dict{ASCIIString,Any}:
"a"=>Float64: 1.0)
"_i2"=>Dict{ASCIIString,Any}:
"_i1"=>Dict{ASCIIString,Any}:
"MCOS"=>0x0 Array{None,2}: []

The _i1 and _i2 keys have no meaning; it's simply how I've chosen to name elements that don't contain a name (as there may be more than one such array and they would clobber the previous dictionary entry). So this subsystem has one variable, named "MCOS", and it has a bunch of different elements. We can clearly see, though, this is where our data lives! But man is it jumbled. The first element is interesting, though: 352 raw bytes. And that very last element, too... apparently this subsystem has an unnamed subsystem-like matrix at the end of it, too (although there's no offset given in the header information; it just acts like one). But it's empty, so who knows what it'd be used for. I've never seen it populated.

So, now we've got to figure out how to connect our 6 element array of uint32s with the real data. What's that obj variable look like again?

In [2]:
summarize(read(f))

Dict{ASCIIString,Any}:
"obj3"=>("AnotherClass",6-element Array{Uint32,1}: [0xdd000000,0x00000002,0x00000001,0x00000001,0x00000003,0x00000002])
"obj"=>("SimpleClass",6-element Array{Uint32,1}: [0xdd000000,0x00000002,0x00000001,0x00000001,0x00000001,0x00000001])
"obj2"=>("SimpleClass",6-element Array{Uint32,1}: [0xdd000000,0x00000002,0x00000001,0x00000001,0x00000002,0x00000001])

As far as I've seen, the first four elements are always [0xdd000000, 2, 1, 1]. They may be reserved for future features or perhaps they're features that I don't use or haven't happened to trigger yet. It might be possible that there'd be more than one FileWrapper__ object — perhaps one of those elements is its index. If you ever see something different, here or anywhere else, send me those mat files!

The last two elements are, respectively, the object_id and class_id. Pretty simple. But the information on how to connect that to the data in our subsystem is hidden away in that opaque byte array.

## The FileWrapper__ byte array¶

The first element of the FileWrapper__ is a large byte array, with a very unique format. It's generally a whole bunch of Int32s, and easiest to read if we treat it as an IO-like stream. I'd guess that the first Int32 is some sort of version number. Here's what the beginning of the data look like:

In [3]:
mcos = f.subsystem["_i1"]["MCOS"][2]
data = vec(mcos[1])
fdata = IOBuffer(data)

xxd(data,1,0x80)

0000: 02000000 06000000    ........            2            6
0008: 70000000 a0000000    p.......          112          160
0010: a0000000 00010000    ........          160          256
0018: 40010000 60010000    @......          320          352
0020: 00000000 00000000    ........            0            0
0028: 63656c6c 5f666965    cell_fie   1819043171   1701406303
0030: 6c645f33 0053696d    ld_3.Sim    861889644   1835619072
0038: 706c6543 6c617373    pleClass   1130720368   1936941420
0040: 00636861 725f6669    .char_fi   1634231040   1768316786
0048: 656c645f 31006172    eld_1.ar   1600416869   1918959665
0050: 7261795f 6669656c    ray_fiel   1601790322   1818585446
0058: 645f3200 416e6f74    d_2.Anot      3301220   1953459777
0060: 68657243 6c617373    herClass   1131570536   1936941420
0068: 00610000 00000000    .a......        24832            0
0070: 00000000 00000000    ........            0            0
0078: 00000000 00000000    ........            0            0


A bunch of little-endian integers, followed by some ASCII data. That first Int32 is always a 2, probably a version or id number, and the second is the number of strings that you see starting at 0x28 (bizarre!). The next six Int32s are segment offsets into this data block - you can see that the first segment offset is the first multiple of 8 bytes after the ASCII strings stop. There are two (perhaps reserved) Int32 zeros and then the strings start at 0x28.

In [4]:
function parse_header(f)
id = read(f,Uint32) # First element is a version number? Always 2?
id == 2 || error("unknown first field (version/id?): ", id)

# Second element is the number of strings

# Followed by up to 6 section offsets (the last two sections seem to be unused)

# And two reserved fields

# And now we're at the string data section
@assert position(f) == 0x28
strs = Array(ASCIIString,n_strs)
for i = 1:n_strs
# simply delimited by nulls
strs[i] = readuntil(f, '\0')[1:end-1] # drop the trailing null byte
end

(offsets,strs)
end

seek(fdata,0)
summarize(strs)

6-element Array{ASCIIString,1}:
[1] ASCIIString: "cell_field_3"
[2] ASCIIString: "SimpleClass"
[3] ASCIIString: "char_field_1"
[4] ASCIIString: "array_field_2"
[5] ASCIIString: "AnotherClass"
[6] ASCIIString: "a"

### Segment 1: Class information¶

The first demarcated segment seems to describe the class information. I've not managed to save fancy enough classes that expose all of these fields, but it at least enumerates the classes, their names, and their package names (using the indexes into that heap of strings).

In [5]:
function parse_class_info(f,strs,section_end)
# The first four int32s unknown. Always 0? Or is this simply an empty slot for another class?

classes = Array((ASCIIString,ASCIIString),0)
while position(f) < section_end
package = package_idx > 0 ? strs[package_idx] : ""
name = name_idx > 0 ? strs[name_idx] : ""
all(unknowns .== 0) || error("discovered a nonzero class property for ",name)
push!(classes,(package, name))
end
classes
end

seek(fdata,segments[1])
classes = parse_class_info(fdata,strs, segments[2])

Out[5]:
2-element Array{(ASCIIString,ASCIIString),1}:
("","SimpleClass")
("","AnotherClass")

### Segment 2: Object properties that contain other objects¶

The second segment is only sometimes there (e.g. offsets[2] == offsets[3]). When it is, it contains informations about each object's properties. Each set has a variable number of subelements, one for each property. But for this matfile, it is empty as there are no properties that contain other objects.

In [6]:
function parse_properties(f::IO,names,heap,section_end)
props = Array(Dict{ASCIIString,Any},0)
position(f) >= section_end && return props

# sizehint: 8 int32s would be 2 props per object; this is overly generous
sizehint(props,iceil((section_end-position(f))/(8*4)))

while position(f) < section_end
# For each class, there is first a Int32 describing the number of properties
start_offset = position(f)
d = Dict{ASCIIString,Any}()
sizehint(d,nprops)
for i=1:nprops
# For each property, there is an index into our strings
# A flag describing how the heap_idx is to be interpreted
# And a value; often an index into some data structure

if flag == 0
# This means that the property is stored in the names array
d[names[name_idx]] = names[heap_idx]
elseif flag == 1
# The property is stored in the MCOS FileWrapper__ heap
d[names[name_idx]] = heap[heap_idx+3] # But... the index is off by 3!? Crazy.
elseif flag == 2
# The property is a boolean, and the heap_idx itself is the value
@assert 0 <= heap_idx <= 1 "boolean flag has a value other than 0 or 1"
d[names[name_idx]] = bool(heap_idx)
else
error("unknown flag ",flag, " for property ",names[name_idx], " with heap index ",heap_idx)
end
end
push!(props,d)

if position(f) % 8 != 0
seek(f,iceil(position(f)/8)*8)
end
end
props
end

seek(fdata,segments[2])
seg2_props = parse_properties(fdata,strs,mcos,segments[3])
summarize(seg2_props)

0-element Array{Dict{ASCIIString,Any},1}: 

### Segment 3: Object information¶

This section has one element per object. Theres an index into the class structure, followed by a few unknown fields. Then there are two fields that describe where the property information is stored -- either in segment 2 or segment 4.

In [7]:
function parse_object_info(f, section_end)
# The first six int32s unknown. Always 0? Or perhaps reserved space for an extra elt?

object_info = Array((Int,Int,Int,Int),0)
while position(f) < section_end
segment1_idx = read(f,Int32) # The index into segment 2
segment2_idx = read(f,Int32) # The index into segment 4

@assert unknown1 == unknown2 == 0 "discovered a nonzero object property"
push!(object_info,(class_idx,segment1_idx,segment2_idx,obj_id))
end
object_info
end

seek(fdata,segments[3])
obj_info = parse_object_info(fdata,segments[4])
# Let's map the class_idx to the classname so it's a bit more readable
summarize(map(x -> (classes[x[1]][2],x[2],x[3],x[4]), obj_info))

3-element Array{(ASCIIString,Int64,Int64,Int64),1}:
[1] ("SimpleClass",0,1,1)
[2] ("SimpleClass",0,2,2)
[3] ("AnotherClass",0,3,3)

### Segment 4: More properties!¶

Just like segment 2, except these properties contain things that aren't class objects. Strange that these two segments aren't adjacent...

In [8]:
seek(fdata,segments[4])
seg4_props = parse_properties(fdata,strs,mcos,segments[5])
summarize(seg4_props)

3-element Array{Dict{ASCIIString,Any},1}:
[1] Dict{ASCIIString,Any}:
"cell_field_3"=>1x4 Array{Any,2}:
[1] Float64: 1.0
[2] ASCIIString: "one"
[3] Float64: 2.0
[4] ASCIIString: "two"
[2] Dict{ASCIIString,Any}:
"cell_field_3"=>1x4 Array{Any,2}:
[1] Float64: 3.0
[2] ASCIIString: "three"
[3] Float64: 4.0
[4] ASCIIString: "four"
"char_field_1"=>ASCIIString: "another_char"
[3] Dict{ASCIIString,Any}: {}

### Segment 5: Empty?¶

I've never seen this populated, so I have no idea what is going on here.

In [9]:
function parse_segment5(f, segment_end)
if any(seg5 .!= 0)
xxd(seg5)
end

@assert segment_end == position(f) && eof(f) "there's more data to be had!"
end

seek(fdata,segments[5])
parse_segment5(fdata, segments[6])


## Putting it all together¶

We're still missing quite a bit here: the object properties from segment 4 are still incomplete! And they're incomplete in strange ways. We've got three objects of two different classes: SimpleClass should have properties char_field_1, array_field_2 and cell_field_3, and AnotherClass just has one property, a. But the two SimpleClass objects have different fields populated! The array_field_2 property is totally missing from both, and the AnotherClass object is totally empty!

There's something we haven't used yet in the FileWrapper__ array: the last element.

In [10]:
println("The last element of FileWrapper__'s array:")
print("  ")
summarize(mcos[end],"  ")

The last element of FileWrapper__'s array:
3x1 Array{Any,2}:
[1] Dict{ASCIIString,Any}: {}
[2] Dict{ASCIIString,Any}:
"array_field_2"=>1x6 Array{Float64,2}: [0.0,1.0,2.0,3.0,4.0,5.0]
"char_field_1"=>ASCIIString: "char_field"
[3] Dict{ASCIIString,Any}:
"a"=>Float64: 1.0

Those look like shared/default properties for each class! Ordered in class order. But we're off by one? Again? Oy (I have a hunch that Matlab's implementation was coded in C by someone so very accustomed to 1-indexed arrays that they pretend index 0 doesn't exist…). It's also important to note that these shared properties are not related to class property defaults. Let's merge these default values with the properties we got from segments 2 and 4 above:

In [11]:
objs = Array(Dict{ASCIIString,Any},length(obj_info))
for (i,info) in enumerate(obj_info)
# Get the property from either segment 2 or segment 4
props = info[2] > 0 ? seg2_props[info[2]] : seg4_props[info[3]]
# And merge it with the matfile defaults for this class
objs[i] = merge(mcos[end][info[1]+1],props)
end
summarize(objs)

3-element Array{Dict{ASCIIString,Any},1}:
[1] Dict{ASCIIString,Any}:
"cell_field_3"=>1x4 Array{Any,2}:
[1] Float64: 1.0
[2] ASCIIString: "one"
[3] Float64: 2.0
[4] ASCIIString: "two"
"array_field_2"=>1x6 Array{Float64,2}: [0.0,1.0,2.0,3.0,4.0,5.0]
"char_field_1"=>ASCIIString: "char_field"
[2] Dict{ASCIIString,Any}:
"cell_field_3"=>1x4 Array{Any,2}:
[1] Float64: 3.0
[2] ASCIIString: "three"
[3] Float64: 4.0
[4] ASCIIString: "four"
"array_field_2"=>1x6 Array{Float64,2}: [0.0,1.0,2.0,3.0,4.0,5.0]
"char_field_1"=>ASCIIString: "another_char"
[3] Dict{ASCIIString,Any}:
"a"=>Float64: 1.0

## We did it! (Well... for this file)¶

Here's what Matlab says these objects should be:

>> load('simple.mat')
>> display(obj); display(obj2); display(obj3)
obj =
SimpleClass with properties:

char_field_1: 'char_field'
array_field_2: [0 1 2 3 4 5]
cell_field_3: {[1]  'one'  [2]  'two'}

obj2 =
SimpleClass with properties:

char_field_1: 'another_char'
array_field_2: [0 1 2 3 4 5]
cell_field_3: {[3]  'three'  [4]  'four'}

obj3 =
AnotherClass with properties:

a: 1



What a mess, though. I guess it's not terribly surprising that this is undocumented.

#### Appendix¶

In [12]:
# More complicated files
f = matopen("fiobj.mat")
mcos = f.subsystem["_i1"]["MCOS"][2]
data = vec(mcos[1])
fdata = IOBuffer(data)

seek(fdata,0)

seek(fdata,segments[1])
classes = parse_class_info(fdata,strs, segments[2])

seek(fdata,segments[2])
seg2_props = parse_properties(fdata,strs,mcos,segments[3])

seek(fdata,segments[3])
obj_info = parse_object_info(fdata,segments[4])

seek(fdata,segments[4])
seg4_props = parse_properties(fdata,strs,mcos,segments[5])

seek(fdata,segments[5])
parse_segment5(fdata, segments[6])

objs = Array(Dict{ASCIIString,Any},length(obj_info))
for (i,info) in enumerate(obj_info)
# Get the property from either segment 2 or segment 4
props = info[2] > 0 ? seg2_props[info[2]] : seg4_props[info[3]]
# And merge it with the matfile defaults for this class
objs[i] = merge(mcos[end][info[1]+1],props)
end
summarize(objs)
println()
summarize(mcos)

1-element Array{Dict{ASCIIString,Any},1}:
[1] Dict{ASCIIString,Any}:
"Scaling"=>ASCIIString: "BinaryPoint"
"RoundingMethod"=>ASCIIString: "Nearest"
"SumBias"=>Float64: 0.0
"Bias"=>Float64: 0.0
"ProductFixedExponent"=>Float64: -30.0
"SumWordLength"=>Float64: 32.0
"CastBeforeSum"=>Bool: true
"ProductFractionLength"=>Float64: 30.0
"Signed"=>Bool: true
"MaxProductWordLength"=>Float64: 65535.0
"SumMode"=>ASCIIString: "FullPrecision"
"nunderflows"=>Float64: 0.0
"minlog"=>Float64: 1.7976931348623157e308
"DataType"=>ASCIIString: "Fixed"
"ProductBias"=>Float64: 0.0
"Logging"=>Bool: false
"maxlog"=>Float64: -1.7976931348623157e308
"ProductMode"=>ASCIIString: "FullPrecision"
"fimathislocal"=>Bool: false
"SumSlope"=>Float64: 9.313225746154785e-10
"intarray"=>1x51 Array{Int16,2}: [0,328,655,983,1311,1638,1966,2294,2621,2949,…]
"DataTypeOverride"=>ASCIIString: "Inherit"
"SumFractionLength"=>Float64: 30.0
"ProductWordLength"=>Float64: 32.0
"ProductSlope"=>Float64: 9.313225746154785e-10
"FixedExponent"=>Float64: -15.0
"OverflowAction"=>ASCIIString: "Saturate"
"MaxSumWordLength"=>Float64: 65535.0
"noverflows"=>Float64: 0.0
"SumFixedExponent"=>Float64: -30.0
"WordLength"=>Float64: 16.0
28x1 Array{Any,2}:
[1] 1104x1 Array{Uint8,2}: [0x02,0x00,0x00,0x00,0x2a,0x00,0x00,0x00,0x48,0x02,…]
[2] 0x0 Array{None,2}: []
[3] Bool: true
[4] Float64: 16.0
[5] Float64: -15.0
[6] Float64: 1.0
[7] Float64: 0.0
[8] 1x51 Array{Int16,2}: [0,328,655,983,1311,1638,1966,2294,2621,2949,…]
[9] Float64: 0.0
[10] Float64: 0.0
[11] Float64: -1.7976931348623157e308
[12] Float64: 1.7976931348623157e308
[13] Float64: 32.0
[14] Float64: 32.0
[15] Float64: 65535.0
[16] Float64: 65535.0
[17] Float64: 30.0
[18] Float64: -30.0
[19] Float64: 9.313225746154785e-10
[20] Float64: 1.0
[21] Float64: 0.0
[22] Float64: 30.0
[23] Float64: -30.0
[24] Float64: 9.313225746154785e-10
[25] Float64: 1.0
[26] Float64: 0.0
[27] Bool: true
[28] 2x1 Array{Any,2}:
[1] Dict{ASCIIString,Any}: {}
[2] Dict{ASCIIString,Any}: {}
In [ ]:
# Simple utitilies for viewing of hex and big nested data structures
cleanascii!{N}(A::Array{Uint8,N}) = (A[(A .< 0x20) | (A .> 0x7e)] = uint8('.'); A)
function xxd(x, start=1, stop=length(x))
for i=div(start-1,8)*8+1:8:stop
row = i:i+7
@printf("%04x: ",i-1)
for r=row
start <= r <= stop ? @printf("%02x",x[r]) : print("  ")
r % 4 == 0 && print(" ")
end
# ASCII
print("   ",ascii(cleanascii!(x[i:min(i+7,end)]))," ")
# Int32
for j=i:4:i+7
start <= j && j+3 <= stop ? @printf("% 12d ",reinterpret(Int32,x[j:j+3])[1]) : print(" "^12)
end
# Float64:
# start <= i && i+7 <= stop ? @printf("%.3e",reinterpret(Float64,x[row])[1]) : nothing
println()
end
end
# Summarize - smartly display large nested data structures for some datatypes
summarize(x::Any,prefix="") = print(string(summary(x)))
summarize(x::String,prefix="") = print(string(summary(x),": \"", x, "\""))
summarize(x::Real,prefix="") = print(string(summary(x),": ", x))
function summarize(x::Tuple,prefix="")
print("(")
i = start(x);
while !done(x,i)
t,i = next(x,i)
if isa(t,String)
print("\"",t,"\"")
elseif isa(t,Real)
print(t)
else
summarize(t,string(prefix,"  "))
end
!done(x,i) && print(",")
end
print(")")
end
function summarize(x::Dict,prefix="")
print(string(summary(x),": ",(isempty(x) ? "{}" : "")))
i = start(x)
while !done(x,i)
(v,i) = next(x,i)
if typeof(v[1])<:String
println()
print(prefix,"  \"",v[1],"\"=>")
summarize(v[2],string(prefix,"    "))
else
println()
print(prefix,"  ",summarize(v[1]),"=>")
summarize(v[2],string(prefix,"    "))
end
end
end
function summarize{T,N}(x::AbstractArray{T,N},prefix="")
print(string(summary(x),": "))
if T<:Real
truncate = length(x) > 10
maxelt = truncate ? 10 : length(x)
# This is very wrong, but it works for the purposes above...
Base.show_comma_array(STDOUT,x[1:min(length(x),maxelt)],"[",(truncate ? ",…]" : "]"))
else
i = start(x)
while !done(x,i)
(v,i) = next(x,i)
println()
print(prefix,"  [\$(i-1)] ")
summarize(v,string(prefix,"     "))
end
end
end;

Copyright (C) 2014 Matt Bauman, [first initial][last name] (at) [gmail]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.`