I wish there were, in C, some equivalent of "struct" but where you could specify...

masklinn · on April 4, 2012

> I wish there were, in C, some equivalent of "struct" but where you could specify

I don't think there's a need for a struct-equivalent, I think there's a need for a struct loader with these capabilities, it would only be in charge of packing and unpacking but would shove everything in a struct.

Basically, Erlang's bit syntax for C (thought the bit syntax unpack to locals, not structs):

    << Foo:16/little, Bar:12/signed, Baz:4 >> = Bin.

(default type specifications are integer, unsigned and big-endian, between the : and / is the size of the data in "units", where Unit defaults to 8 bits for the binary type and 1 bit for integers, floats and strings).

Doesn't natively do alignment though, the developer has to pad on his own.

Python's `struct` module is similar[0]: http://docs.python.org/library/struct.html although the format string is basically unreadable and I believe it's absolutely terrible at decoding non-standard sizes (e.g. an int stored on 3 bits)

[0] libpack[1] for C, Perl also has this[2] which was probably the inspiration for Python

[1] http://www.leonerd.org.uk/code/libpack/intro.html

[2] http://perldoc.perl.org/functions/unpack.html

ge0rg · on April 4, 2012

Yeah, C is really missing a way to serialize/deserialize data from raw memory/sockets into structs usable by your code. The least insane way, libpack [1] requires replicating the data format definition three times:

* define the struct with all elements

* define a string for the binary representation

* call fpack/funpack with the string and all the struct elements as parameters...

Unfortunately, fixing this either requires some kind of black X-macro [2] magic or another template language used to write the specification and to generate the three above-mentioned representations from it...

[1] http://www.leonerd.org.uk/code/libpack/intro.html

[2] http://drdobbs.com/184401387

masklinn · on April 4, 2012

> or another template language

Surely this could be handled via simple syntactic extensions to the struct specification (with everything wrapped into an ungodly macro from hell) in order to define the mapping between the struct itself and libpack's format string, no?

ge0rg · on April 4, 2012

The problem is that you need to replicate the struct entries in the pack/unpack calls as well, which is only possible in plain C by using X-Macros.

It might be possible to construct a macro that creates both the struct and the format string, though.

masklinn · on April 4, 2012

> The problem is that you need to replicate the struct entries in the pack/unpack calls as well

Don't you only need the (generated) format string? Ideally, the macro could generate some wrapper function of some sort as well, which would unpack, fill and return an instance of the struct.

signa11 · on April 4, 2012

just took a brief look at it. doesn't seem to support arbitrary bit-fields e.g. how would extract next 13 bits from a binary stream ?

epe · on April 4, 2012

Go has a great solution for this in the binary package:

http://golang.org/pkg/encoding/binary/#Read

I've been working on some code to parse the shapefile format, which specifies some fields in big-endian and some in little (I have no idea why), but the binary package has made it really easy to deal with.

alexchamberlain · on April 4, 2012

You can't tackle the BO issue here, as it is a property of the underlying architecture.

However, the other 2 issues can be tackled. The first by using a packed struct ( __attribute__((__packed__)) in gcc) and the second by using stdint.h.

adrianmsmith · on April 4, 2012

> You can't tackle the BO issue here, as it is a property of the underlying architecture.

I disagree; if you have "littleendian int32 myfield" for example, every time you reference myfield on a big-endian architecture, the compiler inserts the necessary byte manipulation code, just like the guy does manually in the original post.

alexchamberlain · on April 4, 2012

Is this efficient though? One conversion designed in is clearly faster than a conversion every time you read the variable?

pmjordan · on April 4, 2012

You shouldn't in general be using on-disk/network binary formats for your in-memory representation.

alexchamberlain · on April 4, 2012

I agree. I think you may have misunderstood my comment. The wire representation should be read into memory via any byte order manipulations that are needed.

pmjordan · on April 4, 2012

Why "every time you read the variable" then? You'd have a packed-binary-struct and a memory-representation-struct and conversions between them to use when reading or writing. I say "would", but I actually already do this. I have typedefs of the kind

  typedef union {
    uint8_t bytes[sizeof(uint64_t)];
    uint64_t native; /* alignment hint */
  } uint64le_t;

which I use in structs of my on-disk data structures.

  struct ondisk_range
  {
    uint64le_t start;
    uint64le_t end;
  };

Then for actually using that data:

  struct range
  {
    uint64_t start;
    uint64_t end;
  };

The conversion functions basically just call functions with these prototypes on the 2 fields:

  uint64_t le64_to_cpu(uint64le_t le_val);
  uint64le_t cpu_to_le64(uint64_t val);

Which internally just read out the byte array and turn it into an integer and vice versa.

(I also have static assertions for the expected size following each ondisk struct)

It'd be nicer to codify this as a DSL or something, but C's macro system really isn't up to the job.

alexchamberlain · on April 4, 2012

adrianmsmith is suggesting that everytime the variable is read, the compiler inserts byte-swapping code. I disagree with that.

adrianmsmith · on April 4, 2012

It would, in the worst case, not be worse than what the original poster is suggesting.

And at least in that case, the code would be more readable.

But, as the compiler knows more about what's going on (it's not just parsing and compiling a general expression with ORs and shifts) then it could well be faster (e.g. if there were a CPU instruction to do this, then it could be used, etc.).

pmjordan · on April 4, 2012

I don't necessarily see a problem with that: it would avoid the messy cpu_to_le64 calls.

alexchamberlain · on April 4, 2012

pmjordan · on April 4, 2012

http://en.wikipedia.org/wiki/Domain-specific_language

furyofantares · on April 4, 2012

I'd be curious to know why you want it as a language feature rather than something like google protocol buffers. I don't necessarily disagree, I'm just curious what your reasoning is.

adrianmsmith · on April 4, 2012

That's a good question. (One I hadn't thought about much before I have to admit.)

I think it would be easier for the programmer to use a language feature than a library; the resulting code would be easier to read. I'm thinking how regular expressions are easier to use in Perl than they are in Java because they're part of the language, or how maps are easier to use in scripting languages than say Map in Java because they're part of the language.

If you had to have some e.g. string or external file describing the syntax of the hardware-independent struct, and then some calls like "read_entity(void structure, char fieldname)", it would all get nasty - you're going to have strings in the code which can't be checked at compile-time, you're going to be doing type casting which can't be checked at compile-time, and so on.

But a code generation system could be an option - you define the structure in a file in a certain syntax, and then code is generated with the right types and attribute names being visible to the compiler.

But I still think being part of the language would just be simpler for the user, and I don't consider this to be some obscure feature which would dilute the purity of the language by its introduction; binary file formats and protocols are here to stay.

P.S. Yes Google Protocol Buffers could well be the thing I've been searching for since I first saw the C struct many years ago.

furyofantares · on April 4, 2012

Yeah, protocol buffers are a bit heavyweight, a bit harder to read than a language feature could be, and add an extra dependency. That's why I don't necessarily disagree, though the reason I asked is it occurred to me that any time I have personally needed to deal with byte ordering has been code that can easily justify a heavyweight solution (unlike regex's which end up in tiny scripts that may only be run once) and they have also all been places where being able to easily interact with with the same data in other languages would have been super useful, as would being able to easily add to the format without a versioning headache.

... of course, now that I write that I realize a language feature could actually provide all of that, too.

__david__ · on April 4, 2012

I feel the same way. I wrote a blog post a number of years ago with a proposal: http://porkrind.org/missives/hardware-friendly-c-structures/

justincormack · on April 4, 2012

Erlang has this type of thing I believe, part of its Telecoms heritage.