@ngortheone First of all, you’re correct that a CID the content of address of the original file, and that this can be confusing and sometimes inconvenient, particularly if multiple people independently pin the same file but with different hashing/chunking algorithms. This can indeed be wasteful, but I don’t see it being a major problem, since I would imagine in most cases people would pin an existing version of the file that is on IPFS (thereby using the same hash/chunking algorithm), and the case where different people instead add it from the orginal file would be less common. Worst case scenario, you have 2-3 versions of the same file available.
While not a perfect analogy, you could also complain that saving an image in different file formats is going to result in different files. That’s just how data storage works, and different image formats each have their own merits, as do different hash algorithms and chunking strategies.
There are aspects of the CID design I consider overkill, in particular multibase, i.e. the ability to express a given binary CID (array of bytes) as text in different ways. I think they should have just picked on encoding and gone with it - a lowest common demoniator that’s going to work everywhere, like base16 or base32. Multihash can also result in confusion because the same hash can be represented in different ways such that it’s not obvious that multiple text encodings are actually the same hash.
Multihash on the other hand is more useful. Many protocols/tools over the years have been designed to use a specific hash function, e.g. md5 for bittorrent or sha1 for git. Those hash algorithms have since been found to be insecure. Git is in the process of transitioning away from SHA-1, but the transition is way, way, way more complicated in that if Git had been designed from the beginning to support multiple hash algorithms. IPFS is designed to cater for extensibility and evolution of the protocol, and while this does make things a bit more complex to deal with right now (e.g. choosing which hash algorithm too use), in the long run it’s likely to pay off.
Finally, the DAG services an important purpose. The hash of each block/node is computed independently, and when downloading blocks a receipient can verify whether the block has the correct content based on its hash, without having to wait for the whole file to download. And if there is an error in a block, it can try to re-download just that block again, either from the same peer or another. Suppose you had an 8gb file; if you only had hashing done at a file level, you’d have to grab the whole 8gb again because you don’t know which block(s) were corrupt.
Note that most filsystems already use a DAG (or rather tree) - see for example single/double/triple indirection nodes in a variety of Unix filsystems. A “file” is an abstraction over the block store which uses a DAG to store its layout, and this is true whether you’re using BSD unix from the 1980s or IPFS in 2021.
I think the design rationale that lead to some of the above properties of IPFS isn’t really well explained and can definitely be confusing to newcomers. I spent a lot of time thinking “what the hell is this mess” when I started out trying to understand how IPFS works, but over time came to better understand some of the decisions (not that I agree with them all). I think improved documentation would be really valuable in this regard.