Question about Context window size + Architectural Choices

#24
by michaelcombs28 - opened

Something similar was asked earlier, but I'm wondering if the context window length was a decision based on efficacy or some other efficiency metrics, limitations of MOE in the embedding context, etc.

Just curious, I'm thinking of taking up this challenge (MOE + multilingual) but on a larger context window trying to stay under 0.5b params.

I'm also wondering how positional interpolation compares to other approaches like global attention in this context as well.

Great work with this and v1.5, it's exciting to see others build off of mosaic's foundation making this sort of process accessible to everyone.

Sign up or log in to comment