Running stock kernel on PostmarketOS

Why are we doing this?

In last post we finished by testing our newly build stock kernel. We used stock Android userspace, which is not ideal, because when we start actual mainlining the userspace will be PostmarketOS, so it would be good if we had some indication that the userspace was OK (so we’re not changing both the userspace and a brand-spanking new kernel).

If you want to look at the result, this is the pmos merge request, but if you’re reading this you probably want some of the nitty (and mostly gritty) details.

Setup

Initially, when building stock kernel, the goal is to get it running in any shape or form, so the production kconfig is used. PmOS has some requirement on the kernel (you can check via pmbootstrap kconfig check --file FILE), so after making sure the kernel compiles (i.e patching all the compiler problems), I went on to make the minor modifications necessary for PmOS to like the kernel.

A word about PmOS boot process

It all starts from boot.img — the android ad-hoc standard for putting together kernel, initfs and a few other small things. When the bootloader starts the kernel, gives it initfs and dtb, the kernel preps the hardware it can and starts an init program inside initfs.

PmOS packages a linux-friendly boot and root (/) partition and installs them in the android system partition (don’t confuse the boot partition inside system, with the boot partition hosting the boot.img, outside of system). The pmos boot partition holds the things that a normal linux boot partition would — initfs, kernel, dtb, as regular files. So PmOS initfs job is to mount the boot and root partition appropriately and hand-off the boot to the real init from root partition.

Quick explanation about Android 10 dynamic partitions

Historically android phones have had a bunch of partitions (you can get the gist from the partitions on the Samsung Galaxy S5 from 2014 here). At some point A/B devices were introduced which doubled the number of update-touched partitions. Apparently even google realized the situation was getting out of control, so they decided to solve, complicate things. The main idea is, instead of having a bunch of fixed partitions (these partitions are similar to how you would partition your disk during Linux install — unless LVM or some other trickery is used, the partitions are hard to rearrange/resize later), to use one big super partition, that will hold the data for the many partitions. Very similar to how pmos puts both boot and root partition inside system android partition.

So now, if we want pmos to play more or less nicely with android, it would make sense if it installs the boot and root partitions inside the system partition, which is now inside the super partition. The problem is that the exact way the super sub-partitions are laid out is not clear (well, I haven’t dug into it, maybe it is clear), and nobody has (yet) written code in pmos to manage a super partition, so for now we have to be content with running pmos in initfs only.

To achieve that — add the debug shell hook in initfs. It will stop init process and listen for telnet connections on port 23, so you can login from you computer and look/mess around.

Roadblocks

As you can imagine, I tried the above, and the phone appeared as a network device (due to CONFIG_USB_F_RNDIS support), but no IP. That meant the kernel was more or less working, but the code that was supposed to start the telnet daemon did not.

I wanted to find a way to signal from initfs that it was alive. A lot of phones have notification LEDs, and there were notification LEDs in sysfs (under /sys/class/leds), but it didn’t do anything. I also tried to use the flash, but it didn’t work. I turned the flash on from Android, looked around sysfs and then tried the same thing from the terminal — nada. Luckily I found the vibrator (which was under … LEDs), and I managed to turn it on.

Great! Now I can make the phone do something. I put the vibration sequence in the very top of the init script. To do that, you have to pmbootstrap chroot -r, which gets you inside the device chroot, or the place where the initfs is build for pmos. Inside, you can edit /usr/share/postmarketos-mkinitfs/init.sh.in to your liking, and then just run pmbootstrap initfs build.

You can probably guess what happened next — nothing! OK, so I put a big sleep before the vibrator code. Maybe the vibrator didn’t have time to initialize by the time initfs started. Still nothing. I begin to wonder, does the init script manage to run at all? I notice that android and twrp both use a compiled init process, maybe it does some voodoo trickery inside. I wanted to look at the code, but quickly became discouraged from the amount of crap I have to sift through to get to it. I noticed that the busybox bundled in the initfs was dynamically linked. So I repackaged it to include a statically compiled one, still nothing. Then, still suspecting the shell, I wrote my own C program to activate the vibrator a few times and then exit. I cross compiled it and stuck it in place of init. And this time … something happened! A few seconds after booting a weird screen appeared saying something in the lines of init exited with exit status 0x00000000.

CrashDump Mode example screen
A similar screen I received beforehand

OK, that is progress! You have no idea how happy I was that the phone was telling me something! Even if that thing was die. Well, the vibrator still didn’t vibrate, but the message meant that I might try exiting from init at various places (maybe even differnt return values), to check what was going on. NOTE: Keep in mind that such a screen is not standard, and I was told most devices don’t die so gracefully, so I can consider myself lucky 🙂

To my surprise this method worked, so put exit 0 in the very beginning, and moved it line by line down until I got to a point where the fancy screen wasn’t showing any more. NOTE: If you have a lot of lines to go through you can employ binary search — that is the idea to always split the section you’re unsure about in half, and try in the middle, then repeat with the new, now smaller section. That will drastically reduce the number of tries especially if you have no idea where it is happening.

The problem ended up being some code that mounted partitions, was getting stuck. The code itself didn’t look like it would hang (I read all of the code in init multiple times), but alas, it was probably hanging, due to … reasons.

Luckily for me there was a check that skipped that whole section, all I had to do was add pmos_boot=something to the kernel command line.

Wrap up

To get the kernel and the device package merged into PmOS, I had to figure out one last thing I had been postponing from the beginning — the boot.img situation. As explained in the last post, the newer version of boot.img (that work only for newer bootloaders) support passing dtb separately, not appended to the kernel, or the boot.img itself or another hacky way. The issue was pmos did not support that. Just as I was about to start implementing it, Konrad suggested I just stick the dtb after the kernel (deviceinfo_append_dtb=true for pmos deviceinfo file). This is the oldest way, and should hopefully still work (as newer bootloaders normally understand older formats as well). And to my surprise — it did!

What’s next?

At this point I could have tackled porting LineageOS or TWRP for the device. It is clear that these will see more users than a mainline kernel ever will.

But in the end of the day any other activity will delay the mainlining with weeks, possibly months, and I wouldn’t feel as satisfied to port los or twrp, and then the bug complains will start raining … So next post we try and run some brand-spanking new kernel, fresh from the next branch, stay tuned!